Skip to content

Gracefully handle 429 Too Many Request responses when rate limit is met #74

@andyfeller

Description

@andyfeller

What is the problem?

The GitHub CLI team is using gh models to help detect whether a newly created cli/cli issue is spammy or not as part of cli/cli#11316.

In order to help evaluate whether our system prompts can accurately determine if an issue is spammy, a standalone eval.sh script was created to assess a number of scenarios we've seen in cli/cli issues.

When rate limits are reached, gh models does not gracefully handle the 429 Too Many Requests response from the API:

Running evaluation: Evaluate spam detection
Description: 
Model: openai/gpt-4o-mini
Test cases: 273

Running test case 1/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'PASS'

Running test case 2/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'PASS'

Running test case 3/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 4/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 5/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 6/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 7/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 8/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 9/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 10/273...
  ✗ FAILED
    Model Response: PASS
    ✗ assert response (score: 0.00)
      Expected exact match: 'FAIL'

Running test case 11/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 12/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 13/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 14/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 15/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 16/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 17/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 18/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 19/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 20/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 21/273...
Error: test case 21 failed: failed to call model: unexpected response from the server: 429 Too Many Requests
Too Many Requests


Usage:
  gh models eval [flags]

Examples:
gh models eval my_prompt.prompt.yml
gh models eval --org my-org my_prompt.prompt.yml


Flags:
  -h, --help         help for eval
      --json         Output results in JSON format
      --org string   Organization to attribute usage to (omitting will attribute usage to the current actor

How might this be improved?

  1. Have gh models eval sleep an appropriate amount of time based upon rate limit reset and continue from where the 429 response arose

  2. Avoid printing the usage statement for errors unrelated to invalid arguments

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions