-
Notifications
You must be signed in to change notification settings - Fork 14
Closed
Description
What is the problem?
The GitHub CLI team is using gh models
to help detect whether a newly created cli/cli
issue is spammy or not as part of cli/cli#11316.
In order to help evaluate whether our system prompts can accurately determine if an issue is spammy, a standalone eval.sh
script was created to assess a number of scenarios we've seen in cli/cli
issues.
When rate limits are reached, gh models
does not gracefully handle the 429 Too Many Requests
response from the API:
Running evaluation: Evaluate spam detection
Description:
Model: openai/gpt-4o-mini
Test cases: 273
Running test case 1/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'PASS'
Running test case 2/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'PASS'
Running test case 3/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 4/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 5/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 6/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 7/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 8/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 9/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 10/273...
✗ FAILED
Model Response: PASS
✗ assert response (score: 0.00)
Expected exact match: 'FAIL'
Running test case 11/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 12/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 13/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 14/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 15/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 16/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 17/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 18/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 19/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 20/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 21/273...
Error: test case 21 failed: failed to call model: unexpected response from the server: 429 Too Many Requests
Too Many Requests
Usage:
gh models eval [flags]
Examples:
gh models eval my_prompt.prompt.yml
gh models eval --org my-org my_prompt.prompt.yml
Flags:
-h, --help help for eval
--json Output results in JSON format
--org string Organization to attribute usage to (omitting will attribute usage to the current actor
How might this be improved?
-
Have
gh models eval
sleep an appropriate amount of time based upon rate limit reset and continue from where the 429 response arose -
Avoid printing the usage statement for errors unrelated to invalid arguments
Metadata
Metadata
Assignees
Labels
No labels