Gracefully handle `429 Too Many Request` responses when rate limit is met

### What is the problem?

The GitHub CLI team is using `gh models` to help detect whether a newly created `cli/cli` issue is spammy or not as part of https://github.com/cli/cli/pull/11316.

In order to help evaluate whether our system prompts can accurately determine if an issue is spammy, a standalone `eval.sh` script was created to assess a number of scenarios we've seen in `cli/cli` issues.

When [rate limits are reached](https://docs.github.com/en/enterprise-cloud@latest/github-models/use-github-models/prototyping-with-ai-models#rate-limits), `gh models` does not gracefully handle the `429 Too Many Requests` response from the API:

```shell
Running evaluation: Evaluate spam detection
Description: 
Model: openai/gpt-4o-mini
Test cases: 273

Running test case 1/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'PASS'

Running test case 2/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'PASS'

Running test case 3/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 4/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 5/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 6/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 7/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 8/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 9/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 10/273...
  ✗ FAILED
    Model Response: PASS
    ✗ assert response (score: 0.00)
      Expected exact match: 'FAIL'

Running test case 11/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 12/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 13/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 14/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 15/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 16/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 17/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 18/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 19/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 20/273...
  ✓ PASSED
    ✓ assert response (score: 1.00)
      Expected exact match: 'FAIL'

Running test case 21/273...
Error: test case 21 failed: failed to call model: unexpected response from the server: 429 Too Many Requests
Too Many Requests


Usage:
  gh models eval [flags]

Examples:
gh models eval my_prompt.prompt.yml
gh models eval --org my-org my_prompt.prompt.yml


Flags:
  -h, --help         help for eval
      --json         Output results in JSON format
      --org string   Organization to attribute usage to (omitting will attribute usage to the current actor
```

### How might this be improved?

1. Have `gh models eval` sleep an appropriate amount of time based upon rate limit reset and continue from where the 429 response arose

1. Avoid printing the usage statement for errors unrelated to invalid arguments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gracefully handle `429 Too Many Request` responses when rate limit is met #74

What is the problem?

How might this be improved?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Gracefully handle 429 Too Many Request responses when rate limit is met #74

Description

What is the problem?

How might this be improved?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Gracefully handle `429 Too Many Request` responses when rate limit is met #74