GPT 5 API requests - We are getting 429 responses but not even appearing to touch our quota this morning.

TH-4622 80 Reputation points
2025-10-21T09:13:11.75+00:00

We are getting 429 responses but not even appearing to touch our quota this morning.

GPT 5 API requests:
User's image

It should be able to handle 100 API requests per min, analytics shows no more than 10.

Locally our Developer made the same request and he's getting a 429, but even the request shows there is quota in response headers 
User's image

Please can you help determine what the issue is here?

Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Sina Salam 25,761 Reputation points Volunteer Moderator
    2025-10-21T10:27:43.2933333+00:00

    Hello TH-4622,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that with your GPT 5 API requests, you are getting 429 responses but not even appearing to touch our quota this morning.

    The issue is not caused by exceeding your visible quota but likely results from hidden token-per-minute (TPM) throttling, sub-minute rate bursts, or API Management policies that apply limits independent of your portal analytics. Azure OpenAI enforces both RPM (requests per minute) and TPM (tokens per minute) limits, so even a few large responses can trigger a 429, despite low request volume, see Microsoft documentation on Azure OpenAI quotas and limits.

    To resolve this, capture and submit complete request-response samples with all rate-limit headers (x-rate-limit-remaining-requests, x-rate-limit-reset-tokens, x-ms-request-id, and Retry-After) and timestamps, along with your resource name, region, deployment ID, and subscription ID to Support via your Azure Portal. These data points allow Azure engineers to correlate your calls with internal logs and identify whether throttling comes from TPM exhaustion, an API gateway policy, or temporary service-side constraints. You can refer to Azure support diagnostics guidance.

    Additionally, update your client code to handle all Azure-specific rate-limit headers and apply exponential backoff with jitter. A simple implementation is shown below:

    import time, random
    def handle_429(response, attempt):
        wait = response.headers.get("Retry-After")
        if not wait:
            wait = response.headers.get("x-rate-limit-reset-tokens")
        wait = float(wait or 0.5) * (2 ** attempt) + random.uniform(0, 0.5)
        time.sleep(min(wait, 30))
    

    Before reopening requests, verify that no other applications share the same API key and check for custom API Management policies such as azure-openai-token-limit, which may impose additional throttles - https://free.blessedness.top/en-us/azure/api-management/api-management-policies

    If token throttling remains the cause, reduce max_tokens in your calls or request a quota increase in the Azure portal under Azure OpenAI > Quotas. These validated steps, aligned with Microsoft’s official rate-limit and diagnostics documentation, will ensure accurate root-cause isolation and restore stable GPT-5 API performance.

    I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.