Hello TH-4622,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
I understand that with your GPT 5 API requests, you are getting 429 responses but not even appearing to touch our quota this morning.
The issue is not caused by exceeding your visible quota but likely results from hidden token-per-minute (TPM) throttling, sub-minute rate bursts, or API Management policies that apply limits independent of your portal analytics. Azure OpenAI enforces both RPM (requests per minute) and TPM (tokens per minute) limits, so even a few large responses can trigger a 429, despite low request volume, see Microsoft documentation on Azure OpenAI quotas and limits.
To resolve this, capture and submit complete request-response samples with all rate-limit headers (x-rate-limit-remaining-requests, x-rate-limit-reset-tokens, x-ms-request-id, and Retry-After) and timestamps, along with your resource name, region, deployment ID, and subscription ID to Support via your Azure Portal. These data points allow Azure engineers to correlate your calls with internal logs and identify whether throttling comes from TPM exhaustion, an API gateway policy, or temporary service-side constraints. You can refer to Azure support diagnostics guidance.
Additionally, update your client code to handle all Azure-specific rate-limit headers and apply exponential backoff with jitter. A simple implementation is shown below:
import time, random
def handle_429(response, attempt):
wait = response.headers.get("Retry-After")
if not wait:
wait = response.headers.get("x-rate-limit-reset-tokens")
wait = float(wait or 0.5) * (2 ** attempt) + random.uniform(0, 0.5)
time.sleep(min(wait, 30))
Before reopening requests, verify that no other applications share the same API key and check for custom API Management policies such as azure-openai-token-limit, which may impose additional throttles - https://free.blessedness.top/en-us/azure/api-management/api-management-policies
If token throttling remains the cause, reduce max_tokens in your calls or request a quota increase in the Azure portal under Azure OpenAI > Quotas. These validated steps, aligned with Microsoft’s official rate-limit and diagnostics documentation, will ensure accurate root-cause isolation and restore stable GPT-5 API performance.
I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.
Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.