I am using the Python SDK to access Azure OpenAI (GPT-4o / GPT-4o-mini).
My usage logs show that I’m well below the Tokens-per-Minute and Requests-per-Minute limits for my instance.
Even so, I sometimes get:
429 Too Many Requests
Please try again later.
This happens randomly in small batches of requests, even when exponential backoff is turned on.
I checked:
- No other deployments are using the same quota.
- No spikes in use.
- No quota exhaustion in the Azure portal.
- No signs of problems with the model starting up cold.
Some people online say this can happen because of regional load, hidden rate limits, or shared backend capacity, but no one really knows why.
Has anyone looked into this in depth or found a solution that works?
Is this a problem with Azure, or is there something developers need to set up differently?