Hello Sarge,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
This looks like a very deep issue, however, below is my widow's mite for a production consciousness:
Start by logging every request with UTC timestamp, exact system/user prompt, all parameters (temperature, top_p, max_tokens), complete response JSON, and response headers (x-request-id, x-ms-region). Save both body and headers because Microsoft correlates issues through these IDs. Use a simple curl command to capture headers and output responses:
curl -s -D - -H "Content-Type: application/json" \
-H "api-key:$AZURE_OPENAI_API_KEY" \
-X POST "https://<resource>.openai.azure.com/openai/deployments/<deployment>/chat/completions?api-version=2025-07-01-preview" \
-d '{"messages":[{"role":"system","content":"<system prompt>"},{"role":"user","content":"Can you give me an apple?"}],"temperature":0,"max_tokens":200}' -o response.json
Secondly, check the deployment in Azure AI Foundry and set the Version update policy to manual. This ensures the underlying model does not change automatically to newer releases, which often causes unexpected behavior. If production must remain flexible, create a pinned duplicate deployment for reproducibility. - Azure OpenAI Model Upgrade Guide
Thirdly, build a test harness that repeats each prompt at least 30 times against the pinned deployment, logging results and measuring stability (exact match %), semantic similarity (via embeddings cosine), and latency percentiles. Use embeddings for drift detection like below:
# cosine similarity between embeddings to detect semantic drift
from openai import AzureOpenAI
# pseudo:
get_embedding("baseline"), get_embedding("new"), compute cosine
Azure OpenAI Embeddings Documentation
Fourthly, test each hypothesis systematically: confirm deployment version stability (rule out auto-update), check prompts for truncation, run single-turn conversations to exclude context bleed, disable RAG if used, and monitor headers for fallback routing. Compare request bodies byte-for-byte to rule out client-side mutation. Even with temperature=0, small nondeterminism is expected, but quantifying it distinguishes normal variation from model drift. Non-determinism in LLMs
Then, prepare a support ZIP containing request/response logs, deployment settings (screenshots of update policy), and x-request-id values for failing runs. Microsoft uses these IDs to trace server-side behavior. Do this on Azure Portal or via Priority Customer Support (PCS).
Lastly, to protect production, pin deployments to manual updates, enforce a validation layer so the app rejects unexpected enum outputs, and set up continuous regression alerts. Run tests on a schedule and flag when stability rate or embedding similarity drops below thresholds. This provides early warning of semantic drift before it disrupts users. LLM Regression Testing Practices
I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.
Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.