The LLM Lobotomy.

Sarge 50 Reputation points
2025-09-20T17:25:12.25+00:00

I am working on a product that uses Azure in the back-end for LLMs and Audio Models. Just like how I test the code for every release, every time I add or update things on the system prompts for calibration or new features I also test the conversational flow.

What I mean by this is, I have a set of conversations, used with 0 temperature to guarantee I get most similar answers. The cool thing is I've been working on this product over 6 months and I can see how the very same model of the LLM gets worse and worse. I use the very same messages, and the JSON responses I receive get less and less accurate.

Namely, you are lobotomizing models in the background. Same model, same system prompt, same messages but worse results.

I currently use gpt-4o-mini for language, thank fully it's speed is there but it's answer accuracy is horrible after gpt-5 release. Then I thought I would switch version, and checked out gpt-5-mini and nano. What do you know? gpt-5 is as good as got-4o-mini was according to my tests, but insanely slow sometimes takes up to 20 seconds with minimal reasoning (which still produces bad results.)

So I am trying to understand what is Microsoft's game here? Probably you want to onboard people to newer models to expire old ones, but since the newer models are not good and slow, you have to do this by somehow reducing their quality? And serving smaller parameter versions but still calling them with same names?

This is a bad business strategy, and not everyone is working on note taking apps and text summarization. Accuracy and consistency matter. Which brings me an my team to consider moving away from Azure, since it cannot provide stable service.

I am glad I have proof of this with the test system we put in place and not making this up. What you are doing is bad, either provide better products and ask people to switch or keep things stable and backwards compatible.

Azure AI Language
Azure AI Language
An Azure service that provides natural language capabilities including sentiment analysis, entity extraction, and automated question answering.
{count} vote

1 answer

Sort by: Most helpful
  1. Sina Salam 25,761 Reputation points Volunteer Moderator
    2025-09-28T18:33:35.97+00:00

    Hello Sarge,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    This looks like a very deep issue, however, below is my widow's mite for a production consciousness:

    Start by logging every request with UTC timestamp, exact system/user prompt, all parameters (temperature, top_p, max_tokens), complete response JSON, and response headers (x-request-id, x-ms-region). Save both body and headers because Microsoft correlates issues through these IDs. Use a simple curl command to capture headers and output responses:

    curl -s -D - -H "Content-Type: application/json" \
     -H "api-key:$AZURE_OPENAI_API_KEY" \
     -X POST "https://<resource>.openai.azure.com/openai/deployments/<deployment>/chat/completions?api-version=2025-07-01-preview" \
     -d '{"messages":[{"role":"system","content":"<system prompt>"},{"role":"user","content":"Can you give me an apple?"}],"temperature":0,"max_tokens":200}' -o response.json
    

    Azure OpenAI Chat Quickstart

    Secondly, check the deployment in Azure AI Foundry and set the Version update policy to manual. This ensures the underlying model does not change automatically to newer releases, which often causes unexpected behavior. If production must remain flexible, create a pinned duplicate deployment for reproducibility. - Azure OpenAI Model Upgrade Guide

    Thirdly, build a test harness that repeats each prompt at least 30 times against the pinned deployment, logging results and measuring stability (exact match %), semantic similarity (via embeddings cosine), and latency percentiles. Use embeddings for drift detection like below:

    # cosine similarity between embeddings to detect semantic drift 
    from openai import AzureOpenAI 
    # pseudo: 
    get_embedding("baseline"), get_embedding("new"), compute cosine
    

    Azure OpenAI Embeddings Documentation

    Fourthly, test each hypothesis systematically: confirm deployment version stability (rule out auto-update), check prompts for truncation, run single-turn conversations to exclude context bleed, disable RAG if used, and monitor headers for fallback routing. Compare request bodies byte-for-byte to rule out client-side mutation. Even with temperature=0, small nondeterminism is expected, but quantifying it distinguishes normal variation from model drift. Non-determinism in LLMs

    Then, prepare a support ZIP containing request/response logs, deployment settings (screenshots of update policy), and x-request-id values for failing runs. Microsoft uses these IDs to trace server-side behavior. Do this on Azure Portal or via Priority Customer Support (PCS).

    Lastly, to protect production, pin deployments to manual updates, enforce a validation layer so the app rejects unexpected enum outputs, and set up continuous regression alerts. Run tests on a schedule and flag when stability rate or embedding similarity drops below thresholds. This provides early warning of semantic drift before it disrupts users. LLM Regression Testing Practices

    I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

    1 person found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.