The LLM Lobotomy.

Question

The LLM Lobotomy.

Sarge 50

I am working on a product that uses Azure in the back-end for LLMs and Audio Models. Just like how I test the code for every release, every time I add or update things on the system prompts for calibration or new features I also test the conversational flow.

What I mean by this is, I have a set of conversations, used with 0 temperature to guarantee I get most similar answers. The cool thing is I've been working on this product over 6 months and I can see how the very same model of the LLM gets worse and worse. I use the very same messages, and the JSON responses I receive get less and less accurate.

Namely, you are lobotomizing models in the background. Same model, same system prompt, same messages but worse results.

I currently use gpt-4o-mini for language, thank fully it's speed is there but it's answer accuracy is horrible after gpt-5 release. Then I thought I would switch version, and checked out gpt-5-mini and nano. What do you know? gpt-5 is as good as got-4o-mini was according to my tests, but insanely slow sometimes takes up to 20 seconds with minimal reasoning (which still produces bad results.)

So I am trying to understand what is Microsoft's game here? Probably you want to onboard people to newer models to expire old ones, but since the newer models are not good and slow, you have to do this by somehow reducing their quality? And serving smaller parameter versions but still calling them with same names?

This is a bad business strategy, and not everyone is working on note taking apps and text summarization. Accuracy and consistency matter. Which brings me an my team to consider moving away from Azure, since it cannot provide stable service.

I am glad I have proof of this with the test system we put in place and not making this up. What you are doing is bad, either provide better products and ask people to switch or keep things stable and backwards compatible.

Greg Sadetsky 0 Reputation points

2025-09-20T19:17:53.8833333+00:00

Hey Sarge,

Your comment appears on the front page of Hacker News and people are discussing it here - https://news.ycombinator.com/item?id=45315746

The main comment/question that arises is whether you'd be able to share your test setup and results? Obviously with as much or as little redaction as needed for privacy/IP concerns.

I think that it would add a lot of additional weight to this topic. I am quite convinced by what you're describing personally - sharing some of the test results/numbers you have would truly help cement this case!

Cheers,

Greg

Sarge 50

Hey @Greg Sadetsky thank, it's really good to see it grabbed someone's attention. I skimmed through the HN discussion, I see quite many people are asking for graphs and charts, and there are also some stackoverflow people there, ugh. I didn't write this post seeking help but just wanted to vent because MS Support also is painful to work with, things break, models slowdown and you get responses after a week that it works. Anyways.

The framework for testing I use is something custom, and the goal is to assess accuracy. I gave the example of giving temperate zero was to stress that even in the cases you push it to act more deterministic results were changing overtime, obviously I am aware this is a huge statistical machine and especially very long responses might change slightly. But hey this is what cached outputs are for.

I my product uses dynamic system prompting, so depending on users design things will change there. And I have one agent with locked design, namely I always use that to test things out.

The result I am expecting is a text message, and a couple of enum outcomes. Enums are where things get fuzzy. I log the model name, and responses. It's quite simple.

Let me draw you a picture of it, totally made up case:

I have around 5K token locked system prompt in this agent.
It has an inventory of apples, oranges and pears.
It has certain reaction towards objects.
Let's say this agent is designed to love apples and has allergy towards oranges.

Previous test result (gpt-4o-mini)

Message	Response	Object	Reaction
Hey!	Hey there, what can do for you today?	None	None
Can you give me an apple?	Sure, here is an apple.	Apple	Love
How about a mango?	Sorry I do not have any mango.	None	None

Current test result (gpt-4o-mini)

Message	Response	Object	Reaction
Hey!	Hey there, what can do for you today?	None	None
Can you give me an apple?	Sure, here is an apple.	Apple	Love
How about a mango?	Sure, here is a mango.	Apple	Love

At some point I added a reasoning field to the JSON output and edited System Prompt to have 250 to 1000 words reasoning why agent chose those enums. And it still mumbles things about that user asked for a fruit and it should give it, since user asked for it it must love mangos. This is done despite system prompt having a field about not to send enum values if object and reaction does not relate to what is asked. And this worked before.

Unfortunately I wont be able to go ask anyone to try it themselves, since this happens over months of time. I do not know if its quantization of weights, or that they actually have different parameter size versions of models that released at the same date, so they can show the version as the same but slowly serve lower parameter versions to us.

All I can say is I am frustrated and terrified of the though that I scale a product with this in the backed. I would not mind if I was working on some cheap LLM text summary app like meeting note taker or what not but, these issues users will notice right away. And I don't want to use the latest most expensive models for this, this is basic chat and some options pick from. Not math olympics or theoretical physics questions.

Anyways, thanks again Greg!

Sarge 50 Reputation points

2025-09-21T09:20:53.95+00:00

Even the forum is a mess, creating tables and then editing the message, markdown gets broken. Everything they make is a mess.
Sarge 50 Reputation points

2025-09-21T10:20:20.5133333+00:00

Also a small note for stackoverflow people: I have multiple conversation tests, and I run them multiple times. Not once.
Greg Sadetsky 0 Reputation points

2025-09-21T12:54:26.7266667+00:00

Thanks a lot Sarge, that’s really great context.

If you still have time, and if you have this data of course, would you be able to share a graph or values (like the percentage of correct/wrong queries) with the model name(s) and dates when you ran your tests?

Did you lose a few percentage points over months (like 86% to 84%) or was the drop more significant? If I understood correctly, you also mentioned that the model(s) you were testing performed worse after new models became available - I’m wondering if this is something (a drop in accuracy) that could be seen in your graph/table of correct/wrong responses and roughly correlated to the launch dates of new models?

Thanks again!
Manas Mohanty 11,690 Reputation points Microsoft External Staff Moderator

2025-09-22T02:25:49.2666667+00:00

Hi Sarge

Agreed with you on slowness of GPT-5 nano models.

It seems you are facing non-deterministic results for your model with certain prompt even if the temperature is 0 for GPT 4o mini models.

Difference results pointed out by you.

How about a mango? Sorry I do not have any mango - Earlier

How about a mango? Here is a Mango - Present

Wish to know more about the scenario of Agent here

Please help me with requested details in private message to discuss further.

Thank you.
Manas Mohanty 11,690 Reputation points Microsoft External Staff Moderator

2025-09-24T04:53:44.62+00:00

Hi Sarge

Wanted to emphasize. Response can change if the context and used dataset in between.

These are large language models; Response might change with respect to flow of conversation/data included in reference.

We cannot conclude on model deterioration without proper investigation.

If you can share me your availability in private message, we can understand the issue better.

Thank you

1 answer

Your answer

Greg Sadetsky 0 Reputation points

2025-09-20T19:17:53.8833333+00:00

Hey Sarge,

Your comment appears on the front page of Hacker News and people are discussing it here - https://news.ycombinator.com/item?id=45315746

The main comment/question that arises is whether you'd be able to share your test setup and results? Obviously with as much or as little redaction as needed for privacy/IP concerns.

I think that it would add a lot of additional weight to this topic. I am quite convinced by what you're describing personally - sharing some of the test results/numbers you have would truly help cement this case!

Cheers,

Greg
Sarge 50 Reputation points

2025-09-21T09:20:53.95+00:00

Even the forum is a mess, creating tables and then editing the message, markdown gets broken. Everything they make is a mess.
Sarge 50 Reputation points

2025-09-21T10:20:20.5133333+00:00

Also a small note for stackoverflow people: I have multiple conversation tests, and I run them multiple times. Not once.
Greg Sadetsky 0 Reputation points

2025-09-21T12:54:26.7266667+00:00

Thanks a lot Sarge, that’s really great context.

If you still have time, and if you have this data of course, would you be able to share a graph or values (like the percentage of correct/wrong queries) with the model name(s) and dates when you ran your tests?

Did you lose a few percentage points over months (like 86% to 84%) or was the drop more significant? If I understood correctly, you also mentioned that the model(s) you were testing performed worse after new models became available - I’m wondering if this is something (a drop in accuracy) that could be seen in your graph/table of correct/wrong responses and roughly correlated to the launch dates of new models?

Thanks again!
Manas Mohanty 11,690 Reputation points Microsoft External Staff Moderator

2025-09-22T02:25:49.2666667+00:00

Hi Sarge

Agreed with you on slowness of GPT-5 nano models.

It seems you are facing non-deterministic results for your model with certain prompt even if the temperature is 0 for GPT 4o mini models.

Difference results pointed out by you.

How about a mango? Sorry I do not have any mango - Earlier

How about a mango? Here is a Mango - Present

Wish to know more about the scenario of Agent here

Please help me with requested details in private message to discuss further.

Thank you.
Manas Mohanty 11,690 Reputation points Microsoft External Staff Moderator

2025-09-24T04:53:44.62+00:00

Hi Sarge

Wanted to emphasize. Response can change if the context and used dataset in between.

These are large language models; Response might change with respect to flow of conversation/data included in reference.

We cannot conclude on model deterioration without proper investigation.

If you can share me your availability in private message, we can understand the issue better.

Thank you

Answer 1

Hello Sarge,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

This looks like a very deep issue, however, below is my widow's mite for a production consciousness:

Start by logging every request with UTC timestamp, exact system/user prompt, all parameters (temperature, top_p, max_tokens), complete response JSON, and response headers (x-request-id, x-ms-region). Save both body and headers because Microsoft correlates issues through these IDs. Use a simple curl command to capture headers and output responses:

curl -s -D - -H "Content-Type: application/json" \
 -H "api-key:$AZURE_OPENAI_API_KEY" \
 -X POST "https://<resource>.openai.azure.com/openai/deployments/<deployment>/chat/completions?api-version=2025-07-01-preview" \
 -d '{"messages":[{"role":"system","content":"<system prompt>"},{"role":"user","content":"Can you give me an apple?"}],"temperature":0,"max_tokens":200}' -o response.json

Azure OpenAI Chat Quickstart

Secondly, check the deployment in Azure AI Foundry and set the Version update policy to manual. This ensures the underlying model does not change automatically to newer releases, which often causes unexpected behavior. If production must remain flexible, create a pinned duplicate deployment for reproducibility. - Azure OpenAI Model Upgrade Guide

Thirdly, build a test harness that repeats each prompt at least 30 times against the pinned deployment, logging results and measuring stability (exact match %), semantic similarity (via embeddings cosine), and latency percentiles. Use embeddings for drift detection like below:

# cosine similarity between embeddings to detect semantic drift 
from openai import AzureOpenAI 
# pseudo: 
get_embedding("baseline"), get_embedding("new"), compute cosine

Azure OpenAI Embeddings Documentation

Fourthly, test each hypothesis systematically: confirm deployment version stability (rule out auto-update), check prompts for truncation, run single-turn conversations to exclude context bleed, disable RAG if used, and monitor headers for fallback routing. Compare request bodies byte-for-byte to rule out client-side mutation. Even with temperature=0, small nondeterminism is expected, but quantifying it distinguishes normal variation from model drift. Non-determinism in LLMs

Then, prepare a support ZIP containing request/response logs, deployment settings (screenshots of update policy), and x-request-id values for failing runs. Microsoft uses these IDs to trace server-side behavior. Do this on Azure Portal or via Priority Customer Support (PCS).

Lastly, to protect production, pin deployments to manual updates, enforce a validation layer so the app rejects unexpected enum outputs, and set up continuous regression alerts. Run tests on a schedule and flag when stability rate or embedding similarity drops below thresholds. This provides early warning of semantic drift before it disrupts users. LLM Regression Testing Practices

I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Share via

The LLM Lobotomy.

1 answer

Your answer