Foundry Agent Token consuption

Question

Foundry Agent Token consuption

alejandro piccardo 60

Analizando el alto consumo de mi agente:

User's image entre otros problemas...
vimos que luego de una conversación. (la cual ya consumió más tokens de los que deberia)
y la continuamos con un saludo "hola"
la conversación consume 17mil tokens algo inviable.

mi system prompt pesa aprox. 3mil tokens junto con las tools descriptions

debe haber un problema en la gestión del historial (interno de foundry agents)
en donde por cada msg pasado se envie mil datos mas...
se debe estar enviando , por cada mensaje del historial:

input/ outputs
tool outputs generadas

el problema es que en cada run step del agent loop deben estar subiendo todo

dado que modelos como 4.1 actualmente tienen el rate limiter en 50k tokens per minute con que 3 users envien msg de este estilo . Saltaria un error.

pd:csv son cargados directamente a code interpreter, no son procesados por el llm

Anshika Varshney 1,910 Reputation points Microsoft External Staff Moderator

2025-10-03T20:43:22.52+00:00

Hello Alejandro Piccardo,

Thank you for reaching out on the Microsoft Q&A. Could you please check and provide us the requested details in the private message?
Anshika Varshney 1,910 Reputation points Microsoft External Staff Moderator

2025-10-06T18:30:30.2733333+00:00

Hello Alejandro Piccardo,

The high token consumption you’re seeing in Azure AI Foundry Agents usually happens when the entire conversation history and tool outputs are being sent to the model with every new message. This means that not just the user’s latest query, but also previous inputs, outputs, and intermediate tool results, are all included in each run. As a result, even a simple follow-up message like “hello” can lead to a sudden jump in token usage, since the model reprocesses everything again.

This behavior often points to an inefficient history management setup in the agent configuration. When system prompts, tool descriptions, and large intermediate data are resent every turn, token usage multiplies rapidly. To reduce this, try optimizing your agent’s memory by retaining only essential conversation context, for example, the last few user and assistant exchanges, instead of the full interaction history. Also ensure that tool-generated data (like code interpreter results or CSV summaries) isn’t being serialized into the model input each time.

As per the current situation, the possible solutions to the above problems could be:

Trim or limit the stored message history so only relevant context is passed to the model.

Avoid re-sending large tool outputs or logs with every API call.

Keep CSV and file processing limited to the code interpreter and not in the LLM prompt.

Use the Run Details view in Foundry to analyze which parts of the context are contributing most to token usage.

It’d be more helpful if you could share your resource details, SDK version, and agent configuration snippet. With that information, we can review your setup in more depth and recommend the best optimization approach for your agent loop.

Note: In Agent, token usage is very high due to evaluation tokens, Customer can toggle off AI quality and Risk/Safety evaluation to reduce token cost.

Reference: Agent playground billing with an outrageous unexpected charge - Microsoft Q&A

Thankyou!
Anshika Varshney 1,910 Reputation points Microsoft External Staff Moderator

2025-10-07T17:43:45.94+00:00

Hello Alejandro Piccardo,

In Agent, token usage is very high due to evaluation tokens, Customer can toggle off AI quality and Risk/Safety evaluation to reduce token cost.

Reference: Agent playground billing with an outrageous unexpected charge - Microsoft Q&A

Thank you.
Anshika Varshney 1,910 Reputation points Microsoft External Staff Moderator

2025-10-08T18:45:31.8033333+00:00

Hello Alejandro Piccardo,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help. - We are suggested to do this in private messag for follow up and closure.

Thankyou!
alejandro piccardo 60 Reputation points

2025-10-17T16:25:43.5933333+00:00
"entire conversation history and tool outputs are being sent to the model with every new message. This means that not just the user’s latest query, but also previous inputs, outputs, and intermediate tool results, are all included in each run"

this is exactly what is happening , but how can i manage this?, i thought memory threads where internally manage by foundry, cause i only send the new query and the thread_id to the azureOpenAi agent.

i get aggent by id :

self._project = AIProjectClient(endpoint=project_endpoint, credential=DefaultAzureCredential()) self._agent = self._project.agents.get_agent(self.agent_id)

this is how i run it:

self._project.agents.messages.create( thread_id=thread_id, role="user", content=question, attachments=attachments, ) run_kwargs = { "thread_id": thread_id, "agent_id": self.agent._agent.id, "event_handler": self.sse_handler, "additional_instructions": additional_instructions, } stream = self.agent._project.agents.runs.stream(**run_kwargs)

pip list:
azure-ai-agents 1.1.0

azure-ai-projects 1.0.0

azure-core 1.35.1

how can i change the memory manager ?
is there any solution provided by azure/foundry?
Anshika Varshney 1,910 Reputation points Microsoft External Staff Moderator

2025-10-17T17:54:44.24+00:00

Thanks for sharing the detailed context and code snippet!

You’re absolutely right by default, Azure AI Foundry agents currently resend the entire thread context (including user messages, model outputs, and intermediate tool responses) with each new run. This is by design, as Foundry’s thread-based memory mechanism maintains full conversational continuity for consistent model behavior.

That said, if you’d like to optimize or control memory usage, here are a few approaches you can consider:

1. Manage conversation memory manually You can implement a custom memory handler by storing only selected parts of the conversation (for example, just the last N turns) before calling messages.create(). At the moment, Foundry doesn’t expose a public API to replace the internal memory manager directly memory threads are handled automatically.

2. Use a new thread for short or stateless interactions If you don’t need the full history every time, create a fresh thread_id per query. This avoids sending long histories to the model, which helps reduce token consumption.

3. Use lightweight summarization Some developers maintain a condensed context by periodically summarizing previous interactions and sending that summary as a system message this helps maintain continuity while keeping the payload smaller.

Currently, Foundry doesn’t yet provide an official configuration to switch or limit the internal memory manager, but this capability is under consideration as part of SDK enhancements.

To help us understand your setup better and guide you further, could you please share a bit more info?

What does your current agent configuration look like (especially how messages are being handled)?

Are you able to check how many previous turns are currently retained in the conversation history?

Which SDK version of azure-ai-agents and azure-ai-projects are you using (just confirming your pip list)?

Once we have that, we can suggest a more tailored approach for managing memory efficiently.

Hope this helps!
alejandro piccardo 60 Reputation points

2025-10-17T18:15:12.4633333+00:00

thanks for responding.

let me clarify;

the only way to get a better manage memory is to NOT use thread in create msg, (or creating a new one) and sending the context as a msg....

and there is no way of "inject" a memory manager to your threads.

The problem with this is;

my agent use codeinterpreter per run (generally) and CI cost are associated per thread, so im forced to pay $0.033/session for every run
https://azure.microsoft.com/en-us/pricing/details/cognitive-services/openai-service/
this make my agent probably not worth it.

am i loosing something? is someone doing this a better way?

ab my config:
nowadays im saving as pairs (question, answer) in my own db.
so im able to tell how many turns there are , and summarised them. But the main reason im using azure sdk is to be able to use code interpreter in a safe enviroment but it appears CI use is attatch to threads ("costly speaking#)

as i said earlier im using:
azure-ai-agents 1.1.0

azure-ai-projects 1.0.0

azure-core 1.35.1

Aryan Parashar 1,850 Microsoft External Staff Moderator

Hi alejandro,
Could you try using this code to get the agent token count for the thread:

import os
import pandas as pd
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient

PROJECT_ENDPOINT = "<ENDPOINT_URL>"
AGENT_ID = "AGENT_ID"
OUTPUT_CSV = "azure_agent_full_report.csv"


credential = DefaultAzureCredential()
project_client = AIProjectClient(endpoint=PROJECT_ENDPOINT, credential=credential)


def flatten_attributes(obj, parent_key=""):
   
    result = {}
    for attr in dir(obj):
        if attr.startswith("_"):
            continue
        try:
            value = getattr(obj, attr)
        except Exception:
            value = None
        key = f"{parent_key}.{attr}" if parent_key else attr
        if hasattr(value, "__dict__") and not isinstance(value, (str, int, float, bool, list, dict)):
            result.update(flatten_attributes(value, parent_key=key))
        else:
            result[key] = value
    return result

print(f"Fetching threads and runs for agent: {AGENT_ID}")

all_data = []

threads = project_client.agents.threads.list()

for thread in threads:
    thread_id = thread.id
    print(f"\nProcessing thread: {thread_id}")

    runs = project_client.agents.runs.list(thread_id=thread_id)

    for run in runs:
        full_run = project_client.agents.runs.get(thread_id=thread_id, run_id=run.id)

        run_data = flatten_attributes(full_run)
        run_data["agent_id"] = AGENT_ID
        run_data["thread_id"] = thread_id
        all_data.append(run_data)


if all_data:
    df = pd.DataFrame(all_data)
    print("\n=== Sample of flattened run attributes ===")
    print(df.head())
    df.to_csv(OUTPUT_CSV, index=False)
    print(f"\n Full agent report saved to: {OUTPUT_CSV}")
else:
    print("No runs found for this agent.")

Hope this helps!

Your answer

Anshika Varshney 1,910 Reputation points Microsoft External Staff Moderator

2025-10-03T20:43:22.52+00:00

Hello Alejandro Piccardo,

Thank you for reaching out on the Microsoft Q&A. Could you please check and provide us the requested details in the private message?
Anshika Varshney 1,910 Reputation points Microsoft External Staff Moderator

2025-10-07T17:43:45.94+00:00

Hello Alejandro Piccardo,

In Agent, token usage is very high due to evaluation tokens, Customer can toggle off AI quality and Risk/Safety evaluation to reduce token cost.

Reference: Agent playground billing with an outrageous unexpected charge - Microsoft Q&A

Thank you.
Anshika Varshney 1,910 Reputation points Microsoft External Staff Moderator

2025-10-08T18:45:31.8033333+00:00

Hello Alejandro Piccardo,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help. - We are suggested to do this in private messag for follow up and closure.

Thankyou!
alejandro piccardo 60 Reputation points

2025-10-17T16:25:43.5933333+00:00

"entire conversation history and tool outputs are being sent to the model with every new message. This means that not just the user’s latest query, but also previous inputs, outputs, and intermediate tool results, are all included in each run"

this is exactly what is happening , but how can i manage this?, i thought memory threads where internally manage by foundry, cause i only send the new query and the thread_id to the azureOpenAi agent.

i get aggent by id :

self._project = AIProjectClient(endpoint=project_endpoint, credential=DefaultAzureCredential()) self._agent = self._project.agents.get_agent(self.agent_id)

this is how i run it:

self._project.agents.messages.create( thread_id=thread_id, role="user", content=question, attachments=attachments, ) run_kwargs = { "thread_id": thread_id, "agent_id": self.agent._agent.id, "event_handler": self.sse_handler, "additional_instructions": additional_instructions, } stream = self.agent._project.agents.runs.stream(**run_kwargs)

pip list:
azure-ai-agents 1.1.0

azure-ai-projects 1.0.0

azure-core 1.35.1

how can i change the memory manager ?
is there any solution provided by azure/foundry?
alejandro piccardo 60 Reputation points

2025-10-17T18:15:12.4633333+00:00

thanks for responding.

let me clarify;

the only way to get a better manage memory is to NOT use thread in create msg, (or creating a new one) and sending the context as a msg....

and there is no way of "inject" a memory manager to your threads.

The problem with this is;

my agent use codeinterpreter per run (generally) and CI cost are associated per thread, so im forced to pay $0.033/session for every run
https://azure.microsoft.com/en-us/pricing/details/cognitive-services/openai-service/
this make my agent probably not worth it.

am i loosing something? is someone doing this a better way?

ab my config:
nowadays im saving as pairs (question, answer) in my own db.
so im able to tell how many turns there are , and summarised them. But the main reason im using azure sdk is to be able to use code interpreter in a safe enviroment but it appears CI use is attatch to threads ("costly speaking#)

as i said earlier im using:
azure-ai-agents 1.1.0

azure-ai-projects 1.0.0

azure-core 1.35.1
Aryan Parashar 1,850 Reputation points Microsoft External Staff Moderator

2025-10-23T07:29:01.8033333+00:00

Hi alejandro,
Could you try using this code to get the agent token count for the thread:

import os import pandas as pd from azure.identity import DefaultAzureCredential from azure.ai.projects import AIProjectClient PROJECT_ENDPOINT = "<ENDPOINT_URL>" AGENT_ID = "AGENT_ID" OUTPUT_CSV = "azure_agent_full_report.csv" credential = DefaultAzureCredential() project_client = AIProjectClient(endpoint=PROJECT_ENDPOINT, credential=credential) def flatten_attributes(obj, parent_key=""): result = {} for attr in dir(obj): if attr.startswith("_"): continue try: value = getattr(obj, attr) except Exception: value = None key = f"{parent_key}.{attr}" if parent_key else attr if hasattr(value, "__dict__") and not isinstance(value, (str, int, float, bool, list, dict)): result.update(flatten_attributes(value, parent_key=key)) else: result[key] = value return result print(f"Fetching threads and runs for agent: {AGENT_ID}") all_data = [] threads = project_client.agents.threads.list() for thread in threads: thread_id = thread.id print(f"\nProcessing thread: {thread_id}") runs = project_client.agents.runs.list(thread_id=thread_id) for run in runs: full_run = project_client.agents.runs.get(thread_id=thread_id, run_id=run.id) run_data = flatten_attributes(full_run) run_data["agent_id"] = AGENT_ID run_data["thread_id"] = thread_id all_data.append(run_data) if all_data: df = pd.DataFrame(all_data) print("\n=== Sample of flattened run attributes ===") print(df.head()) df.to_csv(OUTPUT_CSV, index=False) print(f"\n Full agent report saved to: {OUTPUT_CSV}") else: print("No runs found for this agent.")

Hope this helps!

Share via

Foundry Agent Token consuption

Your answer