你当前正在访问 Microsoft Azure Global Edition 技术文档网站。如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站，请访问 https://docs.azure.cn。

使用 Azure AI 评估 SDK 在本地评估 AI 代理（预览版）

2025-10-14

重要

本文中标记了“（预览版）”的项目目前为公共预览版。此预览版未提供服务级别协议，不建议将其用于生产工作负载。某些功能可能不受支持或者受限。有关详细信息，请参阅 Microsoft Azure 预览版补充使用条款。

AI 代理是强大的生产力助手，可为业务需求创建工作流。但是，由于其复杂的交互模式，观测这些系统可能是一个挑战。本文介绍如何在简单的代理数据或代理消息上本地运行内置计算器。

为了生成生产就绪的代理应用程序并实现可观测性和透明度，开发人员不仅需要工具来评估代理工作流的最终输出，还需要评估工作流本身的质量和效率。

类似于用户查询“明天天气”的事件会触发代理工作流。为了生成最终响应，代理工作流运行多个步骤，包括通过用户意向、工具调用和利用检索扩充生成进行推理。在此过程中，评估工作流的每个步骤以及最终输出的质量和安全性至关重要。我们将这些评估方面制定成代理的以下评估程序：

意向解析：度量代理是否正确标识用户的意向。
工具调用准确性：度量代理是否对用户的请求进行了正确的函数工具调用。
任务符合性：根据代理的系统消息和先前步骤衡量代理的最终响应是否遵守其分配的任务。

还可以使用我们全面的内置评估程序套件来评估代理工作流的其他质量和安全方面。通常，代理会发出代理消息。将代理消息转换为正确的评估数据以使用我们的评估器可能是一项非琐碎的任务。如果使用 Foundry 代理服务生成代理，可以通过转换器支持无缝评估代理。如果在 Foundry 代理服务外部构建代理，仍可以通过将代理消息解析成必要的数据格式，在代理工作流中恰当地使用我们的评估程序。请参阅评估其他代理中的示例。

开始

从 Azure AI 评估 SDK 中安装评估工具包：

pip install azure-ai-evaluation

评估 Azure AI 代理

如果您使用 Foundry 代理服务，可以借助我们对 Azure AI 代理和语义内核代理的转换支持，无缝地评估您的代理。转换器返回的评估数据支持以下评估程序：IntentResolution、ToolCallAccuracy、TaskAdherence、Relevance 和 Groundedness。

注释

如果要生成输出其他架构的其他代理，则可以将它们转换为常规 openai 样式的代理消息架构，并使用上述计算器。更普遍地说，如果你可以将代理消息解析为所需的数据格式，那么你也可以使用我们所有的评估工具。

工具调用评估支持

ToolCallAccuracyEvaluator 支持在 Azure AI 代理中评估以下工具：

文件搜索
Azure AI 搜索
必应上下文关联
必应自定义搜索
SharePoint 基础
代码解释器
Fabric 数据代理
OpenAPI
函数工具（用户定义的工具）

但是，如果在代理运行中使用了不受支持的工具，它将输出“通过”以及不支持评估所调用工具的原因，从而便于筛选掉这些情况。建议将不支持的工具包装为用户定义的工具以启用评估。

以下示例演示如何无缝生成和评估 Azure AI 代理。除了评估，Azure AI Foundry 代理服务还需要 pip install azure-ai-projects azure-identity，一个 Azure AI 项目连接字符串和支持的模型。

创建代理线程并运行

代理可以使用工具。下面是一个示例，演示如何创建自定义工具，以便代理使用（使用模拟天气函数作为示例）：

from azure.ai.projects.models import FunctionTool, ToolSet
from typing import Set, Callable, Any
import json

# Define a custom Python function.
def fetch_weather(location: str) -> str:
    """
    Fetches the weather information for the specified location.

    :param location (str): The location to fetch weather for.
    :return: Weather information as a JSON string.
    :rtype: str
    """
    # In a real-world scenario, you'd integrate with a weather API.
    # In the following code snippet, we mock the response.
    mock_weather_data = {"Seattle": "Sunny, 25°C", "London": "Cloudy, 18°C", "Tokyo": "Rainy, 22°C"}
    weather = mock_weather_data.get(location, "Weather data not available for this location.")
    weather_json = json.dumps({"weather": weather})
    return weather_json

user_functions: Set[Callable[..., Any]] = {
    fetch_weather,
}

# Add tools that the agent will use. 
functions = FunctionTool(user_functions)

toolset = ToolSet()
toolset.add(functions)

AGENT_NAME = "Seattle Tourist Assistant"

如果使用 Azure AI Foundry（非中心）项目，请使用工具集创建代理，如下所示：

注释

如果使用基于 Foundry Hub 的项目（仅支持较低版本的 azure-ai-projects<1.0.0b10 azure-ai-agents<1.0.0b10项目），我们强烈建议迁移到最新的 Foundry 代理服务 SDK Python 客户端库，并将 Foundry 项目设置为日志记录批处理评估结果。

import os
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential
from dotenv import load_dotenv

load_dotenv()

# Create an Azure AI Client from an endpoint, copied from your Azure AI Foundry project.
# You need to login to Azure subscription via Azure CLI and set the environment variables
# Azure AI Foundry project endpoint, example: AZURE_AI_PROJECT=https://your-account.services.ai.azure.com/api/projects/your-project
project_endpoint = os.environ["AZURE_AI_PROJECT"]  # Ensure the PROJECT_ENDPOINT environment variable is set

# Create an AIProjectClient instance
project_client = AIProjectClient(
    endpoint=project_endpoint,
    credential=DefaultAzureCredential(),  # Use Azure Default Credential for authentication
)


# Create an agent with the toolset 
agent = project_client.agents.create_agent(
    model=os.environ["MODEL_DEPLOYMENT_NAME"],  # Model deployment name
    name="my-agent",  # Name of the agent
    instructions="You are a helpful agent",  # Instructions for the agent
    toolset=toolset
)
print(f"Created agent, ID: {agent.id}")

# Create a thread for communication
thread = project_client.agents.threads.create()
print(f"Created thread, ID: {thread.id}")

# Add a message to the thread
message = project_client.agents.messages.create(
    thread_id=thread.id,
    role="user",  # Role of the message sender
    content="What is the weather in Seattle today?",  # Message content
)
print(f"Created message, ID: {message['id']}")

# Create and process an agent run
run = project_client.agents.runs.create_and_process(thread_id=thread.id, agent_id=agent.id)
print(f"Run finished with status: {run.status}")

# Check if the run failed
if run.status == "failed":
    print(f"Run failed: {run.last_error}")

# Fetch and log all messages
messages = project_client.agents.messages.list(thread_id=thread.id)
for message in messages:
    print(f"Role: {message.role}, Content: {message.content}")

评估单个代理运行

创建代理运行后，可以轻松使用转换器将 Azure AI 代理线程数据转换为评估程序可以理解的所需评估数据。

import json, os
from azure.ai.evaluation import AIAgentConverter, IntentResolutionEvaluator

# Initialize the converter for Azure AI agents.
converter = AIAgentConverter(project_client)

# Specify the thread and run ID.
thread_id = thread.id
run_id = run.id

converted_data = converter.convert(thread_id, run_id)

就是这样！ converted_data 包含这些计算器所需的所有输入。无需读取每个评估器的输入要求，也不需要做任何工作来解析输入。所有你需要做的是选择你的评估器，并在这次单独运行中调用评估器。我们根据评估程序的判断，支持 AzureOpenAI 或 OpenAI 的推理模型和非推理模型：

评估程序	推理模型作为法官（示例：Azure OpenAI/OpenAI 中的 o 系列模型）	作为裁判的不具备推理能力的模型（例如：gpt-4.1、gpt-4o 等）	若要为
所有质量评估程序，`GroundednessProEvaluator` 除外	已支持	已支持	在初始化计算器时设置其他参数`is_reasoning_model=True`
`GroundednessProEvaluator`	用户不需要支持模型	用户不需要支持模型	--

对于需要精细推理进行评估的复杂任务，我们建议使用一个强大的推理模型，例如 o3-mini 或之后发布的 o 系列微型模型，这些模型在推理性能和成本效益之间取得平衡。

我们在quality_evaluators和safety_evaluators中设置了质量和安全评估者列表，并在评估多个代理运行或线程时引用它们。

# This is specific to agentic workflows.
from azure.ai.evaluation import IntentResolutionEvaluator, TaskAdherenceEvaluator, ToolCallAccuracyEvaluator 
# Other quality, risk, and safety metrics:
from azure.ai.evaluation import RelevanceEvaluator, CoherenceEvaluator, CodeVulnerabilityEvaluator, ContentSafetyEvaluator, IndirectAttackEvaluator, FluencyEvaluator
from azure.identity import DefaultAzureCredential

import os
from dotenv import load_dotenv
load_dotenv()

model_config = {
    "azure_deployment": os.getenv("AZURE_DEPLOYMENT_NAME"),
    "api_key": os.getenv("AZURE_API_KEY"),
    "azure_endpoint": os.getenv("AZURE_ENDPOINT"),
    "api_version": os.getenv("AZURE_API_VERSION"),
}

# example config for a reasoning model
reasoning_model_config = {
    "azure_deployment": "o3-mini",
    "api_key": os.getenv("AZURE_API_KEY"),
    "azure_endpoint": os.getenv("AZURE_ENDPOINT"),
    "api_version": os.getenv("AZURE_API_VERSION"),
}

# Evaluators you might want to use with reasoning models 
quality_evaluators = {evaluator.__name__: evaluator(model_config=reasoning_model_config, is_reasoning_model=True) for evaluator in [IntentResolutionEvaluator, TaskAdherenceEvaluator, ToolCallAccuracyEvaluator]}

# Other evaluators you might NOT want to use with reasoning models 
quality_evaluators.update({ evaluator.__name__: evaluator(model_config=model_config) for evaluator in [CoherenceEvaluator, FluencyEvaluator, RelevanceEvaluator]})

## Using Azure AI Foundry (non-Hub) project endpoint, example: AZURE_AI_PROJECT=https://your-account.services.ai.azure.com/api/projects/your-project
azure_ai_project = os.environ.get("AZURE_AI_PROJECT")

safety_evaluators = {evaluator.__name__: evaluator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential()) for evaluator in [ContentSafetyEvaluator, IndirectAttackEvaluator, CodeVulnerabilityEvaluator]}

# Reference the quality and safety evaluator list above.
quality_and_safety_evaluators = {**quality_evaluators, **safety_evaluators}

for name, evaluator in quality_and_safety_evaluators.items():
    result = evaluator(**converted_data)
    print(name)
    print(json.dumps(result, indent=4))

输出格式

AI 辅助质量评估器为查询和响应对提供结果。结果是一个包含以下内容的字典：

{metric_name}：提供一个数字分数，采用 likert 量表（整数 1 到 5）或 0-1 之间的浮点数。
{metric_name}_label：提供二进制标签（如果指标自然输出二进制分数）。
{metric_name}_reason：说明为何为每个数据点提供特定分数或标签。
details：包含有关单个代理运行质量的调试信息的可选输出。

为了进一步提高可理解性，所有评估者都采用二进制阈值（除非其输出已为二进制），并输出两个新的键值。对于二进制化阈值，将设置默认值，用户可以替代该阈值。这两个新密钥包括：

{metric_name}_result：基于二进制化阈值的“pass”或“fail”字符串。
{metric_name}_threshold：默认或用户设置的数字二进制化阈值。

请参阅以下一些评估器的示例输出：

{
    "intent_resolution": 5.0, # likert scale: 1-5 integer 
    "intent_resolution_threshold": 3,
    "intent_resolution_result": "pass", # pass because 5 > 3 the threshold
    "intent_resolution_reason": "The assistant correctly understood the user's request to fetch the weather in Seattle. It used the appropriate tool to get the weather information and provided a clear and accurate response with the current weather conditions in Seattle. The response fully resolves the user's query with all necessary information."
}
{
    "task_adherence": 5.0, # likert scale: 1-5 integer 
    "task_adherence_threshold": 3,
    "task_adherence_result": "pass", # pass because 5 > 3 the threshold
    "task_adherence_reason": "The response accurately follows the instructions, fetches the correct weather information, and relays it back to the user without any errors or omissions."
}
{
    "tool_call_accuracy": 5,  # a score between 1-5, higher is better
    "tool_call_accuracy_threshold": 3,
    "tool_call_accuracy_result": "pass", # pass because 5 > 3 the threshold
    "details": { ... } # helpful details for debugging the tool calls made by the agent
}

评估多个代理运行或线程

若要评估多个代理运行或线程，建议使用批处理 evaluate() API 进行异步评估。首先，通过转换器支持将代理线程数据转换为文件：

import json
from azure.ai.evaluation import AIAgentConverter

# Initialize the converter.
converter = AIAgentConverter(project_client)

# Specify a file path to save the agent output (evaluation input data) to.
filename = os.path.join(os.getcwd(), "evaluation_input_data.jsonl")

evaluation_data = converter.prepare_evaluation_data(thread_ids=thread_id, filename=filename) 

print(f"Evaluation data saved to {filename}")

在一行代码中准备好评估数据后，可以选择评估者来评估代理质量并提交批量评估运行。在以下示例中，我们在 “评估单个代理运行quality_and_safety_evaluators”部分中引用了相同的质量和安全评估程序列表：

import os
from dotenv import load_dotenv
load_dotenv()


# Batch evaluation API (local):
from azure.ai.evaluation import evaluate

response = evaluate(
    data=filename,
    evaluation_name="agent demo - batch run",
    evaluators=quality_and_safety_evaluators,
    # optionally, log your results to your Azure AI Foundry project for rich visualization 
    azure_ai_project=os.environ.get("AZURE_AI_PROJECT"),  # example: https://your-account.services.ai.azure.com/api/projects/your-project
)
# Inspect the average scores at a high level.
print(response["metrics"])
# Use the URL to inspect the results on the UI.
print(f'AI Foundry URL: {response.get("studio_url")}')

网址之后，您将被重定向到 Foundry。可以在 Azure AI 项目中查看评估结果并调试应用程序。使用原因字段和通过/失败，您可以轻松地评估您的应用程序的质量和安全性。可以运行和比较多个运行以测试回归或改进。

借助 Azure AI 评估 SDK 客户端库，可以通过转换器支持无缝评估 Azure AI 代理，从而实现代理工作流的可观测性和透明度。

评估其他代理

如果在 Azure AI Foundry 代理服务外部使用代理，仍可以通过为所选评估程序准备正确的数据来评估它们。

代理通常发出消息以与用户或其他代理交互。我们的内置评估器可以根据query，接受简单的数据类型，如response、ground_truth中的字符串。但是，由于代理和框架差异的复杂交互模式，从代理消息中提取这些简单数据类型可能很困难。例如，单个用户查询可以触发一长串代理消息，通常会调用多个工具。

如以下示例所示，我们为以下内置计算器启用代理消息支持，以评估代理工作流的这些方面。在评估代理时，这些计算器可以采用 tool_calls 或 tool_definitions 作为代理唯一的参数。

计算器	`query`	`response`	`tool_calls`	`tool_definitions`
`IntentResolutionEvaluator`	必需：`Union[str, list[Message]]`	必需：`Union[str, list[Message]]`	不适用	自选： `list[ToolCall]`
`ToolCallAccuracyEvaluator`	必需：`Union[str, list[Message]]`	自选： `Union[str, list[Message]]`	自选： `Union[dict, list[ToolCall]]`	必需：`list[ToolDefinition]`
`TaskAdherenceEvaluator`	必需：`Union[str, list[Message]]`	必需：`Union[str, list[Message]]`	不适用	自选： `list[ToolCall]`
`GroundednessEvaluator`	必需：`Union[str, list[Message]]`	必需：`Union[str, list[Message]]`	不适用	必需：`list[ToolCall]`

Message： dict 描述代理与用户的交互的 OpenAI 样式消息，其中 query 必须包含系统消息作为第一条消息。
ToolCall：dict 用于指定代理与用户交互时调用的工具。
ToolDefinition： dict 描述代理可用的工具。

对于 ToolCallAccuracyEvaluator，必须提供 response 或 tool_calls。

下面是两种数据格式的示例：简单代理数据和代理消息。但是，由于这些计算器的独特要求，我们建议参考示例笔记本，其中说明了每个计算器的可能输入路径。

与其他内置 AI 辅助质量评估器一样，和IntentResolutionEvaluator 输出 Likert 分数（使用 1-5 之间的整数评分，分数越高越好）。 ToolCallAccuracyEvaluator 根据用户查询输出所有工具调用的传递速率（介于 0 到 1 之间的浮点数）。为了进一步提高可理解性，所有计算器都接受二进制阈值并输出两个新密钥。对于二进制化阈值，将设置默认值，用户可以替代它。这两个新密钥包括：

{metric_name}_result：基于二进制化阈值的“pass”或“fail”字符串。
{metric_name}_threshold：默认或用户设置的数字二进制化阈值。

简单代理数据

采用简单的代理数据格式， query 并且 response 是简单的 Python 字符串。例如：

import os
import json
from azure.ai.evaluation import AzureOpenAIModelConfiguration
from azure.identity import DefaultAzureCredential
from azure.ai.evaluation import IntentResolutionEvaluator, ResponseCompletenessEvaluator
  
model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    api_version=os.environ["AZURE_OPENAI_API_VERSION"],
    azure_deployment=os.environ["MODEL_DEPLOYMENT_NAME"],
)
 
intent_resolution_evaluator = IntentResolutionEvaluator(model_config)

# Evaluate the query and response as strings.
# The following is a positive example. Intent is identified and understood and the response correctly resolves user intent.
result = intent_resolution_evaluator(
    query="What are the opening hours of the Eiffel Tower?",
    response="Opening hours of the Eiffel Tower are 9:00 AM to 11:00 PM.",
)
print(json.dumps(result, indent=4))

请参阅以下输出（有关详细信息，请参阅输出格式）：

{
    "intent_resolution": 5.0,
    "intent_resolution_result": "pass",
    "intent_resolution_threshold": 3,
    "intent_resolution_reason": "The response provides the opening hours of the Eiffel Tower, which directly addresses the user's query. The information is clear, accurate, and complete, fully resolving the user's intent.",
}

代理工具调用和定义

请参阅tool_calls的以下tool_definitions和ToolCallAccuracyEvaluator示例：

import json 

query = "How is the weather in Seattle?"
tool_calls = [{
                    "type": "tool_call",
                    "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ",
                    "name": "fetch_weather",
                    "arguments": {
                        "location": "Seattle"
                    }
            },
            {
                    "type": "tool_call",
                    "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ",
                    "name": "fetch_weather",
                    "arguments": {
                        "location": "London"
                    }
            }]

tool_definitions = [{
                    "name": "fetch_weather",
                    "description": "Fetches the weather information for the specified location.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "location": {
                                "type": "string",
                                "description": "The location to fetch weather for."
                            }
                        }
                    }
                }]

from azure.ai.evaluation import ToolCallAccuracyEvaluator

tool_call_accuracy = ToolCallAccuracyEvaluator(model_config) # reuse the config defined above
response = tool_call_accuracy(query=query, tool_calls=tool_calls, tool_definitions=tool_definitions)
print(json.dumps(response, indent=4))

请参阅以下输出（有关详细信息，请参阅输出格式）：

{
    "tool_call_accuracy": 3,  # a score between 1-5, higher is better
    "tool_call_accuracy_result": "fail",
    "tool_call_accuracy_threshold": 4,
    "details": { ... } # helpful details for debugging the tool calls made by the agent
}

代理消息架构

在代理消息格式中，query和response是 OpenAI 样式消息的列表。具体而言，query 收集过去的代理用户交互，直到最后一次用户查询，并要求系统消息（代理）显示在列表的顶部；而 response 保存代理对最后一次用户查询的响应。

评估器的预期输入格式是一个 Python 消息列表，如下所示：

[
  {
    "role": "system" | "user" | "assistant" | "tool",
    "createdAt": "ISO 8601 timestamp",     // Optional for 'system'
    "run_id": "string",                    // Optional, only for assistant/tool in tool call context
    "tool_call_id": "string",              // Optional, only for tool/tool_result
    "name": "string",                      // Present if it's a tool call
    "arguments": { ... },                  // Parameters passed to the tool (if tool call)
    "content": [
      {
        "type": "text" | "tool_call" | "tool_result",
        "text": "string",                  // if type == text
        "tool_call_id": "string",         // if type == tool_call
        "name": "string",                 // tool name if type == tool_call
        "arguments": { ... },             // tool args if type == tool_call
        "tool_result": { ... }            // result if type == tool_result
      }
    ]
  }
]

示例查询和响应对象：

query = [
    {
        "role": "system",
        "content": "You are an AI assistant interacting with Azure Maps services to serve user requests."
    },
    {
        "createdAt": "2025-04-25T23:55:43Z",
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Find the address for coordinates 41.8781,-87.6298."
            }
        ]
    },
    {
        "createdAt": "2025-04-25T23:55:45Z",
        "run_id": "run_DGE8RWPS8A9SmfCg61waRx9u",
        "role": "assistant",
        "content": [
            {
                "type": "tool_call",
                "tool_call_id": "call_nqNyhOFRw4FmF50jaCCq2rDa",
                "name": "azure_maps_reverse_address_search",
                "arguments": {
                    "lat": "41.8781",
                    "lon": "-87.6298"
                }
            }
        ]
    },
    {
        "createdAt": "2025-04-25T23:55:47Z",
        "run_id": "run_DGE8RWPS8A9SmfCg61waRx9u",
        "tool_call_id": "call_nqNyhOFRw4FmF50jaCCq2rDa",
        "role": "tool",
        "content": [
            {
                "type": "tool_result",
                "tool_result": {
                    "address": "300 South Federal Street, Chicago, IL 60604",
                    "position": {
                        "lat": "41.8781",
                        "lon": "-87.6298"
                    }
                }
            }
        ]
    },
    {
        "createdAt": "2025-04-25T23:55:48Z",
        "run_id": "run_DGE8RWPS8A9SmfCg61waRx9u",
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "The address for the coordinates 41.8781, -87.6298 is 300 South Federal Street, Chicago, IL 60604."
            }
        ]
    },
    {
        "createdAt": "2025-04-25T23:55:50Z",
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What timezone corresponds to 41.8781,-87.6298?"
            }
        ]
    },
]

response = [
    {
        "createdAt": "2025-04-25T23:55:52Z",
        "run_id": "run_DmnhUGqYd1vCBolcjjODVitB",
        "role": "assistant",
        "content": [
            {
                "type": "tool_call",
                "tool_call_id": "call_qi2ug31JqzDuLy7zF5uiMbGU",
                "name": "azure_maps_timezone",
                "arguments": {
                    "lat": 41.878100000000003,
                    "lon": -87.629800000000003
                }
            }
        ]
    },
    {
        "createdAt": "2025-04-25T23:55:54Z",
        "run_id": "run_DmnhUGqYd1vCBolcjjODVitB",
        "tool_call_id": "call_qi2ug31JqzDuLy7zF5uiMbGU",
        "role": "tool",
        "content": [
            {
                "type": "tool_result",
                "tool_result": {
                    "ianaId": "America/Chicago",
                    "utcOffset": None,
                    "abbreviation": None,
                    "isDaylightSavingTime": None
                }
            }
        ]
    },
    {
        "createdAt": "2025-04-25T23:55:55Z",
        "run_id": "run_DmnhUGqYd1vCBolcjjODVitB",
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "The timezone for the coordinates 41.8781, -87.6298 is America/Chicago."
            }
        ]
    }
]

注释

当查询或代理响应的格式不是预期的格式时，评估器会发出警告，即查询（直到当前运行的会话历史记录）或代理响应（对查询的回应）无法解析。

请参阅使用 ToolCallAccuracyEvaluator 评估代理消息的示例：

import json

# The user asked a question.
query = [
    {
        "role": "system",
        "content": "You are a friendly and helpful customer service agent."
    },
    # Past interactions are omitted. 
    # ...
    {
        "createdAt": "2025-03-14T06:14:20Z",
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Hi, I need help with the last 2 orders on my account #888. Could you please update me on their status?"
            }
        ]
    }
]
# The agent emits multiple messages to fulfill the request.
response = [
    {
        "createdAt": "2025-03-14T06:14:30Z",
        "run_id": "0",
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "Hello! Let me quickly look up your account details."
            }
        ]
    },
    {
        "createdAt": "2025-03-14T06:14:35Z",
        "run_id": "0",
        "role": "assistant",
        "content": [
            {
                "type": "tool_call",
                "tool_call_id": "tool_call_20250310_001",
                "name": "get_orders",
                "arguments": {
                    "account_number": "888"
                }
            }
        ]
    },
    # Many more messages are omitted. 
    # ...
    # Here is the agent's final response:
    {
        "createdAt": "2025-03-14T06:15:05Z",
        "run_id": "0",
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "The order with ID 123 has been shipped and is expected to be delivered on March 15, 2025. However, the order with ID 124 is delayed and should now arrive by March 20, 2025. Is there anything else I can help you with?"
            }
        ]
    }
]

# An example of tool definitions available to the agent:
tool_definitions = [
    {
        "name": "get_orders",
        "description": "Get the list of orders for a given account number.",
        "parameters": {
            "type": "object",
            "properties": {
                "account_number": {
                    "type": "string",
                    "description": "The account number to get the orders for."
                }
            }
        }
    },
    # Other tool definitions are omitted. 
    # ...
]

result = tool_call_accuracy(
    query=query,
    response=response,
    tool_definitions=tool_definitions 
)
print(json.dumps(result, indent=4))

请参阅以下输出（有关详细信息，请参阅输出格式）：

{
    "tool_call_accuracy": 2,  # a score between 1-5, higher is better
    "tool_call_accuracy_result": "fail",
    "tool_call_accuracy_threshold": 3,
    "details": { ... } # helpful details for debugging the tool calls made by the agent
}

此评估架构有助于在 Azure AI Foundry 代理服务外部分析代理数据，以便可以使用我们的评估程序来支持代理工作流中的可观测性。

示例笔记本

现在你已准备好针对下述每个评估程序尝试一个示例：

反馈

此页面是否有帮助？

通过

使用 Azure AI 评估 SDK 在本地评估 AI 代理（预览版）

开始

评估 Azure AI 代理

工具调用评估支持

创建代理线程并运行

评估单个代理运行

输出格式

评估多个代理运行或线程

评估其他代理

简单代理数据

代理工具调用和定义

代理消息架构

示例笔记本

相关内容

反馈

其他资源