你当前正在访问 Microsoft Azure Global Edition 技术文档网站。如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站，请访问 https://docs.azure.cn。

使用 Azure AI 评估 SDK 在本地评估生成 AI 应用程序

2025-10-14

重要

本文中标记了“（预览版）”的项目目前为公共预览版。此预览版未提供服务级别协议，不建议将其用于生产工作负载。某些功能可能不受支持或者受限。有关详细信息，请参阅 Microsoft Azure 预览版补充使用条款。

如果要在将生成 AI 应用程序应用到大量数据集时全面评估其性能，可以使用 Azure AI 评估 SDK 在开发环境中对其进行评估。当您提供测试数据集或目标时，生成式 AI 应用程序的输出将通过基于数学的指标和 AI 辅助的质量与安全评估器进行定量测量。内置或自定义评估器可以提供对应用程序功能和限制的全面见解。

本文介绍如何在单个数据行上运行计算器，以及如何在应用程序目标上运行更大的测试数据集。使用在本地利用 Azure AI 评估 SDK 的内置评估器。然后，你将了解如何跟踪 Azure AI 项目中的结果和评估日志。

开始

首先从 Azure AI 评估 SDK 安装评估程序包：

pip install azure-ai-evaluation

注释

有关详细信息，请参阅 Azure AI 评估 SDK 的 API 参考文档。

内置评估器

类别	评估程序
常规用途	`CoherenceEvaluator`、`FluencyEvaluator`、`QAEvaluator`
文本相似性	`SimilarityEvaluator`、`F1ScoreEvaluator`、`BleuScoreEvaluator`、`GleuScoreEvaluator`、`RougeScoreEvaluator`、`MeteorScoreEvaluator`
检索增强生成 (RAG)	`RetrievalEvaluator`、`DocumentRetrievalEvaluator`、`GroundednessEvaluator`、`GroundednessProEvaluator`、`RelevanceEvaluator`、`ResponseCompletenessEvaluator`
风险和安全	`ViolenceEvaluator`、`SexualEvaluator`、`SelfHarmEvaluator`、`HateUnfairnessEvaluator`、`IndirectAttackEvaluator`、`ProtectedMaterialEvaluator`、`UngroundedAttributesEvaluator`、`CodeVulnerabilityEvaluator`、`ContentSafetyEvaluator`
Agentic	`IntentResolutionEvaluator`、`ToolCallAccuracyEvaluator`、`TaskAdherenceEvaluator`
Azure OpenAI	`AzureOpenAILabelGrader`、`AzureOpenAIStringCheckGrader`、`AzureOpenAITextSimilarityGrader`、`AzureOpenAIGrader`

内置的质量和安全指标采用查询和响应对，以及特定评估器的附加信息。

内置评估程序的数据要求

内置评估器可以接受查询和响应对、JSON Lines（JSONL）格式的对话列表，或者同时接受两者。

计算器	对文本的对话和单轮支持	对文本和图像的对话和单轮支持	仅为文本提供单轮支持	需要 `ground_truth`	支持代理输入
质量评估员
`IntentResolutionEvaluator`					✓
`ToolCallAccuracyEvaluator`					✓
`TaskAdherenceEvaluator`					✓
`GroundednessEvaluator`	✓				✓
`GroundednessProEvaluator`	✓
`RetrievalEvaluator`	✓
`DocumentRetrievalEvaluator`	✓			✓
`RelevanceEvaluator`	✓				✓
`CoherenceEvaluator`	✓
`FluencyEvaluator`	✓
`ResponseCompletenessEvaluator`			✓	✓
`QAEvaluator`			✓	✓
自然语言处理（NLP）评估器
`SimilarityEvaluator`			✓	✓
`F1ScoreEvaluator`			✓	✓
`RougeScoreEvaluator`			✓	✓
`GleuScoreEvaluator`			✓	✓
`BleuScoreEvaluator`			✓	✓
`MeteorScoreEvaluator`			✓	✓
安全评估器
`ViolenceEvaluator`		✓
`SexualEvaluator`		✓
`SelfHarmEvaluator`		✓
`HateUnfairnessEvaluator`		✓
`ProtectedMaterialEvaluator`		✓
`ContentSafetyEvaluator`		✓
`UngroundedAttributesEvaluator`			✓
`CodeVulnerabilityEvaluator`			✓
`IndirectAttackEvaluator`	✓
Azure OpenAI 评分者
`AzureOpenAILabelGrader`	✓
`AzureOpenAIStringCheckGrader`	✓
`AzureOpenAITextSimilarityGrader`	✓			✓
`AzureOpenAIGrader`	✓

注释

除 SimilarityEvaluator 以外的 AI 辅助质量评估程序带有原因字段。它们采用包括思维链推理在内的技术来生成对分数的解释。因此，由于评估质量的提高，它们会在生成过程中消耗更多的标记使用量。具体而言，对于所有 AI 辅助评估程序，评估程序生成的 max_token 已设置为 800（对于 RetrievalEvaluator，该值则设置为 1600，对于 ToolCallAccuracyEvaluator，则设置为 3000），以适应更长的输入。

Azure OpenAI 评分员需要一个模板，来描述他们如何将输入列转化为评分员实际使用的输入。示例：如果有两个输入称为查询和响应，以及格式化为 {{item.query}}模板，则仅使用查询。同样，您可以使用类似{{item.conversation}}的内容来接受对话输入，但系统处理此输入的能力取决于您如何配置余下的系统参数以期待该输入。

有关代理评估程序的数据要求的详细信息，请使用 Azure AI 评估 SDK 在本地运行代理评估。

对文本的单轮支持

所有内置评估器接收字符串形式的查询和响应对作为单轮输入。例如：

from azure.ai.evaluation import RelevanceEvaluator

query = "What is the capital of life?"
response = "Paris."

# Initialize an evaluator:
relevance_eval = RelevanceEvaluator(model_config)
relevance_eval(query=query, response=response)

若要使用本地评估或上传数据集来运行云评估，需要以 JSONL 格式表示数据集。上述单轮数据（查询和响应对）等效于数据集行，如下所示（我们以三行为例）：

{"query":"What is the capital of France?","response":"Paris."}
{"query":"What atoms compose water?","response":"Hydrogen and oxygen."}
{"query":"What color is my shirt?","response":"Blue."}

评估测试数据集可以包含以下内容，具体取决于每个内置计算器的要求：

查询：发送到生成 AI 应用程序的查询。
响应：对生成式 AI 应用程序生成的查询的响应。
上下文：生成的响应所基于的源（即基础文档）。
基本真相：用户或人类生成的响应作为真实答案。

若要查看每个计算器所需的内容，可以在内置计算器文档中了解详细信息。

文本中的对话支持

对于支持文本对话的评估器，你可以提供 conversation 作为输入，其中包括一个包含 messages 列表的 Python 字典（其中包括 content、role 和可选的 context）。

请参阅 Python 中的以下两轮对话：

conversation = {
        "messages": [
        {
            "content": "Which tent is the most waterproof?", 
            "role": "user"
        },
        {
            "content": "The Alpine Explorer Tent is the most waterproof",
            "role": "assistant", 
            "context": "From the our product list the alpine explorer tent is the most waterproof. The Adventure Dining Table has higher weight."
        },
        {
            "content": "How much does it cost?",
            "role": "user"
        },
        {
            "content": "The Alpine Explorer Tent is $120.",
            "role": "assistant",
            "context": None
        }
        ]
}

若要使用本地评估或上传数据集来运行云评估，需要以 JSONL 格式表示数据集。上一个对话等效于 JSONL 文件中的数据集行，如以下示例所示：

{"conversation":
    {
        "messages": [
        {
            "content": "Which tent is the most waterproof?", 
            "role": "user"
        },
        {
            "content": "The Alpine Explorer Tent is the most waterproof",
            "role": "assistant", 
            "context": "From the our product list the alpine explorer tent is the most waterproof. The Adventure Dining Table has higher weight."
        },
        {
            "content": "How much does it cost?",
            "role": "user"
        },
        {
            "content": "The Alpine Explorer Tent is $120.",
            "role": "assistant",
            "context": null
        }
        ]
    }
}

我们的评估程序明白，对话的第一轮以查询-响应格式提供来自 query 的有效 user、来自 context 的 assistant 以及来自 response 的 assistant。然后将按轮次评估对话，结果会按所有轮次聚合以得出对话分数。

注释

在第二轮中，即使 context 为 null 或一个缺失键，它也会被解释为空字符串而不是失败并出现错误，这可能会导致误导性结果。强烈建议你验证评估数据以符合数据要求。

对于对话模式，下面提供了一个 GroundednessEvaluator 的示例：

# Conversation mode:
import json
import os
from azure.ai.evaluation import GroundednessEvaluator, AzureOpenAIModelConfiguration

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ.get("AZURE_ENDPOINT"),
    api_key=os.environ.get("AZURE_API_KEY"),
    azure_deployment=os.environ.get("AZURE_DEPLOYMENT_NAME"),
    api_version=os.environ.get("AZURE_API_VERSION"),
)

# Initialize the Groundedness evaluator:
groundedness_eval = GroundednessEvaluator(model_config)

conversation = {
    "messages": [
        { "content": "Which tent is the most waterproof?", "role": "user" },
        { "content": "The Alpine Explorer Tent is the most waterproof", "role": "assistant", "context": "From the our product list the alpine explorer tent is the most waterproof. The Adventure Dining Table has higher weight." },
        { "content": "How much does it cost?", "role": "user" },
        { "content": "$120.", "role": "assistant", "context": "The Alpine Explorer Tent is $120."}
    ]
}

# Alternatively, you can load the same content from a JSONL file.
groundedness_conv_score = groundedness_eval(conversation=conversation)
print(json.dumps(groundedness_conv_score, indent=4))

对于对话输出，每轮次结果都存储在一个列表中，并且整体会话分数 'groundedness': 4.0 是基于这些轮次计算的平均值：

{
    "groundedness": 5.0,
    "gpt_groundedness": 5.0,
    "groundedness_threshold": 3.0,
    "evaluation_per_turn": {
        "groundedness": [
            5.0,
            5.0
        ],
        "gpt_groundedness": [
            5.0,
            5.0
        ],
        "groundedness_reason": [
            "The response accurately and completely answers the query by stating that the Alpine Explorer Tent is the most waterproof, which is directly supported by the context. There are no irrelevant details or incorrect information present.",
            "The RESPONSE directly answers the QUERY with the exact information provided in the CONTEXT, making it fully correct and complete."
        ],
        "groundedness_result": [
            "pass",
            "pass"
        ],
        "groundedness_threshold": [
            3,
            3
        ]
    }
}

注释

我们建议用户迁移其代码以使用没有前缀的密钥（例如）， groundedness.groundedness以便代码支持更多计算器模型。

对于支持图像和多模式图像与文本中的对话的评估程序，可以在 conversation 中传入图像 URL 或 Base64 编码图像。

支持的方案包括：

输入多个图像与文本，生成图像或文本。
仅限文本输入用于图像生成。
仅输入图像以生成文本。

from pathlib import Path
from azure.ai.evaluation import ContentSafetyEvaluator
import base64

# Create an instance of an evaluator with image and multi-modal support.
safety_evaluator = ContentSafetyEvaluator(credential=azure_cred, azure_ai_project=project_scope)

# Example of a conversation with an image URL:
conversation_image_url = {
    "messages": [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": "You are an AI assistant that understands images."}
            ],
        },
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Can you describe this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://cdn.britannica.com/68/178268-050-5B4E7FB6/Tom-Cruise-2013.jpg"
                    },
                },
            ],
        },
        {
            "role": "assistant",
            "content": [
                {
                    "type": "text",
                    "text": "The image shows a man with short brown hair smiling, wearing a dark-colored shirt.",
                }
            ],
        },
    ]
}

# Example of a conversation with base64 encoded images:
base64_image = ""

with Path.open("Image1.jpg", "rb") as image_file:
    base64_image = base64.b64encode(image_file.read()).decode("utf-8")

conversation_base64 = {
    "messages": [
        {"content": "create an image of a branded apple", "role": "user"},
        {
            "content": [{"type": "image_url", "image_url": {"url": f"data:image/jpg;base64,{base64_image}"}}],
            "role": "assistant",
        },
    ]
}

# Run the evaluation on the conversation to output the result.
safety_score = safety_evaluator(conversation=conversation_image_url)

目前，图像和多模式评估程序支持：

仅单轮次（对话只能有一条用户消息和一条助理消息）。
仅包含一条系统消息的对话。
小于 10 MB（包括图像）的对话有效负载。
绝对 URL 和 Base64 编码图像。
单轮中的多个图像。
JPG/JPEG、PNG 和 GIF 文件格式。

设置

对于 AI 辅助质量评估器（除去GroundednessProEvaluator 的预览），必须在你的gpt-35-turbo 中指定一个 GPT 模型（gpt-4、gpt-4-turbo、gpt-4o、gpt-4o-mini 或 model_config）。 GPT 模型充当法官来评分评估数据。我们同时支持 Azure OpenAI 或 OpenAI 模型配置架构。为了获得最佳性能，并与我们的评估器产生可解析的响应，我们建议使用未处于预览状态的 GPT 模型。

注释

我们强烈建议你将评估器模型中的gpt-3.5-turbo替换为gpt-4o-mini，因为根据OpenAI，后者更便宜、更有能力，而且同样快速。

请确保至少具有 Azure OpenAI 资源的 Cognitive Services OpenAI User 角色，以便使用 API 密钥进行推理调用。若要了解有关权限的详细信息，请参阅 Azure OpenAI 资源的权限。

对于所有风险与安全评估器和 GroundednessProEvaluator（预览版），必须提供 model_config 信息，而不是 azure_ai_project 中的 GPT 部署。这会通过 Azure AI 项目访问后端评估服务。

AI 辅助内置评估器的提示

在我们的评估器库和 Azure AI 评估 Python SDK 存储库中，我们开源质量评估器的提示，以保持透明性，但安全评估器和 GroundednessProEvaluator（由 Azure AI 内容安全提供支持）除外。这些提示充当语言模型执行其评估任务的说明，这需要对指标及其关联的评分标准进行人工友好的定义。我们强烈建议用户根据其场景具体情况自定义定义和评分标准。请参阅自定义评估器中的详细信息。

复合评估器

复合评估器是内置的评估器，结合了单个质量或安全指标。它们可以轻松地为查询响应对或聊天消息提供各种现成的指标。

复合评估器	包含	DESCRIPTION
`QAEvaluator`	`GroundednessEvaluator`、`RelevanceEvaluator`、`CoherenceEvaluator`、`FluencyEvaluator`、`SimilarityEvaluator`、`F1ScoreEvaluator`	将所有质量评估器组合为查询和响应对的单个组合指标输出
`ContentSafetyEvaluator`	`ViolenceEvaluator`、`SexualEvaluator`、`SelfHarmEvaluator`、`HateUnfairnessEvaluator`	将所有安全评估器组合为查询和响应对的单个组合指标输出

使用 `evaluate()` 对测试数据集进行本地评估

在单行数据上抽样检查内置或自定义评估器后，可以对整个测试数据集组合使用多个评估器和 evaluate() API。

Azure AI Foundry 项目的先决条件设置步骤

如果这是首次运行评估并将其记录到 Azure AI Foundry 项目，则可能需要执行一些其他设置步骤：

创建存储帐户并将其连接到资源级别的 Azure AI Foundry 项目。此 bicep 模板预配存储帐户，并使用密钥身份验证将存储帐户连接到 Foundry 项目。
确保连接的存储帐户有权访问所有项目。
如果使用 Microsoft Entra ID 连接到存储帐户，请确保在 Azure 门户中向帐户和 Foundry 项目资源授予存储 Blob 数据所有者的 MSI（Microsoft 标识）权限。

评估数据集并将结果记录到 Azure AI Foundry

若要确保 evaluate() API 能够正确分析数据，必须指定列映射，以便将列从数据集映射到计算器接受的关键字。在本例中，我们指定了 query、response 和 context 的数据映射。

from azure.ai.evaluation import evaluate

result = evaluate(
    data="data.jsonl", # Provide your data here:
    evaluators={
        "groundedness": groundedness_eval,
        "answer_length": answer_length
    },
    # Column mapping:
    evaluator_config={
        "groundedness": {
            "column_mapping": {
                "query": "${data.queries}",
                "context": "${data.context}",
                "response": "${data.response}"
            } 
        }
    },
    # Optionally, provide your Azure AI Foundry project information to track your evaluation results in your project portal.
    azure_ai_project = azure_ai_project,
    # Optionally, provide an output path to dump a JSON file of metric summary, row-level data, and the metric and Azure AI project URL.
    output_path="./myevalresults.json"
)

小窍门

获取链接的 result.studio_url 属性的内容，以便在 Azure AI 项目中查看记录的评估结果。

评估器输出的结果为字典形式，其中包含聚合 metrics 和行级数据和指标。请参阅输出的以下示例：

{'metrics': {'answer_length.value': 49.333333333333336,
             'groundedness.gpt_groundeness': 5.0, 'groundedness.groundeness': 5.0},
 'rows': [{'inputs.response': 'Paris is the capital of France.',
           'inputs.context': 'Paris has been the capital of France since '
                                  'the 10th century and is known for its '
                                  'cultural and historical landmarks.',
           'inputs.query': 'What is the capital of France?',
           'outputs.answer_length.value': 31,
           'outputs.groundeness.groundeness': 5,
           'outputs.groundeness.gpt_groundeness': 5,
           'outputs.groundeness.groundeness_reason': 'The response to the query is supported by the context.'},
          {'inputs.response': 'Albert Einstein developed the theory of '
                            'relativity.',
           'inputs.context': 'Albert Einstein developed the theory of '
                                  'relativity, with his special relativity '
                                  'published in 1905 and general relativity in '
                                  '1915.',
           'inputs.query': 'Who developed the theory of relativity?',
           'outputs.answer_length.value': 51,
           'outputs.groundeness.groundeness': 5,
           'outputs.groundeness.gpt_groundeness': 5,
           'outputs.groundeness.groundeness_reason': 'The response to the query is supported by the context.'},
          {'inputs.response': 'The speed of light is approximately 299,792,458 '
                            'meters per second.',
           'inputs.context': 'The exact speed of light in a vacuum is '
                                  '299,792,458 meters per second, a constant '
                                  "used in physics to represent 'c'.",
           'inputs.query': 'What is the speed of light?',
           'outputs.answer_length.value': 66,
           'outputs.groundeness.groundeness': 5,
           'outputs.groundeness.gpt_groundeness': 5,
           'outputs.groundeness.groundeness_reason': 'The response to the query is supported by the context.'}],
 'traces': {}}

`evaluate()` 的要求：

evaluate() API 对它接受的数据格式以及它处理评估程序参数键名称的方式有一些要求，以便 Azure AI 项目中的评估结果图表正确显示。

数据格式

evaluate() API 仅接受 JSONL 格式的数据。对于所有内置评估器，evaluate() 需要数据采用以下格式，并包含必要的输入字段。请参阅上一部分，了解内置计算器所需的数据输入。以下代码片段展示了一行代码的示例格式：

{
  "query":"What is the capital of France?",
  "context":"France is in Europe",
  "response":"Paris is the capital of France.",
  "ground_truth": "Paris"
}

评估程序参数格式

传入内置评估器时，请务必在 evaluators 参数列表中指定正确的关键字映射。下表是在记录到 Azure AI 项目时内置评估程序的结果显示在 UI 中所需的关键字映射。

计算器	关键字参数
`GroundednessEvaluator`	`"groundedness"`
`GroundednessProEvaluator`	`"groundedness_pro"`
`RetrievalEvaluator`	`"retrieval"`
`RelevanceEvaluator`	`"relevance"`
`CoherenceEvaluator`	`"coherence"`
`FluencyEvaluator`	`"fluency"`
`SimilarityEvaluator`	`"similarity"`
`F1ScoreEvaluator`	`"f1_score"`
`RougeScoreEvaluator`	`"rouge"`
`GleuScoreEvaluator`	`"gleu"`
`BleuScoreEvaluator`	`"bleu"`
`MeteorScoreEvaluator`	`"meteor"`
`ViolenceEvaluator`	`"violence"`
`SexualEvaluator`	`"sexual"`
`SelfHarmEvaluator`	`"self_harm"`
`HateUnfairnessEvaluator`	`"hate_unfairness"`
`IndirectAttackEvaluator`	`"indirect_attack"`
`ProtectedMaterialEvaluator`	`"protected_material"`
`CodeVulnerabilityEvaluator`	`"code_vulnerability"`
`UngroundedAttributesEvaluator`	`"ungrounded_attributes"`
`QAEvaluator`	`"qa"`
`ContentSafetyEvaluator`	`"content_safety"`

下面是如何设置 evaluators 参数的示例：

result = evaluate(
    data="data.jsonl",
    evaluators={
        "sexual":sexual_evaluator,
        "self_harm":self_harm_evaluator,
        "hate_unfairness":hate_unfairness_evaluator,
        "violence":violence_evaluator
    }
)

目标上的本地评估

如果有要运行和计算的查询列表，则 evaluate() API 还支持参数 target 。此参数可以将查询发送到应用程序以收集答案，然后在生成的查询和响应上运行计算器。

目标可以是目录中的任何可调用类。在这种情况下，我们有一个具有可调用类askwiki.py的 Python 脚本askwiki()，我们可以将其设置为目标。如果我们有可以发送到简单 askwiki 应用的查询数据集，就可以评估输出的真实性。请确保在 "column_mapping" 中为数据指定正确的列映射。可以使用 "default" 来为所有评估器指定列映射。

以下是在 "data.jsonl" 中的内容：

{"query":"When was United States found ?", "response":"1776"}
{"query":"What is the capital of France?", "response":"Paris"}
{"query":"Who is the best tennis player of all time ?", "response":"Roger Federer"}

from askwiki import askwiki

result = evaluate(
    data="data.jsonl",
    target=askwiki,
    evaluators={
        "groundedness": groundedness_eval
    },
    evaluator_config={
        "default": {
            "column_mapping": {
                "query": "${data.queries}",
                "context": "${outputs.context}",
                "response": "${outputs.response}"
            } 
        }
    }
)

反馈

此页面是否有帮助？