你当前正在访问 Microsoft Azure Global Edition 技术文档网站。 如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站,请访问 https://docs.azure.cn。
AI 系统可能会生成不一致的文本响应,或者缺少一般写作质量(超出最低语法正确性)。 若要解决这些问题,Azure AI Foundry 支持评估:
如果你有一个问答(QA)方案contextground truth以及除queryresponse数据外,还可以使用我们的 QAEvaluator,它是使用相关评估器进行判断的复合评估器。
AI 辅助评估器的模型配置
为了参考以下代码片段,AI 辅助计算器使用模型配置,如下所示:
import os
from azure.ai.evaluation import AzureOpenAIModelConfiguration
from dotenv import load_dotenv
load_dotenv()
model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ["AZURE_ENDPOINT"],
    api_key=os.environ.get("AZURE_API_KEY"),
    azure_deployment=os.environ.get("AZURE_DEPLOYMENT_NAME"),
    api_version=os.environ.get("AZURE_API_VERSION"),
)
计算器模型支持
Azure AI Foundry 根据评估者支持大型语言模型判断(LLM-judge)的 AzureOpenAI 或 OpenAI 推理模型 和非推理模型:
| 评估程序 | 推理模型作为法官(示例:Azure OpenAI/OpenAI 中的 o 系列模型) | 非推理模型作为法官 (示例: gpt-4.1, gpt-4o) | 若要为 | 
|---|---|---|---|
| Intent Resolution、Task Adherence、Tool Call Accuracy、Response Completeness | 已支持 | 已支持 | 在初始化计算器时设置其他参数 is_reasoning_model=True | 
| 其他质量评估程序 | 不支持 | 已支持 | -- | 
对于需要优化推理的复杂评估,我们建议使用具有推理性能和成本效益的平衡的强推理模型,例如 o3-mini 之后发布的 o 系列微型模型。
一致性
              CoherenceEvaluator 测量响应中思想的逻辑有序呈现,这使读者能够轻松跟踪和理解作者的思想训练。 
              一致的响应直接解决了句子和段落之间明确连接的问题,使用适当的转换和逻辑理念序列。 更高的分数意味着更好的一致性。
一致性示例
from azure.ai.evaluation import CoherenceEvaluator
coherence = CoherenceEvaluator(model_config=model_config, threshold=3)
coherence(
    query="Is Marie Curie is born in Paris?", 
    response="No, Marie Curie is born in Warsaw."
)
一致性输出
Likert 刻度的数字分数(整数 1 到 5)。 分数越高越好。 给定数值阈值(默认值为 3),如果分数 >= 阈值,则也会输出传递,否则也会失败。 使用原因字段了解分数高或低的原因。
{
    "coherence": 4.0,
    "gpt_coherence": 4.0,
    "coherence_reason": "The RESPONSE is coherent and directly answers the QUERY with relevant information, making it easy to follow and understand.",
    "coherence_result": "pass",
    "coherence_threshold": 3
}
流畅度
              FluencyEvaluator 衡量书面通信的有效性和清晰度。 此度量侧重于语法准确性、词汇范围、句子复杂性、一致性和整体可读性。 它评估如何顺利传达想法,以及读者如何轻松地理解文本。
流利示例
from azure.ai.evaluation import FluencyEvaluator
fluency = FluencyEvaluator(model_config=model_config, threshold=3)
fluency(
    response="No, Marie Curie is born in Warsaw."
)
流利输出
Likert 刻度的数字分数(整数 1 到 5)。 分数越高越好。 给定数值阈值(默认值为 3),如果分数 >= 阈值,则也会输出传递,否则也会失败。 使用原因字段了解分数高或低的原因。
{
    "fluency": 3.0,
    "gpt_fluency": 3.0,
    "fluency_reason": "The response is clear and grammatically correct, but it lacks complexity and variety in sentence structure, which is why it fits the \"Competent Fluency\" level.",
    "fluency_result": "pass",
    "fluency_threshold": 3
}
回答复合计算器的问题
              QAEvaluator 在问答方案中全面衡量各方面:
- 相关性
- 真实性
- 流畅度
- 一致性
- 相似
- F1 分数
QA 示例
from azure.ai.evaluation import QAEvaluator
qa_eval = QAEvaluator(model_config=model_config, threshold=3)
qa_eval(
    query="Where was Marie Curie born?", 
    context="Background: 1. Marie Curie was a chemist. 2. Marie Curie was born on November 7, 1867. 3. Marie Curie is a French scientist.",
    response="According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
    ground_truth="Marie Curie was born in Warsaw."
)
QA 输出
虽然 F1 分数在 0-1 浮点数刻度上输出数值分数,但其他计算器在 Likert 刻度上输出数值分数(整数 1 到 5)。 分数越高越好。 给定数值阈值(默认值为 3),如果分数 >= 阈值,则也会输出传递,否则也会失败。 使用原因字段了解分数高或低的原因。
{
    "f1_score": 0.631578947368421,
    "f1_result": "pass",
    "f1_threshold": 3,
    "similarity": 4.0,
    "gpt_similarity": 4.0,
    "similarity_result": "pass",
    "similarity_threshold": 3,
    "fluency": 3.0,
    "gpt_fluency": 3.0,
    "fluency_reason": "The input Data should get a Score of 3 because it clearly conveys an idea with correct grammar and adequate vocabulary, but it lacks complexity and variety in sentence structure.",
    "fluency_result": "pass",
    "fluency_threshold": 3,
    "relevance": 3.0,
    "gpt_relevance": 3.0,
    "relevance_reason": "The RESPONSE does not fully answer the QUERY because it fails to explicitly state that Marie Curie was born in Warsaw, which is the key detail needed for a complete understanding. Instead, it only negates Paris, which does not fully address the question.",
    "relevance_result": "pass",
    "relevance_threshold": 3,
    "coherence": 2.0,
    "gpt_coherence": 2.0,
    "coherence_reason": "The RESPONSE provides some relevant information but lacks a clear and logical structure, making it difficult to follow. It does not directly answer the question in a coherent manner, which is why it falls into the \"Poorly Coherent Response\" category.",
    "coherence_result": "fail",
    "coherence_threshold": 3,
    "groundedness": 3.0,
    "gpt_groundedness": 3.0,
    "groundedness_reason": "The response attempts to answer the query about Marie Curie's birthplace but includes incorrect information by stating she was not born in Paris, which is irrelevant. It does provide the correct birthplace (Warsaw), but the misleading nature of the response affects its overall groundedness. Therefore, it deserves a score of 3.",
    "groundedness_result": "pass",
    "groundedness_threshold": 3
}