你当前正在访问 Microsoft Azure Global Edition 技术文档网站。 如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站,请访问 https://docs.azure.cn。
若要开始评估应用程序的代系,内置评估器现成即用。 为了满足评估需求,可以构建自己的基于代码或基于提示的评估程序。
基于代码的评估器
对于某些评估指标,不需要大型语言模型。 基于代码的计算器可以灵活地根据函数或可调用类定义指标。 例如,可以通过创建一个简单的 Python 类来生成自己的基于代码的计算器,该类计算目录中answer_len/答案answer_length.py的长度,如以下示例所示。
基于代码的计算器示例:答案长度
class AnswerLengthEvaluator:
def __init__(self):
pass
# A class is made callable by implementing the special method __call__
def __call__(self, *, answer: str, **kwargs):
return {"answer_length": len(answer)}
通过导入可调用类对一行数据运行计算器:
from answer_len.answer_length import AnswerLengthEvaluator
answer_length_evaluator = AnswerLengthEvaluator()
answer_length = answer_length_evaluator(answer="What is the speed of light?")
基于代码的计算器输出:答案长度
{"answer_length":27}
基于提示的评估器
若要构建自己的基于提示的大型语言模型评估器或 AI 辅助式批注器,可以根据 Prompty 文件创建自定义评估器。
Prompty 是扩展名 .prompty 用于开发提示模板的文件。 Prompty 资产是一个包含已修改前面内容的 Markdown 文件。 前端采用 YAML 格式。 它包含定义模型配置和 Prompty 的预期输入的元数据字段。
若要测量响应的友好性,可以创建自定义计算器 FriendlinessEvaluator:
基于提示的计算器示例:友好性计算器
首先,创建一个 friendliness.prompty 文件来定义友好指标及其评分标准:
---
name: Friendliness Evaluator
description: Friendliness Evaluator to measure warmth and approachability of answers.
model:
api: chat
configuration:
type: azure_openai
azure_endpoint: ${env:AZURE_OPENAI_ENDPOINT}
azure_deployment: gpt-4o-mini
parameters:
model:
temperature: 0.1
inputs:
response:
type: string
outputs:
score:
type: int
explanation:
type: string
---
system:
Friendliness assesses the warmth and approachability of the answer. Rate the friendliness of the response between one to five stars using the following scale:
One star: the answer is unfriendly or hostile
Two stars: the answer is mostly unfriendly
Three stars: the answer is neutral
Four stars: the answer is mostly friendly
Five stars: the answer is very friendly
Please assign a rating between 1 and 5 based on the tone and demeanor of the response.
**Example 1**
generated_query: I just don't feel like helping you! Your questions are getting very annoying.
output:
{"score": 1, "reason": "The response is not warm and is resisting to be providing helpful information."}
**Example 2**
generated_query: I'm sorry this watch is not working for you. Very happy to assist you with a replacement.
output:
{"score": 5, "reason": "The response is warm and empathetic, offering a resolution with care."}
**Here the actual conversation to be scored:**
generated_query: {{response}}
output:
然后创建一个类 FriendlinessEvaluator 以加载 Prompty 文件并使用 JSON 格式处理输出:
import os
import json
import sys
from promptflow.client import load_flow
class FriendlinessEvaluator:
def __init__(self, model_config):
current_dir = os.path.dirname(__file__)
prompty_path = os.path.join(current_dir, "friendliness.prompty")
self._flow = load_flow(source=prompty_path, model={"configuration": model_config})
def __call__(self, *, response: str, **kwargs):
llm_response = self._flow(response=response)
try:
response = json.loads(llm_response)
except Exception as ex:
response = llm_response
return response
现在,创建自己的基于 Prompty 的计算器,并在一行数据上运行它:
from friendliness.friend import FriendlinessEvaluator
friendliness_eval = FriendlinessEvaluator(model_config)
friendliness_score = friendliness_eval(response="I will not apologize for my behavior!")
基于提示的计算器输出:友好性计算器
{
'score': 1,
'reason': 'The response is hostile and unapologetic, lacking warmth or approachability.'
}