基于代码的记分器示例

在 GenAI 的 MLflow 评估中, 基于代码的评分器 允许为 AI 代理或应用程序定义灵活的评估指标。 这组示例及其配套的示例笔记本展示了多种模式,这些模式演示了如何在输入、输出、实现和错误处理上使用不同选项的基于代码的评分器。

下图展示了一些自定义评分器在 MLflow UI 中作为指标的输出。

自定义记分器开发

先决条件

  1. 更新 MLflow
  2. 定义 GenAI 应用程序
  3. 生成用于某些记分器示例的跟踪信息

更新 mlflow

更新 mlflow[databricks] 到最新版本以获得最佳 GenAI 体验,并安装 openai ,因为以下示例应用使用 OpenAI 客户端。

%pip install -q --upgrade "mlflow[databricks]>=3.1" openai
dbutils.library.restartPython()

定义 GenAI 应用程序

下面的一些示例将使用以下 GenAI 应用,该应用是问答的一般助理。 下面的代码使用 OpenAI 客户端连接到 Databricks 托管的 LLM

from databricks.sdk import WorkspaceClient
import mlflow

# Create an OpenAI client that is connected to Databricks-hosted LLMs
w = WorkspaceClient()
client = w.serving_endpoints.get_open_ai_client()

# Select an LLM
model_name = "databricks-claude-sonnet-4"

mlflow.openai.autolog()

# If running outside of Databricks, set up MLflow tracking to Databricks.
# mlflow.set_tracking_uri("databricks")

# In Databricks notebooks, the experiment defaults to the notebook experiment.
# mlflow.set_experiment("/Shared/docs-demo")

@mlflow.trace
def sample_app(messages: list[dict[str, str]]):
    # 1. Prepare messages for the LLM
    messages_for_llm = [
        {"role": "system", "content": "You are a helpful assistant."},
        *messages,
    ]

    # 2. Call LLM to generate a response
    response = client.chat.completions.create(
        model= model_name,
        messages=messages_for_llm,
    )
    return response.choices[0].message.content


sample_app([{"role": "user", "content": "What is the capital of France?"}])

生成记录

下面 eval_dataset 使用 mlflow.genai.evaluate() 占位符记分器生成跟踪。

from mlflow.genai.scorers import scorer

eval_dataset = [
    {
        "inputs": {
            "messages": [
                {"role": "user", "content": "How much does a microwave cost?"},
            ]
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "Can I return the microwave I bought 2 months ago?",
                },
            ]
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "Website"},
            ]
        },
    },
]

@scorer
def placeholder_metric() -> int:
    # placeholder return value
    return 1

eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=sample_app,
    scorers=[placeholder_metric]
)

generated_traces = mlflow.search_traces(run_id=eval_results.run_id)
generated_traces

mlflow.search_traces()上面的函数返回一个包含跟踪的 Pandas 数据帧,以供下面的示例使用。

示例 1:从 Trace 访问数据

访问完整的 MLflow 跟踪对象 ,以使用各种详细信息(范围、输入、输出、属性、计时)进行精细的指标计算。

此评分器检查跟踪的总执行时间是否在可接受的范围内。

import mlflow
from mlflow.genai.scorers import scorer
from mlflow.entities import Trace, Feedback, SpanType

@scorer
def llm_response_time_good(trace: Trace) -> Feedback:
    # Search particular span type from the trace
    llm_span = trace.search_spans(span_type=SpanType.CHAT_MODEL)[0]

    response_time = (llm_span.end_time_ns - llm_span.start_time_ns) / 1e9 # convert to seconds
    max_duration = 5.0
    if response_time <= max_duration:
        return Feedback(
            value="yes",
            rationale=f"LLM response time {response_time:.2f}s is within the {max_duration}s limit."
        )
    else:
        return Feedback(
            value="no",
            rationale=f"LLM response time {response_time:.2f}s exceeds the {max_duration}s limit."
        )

# Evaluate the scorer using the pre-generated traces from the prerequisite code block.
span_check_eval_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[llm_response_time_good]
)

示例 2:封装预定义 LLM 判断

创建一个自定义评分器,用于封装 MLflow 的 预定义 LLM 评估器。 使用此选项可以预处理法官的跟踪数据,或在处理后处理其反馈。

此示例演示如何将 is_context_relevant 代码封装,以评估助手的响应是否与用户查询相关。 具体而言,inputssample_app 字段是一个类似于以下格式的字典:{"messages": [{"role": ..., "content": ...}, ...]}。 此评分程序提取最后一条用户消息的内容,以传递给相关性判断。

import mlflow
from mlflow.entities import Trace, Feedback
from mlflow.genai.judges import is_context_relevant
from mlflow.genai.scorers import scorer
from typing import Any

@scorer
def is_message_relevant(inputs: dict[str, Any], outputs: str) -> Feedback:
    last_user_message_content = None
    if "messages" in inputs and isinstance(inputs["messages"], list):
        for message in reversed(inputs["messages"]):
            if message.get("role") == "user" and "content" in message:
                last_user_message_content = message["content"]
                break

    if not last_user_message_content:
        raise Exception("Could not extract the last user message from inputs to evaluate relevance.")

    # Call the `relevance_to_query judge. It will return a Feedback object.
    return is_context_relevant(
        request=last_user_message_content,
        context={"response": outputs},
    )

# Evaluate the scorer using the pre-generated traces from the prerequisite code block.
custom_relevance_eval_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[is_message_relevant]
)

示例 3:使用 expectations

期望是基本事实值或标签,对于脱机评估通常很重要。 运行 mlflow.genai.evaluate() 时,可以通过在 data 参数中以两种方式指定期望:

  • expectations 列或字段:例如,如果 data 参数是字典列表或 Pandas 数据帧,则每行可以包含一个 expectations 键。 与此密钥关联的值直接传递给自定义评分器。
  • trace 列或字段:例如,如果 data 参数是返回 mlflow.search_traces()的数据帧,它将包含包含 trace 与跟踪关联的任何 Expectation 数据的字段。

注释

生产监控通常没有特定的期望,因为在评估实时流量时没有真实数据。 如果打算对脱机和联机评估使用相同的评分器,请设计它以正常处理期望

此示例还演示了如何使用自定义记分器以及预定义记分器 Safety

import mlflow
from mlflow.entities import Feedback
from mlflow.genai.scorers import scorer, Safety
from typing import Any, List, Optional, Union

expectations_eval_dataset_list = [
    {
        "inputs": {"messages": [{"role": "user", "content": "What is 2+2?"}]},
        "expectations": {
            "expected_response": "2+2 equals 4.",
            "expected_keywords": ["4", "four", "equals"],
        }
    },
    {
        "inputs": {"messages": [{"role": "user", "content": "Describe MLflow in one sentence."}]},
        "expectations": {
            "expected_response": "MLflow is an open-source platform to streamline machine learning development, including tracking experiments, packaging code into reproducible runs, and sharing and deploying models.",
            "expected_keywords": ["mlflow", "open-source", "platform", "machine learning"],
        }
    },
    {
        "inputs": {"messages": [{"role": "user", "content": "Say hello."}]},
        "expectations": {
            "expected_response": "Hello there!",
            # No keywords needed for this one, but the field can be omitted or empty
        }
    }
]

示例 3.1:与预期响应完全匹配

此评分器检查助手的响应是否与 expected_response 中提供的 expectations 完全匹配。

@scorer
def exact_match(outputs: str, expectations: dict[str, Any]) -> bool:
    # Scorer can return primitive value like bool, int, float, str, etc.
    return outputs == expectations["expected_response"]

exact_match_eval_results = mlflow.genai.evaluate(
    data=expectations_eval_dataset_list,
    predict_fn=sample_app, # sample_app is from the prerequisite section
    scorers=[exact_match, Safety()]  # You can include any number of scorers
)

示例 3.2:根据期望标准进行关键字检查

此评分器检查来自 expected_keywords 的所有 expectations 是否都出现在助手的响应中。

@scorer
def keyword_presence_scorer(outputs: str, expectations: dict[str, Any]) -> Feedback:
    expected_keywords = expectations.get("expected_keywords")
    print(expected_keywords)
    if expected_keywords is None:
        return Feedback(value="yes", rationale="No keywords were expected in the response.")

    missing_keywords = []
    for keyword in expected_keywords:
        if keyword.lower() not in outputs.lower():
            missing_keywords.append(keyword)

    if not missing_keywords:
        return Feedback(value="yes", rationale="All expected keywords are present in the response.")
    else:
        return Feedback(value="no", rationale=f"Missing keywords: {', '.join(missing_keywords)}.")

keyword_presence_eval_results = mlflow.genai.evaluate(
    data=expectations_eval_dataset_list,
    predict_fn=sample_app, # sample_app is from the prerequisite section
    scorers=[keyword_presence_scorer]
)

示例 4:返回多个反馈对象

单个评分器可以返回对象列表 Feedback ,允许一个评分者同时评估多个质量方面(如 PII、情绪和简洁性)。

每个 Feedback 对象都应具有 name 唯一的,这会成为结果中的指标名称。 请参阅 有关指标名称的详细信息

此示例演示了一个计分器,它为每个跟踪返回两个不同的反馈片段:

  1. is_not_empty_check:一个布尔值,指示响应内容是否为非空。
  2. response_char_length:响应字符长度的数值。
import mlflow
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback, Trace # Ensure Feedback and Trace are imported
from typing import Any, Optional

@scorer
def comprehensive_response_checker(outputs: str) -> list[Feedback]:
    feedbacks = []
    # 1. Check if the response is not empty
    feedbacks.append(
        Feedback(name="is_not_empty_check", value="yes" if outputs != "" else "no")
    )
    # 2. Calculate response character length
    char_length = len(outputs)
    feedbacks.append(Feedback(name="response_char_length", value=char_length))
    return feedbacks

# Evaluate the scorer using the pre-generated traces from the prerequisite code block.
multi_feedback_eval_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[comprehensive_response_checker]
)

结果将有两列:is_not_empty_checkresponse_char_length 作为评估。

多类型反馈结果

示例 5:使用你自己的 LLM 来担任法官

将自定义或外部托管的 LLM 整合到评分器中。 评分器负责处理 API 调用、输入/输出格式,并从您的 LLM 响应中生成 Feedback,从而对评分过程进行完全控制。

还可以设置 source 对象中的 Feedback 字段,以指示评估的来源是 LLM 评判。

import mlflow
import json
from mlflow.genai.scorers import scorer
from mlflow.entities import AssessmentSource, AssessmentSourceType, Feedback
from typing import Any, Optional

# Define the prompts for the Judge LLM.
judge_system_prompt = """
You are an impartial AI assistant responsible for evaluating the quality of a response generated by another AI model.
Your evaluation should be based on the original user query and the AI's response.
Provide a quality score as an integer from 1 to 5 (1=Poor, 2=Fair, 3=Good, 4=Very Good, 5=Excellent).
Also, provide a brief rationale for your score.

Your output MUST be a single valid JSON object with two keys: "score" (an integer) and "rationale" (a string).
Example:
{"score": 4, "rationale": "The response was mostly accurate and helpful, addressing the user's query directly."}
"""
judge_user_prompt = """
Please evaluate the AI's Response below based on the Original User Query.

Original User Query:
```{user_query}```

AI's Response:
```{llm_response_from_app}```

Provide your evaluation strictly as a JSON object with "score" and "rationale" keys.
"""

@scorer
def answer_quality(inputs: dict[str, Any], outputs: str) -> Feedback:
    user_query = inputs["messages"][-1]["content"]

    # Call the Judge LLM using the OpenAI SDK client.
    judge_llm_response_obj = client.chat.completions.create(
        model="databricks-claude-3-7-sonnet",  # This example uses Databricks hosted Claude. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o-mini, etc.
        messages=[
            {"role": "system", "content": judge_system_prompt},
            {"role": "user", "content": judge_user_prompt.format(user_query=user_query, llm_response_from_app=outputs)},
        ],
        max_tokens=200,  # Max tokens for the judge's rationale
        temperature=0.0, # For more deterministic judging
    )
    judge_llm_output_text = judge_llm_response_obj.choices[0].message.content

    # Parse the Judge LLM's JSON output.
    judge_eval_json = json.loads(judge_llm_output_text)
    parsed_score = int(judge_eval_json["score"])
    parsed_rationale = judge_eval_json["rationale"]

    return Feedback(
        value=parsed_score,
        rationale=parsed_rationale,
        # Set the source of the assessment to indicate the LLM judge used to generate the feedback
        source=AssessmentSource(
            source_type=AssessmentSourceType.LLM_JUDGE,
            source_id="claude-3-7-sonnet",
        )
    )

# Evaluate the scorer using the pre-generated traces from the prerequisite code block.
custom_llm_judge_eval_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[answer_quality]
)

通过在UI中查看跟踪并单击“answer_quality”评估,可以查看评审员的元数据,例如原理、时间戳和评审模型名称。 如果判断评估不正确,可以通过单击 Edit 按钮替代分数。

新的评估取代了最初的法官评估。 编辑历史记录将保留以供将来参考。

编辑 LLM 判断评估

示例 6:基于类的记分器定义

如果记分器需要状态,则基于修饰器的 @scorer 定义可能不够。 而是将 Scorer 基类用于更复杂的评分器。 该 Scorer 类是 Pydantic 对象,因此可以定义其他字段并在方法中使用 __call__ 它们。

from mlflow.genai.scorers import Scorer
from mlflow.entities import Feedback
from typing import Optional

# Scorer class is a Pydantic object
class ResponseQualityScorer(Scorer):

    # The `name` field is mandatory
    name: str = "response_quality"

    # Define additional fields
    min_length: int = 50
    required_sections: Optional[list[str]] = None

    # Override the __call__ method to implement the scorer logic
    def __call__(self, outputs: str) -> Feedback:
        issues = []

        # Check length
        if len(outputs.split()) < self.min_length:
            issues.append(f"Too short (minimum {self.min_length} words)")

        # Check required sections
        missing = [s for s in self.required_sections if s not in outputs]
        if missing:
            issues.append(f"Missing sections: {', '.join(missing)}")

        if issues:
            return Feedback(
                value=False,
                rationale="; ".join(issues)
            )

        return Feedback(
            value=True,
            rationale="Response meets all quality criteria"
        )


response_quality_scorer = ResponseQualityScorer(required_sections=["# Summary", "# Sources"])

# Evaluate the scorer using the pre-generated traces from the prerequisite code block.
class_based_scorer_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[response_quality_scorer]
)

示例 7:记分器中的错误处理

下面的示例演示了使用两种方法处理评分器中的错误:

  • 显式处理错误:可以显式标识错误的输入或捕获其他异常并返回 AssessmentErrorFeedback
  • 让异常传播(建议):对于大多数错误,最好让 MLflow 捕获异常。 MLflow 将创建包含 Feedback 错误详细信息的对象,并继续执行。
import mlflow
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback, AssessmentError

@scorer
def resilient_scorer(outputs, trace=None):
    try:
        response = outputs.get("response")
        if not response:
            return Feedback(
                value=None,
                error=AssessmentError(
                    error_code="MISSING_RESPONSE",
                    error_message="No response field in outputs"
                )
            )
        # Your evaluation logic
        return Feedback(value=True, rationale="Valid response")
    except Exception as e:
        # Let MLflow handle the error gracefully
        raise

# Evaluation continues even if some scorers fail.
results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[resilient_scorer]
)

示例 8:记分器中的命名约定

以下示例说明了 基于代码的评分器的命名行为。 行为可以汇总为:

  1. 如果记分器返回一个或多个 Feedback 对象,则 Feedback.name 字段优先(如果指定)。
  2. 对于基元返回值或未命名的 Feedback,将使用函数名称(用于 @scorer 修饰器)或 Scorer.name 字段(用于 Scorer 类)。
from mlflow.genai.scorers import Scorer
from mlflow.entities import Feedback
from typing import Optional, Any, List

# Primitive value or single `Feedback` without a name: The scorer function name becomes the metric name.
@scorer
def decorator_primitive(outputs: str) -> int:
    # metric name = "decorator_primitive"
    return 1

@scorer
def decorator_unnamed_feedback(outputs: Any) -> Feedback:
    # metric name = "decorator_unnamed_feedback"
    return Feedback(value=True, rationale="Good quality")

# Single `Feedback` with an explicit name: The name specified in the `Feedback` object is used as the metric name.
@scorer
def decorator_feedback_named(outputs: Any) -> Feedback:
    # metric name = "decorator_named_feedback"
    return Feedback(name="decorator_named_feedback", value=True, rationale="Factual accuracy is high")

# Multiple `Feedback` objects: Names specified in each `Feedback` object are preserved. You must specify a unique name for each `Feedback`.
@scorer
def decorator_named_feedbacks(outputs) -> list[Feedback]:
    return [
        Feedback(name="decorator_named_feedback_1", value=True, rationale="No errors"),
        Feedback(name="decorator_named_feedback_2", value=0.9, rationale="Very clear"),
    ]

# Class returning primitive value
class ScorerPrimitive(Scorer):
    # metric name = "scorer_primitive"
    name: str = "scorer_primitive"
    def __call__(self, outputs: str) -> int:
        return 1

scorer_primitive = ScorerPrimitive()

# Class returning a Feedback object without a name
class ScorerFeedbackUnnamed(Scorer):
    # metric name = "scorer_named_feedback"
    name: str = "scorer_named_feedback"
    def __call__(self, outputs: str) -> Feedback:
        return Feedback(value=True, rationale="Good")

scorer_feedback_unnamed = ScorerFeedbackUnnamed()

# Class returning a Feedback object with a name
class ScorerFeedbackNamed(Scorer):
    # metric name = "scorer_named_feedback"
    name: str = "scorer_feedback_named"
    def __call__(self, outputs: str) -> Feedback:
        return Feedback(name="scorer_named_feedback", value=True, rationale="Good")

scorer_feedback_named = ScorerFeedbackNamed()

# Class returning multiple Feedback objects with names
class ScorerNamedFeedbacks(Scorer):
    # metric names = ["scorer_named_feedback_1", "scorer_named_feedback_1"]
    name: str = "scorer_named_feedbacks"  # Not used
    def __call__(self, outputs: str) -> List[Feedback]:
        return [
          Feedback(name="scorer_named_feedback_1", value=True, rationale="Good"),
          Feedback(name="scorer_named_feedback_2", value=1, rationale="ok"),
        ]

scorer_named_feedbacks = ScorerNamedFeedbacks()

mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[
      decorator_primitive,
      decorator_unnamed_feedback,
      decorator_feedback_named,
      decorator_named_feedbacks,
      scorer_primitive,
      scorer_feedback_unnamed,
      scorer_feedback_named,
      scorer_named_feedbacks,
    ],
)

示例 9:链式评估结果

如果某评分器指出这些跟踪的子集出现问题,您可以收集这些子集以便使用 mlflow.search_traces() 进一步迭代。 下面的示例查找一般的“安全”故障,然后使用更为定制化的评分器对失败的跟踪子集进行分析(一个通过内容政策文档进行评估的示例)。 或者,可以使用有问题的跟踪数据子集来反复迭代 AI 应用本身,以提高它在处理具有挑战性输入时的性能。

from mlflow.genai.scorers import Safety, Guidelines

# Run initial evaluation
results1 = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[Safety()]
)

# Use results to create refined dataset
traces = mlflow.search_traces(run_id=results1.run_id)

# Filter to problematic traces
safety_failures = traces[traces['assessments'].apply(
    lambda x: any(a['assessment_name'] == 'Safety' and a['feedback']['value'] == 'no' for a in x)
)]

# Updated app (not actually updated in this toy example)
updated_app = sample_app

# Re-evaluate with different scorers or updated app
if len(safety_failures) > 0:
  results2 = mlflow.genai.evaluate(
      data=safety_failures,
      predict_fn=updated_app,
      scorers=[
          Guidelines(
              name="content_policy",
              guidelines="Response must follow our content policy"
          )
      ]
  )

示例 10:有准则的条件逻辑

可以将 准则评委 包装在基于代码的自定义评分器中,以根据用户属性或其他上下文应用不同的准则。

from mlflow.genai.scorers import scorer, Guidelines

@scorer
def premium_service_validator(inputs, outputs, trace=None):
    """Custom scorer that applies different guidelines based on user tier"""

    # Extract user tier from inputs (could also come from trace)
    user_tier = inputs.get("user_tier", "standard")

    # Apply different guidelines based on user attributes
    if user_tier == "premium":
        # Premium users get more personalized, detailed responses
        premium_judge = Guidelines(
            name="premium_experience",
            guidelines=[
                "The response must acknowledge the user's premium status",
                "The response must provide detailed explanations with at least 3 specific examples",
                "The response must offer priority support options (e.g., 'direct line' or 'dedicated agent')",
                "The response must not include any upselling or promotional content"
            ]
        )
        return premium_judge(inputs=inputs, outputs=outputs)
    else:
        # Standard users get clear but concise responses
        standard_judge = Guidelines(
            name="standard_experience",
            guidelines=[
                "The response must be helpful and professional",
                "The response must be concise (under 100 words)",
                "The response may mention premium features as upgrade options"
            ]
        )
        return standard_judge(inputs=inputs, outputs=outputs)

# Example evaluation data
eval_data = [
    {
        "inputs": {
            "question": "How do I export my data?",
            "user_tier": "premium"
        },
        "outputs": {
            "response": "As a premium member, you have access to advanced export options. You can export in 5 formats: CSV, Excel, JSON, XML, and PDF. Here's how: 1) Go to Settings > Export, 2) Choose your format and date range, 3) Click 'Export Now'. For immediate assistance, call your dedicated support line at 1-800-PREMIUM."
        }
    },
    {
        "inputs": {
            "question": "How do I export my data?",
            "user_tier": "standard"
        },
        "outputs": {
            "response": "You can export your data as CSV from Settings > Export. Premium users can access additional formats like Excel and PDF."
        }
    }
]

# Run evaluation with the custom scorer
results = mlflow.genai.evaluate(
    data=eval_data,
    scorers=[premium_service_validator]
)

示例笔记本

以下笔记本包括此页上的所有代码。

基于代码的 MLflow 评估笔记本评分器

获取笔记本

后续步骤