LLM Judges and Scorers

2025-10-20

Scorers evaluate GenAI app quality by analyzing outputs and producing structured feedback. The same scorer can be used for evaluation in development and reused for monitoring in production.

MLflow provides two types of scorers:

LLM Judges - Scorers that leverage Large Language Models to assess nuanced quality criteria like relevance, safety, and correctness. These include:
- Built-in LLM Judges - Pre-configured judges for common quality dimensions
- Custom LLM Judges - Domain-specific judges you create for your needs
Code-based scorers - Deterministic scorers that use programmatic logic for metrics like latency, token usage, and exact matching:
- Custom code-based scorers - Python functions that compute specific metrics

The MLflow UI screenshot below illustrates outputs from a built-in LLM judge Safety and a custom scorer exact_match:

Example metrics from scorers

The code snippet below computes these metrics using mlflow.genai.evaluate() and then registers the same scorers for production monitoring:

import mlflow
from mlflow.genai.scorers import Safety, ScorerSamplingConfig, scorer
from typing import Any

@scorer
def exact_match(outputs: str, expectations: dict[str, Any]) -> bool:
    # Example of a custom code-based scorer
    return outputs == expectations["expected_response"]

# Evaluation during development
eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=my_app,
    scorers=[Safety(), exact_match]
)

# Production monitoring - same scorers!
registered_scorers = [
  Safety().register(),
  exact_match.register(),
]
registered_scorers = [
    reg_scorer.start(
        sampling_config=ScorerSamplingConfig(sample_rate=0.1)
    )
    for reg_scorer in registered_scorers
]

Next steps

Use built-in LLM judges - Start evaluating your app quickly with built-in LLM Judges
Creating custom LLM Judges - Customize LLM Judges for your specific application
Create custom code-based scorers - Build code-based scorers, including possible inputs, outputs, and error handling
Evaluation harness - Understand how mlflow.genai.evaluate() uses your LLM Judges and code-based scorers
Production monitoring for GenAI - Deploy your LLM Judges and code-based scorers for continuous monitoring

Feedback

Was this page helpful?

Share via

LLM Judges and Scorers

Next steps

Feedback

Additional resources