Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Scorers evaluate GenAI app quality by analyzing outputs and producing structured feedback. The same scorer can be used for evaluation in development and reused for monitoring in production.
MLflow provides two types of scorers:
LLM Judges - Scorers that leverage Large Language Models to assess nuanced quality criteria like relevance, safety, and correctness. These include:
- Built-in LLM Judges - Pre-configured judges for common quality dimensions
- Custom LLM Judges - Domain-specific judges you create for your needs
Code-based scorers - Deterministic scorers that use programmatic logic for metrics like latency, token usage, and exact matching:
- Custom code-based scorers - Python functions that compute specific metrics
The MLflow UI screenshot below illustrates outputs from a built-in LLM judge Safety and a custom scorer exact_match:

The code snippet below computes these metrics using mlflow.genai.evaluate() and then registers the same scorers for production monitoring:
import mlflow
from mlflow.genai.scorers import Safety, ScorerSamplingConfig, scorer
from typing import Any
@scorer
def exact_match(outputs: str, expectations: dict[str, Any]) -> bool:
# Example of a custom code-based scorer
return outputs == expectations["expected_response"]
# Evaluation during development
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
predict_fn=my_app,
scorers=[Safety(), exact_match]
)
# Production monitoring - same scorers!
registered_scorers = [
Safety().register(),
exact_match.register(),
]
registered_scorers = [
reg_scorer.start(
sampling_config=ScorerSamplingConfig(sample_rate=0.1)
)
for reg_scorer in registered_scorers
]
Next steps
- Use built-in LLM judges - Start evaluating your app quickly with built-in LLM Judges
- Creating custom LLM Judges - Customize LLM Judges for your specific application
- Create custom code-based scorers - Build code-based scorers, including possible inputs, outputs, and error handling
- Evaluation harness - Understand how
mlflow.genai.evaluate()uses your LLM Judges and code-based scorers - Production monitoring for GenAI - Deploy your LLM Judges and code-based scorers for continuous monitoring