Share via


LLM Judges and Scorers

Scorers evaluate GenAI app quality by analyzing outputs and producing structured feedback. The same scorer can be used for evaluation in development and reused for monitoring in production.

MLflow provides two types of scorers:

  • LLM Judges - Scorers that leverage Large Language Models to assess nuanced quality criteria like relevance, safety, and correctness. These include:

  • Code-based scorers - Deterministic scorers that use programmatic logic for metrics like latency, token usage, and exact matching:

The MLflow UI screenshot below illustrates outputs from a built-in LLM judge Safety and a custom scorer exact_match:

Example metrics from scorers

The code snippet below computes these metrics using mlflow.genai.evaluate() and then registers the same scorers for production monitoring:

import mlflow
from mlflow.genai.scorers import Safety, ScorerSamplingConfig, scorer
from typing import Any

@scorer
def exact_match(outputs: str, expectations: dict[str, Any]) -> bool:
    # Example of a custom code-based scorer
    return outputs == expectations["expected_response"]

# Evaluation during development
eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=my_app,
    scorers=[Safety(), exact_match]
)

# Production monitoring - same scorers!
registered_scorers = [
  Safety().register(),
  exact_match.register(),
]
registered_scorers = [
    reg_scorer.start(
        sampling_config=ScorerSamplingConfig(sample_rate=0.1)
    )
    for reg_scorer in registered_scorers
]

Next steps