Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This page provides reference documentation for MLflow evaluation and monitoring concepts. For guides and tutorials, see Evaluate and Monitor AI agents.
Tip
For MLflow 3 evaluation and monitoring API documentation, see API Reference.
Quick reference
| Concept | Purpose | Usage |
|---|---|---|
| Scorers | Evaluate trace quality | @scorer decorator or Scorer class |
| Judges | LLM-based assessment | Wrapped in scorers for use |
| Evaluation Harness | Run offline evaluation | mlflow.genai.evaluate() |
| Evaluation Datasets | Test data management | mlflow.genai.datasets |
| Evaluation Runs | Store evaluation results | Created by harness |
| Production Monitoring | Live quality tracking | Scorer.register, Scorer.start |
Scorers: mlflow.genai.scorers
Functions that evaluate traces and return Feedback.
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback
from typing import Optional, Dict, Any, List
@scorer
def my_custom_scorer(
*, # MLflow calls your scorer with named arguments
inputs: Optional[Dict[Any, Any]], # App's input from trace
outputs: Optional[Dict[Any, Any]], # App's output from trace
expectations: Optional[Dict[str, Any]], # Ground truth (offline only)
trace: Optional[mlflow.entities.Trace] # Complete trace
) -> int | float | bool | str | Feedback | List[Feedback]:
# Your evaluation logic
return Feedback(value=True, rationale="Explanation")
Judges
LLM Judges are a type of MLflow Scorer that uses Large Language Models for quality assessment. While code-based Scorers use programmatic logic, judges leverage the reasoning capabilities of LLMs to evaluate criteria like helpfulness, relevance, safety, and beyond.
from mlflow.genai.scorers import Safety, RelevanceToQuery
# Initialize judges that will assess different quality aspects
safety_judge = Safety() # Checks for harmful, toxic, or inappropriate content
relevance_judge = RelevanceToQuery() # Checks if responses are relevant to user queries
# Run evaluation on your test dataset with multiple judges
mlflow.genai.evaluate(
data=eval_data, # Your test cases (inputs, outputs, optional ground truth)
predict_fn=my_app, # The application function you want to evaluate
scorers=[safety_judge, relevance_judge] # Both judges run on every test case
)
Evaluation Harness: mlflow.genai.evaluate(...)
Orchestrates offline evaluation during development.
import mlflow
from mlflow.genai.scorers import Safety, RelevanceToQuery
results = mlflow.genai.evaluate(
data=eval_dataset, # Test data
predict_fn=my_app, # Your app
scorers=[Safety(), RelevanceToQuery()], # Quality metrics
model_id="models:/my-app/1" # Optional version tracking
)
Learn more about Evaluation Harness
Evaluation Datasets: mlflow.genai.datasets.EvaluationDataset
Versioned test data with optional ground truth.
import mlflow.genai.datasets
# Create from production traces
dataset = mlflow.genai.datasets.create_dataset(
uc_table_name="catalog.schema.eval_data"
)
# Add traces
traces = mlflow.search_traces(filter_string="trace.status = 'OK'")
dataset.insert(traces)
# Use in evaluation
results = mlflow.genai.evaluate(data=dataset, ...)
Learn more about Evaluation Datasets
Evaluation Runs: mlflow.entities.Run
Results from evaluation containing traces with feedback.
# Access evaluation results
traces = mlflow.search_traces(run_id=results.run_id)
# Filter by feedback
good_traces = traces[traces['assessments'].apply(
lambda x: all(a.value for a in x if a.name == 'Safety')
)]
Learn more about Evaluation Runs
Production Monitoring
Important
This feature is in Beta.
Continuous evaluation of deployed applications.
import mlflow
from mlflow.genai.scorers import Safety, ScorerSamplingConfig
# Register the scorer with a name and start monitoring
safety_judge = Safety().register(name="my_safety_judge") # name must be unique to experiment
safety_judge = safety_judge.start(sampling_config=ScorerSamplingConfig(sample_rate=0.7))