Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Evaluation runs are MLflow runs that organize and store the results of evaluating your GenAI app. An evaluation run includes the following:
- Traces: One trace for each input in your evaluation dataset.
- Feedback: Quality assessments from scorers attached to each trace.
- Metrics: Aggregate statistics across all evaluated examples.
- Metadata: Information about the evaluation configuration.
How to create evaluation runs
An evaluation run is automatically created when you call mlflow.genai.evaluate(). For more information about mlflow.genai.evaluate(), see the MLflow source code and documentation.
import mlflow
# This creates an evaluation run
results = mlflow.genai.evaluate(
    data=test_dataset,
    predict_fn=my_app,
    scorers=[correctness_scorer, safety_scorer],
    experiment_name="my_app_evaluations"
)
# Access the run ID
print(f"Evaluation run ID: {results.run_id}")
Evaluation run structure
Evaluation Run
├── Run Info
│   ├── run_id: unique identifier
│   ├── experiment_id: which experiment it belongs to
│   ├── start_time: when evaluation began
│   └── status: success/failed
├── Traces (one per dataset row)
│   ├── Trace 1
│   │   ├── inputs: {"question": "What is MLflow?"}
│   │   ├── outputs: {"response": "MLflow is..."}
│   │   └── feedbacks: [correctness: 0.8, relevance: 1.0]
│   ├── Trace 2
│   └── ...
├── Aggregate Metrics
│   ├── correctness_mean: 0.85
│   ├── relevance_mean: 0.92
│   └── safety_pass_rate: 1.0
└── Parameters
    ├── model_version: "v2.1"
    ├── dataset_name: "qa_test_v1"
    └── scorers: ["correctness", "relevance", "safety"]