Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Curated test data for systematic GenAI app evaluation. Includes inputs, optional ground truth (expectations), and metadata.
Two ways to provide evaluation data
You can provide evaluation data to the MLflow evaluation harness in two ways:
1. MLflow Evaluation Datasets (Recommended)
Purpose-built datasets stored in Unity Catalog with:
- Versioning: Track dataset changes over time
- Lineage: Link dataset records to their inputs (source traces) and track their usage (evaluation runs and app versions)
- Collaboration: Share datasets across teams
- Integration: Seamless workflow with MLflow UI and APIs
- Governance: Unity Catalog security and access controls
- Trace conversion: Easily convert production traces into evaluation dataset records using the UI or SDK
- Visualization: Inspect and edit dataset contents directly in the MLflow UI
When to use: Production evaluation workflows, regression testing, and when you need dataset management capabilities.
2. Arbitrary datasets (Quick prototyping)
Use existing data structures like:
- List of dictionaries
- Pandas DataFrame
- Spark DataFrame
When to use: Quick experiments, prototyping, or when you already have evaluation data in these formats.
Evaluation Dataset schema
Evaluation datasets follow a consistent structure whether you use MLflow's Evaluation Dataset abstraction or pass data directly to mlflow.genai.evaluate().
Core fields
The following fields are used in both the Evaluation Dataset abstraction or if you pass data directly.
| Column | Data Type | Description | Required |
|---|---|---|---|
inputs |
dict[Any, Any] |
Inputs for your app (e.g., user question, context), stored as a JSON-seralizable dict. |
Yes |
expectations |
dict[Str, Any] |
Ground truth labels, stored as a JSON-seralizable dict. |
Optional |
expectations reserved keys
expectations has several reserved keys that are used by built-in LLM judges: guidelines, expected_facts, and expected_response.
| Field | Used by | Description |
|---|---|---|
expected_facts |
Correctness judge |
List of facts that should appear |
expected_response |
Correctness judge |
Exact or similar expected output |
guidelines |
Guidelines judge |
Natural language rules to follow |
expected_retrieved_context |
document_recall scorer |
Documents that should be retrieved |
Additional fields
The following fields are used by the Evaluation Dataset abstraction to track lineage and version history.
| Column | Data Type | Description | Required |
|---|---|---|---|
dataset_record_id |
string | The unique identifier for the record. | Automatically set if not provided. |
create_time |
timestamp | The time when the record was created. | Automatically set when inserting or updating. |
created_by |
string | The user who created the record. | Automatically set when inserting or updating. |
last_update_time |
timestamp | The time when the record was last updated. | Automatically set when inserting or updating. |
last_updated_by |
string | The user who last updated the record. | Automatically set when inserting or updating. |
source |
struct | The source of the dataset record (see below). | Optional |
tags |
dict[str, Any] | Key-value tags for the dataset record. | Optional |
Source field
The source field tracks where a dataset record came from. Each record can have only one source type:
1. Human source - Record created manually by a person
{
"source": {
"human": {
"user_name": "jane.doe@company.com"
}
}
}
user_name(str): The user who created the record
2. Document source - Record synthesized from a document
{
"source": {
"document": {
"doc_uri": "s3://bucket/docs/product-manual.pdf",
"content": "The first 500 chars of the document..." # Optional
}
}
}
doc_uri(str): URI/path to the source documentcontent(str, optional): Excerpt or full content from the document
3. Trace source - Record created from a production trace
{
"source": {
"trace": {
"trace_id": "tr-abc123def456"
}
}
}
trace_id(str): The unique identifier of the source trace
MLflow Evaluation Dataset UI

MLflow Evaluation Dataset SDK reference
The evaluation datasets SDK provides programmatic access to create, manage, and use datasets for GenAI app evaluation. For details, see the API reference: mlflow.genai.datasets. Some of the most frequently used methods and classes are the following:
mlflow.genai.datasets.create_datasetmlflow.genai.datasets.get_datasetmlflow.genai.datasets.delete_datasetEvaluationDataset. This class provides methods to interact with and modify evaluation datasets.
Common patterns
Create datasets from production traces
import mlflow
import mlflow.genai.datasets
import pandas as pd
# By default, search_traces() searches the current active experiment.
# To search a different experiment, set it explicitly:
mlflow.set_experiment(experiment_id=<YOUR_EXPERIMENT_ID>)
# Search for production traces with good feedback
traces = mlflow.search_traces(
filter_string="""
tags.environment = 'production'
AND attributes.feedback_score > 0.8
"""
)
dataset = mlflow.genai.datasets.create_dataset(
uc_table_name="catalog.schema.production_golden_set"
)
dataset = dataset.merge_records(traces)
Update existing datasets
import mlflow.genai.datasets
import pandas as pd
# Load existing dataset
dataset = mlflow.genai.datasets.get_dataset("catalog.schema.eval_dataset")
# Add new test cases
new_cases = [
{
"inputs": {"question": "What are MLflow models?"},
"expectations": {
"expected_facts": ["model packaging", "deployment", "registry"],
"min_response_length": 100
}
}
]
# Merge new cases
dataset = dataset.merge_records(new_cases)
Limitations
- Customer Managed Keys (CMK) are not supported.
- Maximum of 2,000 rows per evaluation dataset.
- Maximum of 20 expectations per dataset record.
If you need any of these limitations relaxed for your use case, contact your Databricks representative.
Next steps
How-to guides
- Build evaluation datasets - Step-by-step dataset creation
- Evaluate your app - Use datasets for evaluation
Concepts
- Evaluation Harness - How datasets are used
- Scorers - Metrics applied to datasets