Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
The Correctness judge assesses whether your GenAI application's response is factually correct by comparing it against provided ground truth information (expected_facts or expected_response).
This built-in LLM judge is designed for evaluating application responses against known correct answers.
By default, this judge uses a Databricks-hosted LLM designed to perform GenAI quality assessments. You can change the judge model by using the model argument in the judge definition. The model must be specified in the format <provider>:/<model-name>, where <provider> is a LiteLLM-compatible model provider. If you use databricks as the model provider, the model name is the same as the serving endpoint name.
Prerequisites for running the examples
Install MLflow and required packages
pip install --upgrade "mlflow[databricks]>=3.4.0"Create an MLflow experiment by following the setup your environment quickstart.
Usage examples
from mlflow.genai.scorers import Correctness
correctness_judge = Correctness()
# Example 1: Response contains expected facts
feedback = correctness_judge(
inputs={"request": "What is MLflow?"},
outputs={"response": "MLflow is an open-source platform for managing the ML lifecycle."},
expectations={
"expected_facts": [
"MLflow is open-source",
"MLflow is a platform for ML lifecycle"
]
}
)
print(feedback.value) # "yes"
print(feedback.rationale) # Explanation of which facts are supported
# Example 2: Response missing or contradicting facts
feedback = correctness_judge(
inputs={"request": "When was MLflow released?"},
outputs={"response": "MLflow was released in 2017."},
expectations={"expected_facts": ["MLflow was released in June 2018"]}
)
# Example 3: Using expected_response instead of expected_facts
feedback = correctness_judge(
inputs={"request": "What is the capital of France?"},
outputs={"response": "The capital of France is Paris."},
expectations={"expected_response": "Paris is the capital of France."}
)
Usage with mlflow.evaluate()
The Correctness judge can be used directly with MLflow's evaluation framework.
Requirements:
- Trace requirements:
inputsandoutputsmust be on the Trace's root span - Ground-truth labels: Required - must provide either
expected_factsorexpected_responsein theexpectationsdictionary
from mlflow.genai.scorers import Correctness
# Create evaluation dataset with ground truth
eval_dataset = [
{
"inputs": {"query": "What is the capital of France?"},
"outputs": {
"response": "Paris is the magnificent capital city of France, known for the Eiffel Tower and rich culture."
},
"expectations": {
"expected_facts": ["Paris is the capital of France."]
},
},
{
"inputs": {"query": "What are the main components of MLflow?"},
"outputs": {
"response": "MLflow has four main components: Tracking, Projects, Models, and Registry."
},
"expectations": {
"expected_facts": [
"MLflow has four main components",
"Components include Tracking",
"Components include Projects",
"Components include Models",
"Components include Registry"
]
},
},
{
"inputs": {"query": "When was MLflow released?"},
"outputs": {
"response": "MLflow was released in 2017 by Databricks."
},
"expectations": {
"expected_facts": ["MLflow was released in June 2018"]
},
}
]
# Run evaluation with Correctness scorer
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[
Correctness(
model="databricks:/databricks-gpt-oss-120b", # Optional. Defaults to custom Databricks model.
)
]
)
Alternative: Using expected_response
You can also use expected_response instead of expected_facts:
eval_dataset_with_response = [
{
"inputs": {"query": "What is MLflow?"},
"outputs": {
"response": "MLflow is an open-source platform for managing the ML lifecycle."
},
"expectations": {
"expected_response": "MLflow is an open-source platform for managing the machine learning lifecycle, including experimentation, reproducibility, and deployment."
},
}
]
# Run evaluation with expected_response
eval_results = mlflow.genai.evaluate(
data=eval_dataset_with_response,
scorers=[Correctness()]
)
Tip
Using expected_facts is recommended over expected_response as it allows for more flexible evaluation - the response doesn't need to match word-for-word, just contain the key facts.
Customization
You can customize the judge by providing a different judge model:
from mlflow.genai.scorers import Correctness
# Use a different judge model
correctness_judge = Correctness(
model="databricks:/databricks-gpt-5-mini" # Or any LiteLLM-compatible model
)
# Use in evaluation
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[correctness_judge]
)
Interpreting Results
The judge returns a Feedback object with:
value: "yes" if response is correct, "no" if incorrectrationale: Detailed explanation of which facts are supported or missing
Next Steps
- Explore other built-in judges - Learn about other built-in quality evaluation judges
- Create custom judges - Build domain-specific evaluation judges
- Run evaluations - Use judges in comprehensive application evaluation