Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
MLflow provides two built-in LLM judges to assess relevance in your GenAI applications. These judges help diagnose quality issues - if context isn't relevant, the generation step cannot produce a helpful response.
RelevanceToQuery: Evaluates if your app's response directly addresses the user's inputRetrievalRelevance: Evaluates if each document returned by your app's retriever(s) is relevant
By default, these judges use a Databricks-hosted LLM designed to perform GenAI quality assessments. You can change the judge model by using the model argument in the judge definition. The model must be specified in the format <provider>:/<model-name>, where <provider> is a LiteLLM-compatible model provider. If you use databricks as the model provider, the model name is the same as the serving endpoint name.
Prerequisites for running the examples
Install MLflow and required packages
pip install --upgrade "mlflow[databricks]>=3.4.0" openai "databricks-connect>=16.1"Create an MLflow experiment by following the setup your environment quickstart.
Usage with mlflow.evaluate()
1. RelevanceToQuery Judge
This scorer evaluates if your app's response directly addresses the user's input without deviating into unrelated topics.
Requirements:
- Trace requirements:
inputsandoutputsmust be on the Trace's root span
from mlflow.genai.scorers import RelevanceToQuery
eval_dataset = [
{
"inputs": {"query": "What is the capital of France?"},
"outputs": {
"response": "Paris is the capital of France. It's known for the Eiffel Tower and is a major European city."
},
},
{
"inputs": {"query": "What is the capital of France?"},
"outputs": {
"response": "France is a beautiful country with great wine and cuisine."
},
}
]
# Run evaluation with RelevanceToQuery scorer
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[
RelevanceToQuery(
model="databricks:/databricks-gpt-oss-120b", # Optional. Defaults to custom Databricks model.
)
],
)
2. RetrievalRelevance Judge
This scorer evaluates if each document returned by your app's retriever(s) is relevant to the input request.
Requirements:
- Trace requirements: The MLflow Trace must contain at least one span with
span_typeset toRETRIEVER
import mlflow
from mlflow.genai.scorers import RetrievalRelevance
from mlflow.entities import Document
from typing import List
# Define a retriever function with proper span type
@mlflow.trace(span_type="RETRIEVER")
def retrieve_docs(query: str) -> List[Document]:
# Simulated retrieval - in practice, this would query a vector database
if "capital" in query.lower() and "france" in query.lower():
return [
Document(
id="doc_1",
page_content="Paris is the capital of France.",
metadata={"source": "geography.txt"}
),
Document(
id="doc_2",
page_content="The Eiffel Tower is located in Paris.",
metadata={"source": "landmarks.txt"}
)
]
else:
return [
Document(
id="doc_3",
page_content="Python is a programming language.",
metadata={"source": "tech.txt"}
)
]
# Define your app that uses the retriever
@mlflow.trace
def rag_app(query: str):
docs = retrieve_docs(query)
# In practice, you would pass these docs to an LLM
return {"response": f"Found {len(docs)} relevant documents."}
# Create evaluation dataset
eval_dataset = [
{
"inputs": {"query": "What is the capital of France?"}
},
{
"inputs": {"query": "How do I use Python?"}
}
]
# Run evaluation with RetrievalRelevance scorer
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
predict_fn=rag_app,
scorers=[
RetrievalRelevance(
model="databricks:/databricks-gpt-oss-120b", # Optional. Defaults to custom Databricks model.
)
]
)
Customization
You can customize these judges by providing different judge models:
from mlflow.genai.scorers import RelevanceToQuery, RetrievalRelevance
# Use different judge models
relevance_judge = RelevanceToQuery(
model="databricks:/databricks-gpt-5-mini" # Or any LiteLLM-compatible model
)
retrieval_judge = RetrievalRelevance(
model="databricks:/databricks-claude-opus-4-1"
)
# Use in evaluation
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
predict_fn=rag_app,
scorers=[relevance_judge, retrieval_judge]
)
Interpreting Results
The judge returns a Feedback object with:
value: "yes" if context is relevant, "no" if notrationale: Explanation of why the context was deemed relevant or irrelevant
Next Steps
- Explore other built-in judges - Learn about groundedness, safety, and correctness judges
- Create custom judges - Build specialized judges for your use case
- Evaluate RAG applications - Apply relevance judges in comprehensive RAG evaluation