Share via


Use built-in LLM judges

Overview

LLM judges enable you to evaluate and monitor your GenAI applications using MLflow Traces. These judges are a type of MLflow Scorer that leverages Large Language Models for nuanced quality assessment, complementing code-based scorers that handle deterministic metrics.

:::important When to use which scorer: When to use code-based scorers:

  • Deterministic metrics (latency, token usage)
  • Rule-based validations (format checks, pattern matching)
  • Business logic (price calculations, threshold checks)

When to use LLM judges:

  • Quality assessments (correctness, coherence, relevance)
  • Safety evaluations (toxicity, harmful content)
  • Complex evaluations requiring deep understanding of text, audio, images, or video content :::

Built-in LLM Judges

MLflow provides built-in, research-backed LLM judges to assess traces across essential quality dimensions.

Important

Start with built-in judges for quick evaluation. As your needs evolve:

How built-in judges work

Once passed a Trace by either evaluate() or the monitoring service, the built-in judge:

  1. Parses the trace to extract specific fields and data that are used to assess quality
  2. Calls an LLM to perform the quality assessment based on the extracted fields and data
  3. Returns the quality assessment as Feedback to attach to the trace

Prerequisites

  1. Run the following command to install MLflow 3.0 and OpenAI packages.

    pip install --upgrade "mlflow[databricks]>=3.4.0" openai
    
  2. Follow the tracing quickstart to connect your development environment to an MLflow Experiment.

Step 1: Create a sample application to evaluate

Define a simple application with a fake retriever.

  1. Initialize an OpenAI client to connect to either Databricks-hosted LLMs or LLMs hosted by OpenAI.

    Databricks-hosted LLMs

    Use MLflow to get an OpenAI client that connects to Databricks-hosted LLMs. Select a model from the available foundation models.

    import mlflow
    from databricks.sdk import WorkspaceClient
    
    # Enable MLflow's autologging to instrument your application with Tracing
    mlflow.openai.autolog()
    
    # Set up MLflow tracking to Databricks
    mlflow.set_tracking_uri("databricks")
    mlflow.set_experiment("/Shared/docs-demo")
    
    # Create an OpenAI client that is connected to Databricks-hosted LLMs
    w = WorkspaceClient()
    client = w.serving_endpoints.get_open_ai_client()
    
    # Select an LLM
    model_name = "databricks-claude-sonnet-4"
    

    OpenAI-hosted LLMs

    Use the native OpenAI SDK to connect to OpenAI-hosted models. Select a model from the available OpenAI models.

    import mlflow
    import os
    import openai
    
    # Ensure your OPENAI_API_KEY is set in your environment
    # os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured
    
    # Enable auto-tracing for OpenAI
    mlflow.openai.autolog()
    
    # Set up MLflow tracking to Databricks
    mlflow.set_tracking_uri("databricks")
    mlflow.set_experiment("/Shared/docs-demo")
    
    # Create an OpenAI client connected to OpenAI SDKs
    client = openai.OpenAI()
    
    # Select an LLM
    model_name = "gpt-4o-mini"
    
  2. Define the application:

    from mlflow.entities import Document
    from typing import List
    
    # Retriever function called by the sample app
    @mlflow.trace(span_type="RETRIEVER")
    def retrieve_docs(query: str) -> List[Document]:
        return [
            Document(
                id="sql_doc_1",
                page_content="SELECT is a fundamental SQL command used to retrieve data from a database. You can specify columns and use a WHERE clause to filter results.",
                metadata={"doc_uri": "http://example.com/sql/select_statement"},
            ),
            Document(
                id="sql_doc_2",
                page_content="JOIN clauses in SQL are used to combine rows from two or more tables, based on a related column between them. Common types include INNER JOIN, LEFT JOIN, and RIGHT JOIN.",
                metadata={"doc_uri": "http://example.com/sql/join_clauses"},
            ),
            Document(
                id="sql_doc_3",
                page_content="Aggregate functions in SQL, such as COUNT(), SUM(), AVG(), MIN(), and MAX(), perform calculations on a set of values and return a single summary value.  The most common aggregate function in SQL is COUNT().",
                metadata={"doc_uri": "http://example.com/sql/aggregate_functions"},
            ),
        ]
    
    
    # Sample app to evaluate
    @mlflow.trace
    def sample_app(query: str):
        # 1. Retrieve documents based on the query
        retrieved_documents = retrieve_docs(query=query)
        retrieved_docs_text = "\n".join([doc.page_content for doc in retrieved_documents])
    
        # 2. Prepare messages for the LLM
        messages_for_llm = [
            {
                "role": "system",
                # Fake prompt to show how the various judges identify quality issues.
                "content": f"Answer the user's question based on the following retrieved context: {retrieved_docs_text}.  Do not mention the fact that provided context exists in your answer.  If the context is not relevant to the question, generate the best response you can.",
            },
            {
                "role": "user",
                "content": query,
            },
        ]
    
        # 3. Call LLM to generate the response
        return client.chat.completions.create(
            # Provide a valid model name for your LLM provider.
            model=model_name,
            messages=messages_for_llm,
        )
    result = sample_app("what is select in sql?")
    print(result)
    

Step 2: Create a sample evaluation dataset

Note

expected_facts is only required if you use built-in judges that require ground-truth.

eval_dataset = [
    {
        "inputs": {"query": "What is the most common aggregate function in SQL?"},
        "expectations": {
            "expected_facts": ["Most common aggregate function in SQL is COUNT()."],
        },
    },
    {
        "inputs": {"query": "How do I use MLflow?"},
        "expectations": {
            "expected_facts": [
                "MLflow is a tool for managing and tracking machine learning experiments."
            ],
        },
    },
]
print(eval_dataset)

Step 3: Run evaluation with built-in LLM judges

Now, let's run the evaluation with the judges we defined above.

from mlflow.genai.scorers import (
    Correctness,
    ExpectationsGuidelines,
    Guidelines,
    RelevanceToQuery,
    RetrievalGroundedness,
    RetrievalRelevance,
    RetrievalSufficiency,
    Safety,
)


# Run built-in judges that require ground truth
mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=sample_app,
    scorers=[
        Correctness(),
        # RelevanceToQuery(),
        # RetrievalGroundedness(),
        # RetrievalRelevance(),
        RetrievalSufficiency(),
        # Safety(),
    ],
)


# Run built-in judges that do NOT require ground truth
mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=sample_app,
    scorers=[
        # Correctness(),
        RelevanceToQuery(),
        RetrievalGroundedness(),
        RetrievalRelevance(),
        # RetrievalSufficiency(),
        Safety(),
        Guidelines(name="does_not_mention", guidelines="The response not mention the fact that provided context exists.")
    ],
)

Evaluation traces

Evaluation UI

Available judges

By default, each judge uses a Databricks-hosted LLM designed to perform GenAI quality assessments. You can change the judge model by using the model argument in the judge definition. The model must be specified in the format <provider>:/<model-name>. For example:

from mlflow.genai.scorers import Correctness

Correctness(model="databricks:/databricks-gpt-5-mini")

For a list of supported models, see the MLflow documentation.

Judge What it evaluates? Requires ground-truth?
RelevanceToQuery Does app's response directly address the user's input? No
Safety Does the app's response avoid harmful or toxic content? No
RetrievalGroundedness Is the app's response grounded in retrieved information? No
RetrievalRelevance Are retrieved documents relevant to the user's request? No
Correctness Is app's response correct compared to ground-truth? Yes
RetrievalSufficiency Do retrieved documents contain all necessary information? Yes
Guidelines Does the app's response meet specified criteria? No
ExpectationsGuidelines Does the app's response meet per-example criteria? No

Information about the models powering LLM judges

  • LLM judges might use third-party services to evaluate your GenAI applications, including Azure OpenAI operated by Microsoft.
  • For Azure OpenAI, Databricks has opted out of Abuse Monitoring so no prompts or responses are stored with Azure OpenAI.
  • For European Union (EU) workspaces, LLM judges use models hosted in the EU. All other regions use models hosted in the US.
  • Disabling Partner-powered AI features prevents the LLM judge from calling partner-powered models. You can still use LLM judges by providing your own model.
  • LLM judges are intended to help customers evaluate their GenAI agents/applications, and LLM judge outputs should not be used to train, improve, or fine-tune an LLM.

Next steps

Continue your journey with these recommended actions and tutorials.

Reference guides

Explore detailed documentation for concepts and features mentioned in this guide.