LLM judges

2025-10-20

Overview

LLM Judges are a type of MLflow Scorer that uses Large Language Models for quality assessment. While code-based Scorers use programmatic logic, judges leverage the reasoning capabilities of LLMs to make quality assessments for criteria like helpfulness, relevance, safety, and beyond.

Think of a judge as an AI assistant specialized in quality assessment - it can evaluate your app's inputs, outputs, and even explore the entire execution trace to make assessments based on criteria you define. For example, a judge can understand that give me healthy food options and food to keep me fit are similar queries.

Important

While judges can be used as standalone APIs, judges must be wrapped in custom scorers for use by the Evaluation Harness and production monitoring service.

When to use judges

Use judges when you need to evaluate plain language inputs or outputs:

Semantic correctness: "Does this answer the question correctly?"
Style and tone: "Is this appropriate for our brand voice?"
Safety and compliance: "Does this follow our content guidelines?"
Relative quality: "Which response is more helpful?"

Use custom, code-based scorers instead for:

Exact matching: Checking for specific keywords
Format validation: JSON structure, length limits
Performance metrics: Latency, token usage

Built-in LLM judges

MLflow provides research-validated judges for common use cases:

from mlflow.genai.scorers import (
    Safety,                  # Content safety
    RelevanceToQuery,        # Query relevance
    RetrievalGroundedness,   # RAG grounding
    Correctness,             # Factual accuracy
    RetrievalSufficiency,    # Retrieval quality
    Guidelines,              # Custom pass/fail criteria
    ExpectationsGuidelines   # Example-specific pass/fail criteria
)

See built-in judges reference for detailed documentation.

Custom LLM judges

In addition to the built-in judges, MLflow makes it easy to create your own judges with custom prompts and instructions.

Custom LLM judges are useful when you need to define specialized evaluation tasks, you need more control over grades or scores (not just pass/fail), or you need to validate that your agent made appropriate decisions and performed operations correctly for your specific use case.

Learn more about building judges with custom prompts

Judge accuracy

Databricks continuously improves judge quality through:

Research validation against human expert judgment
Metrics tracking: Cohen's Kappa, accuracy, F1 score
Diverse testing on academic and real-world datasets

See Databricks blog on LLM judge improvements for details.

Information about the models powering LLM judges

LLM judges might use third-party services to evaluate your GenAI applications, including Azure OpenAI operated by Microsoft.
For Azure OpenAI, Databricks has opted out of Abuse Monitoring so no prompts or responses are stored with Azure OpenAI.
For European Union (EU) workspaces, LLM judges use models hosted in the EU. All other regions use models hosted in the US.
Disabling Partner-powered AI features prevents the LLM judge from calling partner-powered models. You can still use LLM judges by providing your own model.
LLM judges are intended to help customers evaluate their GenAI agents/applications, and LLM judge outputs should not be used to train, improve, or fine-tune an LLM.

Next steps

How-to guides

Use built-in LLM judges for common evaluation tasks
Create custom LLM judges for specialized or complex evaluation tasks

Concepts

Built-in judges reference - Detailed documentation of all built-in judges
Scorers - How judges integrate with the evaluation system

Feedback

Was this page helpful?