Run evaluations in the cloud by using the Azure AI Foundry SDK (preview)

2025-10-17

Important

Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

In this article, you learn how to run evaluations in the cloud (preview) in pre-deployment testing on a test dataset. The Azure AI Evaluation SDK supports running evaluations locally on your machine and in the cloud. For example, you can run local evaluations on small test data to assess your generative AI application prototypes. Then move into pre-deployment testing and run evaluations on a large dataset.

Evaluating your applications in the cloud frees you from managing your local compute infrastructure. You can also integrate evaluations as tests into your continuous integration and continuous delivery pipelines. After deployment, you can continuously monitor your applications for post-deployment monitoring.

When you use the Azure AI Projects SDK, it logs evaluation results in your Azure AI project for better observability. This feature supports all Microsoft-curated built-in evaluators and your own custom evaluators. Your evaluators can be located in the Evaluator library and have the same project-scope role-based access control.

Prerequisites

Azure AI Foundry project in the same supported regions as risk and safety evaluators (preview). If you don't have a project, create one. See Create a project for Azure AI Foundry.
Azure OpenAI Deployment with GPT model supporting chat completion. For example, gpt-4.
Make sure you're logged into your Azure subscription by running az login.

If this is your first time running evaluations and logging it to your Azure AI Foundry project, you might need to do a few additional steps:

Create and connect your storage account to your Azure AI Foundry project at the resource level. There are two ways you can do this. You can use a Bicep template, which provisions and connects a storage account to your Foundry project with key authentication. You can also manually create and provision access to your storage account in the Azure portal.
Make sure the connected storage account has access to all projects.
If you connected your storage account with Microsoft Entra ID, make sure to give managed identity Storage Blob Data Owner permissions to both your account and the Foundry project resource in the Azure portal.

Note

Virtual network configurations are currently not supported for cloud-based evaluations. Enable public network access for your Azure OpenAI resource.

Get started

Install the Azure AI Foundry SDK project client that runs the evaluations in the cloud:
```
uv install azure-ai-projects azure-identity
```
Note

For more information, see REST API Reference Documentation.

Set your environment variables for your Azure AI Foundry resources:

import os

# Required environment variables:
endpoint = os.environ["PROJECT_ENDPOINT"] # https://<account>.services.ai.azure.com/api/projects/<project>
model_endpoint = os.environ["MODEL_ENDPOINT"] # https://<account>.services.ai.azure.com
model_api_key = os.environ["MODEL_API_KEY"]
model_deployment_name = os.environ["MODEL_DEPLOYMENT_NAME"] # E.g. gpt-4o-mini

# Optional: Reuse an existing dataset.
dataset_name    = os.environ.get("DATASET_NAME",    "dataset-test")
dataset_version = os.environ.get("DATASET_VERSION", "1.0")

Define a client that runs your evaluations in the cloud:

import os
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient

# Create the project client (Foundry project and credentials):
project_client = AIProjectClient(
    endpoint=endpoint,
    credential=DefaultAzureCredential(),
)

Upload evaluation data

# Upload a local JSONL file. Skip this step if you already have a dataset registered.
data_id = project_client.datasets.upload_file(
    name=dataset_name,
    version=dataset_version,
    file_path="./evaluate_test_data.jsonl",
).id

To learn more about input data formats for evaluating generative AI applications:

To learn more about input data formats for evaluating agents, see Evaluate Azure AI agents and Evaluate other agents.

Specify evaluators

from azure.ai.projects.models import (
    EvaluatorConfiguration,
    EvaluatorIds,
)

# Built-in evaluator configurations:
evaluators = {
    "relevance": EvaluatorConfiguration(
        id=EvaluatorIds.RELEVANCE.value,
        init_params={"deployment_name": model_deployment_name},
        data_mapping={
            "query": "${data.query}",
            "response": "${data.response}",
        },
    ),
    "violence": EvaluatorConfiguration(
        id=EvaluatorIds.VIOLENCE.value,
        init_params={"azure_ai_project": endpoint},
    ),
    "bleu_score": EvaluatorConfiguration(
        id=EvaluatorIds.BLEU_SCORE.value,
    ),
}

Submit an evaluation in the cloud

Finally, submit the remote evaluation run:

from azure.ai.projects.models import (
    Evaluation,
    InputDataset
)

# Create an evaluation with the dataset and evaluators specified.
evaluation = Evaluation(
    display_name="Cloud evaluation",
    description="Evaluation of dataset",
    data=InputDataset(id=data_id),
    evaluators=evaluators,
)

# Run the evaluation.
evaluation_response = project_client.evaluations.create(
    evaluation,
    headers={
        "model-endpoint": model_endpoint,
        "api-key": model_api_key,
    },
)

print("Created evaluation:", evaluation_response.name)
print("Status:", evaluation_response.status)

Specify custom evaluators

Note

Azure AI Foundry projects aren't supported for this feature. Use an Azure AI Foundry hub project instead.

Code-based custom evaluators

from azure.ai.ml import MLClient
from azure.ai.ml.entities import Model
from promptflow.client import PFClient

# Define ml_client to register the custom evaluator.
ml_client = MLClient(
       subscription_id=os.environ["AZURE_SUBSCRIPTION_ID"],
       resource_group_name=os.environ["AZURE_RESOURCE_GROUP"],
       workspace_name=os.environ["AZURE_PROJECT_NAME"],
       credential=DefaultAzureCredential()
)

# Load the evaluator from the module.
from answer_len.answer_length import AnswerLengthEvaluator

# Convert it to an evaluation flow, and save it locally.
pf_client = PFClient()
local_path = "answer_len_local"
pf_client.flows.save(entry=AnswerLengthEvaluator, path=local_path)

# Specify the evaluator name that appears in the Evaluator library.
evaluator_name = "AnswerLenEvaluator"

# Register the evaluator to the Evaluator library.
custom_evaluator = Model(
    path=local_path,
    name=evaluator_name,
    description="Evaluator calculating answer length.",
)
registered_evaluator = ml_client.evaluators.create_or_update(custom_evaluator)
print("Registered evaluator id:", registered_evaluator.id)
# Registered evaluators have versioning. You can always reference any version available.
versioned_evaluator = ml_client.evaluators.get(evaluator_name, version=1)
print("Versioned evaluator id:", registered_evaluator.id)

After you register your custom evaluator, you can view it in your Evaluator library. In your Azure AI Foundry project, select Evaluation, then select Evaluator library.

Prompt-based custom evaluators

Follow this example to register a custom FriendlinessEvaluator built as described in Prompt-based evaluators:

# Import your prompt-based custom evaluator.
from friendliness.friend import FriendlinessEvaluator

# Define your deployment.
model_config = dict(
    azure_endpoint=os.environ.get("AZURE_ENDPOINT"),
    azure_deployment=os.environ.get("AZURE_DEPLOYMENT_NAME"),
    api_version=os.environ.get("AZURE_API_VERSION"),
    api_key=os.environ.get("AZURE_API_KEY"), 
    type="azure_openai"
)

# Define ml_client to register the custom evaluator.
ml_client = MLClient(
       subscription_id=os.environ["AZURE_SUBSCRIPTION_ID"],
       resource_group_name=os.environ["AZURE_RESOURCE_GROUP"],
       workspace_name=os.environ["AZURE_PROJECT_NAME"],
       credential=DefaultAzureCredential()
)

# # Convert the evaluator to evaluation flow and save it locally.
local_path = "friendliness_local"
pf_client = PFClient()
pf_client.flows.save(entry=FriendlinessEvaluator, path=local_path) 

# Specify the evaluator name that appears in the Evaluator library.
evaluator_name = "FriendlinessEvaluator"

# Register the evaluator to the Evaluator library.
custom_evaluator = Model(
    path=local_path,
    name=evaluator_name,
    description="prompt-based evaluator measuring response friendliness.",
)
registered_evaluator = ml_client.evaluators.create_or_update(custom_evaluator)
print("Registered evaluator id:", registered_evaluator.id)
# Registered evaluators have versioning. You can always reference any version available.
versioned_evaluator = ml_client.evaluators.get(evaluator_name, version=1)
print("Versioned evaluator id:", registered_evaluator.id)

After you register your custom evaluator, you can view it in your Evaluator library. In your Azure AI Foundry project, select Evaluation, then select Evaluator library.

Troubleshooting: Job Stuck in Running State

Your evaluation job might remain in the Running state for an extended period when using Azure AI Foundry Project or Hub. The Azure OpenAI model you selected might not have enough capacity.

Resolution

Cancel the current evaluation job.
Increase the model capacity to handle larger input data.
Run the evaluation again.

Feedback

Was this page helpful?