Share via


Create test cases to evaluate your agent (preview)

[This article is prerelease documentation and is subject to change.]

In Copilot Studio, you can create a test set of test cases to evaluate the performance of your agents. Test cases let you simulate real-world scenarios for your agent, so you can measure the accuracy, relevancy, and quality of answers to the questions the agent is asked, based on the information the agent can access. Using the results from the test set, you can optimize your agent's behavior and validate that your agent meets your business and quality requirements.

Important

This article contains Microsoft Copilot Studio preview documentation and is subject to change.

Preview features aren't meant for production use and may have restricted functionality. These features are available before an official release so that you can get early access and provide feedback.

If you're building a production-ready agent, see Microsoft Copilot Studio Overview.

Test methods

When creating test sets, you can choose from three kinds of test methods to evaluate your agent's responses: text match, similarity, and quality. Each test method has its own strengths and is suited for different types of evaluations.

Text match test methods

Text match test methods compare the agent's responses to expected responses that you define in the test set. There are two match tests:

Exact match checks whether the agent's answer exactly matches the expected response in the test: character for character, word for word. If it's the same, it passes. If anything differs, it fails. Exact match is useful for short, precise answers such as numbers, codes, or fixed phrases. It doesn't suit answers that people can phrase in multiple correct ways.

Partial match checks whether the agent's answer contains some of the words or phrases from the expected response that you define in the test. If it does, it passes. If it doesn't, it fails. Partial match is useful when an answer can be phrased in different correct ways, but key terms or ideas still need to be included in the response.

Similarity test methods

The similarity test method compares the similarity of the agent's responses to the expected responses you define in your test set. It's useful when an answer can be phrased in different correct ways, but the overall meaning or intent still needs to come through.

It uses a cosine similarity metric to assess how similar the agent's answer is to the wording and meaning of the expected response and determines a score. The score ranges between 0 and 1, where 1 indicates the answer closely matches and 0 indicates it doesn't. You can set a passing score threshold to determine what constitutes a passing score for an answer.

Quality test methods

Quality test methods help you decide whether your agent's responses meet your standards. This approach ensures the results are both reliable and easy to explain.

These methods use a large language model (LLM) to assess how effectively an agent answers user questions. They're especially helpful when there's no exact answer expected, offering a flexible and scalable way to evaluate responses based on the retrieved documents and the conversation flow.

Quality test methods include two test methods:

General quality evaluates agent responses. It uses the following key criteria and applies a consistent prompt to guide scoring:

  • Relevance: To what extent the agent's response addresses the question. For example, does the agent's response stay on the subject and directly answer the question?

  • Groundedness: To what extent the agent's response is based on the provided context. For example, does the agent's response reference or rely on the information given in the context, rather than introducing unrelated or unsupported information?

  • Completeness: To what extent the agent's response provides all necessary information. For example, does the agent's response cover all aspects of the question and provide sufficient detail?

  • Abstention: Whether the agent attempted to answer the question.

To be considered high quality, a response must meet all key criteria. If one isn't met, the response is flagged for improvement. This scoring method ensures that only responses that are both complete and well-supported receive top marks. In contrast, answers that are incomplete or lack supporting evidence receive lower scores.

Compare meaning evaluates how well the agent's answer reflects the intended meaning of the expected response. Instead of focusing on exact wording, it uses semantic similarity—meaning it compares the ideas and meaning behind the words—to judge how closely the response aligns with what was expected.

You can set a passing score threshold to determine what constitutes a passing score for an answer. The compare meaning test method is useful when an answer can be phrased in different correct ways, but the overall meaning or intent still needs to come through.

Thresholds and pass rates

The success of a test case depends on the test method you select and the threshold you set for passing scores.

Each test method, except exact match, produces a numeric score based on a set of evaluation criteria that reflects how well the agent's answer meets that criteria. The threshold is the cut-off score that separates pass from fail. You can set the passing scores for similarity and compare meaning test cases.

Exact match is a strict test method that doesn't produce a numeric score; the answer must match exactly to pass. By choosing the threshold for a test case, you decide how strict or lenient the evaluation is. Each test method evaluates the agent's answer differently, so it's important to choose the one that best fits your evaluation criteria.

Create a test set

You can start creating a test set in different ways: import a file, use the questions from your test chat, have Copilot Studio generate test cases for you, or manually create test cases in Copilot Studio. Each question in the test set is considered as a test case.

To create a test set:

  1. Open the agent that you want to evaluate.

  2. On the top menu bar, go to Analytics.

  3. If you didn't publish your agent, select Start evaluation. If you published your agent, go to the Evaluations section and select Start evaluation.

    Screenshot showing the Create new test button on the Analytics page.

  4. In the New test set page, choose the method you want to use to create your test set:

    • Select Generate 10 questions to have Copilot Studio create test cases automatically based on what your agent can do.
    • Select Use your test chat conversation to automatically populate the test set with the questions you provided in your test chat.
    • Select Manually add to manually provide your test case.
    • Import test cases from a file by dragging your file into the designated area or select Browse to upload a file.
  5. Review and edit the test cases to create effective tests.

    You can:

    • Change the test case by selecting the case and editing it in the right pane.
    • Remove the test case by selecting the delete icon beside the case.
    • Automatically generate more questions by selecting an option in the dropdown menu.
    • Add a test case by selecting Add a case manually.

    When you're finished with your changes, select Apply.

  6. Under Name, enter a name for your test set.

  7. Under Test account, select the account that you want to use for this test set.

  8. Select Save to update the test set without running the test cases or Evaluate to run the test cases.

Create a test case file to import

Instead of building your test cases directly in Copilot Studio, you can create a spreadsheet file with all your test cases and import them to create your test set. You can compose each test question, determine the test method you want to use, and state the expected responses for each question. When you finish creating the file, save it as a .csv or .txt file and import it into Copilot Studio.

Important

  • The file can contain up to 100 questions.
  • Each question can be up to 1,000 characters, including spaces.
  • The file must be in comma separated values (CSV) or text format.

To create the import file:

  1. Open a spreadsheet application (for example, Microsoft Excel).

  2. Add the following headings, in this order, in the first row:

    • Question
    • Expected response
    • Testing method
  3. Enter your test questions in the Question column. Each question can be 1,000 characters or less, including spaces.

  4. Enter one of the following test methods for each question in the Test method column:

    • General quality
    • Compare meaning
    • Similarity
    • Exact match
    • Partial match
  5. Enter the expected responses for each question in the Expected response column. Expected responses are optional for importing a test set. However, you need expected responses to run match, similarity, and compare meaning test cases.

  6. Save the file as a .csv or .txt file.

  7. Import the file to create a test set.

Edit a test case

After creating a test set, you can edit the test cases by changing the wording of questions, choosing different test methods, or modifying the expected responses as needed. You can select multiple test cases to edit them in bulk by selecting the checkboxes beside each test case.

You have a choice of three test methods, also referred to as graders, to evaluate agent responses: quality, similarity, text match. For more information about the different test methods, see Test methods.

To edit a test case:

  1. In the test set, select the test case you want to edit.

    Screenshot showing the list of test cases.

  2. In the right pane, change the wording of a question by editing the text in the Question field.

    Screenshot showing the question text field.

  3. Select the test method that you want to use.

    Screenshot showing the test method selection.

    • Quality:

      • Select General quality to evaluate the answer based on relevance, groundedness, and completeness.
      • Select Compare meaning to evaluate the answer based on how well it captures the meaning of the expected response. Under Passing score, you can set the threshold for what constitutes a passing score for an answer. In the Expected response box, provide the response against which the test method evaluates the agent's answer.
    • Similarity: uses a cosine similarity metric to assess how similar the agent's answer is to the wording and meaning of the expected response. It determines a score between 0 and 1, where 1 means it matches closely and 0 means it doesn't match at all. Under Passing score, you can set the threshold for what constitutes a passing score for an answer. In the Expected response box, provide the response against which the test method evaluates the agent's answer.

    • Text Match:

      • Select Exact match to evaluate the agent's answer against the expected response, where a passing score means the agent's answer exactly matched the defined expected response. In the Expected response box, provide the response against which the test method evaluates the agent's answer.
      • Select Partial match to evaluate the agent's answer against the expected response, where a passing score means the agent's answer contained some of the words or phrases from the defined expected response. In the Expected response box, provide a phrase or keyword against which the test method evaluates the agent's answer. To add multiple keywords or phrases, select Add, select the operator and or or between the boxes, and provide the keyword or phrase.
  4. Select Apply.

  5. When you're finished with your changes, select Save to save your test set or Evaluate to run the evaluation on the test set.

Run a test set

After you create a test set, you can run or rerun it.

  1. On your agent's Analytics page, go to Evaluations.

  2. Run a test set by doing one of the following actions:

    • Find the test set in the Test sets list, select the More icon (), then select Evaluate test set.

    • Hover over a test result that uses the test set you want, select the More icon (), then select Evaluate test set again.

    Screenshot showing the more menu icons that appear when you hover over test sets or evaluation results.

Delete a test set

Select the More icon () for a test set, then select the Trash icon.

Dive into detailed test results

Each time you run an evaluation with a test set, Copilot Studio:

  1. Uses the connected user account to simulate conversations with the agent, sending each question in the test case to the agent.

  2. Collects the agent's responses.

  3. Measures the success of each response. Each test case receives a Pass or Fail, based on the criteria of the test case.

  4. Assigns a Pass rate score based on the Pass/Fail rate of the test set.

You can see the Pass rate of each test set run on your agent's Analytics page, under Evaluations > Recent results. To see more test set runs, select See all.

Screenshot showing a list of previous evaluations.

Select an evaluation to see a detailed breakdown of the test results for each response within a test set run.

Screenshot showing a list of previous evaluations.

The test case results show a list of the queries used in the test, how the agent responded, and the Pass or Fail score. Select a query in the list to see a detailed assessment of each response.

Screenshot showing a list of test cases within a completed evaluation.

The assessment includes the expected and actual responses, the reasoning behind the test result, and the knowledge and topics the agent used in creating the response.

Screenshot showing the detailed result and evaluation of a test case.