Edit

Share via


Get cached responses of Azure OpenAI API requests

APPLIES TO: All API Management tiers

Use the azure-openai-semantic-cache-lookup policy to perform cache lookup of responses to Azure OpenAI Chat Completion API requests from a configured external cache, based on vector proximity of the prompt to previous requests and a specified similarity score threshold. Response caching reduces bandwidth and processing requirements imposed on the backend Azure OpenAI API and lowers latency perceived by API consumers.

Note

Note

Set the policy's elements and child elements in the order provided in the policy statement. Learn more about how to set or edit API Management policies.

Supported Azure OpenAI in Azure AI Foundry models

The policy is used with APIs added to API Management from the Azure OpenAI in AI Foundry models of the following types:

API type Supported models
Chat completion gpt-3.5

gpt-4

gpt-4o

gpt-4o-mini

o1

o3
Embeddings text-embedding-3-large

text-embedding-3-small

text-embedding-ada-002
Responses (preview) gpt-4o (Versions: 2024-11-20, 2024-08-06, 2024-05-13)

gpt-4o-mini (Version: 2024-07-18)

gpt-4.1 (Version: 2025-04-14)

gpt-4.1-nano (Version: 2025-04-14)

gpt-4.1-mini (Version: 2025-04-14)

gpt-image-1 (Version: 2025-04-15)

o3 (Version: 2025-04-16)

o4-mini (Version: `2025-04-16)

Note

Traditional completion APIs are only available with legacy model versions and support is limited.

For current information about the models and their capabilities, see Azure OpenAI in Foundry Models.

Policy statement

<azure-openai-semantic-cache-lookup
    score-threshold="score threshold to return cached response"
    embeddings-backend-id ="backend entity ID for embeddings API"
    embeddings-backend-auth ="system-assigned"             
    ignore-system-messages="true | false"      
    max-message-count="count" >
    <vary-by>"expression to partition caching"</vary-by>
</azure-openai-semantic-cache-lookup>

Attributes

Attribute Description Required Default
score-threshold Score threshold defines how closely an incoming prompt must match a cached prompt to return its stored response. The value ranges from 0.0 to 1.0. Lower values require higher semantic similarity for a match. Learn more. Yes N/A
embeddings-backend-id Backend ID for embeddings API call. Yes N/A
embeddings-backend-auth Authentication used for embeddings API backend. Yes. Must be set to system-assigned. N/A
ignore-system-messages Boolean. When set to true (recommended), removes system messages from a chat completion prompt before assessing cache similarity. No false
max-message-count If specified, number of remaining dialog messages after which caching is skipped. No N/A

Elements

Name Description Required
vary-by A custom expression determined at runtime whose value partitions caching. If multiple vary-by elements are added, values are concatenated to create a unique combination. No

Usage

Usage notes

  • This policy can only be used once in a policy section.
  • Fine-tune the value of score-threshold based on your application to ensure that the right sensitivity is used to determine when to return cached responses for queries. Start with a low value such as 0.05 and adjust to optimize the ratio of cache hits to misses.
  • Score threshold above 0.2 may lead to cache mismatch. Consider using lower value for sensitive use cases.
  • Control cross-user access to cache entries by specifying vary-by with specific user or user-group identifiers.
  • The embeddings model should have enough capacity and sufficient context size to accommodate the prompt volume and prompts.
  • Consider adding llm-content-safety policy with prompt shield to protect from prompt attacks.
  • We recommend configuring a rate-limit policy (or rate-limit-by-key policy) immediately after any cache lookup. This helps keep your backend service from getting overloaded if the cache isn't available.

Examples

Example with corresponding azure-openai-semantic-cache-store policy

The following example shows how to use the azure-openai-semantic-cache-lookup policy along with the azure-openai-semantic-cache-store policy to retrieve semantically similar cached responses with a similarity score threshold of 0.05. Cached values are partitioned by the subscription ID of the caller.

Note

The rate-limit policy added after the cache lookup helps limit the number of calls to prevent overload on the backend service in case the cache isn't available.

<policies>
    <inbound>
        <base />
        <azure-openai-semantic-cache-lookup
            score-threshold="0.05"
            embeddings-backend-id ="azure-openai-backend"
            embeddings-backend-auth ="system-assigned" >
            <vary-by>@(context.Subscription.Id)</vary-by>
        </azure-openai-semantic-cache-lookup>
        <rate-limit calls="10" renewal-period="60" />
    </inbound>
    <outbound>
        <azure-openai-semantic-cache-store duration="60" />
        <base />
    </outbound>
</policies>

For more information about working with policies, see: