你当前正在访问 Microsoft Azure Global Edition 技术文档网站。 如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站,请访问 https://docs.azure.cn

Azure OpenAI 推理模型

Azure OpenAI 推理模型旨在处理推理任务和解决问题的任务,具有更好的针对性和功能。 与之前的迭代相比,这些模型花费更多的时间处理和理解用户的请求,使它们在科学、编码和数学等领域非常强大。

推理模型的主要功能:

  • 复杂代码生成:能够生成算法并处理高级编码任务以支持开发人员。
  • 高级问题解决:非常适合全面的头脑风暴会议和解决多方面的挑战。
  • 复杂文档比较:非常适合分析合同、案例文件或法律文档以识别细微的差别。
  • 指令遵循和工作流管理:对于管理需要较短上下文的工作流特别有效。

可用性

区域可用性

型号 区域 有限的访问权限
gpt-5-pro 美国东部 2 和瑞典中部(全球标准) 请求访问: 受限访问模型应用程序。 如果已有权访问受限访问模型,则无需请求。
gpt-5-codex 美国东部 2 和瑞典中部(全球标准) 请求访问: 受限访问模型应用程序。 如果已有权访问受限访问模型,则无需请求。
gpt-5 模型可用性 请求访问: 受限访问模型应用程序。 如果已有权访问受限访问模型,则无需请求。
gpt-5-mini 模型可用性 不需要访问请求。
gpt-5-nano 模型可用性 不需要访问请求。
o3-pro 美国东部 2 和瑞典中部(全球标准) 请求访问: 受限访问模型应用程序。 如果已有权访问受限访问模型,则无需请求。
codex-mini 美国东部 2 和瑞典中部(全球标准) 不需要访问请求。
o4-mini 模型可用性 使用此模型的核心功能不需要访问请求。

请求访问权限:o4 微型推理摘要功能
o3 模型可用性 请求访问: 受限访问模型应用程序
o3-mini 模型可用性 此模型不再限制访问。
o1 模型可用性 此模型不再限制访问。
o1-mini 模型可用性 全局标准部署不需要访问权限请求。

标准(区域)部署目前仅适用于在发布o1-preview 版本期间被选择并授予访问权限的客户。

API 和功能支持

功能 gpt-5-pro2025-10-06 gpt-5-codex2025-09-011 gpt-5,2025-08-07 gpt-5-mini,2025-08-07 gpt-5-nano,2025-08-07
API 版本 v1 v1 v1 v1 v1
开发人员消息
结构化输出
上下文窗口 400,000

输入:272,000
输出:128,000
400,000

输入:272,000
输出:128,000
400,000

输入:272,000
输出:128,000
400,000

输入:272,000
输出:128,000
400,000

输入:272,000
输出:128,000
推理工作 - 4
图像输入
聊天补全 API - -
响应 API
函数/工具
并行工具调用1 -
max_completion_tokens 2 - -
系统消息 3
推理摘要
流媒体 -

1reasoning_effort 设置为 minimal 时,不支持并行工具调用

2 推理模型仅在使用聊天完成 API 时使用 max_completion_tokens 参数。 与响应 API 一起使用 max_output_tokens

3最新的推理模型支持系统消息,使迁移更加轻松。 不应在同一 API 请求中使用开发人员消息和系统消息。

4gpt-5-pro 仅支持 reasoning_efforthigh,即使未显式传递给模型,这也是默认值。

新的 GPT-5 推理功能

功能 / 特点 DESCRIPTION
reasoning_effort minimal 现在支持 GPT-5 系列推理模型*

选项minimallowmediumhigh
verbosity 提供对模型输出的简洁性进行更加精细控制的新参数。

选项lowmediumhigh
preamble GPT-5 系列推理模型在执行函数/工具调用之前,能够花费额外的时间“思考”

在进行此类规划时,该模型能够通过一个名为 preamble 的新对象,为模型的响应过程中的规划步骤提供见解。

在模型响应中生成前导信息并不能得到保证,但可以通过使用 instructions 参数并传入诸如“在每次函数调用之前,必须进行充分的规划, 并且在调用任何函数之前始终向用户输出计划”之类的内容来引导模型。
允许的工具 可以在下面 tool_choice 指定多个工具,而不是只指定一个工具。
自定义工具类型 启用原始文本(非 json)输出
lark_tool 允许使用 Python lark 的某些功能来更灵活地约束模型响应

* gpt-5-codex 不支持 reasoning_effort 最小化。

有关详细信息,我们还建议阅读 OpenAI 的 GPT-5 提示指南GPT-5 功能指南

注释

  • 为避免超时,建议使用o3-pro
  • o3-pro 当前不支持映像生成。

不支持

推理模型当前不支持以下各项:

  • temperaturetop_ppresence_penaltyfrequency_penaltylogprobstop_logprobslogit_biasmax_tokens

用法

这些模型 当前不支持与 使用聊天完成 API 的其他模型相同的参数集。

可能需要升级 OpenAI Python 库的版本才能利用新的参数,例如 max_completion_tokens

pip install openai --upgrade
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    base_url="https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/",
)

response = client.chat.completions.create(
    model="gpt-5-mini", # replace with the model deployment name of your o1 deployment.
    messages=[
        {"role": "user", "content": "What steps should I think about when writing my first Python API?"},
    ],
    max_completion_tokens = 5000

)

print(response.model_dump_json(indent=2))

Python 输出:

{
  "id": "chatcmpl-AEj7pKFoiTqDPHuxOcirA9KIvf3yz",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "Writing your first Python API is an exciting step in developing software that can communicate with other applications. An API (Application Programming Interface) allows different software systems to interact with each other, enabling data exchange and functionality sharing. Here are the steps you should consider when creating your first Python API...truncated for brevity.",
        "refusal": null,
        "role": "assistant",
        "function_call": null,
        "tool_calls": null
      },
      "content_filter_results": {
        "hate": {
          "filtered": false,
          "severity": "safe"
        },
        "protected_material_code": {
          "filtered": false,
          "detected": false
        },
        "protected_material_text": {
          "filtered": false,
          "detected": false
        },
        "self_harm": {
          "filtered": false,
          "severity": "safe"
        },
        "sexual": {
          "filtered": false,
          "severity": "safe"
        },
        "violence": {
          "filtered": false,
          "severity": "safe"
        }
      }
    }
  ],
  "created": 1728073417,
  "model": "o1-2024-12-17",
  "object": "chat.completion",
  "service_tier": null,
  "system_fingerprint": "fp_503a95a7d8",
  "usage": {
    "completion_tokens": 1843,
    "prompt_tokens": 20,
    "total_tokens": 1863,
    "completion_tokens_details": {
      "audio_tokens": null,
      "reasoning_tokens": 448
    },
    "prompt_tokens_details": {
      "audio_tokens": null,
      "cached_tokens": 0
    }
  },
  "prompt_filter_results": [
    {
      "prompt_index": 0,
      "content_filter_results": {
        "custom_blocklists": {
          "filtered": false
        },
        "hate": {
          "filtered": false,
          "severity": "safe"
        },
        "jailbreak": {
          "filtered": false,
          "detected": false
        },
        "self_harm": {
          "filtered": false,
          "severity": "safe"
        },
        "sexual": {
          "filtered": false,
          "severity": "safe"
        },
        "violence": {
          "filtered": false,
          "severity": "safe"
        }
      }
    }
  ]
}

推理工作

注释

推理模型在模型响应中将 reasoning_tokens 作为 completion_tokens_details 的一部分。 这些是隐藏的标记,不会作为消息响应内容的一部分返回,但模型会使用它们来帮助生成对请求的最终答复。 对于除 reasoning_effort 之外的所有推理模型,low 都可以设置为 mediumhigho1-mini。 对于 reasoning_effort,GPT-5 推理模型支持一项新设置,即 minimal。 工作量设置越高,模型处理请求所花费的时间就越长,这通常会产生更多的 reasoning_tokens

开发人员消息

从功能上来说,开发人员消息 "role": "developer" 与系统消息相同。

在上一代码示例中添加开发人员消息,如下所示:

可能需要升级 OpenAI Python 库的版本才能利用新的参数,例如 max_completion_tokens

pip install openai --upgrade
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    base_url="https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/",
)

response = client.chat.completions.create(
    model="gpt-5-mini", # replace with the model deployment name of your o1 deployment.
    messages=[
        {"role": "developer","content": "You are a helpful assistant."}, # optional equivalent to a system message for reasoning models 
        {"role": "user", "content": "What steps should I think about when writing my first Python API?"},
    ],
    max_completion_tokens = 5000,
    reasoning_effort = "medium" # low, medium, or high
)

print(response.model_dump_json(indent=2))

推理摘要

将最新的推理模型与响应 API 结合使用时,可以使用推理摘要参数来接收模型思维链推理的摘要。

重要

除通过推理摘要参数外,不支持尝试使用其他方法提取原始推理,这样做可能导致违反可接受的使用策略,一旦检测到就可能导致限流或帐户停用。

需要升级 OpenAI 客户端库才能访问最新的参数。

pip install openai --upgrade
import os
from openai import OpenAI

client = OpenAI(  
  base_url = "https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/",
  api_key=os.getenv("AZURE_OPENAI_API_KEY")  
)

response = client.responses.create(
    input="Tell me about the curious case of neural text degeneration",
    model="gpt-5", # replace with model deployment name
    reasoning={
        "effort": "medium",
        "summary": "auto" # auto, concise, or detailed, gpt-5 series do not support concise 
    },
    text={
        "verbosity": "low" # New with GPT-5 models
    }
)

print(response.model_dump_json(indent=2))
{
  "id": "resp_689a0a3090808190b418acf12b5cc40e0fc1c31bc69d8719",
  "created_at": 1754925616.0,
  "error": null,
  "incomplete_details": null,
  "instructions": null,
  "metadata": {},
  "model": "gpt-5",
  "object": "response",
  "output": [
    {
      "id": "rs_689a0a329298819095d90c34dc9b80db0fc1c31bc69d8719",
      "summary": [],
      "type": "reasoning",
      "encrypted_content": null,
      "status": null
    },
    {
      "id": "msg_689a0a33009881909fe0fcf57cba30200fc1c31bc69d8719",
      "content": [
        {
          "annotations": [],
          "text": "Neural text degeneration refers to the ways language models produce low-quality, repetitive, or vacuous text, especially when generating long outputs. It’s “curious” because models trained to imitate fluent text can still spiral into unnatural patterns. Key aspects:\n\n- Repetition and loops: The model repeats phrases or sentences (“I’m sorry, but...”), often due to high-confidence tokens reinforcing themselves.\n- Loss of specificity: Vague, generic, agreeable text that avoids concrete details.\n- Drift and contradiction: The output gradually departs from context or contradicts itself over long spans.\n- Exposure bias: During training, models see gold-standard prefixes; at inference, they must condition on their own imperfect outputs, compounding errors.\n- Likelihood vs. quality mismatch: Maximizing token-level likelihood doesn’t align with human preferences for diversity, coherence, or factuality.\n- Token over-optimization: Frequent, safe tokens get overused; certain phrases become attractors.\n- Entropy collapse: With greedy or low-temperature decoding, the distribution narrows too much, causing repetitive, low-entropy text.\n- Length and beam search issues: Larger beams or long generations can favor bland, repetitive sequences (the “likelihood trap”).\n\nCommon mitigations:\n\n- Decoding strategies:\n  - Top-k, nucleus (top-p), or temperature sampling to keep sufficient entropy.\n  - Typical sampling and locally typical sampling to avoid dull but high-probability tokens.\n  - Repetition penalties, presence/frequency penalties, no-repeat n-grams.\n  - Contrastive decoding (and variants like DoLa) to filter generic continuations.\n  - Min/max length, stop sequences, and beam search with diversity/penalties.\n\n- Training and alignment:\n  - RLHF/DPO to better match human preferences for non-repetitive, helpful text.\n  - Supervised fine-tuning on high-quality, diverse data; instruction tuning.\n  - Debiasing objectives (unlikelihood training) to penalize repetition and banned patterns.\n  - Mixture-of-denoisers or latent planning to improve long-range coherence.\n\n- Architectural and planning aids:\n  - Retrieval-augmented generation to ground outputs.\n  - Tool use and structured prompting to constrain drift.\n  - Memory and planning modules, hierarchical decoding, or sentence-level control.\n\n- Prompting tips:\n  - Ask for concise answers, set token limits, and specify structure.\n  - Provide concrete constraints or content to reduce generic filler.\n  - Use “say nothing if uncertain” style instructions to avoid vacuity.\n\nRepresentative papers/terms to search:\n- Holtzman et al., “The Curious Case of Neural Text Degeneration” (2020): nucleus sampling.\n- Welleck et al., “Neural Text Degeneration with Unlikelihood Training.”\n- Li et al., “A Contrastive Framework for Decoding.”\n- Su et al., “DoLa: Decoding by Contrasting Layers.”\n- Meister et al., “Typical Decoding.”\n- Ouyang et al., “Training language models to follow instructions with human feedback.”\n\nIn short, degeneration arises from a mismatch between next-token likelihood and human preferences plus decoding choices; careful decoding, training objectives, and grounding help prevent it.",
          "type": "output_text",
          "logprobs": null
        }
      ],
      "role": "assistant",
      "status": "completed",
      "type": "message"
    }
  ],
  "parallel_tool_calls": true,
  "temperature": 1.0,
  "tool_choice": "auto",
  "tools": [],
  "top_p": 1.0,
  "background": false,
  "max_output_tokens": null,
  "max_tool_calls": null,
  "previous_response_id": null,
  "prompt": null,
  "prompt_cache_key": null,
  "reasoning": {
    "effort": "minimal",
    "generate_summary": null,
    "summary": "detailed"
  },
  "safety_identifier": null,
  "service_tier": "default",
  "status": "completed",
  "text": {
    "format": {
      "type": "text"
    }
  },
  "top_logprobs": null,
  "truncation": "disabled",
  "usage": {
    "input_tokens": 16,
    "input_tokens_details": {
      "cached_tokens": 0
    },
    "output_tokens": 657,
    "output_tokens_details": {
      "reasoning_tokens": 0
    },
    "total_tokens": 673
  },
  "user": null,
  "content_filters": null,
  "store": true
}

注释

即使启用,也不能保证为每个步骤/请求生成推理摘要。 这是预期的行为。

Python lark

GPT-5 系列推理模型能够调用名为 custom_tool 的新 lark_tool。 该工具基于 Python lark 开发,可用于对模型输出进行更灵活的限制。

响应 API

{
  "model": "gpt-5-2025-08-07",
  "input": "please calculate the area of a circle with radius equal to the number of 'r's in strawberry",
  "tools": [
    {
      "type": "custom",
      "name": "lark_tool",
      "format": {
        "type": "grammar",
        "syntax": "lark",
        "definition": "start: QUESTION NEWLINE ANSWER\nQUESTION: /[^\\n?]{1,200}\\?/\nNEWLINE: /\\n/\nANSWER: /[^\\n!]{1,200}!/"
      }
    }
  ],
  "tool_choice": "required"
}
import os
from openai import OpenAI

client = OpenAI(  
  base_url = "https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/",
  api_key=os.getenv("AZURE_OPENAI_API_KEY")  
)

response = client.responses.create(  
    model="gpt-5",  # replace with your model deployment name  
    tools=[  
        {  
            "type": "custom",
            "name": "lark_tool",
            "format": {
                "type": "grammar",
                "syntax": "lark",
                "definition": "start: QUESTION NEWLINE ANSWER\nQUESTION: /[^\\n?]{1,200}\\?/\nNEWLINE: /\\n/\nANSWER: /[^\\n!]{1,200}!/"
            }
        }  
    ],  
    input=[{"role": "user", "content": "Please calculate the area of a circle with radius equal to the number of 'r's in strawberry"}],  
)  

print(response.model_dump_json(indent=2))  
  

输出

{
  "id": "resp_689a0cf927408190b8875915747667ad01c936c6ffb9d0d3",
  "created_at": 1754926332.0,
  "error": null,
  "incomplete_details": null,
  "instructions": null,
  "metadata": {},
  "model": "gpt-5",
  "object": "response",
  "output": [
    {
      "id": "rs_689a0cfd1c888190a2a67057f471b5cc01c936c6ffb9d0d3",
      "summary": [],
      "type": "reasoning",
      "encrypted_content": null,
      "status": null
    },
    {
      "id": "msg_689a0d00e60c81908964e5e9b2d6eeb501c936c6ffb9d0d3",
      "content": [
        {
          "annotations": [],
          "text": "“strawberry” has 3 r’s, so the radius is 3.\nArea = πr² = π × 3² = 9π ≈ 28.27 square units.",
          "type": "output_text",
          "logprobs": null
        }
      ],
      "role": "assistant",
      "status": "completed",
      "type": "message"
    }
  ],
  "parallel_tool_calls": true,
  "temperature": 1.0,
  "tool_choice": "auto",
  "tools": [
    {
      "name": "lark_tool",
      "parameters": null,
      "strict": null,
      "type": "custom",
      "description": null,
      "format": {
        "type": "grammar",
        "definition": "start: QUESTION NEWLINE ANSWER\nQUESTION: /[^\\n?]{1,200}\\?/\nNEWLINE: /\\n/\nANSWER: /[^\\n!]{1,200}!/",
        "syntax": "lark"
      }
    }
  ],
  "top_p": 1.0,
  "background": false,
  "max_output_tokens": null,
  "max_tool_calls": null,
  "previous_response_id": null,
  "prompt": null,
  "prompt_cache_key": null,
  "reasoning": {
    "effort": "medium",
    "generate_summary": null,
    "summary": null
  },
  "safety_identifier": null,
  "service_tier": "default",
  "status": "completed",
  "text": {
    "format": {
      "type": "text"
    }
  },
  "top_logprobs": null,
  "truncation": "disabled",
  "usage": {
    "input_tokens": 139,
    "input_tokens_details": {
      "cached_tokens": 0
    },
    "output_tokens": 240,
    "output_tokens_details": {
      "reasoning_tokens": 192
    },
    "total_tokens": 379
  },
  "user": null,
  "content_filters": null,
  "store": true
}

聊天完成

{
  "messages": [
    {
      "role": "user",
      "content": "Which one is larger, 42 or 0?"
    }
  ],
  "tools": [
    {
      "type": "custom",
      "name": "custom_tool",
      "custom": {
        "name": "lark_tool",
        "format": {
          "type": "grammar",
          "grammar": {
            "syntax": "lark",
            "definition": "start: QUESTION NEWLINE ANSWER\nQUESTION: /[^\\n?]{1,200}\\?/\nNEWLINE: /\\n/\nANSWER: /[^\\n!]{1,200}!/"
          }
        }
      }
    }
  ],
  "tool_choice": "required",
  "model": "gpt-5-2025-08-07"
}

Markdown 输出

默认情况下,o3-minio1 模型不会尝试生成包含 markdown 格式的输出。 一个常见的使用场景是,当希望模型输出包含在 markdown 代码块中的代码时,这种行为是不理想的。 当模型生成不带 markdown 格式的输出时,会在交互式操场体验中丢失语法突出显示和可复制代码块等功能。 若要替代此新的默认行为并鼓励在模型响应中包含 Markdown,请将字符串 Formatting re-enabled 添加到开发人员消息的开头。

Formatting re-enabled 添加到开发人员消息的开头不能保证模型在其响应中包含 markdown 格式,只会增加其可能性。 我们在内部测试中发现,具有 Formatting re-enabled 模型的 o1 与具有 o3-mini 相比,其本身效率更低。

为了提高 Formatting re-enabled 的性能,可以进一步增强开发人员信息的开头部分,这通常会产生所需的输出结果。 可以尝试添加更具描述性的初始说明,而不是将 Formatting re-enabled 添加到开发人员消息的开头,如以下示例之一:

  • Formatting re-enabled - please enclose code blocks with appropriate markdown tags.
  • Formatting re-enabled - code output should be wrapped in markdown.

根据预期输出,可能需要进一步自定义初始开发人员消息,以针对特定用例。