Using azure openai realtime API, I sometimes get mismatches in audio and audio_transcript responses

Tim 0 Reputation points
2025-10-24T22:52:59.3033333+00:00

I setup a text-input audio-output realtime api session as follows:

    this.translationWs = await OpenAIRealtimeWS.azure(
      new AzureOpenAI({
        apiKey: process.env.AZURE_OPENAI_API_KEY,
        endpoint: process.env.AZURE_OPENAI_ENDPOINT,
        apiVersion: "2024-10-01-preview",
        deployment: "gpt-realtime",
      })
    );

    this.translationWs.socket.on("open", () => {
      this.sendTranslationMessage({
        type: "session.update",
        session: {
          modalities: ["text", "audio"],
          instructions: myCustomInstructions,
          output_audio_format: "g711_ulaw",
          temperature: 0.6,
        },
      });
    });
    this.translationWs.on("response.audio.delta", (data) => {
      processAudio(data.delta);
    });

    this.translationWs.on("response.audio_transcript.done", async (data) => {
      console.log("translation: ", data.transcript);
    });

Then, after making sure I receive the session.updated event, I start passing input text in (I told it to translate some text):

      this.translationWs.send({
        type: "conversation.item.create",
        item: {
          type: "message",
          role: "user",
          content: [
            {
              type: "input_text",
              text: myCustomText,
            },
          ],
        },
      });
      this.translationWs.send({
        type: "response.create",
        response: { modalities: ["text", "audio"], conversation: "none" },
      }); 

Sometimes, the audio I hear as the result of "response.audio.delta" is not matching the text I get from "response.audio_transcript.done".

It often leaves out part of the transcript in the audio. Here are some examples:

transcript: "I'm checking. Yes, I found it. Lowering it to minimum. Okay, much better. Now I can barely hear myself."

audio: "I'm checking. Yes, I found it. Lowering it to minimum. Okay, much better."

transcript: "One moment, brief pause. Yeah, you're right, it's turned off. I didn't know that existed."

audio: "One moment, brief pause. Yeah, you're right, it's turned off."

Another error is that it sometimes randomly translates a part of the transcript. Here are some examples:

transcript: "Exactly, it used to work automatically before, but since the last update I have to press the translate button each time."

audio: "Exacto, it used to work automatically before, but since the last update I have to press the translate button each time."

Sometimes I get both errors mentioned above at once:

transcript: "Please send it to soporte @ miempresa.com. That’s the best one."

audio: "Envíalo por favor a soporte @ miempresa.com."

In all cases, the transcript is the correct response, so I need the audio to always match the transcript.

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Jerald Felix 7,910 Reputation points
    2025-10-25T11:17:10.4233333+00:00

    Hello Tim,

    To address the issue where you sometimes get mismatches between the audio response and the audio transcript response when using the Azure OpenAI Realtime API, let's break down the possible causes and recommended approaches:

    Understanding the Issue: You have set up a real-time session using the GPT Realtime model in Azure OpenAI, configuring it to return both text (transcript) and audio (audio delta) responses. Occasionally, you notice that the audio does not completely match the transcript. For example, the audio may leave out segments found in the transcript or, in some cases, provide an audio translation for part of the text even when translation was not requested. You shared examples showing clear mismatches in both language and completeness.

    Why Does This Happen?

    Real-Time Streaming Complexity: The real-time API processes streaming audio and transcripts in parallel. The pipeline for generating audio (via TTS) and text can operate asynchronously, especially if there are delays in TTS synthesis or streaming segment delivery. That can cause mismatches, especially in edge cases involving streaming updates, interruptions, or partial outputs.

    Modality and Session Settings: The modalities you set (["text", "audio"]) and parameters such as session update events and output audio format can affect the response behavior. Any discrepancy in event sequencing or incomplete session updates could cause the audio and transcript to diverge for a request.

    API Model Limitations: The API is in preview and may not guarantee strict alignment between transcript and audio, particularly with API versions handling multiple modalities.

    What Can You Do to Achieve Better Alignment?

    Synchronize Events: Always ensure you process the transcript only upon receiving the completion event (such as response.audio_transcript.done). For the audio, collect all response.audio.delta events and confirm you have the final chunk before playback or further processing.

    Wait for Finalization: Do not rely solely on streaming segments; wait for a "done" or similar finalization event for both audio and transcript to ensure you have the full content.

    Implementation Adjustments: Consider comparing the final transcript and final audio after all segments are received for each user input. If a mismatch is detected, log both for troubleshooting and, if feasible, replay the final audio for consistency.

    Update and Version: If you are not already using the most recent API version, upgrade to the latest. Preview APIs often get fixes and improvements for modality synchronization.

    Platform Feedback: As you've noticed this multiple times, report it directly through Azure’s feedback channels. Provide example pairs of mismatched transcript and audio, including API request details and parameter settings. Microsoft’s engineering team can investigate deeper at the service layer.

    Additional Resources for Investigation:

    Review the official documentation for streaming audio with Azure OpenAI and recommended best practices for real-time transcription and text-to-speech synchronization.

    See if the Azure AI Foundry Realtime Audio API documentation offers extra synchronization tips or recent updates on limitations and known issues.learn.microsoft

    I hope these suggestions help you achieve better reliability and alignment between text and audio responses in your implementation. If yes kindly approve the answer.

    Best Regards,

    Jerald Felix

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.