Skip to content

Audio objects

This page documents common audio-related response objects (transcription / verbose transcription / diarized transcription).

The transcription object (JSON)


text string
The transcribed text.


usage object
Token usage details for the transcription.


input_tokens integer
Number of input tokens used in the request.


input_duration_ms integer
The duration of the input audio in milliseconds.


output_tokens integer
Number of output tokens in the transcription.


output_duration_ms integer
The duration of the output audio in milliseconds.


total_tokens integer
Total number of tokens used in the request.


OBJECT The transcription object (JSON)
bash
{
  "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that.",
  "usage": {
    "input_tokens": 333,
    "input_duration_ms": 29801,
    "output_tokens": 67,
    "output_duration_ms": 0,
    "total_tokens": 400
  }
}

The transcription object (verbose_json)


language string
The language of the input audio.


duration number
The duration of the input audio in seconds.


text string
The transcribed text.


words array
Per-word timestamps when timestamp_granularities[] includes word.


segments array
Per-segment timestamps when timestamp_granularities[] includes segment.


The diarized transcription object (diarized_json)


text string
The transcribed text.


segments array
Diarized segments with speaker labels.

那年我双手插兜, 让bug稳如老狗