Audio objects
This page documents common audio-related response objects (transcription / verbose transcription / diarized transcription).
The transcription object (JSON)
text string
The transcribed text.
usage object
Token usage details for the transcription.
input_tokens integer
Number of input tokens used in the request.
input_duration_ms integer
The duration of the input audio in milliseconds.
output_tokens integer
Number of output tokens in the transcription.
output_duration_ms integer
The duration of the output audio in milliseconds.
total_tokens integer
Total number of tokens used in the request.
{
"text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that.",
"usage": {
"input_tokens": 333,
"input_duration_ms": 29801,
"output_tokens": 67,
"output_duration_ms": 0,
"total_tokens": 400
}
}The transcription object (verbose_json)
language string
The language of the input audio.
duration number
The duration of the input audio in seconds.
text string
The transcribed text.
words array
Per-word timestamps when timestamp_granularities[] includes word.
segments array
Per-segment timestamps when timestamp_granularities[] includes segment.
The diarized transcription object (diarized_json)
text string
The transcribed text.
segments array
Diarized segments with speaker labels.
