Audio objects

This page documents common audio-related response objects (transcription / verbose transcription / diarized transcription).

The transcription object (JSON)

text string
The transcribed text.

usage object
Token usage details for the transcription.

input_tokens integer
Number of input tokens used in the request.

input_duration_ms integer
The duration of the input audio in milliseconds.

output_tokens integer
Number of output tokens in the transcription.

output_duration_ms integer
The duration of the output audio in milliseconds.

total_tokens integer
Total number of tokens used in the request.

OBJECT The transcription object (JSON)

bash

{
  "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that.",
  "usage": {
    "input_tokens": 333,
    "input_duration_ms": 29801,
    "output_tokens": 67,
    "output_duration_ms": 0,
    "total_tokens": 400
  }
}

The transcription object (verbose_json)

language string
The language of the input audio.

duration number
The duration of the input audio in seconds.

text string
The transcribed text.

words array
Per-word timestamps when timestamp_granularities[] includes word.

segments array
Per-segment timestamps when timestamp_granularities[] includes segment.

The diarized transcription object (diarized_json)

text string
The transcribed text.

segments array
Diarized segments with speaker labels.

Audio objects ​

The transcription object (JSON) ​

The transcription object (verbose_json) ​

The diarized transcription object (diarized_json) ​

Audio objects

The transcription object (JSON)

The transcription object (verbose_json)

The diarized transcription object (diarized_json)