Version: 1.32

Speech-to-text API

Use the Privatemode speech-to-text API to generate text from audio files. The API is compatible with the OpenAI transcriptions API. To generate text from audio, send your requests to the Privatemode proxy. Audio requests and responses are encrypted, both in transit and during processing.

Generating transcriptions

Send a POST form request to the following endpoint on your proxy:

POST /v1/audio/transcriptions

This endpoint generates a transcription of the provided audio file.

Request body

model (string): The name of the model to use for transcription, e.g., whisper-large-v3.
file (file): The audio file to transcribe. Supported formats are flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, and webm.
language (string, optional): The language of the audio in ISO-639-1 (e.g., en) format. Not setting the correct language can lead to poor accuracy and performance.
prompt (string, optional): An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language.
response_format (string, optional): The format of the response. Options are json (default) or verbose_json. The verbose_json format includes additional fields such as segments with timestamps and metadata for each transcribed segment. Note: verbose_json requires setting language.
For additional parameters see the vLLM transcriptions API documentation.

Returns

The response is a transcription object or a stream of transcription events containing:

text (string): The transcribed text from the audio.
Other parameters: Other fields are consistent with the OpenAI API specifications.

Examples

Note: To run the examples below, start the Privatemode proxy with a pre-configured API key or add an authentication header to the requests.

JSON
Verbose JSON

Example request

#!/usr/bin/env bash

curl localhost:8080/v1/audio/transcriptions \
  -H "Content-Type: multipart/form-data" \
  -F 'model=whisper-large-v3' \
  -F 'file=@path/to/your/audio/file.mp3'

Example response

{
  "text": "Hello World.",
  "usage": {
    "type": "duration",
    "seconds": "35"
  }
}

Example request

#!/usr/bin/env bash

curl localhost:8080/v1/audio/transcriptions \
  -H "Content-Type: multipart/form-data" \
  -F 'model=whisper-large-v3' \
  -F 'language=en' \
  -F 'file=@path/to/your/audio/file.mp3' \
  -F 'response_format=verbose_json'

Example response

{
  "text": "Hello World.",
  "duration": 2.5,
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.0,
      "end": 2.5,
      "text": "Hello World.",
      "tokens": [15496, 2159, 13],
      "temperature": 0.0,
      "avg_logprob": null,
      "compression_ratio": null,
      "no_speech_prob": null
    }
  ]
}

Available speech-to-text models

To list the available text-to-speech models, call the /v1/models endpoint or see the models overview.

warning

Privatemode's serving backend only supports files up to 25 MB in size. For larger files, consider splitting the audio into smaller segments, or try compressing the file to reduce its size.

Generating transcriptions​

Request body​

Returns​

Examples​

Available speech-to-text models​

Generating transcriptions

Request body

Returns

Examples

Available speech-to-text models