Skip to main content
Version: Next

Speech-to-text

Privatemode transcribes audio to text end-to-end encrypted. Uploads, intermediate processing, and the returned transcript stay inaccessible to third parties, which makes the API suitable for recordings containing personal data, customer conversations, medical dictations, or other sensitive content.

The API is compatible with the OpenAI transcriptions API, so existing client code works unchanged once it points at the Privatemode proxy.

info

You can try out Privatemode's audio transcription directly in the web app. Just select the transcription model under settings.

Choosing a model

Two speech-to-text models are available:

ModelWhen to use
Whisper large-v3Default choice. Robust across accents, noisy recordings, and lower-bitrate inputs. Broad multilingual support.
Voxtral Mini 3BLighter and faster. A reasonable choice for clean audio when latency matters. Fall back to Whisper if you observe quality issues.

Both models accept the same request format and return the same response schema, so switching between them is a one-line change.

Quick start

The example below sends a local audio file through the Privatemode proxy using the OpenAI Python SDK. Pass your API key to the proxy via --apiKey; it handles encryption transparently, so the SDK's api_key argument just needs a placeholder value.

from openai import OpenAI

client = OpenAI(
api_key="dummy", # not used; the proxy authenticates with the API
base_url="http://localhost:8080/v1",
)

with open("meeting.mp3", "rb") as audio:
transcript = client.audio.transcriptions.create(
model="whisper-large-v3",
file=audio,
)

print(transcript.text)

Supported formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm.

Improving transcription quality

By default, the model detects the spoken language automatically and infers vocabulary from the audio alone. Two optional fields let you guide it.

Set the language

Detection is best-effort, and a wrong guess degrades both accuracy and latency. If you know the language, pass it as a two-letter language code:

transcript = client.audio.transcriptions.create(
model="whisper-large-v3",
file=audio,
language="en",
)

Setting language is also a prerequisite for verbose_json (see below).

Steer with a prompt

The prompt field biases the model toward a particular style, spelling, or vocabulary. Use it for domain-specific terms, product names, or punctuation conventions. Write the prompt in the same language as the audio.

transcript = client.audio.transcriptions.create(
model="whisper-large-v3",
file=audio,
language="en",
prompt="Transcript of a product meeting discussing Privatemode, vLLM, and confidential computing.",
)

See the OpenAI prompting guide for further patterns.

Timestamps and segments

The default response contains only the final text. To get per-segment timestamps, useful for subtitles, audio search, or aligning a transcript back to the source, request verbose_json:

transcript = client.audio.transcriptions.create(
model="whisper-large-v3",
file=audio,
language="en",
response_format="verbose_json",
)

for segment in transcript.segments:
print(f"[{segment.start:.2f}s – {segment.end:.2f}s] {segment.text}")

Each segment includes start and end times, the transcribed text, and token-level metadata. The API reference documents the full schema.

Working with long audio files

The Privatemode backend accepts files up to 50 MB per request. For longer recordings:

  • Compress before uploading. Re-encoding wav as mp3 or ogg at 64–128 kbit/s typically fits hours of speech under the limit without measurable accuracy loss. flac is a lossless option that often halves the original wav size. Avoid low bit rates.
  • Split on silence. Tools like ffmpeg's silencedetect filter or pydub can chop a recording into chunks at natural pauses. Transcribe each chunk independently and concatenate the results.
  • Stitch with prompts. When splitting, pass the tail of the previous chunk's transcript as the prompt for the next request. This helps the model carry names, terminology, and sentence flow across the boundary.

Translations

Whisper can also translate non-English audio directly into English text via the /v1/audio/translations endpoint. The request format mirrors transcriptions. Only the path changes.

Reference