Version: Next

Speech-to-text

Privatemode transcribes audio to text end-to-end encrypted. Uploads, intermediate processing, and the returned transcript stay inaccessible to third parties, which makes the API suitable for recordings containing personal data, customer conversations, medical dictations, or other sensitive content.

The API is compatible with the OpenAI transcriptions API, so existing client code works unchanged once it points at the Privatemode proxy.

info

You can try out Privatemode's audio transcription directly in the web app. Just select the transcription model under settings.

Choosing a model

Two speech-to-text models are available:

Model	When to use
Whisper large-v3	Default choice. Robust across accents, noisy recordings, and lower-bitrate inputs. Broad multilingual support.
Voxtral Mini 3B	Lighter and faster. A reasonable choice for clean audio when latency matters. Fall back to Whisper if you observe quality issues.

Both models accept the same request format and return the same response schema, so switching between them is a one-line change.

Quick start

The example below sends a local audio file through the Privatemode proxy using the OpenAI Python SDK. Pass your API key to the proxy via --apiKey; it handles encryption transparently, so the SDK's api_key argument just needs a placeholder value.

from openai import OpenAI

client = OpenAI(
    api_key="dummy",  # not used; the proxy authenticates with the API
    base_url="http://localhost:8080/v1",
)

with open("meeting.mp3", "rb") as audio:
    transcript = client.audio.transcriptions.create(
        model="whisper-large-v3",
        file=audio,
    )

print(transcript.text)

Supported formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm.

Improving transcription quality

By default, the model detects the spoken language automatically and infers vocabulary from the audio alone. Two optional fields let you guide it.

Set the language

Detection is best-effort, and a wrong guess degrades both accuracy and latency. If you know the language, pass it as a two-letter language code:

transcript = client.audio.transcriptions.create(
    model="whisper-large-v3",
    file=audio,
    language="en",
)

Setting language is also a prerequisite for verbose_json (see below).

Steer with a prompt

The prompt field biases the model toward a particular style, spelling, or vocabulary. Use it for domain-specific terms, product names, or punctuation conventions. Write the prompt in the same language as the audio.

transcript = client.audio.transcriptions.create(
    model="whisper-large-v3",
    file=audio,
    language="en",
    prompt="Transcript of a product meeting discussing Privatemode, vLLM, and confidential computing.",
)

See the OpenAI prompting guide for further patterns.

Timestamps and segments

The default response contains only the final text. To get per-segment timestamps, useful for subtitles, audio search, or aligning a transcript back to the source, request verbose_json:

transcript = client.audio.transcriptions.create(
    model="whisper-large-v3",
    file=audio,
    language="en",
    response_format="verbose_json",
)

for segment in transcript.segments:
    print(f"[{segment.start:.2f}s – {segment.end:.2f}s] {segment.text}")

Each segment includes start and end times, the transcribed text, and token-level metadata. The API reference documents the full schema.

Working with long audio files

The Privatemode backend accepts files up to 50 MB per request. For longer recordings:

Compress before uploading. Re-encoding wav as mp3 or ogg at 64–128 kbit/s typically fits hours of speech under the limit without measurable accuracy loss. flac is a lossless option that often halves the original wav size. Avoid low bit rates.
Split on silence. Tools like ffmpeg's silencedetect filter or pydub can chop a recording into chunks at natural pauses. Transcribe each chunk independently and concatenate the results.
Stitch with prompts. When splitting, pass the tail of the previous chunk's transcript as the prompt for the next request. This helps the model carry names, terminology, and sentence flow across the boundary.

Translations

Whisper can also translate non-English audio directly into English text via the /v1/audio/translations endpoint. The request format mirrors transcriptions. Only the path changes.

Reference

Speech-to-text API: full request and response schema
Available models: capabilities and endpoints
Proxy configuration: how to set up the encryption layer

Choosing a model​

Quick start​

Improving transcription quality​

Set the language​

Steer with a prompt​

Timestamps and segments​

Working with long audio files​

Translations​

Reference​