Speech-to-text
Privatemode transcribes audio to text end-to-end encrypted. Uploads, intermediate processing, and the returned transcript stay inaccessible to third parties, which makes the API suitable for recordings containing personal data, customer conversations, medical dictations, or other sensitive content.
The API is compatible with the OpenAI transcriptions API, so existing client code works unchanged once it points at the Privatemode proxy.
You can try out Privatemode's audio transcription directly in the web app. Just select the transcription model under settings.
Choosing a model
Two speech-to-text models are available:
| Model | When to use |
|---|---|
| Whisper large-v3 | Default choice. Robust across accents, noisy recordings, and lower-bitrate inputs. Broad multilingual support. |
| Voxtral Mini 3B | Lighter and faster. A reasonable choice for clean audio when latency matters. Fall back to Whisper if you observe quality issues. |
Both models accept the same request format and return the same response schema, so switching between them is a one-line change.
Quick start
The example below sends a local audio file through the Privatemode proxy using the OpenAI Python SDK. Pass your API key to the proxy via --apiKey; it handles encryption transparently, so the SDK's api_key argument just needs a placeholder value.
from openai import OpenAI
client = OpenAI(
api_key="dummy", # not used; the proxy authenticates with the API
base_url="http://localhost:8080/v1",
)
with open("meeting.mp3", "rb") as audio:
transcript = client.audio.transcriptions.create(
model="whisper-large-v3",
file=audio,
)
print(transcript.text)
Supported formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm.
Improving transcription quality
By default, the model detects the spoken language automatically and infers vocabulary from the audio alone. Two optional fields let you guide it.
Set the language
Detection is best-effort, and a wrong guess degrades both accuracy and latency. If you know the language, pass it as a two-letter language code:
transcript = client.audio.transcriptions.create(
model="whisper-large-v3",
file=audio,
language="en",
)
Setting language is also a prerequisite for verbose_json (see below).
Steer with a prompt
The prompt field biases the model toward a particular style, spelling, or vocabulary. Use it for domain-specific terms, product names, or punctuation conventions. Write the prompt in the same language as the audio.
transcript = client.audio.transcriptions.create(
model="whisper-large-v3",
file=audio,
language="en",
prompt="Transcript of a product meeting discussing Privatemode, vLLM, and confidential computing.",
)
See the OpenAI prompting guide for further patterns.
Timestamps and segments
The default response contains only the final text. To get per-segment timestamps, useful for subtitles, audio search, or aligning a transcript back to the source, request verbose_json:
transcript = client.audio.transcriptions.create(
model="whisper-large-v3",
file=audio,
language="en",
response_format="verbose_json",
)
for segment in transcript.segments:
print(f"[{segment.start:.2f}s – {segment.end:.2f}s] {segment.text}")
Each segment includes start and end times, the transcribed text, and token-level metadata. The API reference documents the full schema.
Working with long audio files
The Privatemode backend accepts files up to 50 MB per request. For longer recordings:
- Compress before uploading. Re-encoding
wavasmp3oroggat 64–128 kbit/s typically fits hours of speech under the limit without measurable accuracy loss.flacis a lossless option that often halves the originalwavsize. Avoid low bit rates. - Split on silence. Tools like
ffmpeg'ssilencedetectfilter orpydubcan chop a recording into chunks at natural pauses. Transcribe each chunk independently and concatenate the results. - Stitch with prompts. When splitting, pass the tail of the previous chunk's transcript as the
promptfor the next request. This helps the model carry names, terminology, and sentence flow across the boundary.
Translations
Whisper can also translate non-English audio directly into English text via the /v1/audio/translations endpoint. The request format mirrors transcriptions. Only the path changes.
Reference
- Speech-to-text API: full request and response schema
- Available models: capabilities and endpoints
- Proxy configuration: how to set up the encryption layer