Chat completions API
Use the Privatemode chat completions API to generate text from a prompt via a large language model. The API is compatible with the OpenAI Chat Completions API. To generate text, send your requests to the Privatemode proxy. Chat requests and responses are encrypted, both in transit and during processing.
Example prompting
For prompting, use the following proxy endpoint:
POST /v1/chat/completions
This endpoint generates a response to a chat prompt.
Request body
modelstring: The name of a currently available model. Note that models are updated regularly, and support for older models is discontinued over time. UseGET /v1/modelsto get a list of available models as described in the models API. Model namelatestis deprecated and will be removed in a future update.messageslist: The prompts for which a response is generated.- Additional parameters: These mirror the OpenAI API and are supported based on the model server's capabilities.
Returns
The response is a chat completion or chat completion chunk object containing:
choicesstring: The response generated by the model.- Other parameters: Other fields are consistent with the OpenAI API specifications.
- Default
- Streaming
Example request
#!/usr/bin/env bash
curl localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.3-70b",
"messages": [
{
"role": "user",
"content": "Tell me a joke!"
}
]
}'
Example response
{
"id": "chat-6e8dc369b0614e2488df6a336c24c349",
"object": "chat.completion",
"created": 1727968175,
"model": "<model>",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "What do you call a fake noodle?\n\nAn impasta.",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 40,
"total_tokens": 54,
"completion_tokens": 14
},
"prompt_logprobs": null
}
Example request
#!/usr/bin/env bash
curl localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.3-70b",
"messages": [
{
"role": "user",
"content": "Hi there!"
}
],
"stream" : true
}'
Example response
{"id":"chat-4f0bb41857044f52b5fa03fd3c752c8e","object":"chat.completion.chunk","created":1727968591,"model":"<model>","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}
{"id":"chat-4f0bb41857044f52b5fa03fd3c752c8e","object":"chat.completion.chunk","created":1727968591,"model":"<model>","choices":[{"index":0,"delta":{"content":"It"},"logprobs":null,"finish_reason":null}]}
{"id":"chat-4f0bb41857044f52b5fa03fd3c752c8e","object":"chat.completion.chunk","created":1727968591,"model":"<model>","choices":[{"index":0,"delta":{"content":"'s"},"logprobs":null,"finish_reason":null}]}
...
{"id":"chat-4f0bb41857044f52b5fa03fd3c752c8e","object":"chat.completion.chunk","created":1727968591,"model":"<model>","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":"stop","stop_reason":null}]}
Prompt caching
Privatemode supports prompt caching to reduce response latency when the first part of a prompt can be reused across requests. This is especially relevant for requests with long shared context or long conversation history.
Prompt caching is inactive by default. You can enable it in the Privatemode proxy. All requests sent via the same proxy share a cache and no further changes are required when making requests.
Alternatively, you can configure it per request via request field cache_salt, encoded as a string. All requests that use the same salt share a cache. For example, using the same salt in all requests of a user will create an isolated cache for that user.
- cURL
- Python
#!/usr/bin/env bash
curl localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.3-70b",
"messages": [{"role": "user", "content": "Tell me a joke!"}],
"cache_salt": "Y3+y3nLYf3a0CvT7VtuI0W656YXyl0Rdvd8BHI9e2rU="
}'
import openai
import os
client = openai.OpenAI(
api_key=os.environ.get("PRIVATE_MODE_API_KEY"), base_url="http://localhost:8080/v1"
)
client.chat.completions.create(
model="llama-3.3-70b",
messages=[{"role": "user", "content": "Tell me a joke!"}],
extra_body={
"cache_salt": "Y3+y3nLYf3a0CvT7VtuI0W656YXyl0Rdvd8BHI9e2rU=",
},
)
When using the OpenAI Python client, provide cache_salt as part of an extra_body argument.
If caching is configured in both, the Privatemode proxy and in a request, the value from the request is used, allowing for more granular control by clients.
Cache salts should be kept private and have an entropy of at least 256 bits.
You can generate a secure salt with openssl rand -base64 32.
Structured outputs
Privatemode supports structured outputs using the same response_format argument as the OpenAI Chat Completions API. See OpenAI's structured outputs guide for details.
When generating JSON output, always instruct the model to output JSON as in the example below. Otherwise, some models may emit endless whitespace until the token limit is hit. If that still happens, set frequency_penalty above 0.0 (for example 0.1) to discourage repetitive tokens.
Also make sure to set max_completion_tokens to avoid too long generations.
Example prompt when generating JSON.
{
"role": "user",
"content": "Summarize the meeting into JSON with topic and action items. ..."
}
System prompts
The offered model supports setting a system prompt as part of the request's messages field (see example below). You can use this to tailor the model's behavior to your specific needs.
Improving language accuracy
The model may occasionally make minor language mistakes, especially in languages other than English. To optimize language accuracy, you can set a system prompt. The following example significantly improves accuracy for the German language:
{
"role": "system",
"content": "Ensure every response is free from grammar and spelling errors. Use only valid words. Apply correct article usage, especially for languages with gender-specific articles like German. Follow standard grammar and syntax rules, and check spelling against standard dictionaries. Maintain consistency in style and terminology throughout."
}
Available chat completions models
To list the available chat completions models, call the /v1/models endpoint or see the models overview.