Chat completions API
Use the Privatemode chat completions API to generate text from a prompt via a large language model. The API is compatible with the OpenAI Chat Completions API. To generate text, send your requests to the privatemode-proxy. Chat requests and responses are encrypted, both in transit and during processing.
Example prompting
For prompting, use the following proxy endpoint:
POST /v1/chat/completions
This endpoint generates a response to a chat prompt.
Request body
model
string: The name of a currently available model. Note that models are updated regularly, and support for older models is discontinued over time. UseGET /v1/models
to get a list of available models as described in the models API. Model namelatest
is deprecated and will be removed in a future update.messages
list: The prompts for which a response is generated.- Additional parameters: These mirror the OpenAI API and are supported based on the model server's capabilities.
Returns
The response is a chat completion or chat completion chunk object containing:
choices
string: The response generated by the model.- Other parameters: Other fields are consistent with the OpenAI API specifications.
- Default
- Streaming
Example request
curl localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4",
"messages": [
{
"role": "user",
"content": "Tell me a joke!"
}
]
}'
Example response
{
"id": "chat-6e8dc369b0614e2488df6a336c24c349",
"object": "chat.completion",
"created": 1727968175,
"model": "<model>",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "What do you call a fake noodle?\n\nAn impasta.",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 40,
"total_tokens": 54,
"completion_tokens": 14
},
"prompt_logprobs": null
}
Example request
curl localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4",
"messages": [
{
"role": "user",
"content": "Hi there!"
}
],
"stream" : true
}'
Example response
{"id":"chat-4f0bb41857044f52b5fa03fd3c752c8e","object":"chat.completion.chunk","created":1727968591,"model":"<model>","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}
{"id":"chat-4f0bb41857044f52b5fa03fd3c752c8e","object":"chat.completion.chunk","created":1727968591,"model":"<model>","choices":[{"index":0,"delta":{"content":"It"},"logprobs":null,"finish_reason":null}]}
{"id":"chat-4f0bb41857044f52b5fa03fd3c752c8e","object":"chat.completion.chunk","created":1727968591,"model":"<model>","choices":[{"index":0,"delta":{"content":"'s"},"logprobs":null,"finish_reason":null}]}
...
{"id":"chat-4f0bb41857044f52b5fa03fd3c752c8e","object":"chat.completion.chunk","created":1727968591,"model":"<model>","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":"stop","stop_reason":null}]}
Prompt caching
Privatemode supports prompt caching to reduce response latency when the first part of a prompt can be reused across requests. This is especially relevant for requests with long shared context or long conversation history.
Prompt caching is inactive by default. You can enable it in the privatemode-proxy. All requests sent via the same proxy share a cache and no further changes are required when making requests.
Alternatively, you can configure it per request via request field cache_salt
, encoded as a string. All requests that use the same salt share a cache. For example, using the same salt in all requests of a user will create an isolated cache for that user.
- cURL
- Python
curl localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4",
"messages": [{"role": "user", "content": "Tell me a joke!"}],
"cache_salt" : "Y3+y3nLYf3a0CvT7VtuI0W656YXyl0Rdvd8BHI9e2rU="
}'
client = OpenAI(
api_key=os.environ.get("PRIVATE_MODE_API_KEY"),
base_url="http://localhost:8080/v1"
)
client.chat.completions.create(
model="ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4",
messages=[{"role": "user", "content": "Tell me a joke!"}],
extra_body={
"cache_salt": "Y3+y3nLYf3a0CvT7VtuI0W656YXyl0Rdvd8BHI9e2rU=",
},
)
When using the OpenAI Python client, provide cache_salt
as part of an extra_body
argument.
If caching is configured in both, the privatemode-proxy and in a request, the value from the request is used, allowing for more granular control by clients.
Cache salts should be kept private and have an entropy of at least 256 bits.
You can generate a secure salt with openssl rand -base64 32
.
System prompts
The offered model supports setting a system prompt as part of the request's messages
field (see example below). You can use this to tailor the model's behavior to your specific needs.
Improving language accuracy
The model may occasionally make minor language mistakes, especially in languages other than English. To optimize language accuracy, you can set a system prompt. The following example significantly improves accuracy for the German language:
{
"role": "system",
"content": "Ensure every response is free from grammar and spelling errors. Use only valid words. Apply correct article usage, especially for languages with gender-specific articles like German. Follow standard grammar and syntax rules, and check spelling against standard dictionaries. Maintain consistency in style and terminology throughout."
}
Available chat completions models
To list the available chat completions models, call the /v1/models
endpoint or see the models overview.