Prompting
For the prompting and response schema, simply adhere to the OpenAI Chat API specification. We like to highlight that we don't use any OpenAI services but only follow their interface definitions. For sending prompts, simply use the privatemode-proxy as your endpoint. It'll take care of end-to-end encryption with our GenAI services for you.
You can't send prompts directly to api.privatemode.ai
. Always send your prompts to the privatemode-proxy, which handles encryption and communicates with the actual GenAI endpoint. For you, the proxy effectively acts as your GenAI endpoint.
An example for a default and a stream-configured prompt and its respective response is given below. This guide assumes the privatemode-proxy is running on localhost:8080
.
Example prompting
For prompting, use the following proxy endpoint:
POST /v1/chat/completions
This endpoint generates a response to a chat prompt.
Request body
model
string: The name of a currently available model. Note that models are updated regularly, and support for older models is discontinued over time. UseGET /v1/models
to get a list of available models as described in Section List models. Model namelatest
is deprecated and will be removed in a future update.messages
list: The prompts for which a response is generated.- Additional parameters: These mirror the OpenAI API and are supported based on the model server's capabilities.
Returns
The response is a chat completion or chat completion chunk object containing:
choices
string: The response generated by the model.- Other parameters: Other fields are consistent with the OpenAI API specifications.
- Default
- Streaming
Example request
curl localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4",
"messages": [
{
"role": "user",
"content": "Tell me a joke!"
}
]
}'
Example response
{
"id": "chat-6e8dc369b0614e2488df6a336c24c349",
"object": "chat.completion",
"created": 1727968175,
"model": "<model>",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "What do you call a fake noodle?\n\nAn impasta.",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 40,
"total_tokens": 54,
"completion_tokens": 14
},
"prompt_logprobs": null
}
Example request
curl localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4",
"messages": [
{
"role": "user",
"content": "Hi there!"
}
],
"stream" : true
}'
Example response
{"id":"chat-4f0bb41857044f52b5fa03fd3c752c8e","object":"chat.completion.chunk","created":1727968591,"model":"<model>","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}
{"id":"chat-4f0bb41857044f52b5fa03fd3c752c8e","object":"chat.completion.chunk","created":1727968591,"model":"<model>","choices":[{"index":0,"delta":{"content":"It"},"logprobs":null,"finish_reason":null}]}
{"id":"chat-4f0bb41857044f52b5fa03fd3c752c8e","object":"chat.completion.chunk","created":1727968591,"model":"<model>","choices":[{"index":0,"delta":{"content":"'s"},"logprobs":null,"finish_reason":null}]}
...
{"id":"chat-4f0bb41857044f52b5fa03fd3c752c8e","object":"chat.completion.chunk","created":1727968591,"model":"<model>","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":"stop","stop_reason":null}]}
Prompt caching
Privatemode supports prompt caching to reduce response latency when the first part of a prompt can be reused across requests. This is especially relevant for requests with long shared context or long conversation history.
Prompt caching is inactive by default. You can enable it in the privatemode-proxy. All requests sent via the same proxy share a cache and no further changes are required when making requests.
Alternatively, you can configure it per request via request field cache_salt
, encoded as a string. All requests that use the same salt share a cache. For example, using the same salt in all requests of a user will create an isolated cache for that user.
- cURL
- Python
curl localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4",
"messages": [{"role": "user", "content": "Tell me a joke!"}],
"cache_salt" : "Y3+y3nLYf3a0CvT7VtuI0W656YXyl0Rdvd8BHI9e2rU="
}'
client = OpenAI(
api_key=os.environ.get("PRIVATE_MODE_API_KEY"),
base_url="http://localhost:8080/v1"
)
client.chat.completions.create(
model="ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4",
messages=[{"role": "user", "content": "Tell me a joke!"}],
extra_body={
"cache_salt": "Y3+y3nLYf3a0CvT7VtuI0W656YXyl0Rdvd8BHI9e2rU=",
},
)
When using the OpenAI Python client, provide cache_salt
as part of an extra_body
argument.
If caching is configured in both, the privatemode-proxy and in a request, the value from the request is used, allowing for more granular control by clients.
Cache salts should be kept private and have an entropy of at least 256 bits.
You can generate a secure salt with openssl rand -base64 32
.
Available models
Privatemode currently serves the following models for chat completions:
- LLama 3.3 70B: ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4
- supports chat completions and tool calling
- Gemma 3 27B: leon-se/gemma-3-27b-it-fp8-dynamic
- supports chat completions with text and image input
More models will be available soon.
List models
You can get a list of all available models using the models
endpoint.
GET /v1/models
This endpoint lists all currently available models.
Returns
The response is a list of model objects.
Example request
curl localhost:8080/v1/models
Example response
{
"object": "list",
"data": [
{
"id": "ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4",
"object": "model",
"tasks": [
"generate",
"tool_calling"
]
},
{
"id": "leon-se/gemma-3-27b-it-fp8-dynamic",
"object": "model",
"tasks": [
"generate",
"vision"
]
},
{
"id": "intfloat/multilingual-e5-large-instruct",
"object": "model",
"tasks": [
"embed"
]
}
]
}
Supported model tasks
Response field tasks
provides a lists of all tasks a model supports:
embed
: Create vector representations (embeddings) of input text.generate
: Generate text completions or chat responses from prompts.tool_calling
: Invoke function calls or tools (such as retrieval-augmented generation or plugins).
Note that
tasks
isn't part of the OpenAI API spec.
System prompts
The offered model supports setting a system prompt as part of the request's messages
field (see example below). You can use this to tailor the model's behavior to your specific needs.
Improving language accuracy
The model may occasionally make minor language mistakes, especially in languages other than English. To optimize language accuracy, you can set a system prompt. The following example significantly improves accuracy for the German language:
{
"role": "system",
"content": "Ensure every response is free from grammar and spelling errors. Use only valid words. Apply correct article usage, especially for languages with gender-specific articles like German. Follow standard grammar and syntax rules, and check spelling against standard dictionaries. Maintain consistency in style and terminology throughout."
}