Version: Next

Prompting

For the prompting and response schema, simply adhere to the OpenAI Chat API specification. We like to highlight that we don't use any OpenAI services but only follow their interface definitions. For sending prompts, simply use the privatemode-proxy as your endpoint. It'll take care of end-to-end encryption with our GenAI services for you.

note

You can't send prompts directly to api.privatemode.ai. Always send your prompts to the privatemode-proxy, which handles encryption and communicates with the actual GenAI endpoint. For you, the proxy effectively acts as your GenAI endpoint.

An example for a default and a stream-configured prompt and its respective response is given below. This guide assumes the privatemode-proxy is running on localhost:8080.

Example prompting

For prompting, use the following proxy endpoint:

POST /v1/chat/completions

This endpoint generates a response to a chat prompt.

Request body

model string: The name of a currently available model. Note that models are updated regularly, and support for older models is discontinued over time. Use GET /v1/models to get a list of available models as described in Section List models. Model name latest is deprecated and will be removed in a future update.
messages list: The prompts for which a response is generated.
Additional parameters: These mirror the OpenAI API and are supported based on the model server's capabilities.

Returns

The response is a chat completion or chat completion chunk object containing:

choices string: The response generated by the model.
Other parameters: Other fields are consistent with the OpenAI API specifications.

Default
Streaming

Example request

curl localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4",
    "messages": [
      {
        "role": "user",
        "content": "Tell me a joke!"
      }
    ]
  }'

Example response

{
  "id": "chat-6e8dc369b0614e2488df6a336c24c349",
  "object": "chat.completion",
  "created": 1727968175,
  "model": "<model>",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "What do you call a fake noodle?\n\nAn impasta.",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 40,
    "total_tokens": 54,
    "completion_tokens": 14
  },
  "prompt_logprobs": null
}

Example request

curl localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4",
    "messages": [
      {
        "role": "user",
        "content": "Hi there!"
      }
    ],
    "stream" : true
  }'

Example response

{"id":"chat-4f0bb41857044f52b5fa03fd3c752c8e","object":"chat.completion.chunk","created":1727968591,"model":"<model>","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}

{"id":"chat-4f0bb41857044f52b5fa03fd3c752c8e","object":"chat.completion.chunk","created":1727968591,"model":"<model>","choices":[{"index":0,"delta":{"content":"It"},"logprobs":null,"finish_reason":null}]}

{"id":"chat-4f0bb41857044f52b5fa03fd3c752c8e","object":"chat.completion.chunk","created":1727968591,"model":"<model>","choices":[{"index":0,"delta":{"content":"'s"},"logprobs":null,"finish_reason":null}]}

    ...

{"id":"chat-4f0bb41857044f52b5fa03fd3c752c8e","object":"chat.completion.chunk","created":1727968591,"model":"<model>","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":"stop","stop_reason":null}]}

Prompt caching

Privatemode supports prompt caching to reduce response latency when the first part of a prompt can be reused across requests. This is especially relevant for requests with long shared context or long conversation history.

Prompt caching is inactive by default. You can enable it in the privatemode-proxy. All requests sent via the same proxy share a cache and no further changes are required when making requests.

Alternatively, you can configure it per request via request field cache_salt, encoded as a string. All requests that use the same salt share a cache. For example, using the same salt in all requests of a user will create an isolated cache for that user.

cURL
Python

curl localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4",
    "messages": [{"role": "user", "content": "Tell me a joke!"}],
    "cache_salt" : "Y3+y3nLYf3a0CvT7VtuI0W656YXyl0Rdvd8BHI9e2rU="
  }'

client = OpenAI(
    api_key=os.environ.get("PRIVATE_MODE_API_KEY"),
    base_url="http://localhost:8080/v1"
)

client.chat.completions.create(
    model="ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4",
    messages=[{"role": "user", "content": "Tell me a joke!"}],
    extra_body={
        "cache_salt": "Y3+y3nLYf3a0CvT7VtuI0W656YXyl0Rdvd8BHI9e2rU=",
    },
)

When using the OpenAI Python client, provide cache_salt as part of an extra_body argument. If caching is configured in both, the privatemode-proxy and in a request, the value from the request is used, allowing for more granular control by clients.

Cache salts should be kept private and have an entropy of at least 256 bits. You can generate a secure salt with openssl rand -base64 32.

Available models

Privatemode currently serves the following models for chat completions:

LLama 3.3 70B: ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4
- supports chat completions and tool calling
Gemma 3 27B: leon-se/gemma-3-27b-it-fp8-dynamic
- supports chat completions with text and image input

More models will be available soon.

List models

You can get a list of all available models using the models endpoint.

GET /v1/models

This endpoint lists all currently available models.

Returns

The response is a list of model objects.

Example request

curl localhost:8080/v1/models

Example response

{
  "object": "list",
  "data": [
    {
      "id": "ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4",
      "object": "model",
      "tasks": [
        "generate",
        "tool_calling"
      ]
    },
    {
      "id": "leon-se/gemma-3-27b-it-fp8-dynamic",
      "object": "model",
      "tasks": [
        "generate",
        "vision"
      ]
    },
    {
      "id": "intfloat/multilingual-e5-large-instruct",
      "object": "model",
      "tasks": [
        "embed"
      ]
    }
  ]
}

Supported model tasks

Response field tasks provides a lists of all tasks a model supports:

embed: Create vector representations (embeddings) of input text.
generate: Generate text completions or chat responses from prompts.
tool_calling: Invoke function calls or tools (such as retrieval-augmented generation or plugins).

Note that tasks isn't part of the OpenAI API spec.

System prompts

The offered model supports setting a system prompt as part of the request's messages field (see example below). You can use this to tailor the model's behavior to your specific needs.

Improving language accuracy

The model may occasionally make minor language mistakes, especially in languages other than English. To optimize language accuracy, you can set a system prompt. The following example significantly improves accuracy for the German language:

{
  "role": "system",
  "content": "Ensure every response is free from grammar and spelling errors. Use only valid words. Apply correct article usage, especially for languages with gender-specific articles like German. Follow standard grammar and syntax rules, and check spelling against standard dictionaries. Maintain consistency in style and terminology throughout."
}

Example prompting​

Request body​

Returns​

Prompt caching​

Available models​

List models​

Returns​

Supported model tasks​

System prompts​

Improving language accuracy​

Example prompting

Request body

Returns

Prompt caching

Available models

List models

Returns

Supported model tasks

System prompts

Improving language accuracy