Version: Next

Chat completions API

Use the Privatemode chat completions API to generate text from a prompt via a large language model. The API is compatible with the OpenAI Chat Completions API. To generate text, send your requests to the privatemode-proxy. Chat requests and responses are encrypted, both in transit and during processing.

Example prompting

For prompting, use the following proxy endpoint:

POST /v1/chat/completions

This endpoint generates a response to a chat prompt.

Request body

model string: The name of a currently available model. Note that models are updated regularly, and support for older models is discontinued over time. Use GET /v1/models to get a list of available models as described in the models API. Model name latest is deprecated and will be removed in a future update.
messages list: The prompts for which a response is generated.
Additional parameters: These mirror the OpenAI API and are supported based on the model server's capabilities.

Returns

The response is a chat completion or chat completion chunk object containing:

choices string: The response generated by the model.
Other parameters: Other fields are consistent with the OpenAI API specifications.

Default
Streaming

Example request

curl localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4",
    "messages": [
      {
        "role": "user",
        "content": "Tell me a joke!"
      }
    ]
  }'

Example response

{
  "id": "chat-6e8dc369b0614e2488df6a336c24c349",
  "object": "chat.completion",
  "created": 1727968175,
  "model": "<model>",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "What do you call a fake noodle?\n\nAn impasta.",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 40,
    "total_tokens": 54,
    "completion_tokens": 14
  },
  "prompt_logprobs": null
}

Example request

curl localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4",
    "messages": [
      {
        "role": "user",
        "content": "Hi there!"
      }
    ],
    "stream" : true
  }'

Example response

{"id":"chat-4f0bb41857044f52b5fa03fd3c752c8e","object":"chat.completion.chunk","created":1727968591,"model":"<model>","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}

{"id":"chat-4f0bb41857044f52b5fa03fd3c752c8e","object":"chat.completion.chunk","created":1727968591,"model":"<model>","choices":[{"index":0,"delta":{"content":"It"},"logprobs":null,"finish_reason":null}]}

{"id":"chat-4f0bb41857044f52b5fa03fd3c752c8e","object":"chat.completion.chunk","created":1727968591,"model":"<model>","choices":[{"index":0,"delta":{"content":"'s"},"logprobs":null,"finish_reason":null}]}

    ...

{"id":"chat-4f0bb41857044f52b5fa03fd3c752c8e","object":"chat.completion.chunk","created":1727968591,"model":"<model>","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":"stop","stop_reason":null}]}

Prompt caching

Privatemode supports prompt caching to reduce response latency when the first part of a prompt can be reused across requests. This is especially relevant for requests with long shared context or long conversation history.

Prompt caching is inactive by default. You can enable it in the privatemode-proxy. All requests sent via the same proxy share a cache and no further changes are required when making requests.

Alternatively, you can configure it per request via request field cache_salt, encoded as a string. All requests that use the same salt share a cache. For example, using the same salt in all requests of a user will create an isolated cache for that user.

cURL
Python

curl localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4",
    "messages": [{"role": "user", "content": "Tell me a joke!"}],
    "cache_salt" : "Y3+y3nLYf3a0CvT7VtuI0W656YXyl0Rdvd8BHI9e2rU="
  }'

client = OpenAI(
    api_key=os.environ.get("PRIVATE_MODE_API_KEY"),
    base_url="http://localhost:8080/v1"
)

client.chat.completions.create(
    model="ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4",
    messages=[{"role": "user", "content": "Tell me a joke!"}],
    extra_body={
        "cache_salt": "Y3+y3nLYf3a0CvT7VtuI0W656YXyl0Rdvd8BHI9e2rU=",
    },
)

When using the OpenAI Python client, provide cache_salt as part of an extra_body argument. If caching is configured in both, the privatemode-proxy and in a request, the value from the request is used, allowing for more granular control by clients.

Cache salts should be kept private and have an entropy of at least 256 bits. You can generate a secure salt with openssl rand -base64 32.

System prompts

The offered model supports setting a system prompt as part of the request's messages field (see example below). You can use this to tailor the model's behavior to your specific needs.

Improving language accuracy

The model may occasionally make minor language mistakes, especially in languages other than English. To optimize language accuracy, you can set a system prompt. The following example significantly improves accuracy for the German language:

{
  "role": "system",
  "content": "Ensure every response is free from grammar and spelling errors. Use only valid words. Apply correct article usage, especially for languages with gender-specific articles like German. Follow standard grammar and syntax rules, and check spelling against standard dictionaries. Maintain consistency in style and terminology throughout."
}

Available chat completions models

To list the available chat completions models, call the /v1/models endpoint or see the models overview.

Example prompting​

Request body​

Returns​

Prompt caching​

System prompts​

Improving language accuracy​

Available chat completions models​