Version: 1.8

Server

The server side of Privatemode hosts the inference service and processes prompts securely. Its architecture is designed to be highly scalable while never compromising confidentiality.

It consists of three main parts:

The AI inference application itself, the workers
The Contrast framework for running the workers in a Confidential Computing Environment (CCE)
A key management service, the secret service

Workers

Workers are central to the backend. They host an AI model and serve inference requests. The necessary inference code and model are provided externally by the platform and model provider respectively.

The containerized inference code, in the following referred to as AI code, runs in a secure and isolated environment.

Each worker is a confidential container running in Contrast's runtime environment that isolates containers, more precisely Kubernetes Pods, in confidential VMs (CVM). The runtime CVM is minimal, immutable, and verifiable through remote attestation. It's also described in the Contrast documentation.

Interference code

The inference code is provided by an external party, such as HuggingFace TGI, vLLM, NVIDIA Triton, and is frequently updated. In case of Privatemode, the inference code is currently provided by vLLM. It's included in the remote attestation flow.

This code operates within a confidential computing environment that encrypts all data in memory. Within this secure environment, the inference code can access user data. To ensure that the inference code doesn't leak user data, the system relies on remote attestation, enabling the client to review and verify the code's integrity and behavior before execution.

This architecture ensures that (1) the infrastructure can't access user data or the inference code, and (2) the inference code doesn't leak user data to unprotected memory, the disk, or the network.

Confidential computing environment

Confidential Computing Environments (CCEs) provide robust hardware-based security and workload isolation.

While encryption in transit (TLS) and at rest (disk encryption) have become widespread, confidential computing completes data protection. It secures data at runtime—ensuring encryption throughout its entire lifecycle.

In Privatemode, all workloads run inside AMD SEV-SNP based Confidential VMs (CVMs).

With SEV-SNP, the memory of virtual machines (VMs) is encrypted. The processor manages encryption keys and ensures they're not accessible by untrusted software. Because encryption is hardware-accelerated, performance penalties are minimal. This reduces the attack surface, shielding workloads from:

Unauthorized Access: Even if a malicious actor compromises the server-side system including the hypervisor or other VMs, SEV-SNP's encryption makes your data unreadable.
Sophisticated Memory Attacks: SEV-SNP goes beyond confidentiality by adding integrity protection. It ensures that the data your VM reads is the same data it previously wrote, preventing tampering attempts.

Integrating AI accelerators into the CCE

The Privatemode API currently leverages NVIDIA's H100 AI accelerators to process large language models (LLMs). The H100’s confidential computing capabilities enable GPUs to be assigned to CVMs running on CPUs. This integration extends CCEs to include GPU workloads.

By using H100s, Privatemode applies key confidential computing features—such as remote attestation and isolation—to LLM processing, ensuring secure inference.

Encryption proxy

Each worker implements an encryption proxy responsible for encrypting and decrypting prompts and responses as they enter or leave the CVM for inference. This doesn't affect the low-level runtime encryption of the CVM itself but ensures end-to-end encryption at the application level. Inside the CVM, your data remains protected from external access.

For a detailed explanation of the end-to-end encryption workflow, refer to our Encryption section.

Contrast Integration

Privatemode leverages Contrast to implement attestation.

Contrast Coordinator

The Contrast Coordinator acts as an attestation service and ensures that only verified workloads and infrastructure components participate in Privatemode. It performs remote attestation for workers, provides them with credentials for authentication within the service mesh, and enforces security policies.

Service Mesh

The Contrast service mesh determines which services have been verified by the Coordinator and are allowed to communicate within Privatemode.

Attestation Agent

The attestation agent is a Privatemode-specific component that handles GPU attestation and registers a GenAI worker at the Contrast Coordinator. It's running inside each worker as a separate container and responsible for:

Verifying GPUs before they can be assigned to a worker.
Registering with the Coordinator to authenticate itself within the service mesh.
Contacting the secret service to retrieve decryption keys after successful attestation.

Disk mounter

The disk mounter is a Privatemode-specific component that handles mounting model weight disks as read-only devices through dm-verity. It's running inside each worker as a separate container and is responsible for:

Use dm-verity to setup a verity device. This continuously checks the integrity of disk during use.
Mount the model weights disk as read only.

Secret Service

The secret service is a Privatemode-specific component responsible for secure key management and distribution. It runs inside a confidential container through the Contrast runtime. It ensures that encryption keys are only released to verified GenAI workers after successful attestation.

Its primary role is to:

Store and manage encryption keys for GenAI workers.
Release keys only to successfully verified GenAI workers.

Workers​

Interference code​

Confidential computing environment​

Integrating AI accelerators into the CCE​

Encryption proxy​

Contrast Integration​

Contrast Coordinator​

Service Mesh​

Attestation Agent​

Disk mounter​

Secret Service​