Server side
The server side of Privatemode hosts the inference service and processes prompts securely. Its architecture is designed to be highly scalable while never compromising confidentiality.
It consists of two main components: the workers and the attestation service.
Workers
Worker nodes are central to the backend. They host an AI model an serve inference requests. The necessary inference code and model are provided externally by the platform and model provider respectively.
The containerized inference code, in the following referred to as AI code, runs in a secure and isolated environment.
Each worker is a confidential VM (CVM) running Privatemode's customized Linux. This OS is minimal, immutable, and verifiable through remote attestation. It hosts the inference server and mediates network traffic through an server-sider encryption proxy.
Interference code
The inference code is provided by an external party, such as HuggingFace TGI, vLLM, NVIDIA Triton, and is frequently updated. In case of Privatemode, the inference code is currently provided by vLLM. Including it in remote attestation and reviewing it regularly would be impractical.
This code operates within a confidential computing environment that encrypts all data in memory. Within this secure environment, the inference code can access user data. To ensure that the inference code doesn't leak user data, the system relies on remote attestation, enabling the client to review and verify the code's integrity and behavior before execution.
This architecture ensures that (1) the infrastructure can't access user data or the inference code, and (2) the inference code doesn't leak user data to unprotected memory, the disk, or the network.
A key principle is that the inference code can only communicate with the GPU and the encryption proxy, ensuring all communication is encrypted and preventing plaintext data leaks.
Confidential computing environment
Confidential Computing Environments (CCEs) provide robust hardware-based security and workload isolation.
While encryption in transit (TLS) and at rest (disk encryption) have become widespread, confidential computing completes data protection. It secures data at runtime—ensuring encryption throughout its entire lifecycle.
In Privatemode, all workloads run inside AMD SEV-SNP based Confidential VMs (CVMs).
With SEV-SNP, the memory of virtual machines (VMs) is encrypted. The processor manages encryption keys and ensures they're not accessible by untrusted software. Because encryption is hardware-accelerated, performance penalties are minimal. This reduces the attack surface, shielding workloads from:
- Unauthorized Access: Even if a malicious actor compromises the server-side system including the hypervisor or other VMs, SEV-SNP's encryption makes your data unreadable.
- Sophisticated Memory Attacks: SEV-SNP goes beyond confidentiality by adding integrity protection. It ensures that the data your VM reads is the same data it previously wrote, preventing tampering attempts.
Integrating AI accelerators into the CCE
The Privatemode API currently leverages NVIDIA's H100 AI accelerators to process large language models (LLMs). The H100’s confidential computing capabilities enable GPUs to be assigned to CVMs running on CPUs. This integration extends CCEs to include GPU workloads.
By using H100s, Privatemode applies key confidential computing features—such as remote attestation and isolation—to LLM processing, ensuring secure inference.
Encryption proxy
Each worker implements an encryption proxy responsible for encrypting and decrypting prompts and replies as they enter or leave the CVM for inference. This doesn't affect the low-level runtime encryption of the CVM itself but ensures end-to-end encryption at the application level. Inside the CVM, your data remains protected from external access.
For a detailed explanation of the end-to-end encryption workflow, refer to our Encryption section.
Attestation Service
The attestation feature of CVMs ensures the integrity and authenticity of the AI workers. This allows both the service provider and clients to verify the workers' integrity and that they're interacting with a trustworthy Privatemode deployment.
Because workers can be dynamically scaled and handle concurrent requests, individual verification is impractical. Instead, the attestation service (AS) handles attestation centrally. On the server side, the AS verifies each worker based on its attestation statement. On the client side, the AS provides a system-wide attestation endpoint and handles key exchanges for prompt encryption.
The AS runs in a CVM. Workers register with the AS, providing their attestation statements. Only verified workers can serve inference requests. The AS also manages the distribution of end-to-end encryption secrets. Verified workers synchronize with the AS to retrieve these secrets.