NVIDIA Dynamo Snapshot: Faster AI Inference Startup on Kubernetes

NVIDIA has introduced Dynamo Snapshot, a new checkpoint and restore system designed to drastically reduce cold-start latency for AI inference workloads running on Kubernetes. In production environments, scaling inference replicas can take several minutes as systems pull container images, load model weights, and warm up CUDA kernels. During this process, GPUs remain idle, leading to potential service level agreement (SLA) violations during traffic spikes. Dynamo Snapshot addresses this by serializing the state of a running inference worker and restoring it, allowing for near-instantaneous scaling.

The Mechanics of Checkpoint and Restore

Dynamo Snapshot operates by leveraging two primary tools: cuda-checkpoint and CRIU (Checkpoint/Restore in Userspace). A running inference worker contains both GPU-side device state and host-side process state. When a checkpoint is initiated, cuda-checkpoint first serializes the GPU state—including CUDA contexts, streams, and memory mappings—into the CPU memory of the process. Subsequently, CRIU captures the host-side process tree, including threads and file descriptors, and writes the data to storage.
To manage this on Kubernetes, NVIDIA provides a privileged DaemonSet called snapshot-agent. This agent runs on every node and handles the checkpoint and restore process for containers without requiring modifications to the underlying runc runtime. By using a DaemonSet, the system remains portable and avoids dependencies on cloud-provider feature gates, allowing checkpoint artifacts to be stored in flexible backend storage.

Optimizing Performance and Throughput

To ensure efficient scaling, Dynamo Snapshot incorporates several optimizations to reduce artifact size and restore times. By using the CUDA Virtual Memory Management API, the system can unmap and release physical memory for the KV cache while keeping the virtual address range stable. This significantly reduces the size of the checkpoint artifact; for a Qwen3-0.6B model on a B200 GPU, the artifact size drops from approximately 190 GiB to 6 GiB.
Further improvements target the restoration phase. Pending integration into upstream CRIU, NVIDIA has developed parallel memfd restoration and Linux native AIO for anonymous memory. These enhancements allow the system to utilize thread pools and concurrent read operations to saturate storage bandwidth. In testing with a gpt-oss-120b model, these optimizations achieved a 7.9× speedup in CRIU restore time compared to upstream versions.

Decoupling Weights with GPU Memory Service

To overcome the serial bottleneck where GPU memory cannot be restored until weights are materialized in host memory, NVIDIA developed the GPU Memory Service (GMS). GMS decouples large model weights from the core CRIU checkpoint, moving them into a separate artifact. This allows the restoration of process state and model weights to occur concurrently across different memory channels, such as GPUDirect Storage or NVLink.
In a proof-of-concept configuration using eight striped local NVMe SSDs, the use of GMS reduced the end-to-end startup time for a gpt-oss-120b model to under 5 seconds, representing a 21× reduction in latency. Currently, Dynamo Snapshot supports vLLM workers in a limited preview, with further support for TensorRT-LLM and multi-GPU configurations planned for future releases.

NVIDIA Dynamo Snapshot: Faster AI Inference Startup on Kubernetes

Key Takeaways

The Mechanics of Checkpoint and Restore

Optimizing Performance and Throughput

Decoupling Weights with GPU Memory Service

Comments (0)

No comments yet

NVIDIA Dynamo Snapshot: Faster AI Inference Startup on Kubernetes

Key Takeaways

The Mechanics of Checkpoint and Restore

Optimizing Performance and Throughput

Decoupling Weights with GPU Memory Service

Get a Free AI Prompt Guide

Comments (0)

No comments yet