We often speak about cold starts in LLM inference like they're an annoyance - something to buffer against or mitigate with warm pools. But cold starts are actually a systems design signal. They tell us where our abstractions leak, where orchestration lags behind workload dynamics, and where latency isn’t a function of compute but of movement and coordination.

From the perspective of distributed inference, cold starts are a mirror. They reflect how we schedule, serialize, and precondition large models in response to unpredictable traffic.

What’s Really Behind the Delay?

When we say "cold start," we often conflate several discrete operations:

  • Pulling multi-layer containers from remote registries

  • Initializing runtimes (Python, CUDA, Triton backends)

  • Downloading and unsharding 100+ GB model checkpoints

  • Allocating memory across devices and deserializing tensor states

  • Just-in-time kernel compilation or memory graph construction

The tail latency is often dominated by coordination delays across storage, network, and orchestration boundaries - not model compute.

What the Fastest Paths Are Showing

Some of the most promising patterns come not from model optimization, but from data systems thinking:

  • Chunked and parallel fetches: Treating model weights as parallel streams, not monoliths

  • Hot-path caching: Ensuring container and dependency layers are always locally resident

  • Snapshotting loaded state: Saving initialized memory as a restorable artifact

  • Lazy mounting: Using remote filesystems that trigger load-on-access semantics

These aren’t novel per se, but their composition changes the dynamics of responsiveness. The interesting question isn’t “how fast can we load a model?” but “what are the minimal units of readiness, and how can we reorder them?”

Elasticity and the Myth of Scaling to Zero

In theory, serverless AI implies infinite elasticity. In practice, the moment you introduce models with 40GB+ of weights and long init paths, the idea of scaling to zero becomes a performance liability.

Do we really want elasticity, or do we want just-in-time readiness under bounded latency? This is a real design tension: cost vs. responsiveness, abstraction vs. control.

Measuring What Matters

We still lack a robust vocabulary and instrumentation for cold start analysis. Teams ask “how long does a cold start take?” but rarely break it down by:

  • I/O wait time vs. deserialization time vs. runtime overhead

  • Image vs. weight fetch vs. kernel compile

  • Correlation with model size, hardware type, or scheduler policy

Without this, it’s hard to reason about optimizations - and nearly impossible to generalize lessons across deployments.

Open Questions

What would it look like to design an LLM runtime where cold start time is bounded by a fixed constant, regardless of model size? Could we speculatively load parts of a model based on prior traffic patterns? Should inference runtimes expose structured traces of init stages as first-class observability data?

Cold starts aren’t just something to suppress. They’re a systems artifact - and understanding them is a path to building better serving infrastructure.

The gap between theoretical throughput and observed latency starts here.

Curious what others are measuring, and whether anyone’s managed to bound it.

Get the inside scope.

Sign up for our newsletter.