
In large-scale inference systems, we often talk about latency, QPS, or cost per token. But underneath it all, there’s a much simpler and harder question: how much are we actually using the GPUs we pay for?
GPU underutilization is the silent tax in production LLM systems. The causes are subtle: model replicas that idle waiting for traffic, poor batch packing, long prompt requests that stall the scheduler, or memory fragmentation that prevents colocation. And the impact is large - enterprises pay for peak GPU memory, not average.
This makes one question increasingly urgent: how do we maximize GPU usage across heterogeneous, spiky, and multi-model workloads?
The Multi-Model Problem
A single large model will saturate a GPU if given enough requests. But most production systems serve a mix: different model sizes, different prompt lengths, different latency classes.
Some models need low-latency completion. Others are async background jobs. Some are tiny (2B), others massive (70B). And you want all of them to fit, ideally without reserving an entire A100 for each.
Colocation becomes the primitive.
But then you have to answer:
Which models can coexist on a single GPU without thrashing memory?
How do you prioritize workloads with conflicting latency goals?
What happens when a burst arrives that needs to evict something slower?
This is where GPU utilization stops being a deployment parameter and becomes a scheduling problem.
MIG, Sharding, and Logical Isolation
Some infra teams lean on hardware features like NVIDIA’s MIG (Multi-Instance GPU) to isolate small models. Others shard models across GPUs to run large ones in parallel. Both are partial solutions.
MIG helps with stability, but fragments capacity. Sharding enables large models, but ties up memory even when idle. Neither solves the dynamic allocation problem for mixed workloads.
What’s needed is logical multiplexing - the ability to:
Load multiple models into GPU memory simultaneously
Route traffic dynamically based on current memory and latency budget
Reclaim memory when model use patterns shift
Memory Pressure and Eviction
Serving infra should track not just usage, but predictive usage. That means estimating memory needs per request, understanding KV cache growth over sequence length, and anticipating eviction pressure.
If a long prompt is about to blow out your 70B context window, you should know that before it starts. Otherwise you stall the GPU and kill throughput.
Dynamic schedulers (like what some vLLM forks experiment with) are starting to reason about these tradeoffs. But it’s still early.
Utilization as a First-Class Metric
Too many teams measure cost-efficiency in terms of instance count or average latency. But if your GPUs are sitting at 35% memory usage and 20% streaming multiprocessor usage, you’re leaving capacity untapped.
Useful telemetry includes:
Active vs. idle tokens per second
Memory fragmentation over time
GPU cycles spent waiting for input
Only with this kind of visibility can autoscalers and workload routers make decisions that aren’t just reactive, but optimal.
The Opportunity
If we solve this - if we can dynamically colocate models, route traffic based on token pressure and memory profile, and reclaim unused GPU time - we unlock a 2–5x cost reduction in inference-heavy systems without changing the models at all.
It’s not a modeling problem. It’s a scheduling and observability one.