Right-Sizing GPUs in Kubernetes
A Platform Engineer's Guide to AI Infrastructure Efficiency
Your GPU cluster looks perfect on the dashboard. Finance sees 100% utilisation, and everyone is happy because the numbers show you're using all the resources you're paying for.
But if you ask your ML team, you'll get a different story. They can't get GPU access, training jobs are stuck in queues, and inference runs slower than expected. The dashboard tells one story, but the real situation is different.

Get your free copy
The GPU Yield Problem
Yield is the ratio of what you get out to what you put in. For GPUs, it's the useful work you get from the hardware compared to what you pay for.
In Kubernetes, allocation is just a reservation. When a pod sets limits: nvidia.com/gpu: 1, Kubernetes reserves one GPU for that pod. But a GPU might show 100% allocation while the hardware is running at only 2% utilisation.
Kubernetes reports 4 out of 4 GPUs allocated, and finance sees 100% usage on the billing dashboard. But in reality, actual productive work often yields only 20-30%.
What's Inside
Four chapters that take you from understanding the problem to implementing solutions:
- The GPU Yield Problem: Why allocation and utilisation rarely match, and how the gap between what you pay for and what you get grows in predictable ways
- Measuring What Actually Matters: Moving beyond
nvidia-smiutilisation percentages to metrics that reflect real business value - The Architecture Decision: Time-slicing vs MIG vs MPS -- when each approach makes sense and what trade-offs you're actually making
- Full-Stack Right-Sizing: Practical strategies for closing the yield gap across your entire GPU infrastructure
Who This Is For
This ebook is for platform engineers, ML infrastructure teams, and anyone managing GPU clusters in Kubernetes. If you're the person who has to explain why the GPU bill is so high while your ML engineers complain they can't get access, this is for you.