BlitzScale

Problem: Automatically adjusting the number of model-serving instances to match fluctuating request demand.
Importance:
- Maximizes goodput, meaning requests that meet SLOs.
- Reduces cost by avoiding overprovisioning expensive GPU instances.
- Essential for MaaS systems where workloads vary rapidly.
Difficulty:
- Request arrival rates fluctuate unpredictably in the short term.
- Each request has variable latency and memory usage depending on input length.
- Large models require expensive initialization and parameter loading.
- Scaling decisions must balance latency, memory, and hardware utilization under uncertainty.

Stop-the-world auto-scaling:
- New instances cannot serve requests until the full model is loaded.
- Model loading from SSD or host memory causes long cold-start delays.
- Leads to SLO violations during scaling events.
Live auto-scaling:
- New instances start serving requests before full model loading completes.
- Uses partial model availability and cooperative execution.
- Significantly reduces tail latency during scale-up.

Prefill:
- Processes the full user input prompt.
- Produces the first output token.
- High compute and memory demand.
- Determines Time-To-First-Token.
Decode:
- Iteratively generates subsequent tokens.
- Uses KVCache accumulated so far.
- Determines Time-Between-Tokens.
Implication for provisioning:
- Prefill and decode have different resource characteristics.
- Disaggregating them complicates scaling because KVCache must be transferred.

RDMA (Remote Direct Memory Access):
- Enables direct memory-to-memory data transfer across machines.
- Bypasses the CPU, kernel, and extra memory copies.
- Achieves 100 to 400 Gbps inter-node bandwidth.
NVLink:
- High-speed intra-node GPU interconnect.
- Supports multi-Tbps bandwidth for GPU-to-GPU communication.
Why this enables BlitzScale’s design:
- Network bandwidth is comparable to or faster than host-to-GPU PCIe.
- These links are underutilized during normal serving.
- Model parameters can be multicast directly from existing GPUs to new GPUs.
- Avoids caching hundreds of models in host DRAM.
- Eliminates slow SSD loading paths.
- Achieves O(1) host caching regardless of model count.