BlitzScale

What is the problem of auto-scaling? Why is it important and why is it difficult?

  • Problem: Automatically adjusting the number of model-serving instances to match fluctuating request demand.

  • Importance:

    • Maximizes goodput, meaning requests that meet SLOs.
    • Reduces cost by avoiding overprovisioning expensive GPU instances.
    • Essential for MaaS systems where workloads vary rapidly.
  • Difficulty:

    • Request arrival rates fluctuate unpredictably in the short term.
    • Each request has variable latency and memory usage depending on input length.
    • Large models require expensive initialization and parameter loading.
    • Scaling decisions must balance latency, memory, and hardware utilization under uncertainty.

What is the difference between stop-the-world and live auto-scaling?

  • Stop-the-world auto-scaling:

    • New instances cannot serve requests until the full model is loaded.
    • Model loading from SSD or host memory causes long cold-start delays.
    • Leads to SLO violations during scaling events.
  • Live auto-scaling:

    • New instances start serving requests before full model loading completes.
    • Uses partial model availability and cooperative execution.
    • Significantly reduces tail latency during scale-up.

What is prefill / decode?

  • Prefill:

    • Processes the full user input prompt.
    • Produces the first output token.
    • High compute and memory demand.
    • Determines Time-To-First-Token.
  • Decode:

    • Iteratively generates subsequent tokens.
    • Uses KVCache accumulated so far.
    • Determines Time-Between-Tokens.
  • Implication for provisioning:

    • Prefill and decode have different resource characteristics.
    • Disaggregating them complicates scaling because KVCache must be transferred.

  • RDMA (Remote Direct Memory Access):

    • Enables direct memory-to-memory data transfer across machines.
    • Bypasses the CPU, kernel, and extra memory copies.
    • Achieves 100 to 400 Gbps inter-node bandwidth.
  • NVLink:

    • High-speed intra-node GPU interconnect.
    • Supports multi-Tbps bandwidth for GPU-to-GPU communication.
  • Why this enables BlitzScale’s design:

    • Network bandwidth is comparable to or faster than host-to-GPU PCIe.
    • These links are underutilized during normal serving.
    • Model parameters can be multicast directly from existing GPUs to new GPUs.
    • Avoids caching hundreds of models in host DRAM.
    • Eliminates slow SSD loading paths.
    • Achieves O(1) host caching regardless of model count.