Optimizing Inference Latency: From 100ms to 10ms
Practical techniques we used to dramatically reduce model serving latency in production.
Michael Zhang
Head of Design
10x Latency Reduction
When users interact with ML-powered features, every millisecond matters. High latency leads to poor user experience, lower conversion rates, and in some cases, system timeouts. Here's how we achieved a 10x reduction in inference latency.
The Optimization Stack
Quantization
INT8 quantization reduced memory bandwidth by 4x with less than 1% accuracy loss.
Kernel Fusion
Custom CUDA kernels eliminate memory round-trips between operations.
Batching
Dynamic batching maximizes GPU utilization without sacrificing latency SLAs.
Compilation
torch.compile with CUDA graphs eliminates Python overhead entirely.
Step 1: Profile First
Before optimizing, we profiled end-to-end latency to understand where time was being spent. The breakdown was surprising: 40% was network overhead, 30% was preprocessing, and only 30% was actual model inference.
Step 2: Reduce Network Overhead
We moved from REST to gRPC with connection pooling, reducing network latency from 40ms to 5ms. For internal services, we use Unix domain sockets, cutting latency to under 1ms.
Step 3: Optimize Preprocessing
Tokenization and feature engineering were running on CPU. Moving these to GPU with batched operations reduced preprocessing from 30ms to 3ms.
# Before: CPU tokenization
tokens = tokenizer(text) # 30ms
# After: GPU batched tokenization
tokens = gpu_tokenizer.batch_encode(
texts,
device="cuda"
) # 3ms for batch of 32Step 4: Model Optimization
Finally, we optimized the model itself. INT8 quantization with calibration preserved accuracy while halving memory bandwidth requirements. Combined with kernel fusion and CUDA graphs, inference dropped from 30ms to 2ms.
Results
10ms
p50 latency
18ms
p99 latency
10x
improvement
Michael Zhang
Head of Design
Michael leads design at 1.ML, previously Design Director at Figma.