1.ML is a versatile AI chatbot that generates human-like text, writes code, summarizes documents, translates languages, and creates images based on user prompts. It acts as a 24/7 assistant to boost productivity.

1.ML can generate human-like text, write and debug code, summarize documents, translate between multiple languages, create AI-generated images, help with brainstorming ideas, edit content, and solve complex problems through conversational interaction.

Is 1.ML available on mobile?

Yes, 1.ML is available on both web and mobile platforms, allowing you to access your AI assistant anytime, anywhere.

How many languages does 1.ML support?

1.ML supports translation and interaction in over 20 languages including English, Spanish, Chinese, Hindi, Arabic, Portuguese, French, German, Japanese, Korean, and more.

Back to Home

Engineering

Optimizing Inference Latency: From 100ms to 10ms

Practical techniques we used to dramatically reduce model serving latency in production.

Michael Zhang

Head of Design

Mar 5, 20269 min read

100ms→10ms

10x Latency Reduction

When users interact with ML-powered features, every millisecond matters. High latency leads to poor user experience, lower conversion rates, and in some cases, system timeouts. Here's how we achieved a 10x reduction in inference latency.

The Optimization Stack

Quantization

INT8 quantization reduced memory bandwidth by 4x with less than 1% accuracy loss.

Kernel Fusion

Custom CUDA kernels eliminate memory round-trips between operations.

Batching

Dynamic batching maximizes GPU utilization without sacrificing latency SLAs.

Compilation

torch.compile with CUDA graphs eliminates Python overhead entirely.

Step 1: Profile First

Before optimizing, we profiled end-to-end latency to understand where time was being spent. The breakdown was surprising: 40% was network overhead, 30% was preprocessing, and only 30% was actual model inference.

Step 2: Reduce Network Overhead

We moved from REST to gRPC with connection pooling, reducing network latency from 40ms to 5ms. For internal services, we use Unix domain sockets, cutting latency to under 1ms.

Step 3: Optimize Preprocessing

Tokenization and feature engineering were running on CPU. Moving these to GPU with batched operations reduced preprocessing from 30ms to 3ms.

# Before: CPU tokenization
tokens = tokenizer(text)  # 30ms

# After: GPU batched tokenization
tokens = gpu_tokenizer.batch_encode(
    texts, 
    device="cuda"
)  # 3ms for batch of 32

Step 4: Model Optimization

Finally, we optimized the model itself. INT8 quantization with calibration preserved accuracy while halving memory bandwidth requirements. Combined with kernel fusion and CUDA graphs, inference dropped from 30ms to 2ms.

Results

10ms

p50 latency

18ms

p99 latency

10x

improvement

Michael Zhang

Head of Design

Michael leads design at 1.ML, previously Design Director at Figma.