What Actually Makes Embedding Model Inference Fast?
From Flash Attention to Quantization, where is the inference bottleneck? Is it the architecture, the maths, or will writing everything in Rust solve all my problems? (Hint: It's Not Rust)
There’s a persistent claim that keeps coming up in ML architecture discussions:
“System X is fast because it’s written in Rust.”
You hear this about TEI (Text Embeddings Inference), about various LLM servers, about anything touching performance-critical paths.
And look, Rust has earned its reputation for speed and memory safety, but when we’re talking specifically about embedding inference, pointing to the programming language as the primary performance driver reveals a fundamental misunderstanding of where “latency actually lives”.
If you want to build fast systems, or even just make intelligent choices about which systems to use, you need to understand what’s actually happening when you send an embedding request.
The answer might surprise you, because the bottleneck: is not your your request handler, is not your JSON parser, and not even really the math. It’s memory.
The physics of latency
Let’s trace what happens when you send “the quick brown fox” to an embedding model, asking a Transformer to turn it into a 768-dimensional vector.
Tokenization happens in microseconds of CPU work, totally fine in Python or Rust or whatever. But then you run those tokens through 12 Transformer layers consisting of attention and feed-forward networks, repeated a lot of times.
An NVIDIA A100 can do 312 trillion operations per second, so you’d think inference would be instant. Instead it takes tens of milliseconds, and the reason is that the A100’s compute might be blazing fast, but it can only read from memory at 1.5 TB/s.
That sounds fast!
However, compared to the GPU’s compute capability, it’s slow. During each attention layer, you’re constantly reading weights from memory, computing something, writing results back to memory, then reading them again.
The math operations finish very quickly, but the GPU spends most of its time waiting for data.
To put numbers on it: a 100-million parameter model at FP32 is 400 MB of weights, and reading that from memory takes about 0.27 milliseconds, while the actual matrix multiplication finishes in a fraction of that time.
Why language choice barely moves the needle ?
You’re memory-bound, not compute-bound, which completely reframes how we think about optimization. When someone tells you a system is fast because it’s written in Rust, they’re optimizing the part of the stack that accounts for maybe 5% of your latency, because the HTTP handling, the JSON parsing, the request routing, it’s all happening on the CPU in single-digit milliseconds at most. The other 95% is the GPU waiting for data.
FlashAttention and its inner working
Let me walk you through FlashAttention because it’s the perfect example of how understanding memory hierarchy leads to massive speedups, independent of what language your server is written in.
Where “vanilla” attention hits a bottleneck
Standard attention in a Transformer computes S = Q·K^T to get attention scores, applies softmax to get attention weights P, then computes P·V to get the output, which seems straightforward. But watch what happens in memory: every intermediate matrix (S, P) gets written to slow memory HBM (High Bandwidth Memory) at 1.5 TB/s, then read back, and you’re hitting that bottleneck repeatedly throughout the entire forward pass.
The advantage of FlashAttention
Flash Attention tiles the computation into blocks that fit in fast SRAM, which is the GPU’s scratchpad L1 cache that you manage yourself as a programmer. The actual path looks like this: global memory → shared memory (SRAM) → tensor memory, which is another programmer-managed L1 cache designed specifically to hold accumulators during sequences of Tensor Core operations.
The mathematical insight is that the normalization step can be expressed as a mergeable running state. Instead of computing the full attention matrix, the algorithm maintains a small, mergeable state for each query row: a running max, a running exp-sum, and an output accumulator.
This lets it stream over tiles of keys and values independently, while still producing the exact same attention output as a full softmax. This means you can process chunks independently and merge results without ever writing those huge intermediate matrices to slow HBM.
The advancements of Flash Attention 4
Flash Attention 4, which came out recently, takes this even further with two clever tricks:
First, instead of always using the GPU’s Special Function Units (SFUs) for exponentiation (which can bottleneck because there are far fewer SFUs than CUDA Cores), FA4 mixes in a software implementation using a cubic polynomial approximation
(((c3*r + c2)*r + c1)*r + c0)which turns out to be faster because three fused multiply-adds on CUDA cores beats waiting in the SFU queue.Second, the old approach updated the normalization scaling factor every time you saw a new maximum, but the new approach only updates when the maximum has changed enough to actually threaten numerical stability. This cuts rescaling operations by 10×, eliminating pure computational waste. What’s really interesting is that the complexity in FA4 isn’t even in the math anymore, it’s in the asynchronous pipeline.
Here is how this pipeline actually looks like:
This pipeline works by breaking the attention computation into tiles (small blocks of Queries, Keys, and Values sized to fit on fast on-chip memory). The kernel uses “warp specialization,” which means mapping different pipeline stages onto 32-thread groups called warps, each executing the same instruction, forming a producer/consumer pipeline synchronized with barriers. Different groups of these warps are assigned to fixed roles in the pipeline as follows:
Load warps → move tiles to fast shared memory
MMA (Matrix Multiply Accumulate) warps → matrix multiplication on Tensor Cores
Softmax warps → exponential + partial normalization
Correction warps → rescale if max “jumps” too much
Epilogue warps → final normalization + output write
Each warp runs its stage while the warp scheduler switches between them like an event loop on steroids, which is actually the opposite of async programming on a CPU where one thread follows one request through multiple states. On a GPU, one warp handles one state transition for all tiles.
The math is identical to basic attention, the result is identical, but you’ve eliminated expensive memory trips and kept everything in the fastest possible cache level. That’s why Flash Attention is 2-4× faster, and FA4 adds another 20% on top, not because of language choice but because it respects the memory hierarchy.
Quantization is About Bandwidth, Not Just Math
Quantization is usually explained as “we use smaller numbers to make the math faster,” and sure, INT8 operations are faster than FP32 operations.
But the bigger win, especially in the memory-bound regime we’ve been discussing, is bandwidth reduction. If your model weights are stored as 32-bit floats, that’s 4 bytes per parameter, but when you quantize to 8-bit integers, each parameter becomes 1 byte and your 400 MB model becomes 100 MB. That 0.27 ms memory read becomes 0.067 ms, and across 12 layers you’ve saved 2.4 milliseconds just by moving less data, with the faster INT8 math operations being a nice bonus.
For embedding models, the accuracy impact is usually negligible because neural network weights have massive redundancy as they cluster around certain values and high-precision bits rarely matter for accuracy. Retrieval benchmarks show less than 1% degradation in metrics like NDCG when going from FP32 to INT8, which means you’re getting a 2-3× throughput improvement basically for free.
Again, this isn’t about what language your server is written in, it’s about understanding what’s actually slow and attacking that bottleneck directly.
So how much does language choice matter ?
The language question is more nuanced than “Rust makes it fast” in a properly architected system where the GPU is saturated, Rust versus Python accounts for maybe a 5% difference in throughput.
TEI is fast because it implements Flash Attention, uses aggressive quantization, and has smart batching logic like token-based dynamic batching (grouping by total token count rather than request count) to maximize GPU utilization.
Rust does help with predictable latency without garbage collection pauses, memory safety in complex async pipelines, and operational simplicity from single-binary deployments.
However, you could build a comparably fast system in Python with the same architectural principles around memory hierarchy and batching, because the fundamental performance characteristics are determined by GPU kernel choices and batching strategy, not server language. Performance is architecture first, not syntax first, the language is meaningful but secondary, affecting developer experience and operational simplicity rather than fundamentally determining whether your system is fast.
Sources:
1. https://ttsugriy.github.io/performance-book - Open-source book on understanding performance through mathematical properties
2. https://modal.com/blog/flash-attention-4 - Technical deep-dive into how FA4 achieves memory hierarchy optimization and async pipeline architecture
3. https://gordicaleksa.medium.com/eli5-flash-attention-5c44017022ad - ELI5: FlashAttention


This breakdown of memory vs compute bottlenecks is excelent and often misunderstood in the ML infra world. You nailed the explanation of why FlashAttention actually works, it's all about respecting the memory hierarchy rather than fancy math tricks. I've seen so many teams chase marginal language-level optimizations while ignoring batch strategies and quantization fundamentals. The whole 'Rust makes it fast' narative really misses the point when you're memory-bound at the GPU level.