What Actually Makes Embedding Model Inference…

Jan 22

From Flash Attention to Quantization, where is the inference bottleneck? Is it the architecture, the maths, or will writing everything in Rust solve all my problems? (Hint: It's Not Rust)

Read →

2 Comments

Neural Foundry

This breakdown of memory vs compute bottlenecks is excelent and often misunderstood in the ML infra world. You nailed the explanation of why FlashAttention actually works, it's all about respecting the memory hierarchy rather than fancy math tricks. I've seen so many teams chase marginal language-level optimizations while ignoring batch strategies and quantization fundamentals. The whole 'Rust makes it fast' narative really misses the point when you're memory-bound at the GPU level.

Reply (1)

Filip Makraduli

Thank you! glad you liked the explanation on FlashAttention

A writer's diary on AI

What Actually Makes Embedding Model Inference…