From Flash Attention to Quantization, where is the inference bottleneck? Is it the architecture, the maths, or will writing everything in Rust solve all my problems? (Hint: It's Not Rust)
This breakdown of memory vs compute bottlenecks is excelent and often misunderstood in the ML infra world. You nailed the explanation of why FlashAttention actually works, it's all about respecting the memory hierarchy rather than fancy math tricks. I've seen so many teams chase marginal language-level optimizations while ignoring batch strategies and quantization fundamentals. The whole 'Rust makes it fast' narative really misses the point when you're memory-bound at the GPU level.
This breakdown of memory vs compute bottlenecks is excelent and often misunderstood in the ML infra world. You nailed the explanation of why FlashAttention actually works, it's all about respecting the memory hierarchy rather than fancy math tricks. I've seen so many teams chase marginal language-level optimizations while ignoring batch strategies and quantization fundamentals. The whole 'Rust makes it fast' narative really misses the point when you're memory-bound at the GPU level.
Thank you! glad you liked the explanation on FlashAttention