Comparing Embedding Inference Solutions: TEI, Infinity, and FastEmbed
In my previous article, I covered why productionizing embeddings is hard. Now let’s do a deeper dive into three major open-source solutions and understand what are some of their key advantages.
Embedding Inference Comparison table
Table 1: An overview of popular embedding inference solutions
Inference built for throughput and token based batching
Text Embeddings Inference (TEI) from Hugging Face is written in Rust and uses Flash Attention. The interesting thing about TEI is how it handles batching, which is worth understanding because it affects performance quite a bit.
Most inference systems batch requests by counting the number of requests, they’ll group together a few requests regardless of their size. The problem with this, in the world of embeddings, is that requests vary wildly in length. If you batch “Hi” (2 tokens) with a 500-word essay (800 tokens), the short request sits waiting while the long one is being processed. Your GPU ends up being idle because of this imbalance.
TEI batches by total token count. This means that you set the maximum number of tokens per batch. For example, the limit is 512 and in this token window you can pack any number of requests. This might be 100 short queries, or two long documents, or any combination that fills the “token real estate”. The result is more consistent GPU utilisation, which means better throughput when you have variable-length inputs.
TEI also has good production observability built in using Prometheus metrics and OpenTelemetry tracing which are included, which helps when you’re debugging performance issues.
However, TEI has one big limitation. Each TEI container serves only one model, so if you need 5 different models running, you need 5 separate deployments. And it also handles text only embeddings, no multi-modal support.
Handling multiple modalities
Infinity is built in Python with FastAPI, and is designed to handle a broader range of embedding types from a single server.
What’s useful about Infinity is the multi-modal support. You can embed text, images (via CLIP), audio (via CLAP), and even use late-interaction models like ColBERT. All from the same service. It also lets you serve multiple models from one instance, which can be more efficient with GPU memory than running separate containers.
For teams currently using OpenAI’s embedding API, Infinity provides an OpenAI-compatible endpoint. This means you can often switch by just changing the base URL in your code, without rewriting your application logic.
The challenge with Infinity shows up under sustained production load. Looking at GitHub issues across embedding servers, there’s a pattern of stability problems like OOM crashes during traffic spikes, servers becoming unresponsive, and requests timing out. Besides this, there are some issues with model support and a lot of python dependency and compatibility problems.
When you don’t need a server
FastEmbed from Qdrant is a different category, it’s a Python library, not a server. You import it like any Python package, call a function, and get back embeddings. There’s no container to deploy, no endpoint to configure.
This makes sense for certain use cases. If you’re building a serverless function on AWS Lambda and cold start time matters which means that FastEmbed’s minimal dependencies help there. It’s also simpler for CLI tools or scripts where running a dedicated server feels like too much complexity. There are some issues as well with the ONNX runtime as well as dependency management, but the key trade-off is that there is no batching. This is because FastEmbed is not a server but a library. Under high concurrency, this is an important matter.
What’s not fully solved yet
Looking at these three solutions, you can see they each optimize for different things. But there are still some capabilities that aren’t well-addressed by any of them.
Multi-output embeddings
Some models like BGE-M3 can produce dense, sparse, and multi-vector representations from a single forward pass. This is useful for hybrid retrieval, but current solutions treat these as separate outputs. You can’t easily request all three from one API call without running multiple separate requests.
Model lifecycle management
In production, you’re often testing new models or adding specialized models for specific use cases. With TEI’s one-model-per-container architecture, this gets complex and costly. Infinity can serve multiple models, but doesn’t have sophisticated memory management. If your 10 models don’t all fit in VRAM at once, you’re on your own.
3. Model specific complexity handling
Different models have different quirks, for example BERT uses absolute position embeddings, while some Mistral models use rotary embeddings. Another example are pooling strategies. Some models need mean pooling (average from all token vectors) and others need CLS pooling (taking the first token). This complexity falls on the users to handle correctly. TEI has these details compiled into the Rust binary, which works for supported models but isn’t flexible enough as adding new model types means modifying and recompiling the code.
Conclusion
An ideal solution would combine the inference optimisations of TEI, the multi-modal flexibility of Infinity, the deployment simplicity of FastEmbed, and address these gaps. That doesn’t fully exist yet, which is why choosing an embedding inference solution currently means understanding which trade-offs fit your situation.


Nice overview, thank you!