How can I make a big AI model respond faster when my app asks it to turn a sentence into numbers (embeddings)?
If you’re using embeddings at a large scale, TEI is probably the safer choice?
Reduce embedding latency by attacking the real bottlenecks in order: tokens and padding, queueing and batching, then kernels and runtimes, then “do less work” (caching and precompute). The fastest path for most teams is: serve the encoder behind an embeddings-optimized server (TEI) with token-based micro-batching, run fp16 or bf16 on GPU, keep tokenization fast, cap sequence length, and measure queue time vs compute time. (Hugging Face)
0) Context: what “embedding latency” includes
A sentence-embedding request is not just “model forward pass.”
Typical end-to-end steps:
- Request overhead: network, TLS, JSON, gRPC framing.
- Queueing: waiting for a worker or for a batch to form.
- Tokenization: text → token IDs.
- Forward pass: transformer encoder compute.
- Pooling / normalization: mean-pool or CLS-pool, L2 normalize.
- Return payload: vector of 384–4096 floats.
If you only optimize step 4, p95 often stays bad because step 2 or 3 dominates.
1) Step 1: measure the latency breakdown (this prevents wasted work)
You want p50, p95, p99 for:
- queue_time_ms
- tokenize_time_ms
- forward_time_ms
- postprocess_time_ms
- serialize_time_ms
If you use TEI, it is explicitly positioned as “production-ready” and includes OpenTelemetry tracing and Prometheus metrics, which helps separate queueing from compute. (Hugging Face)
If you use Triton, use its performance tooling and tune batcher parameters based on real percentiles, not single-request timing. Triton’s dynamic batcher docs describe a recommended tuning loop and note that delaying to form batches trades latency for throughput. (NVIDIA Docs)
2) Step 2: make each request cheaper (usually the biggest win)
A) Cap tokens hard (token count is your main cost multiplier)
If your max_length is large (256–512) but most inputs are short, you pay for padding.
Do this:
- Set a tight
max_lengthfor online requests. - Truncate by default.
- Bucket by length so you pad less (see batching section).
B) Keep tokenization fast and warm
Hugging Face tokenizers often exist in:
- Python (“slow”)
- Rust-based “Fast” tokenizers
Transformers docs state that AutoTokenizer tries to load a fast tokenizer if available. (Hugging Face)
The tokenizers repo emphasizes speed from the Rust implementation and gives an order-of-magnitude throughput claim (GB-scale tokenization in seconds). (GitHub)
Practical implications:
- Ensure you are on the “Fast” tokenizer path.
- Avoid per-request tokenizer construction.
- Batch tokenization if you batch inference.
3) Step 3: fix batching the “low-latency” way (micro-batching)
Batching is the core throughput tool. It can also reduce per-request latency at load, but it can increase tail latency if you wait too long to form batches.
Option 1: Use TEI token-based dynamic batching (recommended default)
TEI explicitly supports token-based dynamic batching. (Hugging Face)
The key idea: the server batches based on the number of tokens per request, not just “N requests,” which reduces padding waste when lengths vary. (GitHub)
When this works best:
- Many short requests plus some medium requests.
- You care about stable p95 under concurrency.
- You want fewer custom batching hacks.
Option 2: Use Triton dynamic batching with a strict delay budget
Triton’s dynamic batching combines requests server-side. (NVIDIA Docs)
You control how long Triton is allowed to delay requests to form a better batch with max_queue_delay_microseconds. The docs show an example config:
dynamic_batching {
max_queue_delay_microseconds: 100
}
(Example from Triton docs.) (NVIDIA Docs)
Two critical pitfalls:
- Your model must support batching. Triton maintainers repeatedly point out that enabling batching adds a leading batch dimension that the model must accept. (GitHub)
- Set the delay to match your SLO. If you have a 50 ms p95 budget, you usually cannot spend 20 ms just waiting for a batch.
4) Step 4: speed up the forward pass (GPU kernels, dtype, fastpaths)
A) Use fp16 or bf16 on GPU
For embedding encoders, fp16 or bf16 is typically the default latency lever on modern GPUs.
Sentence-Transformers explicitly documents using fp16 or bf16 to speed inference on GPU and shows how to enable it. (SentenceTransformers)
B) Use PyTorch SDPA and fast attention backends
PyTorch’s SDPA tutorial says fused SDPA can give “large performance benefits” vs naive attention. (PyTorch Docs)
PyTorch 2.2 release notes describe ~2Ă— improvements to scaled_dot_product_attention via FlashAttention-v2 integration. (PyTorch)
To control or debug which attention backend is used, PyTorch provides an SDPA backend selection context manager. (PyTorch Docs)
C) Use BetterTransformer fastpath where supported
Hugging Face’s GPU inference docs describe BetterTransformer as converting models to a PyTorch-native fastpath that calls optimized kernels like Flash Attention under the hood, with the fp16/bf16 requirement noted. (Hugging Face)
Optimum’s BetterTransformer overview describes it as a “fast path” using fused kernels and SDPA-based implementations. (Hugging Face)
Key pitfall: Fastpaths can silently fall back. Always benchmark and confirm the path is active.
5) Step 5: switch runtime when it makes sense (ONNX Runtime, OpenVINO)
If you are CPU-bound, or if PyTorch overhead is too high for your latency target, exporting the same model to an optimized runtime is often a big win.
A) ONNX Runtime transformer optimization tool
ONNX Runtime documents an offline transformer optimization tool for cases where ORT does not apply certain optimizations at load time, and for experimenting with fusions and float16 conversion. (onnxruntime.ai)
When ORT helps most:
- CPU deployments.
- GPU deployments where you can use fp16 and the graph fuses well.
- Stable fixed-shape or well-behaved dynamic shapes.
B) Sentence-Transformers backends: PyTorch vs ONNX vs OpenVINO
Sentence-Transformers explicitly supports three embedding backends: PyTorch, ONNX, OpenVINO, and provides a dedicated “Speeding up Inference” guide and benchmark section. (SentenceTransformers)
This is a practical “same model, different engine” comparison path.
6) Step 6: eliminate work (caching and precompute)
Embedding models are deterministic for a fixed model version.
Do these:
- Cache embeddings for repeated inputs: key by
(model_version, normalized_text_hash). - Precompute document embeddings offline. Only embed the user query online.
- If payload size matters, consider returning float16 vectors or compressing responses (only if downstream supports it).
These changes often reduce latency more than any kernel tweak because they remove requests entirely.
7) A concrete “default plan” for your case
If you tell me nothing about your environment, this is the plan that most reliably improves p95:
- Put embeddings behind TEI (or Triton) so you get production batching and metrics. TEI is purpose-built for embeddings and advertises token-based dynamic batching plus production telemetry. (Hugging Face)
- Cap and truncate
max_lengthfor online traffic. Then bucket by length. - Use fast tokenizers (AutoTokenizer fast path) and keep tokenizers warm. (Hugging Face)
- Run fp16 or bf16 on GPU, then try BetterTransformer or SDPA fastpaths if supported. (SentenceTransformers)
- If CPU is required, export to ONNX and apply ORT transformer optimization, or evaluate OpenVINO via Sentence-Transformers backend options. (onnxruntime.ai)
- Add cache + offline precompute.
High-quality links (with what each is good for)
- TEI docs (features, quick tour, benchmarks): https://huggingface.co/docs/text-embeddings-inference/en/index (Hugging Face)
- TEI engine page (what it uses under the hood): https://huggingface.co/docs/inference-endpoints/en/engines/tei (Hugging Face)
- Token-based batching explanation (TEI discussion): https://github.com/huggingface/text-embeddings-inference/discussions/151 (GitHub)
- Triton dynamic batching and
max_queue_delay_microseconds: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/batcher.html (NVIDIA Docs) - Triton batching pitfall (model must support batching): https://github.com/triton-inference-server/server/discussions/5401 (GitHub)
- Sentence-Transformers “Speeding up Inference” (PyTorch vs ONNX vs OpenVINO): https://sbert.net/docs/sentence_transformer/usage/efficiency.html (SentenceTransformers)
- PyTorch SDPA tutorial (why fused attention is faster): https://docs.pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html (PyTorch Docs)
- PyTorch 2.2 SDPA performance notes (FlashAttention-v2 integration): https://pytorch.org/blog/pytorch2-2/ (PyTorch)
- HF Transformers GPU inference guide (BetterTransformer, FlashAttention dtype constraints): https://huggingface.co/docs/transformers/en/perf_infer_gpu_one (Hugging Face)
- ONNX Runtime transformer optimization tool: https://onnxruntime.ai/docs/performance/transformers-optimization.html (onnxruntime.ai)
- HF “fast tokenizers” page (AutoTokenizer fast fallback behavior): https://huggingface.co/docs/transformers/main/fast_tokenizers (Hugging Face)
Summary
- Biggest levers: cap tokens, reduce padding, micro-batch with a strict delay budget.
- TEI is the simplest production default for embeddings because it does token-based dynamic batching and exposes production telemetry. (Hugging Face)
- For Triton, dynamic batching is powerful but you must tune
max_queue_delay_microsecondsand ensure the model supports a batch dimension. (NVIDIA Docs) - Use fp16 or bf16 plus SDPA and BetterTransformer fastpaths when supported. (PyTorch)
- For CPU or stubborn PyTorch overhead, test ONNX Runtime optimization or OpenVINO via Sentence-Transformers backends. (onnxruntime.ai)
- Add caching and offline precompute to remove requests entirely.