For now, it seems to be a known complicated failure mode:
Answer
Yes — this is a known failure class, but I would phrase it carefully.
I would not describe it as:
TFLiteConverter officially does not support encoder-decoder seq2seq PTQ.
That is too broad.
A more accurate statement is:
Full INT8 post-training quantization with TFLiteConverter is not a robust, well-documented deployment path for a fused autoregressive encoder-decoder Transformer graph. Conversion success only proves that the graph was lowered to a TFLite flatbuffer; it does not prove that encoder-decoder conditioning survived quantization.
In this case, the symptoms are much stronger than ordinary quantization degradation:
- BLEU drops from
23.9 to 0.04.
- The model emits repeated tokens for essentially any input.
- The decoder appears to ignore the encoder from the first decoding step.
- The
INT16 activations / INT8 weights path is not deployable because the target runtime rejects TILE.
That combination strongly suggests that full INT8 PTQ has damaged the encoder-memory / decoder cross-attention path. The converted model is structurally valid, but semantically broken.
Why conversion success is misleading
TFLite conversion answers a graph-lowering question:
Can this TensorFlow graph be represented as a TFLite model using the requested operator set?
It does not answer the more important deployment question:
Does the quantized model preserve the numerical behavior required for autoregressive seq2seq generation?
The LiteRT/TFLite full-integer quantization path uses a representative dataset to estimate ranges for variable tensors such as model inputs, outputs, and intermediate activations. See:
For image classifiers, “representative data” often means representative images. For seq2seq generation, that is not enough. The representative dataset must exercise the real generation states:
- source token lengths,
- source attention masks,
- decoder prefix lengths,
- decoder masks,
- forced BOS / language tokens,
- early decoding,
- middle decoding,
- near-EOS decoding,
- cross-attention activation ranges,
- final-logit ranges.
If the converter only calibrates a narrow graph path, it can choose bad INT8 scales for tensors that are critical during real decoding.
That is how you can get:
conversion succeeds
+
runtime does not crash
+
outputs are completely wrong
Why this looks like cross-attention failure
An encoder-decoder Transformer has two main parts:
source tokens
→ encoder
→ encoder hidden states
→ decoder cross-attention
→ decoder hidden states
→ logits
→ output tokens
The decoder uses the source input mainly through cross-attention. If that path is corrupted, the decoder still has:
- target-side token embeddings,
- decoder self-attention,
- learned language-model priors,
- LM-head bias,
- BOS / forced-token priors,
- common-token frequency bias.
So the model can still generate tokens. But the output becomes weakly conditioned or unconditioned. Typical symptoms are:
- same-ish output for different inputs,
- repeated tokens,
- generic high-priority tokens,
- early collapse,
- near-zero BLEU,
- first-step logits that barely change across source inputs.
That matches the described behavior.
The most suspicious region is:
encoder_hidden_states
→ cross-attention key/value projections
→ attention score/value path
→ cross-attention output projection
→ residual / LayerNorm-adjacent tensors
→ LM-head logits
The first decoding step is especially diagnostic. At step 1, the decoder has almost no target-side history. If the INT8 model is already source-insensitive at step 1, the problem is probably not beam search, repetition penalty, EOS handling, or long-run generation logic. It is likely the encoder-memory path or the first decoder cross-attention block.
Is this a known limitation?
In practical terms, yes.
The exact sentence “TFLiteConverter PTQ does not support encoder-decoder seq2seq” is not the usual official wording. But the documented pieces line up:
- TFLite full-integer PTQ depends on representative activation calibration: LiteRT post-training integer quantization.
- TFLite provides a Quantization Debugger specifically because full-integer quantization can produce unexpectedly poor or completely wrong results.
- Hugging Face’s Optimum TFLite exporter overview lists mostly encoder-style architectures such as BERT, RoBERTa, DistilBERT, MobileBERT, MPNet, and related models. It does not present full autoregressive encoder-decoder generation as the obvious happy path.
- Optimum’s TFLite export guide notes that static input shapes need to be specified.
- Hugging Face’s Optimum ONNX export docs describe encoder-decoder export using separate encoder and decoder pieces, because the encoder runs once and the decoder runs repeatedly during autoregressive generation.
- ONNX Runtime’s quantization guide says dynamic quantization is generally recommended for RNNs and Transformer-based models, while static quantization is generally recommended for CNNs.
That last point is especially relevant. Your hardware requires a static full-INT8-style artifact, but Transformer generation is one of the model families where static activation calibration is most fragile.
So the practical answer is:
This is a known class of PTQ failure: a valid full-INT8 TFLite model can be generated, but the quantized activations can destroy the conditioning path that makes encoder-decoder generation work.
Why Transformers are hard for generic INT8 PTQ
Transformer quantization is difficult mainly because the activations are difficult.
The literature around Transformer quantization repeatedly points to activation outliers and attention/LayerNorm sensitivity:
Plain TFLiteConverter PTQ is much more generic than these methods. It does not automatically perform SmoothQuant-style activation smoothing, LLM.int8-style outlier routing, or I-BERT-style integer Transformer operator redesign.
That matters because a fused encoder-decoder generation graph contains exactly the fragile pieces:
MatMul / BatchMatMul
Softmax
LayerNorm-adjacent tensors
residual additions
attention masks
cross-attention K/V projections
final vocabulary projection
A single bad scale around cross-attention can make the decoder appear source-blind.
Why the fused graph is probably the wrong deployment shape
A fused encoder-decoder generation graph is the least debuggable shape for this problem.
Seq2seq inference naturally looks like this:
1. Run encoder once.
2. Repeatedly run decoder for each output token.
3. Select the next token outside the model.
4. Stop on EOS or max length.
The usual deployment structure is therefore:
encoder model:
input_ids, attention_mask
→ encoder_hidden_states
decoder-step model:
decoder_input_ids, decoder_attention_mask, encoder_hidden_states, encoder_attention_mask
→ next-token logits
Then the host application runs greedy search, beam search, EOS handling, and repetition logic outside the model.
This is also the shape used by common seq2seq export/deployment tooling. For example, Hugging Face’s Optimum ONNX export guide discusses decoder export with past key/value reuse because the decoder runs repeatedly during autoregressive generation.
A fused graph often hides too much:
encoder
decoder
decoder loop
mask updates
shape operations
possibly beam expansion
possibly TILE
token selection
EOS handling
That makes all of these harder:
- calibration,
- static-shape control,
- operator support,
- delegate partitioning,
- cross-attention inspection,
- first-step source-sensitivity testing,
- quantized boundary debugging.
For this case, I would not keep pushing the fused graph as the primary production path.
Recommended working approach
The most realistic path forward is:
encoder_int8.tflite
+
decoder_step_int8.tflite
+
host-side generation loop
Do not export generate() as one fused TFLite graph unless there is no alternative.
Target layout
encoder.tflite
inputs:
input_ids: int32
attention_mask: int32
outputs:
encoder_hidden_states: int8
decoder_step.tflite
inputs:
decoder_input_ids: int32
decoder_attention_mask: int32
encoder_hidden_states: int8
encoder_attention_mask: int32
outputs:
logits: int8
Host-side decoding:
encoder_states = run_encoder(input_ids, attention_mask)
decoder_ids = [decoder_start_token_id]
for step in range(max_new_tokens):
logits = run_decoder_step(
decoder_input_ids=decoder_ids,
decoder_attention_mask=make_decoder_mask(decoder_ids),
encoder_hidden_states=encoder_states,
encoder_attention_mask=attention_mask,
)
next_id = select_next_token(logits)
decoder_ids.append(next_id)
if next_id == eos_token_id:
break
This structure gives you a way to test each boundary:
FP32 encoder → FP32 decoder
INT8 encoder → FP32 decoder
FP32 encoder → INT8 decoder
INT8 encoder → INT8 decoder
INT8 encoder → INT8 decoder on hardware delegate
That isolates whether the failure comes from:
- encoder quantization,
- decoder quantization,
- the encoder-output / decoder-input boundary,
- cross-attention,
- logits,
- or the delegate.
The hard part: quantized encoder/decoder boundary
If you split the graph, the encoder output and decoder input may have different quantization parameters.
Example:
encoder output:
scale_e
zero_point_e
decoder encoder_hidden_states input:
scale_d
zero_point_d
You cannot blindly pass raw INT8 bytes from the encoder output into the decoder input unless the quantization parameters match.
If they differ, you need an explicit requantization bridge:
real_value = scale_e * (q_e - zero_point_e)
q_d = round(real_value / scale_d + zero_point_d)
q_d = clamp(q_d, -128, 127)
This boundary is important. A broken boundary can produce exactly the same symptom as broken cross-attention: the decoder runs but receives meaningless encoder memory.
For debugging, temporarily test these variants:
FP32 encoder output → FP32 decoder
INT8 encoder output → dequantized float → FP32 decoder
FP32 encoder output → quantized decoder input → INT8 decoder
INT8 encoder output → requantized decoder input → INT8 decoder
Only the final variant is close to strict deployment, but the intermediate variants tell you where the information is lost.
Calibration strategy
The representative dataset must cover actual generation states.
Do not calibrate only source inputs.
Do not calibrate only BOS.
Do not calibrate only full teacher-forced targets if deployment uses step-by-step decoding.
A better calibration set should include multiple decoder prefixes per source example.
Bad calibration pattern
def representative_dataset():
for batch in source_batches:
yield {
"input_ids": batch["input_ids"],
"attention_mask": batch["attention_mask"],
}
That may calibrate the encoder path but not the decoder cross-attention behavior used during generation.
Better calibration pattern
def representative_dataset():
for src_text, tgt_text in calibration_pairs:
src = source_tokenizer(
src_text,
max_length=SRC_LEN,
padding="max_length",
truncation=True,
return_tensors="np",
)
tgt = target_tokenizer(
tgt_text,
max_length=TGT_LEN,
padding=False,
truncation=True,
return_tensors="np",
)
target_ids = tgt["input_ids"][0]
for prefix_len in [1, 2, 4, 8, 16, 32]:
if prefix_len > len(target_ids):
continue
decoder_prefix = target_ids[:prefix_len]
decoder_prefix = pad_to_length(
decoder_prefix,
length=DECODER_PREFIX_LEN,
pad_id=target_pad_id,
)
yield {
"input_ids": src["input_ids"].astype("int32"),
"attention_mask": src["attention_mask"].astype("int32"),
"decoder_input_ids": decoder_prefix[None, :].astype("int32"),
"decoder_attention_mask": (decoder_prefix[None, :] != target_pad_id).astype("int32"),
}
The exact input names must match the SavedModel signature.
Calibration coverage checklist
Include:
short source examples
normal source examples
long source examples
max-length source examples
padding-heavy examples
near-no-padding examples
rare names and numerals
punctuation-heavy examples
domain-specific examples
BOS / forced decoder-start token
early decoder prefix
middle decoder prefix
near-EOS decoder prefix
A useful rule of thumb:
200 source examples × 5 decoder prefixes
is usually more informative than:
1000 source examples × only BOS
because the former covers more activation regimes.
Converter configuration advice
There is probably no single converter flag that fixes this.
Still, I would run these baselines.
1. Float TFLite baseline
converter = tf.lite.TFLiteConverter.from_saved_model("<saved_model_dir>")
tflite_model = converter.convert()
with open("<model_float>.tflite", "wb") as f:
f.write(tflite_model)
If this fails, stop. The issue is export/lowering, not INT8.
2. Dynamic-range baseline
converter = tf.lite.TFLiteConverter.from_saved_model("<saved_model_dir>")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()
with open("<model_dynamic_range>.tflite", "wb") as f:
f.write(tflite_model)
If dynamic-range quantization works while full INT8 fails, weights are probably not the main problem. The problem is activation quantization.
3. Full INT8 baseline
converter = tf.lite.TFLiteConverter.from_saved_model("<saved_model_dir>")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
tflite_model = converter.convert()
with open("<model_int8>.tflite", "wb") as f:
f.write(tflite_model)
Be careful with:
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
For text models, token IDs are categorical integer indices, not numeric image/audio activations. input_ids and masks often remain int32. Do not blindly force token IDs to INT8.
Always inspect the converted model:
interpreter = tf.lite.Interpreter(model_path="<model_int8>.tflite")
interpreter.allocate_tensors()
print("Inputs:")
for item in interpreter.get_input_details():
print(item["name"], item["dtype"], item["shape"], item["quantization"])
print("Outputs:")
for item in interpreter.get_output_details():
print(item["name"], item["dtype"], item["shape"], item["quantization"])
If the final logits are INT8, the host decoder must respect the output tensor’s scale and zero point.
For greedy argmax, quantized argmax is often equivalent if all logits share one scale and zero point. For beam search, length penalty, temperature, top-k, or probability arithmetic, dequantization or careful fixed-point handling is safer.
About inference_output_type=tf.float32
This line is suspicious:
converter.inference_output_type = tf.float32
It is not necessarily the root cause of the collapse, but it is worth testing without it.
If the target is a strict INT8 hardware delegate, leaving a float output can create an awkward quantize/dequantize boundary or a partially non-integer interface. That may be acceptable for debugging, but it is not ideal for a strict integer deployment.
However, the repeated-token collapse is more likely caused by an internal activation/cross-attention quantization problem than by the output type alone.
I would test both:
# Debug-friendly interface
converter.inference_output_type = tf.float32
and:
# Strict integer numeric output, if compatible with your graph interface
converter.inference_output_type = tf.int8
Then compare:
full INT8 CPU output
full INT8 delegate output
first-step source sensitivity
BLEU
Why 16x8 is useful even though it is not deployable
The experimental mode:
tf.lite.OpsSet.EXPERIMENTAL_TFLITE_BUILTINS_ACTIVATIONS_INT16_WEIGHTS_INT8
is useful diagnostically because it tests whether INT8 activations are the problem.
If 16x8 improves quality but the runtime rejects TILE, the interpretation is:
The model likely needs more activation precision.
The target delegate cannot execute the more accurate path.
LiteRT documents 16-bit activations with 8-bit weights as an option that can help when activations are sensitive, but optimized kernel/delegate support is more limited than ordinary INT8. See:
So the TILE problem is not surprising. It is a runtime/delegate support failure, not proof that plain INT8 PTQ should work.
The first diagnostic test I would run
Before doing more BLEU evaluation, run a first-step source-sensitivity test.
Pick two very different source sentences:
source A: "The committee approved the budget after three hours of debate."
source B: "The patient developed a fever after the second injection."
Use the same decoder prefix:
decoder_input_ids = [decoder_start_token_id]
Compare:
FP32(source A, BOS) → logits_A_fp32
FP32(source B, BOS) → logits_B_fp32
INT8(source A, BOS) → logits_A_int8
INT8(source B, BOS) → logits_B_int8
Healthy behavior:
FP32 logits differ by source.
INT8 logits also differ by source.
Broken source-blind behavior:
FP32 logits differ by source.
INT8 logits are nearly identical across sources.
Example helper:
import numpy as np
def topk_ids(logits, k=10):
flat = np.asarray(logits).reshape(-1)
return np.argsort(flat)[-k:][::-1]
def compare_logits(logits_a, logits_b, k=10):
a = np.asarray(logits_a).reshape(-1).astype(np.float64)
b = np.asarray(logits_b).reshape(-1).astype(np.float64)
top_a = topk_ids(a, k)
top_b = topk_ids(b, k)
cosine = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-12)
return {
"argmax_a": int(top_a[0]),
"argmax_b": int(top_b[0]),
"same_argmax": bool(top_a[0] == top_b[0]),
"topk_overlap": len(set(top_a.tolist()) & set(top_b.tolist())),
"cosine": float(cosine),
"range_a": float(a.max() - a.min()),
"range_b": float(b.max() - b.min()),
"top_a": top_a.tolist(),
"top_b": top_b.tolist(),
}
This is more informative than BLEU.
BLEU tells you the model is broken. First-step source sensitivity tells you whether the encoder context is already gone at the first decoder step.
Use Quantization Debugger
Use TFLite’s Quantization Debugger to identify where error first explodes:
Inspect these tensors first:
encoder output
decoder layer 0 cross-attention query
decoder layer 0 cross-attention key
decoder layer 0 cross-attention value
decoder layer 0 cross-attention scores
decoder layer 0 cross-attention output
post-cross-attention residual
final decoder hidden state
LM-head logits
Look for:
| Observation |
Likely meaning |
| Encoder hidden states saturated |
Encoder output quantization is bad |
| Cross-attention K/V nearly constant |
Source memory is destroyed |
| Attention scores nearly constant |
Decoder cannot select source positions |
| Attention scores extreme |
Softmax collapses |
| Cross-attention output near zero |
Source signal is muted |
| Residual dominates attention output |
Encoder signal is drowned |
| Logits almost identical across sources |
Decoder is source-blind |
| Logits saturated |
Final projection/output scale problem |
Selective quantization can also be useful diagnostically. For example, leave one region float and see whether BLEU recovers:
leave encoder output float
leave cross-attention K/V projections float
leave attention score path float
leave post-cross-attention residual float
leave LM head float
This may not be deployable on an INT8-only delegate, but it can identify the tensor group that kills the model.
Full diagnostic ladder
Run the same evaluation set through these variants:
| Variant |
Purpose |
Interpretation |
| Original FP32 TensorFlow / Transformers |
Reference |
Should reproduce BLEU around 23.9 |
| Float TFLite CPU |
Export/lowering check |
If bad, quantization is not the first problem |
| Dynamic-range TFLite CPU |
Weight-quantization check |
If good, weights are not the main issue |
| Full INT8 TFLite CPU |
Quantization check |
If bad, calibration/numerics are failing |
| Full INT8 TFLite delegate |
Runtime check |
If CPU good but delegate bad, runtime/delegate is failing |
| 16x8 TFLite CPU, if possible |
Activation-precision check |
If better, INT8 activations are the bottleneck |
The key split is:
float TFLite bad
→ export/lowering/fused-graph issue
float TFLite good, INT8 CPU bad
→ quantization/calibration issue
INT8 CPU good, INT8 delegate bad
→ delegate/operator/kernel issue
16x8 better than INT8
→ activation precision issue
What to do if split PTQ still fails
If the split encoder/decoder-step model still collapses after proper calibration, the realistic options are:
1. Quantization-aware training
Use QAT if PTQ cannot meet the accuracy target.
Relevant docs:
Important: do QAT on the deployment-shaped graph, not only on the original training graph.
That means:
same max source length
same decoder-step shape
same masks
same BOS/EOS behavior
same tokenizer
same target delegate constraints
same quantized encoder/decoder boundary
2. Distillation into a quantization-friendly model
If the original architecture is too sensitive, distill into a smaller model designed for the target constraints:
fixed source length
fixed decoder-step shape
simpler attention pattern
no fused generation graph
delegate-supported ops only
QAT or PTQ-aware evaluation from the beginning
3. Runtime change, if possible
If the target can change, use a Transformer-native runtime instead of generic fused TFLite.
Useful references:
CTranslate2 supports many encoder-decoder Transformer families and quantization modes. Even if it cannot be shipped on the final target, it is useful as a sanity check:
If CTranslate2 INT8 works but TFLite INT8 collapses,
the model is probably quantizable,
but the current TFLite path is not preserving it.
4. Requirement change
If the hardware delegate truly requires plain full INT8 TFLite and the model cannot survive that path, the requirement may be incompatible with the model family.
Possible requirement changes:
allow int16 activations
allow selected float fallback
allow a custom op
allow a different runtime
allow a smaller/distilled model
allow server-side inference
What I would not spend time on
Blindly adding more calibration samples
More data does not fix the wrong calibration distribution.
Bad:
1000 source examples × BOS only
Better:
200 source examples × multiple decoder prefixes
Blindly trying converter flags
Converter flags are secondary. The primary issue is graph shape and activation calibration.
Assuming TILE is the root cause of BLEU collapse
TILE explains why the 16x8 path is not viable on the target. It does not by itself explain why full INT8 repeats tokens. These are related deployment constraints, but not the same failure.
Assuming the converter understands generation semantics
The converter lowers tensors and ops. It does not know that a certain tensor is “encoder memory that must preserve source conditioning.”
My final recommendation
I would proceed like this:
- Do not keep the fused graph as the main production candidate.
- Build a float TFLite baseline and verify it matches the original model.
- Build a dynamic-range TFLite baseline.
- Split into:
encoder.tflite
decoder_step.tflite
- Calibrate the decoder-step model using real decoder prefixes across multiple timesteps.
- Run full INT8 CPU before using the hardware delegate.
- Run the first-step source-sensitivity test.
- Use Quantization Debugger around encoder output and decoder cross-attention.
- Explicitly handle the quantized encoder-output / decoder-input boundary.
- If PTQ still collapses, move to QAT or distillation.
- If CPU INT8 works but the delegate fails, treat it as a delegate/operator support problem.
The concise diagnosis is:
The converted model is probably not failing because TFLite cannot tokenize, decode, or run the graph at all. It is failing because full INT8 static PTQ has destroyed the numerical path that carries encoder information into decoder cross-attention. The decoder still emits tokens, but it no longer receives useful source context, so it falls back to repeated high-prior tokens and BLEU collapses.
Short answer
- Yes, this is a known full-INT8 PTQ failure class.
- No, there is probably not one converter flag that fixes a fused encoder-decoder generation graph.
- The likely broken region is cross-attention or the encoder-hidden-state boundary.
- The recommended deployment shape is split encoder + decoder-step, with generation outside TFLite.
- Calibration must include real decoder prefixes across timesteps, not only source inputs.
- Use float TFLite, dynamic-range TFLite, full INT8 CPU, and full INT8 delegate as separate baselines.
- Use first-step source-sensitivity tests and Quantization Debugger before relying only on BLEU.
- If careful split PTQ still fails, use QAT, distillation, or a different runtime/precision target.