PTQ INT8 via TFLiteConverter — encoder-decoder seq2seq model loses encoder context entirely after conversion

I’m trying to deploy a seq2seq encoder-decoder model on an embedded target that only accepts INT8 TFLite models. The conversion via `TFLiteConverter` completes without errors, but the resulting model is completely broken at inference — suggesting the converter is not handling the encoder-decoder architecture correctly under full INT8 quantization.**

** Environment **

  • tensorflow 2.13, transformers 4.40
  • macOS (conversion) → embedded Linux with INT8 hardware delegate (inference)

Problem

Converting a fused encoder-decoder seq2seq model to INT8 using TFLiteConverter with the following setup:

converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_output_type = tf.float32

Conversion completes without errors, but the model generates repeated tokens for any input (BLEU drops from 23.9 to 0.04). The decoder stops using encoder context entirely from the first inference step.

EXPERIMENTAL_TFLITE_BUILTINS_ACTIVATIONS_INT16_WEIGHTS_INT8 is not viable — TILE op unsupported at runtime.

Question

Is this a known limitation of TFLiteConverter PTQ for encoder-decoder architectures? Is there a recommended calibration strategy or converter configuration for fused encoder-decoder graphs with cross-attention?

Open to any working approach to move forward.

Reproducible notebook available on request.

For now, it seems to be a known complicated failure mode:


Answer

Yes — this is a known failure class, but I would phrase it carefully.

I would not describe it as:

TFLiteConverter officially does not support encoder-decoder seq2seq PTQ.

That is too broad.

A more accurate statement is:

Full INT8 post-training quantization with TFLiteConverter is not a robust, well-documented deployment path for a fused autoregressive encoder-decoder Transformer graph. Conversion success only proves that the graph was lowered to a TFLite flatbuffer; it does not prove that encoder-decoder conditioning survived quantization.

In this case, the symptoms are much stronger than ordinary quantization degradation:

  • BLEU drops from 23.9 to 0.04.
  • The model emits repeated tokens for essentially any input.
  • The decoder appears to ignore the encoder from the first decoding step.
  • The INT16 activations / INT8 weights path is not deployable because the target runtime rejects TILE.

That combination strongly suggests that full INT8 PTQ has damaged the encoder-memory / decoder cross-attention path. The converted model is structurally valid, but semantically broken.


Why conversion success is misleading

TFLite conversion answers a graph-lowering question:

Can this TensorFlow graph be represented as a TFLite model using the requested operator set?

It does not answer the more important deployment question:

Does the quantized model preserve the numerical behavior required for autoregressive seq2seq generation?

The LiteRT/TFLite full-integer quantization path uses a representative dataset to estimate ranges for variable tensors such as model inputs, outputs, and intermediate activations. See:

For image classifiers, “representative data” often means representative images. For seq2seq generation, that is not enough. The representative dataset must exercise the real generation states:

  • source token lengths,
  • source attention masks,
  • decoder prefix lengths,
  • decoder masks,
  • forced BOS / language tokens,
  • early decoding,
  • middle decoding,
  • near-EOS decoding,
  • cross-attention activation ranges,
  • final-logit ranges.

If the converter only calibrates a narrow graph path, it can choose bad INT8 scales for tensors that are critical during real decoding.

That is how you can get:

conversion succeeds
+
runtime does not crash
+
outputs are completely wrong

Why this looks like cross-attention failure

An encoder-decoder Transformer has two main parts:

source tokens
→ encoder
→ encoder hidden states
→ decoder cross-attention
→ decoder hidden states
→ logits
→ output tokens

The decoder uses the source input mainly through cross-attention. If that path is corrupted, the decoder still has:

  • target-side token embeddings,
  • decoder self-attention,
  • learned language-model priors,
  • LM-head bias,
  • BOS / forced-token priors,
  • common-token frequency bias.

So the model can still generate tokens. But the output becomes weakly conditioned or unconditioned. Typical symptoms are:

  • same-ish output for different inputs,
  • repeated tokens,
  • generic high-priority tokens,
  • early collapse,
  • near-zero BLEU,
  • first-step logits that barely change across source inputs.

That matches the described behavior.

The most suspicious region is:

encoder_hidden_states
→ cross-attention key/value projections
→ attention score/value path
→ cross-attention output projection
→ residual / LayerNorm-adjacent tensors
→ LM-head logits

The first decoding step is especially diagnostic. At step 1, the decoder has almost no target-side history. If the INT8 model is already source-insensitive at step 1, the problem is probably not beam search, repetition penalty, EOS handling, or long-run generation logic. It is likely the encoder-memory path or the first decoder cross-attention block.


Is this a known limitation?

In practical terms, yes.

The exact sentence “TFLiteConverter PTQ does not support encoder-decoder seq2seq” is not the usual official wording. But the documented pieces line up:

  • TFLite full-integer PTQ depends on representative activation calibration: LiteRT post-training integer quantization.
  • TFLite provides a Quantization Debugger specifically because full-integer quantization can produce unexpectedly poor or completely wrong results.
  • Hugging Face’s Optimum TFLite exporter overview lists mostly encoder-style architectures such as BERT, RoBERTa, DistilBERT, MobileBERT, MPNet, and related models. It does not present full autoregressive encoder-decoder generation as the obvious happy path.
  • Optimum’s TFLite export guide notes that static input shapes need to be specified.
  • Hugging Face’s Optimum ONNX export docs describe encoder-decoder export using separate encoder and decoder pieces, because the encoder runs once and the decoder runs repeatedly during autoregressive generation.
  • ONNX Runtime’s quantization guide says dynamic quantization is generally recommended for RNNs and Transformer-based models, while static quantization is generally recommended for CNNs.

That last point is especially relevant. Your hardware requires a static full-INT8-style artifact, but Transformer generation is one of the model families where static activation calibration is most fragile.

So the practical answer is:

This is a known class of PTQ failure: a valid full-INT8 TFLite model can be generated, but the quantized activations can destroy the conditioning path that makes encoder-decoder generation work.


Why Transformers are hard for generic INT8 PTQ

Transformer quantization is difficult mainly because the activations are difficult.

The literature around Transformer quantization repeatedly points to activation outliers and attention/LayerNorm sensitivity:

Plain TFLiteConverter PTQ is much more generic than these methods. It does not automatically perform SmoothQuant-style activation smoothing, LLM.int8-style outlier routing, or I-BERT-style integer Transformer operator redesign.

That matters because a fused encoder-decoder generation graph contains exactly the fragile pieces:

MatMul / BatchMatMul
Softmax
LayerNorm-adjacent tensors
residual additions
attention masks
cross-attention K/V projections
final vocabulary projection

A single bad scale around cross-attention can make the decoder appear source-blind.


Why the fused graph is probably the wrong deployment shape

A fused encoder-decoder generation graph is the least debuggable shape for this problem.

Seq2seq inference naturally looks like this:

1. Run encoder once.
2. Repeatedly run decoder for each output token.
3. Select the next token outside the model.
4. Stop on EOS or max length.

The usual deployment structure is therefore:

encoder model:
  input_ids, attention_mask
  → encoder_hidden_states

decoder-step model:
  decoder_input_ids, decoder_attention_mask, encoder_hidden_states, encoder_attention_mask
  → next-token logits

Then the host application runs greedy search, beam search, EOS handling, and repetition logic outside the model.

This is also the shape used by common seq2seq export/deployment tooling. For example, Hugging Face’s Optimum ONNX export guide discusses decoder export with past key/value reuse because the decoder runs repeatedly during autoregressive generation.

A fused graph often hides too much:

encoder
decoder
decoder loop
mask updates
shape operations
possibly beam expansion
possibly TILE
token selection
EOS handling

That makes all of these harder:

  • calibration,
  • static-shape control,
  • operator support,
  • delegate partitioning,
  • cross-attention inspection,
  • first-step source-sensitivity testing,
  • quantized boundary debugging.

For this case, I would not keep pushing the fused graph as the primary production path.


Recommended working approach

The most realistic path forward is:

encoder_int8.tflite
+
decoder_step_int8.tflite
+
host-side generation loop

Do not export generate() as one fused TFLite graph unless there is no alternative.

Target layout

encoder.tflite

inputs:
  input_ids: int32
  attention_mask: int32

outputs:
  encoder_hidden_states: int8
decoder_step.tflite

inputs:
  decoder_input_ids: int32
  decoder_attention_mask: int32
  encoder_hidden_states: int8
  encoder_attention_mask: int32

outputs:
  logits: int8

Host-side decoding:

encoder_states = run_encoder(input_ids, attention_mask)

decoder_ids = [decoder_start_token_id]

for step in range(max_new_tokens):
    logits = run_decoder_step(
        decoder_input_ids=decoder_ids,
        decoder_attention_mask=make_decoder_mask(decoder_ids),
        encoder_hidden_states=encoder_states,
        encoder_attention_mask=attention_mask,
    )

    next_id = select_next_token(logits)
    decoder_ids.append(next_id)

    if next_id == eos_token_id:
        break

This structure gives you a way to test each boundary:

FP32 encoder → FP32 decoder
INT8 encoder → FP32 decoder
FP32 encoder → INT8 decoder
INT8 encoder → INT8 decoder
INT8 encoder → INT8 decoder on hardware delegate

That isolates whether the failure comes from:

  • encoder quantization,
  • decoder quantization,
  • the encoder-output / decoder-input boundary,
  • cross-attention,
  • logits,
  • or the delegate.

The hard part: quantized encoder/decoder boundary

If you split the graph, the encoder output and decoder input may have different quantization parameters.

Example:

encoder output:
  scale_e
  zero_point_e

decoder encoder_hidden_states input:
  scale_d
  zero_point_d

You cannot blindly pass raw INT8 bytes from the encoder output into the decoder input unless the quantization parameters match.

If they differ, you need an explicit requantization bridge:

real_value = scale_e * (q_e - zero_point_e)
q_d = round(real_value / scale_d + zero_point_d)
q_d = clamp(q_d, -128, 127)

This boundary is important. A broken boundary can produce exactly the same symptom as broken cross-attention: the decoder runs but receives meaningless encoder memory.

For debugging, temporarily test these variants:

FP32 encoder output → FP32 decoder
INT8 encoder output → dequantized float → FP32 decoder
FP32 encoder output → quantized decoder input → INT8 decoder
INT8 encoder output → requantized decoder input → INT8 decoder

Only the final variant is close to strict deployment, but the intermediate variants tell you where the information is lost.


Calibration strategy

The representative dataset must cover actual generation states.

Do not calibrate only source inputs.

Do not calibrate only BOS.

Do not calibrate only full teacher-forced targets if deployment uses step-by-step decoding.

A better calibration set should include multiple decoder prefixes per source example.

Bad calibration pattern

def representative_dataset():
    for batch in source_batches:
        yield {
            "input_ids": batch["input_ids"],
            "attention_mask": batch["attention_mask"],
        }

That may calibrate the encoder path but not the decoder cross-attention behavior used during generation.

Better calibration pattern

def representative_dataset():
    for src_text, tgt_text in calibration_pairs:
        src = source_tokenizer(
            src_text,
            max_length=SRC_LEN,
            padding="max_length",
            truncation=True,
            return_tensors="np",
        )

        tgt = target_tokenizer(
            tgt_text,
            max_length=TGT_LEN,
            padding=False,
            truncation=True,
            return_tensors="np",
        )

        target_ids = tgt["input_ids"][0]

        for prefix_len in [1, 2, 4, 8, 16, 32]:
            if prefix_len > len(target_ids):
                continue

            decoder_prefix = target_ids[:prefix_len]
            decoder_prefix = pad_to_length(
                decoder_prefix,
                length=DECODER_PREFIX_LEN,
                pad_id=target_pad_id,
            )

            yield {
                "input_ids": src["input_ids"].astype("int32"),
                "attention_mask": src["attention_mask"].astype("int32"),
                "decoder_input_ids": decoder_prefix[None, :].astype("int32"),
                "decoder_attention_mask": (decoder_prefix[None, :] != target_pad_id).astype("int32"),
            }

The exact input names must match the SavedModel signature.

Calibration coverage checklist

Include:

short source examples
normal source examples
long source examples
max-length source examples
padding-heavy examples
near-no-padding examples
rare names and numerals
punctuation-heavy examples
domain-specific examples
BOS / forced decoder-start token
early decoder prefix
middle decoder prefix
near-EOS decoder prefix

A useful rule of thumb:

200 source examples × 5 decoder prefixes

is usually more informative than:

1000 source examples × only BOS

because the former covers more activation regimes.


Converter configuration advice

There is probably no single converter flag that fixes this.

Still, I would run these baselines.

1. Float TFLite baseline

converter = tf.lite.TFLiteConverter.from_saved_model("<saved_model_dir>")
tflite_model = converter.convert()

with open("<model_float>.tflite", "wb") as f:
    f.write(tflite_model)

If this fails, stop. The issue is export/lowering, not INT8.

2. Dynamic-range baseline

converter = tf.lite.TFLiteConverter.from_saved_model("<saved_model_dir>")
converter.optimizations = [tf.lite.Optimize.DEFAULT]

tflite_model = converter.convert()

with open("<model_dynamic_range>.tflite", "wb") as f:
    f.write(tflite_model)

If dynamic-range quantization works while full INT8 fails, weights are probably not the main problem. The problem is activation quantization.

3. Full INT8 baseline

converter = tf.lite.TFLiteConverter.from_saved_model("<saved_model_dir>")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]

tflite_model = converter.convert()

with open("<model_int8>.tflite", "wb") as f:
    f.write(tflite_model)

Be careful with:

converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

For text models, token IDs are categorical integer indices, not numeric image/audio activations. input_ids and masks often remain int32. Do not blindly force token IDs to INT8.

Always inspect the converted model:

interpreter = tf.lite.Interpreter(model_path="<model_int8>.tflite")
interpreter.allocate_tensors()

print("Inputs:")
for item in interpreter.get_input_details():
    print(item["name"], item["dtype"], item["shape"], item["quantization"])

print("Outputs:")
for item in interpreter.get_output_details():
    print(item["name"], item["dtype"], item["shape"], item["quantization"])

If the final logits are INT8, the host decoder must respect the output tensor’s scale and zero point.

For greedy argmax, quantized argmax is often equivalent if all logits share one scale and zero point. For beam search, length penalty, temperature, top-k, or probability arithmetic, dequantization or careful fixed-point handling is safer.


About inference_output_type=tf.float32

This line is suspicious:

converter.inference_output_type = tf.float32

It is not necessarily the root cause of the collapse, but it is worth testing without it.

If the target is a strict INT8 hardware delegate, leaving a float output can create an awkward quantize/dequantize boundary or a partially non-integer interface. That may be acceptable for debugging, but it is not ideal for a strict integer deployment.

However, the repeated-token collapse is more likely caused by an internal activation/cross-attention quantization problem than by the output type alone.

I would test both:

# Debug-friendly interface
converter.inference_output_type = tf.float32

and:

# Strict integer numeric output, if compatible with your graph interface
converter.inference_output_type = tf.int8

Then compare:

full INT8 CPU output
full INT8 delegate output
first-step source sensitivity
BLEU

Why 16x8 is useful even though it is not deployable

The experimental mode:

tf.lite.OpsSet.EXPERIMENTAL_TFLITE_BUILTINS_ACTIVATIONS_INT16_WEIGHTS_INT8

is useful diagnostically because it tests whether INT8 activations are the problem.

If 16x8 improves quality but the runtime rejects TILE, the interpretation is:

The model likely needs more activation precision.
The target delegate cannot execute the more accurate path.

LiteRT documents 16-bit activations with 8-bit weights as an option that can help when activations are sensitive, but optimized kernel/delegate support is more limited than ordinary INT8. See:

So the TILE problem is not surprising. It is a runtime/delegate support failure, not proof that plain INT8 PTQ should work.


The first diagnostic test I would run

Before doing more BLEU evaluation, run a first-step source-sensitivity test.

Pick two very different source sentences:

source A: "The committee approved the budget after three hours of debate."
source B: "The patient developed a fever after the second injection."

Use the same decoder prefix:

decoder_input_ids = [decoder_start_token_id]

Compare:

FP32(source A, BOS) → logits_A_fp32
FP32(source B, BOS) → logits_B_fp32

INT8(source A, BOS) → logits_A_int8
INT8(source B, BOS) → logits_B_int8

Healthy behavior:

FP32 logits differ by source.
INT8 logits also differ by source.

Broken source-blind behavior:

FP32 logits differ by source.
INT8 logits are nearly identical across sources.

Example helper:

import numpy as np

def topk_ids(logits, k=10):
    flat = np.asarray(logits).reshape(-1)
    return np.argsort(flat)[-k:][::-1]

def compare_logits(logits_a, logits_b, k=10):
    a = np.asarray(logits_a).reshape(-1).astype(np.float64)
    b = np.asarray(logits_b).reshape(-1).astype(np.float64)

    top_a = topk_ids(a, k)
    top_b = topk_ids(b, k)

    cosine = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-12)

    return {
        "argmax_a": int(top_a[0]),
        "argmax_b": int(top_b[0]),
        "same_argmax": bool(top_a[0] == top_b[0]),
        "topk_overlap": len(set(top_a.tolist()) & set(top_b.tolist())),
        "cosine": float(cosine),
        "range_a": float(a.max() - a.min()),
        "range_b": float(b.max() - b.min()),
        "top_a": top_a.tolist(),
        "top_b": top_b.tolist(),
    }

This is more informative than BLEU.

BLEU tells you the model is broken. First-step source sensitivity tells you whether the encoder context is already gone at the first decoder step.


Use Quantization Debugger

Use TFLite’s Quantization Debugger to identify where error first explodes:

Inspect these tensors first:

encoder output
decoder layer 0 cross-attention query
decoder layer 0 cross-attention key
decoder layer 0 cross-attention value
decoder layer 0 cross-attention scores
decoder layer 0 cross-attention output
post-cross-attention residual
final decoder hidden state
LM-head logits

Look for:

Observation Likely meaning
Encoder hidden states saturated Encoder output quantization is bad
Cross-attention K/V nearly constant Source memory is destroyed
Attention scores nearly constant Decoder cannot select source positions
Attention scores extreme Softmax collapses
Cross-attention output near zero Source signal is muted
Residual dominates attention output Encoder signal is drowned
Logits almost identical across sources Decoder is source-blind
Logits saturated Final projection/output scale problem

Selective quantization can also be useful diagnostically. For example, leave one region float and see whether BLEU recovers:

leave encoder output float
leave cross-attention K/V projections float
leave attention score path float
leave post-cross-attention residual float
leave LM head float

This may not be deployable on an INT8-only delegate, but it can identify the tensor group that kills the model.


Full diagnostic ladder

Run the same evaluation set through these variants:

Variant Purpose Interpretation
Original FP32 TensorFlow / Transformers Reference Should reproduce BLEU around 23.9
Float TFLite CPU Export/lowering check If bad, quantization is not the first problem
Dynamic-range TFLite CPU Weight-quantization check If good, weights are not the main issue
Full INT8 TFLite CPU Quantization check If bad, calibration/numerics are failing
Full INT8 TFLite delegate Runtime check If CPU good but delegate bad, runtime/delegate is failing
16x8 TFLite CPU, if possible Activation-precision check If better, INT8 activations are the bottleneck

The key split is:

float TFLite bad
→ export/lowering/fused-graph issue

float TFLite good, INT8 CPU bad
→ quantization/calibration issue

INT8 CPU good, INT8 delegate bad
→ delegate/operator/kernel issue

16x8 better than INT8
→ activation precision issue

What to do if split PTQ still fails

If the split encoder/decoder-step model still collapses after proper calibration, the realistic options are:

1. Quantization-aware training

Use QAT if PTQ cannot meet the accuracy target.

Relevant docs:

Important: do QAT on the deployment-shaped graph, not only on the original training graph.

That means:

same max source length
same decoder-step shape
same masks
same BOS/EOS behavior
same tokenizer
same target delegate constraints
same quantized encoder/decoder boundary

2. Distillation into a quantization-friendly model

If the original architecture is too sensitive, distill into a smaller model designed for the target constraints:

fixed source length
fixed decoder-step shape
simpler attention pattern
no fused generation graph
delegate-supported ops only
QAT or PTQ-aware evaluation from the beginning

3. Runtime change, if possible

If the target can change, use a Transformer-native runtime instead of generic fused TFLite.

Useful references:

CTranslate2 supports many encoder-decoder Transformer families and quantization modes. Even if it cannot be shipped on the final target, it is useful as a sanity check:

If CTranslate2 INT8 works but TFLite INT8 collapses,
the model is probably quantizable,
but the current TFLite path is not preserving it.

4. Requirement change

If the hardware delegate truly requires plain full INT8 TFLite and the model cannot survive that path, the requirement may be incompatible with the model family.

Possible requirement changes:

allow int16 activations
allow selected float fallback
allow a custom op
allow a different runtime
allow a smaller/distilled model
allow server-side inference

What I would not spend time on

Blindly adding more calibration samples

More data does not fix the wrong calibration distribution.

Bad:

1000 source examples × BOS only

Better:

200 source examples × multiple decoder prefixes

Blindly trying converter flags

Converter flags are secondary. The primary issue is graph shape and activation calibration.

Assuming TILE is the root cause of BLEU collapse

TILE explains why the 16x8 path is not viable on the target. It does not by itself explain why full INT8 repeats tokens. These are related deployment constraints, but not the same failure.

Assuming the converter understands generation semantics

The converter lowers tensors and ops. It does not know that a certain tensor is “encoder memory that must preserve source conditioning.”


My final recommendation

I would proceed like this:

  1. Do not keep the fused graph as the main production candidate.
  2. Build a float TFLite baseline and verify it matches the original model.
  3. Build a dynamic-range TFLite baseline.
  4. Split into:
    • encoder.tflite
    • decoder_step.tflite
  5. Calibrate the decoder-step model using real decoder prefixes across multiple timesteps.
  6. Run full INT8 CPU before using the hardware delegate.
  7. Run the first-step source-sensitivity test.
  8. Use Quantization Debugger around encoder output and decoder cross-attention.
  9. Explicitly handle the quantized encoder-output / decoder-input boundary.
  10. If PTQ still collapses, move to QAT or distillation.
  11. If CPU INT8 works but the delegate fails, treat it as a delegate/operator support problem.

The concise diagnosis is:

The converted model is probably not failing because TFLite cannot tokenize, decode, or run the graph at all. It is failing because full INT8 static PTQ has destroyed the numerical path that carries encoder information into decoder cross-attention. The decoder still emits tokens, but it no longer receives useful source context, so it falls back to repeated high-prior tokens and BLEU collapses.


Short answer

  • Yes, this is a known full-INT8 PTQ failure class.
  • No, there is probably not one converter flag that fixes a fused encoder-decoder generation graph.
  • The likely broken region is cross-attention or the encoder-hidden-state boundary.
  • The recommended deployment shape is split encoder + decoder-step, with generation outside TFLite.
  • Calibration must include real decoder prefixes across timesteps, not only source inputs.
  • Use float TFLite, dynamic-range TFLite, full INT8 CPU, and full INT8 delegate as separate baselines.
  • Use first-step source-sensitivity tests and Quantization Debugger before relying only on BLEU.
  • If careful split PTQ still fails, use QAT, distillation, or a different runtime/precision target.