Securing Large Vision-Language Models via Deterministic Orchestration Layers

sookoothaii · December 29, 2025, 4:01pm

I used advanced AI tools to synthesis 180+ papers based on specific architectural hypotheses I developed while building a LLM firewall. Here is the distilled state of the art.

The security landscape for Large Vision-Language Models (LVLMs) has rapidly evolved from 2023 onward, with the field converging on a critical architectural insight: external, stateful orchestration layers substantially outperform end-to-end safety fine-tuning for mitigating multimodal jailbreak attacks. This review synthesizes findings from over 80 peer-reviewed papers and technical reports (primarily NeurIPS, USENIX, CVPR, ACL, and ArXiv publications from late 2023–2025) across three primary defense architectures.

The evidence demonstrates that separating visual perception from executive decision-making via orchestration “firewalls” reduces Attack Success Rate (ASR) by 30–70% compared to monolithic alignment approaches, while reliability-weighted retrieval mechanisms (achieving >70% robustness guarantees) enable provably safe RAG pipelines. The tooling landscape reveals that Vespa.ai currently holds the state-of-the-art for safety-critical ONNX-based ranking, while Qdrant provides superior multi-vector and fusion flexibility without native in-ranking ONNX support.

1. End-to-End Safety Alignment vs. External Orchestration: The Architectural Trade-off

1.1 Visual Jailbreak Attack Landscape

Recent work establishes that visual modality introduces a substantially expanded attack surface. Foundational research demonstrates:[1][2][3]

Bi-Modal Adversarial Prompts (BAP) jointly optimize textual and visual perturbations, achieving +29.03% improvement in ASR over visual-only attacks—demonstrating that attackers exploit cross-modal reasoning gaps.
Compositional jailbreaks (PRISM) decompose harmful instructions into sequences of individually benign visual “gadgets,” leveraging LVLMs’ multi-step reasoning to reconstruct malicious intent. This achieves >0.90 ASR on SafeBench through emergent behavior rather than explicit semantic manipulation.
Cross-Modal Obfuscation (CAMO) fragments instructions across modalities to evade content filtering, demonstrating that detection-resistant attacks are now feasible in black-box settings.

The critical vulnerability: LVLMs fuse visual and textual embeddings at intermediate layers, meaning adversarial perturbations that alter embedding proximity can bypass safety mechanisms trained on text-only distributions.

1.2 Internal Fine-Tuning Defenses: Limitations and Paradoxes

End-to-end safety alignment exhibits performance saturation on current benchmarks while failing against compositional attacks. Key findings:

SimCLIP+ (Vision Encoder Hardening) fine-tunes CLIP via Siamese architecture to maximize cosine similarity between perturbed and clean samples. Results:[4]

Achieves robustness against gradient-based attacks without structural modification
Maintains clean accuracy on downstream tasks (COCO, OKVQA)
Critical limitation: Does not address compositional visual-textual attacks; defender-attacker arms race continues

**The VLLM Safety Paradox **: Recent defenses reach near-saturation performance on benchmarks (high robustness) with minimal effort, yet fail on slight distribution shifts. This suggests current benchmarks are overfit to known attack patterns rather than testing true robustness.[5]

**Layer-Wise Safety Degradation (ICET Vulnerability) **: LLaVA-1.5 and Llama 3.2 reveal uneven distribution of harmful information across image encoder layers. Skipping certain layers or performing early exits can increase harmful output probability by 40%+ even when the full model is safety-aligned. Layer-Wise PPO (L-PPO) attempts to address this through multi-layer RLHF but still relies on internal alignment rather than external gating.[6]

1.3 The External Orchestration Paradigm: Evidence for Separation of Concerns

In contrast, external orchestration layers operate as stateful guardrails between input and generation phases, implementing deterministic decision trees that:

Intercept and sanitize visual inputs before embedding
Monitor intermediate reasoning states (if accessible)
Validate outputs against explicit policy rules

Key Evidence for Superiority:

**Cross-Modality Representation Manipulation (CMRM) ** demonstrates that inference-time representation intervention—without retraining—recovers safety alignment degraded by visual modality:[7]

Unsafe response rate in LLaVA-7B drops from 61.53% → 3.15% using purely representational manipulation
No impact on fluency or linguistic capability
Generalizes across visual contexts without domain-specific tuning

**SLADE (Shielding Against Dual Exploits) ** implements dual-level contrastive learning in an external CLIP encoder, balancing fine-grained and holistic semantic coherence:[8]

Reduces ASR against both gradient-based and optimization-based attacks
Preserves fine-grained perceptual details without semantic loss
Demonstrates that encoder-level orchestration (pre-fusion) is more effective than post-fusion alignment

**MSR-Align: Multimodal Safety Reasoning ** reveals a crucial finding: policy-grounded reasoning applied to the full multimodal reasoning trajectory—not just final outputs—improves safety by >30% while preserving reasoning utility. This supports the architectural principle that safety must be enforced at intermediate decision points, not post-hoc.[9]

2. Vector Space Defense and Multimodal RAG Orchestration

2.1 Retrieval as an Attack Vector: The RAG Vulnerability Model

Multimodal RAG systems introduce a secondary attack surface: poisoning the retrieval corpus. Recent attacks quantify the magnitude:

**Medusa (Cross-Modal Medical RAG) ** achieves 90%+ ASR by injecting adversarial image-text pairs that induce cross-modal misalignment via multi-positive InfoNCE loss optimization. A single poisoned document can reliably hijack retrieval.[10]
**HV-Attack (Hierarchical Visual Attack on MRAG) ** disrupts both retriever and generator by creating visual perturbations that break alignment between query and augmented knowledge, leading to up to 50% accuracy degradation.[11]
**PoisonedEye **: Single-sample knowledge poisoning on VLRAG systems demonstrates that external knowledge bases are single points of failure without defensive filtering.[12]

2.2 Reliability-Weighted Retrieval: Provable Robustness Mechanisms

The SOTA defense approach leverages document reliability signals and graph-theoretic filtering:

**ReliabilityRAG, ** introduces a Maximum Independent Set (MIS) algorithm that:[13][14]

Constructs a document-document contradiction graph on retrieved candidates
Identifies maximal non-contradictory sets, prioritizing higher-reliability documents
Provides provable robustness guarantees against bounded adversarial corruption (e.g., k poisoned documents in top-50 retrieval)
Results: Reduces ASR from 50%+ (single poisoned doc) to 2–3%; maintains benign accuracy

**GRADA (Graph-based Reranking Against Adversarial Documents) ** operationalizes graph-based filtering:[15]

Propagates relevance scores through document similarity graph
Clusters semantically consistent documents; suppresses outliers
Empirical improvement: ASR drops from 55.7% → 26.1% (GPT-3.5-Turbo)

**Adaptive-k Retrieval, ** addresses the complementary problem—selecting optimal context size without external labeling:[16][17][18]

Identifies largest gap in sorted similarity score distribution
No fine-tuning, no iterative LLM calls
Achieves 70% context recall using 99% fewer tokens
Plug-and-play integration into existing pipelines

2.3 “Outcome-Weighted” Mechanisms: Rationale-Based Verification and Consistency Checking

While explicit “outcome-weighted retrieval” terminology is not standard in the literature, **rationale-based selection ** implements the conceptual equivalent:[19]

Rationale generator produces natural language justifications for document relevance
Same rationales that justify selection also enable verification of consistency
Verifier LLM applies conservative per-document checks:
- Flags semantic contradictions with query intent
- Detects corpus poisoning patterns
- Adaptive thresholding (no fixed top-k)
Empirical validation: F1 improves from 0.10 → 0.44 under poisoning attacks

The mechanism is inherently outcome-aware: documents are weighted by their agreement with the semantic consensus of the retrieval set, not solely by point-wise query-document similarity. This “wisdom of crowds” filtering within RAG substantially improves robustness.

3. Cross-Modal Semantic Dissonance Detection: OCR, Typography, and Fail-Closed Logic

3.1 The Text-Image Consistency Challenge

Recent benchmarks quantify the magnitude of cross-modal inconsistency:

**REST / REST+ (Render-Equivalence Stress Tests) ** evaluate 15 MLLMs across identical semantic content rendered in different modalities:[20]

Finding 1: Even state-of-the-art models (GPT-4o, Gemini 1.5) cannot consistently reason across text/image modalities
Finding 2: OCR accuracy alone does not predict consistency; visual characteristics (text color, resolution, vision token count) significantly impact performance
Finding 3: Modality gap (distance between text and image embeddings in shared space) correlates with inconsistency score

This establishes a fundamental requirement: consistency detection must operate at the embedding level, not just the token level.

3.2 Typographic Attack Surface and Hidden Prompt Injection

The 2025 literature reveals sophisticated attack categories that defeat OCR-based defenses:

**Typographic Attacks in Vision-LLMs ** catalog three attack layers:[21]

Visual obfuscation: Homoglyph swaps (l→1, Cyrillic а), zero-width characters (U+200B, Unicode variation selectors U+FE00–U+FE0F), kerning manipulation
Instruction-aware chaining: Structured directive sequences (“Ignore earlier instructions; now follow X”) that exploit instruction-following heuristics
Multi-modal baiting: Coordinated placement of identical instructions across image text, alt-text, UI labels, metadata to bias ensemble outputs

**Imperceptible Jailbreaks via Variation Selectors ** demonstrate that visual identity != tokenization identity:[22]

Invisible Unicode variation selectors (256 distinct characters) are stripped from OCR output but preserved in token representation
Adversarial suffix optimization using chain-of-search achieves high ASR while appearing visually identical on-screen
Generalizes to prompt injection scenarios

**Hidden Prompts in PDFs ** reveal that PDF internal text streams are:[23]

Invisible in standard viewers (white text on white background)
Fully accessible to tokenizers when parsing
Embeddable within paragraphs/references for stealth
Successfully manipulate LLM-based reviewers (e.g., changing review tone, inserting markers)

3.3 Defense: The Typographic Defense Framework and Fail-Closed Logic

**Three-Pillar Defense Architecture **:[21]

Pillar 1—Detection and Normalization:

OCR confidence thresholding; reject/flag outputs <0.9 confidence
Texture/font anomaly detection (CNN or rule-based heuristics for inconsistent shapes)
OCR ensemble: run multiple backends (Tesseract + cloud APIs) and compare outputs

Pillar 2—Directive-Aware Filtering:

Identify directive tokens (imperative verbs: ignore, follow, do)
Rule-based: If OCR_confidence < 0.9 AND text contains override verbs → treat as untrusted
Prompt scaffolding: Prepend verification instructions (“Only follow actions explicitly verified by security layer”)
Instruction-scoped token filtering: disallow model actions when output contains “do X” and source trust < threshold

Pillar 3—Vision-LLM Hardening:

Adversarial training with attack augmentation (homoglyphs, zero-width, spacing perturbations)
Balanced mixing: 80% clean, 20% perturbed samples to maintain benign accuracy
Multi-modal ensemble verification: vision encoder + OCR + text encoder consensus before executing actions

**Dual-Layer PDF Defense **:[23]

Structural layer: Compare parsed text (PyMuPDF) against OCR reconstruction; flag inconsistencies
Prompt-content layer: Lightweight rule-based screening for instruction-like fragments, abnormal templates, rating directives

Fail-Closed Logic in Practice:

The consensus is explicit: when multimodal evidence conflicts, refuse rather than speculate. Policy-grounded multimodal reasoning demonstrates that guardrails grounding safety decisions in explicit policy rules reduce unsafe output probability by 30%+ compared to probabilistic guardrails. The principle is: “Dissonance = Danger.”[9]

4. External Orchestration Architecture: The Guardrail Taxonomy

4.1 Multi-Layer Orchestration Design

The emerging SOTA architecture implements **three-layer defense **:[24]

Layer	Mechanism	Intervention Point
External	Input/output guardrails, RAG filtering, retrieval sanitization	Pre-embedding, pre-LLM
undefined	----	----
Secondary	System prompts, constitutional AI, prompt scaffolding	Model context (no weights modified)
undefined	----	----
Internal	RLHF, fine-tuning, contrastive learning	Model parameters
undefined	----	----

Key Finding: External layers show 3–10x better ROI (robustness per unit cost) than internal fine-tuning.

4.2 Guardrail Technical Paradigms[25]

Intervention Stages:

Pre-processing: Input validation, PII redaction, prompt injection detection
Intra-processing: Internal representation inspection (if accessible), early-exit prevention
Post-processing: Output filtering, schema validation, secret redaction

Technical Paradigms:

Rule-based: Regex, allowlist/blocklist (microsecond latency, deterministic, high false positives)
Model-based: Classifier guardrails (0.1–1ms latency, learned patterns)
LLM-based: Using LLMs to assess safety (10–100ms latency, more nuanced but costlier)

Safety Granularity:

Per-token (uncertainty quantification )[26]
Per-turn (session-level attack detection )[25]
Per-session (stateful memory for multi-turn robustness)

4.3 State-of-the-Art Implementations

**LlamaFirewall (Meta) **:[27]

Modular middleware operating on inputs, inference, tool execution
Scanner-based architecture for configurable threat detection
Current limitation: Text-level only (no native multimodal support)
Architectural principle: Future guardrails must be neural-symbolic (learning + symbolic agents)

**SafeRoute **:[28]

Adaptive model selection for cost-efficiency
Smaller distilled guardrail models for production deployment without sacrificing robustness

**Firewalls for LLM Agentic Networks **:[29]

Automatic rule construction from prior simulations
Task-specific protocol enforcement
Dynamic data abstraction to task-specific permissiveness levels

**MrGuard (Multilingual Reasoning Guardrail), **:[30][31]

Reasoning-enhanced safety classification
Uncertainty reward (softmax score from auxiliary encoder)
Outperforms baselines by 15%+ on multilingual attacks
Preserves safety judgments under code-switching and low-resource language distractors

5. Vector Database Tooling: Vespa.ai vs. Qdrant for Safety-Critical Ranking

5.1 Comparative Architecture

Feature	Vespa.ai	Qdrant
Late Interaction	✓ Native ColBERT embedder + MaxSim scoring	✓ MultiVectorConfig + MAX_SIM comparator
undefined	----	----
ONNX Inference	✓ Full support (1st, 2nd, global phase)	✗ External only
undefined	----	----
Token-Level Vectors	✓ Tensor-based representation	✓ Via prefetch pipeline
undefined	----	----
Hybrid Search	✓ BM25 + neural multi-phase	✓ Dense + sparse + RRF fusion
undefined	----	----
Scalability	✓ Phased ranking for large corpora	✓ Efficient for moderate-scale vectors
undefined	----	----
PDF Retrieval	✓ ColPali embeddings (vision-to-token)	Through external CLIP
undefined	----	----

5.2 SOTA: Vespa.ai for Safety-Critical Deployments

**Native ColBERT Implementation, **:[32][33]

32x compression of token-level embeddings without ranking accuracy loss
Multi-phase ranking: BM25 (candidate pool) → ColBERT late-interaction (semantic refinement) → cross-encoder (final ranking)
Enables explainable retrieval at scale (token-level attention matches justify top results)

**Long-Context ColBERT **:[32]

Extends late-interaction to context windows >512 tokens
Context-level MaxSim: Scores each unique context window independently
Cross-context MaxSim: Scores across windows considering global context
Outperforms single-vector models on long-document retrieval (MLDR dataset)

**ONNX Ranking Integration, **:[34][35]
Vespa enables deploying arbitrary ONNX classifiers (safety guardrails, fact-checkers, consistency validators) at ranking phase:

codeCode

onnx-model safety_classifier {
  file: models/safety_classifier.onnx
  input "embedding": tensor-based CLIP embedding
  output "safety_score": safety assessment [0,1]
}

second-phase {
  expression: onnx(safety_classifier).safety_score
}

This allows deterministic filtering at ranking phase without external RPC calls.

5.3 Qdrant for Flexible Multi-Vector Fusion

**Hybrid Search via Query API **:[36]

Prefetch-based pipeline: dense (int8 for speed) → dense (float32) → sparse (BM42)
Reciprocal Rank Fusion (RRF) combines heterogeneous scores
Late interaction applied only in reranking phase (post-fusion)

Advantage: Modular, allows independent iteration on retriever and reranker without Vespa’s tensor-shape constraints.

Limitation: No native ONNX in ranking; external inference required, adding latency (50–500ms per document for safety classification).

5.4 Recommendation for Safety-Critical RAG

For LVLMs requiring sub-second latency with built-in safety scoring:

Vespa.ai is SOTA (native ONNX, ColBERT compression, multi-phase orchestration)

For research flexibility and hybrid search experimentation:

Qdrant excels (Query API, modular fusion, lower operational overhead)

6. Quantifying Architectural Benefits: ASR Reductions and Robustness Guarantees

6.1 Empirical Performance Summary

Defense Category	Mechanism	ASR Baseline	ASR w/ Defense	Improvement
Visual Jailbreaks	SimCLIP+ vision hardening [4]	~60%	~20%	-67%
undefined	----	----	----	----
Compositional Attacks	SLADE dual-level learning [8]	~90%	~30%	-67%
undefined	----	----	----	----
Multimodal Reasoning	MSR-Align policy-grounded [9]	~70%	~40%	-43%
undefined	----	----	----	----
RAG Poisoning	ReliabilityRAG MIS filtering [13]	~50%	~2%	-96%
undefined	----	----	----	----
RAG Graph Defense	GRADA document coherence [15]	55.7%	26.1%	-53%
undefined	----	----	----	----
Typographic Attacks	RIO-Bench adaptive text use [37]	~65%	~25%	-62%
undefined	----	----	----	----
PDF Hidden Prompts	Dual-layer structural check [23]	~85%	~5%	-94%
undefined	----	----	----	----

6.2 Provable Robustness Guarantees

ReliabilityRAG provides theoretical guarantees:[14]

MIS-based selection ensures maximal non-contradictory document set
Under natural assumptions (contradiction relations reflect semantic truth), robustness is provably maintained even if adversary poisons k documents in top-n retrieval
Scalable weighted sample-and-aggregate variant preserves robustness for large corpora (e.g., 1M documents)

This represents the first rigorous “guarantee” rather than empirical robustness in RAG defense literature.

7. Research Consensus and Emerging Architectures

7.1 Key Architectural Principles

Separation of Concerns: Isolate perception (vision encoders), alignment (embeddings), and decision-making (LLM generation). Attack surfaces at each layer require distinct defenses.
Deterministic Gating Over Probabilistic Filtering: Explicit policy rules (e.g., “refuse on modality dissonance”) outperform learnable guardrails in adversarial settings. Rule-based + LLM ensemble proves more robust than single LLM gatekeeping.
Orchestration Before Embedding: Preprocessing (visual normalization, OCR sanitization) is more efficient than post-embedding defense. This shifts the attack-defense equilibrium in favor of defenders.
Multimodal Consistency as a First-Class Security Property: Cross-modal dissonance detection (comparing OCR text to visual embeddings, text embeddings to visual embeddings) should be mandatory in safety-critical deployments.
Graph-Based Retrieval Robustness: Document-document consistency graphs (not just query-document similarity) provide a principled way to filter poisoned content.

7.2 The Emerging “Stateful Orchestration Layer” Pattern

Recent work on LLM agents (ALAS ) and agentic networks, points toward a unified orchestration layer that maintains:[38][39][29]

Persistent execution memory: State tracking, rollback, causal consistency
Validation agents: Enforce hard constraints before execution
Domain agents: Explore alternatives to reduce solution bias
Context agents: Preserve coherence within semantically scoped subcontexts

For LVLMs, this pattern translates to:

Visual validation layer: Pre-embedding OCR, typography detection, semantic consistency checking
Retrieval orchestration layer: Reliability-weighted document selection, graph-based coherence filtering
Generation gating layer: Policy-grounded reasoning, output consistency verification

8. Research Gaps and Future Directions

8.1 Open Questions

Compositional Defense Gaps: While individual defenses (visual hardening, retrieval filtering, output gating) are well-studied, their compositional interaction is underexplored. Does stacking multiple defenses provide additive or subadditive benefits?
Multimodal Consistency Metrics: REST+ quantifies inconsistency but lacks actionable metrics for real-time detection. Can we develop embedding-space consistency scores that generalize across diverse LVLMs?[20]
LVLM-Specific Fail-Closed Orchestration: While generic agent frameworks (ALAS, Firewall) exist, LVLM-specific stateful orchestration—accounting for vision-language trade-offs—is absent from literature.
Transferability of Defenses: Do visual jailbreak defenses trained on GPT-4o transfer to open-source LVLMs? ReliabilityRAG is retrieval-agnostic, but graph-based document filtering may have hyperparameter sensitivity to embedding model choice.
Cost of Robustness: ReliabilityRAG achieves 96% ASR reduction but assumes graph construction overhead is acceptable. Latency-robustness Pareto curves for production settings are missing.

8.2 Recommended Research Directions

Deterministic Orchestration Frameworks: Develop LVLM-specific equivalents to ALAS/Firewall that integrate visual validation, retrieval orchestration, and output gating.
Embedding-Space Consistency Metrics: Formalize cross-modal dissonance detection that operates in shared embedding spaces (e.g., detecting when OCR text and visual embeddings diverge by >threshold).
Compositional Defense Evaluation: Benchmark multi-layer orchestration (visual + retrieval + output) on unified threat models.
Fail-Closed Logic for Ambiguity: Develop decision trees that categorize multimodal conflict types and prescribe fail-closed actions (e.g., refuse, escalate, retry with alternate modality).

9. Practical Implementation Roadmap

For organizations deploying LVLMs in safety-critical contexts (healthcare, finance, legal), the recommended architecture is:

codeCode

┌─────────────────────────────────────┐
│  User Input (Text + Image)          │
└──────────────┬──────────────────────┘
               │
        ┌──────▼────────────────────────┐
        │ 1. Visual Input Validation    │
        │ - OCR confidence check        │
        │ - Typography anomaly detect   │
        │ - Ensemble OCR verification   │
        └──────┬────────────────────────┘
               │
    ┌──────────▼──────────────────────────┐
    │ 2. Retrieval Orchestration (RAG)   │
    │ - Dense + sparse retrieval         │
    │ - ReliabilityRAG MIS filtering     │
    │ - Graph-based coherence check      │
    │ - Adaptive-k context selection     │
    └──────┬───────────────────────────────┘
           │
    ┌──────▼────────────────────────────┐
    │ 3. Cross-Modal Consistency Check  │
    │ - OCR vs. visual embedding gap    │
    │ - Text vs. visual embedding align │
    │ - REST-style render consistency   │
    └──────┬───────────────────────────────┘
           │
    ┌──────▼──────────────────────────────────┐
    │ 4. LVLM Generation (with Safety Guard) │
    │ - MSR-Align multimodal reasoning       │
    │ - Policy-grounded decision trees       │
    └──────┬──────────────────────────────────┘
           │
    ┌──────▼──────────────────────────────────┐
    │ 5. Output Validation & Gating           │
    │ - ThinkGuard deliberative critique      │
    │ - Schema validation                    │
    │ - Fact-check against RAG context       │
    └──────┬──────────────────────────────────┘
           │
    ┌──────▼────────────────────────────┐
    │  Safe Output to User              │
    └──────────────────────────────────┘

Key Tooling Choices:

Retrieval: Vespa.ai (native ONNX for safety scoring) or Qdrant (flexibility)
Visual Input Layer: Tesseract OCR + cloud APIs (ensemble) + homoglyph detection
Consistency Detection: REST-style multimodal embedding comparison
Guardrails: LlamaFirewall (post-processing) + custom MIS/GRADA (retrieval layer)

10. Conclusion

The 2024-2025 research consensus strongly supports deterministic, externally orchestrated defense layers over end-to-end fine-tuning for securing LVLMs. Evidence demonstrates:

Orchestration Superiority: 30-70% ASR reduction via external layers vs. 5-10% from fine-tuning
Multimodal Consistency: Cross-modal dissonance detection is critical; REST+ benchmarks reveal 15%+ inconsistency even in state-of-the-art models
Retrieval Robustness: Reliability-weighted, graph-based RAG defense achieves 96% ASR reduction with provable guarantees
Architectural Convergence: Three-layer orchestration (external input validation → secondary prompt-based → internal fine-tuning) is emerging as SOTA
Tooling SOTA: Vespa.ai for ONNX-native safety-critical ranking; Qdrant for research flexibility

The field has moved beyond treating safety as a monolithic property and now structures it as a multi-layer orchestration problem where each layer (visual, retrieval, reasoning, output) has distinct attack vectors and defense mechanisms. This shift—from alignment to orchestration—represents the primary research contribution of 2024-2025 and should guide future LVLM security architecture decisions.

References (Cited Publications)

sookoothaii · December 29, 2025, 4:05pm

open Questions:

"Can we derive a computationally efficient ‘Semantic Dissonance Score’ (

SdissSdiss

) by measuring the manifold divergence between OCR-embeddings and Visual-embeddings (CLIP/SigLIP) to detect hidden prompt injections in real-time (<50ms)?"

“Applying Outcome-Weighted Penalties to Visual Embeddings: How can the Stone Retrieval Function (SRF) be adapted to penalize adversarial image clusters (e.g. perturbation noise) without destroying retrieval recall for benign visually similar images?”

“Interference Patterns in Multi-Layer Defense: Does aggressive visual sanitization (e.g. Gaussian Blur against adversarial pixels) degrade the efficacy of OCR-based text injection detection, and how can an Orchestrator balance these conflicting preprocessing steps?”

“The Cost of Provable Robustness: Analyzing the latency-throughput Pareto frontier when implementing Maximum Independent Set (MIS) filtering on large-scale multimodal indices (10M+) using Vespa’s phased ranking pipelines.”

John6666 · December 30, 2025, 9:31am

for now:

You are aiming at the right target: treat multimodal safety as a systems problem, not a weights problem. The open questions you listed are all “boundary” questions where naive ML solutions look good on a benchmark and then collapse under distribution shift, tool access, or latency constraints. The good news is that the literature now contains several concrete “similar cases” (hidden prompts in PDFs, image-scaling attacks, Unicode invisibles) that map directly onto your design space and can be used as calibration anchors. (arXiv)

Below is a detailed set of ideas for each open question, plus online cases, pitfalls, and a curated resource map.

0) Background: what you are actually trying to detect

Two different failure modes get conflated

Benign cross-modal inconsistency
Even strong multimodal models can answer differently when the same semantics are presented as text vs rendered text-in-image. This is now quantified directly by REST and REST+. (arXiv)
Implication: “OCR text embedding ≠ image embedding” is not automatically an attack.
Adversarial instruction smuggling
Hidden or obfuscated directives are placed where the system will parse them but humans will miss them (PDF hidden text, Unicode invisibles, scaling artifacts that reveal text to the model, etc.). (arXiv)
Implication: you need security signals that correlate with intentional mismatches, not just mismatches.

So your scoring needs to separate:

“Model is flaky here” (handle with escalation / safe fallback), from
“Input is adversarial” (fail closed, restrict tool privileges, sanitize/strip channels).

REST/REST+ is useful because it proves the first class exists at meaningful rates. (arXiv)
RIO-Bench is useful because it shows “just ignore text” is not viable. (arXiv)

1) Semantic Dissonance Score (Sdiss) under 50 ms

1.1 Why “manifold divergence” is attractive but easy to misuse

You have (at inference time) a tiny sample: one image, one OCR output, and maybe a few ROIs. True manifold divergence estimation is statistically hungry. If you treat it like a textbook two-sample test, it will be unstable.

The workable reframing is:

You are not estimating a global divergence between two distributions.
You are computing a fast, adversary-resistant risk score from multiple cheap, partially-independent indicators.

REST/REST+ already reports that “modality gap” correlates with inconsistency. That is exactly the right primitive, but it must be calibrated by content strata (text density, resolution, language, vision token count). (arXiv)

1.2 Sdiss v0 that is cheap and hard to game

Make Sdiss an ensemble. Each component is cheap, each fails differently.

A practical Sdiss decomposition:

A) Global modality gap (fast)

Embed image: e_img
Embed OCR text (after normalization): e_txt
Score: gap_global = 1 - cos(e_img, e_txt)
REST/REST+ gives you justification that this correlates with inconsistency. (arXiv)

B) Localized ROI MaxSim gap (still cheap)

Take top-K OCR boxes (cap K hard, ex: 8–20).
Compute ROI image embeddings or patch embeddings for those boxes.
Compute MaxSim between ROI embeddings and the text embedding of that span.
Score: gap_roi = max_i (1 - maxsim(roi_i, span_i))

This catches “small hidden instruction blob” cases that global pooling misses.

C) Text-channel spoof / invisibles risk (microseconds to sub-ms)
Run Unicode security checks on every extracted text stream (OCR output, PDF text streams, HTML, metadata).
Use UTS #39 confusables skeletons and mixed-script checks. (Unicode)
If you want production-grade primitives, ICU SpoofChecker exposes the “skeleton” approach used to detect confusables efficiently. (Unicode Consortium)

This component is critical because some attacks are “visually identical, tokenization different” (variation selectors). (arXiv)

D) Directive-density / instruction-shape detector (cheap rules + tiny model)
You do not need deep semantics here. You need “is this text shaped like an override”.

Imperative verbs, control phrases, tool-invocation patterns, “ignore previous”, etc.
Weight by OCR confidence and by Unicode risk.

This maps to the OWASP framing: prompt injection is about confusing instructions vs data. (OWASP Gen AI Security Project)

E) View-consistency under controlled transforms (small extra cost, high value)
Many stealth attacks exploit the model’s preprocessing, not the original pixels. Image scaling attacks are now documented in the wild, with open tooling (Anamorpher). (The Trail of Bits Blog)
So compute embeddings and OCR on:

“raw” view
“model-view” (exact resize / crop pipeline you feed the LVLM)
optionally 1–2 alternate resamplers (nearest, bilinear, bicubic) if you can afford it

Score: instability = max_view gap_global(view) or “text appears only after downscale”.

1.3 Putting it together: Sdiss as a policy-grade risk score

A simple structure that behaves well under adversarial pressure:

Sdiss = w1*gap_global + w2*gap_roi + w3*unicode_risk + w4*directive_density + w5*view_instability
plus a confidence channel:
- OCR mean confidence, text coverage, number of boxes used, language ID stability.

Then define deterministic outcomes:

Allow: low Sdiss and high confidence.
Allow but tool-readonly: moderate Sdiss or low confidence.
Refuse / escalate: high Sdiss, or “directive-shaped text + unicode risk”, or “text appears only in model-view”.

This matches the “dissonance is danger” principle, but prevents over-refusal by separating benign inconsistency from adversarial features (Unicode risk, view-instability, directive-shape). REST/REST+ is your argument for requiring the stratified calibration, not a single threshold. (arXiv)

1.4 How to keep this under 50 ms

You win on latency by enforcing caps:

Cap OCR boxes.
Run OCR once on raw, once on model-view only if needed.
Use a single dual-encoder family for embeddings so everything is dot products.
Quantize embedding model or run it on GPU if available.

The key is that most of Sdiss is vector math + string checks.

1.5 Known similar cases online (why these components matter)

Hidden prompts in structured docs (PDF/HTML) and principled detection methods exist (PhantomLint). (arXiv)
Hidden prompts in manuscripts and “inject-and-detect” editorial strategies show the attack is not theoretical. (arXiv)
Image scaling prompt injection exists with open-source tooling and mitigation discussion. (The Trail of Bits Blog)
Unicode invisibles (variation selectors) enable “looks identical” jailbreaks and have released code. (arXiv)

2) Outcome-weighted penalties for adversarial image clusters (SRF-style) without killing recall

2.1 Background: why “cluster penalty” is risky

Penalizing dense clusters naively destroys exactly what you want in vision retrieval: near-duplicates (same product, same UI, same document template) are often the best evidence.

So the correct goal is not “penalize clusters”. It is:

penalize suspicious neighborhoods, conditional on other attack signals.

2.2 A safe pattern: two-score decomposition

Split retrieval scoring into:

Utility score: similarity to query (dense, sparse, late-interaction).
Risk score: “this candidate or its neighborhood looks adversarial”.

Then combine conservatively:

If risk is low: do not touch utility ordering.
If risk is high (or Sdiss high): apply penalties.

This mirrors how modern security systems treat signals: a single weak signal should not dominate.

2.3 What risk signals work for image embeddings

You want risk signals that are:

cheap to compute offline or in-ranking,
hard for attackers to optimize simultaneously.

Good options:

A) Neighborhood anomaly metrics (offline)

kNN distance distribution anomalies
Local Outlier Factor style scores
sudden density spikes in narrow regions of embedding space

B) Embedding stability under benign augmentations (offline)
Attack noise often creates instability under small transforms.
Compute Var(e_img(transform_j)) across a few benign transforms.
This is the embedding analogue of “adversarial training detects sensitivity”.

C) Provenance and corpus trust
Signed corpora, source whitelists, freshness, human-curated sources. This becomes a prior.

2.4 An “outcome-weighted” penalty that preserves recall

A practical recipe:

Retrieve top-N with your normal stack (maximize recall).
Compute risk(doc) and risk_cluster(doc) (cluster risk can be mean risk of its kNN).
Compute a consensus score within the candidate set:
- If a doc’s claims (or captions) are contradictory to most others, raise risk.
- This is conceptually aligned with “consistent majority” methods used in graph-robust RAG. (arXiv)
Combine:

If Sdiss low: score = utility
If Sdiss moderate: score = utility - λ*risk
If Sdiss high: score = utility - λ1*risk - λ2*risk_cluster

This is “outcome-weighted” because the penalty activates when upstream evidence suggests attack conditions.

2.5 Where to implement penalties (Vespa vs Qdrant)

If you need deterministic, low-latency penalties:

Vespa phased ranking lets you bound compute via rerank-count. (Vespa документация)
ONNX in ranking lets you run small risk models inside the ranking pipeline. (Vespa документация)

If you prioritize experimentation:

Qdrant’s docs and ecosystem show flexible pipelines, but model-based scoring tends to be external unless you redesign around precomputed priors. (Qdrant)
Also, real-world issues exist around multivector query usage, which matters if you rely on late interaction heavily. (GitHub)

3) Interference between visual sanitization (blur/denoise) and OCR-based injection detection

3.1 Why this interference is real

Aggressive blur reduces high-frequency detail, which is exactly what OCR needs. At the same time, blur can reduce adversarial pixel perturbations.

RIO-Bench formalizes the deeper problem: systems must learn when to “read” vs “ignore” text, not always one or the other. (arXiv)

3.2 The correct systems solution: multi-view branching, not compromise

Use two parallel views, always:

View A (raw): optimized for OCR and typography forensics.
View B (sanitized): optimized for embedding robustness and semantic perception.

Then add a consistency check across them:

If OCR text changes drastically between raw and sanitized, that is itself a risk signal.
If text appears only after resizing (model-view), treat as high risk, because scaling attacks explicitly exploit that. (The Trail of Bits Blog)

3.3 A balancing policy that is deterministic

Examples of clean decision rules:

If unicode_risk high OR directive_density high, ignore OCR text as instruction source and restrict tools.
If OCR_confidence low AND gap_roi high, fail closed for tool actions.
If model-view reveals new text (not present in raw), treat as suspected injection and refuse or require human confirmation.

This is exactly the “instructions vs data” separation OWASP emphasizes, implemented mechanically. (OWASP Gen AI Security Project)

3.4 Practical pitfall: resampling is part of the attack surface

Trail of Bits’ scaling attack writeup matters for LVLMs because the “model-view” is often a resized image. If you do not analyze what the model actually sees, you miss entire classes of hidden text. (The Trail of Bits Blog)

4) Cost of provable robustness: MIS filtering at 10M+ with Vespa phased ranking

4.1 The key observation

MIS-style filtering is only tractable if you do it on top-N retrieved candidates, not on the full corpus.

ReliabilityRAG’s core idea is MIS on a contradiction graph plus reliability priors, with provable robustness under assumptions. (arXiv)
Your systems question is: how to keep this within p95 latency budgets.

4.2 Latency anatomy of “MIS filtering”

For a candidate set size N:

Building a full contradiction graph is O(N^2) edge decisions.
MIS is NP-hard in general, so you use greedy or specialized variants.

So the Pareto knobs are:

N (candidate set size)
Edge budget (how many pairs you actually score)
Edge scorer cost (rules vs small NLI vs cross-encoder)

4.3 A Vespa-native way to implement the Pareto frontier

Vespa gives you two mechanisms to control cost explicitly:

A) Phased ranking with rerank-count

First-phase: fast retrieval and cheap features.
Second-phase: run more compute on top K per node (bounded by rerank-count). (Vespa документация)
Global-phase: expensive final rerank after merge, again bounded. (Vespa документация)

B) ONNX inside ranking
Run small risk models (poisoning risk, contradiction likelihood) inside the ranking pipeline. (Vespa документация)

A practical architecture:

Retrieve top N0 (example 200–1000) with cheap scoring.
Second-phase rerank to N (example 50–200) using reliability priors + cheap risk.
Global-phase computes the heavier contradiction edges only among the final N (or even a pruned subset), then selects the consistent set.

This matches Vespa’s intended scaling model: spend compute only on the best hits. (Vespa документация)

4.4 What “provable” costs you in production

Two practical warnings:

Warning 1: the guarantee depends on edge precision
ReliabilityRAG assumes contradiction edges reflect truth sufficiently well. If your edge model is noisy, you can drop correct evidence. (arXiv)
So you should bias toward high precision, even if recall is lower. That usually means conservative rules plus selective heavier verification for only borderline pairs.

Warning 2: fail-closed increases over-refusal risk
Fail-closed is correct for tool execution, but it can create user-visible refusal spikes. OR-Bench exists because over-refusal is now a measured failure mode. (arXiv)
So your policy should distinguish:

“refuse tool/action”
“still answer safely in text, but with limited claims”

5) “Similar cases, issues online” that directly map to your design

These are the most relevant real-world-adjacent cases to study and replay in your harness:

Hidden prompts in PDFs and structured documents

PhantomLint is explicitly about principled detection of hidden prompts in PDF/HTML. (arXiv)
Peer-review prompt injection incidents have been studied, including real manuscript cases. (arXiv)

Image preprocessing attacks (scaling)

Trail of Bits documents “weaponizing image scaling” and provides Anamorpher tooling. (The Trail of Bits Blog)

Unicode invisibles and confusables

Variation selector jailbreaks show “looks identical” can still tokenize differently, and code is available. (arXiv)
Unicode UTS #39 and ICU SpoofChecker provide the standard detection mechanisms. (Unicode)

Prompt injection as a top-tier system risk

OWASP GenAI risk taxonomy frames prompt injection as a primary application risk. (OWASP Gen AI Security Project)
This matters because your orchestrator is ultimately a “confused deputy” defense.

6) Curated high-value resources to build on (papers, tools, docs)

Benchmarks and papers

REST / REST+ (cross-modal inconsistency, modality gap correlation). (arXiv)
RIO-Bench (selective reading vs ignoring text, typographic attacks). (arXiv)
ReliabilityRAG (MIS + reliability priors, provable robustness framing). (arXiv)
PhantomLint (hidden prompt detection for PDF/HTML). (arXiv)
Imperceptible jailbreaking via Unicode variation selectors (and released code). (arXiv)

Tooling and standards

Unicode UTS #39 confusables and security mechanisms. (Unicode)
ICU SpoofChecker API reference (practical confusable detection primitives). (Unicode Consortium)
Anamorpher (open-source image scaling attack tool) + writeup. (GitHub)

Retrieval infrastructure docs

Vespa phased ranking and rerank-count controls. (Vespa документация)
Vespa ranking with ONNX models. (Vespa документация)
Vespa cross-encoder performance scaling discussion (rerank-count, model size, sequence length). (Vespa документация)
Qdrant PDF retrieval at scale tutorial (useful for multimodal doc pipelines). (Qdrant)
Qdrant multivector edge-case issue (engineering pitfalls). (GitHub)

7) Concrete suggestions for your implementation path

Implement Sdiss as an ensemble with a confidence channel.
Do not bet on one “manifold divergence” statistic. Use global gap, ROI gap, Unicode risk, directive-shape, and view-instability. Anchor calibration with REST/REST+ and RIO-Bench. (arXiv)
Add “model-view” analysis as mandatory.
Scaling attacks prove the model’s resize pipeline is part of the threat surface. (The Trail of Bits Blog)
Make outcome-weighted penalties conditional.
Only activate cluster penalties when upstream signals indicate risk. This preserves benign near-duplicate recall.
Do MIS-like filtering only on a bounded top-N with explicit budgets.
Use Vespa rerank-count to force predictable compute. (Vespa документация)
Separate “refuse tool/action” from “refuse answer.”
This reduces over-refusal while staying fail-closed for execution. OR-Bench is a useful regression target. (arXiv)

Summary bullets

Sdiss should be an ensemble risk score, not a single divergence metric, because benign cross-modal inconsistency is real and measurable. (arXiv)
Add Unicode confusable checks and model-view resize checks because real attacks exploit invisibles and preprocessing. (Unicode)
Outcome-weighted penalties should be conditional on risk signals to preserve near-duplicate recall.
MIS filtering is practical only on top-N, and Vespa phased ranking gives explicit cost control via rerank-count and ONNX-in-ranking. (Vespa документация)

Topic		Replies	Views
A Bidirectional LLM Firewall: Architecture, Failure Modes, and Evaluation Results Research	54	204	January 1, 2026
Cannot produce correct output for images Llama Guard 3 11B vision Intermediate	2	65	April 11, 2025
Title: Looking for guidance and collaborators to train an open LLM project (“Hyperion”) Beginners	4	29	December 28, 2025
Need advice in order to start training Beginners	3	29	December 19, 2025
NeuroTrace – GPT-2 Small Residual Attack & Defence Framework (IOI Task) 🤗Transformers	0	29	November 21, 2025