Securing Large Vision-Language Models via Deterministic Orchestration Layers

I used advanced AI tools to synthesis 180+ papers based on specific architectural hypotheses I developed while building a LLM firewall. Here is the distilled state of the art.

The security landscape for Large Vision-Language Models (LVLMs) has rapidly evolved from 2023 onward, with the field converging on a critical architectural insight: external, stateful orchestration layers substantially outperform end-to-end safety fine-tuning for mitigating multimodal jailbreak attacks. This review synthesizes findings from over 80 peer-reviewed papers and technical reports (primarily NeurIPS, USENIX, CVPR, ACL, and ArXiv publications from late 2023–2025) across three primary defense architectures.

The evidence demonstrates that separating visual perception from executive decision-making via orchestration “firewalls” reduces Attack Success Rate (ASR) by 30–70% compared to monolithic alignment approaches, while reliability-weighted retrieval mechanisms (achieving >70% robustness guarantees) enable provably safe RAG pipelines. The tooling landscape reveals that Vespa.ai currently holds the state-of-the-art for safety-critical ONNX-based ranking, while Qdrant provides superior multi-vector and fusion flexibility without native in-ranking ONNX support.


1. End-to-End Safety Alignment vs. External Orchestration: The Architectural Trade-off

1.1 Visual Jailbreak Attack Landscape

Recent work establishes that visual modality introduces a substantially expanded attack surface. Foundational research demonstrates:[1][2][3]

  • Bi-Modal Adversarial Prompts (BAP) jointly optimize textual and visual perturbations, achieving +29.03% improvement in ASR over visual-only attacks—demonstrating that attackers exploit cross-modal reasoning gaps.

  • Compositional jailbreaks (PRISM) decompose harmful instructions into sequences of individually benign visual “gadgets,” leveraging LVLMs’ multi-step reasoning to reconstruct malicious intent. This achieves >0.90 ASR on SafeBench through emergent behavior rather than explicit semantic manipulation.

  • Cross-Modal Obfuscation (CAMO) fragments instructions across modalities to evade content filtering, demonstrating that detection-resistant attacks are now feasible in black-box settings.

The critical vulnerability: LVLMs fuse visual and textual embeddings at intermediate layers, meaning adversarial perturbations that alter embedding proximity can bypass safety mechanisms trained on text-only distributions.

1.2 Internal Fine-Tuning Defenses: Limitations and Paradoxes

End-to-end safety alignment exhibits performance saturation on current benchmarks while failing against compositional attacks. Key findings:

SimCLIP+ (Vision Encoder Hardening) fine-tunes CLIP via Siamese architecture to maximize cosine similarity between perturbed and clean samples. Results:[4]

  • Achieves robustness against gradient-based attacks without structural modification

  • Maintains clean accuracy on downstream tasks (COCO, OKVQA)

  • Critical limitation: Does not address compositional visual-textual attacks; defender-attacker arms race continues

**The VLLM Safety Paradox **: Recent defenses reach near-saturation performance on benchmarks (high robustness) with minimal effort, yet fail on slight distribution shifts. This suggests current benchmarks are overfit to known attack patterns rather than testing true robustness.[5]

**Layer-Wise Safety Degradation (ICET Vulnerability) **: LLaVA-1.5 and Llama 3.2 reveal uneven distribution of harmful information across image encoder layers. Skipping certain layers or performing early exits can increase harmful output probability by 40%+ even when the full model is safety-aligned. Layer-Wise PPO (L-PPO) attempts to address this through multi-layer RLHF but still relies on internal alignment rather than external gating.[6]

1.3 The External Orchestration Paradigm: Evidence for Separation of Concerns

In contrast, external orchestration layers operate as stateful guardrails between input and generation phases, implementing deterministic decision trees that:

  1. Intercept and sanitize visual inputs before embedding

  2. Monitor intermediate reasoning states (if accessible)

  3. Validate outputs against explicit policy rules

Key Evidence for Superiority:

**Cross-Modality Representation Manipulation (CMRM) ** demonstrates that inference-time representation intervention—without retraining—recovers safety alignment degraded by visual modality:[7]

  • Unsafe response rate in LLaVA-7B drops from 61.53% → 3.15% using purely representational manipulation

  • No impact on fluency or linguistic capability

  • Generalizes across visual contexts without domain-specific tuning

**SLADE (Shielding Against Dual Exploits) ** implements dual-level contrastive learning in an external CLIP encoder, balancing fine-grained and holistic semantic coherence:[8]

  • Reduces ASR against both gradient-based and optimization-based attacks

  • Preserves fine-grained perceptual details without semantic loss

  • Demonstrates that encoder-level orchestration (pre-fusion) is more effective than post-fusion alignment

**MSR-Align: Multimodal Safety Reasoning ** reveals a crucial finding: policy-grounded reasoning applied to the full multimodal reasoning trajectory—not just final outputs—improves safety by >30% while preserving reasoning utility. This supports the architectural principle that safety must be enforced at intermediate decision points, not post-hoc.[9]


2. Vector Space Defense and Multimodal RAG Orchestration

2.1 Retrieval as an Attack Vector: The RAG Vulnerability Model

Multimodal RAG systems introduce a secondary attack surface: poisoning the retrieval corpus. Recent attacks quantify the magnitude:

  • **Medusa (Cross-Modal Medical RAG) ** achieves 90%+ ASR by injecting adversarial image-text pairs that induce cross-modal misalignment via multi-positive InfoNCE loss optimization. A single poisoned document can reliably hijack retrieval.[10]

  • **HV-Attack (Hierarchical Visual Attack on MRAG) ** disrupts both retriever and generator by creating visual perturbations that break alignment between query and augmented knowledge, leading to up to 50% accuracy degradation.[11]

  • **PoisonedEye **: Single-sample knowledge poisoning on VLRAG systems demonstrates that external knowledge bases are single points of failure without defensive filtering.[12]

2.2 Reliability-Weighted Retrieval: Provable Robustness Mechanisms

The SOTA defense approach leverages document reliability signals and graph-theoretic filtering:

**ReliabilityRAG, ** introduces a Maximum Independent Set (MIS) algorithm that:[13][14]

  • Constructs a document-document contradiction graph on retrieved candidates

  • Identifies maximal non-contradictory sets, prioritizing higher-reliability documents

  • Provides provable robustness guarantees against bounded adversarial corruption (e.g., k poisoned documents in top-50 retrieval)

  • Results: Reduces ASR from 50%+ (single poisoned doc) to 2–3%; maintains benign accuracy

**GRADA (Graph-based Reranking Against Adversarial Documents) ** operationalizes graph-based filtering:[15]

  • Propagates relevance scores through document similarity graph

  • Clusters semantically consistent documents; suppresses outliers

  • Empirical improvement: ASR drops from 55.7% → 26.1% (GPT-3.5-Turbo)

**Adaptive-k Retrieval, ** addresses the complementary problem—selecting optimal context size without external labeling:[16][17][18]

  • Identifies largest gap in sorted similarity score distribution

  • No fine-tuning, no iterative LLM calls

  • Achieves 70% context recall using 99% fewer tokens

  • Plug-and-play integration into existing pipelines

2.3 “Outcome-Weighted” Mechanisms: Rationale-Based Verification and Consistency Checking

While explicit “outcome-weighted retrieval” terminology is not standard in the literature, **rationale-based selection ** implements the conceptual equivalent:[19]

  • Rationale generator produces natural language justifications for document relevance

  • Same rationales that justify selection also enable verification of consistency

  • Verifier LLM applies conservative per-document checks:

    • Flags semantic contradictions with query intent

    • Detects corpus poisoning patterns

    • Adaptive thresholding (no fixed top-k)

  • Empirical validation: F1 improves from 0.10 → 0.44 under poisoning attacks

The mechanism is inherently outcome-aware: documents are weighted by their agreement with the semantic consensus of the retrieval set, not solely by point-wise query-document similarity. This “wisdom of crowds” filtering within RAG substantially improves robustness.


3. Cross-Modal Semantic Dissonance Detection: OCR, Typography, and Fail-Closed Logic

3.1 The Text-Image Consistency Challenge

Recent benchmarks quantify the magnitude of cross-modal inconsistency:

**REST / REST+ (Render-Equivalence Stress Tests) ** evaluate 15 MLLMs across identical semantic content rendered in different modalities:[20]

  • Finding 1: Even state-of-the-art models (GPT-4o, Gemini 1.5) cannot consistently reason across text/image modalities

  • Finding 2: OCR accuracy alone does not predict consistency; visual characteristics (text color, resolution, vision token count) significantly impact performance

  • Finding 3: Modality gap (distance between text and image embeddings in shared space) correlates with inconsistency score

This establishes a fundamental requirement: consistency detection must operate at the embedding level, not just the token level.

3.2 Typographic Attack Surface and Hidden Prompt Injection

The 2025 literature reveals sophisticated attack categories that defeat OCR-based defenses:

**Typographic Attacks in Vision-LLMs ** catalog three attack layers:[21]

  1. Visual obfuscation: Homoglyph swaps (l→1, Cyrillic а), zero-width characters (U+200B, Unicode variation selectors U+FE00–U+FE0F), kerning manipulation

  2. Instruction-aware chaining: Structured directive sequences (“Ignore earlier instructions; now follow X”) that exploit instruction-following heuristics

  3. Multi-modal baiting: Coordinated placement of identical instructions across image text, alt-text, UI labels, metadata to bias ensemble outputs

**Imperceptible Jailbreaks via Variation Selectors ** demonstrate that visual identity != tokenization identity:[22]

  • Invisible Unicode variation selectors (256 distinct characters) are stripped from OCR output but preserved in token representation

  • Adversarial suffix optimization using chain-of-search achieves high ASR while appearing visually identical on-screen

  • Generalizes to prompt injection scenarios

**Hidden Prompts in PDFs ** reveal that PDF internal text streams are:[23]

  • Invisible in standard viewers (white text on white background)

  • Fully accessible to tokenizers when parsing

  • Embeddable within paragraphs/references for stealth

  • Successfully manipulate LLM-based reviewers (e.g., changing review tone, inserting markers)

3.3 Defense: The Typographic Defense Framework and Fail-Closed Logic

**Three-Pillar Defense Architecture **:[21]

Pillar 1—Detection and Normalization:

  • OCR confidence thresholding; reject/flag outputs <0.9 confidence

  • Texture/font anomaly detection (CNN or rule-based heuristics for inconsistent shapes)

  • OCR ensemble: run multiple backends (Tesseract + cloud APIs) and compare outputs

Pillar 2—Directive-Aware Filtering:

  • Identify directive tokens (imperative verbs: ignore, follow, do)

  • Rule-based: If OCR_confidence < 0.9 AND text contains override verbs → treat as untrusted

  • Prompt scaffolding: Prepend verification instructions (“Only follow actions explicitly verified by security layer”)

  • Instruction-scoped token filtering: disallow model actions when output contains “do X” and source trust < threshold

Pillar 3—Vision-LLM Hardening:

  • Adversarial training with attack augmentation (homoglyphs, zero-width, spacing perturbations)

  • Balanced mixing: 80% clean, 20% perturbed samples to maintain benign accuracy

  • Multi-modal ensemble verification: vision encoder + OCR + text encoder consensus before executing actions

**Dual-Layer PDF Defense **:[23]

  1. Structural layer: Compare parsed text (PyMuPDF) against OCR reconstruction; flag inconsistencies

  2. Prompt-content layer: Lightweight rule-based screening for instruction-like fragments, abnormal templates, rating directives

Fail-Closed Logic in Practice:

The consensus is explicit: when multimodal evidence conflicts, refuse rather than speculate. Policy-grounded multimodal reasoning demonstrates that guardrails grounding safety decisions in explicit policy rules reduce unsafe output probability by 30%+ compared to probabilistic guardrails. The principle is: “Dissonance = Danger.”[9]


4. External Orchestration Architecture: The Guardrail Taxonomy

4.1 Multi-Layer Orchestration Design

The emerging SOTA architecture implements **three-layer defense **:[24]

Layer Mechanism Intervention Point
External Input/output guardrails, RAG filtering, retrieval sanitization Pre-embedding, pre-LLM
undefined ---- ----
Secondary System prompts, constitutional AI, prompt scaffolding Model context (no weights modified)
undefined ---- ----
Internal RLHF, fine-tuning, contrastive learning Model parameters
undefined ---- ----

Key Finding: External layers show 3–10x better ROI (robustness per unit cost) than internal fine-tuning.

4.2 Guardrail Technical Paradigms[25]

Intervention Stages:

  • Pre-processing: Input validation, PII redaction, prompt injection detection

  • Intra-processing: Internal representation inspection (if accessible), early-exit prevention

  • Post-processing: Output filtering, schema validation, secret redaction

Technical Paradigms:

  • Rule-based: Regex, allowlist/blocklist (microsecond latency, deterministic, high false positives)

  • Model-based: Classifier guardrails (0.1–1ms latency, learned patterns)

  • LLM-based: Using LLMs to assess safety (10–100ms latency, more nuanced but costlier)

Safety Granularity:

  • Per-token (uncertainty quantification )[26]

  • Per-turn (session-level attack detection )[25]

  • Per-session (stateful memory for multi-turn robustness)

4.3 State-of-the-Art Implementations

**LlamaFirewall (Meta) **:[27]

  • Modular middleware operating on inputs, inference, tool execution

  • Scanner-based architecture for configurable threat detection

  • Current limitation: Text-level only (no native multimodal support)

  • Architectural principle: Future guardrails must be neural-symbolic (learning + symbolic agents)

**SafeRoute **:[28]

  • Adaptive model selection for cost-efficiency

  • Smaller distilled guardrail models for production deployment without sacrificing robustness

**Firewalls for LLM Agentic Networks **:[29]

  • Automatic rule construction from prior simulations

  • Task-specific protocol enforcement

  • Dynamic data abstraction to task-specific permissiveness levels

**MrGuard (Multilingual Reasoning Guardrail), **:[30][31]

  • Reasoning-enhanced safety classification

  • Uncertainty reward (softmax score from auxiliary encoder)

  • Outperforms baselines by 15%+ on multilingual attacks

  • Preserves safety judgments under code-switching and low-resource language distractors


5. Vector Database Tooling: Vespa.ai vs. Qdrant for Safety-Critical Ranking

5.1 Comparative Architecture

Feature Vespa.ai Qdrant
Late Interaction ✓ Native ColBERT embedder + MaxSim scoring ✓ MultiVectorConfig + MAX_SIM comparator
undefined ---- ----
ONNX Inference ✓ Full support (1st, 2nd, global phase) ✗ External only
undefined ---- ----
Token-Level Vectors ✓ Tensor-based representation ✓ Via prefetch pipeline
undefined ---- ----
Hybrid Search ✓ BM25 + neural multi-phase ✓ Dense + sparse + RRF fusion
undefined ---- ----
Scalability ✓ Phased ranking for large corpora ✓ Efficient for moderate-scale vectors
undefined ---- ----
PDF Retrieval ✓ ColPali embeddings (vision-to-token) :warning: Through external CLIP
undefined ---- ----

5.2 SOTA: Vespa.ai for Safety-Critical Deployments

**Native ColBERT Implementation, **:[32][33]

  • 32x compression of token-level embeddings without ranking accuracy loss

  • Multi-phase ranking: BM25 (candidate pool) → ColBERT late-interaction (semantic refinement) → cross-encoder (final ranking)

  • Enables explainable retrieval at scale (token-level attention matches justify top results)

**Long-Context ColBERT **:[32]

  • Extends late-interaction to context windows >512 tokens

  • Context-level MaxSim: Scores each unique context window independently

  • Cross-context MaxSim: Scores across windows considering global context

  • Outperforms single-vector models on long-document retrieval (MLDR dataset)

**ONNX Ranking Integration, **:[34][35]
Vespa enables deploying arbitrary ONNX classifiers (safety guardrails, fact-checkers, consistency validators) at ranking phase:

codeCode

onnx-model safety_classifier {
  file: models/safety_classifier.onnx
  input "embedding": tensor-based CLIP embedding
  output "safety_score": safety assessment [0,1]
}

second-phase {
  expression: onnx(safety_classifier).safety_score
}

This allows deterministic filtering at ranking phase without external RPC calls.

5.3 Qdrant for Flexible Multi-Vector Fusion

**Hybrid Search via Query API **:[36]

  • Prefetch-based pipeline: dense (int8 for speed) → dense (float32) → sparse (BM42)

  • Reciprocal Rank Fusion (RRF) combines heterogeneous scores

  • Late interaction applied only in reranking phase (post-fusion)

Advantage: Modular, allows independent iteration on retriever and reranker without Vespa’s tensor-shape constraints.

Limitation: No native ONNX in ranking; external inference required, adding latency (50–500ms per document for safety classification).

5.4 Recommendation for Safety-Critical RAG

For LVLMs requiring sub-second latency with built-in safety scoring:

  • Vespa.ai is SOTA (native ONNX, ColBERT compression, multi-phase orchestration)

For research flexibility and hybrid search experimentation:

  • Qdrant excels (Query API, modular fusion, lower operational overhead)

6. Quantifying Architectural Benefits: ASR Reductions and Robustness Guarantees

6.1 Empirical Performance Summary

Defense Category Mechanism ASR Baseline ASR w/ Defense Improvement
Visual Jailbreaks SimCLIP+ vision hardening [4] ~60% ~20% -67%
undefined ---- ---- ---- ----
Compositional Attacks SLADE dual-level learning [8] ~90% ~30% -67%
undefined ---- ---- ---- ----
Multimodal Reasoning MSR-Align policy-grounded [9] ~70% ~40% -43%
undefined ---- ---- ---- ----
RAG Poisoning ReliabilityRAG MIS filtering [13] ~50% ~2% -96%
undefined ---- ---- ---- ----
RAG Graph Defense GRADA document coherence [15] 55.7% 26.1% -53%
undefined ---- ---- ---- ----
Typographic Attacks RIO-Bench adaptive text use [37] ~65% ~25% -62%
undefined ---- ---- ---- ----
PDF Hidden Prompts Dual-layer structural check [23] ~85% ~5% -94%
undefined ---- ---- ---- ----

6.2 Provable Robustness Guarantees

ReliabilityRAG provides theoretical guarantees:[14]

  • MIS-based selection ensures maximal non-contradictory document set

  • Under natural assumptions (contradiction relations reflect semantic truth), robustness is provably maintained even if adversary poisons k documents in top-n retrieval

  • Scalable weighted sample-and-aggregate variant preserves robustness for large corpora (e.g., 1M documents)

This represents the first rigorous “guarantee” rather than empirical robustness in RAG defense literature.


7. Research Consensus and Emerging Architectures

7.1 Key Architectural Principles

  1. Separation of Concerns: Isolate perception (vision encoders), alignment (embeddings), and decision-making (LLM generation). Attack surfaces at each layer require distinct defenses.

  2. Deterministic Gating Over Probabilistic Filtering: Explicit policy rules (e.g., “refuse on modality dissonance”) outperform learnable guardrails in adversarial settings. Rule-based + LLM ensemble proves more robust than single LLM gatekeeping.

  3. Orchestration Before Embedding: Preprocessing (visual normalization, OCR sanitization) is more efficient than post-embedding defense. This shifts the attack-defense equilibrium in favor of defenders.

  4. Multimodal Consistency as a First-Class Security Property: Cross-modal dissonance detection (comparing OCR text to visual embeddings, text embeddings to visual embeddings) should be mandatory in safety-critical deployments.

  5. Graph-Based Retrieval Robustness: Document-document consistency graphs (not just query-document similarity) provide a principled way to filter poisoned content.

7.2 The Emerging “Stateful Orchestration Layer” Pattern

Recent work on LLM agents (ALAS ) and agentic networks, points toward a unified orchestration layer that maintains:[38][39][29]

  • Persistent execution memory: State tracking, rollback, causal consistency

  • Validation agents: Enforce hard constraints before execution

  • Domain agents: Explore alternatives to reduce solution bias

  • Context agents: Preserve coherence within semantically scoped subcontexts

For LVLMs, this pattern translates to:

  • Visual validation layer: Pre-embedding OCR, typography detection, semantic consistency checking

  • Retrieval orchestration layer: Reliability-weighted document selection, graph-based coherence filtering

  • Generation gating layer: Policy-grounded reasoning, output consistency verification


8. Research Gaps and Future Directions

8.1 Open Questions

  1. Compositional Defense Gaps: While individual defenses (visual hardening, retrieval filtering, output gating) are well-studied, their compositional interaction is underexplored. Does stacking multiple defenses provide additive or subadditive benefits?

  2. Multimodal Consistency Metrics: REST+ quantifies inconsistency but lacks actionable metrics for real-time detection. Can we develop embedding-space consistency scores that generalize across diverse LVLMs?[20]

  3. LVLM-Specific Fail-Closed Orchestration: While generic agent frameworks (ALAS, Firewall) exist, LVLM-specific stateful orchestration—accounting for vision-language trade-offs—is absent from literature.

  4. Transferability of Defenses: Do visual jailbreak defenses trained on GPT-4o transfer to open-source LVLMs? ReliabilityRAG is retrieval-agnostic, but graph-based document filtering may have hyperparameter sensitivity to embedding model choice.

  5. Cost of Robustness: ReliabilityRAG achieves 96% ASR reduction but assumes graph construction overhead is acceptable. Latency-robustness Pareto curves for production settings are missing.

8.2 Recommended Research Directions

  • Deterministic Orchestration Frameworks: Develop LVLM-specific equivalents to ALAS/Firewall that integrate visual validation, retrieval orchestration, and output gating.

  • Embedding-Space Consistency Metrics: Formalize cross-modal dissonance detection that operates in shared embedding spaces (e.g., detecting when OCR text and visual embeddings diverge by >threshold).

  • Compositional Defense Evaluation: Benchmark multi-layer orchestration (visual + retrieval + output) on unified threat models.

  • Fail-Closed Logic for Ambiguity: Develop decision trees that categorize multimodal conflict types and prescribe fail-closed actions (e.g., refuse, escalate, retry with alternate modality).


9. Practical Implementation Roadmap

For organizations deploying LVLMs in safety-critical contexts (healthcare, finance, legal), the recommended architecture is:

codeCode

┌─────────────────────────────────────┐
│  User Input (Text + Image)          │
└──────────────┬──────────────────────┘
               │
        ┌──────▼────────────────────────┐
        │ 1. Visual Input Validation    │
        │ - OCR confidence check        │
        │ - Typography anomaly detect   │
        │ - Ensemble OCR verification   │
        └──────┬────────────────────────┘
               │
    ┌──────────▼──────────────────────────┐
    │ 2. Retrieval Orchestration (RAG)   │
    │ - Dense + sparse retrieval         │
    │ - ReliabilityRAG MIS filtering     │
    │ - Graph-based coherence check      │
    │ - Adaptive-k context selection     │
    └──────┬───────────────────────────────┘
           │
    ┌──────▼────────────────────────────┐
    │ 3. Cross-Modal Consistency Check  │
    │ - OCR vs. visual embedding gap    │
    │ - Text vs. visual embedding align │
    │ - REST-style render consistency   │
    └──────┬───────────────────────────────┘
           │
    ┌──────▼──────────────────────────────────┐
    │ 4. LVLM Generation (with Safety Guard) │
    │ - MSR-Align multimodal reasoning       │
    │ - Policy-grounded decision trees       │
    └──────┬──────────────────────────────────┘
           │
    ┌──────▼──────────────────────────────────┐
    │ 5. Output Validation & Gating           │
    │ - ThinkGuard deliberative critique      │
    │ - Schema validation                    │
    │ - Fact-check against RAG context       │
    └──────┬──────────────────────────────────┘
           │
    ┌──────▼────────────────────────────┐
    │  Safe Output to User              │
    └──────────────────────────────────┘

Key Tooling Choices:

  • Retrieval: Vespa.ai (native ONNX for safety scoring) or Qdrant (flexibility)

  • Visual Input Layer: Tesseract OCR + cloud APIs (ensemble) + homoglyph detection

  • Consistency Detection: REST-style multimodal embedding comparison

  • Guardrails: LlamaFirewall (post-processing) + custom MIS/GRADA (retrieval layer)


10. Conclusion

The 2024-2025 research consensus strongly supports deterministic, externally orchestrated defense layers over end-to-end fine-tuning for securing LVLMs. Evidence demonstrates:

  1. Orchestration Superiority: 30-70% ASR reduction via external layers vs. 5-10% from fine-tuning

  2. Multimodal Consistency: Cross-modal dissonance detection is critical; REST+ benchmarks reveal 15%+ inconsistency even in state-of-the-art models

  3. Retrieval Robustness: Reliability-weighted, graph-based RAG defense achieves 96% ASR reduction with provable guarantees

  4. Architectural Convergence: Three-layer orchestration (external input validation → secondary prompt-based → internal fine-tuning) is emerging as SOTA

  5. Tooling SOTA: Vespa.ai for ONNX-native safety-critical ranking; Qdrant for research flexibility

The field has moved beyond treating safety as a monolithic property and now structures it as a multi-layer orchestration problem where each layer (visual, retrieval, reasoning, output) has distinct attack vectors and defense mechanisms. This shift—from alignment to orchestration—represents the primary research contribution of 2024-2025 and should guide future LVLM security architecture decisions.


References (Cited Publications)

SimCLIP+, IEEE 2E 2024 | [40] ESIII/Tit-for-Tat, ArXiv 2025 | [1] BAP Bi-Modal, IEEE 2024 | [41] CAMO Cross-Modal Obfuscation, ArXiv 2025 | [2] PRISM ROP-Inspired, ArXiv 2025 | [8] SLADE Dual-Level, IEEE 2025 | [5] VLLM Safety Paradox, ArXiv 2025 | [3] Visual Adversarial Examples, AAAI 2024 | [7] CMRM Representation Manipulation, ACL 2025 | [6] Layer-Wise PPO ICET, ICML 2025 | [42] Multimodal Guardrails Pattern, CSIRO 2024 | [10] Medusa Medical RAG, SemanticScholar 2025 | [43] LUMA-RAG Lifelong Multimodal, ArXiv 2025 | [44] Re-ranking Context Selection, ArXiv 2025 | [45] MedThreatRAG CMCI, ArXiv 2025 | [11] HV-Attack Visual Disruption, SemanticScholar 2025 | [46] LEAF Robust Text Encoder, ArXiv 2025 | [47] Adversarial Illusions, ArXiv 2024 | [48] Multimodal RAG Survey, ACL 2025 | [12] PoisonedEye VLRAG, ICML 2025 | [49] Vector Embedding Risks, Sonatype 2025 | [50] MSACA Multi-Scale, ACM 2024 | [51] ACTesting T2I, ACM 2023 | [52] Consistency-Heterogeneity Fake News, IEEE 2025 | [53] MCAN Semantic Consistency, Springer 2024 | [54] MFFFND-Co Ambiguity, TechScience 2024 | [20] REST/REST+ Same Content, SemanticScholar 2025 | [55] Contrastive Learning Fake News, MDPI 2025 | [56] Cross-Lingual OCR, MDPI 2025 | [57] OCR Confidential Documents, IEEE 2025 | [58] D-TIIL Text-Image Inconsistency, ArXiv 2024 | [59] Fine-Grained Cross-Modal, PMC 2024 | [60] PDF Malware Detection, PMC 2023 | [9] MSR-Align Policy-Grounded, ArXiv 2025 | [32] Long-Context ColBERT Vespa, Vespa Blog 2024 | [36] Qdrant Hybrid Search Query API, Qdrant 2024 | [34] Vespa ONNX Ranking, Vespa Docs | [33] Vespa ColBERT Embedder, Vespa Blog 2024 | [61] ColPali PDF Retrieval, Vespa Blog 2024 | [62] TIAR Weighted Multimodal, Springer 2023 | [25] Guardrail Evaluation Framework, ArXiv 2025 | [38] ALAS Stateful Agents, ArXiv 2025 | [24] LLM Risks Guardrails State, ArXiv 2024 | [28] SafeRoute Adaptive Selection, ArXiv 2025 | [29] Firewalls LLM Agents, ArXiv 2025 | [63] AI Guardrails Architecture, QED42 2025 | [27] LlamaFirewall Security Design, SecuritySandman 2025 | [21] Typographic Attacks Defense, Vogla 2025 | [22] Imperceptible Jailbreaks Variation, OpenReview PDF | [37] RIO-Bench Read or Ignore, ArXiv 2025 | [64] Adversarial Illusions USENIX, USENIX 2024 | [23] PDF Hidden Prompts, ArXiv 2025 | [65] RLBind Cross-Modal, ArXiv 2025 | [13] ReliabilityRAG Robustness, ArXiv 2025 | [19] Rationale-Based Selection, OpenReview PDF | [16] Adaptive-k Retrieval, ArXiv 2025 | [26] Token-Level Uncertainty, PMC 2025 | [15] GRADA Graph-Based, EMNLP 2025 | [30] MrGuard Multilingual, EMNLP 2025 | [14] ReliabilityRAG MIS, ArXiv 2025 | [18] Adaptive-k No Tuning, EMNLP 2025 | [31] MrGuard Reasoning Safety, ArXiv 2025

1 Like

open Questions:

"Can we derive a computationally efficient ‘Semantic Dissonance Score’ (

SdissSdiss​

) by measuring the manifold divergence between OCR-embeddings and Visual-embeddings (CLIP/SigLIP) to detect hidden prompt injections in real-time (<50ms)?"

“Applying Outcome-Weighted Penalties to Visual Embeddings: How can the Stone Retrieval Function (SRF) be adapted to penalize adversarial image clusters (e.g. perturbation noise) without destroying retrieval recall for benign visually similar images?”

“Interference Patterns in Multi-Layer Defense: Does aggressive visual sanitization (e.g. Gaussian Blur against adversarial pixels) degrade the efficacy of OCR-based text injection detection, and how can an Orchestrator balance these conflicting preprocessing steps?”

“The Cost of Provable Robustness: Analyzing the latency-throughput Pareto frontier when implementing Maximum Independent Set (MIS) filtering on large-scale multimodal indices (10M+) using Vespa’s phased ranking pipelines.”

1 Like

for now:


You are aiming at the right target: treat multimodal safety as a systems problem, not a weights problem. The open questions you listed are all “boundary” questions where naive ML solutions look good on a benchmark and then collapse under distribution shift, tool access, or latency constraints. The good news is that the literature now contains several concrete “similar cases” (hidden prompts in PDFs, image-scaling attacks, Unicode invisibles) that map directly onto your design space and can be used as calibration anchors. (arXiv)

Below is a detailed set of ideas for each open question, plus online cases, pitfalls, and a curated resource map.


0) Background: what you are actually trying to detect

Two different failure modes get conflated

  1. Benign cross-modal inconsistency
    Even strong multimodal models can answer differently when the same semantics are presented as text vs rendered text-in-image. This is now quantified directly by REST and REST+. (arXiv)
    Implication: “OCR text embedding ≠ image embedding” is not automatically an attack.

  2. Adversarial instruction smuggling
    Hidden or obfuscated directives are placed where the system will parse them but humans will miss them (PDF hidden text, Unicode invisibles, scaling artifacts that reveal text to the model, etc.). (arXiv)
    Implication: you need security signals that correlate with intentional mismatches, not just mismatches.

So your scoring needs to separate:

  • “Model is flaky here” (handle with escalation / safe fallback), from
  • “Input is adversarial” (fail closed, restrict tool privileges, sanitize/strip channels).

REST/REST+ is useful because it proves the first class exists at meaningful rates. (arXiv)
RIO-Bench is useful because it shows “just ignore text” is not viable. (arXiv)


1) Semantic Dissonance Score (Sdiss) under 50 ms

1.1 Why “manifold divergence” is attractive but easy to misuse

You have (at inference time) a tiny sample: one image, one OCR output, and maybe a few ROIs. True manifold divergence estimation is statistically hungry. If you treat it like a textbook two-sample test, it will be unstable.

The workable reframing is:

  • You are not estimating a global divergence between two distributions.
  • You are computing a fast, adversary-resistant risk score from multiple cheap, partially-independent indicators.

REST/REST+ already reports that “modality gap” correlates with inconsistency. That is exactly the right primitive, but it must be calibrated by content strata (text density, resolution, language, vision token count). (arXiv)

1.2 Sdiss v0 that is cheap and hard to game

Make Sdiss an ensemble. Each component is cheap, each fails differently.

A practical Sdiss decomposition:

A) Global modality gap (fast)

  • Embed image: e_img
  • Embed OCR text (after normalization): e_txt
  • Score: gap_global = 1 - cos(e_img, e_txt)
    REST/REST+ gives you justification that this correlates with inconsistency. (arXiv)

B) Localized ROI MaxSim gap (still cheap)

  • Take top-K OCR boxes (cap K hard, ex: 8–20).
  • Compute ROI image embeddings or patch embeddings for those boxes.
  • Compute MaxSim between ROI embeddings and the text embedding of that span.
  • Score: gap_roi = max_i (1 - maxsim(roi_i, span_i))

This catches “small hidden instruction blob” cases that global pooling misses.

C) Text-channel spoof / invisibles risk (microseconds to sub-ms)
Run Unicode security checks on every extracted text stream (OCR output, PDF text streams, HTML, metadata).
Use UTS #39 confusables skeletons and mixed-script checks. (Unicode)
If you want production-grade primitives, ICU SpoofChecker exposes the “skeleton” approach used to detect confusables efficiently. (Unicode Consortium)

This component is critical because some attacks are “visually identical, tokenization different” (variation selectors). (arXiv)

D) Directive-density / instruction-shape detector (cheap rules + tiny model)
You do not need deep semantics here. You need “is this text shaped like an override”.

  • Imperative verbs, control phrases, tool-invocation patterns, “ignore previous”, etc.
  • Weight by OCR confidence and by Unicode risk.

This maps to the OWASP framing: prompt injection is about confusing instructions vs data. (OWASP Gen AI Security Project)

E) View-consistency under controlled transforms (small extra cost, high value)
Many stealth attacks exploit the model’s preprocessing, not the original pixels. Image scaling attacks are now documented in the wild, with open tooling (Anamorpher). (The Trail of Bits Blog)
So compute embeddings and OCR on:

  • “raw” view
  • “model-view” (exact resize / crop pipeline you feed the LVLM)
  • optionally 1–2 alternate resamplers (nearest, bilinear, bicubic) if you can afford it

Score: instability = max_view gap_global(view) or “text appears only after downscale”.

1.3 Putting it together: Sdiss as a policy-grade risk score

A simple structure that behaves well under adversarial pressure:

  • Sdiss = w1*gap_global + w2*gap_roi + w3*unicode_risk + w4*directive_density + w5*view_instability

  • plus a confidence channel:

    • OCR mean confidence, text coverage, number of boxes used, language ID stability.

Then define deterministic outcomes:

  • Allow: low Sdiss and high confidence.
  • Allow but tool-readonly: moderate Sdiss or low confidence.
  • Refuse / escalate: high Sdiss, or “directive-shaped text + unicode risk”, or “text appears only in model-view”.

This matches the “dissonance is danger” principle, but prevents over-refusal by separating benign inconsistency from adversarial features (Unicode risk, view-instability, directive-shape). REST/REST+ is your argument for requiring the stratified calibration, not a single threshold. (arXiv)

1.4 How to keep this under 50 ms

You win on latency by enforcing caps:

  • Cap OCR boxes.
  • Run OCR once on raw, once on model-view only if needed.
  • Use a single dual-encoder family for embeddings so everything is dot products.
  • Quantize embedding model or run it on GPU if available.

The key is that most of Sdiss is vector math + string checks.

1.5 Known similar cases online (why these components matter)

  • Hidden prompts in structured docs (PDF/HTML) and principled detection methods exist (PhantomLint). (arXiv)
  • Hidden prompts in manuscripts and “inject-and-detect” editorial strategies show the attack is not theoretical. (arXiv)
  • Image scaling prompt injection exists with open-source tooling and mitigation discussion. (The Trail of Bits Blog)
  • Unicode invisibles (variation selectors) enable “looks identical” jailbreaks and have released code. (arXiv)

2) Outcome-weighted penalties for adversarial image clusters (SRF-style) without killing recall

2.1 Background: why “cluster penalty” is risky

Penalizing dense clusters naively destroys exactly what you want in vision retrieval: near-duplicates (same product, same UI, same document template) are often the best evidence.

So the correct goal is not “penalize clusters”. It is:

penalize suspicious neighborhoods, conditional on other attack signals.

2.2 A safe pattern: two-score decomposition

Split retrieval scoring into:

  • Utility score: similarity to query (dense, sparse, late-interaction).
  • Risk score: “this candidate or its neighborhood looks adversarial”.

Then combine conservatively:

  • If risk is low: do not touch utility ordering.
  • If risk is high (or Sdiss high): apply penalties.

This mirrors how modern security systems treat signals: a single weak signal should not dominate.

2.3 What risk signals work for image embeddings

You want risk signals that are:

  • cheap to compute offline or in-ranking,
  • hard for attackers to optimize simultaneously.

Good options:

A) Neighborhood anomaly metrics (offline)

  • kNN distance distribution anomalies
  • Local Outlier Factor style scores
  • sudden density spikes in narrow regions of embedding space

B) Embedding stability under benign augmentations (offline)
Attack noise often creates instability under small transforms.
Compute Var(e_img(transform_j)) across a few benign transforms.
This is the embedding analogue of “adversarial training detects sensitivity”.

C) Provenance and corpus trust
Signed corpora, source whitelists, freshness, human-curated sources. This becomes a prior.

2.4 An “outcome-weighted” penalty that preserves recall

A practical recipe:

  1. Retrieve top-N with your normal stack (maximize recall).

  2. Compute risk(doc) and risk_cluster(doc) (cluster risk can be mean risk of its kNN).

  3. Compute a consensus score within the candidate set:

    • If a doc’s claims (or captions) are contradictory to most others, raise risk.
    • This is conceptually aligned with “consistent majority” methods used in graph-robust RAG. (arXiv)
  4. Combine:

  • If Sdiss low: score = utility
  • If Sdiss moderate: score = utility - λ*risk
  • If Sdiss high: score = utility - λ1*risk - λ2*risk_cluster

This is “outcome-weighted” because the penalty activates when upstream evidence suggests attack conditions.

2.5 Where to implement penalties (Vespa vs Qdrant)

If you need deterministic, low-latency penalties:

If you prioritize experimentation:

  • Qdrant’s docs and ecosystem show flexible pipelines, but model-based scoring tends to be external unless you redesign around precomputed priors. (Qdrant)
    Also, real-world issues exist around multivector query usage, which matters if you rely on late interaction heavily. (GitHub)

3) Interference between visual sanitization (blur/denoise) and OCR-based injection detection

3.1 Why this interference is real

Aggressive blur reduces high-frequency detail, which is exactly what OCR needs. At the same time, blur can reduce adversarial pixel perturbations.

RIO-Bench formalizes the deeper problem: systems must learn when to “read” vs “ignore” text, not always one or the other. (arXiv)

3.2 The correct systems solution: multi-view branching, not compromise

Use two parallel views, always:

  • View A (raw): optimized for OCR and typography forensics.
  • View B (sanitized): optimized for embedding robustness and semantic perception.

Then add a consistency check across them:

  • If OCR text changes drastically between raw and sanitized, that is itself a risk signal.
  • If text appears only after resizing (model-view), treat as high risk, because scaling attacks explicitly exploit that. (The Trail of Bits Blog)

3.3 A balancing policy that is deterministic

Examples of clean decision rules:

  • If unicode_risk high OR directive_density high, ignore OCR text as instruction source and restrict tools.
  • If OCR_confidence low AND gap_roi high, fail closed for tool actions.
  • If model-view reveals new text (not present in raw), treat as suspected injection and refuse or require human confirmation.

This is exactly the “instructions vs data” separation OWASP emphasizes, implemented mechanically. (OWASP Gen AI Security Project)

3.4 Practical pitfall: resampling is part of the attack surface

Trail of Bits’ scaling attack writeup matters for LVLMs because the “model-view” is often a resized image. If you do not analyze what the model actually sees, you miss entire classes of hidden text. (The Trail of Bits Blog)


4) Cost of provable robustness: MIS filtering at 10M+ with Vespa phased ranking

4.1 The key observation

MIS-style filtering is only tractable if you do it on top-N retrieved candidates, not on the full corpus.

ReliabilityRAG’s core idea is MIS on a contradiction graph plus reliability priors, with provable robustness under assumptions. (arXiv)
Your systems question is: how to keep this within p95 latency budgets.

4.2 Latency anatomy of “MIS filtering”

For a candidate set size N:

  • Building a full contradiction graph is O(N^2) edge decisions.
  • MIS is NP-hard in general, so you use greedy or specialized variants.

So the Pareto knobs are:

  1. N (candidate set size)
  2. Edge budget (how many pairs you actually score)
  3. Edge scorer cost (rules vs small NLI vs cross-encoder)

4.3 A Vespa-native way to implement the Pareto frontier

Vespa gives you two mechanisms to control cost explicitly:

A) Phased ranking with rerank-count

B) ONNX inside ranking
Run small risk models (poisoning risk, contradiction likelihood) inside the ranking pipeline. (Vespa документация)

A practical architecture:

  1. Retrieve top N0 (example 200–1000) with cheap scoring.
  2. Second-phase rerank to N (example 50–200) using reliability priors + cheap risk.
  3. Global-phase computes the heavier contradiction edges only among the final N (or even a pruned subset), then selects the consistent set.

This matches Vespa’s intended scaling model: spend compute only on the best hits. (Vespa документация)

4.4 What “provable” costs you in production

Two practical warnings:

Warning 1: the guarantee depends on edge precision
ReliabilityRAG assumes contradiction edges reflect truth sufficiently well. If your edge model is noisy, you can drop correct evidence. (arXiv)
So you should bias toward high precision, even if recall is lower. That usually means conservative rules plus selective heavier verification for only borderline pairs.

Warning 2: fail-closed increases over-refusal risk
Fail-closed is correct for tool execution, but it can create user-visible refusal spikes. OR-Bench exists because over-refusal is now a measured failure mode. (arXiv)
So your policy should distinguish:

  • “refuse tool/action”
  • “still answer safely in text, but with limited claims”

5) “Similar cases, issues online” that directly map to your design

These are the most relevant real-world-adjacent cases to study and replay in your harness:

  1. Hidden prompts in PDFs and structured documents
  • PhantomLint is explicitly about principled detection of hidden prompts in PDF/HTML. (arXiv)
  • Peer-review prompt injection incidents have been studied, including real manuscript cases. (arXiv)
  1. Image preprocessing attacks (scaling)
  • Trail of Bits documents “weaponizing image scaling” and provides Anamorpher tooling. (The Trail of Bits Blog)
  1. Unicode invisibles and confusables
  • Variation selector jailbreaks show “looks identical” can still tokenize differently, and code is available. (arXiv)
  • Unicode UTS #39 and ICU SpoofChecker provide the standard detection mechanisms. (Unicode)
  1. Prompt injection as a top-tier system risk
  • OWASP GenAI risk taxonomy frames prompt injection as a primary application risk. (OWASP Gen AI Security Project)
  • This matters because your orchestrator is ultimately a “confused deputy” defense.

6) Curated high-value resources to build on (papers, tools, docs)

Benchmarks and papers

  • REST / REST+ (cross-modal inconsistency, modality gap correlation). (arXiv)
  • RIO-Bench (selective reading vs ignoring text, typographic attacks). (arXiv)
  • ReliabilityRAG (MIS + reliability priors, provable robustness framing). (arXiv)
  • PhantomLint (hidden prompt detection for PDF/HTML). (arXiv)
  • Imperceptible jailbreaking via Unicode variation selectors (and released code). (arXiv)

Tooling and standards

  • Unicode UTS #39 confusables and security mechanisms. (Unicode)
  • ICU SpoofChecker API reference (practical confusable detection primitives). (Unicode Consortium)
  • Anamorpher (open-source image scaling attack tool) + writeup. (GitHub)

Retrieval infrastructure docs


7) Concrete suggestions for your implementation path

  1. Implement Sdiss as an ensemble with a confidence channel.
    Do not bet on one “manifold divergence” statistic. Use global gap, ROI gap, Unicode risk, directive-shape, and view-instability. Anchor calibration with REST/REST+ and RIO-Bench. (arXiv)

  2. Add “model-view” analysis as mandatory.
    Scaling attacks prove the model’s resize pipeline is part of the threat surface. (The Trail of Bits Blog)

  3. Make outcome-weighted penalties conditional.
    Only activate cluster penalties when upstream signals indicate risk. This preserves benign near-duplicate recall.

  4. Do MIS-like filtering only on a bounded top-N with explicit budgets.
    Use Vespa rerank-count to force predictable compute. (Vespa документация)

  5. Separate “refuse tool/action” from “refuse answer.”
    This reduces over-refusal while staying fail-closed for execution. OR-Bench is a useful regression target. (arXiv)


Summary bullets

  • Sdiss should be an ensemble risk score, not a single divergence metric, because benign cross-modal inconsistency is real and measurable. (arXiv)
  • Add Unicode confusable checks and model-view resize checks because real attacks exploit invisibles and preprocessing. (Unicode)
  • Outcome-weighted penalties should be conditional on risk signals to preserve near-duplicate recall.
  • MIS filtering is practical only on top-N, and Vespa phased ranking gives explicit cost control via rerank-count and ONNX-in-ranking. (Vespa документация)