I used advanced AI tools to synthesis 180+ papers based on specific architectural hypotheses I developed while building a LLM firewall. Here is the distilled state of the art.
The security landscape for Large Vision-Language Models (LVLMs) has rapidly evolved from 2023 onward, with the field converging on a critical architectural insight: external, stateful orchestration layers substantially outperform end-to-end safety fine-tuning for mitigating multimodal jailbreak attacks. This review synthesizes findings from over 80 peer-reviewed papers and technical reports (primarily NeurIPS, USENIX, CVPR, ACL, and ArXiv publications from late 2023–2025) across three primary defense architectures.
The evidence demonstrates that separating visual perception from executive decision-making via orchestration “firewalls” reduces Attack Success Rate (ASR) by 30–70% compared to monolithic alignment approaches, while reliability-weighted retrieval mechanisms (achieving >70% robustness guarantees) enable provably safe RAG pipelines. The tooling landscape reveals that Vespa.ai currently holds the state-of-the-art for safety-critical ONNX-based ranking, while Qdrant provides superior multi-vector and fusion flexibility without native in-ranking ONNX support.
1. End-to-End Safety Alignment vs. External Orchestration: The Architectural Trade-off
1.1 Visual Jailbreak Attack Landscape
Recent work establishes that visual modality introduces a substantially expanded attack surface. Foundational research demonstrates:[1][2][3]
-
Bi-Modal Adversarial Prompts (BAP) jointly optimize textual and visual perturbations, achieving +29.03% improvement in ASR over visual-only attacks—demonstrating that attackers exploit cross-modal reasoning gaps.
-
Compositional jailbreaks (PRISM) decompose harmful instructions into sequences of individually benign visual “gadgets,” leveraging LVLMs’ multi-step reasoning to reconstruct malicious intent. This achieves >0.90 ASR on SafeBench through emergent behavior rather than explicit semantic manipulation.
-
Cross-Modal Obfuscation (CAMO) fragments instructions across modalities to evade content filtering, demonstrating that detection-resistant attacks are now feasible in black-box settings.
The critical vulnerability: LVLMs fuse visual and textual embeddings at intermediate layers, meaning adversarial perturbations that alter embedding proximity can bypass safety mechanisms trained on text-only distributions.
1.2 Internal Fine-Tuning Defenses: Limitations and Paradoxes
End-to-end safety alignment exhibits performance saturation on current benchmarks while failing against compositional attacks. Key findings:
SimCLIP+ (Vision Encoder Hardening) fine-tunes CLIP via Siamese architecture to maximize cosine similarity between perturbed and clean samples. Results:[4]
-
Achieves robustness against gradient-based attacks without structural modification
-
Maintains clean accuracy on downstream tasks (COCO, OKVQA)
-
Critical limitation: Does not address compositional visual-textual attacks; defender-attacker arms race continues
**The VLLM Safety Paradox **: Recent defenses reach near-saturation performance on benchmarks (high robustness) with minimal effort, yet fail on slight distribution shifts. This suggests current benchmarks are overfit to known attack patterns rather than testing true robustness.[5]
**Layer-Wise Safety Degradation (ICET Vulnerability) **: LLaVA-1.5 and Llama 3.2 reveal uneven distribution of harmful information across image encoder layers. Skipping certain layers or performing early exits can increase harmful output probability by 40%+ even when the full model is safety-aligned. Layer-Wise PPO (L-PPO) attempts to address this through multi-layer RLHF but still relies on internal alignment rather than external gating.[6]
1.3 The External Orchestration Paradigm: Evidence for Separation of Concerns
In contrast, external orchestration layers operate as stateful guardrails between input and generation phases, implementing deterministic decision trees that:
-
Intercept and sanitize visual inputs before embedding
-
Monitor intermediate reasoning states (if accessible)
-
Validate outputs against explicit policy rules
Key Evidence for Superiority:
**Cross-Modality Representation Manipulation (CMRM) ** demonstrates that inference-time representation intervention—without retraining—recovers safety alignment degraded by visual modality:[7]
-
Unsafe response rate in LLaVA-7B drops from 61.53% → 3.15% using purely representational manipulation
-
No impact on fluency or linguistic capability
-
Generalizes across visual contexts without domain-specific tuning
**SLADE (Shielding Against Dual Exploits) ** implements dual-level contrastive learning in an external CLIP encoder, balancing fine-grained and holistic semantic coherence:[8]
-
Reduces ASR against both gradient-based and optimization-based attacks
-
Preserves fine-grained perceptual details without semantic loss
-
Demonstrates that encoder-level orchestration (pre-fusion) is more effective than post-fusion alignment
**MSR-Align: Multimodal Safety Reasoning ** reveals a crucial finding: policy-grounded reasoning applied to the full multimodal reasoning trajectory—not just final outputs—improves safety by >30% while preserving reasoning utility. This supports the architectural principle that safety must be enforced at intermediate decision points, not post-hoc.[9]
2. Vector Space Defense and Multimodal RAG Orchestration
2.1 Retrieval as an Attack Vector: The RAG Vulnerability Model
Multimodal RAG systems introduce a secondary attack surface: poisoning the retrieval corpus. Recent attacks quantify the magnitude:
-
**Medusa (Cross-Modal Medical RAG) ** achieves 90%+ ASR by injecting adversarial image-text pairs that induce cross-modal misalignment via multi-positive InfoNCE loss optimization. A single poisoned document can reliably hijack retrieval.[10]
-
**HV-Attack (Hierarchical Visual Attack on MRAG) ** disrupts both retriever and generator by creating visual perturbations that break alignment between query and augmented knowledge, leading to up to 50% accuracy degradation.[11]
-
**PoisonedEye **: Single-sample knowledge poisoning on VLRAG systems demonstrates that external knowledge bases are single points of failure without defensive filtering.[12]
2.2 Reliability-Weighted Retrieval: Provable Robustness Mechanisms
The SOTA defense approach leverages document reliability signals and graph-theoretic filtering:
**ReliabilityRAG, ** introduces a Maximum Independent Set (MIS) algorithm that:[13][14]
-
Constructs a document-document contradiction graph on retrieved candidates
-
Identifies maximal non-contradictory sets, prioritizing higher-reliability documents
-
Provides provable robustness guarantees against bounded adversarial corruption (e.g., k poisoned documents in top-50 retrieval)
-
Results: Reduces ASR from 50%+ (single poisoned doc) to 2–3%; maintains benign accuracy
**GRADA (Graph-based Reranking Against Adversarial Documents) ** operationalizes graph-based filtering:[15]
-
Propagates relevance scores through document similarity graph
-
Clusters semantically consistent documents; suppresses outliers
-
Empirical improvement: ASR drops from 55.7% → 26.1% (GPT-3.5-Turbo)
**Adaptive-k Retrieval, ** addresses the complementary problem—selecting optimal context size without external labeling:[16][17][18]
-
Identifies largest gap in sorted similarity score distribution
-
No fine-tuning, no iterative LLM calls
-
Achieves 70% context recall using 99% fewer tokens
-
Plug-and-play integration into existing pipelines
2.3 “Outcome-Weighted” Mechanisms: Rationale-Based Verification and Consistency Checking
While explicit “outcome-weighted retrieval” terminology is not standard in the literature, **rationale-based selection ** implements the conceptual equivalent:[19]
-
Rationale generator produces natural language justifications for document relevance
-
Same rationales that justify selection also enable verification of consistency
-
Verifier LLM applies conservative per-document checks:
-
Flags semantic contradictions with query intent
-
Detects corpus poisoning patterns
-
Adaptive thresholding (no fixed top-k)
-
-
Empirical validation: F1 improves from 0.10 → 0.44 under poisoning attacks
The mechanism is inherently outcome-aware: documents are weighted by their agreement with the semantic consensus of the retrieval set, not solely by point-wise query-document similarity. This “wisdom of crowds” filtering within RAG substantially improves robustness.
3. Cross-Modal Semantic Dissonance Detection: OCR, Typography, and Fail-Closed Logic
3.1 The Text-Image Consistency Challenge
Recent benchmarks quantify the magnitude of cross-modal inconsistency:
**REST / REST+ (Render-Equivalence Stress Tests) ** evaluate 15 MLLMs across identical semantic content rendered in different modalities:[20]
-
Finding 1: Even state-of-the-art models (GPT-4o, Gemini 1.5) cannot consistently reason across text/image modalities
-
Finding 2: OCR accuracy alone does not predict consistency; visual characteristics (text color, resolution, vision token count) significantly impact performance
-
Finding 3: Modality gap (distance between text and image embeddings in shared space) correlates with inconsistency score
This establishes a fundamental requirement: consistency detection must operate at the embedding level, not just the token level.
3.2 Typographic Attack Surface and Hidden Prompt Injection
The 2025 literature reveals sophisticated attack categories that defeat OCR-based defenses:
**Typographic Attacks in Vision-LLMs ** catalog three attack layers:[21]
-
Visual obfuscation: Homoglyph swaps (l→1, Cyrillic а), zero-width characters (U+200B, Unicode variation selectors U+FE00–U+FE0F), kerning manipulation
-
Instruction-aware chaining: Structured directive sequences (“Ignore earlier instructions; now follow X”) that exploit instruction-following heuristics
-
Multi-modal baiting: Coordinated placement of identical instructions across image text, alt-text, UI labels, metadata to bias ensemble outputs
**Imperceptible Jailbreaks via Variation Selectors ** demonstrate that visual identity != tokenization identity:[22]
-
Invisible Unicode variation selectors (256 distinct characters) are stripped from OCR output but preserved in token representation
-
Adversarial suffix optimization using chain-of-search achieves high ASR while appearing visually identical on-screen
-
Generalizes to prompt injection scenarios
**Hidden Prompts in PDFs ** reveal that PDF internal text streams are:[23]
-
Invisible in standard viewers (white text on white background)
-
Fully accessible to tokenizers when parsing
-
Embeddable within paragraphs/references for stealth
-
Successfully manipulate LLM-based reviewers (e.g., changing review tone, inserting markers)
3.3 Defense: The Typographic Defense Framework and Fail-Closed Logic
**Three-Pillar Defense Architecture **:[21]
Pillar 1—Detection and Normalization:
-
OCR confidence thresholding; reject/flag outputs <0.9 confidence
-
Texture/font anomaly detection (CNN or rule-based heuristics for inconsistent shapes)
-
OCR ensemble: run multiple backends (Tesseract + cloud APIs) and compare outputs
Pillar 2—Directive-Aware Filtering:
-
Identify directive tokens (imperative verbs: ignore, follow, do)
-
Rule-based: If OCR_confidence < 0.9 AND text contains override verbs → treat as untrusted
-
Prompt scaffolding: Prepend verification instructions (“Only follow actions explicitly verified by security layer”)
-
Instruction-scoped token filtering: disallow model actions when output contains “do X” and source trust < threshold
Pillar 3—Vision-LLM Hardening:
-
Adversarial training with attack augmentation (homoglyphs, zero-width, spacing perturbations)
-
Balanced mixing: 80% clean, 20% perturbed samples to maintain benign accuracy
-
Multi-modal ensemble verification: vision encoder + OCR + text encoder consensus before executing actions
**Dual-Layer PDF Defense **:[23]
-
Structural layer: Compare parsed text (PyMuPDF) against OCR reconstruction; flag inconsistencies
-
Prompt-content layer: Lightweight rule-based screening for instruction-like fragments, abnormal templates, rating directives
Fail-Closed Logic in Practice:
The consensus is explicit: when multimodal evidence conflicts, refuse rather than speculate. Policy-grounded multimodal reasoning demonstrates that guardrails grounding safety decisions in explicit policy rules reduce unsafe output probability by 30%+ compared to probabilistic guardrails. The principle is: “Dissonance = Danger.”[9]
4. External Orchestration Architecture: The Guardrail Taxonomy
4.1 Multi-Layer Orchestration Design
The emerging SOTA architecture implements **three-layer defense **:[24]
| Layer | Mechanism | Intervention Point |
|---|---|---|
| External | Input/output guardrails, RAG filtering, retrieval sanitization | Pre-embedding, pre-LLM |
| undefined | ---- | ---- |
| Secondary | System prompts, constitutional AI, prompt scaffolding | Model context (no weights modified) |
| undefined | ---- | ---- |
| Internal | RLHF, fine-tuning, contrastive learning | Model parameters |
| undefined | ---- | ---- |
Key Finding: External layers show 3–10x better ROI (robustness per unit cost) than internal fine-tuning.
4.2 Guardrail Technical Paradigms[25]
Intervention Stages:
-
Pre-processing: Input validation, PII redaction, prompt injection detection
-
Intra-processing: Internal representation inspection (if accessible), early-exit prevention
-
Post-processing: Output filtering, schema validation, secret redaction
Technical Paradigms:
-
Rule-based: Regex, allowlist/blocklist (microsecond latency, deterministic, high false positives)
-
Model-based: Classifier guardrails (0.1–1ms latency, learned patterns)
-
LLM-based: Using LLMs to assess safety (10–100ms latency, more nuanced but costlier)
Safety Granularity:
-
Per-token (uncertainty quantification )[26]
-
Per-turn (session-level attack detection )[25]
-
Per-session (stateful memory for multi-turn robustness)
4.3 State-of-the-Art Implementations
**LlamaFirewall (Meta) **:[27]
-
Modular middleware operating on inputs, inference, tool execution
-
Scanner-based architecture for configurable threat detection
-
Current limitation: Text-level only (no native multimodal support)
-
Architectural principle: Future guardrails must be neural-symbolic (learning + symbolic agents)
**SafeRoute **:[28]
-
Adaptive model selection for cost-efficiency
-
Smaller distilled guardrail models for production deployment without sacrificing robustness
**Firewalls for LLM Agentic Networks **:[29]
-
Automatic rule construction from prior simulations
-
Task-specific protocol enforcement
-
Dynamic data abstraction to task-specific permissiveness levels
**MrGuard (Multilingual Reasoning Guardrail), **:[30][31]
-
Reasoning-enhanced safety classification
-
Uncertainty reward (softmax score from auxiliary encoder)
-
Outperforms baselines by 15%+ on multilingual attacks
-
Preserves safety judgments under code-switching and low-resource language distractors
5. Vector Database Tooling: Vespa.ai vs. Qdrant for Safety-Critical Ranking
5.1 Comparative Architecture
| Feature | Vespa.ai | Qdrant |
|---|---|---|
| Late Interaction | ✓ Native ColBERT embedder + MaxSim scoring | ✓ MultiVectorConfig + MAX_SIM comparator |
| undefined | ---- | ---- |
| ONNX Inference | ✓ Full support (1st, 2nd, global phase) | ✗ External only |
| undefined | ---- | ---- |
| Token-Level Vectors | ✓ Tensor-based representation | ✓ Via prefetch pipeline |
| undefined | ---- | ---- |
| Hybrid Search | ✓ BM25 + neural multi-phase | ✓ Dense + sparse + RRF fusion |
| undefined | ---- | ---- |
| Scalability | ✓ Phased ranking for large corpora | ✓ Efficient for moderate-scale vectors |
| undefined | ---- | ---- |
| PDF Retrieval | ✓ ColPali embeddings (vision-to-token) | |
| undefined | ---- | ---- |
5.2 SOTA: Vespa.ai for Safety-Critical Deployments
**Native ColBERT Implementation, **:[32][33]
-
32x compression of token-level embeddings without ranking accuracy loss
-
Multi-phase ranking: BM25 (candidate pool) → ColBERT late-interaction (semantic refinement) → cross-encoder (final ranking)
-
Enables explainable retrieval at scale (token-level attention matches justify top results)
**Long-Context ColBERT **:[32]
-
Extends late-interaction to context windows >512 tokens
-
Context-level MaxSim: Scores each unique context window independently
-
Cross-context MaxSim: Scores across windows considering global context
-
Outperforms single-vector models on long-document retrieval (MLDR dataset)
**ONNX Ranking Integration, **:[34][35]
Vespa enables deploying arbitrary ONNX classifiers (safety guardrails, fact-checkers, consistency validators) at ranking phase:
codeCode
onnx-model safety_classifier {
file: models/safety_classifier.onnx
input "embedding": tensor-based CLIP embedding
output "safety_score": safety assessment [0,1]
}
second-phase {
expression: onnx(safety_classifier).safety_score
}
This allows deterministic filtering at ranking phase without external RPC calls.
5.3 Qdrant for Flexible Multi-Vector Fusion
**Hybrid Search via Query API **:[36]
-
Prefetch-based pipeline: dense (int8 for speed) → dense (float32) → sparse (BM42)
-
Reciprocal Rank Fusion (RRF) combines heterogeneous scores
-
Late interaction applied only in reranking phase (post-fusion)
Advantage: Modular, allows independent iteration on retriever and reranker without Vespa’s tensor-shape constraints.
Limitation: No native ONNX in ranking; external inference required, adding latency (50–500ms per document for safety classification).
5.4 Recommendation for Safety-Critical RAG
For LVLMs requiring sub-second latency with built-in safety scoring:
- Vespa.ai is SOTA (native ONNX, ColBERT compression, multi-phase orchestration)
For research flexibility and hybrid search experimentation:
- Qdrant excels (Query API, modular fusion, lower operational overhead)
6. Quantifying Architectural Benefits: ASR Reductions and Robustness Guarantees
6.1 Empirical Performance Summary
| Defense Category | Mechanism | ASR Baseline | ASR w/ Defense | Improvement |
|---|---|---|---|---|
| Visual Jailbreaks | SimCLIP+ vision hardening [4] | ~60% | ~20% | -67% |
| undefined | ---- | ---- | ---- | ---- |
| Compositional Attacks | SLADE dual-level learning [8] | ~90% | ~30% | -67% |
| undefined | ---- | ---- | ---- | ---- |
| Multimodal Reasoning | MSR-Align policy-grounded [9] | ~70% | ~40% | -43% |
| undefined | ---- | ---- | ---- | ---- |
| RAG Poisoning | ReliabilityRAG MIS filtering [13] | ~50% | ~2% | -96% |
| undefined | ---- | ---- | ---- | ---- |
| RAG Graph Defense | GRADA document coherence [15] | 55.7% | 26.1% | -53% |
| undefined | ---- | ---- | ---- | ---- |
| Typographic Attacks | RIO-Bench adaptive text use [37] | ~65% | ~25% | -62% |
| undefined | ---- | ---- | ---- | ---- |
| PDF Hidden Prompts | Dual-layer structural check [23] | ~85% | ~5% | -94% |
| undefined | ---- | ---- | ---- | ---- |
6.2 Provable Robustness Guarantees
ReliabilityRAG provides theoretical guarantees:[14]
-
MIS-based selection ensures maximal non-contradictory document set
-
Under natural assumptions (contradiction relations reflect semantic truth), robustness is provably maintained even if adversary poisons k documents in top-n retrieval
-
Scalable weighted sample-and-aggregate variant preserves robustness for large corpora (e.g., 1M documents)
This represents the first rigorous “guarantee” rather than empirical robustness in RAG defense literature.
7. Research Consensus and Emerging Architectures
7.1 Key Architectural Principles
-
Separation of Concerns: Isolate perception (vision encoders), alignment (embeddings), and decision-making (LLM generation). Attack surfaces at each layer require distinct defenses.
-
Deterministic Gating Over Probabilistic Filtering: Explicit policy rules (e.g., “refuse on modality dissonance”) outperform learnable guardrails in adversarial settings. Rule-based + LLM ensemble proves more robust than single LLM gatekeeping.
-
Orchestration Before Embedding: Preprocessing (visual normalization, OCR sanitization) is more efficient than post-embedding defense. This shifts the attack-defense equilibrium in favor of defenders.
-
Multimodal Consistency as a First-Class Security Property: Cross-modal dissonance detection (comparing OCR text to visual embeddings, text embeddings to visual embeddings) should be mandatory in safety-critical deployments.
-
Graph-Based Retrieval Robustness: Document-document consistency graphs (not just query-document similarity) provide a principled way to filter poisoned content.
7.2 The Emerging “Stateful Orchestration Layer” Pattern
Recent work on LLM agents (ALAS ) and agentic networks, points toward a unified orchestration layer that maintains:[38][39][29]
-
Persistent execution memory: State tracking, rollback, causal consistency
-
Validation agents: Enforce hard constraints before execution
-
Domain agents: Explore alternatives to reduce solution bias
-
Context agents: Preserve coherence within semantically scoped subcontexts
For LVLMs, this pattern translates to:
-
Visual validation layer: Pre-embedding OCR, typography detection, semantic consistency checking
-
Retrieval orchestration layer: Reliability-weighted document selection, graph-based coherence filtering
-
Generation gating layer: Policy-grounded reasoning, output consistency verification
8. Research Gaps and Future Directions
8.1 Open Questions
-
Compositional Defense Gaps: While individual defenses (visual hardening, retrieval filtering, output gating) are well-studied, their compositional interaction is underexplored. Does stacking multiple defenses provide additive or subadditive benefits?
-
Multimodal Consistency Metrics: REST+ quantifies inconsistency but lacks actionable metrics for real-time detection. Can we develop embedding-space consistency scores that generalize across diverse LVLMs?[20]
-
LVLM-Specific Fail-Closed Orchestration: While generic agent frameworks (ALAS, Firewall) exist, LVLM-specific stateful orchestration—accounting for vision-language trade-offs—is absent from literature.
-
Transferability of Defenses: Do visual jailbreak defenses trained on GPT-4o transfer to open-source LVLMs? ReliabilityRAG is retrieval-agnostic, but graph-based document filtering may have hyperparameter sensitivity to embedding model choice.
-
Cost of Robustness: ReliabilityRAG achieves 96% ASR reduction but assumes graph construction overhead is acceptable. Latency-robustness Pareto curves for production settings are missing.
8.2 Recommended Research Directions
-
Deterministic Orchestration Frameworks: Develop LVLM-specific equivalents to ALAS/Firewall that integrate visual validation, retrieval orchestration, and output gating.
-
Embedding-Space Consistency Metrics: Formalize cross-modal dissonance detection that operates in shared embedding spaces (e.g., detecting when OCR text and visual embeddings diverge by >threshold).
-
Compositional Defense Evaluation: Benchmark multi-layer orchestration (visual + retrieval + output) on unified threat models.
-
Fail-Closed Logic for Ambiguity: Develop decision trees that categorize multimodal conflict types and prescribe fail-closed actions (e.g., refuse, escalate, retry with alternate modality).
9. Practical Implementation Roadmap
For organizations deploying LVLMs in safety-critical contexts (healthcare, finance, legal), the recommended architecture is:
codeCode
┌─────────────────────────────────────┐
│ User Input (Text + Image) │
└──────────────┬──────────────────────┘
│
┌──────▼────────────────────────┐
│ 1. Visual Input Validation │
│ - OCR confidence check │
│ - Typography anomaly detect │
│ - Ensemble OCR verification │
└──────┬────────────────────────┘
│
┌──────────▼──────────────────────────┐
│ 2. Retrieval Orchestration (RAG) │
│ - Dense + sparse retrieval │
│ - ReliabilityRAG MIS filtering │
│ - Graph-based coherence check │
│ - Adaptive-k context selection │
└──────┬───────────────────────────────┘
│
┌──────▼────────────────────────────┐
│ 3. Cross-Modal Consistency Check │
│ - OCR vs. visual embedding gap │
│ - Text vs. visual embedding align │
│ - REST-style render consistency │
└──────┬───────────────────────────────┘
│
┌──────▼──────────────────────────────────┐
│ 4. LVLM Generation (with Safety Guard) │
│ - MSR-Align multimodal reasoning │
│ - Policy-grounded decision trees │
└──────┬──────────────────────────────────┘
│
┌──────▼──────────────────────────────────┐
│ 5. Output Validation & Gating │
│ - ThinkGuard deliberative critique │
│ - Schema validation │
│ - Fact-check against RAG context │
└──────┬──────────────────────────────────┘
│
┌──────▼────────────────────────────┐
│ Safe Output to User │
└──────────────────────────────────┘
Key Tooling Choices:
-
Retrieval: Vespa.ai (native ONNX for safety scoring) or Qdrant (flexibility)
-
Visual Input Layer: Tesseract OCR + cloud APIs (ensemble) + homoglyph detection
-
Consistency Detection: REST-style multimodal embedding comparison
-
Guardrails: LlamaFirewall (post-processing) + custom MIS/GRADA (retrieval layer)
10. Conclusion
The 2024-2025 research consensus strongly supports deterministic, externally orchestrated defense layers over end-to-end fine-tuning for securing LVLMs. Evidence demonstrates:
-
Orchestration Superiority: 30-70% ASR reduction via external layers vs. 5-10% from fine-tuning
-
Multimodal Consistency: Cross-modal dissonance detection is critical; REST+ benchmarks reveal 15%+ inconsistency even in state-of-the-art models
-
Retrieval Robustness: Reliability-weighted, graph-based RAG defense achieves 96% ASR reduction with provable guarantees
-
Architectural Convergence: Three-layer orchestration (external input validation → secondary prompt-based → internal fine-tuning) is emerging as SOTA
-
Tooling SOTA: Vespa.ai for ONNX-native safety-critical ranking; Qdrant for research flexibility
The field has moved beyond treating safety as a monolithic property and now structures it as a multi-layer orchestration problem where each layer (visual, retrieval, reasoning, output) has distinct attack vectors and defense mechanisms. This shift—from alignment to orchestration—represents the primary research contribution of 2024-2025 and should guide future LVLM security architecture decisions.
References (Cited Publications)
SimCLIP+, IEEE 2E 2024 | [40] ESIII/Tit-for-Tat, ArXiv 2025 | [1] BAP Bi-Modal, IEEE 2024 | [41] CAMO Cross-Modal Obfuscation, ArXiv 2025 | [2] PRISM ROP-Inspired, ArXiv 2025 | [8] SLADE Dual-Level, IEEE 2025 | [5] VLLM Safety Paradox, ArXiv 2025 | [3] Visual Adversarial Examples, AAAI 2024 | [7] CMRM Representation Manipulation, ACL 2025 | [6] Layer-Wise PPO ICET, ICML 2025 | [42] Multimodal Guardrails Pattern, CSIRO 2024 | [10] Medusa Medical RAG, SemanticScholar 2025 | [43] LUMA-RAG Lifelong Multimodal, ArXiv 2025 | [44] Re-ranking Context Selection, ArXiv 2025 | [45] MedThreatRAG CMCI, ArXiv 2025 | [11] HV-Attack Visual Disruption, SemanticScholar 2025 | [46] LEAF Robust Text Encoder, ArXiv 2025 | [47] Adversarial Illusions, ArXiv 2024 | [48] Multimodal RAG Survey, ACL 2025 | [12] PoisonedEye VLRAG, ICML 2025 | [49] Vector Embedding Risks, Sonatype 2025 | [50] MSACA Multi-Scale, ACM 2024 | [51] ACTesting T2I, ACM 2023 | [52] Consistency-Heterogeneity Fake News, IEEE 2025 | [53] MCAN Semantic Consistency, Springer 2024 | [54] MFFFND-Co Ambiguity, TechScience 2024 | [20] REST/REST+ Same Content, SemanticScholar 2025 | [55] Contrastive Learning Fake News, MDPI 2025 | [56] Cross-Lingual OCR, MDPI 2025 | [57] OCR Confidential Documents, IEEE 2025 | [58] D-TIIL Text-Image Inconsistency, ArXiv 2024 | [59] Fine-Grained Cross-Modal, PMC 2024 | [60] PDF Malware Detection, PMC 2023 | [9] MSR-Align Policy-Grounded, ArXiv 2025 | [32] Long-Context ColBERT Vespa, Vespa Blog 2024 | [36] Qdrant Hybrid Search Query API, Qdrant 2024 | [34] Vespa ONNX Ranking, Vespa Docs | [33] Vespa ColBERT Embedder, Vespa Blog 2024 | [61] ColPali PDF Retrieval, Vespa Blog 2024 | [62] TIAR Weighted Multimodal, Springer 2023 | [25] Guardrail Evaluation Framework, ArXiv 2025 | [38] ALAS Stateful Agents, ArXiv 2025 | [24] LLM Risks Guardrails State, ArXiv 2024 | [28] SafeRoute Adaptive Selection, ArXiv 2025 | [29] Firewalls LLM Agents, ArXiv 2025 | [63] AI Guardrails Architecture, QED42 2025 | [27] LlamaFirewall Security Design, SecuritySandman 2025 | [21] Typographic Attacks Defense, Vogla 2025 | [22] Imperceptible Jailbreaks Variation, OpenReview PDF | [37] RIO-Bench Read or Ignore, ArXiv 2025 | [64] Adversarial Illusions USENIX, USENIX 2024 | [23] PDF Hidden Prompts, ArXiv 2025 | [65] RLBind Cross-Modal, ArXiv 2025 | [13] ReliabilityRAG Robustness, ArXiv 2025 | [19] Rationale-Based Selection, OpenReview PDF | [16] Adaptive-k Retrieval, ArXiv 2025 | [26] Token-Level Uncertainty, PMC 2025 | [15] GRADA Graph-Based, EMNLP 2025 | [30] MrGuard Multilingual, EMNLP 2025 | [14] ReliabilityRAG MIS, ArXiv 2025 | [18] Adaptive-k No Tuning, EMNLP 2025 | [31] MrGuard Reasoning Safety, ArXiv 2025