Retrieval Strategy for 10M Documents: Standard Dense Passage vs. LightRAG?

Ftrea · November 21, 2025, 1:17am

Hi community,

I am working on a RAG project involving 10 million text documents (living in PostgreSQL). I need to ensure high retrieval accuracy (Semantic Search).

I am torn between two approaches:

Standard Hybrid Search: Using text-embedding-3-large with a Vector DB (Weaviate/Pinecone) + Reranking.
Newer Architectures: Like LightRAG, which claims better context understanding but might be harder to maintain at this scale.

Has anyone benchmarked these approaches on a dataset of this size? Which stack/model combination do you recommend for a balance of performance and maintainability?

John6666 · November 21, 2025, 2:09am

Seems 1 is better in this case.

kbigdelysh · November 25, 2025, 4:40am

For this kind of scale I would worry a bit less about finding “the one true architecture” and more about how you will test and improve whatever stack you pick.

Once you have a basic pipeline (parser → chunker → embeddings → vector/hybrid search → reranker → LLM), the big lever is how systematically you run experiments on the knobs:

Retriever stack and settings (BM25 vs dense vs hybrid, top_k, filters, vector DB choice like pgvector vs Pinecone or Weaviate)
Chunking and indexing strategy (size, overlap, per document type strategies, hierarchical schemes like RAPTOR)
Reranker on or off and which reranker
Model and prompt choices (system prompt, how you format context, temperature and other sampling params)
Update strategy (full reindex vs incremental upserts, reuse of embeddings by signatures)

The useful pattern is to lock in a representative eval set, then sweep combinations of these knobs and look at retrieval quality and answer quality side by side, instead of making one-off tweaks.

Some tools that help with that experimentation and optimization loop:

RapidFire AI RAG (open source) – experiment execution framework focused on RAG. Lets you declare chunking, retrieval, reranking and prompting options as knobs, run many configs in parallel, and compare RAG metrics across them. GitHub: https://github.com/RapidFireAI/rapidfireai
RAGAS – library of RAG evaluation metrics such as faithfulness, answer relevance, context precision and context recall, with integrations into common RAG stacks.
TruLens – open source eval and tracing with the “RAG triad” of context relevance, groundedness and answer relevance, plus other feedback functions.
LangSmith – dataset based evaluation and tracing for LangChain apps, including a tutorial specifically on evaluating RAG systems.

The exact vector DB or orchestrator you choose matters, but at 10M files the thing that will hurt you most is poor search quality. Whatever stack you pick, make sure you have an experiment and evaluation layer that lets you iterate on those knobs quickly.

Disclosure: I work on the RapidFire AI team.

Topic		Replies	Views
We built RapidFire AI RAG: 16–24x faster RAG experimentation + live evals (try it in Colab) Show and Tell	0	40	December 23, 2025
Language model to search an answer in a huge collection of (unrelated) paragraphs Research	4	1551	July 6, 2021
Why does RAG still feel clunky in 2025? Intermediate	12	674	December 14, 2025
How to Improve RAG Retrieval Accuracy and Control Similarity Threshold in FAISS / Hybrid Search Beginners	2	200	August 13, 2025
Vector DB - Exhaustive search in RAG Intermediate	0	348	November 14, 2023

Retrieval Strategy for 10M Documents: Standard Dense Passage vs. LightRAG?

Related topics