Retrieval Strategy for 10M Documents: Standard Dense Passage vs. LightRAG?

Hi community,

I am working on a RAG project involving 10 million text documents (living in PostgreSQL). I need to ensure high retrieval accuracy (Semantic Search).

I am torn between two approaches:

  1. Standard Hybrid Search: Using text-embedding-3-large with a Vector DB (Weaviate/Pinecone) + Reranking.

  2. Newer Architectures: Like LightRAG, which claims better context understanding but might be harder to maintain at this scale.

Has anyone benchmarked these approaches on a dataset of this size? Which stack/model combination do you recommend for a balance of performance and maintainability?

1 Like

Seems 1 is better in this case.

1 Like

For this kind of scale I would worry a bit less about finding “the one true architecture” and more about how you will test and improve whatever stack you pick.

Once you have a basic pipeline (parser → chunker → embeddings → vector/hybrid search → reranker → LLM), the big lever is how systematically you run experiments on the knobs:

  • Retriever stack and settings (BM25 vs dense vs hybrid, top_k, filters, vector DB choice like pgvector vs Pinecone or Weaviate)
  • Chunking and indexing strategy (size, overlap, per document type strategies, hierarchical schemes like RAPTOR)
  • Reranker on or off and which reranker
  • Model and prompt choices (system prompt, how you format context, temperature and other sampling params)
  • Update strategy (full reindex vs incremental upserts, reuse of embeddings by signatures)

The useful pattern is to lock in a representative eval set, then sweep combinations of these knobs and look at retrieval quality and answer quality side by side, instead of making one-off tweaks.

Some tools that help with that experimentation and optimization loop:

  • RapidFire AI RAG (open source) – experiment execution framework focused on RAG. Lets you declare chunking, retrieval, reranking and prompting options as knobs, run many configs in parallel, and compare RAG metrics across them. GitHub: https://github.com/RapidFireAI/rapidfireai
  • RAGAS – library of RAG evaluation metrics such as faithfulness, answer relevance, context precision and context recall, with integrations into common RAG stacks.
  • TruLens – open source eval and tracing with the “RAG triad” of context relevance, groundedness and answer relevance, plus other feedback functions.
  • LangSmith – dataset based evaluation and tracing for LangChain apps, including a tutorial specifically on evaluating RAG systems.

The exact vector DB or orchestrator you choose matters, but at 10M files the thing that will hurt you most is poor search quality. Whatever stack you pick, make sure you have an experiment and evaluation layer that lets you iterate on those knobs quickly.

Disclosure: I work on the RapidFire AI team.

1 Like