Foldex
Multimodal RAG system for document intelligence
Personal
Problem
Problem
Most RAG systems handle only text. Real-world documents — invoices, research papers, presentations, technical manuals — contain tables, figures, charts, and mixed-layout content that text-only pipelines miss entirely. Foldex was built to handle the full multimodal surface of a document: extracting and reasoning over both text and visual elements across 10+ document formats including PDFs, audio, and images.
Approach
Approach
Foldex uses a multi-stage ingestion pipeline: PDFs are parsed for text extraction and layout analysis, images and figures are extracted and described via a vision LLM, and tables are structured into markdown. Each content type is chunked with type-appropriate strategies (sentence windows for text, full-block for tables, description+image for figures) and stored in Qdrant as vector embeddings. At query time, a LangChain retrieval chain fetches from all three namespaces and synthesizes a unified answer. The full stack runs via Docker Compose.
Architecture
Architecture
Foldex — system diagram
Key Technical Decisions
Key Technical Decisions
Qdrant for vector storage
Qdrant provides a self-hosted vector database with strong filtering and hybrid search support. For multimodal RAG with text, table, and image namespaces, Qdrant's collection-level organization and payload filtering enable precise namespace isolation without multiple databases.
Type-specific chunking strategies
Text paragraphs use sliding window chunking (512 tokens, 128 overlap). Tables are stored as single atomic chunks to preserve relational structure. Figures are stored as (image, LLM-generated description) pairs. Mixed strategies significantly outperformed naive fixed-size chunking on retrieval precision.
Docker Compose for the full stack
FastAPI backend, Qdrant storage, and a React frontend all run as services in a single Compose file. One command to spin up the entire system locally. This made it easy to share with researchers who needed a working demo without a cloud deployment.
Results
Results
- ✓Supports 10+ document formats including PDFs, audio, and images in a single pipeline
- ✓Successfully retrieves from tables, figures, and text in the same document
- ✓Vision LLM descriptions enable reasoning over charts and diagrams
- ✓Full stack deployable with a single `docker compose up`
Tech Stack
Tech Stack
Links
Links