AI/ML

Foldex

Multimodal RAG system for document intelligence

Personal

6 technologies
3 key decisions
4 results

Problem

Problem

Most RAG systems handle only text. Real-world documents — invoices, research papers, presentations, technical manuals — contain tables, figures, charts, and mixed-layout content that text-only pipelines miss entirely. Foldex was built to handle the full multimodal surface of a document: extracting and reasoning over both text and visual elements across 10+ document formats including PDFs, audio, and images.

Approach

Approach

Foldex uses a multi-stage ingestion pipeline: PDFs are parsed for text extraction and layout analysis, images and figures are extracted and described via a vision LLM, and tables are structured into markdown. Each content type is chunked with type-appropriate strategies (sentence windows for text, full-block for tables, description+image for figures) and stored in Qdrant as vector embeddings. At query time, a LangChain retrieval chain fetches from all three namespaces and synthesizes a unified answer. The full stack runs via Docker Compose.

Architecture

Architecture

Foldex — system diagram

Document UploadMultimodal ParserVision LLMChunking LayerQdrantLangChain RAG

Key Technical Decisions

Key Technical Decisions

Assembly Instructions — 3 Steps
01

Qdrant for vector storage

Qdrant provides a self-hosted vector database with strong filtering and hybrid search support. For multimodal RAG with text, table, and image namespaces, Qdrant's collection-level organization and payload filtering enable precise namespace isolation without multiple databases.

02

Type-specific chunking strategies

Text paragraphs use sliding window chunking (512 tokens, 128 overlap). Tables are stored as single atomic chunks to preserve relational structure. Figures are stored as (image, LLM-generated description) pairs. Mixed strategies significantly outperformed naive fixed-size chunking on retrieval precision.

03

Docker Compose for the full stack

FastAPI backend, Qdrant storage, and a React frontend all run as services in a single Compose file. One command to spin up the entire system locally. This made it easy to share with researchers who needed a working demo without a cloud deployment.

Results

Results

  • Supports 10+ document formats including PDFs, audio, and images in a single pipeline
  • Successfully retrieves from tables, figures, and text in the same document
  • Vision LLM descriptions enable reasoning over charts and diagrams
  • Full stack deployable with a single `docker compose up`

Tech Stack

Tech Stack

LangChainFastAPILanceDBDockerPythonReact

Links