Academic Paper Recommendation System
Hybrid NLP retrieval over 99,942 papers — 78.9% accuracy
Personal
Problem
Problem
Academic paper discovery is a hard retrieval problem: keyword search misses semantic similarity, citation graphs are sparse for new papers, and most embedding models aren't trained on scientific language. The challenge was building a system that could recommend semantically relevant papers across a 99,942-paper corpus with strong precision.
Approach
Approach
The system uses a hybrid retrieval approach combining three signals: sparse TF-IDF retrieval for lexical precision, dense Sentence-BERT embeddings for semantic similarity, and AllenAI-Specter embeddings specifically trained on scientific paper abstracts and citations. All embeddings are stored in LanceDB for efficient vector search. At query time, scores from all three retrievers are combined via a weighted ensemble. The Specter signal proved most valuable for cross-domain recommendations where keyword overlap is low.
Architecture
Architecture
Academic Paper Recommendation System — system diagram
Key Technical Decisions
Key Technical Decisions
AllenAI-Specter over general-purpose embeddings
General-purpose embedding models (e.g., all-MiniLM) underperform on scientific text because they're not trained on citation-aware paper representations. Specter is trained with a citation-based contrastive loss — papers that cite each other are closer in embedding space. This proved critical for cross-domain and jargon-heavy recommendations.
Hybrid TF-IDF + dense retrieval ensemble
Dense retrieval alone misses exact terminology matches (e.g., specific model names, dataset names). TF-IDF complements by catching lexical precision. The ensemble consistently outperformed either retriever alone. Weights were tuned on a held-out validation set.
LanceDB for embedded vector storage
LanceDB runs embedded with no server overhead and handles Apache Arrow format natively. For a research project storing three separate embedding spaces (TF-IDF sparse, SBERT dense, Specter dense) across 99,942 papers, this eliminated infrastructure complexity while enabling fast ANN search.
Results
Results
- ✓78.9% recommendation accuracy across 99,942 academic papers
- ✓Hybrid retrieval outperforms single-model baselines on cross-domain queries
- ✓AllenAI-Specter provides strongest signal for scientific paper similarity
- ✓LanceDB enables sub-second retrieval across the full corpus
Tech Stack
Tech Stack
Links
Links