AI/ML

Academic Paper Recommendation System

Hybrid NLP retrieval over 99,942 papers — 78.9% accuracy

Personal

5 technologies
3 key decisions
4 results

Problem

Problem

Academic paper discovery is a hard retrieval problem: keyword search misses semantic similarity, citation graphs are sparse for new papers, and most embedding models aren't trained on scientific language. The challenge was building a system that could recommend semantically relevant papers across a 99,942-paper corpus with strong precision.

Approach

Approach

The system uses a hybrid retrieval approach combining three signals: sparse TF-IDF retrieval for lexical precision, dense Sentence-BERT embeddings for semantic similarity, and AllenAI-Specter embeddings specifically trained on scientific paper abstracts and citations. All embeddings are stored in LanceDB for efficient vector search. At query time, scores from all three retrievers are combined via a weighted ensemble. The Specter signal proved most valuable for cross-domain recommendations where keyword overlap is low.

Architecture

Architecture

Academic Paper Recommendation System — system diagram

Paper QueryTF-IDF (sparse)Sentence-BERTAllenAI-SpecterLanceDBWeighted EnsembleRanked Results

Key Technical Decisions

Key Technical Decisions

Assembly Instructions — 3 Steps
01

AllenAI-Specter over general-purpose embeddings

General-purpose embedding models (e.g., all-MiniLM) underperform on scientific text because they're not trained on citation-aware paper representations. Specter is trained with a citation-based contrastive loss — papers that cite each other are closer in embedding space. This proved critical for cross-domain and jargon-heavy recommendations.

02

Hybrid TF-IDF + dense retrieval ensemble

Dense retrieval alone misses exact terminology matches (e.g., specific model names, dataset names). TF-IDF complements by catching lexical precision. The ensemble consistently outperformed either retriever alone. Weights were tuned on a held-out validation set.

03

LanceDB for embedded vector storage

LanceDB runs embedded with no server overhead and handles Apache Arrow format natively. For a research project storing three separate embedding spaces (TF-IDF sparse, SBERT dense, Specter dense) across 99,942 papers, this eliminated infrastructure complexity while enabling fast ANN search.

Results

Results

  • 78.9% recommendation accuracy across 99,942 academic papers
  • Hybrid retrieval outperforms single-model baselines on cross-domain queries
  • AllenAI-Specter provides strongest signal for scientific paper similarity
  • LanceDB enables sub-second retrieval across the full corpus

Tech Stack

Tech Stack

TF-IDFSentence-BERTAllenAI-SpecterLanceDBPython

Links