Repository logo

Neural Compression for Scalable Question-Answer Retrieval

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Question-answering systems at scale face fundamental performance barriers when traditional vector databases transition from exact to approximate search, causing substantial degradation in both query throughput and retrieval quality. While compression can address these challenges, existing compression approaches either apply generic transformations ignoring retrieval task structure (PCA) or require retraining entire embedding models (Matryoshka), limiting practical applicability. This thesis introduces neural compression for question-answer retrieval through two-stage learning that compresses 384-dimensional context embeddings to 32 or 64 dimensions while preserving semantic information. The approach trains an autoencoder to compress context embeddings using cosine similarity loss, then trains a mapper network to predict compressed codes directly from question embeddings using mean squared error loss in compressed space. Both networks are trained on the training split (72%, 86,400 pairs). We evaluate this approach on 120,000 question-answer pairs spanning six knowledge domains, comparing against six baseline methods (FAISS, HNSW, ScaNN, PCA, Matryoshka, zero-shot) across six dataset scales (20K to 120K) with three iterations per configuration, totaling in 198 experimental runs. Results demonstrate that neural compression achieves 0.1725 ROUGE-1 score compared to FAISS’s 0.1624 (+6.2% improvement) while reducing storage from 184 MB to 7.7 MB (96% reduction) and increasing throughput from 151 to 7,861 queries per second (52× speedup). Neural compression is the only method whose quality improves with scale (+2.5% from 20K to 120K samples) while all baselines degrade (-6% to -13%). The performance crossover occurs at approximately 40K samples, earlier than hypothesized, as FAISS quality degrades from curse of dimensionality effects before its algorithmic transition to approximation. These results show that task-specific learned compression through asymmetric architecture compressing only contexts while keeping questions full dimensional enables exact search at scales where high dimensional methods must approximate, fundamentally changing scalability characteristics of retrieval systems.

Description

This thesis investigates neural compression for scalable question-answer retrieval in Retrieval-Augmented Generation systems. An asymmetric autoencoder architecture compresses 384-dimensional context embeddings to 32 dimensions, achieving 12× storage reduction while maintaining retrieval quality. Experiments on 120K question-answer pairs demonstrate superior performance over PCA and Matryoshka baselines.

Keywords

Compression, Retrieval Augmented Generation (RAG), Neural Networks, Scalability

Citation