Comparing the representation learning of autoencoding transformer models in ad hoc information retrieval
Information retrieval (IR) saw a recent development in ranking models since the ad-vent of deep learning techniques. Traditionally, classical IR methods, such as BM25,assume query term independence, allowing them to precompute term-document scores which makes them efficient for full-ranking. On the other hand, deep neural ranking models, like BERT, depend on interaction signals between query and document terms for successful retrieval, therefore being restricted to only late stage re-ranking, even though they have superior retrieval performance. Recent work has shown that with offline precomputation of the sentence embeddings, these computation-intensive models can be used for full-ranking and made cost-effective when combined with any indexing structure. However, BERT is not the only advanced language representation model that can be used for information retrieval. Various other models such as RoBERTa, ALBERT, DistilBERT, ELECTRA and many others have surpassed the performance of BERT in several NLP tasks. Since a large number of pre-trained language models have been proposed lately, we believe it is the right time to evaluate their representational learning for ranking. Although these pre-trained models share some fundamental characteristics, their performance varies because of differences in training data, or training procedure. They also differ in their computational requirements. In this work, we evaluate the representational learning of the various autoencoding transformer models extrinsically on the downstream task of Microsoft MAchine Reading Comprehension (MS MARCO) passage retrieval. We observe that BERT and its distilled version DistilBERT are the best performers in terms of ranking, while DistilBERT achieves a good trade-off between effectiveness and computational efficiency in ad hoc document retrieval. We discuss the empirical analysis of these models and provide insights about their performance in tasks like semantic similarity. We believe that our results shed some light on the selection of embeddings for ad hoc retrieval and also serves as a benchmark for future search applications.