Approximate Bag-of-Words Top-$k$ Corpus Graphs

Abstract

A recent line of work has investigated the use of corpus graphs to improve the latency-vs-effectiveness envelope of information retrieval systems. The key idea is to build a document-to-document similarity graph offline, allowing additional relevance signals to be exploited during query processing. However, these graphs are inherently expensive to build, requiring a quadratic “all-pairs similarity” computation. In this work, we examine the problem of building corpus graphs using bag-of-words models, and explore heuristics to build high quality graphs at a fraction of the total cost of exhaustive algorithms. We demonstrate that simple mechanisms such as document titles, expanded surrogate queries, and high impact terms can yield effective graphs at a fraction of the cost of their exhaustive counterparts.

Publication
Proceedings of the 47th European Conference on Information Retrieval (ECIR 2025)
Date
Links