Revisiting Document Expansion and Filtering for Effective First-Stage Retrieval


Document expansion is a technique that aims to reduce the likelihood of term mismatch by augmenting documents with related terms or queries. Doc2Query minus minus (Doc2Query–) represents an extension to the expansion process that uses a neural model to identify and remove expansions that may not be relevant to the given document, thereby increasing the quality of the ranking while simultaneously reducing the amount of augmented data. In this work, we conduct a detailed reproducibility study of Doc2Query– to better understand the trade-offs inherent to document expansion and filtering mechanisms. After successfully reproducing the best-performing method from the Doc2Query– family, we show that filtering actually harms recall-based metrics on various test collections. Next, we explore whether the two-stage “generate-then-filter” process can be replaced with a single generation phase via reinforcement learning. Finally, we extend our experimentation to learned sparse retrieval models and demonstrate that filtering is not helpful when term weights can be learned. Overall, our work provides a deeper understanding of the behaviour and characteristics of common document expansion mechanisms, and paves the way for developing more efficient yet effective augmentation models.

Proceedings of the 47th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2024)