A Flexible Framework for Offline Effectiveness Metrics


The use of offline effectiveness metrics is one of the cornerstones of evaluation in information retrieval. Static resources that include test collections and sets of topics, the corresponding relevance judgments connecting them, and metrics that map document rankings from a retrieval system to numeric scores have been used for multiple decades as an important way of comparing systems. The basis behind this experimental structure is that the metric score for a system can serve as a surrogate measurement for user satisfaction. Here we introduce a user behavior framework that extends the C/W/L family. The essence of the new framework – which we call C/W/L/A – is that the user actions that are undertaken while reading the ranking can be considered separately from the benefit that each user will have derived as they exit the ranking. This split structure allows the great majority of current effectiveness metrics to be systematically categorized, and thus their relative properties and relationships to be better understood; and at the same time permits a wide range of novel combinations to be considered. We then carry out experiments using relevance judgments, document rankings, and user satisfaction data from two distinct sources, comparing the patterns of metric scores generated, and showing that those metrics vary quite markedly in terms of their ability to predict user satisfaction.

Proceedings of the 45th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2022)