# Publications

Filter by type:

### Different Keystrokes for Different Folks: Visualizing Crowdworker Querying Behavior

Search engine users retrieve relevant information for an information need using keyword queries. Different users may have similar information needs, but use different query terms. The resulting user query variations can provide a wealth of useful information to IR researchers. Most recently, the keystroke-level telemetry data gathered as part of the CC-News-En collection provides important insights into how users create queries for a search task, at a level of detail not possible using a normal query log. In this demo, we present an interactive tool that enables practitioners to visualize users formulating queries. Our new tool is a temporal simulation of the typing behavior of crowdworkers, grouped by information need. It provides the ability to directly compare the cognitive behavior of multiple users simultaneously, and observe how query keyword selection and ordering happens before a final query is submitted to a search engine. To demonstrate the benefit of our tool, we include a qualitative study of four different user behavior patterns which were observed in the CC-News-En collection.
CHIIR, 2021

### Examining the Additivity of Top-$k$ Query Processing Innovations

Research activity spanning more than five decades has led to index organizations, compression schemes, and traversal algorithms that allow extremely rapid response to ranked queries against very large text collections. However, little attention has been paid to the interactions between these many components, and the additivity of algorithmic improvements has not been explored. Here we examine the extent to which efficiency improvements add up. We employ four query processing algorithms, four compression codecs, and all possible combinations of four distinct further optimizations, and compare the performance of the 256 resulting systems to determine when and how different optimizations interact. Our results over two test collections show that efficiency enhancements are, for the most part, additive, and that there is little risk of negative interactions. In addition, our detailed profiling across this large pool of systems leads to key insights as to why the various individual enhancements work well, and indicates that optimizing ‘simpler’ implementations can result in higher query throughput than is available from non-optimized versions of the more ‘complex’ techniques, with clear implications for the choices needing to be made by practitioners.
CIKM, 2020

### CC-News-En: A Large English News Corpus

We describe a static, open-access news corpus using data from the Common Crawl Foundation, who provide free, publicly available web archives, including a continuous crawl of international news articles published in multiple languages. Our derived corpus, CC-News-En, contains 44 million English documents collected between September 2016 and March 2018. The collection is comparable in size with the number of documents typically found in a single shard of a large-scale, distributed search engine, and is four times larger than the news collections previously used in offline information retrieval experiments. To complement the corpus, 173 topics were curated using titles from Reddit threads, forming a temporally representative sampling of relevant news topics over the 583 day collection window. Information needs were then generated using automatic summarization tools to produce textual and audio representations, and used to elicit query variations from crowdworkers, with a total of 10,437 queries collected against the 173 topics. Of these, 10,089 include key-stroke level instrumentation that captures the timings of character insertions and deletions made by the workers while typing their queries. These new resources support a wide variety of experiments, including large-scale efficiency exercises and query auto-completion synthesis, with scope for future addition of relevance judgments to support offline effectiveness experiments and hence batch evaluation campaigns.
CIKM, 2020

### Efficiency Implications of Term Weighting for Passage Retrieval

Language model pre-training has spurred a great deal of attention for tasks involving natural language understanding, and has been successfully applied to many downstream tasks with impressive results. Within information retrieval, many of these solutions are too costly to stand on their own, requiring multi-stage ranking architectures. Recent work has begun to consider how to “backport” salient aspects of these computationally expensive models to previous stages of the retrieval pipeline. One such instance is DeepCT, which uses BERT to re-weight term importance in a given context at the passage level. This process, which is computed offline, results in an augmented inverted index with re-weighted term frequency values. In this work, we conduct an investigation of query processing efficiency over DeepCT indexes. Using a number of candidate generation algorithms, we reveal how term re-weighting can impact query processing latency, and explore how DeepCT can be used as a static index pruning technique to accelerate query processing without harming search effectiveness.
SIGIR, 2020

### Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format

There exists a natural tension between encouraging a diverse ecosystem of open-source search engines and supporting fair, replicable comparisons across those systems. To balance these two goals, we examine two approaches to providing interoperability between the inverted indexes of several systems. The first takes advantage of internal abstractions around index structures and building wrappers that allow one system to directly read the indexes of another. The second involves sharing indexes across systems via a data exchange specification that we have developed, called the Common Index File Format (CIFF). We demonstrate the first approach with the Java systems Anserini and Terrier, and the second approach with Anserini, JASS, OldDog, PISA, and Terrier. Together, these systems provide a wide range of implementations and features, with different research goals. Overall, we recommend CIFF as a low-effort approach to support independent innovation while enabling the types of fair evaluations that are critical for driving the field forward.
SIGIR, 2020

### Managing Tail Latency in Large Scale Information Retrieval Systems

As both the availability of internet access and the prominence of smart devices continue to increase, data is being generated at a rate faster than ever before. This massive increase in data production comes with many challenges, including efficiency concerns for the storage and retrieval of such large-scale data. However, users have grown to expect the sub-second response times that are common in most modern search engines, creating a problem - how can such large amounts of data continue to be served efficiently enough to satisfy end users?
This dissertation investigates several issues regarding tail latency in large-scale information retrieval systems. Tail latency corresponds to the high percentile latency that is observed from a system - in the case of search, this latency typically corresponds to how long it takes for a query to be processed. In particular, keeping tail latency as low as possible translates to a good experience for all users, as tail latency is directly related to the worst-case latency and hence, the worst possible user experience. The key idea in targeting tail latency is to move from questions such as “what is the median latency of our search engine?” to questions which more accurately capture user experience such as “how many queries take more than 200ms to return answers?” or “what is the worst case latency that a user may be subject to, and how often might it occur?”
While various strategies exist for efficiently processing queries over large textual corpora, prior research has focused almost entirely on improvements to the average processing time or cost of search systems. As a first contribution, we examine some state-of-the-art retrieval algorithms for two popular index organizations, and discuss the trade-offs between them, paying special attention to the notion of tail latency. This research uncovers a number of observations that are subsequently leveraged for improved search efficiency and effectiveness.
We then propose and solve a new problem, which involves processing a number of related queries together, known as multi-queries, to yield higher quality search results. We experiment with a number of algorithmic approaches to efficiently process these multi-queries, and report on the cost, efficiency, and effectiveness trade-offs present with each. Ultimately, we find that some solutions yield a low tail latency, and are hence suitable for use in real-time search environments.
Finally, we examine how predictive models can be used to improve the tail latency and end-to-end cost of a commonly used multi-stage retrieval architecture without impacting result effectiveness. By combining ideas from numerous areas of information retrieval, we propose a prediction framework which can be used for training and evaluating several efficiency/effectiveness trade-off parameters, resulting in improved trade-offs between cost, result quality, and tail latency.
PhD Thesis, 2019

### Boosting Search Performance Using Query Variations

Rank fusion is a powerful technique that allows multiple sources of information to be combined into a single result set. However, to date fusion has not been regarded as being cost-effective in cases where strict per-query efficiency guarantees are required, such as in web search. In this work we propose a novel solution to rank fusion by splitting the computation into two parts – one phase that is carried out offline to generate pre-computed centroid answers for queries with broadly similar information needs, and then a second online phase that uses the corresponding topic centroid to compute a result page for each query. We explore efficiency improvements to classic fusion algorithms whose costs can be amortized as a pre-processing step, and can then be combined with re-ranking approaches to dramatically improve effectiveness in multi-stage retrieval systems with little efficiency overhead at query time. Experimental results using the ClueWeb12B collection and the UQV100 query variations demonstrate that centroid-based approaches allow improved retrieval effectiveness at little or no loss in query throughput or latency, and with reasonable pre-processing requirements. We additionally show that queries that do not match any of the pre-computed clusters can be accurately identified and efficiently processed in our proposed ranking pipeline.
TOIS, 2019

### PISA: Performant Indexes and Search for Academia

Performant Indexes and Search for Academia (PISA) is an experimental search engine that focuses on efficient implementations of state-of-the-art representations and algorithms for text retrieval. In this work, we outline our effort in creating a replicable search run from PISA for the 2019 Open Source Information Retrieval Replicability Challenge, which encourages the information retrieval community to produce replicable systems through the use of a containerized, Docker-based infrastructure. We also discuss the origins, current functionality, and future direction and challenges for the PISA system.
OSIRRC, 2019

### Accelerated Query Processing Via Similarity Score Prediction

Processing top-$k$ bag-of-words queries is critical to many information retrieval applications, including web-scale search. In this work, we consider algorithmic properties associated with dynamic pruning mechanisms. Such algorithms maintain a score threshold (the $k$th highest similarity score identified so far) so that low-scoring documents can be bypassed, allowing fast top-$k$ retrieval with no loss in effectiveness. In standard pruning algorithms the score threshold is initialized to the lowest possible value. To accelerate processing, we make use of term- and query-dependent features to predict the final value of that threshold, and then employ the predicted value right from the commencement of processing. Because of the asymmetry associated with prediction errors (if the estimated threshold is too high the query will need to be re-executed in order to assure the correct answer), the prediction process must be risk-sensitive. We explore techniques for balancing those factors, and provide detailed experimental results that show the practical usefulness of the new approach.
SIGIR, 2019

### Exploring User Behavior in Email Re-Finding Tasks

Email continues to be one of the most commonly used forms of online communication. As inboxes grow larger, users rely more heavily on email search to effectively find what they are looking for. However, previous studies on email have been exclusive to enterprises with access to large user logs, or limited to small-scale qualitative surveys and analyses on outdated public datasets such as Enron and Avocado. In this work, we propose a novel framework that allows for experimentation with real email data. In particular, our approach provides a realistic way of simulating email re-finding tasks in a crowdsourcing environment using the workers’ personal email data. We use our approach to experiment with various ranking functions and quality degradation to measure how users behave under different conditions, and conduct analysis across various email types and attributes. Our results show that user behavior can be significantly impacted as a result of the quality of the search ranker, but only when differences in quality are very pronounced. Our analysis confirms that time-based ranking begins to fail as email age increases, suggesting that hybrid approaches may help bridge the gap between relevance-based rankers and the traditional time-based ranking approach. Finally, we also found that users typically reformulate search queries by either entirely re-writing the query, or simply appending terms to the query, which may have implications for email query suggestion facilities
WWW, 2019

### Compressing Inverted Indexes with Recursive Graph Bisection: A Reproducibility Study

Document reordering is an important but often overlooked preprocessing stage in index construction. Reordering document identifiers in graphs and inverted indexes has been shown to reduce storage costs and improve processing efficiency in the resulting indexes. However, surprisingly few document reordering algorithms are publicly available despite their importance. A new reordering algorithm derived from recursive graph bisection was recently proposed by Dhulipala (KDD 2016), and shown to be highly effective and efficient when compared against other state-of-the-art reordering strategies. In this work, we present a reproducibility study of this new algorithm. We describe the implementation challenges encountered, and explore the performance characteristics of our clean-room reimplementation. We show that we are able to successfully reproduce the core results of the original paper, and show that the algorithm generalizes to other collections and indexing frameworks. Furthermore, we make our implementation publicly available to help promote further research in this space.
ECIR, 2019

### Revisiting Spam Filtering in Web Search

The Waterloo spam scores are now a commonly used static document feature in web collections such as ClueWeb. This feature can be used as a post-retrieval filter, as a document prior, or as one of many features in a Learning-to-Rank system. In this work, we highlight the risks associated with using spam scores as a post-retrieval filter, which is now common practice in experiments with the ClueWeb test collection. While it increases the average evaluation score and boosts the performance of some topics, it can significantly harm the performance of others. Through a detailed failure analysis, we show that simple spam filtering is a high risk practice that should be avoided in future work, particularly when working with the ClueWeb12 test collection.

### RMIT at the 2018 TREC CORE Track

Ad-hoc retrieval is an important problem with many practical applications. It forms the basis of web search, question-answering, and a new generation of virtual assistants being developed by several of the largest software companies in the world. In this report, we continue our exploration of the importance of multiple expressions of information needs. Our thesis is that over-reliance on a single query can lead to suboptimal performance, and that by creating multiple query representations for an information need and combining the relevance signals through fusion and relevance modeling, highly effective systems can be produced. This approach may form the basis for more complex multi-stage retrieval systems in a variety of applications.
TREC, 2018

### Towards Efficient and Effective Query Variant Generation

Relevance modeling and data fusion are powerful yet simple approaches to improving the effectiveness of Information Retrieval Systems. For many of the classic TREC test collections, these approaches were used in many of the top performing retrieval systems. However, these approaches are often inefficient and are therefore rarely applied in production systems which must adhere to strict performance guarantees. Inspired by our recent work with human-derived query variations, we propose a new sampling-based system which provides significantly better efficiency-effectiveness trade-offs while leveraging both relevance modeling and data fusion. We show that our new end-to-end search system approaches the state-of-the-art in effectiveness while still being efficient in practice. Orthogonally, we also show how to leverage query expansion and data fusion to achieve significantly better risk-reward trade-offs than plain relevance modeling approaches.
DESIRES, 2018

### On the Cost of Negation for Dynamic Pruning

Negated query terms allow documents containing such terms to be filtered out of a search results list, supporting disambiguation. In this work, the effect of negation on the efficiency of disjunctive, top-$k$ retrieval is examined. First, we show how negation can be integrated efficiently into two popular dynamic pruning algorithms. Then, we explore the efficiency of our approach, and show that while often efficient, negation can negatively impact the dynamic pruning effectiveness for certain queries.
ECIR, 2018

### Query Driven Algorithm Selection in Early Stage Retrieval

Large scale retrieval systems often employ cascaded ranking architectures, in which an initial set of candidate documents are iteratively refined and re-ranked by increasingly sophisticated and expensive ranking models. In this paper, we propose a unified framework for predicting a range of performance-sensitive parameters based on minimizing end-to-end effectiveness loss. The framework does not require relevance judgments for training, is amenable to predicting a wide range of parameters, allows for fine tuned efficiency-effectiveness trade-offs, and can be easily deployed in large scale search systems with minimal overhead. As a proof of concept, we show that the framework can accurately predict a number of performance parameters on a query-by-query basis, allowing efficient and effective retrieval, while simultaneously minimizing the tail latency of an early-stage candidate generation system. On the $50$ million document ClueWeb09B collection, and across $25{,}000$ queries, our hybrid system can achieve superior early-stage efficiency to fixed parameter systems without loss of effectiveness, and allows more finely-grained efficiency-effectiveness trade-offs across the multiple stages of the retrieval system.
WSDM, 2018

### Early Termination Heuristics for Score-at-a-Time Index Traversal

Score-at-a-Time index traversal is a query processing approach which supports early termination in order to balance efficiency and effectiveness trade-offs. In this work, we explore new techniques which extend a modern Score-at-a-Time traversal algorithm to allow for parallel postings traversal. We show that careful integration of parallel traversal can improve both efficiency and effectiveness when compared with current single threaded early termination approaches. In addition, we explore the various trade-offs for differing early termination heuristics, and propose hybrid systems which parallelize long running queries, while processing short running queries with only a single thread.

### RMIT at the NTCIR-13 We Want Web Task

Furthering the state-of-the-art in adhoc web search is one of the underlying goals for the NTCIR We Want Web (WWW) task. Adhoc search can be viewed as a bridge connecting many of the specialized sub-fields that are a result of the way people connect to and use information access systems. Since this is the first year of the WWW task, and no training data was provided for the English subtask, we focused on classic techniques for improving effectiveness in lieu of modern techniques based on supervised learning. In particular, we explored the use of Markov Random Field Models (MRFs), static document features, field-based weighting, and query expansion. This round we made extensive use of the Indri search system and the flexible query language it provides to produce effective results.
NTCIR, 2017

### RMIT at the 2017 TREC CORE Track

The 2017 TREC CORE Track is a re-run of the classic TREC ad hoc search evaluation campaign, with the vision of establishing new methodologies for creating IR test collections. The previous TREC newswire ad hoc task was the 2004 Robust Track, where the emphasis was on improving the effectiveness of poorly performing topics in previous tracks. The TREC CORE 2017 track reuses the Robust 2004 topic set, for the development of relevance judgments over a new New York Times corpus, composed of newswire articles published between 1987 and 2007. For this task, RMIT contributed three different runs, including two manual, and one automatic run.
TREC, 2017

### Managing Tail Latencies in Large Scale IR Systems

With the growing popularity of the world-wide-web and the increasing accessibility of smart devices, data is being generated at a faster rate than ever before. This presents scalability challenges to web-scale search systems – how can we efficiently index, store and retrieve such a vast amount of data? A large amount of prior research has attempted to address many facets of this question, with the invention of a range of efficient index storage and retrieval frameworks that are able to efficiently answer most queries. However, the current literature generally focuses on improving the mean or median query processing time in a given system. In the proposed PhD project, we focus on improving the efficiency of high percentile tail latencies in large scale IR systems while minimising end-to-end effectiveness loss. Although there is a wealth of prior research involving improving the efficiency of large scale IR systems, the most relevant prior work involves predicting long-running queries and processing them in various ways to avoid large query processing times. Prediction is often done through pre-trained models based on both static and dynamic features from queries and documents. Many different approaches to reducing the processing time of long running queries have been proposed, including parallelising queries that are predicted to run slowly, scheduling queries based on their predicted run time, and selecting or modifying the query processing algorithm depending on the load of the system. Considering the specific focus on tail latencies in large-scale IR systems, the proposed research aims to: (i) study what causes large tail latencies to occur in large-scale web search systems, (ii) propose a framework to mitigate tail latencies in multi-stage retrieval through the prediction of a vast range of query-specific efficiency parameters, (iii) experiment with mixed-mode query semantics to provide efficient and effective querying to reduce tail latencies, and (iv) propose a time-bounded solution for Document-at-a-Time (DaaT) query processing which is suitable for current web search systems. As a preliminary study, Crane et al. compared some state-of-the-art query processing strategies across many modern collections. They found that although modern DaaT dynamic pruning strategies are very efficient for ranked disjunctive processing, they have a much larger variance in processing times than Score-at-a-Time (SaaT) strategies which have a similar efficiency profile regardless of query length or the size of the required result set. Furthermore, Mackenzie et al. explored the efficiency trade-offs for paragraph retrieval in a multi-stage question answering system. They found that DaaT dynamic pruning strategies could efficiently retrieve the top-1,000 candidate paragraphs for very long queries. Extending on prior work, Mackenzie et al. showed how a range of per-query efficiency settings can be accurately predicted such that $99.99$ percent of queries are serviced in less than $200$ ms without noticeable effectiveness loss. In addition, a reference list framework was used for training models such that no relevance judgements or annotations were required. Future work will focus on improving the candidate generation stage in large-scale multi-stage retrieval systems. This will include further exploration of index layouts, traversal strategies, and query rewriting, with the aim of improving early stage efficiency to reduce the system tail latency, while potentially improving end-to-end effectiveness.
SIGIR, 2017

### A Comparison of Document-at-a-Time and Score-at-a-Time Query Evaluation

We present an empirical comparison between document-at-a-time (DaaT) and score-at-a-time (SaaT) document ranking strategies within a common framework. Although both strategies have been extensively explored, the literature lacks a fair, direct comparison: such a study has been difficult due to vastly different query evaluation mechanics and index organizations. Our work controls for score quantization, document processing, compression, implementation language, implementation effort, and a number of details, arriving at an empirical evaluation that fairly characterizes the performance of three specific techniques: WAND (DaaT), BMW (DaaT), and JASS (SaaT). Experiments reveal a number of interesting findings. The performance gap between WAND and BMW is not as clear as the literature suggests, and both methods are susceptible to tail queries that may take orders of magnitude longer than the median query to execute. Surprisingly, approximate query evaluation in WAND and BMW does not significantly reduce the risk of these tail queries. Overall, JASS is slightly slower than either WAND or BMW, but exhibits much lower variance in query latencies and is much less susceptible to tail query effects. Furthermore, JASS query latency is not particularly sensitive to the retrieval depth, making it an appealing solution for performance-sensitive applications where bounds on query latencies are desirable.
WSDM, 2017

### RMIT at the TREC 2016 LiveQA Track

This paper describes the four systems RMIT fielded for the TREC 2016 LiveQA task and the associated experiments. Similar to last year, the results show that simple solutions tend to work best, and that our improved baseline systems achieved an above-average performance. We use a commercial search engine as a first stage retrieval mechanism and compare it with our internal system which uses a carefully curated document collection. Somewhat surprisingly, we found that on average the small curated collection performed better within our current framework, warranting further studies on when and when not to use an external resource, such as a publicly available search engine API. Finally, we show that small improvements to performance can substantially reduce failure rates.
TREC, 2016

### Efficient Location-Aware Web Search

Mobile search is quickly becoming the most common mode of search on the internet. This shift is driving changes in user behaviour, and search engine behaviour. Just over half of all search queries from mobile devices have local intent, making location-aware search an increasingly important problem. In this work, we compare the efficiency and effectiveness of two general types of geographical search queries, range queries and $k$-nearest-neighbor queries, for common web search tasks. We test state-of-the-art spatial-textual indexing and search algorithms for both query types on two large datasets. Finally, we present a rank-safe dynamic pruning algorithm that is simple to implement and use with current inverted indexing techniques. Our algorithm is more efficient than the tightly coupled best-in-breed hybrid indexing algorithms that are commonly used for top-$k$ spatial textual queries, and more likely to find relevant documents than techniques derived from range queries.