Publications

I believe in open access to scholarly information, and provide the “author versions” of my research manuscripts here. Please note that they may differ to the final published versions, which can be found in the respective publisher’s digital library.

Filter by type:

Since their emergence some two decades ago, indexes based on the Burrows-Wheeler transform (BWT) have been intensely studied and today find wide use in genomics, where they form the basis of software tools for read alignment and $k$-mer lookup—routine tasks in modern data-intensive bioinformatics pipelines. BWT-based indexes reduce an existential query for a pattern $P$ of length $m$ to a sequence of up to $m$ pairs of rank queries on a sequence derived from the underlying indexed data. In general these rank queries exhibit poor locality of memory reference, with each pair causing one or two cache misses, something that has become generally accepted as a limitation of these indexes. However, in the above mentioned applications a typical experimental run will search for 100s of millions—even billions—of patterns using the index. In this paper we show that, taken across such a large set of patterns, rank queries do exhibit locality of memory reference a d that this can be exploited by reorganising the order in which rank queries are issued. We show this leads to significant performance gains—in particular, $k$-mer lookup queries can be answered several times faster when a batch of patterns is treated holistically.
ALENEX, 2025

Experiments often result in the need to compare an observation against a reference, where observation and reference are selections made from some specified domain. The goal is to determine how close the observation is to the ideal result represented by the reference, so that, all other things being equal, systems that achieve outputs closer to the ideal reference can be preferred for deployment. Both observation and reference might be sets of items, or might be ordered sequences (rankings) of items. There are thus four possible combinations between sets and rankings. Three of those possibilities are already familiar to IR researchers, and have received detailed exploration. Here we consider the fourth combination, that of comparing an observation set relative to a reference ranking. We introduce a new measurement that we call rank-biased recall to cover this scenario, and demonstrate its usefulness with a case study from multi-phase ranking. We also present a new top-weighted ‘ranking compared to ranking’ measurement, and show that it represents a complementary assessment to the previous rank-biased overlap mechanism, and possesses distinctive characteristics.
SIGIR-AP, 2024

Conversational assistants are becoming prevalent among the wider population due to their simplicity and increasing utility. However, the shortcomings of these tools are as renowned as their benefits. In this work, we present a “first look” at an extensive collection of conversational queries, aiming to identify limitations and improvement opportunities specifically related to information access (i.e., search interactions). We explore over 600,000 Google Assistant interactions from 173 unique users, examining usage trends and the resulting deficiencies and strengths of these assistants. We aim to provide a balanced assessment, highlighting the assistant’s shortcomings in supporting users and delivering relevant information to user needs and areas where it demonstrates a reasonable response to user inputs. Our analysis shows that, although most users conduct information-seeking tasks, there is little evidence of complex information-seeking behaviour, with most interactions consisting of simple, imperative instructions. Finally, we find that conversational devices allow users to benefit from increased naturalistic interactions and the ability to apply acquired information in situ, a novel observation for conversational information seeking.
CIKM, 2024

Advancements in large language models (LLMs) have changed information retrieval, offering users a more personalised and natural search experience with technologies like OpenAI ChatGPT, Google Bard (Gemini), or Microsoft Copilot. Despite these advancements, research into user tasks and information needs remains scarce. This preliminary work analyses a Google Bard prompt log with 15,023 interactions called the Bard Intelligence and Dialogue Dataset (BIDD), providing an understanding akin to query log analyses. We show that Google Bard prompts are often verbose and structured, encapsulating a broader range of information needs and imperative (e.g., directive) tasks distinct from traditional search queries. We show that LLMs can support users in tasks beyond the three main types based on user intent: informational, navigational, and transactional. Our findings emphasise the versatile application of LLMs across content creation, LLM writing style preferences, and information extraction. We document diverse user interaction styles, showcasing the adaptability of users to LLM capabilities.
SIGIR, 2024

Document expansion is a technique that aims to reduce the likelihood of term mismatch by augmenting documents with related terms or queries. Doc2Query minus minus (Doc2Query–) represents an extension to the expansion process that uses a neural model to identify and remove expansions that may not be relevant to the given document, thereby increasing the quality of the ranking while simultaneously reducing the amount of augmented data. In this work, we conduct a detailed reproducibility study of Doc2Query– to better understand the trade-offs inherent to document expansion and filtering mechanisms. After successfully reproducing the best-performing method from the Doc2Query– family, we show that filtering actually harms recall-based metrics on various test collections. Next, we explore whether the two-stage “generate-then-filter” process can be replaced with a single generation phase via reinforcement learning. Finally, we extend our experimentation to learned sparse retrieval models and demonstrate that filtering is not helpful when term weights can be learned. Overall, our work provides a deeper understanding of the behaviour and characteristics of common document expansion mechanisms, and paves the way for developing more efficient yet effective augmentation models.
SIGIR, 2024

The Information Retrieval (IR) community has a rich history of empirically measuring novel retrieval methods in terms of effectiveness and efficiency. However, as the search ecosystem is developing rapidly, comparatively little attention has been paid to evaluating efficiency in recent years, which raises the question of the cost-benefit ratio between effectiveness and efficiency. In this regard, it has become difficult to compare and contrast systems in an empirically fair way. Factors including hardware configurations, software versioning, experimental settings, and measurement methods all contribute to the difficulty of meaningfully comparing search systems, especially where efficiency is a key component of the evaluation. Furthermore, efficiency is no longer limited to time and space but has found new, challenging dimensions that stretch to resource, sample, and energy efficiency and have implications for users, researchers, and the environment. Examining algorithms and models through the lens of efficiency and its trade-off with effectiveness requires revisiting and establishing new standards and principles, from defining relevant concepts, to designing measures, to creating guidelines for making sense of the significance of findings. The third iteration of ReNeuIR aims to bring the community together to debate these questions and collaboratively test and improve a benchmarking framework for efficiency derived from the discussions of the first two iterations of this workshop. We provide a first prototype of this framework by organizing a shared task track focused on comparability and reproducibility at the workshop.
SIGIR, 2024

It is tempting to assume that because effectiveness metrics have free choice to assign scores to search engine result pages (SERPs) there must thus be a similar degree of freedom as to the relative order that SERP pairs can be put into. In fact that second freedom is, to a considerable degree, illusory. That’s because if one SERP in a pair has been given a certain score by a metric, fundamental ordering constraints in many cases then dictate that the score for the second SERP must be either not less than, or not greater than, the score assigned to the first SERP. We refer to these fixed relationships as innate pairwise SERP orderings. Our first goal in this work is to describe and defend those pairwise SERP relationship constraints, and tabulate their relative occurrence via both exhaustive and empirical experimentation. We then consider how to employ such innate pairwise relationships in IR experiments, leading to a proposal for a new measurement paradigm. Specifically, we argue that tables of results in which many different metrics are listed for champion versus challenger system comparisons should be avoided; and that instead a single metric be argued for in principled terms, with any relationships identified by that metric then reinforced via an assessment of the innate relationship as to whether other metrics - indeed, all other metrics - are likely to yield the same system-vs-system outcome.
JASIST, 2024

Dense indexes derived from whole-of-document neural models are now more effective at locating likely-relevant documents than are conventional term-based inverted indexes. That effectiveness comes at a cost, however: inverted indexes require less than a byte per posting to store, whereas dense indexes store a fixed-length vector of floating point coefficients (typically 768) for each document, making them potentially an order of magnitude larger. In this paper we consider compression of indexes employing dense vectors. Only limited space savings can be achieved via lossless compression techniques, but we demonstrate that dense indexes are responsive to lossy techniques that sacrifice controlled amounts of numeric resolution in order to gain compressibility. We describe suitable schemes, and, via experiments on three different collections, show that substantial space savings can be achieved with minimal loss of ranking fidelity. These techniques further boost the attractiveness of dense indexes for practical use.
SIGIR-AP, 2023

The SPLADE (SParse Lexical AnD Expansion) model is a highly effective approach to learned sparse retrieval, where documents are represented by term impact scores derived from large language models. During training, SPLADE applies regularization to ensure postings lists are kept sparse – with the aim of mimicking the properties of natural term distributions – allowing efficient and effective lexical matching and ranking. However, we hypothesize that SPLADE may encode additional signals into common postings lists to further improve effectiveness. To explore this idea, we perform a number of empirical analyses where we re-train SPLADE with different, controlled vocabularies and measure how effective it is at ranking passages. Our findings suggest that SPLADE can effectively encode useful ranking signals in documents even when the vocabulary is constrained to terms that are not traditionally useful for ranking, such as stopwords or even random words.
ICTIR, 2023

Multifaceted, empirical evaluation of algorithmic ideas is one of the central pillars of Information Retrieval (IR) research. The IR community has a rich history of studying the effectiveness of indexes, retrieval algorithms, and complex machine learning rankers and, at the same time, quantifying their computational costs, from creation and training to application and inference. As the community moves towards even more complex deep learning models, questions on efficiency have once again become relevant with renewed urgency. Indeed, efficiency is no longer limited to time and space; instead it has found new, challenging dimensions that stretch to resource-, sample- and energy-efficiency with ramifications for researchers, users, and the environment alike. Examining algorithms and models through the lens of holistic efficiency requires the establishment of standards and principles, from defining relevant concepts, to designing metrics, to creating guidelines for making sense of the significance of new findings. The second iteration of the ReNeuIR workshop aims to bring the community together to debate these questions, with the express purpose of moving towards a common benchmarking framework for efficiency.
SIGIR, 2023

Efficiently retrieving the top-$k$ documents for a given query is a fundamental operation in many search applications. Dynamic pruning algorithms accelerate top-$k$ retrieval over inverted indexes by skipping documents that are not able to enter the current set of results. However, the performance of these algorithms depends on a number of variables such as the ranking function, the order of documents within the index, and the number of documents to be retrieved. In this paper, we propose a diagnostic framework, Dyno, for profiling and visualizing the performance of dynamic pruning algorithms. Our framework captures processing traces during retrieval, allowing the operations of the index traversal algorithm to be visualized. These visualizations support both query-level and system-to-system comparisons, enabling performance characteristics to be readily understood for different systems. Dyno benefits both academics and practitioners by furthering our understanding of the behavior of dynamic pruning algorithms, allowing better design choices to be made during experimentation and deployment.
SIGIR, 2023

Large scale web search engines provide sub-second response times to interactive user queries. However, not all search traffic arises interactively - cache updates, internal testing and prototyping, generation of training data, and web mining tasks all contribute to the workload of a typical search service. If these non-interactive query components are collected together and processed as a batch, the overall execution cost of query processing can be significantly reduced. In this reproducibility study, we revisit query batching in the context of large-scale conjunctive processing over inverted indexes, considering both on-disk and in-memory index arrangements. Our exploration first verifies the results reported in the reference work [Ding et al., WSDM 2011], and then provides novel approaches for batch processing which give rise to better time-space trade-offs than have been previously achieved.
ECIR, 2023

In a dynamic retrieval system, documents must be ingested as they arrive, and be immediately findable by queries. Our purpose in this paper is to describe an index structure and processing regime that accommodates that requirement for immediate access, seeking to make the ingestion process as streamlined as possible, while at the same time seeking to make the growing index as small as possible, and seeking to make term-based querying via the index as efficient as possible. We describe a new compression operation and a novel approach to extensible lists which together facilitate that triple goal. In particular, the structure we describe provides incremental document-level indexing using as little as two bytes per posting and only a small amount more for word-level indexing; provides fast document insertion; supports immediate and continuous queryability; provides support for fast conjunctive queries and similarity score-based ranked queries; and facilitates fast conversion of the dynamic index to a “normal” static compressed inverted index structure. Measurement of our new mechanism confirms that in-memory dynamic document-level indexes for collections into the gigabyte range can be constructed at a rate of two gigabytes/minute using a typical server architecture, that multi-term conjunctive Boolean queries can be resolved in just a few milliseconds each on average even while new documents are being concurrently ingested, and that the net memory space required for all of the required data structures amounts to an average of as little as two bytes per stored posting, less than half the space required by the best previous mechanism.
IPM, 2023

Researchers have had much recent success with ranking models based on so-called learned sparse representations generated by transformers. One crucial advantage of this approach is that such models can exploit inverted indexes for top-$k$ retrieval, thereby leveraging decades of work on efficient query evaluation. Yet, there remain many open questions about how these learned representations fit within the existing literature, which our work aims to tackle using four representative learned sparse models. We find that impact weights generated by transformers appear to greatly reduce opportunities for skipping and early exiting optimizations in well-studied document-at-a-time (DaaT) approaches. Similarly, “off-the-shelf” application of score-at-a-time (SaaT) processing exhibits a mismatch between these weights and assumptions behind accumulator management strategies. Building on these observations, we present solutions to address deficiencies with both DaaT and SaaT approaches, yielding substantial speedups in query evaluation. Our detailed empirical analysis demonstrates that both methods lie on the effectiveness–efficiency Pareto frontier, indicating that the optimal choice for deployment depends on operational constraints.
TOIS, 2022

The array is a fundamental data object in most programs. Its key functionality – storage of and access to a set of same-type elements in $O$(1) time per operation – is also widely employed in other more sophisticated data structures. In an extensible array the number of elements in the set is unknown at the time the program is initiated, and the array might continue to grow right through the program’s execution. In this paper we explore the use of extensible arrays in connection with the task of inverted index construction. We develop and test a space-efficient extensible array arrangement that has been previously described but not to our knowledge employed in practice, and show that it adds considerable flexibility to the index construction process while incurring only modest run-time overheads as a result of access indirections.
ADCS, 2022

Novel inverted index-based learned sparse ranking models provide more effective, but less efficient, retrieval performance compared to traditional ranking models like BM25. In this paper, we introduce a technique we call postings clipping to improve the query efficiency of learned representations. Our technique amplifies the benefit of dynamic pruning query processing techniques by accounting for changes in term importance distributions of learned ranking models. The new clipping mechanism accelerates top-$k$ retrieval by up to 9.6X without any loss in effectiveness.
EMNLP Findings, 2022

Web connectivity graphs and similar linked data such as inverted indexes are important components of the information access systems provided by social media and web search services. The Bipartite Graph Partitioning mechanism of Dhulipala et al. [KDD 2016] relabels the vertices of large sparse graphs, seeking to enhance compressibility and thus reduce the storage space occupied by these costly structures. Here we develop a range of algorithmic and heuristic refinements to Bipartite Graph Partitioning (BP) that lead to faster computation of space-reducing vertex orderings whilst continuing to apply the same broad algorithmic paradigm. Using a range of web graph and information retrieval system index data as test cases, we demonstrate an implementation that executes up to approximately four times faster than the baseline implementation we commenced with, while holding compressibility approximately constant. We have also improved the asymptotic execution time of BP by replacing a sorting step by a customized median-finding step.
TKDE, 2022

Impact-ordered index organizations are suited to score-at-a-time query evaluation strategies. A key advantage of score-at-a-time processing is that query latency can be tightly controlled, leading to lower tail latency and less latency variance overall. While score-at-a-time evaluation strategies have been explored in the literature, there is currently only one notable system that promotes impact-ordered indexing and efficient score-at-a-time query processing. In this paper, we propose an alternative implementation of score-at-a-time retrieval over impact-ordered indexes in the Rust programming language. We detail the efficiency-effectiveness characteristics of our implementation through a range of experiments on two test collections. Our results demonstrate the efficiency of our proposed model in terms of both single-threaded latency, and multi-threaded throughput capability. We make our system publicly available to benefit the community and to promote further research in efficient impact-ordered query processing.
DESIRES, 2022

The use of offline effectiveness metrics is one of the cornerstones of evaluation in information retrieval. Static resources that include test collections and sets of topics, the corresponding relevance judgments connecting them, and metrics that map document rankings from a retrieval system to numeric scores have been used for multiple decades as an important way of comparing systems. The basis behind this experimental structure is that the metric score for a system can serve as a surrogate measurement for user satisfaction. Here we introduce a user behavior framework that extends the C/W/L family. The essence of the new framework – which we call C/W/L/A – is that the user actions that are undertaken while reading the ranking can be considered separately from the benefit that each user will have derived as they exit the ranking. This split structure allows the great majority of current effectiveness metrics to be systematically categorized, and thus their relative properties and relationships to be better understood; and at the same time permits a wide range of novel combinations to be considered. We then carry out experiments using relevance judgments, document rankings, and user satisfaction data from two distinct sources, comparing the patterns of metric scores generated, and showing that those metrics vary quite markedly in terms of their ability to predict user satisfaction.
SIGIR, 2022

Neural information retrieval architectures based on transformers such as BERT are able to significantly improve system effectiveness over traditional sparse models such as BM25. Though highly effective, these neural approaches are very expensive to run, making them difficult to deploy under strict latency constraints. To address this limitation, recent studies have proposed new families of learned sparse models that try to match the effectiveness of learned dense models, while leveraging the traditional inverted index data structure for efficiency. Current learned sparse models learn the weights of terms in documents and, sometimes, queries; however, they exploit different vocabulary structures, document expansion techniques, and query expansion strategies, which can make them slower than traditional sparse models such as BM25. In this work, we propose a novel indexing and query processing technique that exploits a traditional sparse model’s ‘guidance’ to efficiently traverse the index, allowing the more effective learned model to execute fewer scoring operations. Our experiments show that our guided processing heuristic is able to boost the efficiency of the underlying learned sparse model by a factor of four without any measurable loss of effectiveness.
SIGIR, 2022

Document-at-a-time (DaaT) and score-at-a-time (SaaT) query evaluation techniques represent different approaches to top-$k$ retrieval with inverted indexes. While modern deployed systems are dominated by DaaT methods, the academic literature has seen decades of debate about the merits of both. Recently, there has been renewed interest in SaaT methods for learned sparse lexical models, where studies have shown that transformers generate ‘wacky weights’ that appear to reduce opportunities for optimizations in DaaT methods. However, researchers currently lack an easy-to-use SaaT system to support further exploration. This is the gap that our work tries to fill. Starting with JASS, a modern SaaT system, we built Python bindings to create PyJASS, and then further integrated PyJASS into the Pyserini IR toolkit. The result is a common frontend to both a DaaT system (Lucene) and a SaaT system (JASS). We demonstrate how recent experiments with a wide range of learned sparse lexical models can be easily reproduced. Our contribution is a framework that enables future research comparing DaaT and SaaT methods in the context of modern neural retrieval models.
SIGIR, 2022

The 25th Australasian Document Computing Symposium (ADCS 2021) took place as a virtual seminar series from December 2021 through to February 2022, co-hosted by the University of Melbourne and RMIT University, and held in cooperation with ACM SIGIR. Five sessions were dedicated to presenting and discussing nine accepted research papers, book-ended by opening and closing keynotes. The virtual nature of ADCS 2021, coupled with free registration, attracted a broader audience than usual, with 110 registrants from 16 countries including those outside of Australia and New Zealand.
SIGIR Forum, 2022

We detail a deep learning approach based on the transformer architecture for performing fake news detection. The proposed approach is composed of a deep learning network which receives as input the claim to be verified, a series of predictions made by other models, and supporting evidence in the form of ranked passages. We validate our approach on the data from the CLEF2022-CheckThat! Lab (Task 3: Fake News Detection), where we achieve an F1-score of 0.275, ranking 10th out of 25 participants.
CLEF, 2022

In top-$k$ ranked retrieval the goal is to efficiently compute an ordered list of the highest scoring $k$ documents according to some stipulated similarity function such as the well-known BM25 approach. In most implementation techniques a min-heap of size $k$ is used to track the top scoring candidates. In this work we consider the question of how best to retrieve the second page of search results, given that a first page has already been computed; that is, identification of the documents at ranks $k + 1$ to $2k$ for some query. Our goal is to understand what information is available as a by-product of the first-page scoring, and how it can be employed to accelerate the second-page computation, assuming that the second-page of results is required for only a fraction of the query load. We propose a range of simple, yet efficient, next-page retrieval techniques which are suitable for accelerating Document-at-a-Time mechanisms, and demonstrate their performance on three large text collections.
IRJ, 2022

Inverted indexes continue to be a mainstay of text search engines, allowing efficient querying of large document collections. While there are a number of possible organizations, document-ordered indexes are the most common, since they are amenable to various query types, support index updates, and allow for efficient dynamic pruning operations. One disadvantage with document-ordered indexes is that high-scoring documents can be distributed across the document identifier space, meaning that index traversal algorithms that terminate early might put search effectiveness at risk. The alternative is impact-ordered indexes, which primarily support top-$k$ disjunctions, but also allow for anytime query processing, where the search can be terminated at any time, with search quality improving as processing latency increases. Anytime query processing can be used to effectively reduce high-percentile tail latency which is essential for operational scenarios in which a service level agreement (SLA) imposes response time requirements. In this work, we show how document-ordered indexes can be organized such that they can be queried in an anytime fashion, enabling strict latency control with effective early termination. Our experiments show that processing document-ordered topical segments selected by a simple score estimator outperforms existing anytime algorithms, and allows query runtimes to be accurately limited in order to comply with SLA requirements.
TOIS, 2022

We consider the precedence and priority issues that arise from the increasingly common trend of distributing unrefereed preprints via services such as arXiv prior to or at the same time as they are submitted for peer review, and the effect that this practice can have on the integrity of the scientific review process. We offer some suggestions which could be incorporated into the peer-review process to provide better safeguards to authors.
SIGIR Forum, 2021

Index reordering techniques allow document collections to be renumbered, with the goal of developing a permutation of the initial document ordinal identifiers that places documents that are (somehow) like each other into positions near each other in the permuted ordering. The clustering that results allows inverted index size to be reduced, since each term’s posting list is more likely to contain a non-uniform set of inter-document integer gaps. Reordering is normally performed once, at the time the index is created. Here we consider the role of index reordering in collections that grow over time, noting that simply appending new documents to the collection may erode the effectiveness of an earlier reordering. In particular, we discuss methods for maintaining and reinstating reorderings as document collections grow, and measure the effectiveness of those techniques on a large corpus of English news articles. We also provide experimental results that illustrate the benefits of reordering in terms of query execution time.
Winner of the best paper award
ADCS, 2021

The recent MSMARCO passage retrieval collection has allowed researchers to develop highly tuned retrieval systems. One aspect of this data set that makes it distinctive compared to traditional corpora is that most of the topics only have a single answer passage marked relevant. Here we carry out a “what if” sensitivity study, asking whether a set of systems would still have the same relative performance if more passages per topic were deemed to be “relevant”, exploring several mechanisms for identifying sets of passages to be so categorized. Our results show that, in general, while run scores can vary markedly if additional plausible passages are presumed to be relevant, the derived system ordering is relatively insensitive to additional relevance, providing support for the methodology that was used at the time the MSMARCO passage collection was created.
arXiv pre-print, 2021

Recent advances in retrieval models based on learned sparse representations generated by transformers have led us to, once again, consider score-at-a-time query evaluation techniques for the top-$k$ retrieval problem. Previous studies comparing document-at-a-time and score-at-a-time approaches have consistently found that the former approach yields lower mean query latency, although the latter approach has more predictable query latency. In our experiments with four different retrieval models that exploit representational learning with bags of words, we find that transformers generate ‘wacky weights’ that appear to greatly reduce the opportunities for skipping and early exiting optimizations that lie at the core of standard document-at-a-time techniques. As a result, score-at-a-time approaches appear to be more competitive in terms of query evaluation latency than in previous studies. We find that, if an effectiveness loss of up to three percent can be tolerated, a score-at-a-time approach can yield substantial gains in mean query latency while at the same time dramatically reducing tail latency.
arXiv pre-print, 2021

Text retrieval using bags of words is typically formulated as inner products between vector representations of queries and documents, realized in query evaluation algorithms that traverse postings in an inverted index. Viewed in database terms, this captures a tight coupling between the ‘logical’ aspects of ranking (i.e., term weighting) and the ‘physical’ aspects of ranking (query evaluation). We argue that explicitly decoupling these two aspects offers a framework for thinking about the relationship between sparse retrieval techniques and the rapidly growing literature on dense retrieval techniques.
DESIRES, 2021

We revisit the Bipartite Graph Partitioning approach to document reordering (Dhulipala et al., KDD 2016), and consider a range of algorithmic and heuristic refinements that lead to faster computation of index-minimizing document orderings. Our final implementation executes approximately four times faster than the reference implementation we commence with, and obtains the same, or slightly better, compression effectiveness on three large text collections.
SIGIR, 2021

When faced with an information need, different users are likely to issue different queries. A renewed interest in these user query variations has resulted in a number of collections, tasks, and studies which utilize multiple queries for each topic. The most commonly applied technique for generating query variations is to show a short backstory to a pool of crowdworkers, and ask each of them to provide a keyword query that they would expect to provide more information pertaining to the backstory. In this short paper we explore whether the length of the backstory and the mode in which the backstory is conveyed to crowdworkers affect the resulting queries. Our experiments demonstrate that both the length of the backstory and the mode in which the backstory is delivered influence the resultant query variations; and that there is a consequent downstream implication in terms of forming the judgment pools necessary to assess systems in the presence of query variations.
ICTIR, 2021

We explore the relationship between expected reciprocal rank (ERR) and the metrics that are available under the C/W/L framework. On the surface, it appears that the user browsing model associated with ERR can be directly injected into a C/W/L arrangement, to produce system measurements equivalent to those generated from ERR. That assumption is now known to be invalid, and demonstration of the impossibility of ERR being described via C/W/L choices forms the first part of our work. Given that ERR cannot be accommodated within the C/W/L framework, we then explore the extent to which practical use of ERR correlates with metrics that do fit within the C/W/L user browsing model. In this part of the investigation we present a range of shallow-evaluation C/W/L variants that have very high correlation with ERR when compared in experiments involving a large number of TREC runs. That is, while ERR itself is not a C/W/L metric, there are other weighted-precision computations that fit with the user model assumed by C/W/L, and yield system comparisons almost indistinguishable from those generated via the use of ERR.
Winner of the best paper award
ICTIR, 2021

Search engine users retrieve relevant information for an information need using keyword queries. Different users may have similar information needs, but use different query terms. The resulting user query variations can provide a wealth of useful information to IR researchers. Most recently, the keystroke-level telemetry data gathered as part of the CC-News-En collection provides important insights into how users create queries for a search task, at a level of detail not possible using a normal query log. In this demo, we present an interactive tool that enables practitioners to visualize users formulating queries. Our new tool is a temporal simulation of the typing behavior of crowdworkers, grouped by information need. It provides the ability to directly compare the cognitive behavior of multiple users simultaneously, and observe how query keyword selection and ordering happens before a final query is submitted to a search engine. To demonstrate the benefit of our tool, we include a qualitative study of four different user behavior patterns which were observed in the CC-News-En collection.
CHIIR, 2021

Research activity spanning more than five decades has led to index organizations, compression schemes, and traversal algorithms that allow extremely rapid response to ranked queries against very large text collections. However, little attention has been paid to the interactions between these many components, and the additivity of algorithmic improvements has not been explored. Here we examine the extent to which efficiency improvements add up. We employ four query processing algorithms, four compression codecs, and all possible combinations of four distinct further optimizations, and compare the performance of the 256 resulting systems to determine when and how different optimizations interact. Our results over two test collections show that efficiency enhancements are, for the most part, additive, and that there is little risk of negative interactions. In addition, our detailed profiling across this large pool of systems leads to key insights as to why the various individual enhancements work well, and indicates that optimizing ‘simpler’ implementations can result in higher query throughput than is available from non-optimized versions of the more ‘complex’ techniques, with clear implications for the choices needing to be made by practitioners.
CIKM, 2020

We describe a static, open-access news corpus using data from the Common Crawl Foundation, who provide free, publicly available web archives, including a continuous crawl of international news articles published in multiple languages. Our derived corpus, CC-News-En, contains 44 million English documents collected between September 2016 and March 2018. The collection is comparable in size with the number of documents typically found in a single shard of a large-scale, distributed search engine, and is four times larger than the news collections previously used in offline information retrieval experiments. To complement the corpus, 173 topics were curated using titles from Reddit threads, forming a temporally representative sampling of relevant news topics over the 583 day collection window. Information needs were then generated using automatic summarization tools to produce textual and audio representations, and used to elicit query variations from crowdworkers, with a total of 10,437 queries collected against the 173 topics. Of these, 10,089 include key-stroke level instrumentation that captures the timings of character insertions and deletions made by the workers while typing their queries. These new resources support a wide variety of experiments, including large-scale efficiency exercises and query auto-completion synthesis, with scope for future addition of relevance judgments to support offline effectiveness experiments and hence batch evaluation campaigns.
CIKM, 2020

Language model pre-training has spurred a great deal of attention for tasks involving natural language understanding, and has been successfully applied to many downstream tasks with impressive results. Within information retrieval, many of these solutions are too costly to stand on their own, requiring multi-stage ranking architectures. Recent work has begun to consider how to “backport” salient aspects of these computationally expensive models to previous stages of the retrieval pipeline. One such instance is DeepCT, which uses BERT to re-weight term importance in a given context at the passage level. This process, which is computed offline, results in an augmented inverted index with re-weighted term frequency values. In this work, we conduct an investigation of query processing efficiency over DeepCT indexes. Using a number of candidate generation algorithms, we reveal how term re-weighting can impact query processing latency, and explore how DeepCT can be used as a static index pruning technique to accelerate query processing without harming search effectiveness.
SIGIR, 2020

There exists a natural tension between encouraging a diverse ecosystem of open-source search engines and supporting fair, replicable comparisons across those systems. To balance these two goals, we examine two approaches to providing interoperability between the inverted indexes of several systems. The first takes advantage of internal abstractions around index structures and building wrappers that allow one system to directly read the indexes of another. The second involves sharing indexes across systems via a data exchange specification that we have developed, called the Common Index File Format (CIFF). We demonstrate the first approach with the Java systems Anserini and Terrier, and the second approach with Anserini, JASS, OldDog, PISA, and Terrier. Together, these systems provide a wide range of implementations and features, with different research goals. Overall, we recommend CIFF as a low-effort approach to support independent innovation while enabling the types of fair evaluations that are critical for driving the field forward.
SIGIR, 2020

As both the availability of internet access and the prominence of smart devices continue to increase, data is being generated at a rate faster than ever before. This massive increase in data production comes with many challenges, including efficiency concerns for the storage and retrieval of such large-scale data. However, users have grown to expect the sub-second response times that are common in most modern search engines, creating a problem - how can such large amounts of data continue to be served efficiently enough to satisfy end users?
This dissertation investigates several issues regarding tail latency in large-scale information retrieval systems. Tail latency corresponds to the high percentile latency that is observed from a system - in the case of search, this latency typically corresponds to how long it takes for a query to be processed. In particular, keeping tail latency as low as possible translates to a good experience for all users, as tail latency is directly related to the worst-case latency and hence, the worst possible user experience. The key idea in targeting tail latency is to move from questions such as “what is the median latency of our search engine?” to questions which more accurately capture user experience such as “how many queries take more than 200ms to return answers?” or “what is the worst case latency that a user may be subject to, and how often might it occur?”
While various strategies exist for efficiently processing queries over large textual corpora, prior research has focused almost entirely on improvements to the average processing time or cost of search systems. As a first contribution, we examine some state-of-the-art retrieval algorithms for two popular index organizations, and discuss the trade-offs between them, paying special attention to the notion of tail latency. This research uncovers a number of observations that are subsequently leveraged for improved search efficiency and effectiveness.
We then propose and solve a new problem, which involves processing a number of related queries together, known as multi-queries, to yield higher quality search results. We experiment with a number of algorithmic approaches to efficiently process these multi-queries, and report on the cost, efficiency, and effectiveness trade-offs present with each. Ultimately, we find that some solutions yield a low tail latency, and are hence suitable for use in real-time search environments.
Finally, we examine how predictive models can be used to improve the tail latency and end-to-end cost of a commonly used multi-stage retrieval architecture without impacting result effectiveness. By combining ideas from numerous areas of information retrieval, we propose a prediction framework which can be used for training and evaluating several efficiency/effectiveness trade-off parameters, resulting in improved trade-offs between cost, result quality, and tail latency.
PhD Thesis, 2019

Rank fusion is a powerful technique that allows multiple sources of information to be combined into a single result set. However, to date fusion has not been regarded as being cost-effective in cases where strict per-query efficiency guarantees are required, such as in web search. In this work we propose a novel solution to rank fusion by splitting the computation into two parts – one phase that is carried out offline to generate pre-computed centroid answers for queries with broadly similar information needs, and then a second online phase that uses the corresponding topic centroid to compute a result page for each query. We explore efficiency improvements to classic fusion algorithms whose costs can be amortized as a pre-processing step, and can then be combined with re-ranking approaches to dramatically improve effectiveness in multi-stage retrieval systems with little efficiency overhead at query time. Experimental results using the ClueWeb12B collection and the UQV100 query variations demonstrate that centroid-based approaches allow improved retrieval effectiveness at little or no loss in query throughput or latency, and with reasonable pre-processing requirements. We additionally show that queries that do not match any of the pre-computed clusters can be accurately identified and efficiently processed in our proposed ranking pipeline.
TOIS, 2019

Performant Indexes and Search for Academia (PISA) is an experimental search engine that focuses on efficient implementations of state-of-the-art representations and algorithms for text retrieval. In this work, we outline our effort in creating a replicable search run from PISA for the 2019 Open Source Information Retrieval Replicability Challenge, which encourages the information retrieval community to produce replicable systems through the use of a containerized, Docker-based infrastructure. We also discuss the origins, current functionality, and future direction and challenges for the PISA system.
OSIRRC, 2019

Processing top-$k$ bag-of-words queries is critical to many information retrieval applications, including web-scale search. In this work, we consider algorithmic properties associated with dynamic pruning mechanisms. Such algorithms maintain a score threshold (the $k$th highest similarity score identified so far) so that low-scoring documents can be bypassed, allowing fast top-$k$ retrieval with no loss in effectiveness. In standard pruning algorithms the score threshold is initialized to the lowest possible value. To accelerate processing, we make use of term- and query-dependent features to predict the final value of that threshold, and then employ the predicted value right from the commencement of processing. Because of the asymmetry associated with prediction errors (if the estimated threshold is too high the query will need to be re-executed in order to assure the correct answer), the prediction process must be risk-sensitive. We explore techniques for balancing those factors, and provide detailed experimental results that show the practical usefulness of the new approach.
SIGIR, 2019

Email continues to be one of the most commonly used forms of online communication. As inboxes grow larger, users rely more heavily on email search to effectively find what they are looking for. However, previous studies on email have been exclusive to enterprises with access to large user logs, or limited to small-scale qualitative surveys and analyses on outdated public datasets such as Enron and Avocado. In this work, we propose a novel framework that allows for experimentation with real email data. In particular, our approach provides a realistic way of simulating email re-finding tasks in a crowdsourcing environment using the workers’ personal email data. We use our approach to experiment with various ranking functions and quality degradation to measure how users behave under different conditions, and conduct analysis across various email types and attributes. Our results show that user behavior can be significantly impacted as a result of the quality of the search ranker, but only when differences in quality are very pronounced. Our analysis confirms that time-based ranking begins to fail as email age increases, suggesting that hybrid approaches may help bridge the gap between relevance-based rankers and the traditional time-based ranking approach. Finally, we also found that users typically reformulate search queries by either entirely re-writing the query, or simply appending terms to the query, which may have implications for email query suggestion facilities
WWW, 2019

Document reordering is an important but often overlooked preprocessing stage in index construction. Reordering document identifiers in graphs and inverted indexes has been shown to reduce storage costs and improve processing efficiency in the resulting indexes. However, surprisingly few document reordering algorithms are publicly available despite their importance. A new reordering algorithm derived from recursive graph bisection was recently proposed by Dhulipala (KDD 2016), and shown to be highly effective and efficient when compared against other state-of-the-art reordering strategies. In this work, we present a reproducibility study of this new algorithm. We describe the implementation challenges encountered, and explore the performance characteristics of our clean-room reimplementation. We show that we are able to successfully reproduce the core results of the original paper, and show that the algorithm generalizes to other collections and indexing frameworks. Furthermore, we make our implementation publicly available to help promote further research in this space.
ECIR, 2019

The Waterloo spam scores are now a commonly used static document feature in web collections such as ClueWeb. This feature can be used as a post-retrieval filter, as a document prior, or as one of many features in a Learning-to-Rank system. In this work, we highlight the risks associated with using spam scores as a post-retrieval filter, which is now common practice in experiments with the ClueWeb test collection. While it increases the average evaluation score and boosts the performance of some topics, it can significantly harm the performance of others. Through a detailed failure analysis, we show that simple spam filtering is a high risk practice that should be avoided in future work, particularly when working with the ClueWeb12 test collection.
ADCS, 2018

Ad-hoc retrieval is an important problem with many practical applications. It forms the basis of web search, question-answering, and a new generation of virtual assistants being developed by several of the largest software companies in the world. In this report, we continue our exploration of the importance of multiple expressions of information needs. Our thesis is that over-reliance on a single query can lead to suboptimal performance, and that by creating multiple query representations for an information need and combining the relevance signals through fusion and relevance modeling, highly effective systems can be produced. This approach may form the basis for more complex multi-stage retrieval systems in a variety of applications.
TREC, 2018

Relevance modeling and data fusion are powerful yet simple approaches to improving the effectiveness of Information Retrieval Systems. For many of the classic TREC test collections, these approaches were used in many of the top performing retrieval systems. However, these approaches are often inefficient and are therefore rarely applied in production systems which must adhere to strict performance guarantees. Inspired by our recent work with human-derived query variations, we propose a new sampling-based system which provides significantly better efficiency-effectiveness trade-offs while leveraging both relevance modeling and data fusion. We show that our new end-to-end search system approaches the state-of-the-art in effectiveness while still being efficient in practice. Orthogonally, we also show how to leverage query expansion and data fusion to achieve significantly better risk-reward trade-offs than plain relevance modeling approaches.
DESIRES, 2018

Negated query terms allow documents containing such terms to be filtered out of a search results list, supporting disambiguation. In this work, the effect of negation on the efficiency of disjunctive, top-$k$ retrieval is examined. First, we show how negation can be integrated efficiently into two popular dynamic pruning algorithms. Then, we explore the efficiency of our approach, and show that while often efficient, negation can negatively impact the dynamic pruning effectiveness for certain queries.
ECIR, 2018

Large scale retrieval systems often employ cascaded ranking architectures, in which an initial set of candidate documents are iteratively refined and re-ranked by increasingly sophisticated and expensive ranking models. In this paper, we propose a unified framework for predicting a range of performance-sensitive parameters based on minimizing end-to-end effectiveness loss. The framework does not require relevance judgments for training, is amenable to predicting a wide range of parameters, allows for fine tuned efficiency-effectiveness trade-offs, and can be easily deployed in large scale search systems with minimal overhead. As a proof of concept, we show that the framework can accurately predict a number of performance parameters on a query-by-query basis, allowing efficient and effective retrieval, while simultaneously minimizing the tail latency of an early-stage candidate generation system. On the $50$ million document ClueWeb09B collection, and across $25{,}000$ queries, our hybrid system can achieve superior early-stage efficiency to fixed parameter systems without loss of effectiveness, and allows more finely-grained efficiency-effectiveness trade-offs across the multiple stages of the retrieval system.
WSDM, 2018

Score-at-a-Time index traversal is a query processing approach which supports early termination in order to balance efficiency and effectiveness trade-offs. In this work, we explore new techniques which extend a modern Score-at-a-Time traversal algorithm to allow for parallel postings traversal. We show that careful integration of parallel traversal can improve both efficiency and effectiveness when compared with current single threaded early termination approaches. In addition, we explore the various trade-offs for differing early termination heuristics, and propose hybrid systems which parallelize long running queries, while processing short running queries with only a single thread.
Runner up best paper award
ADCS, 2017

Furthering the state-of-the-art in adhoc web search is one of the underlying goals for the NTCIR We Want Web (WWW) task. Adhoc search can be viewed as a bridge connecting many of the specialized sub-fields that are a result of the way people connect to and use information access systems. Since this is the first year of the WWW task, and no training data was provided for the English subtask, we focused on classic techniques for improving effectiveness in lieu of modern techniques based on supervised learning. In particular, we explored the use of Markov Random Field Models (MRFs), static document features, field-based weighting, and query expansion. This round we made extensive use of the Indri search system and the flexible query language it provides to produce effective results.
NTCIR, 2017

The 2017 TREC CORE Track is a re-run of the classic TREC ad hoc search evaluation campaign, with the vision of establishing new methodologies for creating IR test collections. The previous TREC newswire ad hoc task was the 2004 Robust Track, where the emphasis was on improving the effectiveness of poorly performing topics in previous tracks. The TREC CORE 2017 track reuses the Robust 2004 topic set, for the development of relevance judgments over a new New York Times corpus, composed of newswire articles published between 1987 and 2007. For this task, RMIT contributed three different runs, including two manual, and one automatic run.
TREC, 2017

With the growing popularity of the world-wide-web and the increasing accessibility of smart devices, data is being generated at a faster rate than ever before. This presents scalability challenges to web-scale search systems – how can we efficiently index, store and retrieve such a vast amount of data? A large amount of prior research has attempted to address many facets of this question, with the invention of a range of efficient index storage and retrieval frameworks that are able to efficiently answer most queries. However, the current literature generally focuses on improving the mean or median query processing time in a given system. In the proposed PhD project, we focus on improving the efficiency of high percentile tail latencies in large scale IR systems while minimising end-to-end effectiveness loss. Although there is a wealth of prior research involving improving the efficiency of large scale IR systems, the most relevant prior work involves predicting long-running queries and processing them in various ways to avoid large query processing times. Prediction is often done through pre-trained models based on both static and dynamic features from queries and documents. Many different approaches to reducing the processing time of long running queries have been proposed, including parallelising queries that are predicted to run slowly, scheduling queries based on their predicted run time, and selecting or modifying the query processing algorithm depending on the load of the system. Considering the specific focus on tail latencies in large-scale IR systems, the proposed research aims to: (i) study what causes large tail latencies to occur in large-scale web search systems, (ii) propose a framework to mitigate tail latencies in multi-stage retrieval through the prediction of a vast range of query-specific efficiency parameters, (iii) experiment with mixed-mode query semantics to provide efficient and effective querying to reduce tail latencies, and (iv) propose a time-bounded solution for Document-at-a-Time (DaaT) query processing which is suitable for current web search systems. As a preliminary study, Crane et al. compared some state-of-the-art query processing strategies across many modern collections. They found that although modern DaaT dynamic pruning strategies are very efficient for ranked disjunctive processing, they have a much larger variance in processing times than Score-at-a-Time (SaaT) strategies which have a similar efficiency profile regardless of query length or the size of the required result set. Furthermore, Mackenzie et al. explored the efficiency trade-offs for paragraph retrieval in a multi-stage question answering system. They found that DaaT dynamic pruning strategies could efficiently retrieve the top-1,000 candidate paragraphs for very long queries. Extending on prior work, Mackenzie et al. showed how a range of per-query efficiency settings can be accurately predicted such that $99.99$ percent of queries are serviced in less than $200$ ms without noticeable effectiveness loss. In addition, a reference list framework was used for training models such that no relevance judgements or annotations were required. Future work will focus on improving the candidate generation stage in large-scale multi-stage retrieval systems. This will include further exploration of index layouts, traversal strategies, and query rewriting, with the aim of improving early stage efficiency to reduce the system tail latency, while potentially improving end-to-end effectiveness.
SIGIR, 2017

We present an empirical comparison between document-at-a-time (DaaT) and score-at-a-time (SaaT) document ranking strategies within a common framework. Although both strategies have been extensively explored, the literature lacks a fair, direct comparison: such a study has been difficult due to vastly different query evaluation mechanics and index organizations. Our work controls for score quantization, document processing, compression, implementation language, implementation effort, and a number of details, arriving at an empirical evaluation that fairly characterizes the performance of three specific techniques: WAND (DaaT), BMW (DaaT), and JASS (SaaT). Experiments reveal a number of interesting findings. The performance gap between WAND and BMW is not as clear as the literature suggests, and both methods are susceptible to tail queries that may take orders of magnitude longer than the median query to execute. Surprisingly, approximate query evaluation in WAND and BMW does not significantly reduce the risk of these tail queries. Overall, JASS is slightly slower than either WAND or BMW, but exhibits much lower variance in query latencies and is much less susceptible to tail query effects. Furthermore, JASS query latency is not particularly sensitive to the retrieval depth, making it an appealing solution for performance-sensitive applications where bounds on query latencies are desirable.
WSDM, 2017

This paper describes the four systems RMIT fielded for the TREC 2016 LiveQA task and the associated experiments. Similar to last year, the results show that simple solutions tend to work best, and that our improved baseline systems achieved an above-average performance. We use a commercial search engine as a first stage retrieval mechanism and compare it with our internal system which uses a carefully curated document collection. Somewhat surprisingly, we found that on average the small curated collection performed better within our current framework, warranting further studies on when and when not to use an external resource, such as a publicly available search engine API. Finally, we show that small improvements to performance can substantially reduce failure rates.
TREC, 2016

Mobile search is quickly becoming the most common mode of search on the internet. This shift is driving changes in user behaviour, and search engine behaviour. Just over half of all search queries from mobile devices have local intent, making location-aware search an increasingly important problem. In this work, we compare the efficiency and effectiveness of two general types of geographical search queries, range queries and $k$-nearest-neighbor queries, for common web search tasks. We test state-of-the-art spatial-textual indexing and search algorithms for both query types on two large datasets. Finally, we present a rank-safe dynamic pruning algorithm that is simple to implement and use with current inverted indexing techniques. Our algorithm is more efficient than the tightly coupled best-in-breed hybrid indexing algorithms that are commonly used for top-$k$ spatial textual queries, and more likely to find relevant documents than techniques derived from range queries.
Winner of the best paper award
ADCS, 2015