research

Combining Terrier with Apache Spark to Create Agile Experimental Information Retrieval Pipelines

Abstract

Experimentation using IR systems has traditionally been a procedural and laborious process. Queries must be run on an index, with any parameters of the retrieval models suitably tuned. With the advent of learning-to-rank, such experimental processes (including the appropriate folding of queries to achieve cross-fold validation) have resulted in complicated experimental designs and hence scripting. At the same time, machine learning platforms such as Scikit Learn and Apache Spark have pioneered the notion of an experimental pipeline , which naturally allows a supervised classification experiment to be expressed a series of stages, which can be learned or transformed. In this demonstration, we detail Terrier-Spark, a recent adaptation to the Terrier Information Retrieval platform which permits it to be used within the experimental pipelines of Spark. We argue that this (1) provides an agile experimental platform for information retrieval, comparable to that enjoyed by other branches of data science; (2) aids research reproducibility in information retrieval by facilitating easily-distributable notebooks containing conducted experiments; and (3) facilitates the teaching of information retrieval experiments in educational environments

    Similar works