Search CORE

1,118 research outputs found

Combining Terrier with Apache Spark to Create Agile Experimental Information Retrieval Pipelines

Author: Macdonald Craig
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2018
Field of study

Experimentation using IR systems has traditionally been a procedural and laborious process. Queries must be run on an index, with any parameters of the retrieval models suitably tuned. With the advent of learning-to-rank, such experimental processes (including the appropriate folding of queries to achieve cross-fold validation) have resulted in complicated experimental designs and hence scripting. At the same time, machine learning platforms such as Scikit Learn and Apache Spark have pioneered the notion of an experimental pipeline , which naturally allows a supervised classification experiment to be expressed a series of stages, which can be learned or transformed. In this demonstration, we detail Terrier-Spark, a recent adaptation to the Terrier Information Retrieval platform which permits it to be used within the experimental pipelines of Spark. We argue that this (1) provides an agile experimental platform for information retrieval, comparable to that enjoyed by other branches of data science; (2) aids research reproducibility in information retrieval by facilitating easily-distributable notebooks containing conducted experiments; and (3) facilitates the teaching of information retrieval experiments in educational environments

Scipedia

Enlighten

Improving Arabic Light Stemming in Information Retrieval Systems

Author: Almusaddar Mohammed Yahya
Publication venue: الجامعة الإسلامية - غزة
Publication date: 01/01/2014
Field of study

Information retrieval refers to the retrieval of textual documents such as newsprint and magazine articles or Web documents. Due to extensive research in the IR field, there are many retrieval techniques that have been developed for Arabic language. The main objective of this research to improve Arabic information retrieval by enhancing light stemming and preprocessing stage and to contribute to the open source community, also establish a guideline for Arabic normalization and stop-word removal. To achieve these objectives, we create a GUI toolkit that implements preprocessing stage that is necessary for information retrieval. One of these steps is normalizing, which we improved and introduced a set of rules to be standardized and improved by other researchers. The next preprocessing step we improved is stop-word removal, we introduced two different stop-word lists, the first one is intensive stop-word list for reducing the size of the index and ambiguous words, and the other is light stop-word list for better results with recall in information retrieval applications. We improved light stemming by update a suffix rule, and introduce the use of Arabized words, 100 words manually collected, these words should not follow the stemming rules since they came to Arabic language from other languages, and show how this improve results compared to two popular stemming algorithms like Khoja and Larkey stemmers. The proposed toolkit was integrated into a popular IR platform known as Terrier IR platform. We implemented Arabic language support into the Terrier IR platform. We used TF-IDF scoring model from Terrier IR platform. We tested our results using OSAC datasets. We used java programming language and Terrier IR platform for the proposed systems. The infrastructure we used consisted of CORE I7 CPU ran speed at 3.4 GHZ and 8 GB RAM

Institutional Repository of the Islamic University of Gaza

ImageTerrier: an extensible platform for scalable high-performance image retrieval

Author: Dupplaw David
Hare Jonathon
Lewis Paul H.
Samangooei Sina
Publication venue
Publication date: 05/06/2012
Field of study

Southampton (e-Prints Soton)

reSearch : enhancing information retrieval with images

Author: Goodfellow Martin Hugh
Hunt Ela
McCafferty Daniel
Publication venue
Publication date: 10/05/2009
Field of study

Combining image and text search is an open research question. The main issues are what technologies to base this solution on, and what measures of relevance to employ. Our reSearch prototype mashes up papers indexed using information retrieval techniques (Terrier) with Google image search for faces and Google book search. The user can interactively employ query expansion with additional terms suggested by Terrier, and use those terms to expand both the text and image search. We test this solution with a selection of recent publications and queries concerning people engaged in research. We report on the effectiveness of this solution. It seems that the combination works to a large extent, as testified by our observations

University of Strathclyde Institutional Repository

University of Glasgow at WebCLEF 2005: experiments in per-field normalisation and language specific stemming

Author: He B.
Lioma C.
Macdonald C.
Ounis I.
Plachouras V.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2006
Field of study

We participated in the WebCLEF 2005 monolingual task. In this task, a search system aims to retrieve relevant documents from a multilingual corpus of Web documents from Web sites of European governments. Both the documents and the queries are written in a wide range of European languages. A challenge in this setting is to detect the language of documents and topics, and to process them appropriately. We develop a language specific technique for applying the correct stemming approach, as well as for removing the correct stopwords from the queries. We represent documents using three fields, namely content, title, and anchor text of incoming hyperlinks. We use a technique called per-field normalisation, which extends the Divergence From Randomness (DFR) framework, to normalise the term frequencies, and to combine them across the three fields. We also employ the length of the URL path of Web documents. The ranking is based on combinations of both the language specific stemming, if applied, and the per-field normalisation. We use our Terrier platform for all our experiments. The overall performance of our techniques is outstanding, achieving the overall top four performing runs, as well as the top performing run without metadata in the monolingual task. The best run only uses per-field normalisation, without applying stemming

Crossref

Copenhagen University Research Information System

Enlighten

On inverted index compression for search engine efficiency

Author: Catena Matteo
Macdonald Craig
Ounis Iadh
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Efficient access to the inverted index data structure is a key aspect for a search engine to achieve fast response times to users’ queries . While the performance of an information retrieval (IR) system can be enhanced through the compression of its posting lists, there is little recent work in the literature that thoroughly compares and analyses the performance of modern integer compression schemes across different types of posting information (document ids, frequencies, positions). In this paper, we experiment with different modern integer compression algorithms, integrating these into a modern IR system. Through comprehensive experiments conducted on two large, widely used document corpora and large query sets, our results show the benefit of compression for different types of posting information to the space- and time-efficiency of the search engine. Overall, we find that the simple Frame of Reference compression scheme results in the best query response times for all types of posting information. Moreover, we observe that the frequency and position posting information in Web corpora that have large volumes of anchor text are more challenging to compress, yet compression is beneficial in reducing average query response times

CiteSeerX

Crossref

Enlighten

Monitoring Electoral Violence through Social Media: A Machine Learning Approach

Author: Macdonald Craig
Ounis Iadh
Yang Xiao
Publication venue
Publication date: 01/01/2016
Field of study

No abstract available

Enlighten

MapReduce for information retrieval evaluation: "Let's quickly test this on 12 TB of data"

Author: Hauff Claudia
Hiemstra Djoerd
Publication venue: Springer
Publication date: 01/01/2010
Field of study

We propose to use MapReduce to quickly test new retrieval approaches on a cluster of machines by sequentially scanning all documents. We present a small case study in which we use a cluster of 15 low cost machines to search a web crawl of 0.5 billion pages showing that sequential scanning is a viable approach to running large-scale information retrieval experiments with little effort. The code is available to other researchers at: http://mirex.sourceforge.net

CiteSeerX

Crossref

Radboud Repository

University of Twente Research Information

Evaluating Bad Query Abandonment in an Iterative SMS-Based FAQ Retrieval System

Author: Ounis Iadh
Rogers Simon
Thuma Edwin
Publication venue
Publication date: 01/01/2013
Field of study

In this paper, we investigate how many iterations users are willing to tolerate in an iterative Frequently Asked Ques- tion (FAQ) system that provides information on HIV/AIDS. This is part of work in progress that aims to develop an automated Frequently Asked Question system that can be used to provide answers on HIV/AIDS related queries to users in Botswana. Our system engages the user in the question answering process by following an iterative interaction approach in order to avoid giving inappropriate answers to the user. Our findings provide us with an indication of how long users are willing to engage with the system. We sub- sequently use this to develop a novel evaluation metric to use in future developments of the system. As an additional finding, we show that the previous search experience of the users has a significant effect on their future behaviour

Enlighten

MIREX: MapReduce Information Retrieval Experiments

Author: Hauff Claudia
Hiemstra Djoerd
Publication venue
Publication date: 01/01/2010
Field of study

We propose to use MapReduce to quickly test new retrieval approaches on a cluster of machines by sequentially scanning all documents. We present a small case study in which we use a cluster of 15 low cost ma- chines to search a web crawl of 0.5 billion pages showing that sequential scanning is a viable approach to running large-scale information retrieval experiments with little effort. The code is available to other researchers at: http://mirex.sourceforge.ne

arXiv.org e-Print Archive

CiteSeerX

University of Twente Research Information