Search CORE

1 research outputs found

Lucene for n‐grams using the ClueWeb collection

Author: Chris Fallen
Gregory B. Newby
Kylie Mccormick
Publication venue
Publication date: 07/02/2012
Field of study

The ARSC team made modifications to the Apache Lucene engine to accommodate “go words,” taken from the Google Gigaword vocabulary of n‐grams. Indexing the Category “B ” subset of the ClueWeb collection was accomplished by a divide and conquer method, working across the separate ClueWeb subsets for 1, 2 and 3‐grams. 1. Overview and Prior work Phrase searching—or imposing an order on query terms—has traditionally been an expensive IR task. One approach is to use sophisticated algorithms at query time to analyze the sequence of a given term relative to nearby terms in the text, where term locations were stored at indexing time. Another method is to index the document collection relative to a large set of phrase tokens, rather than single terms. For the TREC 2009 Web track, we indexed the ClueWeb09 Category B document collection utilizing a “go list ” vocabulary (as opposed to a “stop list”) of 1‐, 2‐, and 3‐gram phrase tokens extracted from the Google NGram data set. We made small modifications to Lucene to facilitate use of the go list, in both the indexer and the query processor. We found that query processing was quite fast, with queries being processed in well under 2 seconds each. Lucene indexing time increased approximately 3X for the three passes (with 1‐, 2‐, 3‐grams), and index files were larger due to the larger number of indexed n‐grams. The additional cost paid to index using n‐ gram tokens allowed for phrase searching to occur as quickly as regular single term searching

CiteSeerX