Parsing the Wiki collection and snippet generation

Abstract

University of Minnesota M.S. thesis. April 2013. Major: Computer science. Advisor: Dr Donald Crouch. 1 computer file (PDF); vi, 31 pages.Information Retrieval (IR) is a feld which deals with retrieving useful information from large sets of data in response to a query. Much information in this digital age is stored in XML format, which associates a structure with a document. Though IR systems have been used for years to access documents, the field has greatly expanded with the emergence of the world wide web, which emphasizes the structure of the data. The amount of data makes the identification of various portion(s) of a document difficult; document structure helps in this task. This thesis describes a retrieval task known as snippet retrieval. A snippet is the smallest meaningful body of text which can be used to establish the relevance of the document without actually looking at the document. The work on snippet retrieval is extended from past work in focused retrieval, wherein a ranked list of focused elements is retrieved in response to the user query. The Vector Space Model provides the framework for retrieval; we use Smart for basic retrieval functions. Our system for dynamic element retrieval, Flex, enables us to identify and rank the individual elements of each hypertext document with respect to the query. We include a discussion of focusing strategies and the use of focused elements for snippet generation. Results of our top-ranked 2011 and 2012 Snippet Retrieval track runs are included

    Similar works