239 research outputs found

    Delivering the Maori-language newspapers on the Internet

    Get PDF
    Although any collection of historical newspapers provides a particularly rich and valuable record of events and social and political commentary, the content tends to be difficult to access and extremely time-consuming to browse or search. The advent of digital libraries has meant that for electronically stored text, full-text searching is now a tool readily available for researchers, or indeed anyone wishing to have asscess to specific information in text. Text in this form can be readily distributed via CD-ROM or the Internet, with a significant impact on accessibility over traditional microfiche or hard-copy distribution. For the majority of text being generated de nouveau, availability in electronic form is standard, and hence the increasing use of full-text search facilities. However, for legacy text available only in printed form, the provision of these electronic search tools is dependent on the prior electronic capture of digital facsimile images of the printed text, followed by the conversion of these images to electronic text through the process of optical character recognition (OCR). This article describes a project undertaken at the University of Waikato over the period 1999 to 2001 to produce a full-text searchable version of the Niupepa or Maori- language newspaper collection for delivery over the Internet

    Enhancing Content-And-Structure Information Retrieval using a Native XML Database

    Get PDF
    Three approaches to content-and-structure XML retrieval are analysed in this paper: first by using Zettair, a full-text information retrieval system; second by using eXist, a native XML database, and third by using a hybrid XML retrieval system that uses eXist to produce the final answers from likely relevant articles retrieved by Zettair. INEX 2003 content-and-structure topics can be classified in two categories: the first retrieving full articles as final answers, and the second retrieving more specific elements within articles as final answers. We show that for both topic categories our initial hybrid system improves the retrieval effectiveness of a native XML database. For ranking the final answer elements, we propose and evaluate a novel retrieval model that utilises the structural relationships between the answer elements of a native XML database and retrieves Coherent Retrieval Elements. The final results of our experiments show that when the XML retrieval task focusses on highly relevant elements our hybrid XML retrieval system with the Coherent Retrieval Elements module is 1.8 times more effective than Zettair and 3 times more effective than eXist, and yields an effective content-and-structure XML retrieval

    Investigation of the Lambda Parameter for Language Modeling Based Persian Retrieval

    Get PDF
    Language modeling is one of the most powerful methods in information retrieval. Many language modeling based retrieval systems have been developed and tested on English collections. Hence, the evaluation of language modeling on collections of other languages is an interesting research issue. In this study, four different language modeling methods proposed by Hiemstra [1] have been evaluated on a large Persian collection of a news archive. Furthermore, we study two different approaches that are proposed for tuning the Lambda parameter in the method. Experimental results show that the performance of language models on Persian text improves after Lambda Tuning. More specifically Witten Bell method provides the best results

    MWAND: A New Early Termination Algorithm for Fast and Efficient Query Evaluation

    Get PDF
    Nowadays, current information systems are so large and maintain huge amount of data. At every time, they process millions of documents and millions of queries. In order to choose the most important responses from this amount of data, it is well to apply what is so called early termination algorithms. These ones attempt to extract the Top-K documents according to a specified increasing monotone function. The principal idea behind is to reach and score the most significant less number of documents. So, they avoid fully processing the whole documents. WAND algorithm is at the state of the art in this area. Despite it is efficient, it is missing effectiveness and precision. In this paper, we propose two contributions, the principal proposal is a new early termination algorithm based on WAND approach, we call it MWAND (Modified WAND). This one is faster and more precise than the first. It has the ability to avoid unnecessary WAND steps. In this work, we integrate a tree structure as an index into WAND and we add new levels in query processing. In the second contribution, we define new fine metrics to ameliorate the evaluation of the retrieved information. The experimental results on real datasets show that MWAND is more efficient than the WAND approach
    corecore