132,325 research outputs found

    Parsimonious Language Models for a Terabyte of Text

    Get PDF
    The aims of this paper are twofold. Our first aim\ud is to compare results of the earlier Terabyte tracks\ud to the Million Query track. We submitted a number\ud of runs using different document representations\ud (such as full-text, title-fields, or incoming\ud anchor-texts) to increase pool diversity. The initial\ud results show broad agreement in system rankings\ud over various measures on topic sets judged at both\ud Terabyte and Million Query tracks, with runs using\ud the full-text index giving superior results on\ud all measures, but also some noteworthy upsets.\ud Our second aim is to explore the use of parsimonious\ud language models for retrieval on terabyte-scale\ud collections. These models are smaller thus\ud more efficient than the standard language models\ud when used at indexing time, and they may also improve\ud retrieval performance. We have conducted\ud initial experiments using parsimonious models in\ud combination with pseudo-relevance feedback, for\ud both the Terabyte and Million Query track topic\ud sets, and obtained promising initial results

    MIaS: Math-Aware Retrieval in Digital Mathematical Libraries

    Get PDF
    Digital mathematical libraries (DMLs) such as arXiv, Numdam, and EuDML contain mainly documents from STEM fields, where mathematical formulae are often more important than text for understanding. Conventional information retrieval (IR) systems are unable to represent formulae and they are therefore ill-suited for math information retrieval (MIR). To fill the gap, we have developed, and open-sourced the MIaS MIR system. MIaS is based on the full-text search engine Apache Lucene. On top of text retrieval, MIaS also incorporates a set of tools for preprocessing mathematical formulae. We describe the design of the system and present speed, and quality evaluation results. We show that MIaS is both efficient, and effective, as evidenced by our victory in the NTCIR-11 Math-2 task

    A Strategy for Electronic Dissemination of NASA Langley Technical Publications

    Get PDF
    To demonstrate NASA Langley Research Center\u27s relevance and to transfer technology to external customers in a timely and efficient manner, Langley has formed a working group to study and recommend a course of action for the electronic dissemination of technical reports (EDTR). The working group identified electronic report requirements (e.g., accessibility, file format, search requirements) of customers in U.S. industry through numerous site visits and personal contacts. Internal surveys were also used to determine commonalities in document preparation methods. From these surveys, a set of requirements for an electronic dissemination system was developed. Two candidate systems were identified and evaluated against the set of requirements: the Full-Text Electronic Documents System (FEDS), which is a full-text retrieval system based on the commercial document management package Interleaf, and the Langley Technical Report Server (LTRS), which is a Langley-developed system based on the publicly available World Wide Web (WWW) software system. Factors that led to the selection of LTRS as the vehicle for electronic dissemination included searching and viewing capability, current system operability, and client software availability for multiple platforms at no cost to industry. This report includes the survey results, evaluations, a description of the LTRS architecture, recommended policy statement, and suggestions for future implementations

    Design and Implementation of a Multimedia Information Retrieval Engine for the MSR-Bing Image Retrieval Challenge

    Get PDF
    The aim of this work is to design and implement a multimedia information retrieval engine for the MSR-Bing Retrieval Challenge provided by Microsoft. The challenge is based on the Clickture dataset, generated from click logs of Bing image search. The system has to predict the relevance of images with respect to text queries, by associating a score to a pair (image, text query) that indicates how the text query is good at describing the image content. We attempt to combine textual and visual information, by performing text-based and content-based image retrieval. The framework used to extract visual features is Caffe, an efficient implementation of deep Convolutional Neural Network(CNN). Decision is taken using a knowledge base containing triplets each consisting of a text query, an image, and the number of times that a users clicked on the image, in correspondence of the text query. Two strategies were proposed. In one case we analyse the intersection among the riplets elements retrieved respectively using the textual query and the image itself. In the other case we analyse the union. To solve efficiency issues we proposed an approach that index visual features using Apache Lucene, that is a text search engine library written entirely in Java, suitable for nearly any application requiring full-text search abilities. To this aim, we have converted image features into a textual form, to index them into an inverted index by means of Lucene. In this way we were able to set up a robust retrieval system that combines full-text search with content-based image retrieval capabilities. To prove that our search of textually and visually similar images really works, a small web-based prototype has been implemented. We evaluated different versions of our system over the development set in order to evaluate the measures of similarity to compare images, and to assess the best sorting strategy. Finally, our proposed approaches have been compared with those implemented by the winners of previous challenge editions

    LINNAEUS: A species name identification system for biomedical literature

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The task of recognizing and identifying species names in biomedical literature has recently been regarded as critical for a number of applications in text and data mining, including gene name recognition, species-specific document retrieval, and semantic enrichment of biomedical articles.</p> <p>Results</p> <p>In this paper we describe an open-source species name recognition and normalization software system, LINNAEUS, and evaluate its performance relative to several automatically generated biomedical corpora, as well as a novel corpus of full-text documents manually annotated for species mentions. LINNAEUS uses a dictionary-based approach (implemented as an efficient deterministic finite-state automaton) to identify species names and a set of heuristics to resolve ambiguous mentions. When compared against our manually annotated corpus, LINNAEUS performs with 94% recall and 97% precision at the mention level, and 98% recall and 90% precision at the document level. Our system successfully solves the problem of disambiguating uncertain species mentions, with 97% of all mentions in PubMed Central full-text documents resolved to unambiguous NCBI taxonomy identifiers.</p> <p>Conclusions</p> <p>LINNAEUS is an open source, stand-alone software system capable of recognizing and normalizing species name mentions with speed and accuracy, and can therefore be integrated into a range of bioinformatics and text-mining applications. The software and manually annotated corpus can be downloaded freely at <url>http://linnaeus.sourceforge.net/</url>.</p

    Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature

    Get PDF
    Background: The biomedical literature continues to grow at a rapid pace, making the challenge of knowledge retrieval and extraction ever greater. Tools that provide a means to search and mine the full text of literature thus represent an important way by which the efficiency of these processes can be improved. Results: We describe the next generation of the Textpresso information retrieval system, Textpresso Central (TPC). TPC builds on the strengths of the original system by expanding the full text corpus to include the PubMed Central Open Access Subset (PMC OA), as well as the WormBase C. elegans bibliography. In addition, TPC allows users to create a customized corpus by uploading and processing documents of their choosing. TPC is UIMA compliant, to facilitate compatibility with external processing modules, and takes advantage of Lucene indexing and search technology for efficient handling of millions of full text documents. Like Textpresso, TPC searches can be performed using keywords and/or categories (semantically related groups of terms), but to provide better context for interpreting and validating queries, search results may now be viewed as highlighted passages in the context of full text. To facilitate biocuration efforts, TPC also allows users to select text spans from the full text and annotate them, create customized curation forms for any data type, and send resulting annotations to external curation databases. As an example of such a curation form, we describe integration of TPC with the Noctua curation tool developed by the Gene Ontology (GO) Consortium. Conclusion: Textpresso Central is an online literature search and curation platform that enables biocurators and biomedical researchers to search and mine the full text of literature by integrating keyword and category searches with viewing search results in the context of the full text. It also allows users to create customized curation interfaces, use those interfaces to make annotations linked to supporting evidence statements, and then send those annotations to any database in the world

    Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation

    Get PDF
    Background: Manual curation of experimental data from the biomedical literature is an expensive and time-consuming endeavor. Nevertheless, most biological knowledge bases still rely heavily on manual curation for data extraction and entry. Text mining software that can semi- or fully automate information retrieval from the literature would thus provide a significant boost to manual curation efforts. Results: We employ the Textpresso category-based information retrieval and extraction system http://www.textpresso.org webcite, developed by WormBase to explore how Textpresso might improve the efficiency with which we manually curate C. elegans proteins to the Gene Ontology's Cellular Component Ontology. Using a training set of sentences that describe results of localization experiments in the published literature, we generated three new curation task-specific categories (Cellular Components, Assay Terms, and Verbs) containing words and phrases associated with reports of experimentally determined subcellular localization. We compared the results of manual curation to that of Textpresso queries that searched the full text of articles for sentences containing terms from each of the three new categories plus the name of a previously uncurated C. elegans protein, and found that Textpresso searches identified curatable papers with recall and precision rates of 79.1% and 61.8%, respectively (F-score of 69.5%), when compared to manual curation. Within those documents, Textpresso identified relevant sentences with recall and precision rates of 30.3% and 80.1% (F-score of 44.0%). From returned sentences, curators were able to make 66.2% of all possible experimentally supported GO Cellular Component annotations with 97.3% precision (F-score of 78.8%). Measuring the relative efficiencies of Textpresso-based versus manual curation we find that Textpresso has the potential to increase curation efficiency by at least 8-fold, and perhaps as much as 15-fold, given differences in individual curatorial speed. Conclusion: Textpresso is an effective tool for improving the efficiency of manual, experimentally based curation. Incorporating a Textpresso-based Cellular Component curation pipeline at WormBase has allowed us to transition from strictly manual curation of this data type to a more efficient pipeline of computer-assisted validation. Continued development of curation task-specific Textpresso categories will provide an invaluable resource for genomics databases that rely heavily on manual curation

    Challenging Ubiquitous Inverted Files

    Get PDF
    Stand-alone ranking systems based on highly optimized inverted file structures are generally considered ā€˜theā€™ solution for building search engines. Observing various developments in software and hardware, we argue however that IR research faces a complex engineering problem in the quest for more flexible yet efficient retrieval systems. We propose to base the development of retrieval systems on ā€˜the database approachā€™: mapping high-level declarative specifications of the retrieval process into efficient query plans. We present the Mirror DBMS as a prototype implementation of a retrieval system based on this approach

    Enhancing Content-And-Structure Information Retrieval using a Native XML Database

    Get PDF
    Three approaches to content-and-structure XML retrieval are analysed in this paper: first by using Zettair, a full-text information retrieval system; second by using eXist, a native XML database, and third by using a hybrid XML retrieval system that uses eXist to produce the final answers from likely relevant articles retrieved by Zettair. INEX 2003 content-and-structure topics can be classified in two categories: the first retrieving full articles as final answers, and the second retrieving more specific elements within articles as final answers. We show that for both topic categories our initial hybrid system improves the retrieval effectiveness of a native XML database. For ranking the final answer elements, we propose and evaluate a novel retrieval model that utilises the structural relationships between the answer elements of a native XML database and retrieves Coherent Retrieval Elements. The final results of our experiments show that when the XML retrieval task focusses on highly relevant elements our hybrid XML retrieval system with the Coherent Retrieval Elements module is 1.8 times more effective than Zettair and 3 times more effective than eXist, and yields an effective content-and-structure XML retrieval
    • ā€¦
    corecore