Search CORE

132,325 research outputs found

Parsimonious Language Models for a Terabyte of Text

Author: Hiemstra Djoerd
Kamps Jaap
Kaptein Rianne
Li Rongmei
Publication venue: US National Institute of Standards and Technology (NIST)
Publication date: 01/01/2008
Field of study

The aims of this paper are twofold. Our first aim\ud is to compare results of the earlier Terabyte tracks\ud to the Million Query track. We submitted a number\ud of runs using different document representations\ud (such as full-text, title-fields, or incoming\ud anchor-texts) to increase pool diversity. The initial\ud results show broad agreement in system rankings\ud over various measures on topic sets judged at both\ud Terabyte and Million Query tracks, with runs using\ud the full-text index giving superior results on\ud all measures, but also some noteworthy upsets.\ud Our second aim is to explore the use of parsimonious\ud language models for retrieval on terabyte-scale\ud collections. These models are smaller thus\ud more efficient than the standard language models\ud when used at indexing time, and they may also improve\ud retrieval performance. We have conducted\ud initial experiments using parsimonious models in\ud combination with pseudo-relevance feedback, for\ud both the Terabyte and Million Query track topic\ud sets, and obtained promising initial results

University of Twente Research Information

UvA-DARE

International Migration, Integration and Social Cohesion online publications

MIaS: Math-Aware Retrieval in Digital Mathematical Libraries

Author: Aizawa Akiko
Aizawa Akiko
Białecki Andrzej
Cervone Davide
Líska Martin
Líska Martin
Richard
Rygl Jan
Růžička Michal
Růžička Michal
Sojka Petr
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2018
Field of study

Digital mathematical libraries (DMLs) such as arXiv, Numdam, and EuDML contain mainly documents from STEM fields, where mathematical formulae are often more important than text for understanding. Conventional information retrieval (IR) systems are unable to represent formulae and they are therefore ill-suited for math information retrieval (MIR). To fill the gap, we have developed, and open-sourced the MIaS MIR system. MIaS is based on the full-text search engine Apache Lucene. On top of text retrieval, MIaS also incorporates a set of tools for preprocessing mathematical formulae. We describe the design of the system and present speed, and quality evaluation results. We show that MIaS is both efficient, and effective, as evidenced by our victory in the NTCIR-11 Math-2 task

arXiv.org e-Print Archive

Crossref

Univerzitní repozitář Masarykovy univerzity

A Strategy for Electronic Dissemination of NASA Langley Technical Publications

Author: Adkins Susan L.
Ambur Manjula Y.
Campbell Bryan A.
Holland Scott D.
McCaskill Mary K.
Nelson Michael L.
Roper Donna G.
Walsh Joanne L.
Publication venue: ODU Digital Commons
Publication date: 01/01/1994
Field of study

To demonstrate NASA Langley Research Center\u27s relevance and to transfer technology to external customers in a timely and efficient manner, Langley has formed a working group to study and recommend a course of action for the electronic dissemination of technical reports (EDTR). The working group identified electronic report requirements (e.g., accessibility, file format, search requirements) of customers in U.S. industry through numerous site visits and personal contacts. Internal surveys were also used to determine commonalities in document preparation methods. From these surveys, a set of requirements for an electronic dissemination system was developed. Two candidate systems were identified and evaluated against the set of requirements: the Full-Text Electronic Documents System (FEDS), which is a full-text retrieval system based on the commercial document management package Interleaf, and the Langley Technical Report Server (LTRS), which is a Langley-developed system based on the publicly available World Wide Web (WWW) software system. Factors that led to the selection of LTRS as the vehicle for electronic dissemination included searching and viewing capability, current system operability, and client software availability for multiple platforms at no cost to industry. This report includes the survey results, evaluations, a description of the LTRS architecture, recommended policy statement, and suggestions for future implementations

Old Dominion University

Design and Implementation of a Multimedia Information Retrieval Engine for the MSR-Bing Image Retrieval Challenge

Author: RAZZANO ALESSIA
Publication venue: 'Pisa University Press'
Publication date: 24/02/2016
Field of study

The aim of this work is to design and implement a multimedia information retrieval engine for the MSR-Bing Retrieval Challenge provided by Microsoft. The challenge is based on the Clickture dataset, generated from click logs of Bing image search. The system has to predict the relevance of images with respect to text queries, by associating a score to a pair (image, text query) that indicates how the text query is good at describing the image content. We attempt to combine textual and visual information, by performing text-based and content-based image retrieval. The framework used to extract visual features is Caffe, an efficient implementation of deep Convolutional Neural Network(CNN). Decision is taken using a knowledge base containing triplets each consisting of a text query, an image, and the number of times that a users clicked on the image, in correspondence of the text query. Two strategies were proposed. In one case we analyse the intersection among the riplets elements retrieved respectively using the textual query and the image itself. In the other case we analyse the union. To solve efficiency issues we proposed an approach that index visual features using Apache Lucene, that is a text search engine library written entirely in Java, suitable for nearly any application requiring full-text search abilities. To this aim, we have converted image features into a textual form, to index them into an inverted index by means of Lucene. In this way we were able to set up a robust retrieval system that combines full-text search with content-based image retrieval capabilities. To prove that our search of textually and visually similar images really works, a small web-based prototype has been implemented. We evaluated different versions of our system over the development set in order to evaluate the measures of similarity to compare images, and to assess the best sorting strategy. Finally, our proposed approaches have been compared with those implemented by the winners of previous challenge editions

Electronic Thesis and Dissertation Archive - Università di Pisa

LINNAEUS: A species name identification system for biomedical literature

Author: Bergman Casey M
Gerner Martin
Nenadic Goran
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background The task of recognizing and identifying species names in biomedical literature has recently been regarded as critical for a number of applications in text and data mining, including gene name recognition, species-specific document retrieval, and semantic enrichment of biomedical articles. Results In this paper we describe an open-source species name recognition and normalization software system, LINNAEUS, and evaluate its performance relative to several automatically generated biomedical corpora, as well as a novel corpus of full-text documents manually annotated for species mentions. LINNAEUS uses a dictionary-based approach (implemented as an efficient deterministic finite-state automaton) to identify species names and a set of heuristics to resolve ambiguous mentions. When compared against our manually annotated corpus, LINNAEUS performs with 94% recall and 97% precision at the mention level, and 98% recall and 90% precision at the document level. Our system successfully solves the problem of disambiguating uncertain species mentions, with 97% of all mentions in PubMed Central full-text documents resolved to unambiguous NCBI taxonomy identifiers. Conclusions LINNAEUS is an open source, stand-alone software system capable of recognizing and normalizing species name mentions with speed and accuracy, and can therefore be integrated into a range of bioinformatics and text-mining applications. The software and manually annotated corpus can be downloaded freely at <url>http://linnaeus.sourceforge.net/</url>.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

The University of Manchester - Institutional Repository

Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature

Author: Li Y.
Müller H.-M.
Sternberg P. W.
Van Auken Kimberly M.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 09/03/2018
Field of study

Background: The biomedical literature continues to grow at a rapid pace, making the challenge of knowledge retrieval and extraction ever greater. Tools that provide a means to search and mine the full text of literature thus represent an important way by which the efficiency of these processes can be improved. Results: We describe the next generation of the Textpresso information retrieval system, Textpresso Central (TPC). TPC builds on the strengths of the original system by expanding the full text corpus to include the PubMed Central Open Access Subset (PMC OA), as well as the WormBase C. elegans bibliography. In addition, TPC allows users to create a customized corpus by uploading and processing documents of their choosing. TPC is UIMA compliant, to facilitate compatibility with external processing modules, and takes advantage of Lucene indexing and search technology for efficient handling of millions of full text documents. Like Textpresso, TPC searches can be performed using keywords and/or categories (semantically related groups of terms), but to provide better context for interpreting and validating queries, search results may now be viewed as highlighted passages in the context of full text. To facilitate biocuration efforts, TPC also allows users to select text spans from the full text and annotate them, create customized curation forms for any data type, and send resulting annotations to external curation databases. As an example of such a curation form, we describe integration of TPC with the Noctua curation tool developed by the Gene Ontology (GO) Consortium. Conclusion: Textpresso Central is an online literature search and curation platform that enables biocurators and biomedical researchers to search and mine the full text of literature by integrating keyword and category searches with viewing search results in the context of the full text. It also allows users to create customized curation interfaces, use those interfaces to make annotations linked to supporting evidence statements, and then send those annotations to any database in the world

Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation

Author: Chan Juancarlos
Jaffery Joshua
Müller Hans-Michael
Sternberg Paul W.
Van Auken Kimberly
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Background: Manual curation of experimental data from the biomedical literature is an expensive and time-consuming endeavor. Nevertheless, most biological knowledge bases still rely heavily on manual curation for data extraction and entry. Text mining software that can semi- or fully automate information retrieval from the literature would thus provide a significant boost to manual curation efforts. Results: We employ the Textpresso category-based information retrieval and extraction system http://www.textpresso.org webcite, developed by WormBase to explore how Textpresso might improve the efficiency with which we manually curate C. elegans proteins to the Gene Ontology's Cellular Component Ontology. Using a training set of sentences that describe results of localization experiments in the published literature, we generated three new curation task-specific categories (Cellular Components, Assay Terms, and Verbs) containing words and phrases associated with reports of experimentally determined subcellular localization. We compared the results of manual curation to that of Textpresso queries that searched the full text of articles for sentences containing terms from each of the three new categories plus the name of a previously uncurated C. elegans protein, and found that Textpresso searches identified curatable papers with recall and precision rates of 79.1% and 61.8%, respectively (F-score of 69.5%), when compared to manual curation. Within those documents, Textpresso identified relevant sentences with recall and precision rates of 30.3% and 80.1% (F-score of 44.0%). From returned sentences, curators were able to make 66.2% of all possible experimentally supported GO Cellular Component annotations with 97.3% precision (F-score of 78.8%). Measuring the relative efficiencies of Textpresso-based versus manual curation we find that Textpresso has the potential to increase curation efficiency by at least 8-fold, and perhaps as much as 15-fold, given differences in individual curatorial speed. Conclusion: Textpresso is an effective tool for improving the efficiency of manual, experimentally based curation. Incorporating a Textpresso-based Cellular Component curation pipeline at WormBase has allowed us to transition from strictly manual curation of this data type to a more efficient pipeline of computer-assisted validation. Continued development of curation task-specific Textpresso categories will provide an invaluable resource for genomics databases that rely heavily on manual curation

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Caltech Authors

Challenging Ubiquitous Inverted Files

Author: Vries A.P. de
Publication venue: European Research Consortium for Informatics and Mathematics (ERCIM)
Publication date: 01/01/2000
Field of study

Stand-alone ranking systems based on highly optimized inverted file structures are generally considered ‘the’ solution for building search engines. Observing various developments in software and hardware, we argue however that IR research faces a complex engineering problem in the quest for more flexible yet efficient retrieval systems. We propose to base the development of retrieval systems on ‘the database approach’: mapping high-level declarative specifications of the retrieval process into efficient query plans. We present the Mirror DBMS as a prototype implementation of a retrieval system based on this approach

CWI's Institutional Repository

University of Twente Research Information

Enhancing Content-And-Structure Information Retrieval using a Native XML Database

Author: Pehcevski Jovan
Thom James A.
Vercoustre Anne-Marie
Publication venue
Publication date: 01/01/2004
Field of study

Three approaches to content-and-structure XML retrieval are analysed in this paper: first by using Zettair, a full-text information retrieval system; second by using eXist, a native XML database, and third by using a hybrid XML retrieval system that uses eXist to produce the final answers from likely relevant articles retrieved by Zettair. INEX 2003 content-and-structure topics can be classified in two categories: the first retrieving full articles as final answers, and the second retrieving more specific elements within articles as final answers. We show that for both topic categories our initial hybrid system improves the retrieval effectiveness of a native XML database. For ranking the final answer elements, we propose and evaluate a novel retrieval model that utilises the structural relationships between the answer elements of a native XML database and retrieves Coherent Retrieval Elements. The final results of our experiments show that when the XML retrieval task focusses on highly relevant elements our hybrid XML retrieval system with the Coherent Retrieval Elements module is 1.8 times more effective than Zettair and 3 times more effective than eXist, and yields an effective content-and-structure XML retrieval

arXiv.org e-Print Archive

CiteSeerX

INRIA a CCSD electronic archive server

Hal-Diderot