29 research outputs found

    The Lucene for Information Access and Retrieval Research (LIARR) Workshop at SIGIR 2017

    Get PDF
    As an empirical discipline, information access and retrieval research requires substantial software infrastructure to index and search large collections. This workshop is motivated by the desire to better align information retrieval research with the practice of building search applications from the perspective of open-source information retrieval systems. Our goal is to promote the use of Lucene for information access and retrieval research

    Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge

    Get PDF
    The Open-Source IR Reproducibility Challenge brought together developers of open-source search engines to provide reproducible baselines of their systems in a common environment on Amazon EC2. The product is a repository that contains all code necessary to generate competitive ad hoc retrieval baselines, such that with a single script, anyone with a copy of the collection can reproduce the submitted runs. Our vision is that these results would serve as widely accessible points of comparison in future IR research. This project represents an ongoing effort, but we describe the first phase of the challenge that was organized as part of a workshop at SIGIR 2015. We have succeeded modestly so far, achieving our main goals on the Gov2 collection with seven opensource search engines. In this paper, we describe our methodology, share experimental results, and discuss lessons learned as well as next steps

    Sidra5: a search system with geographic signatures

    Get PDF
    Tese de mestrado em Engenharia Informática, apresentada à Universidade de Lisboa através da Faculdade de Ciências, 2007Este trabalho consistiu no desenvolvimento de um sistema de pesquisa de informação com raciocínio geográfico, servindo de base para uma nova abordagem para modelação da informação geográfica contida nos documentos, as assinaturas geográficas. Pretendeu-se determinar se a semântica geográfica presente nos documentos, capturada através das assinaturas geográficas, contribui para uma melhoria dos resultados obtidos para pesquisas de cariz geográfico. São propostas e experimentadas diversas estratégias para o cálculo da semelhança entre as assinaturas geográficas de interrogações e documentos. A partir dos resultados observados conclui-se que, em algumas circunstâncias, as assinaturas geográficas contribuem para melhorar a qualidade das pesquisas geográficas.The dissertation report presents the development of a geographic information search system which implements geographic signatures, a novel approach for the modeling of the geographic information present in documents. The goal of the project was to determine if the information with geographic semantics present in documents, captured as geographic signatures, contributes to the improvement of search results. Several strategies for computing the similarity between the geographic signatures in queries and documents are proposed and experimented. The obtained results show that, in some circunstances, geographic signatures can indeed improve the search quality of geographic queries

    Objective and automated protocols for the evaluation of biomedical search engines using No Title Evaluation protocols

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The evaluation of information retrieval techniques has traditionally relied on human judges to determine which documents are relevant to a query and which are not. This protocol is used in the Text Retrieval Evaluation Conference (TREC), organized annually for the past 15 years, to support the unbiased evaluation of novel information retrieval approaches. The TREC Genomics Track has recently been introduced to measure the performance of information retrieval for biomedical applications.</p> <p>Results</p> <p>We describe two protocols for evaluating biomedical information retrieval techniques without human relevance judgments. We call these protocols No Title Evaluation (NT Evaluation). The first protocol measures performance for focused searches, where only one relevant document exists for each query. The second protocol measures performance for queries expected to have potentially many relevant documents per query (high-recall searches). Both protocols take advantage of the clear separation of titles and abstracts found in Medline. We compare the performance obtained with these evaluation protocols to results obtained by reusing the relevance judgments produced in the 2004 and 2005 TREC Genomics Track and observe significant correlations between performance rankings generated by our approach and TREC. Spearman's correlation coefficients in the range of 0.79–0.92 are observed comparing bpref measured with NT Evaluation or with TREC evaluations. For comparison, coefficients in the range 0.86–0.94 can be observed when evaluating the same set of methods with data from two independent TREC Genomics Track evaluations. We discuss the advantages of NT Evaluation over the TRels and the data fusion evaluation protocols introduced recently.</p> <p>Conclusion</p> <p>Our results suggest that the NT Evaluation protocols described here could be used to optimize some search engine parameters before human evaluation. Further research is needed to determine if NT Evaluation or variants of these protocols can fully substitute for human evaluations.</p

    Indexing and Searching Semantically Annotated Texts

    Get PDF
    Tato práce řeší problém vyhledávání v sémanticky anotovaných textech. Cílem této práce je navrhnout a implementovat systém schopný vyhledat dokumenty obsahující fragmenty definované uživatelem a obohatit entity či ne-entity o syntaktické a sémantické informace, které nejsou implicitně zmíněné. Práce se zaměřuje na analýzu již existujícího řešení a principu práce nástroje MG4J. Problém je řešen rozšířením funkcionality již existujícího systému a vytvořením nové části, která má za cíl zajistit sbíraní vyhledaných dat. Výsledkem jsou dva programy. Jeden z nich zajišťuje vyhledání v dokumentech uložených na serveru a je serverovou aplikaci. Další je klientskou aplikaci, která sbírá data z více serverů. Výsledky této práce umožňují provádět pokročilé dotazování a získávat informace, které nejsou explicitně zmíněny v textu, o jednotlivých entitách reálného světa.This thesis solves the problem of search in the semantically enriched texts. The task of this thesis is to propose and implement a system for searching documents which  contain fragments defined by user and enrich entities or non-entities by syntactic and semantic information, which is not mentioned implicitly. The thesis focuses on analysis of existing solution and principles of MG4J engine. The problem was resolved by extending already existing system and implementing a new part, which ensure  the data collection. As a result two programs were implemented. One of them ensure the retrieval in document collection stored on a server and is a server-side application. The second one is a client-side application which ensures collection of data from the servers. The implemented programs allow to make advanced queries and get information, which is not explicitly mentioned in text, about entities of the real world.

    Efficient Optimally Lazy Algorithms for Minimal-Interval Semantics

    Full text link
    Minimal-interval semantics associates with each query over a document a set of intervals, called witnesses, that are incomparable with respect to inclusion (i.e., they form an antichain): witnesses define the minimal regions of the document satisfying the query. Minimal-interval semantics makes it easy to define and compute several sophisticated proximity operators, provides snippets for user presentation, and can be used to rank documents. In this paper we provide algorithms for computing conjunction and disjunction that are linear in the number of intervals and logarithmic in the number of operands; for additional operators, such as ordered conjunction and Brouwerian difference, we provide linear algorithms. In all cases, space is linear in the number of operands. More importantly, we define a formal notion of optimal laziness, and either prove it, or prove its impossibility, for each algorithm. We cast our results in a general framework of antichains of intervals on total orders, making our algorithms directly applicable to other domains.Comment: 24 pages, 4 figures. A preliminary (now outdated) version was presented at SPIRE 200

    Implementation of an information retrieval system within a central knowledge management system

    Get PDF
    Páginas numeradas: I-XIII, 14-126Estágio realizado na Wipro Portugal SA e orientado pelo Eng.º Hugo NetoTese de mestrado integrado. Engenharia Informática e Computação. Faculdade de Engenharia. Universidade do Porto. 201

    Getting More out of Biomedical Documents with GATE's Full Lifecycle Open Source Text Analytics.

    Get PDF
    This software article describes the GATE family of open source text analysis tools and processes. GATE is one of the most widely used systems of its type with yearly download rates of tens of thousands and many active users in both academic and industrial contexts. In this paper we report three examples of GATE-based systems operating in the life sciences and in medicine. First, in genome-wide association studies which have contributed to discovery of a head and neck cancer mutation association. Second, medical records analysis which has significantly increased the statistical power of treatment/ outcome models in the UK’s largest psychiatric patient cohort. Third, richer constructs in drug-related searching. We also explore the ways in which the GATE family supports the various stages of the lifecycle present in our examples. We conclude that the deployment of text mining for document abstraction or rich search and navigation is best thought of as a process, and that with the right computational tools and data collection strategies this process can be made defined and repeatable. The GATE research programme is now 20 years old and has grown from its roots as a specialist development tool for text processing to become a rather comprehensive ecosystem, bringing together software developers, language engineers and research staff from diverse fields. GATE now has a strong claim to cover a uniquely wide range of the lifecycle of text analysis systems. It forms a focal point for the integration and reuse of advances that have been made by many people (the majority outside of the authors’ own group) who work in text processing for biomedicine and other areas. GATE is available online ,1. under GNU open source licences and runs on all major operating systems. Support is available from an active user and developer community and also on a commercial basis
    corecore