29 research outputs found
The Lucene for Information Access and Retrieval Research (LIARR) Workshop at SIGIR 2017
As an empirical discipline, information access and retrieval research requires substantial software infrastructure to index and search large collections. This workshop is motivated by the desire to better align information retrieval research with the practice of building search applications from the perspective of open-source information retrieval systems. Our goal is to promote the use of Lucene for information access and retrieval research
Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge
The Open-Source IR Reproducibility Challenge brought together
developers of open-source search engines to provide reproducible
baselines of their systems in a common environment on Amazon EC2.
The product is a repository that contains all code necessary to generate
competitive ad hoc retrieval baselines, such that with a single script,
anyone with a copy of the collection can reproduce the submitted runs.
Our vision is that these results would serve as widely accessible points
of comparison in future IR research. This project represents an ongoing
effort, but we describe the first phase of the challenge that was organized
as part of a workshop at SIGIR 2015. We have succeeded modestly so
far, achieving our main goals on the Gov2 collection with seven opensource
search engines. In this paper, we describe our methodology, share
experimental results, and discuss lessons learned as well as next steps
Sidra5: a search system with geographic signatures
Tese de mestrado em Engenharia Informática, apresentada à Universidade de Lisboa através da Faculdade de Ciências, 2007Este trabalho consistiu no desenvolvimento de um sistema de pesquisa de informação com raciocínio geográfico, servindo de base para uma nova abordagem para modelação da informação geográfica contida nos documentos, as assinaturas geográficas. Pretendeu-se determinar se a semântica geográfica presente nos documentos, capturada através das assinaturas geográficas, contribui para uma melhoria dos resultados obtidos para pesquisas de cariz geográfico. São propostas e experimentadas diversas estratégias para o cálculo da semelhança entre as assinaturas geográficas de interrogações e documentos. A partir dos resultados observados conclui-se que, em algumas circunstâncias, as assinaturas geográficas contribuem para melhorar a qualidade das pesquisas geográficas.The dissertation report presents the development of a geographic information search system which implements geographic signatures, a novel approach for the modeling of the geographic information present in documents. The goal of the project was to determine if the information with geographic semantics present in documents, captured as geographic signatures, contributes to the improvement of search results. Several strategies for computing the similarity between the geographic signatures in queries and documents are proposed and experimented. The obtained results show that, in some circunstances, geographic signatures can indeed improve the search quality of geographic queries
Objective and automated protocols for the evaluation of biomedical search engines using No Title Evaluation protocols
<p>Abstract</p> <p>Background</p> <p>The evaluation of information retrieval techniques has traditionally relied on human judges to determine which documents are relevant to a query and which are not. This protocol is used in the Text Retrieval Evaluation Conference (TREC), organized annually for the past 15 years, to support the unbiased evaluation of novel information retrieval approaches. The TREC Genomics Track has recently been introduced to measure the performance of information retrieval for biomedical applications.</p> <p>Results</p> <p>We describe two protocols for evaluating biomedical information retrieval techniques without human relevance judgments. We call these protocols No Title Evaluation (NT Evaluation). The first protocol measures performance for focused searches, where only one relevant document exists for each query. The second protocol measures performance for queries expected to have potentially many relevant documents per query (high-recall searches). Both protocols take advantage of the clear separation of titles and abstracts found in Medline. We compare the performance obtained with these evaluation protocols to results obtained by reusing the relevance judgments produced in the 2004 and 2005 TREC Genomics Track and observe significant correlations between performance rankings generated by our approach and TREC. Spearman's correlation coefficients in the range of 0.79–0.92 are observed comparing bpref measured with NT Evaluation or with TREC evaluations. For comparison, coefficients in the range 0.86–0.94 can be observed when evaluating the same set of methods with data from two independent TREC Genomics Track evaluations. We discuss the advantages of NT Evaluation over the TRels and the data fusion evaluation protocols introduced recently.</p> <p>Conclusion</p> <p>Our results suggest that the NT Evaluation protocols described here could be used to optimize some search engine parameters before human evaluation. Further research is needed to determine if NT Evaluation or variants of these protocols can fully substitute for human evaluations.</p
Indexing and Searching Semantically Annotated Texts
Tato práce řeší problém vyhledávání v sémanticky anotovaných textech. Cílem této práce je navrhnout a implementovat systém schopný vyhledat dokumenty obsahující fragmenty definované uživatelem a obohatit entity či ne-entity o syntaktické a sémantické informace, které nejsou implicitně zmíněné. Práce se zaměřuje na analýzu již existujícího řešení a principu práce nástroje MG4J. Problém je řešen rozšířením funkcionality již existujícího systému a vytvořením nové části, která má za cíl zajistit sbíraní vyhledaných dat. Výsledkem jsou dva programy. Jeden z nich zajišťuje vyhledání v dokumentech uložených na serveru a je serverovou aplikaci. Další je klientskou aplikaci, která sbírá data z více serverů. Výsledky této práce umožňují provádět pokročilé dotazování a získávat informace, které nejsou explicitně zmíněny v textu, o jednotlivých entitách reálného světa.This thesis solves the problem of search in the semantically enriched texts. The task of this thesis is to propose and implement a system for searching documents which contain fragments defined by user and enrich entities or non-entities by syntactic and semantic information, which is not mentioned implicitly. The thesis focuses on analysis of existing solution and principles of MG4J engine. The problem was resolved by extending already existing system and implementing a new part, which ensure the data collection. As a result two programs were implemented. One of them ensure the retrieval in document collection stored on a server and is a server-side application. The second one is a client-side application which ensures collection of data from the servers. The implemented programs allow to make advanced queries and get information, which is not explicitly mentioned in text, about entities of the real world.
Efficient Optimally Lazy Algorithms for Minimal-Interval Semantics
Minimal-interval semantics associates with each query over a document a set
of intervals, called witnesses, that are incomparable with respect to inclusion
(i.e., they form an antichain): witnesses define the minimal regions of the
document satisfying the query. Minimal-interval semantics makes it easy to
define and compute several sophisticated proximity operators, provides snippets
for user presentation, and can be used to rank documents. In this paper we
provide algorithms for computing conjunction and disjunction that are linear in
the number of intervals and logarithmic in the number of operands; for
additional operators, such as ordered conjunction and Brouwerian difference, we
provide linear algorithms. In all cases, space is linear in the number of
operands. More importantly, we define a formal notion of optimal laziness, and
either prove it, or prove its impossibility, for each algorithm. We cast our
results in a general framework of antichains of intervals on total orders,
making our algorithms directly applicable to other domains.Comment: 24 pages, 4 figures. A preliminary (now outdated) version was
presented at SPIRE 200
Implementation of an information retrieval system within a central knowledge management system
Páginas numeradas: I-XIII, 14-126Estágio realizado na Wipro Portugal SA e orientado pelo Eng.º Hugo NetoTese de mestrado integrado. Engenharia Informática e Computação. Faculdade de Engenharia. Universidade do Porto. 201
Getting More out of Biomedical Documents with GATE's Full Lifecycle Open Source Text Analytics.
This software article describes the GATE family of open source text analysis tools and processes. GATE is one of the most
widely used systems of its type with yearly download rates of tens of thousands and many active users in both academic
and industrial contexts. In this paper we report three examples of GATE-based systems operating in the life sciences and in
medicine. First, in genome-wide association studies which have contributed to discovery of a head and neck cancer
mutation association. Second, medical records analysis which has significantly increased the statistical power of treatment/
outcome models in the UK’s largest psychiatric patient cohort. Third, richer constructs in drug-related searching. We also
explore the ways in which the GATE family supports the various stages of the lifecycle present in our examples. We conclude
that the deployment of text mining for document abstraction or rich search and navigation is best thought of as a process,
and that with the right computational tools and data collection strategies this process can be made defined and repeatable.
The GATE research programme is now 20 years old and has grown from its roots as a specialist development tool for text
processing to become a rather comprehensive ecosystem, bringing together software developers, language engineers and
research staff from diverse fields. GATE now has a strong claim to cover a uniquely wide range of the lifecycle of text analysis
systems. It forms a focal point for the integration and reuse of advances that have been made by many people (the majority
outside of the authors’ own group) who work in text processing for biomedicine and other areas. GATE is available online
,1. under GNU open source licences and runs on all major operating systems. Support is available from an active user and
developer community and also on a commercial basis