7 research outputs found

    ir_metadata: An Extensible Metadata Schema for IR Experiments

    Full text link
    The information retrieval (IR) community has a strong tradition of making the computational artifacts and resources available for future reuse, allowing the validation of experimental results. Besides the actual test collections, the underlying run files are often hosted in data archives as part of conferences like TREC, CLEF, or NTCIR. Unfortunately, the run data itself does not provide much information about the underlying experiment. For instance, the single run file is not of much use without the context of the shared task's website or the run data archive. In other domains, like the social sciences, it is good practice to annotate research data with metadata. In this work, we introduce ir_metadata - an extensible metadata schema for TREC run files based on the PRIMAD model. We propose to align the metadata annotations to PRIMAD, which considers components of computational experiments that can affect reproducibility. Furthermore, we outline important components and information that should be reported in the metadata and give evidence from the literature. To demonstrate the usefulness of these metadata annotations, we implement new features in repro_eval that support the outlined metadata schema for the use case of reproducibility studies. Additionally, we curate a dataset with run files derived from experiments with different instantiations of PRIMAD components and annotate these with the corresponding metadata. In the experiments, we cover reproducibility experiments that are identified by the metadata and classified by PRIMAD. With this work, we enable IR researchers to annotate TREC run files and improve the reuse value of experimental artifacts even further.Comment: Resource pape

    Technologies for extracting and analysing the credibility of health-related online content

    Get PDF
    The evolution of the Web has led to an improvement in information accessibility. This change has allowed access to more varied content at greater speed, but we must also be aware of the dangers involved. The results offered may be unreliable, inadequate, or of poor quality, leading to misinformation. This can have a greater or lesser impact depending on the domain, but is particularly sensitive when it comes to health-related content. In this thesis, we focus in the development of methods to automatically assess credibility. We also studied the reliability of the new Large Language Models (LLMs) to answer health questions. Finally, we also present a set of tools that might help in the massive analysis of web textual content

    Pretrained Transformers for Text Ranking: BERT and Beyond

    Get PDF
    The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in response to a query. Although the most common formulation of text ranking is search, instances of the task can also be found in many natural language processing applications. This survey provides an overview of text ranking with neural network architectures known as transformers, of which BERT is the best-known example. The combination of transformers and self-supervised pretraining has been responsible for a paradigm shift in natural language processing (NLP), information retrieval (IR), and beyond. In this survey, we provide a synthesis of existing work as a single point of entry for practitioners who wish to gain a better understanding of how to apply transformers to text ranking problems and researchers who wish to pursue work in this area. We cover a wide range of modern techniques, grouped into two high-level categories: transformer models that perform reranking in multi-stage architectures and dense retrieval techniques that perform ranking directly. There are two themes that pervade our survey: techniques for handling long documents, beyond typical sentence-by-sentence processing in NLP, and techniques for addressing the tradeoff between effectiveness (i.e., result quality) and efficiency (e.g., query latency, model and index size). Although transformer architectures and pretraining techniques are recent innovations, many aspects of how they are applied to text ranking are relatively well understood and represent mature techniques. However, there remain many open research questions, and thus in addition to laying out the foundations of pretrained transformers for text ranking, this survey also attempts to prognosticate where the field is heading

    TrialMatch: A Transformer Architecture to Match Patients to Clinical Trials

    Get PDF
    Around 80% of clinical trials fail to meet the patient recruitment requirements, which not only hinders the market growth but also delays patients’ access to new and effec- tive treatments. A possible approach is to use Electronic Health Records (EHRs) to help match patients to clinical trials. Past attempts at achieving this exact goal took place, but due to a lack of data, they were unsuccessful. In 2021 Text REtrieval Conference (TREC) introduced the Clinical Trials Track, where participants were challenged with retrieving relevant clinical trials given the patient’s descriptions simulating admission notes. Utilizing the track results as a baseline, we tackled the challenge, for this, we re- sort to Information Retrieval (IR), implementing a pipeline for document ranking where we explore the different retrieval methods, how to filter the clinical trials based on the criteria, and reranking with Transformer based models. To tackle the problem, we ex- plored models pre-trained on the biomedical domain, how to deal with long queries and documents through query expansion and passage selection, and how to distinguish an eligible clinical trial from an excluded clinical trial, using techniques such as Named Entity Recognition (NER) and Clinical Assertion. Our results let to the finding that the current state-of-the-art Bidirectional Encoder Representations from Transformers (BERT) bi-encoders outperform the cross-encoders in the mentioned task, whilst proving that sparse retrieval methods are capable of obtaining competitive outcomes, and to finalize we showed that the use of the demographic information available can be used to improve the final result.Cerca de 80% dos ensaios clínicos não satisfazem os requisitos de recrutamento de paci- entes, o que não só dificulta o crescimento do mercado como também impede o acesso dos pacientes a novos e eficazes tratamentos. Uma abordagem possível é utilizar os Pron- tuários Eletrônicos para ajudar a combinar doentes a ensaios clínicos. Tentativas passadas para alcançar este exato objetivo tiveram lugar, mas devido à falta de dados, não foram bem sucedidos. Em 2021, a TREC introduziu a Clinical Trials Track, onde os participantes foram desafiados com a recuperação ensaios clínicos relevantes, dadas as descrições dos pacientes simulando notas de admissão. Utilizando os resultados da track como base, en- frentámos o desafio, para isso recorremos à Recuperação de Informação, implementando uma pipeline para a classificação de documentos onde exploramos os diferentes métodos de recuperação, como filtrar os ensaios clínicos com base nos critérios, e reclassificação com modelos baseados no Transformer. Para enfrentar o problema, explorámos modelos pré-treinados no domínio biomédico, como lidar com longas descrições e documentos, e como distinguir um ensaio clínico elegível de um ensaio clínico excluído, utilizando técnicas como Reconhecimento de Entidade Mencionada e Asserção Clínica. Os nossos re- sultados permitem concluir que os actuais bi-encoders de última geração BERT superam os cross-encoders BERT na tarefa mencionada, provamos que os métodos de recuperação esparsos são capazes de obter resultados competitivos, e para finalizar mostramos que a utilização da informação demográfica disponível pode ser utilizada para melhorar o resultado fina

    Neural Methods for Effective, Efficient, and Exposure-Aware Information Retrieval

    Get PDF
    Neural networks with deep architectures have demonstrated significant performance improvements in computer vision, speech recognition, and natural language processing. The challenges in information retrieval (IR), however, are different from these other application areas. A common form of IR involves ranking of documents--or short passages--in response to keyword-based queries. Effective IR systems must deal with query-document vocabulary mismatch problem, by modeling relationships between different query and document terms and how they indicate relevance. Models should also consider lexical matches when the query contains rare terms--such as a person's name or a product model number--not seen during training, and to avoid retrieving semantically related but irrelevant results. In many real-life IR tasks, the retrieval involves extremely large collections--such as the document index of a commercial Web search engine--containing billions of documents. Efficient IR methods should take advantage of specialized IR data structures, such as inverted index, to efficiently retrieve from large collections. Given an information need, the IR system also mediates how much exposure an information artifact receives by deciding whether it should be displayed, and where it should be positioned, among other results. Exposure-aware IR systems may optimize for additional objectives, besides relevance, such as parity of exposure for retrieved items and content publishers. In this thesis, we present novel neural architectures and methods motivated by the specific needs and challenges of IR tasks.Comment: PhD thesis, Univ College London (2020

    Geographic information extraction from texts

    Get PDF
    A large volume of unstructured texts, containing valuable geographic information, is available online. This information – provided implicitly or explicitly – is useful not only for scientific studies (e.g., spatial humanities) but also for many practical applications (e.g., geographic information retrieval). Although large progress has been achieved in geographic information extraction from texts, there are still unsolved challenges and issues, ranging from methods, systems, and data, to applications and privacy. Therefore, this workshop will provide a timely opportunity to discuss the recent advances, new ideas, and concepts but also identify research gaps in geographic information extraction
    corecore