112 research outputs found

    Semantic concept extraction from electronic medical records for enhancing information retrieval performance

    Get PDF
    With the healthcare industry increasingly using EMRs, there emerges an opportunity for knowledge discovery within the healthcare domain that was not possible with paper-based medical records. One such opportunity is to discover UMLS concepts from EMRs. However, with opportunities come challenges that need to be addressed. Medical verbiage is very different from common English verbiage and it is reasonable to assume extracting any information from medical text requires different protocols than what is currently used in common English text. This thesis proposes two new semantic matching models: Term-Based Matching and CUI-Based Matching. These two models use specialized biomedical text mining tools that extract medical concepts from EMRs. Extensive experiments to rank the extracted concepts are conducted on the University of Pittsburgh BLULab NLP Repository for the TREC 2011 Medical Records track dataset that consists of 101,711 EMRs that contain concepts in 34 predefined topics. This thesis compares the proposed semantic matching models against the traditional weighting equations and information retrieval tools used in the academic world today

    Inter-relaão das técnicas Term Extration e Query Expansion aplicadas na recuperação de documentos textuais

    Get PDF
    Tese (doutorado) - Universidade Federal de Santa Catarina, Centro Tecnológico. Programa de Pós-graduação em Engenharia e Gestão do ConhecimentoConforme Sighal (2006) as pessoas reconhecem a importância do armazenamento e busca da informação e, com o advento dos computadores, tornou-se possível o armazenamento de grandes quantidades dela em bases de dados. Em conseqüência, catalogar a informação destas bases tornou-se imprescindível. Nesse contexto, o campo da Recuperação da Informação, surgiu na década de 50, com a finalidade de promover a construção de ferramentas computacionais que permitissem aos usuários utilizar de maneira mais eficiente essas bases de dados. O principal objetivo da presente pesquisa é desenvolver um Modelo Computacional que possibilite a recuperação de documentos textuais ordenados pela similaridade semântica, baseado na intersecção das técnicas de Term Extration e Query Expansion

    Knowledge mining over scientific literature and technical documentation

    Full text link
    Abstract This dissertation focuses on the extraction of information implicitly encoded in domain descriptions (technical terminology and related items) and its usage within a restricted-domain question answering system (QA). Since different variants of the same term can be used to refer to the same domain entity, it is necessary to recognize all possible forms of a given term and structure them, so that they can be used in the question answering process. The knowledge about domain descriptions and their mutual relations is leveraged in an extension to an existing QA system, aimed at the technical maintenance manual of a well-known commercial aircraft. The original version of the QA system did not make use of domain descriptions, which are the novelty introduced by the present work. The explicit treatment of domain descriptions provided considerable gains in terms of efficiency, in particular in the process of analysis of the background document collection. Similar techniques were later applied to another domain (biomedical scientific literature), focusing in particular on protein- protein interactions. This dissertation describes in particular: (1) the extraction of domain specific lexical items which refer to entities of the domain; (2) the detection of relationships (like synonymy and hyponymy) among such items, and their organization into a conceptual structure; (3) their usage within a domain restricted question answering system, in order to facilitate the correct identification of relevant answers to a query; (4) the adaptation of the system to another domain, and extension of the basic hypothesis to tasks other than question answering. Zusammenfassung Das Thema dieser Dissertation ist die Extraktion von Information, welche implizit in technischen Terminologien und ähnlichen Ressourcen enthalten ist, sowie ihre Anwendung in einem Antwortextraktionssystem (AE). Da verschiedene Varianten desselben Terms verwendet werden können, um auf den gleichen Begriff zu verweisen, ist die Erkennung und Strukturierung aller möglichen Formen Voraussetzung für den Einsatz in einem AE-System. Die Kenntnisse über Terme und deren Relationen werden in einem AE System angewandt, welches auf dem Wartungshandbuch eines bekannten Verkehrsflugzeug fokussiert. Die ursprüngliche Version des Systems hatte keine explizite Behandlung von Terminologie. Die explizite Behandlung von Terminologie lieferte eine beachtliche Verbesserung der Effizienz des Systems, insbesondere was die Analyse der zugrundeliegenden Dokumentensammlung betrifft. Ähnliche Methodologien wurden später auf einer anderen Domäne angewandt (biomedizinische Literatur), mit einen besonderen Fokus auf Interaktionen zwischen Proteinen. Diese Dissertation beschreibt insbesondere: (1) die Extraktion der Terminologie (2) die Identifikation der Relationen zwischen Termen (wie z.B. Synonymie und Hyponymie) (3) deren Verwendung in einen AE System (4) die Portierung des Systems auf eine andere Domäne

    Detection and management of redundancy for information retrieval

    Get PDF
    The growth of the web, authoring software, and electronic publishing has led to the emergence of a new type of document collection that is decentralised, amorphous, dynamic, and anarchic. In such collections, redundancy is a significant issue. Documents can spread and propagate across such collections without any control or moderation. Redundancy can interfere with the information retrieval process, leading to decreased user amenity in accessing information from these collections, and thus must be effectively managed. The precise definition of redundancy varies with the application. We restrict ourselves to documents that are co-derivative: those that share a common heritage, and hence contain passages of common text. We explore document fingerprinting, a well-known technique for the detection of co-derivative document pairs. Our new lossless fingerprinting algorithm improves the effectiveness of a range of document fingerprinting approaches. We empirically show that our algorithm can be highly effective at discovering co-derivative document pairs in large collections. We study the occurrence and management of redundancy in a range of application domains. On the web, we find that document fingerprinting is able to identify widespread redundancy, and that this redundancy has a significant detrimental effect on the quality of search results. Based on user studies, we suggest that redundancy is most appropriately managed as a postprocessing step on the ranked list and explain how and why this should be done. In the genomic area of sequence homology search, we explain why the existing techniques for redundancy discovery are increasingly inefficient, and present a critique of the current approaches to redundancy management. We show how document fingerprinting with a modified version of our algorithm provides significant efficiency improvements, and propose a new approach to redundancy management based on wildcards. We demonstrate that our scheme provides the benefits of existing techniques but does not have their deficiencies. Redundancy in distributed information retrieval systems - where different parts of the collection are searched by autonomous servers - cannot be effectively managed using traditional fingerprinting techniques. We thus propose a new data structure, the grainy hash vector, for redundancy detection and management in this environment. We show in preliminary tests that the grainy hash vector is able to accurately detect a good proportion of redundant document pairs while maintaining low resource usage

    Pretrained Transformers for Text Ranking: BERT and Beyond

    Get PDF
    The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in response to a query. Although the most common formulation of text ranking is search, instances of the task can also be found in many natural language processing applications. This survey provides an overview of text ranking with neural network architectures known as transformers, of which BERT is the best-known example. The combination of transformers and self-supervised pretraining has been responsible for a paradigm shift in natural language processing (NLP), information retrieval (IR), and beyond. In this survey, we provide a synthesis of existing work as a single point of entry for practitioners who wish to gain a better understanding of how to apply transformers to text ranking problems and researchers who wish to pursue work in this area. We cover a wide range of modern techniques, grouped into two high-level categories: transformer models that perform reranking in multi-stage architectures and dense retrieval techniques that perform ranking directly. There are two themes that pervade our survey: techniques for handling long documents, beyond typical sentence-by-sentence processing in NLP, and techniques for addressing the tradeoff between effectiveness (i.e., result quality) and efficiency (e.g., query latency, model and index size). Although transformer architectures and pretraining techniques are recent innovations, many aspects of how they are applied to text ranking are relatively well understood and represent mature techniques. However, there remain many open research questions, and thus in addition to laying out the foundations of pretrained transformers for text ranking, this survey also attempts to prognosticate where the field is heading

    User-centered semantic dataset retrieval

    Get PDF
    Finding relevant research data is an increasingly important but time-consuming task in daily research practice. Several studies report on difficulties in dataset search, e.g., scholars retrieve only partial pertinent data, and important information can not be displayed in the user interface. Overcoming these problems has motivated a number of research efforts in computer science, such as text mining and semantic search. In particular, the emergence of the Semantic Web opens a variety of novel research perspectives. Motivated by these challenges, the overall aim of this work is to analyze the current obstacles in dataset search and to propose and develop a novel semantic dataset search. The studied domain is biodiversity research, a domain that explores the diversity of life, habitats and ecosystems. This thesis has three main contributions: (1) We evaluate the current situation in dataset search in a user study, and we compare a semantic search with a classical keyword search to explore the suitability of semantic web technologies for dataset search. (2) We generate a question corpus and develop an information model to figure out on what scientific topics scholars in biodiversity research are interested in. Moreover, we also analyze the gap between current metadata and scholarly search interests, and we explore whether metadata and user interests match. (3) We propose and develop an improved dataset search based on three components: (A) a text mining pipeline, enriching metadata and queries with semantic categories and URIs, (B) a retrieval component with a semantic index over categories and URIs and (C) a user interface that enables a search within categories and a search including further hierarchical relations. Following user centered design principles, we ensure user involvement in various user studies during the development process

    Division of labour and sharing of knowledge for synchronous collaborative information retrieval

    Get PDF

    Mining the Medical and Patent Literature to Support Healthcare and Pharmacovigilance

    Get PDF
    Recent advancements in healthcare practices and the increasing use of information technology in the medical domain has lead to the rapid generation of free-text data in forms of scientific articles, e-health records, patents, and document inventories. This has urged the development of sophisticated information retrieval and information extraction technologies. A fundamental requirement for the automatic processing of biomedical text is the identification of information carrying units such as the concepts or named entities. In this context, this work focuses on the identification of medical disorders (such as diseases and adverse effects) which denote an important category of concepts in the medical text. Two methodologies were investigated in this regard and they are dictionary-based and machine learning-based approaches. Futhermore, the capabilities of the concept recognition techniques were systematically exploited to build a semantic search platform for the retrieval of e-health records and patents. The system facilitates conventional text search as well as semantic and ontological searches. Performance of the adapted retrieval platform for e-health records and patents was evaluated within open assessment challenges (i.e. TRECMED and TRECCHEM respectively) wherein the system was best rated in comparison to several other competing information retrieval platforms. Finally, from the medico-pharma perspective, a strategy for the identification of adverse drug events from medical case reports was developed. Qualitative evaluation as well as an expert validation of the developed system's performance showed robust results. In conclusion, this thesis presents approaches for efficient information retrieval and information extraction from various biomedical literature sources in the support of healthcare and pharmacovigilance. The applied strategies have potential to enhance the literature-searches performed by biomedical, healthcare, and patent professionals. The applied strategies have potential to enhance the literature-searches performed by biomedical, healthcare, and patent professionals. This can promote the literature-based knowledge discovery, improve the safety and effectiveness of medical practices, and drive the research and development in medical and healthcare arena
    corecore