1,160 research outputs found

    PadChest: A large chest x-ray image dataset with multi-label annotated reports

    Get PDF
    We present a labeled large-scale, high resolution chest x-ray dataset for the automated exploration of medical images along with their associated reports. This dataset includes more than 160,000 images obtained from 67,000 patients that were interpreted and reported by radiologists at Hospital San Juan Hospital (Spain) from 2009 to 2017, covering six different position views and additional information on image acquisition and patient demography. The reports were labeled with 174 different radiographic findings, 19 differential diagnoses and 104 anatomic locations organized as a hierarchical taxonomy and mapped onto standard Unified Medical Language System (UMLS) terminology. Of these reports, 27% were manually annotated by trained physicians and the remaining set was labeled using a supervised method based on a recurrent neural network with attention mechanisms. The labels generated were then validated in an independent test set achieving a 0.93 Micro-F1 score. To the best of our knowledge, this is one of the largest public chest x-ray database suitable for training supervised models concerning radiographs, and the first to contain radiographic reports in Spanish. The PadChest dataset can be downloaded from http://bimcv.cipf.es/bimcv-projects/padchest/

    Classifying distinct data types: textual streams protein sequences and genomic variants

    Get PDF
    Artificial Intelligence (AI) is an interdisciplinary field combining different research areas with the end goal to automate processes in the everyday life and industry. The fundamental components of AI models are an “intelligent” model and a functional component defined by the end-application. That is, an intelligent model can be a statistical model that can recognize patterns in data instances to distinguish differences in between these instances. For example, if the AI is applied in car manufacturing, based on an image of a part of a car, the model can categorize if the car part is in the front, middle or rear compartment of the car, as a human brain would do. For the same example application, the statistical model informs a mechanical arm, the functional component, for the current car compartment and the arm in turn assembles this compartment, of the car, based on predefined instructions, likely as a human hand would follow human brain neural signals. A crucial step of AI applications is the classification of input instances by the intelligent model. The classification step in the intelligent model pipeline allows the subsequent steps to act in similar fashion for instances belonging to the same category. We define as classification the module of the intelligent model, which categorizes the input instances based on predefined human-expert or data-driven produced patterns of the instances. Irrespectively of the method to find patterns in data, classification is composed of four distinct steps: (i) input representation, (ii) model building (iii) model prediction and (iv) model assessment. Based on these classification steps, we argue that applying classification on distinct data types holds different challenges. In this thesis, I focus on challenges for three distinct classification scenarios: (i) Textual Streams: how to advance the model building step, commonly used for static distribution of data, to classify textual posts with transient data distribution? (ii) Protein Prediction: which biologically meaningful information can be used in the input representation step to overcome the limited training data challenge? (iii) Human Variant Pathogenicity Prediction: how to develop a classification system for functional impact of human variants, by providing standardized and well accepted evidence for the classification outcome and thus enabling the model assessment step? To answer these research questions, I present my contributions in classifying these different types of data: temporalMNB: I adapt the sequential prediction with expert advice paradigm to optimally aggregate complementary distributions to enhance a Naive Bayes model to adapt on drifting distribution of the characteristics of the textual posts. dom2vec: our proposal to learn embedding vectors for the protein domains using self-supervision. Based on the high performance achieved by the dom2vec embeddings in quantitative intrinsic assessment on the captured biological information, I provide example evidence for an analogy between the local linguistic features in natural languages and the domain structure and function information in domain architectures. Last, I describe GenOtoScope bioinformatics software tool to automate standardized evidence-based criteria for pathogenicity impact of variants associated with hearing loss. Finally, to increase the practical use of our last contribution, I develop easy-to-use software interfaces to be used, in research settings, by clinical diagnostics personnel.Künstliche Intelligenz (KI) ist ein interdisziplinäres Gebiet, das verschiedene Forschungsbereiche mit dem Ziel verbindet, Prozesse im Alltag und in der Industrie zu automatisieren. Die grundlegenden Komponenten von KI-Modellen sind ein “intelligentes” Modell und eine durch die Endanwendung definierte funktionale Komponente. Das heißt, ein intelligentes Modell kann ein statistisches Modell sein, das Muster in Dateninstanzen erkennen kann, um Unterschiede zwischen diesen Instanzen zu unterscheiden. Wird die KI beispielsweise in der Automobilherstellung eingesetzt, kann das Modell auf der Grundlage eines Bildes eines Autoteils kategorisieren, ob sich das Autoteil im vorderen, mittleren oder hinteren Bereich des Autos befindet, wie es ein menschliches Gehirn tun würde. Bei der gleichen Beispielanwendung informiert das statistische Modell einen mechanischen Arm, die funktionale Komponente, über den aktuellen Fahrzeugbereich, und der Arm wiederum baut diesen Bereich des Fahrzeugs auf der Grundlage vordefinierter Anweisungen zusammen, so wie eine menschliche Hand den neuronalen Signalen des menschlichen Gehirns folgen würde. Ein entscheidender Schritt bei KI-Anwendungen ist die Klassifizierung von Eingabeinstanzen durch das intelligente Modell. Unabhängig von der Methode zum Auffinden von Mustern in Daten besteht die Klassifizierung aus vier verschiedenen Schritten: (i) Eingabedarstellung, (ii) Modellbildung, (iii) Modellvorhersage und (iv) Modellbewertung. Ausgehend von diesen Klassifizierungsschritten argumentiere ich, dass die Anwendung der Klassifizierung auf verschiedene Datentypen unterschiedliche Herausforderungen mit sich bringt. In dieser Arbeit konzentriere ich uns auf die Herausforderungen für drei verschiedene Klassifizierungsszenarien: (i) Textdatenströme: Wie kann der Schritt der Modellerstellung, der üblicherweise für eine statische Datenverteilung verwendet wird, weiterentwickelt werden, um die Klassifizierung von Textbeiträgen mit einer instationären Datenverteilung zu erlernen? (ii) Proteinvorhersage: Welche biologisch sinnvollen Informationen können im Schritt der Eingabedarstellung verwendet werden, um die Herausforderung der begrenzten Trainingsdaten zu überwinden? (iii) Vorhersage der Pathogenität menschlicher Varianten: Wie kann ein Klassifizierungssystem für die funktionellen Auswirkungen menschlicher Varianten entwickelt werden, indem standardisierte und anerkannte Beweise für das Klassifizierungsergebnis bereitgestellt werden und somit der Schritt der Modellbewertung ermöglicht wird? Um diese Forschungsfragen zu beantworten, stelle ich meine Beiträge zur Klassifizierung dieser verschiedenen Datentypen vor: temporalMNB: Verbesserung des Naive-Bayes-Modells zur Klassifizierung driftender Textströme durch Ensemble-Lernen. dom2vec: Lernen von Einbettungsvektoren für Proteindomänen durch Selbstüberwachung. Auf der Grundlage der berichteten Ergebnisse liefere ich Beispiele für eine Analogie zwischen den lokalen linguistischen Merkmalen in natürlichen Sprachen und den Domänenstruktur- und Funktionsinformationen in Domänenarchitekturen. Schließlich beschreibe ich ein bioinformatisches Softwaretool, GenOtoScope, zur Automatisierung standardisierter evidenzbasierter Kriterien für die orthogenitätsauswirkungen von Varianten, die mit angeborener Schwerhörigkeit in Verbindung stehen

    Neural Representations of Concepts and Texts for Biomedical Information Retrieval

    Get PDF
    Information retrieval (IR) methods are an indispensable tool in the current landscape of exponentially increasing textual data, especially on the Web. A typical IR task involves fetching and ranking a set of documents (from a large corpus) in terms of relevance to a user\u27s query, which is often expressed as a short phrase. IR methods are the backbone of modern search engines where additional system-level aspects including fault tolerance, scale, user interfaces, and session maintenance are also addressed. In addition to fetching documents, modern search systems may also identify snippets within the documents that are potentially most relevant to the input query. Furthermore, current systems may also maintain preprocessed structured knowledge derived from textual data as so called knowledge graphs, so certain types of queries that are posed as questions can be parsed as such; a response can be an output of one or more named entities instead of a ranked list of documents (e.g., what diseases are associated with EGFR mutations? ). This refined setup is often termed as question answering (QA) in the IR and natural language processing (NLP) communities. In biomedicine and healthcare, specialized corpora are often at play including research articles by scientists, clinical notes generated by healthcare professionals, consumer forums for specific conditions (e.g., cancer survivors network), and clinical trial protocols (e.g., www.clinicaltrials.gov). Biomedical IR is specialized given the types of queries and the variations in the texts are different from that of general Web documents. For example, scientific articles are more formal with longer sentences but clinical notes tend to have less grammatical conformity and are rife with abbreviations. There is also a mismatch between the vocabulary of consumers and the lingo of domain experts and professionals. Queries are also different and can range from simple phrases (e.g., COVID-19 symptoms ) to more complex implicitly fielded queries (e.g., chemotherapy regimens for stage IV lung cancer patients with ALK mutations ). Hence, developing methods for different configurations (corpus, query type, user type) needs more deliberate attention in biomedical IR. Representations of documents and queries are at the core of IR methods and retrieval methodology involves coming up with these representations and matching queries with documents based on them. Traditional IR systems follow the approach of keyword based indexing of documents (the so called inverted index) and matching query phrases against the document index. It is not difficult to see that this keyword based matching ignores the semantics of texts (synonymy at the lexeme level and entailment at phrase/clause/sentence levels) and this has lead to dimensionality reduction methods such as latent semantic indexing that generally have scale-related concerns; such methods also do not address similarity at the sentence level. Since the resurgence of neural network methods in NLP, the IR field has also moved to incorporate advances in neural networks into current IR methods. This dissertation presents four specific methodological efforts toward improving biomedical IR. Neural methods always begin with dense embeddings for words and concepts to overcome the limitations of one-hot encoding in traditional NLP/IR. In the first effort, we present a new neural pre-training approach to jointly learn word and concept embeddings for downstream use in applications. In the second study, we present a joint neural model for two essential subtasks of information extraction (IE): named entity recognition (NER) and entity normalization (EN). Our method detects biomedical concept phrases in texts and links them to the corresponding semantic types and entity codes. These first two studies provide essential tools to model textual representations as compositions of both surface forms (lexical units) and high level concepts with potential downstream use in QA. In the third effort, we present a document reranking model that can help surface documents that are likely to contain answers (e.g, factoids, lists) to a question in a QA task. The model is essentially a sentence matching neural network that learns the relevance of a candidate answer sentence to the given question parametrized with a bilinear map. In the fourth effort, we present another document reranking approach that is tailored for precision medicine use-cases. It combines neural query-document matching and faceted text summarization. The main distinction of this effort from previous efforts is to pivot from a query manipulation setup to transforming candidate documents into pseudo-queries via neural text summarization. Overall, our contributions constitute nontrivial advances in biomedical IR using neural representations of concepts and texts

    Biomedical ontology alignment: An approach based on representation learning

    Get PDF
    While representation learning techniques have shown great promise in application to a number of different NLP tasks, they have had little impact on the problem of ontology matching. Unlike past work that has focused on feature engineering, we present a novel representation learning approach that is tailored to the ontology matching task. Our approach is based on embedding ontological terms in a high-dimensional Euclidean space. This embedding is derived on the basis of a novel phrase retrofitting strategy through which semantic similarity information becomes inscribed onto fields of pre-trained word vectors. The resulting framework also incorporates a novel outlier detection mechanism based on a denoising autoencoder that is shown to improve performance. An ontology matching system derived using the proposed framework achieved an F-score of 94% on an alignment scenario involving the Adult Mouse Anatomical Dictionary and the Foundational Model of Anatomy ontology (FMA) as targets. This compares favorably with the best performing systems on the Ontology Alignment Evaluation Initiative anatomy challenge. We performed additional experiments on aligning FMA to NCI Thesaurus and to SNOMED CT based on a reference alignment extracted from the UMLS Metathesaurus. Our system obtained overall F-scores of 93.2% and 89.2% for these experiments, thus achieving state-of-the-art results
    corecore