3 research outputs found

    Supervised learning for detection of duplicates in genomic sequence databases

    Get PDF
    Motivation First identified as an issue in 1996, duplication in biological databases introduces redundancy and even leads to inconsistency when contradictory information appears. The amount of data makes purely manual de-duplication impractical, and existing automatic systems cannot detect duplicates as precisely as can experts. Supervised learning has the potential to address such problems by building automatic systems that learn from expert curation to detect duplicates precisely and efficiently. While machine learning is a mature approach in other duplicate detection contexts, it has seen only preliminary application in genomic sequence databases. Results We developed and evaluated a supervised duplicate detection method based on an expert curated dataset of duplicates, containing over one million pairs across five organisms derived from genomic sequence databases. We selected 22 features to represent distinct attributes of the database records, and developed a binary model and a multi-class model. Both models achieve promising performance; under cross-validation, the binary model had over 90% accuracy in each of the five organisms, while the multi-class model maintains high accuracy and is more robust in generalisation. We performed an ablation study to quantify the impact of different sequence record features, finding that features derived from metadata, sequence identity, and alignment quality impact performance most strongly. The study demonstrates machine learning can be an effective additional tool for de-duplication of genomic sequence databases. All Data are available as described in the supplementary material

    Robust Entity Linking in Heterogeneous Domains

    Get PDF
    Entity Linking is the task of mapping terms in arbitrary documents to entities in a knowledge base by identifying the correct semantic meaning. It is applied in the extraction of structured data in RDF (Resource Description Framework) from textual documents, but equally so in facilitating artificial intelligence applications, such as Semantic Search, Reasoning and Question and Answering. Most existing Entity Linking systems were optimized for specific domains (e.g., general domain, biomedical domain), knowledge base types (e.g., DBpedia, Wikipedia), or document structures (e.g., tables) and types (e.g., news articles, tweets). This led to very specialized systems that lack robustness and are only applicable for very specific tasks. In this regard, this work focuses on the research and development of a robust Entity Linking system in terms of domains, knowledge base types, and document structures and types. To create a robust Entity Linking system, we first analyze the following three crucial components of an Entity Linking algorithm in terms of robustness criteria: (i) the underlying knowledge base, (ii) the entity relatedness measure, and (iii) the textual context matching technique. Based on the analyzed components, our scientific contributions are three-fold. First, we show that a federated approach leveraging knowledge from various knowledge base types can significantly improve robustness in Entity Linking systems. Second, we propose a new state-of-the-art, robust entity relatedness measure for topical coherence computation based on semantic entity embeddings. Third, we present the neural-network-based approach Doc2Vec as a textual context matching technique for robust Entity Linking. Based on our previous findings and outcomes, our main contribution in this work is DoSeR (Disambiguation of Semantic Resources). DoSeR is a robust, knowledge-base-agnostic Entity Linking framework that extracts relevant entity information from multiple knowledge bases in a fully automatic way. The integrated algorithm represents a collective, graph-based approach that utilizes semantic entity and document embeddings for entity relatedness and textual context matching computation. Our evaluation shows, that DoSeR achieves state-of-the-art results over a wide range of different document structures (e.g., tables), document types (e.g., news documents) and domains (e.g., general domain, biomedical domain). In this context, DoSeR outperforms all other (publicly available) Entity Linking algorithms on most data sets
    corecore