39 research outputs found

    Mutalyzer 2: next generation HGVS nomenclature checker

    Get PDF
    Motivation: Unambiguous variant descriptions are of utmost importance in clinical genetic diagnostics, scientific literature and genetic databases. The Human Genome Variation Society (HGVS) publishes a comprehensive set of guidelines on how variants should be correctly and unambiguously described. We present the implementation of the Mutalyzer 2 tool suite, designed to automatically apply the HGVS guidelines so users do not have to deal with the HGVS intricacies explicitly to check and correct their variant descriptions.Results: Mutalyzer is profusely used by the community, having processed over 133 million descriptions since its launch. Over a five year period, Mutalyzer reported a correct input in similar to 50% of cases. In 41% of the cases either a syntactic or semantic error was identified and for similar to 7% of cases, Mutalyzer was able to automatically correct the description.Molecular Technology and Informatics for Personalised Medicine and Healt

    DeepVar: An End-to-End Deep Learning Approach for Genomic Variant Recognition in Biomedical Literature

    Full text link
    We consider the problem of Named Entity Recognition (NER) on biomedical scientific literature, and more specifically the genomic variants recognition in this work. Significant success has been achieved for NER on canonical tasks in recent years where large data sets are generally available. However, it remains a challenging problem on many domain-specific areas, especially the domains where only small gold annotations can be obtained. In addition, genomic variant entities exhibit diverse linguistic heterogeneity, differing much from those that have been characterized in existing canonical NER tasks. The state-of-the-art machine learning approaches in such tasks heavily rely on arduous feature engineering to characterize those unique patterns. In this work, we present the first successful end-to-end deep learning approach to bridge the gap between generic NER algorithms and low-resource applications through genomic variants recognition. Our proposed model can result in promising performance without any hand-crafted features or post-processing rules. Our extensive experiments and results may shed light on other similar low-resource NER applications.Comment: accepted by AAAI 202

    Gramatička evolucija tehničkih procesa

    Get PDF
    Teorija tehničkih sustava objašnjava tehničku evoluciju, konstruiranje i razvoj proizvoda kao odgovor na potrebe društva koje se mogu ostvariti tehničkim procesima. Takvo teleološko shvaćanje nalaže kao početni korak u razvoju koncepta novog proizvoda utvrđivanje tehničkog procesa kao procesa unutar kojega se sudjelovanjem tehničkoga proizvoda ostvaruju efekti potrebni za svrhovitu transformaciju operanada sukladno radnim principima na kojima se tehnički proces temelji. Cilj istraživanja u okviru izrade doktorskog rada jest kreiranje računalne podrške upravo za taj početni korak konceptualne faze razvoja proizvoda. Generiranje varijanti transformacije operanada računalnom mogu stvoriti osnovu koja će poslužiti za temeljitije razmatranje mogućnosti za realizaciju tehničkoga proizvoda. Sukladno znanstveno-istraživačkoj metodologiji prisutnoj unutar područja znanosti o konstruiranju, istraživanje u okviru ovoga rada provedeno je unutar dvije faze: teoretska faza koja obuhvaća definiranje metode za generiranje varijanti transformacije operanda temeljem poznatih radnih principa, i praktična faza koja obuhvaća razvitak računalnog alata na osnovu definirane metode do razine koja će omogućiti potvrđivanje rezultata istraživanja. Teoretska faza istraživanja zaključena je sa glavnim znanstvenim doprinosima ove disertacije: (1) definiran je formalni model tehničkog procesa, (2) definiran je formalni model sinteze tehničkih procesa temeljen na graf-gramatikama, (3) uvedena je mogućnost pretraživanja varijanti transformacije koristeći se algoritmom gramatičke evolucije [3]. Praktična faza ovoga istraživanja rezultirala je računalnom implementacijom definirane metode za generiranje varijanti transformacije operanada u okruženju za tu svrhu osmišljenog i razvijenoga računalnoga alata. Tijekom istraživanja utvrđeno je da generalizirano i sistematizirano znanje o tehničkim procesima i radnim principima unutar područja još uvijek nije dostupno u obliku dovoljno detaljne taksonomije ili ontologije za razinu koju zahtijeva definirana metoda. Iz tog razloga predložene su smjernice za graf-gramatičku formalizaciju znanja o tehničkim procesima i radnim principima (4)

    Algorithms for the description of molecular sequences

    Get PDF
    Unambiguous sequence variant descriptions are important in reporting the outcome of clinical diagnostic DNA tests. The standard nomenclature of the Human Genome Variation Society (HGVS) describes the observed variant sequence relative to a given reference sequence. We propose an efficient algorithm for the extraction of HGVS descriptions from two DNA sequences. Our algorithm is able to compute the HGVS~descriptions of complete chromosomes or other large DNA strings in a reasonable amount of computation time and its resulting descriptions are relatively small. Additional applications include updating of gene variant database contents and reference sequence liftovers. Next, we adapted our method for the extraction of descriptions for protein sequences in particular for describing frame shifted variants. We propose an addition to the HGVS nomenclature for accommodating the (complex) frame shifted variants that can be described with our method. Finally, we applied our method to generate descriptions for Short Tandem Repeats (STRs), a form of self-similarity. We propose an alternative repeat variant that can be added to the existing HGVS nomenclature. The final chapter takes an explorative approach to classification in large cohort studies. We provide a ``cross-sectional'' investigation on this data to see the relative power of the different groups.  Algorithms and the Foundations of Software technolog

    Transfer learning: bridging the gap between deep learning and domain-specific text mining

    Get PDF
    Inspired by the success of deep learning techniques in Natural Language Processing (NLP), this dissertation tackles the domain-specific text mining problems for which the generic deep learning approaches would fail. More specifically, the domain-specific problems are: (1) success prediction in crowdfunding, (2) variants identification in biomedical literature, and (3) text data augmentation for domains with low-resources. In the first part, transfer learning in a multimodal perspective is utilized to facilitate solving the project success prediction on the crowdfunding application. Even though the information in a project profile can be of different modalities such as text, images, and metadata, most existing prediction approaches leverage only the text modality. It is promising to utilize the visual images in project profiles to find out how images could contribute to the success prediction. An advanced neural network scheme is designed and evaluated combining information learned from different modalities for project success prediction. In the second part, transfer learning is combined with deep learning techniques to solve genomic variants Named Entity Recognition (NER) problems in biomedical literature. Most of the advanced generic NER algorithms can fail due to the restricted training corpus. However, those generic deep learning algorithms are capable of learning from a canonical corpus, without any effort on feature engineering. This work aims to build an end-to-end deep learning approach to transfer the domain-specific knowledge to those advanced generic NER algorithms, addressing the challenges in low-resource training and requiring neither hand-crafted features nor post-processing rules. For the last part, transfer learning with knowledge distillation and active learning are utilized to solve text augmentation for domains with low-resources. Most of the recent text augmentation methods heavily rely on large external resources. This work is dedicates to solving the text augmentation problem adaptively and consistently with minimal resources for token-level tasks like NER. The solution can also assure the reliability of machine labels for noisy data and can enhance training consistency with noisy labels. All the works are evaluated on different domain-specific benchmarks, respectively. Experimental results demonstrate the effectiveness of those proposed methods. The advantages also indicate promising potential for transfer learning in domain-specific applications

    A formalized description of the standard human variant nomenclature in Extended Backus-Naur Form

    Get PDF
    BACKGROUND The use of a standard human sequence variant nomenclature is advocated by the Human Genome Variation Society in order to unambiguously describe genetic variants in databases and literature. There is a clear need for tools that allow the mining of data about human sequence variants and their functional consequences from databases and literature. Existing text mining focuses on the recognition of protein variants and their effects. The recognition of variants at the DNA and RNA levels is essential for dissemination of variant data for diagnostic purposes. Development of new tools is hampered by the complexity of the current nomenclature, which requires processing at the character level to recognize the specific syntactic constructs used in variant descriptions. RESULTS We approached the gene variant nomenclature as a scientific sublanguage and created two formal descriptions of the syntax in Extended Backus-Naur Form: one at the DNA-RNA level and one at the protein level. To ensure compatibility to older versions of the human sequence variant nomenclature, previously recommended variant description formats have been included. The first grammar versions were designed to help build variant description handling in the Alamut mutation interpretation software. The DNA and RNA level descriptions were then updated and used to construct the context-free parser of the Mutalyzer 2 sequence variant nomenclature checker, which has already been used to check more than one million variant descriptions. CONCLUSIONS The Extended Backus-Naur Form provided an overview of the full complexity of the syntax of the sequence variant nomenclature, which remained hidden in the textual format and the division of the recommendations across the DNA, RNA and protein sections of the Human Genome Variation Society nomenclature website (http://www.hgvs.org/mutnomen/). This insight into the syntax of the nomenclature could be used to design detailed and clear rules for software development. The Mutalyzer 2 parser demonstrated that it facilitated decomposition of complex variant descriptions into their individual parts. The Extended Backus-Naur Form or parts of it can be used or modified by adding rules, allowing the development of specific sequence variant text mining tools and other programs, which can generate or handle sequence variant descriptions.Genomics, epigenetics, population genetics and bioinformatic

    Adaptive object-modeling : patterns, tools and applications

    Get PDF
    Tese de Programa Doutoral. Informática. Universidade do Porto. Faculdade de Engenharia. 201
    corecore