76 research outputs found

    Thorough in silico and in vitro cDNA analysis of 21 putative BRCA1 and BRCA2 splice variants and a complex tandem duplication in BRCA2 allowing the identification of activated cryptic splice donor sites in BRCA2 exon 11

    Get PDF
    For 21 putative BRCA1 and BRCA2 splice site variants, the concordance between mRNA analysis and predictions by in silico programs was evaluated. Aberrant splicing was confirmed for 12 alterations. In silico prediction tools were helpful to determine for which variants cDNA analysis is warranted, however, predictions for variants in the Cartegni consensus region but outside the canonical sites, were less reliable. Learning algorithms like Adaboost and Random Forest outperformed the classical tools. Further validations are warranted prior to implementation of these novel tools in clinical settings. Additionally, we report here for the first time activated cryptic donor sites in the large exon 11 of BRCA2 by evaluating the effect at the cDNA level of a novel tandem duplication (5 breakpoint in intron 4; 3 breakpoint in exon 11) and of a variant disrupting the splice donor site of exon 11 (c.6841+1G>C). Additional sites were predicted, but not activated. These sites warrant further research to increase our knowledge on cis and trans acting factors involved in the conservation of correct transcription of this large exon. This may contribute to adequate design of ASOs (antisense oligonucleotides), an emerging therapy to render cancer cells sensitive to PARP inhibitor and platinum therapies

    Statistical methods for clinical genome interpretation with specific application to inherited cardiac conditions

    Get PDF
    Background: While next-generation sequencing has enabled us to rapidly identify sequence variants, clinical application is limited by our ability to determine which rare variants impact disease risk. Aim: Developing computational methods to identify clinically important variants Methods and Results: (1) I built a disease-specific variant classifier for inherited cardiac conditions (ICCs), which outperforms genome-wide tools in a wide range of benchmarking. It discriminates pathogenic variants from benign variants with global accuracy improved by 4-24% over existing tools. Variants classified with >90% confidence are significantly associated with both disease status and clinical outcomes. (2) To better interpret missense variants, I examined evolutionarily equivalent residues across protein domain families, to identify positions intolerant of variations. Homologous residue constraint is a strong predictor of variant pathogenicity. It can identify a subset of de novo missense variants with comparable impact on developmental disorders as protein-truncating variants. Independent from existing approaches, it can also improve the prioritisation of disease-relevant gene for both developmental disorders and inherited hypertrophic cardiomyopathy. (3) TTN-truncating variants are known to cause dilated cardiomyopathy, but the effect of missense variants is poorly understood. Using the approach in (2), I studied the role of TTN missense variants on DCM. Our prioritised residues are enriched with known pathogenic variants, including the two known to cause DCM and others involved in skeletal myopathies. I also found a significant association between constrained variants of TTN I-set domains and DCM in a case-control burden test of Caucasian samples (OR=3.2, 95%CI=1.3-9.4). Within subsets of DCM, the association is replicated in alcoholic cardiomyopathy. (4) Finally, I also developed a tool to annotate 5’UTR variants creating or disrupting upstream open reading frames (uORF). Its utility is demonstrated to detect high-impact uORF-disturbing variants from ClinVar, gnomAD and Genomics England. Conclusion: These studies established broadly applicable methods and improved understanding of ICCs.Open Acces

    Machine Learning in clinical biology and medicine: from prediction of multidrug resistant infections in humans to pre-mRNA splicing control in Ciliates

    Get PDF
    Machine Learning methods have broadly begun to infiltrate the clinical literature in such a way that the correct use of algorithms and tools can facilitate both diagnosis and therapies. The availability of large quantities of high-quality data could lead to an improved understanding of risk factors in community and healthcare-acquired infections. In the first part of my PhD program, I refined my skills in Machine Learning by developing and evaluate with a real antibiotic stewardship dataset, a model useful to predict multi-drugs resistant urinary tract infections after patient hospitalization9 . For this purpose, I created an online platform called DSaaS specifically designed for healthcare operators to train ML models (supervised learning algorithms). These results are reported in Chapter 2. In the second part of the PhD thesis (Chapter 3) I used my new skills to study the genomic variants, in particular the phenomenon of intron splicing. One of the important modes of pre-mRNA post-transcriptional modification is alternative intron splicing, that includes intron retention (unsplicing), allowing the creation of many distinct mature mRNA transcripts from a single gene. An accurate interpretation of genomic variants is the backbone of genomic medicine. Determining for example the causative variant in patients with Mendelian disorders facilitates both management and potential downstream treatment of the patient’s condition, as well as providing peace of mind and allowing more effective counselling for the wider family. Recent years have seen a surge in bioinformatics tools designed to predict variant impact on splicing, and these offer an opportunity to circumvent many limitations of RNA-seq based approaches. An increasing number of these tools rely on machine learning computational approaches that can identify patterns in data and use this knowledge to speculate on new data. I optimized a pipeline to extract and classify introns from genomes and transcriptomes and I classified them into retained (Ris) and constitutively spliced (CSIs) introns. I used data from ciliates for the peculiar organization of their genomes (enriched of coding sequences) and because they are unicellular organisms without cells differentiated into tissues. That made easier the identification and the manipulation of introns. In collaboration with the PhD colleague dr. Leonardo Vito, I analyzed these intronic sequences in order to identify “features” to predict and to classify them by Machine Learning algorithms. We also developed a platform useful to manipulate FASTA, gtf, BED, etc. files produced by the pipeline tools. I named the platform: Biounicam (intron extraction tools) available at http://46.23.201.244:1880/ui. The major objective of this study was to develop an accurate machine-learning model that can predict whether an intron will be retained or not, to understand the key-features involved in the intron retention mechanism, and provide insight on the factors that drive IR. Once the model has been developed, the final step of my PhD work will be to expand the platform with different machine learning algorithms to better predict the retention and to test new features that drive this phenomenon. These features hopefully will contribute to find new mechanisms that controls intron splicing. The other additional papers and patents I published during my PhD program are in Appendix B and C. These works have enriched me with many useful techniques for future works and ranged from microbiology to classical statistics

    깊은 신경망을 이용한 강인한 특징 학습

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2016. 8. 윤성로.최근 기계 학습의 발전으로 인공 지능은 우리에게 한 걸음 더 가까이 다가오게 되었다. 특히 자율 주행이나 게임 플레이 등 최신 인공 지능 프레임워크들에 있어서, 딥 러닝이 중요한 역할을 하고 있는 상황이다. 딥 러닝이란 multi-layered neural networks 과 관련된 기술들을 총칭하는 용어로서, 데이터의 양이 급속하게 증가하며, 사전 지식들이 축적되고, 효율적인 학습 알고리즘들이 개발되며, 고급 하드웨어들이 만들어짐에 따라 빠르게 변화하고 있다. 현재 딥 러닝은 대부분의 인식 문제에서 최첨단 기술로 활용되고 있다. 여러 레이어로 구성된 깊은 신경망은 많은 양의 파라미터를 학습하기 때문에, 방대한 파라미터 집합 속에서 좋은 해를 효율적으로 찾아내는 것이 중요하다. 본 논문에서는 깊은 신경망의 세 가지 이슈에 대해 접근하며, 그것들을 해결하기 위한 regularization 기법들을 제안한다. 첫째로, 신경망 구조는 adversarial perturbations 이라는 내재적인 blind spots 들에 많이 노출되어 있다. 이러한 adversarial perturbations 에 강인한 신경망을 만들기 위하여, 학습 샘플과 그것의 adversarial perturbations 와의 차이를 최소화하는 manifold loss term을 목적 함수에 추가하였다. 둘째로, restricted Boltzmann machines 의 학습에 있어서, 상대적으로 작은 크기를 가지는 클래스를 학습하는 데에 기존의 contrastive divergence 알고리즘은 한계점을 가지고 있었다. 본 논문에서는 작은 클래스에 더 높은 학습 가중치를 부여하는 boosting 개념과 categorical features를 가진 데이터에 적합한 새로운 regularization 기법을 조합하여 기존의 한계점에 접근하였다. 마지막으로, 신경망의 파라미터를 학습하기에 충분하지 않은 데이터가 주어진 경우, 더 정교한 data augmentation 기법을 다룬다. 샘플의 차원이 많을수록, 데이터 생성의 기저에 깔려있는 사전 지식을 활용하여 augmentation을 하는 것이 더욱 더 필요하다. 나아가, 본 논문은 junction splicing signals 학습을 위한 첫 번째 깊은 신경망 모델링 결과를 제시하고 있다. Junction prediction 문제는 positive 샘플 수가 매우 적어 패턴 모델링이 힘들며, 이는 생명정보학 분야에서 가장 중요한 문제 중 하나로서, 전체 gene expression process 를 이해하는 첫 걸음이라고 할 수 있다. 요약하면, 본 논문은 딥 러닝으로 이미지와 대용량 유전체 데이터를 위한 효과적인 표현법을 학습할 수 있는 regularization 기법들을 제안하였으며, 유명한 벤치마크 데이터와 biomedical imaging 데이터를 사용하여 그 실효성을 검증하였다.Recent advances in machine learning continue to bring us closer to artificial intelligence. In particular, deep learning plays a key role in cutting-edge frameworks such as autonomous driving and game playing. Deep learning refers to a class of multi-layered neural networks, which is rapidly evolving as the amount of data increases, prior knowledge builds up, efficient training schemes are being developed, and high-end hardwares are being build. Currently, deep learning is a state-of-the-art technique for most recognition tasks. As deep neural networks learn many parameters, there has been a variety of attempts to obtain reasonable solutions over a wide search space. In this dissertation, three issues in deep learning are discussed and approaches to solve them with regularization techniques are suggested. First, deep neural networks expose the problem of intrinsic blind spots called adversarial perturbations. Thus, we must construct neural networks that resist the directions of adversarial perturbations by introducing an explicit loss term to minimize the differences between the original and adversarial samples. Second, training restricted Boltzmann machines show limited performance when handling minority samples in class-imbalanced datasets. Our approach addresses this limitation and is combined with a new regularization concept for datasets that have categorical features. Lastly, insufficient data handling is required to be more sophisticated when deep networks learn numerous parameters. Given high-dimensional samples, we must augment datasets with adequate prior knowledge to estimate a high-dimensional distribution. Furthermore, this dissertation shows the first application of deep belief networks to identifying junction splicing signals. Junction prediction is one of the major problems in the field of bioinformatics, and is a starting point to understanding the entire gene expression process. In summary, this dissertation proposes a set of deep learning regularization schemes that can learn the meaningful representation underlying large-scale genomic datasets and image datasets. The effectiveness of these methods was confirmed with a number of experimental studies.Chapter 1 Introduction 1 1.1 Deep neural networks 1 1.2 Issue 1: adversarial examples handling 3 1.3 Issue 2: class-imbalance handling 5 1.4 Issue 3: insufficient data handling 5 1.5 Organization 6 Chapter 2 Background 10 2.1 Basic operations for deep networks 10 2.2 History of deep networks 12 2.3 Modern deep networks 14 2.3.1 Contrastive divergence 16 2.3.2 Deep manifold learning 18 Chapter 3 Adversarial examples handling 20 3.1 Introduction 20 3.2 Methods 21 3.2.1 Manifold regularized networks 21 3.2.2 Generation of adversarial examples 25 3.3 Results and discussion 26 3.3.1 Improved classification performance 28 3.3.2 Disentanglement and generalization 30 3.4 Summary 33 Chapter 4 Class-imbalance handling 35 4.1 Introduction 35 4.1.1 Numerical interpretation of DNA sequences 37 4.1.2 Review of junction prediction problem 41 4.2 Methods 44 4.2.1 Boosted contrastive divergence with categorical gradients 44 4.2.2 Stacking and fine-tuning 46 4.2.3 Initialization and parameter setting 47 4.3 Results and discussion 47 4.3.1 Experiment preparation 47 4.3.2 Improved prediction performance and runtime 49 4.3.3 More robust prediction by proposed approach 51 4.3.4 Effects of regularization on performance 53 4.3.5 Efficient RBM training by boosted CD 54 4.3.6 Identification of non-canonical splice sites 57 4.4 Summary 58 Chapter 5 Insufficient data handling 60 5.1 Introduction 60 5.2 Backgrounds 62 5.2.1 Understanding comets 62 5.2.2 Assessing DNA damage from tail shape 65 5.2.3 Related image processing techniques 66 5.3 Methods 68 5.3.1 Preprocessing 70 5.3.2 Binarization 70 5.3.3 Filtering and overlap correction 72 5.3.4 Characterization and classification 75 5.4 Results and discussion 76 5.4.1 Test data preparation 76 5.4.2 Binarization 77 5.4.3 Robust identification of comets 79 5.4.4 Classification 81 5.4.5 More accurate characterization by DeepComet 82 5.5 Summary 85 Chapter 6 Conclusion 87 6.1 Dissertation summary 87 6.2 Future work 89 Bibliography 91Docto

    Differential Architecture Search in Deep Learning for DNA Splice Site Classification

    Get PDF
    The data explosion caused by unprecedented advancements in the field of genomics is constantly challenging the conventional methods used in the interpretation of the human genome. The demand for robust algorithms over the recent years has brought huge success in the field of Deep Learning (DL) in solving many difficult tasks in image, speech and natural language processing by automating the manual process of architecture design

    Using machine learning to predict pathogenicity of genomic variants throughout the human genome

    Get PDF
    Geschätzt mehr als 6.000 Erkrankungen werden durch Veränderungen im Genom verursacht. Ursachen gibt es viele: Eine genomische Variante kann die Translation eines Proteins stoppen, die Genregulation stören oder das Spleißen der mRNA in eine andere Isoform begünstigen. All diese Prozesse müssen überprüft werden, um die zum beschriebenen Phänotyp passende Variante zu ermitteln. Eine Automatisierung dieses Prozesses sind Varianteneffektmodelle. Mittels maschinellem Lernen und Annotationen aus verschiedenen Quellen bewerten diese Modelle genomische Varianten hinsichtlich ihrer Pathogenität. Die Entwicklung eines Varianteneffektmodells erfordert eine Reihe von Schritten: Annotation der Trainingsdaten, Auswahl von Features, Training verschiedener Modelle und Selektion eines Modells. Hier präsentiere ich ein allgemeines Workflow dieses Prozesses. Dieses ermöglicht es den Prozess zu konfigurieren, Modellmerkmale zu bearbeiten, und verschiedene Annotationen zu testen. Der Workflow umfasst außerdem die Optimierung von Hyperparametern, Validierung und letztlich die Anwendung des Modells durch genomweites Berechnen von Varianten-Scores. Der Workflow wird in der Entwicklung von Combined Annotation Dependent Depletion (CADD), einem Varianteneffektmodell zur genomweiten Bewertung von SNVs und InDels, verwendet. Durch Etablierung des ersten Varianteneffektmodells für das humane Referenzgenome GRCh38 demonstriere ich die gewonnenen Möglichkeiten Annotationen aufzugreifen und neue Modelle zu trainieren. Außerdem zeige ich, wie Deep-Learning-Scores als Feature in einem CADD-Modell die Vorhersage von RNA-Spleißing verbessern. Außerdem werden Varianteneffektmodelle aufgrund eines neuen, auf Allelhäufigkeit basierten, Trainingsdatensatz entwickelt. Diese Ergebnisse zeigen, dass der entwickelte Workflow eine skalierbare und flexible Möglichkeit ist, um Varianteneffektmodelle zu entwickeln. Alle entstandenen Scores sind unter cadd.gs.washington.edu und cadd.bihealth.org frei verfügbar.More than 6,000 diseases are estimated to be caused by genomic variants. This can happen in many possible ways: a variant may stop the translation of a protein, interfere with gene regulation, or alter splicing of the transcribed mRNA into an unwanted isoform. It is necessary to investigate all of these processes in order to evaluate which variant may be causal for the deleterious phenotype. A great help in this regard are variant effect scores. Implemented as machine learning classifiers, they integrate annotations from different resources to rank genomic variants in terms of pathogenicity. Developing a variant effect score requires multiple steps: annotation of the training data, feature selection, model training, benchmarking, and finally deployment for the model's application. Here, I present a generalized workflow of this process. It makes it simple to configure how information is converted into model features, enabling the rapid exploration of different annotations. The workflow further implements hyperparameter optimization, model validation and ultimately deployment of a selected model via genome-wide scoring of genomic variants. The workflow is applied to train Combined Annotation Dependent Depletion (CADD), a variant effect model that is scoring SNVs and InDels genome-wide. I show that the workflow can be quickly adapted to novel annotations by porting CADD to the genome reference GRCh38. Further, I demonstrate the integration of deep-neural network scores as features into a new CADD model, improving the annotation of RNA splicing events. Finally, I apply the workflow to train multiple variant effect models from training data that is based on variants selected by allele frequency. In conclusion, the developed workflow presents a flexible and scalable method to train variant effect scores. All software and developed scores are freely available from cadd.gs.washington.edu and cadd.bihealth.org

    CADD-Splice—improving genome-wide variant effect prediction using deep learning-derived splice scores

    Get PDF
    Background: Splicing of genomic exons into mRNAs is a critical prerequisite for the accurate synthesis of human proteins. Genetic variants impacting splicing underlie a substantial proportion of genetic disease, but are challenging to identify beyond those occurring at donor and acceptor dinucleotides. To address this, various methods aim to predict variant effects on splicing. Recently, deep neural networks (DNNs) have been shown to achieve better results in predicting splice variants than other strategies. Methods: It has been unclear how best to integrate such process-specific scores into genome-wide variant effect predictors. Here, we use a recently published experimental data set to compare several machine learning methods that score variant effects on splicing. We integrate the best of those approaches into general variant effect prediction models and observe the effect on classification of known pathogenic variants. Results: We integrate two specialized splicing scores into CADD (Combined Annotation Dependent Depletion; cadd.gs.washington.edu), a widely used tool for genome-wide variant effect prediction that we previously developed to weight and integrate diverse collections of genomic annotations. With this new model, CADD-Splice, we show that inclusion of splicing DNN effect scores substantially improves predictions across multiple variant categories, without compromising overall performance. Conclusions: While splice effect scores show superior performance on splice variants, specialized predictors cannot compete with other variant scores in general variant interpretation, as the latter account for nonsense and missense effects that do not alter splicing. Although only shown here for splice scores, we believe that the applied approach will generalize to other specific molecular processes, providing a path for the further improvement of genome-wide variant effect prediction

    Automatic intron detection in metagenomes using neural networks.

    Get PDF
    Tato práce se zabývá detekcí intronů v metagenomech hub pomocí hlubokých neuronových sítí. Přesné biologické mechanizmy rozpoznávání a vyřezávání intronů nejsou zatím plně známy a jejich strojová detekce není považovaná za vyřešený problém. Rozpoznávání a vyřezávání intronů z DNA sekvencí je důležité pro identifikaci genů v metagenomech a hledání jejich homologií mezi známými DNA sekvencemi,které jsou dostupné ve veřejných databázích. Rozpoznání genů a nalezení jejich případných homologů umožňuje identifikaci jak již známých tak i nových druhů a jejich taxonomické zařazení. V rámci práce vznikly dva modely neuronových sítí, které detekují začátky a konce intronů, takzvaná donorová a akceptorová místa sestřihu. Detekovaná místa sestřihu jsou následně zkombinována do kandidátních intronů. Překrývající se kandidátní introny jsou poté odstraněny pomocí jednoduchého skórovacího algoritmu. Práce navazuje na existující řešení, které využívá metody podpůrných vektorů (SVM). Výsledné neuronové sítě dosahují lepších výsledků než SVM a to při více než desetinásobně nižším výpočetním čase na zpracování stejně obsáhlého genomu.This work is concerned with the detection of introns in metagenomes with deep neural networks. Exact biological mechanisms of intron recognition and splicing are not fully known yet and their automated detection has remained unresolved. Detection and removal of introns from DNA sequences is important for the identification of genes in metagenomes and for searching for homologs among the known DNA sequences available in public databases. Gene prediction and the discovery of their homologs allows the identification of known and new species and their taxonomic classification. Two neural network models were developed as part of this thesis. The models' aim is the detection of intron starts and ends with the so-called donor and acceptor splice sites. The splice sites are later combined into candidate introns which are further filtered by a simple score-based overlap resolving algorithm. The work relates to an existing solution based on support vector machines (SVM). The resulting neural networks achieve better results than SVM and require more than order of magnitude less computational resources in order to process equally large genome

    Learning the Regulatory Code of Gene Expression

    Get PDF
    Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology

    Human Promoter Prediction Using DNA Numerical Representation

    Get PDF
    With the emergence of genomic signal processing, numerical representation techniques for DNA alphabet set {A, G, C, T} play a key role in applying digital signal processing and machine learning techniques for processing and analysis of DNA sequences. The choice of the numerical representation of a DNA sequence affects how well the biological properties can be reflected in the numerical domain for the detection and identification of the characteristics of special regions of interest within the DNA sequence. This dissertation presents a comprehensive study of various DNA numerical and graphical representation methods and their applications in processing and analyzing long DNA sequences. Discussions on the relative merits and demerits of the various methods, experimental results and possible future developments have also been included. Another area of the research focus is on promoter prediction in human (Homo Sapiens) DNA sequences with neural network based multi classifier system using DNA numerical representation methods. In spite of the recent development of several computational methods for human promoter prediction, there is a need for performance improvement. In particular, the high false positive rate of the feature-based approaches decreases the prediction reliability and leads to erroneous results in gene annotation.To improve the prediction accuracy and reliability, DigiPromPred a numerical representation based promoter prediction system is proposed to characterize DNA alphabets in different regions of a DNA sequence.The DigiPromPred system is found to be able to predict promoters with a sensitivity of 90.8% while reducing false prediction rate for non-promoter sequences with a specificity of 90.4%. The comparative study with state-of-the-art promoter prediction systems for human chromosome 22 shows that our proposed system maintains a good balance between prediction accuracy and reliability. To reduce the system architecture and computational complexity compared to the existing system, a simple feed forward neural network classifier known as SDigiPromPred is proposed. The SDigiPromPred system is found to be able to predict promoters with a sensitivity of 87%, 87%, 99% while reducing false prediction rate for non-promoter sequences with a specificity of 92%, 94%, 99% for Human, Drosophila, and Arabidopsis sequences respectively with reconfigurable capability compared to existing system
    corecore