106 research outputs found

    Deep Learning for Genomics: A Concise Overview

    Full text link
    Advancements in genomic research such as high-throughput sequencing techniques have driven modern genomic studies into "big data" disciplines. This data explosion is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in a variety of fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning since we are expecting from deep learning a superhuman intelligence that explores beyond our knowledge to interpret the genome. A powerful deep learning model should rely on insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with a proper deep architecture, and remark on practical considerations of developing modern deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research, as well as pointing out potential opportunities and obstacles for future genomics applications.Comment: Invited chapter for Springer Book: Handbook of Deep Learning Application

    Applications of deep neural networks to protein structure prediction

    Get PDF
    Professor Yi Shang, Dissertation Advisor; Professor Dong Xu, Dissertation Co-advisor.Includes vita.Field of Study: Computer science."July 2018."Protein secondary structure, backbone torsion angle and other secondary structure features can provide useful information for protein 3D structure prediction and protein functions. Deep learning offers a new opportunity to significantly improve prediction accuracy. In this dissertation, several new deep neural network architectures are proposed for protein secondary structure prediction: deep inception-inside-inception (Deep3I) networks and deep neighbor residual (DeepNRN) networks for secondary structure prediction; deep residual inception networks (DeepRIN) for backbone torsion angle prediction; deep dense inception networks (DeepDIN) for beta turn prediction; deep inception capsule networks (DeepICN) for gamma turn prediction. Every tool was then implemented as a standalone tool integrated into MUFold package and freely available to research community. A webserver called MUFold-SS-Angle is also developed for protein property prediction. The input feature to those deep neural networks is a carefully designed feature matrix corresponding to the primary amino acid sequence of a protein, which consists of a rich set of information derived from individual amino acid, as well as the context of the protein sequence. Specifically, the feature matrix is a composition of physio-chemical properties of amino acids, PSI-BLAST profile, HHBlits profile and/or predicted shape string. The deep architecture enables effective processing of local and global interactions between amino acids in making accurate prediction. In extensive experiments on multiple datasets, the proposed deep neural architectures outperformed the best existing methods and other deep neural networks significantly: The proposed DeepNRN achieved highest Q8 75.33, 72.9, 70.8 on CASP 10, 11, 12 higher than previous state-of-the-art DeepCNF-SS with 71.8, 72.3, and 69.76. The proposed MUFold-SS (Deep3I) achieved highest Q8 76.47, 74.51, 72.1 on CASP 10, 11, 12. Compared to the recently released state-of-the-art tool, SPIDER3, DeepRIN reduced the Psi angle prediction error by more than 5 degrees and the Phi angle prediction error by more than 2 degrees on average. DeepDIN outperformed significantly BetaTPred3 in both two-class and nine-class beta turn prediction on benchmark BT426 and BT6376. DeepICN is the first application of using capsule network to biological sequence analysis and outperformed all previous gamma-turn predictors on benchmark GT320.Includes bibliographical references (pages 114-131)

    Prediction of Secondary Protein Structure

    Get PDF
    Αυτό το έργο στοχεύει να δείξει στους αναγνώστες του μια προσπάθεια για την επίλυση του προβλήματος πρόβλεψης της δευτερογενούς δομής πρωτεΐνης χρησιμοποιώντας βαθιά υπολειμματικά νευρωνικά δίκτυα και άλλες μεθόδους. Οι πρωτεΐνες είναι ένα από τα πιο ζωτικά συστατικά κάθε ζωντανού όντος. Παίζουν πολύ σημαντικό ρόλο καθώς καθορίζουν τις λειτουργίες ενός οργανισμού. Επομένως, η γνώση της δομής της πρωτεΐνης είναι μεγάλης σημασίας. Συγκεκριμένα, η δομή της πρωτεΐνης αποτελείται από τέσσερα επίπεδα. πρωτοταγής, δευτεροταγής, τριτοταγής και τεταρτοταγής πρωτεϊνική δομή. Η πιο σημαντική είναι η δομή στον τρισδιάστατο χώρο, η τριτοταγής δομή , γιατί αυτή καθορίζει τον βιολογικό ρόλο της πρωτεΐνης. Ως αποτέλεσμα, η γνώση των πρωτεϊνικών λειτουργιών μπορεί να βοηθήσει στη θεραπεία πολλών ασθενειών. Δυστυχώς, οι μεθοδολογίες εξαγωγών που έχουν αναπτυχθεί μέχρι τώρα, είναι πολύ περίπλοκες και χρονοβόρες διαδικασίες. Ο ορισμός της δευτεροταγής δομής είναι απαραίτητος για την εξαγωγή της τριτοταγής δομής και αυτός είναι ο λόγος που μελετάται. Η δευτεροταγής δομή εξάγεται από την πρωτοταγή δομή, η οποία περιλαμβάνει μια αλληλουχία αμινοξέων. Σε αυτό το έργο θα αναλυθούν κυρίως τα βαθιά υπολειμματικά δίκτυα και ο τρόπος που μπορούν να βοηθήσουν στην πρόβλεψη της δευτεροταγούς δομής της πρωτεΐνης. Τέτοια δίκτυα ανήκουν στην κατηγορία των βαθιών νευρωνικών δικτύων, τα οποία ουσιαστικά αποτελούνται από συγκλίνοντα επίπεδα με προσθετικές συνδέσεις μεταξύ τους.This project aims to show its readers an effort for the solution of the prediction problem of the protein secondary structure using deep residual neural networks and other methods. Proteins are one of the most vital components of every living being. They play a quite important role as they define the functions of an organism. Therefore, knowing the protein structure is of great importance. Specifically, protein structure consists of four levels; primary, secondary, tertiary and quaternary protein structure. The most significant is the structure in the three-dimensional space, the tertiary structure because this one defines the biological role of the protein. As a result, knowing the protein functions may help the treatment of many diseases. Unfortunately, the export methodologies that are developed so far, are very complicated and time-wasting procedures. The definition of the secondary structure is needed to export the tertiary structure and that is the reason it is studied. The secondary structure is exported by the primary structure, which includes an amino acid sequence. In this project the deep residual networks and the way they can help for the prediction of the protein secondary structure will mainly be analyzed. Such networks belong to the category of deep residual neural ones, which essentially consist of convergent levels with additive connections among them

    Opportunities and obstacles for deep learning in biology and medicine

    Get PDF
    Deep learning describes a class of machine learning algorithms that are capable of combining raw inputs into layers of intermediate features. These algorithms have recently shown impressive results across a variety of domains. Biology and medicine are data-rich disciplines, but the data are complex and often ill-understood. Hence, deep learning techniques may be particularly well suited to solve problems of these fields. We examine applications of deep learning to a variety of biomedical problems-patient classification, fundamental biological processes and treatment of patients-and discuss whether deep learning will be able to transform these tasks or if the biomedical sphere poses unique challenges. Following from an extensive literature review, we find that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art. Even though improvements over previous baselines have been modest in general, the recent progress indicates that deep learning methods will provide valuable means for speeding up or aiding human investigation. Though progress has been made linking a specific neural network\u27s prediction to input features, understanding how users should interpret these models to make testable hypotheses about the system under study remains an open challenge. Furthermore, the limited amount of labelled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning enabling changes at both bench and bedside with the potential to transform several areas of biology and medicine

    생물학적 서열 데이터에 대한 표현 학습

    Get PDF
    학위논문(박사) -- 서울대학교대학원 : 공과대학 전기·정보공학부, 2021.8. 윤성로.As we are living in the era of big data, the biomedical domain is not an exception. With the advent of technologies such as next-generation sequencing, developing methods to capitalize on the explosion of biomedical data is one of the most major challenges in bioinformatics. Representation learning, in particular deep learning, has made significant advancements in diverse fields where the artificial intelligence community has struggled for many years. However, although representation learning has also shown great promises in bioinformatics, it is not a silver bullet. Off-the-shelf applications of representation learning cannot always provide successful results for biological sequence data. There remain full of challenges and opportunities to be explored. This dissertation presents a set of representation learning methods to address three issues in biological sequence data analysis. First, we propose a two-stage training strategy to address throughput and information trade-offs within wet-lab CRISPR-Cpf1 activity experiments. Second, we propose an encoding scheme to model interaction between two sequences for functional microRNA target prediction. Third, we propose a self-supervised pre-training method to bridge the exponentially growing gap between the numbers of unlabeled and labeled protein sequences. In summary, this dissertation proposes a set of representation learning methods that can derive invaluable information from the biological sequence data.우리는 빅데이터의 시대를 맞이하고 있으며, 의생명 분야 또한 예외가 아니다. 차세대 염기서열 분석과 같은 기술들이 도래함에 따라, 폭발적인 의생명 데이터의 증가를 활용하기 위한 방법론의 개발은 생물정보학 분야의 주요 과제 중의 하나이다. 심층 학습을 포함한 표현 학습 기법들은 인공지능 학계가 오랫동안 어려움을 겪어온 다양한 분야에서 상당한 발전을 이루었다. 표현 학습은 생물정보학 분야에서도 많은 가능성을 보여주었다. 하지만 단순한 적용으로는 생물학적 서열 데이터 분석의 성공적인 결과를 항상 얻을 수는 않으며, 여전히 연구가 필요한 많은 문제들이 남아있다. 본 학위논문은 생물학적 서열 데이터 분석과 관련된 세 가지 사안을 해결하기 위해, 표현 학습에 기반한 일련의 방법론들을 제안한다. 첫 번째로, 유전자가위 실험 데이터에 내재된 정보와 수율의 균형에 대처할 수 있는 2단계 학습 기법을 제안한다. 두 번째로, 두 염기 서열 간의 상호 작용을 학습하기 위한 부호화 방식을 제안한다. 세 번째로, 기하급수적으로 증가하는 특징되지 않은 단백질 서열을 활용하기 위한 자기 지도 사전 학습 기법을 제안한다. 요약하자면, 본 학위논문은 생물학적 서열 데이터를 분석하여 중요한 정보를 도출할 수 있는 표현 학습에 기반한 일련의 방법론들을 제안한다.1 Introduction 1 1.1 Motivation 1 1.2 Contents of Dissertation 4 2 Background 8 2.1 Representation Learning 8 2.2 Deep Neural Networks 12 2.2.1 Multi-layer Perceptrons 12 2.2.2 Convolutional Neural Networks 14 2.2.3 Recurrent Neural Networks 16 2.2.4 Transformers 19 2.3 Training of Deep Neural Networks 23 2.4 Representation Learning in Bioinformatics 26 2.5 Biological Sequence Data Analyses 29 2.6 Evaluation Metrics 32 3 CRISPR-Cpf1 Activity Prediction 36 3.1 Methods 39 3.1.1 Model Architecture 39 3.1.2 Training of Seq-deepCpf1 and DeepCpf1 41 3.2 Experiment Results 44 3.2.1 Datasets 44 3.2.2 Baselines 47 3.2.3 Evaluation of Seq-deepCpf1 49 3.2.4 Evaluation of DeepCpf1 51 3.3 Summary 55 4 Functional microRNA Target Prediction 56 4.1 Methods 62 4.1.1 Candidate Target Site Selection 63 4.1.2 Input Encoding 64 4.1.3 Residual Network 67 4.1.4 Post-processing 68 4.2 Experiment Results 70 4.2.1 Datasets 70 4.2.2 Classification of Functional and Non-functional Targets 71 4.2.3 Distinguishing High-functional Targets 73 4.2.4 Ablation Studies 76 4.3 Summary 77 5 Self-supervised Learning of Protein Representations 78 5.1 Methods 83 5.1.1 Pre-training Procedure 83 5.1.2 Fine-tuning Procedure 86 5.1.3 Model Architecturen 87 5.2 Experiment Results 90 5.2.1 Experiment Setup 90 5.2.2 Pre-training Results 92 5.2.3 Fine-tuning Results 93 5.2.4 Comparison with Larger Protein Language Models 97 5.2.5 Ablation Studies 100 5.2.6 Qualitative Interpreatation Analyses 103 5.3 Summary 106 6 Discussion 107 6.1 Challenges and Opportunities 107 7 Conclusion 111 Bibliography 113 Abstract in Korean 130박

    On Improving Generalization of CNN-Based Image Classification with Delineation Maps Using the CORF Push-Pull Inhibition Operator

    Get PDF
    Deployed image classification pipelines are typically dependent on the images captured in real-world environments. This means that images might be affected by different sources of perturbations (e.g. sensor noise in low-light environments). The main challenge arises by the fact that image quality directly impacts the reliability and consistency of classification tasks. This challenge has, hence, attracted wide interest within the computer vision communities. We propose a transformation step that attempts to enhance the generalization ability of CNN models in the presence of unseen noise in the test set. Concretely, the delineation maps of given images are determined using the CORF push-pull inhibition operator. Such an operation transforms an input image into a space that is more robust to noise before being processed by a CNN. We evaluated our approach on the Fashion MNIST data set with an AlexNet model. It turned out that the proposed CORF-augmented pipeline achieved comparable results on noise-free images to those of a conventional AlexNet classification model without CORF delineation maps, but it consistently achieved significantly superior performance on test images perturbed with different levels of Gaussian and uniform noise

    A Computational Framework for Host-Pathogen Protein-Protein Interactions

    Get PDF
    Infectious diseases cause millions of illnesses and deaths every year, and raise great health concerns world widely. How to monitor and cure the infectious diseases has become a prevalent and intractable problem. Since the host-pathogen interactions are considered as the key infection processes at the molecular level for infectious diseases, there have been a large amount of researches focusing on the host-pathogen interactions towards the understanding of infection mechanisms and the development of novel therapeutic solutions. For years, the continuously development of technologies in biology has benefitted the wet lab-based experiments, such as small-scale biochemical, biophysical and genetic experiments and large-scale methods (for example yeast-two-hybrid analysis and cryogenic electron microscopy approach). As a result of past decades of efforts, there has been an exploded accumulation of biological data, which includes multi omics data, for example, the genomics data and proteomics data. Thus, an initiative review of omics data has been conducted in Chapter 2, which has exclusively demonstrated the recent update of ‘omics’ study, particularly focusing on proteomics and genomics. With the high-throughput technologies, the increasing amount of ‘omics’ data, including genomics and proteomics, has even further boosted. An upsurge of interest for data analytics in bioinformatics comes as no surprise to the researchers from a variety of disciplines. Specifically, the astonishing rate at which genomics and proteomics data are generated leads the researchers into the realm of ‘Big Data’ research. Chapter 2 is thus developed to providing an update of the omics background and the state-of-the-art developments in the omics area, with a focus on genomics data, from the perspective of big data analytics..
    corecore