59 research outputs found

    생물학적 서열 데이터에 대한 표현 학습

    Get PDF
    학위논문(박사) -- 서울대학교대학원 : 공과대학 전기·정보공학부, 2021.8. 윤성로.As we are living in the era of big data, the biomedical domain is not an exception. With the advent of technologies such as next-generation sequencing, developing methods to capitalize on the explosion of biomedical data is one of the most major challenges in bioinformatics. Representation learning, in particular deep learning, has made significant advancements in diverse fields where the artificial intelligence community has struggled for many years. However, although representation learning has also shown great promises in bioinformatics, it is not a silver bullet. Off-the-shelf applications of representation learning cannot always provide successful results for biological sequence data. There remain full of challenges and opportunities to be explored. This dissertation presents a set of representation learning methods to address three issues in biological sequence data analysis. First, we propose a two-stage training strategy to address throughput and information trade-offs within wet-lab CRISPR-Cpf1 activity experiments. Second, we propose an encoding scheme to model interaction between two sequences for functional microRNA target prediction. Third, we propose a self-supervised pre-training method to bridge the exponentially growing gap between the numbers of unlabeled and labeled protein sequences. In summary, this dissertation proposes a set of representation learning methods that can derive invaluable information from the biological sequence data.우리는 빅데이터의 시대를 맞이하고 있으며, 의생명 분야 또한 예외가 아니다. 차세대 염기서열 분석과 같은 기술들이 도래함에 따라, 폭발적인 의생명 데이터의 증가를 활용하기 위한 방법론의 개발은 생물정보학 분야의 주요 과제 중의 하나이다. 심층 학습을 포함한 표현 학습 기법들은 인공지능 학계가 오랫동안 어려움을 겪어온 다양한 분야에서 상당한 발전을 이루었다. 표현 학습은 생물정보학 분야에서도 많은 가능성을 보여주었다. 하지만 단순한 적용으로는 생물학적 서열 데이터 분석의 성공적인 결과를 항상 얻을 수는 않으며, 여전히 연구가 필요한 많은 문제들이 남아있다. 본 학위논문은 생물학적 서열 데이터 분석과 관련된 세 가지 사안을 해결하기 위해, 표현 학습에 기반한 일련의 방법론들을 제안한다. 첫 번째로, 유전자가위 실험 데이터에 내재된 정보와 수율의 균형에 대처할 수 있는 2단계 학습 기법을 제안한다. 두 번째로, 두 염기 서열 간의 상호 작용을 학습하기 위한 부호화 방식을 제안한다. 세 번째로, 기하급수적으로 증가하는 특징되지 않은 단백질 서열을 활용하기 위한 자기 지도 사전 학습 기법을 제안한다. 요약하자면, 본 학위논문은 생물학적 서열 데이터를 분석하여 중요한 정보를 도출할 수 있는 표현 학습에 기반한 일련의 방법론들을 제안한다.1 Introduction 1 1.1 Motivation 1 1.2 Contents of Dissertation 4 2 Background 8 2.1 Representation Learning 8 2.2 Deep Neural Networks 12 2.2.1 Multi-layer Perceptrons 12 2.2.2 Convolutional Neural Networks 14 2.2.3 Recurrent Neural Networks 16 2.2.4 Transformers 19 2.3 Training of Deep Neural Networks 23 2.4 Representation Learning in Bioinformatics 26 2.5 Biological Sequence Data Analyses 29 2.6 Evaluation Metrics 32 3 CRISPR-Cpf1 Activity Prediction 36 3.1 Methods 39 3.1.1 Model Architecture 39 3.1.2 Training of Seq-deepCpf1 and DeepCpf1 41 3.2 Experiment Results 44 3.2.1 Datasets 44 3.2.2 Baselines 47 3.2.3 Evaluation of Seq-deepCpf1 49 3.2.4 Evaluation of DeepCpf1 51 3.3 Summary 55 4 Functional microRNA Target Prediction 56 4.1 Methods 62 4.1.1 Candidate Target Site Selection 63 4.1.2 Input Encoding 64 4.1.3 Residual Network 67 4.1.4 Post-processing 68 4.2 Experiment Results 70 4.2.1 Datasets 70 4.2.2 Classification of Functional and Non-functional Targets 71 4.2.3 Distinguishing High-functional Targets 73 4.2.4 Ablation Studies 76 4.3 Summary 77 5 Self-supervised Learning of Protein Representations 78 5.1 Methods 83 5.1.1 Pre-training Procedure 83 5.1.2 Fine-tuning Procedure 86 5.1.3 Model Architecturen 87 5.2 Experiment Results 90 5.2.1 Experiment Setup 90 5.2.2 Pre-training Results 92 5.2.3 Fine-tuning Results 93 5.2.4 Comparison with Larger Protein Language Models 97 5.2.5 Ablation Studies 100 5.2.6 Qualitative Interpreatation Analyses 103 5.3 Summary 106 6 Discussion 107 6.1 Challenges and Opportunities 107 7 Conclusion 111 Bibliography 113 Abstract in Korean 130박

    Developing deep learning computational tools for cancer using omics data

    Get PDF
    Dissertação de mestrado em Computer ScienceThere has been an increasing investment in cancer research that generated an enormous amount of biological and clinical data, especially after the advent of the next-generation sequencing technologies. To analyze the large datasets provided by omics data of cancer samples, scientists have successfully been recurring to machine learning algorithms, identifying patterns and developing models by using statistical techniques to make accurate predictions. Deep learning is a branch of machine learning, best known by its applications in artificial intelligence (computer vision, speech recognition, natural language processing and robotics). In general, deep learning models differ from machine learning “shallow” methods (single hidden layer) because they recur to multiple layers of abstraction. In this way, it is possible to learn high level features and complex relations in the given data. Given the context specified above, the main target of this work is the development and evaluation of deep learning methods for the analysis of cancer omics datasets, covering both unsupervised methods for feature generation from different types of data, and supervised methods to address cancer diagnostics and prognostic predictions. We worked with a Neuroblastoma (NB) dataset from two different platforms (RNA-Seq and microarrays) and developed both supervised (Deep Neural Networks (DNN), Multi-Task Deep Neural Network (MT-DNN)) and unsupervised (Stacked Denoising Autoencoders (SDA)) deep architectures, and compared them with shallow traditional algorithms. Overall we achieved promising results with deep learning on both platforms, meaning that it is possible to retrieve the advantages of deep learning models on cancer omics data. At the same time we faced some difficulties related to the complexity and computational power requirements, as well as the lack of samples to truly benefit from the deep architectures. There was generated code that can be applied to other datasets, wich is available in a github repository https://github.com/lmpeixoto/deepl_learning [49].Nos últimos anos tem havido um investimento significativo na pesquisa de cancro, o que gerou uma quantidade enorme de dados biológicos e clínicos, especialmente após o aparecimento das tecnologias de sequenciação denominadas de “próxima-geração”. Para analisar estes dados, a comunidade científica tem recorrido, e com sucesso, a algoritmos de aprendizado de máquina, identificando padrões e desenvolvendo modelos com recurso a métodos estatísticos. Com estes modelos é possível fazer previsão de resultados. O aprendizado profundo, um ramo do aprendizado de máquina, tem sido mais notório pelas suas aplicações em inteligência artificial (reconhecimento de imagens e voz, processamento de linguagem natural e robótica). De um modo geral, os modelos de aprendizado profundo diferem dos métodos clássicos do aprendizado de máquina por recorrerem a várias camadas de abstração. Desta forma, é possível “aprender” as representações complexas e não lineares, com vários graus de liberdade dos dados analisados. Neste contexto, o objetivo principal deste trabalho é desenvolver e avaliar métodos de aprendizado profundo para analisar dados ómicos do cancro. Pretendem-se desenvolver tanto métodos supervisionados como não-supervisionados e utilizar diferentes tipos de dados, construindo soluções para diagnóstico e prognóstico do cancro. Para isso trabalhámos com uma matriz de dados de Neuroblastoma, proveniente de duas plataformas diferentes (RNA-seq e microarrays), nos quais aplicámos algumas arquiteturas de aprendizado profundo, tanto como métodos supervisionados e não-supervisionados, e com as quais comparámos com algoritmos tradicionais de aprendizado de máquina. No geral conseguimos obter resultados promissores nas duas plataformas, o que significou ser possível beneficiar das vantagens dos modelos do aprendizado profundo nos dados ómicos de cancro. Ao mesmo tempo encontrámos algumas dificuldades, de modo especial relacionadas com a complexidade dos modelos e o poder computacional exigido, bem como o baixo número de amostras disponíveis. Na sequencia deste trabalho foi gerado código que pode ser aplicado a outros dados e está disponível num repositório do github https://github.com/lmpeixoto/deepl_learning [49]

    Inference of biomolecular interactions from sequence data

    Get PDF
    This thesis describes our work on the inference of biomolecular interactions from sequence data. In particular, the first part of the thesis focuses on proteins and describes computational methods that we have developed for the inference of both intra- and inter-protein interactions from genomic data. The second part of the thesis centers around protein-RNA interactions and describes a method for the inference of binding motifs of RNA-binding proteins from high-throughput sequencing data. The thesis is organized as follows. In the first part, we start by introducing a novel mathematical model for the characterization of protein sequences (chapter 1). We then show how, using genomic data, this model can be successfully applied to two different problems, namely to the inference of interacting amino acid residues in the tertiary structure of protein domains (chapter 2) and to the prediction of protein-protein interactions in large paralogous protein families (chapters 3 and 4). We conclude the first part by a discussion of potential extensions and generalizations of the methods presented (chapter 5). In the second part of this thesis, we first give a general introduction about RNA- binding proteins (chapter 6). We then describe a novel experimental method for the genome-wide identification of target RNAs of RNA-binding proteins and show how this method can be used to infer the binding motifs of RNA-binding proteins (chapter 7). Finally, we discuss a potential mechanism by which KH domain-containing RNA- binding proteins could achieve the specificity of interaction with their target RNAs and conclude the second part of the thesis by proposing a novel type of motif finding algorithm tailored for the inference of their recognition elements (chapter 8)

    Higher-order interactions in single-cell gene expression: towards a cybergenetic semantics of cell state

    Get PDF
    Finding and understanding patterns in gene expression guides our understanding of living organisms, their development, and diseases, but is a challenging and high-dimensional problem as there are many molecules involved. One way to learn about the structure of a gene regulatory network is by studying the interdependencies among its constituents in transcriptomic data sets. These interdependencies could be arbitrarily complex, but almost all current models of gene regulation contain pairwise interactions only, despite experimental evidence existing for higher-order regulation that cannot be decomposed into pairwise mechanisms. I set out to capture these higher-order dependencies in single-cell RNA-seq data using two different approaches. First, I fitted maximum entropy (or Ising) models to expression data by training restricted Boltzmann machines (RBMs). On simulated data, RBMs faithfully reproduced both pairwise and third-order interactions. I then trained RBMs on 37 genes from a scRNA-seq data set of 70k astrocytes from an embryonic mouse. While pairwise and third-order interactions were revealed, the estimates contained a strong omitted variable bias, and there was no statistically sound and tractable way to quantify the uncertainty in the estimates. As a result I next adopted a model-free approach. Estimating model-free interactions (MFIs) in single-cell gene expression data required a quasi-causal graph of conditional dependencies among the genes, which I inferred with an MCMC graph-optimisation algorithm on an initial estimate found by the Peter-Clark algorithm. As the estimates are model-free, MFIs can be interpreted either as mechanistic relationships between the genes, or as substructures in the cell population. On simulated data, MFIs revealed synergy and higher-order mechanisms in various logical and causal dynamics more accurately than any correlation- or information-based quantities. I then estimated MFIs among 1,000 genes, at up to seventh-order, in 20k neurons and 20k astrocytes from two different mouse brain scRNA-seq data sets: one developmental, and one adolescent. I found strong evidence for up to fifth-order interactions, and the MFIs mostly disambiguated direct from indirect regulation by preferentially coupling causally connected genes, whereas correlations persisted across causal chains. Validating the predicted interactions against the Pathway Commons database, gene ontology annotations, and semantic similarity, I found that pairwise MFIs contained different but a similar amount of mechanistic information relative to networks based on correlation. Furthermore, third-order interactions provided evidence of combinatorial regulation by transcription factors and immediate early genes. I then switched focus from mechanism to population structure. Each significant MFI can be assigned a set of single cells that most influence its value. Hierarchical clustering of the MFIs by cell assignment revealed substructures in the cell population corresponding to diverse cell states. This offered a new, purely data-driven view on cell states because the inferred states are not required to localise in gene expression space. Across the four data sets, I found 69 significant and biologically interpretable cell states, where only 9 could be obtained by standard approaches. I identified immature neurons among developing astrocytes and radial glial cells, D1 and D2 medium spiny neurons, D1 MSN subtypes, and cell-cycle related states present across four data sets. I further found evidence for states defined by genes associated to neuropeptide signalling, neuronal activity, myelin metabolism, and genomic imprinting. MFIs thus provide a new, statistically sound method to detect substructure in single-cell gene expression data, identifying cell types, subtypes, or states that can be delocalised in gene expression space and whose hierarchical structure provides a new view on the semantics of cell state. The estimation of the quasi-causal graph, the MFIs, and inference of the associated states is implemented as a publicly available Nextflow pipeline called Stator

    Identifying disease-associated genes based on artificial intelligence

    Get PDF
    Identifying disease-gene associations can help improve the understanding of disease mechanisms, which has a variety of applications, such as early diagnosis and drug development. Although experimental techniques, such as linkage analysis, genome-wide association studies (GWAS), have identified a large number of associations, identifying disease genes is still challenging since experimental methods are usually time-consuming and expensive. To solve these issues, computational methods are proposed to predict disease-gene associations. Based on the characteristics of existing computational algorithms in the literature, we can roughly divide them into three categories: network-based methods, machine learning-based methods, and other methods. No matter what models are used to predict disease genes, the proper integration of multi-level biological data is the key to improving prediction accuracy. This thesis addresses some limitations of the existing computational algorithms, and integrates multi-level data via artificial intelligence techniques. The thesis starts with a comprehensive review of computational methods, databases, and evaluation methods used in predicting disease-gene associations, followed by one network-based method and four machine learning-based methods. The first chapter introduces the background information, objectives of the studies and structure of the thesis. After that, a comprehensive review is provided in the second chapter to discuss the existing algorithms as well as the databases and evaluation methods used in existing studies. Having the objectives and future directions, the thesis then presents five computational methods for predicting disease-gene associations. The first method proposed in Chapter 3 considers the issue of non-disease gene selection. A shortest path-based strategy is used to select reliable non-disease genes from a disease gene network and a differential network. The selected genes are then used by a network-energy model to improve its performance. The second method proposed in Chapter 4 constructs sample-based networks for case samples and uses them to predict disease genes. This strategy improves the quality of protein-protein interaction (PPI) networks, which further improves the prediction accuracy. Chapter 5 presents a generic model which applies multimodal deep belief nets (DBN) to fuse different types of data. Network embeddings extracted from PPI networks and gene ontology (GO) data are fused with the multimodal DBN to obtain cross-modality representations. Chapter 6 presents another deep learning model which uses a convolutional neural network (CNN) to integrate gene similarities with other types of data. Finally, the fifth method proposed in Chapter 7 is a nonnegative matrix factorization (NMF)-based method. This method maps diseases and genes onto a lower-dimensional manifold, and the geodesic distance between diseases and genes are used to predict their associations. The method can predict disease genes even if the disease under consideration has no known associated genes. In summary, this thesis has proposed several artificial intelligence-based computational algorithms to address the typical issues existing in computational algorithms. Experimental results have shown that the proposed methods can improve the accuracy of disease-gene prediction

    Opportunities and obstacles for deep learning in biology and medicine

    Get PDF
    Deep learning describes a class of machine learning algorithms that are capable of combining raw inputs into layers of intermediate features. These algorithms have recently shown impressive results across a variety of domains. Biology and medicine are data-rich disciplines, but the data are complex and often ill-understood. Hence, deep learning techniques may be particularly well suited to solve problems of these fields. We examine applications of deep learning to a variety of biomedical problems-patient classification, fundamental biological processes and treatment of patients-and discuss whether deep learning will be able to transform these tasks or if the biomedical sphere poses unique challenges. Following from an extensive literature review, we find that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art. Even though improvements over previous baselines have been modest in general, the recent progress indicates that deep learning methods will provide valuable means for speeding up or aiding human investigation. Though progress has been made linking a specific neural network\u27s prediction to input features, understanding how users should interpret these models to make testable hypotheses about the system under study remains an open challenge. Furthermore, the limited amount of labelled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning enabling changes at both bench and bedside with the potential to transform several areas of biology and medicine

    Unveiling the frontiers of deep learning: innovations shaping diverse domains

    Full text link
    Deep learning (DL) enables the development of computer models that are capable of learning, visualizing, optimizing, refining, and predicting data. In recent years, DL has been applied in a range of fields, including audio-visual data processing, agriculture, transportation prediction, natural language, biomedicine, disaster management, bioinformatics, drug design, genomics, face recognition, and ecology. To explore the current state of deep learning, it is necessary to investigate the latest developments and applications of deep learning in these disciplines. However, the literature is lacking in exploring the applications of deep learning in all potential sectors. This paper thus extensively investigates the potential applications of deep learning across all major fields of study as well as the associated benefits and challenges. As evidenced in the literature, DL exhibits accuracy in prediction and analysis, makes it a powerful computational tool, and has the ability to articulate itself and optimize, making it effective in processing data with no prior training. Given its independence from training data, deep learning necessitates massive amounts of data for effective analysis and processing, much like data volume. To handle the challenge of compiling huge amounts of medical, scientific, healthcare, and environmental data for use in deep learning, gated architectures like LSTMs and GRUs can be utilized. For multimodal learning, shared neurons in the neural network for all activities and specialized neurons for particular tasks are necessary.Comment: 64 pages, 3 figures, 3 table

    Identifying and targeting cancer-specific metabolism with network-based drug target prediction

    Get PDF
    Background Metabolic rewiring allows cancer cells to sustain high proliferation rates. Thus, targeting only the cancer-specific cellular metabolism will safeguard healthy tissues. Methods We developed the very efficient FASTCORMICS RNA-seq workflow (rFASTCORMICS) to build 10,005 high-resolution metabolic models from the TCGA dataset to capture metabolic rewiring strategies in cancer cells. Colorectal cancer (CRC) was used as a test case for a repurposing workflow based on rFASTCORMICS. Findings Alternative pathways that are not required for proliferation or survival tend to be shut down and, therefore, tumours display cancer-specific essential genes that are significantly enriched for known drug targets. We identified naftifine, ketoconazole, and mimosine as new potential CRC drugs, which were experimentally validated. Interpretation The here presented rFASTCORMICS workflow successfully reconstructs a metabolic model based on RNA-seq data and successfully predicted drug targets and drugs not yet indicted for colorectal cancer

    Bioinformatics

    Get PDF
    This book is divided into different research areas relevant in Bioinformatics such as biological networks, next generation sequencing, high performance computing, molecular modeling, structural bioinformatics, molecular modeling and intelligent data analysis. Each book section introduces the basic concepts and then explains its application to problems of great relevance, so both novice and expert readers can benefit from the information and research works presented here

    Computational Methods for the Pharmacogenetic Interpretation of Next Generation Sequencing Data

    Get PDF
    Up to half of all patients do not respond to pharmacological treatment as intended. A substantial fraction of these inter-individual differences is due to heritable factors and a growing number of associations between genetic variations and drug response phenotypes have been identified. Importantly, the rapid progress in Next Generation Sequencing technologies in recent years unveiled the true complexity of the genetic landscape in pharmacogenes with tens of thousands of rare genetic variants. As each individual was found to harbor numerous such rare variants they are anticipated to be important contributors to the genetically encoded inter-individual variability in drug effects. The fundamental challenge however is their functional interpretation due to the sheer scale of the problem that renders systematic experimental characterization of these variants currently unfeasible. Here, we review concepts and important progress in the development of computational prediction methods that allow to evaluate the effect of amino acid sequence alterations in drug metabolizing enzymes and transporters. In addition, we discuss recent advances in the interpretation of functional effects of non-coding variants, such as variations in splice sites, regulatory regions and miRNA binding sites. We anticipate that these methodologies will provide a useful toolkit to facilitate the integration of the vast extent of rare genetic variability into drug response predictions in a precision medicine framework
    corecore