429 research outputs found

    The Reasonable Effectiveness of Randomness in Scalable and Integrative Gene Regulatory Network Inference and Beyond

    Get PDF
    Gene regulation is orchestrated by a vast number of molecules, including transcription factors and co-factors, chromatin regulators, as well as epigenetic mechanisms, and it has been shown that transcriptional misregulation, e.g., caused by mutations in regulatory sequences, is responsible for a plethora of diseases, including cancer, developmental or neurological disorders. As a consequence, decoding the architecture of gene regulatory networks has become one of the most important tasks in modern (computational) biology. However, to advance our understanding of the mechanisms involved in the transcriptional apparatus, we need scalable approaches that can deal with the increasing number of large-scale, high-resolution, biological datasets. In particular, such approaches need to be capable of efficiently integrating and exploiting the biological and technological heterogeneity of such datasets in order to best infer the underlying, highly dynamic regulatory networks, often in the absence of sufficient ground truth data for model training or testing. With respect to scalability, randomized approaches have proven to be a promising alternative to deterministic methods in computational biology. As an example, one of the top performing algorithms in a community challenge on gene regulatory network inference from transcriptomic data is based on a random forest regression model. In this concise survey, we aim to highlight how randomized methods may serve as a highly valuable tool, in particular, with increasing amounts of large-scale, biological experiments and datasets being collected. Given the complexity and interdisciplinary nature of the gene regulatory network inference problem, we hope our survey maybe helpful to both computational and biological scientists. It is our aim to provide a starting point for a dialogue about the concepts, benefits, and caveats of the toolbox of randomized methods, since unravelling the intricate web of highly dynamic, regulatory events will be one fundamental step in understanding the mechanisms of life and eventually developing efficient therapies to treat and cure diseases

    Algorithms for pre-microrna classification and a GPU program for whole genome comparison

    Get PDF
    MicroRNAs (miRNAs) are non-coding RNAs with approximately 22 nucleotides that are derived from precursor molecules. These precursor molecules or pre-miRNAs often fold into stem-loop hairpin structures. However, a large number of sequences with pre-miRNA-like hairpin can be found in genomes. It is a challenge to distinguish the real pre-miRNAs from other hairpin sequences with similar stem-loops (referred to as pseudo pre-miRNAs). The first part of this dissertation presents a new method, called MirID, for identifying and classifying microRNA precursors. MirID is comprised of three steps. Initially, a combinatorial feature mining algorithm is developed to identify suitable feature sets. Then, the feature sets are used to train support vector machines to obtain classification models, based on which classifier ensemble is constructed. Finally, an AdaBoost algorithm is adopted to further enhance the accuracy of the classifier ensemble. Experimental results on a variety of species demonstrate the good performance of the proposed approach, and its superiority over existing methods. In the second part of this dissertation, A GPU (Graphics Processing Unit) program is developed for whole genome comparison. The goal for the research is to identify the commonalities and differences of two genomes from closely related organisms, via multiple sequencing alignments by using a seed and extend technique to choose reliable subsets of exact or near exact matches, which are called anchors. A rigorous method named Smith-Waterman search is applied for the anchor seeking, but takes days and months to map millions of bases for mammalian genome sequences. With GPU programming, which is designed to run in parallel hundreds of short functions called threads, up to 100X speed up is achieved over similar CPU executions

    ์ƒ๋ฌผํ•™์  ์„œ์—ด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ํ‘œํ˜„ ํ•™์Šต

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2021.8. ์œค์„ฑ๋กœ.As we are living in the era of big data, the biomedical domain is not an exception. With the advent of technologies such as next-generation sequencing, developing methods to capitalize on the explosion of biomedical data is one of the most major challenges in bioinformatics. Representation learning, in particular deep learning, has made significant advancements in diverse fields where the artificial intelligence community has struggled for many years. However, although representation learning has also shown great promises in bioinformatics, it is not a silver bullet. Off-the-shelf applications of representation learning cannot always provide successful results for biological sequence data. There remain full of challenges and opportunities to be explored. This dissertation presents a set of representation learning methods to address three issues in biological sequence data analysis. First, we propose a two-stage training strategy to address throughput and information trade-offs within wet-lab CRISPR-Cpf1 activity experiments. Second, we propose an encoding scheme to model interaction between two sequences for functional microRNA target prediction. Third, we propose a self-supervised pre-training method to bridge the exponentially growing gap between the numbers of unlabeled and labeled protein sequences. In summary, this dissertation proposes a set of representation learning methods that can derive invaluable information from the biological sequence data.์šฐ๋ฆฌ๋Š” ๋น…๋ฐ์ดํ„ฐ์˜ ์‹œ๋Œ€๋ฅผ ๋งž์ดํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ์˜์ƒ๋ช… ๋ถ„์•ผ ๋˜ํ•œ ์˜ˆ์™ธ๊ฐ€ ์•„๋‹ˆ๋‹ค. ์ฐจ์„ธ๋Œ€ ์—ผ๊ธฐ์„œ์—ด ๋ถ„์„๊ณผ ๊ฐ™์€ ๊ธฐ์ˆ ๋“ค์ด ๋„๋ž˜ํ•จ์— ๋”ฐ๋ผ, ํญ๋ฐœ์ ์ธ ์˜์ƒ๋ช… ๋ฐ์ดํ„ฐ์˜ ์ฆ๊ฐ€๋ฅผ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•๋ก ์˜ ๊ฐœ๋ฐœ์€ ์ƒ๋ฌผ์ •๋ณดํ•™ ๋ถ„์•ผ์˜ ์ฃผ์š” ๊ณผ์ œ ์ค‘์˜ ํ•˜๋‚˜์ด๋‹ค. ์‹ฌ์ธต ํ•™์Šต์„ ํฌํ•จํ•œ ํ‘œํ˜„ ํ•™์Šต ๊ธฐ๋ฒ•๋“ค์€ ์ธ๊ณต์ง€๋Šฅ ํ•™๊ณ„๊ฐ€ ์˜ค๋žซ๋™์•ˆ ์–ด๋ ค์›€์„ ๊ฒช์–ด์˜จ ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์—์„œ ์ƒ๋‹นํ•œ ๋ฐœ์ „์„ ์ด๋ฃจ์—ˆ๋‹ค. ํ‘œํ˜„ ํ•™์Šต์€ ์ƒ๋ฌผ์ •๋ณดํ•™ ๋ถ„์•ผ์—์„œ๋„ ๋งŽ์€ ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋‹จ์ˆœํ•œ ์ ์šฉ์œผ๋กœ๋Š” ์ƒ๋ฌผํ•™์  ์„œ์—ด ๋ฐ์ดํ„ฐ ๋ถ„์„์˜ ์„ฑ๊ณต์ ์ธ ๊ฒฐ๊ณผ๋ฅผ ํ•ญ์ƒ ์–ป์„ ์ˆ˜๋Š” ์•Š์œผ๋ฉฐ, ์—ฌ์ „ํžˆ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•œ ๋งŽ์€ ๋ฌธ์ œ๋“ค์ด ๋‚จ์•„์žˆ๋‹ค. ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์€ ์ƒ๋ฌผํ•™์  ์„œ์—ด ๋ฐ์ดํ„ฐ ๋ถ„์„๊ณผ ๊ด€๋ จ๋œ ์„ธ ๊ฐ€์ง€ ์‚ฌ์•ˆ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ํ‘œํ˜„ ํ•™์Šต์— ๊ธฐ๋ฐ˜ํ•œ ์ผ๋ จ์˜ ๋ฐฉ๋ฒ•๋ก ๋“ค์„ ์ œ์•ˆํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ๋กœ, ์œ ์ „์ž๊ฐ€์œ„ ์‹คํ—˜ ๋ฐ์ดํ„ฐ์— ๋‚ด์žฌ๋œ ์ •๋ณด์™€ ์ˆ˜์œจ์˜ ๊ท ํ˜•์— ๋Œ€์ฒ˜ํ•  ์ˆ˜ ์žˆ๋Š” 2๋‹จ๊ณ„ ํ•™์Šต ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ, ๋‘ ์—ผ๊ธฐ ์„œ์—ด ๊ฐ„์˜ ์ƒํ˜ธ ์ž‘์šฉ์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ๋ถ€ํ˜ธํ™” ๋ฐฉ์‹์„ ์ œ์•ˆํ•œ๋‹ค. ์„ธ ๋ฒˆ์งธ๋กœ, ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๋Š” ํŠน์ง•๋˜์ง€ ์•Š์€ ๋‹จ๋ฐฑ์งˆ ์„œ์—ด์„ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•œ ์ž๊ธฐ ์ง€๋„ ์‚ฌ์ „ ํ•™์Šต ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์š”์•ฝํ•˜์ž๋ฉด, ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์€ ์ƒ๋ฌผํ•™์  ์„œ์—ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์ค‘์š”ํ•œ ์ •๋ณด๋ฅผ ๋„์ถœํ•  ์ˆ˜ ์žˆ๋Š” ํ‘œํ˜„ ํ•™์Šต์— ๊ธฐ๋ฐ˜ํ•œ ์ผ๋ จ์˜ ๋ฐฉ๋ฒ•๋ก ๋“ค์„ ์ œ์•ˆํ•œ๋‹ค.1 Introduction 1 1.1 Motivation 1 1.2 Contents of Dissertation 4 2 Background 8 2.1 Representation Learning 8 2.2 Deep Neural Networks 12 2.2.1 Multi-layer Perceptrons 12 2.2.2 Convolutional Neural Networks 14 2.2.3 Recurrent Neural Networks 16 2.2.4 Transformers 19 2.3 Training of Deep Neural Networks 23 2.4 Representation Learning in Bioinformatics 26 2.5 Biological Sequence Data Analyses 29 2.6 Evaluation Metrics 32 3 CRISPR-Cpf1 Activity Prediction 36 3.1 Methods 39 3.1.1 Model Architecture 39 3.1.2 Training of Seq-deepCpf1 and DeepCpf1 41 3.2 Experiment Results 44 3.2.1 Datasets 44 3.2.2 Baselines 47 3.2.3 Evaluation of Seq-deepCpf1 49 3.2.4 Evaluation of DeepCpf1 51 3.3 Summary 55 4 Functional microRNA Target Prediction 56 4.1 Methods 62 4.1.1 Candidate Target Site Selection 63 4.1.2 Input Encoding 64 4.1.3 Residual Network 67 4.1.4 Post-processing 68 4.2 Experiment Results 70 4.2.1 Datasets 70 4.2.2 Classification of Functional and Non-functional Targets 71 4.2.3 Distinguishing High-functional Targets 73 4.2.4 Ablation Studies 76 4.3 Summary 77 5 Self-supervised Learning of Protein Representations 78 5.1 Methods 83 5.1.1 Pre-training Procedure 83 5.1.2 Fine-tuning Procedure 86 5.1.3 Model Architecturen 87 5.2 Experiment Results 90 5.2.1 Experiment Setup 90 5.2.2 Pre-training Results 92 5.2.3 Fine-tuning Results 93 5.2.4 Comparison with Larger Protein Language Models 97 5.2.5 Ablation Studies 100 5.2.6 Qualitative Interpreatation Analyses 103 5.3 Summary 106 6 Discussion 107 6.1 Challenges and Opportunities 107 7 Conclusion 111 Bibliography 113 Abstract in Korean 130๋ฐ•

    Differential Regulatory Analysis Based on Coexpression Network in Cancer Research

    Get PDF

    Pathway-Based Multi-Omics Data Integration for Breast Cancer Diagnosis and Prognosis.

    Get PDF
    Ph.D. Thesis. University of Hawaiสปi at Mฤnoa 2017

    An integrated network of Arabidopsis growth regulators and its use for gene prioritization

    Get PDF
    Elucidating the molecular mechanisms that govern plant growth has been an important topic in plant research, and current advances in large-scale data generation call for computational tools that efficiently combine these different data sources to generate novel hypotheses. In this work, we present a novel, integrated network that combines multiple large-scale data sources to characterize growth regulatory genes in Arabidopsis, one of the main plant model organisms. The contributions of this work are twofold: first, we characterized a set of carefully selected growth regulators with respect to their connectivity patterns in the integrated network, and, subsequently, we explored to which extent these connectivity patterns can be used to suggest new growth regulators. Using a large-scale comparative study, we designed new supervised machine learning methods to prioritize growth regulators. Our results show that these methods significantly improve current state-of-the-art prioritization techniques, and are able to suggest meaningful new growth regulators. In addition, the integrated network is made available to the scientific community, providing a rich data source that will be useful for many biological processes, not necessarily restricted to plant growth

    Bipartite Heterogeneous Network Method Based on Co-neighbor for MiRNA-Disease Association Prediction

    Get PDF
    In recent years, miRNA variation and dysregulation have been found to be closely related to human tumors, and identifying miRNA-disease associations is helpful for understanding the mechanisms of disease or tumor development and is greatly significant for the prognosis, diagnosis, and treatment of human diseases. This article proposes a Bipartite Heterogeneous network link prediction method based on co-neighbor to predict miRNA-disease association (BHCN). According to the structural characteristics of the bipartite network, the concept of bipartite network co-neighbors is proposed, and the co-neighbors were used to represent the probability of association between disease and miRNA. To predict the isolated diseases and the new miRNA based on the association probability expressed by co-neighbors, we utilized the similarity between disease nodes and the similarity between miRNA nodes in heterogeneous networks to represent the association probability between disease and miRNA. The model's predictive performance was evaluated by the leave-one-out cross validation (LOOCV) on different datasets. The AUC value of BHCN on the gold benchmark dataset was 0.7973, and the AUC obtained on the prediction dataset was 0.9349, which was better than that of the classic global algorithm. In this case study, we conducted predictive studies on breast neoplasms and colon neoplasms. Most of the top 50 predicted results were confirmed by three databases, namely, HMDD, miR2disease, and dbDEMC, with accuracy rates of 96 and 82%. In addition, BHCN can be used for predicting isolated diseases (without any known associated diseases) and new miRNAs (without any known associated miRNAs). In the isolated disease case study, the top 50 of breast neoplasm and colon neoplasm potentials associated with miRNAs predicted an accuracy of 100 and 96%, respectively, thereby demonstrating the favorable predictive power of BHCN for potentially relevant miRNAs
    • โ€ฆ
    corecore