51 research outputs found

    NOVEL APPLICATIONS OF MACHINE LEARNING IN BIOINFORMATICS

    Get PDF
    Technological advances in next-generation sequencing and biomedical imaging have led to a rapid increase in biomedical data dimension and acquisition rate, which is challenging the conventional data analysis strategies. Modern machine learning techniques promise to leverage large data sets for finding hidden patterns within them, and for making accurate predictions. This dissertation aims to design novel machine learning-based models to transform biomedical big data into valuable biological insights. The research presented in this dissertation focuses on three bioinformatics domains: splice junction classification, gene regulatory network reconstruction, and lesion detection in mammograms. A critical step in defining gene structures and mRNA transcript variants is to accurately identify splice junctions. In the first work, we built the first deep learning-based splice junction classifier, DeepSplice. It outperforms the state-of-the-art classification tools in terms of both classification accuracy and computational efficiency. To uncover transcription factors governing metabolic reprogramming in non-small-cell lung cancer patients, we developed TFmeta, a machine learning approach to reconstruct relationships between transcription factors and their target genes in the second work. Our approach achieves the best performance on benchmark data sets. In the third work, we designed deep learning-based architectures to perform lesion detection in both 2D and 3D whole mammogram images

    A Study on Deep Learning for Bioinformatics

    Get PDF
    Bioinformatics, an interdisciplinary area of biology and computer science, handles large and complex data sets with linear and non-linear relationships between attributes. To handle such relationships, deep learning has got a greater importance these days. This paper analyses different deep learning architectures and their applications in Bioinformatics. The paper also addresses the limitations and challenges of deep learning

    ๊นŠ์€ ์‹ ๊ฒฝ๋ง์„ ์ด์šฉํ•œ ๊ฐ•์ธํ•œ ํŠน์ง• ํ•™์Šต

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2016. 8. ์œค์„ฑ๋กœ.์ตœ๊ทผ ๊ธฐ๊ณ„ ํ•™์Šต์˜ ๋ฐœ์ „์œผ๋กœ ์ธ๊ณต ์ง€๋Šฅ์€ ์šฐ๋ฆฌ์—๊ฒŒ ํ•œ ๊ฑธ์Œ ๋” ๊ฐ€๊นŒ์ด ๋‹ค๊ฐ€์˜ค๊ฒŒ ๋˜์—ˆ๋‹ค. ํŠนํžˆ ์ž์œจ ์ฃผํ–‰์ด๋‚˜ ๊ฒŒ์ž„ ํ”Œ๋ ˆ์ด ๋“ฑ ์ตœ์‹  ์ธ๊ณต ์ง€๋Šฅ ํ”„๋ ˆ์ž„์›Œํฌ๋“ค์— ์žˆ์–ด์„œ, ๋”ฅ ๋Ÿฌ๋‹์ด ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•˜๊ณ  ์žˆ๋Š” ์ƒํ™ฉ์ด๋‹ค. ๋”ฅ ๋Ÿฌ๋‹์ด๋ž€ multi-layered neural networks ๊ณผ ๊ด€๋ จ๋œ ๊ธฐ์ˆ ๋“ค์„ ์ด์นญํ•˜๋Š” ์šฉ์–ด๋กœ์„œ, ๋ฐ์ดํ„ฐ์˜ ์–‘์ด ๊ธ‰์†ํ•˜๊ฒŒ ์ฆ๊ฐ€ํ•˜๋ฉฐ, ์‚ฌ์ „ ์ง€์‹๋“ค์ด ์ถ•์ ๋˜๊ณ , ํšจ์œจ์ ์ธ ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์ด ๊ฐœ๋ฐœ๋˜๋ฉฐ, ๊ณ ๊ธ‰ ํ•˜๋“œ์›จ์–ด๋“ค์ด ๋งŒ๋“ค์–ด์ง์— ๋”ฐ๋ผ ๋น ๋ฅด๊ฒŒ ๋ณ€ํ™”ํ•˜๊ณ  ์žˆ๋‹ค. ํ˜„์žฌ ๋”ฅ ๋Ÿฌ๋‹์€ ๋Œ€๋ถ€๋ถ„์˜ ์ธ์‹ ๋ฌธ์ œ์—์„œ ์ตœ์ฒจ๋‹จ ๊ธฐ์ˆ ๋กœ ํ™œ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ์—ฌ๋Ÿฌ ๋ ˆ์ด์–ด๋กœ ๊ตฌ์„ฑ๋œ ๊นŠ์€ ์‹ ๊ฒฝ๋ง์€ ๋งŽ์€ ์–‘์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ•™์Šตํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๋ฐฉ๋Œ€ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ์ง‘ํ•ฉ ์†์—์„œ ์ข‹์€ ํ•ด๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊นŠ์€ ์‹ ๊ฒฝ๋ง์˜ ์„ธ ๊ฐ€์ง€ ์ด์Šˆ์— ๋Œ€ํ•ด ์ ‘๊ทผํ•˜๋ฉฐ, ๊ทธ๊ฒƒ๋“ค์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ regularization ๊ธฐ๋ฒ•๋“ค์„ ์ œ์•ˆํ•œ๋‹ค. ์ฒซ์งธ๋กœ, ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ๋Š” adversarial perturbations ์ด๋ผ๋Š” ๋‚ด์žฌ์ ์ธ blind spots ๋“ค์— ๋งŽ์ด ๋…ธ์ถœ๋˜์–ด ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ adversarial perturbations ์— ๊ฐ•์ธํ•œ ์‹ ๊ฒฝ๋ง์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•˜์—ฌ, ํ•™์Šต ์ƒ˜ํ”Œ๊ณผ ๊ทธ๊ฒƒ์˜ adversarial perturbations ์™€์˜ ์ฐจ์ด๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” manifold loss term์„ ๋ชฉ์  ํ•จ์ˆ˜์— ์ถ”๊ฐ€ํ•˜์˜€๋‹ค. ๋‘˜์งธ๋กœ, restricted Boltzmann machines ์˜ ํ•™์Šต์— ์žˆ์–ด์„œ, ์ƒ๋Œ€์ ์œผ๋กœ ์ž‘์€ ํฌ๊ธฐ๋ฅผ ๊ฐ€์ง€๋Š” ํด๋ž˜์Šค๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐ์— ๊ธฐ์กด์˜ contrastive divergence ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํ•œ๊ณ„์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ์—ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ž‘์€ ํด๋ž˜์Šค์— ๋” ๋†’์€ ํ•™์Šต ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•˜๋Š” boosting ๊ฐœ๋…๊ณผ categorical features๋ฅผ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ์— ์ ํ•ฉํ•œ ์ƒˆ๋กœ์šด regularization ๊ธฐ๋ฒ•์„ ์กฐํ•ฉํ•˜์—ฌ ๊ธฐ์กด์˜ ํ•œ๊ณ„์ ์— ์ ‘๊ทผํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ์‹ ๊ฒฝ๋ง์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ•™์Šตํ•˜๊ธฐ์— ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ๊ฐ€ ์ฃผ์–ด์ง„ ๊ฒฝ์šฐ, ๋” ์ •๊ตํ•œ data augmentation ๊ธฐ๋ฒ•์„ ๋‹ค๋ฃฌ๋‹ค. ์ƒ˜ํ”Œ์˜ ์ฐจ์›์ด ๋งŽ์„์ˆ˜๋ก, ๋ฐ์ดํ„ฐ ์ƒ์„ฑ์˜ ๊ธฐ์ €์— ๊น”๋ ค์žˆ๋Š” ์‚ฌ์ „ ์ง€์‹์„ ํ™œ์šฉํ•˜์—ฌ augmentation์„ ํ•˜๋Š” ๊ฒƒ์ด ๋”์šฑ ๋” ํ•„์š”ํ•˜๋‹ค. ๋‚˜์•„๊ฐ€, ๋ณธ ๋…ผ๋ฌธ์€ junction splicing signals ํ•™์Šต์„ ์œ„ํ•œ ์ฒซ ๋ฒˆ์งธ ๊นŠ์€ ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ๋ง ๊ฒฐ๊ณผ๋ฅผ ์ œ์‹œํ•˜๊ณ  ์žˆ๋‹ค. Junction prediction ๋ฌธ์ œ๋Š” positive ์ƒ˜ํ”Œ ์ˆ˜๊ฐ€ ๋งค์šฐ ์ ์–ด ํŒจํ„ด ๋ชจ๋ธ๋ง์ด ํž˜๋“ค๋ฉฐ, ์ด๋Š” ์ƒ๋ช…์ •๋ณดํ•™ ๋ถ„์•ผ์—์„œ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ฌธ์ œ ์ค‘ ํ•˜๋‚˜๋กœ์„œ, ์ „์ฒด gene expression process ๋ฅผ ์ดํ•ดํ•˜๋Š” ์ฒซ ๊ฑธ์Œ์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค. ์š”์•ฝํ•˜๋ฉด, ๋ณธ ๋…ผ๋ฌธ์€ ๋”ฅ ๋Ÿฌ๋‹์œผ๋กœ ์ด๋ฏธ์ง€์™€ ๋Œ€์šฉ๋Ÿ‰ ์œ ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ์œ„ํ•œ ํšจ๊ณผ์ ์ธ ํ‘œํ˜„๋ฒ•์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” regularization ๊ธฐ๋ฒ•๋“ค์„ ์ œ์•ˆํ•˜์˜€์œผ๋ฉฐ, ์œ ๋ช…ํ•œ ๋ฒค์น˜๋งˆํฌ ๋ฐ์ดํ„ฐ์™€ biomedical imaging ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ทธ ์‹คํšจ์„ฑ์„ ๊ฒ€์ฆํ•˜์˜€๋‹ค.Recent advances in machine learning continue to bring us closer to artificial intelligence. In particular, deep learning plays a key role in cutting-edge frameworks such as autonomous driving and game playing. Deep learning refers to a class of multi-layered neural networks, which is rapidly evolving as the amount of data increases, prior knowledge builds up, efficient training schemes are being developed, and high-end hardwares are being build. Currently, deep learning is a state-of-the-art technique for most recognition tasks. As deep neural networks learn many parameters, there has been a variety of attempts to obtain reasonable solutions over a wide search space. In this dissertation, three issues in deep learning are discussed and approaches to solve them with regularization techniques are suggested. First, deep neural networks expose the problem of intrinsic blind spots called adversarial perturbations. Thus, we must construct neural networks that resist the directions of adversarial perturbations by introducing an explicit loss term to minimize the differences between the original and adversarial samples. Second, training restricted Boltzmann machines show limited performance when handling minority samples in class-imbalanced datasets. Our approach addresses this limitation and is combined with a new regularization concept for datasets that have categorical features. Lastly, insufficient data handling is required to be more sophisticated when deep networks learn numerous parameters. Given high-dimensional samples, we must augment datasets with adequate prior knowledge to estimate a high-dimensional distribution. Furthermore, this dissertation shows the first application of deep belief networks to identifying junction splicing signals. Junction prediction is one of the major problems in the field of bioinformatics, and is a starting point to understanding the entire gene expression process. In summary, this dissertation proposes a set of deep learning regularization schemes that can learn the meaningful representation underlying large-scale genomic datasets and image datasets. The effectiveness of these methods was confirmed with a number of experimental studies.Chapter 1 Introduction 1 1.1 Deep neural networks 1 1.2 Issue 1: adversarial examples handling 3 1.3 Issue 2: class-imbalance handling 5 1.4 Issue 3: insufficient data handling 5 1.5 Organization 6 Chapter 2 Background 10 2.1 Basic operations for deep networks 10 2.2 History of deep networks 12 2.3 Modern deep networks 14 2.3.1 Contrastive divergence 16 2.3.2 Deep manifold learning 18 Chapter 3 Adversarial examples handling 20 3.1 Introduction 20 3.2 Methods 21 3.2.1 Manifold regularized networks 21 3.2.2 Generation of adversarial examples 25 3.3 Results and discussion 26 3.3.1 Improved classification performance 28 3.3.2 Disentanglement and generalization 30 3.4 Summary 33 Chapter 4 Class-imbalance handling 35 4.1 Introduction 35 4.1.1 Numerical interpretation of DNA sequences 37 4.1.2 Review of junction prediction problem 41 4.2 Methods 44 4.2.1 Boosted contrastive divergence with categorical gradients 44 4.2.2 Stacking and fine-tuning 46 4.2.3 Initialization and parameter setting 47 4.3 Results and discussion 47 4.3.1 Experiment preparation 47 4.3.2 Improved prediction performance and runtime 49 4.3.3 More robust prediction by proposed approach 51 4.3.4 Effects of regularization on performance 53 4.3.5 Efficient RBM training by boosted CD 54 4.3.6 Identification of non-canonical splice sites 57 4.4 Summary 58 Chapter 5 Insufficient data handling 60 5.1 Introduction 60 5.2 Backgrounds 62 5.2.1 Understanding comets 62 5.2.2 Assessing DNA damage from tail shape 65 5.2.3 Related image processing techniques 66 5.3 Methods 68 5.3.1 Preprocessing 70 5.3.2 Binarization 70 5.3.3 Filtering and overlap correction 72 5.3.4 Characterization and classification 75 5.4 Results and discussion 76 5.4.1 Test data preparation 76 5.4.2 Binarization 77 5.4.3 Robust identification of comets 79 5.4.4 Classification 81 5.4.5 More accurate characterization by DeepComet 82 5.5 Summary 85 Chapter 6 Conclusion 87 6.1 Dissertation summary 87 6.2 Future work 89 Bibliography 91Docto

    Discerning Novel Splice Junctions Derived from RNA-Seq Alignment: A Deep Learning Approach

    Get PDF
    Background: Exon splicing is a regulated cellular process in the transcription of protein-coding genes. Technological advancements and cost reductions in RNA sequencing have made quantitative and qualitative assessments of the transcriptome both possible and widely available. RNA-seq provides unprecedented resolution to identify gene structures and resolve the diversity of splicing variants. However, currently available ab initio aligners are vulnerable to spurious alignments due to random sequence matches and sample-reference genome discordance. As a consequence, a significant set of false positive exon junction predictions would be introduced, which will further confuse downstream analyses of splice variant discovery and abundance estimation. Results: In this work, we present a deep learning based splice junction sequence classifier, named DeepSplice, which employs convolutional neural networks to classify candidate splice junctions. We show (I) DeepSplice outperforms state-of-the-art methods for splice site classification when applied to the popular benchmark dataset HS3D, (II) DeepSplice shows high accuracy for splice junction classification with GENCODE annotation, and (III) the application of DeepSplice to classify putative splice junctions generated by Rail-RNA alignment of 21,504 human RNA-seq data significantly reduces 43 million candidates into around 3 million highly confident novel splice junctions. Conclusions: A model inferred from the sequences of annotated exon junctions that can then classify splice junctions derived from primary RNA-seq data has been implemented. The performance of the model was evaluated and compared through comprehensive benchmarking and testing, indicating a reliable performance and gross usability for classifying novel splice junctions derived from RNA-seq alignment

    Deep Learning for Genomics: A Concise Overview

    Full text link
    Advancements in genomic research such as high-throughput sequencing techniques have driven modern genomic studies into "big data" disciplines. This data explosion is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in a variety of fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning since we are expecting from deep learning a superhuman intelligence that explores beyond our knowledge to interpret the genome. A powerful deep learning model should rely on insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with a proper deep architecture, and remark on practical considerations of developing modern deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research, as well as pointing out potential opportunities and obstacles for future genomics applications.Comment: Invited chapter for Springer Book: Handbook of Deep Learning Application

    Artificial intelligence used in genome analysis studies

    Get PDF
    Next Generation Sequencing (NGS) or deep sequencing technology enables parallel reading of multiple individual DNA fragments, thereby enabling the identification of millions of base pairs in several hours. Recent research has clearly shown that machine learning technologies can efficiently analyse large sets of genomic data and help to identify novel gene functions and regulation regions. A deep artificial neural network consists of a group of artificial neurons that mimic the properties of living neurons. These mathematical models, termed Artificial Neural Networks (ANN), can be used to solve artificial intelligence engineering problems in several different technological fields (e.g., biology, genomics, proteomics, and metabolomics). In practical terms, neural networks are non-linear statistical structures that are organized as modelling tools and are used to simulate complex genomic relationships between inputs and outputs. To date, Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNN) have been demonstrated to be the best tools for improving performance in problem solving tasks within the genomic field
    • โ€ฆ
    corecore