867 research outputs found

    Construction of Gene Regulatory Networks Using Recurrent Neural Networks and Swarm Intelligence

    Get PDF

    ์ƒ๋ฌผํ•™์  ์„œ์—ด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ํ‘œํ˜„ ํ•™์Šต

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2021.8. ์œค์„ฑ๋กœ.As we are living in the era of big data, the biomedical domain is not an exception. With the advent of technologies such as next-generation sequencing, developing methods to capitalize on the explosion of biomedical data is one of the most major challenges in bioinformatics. Representation learning, in particular deep learning, has made significant advancements in diverse fields where the artificial intelligence community has struggled for many years. However, although representation learning has also shown great promises in bioinformatics, it is not a silver bullet. Off-the-shelf applications of representation learning cannot always provide successful results for biological sequence data. There remain full of challenges and opportunities to be explored. This dissertation presents a set of representation learning methods to address three issues in biological sequence data analysis. First, we propose a two-stage training strategy to address throughput and information trade-offs within wet-lab CRISPR-Cpf1 activity experiments. Second, we propose an encoding scheme to model interaction between two sequences for functional microRNA target prediction. Third, we propose a self-supervised pre-training method to bridge the exponentially growing gap between the numbers of unlabeled and labeled protein sequences. In summary, this dissertation proposes a set of representation learning methods that can derive invaluable information from the biological sequence data.์šฐ๋ฆฌ๋Š” ๋น…๋ฐ์ดํ„ฐ์˜ ์‹œ๋Œ€๋ฅผ ๋งž์ดํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ์˜์ƒ๋ช… ๋ถ„์•ผ ๋˜ํ•œ ์˜ˆ์™ธ๊ฐ€ ์•„๋‹ˆ๋‹ค. ์ฐจ์„ธ๋Œ€ ์—ผ๊ธฐ์„œ์—ด ๋ถ„์„๊ณผ ๊ฐ™์€ ๊ธฐ์ˆ ๋“ค์ด ๋„๋ž˜ํ•จ์— ๋”ฐ๋ผ, ํญ๋ฐœ์ ์ธ ์˜์ƒ๋ช… ๋ฐ์ดํ„ฐ์˜ ์ฆ๊ฐ€๋ฅผ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•๋ก ์˜ ๊ฐœ๋ฐœ์€ ์ƒ๋ฌผ์ •๋ณดํ•™ ๋ถ„์•ผ์˜ ์ฃผ์š” ๊ณผ์ œ ์ค‘์˜ ํ•˜๋‚˜์ด๋‹ค. ์‹ฌ์ธต ํ•™์Šต์„ ํฌํ•จํ•œ ํ‘œํ˜„ ํ•™์Šต ๊ธฐ๋ฒ•๋“ค์€ ์ธ๊ณต์ง€๋Šฅ ํ•™๊ณ„๊ฐ€ ์˜ค๋žซ๋™์•ˆ ์–ด๋ ค์›€์„ ๊ฒช์–ด์˜จ ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์—์„œ ์ƒ๋‹นํ•œ ๋ฐœ์ „์„ ์ด๋ฃจ์—ˆ๋‹ค. ํ‘œํ˜„ ํ•™์Šต์€ ์ƒ๋ฌผ์ •๋ณดํ•™ ๋ถ„์•ผ์—์„œ๋„ ๋งŽ์€ ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋‹จ์ˆœํ•œ ์ ์šฉ์œผ๋กœ๋Š” ์ƒ๋ฌผํ•™์  ์„œ์—ด ๋ฐ์ดํ„ฐ ๋ถ„์„์˜ ์„ฑ๊ณต์ ์ธ ๊ฒฐ๊ณผ๋ฅผ ํ•ญ์ƒ ์–ป์„ ์ˆ˜๋Š” ์•Š์œผ๋ฉฐ, ์—ฌ์ „ํžˆ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•œ ๋งŽ์€ ๋ฌธ์ œ๋“ค์ด ๋‚จ์•„์žˆ๋‹ค. ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์€ ์ƒ๋ฌผํ•™์  ์„œ์—ด ๋ฐ์ดํ„ฐ ๋ถ„์„๊ณผ ๊ด€๋ จ๋œ ์„ธ ๊ฐ€์ง€ ์‚ฌ์•ˆ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ํ‘œํ˜„ ํ•™์Šต์— ๊ธฐ๋ฐ˜ํ•œ ์ผ๋ จ์˜ ๋ฐฉ๋ฒ•๋ก ๋“ค์„ ์ œ์•ˆํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ๋กœ, ์œ ์ „์ž๊ฐ€์œ„ ์‹คํ—˜ ๋ฐ์ดํ„ฐ์— ๋‚ด์žฌ๋œ ์ •๋ณด์™€ ์ˆ˜์œจ์˜ ๊ท ํ˜•์— ๋Œ€์ฒ˜ํ•  ์ˆ˜ ์žˆ๋Š” 2๋‹จ๊ณ„ ํ•™์Šต ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ, ๋‘ ์—ผ๊ธฐ ์„œ์—ด ๊ฐ„์˜ ์ƒํ˜ธ ์ž‘์šฉ์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ๋ถ€ํ˜ธํ™” ๋ฐฉ์‹์„ ์ œ์•ˆํ•œ๋‹ค. ์„ธ ๋ฒˆ์งธ๋กœ, ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๋Š” ํŠน์ง•๋˜์ง€ ์•Š์€ ๋‹จ๋ฐฑ์งˆ ์„œ์—ด์„ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•œ ์ž๊ธฐ ์ง€๋„ ์‚ฌ์ „ ํ•™์Šต ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์š”์•ฝํ•˜์ž๋ฉด, ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์€ ์ƒ๋ฌผํ•™์  ์„œ์—ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์ค‘์š”ํ•œ ์ •๋ณด๋ฅผ ๋„์ถœํ•  ์ˆ˜ ์žˆ๋Š” ํ‘œํ˜„ ํ•™์Šต์— ๊ธฐ๋ฐ˜ํ•œ ์ผ๋ จ์˜ ๋ฐฉ๋ฒ•๋ก ๋“ค์„ ์ œ์•ˆํ•œ๋‹ค.1 Introduction 1 1.1 Motivation 1 1.2 Contents of Dissertation 4 2 Background 8 2.1 Representation Learning 8 2.2 Deep Neural Networks 12 2.2.1 Multi-layer Perceptrons 12 2.2.2 Convolutional Neural Networks 14 2.2.3 Recurrent Neural Networks 16 2.2.4 Transformers 19 2.3 Training of Deep Neural Networks 23 2.4 Representation Learning in Bioinformatics 26 2.5 Biological Sequence Data Analyses 29 2.6 Evaluation Metrics 32 3 CRISPR-Cpf1 Activity Prediction 36 3.1 Methods 39 3.1.1 Model Architecture 39 3.1.2 Training of Seq-deepCpf1 and DeepCpf1 41 3.2 Experiment Results 44 3.2.1 Datasets 44 3.2.2 Baselines 47 3.2.3 Evaluation of Seq-deepCpf1 49 3.2.4 Evaluation of DeepCpf1 51 3.3 Summary 55 4 Functional microRNA Target Prediction 56 4.1 Methods 62 4.1.1 Candidate Target Site Selection 63 4.1.2 Input Encoding 64 4.1.3 Residual Network 67 4.1.4 Post-processing 68 4.2 Experiment Results 70 4.2.1 Datasets 70 4.2.2 Classification of Functional and Non-functional Targets 71 4.2.3 Distinguishing High-functional Targets 73 4.2.4 Ablation Studies 76 4.3 Summary 77 5 Self-supervised Learning of Protein Representations 78 5.1 Methods 83 5.1.1 Pre-training Procedure 83 5.1.2 Fine-tuning Procedure 86 5.1.3 Model Architecturen 87 5.2 Experiment Results 90 5.2.1 Experiment Setup 90 5.2.2 Pre-training Results 92 5.2.3 Fine-tuning Results 93 5.2.4 Comparison with Larger Protein Language Models 97 5.2.5 Ablation Studies 100 5.2.6 Qualitative Interpreatation Analyses 103 5.3 Summary 106 6 Discussion 107 6.1 Challenges and Opportunities 107 7 Conclusion 111 Bibliography 113 Abstract in Korean 130๋ฐ•

    Deep diversification of an AAV capsid protein by machine learning.

    Get PDF
    Modern experimental technologies can assay large numbers of biological sequences, but engineered protein libraries rarely exceed the sequence diversity of natural protein families. Machine learning (ML) models trained directly on experimental data without biophysical modeling provide one route to accessing the full potential diversity of engineered proteins. Here we apply deep learning to design highly diverse adeno-associated virusโ€‰2 (AAV2) capsid protein variants that remain viable for packaging of a DNA payload. Focusing on a 28-amino acid segment, we generated 201,426โ€‰variants of the AAV2 wild-type (WT) sequence yielding 110,689โ€‰viable engineered capsids, 57,348 of which surpass the average diversity of natural AAV serotype sequences, with 12-29โ€‰mutations across this region. Even when trained on limited data, deep neural network models accurately predict capsid viability across diverse variants. This approach unlocks vast areas of functional but previously unreachable sequence space, with many potential applications for the generation of improved viral vectors and protein therapeutics

    Cell Type Classification Via Deep Learning On Single-Cell Gene Expression Data

    Get PDF
    Single-cell sequencing is a recently advanced revolutionary technology which enables researchers to obtain genomic, transcriptomic, or multi-omics information through gene expression analysis. It gives the advantage of analyzing highly heterogenous cell type information compared to traditional sequencing methods, which is gaining popularity in the biomedical area. Moreover, this analysis can help for early diagnosis and drug development of tumor cells, and cancer cell types. In the workflow of gene expression data profiling, identification of the cell types is an important task, but it faces many challenges like the curse of dimensionality, sparsity, batch effect, and overfitting. However, these challenges can be overcome by performing a feature selection technique which selects more relevant features by reducing feature dimensions. In this research work, recurrent neural network-based feature selection model is proposed to extract relevant features from high dimensional, and low sample size data. Moreover, a deep learning-based gene embedding model is also proposed to reduce data sparsity of single-cell data for cell type identification. The proposed frameworks have been implemented with different architectures of recurrent neural networks, and demonstrated via real-world micro-array datasets and single-cell RNA-seq data and observed that the proposed models perform better than other feature selection models. A semi-supervised model is also implemented using the same workflow of gene embedding concept since labeling data is very cumbersome, time consuming, and requires manual effort and expertise in the field. Therefore, different ratios of labeled data are used in the experiment to validate the concept. Experimental results show that the proposed semi-supervised approach represents very encouraging performance even though a limited number of labeled data is used via the gene embedding concept. In addition, graph attention based autoencoder model has also been studied to learn the latent features by incorporating prior knowledge with gene expression data for cell type classification. Index Terms โ€” Single-Cell Gene Expression Data, Gene Embedding, Semi-Supervised model, Incorporate Prior Knowledge, Gene-gene Interaction Network, Deep Learning, Graph Auto Encode

    Evaluation of machine learning approaches for prediction of protein coding genes in prokaryotic DNA sequences

    Get PDF
    According to the National Human Genome Research Institute the amount of genomic data generated on a yearly basis is constantly increasing. This rapid growth in genomic data has led to a subsequent surge in the demand for efficient analysis and handling of said data. Gene prediction involves identifying the areas of a DNA sequence that code for proteins, also called protein coding genes. This task falls within the scope of bioinformatics, and there has been surprisingly little development in this field of study, over the past years. Despite there being sufficient state-of-the-art gene prediction tools, there is still room for improvement in terms of efficiency and accuracy. Advances made within the field of gene prediction can, among other things, aid the medical and pharmaceutical industry, as well as impact environmental and anthropological research. Machine learning techniques such as the Random Forest classifiers and Artificial Neural Networks (ANN) have proved successful at the task of gene prediction. In this thesis one deep learning model and two other machine learning models were tested. The first model implemented was the established Random Forest classifier. When it comes to the use of ensemble methods, such as the Random Forest classifier, feature engineering is critical for the success of such models. The exploration of different feature selection and extraction techniques underpinned its relevance. It also showed that feature importance varies greatly among genomes, and revealed possibilities that can be further explored in future work. The second model tested was the ensemble method Extreme Gradient Boosting (XGBoost), which served as a good competitor to the Random Forest classifier. Finally, a Recurrent Neural Network (RNN) was implemented. RNNs are known to be good with handling sequential data, therefore it seemed like a good candidate for gene prediction. The 15 prokaryotic genomes used to train the models were extracted from the NCBI genome database. Each model was tasked with classifying sub-sequences of the genomes, called open reading frames (ORFs), as either protein coding ORFs, or non-coding ORFs. One challenge when preparing these datasets was that the number of protein coding ORFs was very small compared to the number of non-coding ORFs. Another problem encountered in the dataset was that protein coding ORFs in general are longer than non-coding ORFs, which can bias the models to simply classify long ORFs as protein coding, and short ORFs as non-coding. For these reasons, two datasets for each genome were created, taking each imbalance into account. The models were trained, tuned and tested on both datasets for all genomes, and a combination of genomes. The models were evaluated with regard to accuracy, precision and recall. The results show that all three methods have potential and attained somewhat similar performance scores. Despite the fact that both time and data were limited during model development, they still yielded promising results. Considering there are several parameters that have not yet been tuned in all models, many possibilities for further research remain. The fact that a relatively simple RNN architecture performed so well, and has no requirement for feature engineering, shows great promise for further applications in gene prediction, and possibly other fields in bioinformatics.M-D
    • โ€ฆ
    corecore