1 research outputs found
μλ¬Όνμ μμ΄ λ°μ΄ν°μ λν νν νμ΅
νμλ
Όλ¬Έ(λ°μ¬) -- μμΈλνκ΅λνμ : 곡과λν μ κΈ°Β·μ 보곡νλΆ, 2021.8. μ€μ±λ‘.As we are living in the era of big data, the biomedical domain is not an exception. With the advent of technologies such as next-generation sequencing, developing methods to capitalize on the explosion of biomedical data is one of the most major challenges in bioinformatics. Representation learning, in particular deep learning, has made significant advancements in diverse fields where the artificial intelligence community has struggled for many years. However, although representation learning has also shown great promises in bioinformatics, it is not a silver bullet. Off-the-shelf applications of representation learning cannot always provide successful results for biological sequence data. There remain full of challenges and opportunities to be explored.
This dissertation presents a set of representation learning methods to address three issues in biological sequence data analysis. First, we propose a two-stage training strategy to address throughput and information trade-offs within wet-lab CRISPR-Cpf1 activity experiments. Second, we propose an encoding scheme to model interaction between two sequences for functional microRNA target prediction. Third, we propose a self-supervised pre-training method to bridge the exponentially growing gap between the numbers of unlabeled and labeled protein sequences. In summary, this dissertation proposes a set of representation learning methods that can derive invaluable information from the biological sequence data.μ°λ¦¬λ λΉ
λ°μ΄ν°μ μλλ₯Ό λ§μ΄νκ³ μμΌλ©°, μμλͺ
λΆμΌ λν μμΈκ° μλλ€. μ°¨μΈλ μΌκΈ°μμ΄ λΆμκ³Ό κ°μ κΈ°μ λ€μ΄ λλν¨μ λ°λΌ, νλ°μ μΈ μμλͺ
λ°μ΄ν°μ μ¦κ°λ₯Ό νμ©νκΈ° μν λ°©λ²λ‘ μ κ°λ°μ μλ¬Όμ 보ν λΆμΌμ μ£Όμ κ³Όμ μ€μ νλμ΄λ€. μ¬μΈ΅ νμ΅μ ν¬ν¨ν νν νμ΅ κΈ°λ²λ€μ μΈκ³΅μ§λ₯ νκ³κ° μ€λ«λμ μ΄λ €μμ κ²ͺμ΄μ¨ λ€μν λΆμΌμμ μλΉν λ°μ μ μ΄λ£¨μλ€. νν νμ΅μ μλ¬Όμ 보ν λΆμΌμμλ λ§μ κ°λ₯μ±μ 보μ¬μ£Όμλ€. νμ§λ§ λ¨μν μ μ©μΌλ‘λ μλ¬Όνμ μμ΄ λ°μ΄ν° λΆμμ μ±κ³΅μ μΈ κ²°κ³Όλ₯Ό νμ μ»μ μλ μμΌλ©°, μ¬μ ν μ°κ΅¬κ° νμν λ§μ λ¬Έμ λ€μ΄ λ¨μμλ€.
λ³Έ νμλ
Όλ¬Έμ μλ¬Όνμ μμ΄ λ°μ΄ν° λΆμκ³Ό κ΄λ ¨λ μΈ κ°μ§ μ¬μμ ν΄κ²°νκΈ° μν΄, νν νμ΅μ κΈ°λ°ν μΌλ ¨μ λ°©λ²λ‘ λ€μ μ μνλ€. 첫 λ²μ§Έλ‘, μ μ μκ°μ μ€ν λ°μ΄ν°μ λ΄μ¬λ μ 보μ μμ¨μ κ· νμ λμ²ν μ μλ 2λ¨κ³ νμ΅ κΈ°λ²μ μ μνλ€. λ λ²μ§Έλ‘, λ μΌκΈ° μμ΄ κ°μ μνΈ μμ©μ νμ΅νκΈ° μν λΆνΈν λ°©μμ μ μνλ€. μΈ λ²μ§Έλ‘, κΈ°νκΈμμ μΌλ‘ μ¦κ°νλ νΉμ§λμ§ μμ λ¨λ°±μ§ μμ΄μ νμ©νκΈ° μν μκΈ° μ§λ μ¬μ νμ΅ κΈ°λ²μ μ μνλ€. μμ½νμλ©΄, λ³Έ νμλ
Όλ¬Έμ μλ¬Όνμ μμ΄ λ°μ΄ν°λ₯Ό λΆμνμ¬ μ€μν μ 보λ₯Ό λμΆν μ μλ νν νμ΅μ κΈ°λ°ν μΌλ ¨μ λ°©λ²λ‘ λ€μ μ μνλ€.1 Introduction 1
1.1 Motivation 1
1.2 Contents of Dissertation 4
2 Background 8
2.1 Representation Learning 8
2.2 Deep Neural Networks 12
2.2.1 Multi-layer Perceptrons 12
2.2.2 Convolutional Neural Networks 14
2.2.3 Recurrent Neural Networks 16
2.2.4 Transformers 19
2.3 Training of Deep Neural Networks 23
2.4 Representation Learning in Bioinformatics 26
2.5 Biological Sequence Data Analyses 29
2.6 Evaluation Metrics 32
3 CRISPR-Cpf1 Activity Prediction 36
3.1 Methods 39
3.1.1 Model Architecture 39
3.1.2 Training of Seq-deepCpf1 and DeepCpf1 41
3.2 Experiment Results 44
3.2.1 Datasets 44
3.2.2 Baselines 47
3.2.3 Evaluation of Seq-deepCpf1 49
3.2.4 Evaluation of DeepCpf1 51
3.3 Summary 55
4 Functional microRNA Target Prediction 56
4.1 Methods 62
4.1.1 Candidate Target Site Selection 63
4.1.2 Input Encoding 64
4.1.3 Residual Network 67
4.1.4 Post-processing 68
4.2 Experiment Results 70
4.2.1 Datasets 70
4.2.2 Classification of Functional and Non-functional Targets 71
4.2.3 Distinguishing High-functional Targets 73
4.2.4 Ablation Studies 76
4.3 Summary 77
5 Self-supervised Learning of Protein Representations 78
5.1 Methods 83
5.1.1 Pre-training Procedure 83
5.1.2 Fine-tuning Procedure 86
5.1.3 Model Architecturen 87
5.2 Experiment Results 90
5.2.1 Experiment Setup 90
5.2.2 Pre-training Results 92
5.2.3 Fine-tuning Results 93
5.2.4 Comparison with Larger Protein Language Models 97
5.2.5 Ablation Studies 100
5.2.6 Qualitative Interpreatation Analyses 103
5.3 Summary 106
6 Discussion 107
6.1 Challenges and Opportunities 107
7 Conclusion 111
Bibliography 113
Abstract in Korean 130λ°