1 research outputs found

    생물학적 μ„œμ—΄ 데이터에 λŒ€ν•œ ν‘œν˜„ ν•™μŠ΅

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 전기·정보곡학뢀, 2021.8. μœ€μ„±λ‘œ.As we are living in the era of big data, the biomedical domain is not an exception. With the advent of technologies such as next-generation sequencing, developing methods to capitalize on the explosion of biomedical data is one of the most major challenges in bioinformatics. Representation learning, in particular deep learning, has made significant advancements in diverse fields where the artificial intelligence community has struggled for many years. However, although representation learning has also shown great promises in bioinformatics, it is not a silver bullet. Off-the-shelf applications of representation learning cannot always provide successful results for biological sequence data. There remain full of challenges and opportunities to be explored. This dissertation presents a set of representation learning methods to address three issues in biological sequence data analysis. First, we propose a two-stage training strategy to address throughput and information trade-offs within wet-lab CRISPR-Cpf1 activity experiments. Second, we propose an encoding scheme to model interaction between two sequences for functional microRNA target prediction. Third, we propose a self-supervised pre-training method to bridge the exponentially growing gap between the numbers of unlabeled and labeled protein sequences. In summary, this dissertation proposes a set of representation learning methods that can derive invaluable information from the biological sequence data.μš°λ¦¬λŠ” λΉ…λ°μ΄ν„°μ˜ μ‹œλŒ€λ₯Ό λ§žμ΄ν•˜κ³  있으며, μ˜μƒλͺ… λΆ„μ•Ό λ˜ν•œ μ˜ˆμ™Έκ°€ μ•„λ‹ˆλ‹€. μ°¨μ„ΈλŒ€ μ—ΌκΈ°μ„œμ—΄ 뢄석과 같은 κΈ°μˆ λ“€μ΄ λ„λž˜ν•¨μ— 따라, 폭발적인 μ˜μƒλͺ… λ°μ΄ν„°μ˜ 증가λ₯Ό ν™œμš©ν•˜κΈ° μœ„ν•œ λ°©λ²•λ‘ μ˜ κ°œλ°œμ€ 생물정보학 λΆ„μ•Όμ˜ μ£Όμš” 과제 μ€‘μ˜ ν•˜λ‚˜μ΄λ‹€. 심측 ν•™μŠ΅μ„ ν¬ν•¨ν•œ ν‘œν˜„ ν•™μŠ΅ 기법듀은 인곡지λŠ₯ 학계가 μ˜€λž«λ™μ•ˆ 어렀움을 κ²ͺμ–΄μ˜¨ λ‹€μ–‘ν•œ λΆ„μ•Όμ—μ„œ μƒλ‹Ήν•œ λ°œμ „μ„ μ΄λ£¨μ—ˆλ‹€. ν‘œν˜„ ν•™μŠ΅μ€ 생물정보학 λΆ„μ•Όμ—μ„œλ„ λ§Žμ€ κ°€λŠ₯성을 λ³΄μ—¬μ£Όμ—ˆλ‹€. ν•˜μ§€λ§Œ λ‹¨μˆœν•œ μ μš©μœΌλ‘œλŠ” 생물학적 μ„œμ—΄ 데이터 λΆ„μ„μ˜ 성곡적인 κ²°κ³Όλ₯Ό 항상 얻을 μˆ˜λŠ” μ•ŠμœΌλ©°, μ—¬μ „νžˆ 연ꡬ가 ν•„μš”ν•œ λ§Žμ€ λ¬Έμ œλ“€μ΄ λ‚¨μ•„μžˆλ‹€. λ³Έ ν•™μœ„λ…Όλ¬Έμ€ 생물학적 μ„œμ—΄ 데이터 뢄석과 κ΄€λ ¨λœ μ„Έ 가지 μ‚¬μ•ˆμ„ ν•΄κ²°ν•˜κΈ° μœ„ν•΄, ν‘œν˜„ ν•™μŠ΅μ— κΈ°λ°˜ν•œ 일련의 방법둠듀을 μ œμ•ˆν•œλ‹€. 첫 번째둜, μœ μ „μžκ°€μœ„ μ‹€ν—˜ 데이터에 λ‚΄μž¬λœ 정보와 수율의 κ· ν˜•μ— λŒ€μ²˜ν•  수 μžˆλŠ” 2단계 ν•™μŠ΅ 기법을 μ œμ•ˆν•œλ‹€. 두 번째둜, 두 μ—ΌκΈ° μ„œμ—΄ κ°„μ˜ μƒν˜Έ μž‘μš©μ„ ν•™μŠ΅ν•˜κΈ° μœ„ν•œ λΆ€ν˜Έν™” 방식을 μ œμ•ˆν•œλ‹€. μ„Έ 번째둜, κΈ°ν•˜κΈ‰μˆ˜μ μœΌλ‘œ μ¦κ°€ν•˜λŠ” νŠΉμ§•λ˜μ§€ μ•Šμ€ λ‹¨λ°±μ§ˆ μ„œμ—΄μ„ ν™œμš©ν•˜κΈ° μœ„ν•œ 자기 지도 사전 ν•™μŠ΅ 기법을 μ œμ•ˆν•œλ‹€. μš”μ•½ν•˜μžλ©΄, λ³Έ ν•™μœ„λ…Όλ¬Έμ€ 생물학적 μ„œμ—΄ 데이터λ₯Ό λΆ„μ„ν•˜μ—¬ μ€‘μš”ν•œ 정보λ₯Ό λ„μΆœν•  수 μžˆλŠ” ν‘œν˜„ ν•™μŠ΅μ— κΈ°λ°˜ν•œ 일련의 방법둠듀을 μ œμ•ˆν•œλ‹€.1 Introduction 1 1.1 Motivation 1 1.2 Contents of Dissertation 4 2 Background 8 2.1 Representation Learning 8 2.2 Deep Neural Networks 12 2.2.1 Multi-layer Perceptrons 12 2.2.2 Convolutional Neural Networks 14 2.2.3 Recurrent Neural Networks 16 2.2.4 Transformers 19 2.3 Training of Deep Neural Networks 23 2.4 Representation Learning in Bioinformatics 26 2.5 Biological Sequence Data Analyses 29 2.6 Evaluation Metrics 32 3 CRISPR-Cpf1 Activity Prediction 36 3.1 Methods 39 3.1.1 Model Architecture 39 3.1.2 Training of Seq-deepCpf1 and DeepCpf1 41 3.2 Experiment Results 44 3.2.1 Datasets 44 3.2.2 Baselines 47 3.2.3 Evaluation of Seq-deepCpf1 49 3.2.4 Evaluation of DeepCpf1 51 3.3 Summary 55 4 Functional microRNA Target Prediction 56 4.1 Methods 62 4.1.1 Candidate Target Site Selection 63 4.1.2 Input Encoding 64 4.1.3 Residual Network 67 4.1.4 Post-processing 68 4.2 Experiment Results 70 4.2.1 Datasets 70 4.2.2 Classification of Functional and Non-functional Targets 71 4.2.3 Distinguishing High-functional Targets 73 4.2.4 Ablation Studies 76 4.3 Summary 77 5 Self-supervised Learning of Protein Representations 78 5.1 Methods 83 5.1.1 Pre-training Procedure 83 5.1.2 Fine-tuning Procedure 86 5.1.3 Model Architecturen 87 5.2 Experiment Results 90 5.2.1 Experiment Setup 90 5.2.2 Pre-training Results 92 5.2.3 Fine-tuning Results 93 5.2.4 Comparison with Larger Protein Language Models 97 5.2.5 Ablation Studies 100 5.2.6 Qualitative Interpreatation Analyses 103 5.3 Summary 106 6 Discussion 107 6.1 Challenges and Opportunities 107 7 Conclusion 111 Bibliography 113 Abstract in Korean 130λ°•
    corecore