6,391 research outputs found
Machine learning-guided directed evolution for protein engineering
Machine learning (ML)-guided directed evolution is a new paradigm for
biological design that enables optimization of complex functions. ML methods
use data to predict how sequence maps to function without requiring a detailed
model of the underlying physics or biological pathways. To demonstrate
ML-guided directed evolution, we introduce the steps required to build ML
sequence-function models and use them to guide engineering, making
recommendations at each stage. This review covers basic concepts relevant to
using ML for protein engineering as well as the current literature and
applications of this new engineering paradigm. ML methods accelerate directed
evolution by learning from information contained in all measured variants and
using that information to select sequences that are likely to be improved. We
then provide two case studies that demonstrate the ML-guided directed evolution
process. We also look to future opportunities where ML will enable discovery of
new protein functions and uncover the relationship between protein sequence and
function.Comment: Made significant revisions to focus on aspects most relevant to
applying machine learning to speed up directed evolutio
Analysis of Three-Dimensional Protein Images
A fundamental goal of research in molecular biology is to understand protein
structure. Protein crystallography is currently the most successful method for
determining the three-dimensional (3D) conformation of a protein, yet it
remains labor intensive and relies on an expert's ability to derive and
evaluate a protein scene model. In this paper, the problem of protein structure
determination is formulated as an exercise in scene analysis. A computational
methodology is presented in which a 3D image of a protein is segmented into a
graph of critical points. Bayesian and certainty factor approaches are
described and used to analyze critical point graphs and identify meaningful
substructures, such as alpha-helices and beta-sheets. Results of applying the
methodologies to protein images at low and medium resolution are reported. The
research is related to approaches to representation, segmentation and
classification in vision, as well as to top-down approaches to protein
structure prediction.Comment: See http://www.jair.org/ for any accompanying file
์๋ฌผํ์ ์์ด ๋ฐ์ดํฐ์ ๋ํ ํํ ํ์ต
ํ์๋
ผ๋ฌธ(๋ฐ์ฌ) -- ์์ธ๋ํ๊ต๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ, 2021.8. ์ค์ฑ๋ก.As we are living in the era of big data, the biomedical domain is not an exception. With the advent of technologies such as next-generation sequencing, developing methods to capitalize on the explosion of biomedical data is one of the most major challenges in bioinformatics. Representation learning, in particular deep learning, has made significant advancements in diverse fields where the artificial intelligence community has struggled for many years. However, although representation learning has also shown great promises in bioinformatics, it is not a silver bullet. Off-the-shelf applications of representation learning cannot always provide successful results for biological sequence data. There remain full of challenges and opportunities to be explored.
This dissertation presents a set of representation learning methods to address three issues in biological sequence data analysis. First, we propose a two-stage training strategy to address throughput and information trade-offs within wet-lab CRISPR-Cpf1 activity experiments. Second, we propose an encoding scheme to model interaction between two sequences for functional microRNA target prediction. Third, we propose a self-supervised pre-training method to bridge the exponentially growing gap between the numbers of unlabeled and labeled protein sequences. In summary, this dissertation proposes a set of representation learning methods that can derive invaluable information from the biological sequence data.์ฐ๋ฆฌ๋ ๋น
๋ฐ์ดํฐ์ ์๋๋ฅผ ๋ง์ดํ๊ณ ์์ผ๋ฉฐ, ์์๋ช
๋ถ์ผ ๋ํ ์์ธ๊ฐ ์๋๋ค. ์ฐจ์ธ๋ ์ผ๊ธฐ์์ด ๋ถ์๊ณผ ๊ฐ์ ๊ธฐ์ ๋ค์ด ๋๋ํจ์ ๋ฐ๋ผ, ํญ๋ฐ์ ์ธ ์์๋ช
๋ฐ์ดํฐ์ ์ฆ๊ฐ๋ฅผ ํ์ฉํ๊ธฐ ์ํ ๋ฐฉ๋ฒ๋ก ์ ๊ฐ๋ฐ์ ์๋ฌผ์ ๋ณดํ ๋ถ์ผ์ ์ฃผ์ ๊ณผ์ ์ค์ ํ๋์ด๋ค. ์ฌ์ธต ํ์ต์ ํฌํจํ ํํ ํ์ต ๊ธฐ๋ฒ๋ค์ ์ธ๊ณต์ง๋ฅ ํ๊ณ๊ฐ ์ค๋ซ๋์ ์ด๋ ค์์ ๊ฒช์ด์จ ๋ค์ํ ๋ถ์ผ์์ ์๋นํ ๋ฐ์ ์ ์ด๋ฃจ์๋ค. ํํ ํ์ต์ ์๋ฌผ์ ๋ณดํ ๋ถ์ผ์์๋ ๋ง์ ๊ฐ๋ฅ์ฑ์ ๋ณด์ฌ์ฃผ์๋ค. ํ์ง๋ง ๋จ์ํ ์ ์ฉ์ผ๋ก๋ ์๋ฌผํ์ ์์ด ๋ฐ์ดํฐ ๋ถ์์ ์ฑ๊ณต์ ์ธ ๊ฒฐ๊ณผ๋ฅผ ํญ์ ์ป์ ์๋ ์์ผ๋ฉฐ, ์ฌ์ ํ ์ฐ๊ตฌ๊ฐ ํ์ํ ๋ง์ ๋ฌธ์ ๋ค์ด ๋จ์์๋ค.
๋ณธ ํ์๋
ผ๋ฌธ์ ์๋ฌผํ์ ์์ด ๋ฐ์ดํฐ ๋ถ์๊ณผ ๊ด๋ จ๋ ์ธ ๊ฐ์ง ์ฌ์์ ํด๊ฒฐํ๊ธฐ ์ํด, ํํ ํ์ต์ ๊ธฐ๋ฐํ ์ผ๋ จ์ ๋ฐฉ๋ฒ๋ก ๋ค์ ์ ์ํ๋ค. ์ฒซ ๋ฒ์งธ๋ก, ์ ์ ์๊ฐ์ ์คํ ๋ฐ์ดํฐ์ ๋ด์ฌ๋ ์ ๋ณด์ ์์จ์ ๊ท ํ์ ๋์ฒํ ์ ์๋ 2๋จ๊ณ ํ์ต ๊ธฐ๋ฒ์ ์ ์ํ๋ค. ๋ ๋ฒ์งธ๋ก, ๋ ์ผ๊ธฐ ์์ด ๊ฐ์ ์ํธ ์์ฉ์ ํ์ตํ๊ธฐ ์ํ ๋ถํธํ ๋ฐฉ์์ ์ ์ํ๋ค. ์ธ ๋ฒ์งธ๋ก, ๊ธฐํ๊ธ์์ ์ผ๋ก ์ฆ๊ฐํ๋ ํน์ง๋์ง ์์ ๋จ๋ฐฑ์ง ์์ด์ ํ์ฉํ๊ธฐ ์ํ ์๊ธฐ ์ง๋ ์ฌ์ ํ์ต ๊ธฐ๋ฒ์ ์ ์ํ๋ค. ์์ฝํ์๋ฉด, ๋ณธ ํ์๋
ผ๋ฌธ์ ์๋ฌผํ์ ์์ด ๋ฐ์ดํฐ๋ฅผ ๋ถ์ํ์ฌ ์ค์ํ ์ ๋ณด๋ฅผ ๋์ถํ ์ ์๋ ํํ ํ์ต์ ๊ธฐ๋ฐํ ์ผ๋ จ์ ๋ฐฉ๋ฒ๋ก ๋ค์ ์ ์ํ๋ค.1 Introduction 1
1.1 Motivation 1
1.2 Contents of Dissertation 4
2 Background 8
2.1 Representation Learning 8
2.2 Deep Neural Networks 12
2.2.1 Multi-layer Perceptrons 12
2.2.2 Convolutional Neural Networks 14
2.2.3 Recurrent Neural Networks 16
2.2.4 Transformers 19
2.3 Training of Deep Neural Networks 23
2.4 Representation Learning in Bioinformatics 26
2.5 Biological Sequence Data Analyses 29
2.6 Evaluation Metrics 32
3 CRISPR-Cpf1 Activity Prediction 36
3.1 Methods 39
3.1.1 Model Architecture 39
3.1.2 Training of Seq-deepCpf1 and DeepCpf1 41
3.2 Experiment Results 44
3.2.1 Datasets 44
3.2.2 Baselines 47
3.2.3 Evaluation of Seq-deepCpf1 49
3.2.4 Evaluation of DeepCpf1 51
3.3 Summary 55
4 Functional microRNA Target Prediction 56
4.1 Methods 62
4.1.1 Candidate Target Site Selection 63
4.1.2 Input Encoding 64
4.1.3 Residual Network 67
4.1.4 Post-processing 68
4.2 Experiment Results 70
4.2.1 Datasets 70
4.2.2 Classification of Functional and Non-functional Targets 71
4.2.3 Distinguishing High-functional Targets 73
4.2.4 Ablation Studies 76
4.3 Summary 77
5 Self-supervised Learning of Protein Representations 78
5.1 Methods 83
5.1.1 Pre-training Procedure 83
5.1.2 Fine-tuning Procedure 86
5.1.3 Model Architecturen 87
5.2 Experiment Results 90
5.2.1 Experiment Setup 90
5.2.2 Pre-training Results 92
5.2.3 Fine-tuning Results 93
5.2.4 Comparison with Larger Protein Language Models 97
5.2.5 Ablation Studies 100
5.2.6 Qualitative Interpreatation Analyses 103
5.3 Summary 106
6 Discussion 107
6.1 Challenges and Opportunities 107
7 Conclusion 111
Bibliography 113
Abstract in Korean 130๋ฐ
Structural Prediction of ProteinโProtein Interactions by Docking: Application to Biomedical Problems
A huge amount of genetic information is available thanks to the recent advances in sequencing technologies and the larger computational capabilities, but the interpretation of such genetic data at phenotypic level remains elusive. One of the reasons is that proteins are not acting alone, but are specifically interacting with other proteins and biomolecules, forming intricate interaction networks that are essential for the majority of cell processes and pathological conditions. Thus, characterizing such interaction networks is an important step in understanding how information flows from gene to phenotype. Indeed, structural characterization of proteinโprotein interactions at atomic resolution has many applications in biomedicine, from diagnosis and vaccine design, to drug discovery. However, despite the advances of experimental structural determination, the number of interactions for which there is available structural data is still very small. In this context, a complementary approach is computational modeling of protein interactions by docking, which is usually composed of two major phases: (i) sampling of the possible binding modes between the interacting molecules and (ii) scoring for the identification of the correct orientations. In addition, prediction of interface and hot-spot residues is very useful in order to guide and interpret mutagenesis experiments, as well as to understand functional and mechanistic aspects of the interaction. Computational docking is already being applied to specific biomedical problems within the context of personalized medicine, for instance, helping to interpret pathological mutations involved in proteinโprotein interactions, or providing modeled structural data for drug discovery targeting proteinโprotein interactions.Spanish Ministry of Economy grant number BIO2016-79960-R; D.B.B. is supported by a
predoctoral fellowship from CONACyT; M.R. is supported by an FPI fellowship from the
Severo Ochoa program. We are grateful to the Joint BSC-CRG-IRB Programme in
Computational Biology.Peer ReviewedPostprint (author's final draft
- โฆ