6,391 research outputs found

    Machine learning-guided directed evolution for protein engineering

    Get PDF
    Machine learning (ML)-guided directed evolution is a new paradigm for biological design that enables optimization of complex functions. ML methods use data to predict how sequence maps to function without requiring a detailed model of the underlying physics or biological pathways. To demonstrate ML-guided directed evolution, we introduce the steps required to build ML sequence-function models and use them to guide engineering, making recommendations at each stage. This review covers basic concepts relevant to using ML for protein engineering as well as the current literature and applications of this new engineering paradigm. ML methods accelerate directed evolution by learning from information contained in all measured variants and using that information to select sequences that are likely to be improved. We then provide two case studies that demonstrate the ML-guided directed evolution process. We also look to future opportunities where ML will enable discovery of new protein functions and uncover the relationship between protein sequence and function.Comment: Made significant revisions to focus on aspects most relevant to applying machine learning to speed up directed evolutio

    Analysis of Three-Dimensional Protein Images

    Full text link
    A fundamental goal of research in molecular biology is to understand protein structure. Protein crystallography is currently the most successful method for determining the three-dimensional (3D) conformation of a protein, yet it remains labor intensive and relies on an expert's ability to derive and evaluate a protein scene model. In this paper, the problem of protein structure determination is formulated as an exercise in scene analysis. A computational methodology is presented in which a 3D image of a protein is segmented into a graph of critical points. Bayesian and certainty factor approaches are described and used to analyze critical point graphs and identify meaningful substructures, such as alpha-helices and beta-sheets. Results of applying the methodologies to protein images at low and medium resolution are reported. The research is related to approaches to representation, segmentation and classification in vision, as well as to top-down approaches to protein structure prediction.Comment: See http://www.jair.org/ for any accompanying file

    ์ƒ๋ฌผํ•™์  ์„œ์—ด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ํ‘œํ˜„ ํ•™์Šต

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2021.8. ์œค์„ฑ๋กœ.As we are living in the era of big data, the biomedical domain is not an exception. With the advent of technologies such as next-generation sequencing, developing methods to capitalize on the explosion of biomedical data is one of the most major challenges in bioinformatics. Representation learning, in particular deep learning, has made significant advancements in diverse fields where the artificial intelligence community has struggled for many years. However, although representation learning has also shown great promises in bioinformatics, it is not a silver bullet. Off-the-shelf applications of representation learning cannot always provide successful results for biological sequence data. There remain full of challenges and opportunities to be explored. This dissertation presents a set of representation learning methods to address three issues in biological sequence data analysis. First, we propose a two-stage training strategy to address throughput and information trade-offs within wet-lab CRISPR-Cpf1 activity experiments. Second, we propose an encoding scheme to model interaction between two sequences for functional microRNA target prediction. Third, we propose a self-supervised pre-training method to bridge the exponentially growing gap between the numbers of unlabeled and labeled protein sequences. In summary, this dissertation proposes a set of representation learning methods that can derive invaluable information from the biological sequence data.์šฐ๋ฆฌ๋Š” ๋น…๋ฐ์ดํ„ฐ์˜ ์‹œ๋Œ€๋ฅผ ๋งž์ดํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ์˜์ƒ๋ช… ๋ถ„์•ผ ๋˜ํ•œ ์˜ˆ์™ธ๊ฐ€ ์•„๋‹ˆ๋‹ค. ์ฐจ์„ธ๋Œ€ ์—ผ๊ธฐ์„œ์—ด ๋ถ„์„๊ณผ ๊ฐ™์€ ๊ธฐ์ˆ ๋“ค์ด ๋„๋ž˜ํ•จ์— ๋”ฐ๋ผ, ํญ๋ฐœ์ ์ธ ์˜์ƒ๋ช… ๋ฐ์ดํ„ฐ์˜ ์ฆ๊ฐ€๋ฅผ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•๋ก ์˜ ๊ฐœ๋ฐœ์€ ์ƒ๋ฌผ์ •๋ณดํ•™ ๋ถ„์•ผ์˜ ์ฃผ์š” ๊ณผ์ œ ์ค‘์˜ ํ•˜๋‚˜์ด๋‹ค. ์‹ฌ์ธต ํ•™์Šต์„ ํฌํ•จํ•œ ํ‘œํ˜„ ํ•™์Šต ๊ธฐ๋ฒ•๋“ค์€ ์ธ๊ณต์ง€๋Šฅ ํ•™๊ณ„๊ฐ€ ์˜ค๋žซ๋™์•ˆ ์–ด๋ ค์›€์„ ๊ฒช์–ด์˜จ ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์—์„œ ์ƒ๋‹นํ•œ ๋ฐœ์ „์„ ์ด๋ฃจ์—ˆ๋‹ค. ํ‘œํ˜„ ํ•™์Šต์€ ์ƒ๋ฌผ์ •๋ณดํ•™ ๋ถ„์•ผ์—์„œ๋„ ๋งŽ์€ ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋‹จ์ˆœํ•œ ์ ์šฉ์œผ๋กœ๋Š” ์ƒ๋ฌผํ•™์  ์„œ์—ด ๋ฐ์ดํ„ฐ ๋ถ„์„์˜ ์„ฑ๊ณต์ ์ธ ๊ฒฐ๊ณผ๋ฅผ ํ•ญ์ƒ ์–ป์„ ์ˆ˜๋Š” ์•Š์œผ๋ฉฐ, ์—ฌ์ „ํžˆ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•œ ๋งŽ์€ ๋ฌธ์ œ๋“ค์ด ๋‚จ์•„์žˆ๋‹ค. ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์€ ์ƒ๋ฌผํ•™์  ์„œ์—ด ๋ฐ์ดํ„ฐ ๋ถ„์„๊ณผ ๊ด€๋ จ๋œ ์„ธ ๊ฐ€์ง€ ์‚ฌ์•ˆ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ํ‘œํ˜„ ํ•™์Šต์— ๊ธฐ๋ฐ˜ํ•œ ์ผ๋ จ์˜ ๋ฐฉ๋ฒ•๋ก ๋“ค์„ ์ œ์•ˆํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ๋กœ, ์œ ์ „์ž๊ฐ€์œ„ ์‹คํ—˜ ๋ฐ์ดํ„ฐ์— ๋‚ด์žฌ๋œ ์ •๋ณด์™€ ์ˆ˜์œจ์˜ ๊ท ํ˜•์— ๋Œ€์ฒ˜ํ•  ์ˆ˜ ์žˆ๋Š” 2๋‹จ๊ณ„ ํ•™์Šต ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ, ๋‘ ์—ผ๊ธฐ ์„œ์—ด ๊ฐ„์˜ ์ƒํ˜ธ ์ž‘์šฉ์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ๋ถ€ํ˜ธํ™” ๋ฐฉ์‹์„ ์ œ์•ˆํ•œ๋‹ค. ์„ธ ๋ฒˆ์งธ๋กœ, ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๋Š” ํŠน์ง•๋˜์ง€ ์•Š์€ ๋‹จ๋ฐฑ์งˆ ์„œ์—ด์„ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•œ ์ž๊ธฐ ์ง€๋„ ์‚ฌ์ „ ํ•™์Šต ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์š”์•ฝํ•˜์ž๋ฉด, ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์€ ์ƒ๋ฌผํ•™์  ์„œ์—ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์ค‘์š”ํ•œ ์ •๋ณด๋ฅผ ๋„์ถœํ•  ์ˆ˜ ์žˆ๋Š” ํ‘œํ˜„ ํ•™์Šต์— ๊ธฐ๋ฐ˜ํ•œ ์ผ๋ จ์˜ ๋ฐฉ๋ฒ•๋ก ๋“ค์„ ์ œ์•ˆํ•œ๋‹ค.1 Introduction 1 1.1 Motivation 1 1.2 Contents of Dissertation 4 2 Background 8 2.1 Representation Learning 8 2.2 Deep Neural Networks 12 2.2.1 Multi-layer Perceptrons 12 2.2.2 Convolutional Neural Networks 14 2.2.3 Recurrent Neural Networks 16 2.2.4 Transformers 19 2.3 Training of Deep Neural Networks 23 2.4 Representation Learning in Bioinformatics 26 2.5 Biological Sequence Data Analyses 29 2.6 Evaluation Metrics 32 3 CRISPR-Cpf1 Activity Prediction 36 3.1 Methods 39 3.1.1 Model Architecture 39 3.1.2 Training of Seq-deepCpf1 and DeepCpf1 41 3.2 Experiment Results 44 3.2.1 Datasets 44 3.2.2 Baselines 47 3.2.3 Evaluation of Seq-deepCpf1 49 3.2.4 Evaluation of DeepCpf1 51 3.3 Summary 55 4 Functional microRNA Target Prediction 56 4.1 Methods 62 4.1.1 Candidate Target Site Selection 63 4.1.2 Input Encoding 64 4.1.3 Residual Network 67 4.1.4 Post-processing 68 4.2 Experiment Results 70 4.2.1 Datasets 70 4.2.2 Classification of Functional and Non-functional Targets 71 4.2.3 Distinguishing High-functional Targets 73 4.2.4 Ablation Studies 76 4.3 Summary 77 5 Self-supervised Learning of Protein Representations 78 5.1 Methods 83 5.1.1 Pre-training Procedure 83 5.1.2 Fine-tuning Procedure 86 5.1.3 Model Architecturen 87 5.2 Experiment Results 90 5.2.1 Experiment Setup 90 5.2.2 Pre-training Results 92 5.2.3 Fine-tuning Results 93 5.2.4 Comparison with Larger Protein Language Models 97 5.2.5 Ablation Studies 100 5.2.6 Qualitative Interpreatation Analyses 103 5.3 Summary 106 6 Discussion 107 6.1 Challenges and Opportunities 107 7 Conclusion 111 Bibliography 113 Abstract in Korean 130๋ฐ•

    Structural Prediction of Proteinโ€“Protein Interactions by Docking: Application to Biomedical Problems

    Get PDF
    A huge amount of genetic information is available thanks to the recent advances in sequencing technologies and the larger computational capabilities, but the interpretation of such genetic data at phenotypic level remains elusive. One of the reasons is that proteins are not acting alone, but are specifically interacting with other proteins and biomolecules, forming intricate interaction networks that are essential for the majority of cell processes and pathological conditions. Thus, characterizing such interaction networks is an important step in understanding how information flows from gene to phenotype. Indeed, structural characterization of proteinโ€“protein interactions at atomic resolution has many applications in biomedicine, from diagnosis and vaccine design, to drug discovery. However, despite the advances of experimental structural determination, the number of interactions for which there is available structural data is still very small. In this context, a complementary approach is computational modeling of protein interactions by docking, which is usually composed of two major phases: (i) sampling of the possible binding modes between the interacting molecules and (ii) scoring for the identification of the correct orientations. In addition, prediction of interface and hot-spot residues is very useful in order to guide and interpret mutagenesis experiments, as well as to understand functional and mechanistic aspects of the interaction. Computational docking is already being applied to specific biomedical problems within the context of personalized medicine, for instance, helping to interpret pathological mutations involved in proteinโ€“protein interactions, or providing modeled structural data for drug discovery targeting proteinโ€“protein interactions.Spanish Ministry of Economy grant number BIO2016-79960-R; D.B.B. is supported by a predoctoral fellowship from CONACyT; M.R. is supported by an FPI fellowship from the Severo Ochoa program. We are grateful to the Joint BSC-CRG-IRB Programme in Computational Biology.Peer ReviewedPostprint (author's final draft
    • โ€ฆ
    corecore