688 research outputs found

    A new approach for determining SARS-CoV-2 epitopes using machine learning-based in silico methods

    Get PDF
    The emergence of machine learning-based in silico tools has enabled rapid and high-quality predictions in the biomedical field. In the COVID-19 pandemic, machine learning methods have been used in many topics such as predicting the death of patients, modeling the spread of infection, determining future effects, diagnosis with medical image analysis, and forecasting the vaccination rate. However, there is a gap in the literature regarding identifying epitopes that can be used in fast, useful, and effective vaccine design using machine learning methods and bioinformatics tools. Machine learning methods can give medical biotechnologists an advantage in designing a faster and more successful vaccine. The motivation of this study is to propose a successful hybrid machine learning method for SARS-CoV-2 epitope prediction and to identify nonallergen, nontoxic, antigen peptides that can be used in vaccine design from the predicted epitopes with bioinformatics tools. The identified epitopes will be effective not only in the design of the COVID-19 vaccine but also against viruses from the SARS family that may be encountered in the future. For this purpose, epitope prediction performances of random forest, support vector machine, logistic regression, bagging with decision tree, k-nearest neighbor and decision tree methods were examined. In the SARS-CoV and B-cell datasets used for education in the study, epitope estimation was performed again after the datasets were balanced with the synthetic minority oversampling technique (SMOTE) method since the epitope class samples were in the minority compared to the nonepitope class. The experimental results obtained were compared and the most successful predictions were obtained with the random forest (RF) method. The epitope prediction performance in balanced datasets was found to be higher than that in the original datasets (94.0% AUC and 94.4% PRC for the SMOTE-SARS-CoV dataset; 95.6% AUC and 95.3% PRC for the SMOTE-B-cell dataset). In this study, 252 peptides out of 20312 peptides were determined to be epitopes with the SMOTE-RF-SVM hybrid method proposed for SARS-CoV-2 epitope prediction. Determined epitopes were analyzed with AllerTOP 2.0, VaxiJen 2.0 and ToxinPred tools, and allergic, nonantigen, and toxic epitopes were eliminated. As a result, 11 possible nonallergic, high antigen and nontoxic epitope candidates were proposed that could be used in protein-based COVID-19 vaccine design (“VGGNYNY”, “VNFNFNGLTG”, “RQIAPGQTGKI”, “QIAPGQTGKIA”, “SYECDIPIGAGI”, “STFKCYGVSPTKL”, “GVVFLHVTYVPAQ”, “KNHTSPDVDLGDI”, “NHTSPDVDLGDIS”, “AGAAAYYVGYLQPR”, “KKSTNLVKNKCVNF”). It is predicted that the few epitopes determined by machine learning-based in silico methods will help biotechnologists design fast and accurate vaccines by reducing the number of trials in the laboratory environment. © 2022 Elsevier LtdTürkiye Bilimsel ve Teknolojik Araştirma Kurumu, TÜBITAK: 121E326This study was supported by Turkish Scientific and Technical Research Council, Turkey-TÜBİTAK (Project Number: 121E326).This study was supported by Turkish Scientific and Technical Research Council, Turkey -TÜBİTAK (Project Number: 121E326 )

    Prediction of conformational B-cell epitopes from 3D structures by random forests with a distance-based feature

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Antigen-antibody interactions are key events in immune system, which provide important clues to the immune processes and responses. In Antigen-antibody interactions, the specific sites on the antigens that are directly bound by the B-cell produced antibodies are well known as B-cell epitopes. The identification of epitopes is a hot topic in bioinformatics because of their potential use in the epitope-based drug design. Although most B-cell epitopes are discontinuous (or conformational), insufficient effort has been put into the conformational epitope prediction, and the performance of existing methods is far from satisfaction.</p> <p>Results</p> <p>In order to develop the high-accuracy model, we focus on some possible aspects concerning the prediction performance, including the impact of interior residues, different contributions of adjacent residues, and the imbalanced data which contain much more non-epitope residues than epitope residues. In order to address above issues, we take following strategies. Firstly, a concept of 'thick surface patch' instead of 'surface patch' is introduced to describe the local spatial context of each surface residue, which considers the impact of interior residue. The comparison between the thick surface patch and the surface patch shows that interior residues contribute to the recognition of epitopes. Secondly, statistical significance of the distance distribution difference between non-epitope patches and epitope patches is observed, thus an adjacent residue distance feature is presented, which reflects the unequal contributions of adjacent residues to the location of binding sites. Thirdly, a bootstrapping and voting procedure is adopted to deal with the imbalanced dataset. Based on the above ideas, we propose a new method to identify the B-cell conformational epitopes from 3D structures by combining conventional features and the proposed feature, and the random forest (RF) algorithm is used as the classification engine. The experiments show that our method can predict conformational B-cell epitopes with high accuracy. Evaluated by leave-one-out cross validation (LOOCV), our method achieves the mean AUC value of 0.633 for the benchmark bound dataset, and the mean AUC value of 0.654 for the benchmark unbound dataset. When compared with the state-of-the-art prediction models in the independent test, our method demonstrates comparable or better performance.</p> <p>Conclusions</p> <p>Our method is demonstrated to be effective for the prediction of conformational epitopes. Based on the study, we develop a tool to predict the conformational epitopes from 3D structures, available at <url>http://code.google.com/p/my-project-bpredictor/downloads/list</url>.</p

    Recent advances in B-cell epitope prediction methods

    Get PDF
    Identification of epitopes that invoke strong responses from B-cells is one of the key steps in designing effective vaccines against pathogens. Because experimental determination of epitopes is expensive in terms of cost, time, and effort involved, there is an urgent need for computational methods for reliable identification of B-cell epitopes. Although several computational tools for predicting B-cell epitopes have become available in recent years, the predictive performance of existing tools remains far from ideal. We review recent advances in computational methods for B-cell epitope prediction, identify some gaps in the current state of the art, and outline some promising directions for improving the reliability of such methods

    지도 학습 기반 바이오패닝 클론 증폭 패턴 분석을 통한 항원 결합 반응성 예측

    Get PDF
    학위논문(박사) -- 서울대학교대학원 : 의과대학 의과학과, 2021.8. 정준호.Background: Monoclonal antibodies (mAbs) are produced by B cells and specifically binds to target antigens. Technical advances in molecular and cellular cloning made it possible to purify recombinant mAbs in a large scale, enhancing the multiple research area and potential for their clinical application. Since the importance of therapeutic mAbs is increasing, mAbs have become the predominant drug classes for various diseases over the past decades. During that time, immense technological advances have made the discovery and development of mAb therapeutics more efficient. Owing to advances in high-throughput methodology in genomic sequencing, phenotype screening, and computational data analysis, it is conceivable to generate the panel of antibodies with annotated characteristics without experiments. Thesis objective: This thesis aims to develop the next-generation antibody discovery methods utilizing high-throughput antibody repertoire sequencing and bioinformatics analysis. I developed novel methods for construction of in vitro display antibody library, and machine learning based antibody discovery. In chapter 3, I described a new method for generating immunoglobulin (Ig) gene repertoire, which minimizes the amplification bias originated from a large number of primers targeting diverse Ig germline genes. Universal primer-based amplification method was employed in generating Ig gene repertoire then validated by high-throughput antibody repertoire sequencing, in the aspect of clonal diversity and immune repertoire reproducibility. A result of this research work is published in ‘Journal of Immunological Methods (2021). doi: 10.1016/j.jim.2021. 113089’. In chapter 4, I described a novel machine learning based antibody discovery method. In conventional colony screening approach, it is impossible to identify antigen specific binders having low clonal abundance, or hindered by non-specific phage particles having antigen reactivity on p8 coat protein. To overcome the limitations, I applied the supervised learning algorithm on high-throughput sequencing data annotated with binding property and clonal frequency through bio-panning. NGS analysis was performed to generate large number of antibody sequences annotated with its’ clonal frequency at each selection round of the bio-panning. By using random forest (RF) algorithm, antigen reactive binders were predicted and validated with in vitro screening experiment. A result of this research work is published in ‘Experimental & Molecular Medicine (2017). doi:0.1038/emm.2017.22’ and ‘Biomolecule (2020). doi:10.3390/biom10030421’. Conclusion: By combining conventional antibody discovery techniques and high-throughput antibody repertoire sequencing, it was able to make advances in multiple attributes of the previous methodology. Multi-cycle amplification with Ig germline gene specific primers showed the high level of repertoire distortion, but could be improved by employing universal primer-based amplification method. RF model generates the large number of antigen reactive antibody sequences having various clonal enrichment pattern. This result offers the new insight in interpreting clonal enrichment process, frequency of antigen specific binder does not increase gradually but depends on the multiple selection rounds. Supervised learning-based method also provides the more diverse antigen specific clonotypes than conventional antibody discovery methods.연구의 배경: 단일 클론 항체 (monoclonal antibody, mAb) 는 B 세포에서 생산되어 표적 항원에 특이적으로 결합하는 폴리펩타이드 복합체 이다. 분자 및 세포 클로닝 기술의 발전으로 재조합 단일 클론 항체를 대용량으로 생산하는것이 가능해졌으며, 이를 바탕으로 다양한 연구 및 임상 분야에서의 활용이 확대되고 있다. 또한 치료용 항체를 효율적으로 발굴하고 개발하는 기술에 대한 비약적인 발전이 이루어졌다. 유전자 서열 분석, 표현형 스크리닝, 컴퓨팅 기반 분석법 분야에서 이루어진 고집적 방법론 (high-throughput methodology) 의 발전과 이의 응용을 통해, 비실험적 방법을 통해 항원 반응성 항체 패널을 생산하는것이 가능해졌다. 연구의 목표: 본 박사 학위 논문은 고집적 항체 레퍼토어 시퀀싱 (high-throughput antibody repertoire sequencing) 과 생물정보학 (bioinformatics) 기법을 활용하여 신규한 (novel) 차세대 항체 발굴법 (next-generation antibody discovery method) 을 개발하는것을 목표로 하고 있다. 본 연구를 통해 in vitro display 항체 라이브러리를 제작하기 위한 신규 프로토콜 및 기계 학습을 기반으로한 항체 발굴법을 개발 하였다. Chapter 3: 항체 레퍼토어를 증폭하는 과정에서, 다수의 생식세포 면역 글로불린 유전자 (germline immunoglobulin gene) 특이적 프라이머 사용에 의해 발생하는 증폭 편차 (amplification bias) 를 최소화 하는 방법론에 대해 기술하였다. 유니버셜 (universal) 프라이머를 사용한 다중 사이클 증폭 (multi-cycle amplification) 법이 사용되었으며, 고집적 항체 레퍼토어 시퀀싱을 통해, 클론 다양성 (clonal diversity) 및 면역 레퍼토어 재구성도 (immune repertoire reproducibility) 를 생물정보학적 기법으로 측정하여 신규 방법론에 대한 검증을 수행하였다. 본 연구의 연구결과는 다음의 학술지에 출판 되었다: Journal of Immunological Methods (2021). doi: 10.1016/j.jim.2021. 113089. Chapter 4: 기계 학습 기반의 항체 발굴법 개발에 대해 기술하였다. 전통적 콜로니 스크리닝 (colony screening) 방법에서는, 클론 빈도 (clonal abundance) 가 낮은 클론을 발굴 하거나 선택압 (selective pressure) 이 부여되는 과정에서, p8 표면 단백질의 비 특이적 항원 특이성을 제거할 수 없다. 이러한 제한점을 극복하기 위해서 항원 결합능 및 바이오패닝 에서의 클론 빈도가 측정 되어있는 고집적 항체 서열 데이터를 대상으로 지도 학습 알고리즘을 적용하였다. 랜덤 포레스트 (random forest, RF) 알고리즘을 적용하여 항원 특이적 항체 클론을 예측하였으며, 시험관 내 스크리닝을 통해 항원 특이성을 검증하였다. 본 연구의 연구 결과는 다음의 학술지에 출판되었다: 1) Experimental & Molecular Medicine (2017). doi:0.1038/emm.2017.22., 2) Biomolecule (2020). doi:10.3390/biom10030421. 결론: 전통적 항체 발굴 기술과 고집적 항체 레퍼토어 시퀀싱 기술을 융합함으로써, 기존 방법론의 다양한 한계점을 개선할 수 있었다. 면역 글로불린 생식세포 유전자 특이적 프라이머를 사용한 다중 사이클 증폭은 클론 빈도 및 다양성에 왜곡을 유도 하였으나, 유니버셜 프라이머를 사용한 증폭법을 통해 높은 효율로 레퍼토어 왜곡을 개선시킬 수 있음을 관찰할 수 있었다. RF 모델은 다양한 클론 증폭 패턴 (enrichment pattern) 을 가지는 항원 반응성 항체 서열을 생성하였다. 이를 통해 항원에 특이적으로 결합하는 클론이 단계적으로 증폭되는 것이 아니라 초기 및 후기의 다수의 선별 단계 (selection round) 에 의존함을 확인할 수 있었으며, 바이오패닝 에서의 클론 증폭에 대한 새로운 해석을 제시하였다. 또한 지도 학습을 기반으로 발굴 된 클론들에서, 전통적 콜로니 스크리닝 방법과 대비하여 더 높은 서열 다양성을 관찰할 수 있었다.1. Introduction 8 1.1. Antibody and immunoglobulin repertoire 8 1.2. Antibody therapeutics 16 1.3. Methodology: antibody discovery and engineering 21 2. Thesis objective 28 3. Establishment of minimally biased phage display library construction method for antibody discovery 29 3.1. Abstract 29 3.2. Introduction 30 3.3. Results 32 3.4. Discussion 44 3.5. Methods 47 4. In silico identification of target specific antibodies by high-throughput antibody repertoire sequencing and machine learning 58 4.1. Abstract 58 4.2. Introduction 60 4.3. Results 64 4.4. Discussion 111 4.5. Methods 116 5. Future perspectives 129 6. References 135 7. Abstract in Korean 150박

    "Going back to our roots": second generation biocomputing

    Full text link
    Researchers in the field of biocomputing have, for many years, successfully "harvested and exploited" the natural world for inspiration in developing systems that are robust, adaptable and capable of generating novel and even "creative" solutions to human-defined problems. However, in this position paper we argue that the time has now come for a reassessment of how we exploit biology to generate new computational systems. Previous solutions (the "first generation" of biocomputing techniques), whilst reasonably effective, are crude analogues of actual biological systems. We believe that a new, inherently inter-disciplinary approach is needed for the development of the emerging "second generation" of bio-inspired methods. This new modus operandi will require much closer interaction between the engineering and life sciences communities, as well as a bidirectional flow of concepts, applications and expertise. We support our argument by examining, in this new light, three existing areas of biocomputing (genetic programming, artificial immune systems and evolvable hardware), as well as an emerging area (natural genetic engineering) which may provide useful pointers as to the way forward.Comment: Submitted to the International Journal of Unconventional Computin

    Using advanced computational methods to model the binding of antibody complexes: a case study from the coagulation cascade

    Get PDF
    Haemophilia A is a congenital bleeding disorder affecting one in 5,000 to 10,000 males. To prevent symptomatic disease, injections of recombinant factor VIII (FVIII) are administered to compensate for insufficient levels of this essential clotting factor. Patients suffering from a severe form of haemophilia A are at increased risk of forming neutralising antibodies — known as inhibitors — against therapeutic FVIII. A better understanding of the binding characteristics of inhibitors may aid the selection of optimal haemophilia A therapies, lead to the development of new therapeutics that are less antigenic, and support future initiatives in personalised and precision medicine. With this goal in mind, Classical Molecular Dynamics (CMD) in conjunction with Molecular Mechanics/Generalized Born Surface Area (MM/GBSA) free energy calculations, together with enhanced sampling techniques, have been used to investigate interactions and the dynamics of binding site residues of the human inhibitory antibody BO2C11 bound to the C2-domain of factor VIII. In parallel, recombinant bacterial expressions of the C2-domain were initiated with the aim to explore structural changes induced by mutations that abrogate binding as described previously in surface plasmon resonance experiments. Computational binding affinity predictions were generally shown to be in good agreement with experimental findings. Additionally, binding site dynamics were investigated in detail using customized visualization techniques and an interpretable machine learning approach. Nevertheless, CMD simulations were insufficient for gaining insights into structural changes induced by mutations that were determined experimentally to be non-binding, and for exploring the underlying differences between the bound and unbound structures of the FVIII-C2 domain. To this end, Accelerated Molecular Dynamics (AMD) and Umbrella Sampling (US) simulations proved to be appropriate additions to investigate the conformational changes and energetic differences associated with the binding of BO2C11

    Enhancing navigation in biomedical databases by community voting and database-driven text classification

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The breadth of biological databases and their information content continues to increase exponentially. Unfortunately, our ability to query such sources is still often suboptimal. Here, we introduce and apply community voting, database-driven text classification, and visual aids as a means to incorporate distributed expert knowledge, to automatically classify database entries and to efficiently retrieve them.</p> <p>Results</p> <p>Using a previously developed peptide database as an example, we compared several machine learning algorithms in their ability to classify abstracts of published literature results into categories relevant to peptide research, such as related or not related to cancer, angiogenesis, molecular imaging, etc. Ensembles of bagged decision trees met the requirements of our application best. No other algorithm consistently performed better in comparative testing. Moreover, we show that the algorithm produces meaningful class probability estimates, which can be used to visualize the confidence of automatic classification during the retrieval process. To allow viewing long lists of search results enriched by automatic classifications, we added a dynamic heat map to the web interface. We take advantage of community knowledge by enabling users to cast votes in Web 2.0 style in order to correct automated classification errors, which triggers reclassification of all entries. We used a novel framework in which the database "drives" the entire vote aggregation and reclassification process to increase speed while conserving computational resources and keeping the method scalable. In our experiments, we simulate community voting by adding various levels of noise to nearly perfectly labelled instances, and show that, under such conditions, classification can be improved significantly.</p> <p>Conclusion</p> <p>Using PepBank as a model database, we show how to build a classification-aided retrieval system that gathers training data from the community, is completely controlled by the database, scales well with concurrent change events, and can be adapted to add text classification capability to other biomedical databases.</p> <p>The system can be accessed at <url>http://pepbank.mgh.harvard.edu</url>.</p

    Concept and application of a computational vaccinology workflow

    Get PDF
    BACKGROUND : The last years have seen a renaissance of the vaccine area, driven by clinical needs in infectious diseases but also chronic diseases such as cancer and autoimmune disorders. Equally important are technological improvements involving nano-scale delivery platforms as well as third generation adjuvants. In parallel immunoinformatics routines have reached essential maturity for supporting central aspects in vaccinology going beyond prediction of antigenic determinants. On this basis computational vaccinology has emerged as a discipline aimed at ab-initio rational vaccine design.Here we present a computational workflow for implementing computational vaccinology covering aspects from vaccine target identification to functional characterization and epitope selection supported by a Systems Biology assessment of central aspects in host-pathogen interaction. We exemplify the procedures for Epstein Barr Virus (EBV), a clinically relevant pathogen causing chronic infection and suspected of triggering malignancies and autoimmune disorders. RESULTS : We introduce pBone/pView as a computational workflow supporting design and execution of immunoinformatics workflow modules, additionally involving aspects of results visualization, knowledge sharing and re-use. Specific elements of the workflow involve identification of vaccine targets in the realm of a Systems Biology assessment of host-pathogen interaction for identifying functionally relevant targets, as well as various methodologies for delineating B- and T-cell epitopes with particular emphasis on broad coverage of viral isolates as well as MHC alleles.Applying the workflow on EBV specifically proposes sequences from the viral proteins LMP2, EBNA2 and BALF4 as vaccine targets holding specific B- and T-cell epitopes promising broad strain and allele coverage. CONCLUSION : Based on advancements in the experimental assessment of genomes, transcriptomes and proteomes for both, pathogen and (human) host, the fundaments for rational design of vaccines have been laid out. In parallel, immunoinformatics modules have been designed and successfully applied for supporting specific aspects in vaccine design. Joining these advancements, further complemented by novel vaccine formulation and delivery aspects, have paved the way for implementing computational vaccinology for rational vaccine design tackling presently unmet vaccine challenges

    Prediction of MHC-peptide binding: a systematic and comprehensive overview

    Get PDF
    T cell immune responses are driven by the recognition of peptide antigens (T cell epitopes) that are bound to major histocompatibility complex (MHC) molecules. T cell epitope immunogenicity is thus contingent on several events, including appropriate and effective processing of the peptide from its protein source, stable peptide binding to the MHC molecule, and recognition of the MHC-bound peptide by the T cell receptor. Of these three hallmarks, MHC-peptide binding is the most selective event that determines T cell epitopes. Therefore, prediction of MHC-peptide binding constitutes the principal basis for anticipating potential T cell epitopes. The tremendous relevance of epitope identification in vaccine design and in the monitoring of T cell responses has spurred the development of many computational methods for predicting MHC-peptide binding that improve the efficiency and economics of T cell epitope identification. In this report, we will systematically examine the available methods for predicting MHC-peptide binding and discuss their most relevant advantages and drawbacks

    A PubMed-Wide Associational Study of Infectious Diseases

    Get PDF
    Background: Computational discovery is playing an ever-greater role in supporting the processes of knowledge synthesis. A significant proportion of the more than 18 million manuscripts indexed in the PubMed database describe infectious disease syndromes and various infectious agents. This study is the first attempt to integrate online repositories of text-based publications and microbial genome databases in order to explore the dynamics of relationships between pathogens and infectious diseases. Methodology/Principal Findings: Herein we demonstrate how the knowledge space of infectious diseases can be computationally represented and quantified, and tracked over time. The knowledge space is explored by mapping of the infectious disease literature, looking at dynamics of literature deposition, zooming in from pathogen to genome level and searching for new associations. Syndromic signatures for different pathogens can be created to enable a new and clinically focussed reclassification of the microbial world. Examples of syndrome and pathogen networks illustrate how multilevel network representations of the relationships between infectious syndromes, pathogens and pathogen genomes can illuminate unexpected biological similarities in disease pathogenesis and epidemiology. Conclusions/Significance: This new approach based on text and data mining can support the discovery of previously hidden associations between diseases and microbial pathogens, clinically relevant reclassification of pathogeni
    corecore