    Characterisation and Classification of Protein Sequences by Using Enhanced Amino Acid Indices and Signal Processing-Based Methods

    Due to copyright reasons, the authors published papers have been removed from this copy of the thesis.Protein sequencing has produced overwhelming amount of protein sequences, especially in the last decade. Nevertheless, the majority of the proteins' functional and structural classes are still unknown, and experimental methods currently used to determine these properties are very expensive, laborious and time consuming. Therefore, automated computational methods are urgently required to accurately and reliably predict functional and structural classes of the proteins. Several bioinformatics methods have been developed to determine such properties of the proteins directly from their sequence information. Such methods that involve signal processing methods have recently become popular in the bioinformatics area and been investigated for the analysis of DNA and protein sequences and shown to be useful and generally help better characterise the sequences. However, there are various technical issues that need to be addressed in order to overcome problems associated with the signal processing methods for the analysis of the proteins sequences. Amino acid indices that are used to transform the protein sequences into signals have various applications and can represent diverse features of the protein sequences and amino acids. As the majority of indices have similar features, this project proposes a new set of computationally derived indices that better represent the original group of indices. A study is also carried out that resulted in finding a unique and universal set of best discriminating amino acid indices for the characterisation of allergenic proteins. This analysis extracts features directly from the protein sequences by using Discrete Fourier Transform (DFT) to build a classification model based on Support Vector Machines (SVM) for the allergenic proteins. The proposed predictive model yields a higher and more reliable accuracy than those of the existing methods. A new method is proposed for performing a multiple sequence alignment. For this method, DFT-based method is used to construct a new distance matrix in combination with multiple amino acid indices that were used to encode protein sequences into numerical sequences. Additionally, a new type of substitution matrix is proposed where the physicochemical similarities between any given amino acids is calculated. These similarities were calculated based on the 25 amino acids indices selected, where each one represents a unique biological protein feature. The proposed multiple sequence alignment method yields a better and more reliable alignment than the existing methods. In order to evaluate complex information that is generated as a result of DFT, Complex Informational Spectrum Analysis (CISA) is developed and presented. As the results show, when protein classes present similarities or differences according to the Common Frequency Peak (CFP) in specific amino acid indices, then it is probable that these classes are related to the protein feature that the specific amino acid represents. By using only the absolute spectrum in the analysis of protein sequences using the informational spectrum analysis is proven to be insufficient, as biologically related features can appear individually either in the real or the imaginary spectrum. This is successfully demonstrated over the analysis of influenza neuraminidase protein sequences. Upon identification of a new protein, it is important to single out amino acid responsible for the structural and functional classification of the protein, as well as the amino acids contributing to the protein's specific biological characterisation. In this work, a novel approach is presented to identify and quantify the relationship between individual amino acids and the protein. This is successfully demonstrated over the analysis of influenza neuraminidase protein sequences. Characterisation and identification problem of the Influenza A virus protein sequences is tackled through a Subgroup Discovery (SD) algorithm, which can provide ancillary knowledge to the experts. The main objective of the case study was to derive interpretable knowledge for the influenza A virus problem and to consequently better describe the relationships between subtypes of this virus. Finally, by using DFT-based sequence-driven features a Support Vector Machine (SVM)-based classification model was built and tested, that yields higher predictive accuracy than that of SD. The methods developed and presented in this study yield promising results and can be easily applied to proteomic fields

    First data on Ornithodoros moubata Aquaporins: structural, phylogenetic and immunogenic characterisation as vaccine targets

    21 páginas, 4 tablas, 4 figuras,11 figuras suplementarias, 11 tablas suplementariasOrnithodoros moubata transmits African swine fever and human relapsing fever in Africa. The elimination of O. moubata populations from anthropic environments is expected to improve the prevention and control of these diseases. Tick vaccines have emerged as a sustainable method for tick control, and tick aquaporins (AQPs) are promising targets for tick vaccines due to their vital functions, immunogenicity and ease of access by neutralising host antibodies. This study aimed at the systematic identification of the AQPs expressed by O. moubata (OmAQPs) and their characterisation as vaccine targets. Therefore, AQP coding sequences were recovered from available transcriptomic datasets, followed by PCR amplification, cloning, sequence verification and the analysis of the AQP protein structure and epitope exposure. Seven OmAQPs were identified and characterised: six were aquaglyceroporins, and one was a water-specific aquaporin. All of these were expressed in the salivary glands and midgut and only three in the coxal glands. Epitope exposure analysis identified three extracellular domains in each AQP, which concentrate overlapping B and T cell epitopes, making them interesting vaccine targets. Based on these domain sequences, a set of ten antigenic peptides was designed, which showed adequate properties to be produced and tested in pilot vaccine trialsThis research was funded by the project “RTI2018-098297-B-I00” (MCIU/AEI/FEDER, UE), granted by the Spanish Ministry of Science, Innovation and Universities, the State Research Agency (AEI) and the European Regional Development Fund (ERDF); and project “CLU-2019-05-IRNASA/CSIC Unit of Excellence”, granted by the Junta de Castilla y León and co-financed by the European Union (ERDF “Europe drives our growth”).Peer reviewe

    Selected Works in Bioinformatics

    This book consists of nine chapters covering a variety of bioinformatics subjects, ranging from database resources for protein allergens, unravelling genetic determinants of complex disorders, characterization and prediction of regulatory motifs, computational methods for identifying the best classifiers and key disease genes in large-scale transcriptomic and proteomic experiments, functional characterization of inherently unfolded proteins/regions, protein interaction networks and flexible protein-protein docking. The computational algorithms are in general presented in a way that is accessible to advanced undergraduate students, graduate students and researchers in molecular biology and genetics. The book should also serve as stepping stones for mathematicians, biostatisticians, and computational scientists to cross their academic boundaries into the dynamic and ever-expanding field of bioinformatics

    A multi-method and structure-based in silico vaccine designing against Echinococcus granulosus through investigating enolase protein

    Introduction: Hydatid disease is a ubiquitous parasitic zoonotic disease, which causes different medical, economic and serious public health problems in some parts of the world. The causal organism is a multi-stage parasite named Echinococcus granulosus whose life cycle is dependent on two types of mammalian hosts viz definitive and intermediate hosts. Methods: In this study, enolase, as a key functional enzyme in the metabolism of E. granulosus (EgEnolase), was targeted through a comprehensive in silico modeling analysis and designing a host-specific multi-epitope vaccine. Three-dimensional (3D) structure of enolase was modeled using MODELLER v9.18 software. The B-cell epitopes (BEs) were predicted based on the multi-method approach and via some authentic online predictors. ClusPro v2.0 server was used for docking-based T-helper epitope prediction. The 3D structure of the vaccine was modeled using the RaptorX server. The designed vaccine was evaluated for its immunogenicity, physicochemical properties, and allergenicity. The codon optimization of the vaccine sequence was performed based on the codon usage table of E. coli K12. Finally, the energy minimization and molecular docking were implemented for simulating the vaccine binding affinity to the TLR-2 and TLR-4 and the complex stability. Results: The designed multi-epitope vaccine was found to induce anti-EgEnolase immunity which may have the potential to prevent the survival and proliferation of E. granulosus into the definitive host. Conclusion: Based on the results, this step-by-step immunoinformatics approach could be considered as a rational platform for designing vaccines against such multi-stage parasites. Furthermore, it is proposed that this multi-epitope vaccine is served as a promising preventive anti-echinococcosis agent

    Structural Biology of Peanut Allergens

    Peanuts are a cause of one of the most common food allergies. Allergy to peanuts not only affects a significant fraction of the population, but it is relatively often associated with strong reactions in sensitized individuals. Peanut and tree nut allergies, which start in childhood are often persistent and continue through life, as opposed to other food allergies that resolve with age. Cherefore, peanut allergens are one of the most intensively studied food allergens. In this review we focus on the structural studies of peanut allergens. Despite the fact that these allergens are attracting a lot of interest and several of them have had their structures experimentally determined, still some molecular properties of peanut allergens are not well understood. Peanut allergens like other allergens belong to just a few protein families. Allergens from the cupin superfamily (Ara h 1 and Ara h 3), 2S albumins (Arah 2 and Ara h 6), Ara h 8 (pathogenesis related class-10 protein) and Ara h 5 (profilin) are relatively well characterized in terms of their 3D structures. However some peanut allergens like Ara h 7 (2S albumin), Ara h 9 (nonspecific lipid-transfer protein), and especially oleosins (Ara h 10 and Ara h 11) and defensins (Ara h 12 and Ara h 13), still are waiting for such characterization

    PREvaIL, an integrative approach for inferring catalytic residues using sequence, structural and network features in a machine learning framework

    Determining the catalytic residues in an enzyme is critical to our understanding the relationship between protein sequence, structure, function, and enhancing our ability to design novel enzymes and their inhibitors. Although many enzymes have been sequenced, and their primary and tertiary structures determined, experimental methods for enzyme functional characterization lag behind. Because experimental methods used for identifying catalytic residues are resource- and labor-intensive, computational approaches have considerable value and are highly desirable for their ability to complement experimental studies in identifying catalytic residues and helping to bridge the sequence–structure–function gap. In this study, we describe a new computational method called PREvaIL for predicting enzyme catalytic residues. This method was developed by leveraging a comprehensive set of informative features extracted from multiple levels, including sequence, structure, and residue-contact network, in a random forest machine-learning framework. Extensive benchmarking experiments on eight different datasets based on 10-fold cross-validation and independent tests, as well as side-by-side performance comparisons with seven modern sequence- and structure-based methods, showed that PREvaIL achieved competitive predictive performance, with an area under the receiver operating characteristic curve and area under the precision-recall curve ranging from 0.896 to 0.973 and from 0.294 to 0.523, respectively. We demonstrated that this method was able to capture useful signals arising from different levels, leveraging such differential but useful types of features and allowing us to significantly improve the performance of catalytic residue prediction. We believe that this new method can be utilized as a valuable tool for both understanding the complex sequence–structure–function relationships of proteins and facilitating the characterization of novel enzymes lacking functional annotations

    Probing the Rhipicephalus bursa sialomes in potential anti-tick vaccine candidates : a reverse vaccinology approach

    In the wake of the ‘omics’ explosion of data, reverse vaccinology approaches are being applied more readily as an alternative for the discovery of candidates for next generation diagnostics and vaccines. Promising protective antigens for the control of ticks and tick-borne diseases can be discovered by mining available omics data for immunogenic epitopes. The present study aims to explore the previously obtained Rhipicephalus bursa sialotranscriptome during both feeding and Babesia infection, to select antigenic targets that are either membrane-associated or a secreted protein, as well as unique to the ectoparasite and not present in the mammalian host. Further, they should be capable of stimulating T and B cells for a potential robust immune response, and be non-allergenic or toxic to the host. From the R. bursa transcriptome, 5706 and 3025 proteins were identified as belonging to the surfaceome and secretome, respectively. Following a reverse genetics immunoinformatics pipeline, nine preferred candidates, consisting of one transmembrane-related and eight secreted proteins, were identified. These candidates showed a higher predicted antigenicity than the Bm86 antigen, with no homology to mammalian hosts and exposed regions. Only four were functionally annotated and selected for further in silico analysis, which examined their protein structure, surface accessibility, flexibility, hydrophobicity, and putative linear B and T-cell epitopes. Regions with overlapping coincident epitopes groups (CEGs) were evaluated to select peptides that were further analyzed for their physicochemical characteristics, potential allergenicity, toxicity, solubility, and potential propensity for crystallization. Following these procedures, a set of three peptides from the three R. bursa proteins were selected. In silico results indicate that the designed epitopes could stimulate a protective and long-lasting immune response against those tick proteins, reflecting its potential as anti-tick vaccines The immunogenicity of these peptides was evaluated in a pilot immunization study followed by tick feeding to evaluate its impact on tick behavior and pathogen transmission. Combining in silico methods with in vivo immunogenicity evaluation enabled the screening of vaccine candidates prior to expensive infestation studies on the definitive ovine host animals.Spreadsheet S1 – SurfaceomeSpreadsheet S2 – SecretomeSpreadsheet S3 – MARVELSpreadsheet S4 – EVASINSpreadsheet S5 - RICINFundação para a Ciência e Tecnologia (FCT)http://www.mdpi.com/journal/biomedicinespm2021BiochemistryForestry and Agricultural Biotechnology Institute (FABI)GeneticsMicrobiology and Plant PathologyPlant Production and Soil Scienc

    Machine learning approaches for epitope prediction

    The identification and characterization of epitopes in antigenic sequences is critical for understanding disease pathogenesis, for identifying potential autoantigens, and for designing vaccines and immune-based cancer therapies. As the number of pathogen genomes fully or partially sequenced is rapidly increasing, experimental methods for epitope mapping would be prohibitive in terms of time and expenses. Therefore, computational methods for reliably identifying potential vaccine candidates (i.e., epitopes that invoke strong response from both T-cells and B-cells) are highly desirable. Machine learning offers one of the most cost-effective and widely used approaches to developing epitope prediction tools. In the last few years, several advances in machine learning research have emerged. We utilize recent advances in machine learning research to provide epitope prediction tools with improved predictive performance. First, we introduce two methods, BCPred and FBCPred, for predicting linear B-cell epitopes and flexible length linear B-cell epitopes, respectively, using string kernel based support vector machine (SVM) classifiers. Second, we introduce three scoring matrix methods and show that they are highly competitive with a broad class of machine learning methods, including SVM, in predicting major histocompatibility complex class I (MHC-I) binding peptides. Finally, we formulate the problems of qualitatively and quantitatively predicting flexible length major histocompatibility complex class II (MHC-II) peptides as multiple instance learning and multiple instance regression problems, respectively. Based on this formulation, we introduce MHCMIR, a novel method for predicting MHC-II binding affinity using multiple instance regression. The development of reliable epitope prediction tools is not feasible in the absence of high quality data sets. Unfortunately, most of the existing epitope benchmark data sets are comprised of epitope sequences that share high degree of similarity with other peptide sequences in the same data set. We demonstrate the pitfalls of these commonly used data sets for evaluating the performance of machine learning approaches to epitope prediction. Finally, we propose a similarity reduction procedure that is more stringent than currently used similarity reduction methods

    Investigation of ancient proteins in archaeological material

    Although several studies have positively identified dairy proteins from ancient dental calculus, other dietary protein identifications are exceedingly rare. The manuscripts included in this thesis include the identification of 20 different dietary proteins that could be taxonomically identified to the species level from 10 different species. One of our primary goals for this thesis was to evaluate any potential biases in our dietary protein recovery, and to gain a deeper understanding of the factors that influence dietary protein preservation within dental calculus. Although our sample sizes are small, we did find evidence of biases in the types of proteins we have recovered thus far. All proteins identified were potential allergens with IgE binding sites, low monomer molecular weight, resistance to changes in pH and temperature, and resistance to degradation through enzymatic proteolysis. The identified proteins primarily had functions in host defense or storage