52 research outputs found

    Characterisation and Classification of Protein Sequences by Using Enhanced Amino Acid Indices and Signal Processing-Based Methods

    Get PDF
    Due to copyright reasons, the authors published papers have been removed from this copy of the thesis.Protein sequencing has produced overwhelming amount of protein sequences, especially in the last decade. Nevertheless, the majority of the proteins' functional and structural classes are still unknown, and experimental methods currently used to determine these properties are very expensive, laborious and time consuming. Therefore, automated computational methods are urgently required to accurately and reliably predict functional and structural classes of the proteins. Several bioinformatics methods have been developed to determine such properties of the proteins directly from their sequence information. Such methods that involve signal processing methods have recently become popular in the bioinformatics area and been investigated for the analysis of DNA and protein sequences and shown to be useful and generally help better characterise the sequences. However, there are various technical issues that need to be addressed in order to overcome problems associated with the signal processing methods for the analysis of the proteins sequences. Amino acid indices that are used to transform the protein sequences into signals have various applications and can represent diverse features of the protein sequences and amino acids. As the majority of indices have similar features, this project proposes a new set of computationally derived indices that better represent the original group of indices. A study is also carried out that resulted in finding a unique and universal set of best discriminating amino acid indices for the characterisation of allergenic proteins. This analysis extracts features directly from the protein sequences by using Discrete Fourier Transform (DFT) to build a classification model based on Support Vector Machines (SVM) for the allergenic proteins. The proposed predictive model yields a higher and more reliable accuracy than those of the existing methods. A new method is proposed for performing a multiple sequence alignment. For this method, DFT-based method is used to construct a new distance matrix in combination with multiple amino acid indices that were used to encode protein sequences into numerical sequences. Additionally, a new type of substitution matrix is proposed where the physicochemical similarities between any given amino acids is calculated. These similarities were calculated based on the 25 amino acids indices selected, where each one represents a unique biological protein feature. The proposed multiple sequence alignment method yields a better and more reliable alignment than the existing methods. In order to evaluate complex information that is generated as a result of DFT, Complex Informational Spectrum Analysis (CISA) is developed and presented. As the results show, when protein classes present similarities or differences according to the Common Frequency Peak (CFP) in specific amino acid indices, then it is probable that these classes are related to the protein feature that the specific amino acid represents. By using only the absolute spectrum in the analysis of protein sequences using the informational spectrum analysis is proven to be insufficient, as biologically related features can appear individually either in the real or the imaginary spectrum. This is successfully demonstrated over the analysis of influenza neuraminidase protein sequences. Upon identification of a new protein, it is important to single out amino acid responsible for the structural and functional classification of the protein, as well as the amino acids contributing to the protein's specific biological characterisation. In this work, a novel approach is presented to identify and quantify the relationship between individual amino acids and the protein. This is successfully demonstrated over the analysis of influenza neuraminidase protein sequences. Characterisation and identification problem of the Influenza A virus protein sequences is tackled through a Subgroup Discovery (SD) algorithm, which can provide ancillary knowledge to the experts. The main objective of the case study was to derive interpretable knowledge for the influenza A virus problem and to consequently better describe the relationships between subtypes of this virus. Finally, by using DFT-based sequence-driven features a Support Vector Machine (SVM)-based classification model was built and tested, that yields higher predictive accuracy than that of SD. The methods developed and presented in this study yield promising results and can be easily applied to proteomic fields

    In silico allergen identification: Proposal for a revision of FAO/WHO guidelines

    Get PDF
    Allergy is a widespread, often severe health problem. In vivo or in vitro identification of new allergenic proteins (natural or bioengineered) is time- and resource-consuming, and in vivo testing can be dangerous. Thus, allergenicity prediction through computation (in silico) was proposed to narrow down the number of potential allergens to be tested with traditional methods. In 2001, the Food and Agriculture Organization (FAO) and the World Health Organization (WHO) officially defined guidelines for in silico allergenicity prediction, based on amino acid sequence similarity to known allergens; these guidelines, however, have been criticized because of frequent false positives. In the present work, the BLAST (Basic Local Alignment Search Tool) software was used to compare known and potential allergens, and select only statistically significant homologies (i.e. homologies whose E value, calculated by BLAST, was 1); FAO/WHO rules were then applied to these homologies. With this method, correct recognition of all known allergens, with only 10 false positives (1.26% of all predicted allergens) was achieved when using an upper limit of 0.1 for E values; complete suppression of wrong predictions, while maintaining 100% sensitivity, was obtained with little modifications of the minimum requirements contained in the FAO/WHO guidelines

    EVALLER: a web server for in silico assessment of potential protein allergenicity

    Get PDF
    Bioinformatics testing approaches for protein allergenicity, involving amino acid sequence comparisons, have evolved appreciably over the last several years to increased sophistication and performance. EVALLER, the web server presented in this article is based on our recently published ‘Detection based on Filtered Length-adjusted Allergen Peptides’ (DFLAP) algorithm, which affords in silico determination of potential protein allergenicity of high sensitivity and excellent specificity. To strengthen bioinformatics risk assessment in allergology EVALLER provides a comprehensive outline of its judgment on a query protein's potential allergenicity. Each such textual output incorporates a scoring figure, a confidence numeral of the assignment and information on high- or low-scoring matches to identified allergen-related motifs, including their respective location in accordingly derived allergens. The interface, built on a modified Perl Open Source package, enables dynamic and color-coded graphic representation of key parts of the output. Moreover, pertinent details can be examined in great detail through zoomed views. The server can be accessed at http://bioinformatics.bmc.uu.se/evaller.html

    Databases and Algorithms in Allergen Informatics

    Get PDF
    Allergic diseases are considered as one of the major health problems worldwide due to their increasing prevalence. Advancements in genomic, proteomic, and analytical techniques have resulted in considerable progress in the field of allergology, which has led to accumulation of huge amount of data. Allergen bioinformatics comprises allergen-related data resources and computational methods/tools, which deal with an efficient archival, management, and analysis of allergological data. Significant work has been done in the area of allergen bioinformatics that has proven pivotal for the development and progress of this field. In this chapter, we describe the current status of databases and algorithms, encompassing the field of allergen bioinformatics by examining work carried out thus far with respect to features such as allergens and allergenicity, allergen databases, algorithms/tools for allergen/allergenicity prediction, allergen epitope prediction, and allergenic cross-reactivity assessment. This chapter illustrates concepts and algorithms in allergen bioinformatics, as well as it outlines the key areas for potential development in allergology field

    Computational detection of allergenic proteins attains a new level of accuracy with in silico variable-length peptide extraction and machine learning

    Get PDF
    The placing of novel or new-in-the-context proteins on the market, appearing in genetically modified foods, certain bio-pharmaceuticals and some household products leads to human exposure to proteins that may elicit allergic responses. Accurate methods to detect allergens are therefore necessary to ensure consumer/patient safety. We demonstrate that it is possible to reach a new level of accuracy in computational detection of allergenic proteins by presenting a novel detector, Detection based on Filtered Length-adjusted Allergen Peptides (DFLAP). The DFLAP algorithm extracts variable length allergen sequence fragments and employs modern machine learning techniques in the form of a support vector machine. In particular, this new detector shows hitherto unmatched specificity when challenged to the Swiss-Prot repository without appreciable loss of sensitivity. DFLAP is also the first reported detector that successfully discriminates between allergens and non-allergens occurring in protein families known to hold both categories. Allergenicity assessment for specific protein sequences of interest using DFLAP is possible via [email protected]

    Wildfire: distributed, Grid-enabled workflow construction and execution

    Get PDF
    BACKGROUND: We observe two trends in bioinformatics: (i) analyses are increasing in complexity, often requiring several applications to be run as a workflow; and (ii) multiple CPU clusters and Grids are available to more scientists. The traditional solution to the problem of running workflows across multiple CPUs required programming, often in a scripting language such as perl. Programming places such solutions beyond the reach of many bioinformatics consumers. RESULTS: We present Wildfire, a graphical user interface for constructing and running workflows. Wildfire borrows user interface features from Jemboss and adds a drag-and-drop interface allowing the user to compose EMBOSS (and other) programs into workflows. For execution, Wildfire uses GEL, the underlying workflow execution engine, which can exploit available parallelism on multiple CPU machines including Beowulf-class clusters and Grids. CONCLUSION: Wildfire simplifies the tasks of constructing and executing bioinformatics workflows

    iSS-PseDNC: Identifying Splicing Sites Using Pseudo Dinucleotide Composition

    Get PDF
    corecore