84 research outputs found

    Creation, evaluation, and use of PSI, a program for identifying protein-phenotype relationships and comparing protein content in groups of organisms

    Get PDF
    Recent advances in DNA sequencing technology have enabled entire genomes to be sequenced quickly and accurately, resulting in an exponential increase in the number of organisms whose genome sequences have been elucidated. While the genome sequence of a given organism represents an important starting point in understanding its physiology, the functions of the protein products of many genes are still unknown; as such, computational methods for studying protein function are becoming increasingly important. In addition, this wealth of genomic information has created an unprecedented opportunity to compare the protein content of different organisms; among other applications, this can enable us to improve taxonomic classifications, to develop more accurate diagnostic tests for identifying particular bacteria, and to better understand protein content relationships in both closely-related and distantly-related organisms. This thesis describes the design, evaluation, and use of a program called Proteome Subtraction and Intersection (PSI) that uses an idea called genome subtraction for discovering protein-phenotype relationships and for characterizing differences in protein content in groups of organisms. PSI takes as input a set of proteomes, as well as a partitioning of that set into a subset of "included" proteomes and a subset of "excluded" proteomes. Using reciprocal BLAST hits, PSI finds orthologous relationships among all the proteins in the proteomes from the original set, and then finds groups of orthologous proteins containing at least one orthologue from each of the proteomes in the "included" subset, and none from any of the proteomes in the "excluded" subset. PSI is first applied to finding protein-phenotype relationships. By identifying proteins that are present in all sequenced isolates of the genus Lactobacillus, but not in the related bacterium Pediococcus pentosaceus, proteins are discovered that are likely to be responsible for the difference in cell shape between the lactobacilli and P. pentosaceus. In addition, proteins are identified that may be responsible for resistance to the antibiotic gatifloxacin in some lactic acid bacteria. This thesis also explores the use of PSI for comparing protein content in groups of organisms. Based on the idea of genome subtraction, a novel metric is proposed for comparing the difference in protein content between two organisms. This metric is then used to create a phylogenetic tree for a large set of bacteria, which to the author's knowledge represents the largest phylogenetic tree created to date using protein content. In addition, PSI is used to find the proteomic cohesiveness of isolates of several bacterial species in order to support or refute their current taxonomic classifications. Overall, PSI is a versatile tool with many interesting applications, and should become more and more valuable as additional genomic information becomes available

    Design and data analysis of kinome microarrays

    Get PDF
    Catalyzed by protein kinases, phosphorylation is the most important post-translational modification in eukaryotes and is involved in the regulation of almost all cellular processes. Investigating phosphorylation events and how they change in response to different biological conditions is integral to understanding cellular signaling processes in general, as well as to defining the role of phosphorylation in health and disease. A recently-developed technology for studying phosphorylation events is the kinome microarray, which consists of several hundred "spots" arranged in a grid-like pattern on a glass slide. Each spot contains many peptides of a particular amino acid sequence chemically fixed to the slide, with different spots containing peptides with different sequences. Each peptide is a subsequence of a full protein, containing an amino acid residue that is known or suspected to undergo phosphorylation in vivo, as well as several surrounding residues. When a kinome microarray is exposed to cell lysate, the protein kinases in the lysate catalyze the phosphorylation of the peptides on the array. By measuring the degree to which the peptides comprising each spot are phosphorylated, insight can be gained into the upregulation or downregulation of signaling pathways in response to different biological treatments or conditions. There are two main computational challenges associated with kinome microarrays. The first is array design, which involves selecting the peptides to be included on a given array. The level of difficulty of this task depends largely on the number of phosphorylation sites that have been experimentally identified in the proteome of the organism being studied. For instance, thousands of phosphorylation sites are known for human and mouse, allowing considerable freedom to select peptides that are relevant to the problem being examined. In contrast, few sites are known for, say, honeybee and soybean. For such organisms, it is useful to expand the set of possible peptides by using computational techniques to predict probable phosphorylation sites. In this thesis, existing techniques for the computational prediction of phosphorylation sites are reviewed. In addition, two novel methods are described for predicting phosphorylation events in organisms with few known sites, with each method using a fundamentally different approach. The first technique, called PHOSFER, uses a random forest-based machine-learning strategy, while the second, called DAPPLE, takes advantage of sequence homology between known sites and the proteome of interest. Both methods are shown to allow quicker or more accurate predictions in organisms with few known sites than comparable previous techniques. Therefore, the use of kinome microarrays is no longer limited to the study of organisms having many known phosphorylation sites; rather, this technology can potentially be applied to any organism having a sequenced genome. It is shown that PHOSFER and DAPPLE are suitable for identifying phosphorylation sites in a wide variety of organisms, including cow, honeybee, and soybean. The second computational challenge is data analysis, which involves the normalization, clustering, statistical analysis, and visualization of data resulting from the arrays. While software designed for the analysis of DNA microarrays has also been used for kinome arrays, differences between the two technologies prompted the development of PIIKA, a software package specifically designed for the analysis of kinome microarray data. By comparing with methods used for DNA microarrays, it is shown that PIIKA improves the ability to identify biological pathways that are differentially regulated in a treatment condition compared to a control condition. Also described is an updated version, PIIKA 2, which contains improvements and new features in the areas of clustering, statistical analysis, and data visualization. Given the previous absence of dedicated tools for analyzing kinome microarray data, as well as their wealth of features, PIIKA and PIIKA 2 represent an important step in maximizing the scientific value of this technology. In addition to the above techniques, this thesis presents three studies involving biological applications of kinome microarray analysis. The first study demonstrates the existence of "kinotypes" - species- or individual-specific kinome profiles - which has implications for personalized medicine and for the use of model organisms in the study of human disease. The second study uses kinome analysis to characterize how the calf immune system responds to infection by the bacterium Mycobacterium avium subsp. paratuberculosis. Finally, the third study uses kinome arrays to study parasitism of honeybees by the mite Varroa destructor, which is thought to be a major cause of colony collapse disorder. In order to make the methods described above readily available, a website called the SAskatchewan PHosphorylation Internet REsource (SAPHIRE) has been developed. Located at the URL http://saphire.usask.ca, SAPHIRE allows researchers to easily make use of PHOSFER, DAPPLE, and PIIKA 2. These resources facilitate both the design and data analysis of kinome microarrays, making them an even more effective technique for studying cellular signaling

    Statistical characterization of the GxxxG glycine repeats in the flagellar biosynthesis protein FliH and its Type III secretion homologue YscL

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>FliH is a protein involved in the export of components of the bacterial flagellum and we herein describe the presence of glycine-rich repeats in FliH of the form AxxxG(xxxG)<sub><it>m</it></sub>xxxA, where the value of <it>m </it>varies considerably in FliH proteins from different bacteria. While GxxxG and AxxxA patterns have previously been described, the long glycine repeat segments in FliH proteins have yet to be characterized. The Type III secretion system homologue to FliH (YscL, AscL, PscL, etc.) also contains a similar GxxxG repeat, and hence the presence of the repeat is evolutionarily conserved in these proteins, suggesting an important structural role or biological function.</p> <p>Results</p> <p>A set of FliH and YscL protein sequences was downloaded from GenBank, and then filtered to reduce redundancy, to ensure the soundness of the sequences, and to eliminate, as much as possible, confounding phylogenetic signal between individual sequences by implementing a pairwise 25% sequence identity cut-off. The general features of the glycine-rich repeats in these proteins were examined, and it was found that the length of these repeat segments varied substantially among FliH proteins but was fairly consistent for the Type III (YscL) homologue sequences, with values of <it>m </it>ranging from 0 to 12 for FliH and 0 to 2 for YscL. The amino acid sequence distribution of each of the three positions in the GxxxG repeats was found to differ significantly from the overall amino acid composition of the FliH/YscL proteins. The high frequency of Glu, Gln, Lys and Ala residues in the repeat positions, which is not likely indicative of any contaminating phylogenetic signal, suggests an α-helical structure for this motif. In addition, we sought to determine whether certain pairs of amino acids, in certain pairs of positions, were found together significantly more often than would be predicted by chance. Several statistically significant correlations were uncovered, which may be important for maintaining helical stability or for forming helix-helix interactions. These correlations are likely not of a phylogenetic origin as the originating sequences for the pair correlations are derived from a low similarity set and the individual incidences of the pair correlations do not cluster in any obvious phylogenetic sense, nor is there much evidence of strict sequence conservation outside the positions of the glycine residues. Finally, the α-helices from a non-redundant set of proteins from the Protein Data Bank were searched for GxxxG repeats similar in length to those found in FliH, however there were no helices containing more than three contiguous glycine repeat segments; thus, long glycine repeats similar to those found in FliH are presumably quite rare in nature.</p> <p>Conclusion</p> <p>The glycine repeats in YscL and particularly FliH represent an intriguing amino acid sequence motif that is very rare in nature. Although we do not attempt to offer a mechanism whereby these repeats may have evolved, we do place the existence of the motif and some residue pairings within a rational structural context. While crystal structures of these proteins are necessary to fully elucidate the structural and functional significance of these repeats, the characterization reported here represents a first step in understanding this unique sequence feature.</p

    Strength in numbers: achieving greater accuracy in MHC-I binding prediction by combining the results from multiple prediction tools

    Get PDF
    BACKGROUND: Peptides derived from endogenous antigens can bind to MHC class I molecules. Those which bind with high affinity can invoke a CD8(+ )immune response, resulting in the destruction of infected cells. Much work in immunoinformatics has involved the algorithmic prediction of peptide binding affinity to various MHC-I alleles. A number of tools for MHC-I binding prediction have been developed, many of which are available on the web. RESULTS: We hypothesize that peptides predicted by a number of tools are more likely to bind than those predicted by just one tool, and that the likelihood of a particular peptide being a binder is related to the number of tools that predict it, as well as the accuracy of those tools. To this end, we have built and tested a heuristic-based method of making MHC-binding predictions by combining the results from multiple tools. The predictive performance of each individual tool is first ascertained. These performance data are used to derive weights such that the predictions of tools with better accuracy are given greater credence. The combined tool was evaluated using ten-fold cross-validation and was found to signicantly outperform the individual tools when a high specificity threshold is used. It performs comparably well to the best-performing individual tools at lower specificity thresholds. Finally, it also outperforms the combination of the tools resulting from linear discriminant analysis. CONCLUSION: A heuristic-based method of combining the results of the individual tools better facilitates the scanning of large proteomes for potential epitopes, yielding more actual high-affinity binders while reporting very few false positives

    A better sequence-read simulator program for metagenomics

    Get PDF
    BACKGROUND: There are many programs available for generating simulated whole-genome shotgun sequence reads. The data generated by many of these programs follow predefined models, which limits their use to the authors' original intentions. For example, many models assume that read lengths follow a uniform or normal distribution. Other programs generate models from actual sequencing data, but are limited to reads from single-genome studies. To our knowledge, there are no programs that allow a user to generate simulated data following non-parametric read-length distributions and quality profiles based on empirically-derived information from metagenomics sequencing data. RESULTS: We present BEAR (Better Emulation for Artificial Reads), a program that uses a machine-learning approach to generate reads with lengths and quality values that closely match empirically-derived distributions. BEAR can emulate reads from various sequencing platforms, including Illumina, 454, and Ion Torrent. BEAR requires minimal user input, as it automatically determines appropriate parameter settings from user-supplied data. BEAR also uses a unique method for deriving run-specific error rates, and extracts useful statistics from the metagenomic data itself, such as quality-error models. Many existing simulators are specific to a particular sequencing technology; however, BEAR is not restricted in this way. Because of its flexibility, BEAR is particularly useful for emulating the behaviour of technologies like Ion Torrent, for which no dedicated sequencing simulators are currently available. BEAR is also the first metagenomic sequencing simulator program that automates the process of generating abundances, which can be an arduous task. CONCLUSIONS: BEAR is useful for evaluating data processing tools in genomics. It has many advantages over existing comparable software, such as generating more realistic reads and being independent of sequencing technology, and has features particularly useful for metagenomics work

    Comparing the Similarity of Different Groups of Bacteria to the Human Proteome

    Get PDF
    Numerous aspects of the relationship between bacteria and human have been investigated. One aspect that has recently received attention is sequence overlap at the proteomic level. However, there has not yet been a study that comprehensively characterizes the level of sequence overlap between bacteria and human, especially as it relates to bacterial characteristics like pathogenicity, G-C content, and proteome size. In this study, we began by performing a general characterization of the range of bacteria-human similarity at the proteomic level, and identified characteristics of the most- and least-similar bacterial species. We then examined the relationship between proteomic similarity and numerous other variables. While pathogens and nonpathogens had comparable similarity to the human proteome, pathogens causing chronic infections were found to be more similar to the human proteome than those causing acute infections. Although no general correspondence between a bacterium’s proteome size and its similarity to the human proteome was noted, no bacteria with small proteomes had high similarity to the human proteome. Finally, we discovered an interesting relationship between similarity and a bacterium’s G-C content. While the relationship between bacteria and human has been studied from many angles, their proteomic similarity still needs to be examined in more detail. This paper sheds further light on this relationship, particularly with respect to immunity and pathogenicity

    The oligodeoxynucleotide sequences corresponding to never-expressed peptide motifs are mainly located in the non-coding strand

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>We study the usage of specific peptide platforms in protein composition. Using the pentapeptide as a unit of length, we find that in the universal proteome many pentapeptides are heavily repeated (even thousands of times), whereas some are quite rare, and a small number do not appear at all. To understand the physico-chemical-biological basis underlying peptide usage at the proteomic level, in this study we analyse the energetic costs for the synthesis of rare and never-expressed versus frequent pentapeptides. In addition, we explore residue bulkiness, hydrophobicity, and codon number as factors able to modulate specific peptide frequencies. Then, the possible influence of amino acid composition is investigated in zero- and high-frequency pentapeptide sets by analysing the frequencies of the corresponding inverse-sequence pentapeptides. As a final step, we analyse the pentadecamer oligodeoxynucleotide sequences corresponding to the never-expressed pentapeptides.</p> <p>Results</p> <p>We find that only DNA context-dependent constraints (such as oligodeoxynucleotide sequence location in the minus strand, introns, pseudogenes, frameshifts, etc.) provide a coherent mechanistic platform to explain the occurrence of never-expressed versus frequent pentapeptides in the protein world.</p> <p>Conclusions</p> <p>This study is of importance in cell biology. Indeed, the rarity (or lack of expression) of specific 5-mer peptide modules implies the rarity (or lack of expression) of the corresponding <it>n</it>-mer peptide sequences (with <it>n </it>< 5), so possibly modulating protein compositional trends. Moreover the data might further our understanding of the role exerted by rare pentapeptide modules as critical biological effectors in protein-protein interactions.</p

    A large data resource of genomic copy number variation across neurodevelopmental disorders

    Get PDF
    Copy number variations (CNVs) are implicated across many neurodevelopmental disorders (NDDs) and contribute to their shared genetic etiology. Multiple studies have attempted to identify shared etiology among NDDs, but this is the first genome-wide CNV analysis across autism spectrum disorder (ASD), attention deficit hyperactivity disorder (ADHD), schizophrenia (SCZ), and obsessive-compulsive disorder (OCD) at once. Using microarray (Affymetrix CytoScan HD), we genotyped 2,691 subjects diagnosed with an NDD (204 SCZ, 1,838 ASD, 427 ADHD and 222 OCD) and 1,769 family members, mainly parents. We identified rare CNVs, defined as those found in \u3c0.1% of 10,851 population control samples. We found clinically relevant CNVs (broadly defined) in 284 (10.5%) of total subjects, including 22 (10.8%) among subjects with SCZ, 209 (11.4%) with ASD, 40 (9.4%) with ADHD, and 13 (5.6%) with OCD. Among all NDD subjects, we identified 17 (0.63%) with aneuploidies and 115 (4.3%) with known genomic disorder variants. We searched further for genes impacted by different CNVs in multiple disorders. Examples of NDD-associated genes linked across more than one disorder (listed in order of occurrence frequency) are NRXN1, SEH1L, LDLRAD4, GNAL, GNG13, MKRN1, DCTN2, KNDC1, PCMTD2, KIF5A, SYNM, and long non-coding RNAs: AK127244 and PTCHD1-AS. We demonstrated that CNVs impacting the same genes could potentially contribute to the etiology of multiple NDDs. The CNVs identified will serve as a useful resource for both research and diagnostic laboratories for prioritization of variants
    corecore