49 research outputs found

    Fast computation by block permanents of cumulative distribution functions of order statistics from several populations

    Full text link
    The joint cumulative distribution function for order statistics arising from several different populations is given in terms of the distribution function of the populations. The computational cost of the formula in the case of two populations is still exponential in the worst case, but it is a dramatic improvement compared to the general formula by Bapat and Beg. In the case when only the joint distribution function of a subset of the order statistics of fixed size is needed, the complexity is polynomial, for the case of two populations.Comment: 21 pages, 3 figure

    Predicting protein linkages in bacteria: Which method is best depends on task

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Applications of computational methods for predicting protein functional linkages are increasing. In recent years, several bacteria-specific methods for predicting linkages have been developed. The four major genomic context methods are: Gene cluster, Gene neighbor, Rosetta Stone, and Phylogenetic profiles. These methods have been shown to be powerful tools and this paper provides guidelines for when each method is appropriate by exploring different features of each method and potential improvements offered by their combination. We also review many previous treatments of these prediction methods, use the latest available annotations, and offer a number of new observations.</p> <p>Results</p> <p>Using <it>Escherichia coli </it>K12 and <it>Bacillus subtilis</it>, linkage predictions made by each of these methods were evaluated against three benchmarks: functional categories defined by COG and KEGG, known pathways listed in EcoCyc, and known operons listed in RegulonDB. Each evaluated method had strengths and weaknesses, with no one method dominating all aspects of predictive ability studied. For functional categories, as previous studies have shown, the Rosetta Stone method was individually best at detecting linkages and predicting functions among proteins with shared KEGG categories while the Phylogenetic profile method was best for linkage detection and function prediction among proteins with common COG functions. Differences in performance under COG versus KEGG may be attributable to the presence of paralogs. Better function prediction was observed when using a weighted combination of linkages based on reliability versus using a simple unweighted union of the linkage sets. For pathway reconstruction, 99 complete metabolic pathways in <it>E. coli </it>K12 (out of the 209 known, non-trivial pathways) and 193 pathways with 50% of their proteins were covered by linkages from at least one method. Gene neighbor was most effective individually on pathway reconstruction, with 48 complete pathways reconstructed. For operon prediction, Gene cluster predicted completely 59% of the known operons in <it>E. coli </it>K12 and 88% (333/418)in <it>B. subtilis</it>. Comparing two versions of the <it>E. coli </it>K12 operon database, many of the unannotated predictions in the earlier version were updated to true predictions in the later version. Using only linkages found by both Gene Cluster and Gene Neighbor improved the precision of operon predictions. Additionally, as previous studies have shown, combining features based on intergenic region and protein function improved the specificity of operon prediction.</p> <p>Conclusion</p> <p>A common problem for computational methods is the generation of a large number of false positives that might be caused by an incomplete source of validation. By comparing two versions of a database, we demonstrated the dramatic differences on reported results. We used several benchmarks on which we have shown the comparative effectiveness of each prediction method, as well as provided guidelines as to which method is most appropriate for a given prediction task.</p

    Biomedical Discovery Acceleration, with Applications to Craniofacial Development

    Get PDF
    The profusion of high-throughput instruments and the explosion of new results in the scientific literature, particularly in molecular biomedicine, is both a blessing and a curse to the bench researcher. Even knowledgeable and experienced scientists can benefit from computational tools that help navigate this vast and rapidly evolving terrain. In this paper, we describe a novel computational approach to this challenge, a knowledge-based system that combines reading, reasoning, and reporting methods to facilitate analysis of experimental data. Reading methods extract information from external resources, either by parsing structured data or using biomedical language processing to extract information from unstructured data, and track knowledge provenance. Reasoning methods enrich the knowledge that results from reading by, for example, noting two genes that are annotated to the same ontology term or database entry. Reasoning is also used to combine all sources into a knowledge network that represents the integration of all sorts of relationships between a pair of genes, and to calculate a combined reliability score. Reporting methods combine the knowledge network with a congruent network constructed from experimental data and visualize the combined network in a tool that facilitates the knowledge-based analysis of that data. An implementation of this approach, called the Hanalyzer, is demonstrated on a large-scale gene expression array dataset relevant to craniofacial development. The use of the tool was critical in the creation of hypotheses regarding the roles of four genes never previously characterized as involved in craniofacial development; each of these hypotheses was validated by further experimental work

    Molecular Profiling Reveals Prognostically Significant Subtypes of Canine Lymphoma

    Get PDF
    We performed genomewide gene expression analysis of 35 samples representing 6 common histologic subtypes of canine lymphoma and bioinformatics analyses to define their molecular characteristics. Three major groups were defined on the basis of gene expression profiles: (1) low-grade T-cell lymphoma, composed entirely by T-zone lymphoma; (2) high-grade T-cell lymphoma, consisting of lymphoblastic T-cell lymphoma and peripheral T-cell lymphoma not otherwise specified; and (3) B-cell lymphoma, consisting of marginal B-cell lymphoma, diffuse large B-cell lymphoma, and Burkitt lymphoma. Interspecies comparative analyses of gene expression profiles also showed that marginal B-cell lymphoma and diffuse large B-cell lymphoma in dogs and humans might represent a continuum of disease with similar drivers. The classification of these diverse tumors into 3 subgroups was prognostically significant, as the groups were directly correlated with event-free survival. Finally, we developed a benchtop diagnostic test based on expression of 4 genes that can robustly classify canine lymphomas into one of these 3 subgroups, enabling a direct clinical application for our results

    Improving protein function prediction methods with integrated literature data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Determining the function of uncharacterized proteins is a major challenge in the post-genomic era due to the problem's complexity and scale. Identifying a protein's function contributes to an understanding of its role in the involved pathways, its suitability as a drug target, and its potential for protein modifications. Several graph-theoretic approaches predict unidentified functions of proteins by using the functional annotations of better-characterized proteins in protein-protein interaction networks. We systematically consider the use of literature co-occurrence data, introduce a new method for quantifying the reliability of co-occurrence and test how performance differs across species. We also quantify changes in performance as the prediction algorithms annotate with increased specificity.</p> <p>Results</p> <p>We find that including information on the co-occurrence of proteins within an abstract greatly boosts performance in the Functional Flow graph-theoretic function prediction algorithm in yeast, fly and worm. This increase in performance is not simply due to the presence of additional edges since supplementing protein-protein interactions with co-occurrence data outperforms supplementing with a comparably-sized genetic interaction dataset. Through the combination of protein-protein interactions and co-occurrence data, the neighborhood around unknown proteins is quickly connected to well-characterized nodes which global prediction algorithms can exploit. Our method for quantifying co-occurrence reliability shows superior performance to the other methods, particularly at threshold values around 10% which yield the best trade off between coverage and accuracy. In contrast, the traditional way of asserting co-occurrence when at least one abstract mentions both proteins proves to be the worst method for generating co-occurrence data, introducing too many false positives. Annotating the functions with greater specificity is harder, but co-occurrence data still proves beneficial.</p> <p>Conclusion</p> <p>Co-occurrence data is a valuable supplemental source for graph-theoretic function prediction algorithms. A rapidly growing literature corpus ensures that co-occurrence data is a readily-available resource for nearly every studied organism, particularly those with small protein interaction databases. Though arguably biased toward known genes, co-occurrence data provides critical additional links to well-studied regions in the interaction network that graph-theoretic function prediction algorithms can exploit.</p

    Alu insertion polymorphisms shared by Papio baboons and Theropithecus gelada reveal an intertwined common ancestry

    Get PDF
    © 2019 The Author(s). Background: Baboons (genus Papio) and geladas (Theropithecus gelada) are now generally recognized as close phylogenetic relatives, though morphologically quite distinct and generally classified in separate genera. Primate specific Alu retrotransposons are well-established genomic markers for the study of phylogenetic and population genetic relationships. We previously reported a computational reconstruction of Papio phylogeny using large-scale whole genome sequence (WGS) analysis of Alu insertion polymorphisms. Recently, high coverage WGS was generated for Theropithecus gelada. The objective of this study was to apply the high-Throughput poly-Detect method to computationally determine the number of Alu insertion polymorphisms shared by T. gelada and Papio, and vice versa, by each individual Papio species and T. gelada. Secondly, we performed locus-specific polymerase chain reaction (PCR) assays on a diverse DNA panel to complement the computational data. Results: We identified 27,700 Alu insertions from T. gelada WGS that were also present among six Papio species, with nearly half (12,956) remaining unfixed among 12 Papio individuals. Similarly, each of the six Papio species had species-indicative Alu insertions that were also present in T. gelada. In general, P. kindae shared more insertion polymorphisms with T. gelada than did any of the other five Papio species. PCR-based genotype data provided additional support for the computational findings. Conclusions: Our discovery that several thousand Alu insertion polymorphisms are shared by T. gelada and Papio baboons suggests a much more permeable reproductive barrier between the two genera then previously suspected. Their intertwined evolution likely involves a long history of admixture, gene flow and incomplete lineage sorting

    Stratification of co-evolving genomic groups using ranked phylogenetic profiles

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Previous methods of detecting the taxonomic origins of arbitrary sequence collections, with a significant impact to genome analysis and in particular metagenomics, have primarily focused on compositional features of genomes. The evolutionary patterns of phylogenetic distribution of genes or proteins, represented by phylogenetic profiles, provide an alternative approach for the detection of taxonomic origins, but typically suffer from low accuracy. Herein, we present <it>rank-BLAST</it>, a novel approach for the assignment of protein sequences into genomic groups of the same taxonomic origin, based on the ranking order of phylogenetic profiles of target genes or proteins across the reference database.</p> <p>Results</p> <p>The rank-BLAST approach is validated by computing the phylogenetic profiles of all sequences for five distinct microbial species of varying degrees of phylogenetic proximity, against a reference database of 243 fully sequenced genomes. The approach - a combination of sequence searches, statistical estimation and clustering - analyses the degree of sequence divergence between sets of protein sequences and allows the classification of protein sequences according to the species of origin with high accuracy, allowing taxonomic classification of 64% of the proteins studied. In most cases, a main cluster is detected, representing the corresponding species. Secondary, functionally distinct and species-specific clusters exhibit different patterns of phylogenetic distribution, thus flagging gene groups of interest. Detailed analyses of such cases are provided as examples.</p> <p>Conclusion</p> <p>Our results indicate that the rank-BLAST approach can capture the taxonomic origins of sequence collections in an accurate and efficient manner. The approach can be useful both for the analysis of genome evolution and the detection of species groups in metagenomics samples.</p

    Operon structure of Staphylococcus aureus

    Get PDF
    In bacteria, gene regulation is one of the fundamental characteristics of survival, colonization and pathogenesis. Operons play a key role in regulating expression of diverse genes involved in metabolism and virulence. However, operon structures in pathogenic bacteria have been determined only by in silico approaches that are dependent on factors such as intergenic distances and terminator/promoter sequences. Knowledge of operon structures is crucial to fully understand the pathophysiology of infections. Presently, transcriptome data obtained from growth curves in a defined medium were used to predict operons in Staphylococcus aureus. This unbiased approach and the use of five highly reproducible biological replicates resulted in 93.5% significantly regulated genes. These data, combined with Pearson’s correlation coefficients of the transcriptional profiles, enabled us to accurately compile 93% of the genome in operon structures. A total of 1640 genes of different functional classes were identified in operons. Interestingly, we found several operons containing virulence genes and showed synergistic effects for two complement convertase inhibitors transcribed in one operon. This is the first experimental approach to fully identify operon structures in S. aureus. It forms the basis for further in vitro regulation studies that will profoundly advance the understanding of bacterial pathophysiology in vivo

    Finding the “Dark Matter” in Human and Yeast Protein Network Prediction and Modelling

    Get PDF
    Accurate modelling of biological systems requires a deeper and more complete knowledge about the molecular components and their functional associations than we currently have. Traditionally, new knowledge on protein associations generated by experiments has played a central role in systems modelling, in contrast to generally less trusted bio-computational predictions. However, we will not achieve realistic modelling of complex molecular systems if the current experimental designs lead to biased screenings of real protein networks and leave large, functionally important areas poorly characterised. To assess the likelihood of this, we have built comprehensive network models of the yeast and human proteomes by using a meta-statistical integration of diverse computationally predicted protein association datasets. We have compared these predicted networks against combined experimental datasets from seven biological resources at different level of statistical significance. These eukaryotic predicted networks resemble all the topological and noise features of the experimentally inferred networks in both species, and we also show that this observation is not due to random behaviour. In addition, the topology of the predicted networks contains information on true protein associations, beyond the constitutive first order binary predictions. We also observe that most of the reliable predicted protein associations are experimentally uncharacterised in our models, constituting the hidden or “dark matter” of networks by analogy to astronomical systems. Some of this dark matter shows enrichment of particular functions and contains key functional elements of protein networks, such as hubs associated with important functional areas like the regulation of Ras protein signal transduction in human cells. Thus, characterising this large and functionally important dark matter, elusive to established experimental designs, may be crucial for modelling biological systems. In any case, these predictions provide a valuable guide to these experimentally elusive regions

    SalmoNet, an integrated network of ten Salmonella enterica strains reveals common and distinct pathways to host adaptation

    Get PDF
    Salmonella enterica is a prominent bacterial pathogen with implications on human and animal health. Salmonella serovars could be classified as gastro-intestinal or extra-intestinal. Genome-wide comparisons revealed that extra-intestinal strains are closer relatives of gastro-intestinal strains than to each other indicating a parallel evolution of this trait. Given the complexity of the differences, a systems-level comparison could reveal key mechanisms enabling extra-intestinal serovars to cause systemic infections. Accordingly, in this work, we introduce a unique resource, SalmoNet, which combines manual curation, high-throughput data and computational predictions to provide an integrated network for Salmonella at the metabolic, transcriptional regulatory and protein-protein interaction levels. SalmoNet provides the networks separately for five gastro-intestinal and five extra-intestinal strains. As a multi-layered, multi-strain database containing experimental data, SalmoNet is the first dedicated network resource for Salmonella. It comprehensively contains interactions between proteins encoded in Salmonella pathogenicity islands, as well as regulatory mechanisms of metabolic processes with the option to zoom-in and analyze the interactions at specific loci in more detail. Application of SalmoNet is not limited to strain comparisons as it also provides a Salmonella resource for biochemical network modeling, host-pathogen interaction studies, drug discovery, experimental validation of novel interactions, uncovering new pathological mechanisms from emergent properties and epidemiological studies. SalmoNet is available at http://salmonet.org
    corecore