28 research outputs found

    DendroBlast: approximate phylogenetic trees in the absence of multiple sequence alignments

    Get PDF
    The rapidly growing availability of genome information has created considerable demand for both fast and accurate phylogenetic inference algorithms. We present a novel method called DendroBLAST for reconstructing phylogenetic dendrograms/trees from protein sequences using BLAST. This method differs from other methods by incorporating a simple model of sequence evolution to test the effect of introducing sequence changes on the reliability of the bipartitions in the inferred tree. Using realistic simulated sequence data we demonstrate that this method produces phylogenetic trees that are more accurate than other commonly-used distance based methods though not as accurate as maximum likelihood methods from good quality multiple sequence alignments. In addition to tests on simulated data, we use DendroBLAST to generate input trees for a supertree reconstruction of the phylogeny of the Archaea. This independent analysis produces an approximate phylogeny of the Archaea that has both high precision and recall when compared to previously published analysis of the same dataset using conventional methods. Taken together these results demonstrate that approximate phylogenetic trees can be produced in the absence of multiple sequence alignments, and we propose that these trees will provide a platform for improving and informing downstream bioinformatic analysis. A web implementation of the DendroBLAST method is freely available for use at http://www.dendroblast.com/

    Comparison study on k-word statistical measures for protein: From sequence to 'sequence space'

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Many proposed statistical measures can efficiently compare protein sequence to further infer protein structure, function and evolutionary information. They share the same idea of using <it>k</it>-word frequencies of protein sequences. Given a protein sequence, the information on its related protein sequences hasn't been used for protein sequence comparison until now. This paper proposed a scheme to construct protein 'sequence space' which was associated with protein sequences related to the given protein, and the performances of statistical measures were compared when they explored the information on protein 'sequence space' or not. This paper also presented two statistical measures for protein: <it>gre.k </it>(generalized relative entropy) and <it>gsm.k </it>(gapped similarity measure).</p> <p>Results</p> <p>We tested statistical measures based on protein 'sequence space' or not with three data sets. This not only offers the systematic and quantitative experimental assessment of these statistical measures, but also naturally complements the available comparison of statistical measures based on protein sequence. Moreover, we compared our statistical measures with alignment-based measures and the existing statistical measures. The experiments were grouped into two sets. The first one, performed via ROC (Receiver Operating Curve) analysis, aims at assessing the intrinsic ability of the statistical measures to discriminate and classify protein sequences. The second set of the experiments aims at assessing how well our measure does in phylogenetic analysis. Based on the experiments, several conclusions can be drawn and, from them, novel valuable guidelines for the use of protein 'sequence space' and statistical measures were obtained.</p> <p>Conclusion</p> <p>Alignment-based measures have a clear advantage when the data is high redundant. The more efficient statistical measure is the novel <it>gsm.k </it>introduced by this article, the <it>cos.k </it>followed. When the data becomes less redundant, <it>gre.k </it>proposed by us achieves a better performance, but all the other measures perform poorly on classification tasks. Almost all the statistical measures achieve improvement by exploring the information on 'sequence space' as word's length increases, especially for less redundant data. The reasonable results of phylogenetic analysis confirm that <it>Gdis.k </it>based on 'sequence space' is a reliable measure for phylogenetic analysis. In summary, our quantitative analysis verifies that exploring the information on 'sequence space' is a promising way to improve the abilities of statistical measures for protein comparison.</p

    Classification of Protein Kinases on the Basis of Both Kinase and Non-Kinase Regions

    Get PDF
    BACKGROUND: Protein phosphorylation is a generic way to regulate signal transduction pathways in all kingdoms of life. In many organisms, it is achieved by the large family of Ser/Thr/Tyr protein kinases which are traditionally classified into groups and subfamilies on the basis of the amino acid sequence of their catalytic domains. Many protein kinases are multi-domain in nature but the diversity of the accessory domains and their organization are usually not taken into account while classifying kinases into groups or subfamilies. METHODOLOGY: Here, we present an approach which considers amino acid sequences of complete gene products, in order to suggest refinements in sets of pre-classified sequences. The strategy is based on alignment-free similarity scores and iterative Area Under the Curve (AUC) computation. Similarity scores are computed by detecting common patterns between two sequences and scoring them using a substitution matrix, with a consistent normalization scheme. This allows us to handle full-length sequences, and implicitly takes into account domain diversity and domain shuffling. We quantitatively validate our approach on a subset of 212 human protein kinases. We then employ it on the complete repertoire of human protein kinases and suggest few qualitative refinements in the subfamily assignment stored in the KinG database, which is based on catalytic domains only. Based on our new measure, we delineate 37 cases of potential hybrid kinases: sequences for which classical classification based entirely on catalytic domains is inconsistent with the full-length similarity scores computed here, which implicitly consider multi-domain nature and regions outside the catalytic kinase domain. We also provide some examples of hybrid kinases of the protozoan parasite Entamoeba histolytica. CONCLUSIONS: The implicit consideration of multi-domain architectures is a valuable inclusion to complement other classification schemes. The proposed algorithm may also be employed to classify other families of enzymes with multi-domain architecture

    Genome-Wide Comparative Gene Family Classification

    Get PDF
    Correct classification of genes into gene families is important for understanding gene function and evolution. Although gene families of many species have been resolved both computationally and experimentally with high accuracy, gene family classification in most newly sequenced genomes has not been done with the same high standard. This project has been designed to develop a strategy to effectively and accurately classify gene families across genomes. We first examine and compare the performance of computer programs developed for automated gene family classification. We demonstrate that some programs, including the hierarchical average-linkage clustering algorithm MC-UPGMA and the popular Markov clustering algorithm TRIBE-MCL, can reconstruct manual curation of gene families accurately. However, their performance is highly sensitive to parameter setting, i.e. different gene families require different program parameters for correct resolution. To circumvent the problem of parameterization, we have developed a comparative strategy for gene family classification. This strategy takes advantage of existing curated gene families of reference species to find suitable parameters for classifying genes in related genomes. To demonstrate the effectiveness of this novel strategy, we use TRIBE-MCL to classify chemosensory and ABC transporter gene families in C. elegans and its four sister species. We conclude that fully automated programs can establish biologically accurate gene families if parameterized accordingly. Comparative gene family classification finds optimal parameters automatically, thus allowing rapid insights into gene families of newly sequenced species

    CLUSS: Clustering of protein sequences based on a new similarity measure

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The rapid burgeoning of available protein data makes the use of clustering within families of proteins increasingly important. The challenge is to identify subfamilies of evolutionarily related sequences. This identification reveals phylogenetic relationships, which provide prior knowledge to help researchers understand biological phenomena. A good evolutionary model is essential to achieve a clustering that reflects the biological reality, and an accurate estimate of protein sequence similarity is crucial to the building of such a model. Most existing algorithms estimate this similarity using techniques that are not necessarily biologically plausible, especially for hard-to-align sequences such as proteins with different domain structures, which cause many difficulties for the alignment-dependent algorithms. In this paper, we propose a novel similarity measure based on matching amino acid subsequences. This measure, named SMS for Substitution Matching Similarity, is especially designed for application to non-aligned protein sequences. It allows us to develop a new alignment-free algorithm, named CLUSS, for clustering protein families. To the best of our knowledge, this is the first alignment-free algorithm for clustering protein sequences. Unlike other clustering algorithms, CLUSS is effective on both alignable and non-alignable protein families. In the rest of the paper, we use the term "<it>phylogenetic</it>" in the sense of "<it>relatedness of biological functions</it>".</p> <p>Results</p> <p>To show the effectiveness of CLUSS, we performed an extensive clustering on COG database. To demonstrate its ability to deal with hard-to-align sequences, we tested it on the GH2 family. In addition, we carried out experimental comparisons of CLUSS with a variety of mainstream algorithms. These comparisons were made on hard-to-align and easy-to-align protein sequences. The results of these experiments show the superiority of CLUSS in yielding clusters of proteins with similar functional activity.</p> <p>Conclusion</p> <p>We have developed an effective method and tool for clustering protein sequences to meet the needs of biologists in terms of phylogenetic analysis and prediction of biological functions. Compared to existing clustering methods, CLUSS more accurately highlights the functional characteristics of the clustered families. It provides biologists with a new and plausible instrument for the analysis of protein sequences, especially those that cause problems for the alignment-dependent algorithms.</p

    Prioritizing Health Care Strategies to Reduce Childhood Mortality

    Get PDF
    IMPORTANCE: Although child mortality trends have decreased worldwide, deaths among children younger than 5 years of age remain high and disproportionately circumscribed to sub-Saharan Africa and Southern Asia. Tailored and innovative approaches are needed to increase access, coverage, and quality of child health care services to reduce mortality, but an understanding of health system deficiencies that may have the greatest impact on mortality among children younger than 5 years is lacking. OBJECTIVE: To investigate which health care and public health improvements could have prevented the most stillbirths and deaths in children younger than 5 years using data from the Child Health and Mortality Prevention Surveillance (CHAMPS) network. DESIGN, SETTING, AND PARTICIPANTS: This cross-sectional study used longitudinal, population-based, and mortality surveillance data collected by CHAMPS to understand preventable causes of death. Overall, 3390 eligible deaths across all 7 CHAMPS sites (Bangladesh, Ethiopia, Kenya, Mali, Mozambique, Sierra Leone, and South Africa) between December 9, 2016, and December 31, 2021 (1190 stillbirths, 1340 neonatal deaths, 860 infant and child deaths), were included. Deaths were investigated using minimally invasive tissue sampling (MITS), a postmortem approach using biopsy needles for sampling key organs and fluids. MAIN OUTCOMES AND MEASURES: For each death, an expert multidisciplinary panel reviewed case data to determine the plausible pathway and causes of death. If the death was deemed preventable, the panel identified which of 10 predetermined health system gaps could have prevented the death. The health system improvements that could have prevented the most deaths were evaluated for each age group: stillbirths, neonatal deaths (aged <28 days), and infant and child deaths (aged 1 month to <5 years). RESULTS: Of 3390 deaths, 1505 (44.4%) were female and 1880 (55.5%) were male; sex was not recorded for 5 deaths. Of all deaths, 3045 (89.8%) occurred in a healthcare facility and 344 (11.9%) in the community. Overall, 2607 (76.9%) were deemed potentially preventable: 883 of 1190 stillbirths (74.2%), 1010 of 1340 neonatal deaths (75.4%), and 714 of 860 infant and child deaths (83.0%). Recommended measures to prevent deaths were improvements in antenatal and obstetric care (recommended for 588 of 1190 stillbirths [49.4%], 496 of 1340 neonatal deaths [37.0%]), clinical management and quality of care (stillbirths, 280 [23.5%]; neonates, 498 [37.2%]; infants and children, 393 of 860 [45.7%]), health-seeking behavior (infants and children, 237 [27.6%]), and health education (infants and children, 262 [30.5%]). CONCLUSIONS AND RELEVANCE: In this cross-sectional study, interventions prioritizing antenatal, intrapartum, and postnatal care could have prevented the most deaths among children younger than 5 years because 75% of deaths among children younger than 5 were stillbirths and neonatal deaths. Measures to reduce mortality in this population should prioritize improving existing systems, such as better access to antenatal care, implementation of standardized clinical protocols, and public education campaigns
    corecore