156 research outputs found

    Identifying a High Fraction of the Human Genome to be under Selective Constraint Using GERP++

    Get PDF
    Computational efforts to identify functional elements within genomes leverage comparative sequence information by looking for regions that exhibit evidence of selective constraint. One way of detecting constrained elements is to follow a bottom-up approach by computing constraint scores for individual positions of a multiple alignment and then defining constrained elements as segments of contiguous, highly scoring nucleotide positions. Here we present GERP++, a new tool that uses maximum likelihood evolutionary rate estimation for position-specific scoring and, in contrast to previous bottom-up methods, a novel dynamic programming approach to subsequently define constrained elements. GERP++ evaluates a richer set of candidate element breakpoints and ranks them based on statistical significance, eliminating the need for biased heuristic extension techniques. Using GERP++ we identify over 1.3 million constrained elements spanning over 7% of the human genome. We predict a higher fraction than earlier estimates largely due to the annotation of longer constrained elements, which improves one to one correspondence between predicted elements with known functional sequences. GERP++ is an efficient and effective tool to provide both nucleotide- and element-level constraint scores within deep multiple sequence alignments

    First anatomical network analysis of fore- and hindlimb musculoskeletal modularity in bonobos, common chimpanzees, and humans

    Get PDF
    Studies of morphological integration and modularity, and of anatomical complexity in human evolution typically focus on skeletal tissues. Here we provide the first network analysis of the musculoskeletal anatomy of both the fore- and hindlimbs of the two species of chimpanzee and humans. Contra long-accepted ideas, network analysis reveals that the hindlimb displays a pattern opposite to that of the forelimb: Pan big toe is typically seen as more independently mobile, but humans are actually the ones that have a separate module exclusively related to its movements. Different fore- vs hindlimb patterns are also seen for anatomical network complexity (i.e., complexity in the arrangement of bones and muscles). For instance, the human hindlimb is as complex as that of chimpanzees but the human forelimb is less complex than in Pan. Importantly, in contrast to the analysis of morphological integration using morphometric approaches, network analyses do not support the prediction that forelimb and hindlimb are more dissimilar in species with functionally divergent limbs such as bipedal humans

    Viral population estimation using pyrosequencing

    Get PDF
    The diversity of virus populations within single infected hosts presents a major difficulty for the natural immune response as well as for vaccine design and antiviral drug therapy. Recently developed pyrophosphate based sequencing technologies (pyrosequencing) can be used for quantifying this diversity by ultra-deep sequencing of virus samples. We present computational methods for the analysis of such sequence data and apply these techniques to pyrosequencing data obtained from HIV populations within patients harboring drug resistant virus strains. Our main result is the estimation of the population structure of the sample from the pyrosequencing reads. This inference is based on a statistical approach to error correction, followed by a combinatorial algorithm for constructing a minimal set of haplotypes that explain the data. Using this set of explaining haplotypes, we apply a statistical model to infer the frequencies of the haplotypes in the population via an EM algorithm. We demonstrate that pyrosequencing reads allow for effective population reconstruction by extensive simulations and by comparison to 165 sequences obtained directly from clonal sequencing of four independent, diverse HIV populations. Thus, pyrosequencing can be used for cost-effective estimation of the structure of virus populations, promising new insights into viral evolutionary dynamics and disease control strategies.Comment: 23 pages, 13 figure

    Detection of lineage-specific evolutionary changes among primate species

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Comparison of the human genome with other primates offers the opportunity to detect evolutionary events that created the diverse phenotypes among the primate species. Because the primate genomes are highly similar to one another, methods developed for analysis of more divergent species do not always detect signs of evolutionary selection.</p> <p>Results</p> <p>We have developed a new method, called DivE, specifically designed to find regions that have evolved either more or less rapidly than expected, for any clade within a set of very closely related species. Unlike some previous methods, DivE does not rely on rates of synonymous and nonsynonymous substitution, which enables it to detect evolutionary events in noncoding regions. We demonstrate using simulated data that DivE compares favorably to alternative methods, and we then apply DivE to the ENCODE regions in 14 primate species. We identify thousands of regions in these primates, ranging from 50 to >10000 bp in length, that appear to have experienced either constrained or accelerated rates of evolution. In particular, we detected 4942 regions that have potentially undergone positive selection in one or more primate species. Most of these regions occur outside of protein-coding genes, although we identified 20 proteins that have experienced positive selection.</p> <p>Conclusions</p> <p>DivE provides an easy-to-use method to predict both positive and negative selection in noncoding DNA, that is particularly well-suited to detecting lineage-specific selection in large genomes.</p

    Local conservation scores without a priori assumptions on neutral substitution rates

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Comparative genomics aims to detect signals of evolutionary conservation as an indicator of functional constraint. Surprisingly, results of the ENCODE project revealed that about half of the experimentally verified functional elements found in non-coding DNA were classified as unconstrained by computational predictions. Following this observation, it has been hypothesized that this may be partly explained by biased estimates on neutral evolutionary rates used by existing sequence conservation metrics. All methods we are aware of rely on a comparison with the neutral rate and conservation is estimated by measuring the deviation of a particular genomic region from this rate. Consequently, it is a reasonable assumption that inaccurate neutral rate estimates may lead to biased conservation and constraint estimates.</p> <p>Results</p> <p>We propose a conservation signal that is produced by local Maximum Likelihood estimation of evolutionary parameters using an optimized sliding window and present a Kullback-Leibler projection that allows multiple different estimated parameters to be transformed into a conservation measure. This conservation measure does not rely on assumptions about neutral evolutionary substitution rates and little a priori assumptions on the properties of the conserved regions are imposed. We show the accuracy of our approach (KuLCons) on synthetic data and compare it to the scores generated by state-of-the-art methods (phastCons, GERP, SCONE) in an ENCODE region. We find that KuLCons is most often in agreement with the conservation/constraint signatures detected by GERP and SCONE while qualitatively very different patterns from phastCons are observed. Opposed to standard methods KuLCons can be extended to more complex evolutionary models, e.g. taking insertion and deletion events into account and corresponding results show that scores obtained under this model can diverge significantly from scores using the simpler model.</p> <p>Conclusion</p> <p>Our results suggest that discriminating among the different degrees of conservation is possible without making assumptions about neutral rates. We find, however, that it cannot be expected to discover considerably different constraint regions than GERP and SCONE. Consequently, we conclude that the reported discrepancies between experimentally verified functional and computationally identified constraint elements are likely not to be explained by biased neutral rate estimates.</p

    Anatomical Network Comparison of Human Upper and Lower, Newborn and Adult, and Normal and Abnormal Limbs, with Notes on Development, Pathology and Limb Serial Homology vs. Homoplasy

    Get PDF
    How do the various anatomical parts (modules) of the animal body evolve into very different integrated forms (integration) yet still function properly without decreasing the individual's survival? This long-standing question remains unanswered for multiple reasons, including lack of consensus about conceptual definitions and approaches, as well as a reasonable bias toward the study of hard tissues over soft tissues. A major difficulty concerns the non-trivial technical hurdles of addressing this problem, specifically the lack of quantitative tools to quantify and compare variation across multiple disparate anatomical parts and tissue types. In this paper we apply for the first time a powerful new quantitative tool, Anatomical Network Analysis (AnNA), to examine and compare in detail the musculoskeletal modularity and integration of normal and abnormal human upper and lower limbs. In contrast to other morphological methods, the strength of AnNA is that it allows efficient and direct empirical comparisons among body parts with even vastly different architectures (e.g. upper and lower limbs) and diverse or complex tissue composition (e.g. bones, cartilages and muscles), by quantifying the spatial organization of these parts-their topological patterns relative to each other-using tools borrowed from network theory. Our results reveal similarities between the skeletal networks of the normal newborn/adult upper limb vs. lower limb, with exception to the shoulder vs. pelvis. However, when muscles are included, the overall musculoskeletal network organization of the upper limb is strikingly different from that of the lower limb, particularly that of the more proximal structures of each limb. Importantly, the obtained data provide further evidence to be added to the vast amount of paleontological, gross anatomical, developmental, molecular and embryological data recently obtained that contradicts the long-standing dogma that the upper and lower limbs are serial homologues. In addition, the AnNA of the limbs of a trisomy 18 human fetus strongly supports Pere Alberch's ill-named "logic of monsters" hypothesis, and contradicts the commonly accepted idea that birth defects often lead to lower integration (i.e. more parcellation) of anatomical structures

    Analysis of Transposon Interruptions Suggests Selection for L1 Elements on the X Chromosome

    Get PDF
    It has been hypothesised that the massive accumulation of L1 transposable elements on the X chromosome is due to their function in X inactivation, and that the accumulation of Alu elements near genes is adaptive. We tested the possible selective advantage of these two transposable element (TE) families with a novel method, interruption analysis. In mammalian genomes, a large number of TEs interrupt other TEs due to the high overall abundance and age of repeats, and these interruptions can be used to test whether TEs are selectively neutral. Interruptions of TEs, which are beneficial for the host, are expected to be deleterious and underrepresented compared with neutral ones. We found that L1 elements in the regions of the X chromosome that contain the majority of the inactivated genes are significantly less frequently interrupted than on the autosomes, while L1s near genes that escape inactivation are interrupted with higher frequency, supporting the hypothesis that L1s on the X chromosome play a role in its inactivation. In addition, we show that TEs are less frequently interrupted in introns than in intergenic regions, probably due to selection against the expansion of introns, but the insertion pattern of Alus is comparable to other repeats

    Multiple organism algorithm for finding ultraconserved elements

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Ultraconserved elements are nucleotide or protein sequences with 100% identity (no mismatches, insertions, or deletions) in the same organism or between two or more organisms. Studies indicate that these conserved regions are associated with micro RNAs, mRNA processing, development and transcription regulation. The identification and characterization of these elements among genomes is necessary for the further understanding of their functionality.</p> <p>Results</p> <p>We describe an algorithm and provide freely available software which can find all of the ultraconserved sequences between genomes of multiple organisms. Our algorithm takes a combinatorial approach that finds all sequences without requiring the genomes to be aligned. The algorithm is significantly faster than BLAST and is designed to handle very large genomes efficiently. We ran our algorithm on several large comparative analyses to evaluate its effectiveness; one compared 17 vertebrate genomes where we find 123 ultraconserved elements longer than 40 bps shared by all of the organisms, and another compared the human body louse, <it>Pediculus humanus humanus</it>, against itself and select insects to find thousands of non-coding, potentially functional sequences.</p> <p>Conclusion</p> <p>Whole genome comparative analysis for multiple organisms is both feasible and desirable in our search for biological knowledge. We argue that bioinformatic programs should be forward thinking by assuming analysis on multiple (and possibly large) genomes in the design and implementation of algorithms. Our algorithm shows how a compromise design with a trade-off of disk space versus memory space allows for efficient computation while only requiring modest computer resources, and at the same time providing benefits not available with other software.</p

    Systematic identification of conserved motif modules in the human genome

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The identification of motif modules, groups of multiple motifs frequently occurring in DNA sequences, is one of the most important tasks necessary for annotating the human genome. Current approaches to identifying motif modules are often restricted to searches within promoter regions or rely on multiple genome alignments. However, the promoter regions only account for a limited number of locations where transcription factor binding sites can occur, and multiple genome alignments often cannot align binding sites with their true counterparts because of the short and degenerative nature of these transcription factor binding sites.</p> <p>Results</p> <p>To identify motif modules systematically, we developed a computational method for the entire non-coding regions around human genes that does not rely upon the use of multiple genome alignments. First, we selected orthologous DNA blocks approximately 1-kilobase in length based on discontiguous sequence similarity. Next, we scanned the conserved segments in these blocks using known motifs in the TRANSFAC database. Finally, a frequent pattern mining technique was applied to identify motif modules within these blocks. In total, with a false discovery rate cutoff of 0.05, we predicted 3,161,839 motif modules, 90.8% of which are supported by various forms of functional evidence. Compared with experimental data from 14 ChIP-seq experiments, on average, our methods predicted 69.6% of the ChIP-seq peaks with TFBSs of multiple TFs. Our findings also show that many motif modules have distance preference and order preference among the motifs, which further supports the functionality of these predictions.</p> <p>Conclusions</p> <p>Our work provides a large-scale prediction of motif modules in mammals, which will facilitate the understanding of gene regulation in a systematic way.</p

    How accurately is ncRNA aligned within whole-genome multiple alignments?

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Multiple alignment of homologous DNA sequences is of great interest to biologists since it provides a window into evolutionary processes. At present, the accuracy of whole-genome multiple alignments, particularly in noncoding regions, has not been thoroughly evaluated.</p> <p>Results</p> <p>We evaluate the alignment accuracy of certain noncoding regions using noncoding RNA alignments from Rfam as a reference. We inspect the MULTIZ 17-vertebrate alignment from the UCSC Genome Browser for all the human sequences in the Rfam seed alignments. In particular, we find 638 instances of chimeric and partial alignments to human noncoding RNA elements, of which at least 225 can be improved by straightforward means. As a byproduct of our procedure, we predict many novel instances of known ncRNA families that are suggested by the alignment.</p> <p>Conclusion</p> <p>MULTIZ does a fairly accurate job of aligning these genomes in these difficult regions. However, our experiments indicate that better alignments exist in some regions.</p
    corecore