185 research outputs found

    Logarithmic gap costs decrease alignment accuracy

    Get PDF
    BACKGROUND: Studies on the distribution of indel sizes have consistently found that they obey a power law. This finding has lead several scientists to propose that logarithmic gap costs, G (k) = a + c ln k, are more biologically realistic than affine gap costs, G (k) = a + bk, for sequence alignment. Since quick and efficient affine costs are currently the most popular way to globally align sequences, the goal of this paper is to determine whether logarithmic gap costs improve alignment accuracy significantly enough the merit their use over the faster affine gap costs. RESULTS: A group of simulated sequences pairs were globally aligned using affine, logarithmic, and log-affine gap costs. Alignment accuracy was calculated by comparing resulting alignments to actual alignments of the sequence pairs. Gap costs were then compared based on average alignment accuracy. Log-affine gap costs had the best accuracy, followed closely by affine gap costs, while logarithmic gap costs performed poorly. Subsequently a model was developed to explain the results. CONCLUSION: In contrast to initial expectations, logarithmic gap costs produce poor alignments and are actually not implied by the power-law behavior of gap sizes, given typical match and mismatch costs. Furthermore, affine gap costs not only produce accurate alignments but are also good approximations to biologically realistic gap costs. This work provides added confidence for the biological relevance of existing alignment algorithms

    The multiple personalities of Watson and Crick strands

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In genetics it is customary to refer to double-stranded DNA as containing a "Watson strand" and a "Crick strand." However, there seems to be no consensus in the literature on the exact meaning of these two terms, and the many usages contradict one another as well as the original definition. Here, we review the history of the terminology and suggest retaining a single sense that is currently the most useful and consistent.</p> <p>Proposal</p> <p>The <it>Saccharomyces </it>Genome Database defines the Watson strand as the strand which has its 5'-end at the short-arm telomere and the Crick strand as its complement. The Watson strand is always used as the reference strand in their database. Using this as the basis of our standard, we recommend that Watson and Crick strand terminology only be used in the context of genomics. When possible, the centromere or other genomic feature should be used as a reference point, dividing the chromosome into two arms of unequal lengths. Under our proposal, the Watson strand is standardized as the strand whose 5'-end is on the short arm of the chromosome, and the Crick strand as the one whose 5'-end is on the long arm. Furthermore, the Watson strand should be retained as the reference (plus) strand in a genomic database. This usage not only makes the determination of Watson and Crick unambiguous, but also allows unambiguous selection of reference stands for genomics.</p> <p>Reviewers</p> <p>This article was reviewed by John M. Logsdon, Igor B. Rogozin (nominated by Andrey Rzhetsky), and William Martin.</p

    A Composite Genome Approach to Identify Phylogenetically Informative Data from Next-Generation Sequencing

    Full text link
    We have developed a novel method to rapidly obtain homologous genomic data for phylogenetics directly from next-generation sequencing reads without the use of a reference genome. This software, called SISRS, avoids the time consuming steps of de novo whole genome assembly, genome-genome alignment, and annotation. For simulations SISRS is able to identify large numbers of loci containing variable sites with phylogenetic signal. For genomic data from apes, SISRS identified thousands of variable sites, from which we produced an accurate phylogeny. Finally, we used SISRS to identify phylogenetic markers that we used to estimate the phylogeny of placental mammals. We recovered phylogenies from multiple datasets that were consistent with previous conflicting estimates of the relationships among mammals. SISRS is open source and freely available at https://github.com/rachelss/SISRS.Comment: 12 pages plus36 figures, 1 supplementary table, 3 supplementary figure

    The effect of the dispersal kernel on isolation-by-distance in a continuous population

    Full text link
    Under models of isolation-by-distance, population structure is determined by the probability of identity-by-descent between pairs of genes according to the geographic distance between them. Well established analytical results indicate that the relationship between geographical and genetic distance depends mostly on the neighborhood size of the population, Nb=4πσ2DeN_b = 4{\pi}{\sigma}^2 D_e, which represents a standardized measure of dispersal. To test this prediction, we model local dispersal of haploid individuals on a two-dimensional torus using four dispersal kernels: Rayleigh, exponential, half-normal and triangular. When neighborhood size is held constant, the distributions produce similar patterns of isolation-by-distance, confirming predictions. Considering this, we propose that the triangular distribution is the appropriate null distribution for isolation-by-distance studies. Under the triangular distribution, dispersal is uniform within an area of 4πσ24{\pi}{\sigma}^2 (i.e. the neighborhood area), which suggests that the common description of neighborhood size as a measure of a local panmictic population is valid for popular families of dispersal distributions. We further show how to draw from the triangular distribution efficiently and argue that it should be utilized in other studies in which computational efficiency is importantComment: 18 pages (main); 4 pages (supp

    A family-based probabilistic method for capturing de novo mutations from high-throughput short-read sequencing data

    Get PDF
    Recent advances in high-throughput DNA sequencing technologies and associated statistical analyses have enabled in-depth analysis of whole-genome sequences. As this technology is applied to a growing number of individual human genomes, entire families are now being sequenced. Information contained within the pedigree of a sequenced family can be leveraged when inferring the donors' genotypes. The presence of a de novo mutation within the pedigree is indicated by a violation of Mendelian inheritance laws. Here, we present a method for probabilistically inferring genotypes across a pedigree using high-throughput sequencing data and producing the posterior probability of de novo mutation at each genomic site examined. This framework can be used to disentangle the effects of germline and somatic mutational processes and to simultaneously estimate the effect of sequencing error and the initial genetic variation in the population from which the founders of the pedigree arise. This approach is examined in detail through simulations and areas for method improvement are noted. By applying this method to data from members of a well-defined nuclear family with accurate pedigree information, the stage is set to make the most direct estimates of the human mutation rate to date

    Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions

    Get PDF
    Motivation: Accurate identification of genotypes is an essential part of the analysis of genomic data, including in identification of sequence polymorphisms, linking mutations with disease and determining mutation rates. Biological and technical processes that adversely affect genotyping include copy-number-variation, paralogous sequences, library preparation, sequencing error and reference-mapping biases, among others. Results: We modeled the read depth for all data as a mixture of Dirichlet-multinomial distributions, resulting in significant improvements over previously used models. In most cases the best model was comprised of two distributions. The major-component distribution is similar to a binomial distribution with low error and low reference bias. The minor-component distribution is overdispersed with higher error and reference bias. We also found that sites fitting the minor component are enriched for copy number variants and low complexity regions, which can produce erroneous genotype calls. By removing sites that do not fit the major component, we can improve the accuracy of genotype calls. Availability and Implementation: Methods and data files are available at https://github.com/ CartwrightLab/WuEtAl2017/ (doi:10.5281/zenodo.256858). Contact: [email protected] Supplementary information: Supplementary data is available at Bioinformatics online

    A phylogenomic approach reveals a low somatic mutation rate in a long-lived plant.

    Get PDF
    Somatic mutations can have important effects on the life history, ecology, and evolution of plants, but the rate at which they accumulate is poorly understood and difficult to measure directly. Here, we develop a method to measure somatic mutations in individual plants and use it to estimate the somatic mutation rate in a large, long-lived, phenotypically mosaic Eucalyptus melliodora tree. Despite being 100 times larger than Arabidopsis, this tree has a per-generation mutation rate only ten times greater, which suggests that this species may have evolved mechanisms to reduce the mutation rate per unit of growth. This adds to a growing body of evidence that illuminates the correlated evolutionary shifts in mutation rate and life history in plants

    Husimi Transform of an Operator Product

    Get PDF
    It is shown that the series derived by Mizrahi, giving the Husimi transform (or covariant symbol) of an operator product, is absolutely convergent for a large class of operators. In particular, the generalized Liouville equation, describing the time evolution of the Husimi function, is absolutely convergent for a large class of Hamiltonians. By contrast, the series derived by Groenewold, giving the Weyl transform of an operator product, is often only asymptotic, or even undefined. The result is used to derive an alternative way of expressing expectation values in terms of the Husimi function. The advantage of this formula is that it applies in many of the cases where the anti-Husimi transform (or contravariant symbol) is so highly singular that it fails to exist as a tempered distribution.Comment: AMS-Latex, 13 page

    PICS-Ord: unlimited coding of ambiguous regions by pairwise identity and cost scores ordination

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>We present a novel method to encode ambiguously aligned regions in fixed multiple sequence alignments by 'Pairwise Identity and Cost Scores Ordination' (PICS-Ord). The method works via ordination of sequence identity or cost scores matrices by means of Principal Coordinates Analysis (PCoA). After identification of ambiguous regions, the method computes pairwise distances as sequence identities or cost scores, ordinates the resulting distance matrix by means of PCoA, and encodes the principal coordinates as ordered integers. Three biological and 100 simulated datasets were used to assess the performance of the new method.</p> <p>Results</p> <p>Including ambiguous regions coded by means of PICS-Ord increased topological accuracy, resolution, and bootstrap support in real biological and simulated datasets compared to the alternative of excluding such regions from the analysis a priori. In terms of accuracy, PICS-Ord performs equal to or better than previously available methods of ambiguous region coding (e.g., INAASE), with the advantage of a practically unlimited alignment size and increased analytical speed and the possibility of PICS-Ord scores to be analyzed together with DNA data in a partitioned maximum likelihood model.</p> <p>Conclusions</p> <p>Advantages of PICS-Ord over step matrix-based ambiguous region coding with INAASE include a practically unlimited number of OTUs and seamless integration of PICS-Ord codes into phylogenetic datasets, as well as the increased speed of phylogenetic analysis. Contrary to word- and frequency-based methods, PICS-Ord maintains the advantage of pairwise sequence alignment to derive distances, and the method is flexible with respect to the calculation of distance scores. In addition to distance and maximum parsimony, PICS-Ord codes can be analyzed in a Bayesian or maximum likelihood framework. RAxML (version 7.2.6 or higher that was developed for this study) allows up to 32-state ordered or unordered characters. A GTR, MK, or ORDERED model can be applied to analyse the PICS-Ord codes partition, with GTR performing slightly better than MK and ORDERED.</p> <p>Availability</p> <p>An implementation of the PICS-Ord algorithm is available from <url>http://scit.us/projects/ngila/wiki/PICS-Ord</url>. It requires both the statistical software, R <url>http://www.r-project.org</url> and the alignment software Ngila <url>http://scit.us/projects/ngila</url>.</p

    Systematic reviews of observational studies of risk of thrombosis and bleeding in urological surgery (ROTBUS) : introduction and methodology

    Get PDF
    Abstract Background Pharmacological thromboprophylaxis in the peri-operative period involves a trade-off between reduction in venous thromboembolism (VTE) and an increase in bleeding. Baseline risks, in the absence of prophylaxis, for VTE and bleeding are known to vary widely between urological procedures, but their magnitude is highly uncertain. Systematic reviews and meta-analyses addressing baseline risks are uncommon, needed, and require methodological innovation. In this article, we describe the rationale and methods for a series of systematic reviews of the risks of symptomatic VTE and bleeding requiring reoperation in urological surgery. Methods/design We searched MEDLINE from January 1, 2000 until April 10, 2014 for observational studies reporting on symptomatic VTE or bleeding after urological procedures. Additional studies known to experts and studies cited in relevant review articles were added. Teams of two reviewers, independently assessed articles for eligibility, evaluated risk of bias, and abstracted data. We derived best estimates of risk from the median estimates among studies rated at the lowest risk of bias. The primary endpoints were 30-day post-operative risk estimates of symptomatic VTE and bleeding requiring reoperation, stratified by procedure and patient risk factors. Discussion This series of systematic reviews will inform clinicians and patients regarding the trade-off between VTE prevention and bleeding. Our work advances standards in systematic reviews of surgical complications, including assessment of risk of bias, criteria for arriving at best estimates of risk (including modeling of timing of events and dealing with suboptimal data reporting), dealing with subgroups at higher and lower risk of bias, and use of the Grading of Recommendations Assessment, Development and Evaluation (GRADE) approach to rate certainty in estimates of risk. The results will be incorporated in the upcoming European Association Urology Guideline on Thromboprophylaxis. Systematic review registration PROSPERO CRD42014010342
    corecore