34,126 research outputs found
Approximate Common Intervals in Multiple Genome Comparison
International audienceWe consider the problem of inferring approximate common intervals of multiple genomes. Genomes are modelled as sequences of homologous genes families identifiers, and approximate common intervals represent conserved regions possibly showing rearrangements, as well as repetitions, or insertions/deletions. This problem is already known, but existing approaches are not incremental and somehow limited to special cases. We adopt a simple, classical graph-based approach, where the vertices of the graph represent the exact common intervals of the sequences (\ie, regions containing the same gene set), and where edges link vertices that differ by less than elements (with being parameter). With this model, approximate gene clusters are maximal cliques of the graph: computing them can exploit known and well designed algorithms. For a proof of concept, we applied the method to several datasets of bacterial genomes and compared the two maximal cliques algorithms, a static and a dynamic one. While being quite flexible, this approach opens the way to a combinatorial characterization of genomic rearrangements in terms of graph substructures
Change-point model on nonhomogeneous Poisson processes with application in copy number profiling by next-generation DNA sequencing
We propose a flexible change-point model for inhomogeneous Poisson Processes,
which arise naturally from next-generation DNA sequencing, and derive score and
generalized likelihood statistics for shifts in intensity functions. We
construct a modified Bayesian information criterion (mBIC) to guide model
selection, and point-wise approximate Bayesian confidence intervals for
assessing the confidence in the segmentation. The model is applied to DNA Copy
Number profiling with sequencing data and evaluated on simulated spike-in and
real data sets.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS517 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Estimating the relative rate of recombination to mutation in bacteria from single-locus variants using composite likelihood methods
A number of studies have suggested using comparisons between DNA sequences of
closely related bacterial isolates to estimate the relative rate of
recombination to mutation for that bacterial species. We consider such an
approach which uses single-locus variants: pairs of isolates whose DNA differ
at a single gene locus. One way of deriving point estimates for the relative
rate of recombination to mutation from such data is to use composite likelihood
methods. We extend recent work in this area so as to be able to construct
confidence intervals for our estimates, without needing to resort to
computationally-intensive bootstrap procedures, and to develop a test for
whether the relative rate varies across loci. Both our test and method for
constructing confidence intervals are obtained by modeling the dependence
structure in the data, and then applying asymptotic theory regarding the
distribution of estimators obtained using a composite likelihood. We applied
these methods to multi-locus sequence typing (MLST) data from eight bacteria,
finding strong evidence for considerable rate variation in three of these:
Bacillus cereus, Enterococcus faecium and Klebsiella pneumoniae.Comment: Published at http://dx.doi.org/10.1214/14-AOAS795 in the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Detecting simultaneous variant intervals in aligned sequences
Given a set of aligned sequences of independent noisy observations, we are
concerned with detecting intervals where the mean values of the observations
change simultaneously in a subset of the sequences. The intervals of changed
means are typically short relative to the length of the sequences, the subset
where the change occurs, the "carriers," can be relatively small, and the sizes
of the changes can vary from one sequence to another. This problem is motivated
by the scientific problem of detecting inherited copy number variants in
aligned DNA samples. We suggest a statistic based on the assumption that for
any given interval of changed means there is a given fraction of samples that
carry the change. We derive an analytic approximation for the false positive
error probability of a scan, which is shown by simulations to be reasonably
accurate. We show that the new method usually improves on methods that analyze
a single sample at a time and on our earlier multi-sample method, which is most
efficient when the carriers form a large fraction of the set of sequences. The
proposed procedure is also shown to be robust with respect to the assumed
fraction of carriers of the changes.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS400 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Microevolution of Helicobacter pylori during prolonged infection of single hosts and within families
Our understanding of basic evolutionary processes in bacteria is still very limited. For example, multiple recent dating estimates are based on a universal inter-species molecular clock rate, but that rate was calibrated using estimates of geological dates that are no longer accepted. We therefore estimated the short-term rates of mutation and recombination in Helicobacter pylori by sequencing an average of 39,300 bp in 78 gene fragments from 97 isolates. These isolates included 34 pairs of sequential samples, which were sampled at intervals of 0.25 to 10.2 years. They also included single isolates from 29 individuals (average age: 45 years) from 10 families. The accumulation of sequence diversity increased with time of separation in a clock-like manner in the sequential isolates. We used Approximate Bayesian Computation to estimate the rates of mutation, recombination, mean length of recombination tracts, and average diversity in those tracts. The estimates indicate that the short-term mutation rate is 1.4×10−6 (serial isolates) to 4.5×10−6 (family isolates) per nucleotide per year and that three times as many substitutions are introduced by recombination as by mutation. The long-term mutation rate over millennia is 5–17-fold lower, partly due to the removal of non-synonymous mutations due to purifying selection. Comparisons with the recent literature show that short-term mutation rates vary dramatically in different bacterial species and can span a range of several orders of magnitude
Genome-wide inference of ancestral recombination graphs
The complex correlation structure of a collection of orthologous DNA
sequences is uniquely captured by the "ancestral recombination graph" (ARG), a
complete record of coalescence and recombination events in the history of the
sample. However, existing methods for ARG inference are computationally
intensive, highly approximate, or limited to small numbers of sequences, and,
as a consequence, explicit ARG inference is rarely used in applied population
genomics. Here, we introduce a new algorithm for ARG inference that is
efficient enough to apply to dozens of complete mammalian genomes. The key idea
of our approach is to sample an ARG of n chromosomes conditional on an ARG of
n-1 chromosomes, an operation we call "threading." Using techniques based on
hidden Markov models, we can perform this threading operation exactly, up to
the assumptions of the sequentially Markov coalescent and a discretization of
time. An extension allows for threading of subtrees instead of individual
sequences. Repeated application of these threading operations results in highly
efficient Markov chain Monte Carlo samplers for ARGs. We have implemented these
methods in a computer program called ARGweaver. Experiments with simulated data
indicate that ARGweaver converges rapidly to the true posterior distribution
and is effective in recovering various features of the ARG for dozens of
sequences generated under realistic parameters for human populations. In
applications of ARGweaver to 54 human genome sequences from Complete Genomics,
we find clear signatures of natural selection, including regions of unusually
ancient ancestry associated with balancing selection and reductions in allele
age in sites under directional selection. Preliminary results also indicate
that our methods can be used to gain insight into complex features of human
population structure, even with a noninformative prior distribution.Comment: 88 pages, 7 main figures, 22 supplementary figures. This version
contains a substantially expanded genomic data analysi
Genomic Selective Constraints in Murid Noncoding DNA
Recent work has suggested that there are many more selectively constrained, functional noncoding than coding sites in mammalian genomes. However, little is known about how selective constraint varies amongst different classes of noncoding DNA. We estimated the magnitude of selective constraint on a large dataset of mouse-rat gene orthologs and their surrounding noncoding DNA. Our analysis indicates that there are more than three times as many selectively constrained, nonrepetitive sites within noncoding DNA as in coding DNA in murids. The majority of these constrained noncoding sites appear to be located within intergenic regions, at distances greater than 5 kilobases from known genes. Our study also shows that in murids, intron length and mean intronic selective constraint are negatively correlated with intron ordinal number. Our results therefore suggest that functional intronic sites tend to accumulate toward the 5' end of murid genes. Our analysis also reveals that mean number of selectively constrained noncoding sites varies substantially with the function of the adjacent gene. We find that, among others, developmental and neuronal genes are associated with the greatest numbers of putatively functional noncoding sites compared with genes involved in electron transport and a variety of metabolic processes. Combining our estimates of the total number of constrained coding and noncoding bases we calculate that over twice as many deleterious mutations have occurred in intergenic regions as in known genic sequence and that the total genomic deleterious point mutation rate is 0.91 per diploid genome, per generation. This estimated rate is over twice as large as a previous estimate in murids
- …