1,505 research outputs found
A Stochastic Model for Genomic Interspersed Duplication
Mutation processes such as point mutation, insertion,
deletion, and duplication (including tandem and interspersed
duplication) have an important role in evolution, as
they lead to genomic diversity, and thus to phenotypic variation. In this work, we study the expressive power of interspersed duplication, i.e., its ability to generate diversity, via a simple but fundamental stochastic model, where the length and the location of the subsequence that is duplicated and the point of insertion of the copy are chosen randomly. In contrast to combinatorial models, where the goal is to determine the set of possible outcomes regardless of their likelihood, in stochastic
systems, we investigate the properties of the set of high-probability sequences. In particular we provide results regarding the asymptotic behavior of frequencies of symbols and short words in a sequence evolving through interspersed duplication. The study of such a systems is an important step towards the design and analysis of more realistic and sophisticated models of genomic mutation processes
Spectral Analysis of Guanine and Cytosine Fluctuations of Mouse Genomic DNA
We study global fluctuations of the guanine and cytosine base content (GC%)
in mouse genomic DNA using spectral analyses. Power spectra S(f) of GC%
fluctuations in all nineteen autosomal and two sex chromosomes are observed to
have the universal functional form S(f) \sim 1/f^alpha (alpha \approx 1) over
several orders of magnitude in the frequency range 10^-7< f < 10^-5 cycle/base,
corresponding to long-ranging GC% correlations at distances between 100 kb and
10 Mb. S(f) for higher frequencies (f > 10^-5 cycle/base) shows a flattened
power-law function with alpha < 1 across all twenty-one chromosomes. The
substitution of about 38% interspersed repeats does not affect the functional
form of S(f), indicating that these are not predominantly responsible for the
long-ranged multi-scale GC% fluctuations in mammalian genomes. Several
biological implications of the large-scale GC% fluctuation are discussed,
including neutral evolutionary history by DNA duplication, chromosomal bands,
spatial distribution of transcription units (genes), replication timing, and
recombination hot spots.Comment: 15 pages (figures included), 2 figure
The Capacity of Some P\'olya String Models
We study random string-duplication systems, which we call P\'olya string
models. These are motivated by DNA storage in living organisms, and certain
random mutation processes that affect their genome. Unlike previous works that
study the combinatorial capacity of string-duplication systems, or various
string statistics, this work provides exact capacity or bounds on it, for
several probabilistic models. In particular, we study the capacity of noisy
string-duplication systems, including the tandem-duplication, end-duplication,
and interspersed-duplication systems. Interesting connections are drawn between
some systems and the signature of random permutations, as well as to the beta
distribution common in population genetics
SVIM: Structural Variant Identification using Mapped Long Reads
Motivation: Structural variants are defined as genomic variants larger than 50bp. They have been shown to affect more bases in any given genome than SNPs or small indels. Additionally, they have great impact on human phenotype and diversity and have been linked to numerous diseases. Due to their size and association with repeats, they are difficult to detect by shotgun sequencing, especially when based on short reads. Long read, single molecule sequencing technologies like those offered by Pacific Biosciences or Oxford Nanopore Technologies produce reads with a length of several thousand base pairs. Despite the higher error rate and sequencing cost, long read sequencing offers many advantages for the detection of structural variants. Yet, available software tools still do not fully exploit the possibilities. Results: We present SVIM, a tool for the sensitive detection and precise characterization of structural variants from long read data. SVIM consists of three components for the collection, clustering and combination of structural variant signatures from read alignments. It discriminates five different variant classes including similar types, such as tandem and interspersed duplications and novel element insertions. SVIM is unique in its capability of extracting both the genomic origin and destination of duplications. It compares favorably with existing tools in evaluations on simulated data and real datasets from PacBio and Nanopore sequencing machines. Availability and implementation: The source code and executables of SVIM are available on Github: github.com/eldariont/svim. SVIM has been implemented in Python 3 and published on bioconda and the Python Package Index. Supplementary information: Supplementary data are available at Bioinformatics online
Reconstruction Codes for DNA Sequences with Uniform Tandem-Duplication Errors
DNA as a data storage medium has several advantages, including far greater
data density compared to electronic media. We propose that schemes for data
storage in the DNA of living organisms may benefit from studying the
reconstruction problem, which is applicable whenever multiple reads of noisy
data are available. This strategy is uniquely suited to the medium, which
inherently replicates stored data in multiple distinct ways, caused by
mutations. We consider noise introduced solely by uniform tandem-duplication,
and utilize the relation to constant-weight integer codes in the Manhattan
metric. By bounding the intersection of the cross-polytope with hyperplanes, we
prove the existence of reconstruction codes with greater capacity than known
error-correcting codes, which we can determine analytically for any set of
parameters.Comment: 11 pages, 2 figures, Latex; version accepted for publicatio
Law of Genome Evolution Direction : Coding Information Quantity Grows
The problem of the directionality of genome evolution is studied. Based on
the analysis of C-value paradox and the evolution of genome size we propose
that the function-coding information quantity of a genome always grows in the
course of evolution through sequence duplication, expansion of code, and gene
transfer from outside. The function-coding information quantity of a genome
consists of two parts, p-coding information quantity which encodes functional
protein and n-coding information quantity which encodes other functional
elements except amino acid sequence. The evidences on the evolutionary law
about the function-coding information quantity are listed. The needs of
function is the motive force for the expansion of coding information quantity
and the information quantity expansion is the way to make functional innovation
and extension for a species. So, the increase of coding information quantity of
a genome is a measure of the acquired new function and it determines the
directionality of genome evolution.Comment: 16 page
Evolution of k-mer Frequencies and Entropy in Duplication and Substitution Mutation Systems
Genomic evolution can be viewed as string-editing processes driven by mutations. An understanding of the statistical properties resulting from these mutation processes is of value in a variety of tasks related to biological sequence data, e.g., estimation of model parameters and compression. At the same time, due to the complexity of these processes, designing tractable stochastic models and analyzing them are challenging. In this paper, we study two kinds of systems, each representing a set of mutations. In the first system, tandem duplications and substitution mutations are allowed and in the other, interspersed duplications. We provide stochastic models and, via stochastic approximation, study the evolution of substring frequencies for these two systems separately. Specifically, we show that k-mer frequencies converge almost surely and determine the limit set. Furthermore, we present a method for finding upper bounds on entropy for such systems
Universality of Long-Range Correlations in Expansion-Randomization Systems
We study the stochastic dynamics of sequences evolving by single site
mutations, segmental duplications, deletions, and random insertions. These
processes are relevant for the evolution of genomic DNA. They define a
universality class of non-equilibrium 1D expansion-randomization systems with
generic stationary long-range correlations in a regime of growing sequence
length. We obtain explicitly the two-point correlation function of the sequence
composition and the distribution function of the composition bias in sequences
of finite length. The characteristic exponent of these quantities is
determined by the ratio of two effective rates, which are explicitly calculated
for several specific sequence evolution dynamics of the universality class.
Depending on the value of , we find two different scaling regimes, which
are distinguished by the detectability of the initial composition bias. All
analytic results are accurately verified by numerical simulations. We also
discuss the non-stationary build-up and decay of correlations, as well as more
complex evolutionary scenarios, where the rates of the processes vary in time.
Our findings provide a possible example for the emergence of universality in
molecular biology.Comment: 23 pages, 15 figure
The capacity of some Pólya string models
We study random string-duplication systems, called Pólya string models, motivated by certain random mutation processes in the genome of living organisms. Unlike previous works that study the combinatorial capacity of string-duplication systems, or peripheral properties such as symbol frequency, this work provides exact capacity or bounds on it, for several probabilistic models. In particular, we give the exact capacity of the random tandem-duplication system, and the end-duplication system, and bound the capacity of the complement tandem-duplication system. Interesting connections are drawn between the former and the beta distribution common to population genetics, as well as between the latter system and signatures of random permutations
- …