362 research outputs found
Universal sequence map (USM) of arbitrary discrete sequences
BACKGROUND: For over a decade the idea of representing biological sequences in a continuous coordinate space has maintained its appeal but not been fully realized. The basic idea is that any sequence of symbols may define trajectories in the continuous space conserving all its statistical properties. Ideally, such a representation would allow scale independent sequence analysis – without the context of fixed memory length. A simple example would consist on being able to infer the homology between two sequences solely by comparing the coordinates of any two homologous units. RESULTS: We have successfully identified such an iterative function for bijective mappingψ of discrete sequences into objects of continuous state space that enable scale-independent sequence analysis. The technique, named Universal Sequence Mapping (USM), is applicable to sequences with an arbitrary length and arbitrary number of unique units and generates a representation where map distance estimates sequence similarity. The novel USM procedure is based on earlier work by these and other authors on the properties of Chaos Game Representation (CGR). The latter enables the representation of 4 unit type sequences (like DNA) as an order free Markov Chain transition table. The properties of USM are illustrated with test data and can be verified for other data by using the accompanying web-based tool:http://bioinformatics.musc.edu/~jonas/usm/. CONCLUSIONS: USM is shown to enable a statistical mechanics approach to sequence analysis. The scale independent representation frees sequence analysis from the need to assume a memory length in the investigation of syntactic rules
Computing distribution of scale independent motifs in biological sequences
The use of Chaos Game Representation (CGR) or its generalization, Universal Sequence Maps (USM), to describe the distribution of biological sequences has been found objectionable because of the fractal structure of that coordinate system. Consequently, the investigation of distribution of symbolic motifs at multiple scales is hampered by an inexact association between distance and sequence dissimilarity. A solution to this problem could unleash the use of iterative maps as phase-state representation of sequences where its statistical properties can be conveniently investigated. In this study a family of kernel density functions is described that accommodates the fractal nature of iterative function representations of symbolic sequences and, consequently, enables the exact investigation of sequence motifs of arbitrary lengths in that scale-independent representation. Furthermore, the proposed kernel density includes both Markovian succession and currently used alignment-free sequence dissimilarity metrics as special solutions. Therefore, the fractal kernel described is in fact a generalization that provides a common framework for a diverse suite of sequence analysis techniques
Efficient Boolean implementation of universal sequence maps (bUSM)
BACKGROUND: Recently, Almeida and Vinga offered a new approach for the representation of arbitrary discrete sequences, referred to as Universal Sequence Maps (USM), and discussed its applicability to genomic sequence analysis. Their work generalizes and extends Chaos Game Representation (CGR) of DNA for arbitrary discrete sequences. RESULTS: We have considered issues associated with the practical implementation of USMs and offer a variation on the algorithm that: 1) eliminates the overestimation of similar segment lengths, 2) permits the identification of arbitrarily long similar segments in the context of finite word length coordinate representations, 3) uses more computationally efficient operations, and 4) provides a simple conversion for recovering the USM coordinates. Computational performance comparisons and examples are provided. CONCLUSIONS: We have shown that the desirable properties of the USM encoding of nucleotide sequences can be retained in a practical implementation of the algorithm. In addition, the proposed implementation enables determination of local sequence identity at increased speed
Biological sequences as pictures – a generic two dimensional solution for iterated maps
<p>Abstract</p> <p>Background</p> <p>Representing symbolic sequences graphically using iterated maps has enjoyed an enduring popularity since it was first proposed in Jeffrey 1990 as chaos game representation (CGR). The usefulness of this representation goes beyond the convenience of a scale independent representation. It provides a variable memory length representation of transition. This includes the representation of succession with non-integer order, which comes with the promise of generalizing Markovian formalisms. The original proposal targeted genomic sequences only but since then several generalizations have been proposed, many specifically designed to handle protein data.</p> <p>Results</p> <p>The challenge of a general solution is that of deriving a bijective transformation of symbolic sequences into bi-dimensional planes. More specifically, it requires the regular fractal nesting of polygons. A first attempt at a general solution was proposed by Fiser 1994 by using non-overlapping circles that contain the polygons. This was used as a starting point to identify a more efficient solution where the encapsulating circles can overlap without the same happening for the sequence maps which are circumscribed to fractal polygon domains.</p> <p>Conclusion</p> <p>We identified the optimal inscribed packing solution for iterated maps of any Biological sequence, indeed of any symbolic sequence. The new solution maintains the prized bijective mapping property and includes the Sierpinski triangle and the CGR square as particular solutions of the more encompassing formulation.</p
Enumeration of rational plane curves tangent to a smooth cubic
We use twisted stable maps to compute the number of rational degree d plane
curves having prescribed contacts to a smooth plane cubic.Comment: 27 pages, v2: typos corrected and references adde
Blind Biological Sequence Denoising with Self-Supervised Set Learning
Biological sequence analysis relies on the ability to denoise the imprecise
output of sequencing platforms. We consider a common setting where a short
sequence is read out repeatedly using a high-throughput long-read platform to
generate multiple subreads, or noisy observations of the same sequence.
Denoising these subreads with alignment-based approaches often fails when too
few subreads are available or error rates are too high. In this paper, we
propose a novel method for blindly denoising sets of sequences without directly
observing clean source sequence labels. Our method, Self-Supervised Set
Learning (SSSL), gathers subreads together in an embedding space and estimates
a single set embedding as the midpoint of the subreads in both the latent and
sequence spaces. This set embedding represents the "average" of the subreads
and can be decoded into a prediction of the clean sequence. In experiments on
simulated long-read DNA data, SSSL methods denoise small reads of
subreads with 17% fewer errors and large reads of subreads with 8% fewer
errors compared to the best baseline. On a real dataset of antibody sequences,
SSSL improves over baselines on two self-supervised metrics, with a significant
improvement on difficult small reads that comprise over 60% of the test set. By
accurately denoising these reads, SSSL promises to better realize the potential
of high-throughput DNA sequencing data for downstream scientific applications
Horocycle dynamics: new invariants and eigenform loci in the stratum H(1,1)
We study dynamics of the horocycle flow on strata of translation surfaces,
introduce new invariants for ergodic measures, and analyze the interaction of
the horocycle flow and real Rel surgeries. We use this analysis to complete and
extend results of Calta and Wortman classifying horocycle-invariant measures in
the eigenform loci. We classify the orbit-closures and prove that every orbit
is equidistributed in its orbit-closure. We also prove equidistribution
statements regarding limits of sequences of measures, some of which have
applications to counting problems.Comment: 100 page
- …