8,091 research outputs found
Wavelet analysis on symbolic sequences and two-fold de Bruijn sequences
The concept of symbolic sequences play important role in study of complex
systems. In the work we are interested in ultrametric structure of the set of
cyclic sequences naturally arising in theory of dynamical systems. Aimed at
construction of analytic and numerical methods for investigation of clusters we
introduce operator language on the space of symbolic sequences and propose an
approach based on wavelet analysis for study of the cluster hierarchy. The
analytic power of the approach is demonstrated by derivation of a formula for
counting of {\it two-fold de Bruijn sequences}, the extension of the notion of
de Bruijn sequences. Possible advantages of the developed description is also
discussed in context of applied
Partitioning de Bruijn Graphs into Fixed-Length Cycles for Robot Identification and Tracking
We propose a new camera-based method of robot identification, tracking and
orientation estimation. The system utilises coloured lights mounted in a circle
around each robot to create unique colour sequences that are observed by a
camera. The number of robots that can be uniquely identified is limited by the
number of colours available, , the number of lights on each robot, , and
the number of consecutive lights the camera can see, . For a given set of
parameters, we would like to maximise the number of robots that we can use. We
model this as a combinatorial problem and show that it is equivalent to finding
the maximum number of disjoint -cycles in the de Bruijn graph
.
We provide several existence results that give the maximum number of cycles
in in various cases. For example, we give an optimal
solution when . Another construction yields many cycles in larger
de Bruijn graphs using cycles from smaller de Bruijn graphs: if
can be partitioned into -cycles, then
can be partitioned into -cycles for any divisor of
. The methods used are based on finite field algebra and the combinatorics
of words.Comment: 16 pages, 4 figures. Accepted for publication in Discrete Applied
Mathematic
Constant-Weight Gray Codes for Local Rank Modulation
We consider the local rank-modulation scheme in which a sliding window going over a sequence of real-valued variables induces a sequence of permutations. Local rank- modulation is a generalization of the rank-modulation scheme, which has been recently suggested as a way of storing information in flash memory.
We study constant-weight Gray codes for the local rank- modulation scheme in order to simulate conventional multi-level flash cells while retaining the benefits of rank modulation. We provide necessary conditions for the existence of cyclic and cyclic optimal Gray codes. We then specifically study codes of weight 2 and upper bound their efficiency, thus proving that there are no such asymptotically-optimal cyclic codes. In contrast, we study codes of weight 3 and efficiently construct codes which are asymptotically-optimal. We conclude with a construction of codes with asymptotically-optimal rate and weight asymptotically half the length, thus having an asymptotically-optimal charge difference between adjacent cells
MSPKmerCounter: A Fast and Memory Efficient Approach for K-mer Counting
A major challenge in next-generation genome sequencing (NGS) is to assemble
massive overlapping short reads that are randomly sampled from DNA fragments.
To complete assembling, one needs to finish a fundamental task in many leading
assembly algorithms: counting the number of occurrences of k-mers (length-k
substrings in sequences). The counting results are critical for many components
in assembly (e.g. variants detection and read error correction). For large
genomes, the k-mer counting task can easily consume a huge amount of memory,
making it impossible for large-scale parallel assembly on commodity servers.
In this paper, we develop MSPKmerCounter, a disk-based approach, to
efficiently perform k-mer counting for large genomes using a small amount of
memory. Our approach is based on a novel technique called Minimum Substring
Partitioning (MSP). MSP breaks short reads into multiple disjoint partitions
such that each partition can be loaded into memory and processed individually.
By leveraging the overlaps among the k-mers derived from the same short read,
MSP can achieve astonishing compression ratio so that the I/O cost can be
significantly reduced. For the task of k-mer counting, MSPKmerCounter offers a
very fast and memory-efficient solution. Experiment results on large real-life
short reads data sets demonstrate that MSPKmerCounter can achieve better
overall performance than state-of-the-art k-mer counting approaches.
MSPKmerCounter is available at http://www.cs.ucsb.edu/~yangli/MSPKmerCounte
Cellular Probabilistic Automata - A Novel Method for Uncertainty Propagation
We propose a novel density based numerical method for uncertainty propagation
under certain partial differential equation dynamics. The main idea is to
translate them into objects that we call cellular probabilistic automata and to
evolve the latter. The translation is achieved by state discretization as in
set oriented numerics and the use of the locality concept from cellular
automata theory. We develop the method at the example of initial value
uncertainties under deterministic dynamics and prove a consistency result. As
an application we discuss arsenate transportation and adsorption in drinking
water pipes and compare our results to Monte Carlo computations
HYPA: Efficient Detection of Path Anomalies in Time Series Data on Networks
The unsupervised detection of anomalies in time series data has important
applications in user behavioral modeling, fraud detection, and cybersecurity.
Anomaly detection has, in fact, been extensively studied in categorical
sequences. However, we often have access to time series data that represent
paths through networks. Examples include transaction sequences in financial
networks, click streams of users in networks of cross-referenced documents, or
travel itineraries in transportation networks. To reliably detect anomalies, we
must account for the fact that such data contain a large number of independent
observations of paths constrained by a graph topology. Moreover, the
heterogeneity of real systems rules out frequency-based anomaly detection
techniques, which do not account for highly skewed edge and degree statistics.
To address this problem, we introduce HYPA, a novel framework for the
unsupervised detection of anomalies in large corpora of variable-length
temporal paths in a graph. HYPA provides an efficient analytical method to
detect paths with anomalous frequencies that result from nodes being traversed
in unexpected chronological order.Comment: 11 pages with 8 figures and supplementary material. To appear at SIAM
Data Mining (SDM 2020
Rates of DNA Sequence Profiles for Practical Values of Read Lengths
A recent study by one of the authors has demonstrated the importance of
profile vectors in DNA-based data storage. We provide exact values and lower
bounds on the number of profile vectors for finite values of alphabet size ,
read length , and word length .Consequently, we demonstrate that for
and , the number of profile vectors is at least
with very close to one.In addition to enumeration
results, we provide a set of efficient encoding and decoding algorithms for
each of two particular families of profile vectors
- …