8,091 research outputs found

    Wavelet analysis on symbolic sequences and two-fold de Bruijn sequences

    Full text link
    The concept of symbolic sequences play important role in study of complex systems. In the work we are interested in ultrametric structure of the set of cyclic sequences naturally arising in theory of dynamical systems. Aimed at construction of analytic and numerical methods for investigation of clusters we introduce operator language on the space of symbolic sequences and propose an approach based on wavelet analysis for study of the cluster hierarchy. The analytic power of the approach is demonstrated by derivation of a formula for counting of {\it two-fold de Bruijn sequences}, the extension of the notion of de Bruijn sequences. Possible advantages of the developed description is also discussed in context of applied

    Partitioning de Bruijn Graphs into Fixed-Length Cycles for Robot Identification and Tracking

    Full text link
    We propose a new camera-based method of robot identification, tracking and orientation estimation. The system utilises coloured lights mounted in a circle around each robot to create unique colour sequences that are observed by a camera. The number of robots that can be uniquely identified is limited by the number of colours available, qq, the number of lights on each robot, kk, and the number of consecutive lights the camera can see, \ell. For a given set of parameters, we would like to maximise the number of robots that we can use. We model this as a combinatorial problem and show that it is equivalent to finding the maximum number of disjoint kk-cycles in the de Bruijn graph dB(q,)\text{dB}(q,\ell). We provide several existence results that give the maximum number of cycles in dB(q,)\text{dB}(q,\ell) in various cases. For example, we give an optimal solution when k=q1k=q^{\ell-1}. Another construction yields many cycles in larger de Bruijn graphs using cycles from smaller de Bruijn graphs: if dB(q,)\text{dB}(q,\ell) can be partitioned into kk-cycles, then dB(q,)\text{dB}(q,\ell) can be partitioned into tktk-cycles for any divisor tt of kk. The methods used are based on finite field algebra and the combinatorics of words.Comment: 16 pages, 4 figures. Accepted for publication in Discrete Applied Mathematic

    Constant-Weight Gray Codes for Local Rank Modulation

    Get PDF
    We consider the local rank-modulation scheme in which a sliding window going over a sequence of real-valued variables induces a sequence of permutations. Local rank- modulation is a generalization of the rank-modulation scheme, which has been recently suggested as a way of storing information in flash memory. We study constant-weight Gray codes for the local rank- modulation scheme in order to simulate conventional multi-level flash cells while retaining the benefits of rank modulation. We provide necessary conditions for the existence of cyclic and cyclic optimal Gray codes. We then specifically study codes of weight 2 and upper bound their efficiency, thus proving that there are no such asymptotically-optimal cyclic codes. In contrast, we study codes of weight 3 and efficiently construct codes which are asymptotically-optimal. We conclude with a construction of codes with asymptotically-optimal rate and weight asymptotically half the length, thus having an asymptotically-optimal charge difference between adjacent cells

    MSPKmerCounter: A Fast and Memory Efficient Approach for K-mer Counting

    Full text link
    A major challenge in next-generation genome sequencing (NGS) is to assemble massive overlapping short reads that are randomly sampled from DNA fragments. To complete assembling, one needs to finish a fundamental task in many leading assembly algorithms: counting the number of occurrences of k-mers (length-k substrings in sequences). The counting results are critical for many components in assembly (e.g. variants detection and read error correction). For large genomes, the k-mer counting task can easily consume a huge amount of memory, making it impossible for large-scale parallel assembly on commodity servers. In this paper, we develop MSPKmerCounter, a disk-based approach, to efficiently perform k-mer counting for large genomes using a small amount of memory. Our approach is based on a novel technique called Minimum Substring Partitioning (MSP). MSP breaks short reads into multiple disjoint partitions such that each partition can be loaded into memory and processed individually. By leveraging the overlaps among the k-mers derived from the same short read, MSP can achieve astonishing compression ratio so that the I/O cost can be significantly reduced. For the task of k-mer counting, MSPKmerCounter offers a very fast and memory-efficient solution. Experiment results on large real-life short reads data sets demonstrate that MSPKmerCounter can achieve better overall performance than state-of-the-art k-mer counting approaches. MSPKmerCounter is available at http://www.cs.ucsb.edu/~yangli/MSPKmerCounte

    Cellular Probabilistic Automata - A Novel Method for Uncertainty Propagation

    Full text link
    We propose a novel density based numerical method for uncertainty propagation under certain partial differential equation dynamics. The main idea is to translate them into objects that we call cellular probabilistic automata and to evolve the latter. The translation is achieved by state discretization as in set oriented numerics and the use of the locality concept from cellular automata theory. We develop the method at the example of initial value uncertainties under deterministic dynamics and prove a consistency result. As an application we discuss arsenate transportation and adsorption in drinking water pipes and compare our results to Monte Carlo computations

    HYPA: Efficient Detection of Path Anomalies in Time Series Data on Networks

    Full text link
    The unsupervised detection of anomalies in time series data has important applications in user behavioral modeling, fraud detection, and cybersecurity. Anomaly detection has, in fact, been extensively studied in categorical sequences. However, we often have access to time series data that represent paths through networks. Examples include transaction sequences in financial networks, click streams of users in networks of cross-referenced documents, or travel itineraries in transportation networks. To reliably detect anomalies, we must account for the fact that such data contain a large number of independent observations of paths constrained by a graph topology. Moreover, the heterogeneity of real systems rules out frequency-based anomaly detection techniques, which do not account for highly skewed edge and degree statistics. To address this problem, we introduce HYPA, a novel framework for the unsupervised detection of anomalies in large corpora of variable-length temporal paths in a graph. HYPA provides an efficient analytical method to detect paths with anomalous frequencies that result from nodes being traversed in unexpected chronological order.Comment: 11 pages with 8 figures and supplementary material. To appear at SIAM Data Mining (SDM 2020

    Rates of DNA Sequence Profiles for Practical Values of Read Lengths

    Full text link
    A recent study by one of the authors has demonstrated the importance of profile vectors in DNA-based data storage. We provide exact values and lower bounds on the number of profile vectors for finite values of alphabet size qq, read length \ell, and word length nn.Consequently, we demonstrate that for q2q\ge 2 and nq/21n\le q^{\ell/2-1}, the number of profile vectors is at least qκnq^{\kappa n} with κ\kappa very close to one.In addition to enumeration results, we provide a set of efficient encoding and decoding algorithms for each of two particular families of profile vectors
    corecore