251 research outputs found
Some Relationships Between Sequences and Their Kmer Profiles
This paper explores kmer profiles in bioinformatics its two applications: one as a model for the reads of genome assembly, the second as a nice representation of DNA sequences. Kmer profiles are simply unordered collections of fixed length substrings (with length k) of DNA sequences; they resemble an idealized form of input genome assemblers receive while has been in the literature used as a fast way to approximate the otherwise expensive edit distance. The obvious question is the choice of k. After using the theory of metric embedding, de Bruijn assembly, and to some extent algebra, the familiar conclusion for genome assembly is recovered: k should be as large as permitted. The conclusion for edit distance approximation is more subtle. Small k loses nice mathematical properties while retaining good computational ones. Large k has good mathematical properties (with a proper metric distortion) while becomes computationally ugly due to the curse of dimensionality.Bachelor of Scienc
Complex dynamics emerging in Rule 30 with majority memory
In cellular automata with memory, the unchanged maps of the conventional
cellular automata are applied to cells endowed with memory of their past states
in some specified interval. We implement Rule 30 automata with a majority
memory and show that using the memory function we can transform quasi-chaotic
dynamics of classical Rule 30 into domains of travelling structures with
predictable behaviour. We analyse morphological complexity of the automata and
classify dynamics of gliders (particles, self-localizations) in memory-enriched
Rule 30. We provide formal ways of encoding and classifying glider dynamics
using de Bruijn diagrams, soliton reactions and quasi-chemical representations
HYPA: Efficient Detection of Path Anomalies in Time Series Data on Networks
The unsupervised detection of anomalies in time series data has important
applications in user behavioral modeling, fraud detection, and cybersecurity.
Anomaly detection has, in fact, been extensively studied in categorical
sequences. However, we often have access to time series data that represent
paths through networks. Examples include transaction sequences in financial
networks, click streams of users in networks of cross-referenced documents, or
travel itineraries in transportation networks. To reliably detect anomalies, we
must account for the fact that such data contain a large number of independent
observations of paths constrained by a graph topology. Moreover, the
heterogeneity of real systems rules out frequency-based anomaly detection
techniques, which do not account for highly skewed edge and degree statistics.
To address this problem, we introduce HYPA, a novel framework for the
unsupervised detection of anomalies in large corpora of variable-length
temporal paths in a graph. HYPA provides an efficient analytical method to
detect paths with anomalous frequencies that result from nodes being traversed
in unexpected chronological order.Comment: 11 pages with 8 figures and supplementary material. To appear at SIAM
Data Mining (SDM 2020
Optimal Watermark Embedding and Detection Strategies Under Limited Detection Resources
An information-theoretic approach is proposed to watermark embedding and
detection under limited detector resources. First, we consider the attack-free
scenario under which asymptotically optimal decision regions in the
Neyman-Pearson sense are proposed, along with the optimal embedding rule.
Later, we explore the case of zero-mean i.i.d. Gaussian covertext distribution
with unknown variance under the attack-free scenario. For this case, we propose
a lower bound on the exponential decay rate of the false-negative probability
and prove that the optimal embedding and detecting strategy is superior to the
customary linear, additive embedding strategy in the exponential sense.
Finally, these results are extended to the case of memoryless attacks and
general worst case attacks. Optimal decision regions and embedding rules are
offered, and the worst attack channel is identified.Comment: 36 pages, 5 figures. Revised version. Submitted to IEEE Transactions
on Information Theor
- …