992 research outputs found
Shortest prefix strings containing all subset permutations
AbstractWhat is the length of the shortest string consisting of elements of {1,âŠn} that contains as subsequences all permutations of any k-element subset? Many authors have considered the special case where k=n. We instead consider an incremental variation on this problem first proposed by Koutas and Hu. For a fixed value of n they ask for a string such that for all values of kâ©œn, the prefix containing all permutations of any k-element subset as subsequences is as short as possible. The problem can also be viewed as follows:For k=1 one needs n distinct digits to find each of the n possible permutations. In going from k to k+1, one starts with a string containing all k-element permutations as subsequences, and one adds as few digits as possible to the end of the string so that the new string contains all (k+1)-element permutations.We give a new construction that gives shorter strings than the best previous construction. We then prove a weak form of lower bound for the number of digits added in successive suffixes. The lower bound proof leads to a construction that matches the bound exactly. The length of a shortest prefix string is k(nâ2)+[13(k+1)]+3, for k > 2.The lengths for k=1, 2 are n and 2nâ1. This proves the natural conjecture that requiring the strings to be prefixes strictly increases the length of the strings required for all but the smallest values of k
Superstrings with multiplicities
A superstring of a set of words P = s1, · · · , sp is a string that contains each word of P as substring. Given P, the well known Shortest Linear Superstring problem (SLS), asks for a shortest superstring of P. In a variant of SLS, called Multi-SLS, each word si comes with an integer m(i), its multiplicity, that sets a constraint on its number of occurrences, and the goal is to find a shortest superstring that contains at least m(i) occurrences of si. Multi-SLS generalizes SLS and is obviously as hard to solve, but it has been studied only in special cases (with words of length 2 or with a fixed number of words). The approximability of Multi-SLS in the general case remains open. Here, we study the approximability of Multi-SLS and that of the companion problem Multi-SCCS, which asks for a shortest cyclic cover instead of shortest superstring. First, we investigate the approximation of a greedy algorithm for maximizing the compression offered by a superstring or by a cyclic cover: the approximation ratio is 1/2 for Multi-SLS and 1 for Multi-SCCS. Then, we exhibit a linear time approximation algorithm, Concat-Greedy, and show it achieves a ratio of 4 regarding the superstring length. This demonstrates that for both measures Multi-SLS belongs to the class of APX problems. © 2018 Yoshifumi Sakai; licensed under Creative Commons License CC-BY.Peer reviewe
Normal, Abby Normal, Prefix Normal
A prefix normal word is a binary word with the property that no substring has
more 1s than the prefix of the same length. This class of words is important in
the context of binary jumbled pattern matching. In this paper we present
results about the number of prefix normal words of length , showing
that for some and
. We introduce efficient
algorithms for testing the prefix normal property and a "mechanical algorithm"
for computing prefix normal forms. We also include games which can be played
with prefix normal words. In these games Alice wishes to stay normal but Bob
wants to drive her "abnormal" -- we discuss which parameter settings allow
Alice to succeed.Comment: Accepted at FUN '1
Around Kolmogorov complexity: basic notions and results
Algorithmic information theory studies description complexity and randomness
and is now a well known field of theoretical computer science and mathematical
logic. There are several textbooks and monographs devoted to this theory where
one can find the detailed exposition of many difficult results as well as
historical references. However, it seems that a short survey of its basic
notions and main results relating these notions to each other, is missing.
This report attempts to fill this gap and covers the basic notions of
algorithmic information theory: Kolmogorov complexity (plain, conditional,
prefix), Solomonoff universal a priori probability, notions of randomness
(Martin-L\"of randomness, Mises--Church randomness), effective Hausdorff
dimension. We prove their basic properties (symmetry of information, connection
between a priori probability and prefix complexity, criterion of randomness in
terms of complexity, complexity characterization for effective dimension) and
show some applications (incompressibility method in computational complexity
theory, incompleteness theorems). It is based on the lecture notes of a course
at Uppsala University given by the author
The Google Similarity Distance
Words and phrases acquire meaning from the way they are used in society, from
their relative semantics to other words and phrases. For computers the
equivalent of `society' is `database,' and the equivalent of `use' is `way to
search the database.' We present a new theory of similarity between words and
phrases based on information distance and Kolmogorov complexity. To fix
thoughts we use the world-wide-web as database, and Google as search engine.
The method is also applicable to other search engines and databases. This
theory is then applied to construct a method to automatically extract
similarity, the Google similarity distance, of words and phrases from the
world-wide-web using Google page counts. The world-wide-web is the largest
database on earth, and the context information entered by millions of
independent users averages out to provide automatic semantics of useful
quality. We give applications in hierarchical clustering, classification, and
language translation. We give examples to distinguish between colors and
numbers, cluster names of paintings by 17th century Dutch masters and names of
books by English novelists, the ability to understand emergencies, and primes,
and we demonstrate the ability to do a simple automatic English-Spanish
translation. Finally, we use the WordNet database as an objective baseline
against which to judge the performance of our method. We conduct a massive
randomized trial in binary classification using support vector machines to
learn categories based on our Google distance, resulting in an a mean agreement
of 87% with the expert crafted WordNet categories.Comment: 15 pages, 10 figures; changed some text/figures/notation/part of
theorem. Incorporated referees comments. This is the final published version
up to some minor changes in the galley proof
Near-Optimal Computation of Runs over General Alphabet via Non-Crossing LCE Queries
Longest common extension queries (LCE queries) and runs are ubiquitous in
algorithmic stringology. Linear-time algorithms computing runs and
preprocessing for constant-time LCE queries have been known for over a decade.
However, these algorithms assume a linearly-sortable integer alphabet. A recent
breakthrough paper by Bannai et.\ al.\ (SODA 2015) showed a link between the
two notions: all the runs in a string can be computed via a linear number of
LCE queries. The first to consider these problems over a general ordered
alphabet was Kosolobov (\emph{Inf.\ Process.\ Lett.}, 2016), who presented an
-time algorithm for answering LCE queries. This
result was improved by Gawrychowski et.\ al.\ (accepted to CPM 2016) to time. In this work we note a special \emph{non-crossing} property
of LCE queries asked in the runs computation. We show that any such
non-crossing queries can be answered on-line in time, which
yields an -time algorithm for computing runs
- âŠ