31,012 research outputs found
Engineering Parallel String Sorting
We discuss how string sorting algorithms can be parallelized on modern
multi-core shared memory machines. As a synthesis of the best sequential string
sorting algorithms and successful parallel sorting algorithms for atomic
objects, we first propose string sample sort. The algorithm makes effective use
of the memory hierarchy, uses additional word level parallelism, and largely
avoids branch mispredictions. Then we focus on NUMA architectures, and develop
parallel multiway LCP-merge and -mergesort to reduce the number of random
memory accesses to remote nodes. Additionally, we parallelize variants of
multikey quicksort and radix sort that are also useful in certain situations.
Comprehensive experiments on five current multi-core platforms are then
reported and discussed. The experiments show that our implementations scale
very well on real-world inputs and modern machines.Comment: 46 pages, extension of "Parallel String Sample Sort" arXiv:1305.115
A Polynomial Time Algorithm for Lossy Population Recovery
We give a polynomial time algorithm for the lossy population recovery
problem. In this problem, the goal is to approximately learn an unknown
distribution on binary strings of length from lossy samples: for some
parameter each coordinate of the sample is preserved with probability
and otherwise is replaced by a `?'. The running time and number of
samples needed for our algorithm is polynomial in and for
each fixed . This improves on algorithm of Wigderson and Yehudayoff that
runs in quasi-polynomial time for any and the polynomial time
algorithm of Dvir et al which was shown to work for by
Batman et al. In fact, our algorithm also works in the more general framework
of Batman et al. in which there is no a priori bound on the size of the support
of the distribution. The algorithm we analyze is implicit in previous work; our
main contribution is to analyze the algorithm by showing (via linear
programming duality and connections to complex analysis) that a certain matrix
associated with the problem has a robust local inverse even though its
condition number is exponentially small. A corollary of our result is the first
polynomial time algorithm for learning DNFs in the restriction access model of
Dvir et al
Facticity as the amount of self-descriptive information in a data set
Using the theory of Kolmogorov complexity the notion of facticity {\phi}(x)
of a string is defined as the amount of self-descriptive information it
contains. It is proved that (under reasonable assumptions: the existence of an
empty machine and the availability of a faithful index) facticity is definite,
i.e. random strings have facticity 0 and for compressible strings 0 < {\phi}(x)
< 1/2 |x| + O(1). Consequently facticity measures the tension in a data set
between structural and ad-hoc information objectively. For binary strings there
is a so-called facticity threshold that is dependent on their entropy. Strings
with facticty above this threshold have no optimal stochastic model and are
essentially computational. The shape of the facticty versus entropy plot
coincides with the well-known sawtooth curves observed in complex systems. The
notion of factic processes is discussed. This approach overcomes problems with
earlier proposals to use two-part code to define the meaningfulness or
usefulness of a data set.Comment: 10 pages, 2 figure
- …