14,170 research outputs found
A Survey on Metric Learning for Feature Vectors and Structured Data
The need for appropriate ways to measure the distance or similarity between
data is ubiquitous in machine learning, pattern recognition and data mining,
but handcrafting such good metrics for specific problems is generally
difficult. This has led to the emergence of metric learning, which aims at
automatically learning a metric from data and has attracted a lot of interest
in machine learning and related fields for the past ten years. This survey
paper proposes a systematic review of the metric learning literature,
highlighting the pros and cons of each approach. We pay particular attention to
Mahalanobis distance metric learning, a well-studied and successful framework,
but additionally present a wide range of methods that have recently emerged as
powerful alternatives, including nonlinear metric learning, similarity learning
and local metric learning. Recent trends and extensions, such as
semi-supervised metric learning, metric learning for histogram data and the
derivation of generalization guarantees, are also covered. Finally, this survey
addresses metric learning for structured data, in particular edit distance
learning, and attempts to give an overview of the remaining challenges in
metric learning for the years to come.Comment: Technical report, 59 pages. Changes in v2: fixed typos and improved
presentation. Changes in v3: fixed typos. Changes in v4: fixed typos and new
method
Efficient Online String Matching through Linked Weak Factors
Online string matching is a computational problem involving the search for
patterns or substrings in a large text dataset, with the pattern and text being
processed sequentially, without prior access to the entire text. Its relevance
stems from applications in data compression, data mining, text editing, and
bioinformatics, where rapid and efficient pattern matching is crucial. Various
solutions have been proposed over the past few decades, employing diverse
techniques. Recently, weak recognition approaches have attracted increasing
attention. This paper presents Hash Chain, a new algorithm based on a robust
weak factor recognition approach that connects adjacent factors through
hashing. Despite its O(nm) complexity, the algorithm exhibits a sublinear
behavior in practice and achieves superior performance compared to the most
effective algorithms
One-Tape Turing Machine Variants and Language Recognition
We present two restricted versions of one-tape Turing machines. Both
characterize the class of context-free languages. In the first version,
proposed by Hibbard in 1967 and called limited automata, each tape cell can be
rewritten only in the first visits, for a fixed constant .
Furthermore, for deterministic limited automata are equivalent to
deterministic pushdown automata, namely they characterize deterministic
context-free languages. Further restricting the possible operations, we
consider strongly limited automata. These models still characterize
context-free languages. However, the deterministic version is less powerful
than the deterministic version of limited automata. In fact, there exist
deterministic context-free languages that are not accepted by any deterministic
strongly limited automaton.Comment: 20 pages. This article will appear in the Complexity Theory Column of
the September 2015 issue of SIGACT New
Random Access to Grammar Compressed Strings
Grammar based compression, where one replaces a long string by a small
context-free grammar that generates the string, is a simple and powerful
paradigm that captures many popular compression schemes. In this paper, we
present a novel grammar representation that allows efficient random access to
any character or substring without decompressing the string.
Let be a string of length compressed into a context-free grammar
of size . We present two representations of
achieving random access time, and either
construction time and space on the pointer machine model, or
construction time and space on the RAM. Here, is the inverse of
the row of Ackermann's function. Our representations also efficiently
support decompression of any substring in : we can decompress any substring
of length in the same complexity as a single random access query and
additional time. Combining these results with fast algorithms for
uncompressed approximate string matching leads to several efficient algorithms
for approximate string matching on grammar-compressed strings without
decompression. For instance, we can find all approximate occurrences of a
pattern with at most errors in time , where is the number of occurrences of in . Finally, we
generalize our results to navigation and other operations on grammar-compressed
ordered trees.
All of the above bounds significantly improve the currently best known
results. To achieve these bounds, we introduce several new techniques and data
structures of independent interest, including a predecessor data structure, two
"biased" weighted ancestor data structures, and a compact representation of
heavy paths in grammars.Comment: Preliminary version in SODA 201
- …