9,392 research outputs found
Space-efficient Feature Maps for String Alignment Kernels
String kernels are attractive data analysis tools for analyzing string data.
Among them, alignment kernels are known for their high prediction accuracies in
string classifications when tested in combination with SVM in various
applications. However, alignment kernels have a crucial drawback in that they
scale poorly due to their quadratic computation complexity in the number of
input strings, which limits large-scale applications in practice. We address
this need by presenting the first approximation for string alignment kernels,
which we call space-efficient feature maps for edit distance with moves
(SFMEDM), by leveraging a metric embedding named edit sensitive parsing (ESP)
and feature maps (FMs) of random Fourier features (RFFs) for large-scale string
analyses. The original FMs for RFFs consume a huge amount of memory
proportional to the dimension d of input vectors and the dimension D of output
vectors, which prohibits its large-scale applications. We present novel
space-efficient feature maps (SFMs) of RFFs for a space reduction from O(dD) of
the original FMs to O(d) of SFMs with a theoretical guarantee with respect to
concentration bounds. We experimentally test SFMEDM on its ability to learn SVM
for large-scale string classifications with various massive string data, and we
demonstrate the superior performance of SFMEDM with respect to prediction
accuracy, scalability and computation efficiency.Comment: Full version for ICDM'19 pape
Kernel methods in machine learning
We review machine learning methods employing positive definite kernels. These
methods formulate learning and estimation problems in a reproducing kernel
Hilbert space (RKHS) of functions defined on the data domain, expanded in terms
of a kernel. Working in linear spaces of function has the benefit of
facilitating the construction and analysis of learning algorithms while at the
same time allowing large classes of functions. The latter include nonlinear
functions as well as functions defined on nonvectorial data. We cover a wide
range of methods, ranging from binary classifiers to sophisticated methods for
estimation with structured data.Comment: Published in at http://dx.doi.org/10.1214/009053607000000677 the
Annals of Statistics (http://www.imstat.org/aos/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Identification of functionally related enzymes by learning-to-rank methods
Enzyme sequences and structures are routinely used in the biological sciences
as queries to search for functionally related enzymes in online databases. To
this end, one usually departs from some notion of similarity, comparing two
enzymes by looking for correspondences in their sequences, structures or
surfaces. For a given query, the search operation results in a ranking of the
enzymes in the database, from very similar to dissimilar enzymes, while
information about the biological function of annotated database enzymes is
ignored.
In this work we show that rankings of that kind can be substantially improved
by applying kernel-based learning algorithms. This approach enables the
detection of statistical dependencies between similarities of the active cleft
and the biological function of annotated enzymes. This is in contrast to
search-based approaches, which do not take annotated training data into
account. Similarity measures based on the active cleft are known to outperform
sequence-based or structure-based measures under certain conditions. We
consider the Enzyme Commission (EC) classification hierarchy for obtaining
annotated enzymes during the training phase. The results of a set of sizeable
experiments indicate a consistent and significant improvement for a set of
similarity measures that exploit information about small cavities in the
surface of enzymes
On the String Consensus Problem and the Manhattan Sequence Consensus Problem
In the Manhattan Sequence Consensus problem (MSC problem) we are given
integer sequences, each of length , and we are to find an integer sequence
of length (called a consensus sequence), such that the maximum
Manhattan distance of from each of the input sequences is minimized. For
binary sequences Manhattan distance coincides with Hamming distance, hence in
this case the string consensus problem (also called string center problem or
closest string problem) is a special case of MSC. Our main result is a
practically efficient -time algorithm solving MSC for sequences.
Practicality of our algorithms has been verified experimentally. It improves
upon the quadratic algorithm by Amir et al.\ (SPIRE 2012) for string consensus
problem for binary strings. Similarly as in Amir's algorithm we use a
column-based framework. We replace the implied general integer linear
programming by its easy special cases, due to combinatorial properties of the
MSC for . We also show that for a general parameter any instance
can be reduced in linear time to a kernel of size , so the problem is
fixed-parameter tractable. Nevertheless, for this is still too large
for any naive solution to be feasible in practice.Comment: accepted to SPIRE 201
Positive Definite Kernels in Machine Learning
This survey is an introduction to positive definite kernels and the set of
methods they have inspired in the machine learning literature, namely kernel
methods. We first discuss some properties of positive definite kernels as well
as reproducing kernel Hibert spaces, the natural extension of the set of
functions associated with a kernel defined
on a space . We discuss at length the construction of kernel
functions that take advantage of well-known statistical models. We provide an
overview of numerous data-analysis methods which take advantage of reproducing
kernel Hilbert spaces and discuss the idea of combining several kernels to
improve the performance on certain tasks. We also provide a short cookbook of
different kernels which are particularly useful for certain data-types such as
images, graphs or speech segments.Comment: draft. corrected a typo in figure
A Survey on Metric Learning for Feature Vectors and Structured Data
The need for appropriate ways to measure the distance or similarity between
data is ubiquitous in machine learning, pattern recognition and data mining,
but handcrafting such good metrics for specific problems is generally
difficult. This has led to the emergence of metric learning, which aims at
automatically learning a metric from data and has attracted a lot of interest
in machine learning and related fields for the past ten years. This survey
paper proposes a systematic review of the metric learning literature,
highlighting the pros and cons of each approach. We pay particular attention to
Mahalanobis distance metric learning, a well-studied and successful framework,
but additionally present a wide range of methods that have recently emerged as
powerful alternatives, including nonlinear metric learning, similarity learning
and local metric learning. Recent trends and extensions, such as
semi-supervised metric learning, metric learning for histogram data and the
derivation of generalization guarantees, are also covered. Finally, this survey
addresses metric learning for structured data, in particular edit distance
learning, and attempts to give an overview of the remaining challenges in
metric learning for the years to come.Comment: Technical report, 59 pages. Changes in v2: fixed typos and improved
presentation. Changes in v3: fixed typos. Changes in v4: fixed typos and new
method
- …