14,318 research outputs found
Query-Focused Opinion Summarization for User-Generated Content
We present a submodular function-based framework for query-focused opinion
summarization. Within our framework, relevance ordering produced by a
statistical ranker, and information coverage with respect to topic distribution
and diverse viewpoints are both encoded as submodular functions. Dispersion
functions are utilized to minimize the redundancy. We are the first to evaluate
different metrics of text similarity for submodularity-based summarization
methods. By experimenting on community QA and blog summarization, we show that
our system outperforms state-of-the-art approaches in both automatic evaluation
and human evaluation. A human evaluation task is conducted on Amazon Mechanical
Turk with scale, and shows that our systems are able to generate summaries of
high overall quality and information diversity.Comment: COLING 201
FlashProfile: A Framework for Synthesizing Data Profiles
We address the problem of learning a syntactic profile for a collection of
strings, i.e. a set of regex-like patterns that succinctly describe the
syntactic variations in the strings. Real-world datasets, typically curated
from multiple sources, often contain data in various syntactic formats. Thus,
any data processing task is preceded by the critical step of data format
identification. However, manual inspection of data to identify the different
formats is infeasible in standard big-data scenarios.
Prior techniques are restricted to a small set of pre-defined patterns (e.g.
digits, letters, words, etc.), and provide no control over granularity of
profiles. We define syntactic profiling as a problem of clustering strings
based on syntactic similarity, followed by identifying patterns that succinctly
describe each cluster. We present a technique for synthesizing such profiles
over a given language of patterns, that also allows for interactive refinement
by requesting a desired number of clusters.
Using a state-of-the-art inductive synthesis framework, PROSE, we have
implemented our technique as FlashProfile. Across tasks over large
real datasets, we observe a median profiling time of only s.
Furthermore, we show that access to syntactic profiles may allow for more
accurate synthesis of programs, i.e. using fewer examples, in
programming-by-example (PBE) workflows such as FlashFill.Comment: 28 pages, SPLASH (OOPSLA) 201
Ranking and significance of variable-length similarity-based time series motifs
The detection of very similar patterns in a time series, commonly called
motifs, has received continuous and increasing attention from diverse
scientific communities. In particular, recent approaches for discovering
similar motifs of different lengths have been proposed. In this work, we show
that such variable-length similarity-based motifs cannot be directly compared,
and hence ranked, by their normalized dissimilarities. Specifically, we find
that length-normalized motif dissimilarities still have intrinsic dependencies
on the motif length, and that lowest dissimilarities are particularly affected
by this dependency. Moreover, we find that such dependencies are generally
non-linear and change with the considered data set and dissimilarity measure.
Based on these findings, we propose a solution to rank those motifs and measure
their significance. This solution relies on a compact but accurate model of the
dissimilarity space, using a beta distribution with three parameters that
depend on the motif length in a non-linear way. We believe the incomparability
of variable-length dissimilarities could go beyond the field of time series,
and that similar modeling strategies as the one used here could be of help in a
more broad context.Comment: 20 pages, 10 figure
A bag-to-class divergence approach to multiple-instance learning
In multi-instance (MI) learning, each object (bag) consists of multiple
feature vectors (instances), and is most commonly regarded as a set of points
in a multidimensional space. A different viewpoint is that the instances are
realisations of random vectors with corresponding probability distribution, and
that a bag is the distribution, not the realisations. In MI classification,
each bag in the training set has a class label, but the instances are
unlabelled. By introducing the probability distribution space to bag-level
classification problems, dissimilarities between probability distributions
(divergences) can be applied. The bag-to-bag Kullback-Leibler information is
asymptotically the best classifier, but the typical sparseness of MI training
sets is an obstacle. We introduce bag-to-class divergence to MI learning,
emphasising the hierarchical nature of the random vectors that makes bags from
the same class different. We propose two properties for bag-to-class
divergences, and an additional property for sparse training sets
Spherical Wards clustering and generalized Voronoi diagrams
Gaussian mixture model is very useful in many practical problems.
Nevertheless, it cannot be directly generalized to non Euclidean spaces. To
overcome this problem we present a spherical Gaussian-based clustering approach
for partitioning data sets with respect to arbitrary dissimilarity measure. The
proposed method is a combination of spherical Cross-Entropy Clustering with a
generalized Wards approach. The algorithm finds the optimal number of clusters
by automatically removing groups which carry no information. Moreover, it is
scale invariant and allows for forming of spherically-shaped clusters of
arbitrary sizes. In order to graphically represent and interpret the results
the notion of Voronoi diagram was generalized to non Euclidean spaces and
applied for introduced clustering method
A new class of metrics for learning on real-valued and structured data
We propose a new class of metrics on sets, vectors, and functions that can be
used in various stages of data mining, including exploratory data analysis,
learning, and result interpretation. These new distance functions unify and
generalize some of the popular metrics, such as the Jaccard and bag distances
on sets, Manhattan distance on vector spaces, and Marczewski-Steinhaus distance
on integrable functions. We prove that the new metrics are complete and show
useful relationships with -divergences for probability distributions. To
further extend our approach to structured objects such as concept hierarchies
and ontologies, we introduce information-theoretic metrics on directed acyclic
graphs drawn according to a fixed probability distribution. We conduct
empirical investigation to demonstrate intuitive interpretation of the new
metrics and their effectiveness on real-valued, high-dimensional, and
structured data. Extensive comparative evaluation demonstrates that the new
metrics outperformed multiple similarity and dissimilarity functions
traditionally used in data mining, including the Minkowski family, the
fractional family, two -divergences, cosine distance, and two
correlation coefficients. Finally, we argue that the new class of metrics is
particularly appropriate for rapid processing of high-dimensional and
structured data in distance-based learning
Developments in the theory of randomized shortest paths with a comparison of graph node distances
There have lately been several suggestions for parametrized distances on a
graph that generalize the shortest path distance and the commute time or
resistance distance. The need for developing such distances has risen from the
observation that the above-mentioned common distances in many situations fail
to take into account the global structure of the graph. In this article, we
develop the theory of one family of graph node distances, known as the
randomized shortest path dissimilarity, which has its foundation in statistical
physics. We show that the randomized shortest path dissimilarity can be easily
computed in closed form for all pairs of nodes of a graph. Moreover, we come up
with a new definition of a distance measure that we call the free energy
distance. The free energy distance can be seen as an upgrade of the randomized
shortest path dissimilarity as it defines a metric, in addition to which it
satisfies the graph-geodetic property. The derivation and computation of the
free energy distance are also straightforward. We then make a comparison
between a set of generalized distances that interpolate between the shortest
path distance and the commute time, or resistance distance. This comparison
focuses on the applicability of the distances in graph node clustering and
classification. The comparison, in general, shows that the parametrized
distances perform well in the tasks. In particular, we see that the results
obtained with the free energy distance are among the best in all the
experiments.Comment: 30 pages, 4 figures, 3 table
Multi-criteria Similarity-based Anomaly Detection using Pareto Depth Analysis
We consider the problem of identifying patterns in a data set that exhibit
anomalous behavior, often referred to as anomaly detection. Similarity-based
anomaly detection algorithms detect abnormally large amounts of similarity or
dissimilarity, e.g.~as measured by nearest neighbor Euclidean distances between
a test sample and the training samples. In many application domains there may
not exist a single dissimilarity measure that captures all possible anomalous
patterns. In such cases, multiple dissimilarity measures can be defined,
including non-metric measures, and one can test for anomalies by scalarizing
using a non-negative linear combination of them. If the relative importance of
the different dissimilarity measures are not known in advance, as in many
anomaly detection applications, the anomaly detection algorithm may need to be
executed multiple times with different choices of weights in the linear
combination. In this paper, we propose a method for similarity-based anomaly
detection using a novel multi-criteria dissimilarity measure, the Pareto depth.
The proposed Pareto depth analysis (PDA) anomaly detection algorithm uses the
concept of Pareto optimality to detect anomalies under multiple criteria
without having to run an algorithm multiple times with different choices of
weights. The proposed PDA approach is provably better than using linear
combinations of the criteria and shows superior performance on experiments with
synthetic and real data sets.Comment: The work is submitted to IEEE TNNLS Special Issue on Learning in
Non-(geo)metric Spaces for review on October 28, 2013, revised on July 26,
2015 and accepted on July 30, 2015. A preliminary version of this work is
reported in the conference Advances in Neural Information Processing Systems
(NIPS) 201
Structuring Relevant Feature Sets with Multiple Model Learning
Feature selection is one of the most prominent learning tasks, especially in
high-dimensional datasets in which the goal is to understand the mechanisms
that underly the learning dataset. However most of them typically deliver just
a flat set of relevant features and provide no further information on what kind
of structures, e.g. feature groupings, might underly the set of relevant
features. In this paper we propose a new learning paradigm in which our goal is
to uncover the structures that underly the set of relevant features for a given
learning problem. We uncover two types of features sets, non-replaceable
features that contain important information about the target variable and
cannot be replaced by other features, and functionally similar features sets
that can be used interchangeably in learned models, given the presence of the
non-replaceable features, with no change in the predictive performance. To do
so we propose a new learning algorithm that learns a number of disjoint models
using a model disjointness regularization constraint together with a constraint
on the predictive agreement of the disjoint models. We explore the behavior of
our approach on a number of high-dimensional datasets, and show that, as
expected by their construction, these satisfy a number of properties. Namely,
model disjointness, a high predictive agreement, and a similar predictive
performance to models learned on the full set of relevant features. The ability
to structure the set of relevant features in such a manner can become a
valuable tool in different applications of scientific knowledge discovery
Discovering Playing Patterns: Time Series Clustering of Free-To-Play Game Data
The classification of time series data is a challenge common to all
data-driven fields. However, there is no agreement about which are the most
efficient techniques to group unlabeled time-ordered data. This is because a
successful classification of time series patterns depends on the goal and the
domain of interest, i.e. it is application-dependent.
In this article, we study free-to-play game data. In this domain, clustering
similar time series information is increasingly important due to the large
amount of data collected by current mobile and web applications. We evaluate
which methods cluster accurately time series of mobile games, focusing on
player behavior data. We identify and validate several aspects of the
clustering: the similarity measures and the representation techniques to reduce
the high dimensionality of time series. As a robustness test, we compare
various temporal datasets of player activity from two free-to-play video-games.
With these techniques we extract temporal patterns of player behavior
relevant for the evaluation of game events and game-business diagnosis. Our
experiments provide intuitive visualizations to validate the results of the
clustering and to determine the optimal number of clusters. Additionally, we
assess the common characteristics of the players belonging to the same group.
This study allows us to improve the understanding of player dynamics and churn
behavior
- …