17,748 research outputs found
Oversampling for Imbalanced Time Series Data
Many important real-world applications involve time-series data with skewed
distribution. Compared to conventional imbalance learning problems, the
classification of imbalanced time-series data is more challenging due to high
dimensionality and high inter-variable correlation. This paper proposes a
structure preserving Oversampling method to combat the High-dimensional
Imbalanced Time-series classification (OHIT). OHIT first leverages a
density-ratio based shared nearest neighbor clustering algorithm to capture the
modes of minority class in high-dimensional space. It then for each mode
applies the shrinkage technique of large-dimensional covariance matrix to
obtain accurate and reliable covariance structure. Finally, OHIT generates the
structure-preserving synthetic samples based on multivariate Gaussian
distribution by using the estimated covariance matrices. Experimental results
on several publicly available time-series datasets (including unimodal and
multi-modal) demonstrate the superiority of OHIT against the state-of-the-art
oversampling algorithms in terms of F-value, G-mean, and AUC
A Survey of Neighbourhood Construction Models for Categorizing Data Points
Finding neighbourhood structures is very useful in extracting valuable
relationships among data samples. This paper presents a survey of recent
neighbourhood construction algorithms for pattern clustering and classifying
data points. Extracting neighbourhoods and connections among the points is
extremely useful for clustering and classifying the data. Many applications
such as detecting social network communities, bundling related edges, and
solving location and routing problems all indicate the usefulness of this
problem. Finding data point neighbourhood in data mining and pattern
recognition should generally improve knowledge extraction from databases.
Several algorithms of data point neighbourhood construction have been proposed
to analyse the data in this sense. They will be described and discussed from
different aspects in this paper. Finally, the future challenges concerning the
title of the present paper will be outlined
Clustering Millions of Faces by Identity
In this work, we attempt to address the following problem: Given a large
number of unlabeled face images, cluster them into the individual identities
present in this data. We consider this a relevant problem in different
application scenarios ranging from social media to law enforcement. In
large-scale scenarios the number of faces in the collection can be of the order
of hundreds of million, while the number of clusters can range from a few
thousand to millions--leading to difficulties in terms of both run-time
complexity and evaluating clustering and per-cluster quality. An efficient and
effective Rank-Order clustering algorithm is developed to achieve the desired
scalability, and better clustering accuracy than other well-known algorithms
such as k-means and spectral clustering. We cluster up to 123 million face
images into over 10 million clusters, and analyze the results in terms of both
external cluster quality measures (known face labels) and internal cluster
quality measures (unknown face labels) and run-time. Our algorithm achieves an
F-measure of 0.87 on a benchmark unconstrained face dataset (LFW, consisting of
13K faces), and 0.27 on the largest dataset considered (13K images in LFW, plus
123M distractor images). Additionally, we present preliminary work on video
frame clustering (achieving 0.71 F-measure when clustering all frames in the
benchmark YouTube Faces dataset). A per-cluster quality measure is developed
which can be used to rank individual clusters and to automatically identify a
subset of good quality clusters for manual exploration
ELKI: A large open-source library for data analysis - ELKI Release 0.7.5 "Heidelberg"
This paper documents the release of the ELKI data mining framework, version
0.7.5.
ELKI is an open source (AGPLv3) data mining software written in Java. The
focus of ELKI is research in algorithms, with an emphasis on unsupervised
methods in cluster analysis and outlier detection. In order to achieve high
performance and scalability, ELKI offers data index structures such as the
R*-tree that can provide major performance gains. ELKI is designed to be easy
to extend for researchers and students in this domain, and welcomes
contributions of additional methods. ELKI aims at providing a large collection
of highly parameterizable algorithms, in order to allow easy and fair
evaluation and benchmarking of algorithms.
We will first outline the motivation for this release, the plans for the
future, and then give a brief overview over the new functionality in this
version. We also include an appendix presenting an overview on the overall
implemented functionality
Clustering Based on Pairwise Distances When the Data is of Mixed Dimensions
In the context of clustering, we consider a generative model in a Euclidean
ambient space with clusters of different shapes, dimensions, sizes and
densities. In an asymptotic setting where the number of points becomes large,
we obtain theoretical guaranties for a few emblematic methods based on pairwise
distances: a simple algorithm based on the extraction of connected components
in a neighborhood graph; the spectral clustering method of Ng, Jordan and
Weiss; and hierarchical clustering with single linkage. The methods are shown
to enjoy some near-optimal properties in terms of separation between clusters
and robustness to outliers. The local scaling method of Zelnik-Manor and Perona
is shown to lead to a near-optimal choice for the scale in the first two
methods. We also provide a lower bound on the spectral gap to consistently
choose the correct number of clusters in the spectral method
Efficient Parameter-free Clustering Using First Neighbor Relations
We present a new clustering method in the form of a single clustering
equation that is able to directly discover groupings in the data. The main
proposition is that the first neighbor of each sample is all one needs to
discover large chains and finding the groups in the data. In contrast to most
existing clustering algorithms our method does not require any
hyper-parameters, distance thresholds and/or the need to specify the number of
clusters. The proposed algorithm belongs to the family of hierarchical
agglomerative methods. The technique has a very low computational overhead, is
easily scalable and applicable to large practical problems. Evaluation on well
known datasets from different domains ranging between 1077 and 8.1 million
samples shows substantial performance gains when compared to the existing
clustering techniques.Comment: CVPR 201
1-D and 2-D Parallel Algorithms for All-Pairs Similarity Problem
All-pairs similarity problem asks to find all vector pairs in a set of
vectors the similarities of which surpass a given similarity threshold, and it
is a computational kernel in data mining and information retrieval for several
tasks. We investigate the parallelization of a recent fast sequential
algorithm. We propose effective 1-D and 2-D data distribution strategies that
preserve the essential optimizations in the fast algorithm. 1-D parallel
algorithms distribute either dimensions or vectors, whereas the 2-D parallel
algorithm distributes data both ways. Additional contributions to the 1-D
vertical distribution include a local pruning strategy to reduce the number of
candidates, a recursive pruning algorithm, and block processing to reduce
imbalance. The parallel algorithms were programmed in OCaml which affords much
convenience. Our experiments indicate that the performance depends on the
dataset, therefore a variety of parallelizations is useful
A New Clustering Algorithm Based Upon Flocking On Complex Network
We have proposed a model based upon flocking on a complex network, and then
developed two clustering algorithms on the basis of it. In the algorithms,
firstly a \textit{k}-nearest neighbor (knn) graph as a weighted and directed
graph is produced among all data points in a dataset each of which is regarded
as an agent who can move in space, and then a time-varying complex network is
created by adding long-range links for each data point. Furthermore, each data
point is not only acted by its \textit{k} nearest neighbors but also \textit{r}
long-range neighbors through fields established in space by them together, so
it will take a step along the direction of the vector sum of all fields. It is
more important that these long-range links provides some hidden information for
each data point when it moves and at the same time accelerate its speed
converging to a center. As they move in space according to the proposed model,
data points that belong to the same class are located at a same position
gradually, whereas those that belong to different classes are away from one
another. Consequently, the experimental results have demonstrated that data
points in datasets are clustered reasonably and efficiently, and the rates of
convergence of clustering algorithms are fast enough. Moreover, the comparison
with other algorithms also provides an indication of the effectiveness of the
proposed approach.Comment: 18 pages, 4 figures, 3 table
Unsupervised Methods for Determining Object and Relation Synonyms on the Web
The task of identifying synonymous relations and objects, or synonym
resolution, is critical for high-quality information extraction. This paper
investigates synonym resolution in the context of unsupervised information
extraction, where neither hand-tagged training examples nor domain knowledge is
available. The paper presents a scalable, fully-implemented system that runs in
O(KN log N) time in the number of extractions, N, and the maximum number of
synonyms per word, K. The system, called Resolver, introduces a probabilistic
relational model for predicting whether two strings are co-referential based on
the similarity of the assertions containing them. On a set of two million
assertions extracted from the Web, Resolver resolves objects with 78% precision
and 68% recall, and resolves relations with 90% precision and 35% recall.
Several variations of resolvers probabilistic model are explored, and
experiments demonstrate that under appropriate conditions these variations can
improve F1 by 5%. An extension to the basic Resolver system allows it to handle
polysemous names with 97% precision and 95% recall on a data set from the TREC
corpus
Reductive Clustering: An Efficient Linear-time Graph-based Divisive Cluster Analysis Approach
We propose an efficient linear-time graph-based divisive cluster analysis
approach called Reductive Clustering. The approach tries to reveal the
hierarchical structural information through reducing the graph into a more
concise one repeatedly. With the reductions, the original graph can be divided
into subgraphs recursively, and a lite informative dendrogram is constructed
based on the divisions. The reduction consists of three steps: selection,
connection, and partition. First a subset of vertices of the graph are selected
as representatives to build a concise graph. The representatives are
re-connected to maintain a consistent structure with the previous graph. If
possible, the concise graph is divided into subgraphs, and each subgraph is
further reduced recursively until the termination condition is met. We discuss
the approach, along with several selection and connection methods, in detail
both theoretically and experimentally in this paper. Our implementations run in
linear time and achieve outstanding performance on various types of datasets.
Experimental results show that they outperform state-of-the-art clustering
algorithms with significantly less computing resource requirements.Comment: http://res.ctarn.io/reductive-clusterin
- …