17,748 research outputs found

    Oversampling for Imbalanced Time Series Data

    Full text link
    Many important real-world applications involve time-series data with skewed distribution. Compared to conventional imbalance learning problems, the classification of imbalanced time-series data is more challenging due to high dimensionality and high inter-variable correlation. This paper proposes a structure preserving Oversampling method to combat the High-dimensional Imbalanced Time-series classification (OHIT). OHIT first leverages a density-ratio based shared nearest neighbor clustering algorithm to capture the modes of minority class in high-dimensional space. It then for each mode applies the shrinkage technique of large-dimensional covariance matrix to obtain accurate and reliable covariance structure. Finally, OHIT generates the structure-preserving synthetic samples based on multivariate Gaussian distribution by using the estimated covariance matrices. Experimental results on several publicly available time-series datasets (including unimodal and multi-modal) demonstrate the superiority of OHIT against the state-of-the-art oversampling algorithms in terms of F-value, G-mean, and AUC

    A Survey of Neighbourhood Construction Models for Categorizing Data Points

    Full text link
    Finding neighbourhood structures is very useful in extracting valuable relationships among data samples. This paper presents a survey of recent neighbourhood construction algorithms for pattern clustering and classifying data points. Extracting neighbourhoods and connections among the points is extremely useful for clustering and classifying the data. Many applications such as detecting social network communities, bundling related edges, and solving location and routing problems all indicate the usefulness of this problem. Finding data point neighbourhood in data mining and pattern recognition should generally improve knowledge extraction from databases. Several algorithms of data point neighbourhood construction have been proposed to analyse the data in this sense. They will be described and discussed from different aspects in this paper. Finally, the future challenges concerning the title of the present paper will be outlined

    Clustering Millions of Faces by Identity

    Full text link
    In this work, we attempt to address the following problem: Given a large number of unlabeled face images, cluster them into the individual identities present in this data. We consider this a relevant problem in different application scenarios ranging from social media to law enforcement. In large-scale scenarios the number of faces in the collection can be of the order of hundreds of million, while the number of clusters can range from a few thousand to millions--leading to difficulties in terms of both run-time complexity and evaluating clustering and per-cluster quality. An efficient and effective Rank-Order clustering algorithm is developed to achieve the desired scalability, and better clustering accuracy than other well-known algorithms such as k-means and spectral clustering. We cluster up to 123 million face images into over 10 million clusters, and analyze the results in terms of both external cluster quality measures (known face labels) and internal cluster quality measures (unknown face labels) and run-time. Our algorithm achieves an F-measure of 0.87 on a benchmark unconstrained face dataset (LFW, consisting of 13K faces), and 0.27 on the largest dataset considered (13K images in LFW, plus 123M distractor images). Additionally, we present preliminary work on video frame clustering (achieving 0.71 F-measure when clustering all frames in the benchmark YouTube Faces dataset). A per-cluster quality measure is developed which can be used to rank individual clusters and to automatically identify a subset of good quality clusters for manual exploration

    ELKI: A large open-source library for data analysis - ELKI Release 0.7.5 "Heidelberg"

    Full text link
    This paper documents the release of the ELKI data mining framework, version 0.7.5. ELKI is an open source (AGPLv3) data mining software written in Java. The focus of ELKI is research in algorithms, with an emphasis on unsupervised methods in cluster analysis and outlier detection. In order to achieve high performance and scalability, ELKI offers data index structures such as the R*-tree that can provide major performance gains. ELKI is designed to be easy to extend for researchers and students in this domain, and welcomes contributions of additional methods. ELKI aims at providing a large collection of highly parameterizable algorithms, in order to allow easy and fair evaluation and benchmarking of algorithms. We will first outline the motivation for this release, the plans for the future, and then give a brief overview over the new functionality in this version. We also include an appendix presenting an overview on the overall implemented functionality

    Clustering Based on Pairwise Distances When the Data is of Mixed Dimensions

    Full text link
    In the context of clustering, we consider a generative model in a Euclidean ambient space with clusters of different shapes, dimensions, sizes and densities. In an asymptotic setting where the number of points becomes large, we obtain theoretical guaranties for a few emblematic methods based on pairwise distances: a simple algorithm based on the extraction of connected components in a neighborhood graph; the spectral clustering method of Ng, Jordan and Weiss; and hierarchical clustering with single linkage. The methods are shown to enjoy some near-optimal properties in terms of separation between clusters and robustness to outliers. The local scaling method of Zelnik-Manor and Perona is shown to lead to a near-optimal choice for the scale in the first two methods. We also provide a lower bound on the spectral gap to consistently choose the correct number of clusters in the spectral method

    Efficient Parameter-free Clustering Using First Neighbor Relations

    Full text link
    We present a new clustering method in the form of a single clustering equation that is able to directly discover groupings in the data. The main proposition is that the first neighbor of each sample is all one needs to discover large chains and finding the groups in the data. In contrast to most existing clustering algorithms our method does not require any hyper-parameters, distance thresholds and/or the need to specify the number of clusters. The proposed algorithm belongs to the family of hierarchical agglomerative methods. The technique has a very low computational overhead, is easily scalable and applicable to large practical problems. Evaluation on well known datasets from different domains ranging between 1077 and 8.1 million samples shows substantial performance gains when compared to the existing clustering techniques.Comment: CVPR 201

    1-D and 2-D Parallel Algorithms for All-Pairs Similarity Problem

    Full text link
    All-pairs similarity problem asks to find all vector pairs in a set of vectors the similarities of which surpass a given similarity threshold, and it is a computational kernel in data mining and information retrieval for several tasks. We investigate the parallelization of a recent fast sequential algorithm. We propose effective 1-D and 2-D data distribution strategies that preserve the essential optimizations in the fast algorithm. 1-D parallel algorithms distribute either dimensions or vectors, whereas the 2-D parallel algorithm distributes data both ways. Additional contributions to the 1-D vertical distribution include a local pruning strategy to reduce the number of candidates, a recursive pruning algorithm, and block processing to reduce imbalance. The parallel algorithms were programmed in OCaml which affords much convenience. Our experiments indicate that the performance depends on the dataset, therefore a variety of parallelizations is useful

    A New Clustering Algorithm Based Upon Flocking On Complex Network

    Full text link
    We have proposed a model based upon flocking on a complex network, and then developed two clustering algorithms on the basis of it. In the algorithms, firstly a \textit{k}-nearest neighbor (knn) graph as a weighted and directed graph is produced among all data points in a dataset each of which is regarded as an agent who can move in space, and then a time-varying complex network is created by adding long-range links for each data point. Furthermore, each data point is not only acted by its \textit{k} nearest neighbors but also \textit{r} long-range neighbors through fields established in space by them together, so it will take a step along the direction of the vector sum of all fields. It is more important that these long-range links provides some hidden information for each data point when it moves and at the same time accelerate its speed converging to a center. As they move in space according to the proposed model, data points that belong to the same class are located at a same position gradually, whereas those that belong to different classes are away from one another. Consequently, the experimental results have demonstrated that data points in datasets are clustered reasonably and efficiently, and the rates of convergence of clustering algorithms are fast enough. Moreover, the comparison with other algorithms also provides an indication of the effectiveness of the proposed approach.Comment: 18 pages, 4 figures, 3 table

    Unsupervised Methods for Determining Object and Relation Synonyms on the Web

    Full text link
    The task of identifying synonymous relations and objects, or synonym resolution, is critical for high-quality information extraction. This paper investigates synonym resolution in the context of unsupervised information extraction, where neither hand-tagged training examples nor domain knowledge is available. The paper presents a scalable, fully-implemented system that runs in O(KN log N) time in the number of extractions, N, and the maximum number of synonyms per word, K. The system, called Resolver, introduces a probabilistic relational model for predicting whether two strings are co-referential based on the similarity of the assertions containing them. On a set of two million assertions extracted from the Web, Resolver resolves objects with 78% precision and 68% recall, and resolves relations with 90% precision and 35% recall. Several variations of resolvers probabilistic model are explored, and experiments demonstrate that under appropriate conditions these variations can improve F1 by 5%. An extension to the basic Resolver system allows it to handle polysemous names with 97% precision and 95% recall on a data set from the TREC corpus

    Reductive Clustering: An Efficient Linear-time Graph-based Divisive Cluster Analysis Approach

    Full text link
    We propose an efficient linear-time graph-based divisive cluster analysis approach called Reductive Clustering. The approach tries to reveal the hierarchical structural information through reducing the graph into a more concise one repeatedly. With the reductions, the original graph can be divided into subgraphs recursively, and a lite informative dendrogram is constructed based on the divisions. The reduction consists of three steps: selection, connection, and partition. First a subset of vertices of the graph are selected as representatives to build a concise graph. The representatives are re-connected to maintain a consistent structure with the previous graph. If possible, the concise graph is divided into subgraphs, and each subgraph is further reduced recursively until the termination condition is met. We discuss the approach, along with several selection and connection methods, in detail both theoretically and experimentally in this paper. Our implementations run in linear time and achieve outstanding performance on various types of datasets. Experimental results show that they outperform state-of-the-art clustering algorithms with significantly less computing resource requirements.Comment: http://res.ctarn.io/reductive-clusterin
    • …
    corecore