5,868 research outputs found

    Out-of-sample generalizations for supervised manifold learning for classification

    Get PDF
    Supervised manifold learning methods for data classification map data samples residing in a high-dimensional ambient space to a lower-dimensional domain in a structure-preserving way, while enhancing the separation between different classes in the learned embedding. Most nonlinear supervised manifold learning methods compute the embedding of the manifolds only at the initially available training points, while the generalization of the embedding to novel points, known as the out-of-sample extension problem in manifold learning, becomes especially important in classification applications. In this work, we propose a semi-supervised method for building an interpolation function that provides an out-of-sample extension for general supervised manifold learning algorithms studied in the context of classification. The proposed algorithm computes a radial basis function (RBF) interpolator that minimizes an objective function consisting of the total embedding error of unlabeled test samples, defined as their distance to the embeddings of the manifolds of their own class, as well as a regularization term that controls the smoothness of the interpolation function in a direction-dependent way. The class labels of test data and the interpolation function parameters are estimated jointly with a progressive procedure. Experimental results on face and object images demonstrate the potential of the proposed out-of-sample extension algorithm for the classification of manifold-modeled data sets

    Unsupervised User Stance Detection on Twitter

    Full text link
    We present a highly effective unsupervised framework for detecting the stance of prolific Twitter users with respect to controversial topics. In particular, we use dimensionality reduction to project users onto a low-dimensional space, followed by clustering, which allows us to find core users that are representative of the different stances. Our framework has three major advantages over pre-existing methods, which are based on supervised or semi-supervised classification. First, we do not require any prior labeling of users: instead, we create clusters, which are much easier to label manually afterwards, e.g., in a matter of seconds or minutes instead of hours. Second, there is no need for domain- or topic-level knowledge either to specify the relevant stances (labels) or to conduct the actual labeling. Third, our framework is robust in the face of data skewness, e.g., when some users or some stances have greater representation in the data. We experiment with different combinations of user similarity features, dataset sizes, dimensionality reduction methods, and clustering algorithms to ascertain the most effective and most computationally efficient combinations across three different datasets (in English and Turkish). We further verified our results on additional tweet sets covering six different controversial topics. Our best combination in terms of effectiveness and efficiency uses retweeted accounts as features, UMAP for dimensionality reduction, and Mean Shift for clustering, and yields a small number of high-quality user clusters, typically just 2--3, with more than 98\% purity. The resulting user clusters can be used to train downstream classifiers. Moreover, our framework is robust to variations in the hyper-parameter values and also with respect to random initialization

    A Survey on Metric Learning for Feature Vectors and Structured Data

    Full text link
    The need for appropriate ways to measure the distance or similarity between data is ubiquitous in machine learning, pattern recognition and data mining, but handcrafting such good metrics for specific problems is generally difficult. This has led to the emergence of metric learning, which aims at automatically learning a metric from data and has attracted a lot of interest in machine learning and related fields for the past ten years. This survey paper proposes a systematic review of the metric learning literature, highlighting the pros and cons of each approach. We pay particular attention to Mahalanobis distance metric learning, a well-studied and successful framework, but additionally present a wide range of methods that have recently emerged as powerful alternatives, including nonlinear metric learning, similarity learning and local metric learning. Recent trends and extensions, such as semi-supervised metric learning, metric learning for histogram data and the derivation of generalization guarantees, are also covered. Finally, this survey addresses metric learning for structured data, in particular edit distance learning, and attempts to give an overview of the remaining challenges in metric learning for the years to come.Comment: Technical report, 59 pages. Changes in v2: fixed typos and improved presentation. Changes in v3: fixed typos. Changes in v4: fixed typos and new method

    A Graph-Based Semi-Supervised k Nearest-Neighbor Method for Nonlinear Manifold Distributed Data Classification

    Get PDF
    kk Nearest Neighbors (kkNN) is one of the most widely used supervised learning algorithms to classify Gaussian distributed data, but it does not achieve good results when it is applied to nonlinear manifold distributed data, especially when a very limited amount of labeled samples are available. In this paper, we propose a new graph-based kkNN algorithm which can effectively handle both Gaussian distributed data and nonlinear manifold distributed data. To achieve this goal, we first propose a constrained Tired Random Walk (TRW) by constructing an RR-level nearest-neighbor strengthened tree over the graph, and then compute a TRW matrix for similarity measurement purposes. After this, the nearest neighbors are identified according to the TRW matrix and the class label of a query point is determined by the sum of all the TRW weights of its nearest neighbors. To deal with online situations, we also propose a new algorithm to handle sequential samples based a local neighborhood reconstruction. Comparison experiments are conducted on both synthetic data sets and real-world data sets to demonstrate the validity of the proposed new kkNN algorithm and its improvements to other version of kkNN algorithms. Given the widespread appearance of manifold structures in real-world problems and the popularity of the traditional kkNN algorithm, the proposed manifold version kkNN shows promising potential for classifying manifold-distributed data.Comment: 32 pages, 12 figures, 7 table
    • …
    corecore