63,821 research outputs found

    Towards subjectifying text clustering

    Full text link
    Although it is common practice to produce only a single clustering of a dataset, in many cases text documents can be clustered along different dimensions. Unfortunately, not only do traditional text clustering algorithms fail to produce multiple clusterings of a dataset, the only clustering they produce may not be the one that the user desires. In this paper, we propose a simple active clustering algorithm that is capable of producing multiple clusterings of the same data according to user interest. In comparison to previous work on feedback-oriented clustering, the amount of user feedback required by our algorithm is minimal. In fact, the feedback turns out to be as simple as a cursory look at a list of words. Experimental results are very promising: our system is able to generate clusterings along the user-specified dimensions with reasonable accuracies on several challenging text clas-sification tasks, thus providing suggestive evidence that our approach is viable

    Does the Geometry of Word Embeddings Help Document Classification? A Case Study on Persistent Homology Based Representations

    Full text link
    We investigate the pertinence of methods from algebraic topology for text data analysis. These methods enable the development of mathematically-principled isometric-invariant mappings from a set of vectors to a document embedding, which is stable with respect to the geometry of the document in the selected metric space. In this work, we evaluate the utility of these topology-based document representations in traditional NLP tasks, specifically document clustering and sentiment classification. We find that the embeddings do not benefit text analysis. In fact, performance is worse than simple techniques like tf-idf\textit{tf-idf}, indicating that the geometry of the document does not provide enough variability for classification on the basis of topic or sentiment in the chosen datasets.Comment: 5 pages, 3 figures. Rep4NLP workshop at ACL 201

    Discretize and Conquer: Scalable Agglomerative Clustering in Hamming Space

    Get PDF
    Clustering is one of the most fundamental tasks in many machine learning and information retrieval applications. Roughly speaking, the goal is to partition data instances such that similar instances end up in the same group while dissimilar instances lie in different groups. Quite surprisingly though, the formal and rigorous definition of clustering is not at all clear mainly because there is no consensus about what constitutes a cluster. That said, across all disciplines, from mathematics and statistics to genetics, people frequently try to get a first intuition about the data through identifying meaningful groups. Finding similar instances and grouping them are two main steps in clustering, and not surprisingly, both have been the subject of extensive study over recent decades. It has been shown that using large datasets is the key to achieving acceptable levels of performance in data-driven applications. Today, the Internet is a vast resource for such datasets, each of which contains millions and billions of high-dimensional items such as images and text documents. However, for such large-scale datasets, the performance of the employed machine-learning algorithm quickly becomes the main bottleneck. Conventional clustering algorithms are no exception, and a great deal of effort has been devoted to developing scalable clustering algorithms. Clustering tasks can vary both in terms of the input they have and the output that they are expected to generate. For instance, the input of a clustering algorithm can hold various types of data such as continuous numerical, and categorical types. This thesis on a particular setting; in it, the input instances are represented with binary strings. Binary representation has several advantages such as storage efficiency, simplicity, lack of a numerical-data-like concept of noise, and being naturally normalized. The literature abounds with applications of clustering binary data, such as in marketing, document clustering, and image clustering. As a more-concrete example, in marketing for an online store, each customer's basket is a binary representation of items. By clustering customers, the store can recommend items to customers with the same interests. In document clustering, documents can be represented as binary codes in which each element indicates whether a word exists in the document or not. Another notable application of binary codes is in binary hashing, which has been the topic of significant research in the last decade. The goal of binary hashing is to encode high-dimensional items, such as images, with compact binary strings so as to preserve a given notion of similarity. Such codes enable extremely fast nearest neighbour searches, as the distance between two codes (often the Hamming distance) can be computed quickly using bit-wise operations implemented at the hardware level. Similar to other types of data, the clustering of binary datasets has witnessed considerable research recently. Unfortunately, most of the existing approaches are only concerned with devising density and centroid-based clustering algorithms, even though many other types of clustering techniques can be applied to binary data. One of the most popular and intuitive algorithms in connectivity-based clustering is the Hierarchical Agglomerative Clustering (HAC) algorithm, which is based on the core idea of objects being more related to nearby objects than to objects farther away. As the name suggests, HAC is a family of clustering methods that return a dendrogram as their output: that is, a hierarchical tree of domain subsets, having a singleton instance in their leaves and the whole data instances in their root. Such algorithms need no prior knowledge about the number of clusters. Most of them are deterministic and applicable to different cluster shapes, but these advantages come at the price of high computational and storage costs in comparison with other popular clustering algorithms such as k-means. In this thesis, a family of HAC algorithms is proposed, called Discretized Agglomerative Clustering (DAC), that is designed to work with binary data. By leveraging the discretized and bounded nature of binary representation, the proposed algorithms can achieve significant speedup factors both in theory and practice, in comparison to the existing solutions. From the theoretical perspective, DAC algorithms can reduce the computational cost of hierarchical clustering from cubic to quadratic, matching the known lower bounds for HAC. The proposed approach is also be empirically compared with other well-known clustering algorithms such as k-means, DBSCAN, average, and complete-linkage HAC, on well-known datasets such as TEXMEX, CIFAR-10 and MNIST, which are among the standard benchmarks for large-scale algorithms. Results indicate that by mapping real points to binary vectors using existing binary hashing algorithms and clustering them with DAC, one can achieve several orders of magnitude speed without losing much clustering quality, and in some cases, achieving even more

    Two Algorithms for Orthogonal Nonnegative Matrix Factorization with Application to Clustering

    Full text link
    Approximate matrix factorization techniques with both nonnegativity and orthogonality constraints, referred to as orthogonal nonnegative matrix factorization (ONMF), have been recently introduced and shown to work remarkably well for clustering tasks such as document classification. In this paper, we introduce two new methods to solve ONMF. First, we show athematical equivalence between ONMF and a weighted variant of spherical k-means, from which we derive our first method, a simple EM-like algorithm. This also allows us to determine when ONMF should be preferred to k-means and spherical k-means. Our second method is based on an augmented Lagrangian approach. Standard ONMF algorithms typically enforce nonnegativity for their iterates while trying to achieve orthogonality at the limit (e.g., using a proper penalization term or a suitably chosen search direction). Our method works the opposite way: orthogonality is strictly imposed at each step while nonnegativity is asymptotically obtained, using a quadratic penalty. Finally, we show that the two proposed approaches compare favorably with standard ONMF algorithms on synthetic, text and image data sets.Comment: 17 pages, 8 figures. New numerical experiments (document and synthetic data sets
    corecore