17 research outputs found

    Spectral Clustering with Imbalanced Data

    Full text link
    Spectral clustering is sensitive to how graphs are constructed from data particularly when proximal and imbalanced clusters are present. We show that Ratio-Cut (RCut) or normalized cut (NCut) objectives are not tailored to imbalanced data since they tend to emphasize cut sizes over cut values. We propose a graph partitioning problem that seeks minimum cut partitions under minimum size constraints on partitions to deal with imbalanced data. Our approach parameterizes a family of graphs, by adaptively modulating node degrees on a fixed node set, to yield a set of parameter dependent cuts reflecting varying levels of imbalance. The solution to our problem is then obtained by optimizing over these parameters. We present rigorous limit cut analysis results to justify our approach. We demonstrate the superiority of our method through unsupervised and semi-supervised experiments on synthetic and real data sets.Comment: 24 pages, 7 figures. arXiv admin note: substantial text overlap with arXiv:1302.513

    Clustering and Community Detection with Imbalanced Clusters

    Full text link
    Spectral clustering methods which are frequently used in clustering and community detection applications are sensitive to the specific graph constructions particularly when imbalanced clusters are present. We show that ratio cut (RCut) or normalized cut (NCut) objectives are not tailored to imbalanced cluster sizes since they tend to emphasize cut sizes over cut values. We propose a graph partitioning problem that seeks minimum cut partitions under minimum size constraints on partitions to deal with imbalanced cluster sizes. Our approach parameterizes a family of graphs by adaptively modulating node degrees on a fixed node set, yielding a set of parameter dependent cuts reflecting varying levels of imbalance. The solution to our problem is then obtained by optimizing over these parameters. We present rigorous limit cut analysis results to justify our approach and demonstrate the superiority of our method through experiments on synthetic and real datasets for data clustering, semi-supervised learning and community detection.Comment: Extended version of arXiv:1309.2303 with new applications. Accepted to IEEE TSIP

    Hash code learning for large scale similarity search

    Get PDF
    In this thesis we explore methods which learn compact hash coding schemes to encode image databases such that relevant images can be quickly retrieved when a query image is presented. We here present three contributions. Firstly, we improve upon the bit allocation strategy of Signal-to-Noise Ratio Maximization Hashing (SMH) to produce longer hash codes without a deterioration in retrieval performance as measured by mean average precision (MAP). The proposed bit allocation strategy seamlessly converts the Hamming distance between hash codes into a likelihood ratio test statistic, which is the optimal decision rule to decide if samples are related. We show via experiments that at the same false positive rate, the proposed method could obtain false negative error rates which are significantly lower than the original SMH bit allocation strategy. Our second contribution is the extension of SMH to use a deep linear discriminant analysis (LDA) framework. The original SMH method uses features from convolutional neural networks (CNNs) trained on categorical-cross-entropy (CCE) loss, which does not explicitly impose linear separability on the latent space representation learned by the CNN. The Deep LDA framework allows us to obtain a non-linear transformation on the input images to obtain transformed features which are more discriminatory (samples of the same class are close together while samples of different classes are far apart) and better fit the linear Gaussian model assumed in SMH. We show that the enhanced SMH method using Deep LDA outperforms recent state-of-the-art hashing methods on single-label datasets CIFAR10 and MNIST. Our final contribution is an unsupervised graph construction method which binarizes CNN features and allows the use of quick Hamming distance calculations to approximate pairwise similarity. This graph can be used in various unsupervised hashing methods which require a similarity matrix. Current unsupervised image graph construction methods are dominated by those which utilize the manifold structure of images in the feature space. These methods face the dilemma of needing a large dense set of data points to capture the manifold structure, but at the same time are unable to scale up to the requisite sample sizes due to their very high complexity. We depart from the manifold paradigm and propose an alteration relying on matching, exploiting the feature detecting capabilities of rectified linear unit (ReLU) activations to generate binary features which are robust to dataset sparsity and have significant advantages in computational runtime and storage. We show on six benchmark datasets that our proposed binary features outperform the original ones. Furthermore we explain why the proposed binarization based on Hamming metric outperformed the original Euclidean metric. Particularly, in low-SNR regimes, such as that of features obtained from CNNs trained on another dataset, dissimilar samples have been shown to be much better separated in the Hamming metric than the Euclidean metric

    Unsupervised learning in high-dimensional space

    Full text link
    Thesis (Ph.D.)--Boston UniversityIn machine learning, the problem of unsupervised learning is that of trying to explain key features and find hidden structures in unlabeled data. In this thesis we focus on three unsupervised learning scenarios: graph based clustering with imbalanced data, point-wise anomaly detection and anomalous cluster detection on graphs. In the first part we study spectral clustering, a popular graph based clustering technique. We investigate the reason why spectral clustering performs badly on imbalanced and proximal data. We then propose the partition constrained minimum cut (PCut) framework based on a novel parametric graph construction method, that is shown to adapt to different degrees of imbalanced data. We analyze the limit cut behavior of our approach, and demonstrate the significant performance improvement through clustering and semi-supervised learning experiments on imbalanced data. [TRUNCATED

    Structured learning for information retrieval

    Get PDF
    Information retrieval is the area of study concerned with the process of searching, recovering and interpreting information from large amounts of data. In this Thesis we show that many of the problems in information retrieval consist of structured learning, where the goal is to learn predictors of complex output structures, consisting of many inter-dependent variables. We then attack these problems using principled machine learning methods that are specifically suited for such scenarios. In the process of doing so, we develop new models, new model extensions and new algorithms that, when integrated with existing methodology, comprise a new set of tools for solving a variety of information retrieval problems. Firstly, we cover the multi-label classification problem, where we seek to predict a set of labels associated with a given object; the output in this case is structured, as the output variables are interdependent. Secondly, we focus on document ranking, where given a query and a set of documents associated with it we want to rank them according to their relevance with respect to the query; here, again, we have a structured output - a ranking of documents. Thirdly, we address topic models, where we are given a set of documents and attempt to find a compact representation of them, by learning latent topics and associating a topic distribution to each document; the output is again structured, consisting of word and topic distributions. For all the above problems, we obtain state-of-the-art solutions as attested by empirical performance in publicly available real-world datasets
    corecore