1,184 research outputs found

    Robust Path-based Image Segmentation Using Superpixel Denoising

    Get PDF
    Clustering is the important task of partitioning data into groups with similar characteristics, with one category being spectral clustering where data points are represented as vertices of a graph connected by weighted edges signifying similarity based on distance. The longest leg path distance (LLPD) has shown promise when used in spectral clustering, but is sensitive to noisy data, therefore requiring a data denoising procedure to achieve good performance. Previous denoising techniques have involved identifying and removing noisy data points, however this is not a desirable pre-clustering step for data sets with a specific structure like images. The process of partitioning an image into regions of similar features known as image segmentation can be represented as a clustering problem by defining the vector of intensity and spatial information at each pixel as data point. We therefore propose the method of pre-cluster denoising to formulate a robust LLPD clustering framework. By creating a fine clustering of approximately equal-sized groups and averaging each, a reduced number of data points can be defined that represent the relevant information of the original data set by locally averaging out noise influence. We can then construct a smaller graph representation of the data based on the LLPD between the reduced data points, and identify the spectral embedding coordinates for each reduced point. An out-of-sample extension procedure is then used to compute spectral embedding coordinates at each of the original data points, after which a simple (k-means) clustering is performed to compute the final cluster labels. In the context of image segmentation, computing superpixels provides a nice structure for performing this type of pre-clustering. We show how the above LLPD framework can be carried out in the context of image segmentation, and show that a simple computationally efficient spatial interpolation procedure can be used instead to extend the embedding in a way that yields better segmentation performance with respect to ground truth on a publicly available data set. Similar experiments are also performed using the standard Euclidean distance in place of the LLPD to show the proficiency of the LLPD for image segmentation

    Overlapping and Robust Edge-Colored Clustering in Hypergraphs

    Full text link
    A recent trend in data mining has explored (hyper)graph clustering algorithms for data with categorical relationship types. Such algorithms have applications in the analysis of social, co-authorship, and protein interaction networks, to name a few. Many such applications naturally have some overlap between clusters, a nuance which is missing from current combinatorial models. Additionally, existing models lack a mechanism for handling noise in datasets. We address these concerns by generalizing Edge-Colored Clustering, a recent framework for categorical clustering of hypergraphs. Our generalizations allow for a budgeted number of either (a) overlapping cluster assignments or (b) node deletions. For each new model we present a greedy algorithm which approximately minimizes an edge mistake objective, as well as bicriteria approximations where the second approximation factor is on the budget. Additionally, we address the parameterized complexity of each problem, providing FPT algorithms and hardness results

    Coresets for Wasserstein Distributionally Robust Optimization Problems

    Full text link
    Wasserstein distributionally robust optimization (\textsf{WDRO}) is a popular model to enhance the robustness of machine learning with ambiguous data. However, the complexity of \textsf{WDRO} can be prohibitive in practice since solving its ``minimax'' formulation requires a great amount of computation. Recently, several fast \textsf{WDRO} training algorithms for some specific machine learning tasks (e.g., logistic regression) have been developed. However, the research on designing efficient algorithms for general large-scale \textsf{WDRO}s is still quite limited, to the best of our knowledge. \textit{Coreset} is an important tool for compressing large dataset, and thus it has been widely applied to reduce the computational complexities for many optimization problems. In this paper, we introduce a unified framework to construct the ϵ\epsilon-coreset for the general \textsf{WDRO} problems. Though it is challenging to obtain a conventional coreset for \textsf{WDRO} due to the uncertainty issue of ambiguous data, we show that we can compute a ``dual coreset'' by using the strong duality property of \textsf{WDRO}. Also, the error introduced by the dual coreset can be theoretically guaranteed for the original \textsf{WDRO} objective. To construct the dual coreset, we propose a novel grid sampling approach that is particularly suitable for the dual formulation of \textsf{WDRO}. Finally, we implement our coreset approach and illustrate its effectiveness for several \textsf{WDRO} problems in the experiments

    Classification without labels: Learning from mixed samples in high energy physics

    Get PDF
    Modern machine learning techniques can be used to construct powerful models for difficult collider physics problems. In many applications, however, these models are trained on imperfect simulations due to a lack of truth-level information in the data, which risks the model learning artifacts of the simulation. In this paper, we introduce the paradigm of classification without labels (CWoLa) in which a classifier is trained to distinguish statistical mixtures of classes, which are common in collider physics. Crucially, neither individual labels nor class proportions are required, yet we prove that the optimal classifier in the CWoLa paradigm is also the optimal classifier in the traditional fully-supervised case where all label information is available. After demonstrating the power of this method in an analytical toy example, we consider a realistic benchmark for collider physics: distinguishing quark- versus gluon-initiated jets using mixed quark/gluon training samples. More generally, CWoLa can be applied to any classification problem where labels or class proportions are unknown or simulations are unreliable, but statistical mixtures of the classes are available.Comment: 18 pages, 5 figures; v2: intro extended and references added; v3: additional discussion to match JHEP versio

    Object Discovery via Cohesion Measurement

    Full text link
    Color and intensity are two important components in an image. Usually, groups of image pixels, which are similar in color or intensity, are an informative representation for an object. They are therefore particularly suitable for computer vision tasks, such as saliency detection and object proposal generation. However, image pixels, which share a similar real-world color, may be quite different since colors are often distorted by intensity. In this paper, we reinvestigate the affinity matrices originally used in image segmentation methods based on spectral clustering. A new affinity matrix, which is robust to color distortions, is formulated for object discovery. Moreover, a Cohesion Measurement (CM) for object regions is also derived based on the formulated affinity matrix. Based on the new Cohesion Measurement, a novel object discovery method is proposed to discover objects latent in an image by utilizing the eigenvectors of the affinity matrix. Then we apply the proposed method to both saliency detection and object proposal generation. Experimental results on several evaluation benchmarks demonstrate that the proposed CM based method has achieved promising performance for these two tasks.Comment: 14 pages, 14 figure
    corecore