12,501 research outputs found

    Unsupervised Learning via Mixtures of Skewed Distributions with Hypercube Contours

    Full text link
    Mixture models whose components have skewed hypercube contours are developed via a generalization of the multivariate shifted asymmetric Laplace density. Specifically, we develop mixtures of multiple scaled shifted asymmetric Laplace distributions. The component densities have two unique features: they include a multivariate weight function, and the marginal distributions are also asymmetric Laplace. We use these mixtures of multiple scaled shifted asymmetric Laplace distributions for clustering applications, but they could equally well be used in the supervised or semi-supervised paradigms. The expectation-maximization algorithm is used for parameter estimation and the Bayesian information criterion is used for model selection. Simulated and real data sets are used to illustrate the approach and, in some cases, to visualize the skewed hypercube structure of the components

    Data mining using concepts of independence, unimodality and homophily

    Get PDF
    With the widespread use of information technologies, more and more complex data is generated and collected every day. Such complex data is various in structure, size, type and format, e.g. time series, texts, images, videos and graphs. Complex data is often high-dimensional and heterogeneous, which makes the separation of the wheat (knowledge) from the chaff (noise) more difficult. Clustering is a main mode of knowledge discovery from complex data, which groups objects in such a way that intra-group objects are more similar than inter-group objects. Traditional clustering methods such as k-means, Expectation-Maximization clustering (EM), DBSCAN and spectral clustering are either deceived by "the curse of dimensionality" or spoiled by heterogenous information. So, how to effectively explore complex data? In some cases, people may only have some partial information about the complex data. For example, in social networks, not every user provides his/her profile information such as the personal interests. Can we leverage the limited user information and friendship network wisely to infer the likely labels of the unlabeled users so that the advertisers can do accurate advertising? This is the problem of learning from labeled and unlabeled data, which is literarily attributed to semi-supervised classification. To gain insights into these problems, this thesis focuses on developing clustering and semi-supervised classification methods that are driven by the concepts of independence, unimodality and homophily. The proposed methods leverage techniques from diverse areas, such as statistics, information theory, graph theory, signal processing, optimization and machine learning. Specifically, this thesis develops four methods, i.e. FUSE, ISAAC, UNCut, and wvGN. FUSE and ISAAC are clustering techniques to discover statistically independent patterns from high-dimensional numerical data. UNCut is a clustering technique to discover unimodal clusters in attributed graphs in which not all the attributes are relevant to the graph structure. wvGN is a semi-supervised classification technique using the theory of homophily to infer the labels of the unlabeled vertices in graphs. We have verified our clustering and semi-supervised classification methods on various synthetic and real-world data sets. The results are superior to those of the state-of-the-art.Täglich werden durch den weit verbreiteten Einsatz von Informationstechnologien mehr und mehr komplexe Daten generiert und gesammelt. Diese komplexen Daten unterscheiden sich in der Struktur, Größe, Art und Format. Häufig anzutreffen sind beispielsweise Zeitreihen, Texte, Bilder, Videos und Graphen. Dabei sind diese Daten meist hochdimensional und heterogen, was die Trennung des Weizens ( Wissen ) von der Spreu ( Rauschen ) erschwert. Die Cluster Analyse ist dabei eine der wichtigsten Methoden um aus komplexen Daten wssen zu extrahieren. Dabei werden die Objekte eines Datensatzes in einer solchen Weise gruppiert, dass intra-gruppierte Objekte ähnlicher sind als Objekte anderer Gruppen. Der Einsatz von traditionellen Clustering-Methoden wie k-Means, Expectation-Maximization (EM), DBSCAN und Spektralclustering wird dabei entweder "durch der Fluch der Dimensionalität" erschwert oder ist angesichts der heterogenen Information nicht möglich. Wie erforscht man also solch komplexe Daten effektiv? Darüber hinaus ist es oft der Fall, dass für Objekte solcher Datensätze nur partiell Informationen vorliegen. So gibt in sozialen Netzwerken nicht jeder Benutzer seine Profil-Informationen wie die persönlichen Interessen frei. Können wir diese eingeschränkten Benutzerinformation trotzdem in Kombination mit dem Freundschaftsnetzwerk nutzen, um von von wenigen, einer Klasse zugeordneten Nutzern auf die anderen zu schließen. Beispielsweise um zielgerichtete Werbung zu schalten? Dieses Problem des Lernens aus klassifizierten und nicht klassifizierten Daten wird dem semi-supversised Learning zugeordnet. Um Einblicke in diese Probleme zu gewinnen, konzentriert sich diese Arbeit auf die Entwicklung von Clustering- und semi-überwachten Klassifikationsmethoden, die von den Konzepten der Unabhängigkeit, Unimodalität und Homophilie angetrieben werden. Die vorgeschlagenen Methoden nutzen Techniken aus verschiedenen Bereichen der Statistik, Informationstheorie, Graphentheorie, Signalverarbeitung, Optimierung und des maschinelles Lernen. Dabei stellt diese Arbeit vier Techniken vor: FUSE, ISAAC, UNCut, sowie wvGN. FUSE und ISAAC sind Clustering-Techniken, um statistisch unabhängige Muster aus hochdimensionalen numerischen Daten zu entdecken. UNCut ist eine Clustering-Technik, um unimodale Cluster in attributierten Graphen zu entdecken, in denen die Kanten und Attribute heterogene Informationen liefern. wvGN ist eine halbüberwachte Klassifikationstechnik, die Homophilie verwendet, um von gelabelten Kanten auf ungelabelte Kanten im Graphen zu schließen. Wir haben diese Clustering und semi-überwachten Klassifizierungsmethoden auf verschiedenen synthetischen und realen Datensätze überprüft. Die Ergebnisse sind denen von bisherigen State-of-the-Art-Methoden überlegen

    Prototypical Contrastive Learning of Unsupervised Representations

    Full text link
    This paper presents Prototypical Contrastive Learning (PCL), an unsupervised representation learning method that addresses the fundamental limitations of instance-wise contrastive learning. PCL not only learns low-level features for the task of instance discrimination, but more importantly, it implicitly encodes semantic structures of the data into the learned embedding space. Specifically, we introduce prototypes as latent variables to help find the maximum-likelihood estimation of the network parameters in an Expectation-Maximization framework. We iteratively perform E-step as finding the distribution of prototypes via clustering and M-step as optimizing the network via contrastive learning. We propose ProtoNCE loss, a generalized version of the InfoNCE loss for contrastive learning, which encourages representations to be closer to their assigned prototypes. PCL outperforms state-of-the-art instance-wise contrastive learning methods on multiple benchmarks with substantial improvement in low-resource transfer learning. Code and pretrained models are available at https://github.com/salesforce/PCL

    A robust approach to model-based classification based on trimming and constraints

    Full text link
    In a standard classification framework a set of trustworthy learning data are employed to build a decision rule, with the final aim of classifying unlabelled units belonging to the test set. Therefore, unreliable labelled observations, namely outliers and data with incorrect labels, can strongly undermine the classifier performance, especially if the training size is small. The present work introduces a robust modification to the Model-Based Classification framework, employing impartial trimming and constraints on the ratio between the maximum and the minimum eigenvalue of the group scatter matrices. The proposed method effectively handles noise presence in both response and exploratory variables, providing reliable classification even when dealing with contaminated datasets. A robust information criterion is proposed for model selection. Experiments on real and simulated data, artificially adulterated, are provided to underline the benefits of the proposed method
    corecore