11 research outputs found

    Implementasi dan Analisis Algoritma HMRF-KMeans untuk Semi-Supervised Clustering Dokumen

    Get PDF
    ABSTRAKSI: Pelabelan data membutuhkan cost yang mahal dan besar, untuk itulah diperlukan suatu sistem dimana data dapat dilabelkan dengan mudah dan tepat. Semi-supervised clustering adalah suatu teknik learning untuk mengelompokkan atau melabelkan data unsupervised menggunakan supervised data sebagai acuannya. HMRF-KMeans merupakan algoritma semi-supervised clustering, dimana algoritma ini menggunakan Hidden Markov Random Field, untuk mengambil dan mengobservasi data secara acak dan menghitung probabilitas alaminya melalui komponen parameter hidden (tersembunyi). HMRF-KMeans menggabungkan constraint-based dan distance-based learning dalam fungsi objektif HMRF-KMeans. Fungsi objektif HMRF-KMeans yang minimum akan menghasilkan kualitas cluster yang baik. Dengan constraint based, proses inisialisasi centroid menjadi tepat dan distance learning membantu untuk meminimumkan fungsi objektif HMRF-KMeans.Kata Kunci : cost, semi-supervised clustering, HMRF-KMeans, algoritma, supervised, unsupervised, constraint, distance.ABSTRACT: Labeling data is expensive and requires great cost. Therefore it needed a system where data can be easily and accurately labeled. Semi-supervised clustering is a learning technique to cluster or to label unsupervised data using supervised data. Supervised data is used as reference for grouping unsupervised data. HMRF-KMeans is a semi-supervised clustering algorithm, where this algorithm using hidden Markov Random Field, to take up and to observe supervised data at random and then make these data as a reference to cluster the data. HMRF-KMeans combines Constraint-based and distance-based learning in HMRF-KMeans objective function. The minimum HMRF-KMeans objective function, will produce the right cluster. Constraint-based, provides the best centroid in initialization process and distance learning helps to give minimize HMRF -KMeans objective function.Keyword: cost, semi-supervised clustering, HMRF-KMeans, algorithm, supervised, unsupervised, constraint, distance

    Component-level aggregation of probabilistic PCA mixtures using variational-Bayes

    Get PDF
    Technical Report. This report of an extended version of our ICPR'2010 paper.This paper proposes a technique for aggregating mixtures of probabilistic principal component analyzers, which are a powerful probabilistic generative model for coping with a high-dimensional, non linear, data set. Aggregation is carried out through Bayesian estimation with a specific prior and an original variational scheme. We demonstrate how such models may be aggregated by accessing model parameters only, rather than original data, which can be advantageous for learning from distributed data sets. Experimental results illustrate the effectiveness of the proposal

    Clasificación de gestos utilizando Deep Learning en datasets con pocos datos etiquetados

    Get PDF
    En los últimos años el aprendizaje profundo ha demostrado ser un método suma-mente efectivo a la hora de realizar clasificación de imágenes. Esta efectividad esasociada en parte al aumento de poder de procesamiento, al desarrollo de nuevos al-goritmos y al incremento en el tamaño y cantidad de conjuntos de datos disponibles.Pero este aumento en la cantidad de conjuntos de datos disponibles no ha alcanzadotodas las problemáticas existentes, teniendo múltiples áreas donde los conjuntos dedatos disponibles son pequeños para la aplicación efectiva de modelos de aprendiza-je profundo o cuyos datos poseen información poco útil al no ser lo suficientementerepresentativa del problema o poseer ruido.Esta limitación en la cantidad de datos etiquetados es una problemática actualexistente en la clasificación de señas de la lengua de señas. En esta tesis se explorarondiversos métodos para lograr alcanzar la mejor precisión posible utilizando la menorcantidad de datos. Llegando finalmente a lograr una precisión en la clasificación deseñas estáticas del 99.26 % en el conjunto de datos LSA16 y 94 % con el conjunto dedatos RWTH-PHOENIX-Wheater.Facultad de Informátic

    A Lagrangian-based score for assessing the quality of pairwise constraints in semi-supervised clustering

    Get PDF
    ABSTRACT: Clustering algorithms help identify homogeneous subgroups from data. In some cases, additional information about the relationship among some subsets of the data exists. When using a semi-supervised clustering algorithm, an expert may provide additional information to constrain the solution based on that knowledge and, in doing so, guide the algorithm to a more useful and meaningful solution. Such additional information often takes the form of a cannot-link constraint (i.e., two data points cannot be part of the same cluster) or a must-link constraint (i.e., two data points must be part of the same cluster). A key challenge for users of such constraints in semi-supervised learning algorithms, however, is that the addition of inaccurate or conflicting constraints can decrease accuracy and little is known about how to detect whether expert-imposed constraints are likely incorrect. In the present work, we introduce a method to score each must-link and cannot-link pairwise constraint as likely incorrect. Using synthetic experimental examples and real data, we show that the resulting impact score can successfully identify individual constraints that should be removed or revised

    Some contributions to k-means clustering problems

    Get PDF
    k-means clustering is the most common clustering technique for homogeneous data sets. In this thesis we introduced some contributions for problems related to k-means. The first topic, we developed a modification of the k-means algorithm to efficiently partition massive data sets in a semi-supervised framework, i.e. partial information is available. Our algorithms are designed to also work in cases where not all of the groups have representatives in the supervised part of the data set as well as when the total number of groups is not known in advance. We provide strategies for initializing our algorithm and for determining the number of clusters. The second contribution we develop a methodology to model the distribution function of the difference in residuals for a K-groups model against a K\u27 -groups model for assessing if more groups fit the model better (K\u27\u3e K). This leads us to estimate the distribution of a sum of random variables: We provide two possible approaches here, with our first method relying on the theory of non-parametric kernel estimation and a second approximate approach that uses the normal approximation for this tail probability. Finally, we introduce a new merging tool that does not require any distribution assumption. To achieve this we computed the normed residuals, for each cluster realization. These residuals form sample from a non-negative distribution using asymmetric kernel estimation we estimate the miss-classification probability. Further we extend this non-parametric estimation to merge clusters

    Data mining and database systems: integrating conceptual clustering with a relational database management system.

    Get PDF
    Many clustering algorithms have been developed and improved over the years to cater for large scale data clustering. However, much of this work has been in developing numeric based algorithms that use efficient summarisations to scale to large data sets. There is a growing need for scalable categorical clustering algorithms as, although numeric based algorithms can be adapted to categorical data, they do not always produce good results. This thesis presents a categorical conceptual clustering algorithm that can scale to large data sets using appropriate data summarisations. Data mining is distinguished from machine learning by the use of larger data sets that are often stored in database management systems (DBMSs). Many clustering algorithms require data to be extracted from the DBMS and reformatted for input to the algorithm. This thesis presents an approach that integrates conceptual clustering with a DBMS. The presented approach makes the algorithm main memory independent and supports on-line data mining

    Data mining and database systems : integrating conceptual clustering with a relational database management system

    Get PDF
    Many clustering algorithms have been developed and improved over the years to cater for large scale data clustering. However, much of this work has been in developing numeric based algorithms that use efficient summarisations to scale to large data sets. There is a growing need for scalable categorical clustering algorithms as, although numeric based algorithms can be adapted to categorical data, they do not always produce good results. This thesis presents a categorical conceptual clustering algorithm that can scale to large data sets using appropriate data summarisations. Data mining is distinguished from machine learning by the use of larger data sets that are often stored in database management systems (DBMSs). Many clustering algorithms require data to be extracted from the DBMS and reformatted for input to the algorithm. This thesis presents an approach that integrates conceptual clustering with a DBMS. The presented approach makes the algorithm main memory independent and supports on-line data mining.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Probabilistic Semi-Supervised Clustering with Constraints

    No full text
    corecore