682 research outputs found

    Semantic distillation: a method for clustering objects by their contextual specificity

    Full text link
    Techniques for data-mining, latent semantic analysis, contextual search of databases, etc. have long ago been developed by computer scientists working on information retrieval (IR). Experimental scientists, from all disciplines, having to analyse large collections of raw experimental data (astronomical, physical, biological, etc.) have developed powerful methods for their statistical analysis and for clustering, categorising, and classifying objects. Finally, physicists have developed a theory of quantum measurement, unifying the logical, algebraic, and probabilistic aspects of queries into a single formalism. The purpose of this paper is twofold: first to show that when formulated at an abstract level, problems from IR, from statistical data analysis, and from physical measurement theories are very similar and hence can profitably be cross-fertilised, and, secondly, to propose a novel method of fuzzy hierarchical clustering, termed \textit{semantic distillation} -- strongly inspired from the theory of quantum measurement --, we developed to analyse raw data coming from various types of experiments on DNA arrays. We illustrate the method by analysing DNA arrays experiments and clustering the genes of the array according to their specificity.Comment: Accepted for publication in Studies in Computational Intelligence, Springer-Verla

    Designing labeled graph classifiers by exploiting the R\'enyi entropy of the dissimilarity representation

    Full text link
    Representing patterns as labeled graphs is becoming increasingly common in the broad field of computational intelligence. Accordingly, a wide repertoire of pattern recognition tools, such as classifiers and knowledge discovery procedures, are nowadays available and tested for various datasets of labeled graphs. However, the design of effective learning procedures operating in the space of labeled graphs is still a challenging problem, especially from the computational complexity viewpoint. In this paper, we present a major improvement of a general-purpose classifier for graphs, which is conceived on an interplay between dissimilarity representation, clustering, information-theoretic techniques, and evolutionary optimization algorithms. The improvement focuses on a specific key subroutine devised to compress the input data. We prove different theorems which are fundamental to the setting of the parameters controlling such a compression operation. We demonstrate the effectiveness of the resulting classifier by benchmarking the developed variants on well-known datasets of labeled graphs, considering as distinct performance indicators the classification accuracy, computing time, and parsimony in terms of structural complexity of the synthesized classification models. The results show state-of-the-art standards in terms of test set accuracy and a considerable speed-up for what concerns the computing time.Comment: Revised versio

    Macrostate Data Clustering

    Full text link
    We develop an effective nonhierarchical data clustering method using an analogy to the dynamic coarse graining of a stochastic system. Analyzing the eigensystem of an interitem transition matrix identifies fuzzy clusters corresponding to the metastable macroscopic states (macrostates) of a diffusive system. A "minimum uncertainty criterion" determines the linear transformation from eigenvectors to cluster-defining window functions. Eigenspectrum gap and cluster certainty conditions identify the proper number of clusters. The physically motivated fuzzy representation and associated uncertainty analysis distinguishes macrostate clustering from spectral partitioning methods. Macrostate data clustering solves a variety of test cases that challenge other methods.Comment: keywords: cluster analysis, clustering, pattern recognition, spectral graph theory, dynamic eigenvectors, machine learning, macrostates, classificatio

    One-class classifiers based on entropic spanning graphs

    Get PDF
    One-class classifiers offer valuable tools to assess the presence of outliers in data. In this paper, we propose a design methodology for one-class classifiers based on entropic spanning graphs. Our approach takes into account the possibility to process also non-numeric data by means of an embedding procedure. The spanning graph is learned on the embedded input data and the outcoming partition of vertices defines the classifier. The final partition is derived by exploiting a criterion based on mutual information minimization. Here, we compute the mutual information by using a convenient formulation provided in terms of the α\alpha-Jensen difference. Once training is completed, in order to associate a confidence level with the classifier decision, a graph-based fuzzy model is constructed. The fuzzification process is based only on topological information of the vertices of the entropic spanning graph. As such, the proposed one-class classifier is suitable also for data characterized by complex geometric structures. We provide experiments on well-known benchmarks containing both feature vectors and labeled graphs. In addition, we apply the method to the protein solubility recognition problem by considering several representations for the input samples. Experimental results demonstrate the effectiveness and versatility of the proposed method with respect to other state-of-the-art approaches.Comment: Extended and revised version of the paper "One-Class Classification Through Mutual Information Minimization" presented at the 2016 IEEE IJCNN, Vancouver, Canad

    Two-Stage Fuzzy Multiple Kernel Learning Based on Hilbert-Schmidt Independence Criterion

    Full text link
    © 1993-2012 IEEE. Multiple kernel learning (MKL) is a principled approach to kernel combination and selection for a variety of learning tasks, such as classification, clustering, and dimensionality reduction. In this paper, we develop a novel fuzzy multiple kernel learning model based on the Hilbert-Schmidt independence criterion (HSIC) for classification, which we call HSIC-FMKL. In this model, we first propose an HSIC Lasso-based MKL formulation, which not only has a clear statistical interpretation that minimum redundant kernels with maximum dependence on output labels are found and combined, but also enables the global optimal solution to be computed efficiently by solving a Lasso optimization problem. Since the traditional support vector machine (SVM) is sensitive to outliers or noises in the dataset, fuzzy SVM (FSVM) is used to select the prediction hypothesis once the optimal kernel has been obtained. The main advantage of FSVM is that we can associate a fuzzy membership with each data point such that these data points can have different effects on the training of the learning machine. We propose a new fuzzy membership function using a heuristic strategy based on the HSIC. The proposed HSIC-FMKL is a two-stage kernel learning approach and the HSIC is applied in both stages. We perform extensive experiments on real-world datasets from the UCI benchmark repository and the application domain of computational biology which validate the superiority of the proposed model in terms of prediction accuracy

    Time Series Cluster Kernel for Learning Similarities between Multivariate Time Series with Missing Data

    Get PDF
    Similarity-based approaches represent a promising direction for time series analysis. However, many such methods rely on parameter tuning, and some have shortcomings if the time series are multivariate (MTS), due to dependencies between attributes, or the time series contain missing data. In this paper, we address these challenges within the powerful context of kernel methods by proposing the robust \emph{time series cluster kernel} (TCK). The approach taken leverages the missing data handling properties of Gaussian mixture models (GMM) augmented with informative prior distributions. An ensemble learning approach is exploited to ensure robustness to parameters by combining the clustering results of many GMM to form the final kernel. We evaluate the TCK on synthetic and real data and compare to other state-of-the-art techniques. The experimental results demonstrate that the TCK is robust to parameter choices, provides competitive results for MTS without missing data and outstanding results for missing data.Comment: 23 pages, 6 figure

    Fuzzy Modeling of Geospatial Patterns

    Get PDF

    Robust unsupervised learning using kernels

    Get PDF
    This thesis aims to study deep connections between statistical robustness and machine learning techniques, in particular, the relationship between some particular kernel (the Gaussian kernel) and the robustness of kernel-based learning methods that use it. This thesis also presented that estimating the mean in the feature space with the RBF kernel, is like doing robust estimation of the mean in the data space with the Welsch M-estimator. Based on these ideas, new robust kernel to machine learning algorithms are designed and implemented in the current thesis: Tukey’s, Andrews’ and Huber’s robust kernels which each one corresponding to Tukey’s, Andrews’ and Huber’s M-robust estimator, respectively. On the one hand, kernel-based algorithms are an important tool which is widely applied to different machine learning and information retrieval problems including: clustering, latent topic analysis, recommender systems, image annotation, and contentbased image retrieval, amongst others. Robustness is the ability of a statistical estimation method or machine learning method to deal with noise and outliers. There is a strong theory of robustness in statistics; however, it receives little attention in machine learning. A systematic evaluation is performed in order to evaluate the robustness of kernel-based algorithms in clustering showing that some robust kernels including Tukey’s and Andrews’ robust kernels perform on par to state-of-the-art algorithmsResumen: Esta tesis apunta a mostrar la profunda relación que existe entre robustez estadística y técnicas de aprendizaje de maquina, en particular, la relación entre algunos tipos de kernels (kernel Gausiano) y la robustez de los métodos basados en kernels. Esta tesis también presenta que la estimación de la media en el espacio de características con el kernel rbf, es como hacer estimación de la media en el espacio de los datos con el m-estimador de Welsch. Basado en las ideas anteriores, un conjunto de nuevos kernel robustos son propuestos y diseñdos: Tukey, Andrews, y Huber kernels robustos correspondientes a los m-estimadores de Tukey, Andrews y Huber respectivamente. Por un lado, los algoritmos basados en kernel es una importante herramienta aplicada en diferentes problemas de aprendizaje automático y recuperación de información, incluyendo: el agrupamiento, análisis de tema latente, sistemas de recomendación, anotación de imágenes, recuperación de informacion, entre otros. La robustez es la capacidad de un método o procedimiento de estimación aprendizaje estadístico automatico para lidiar con el ruido y los valores atípicos. Hay una fuerte teoría de robustez en estadística, sin embargo, han recibido poca atención en aprendizaje de máquina. Una evaluación sistemática se realiza con el fin de evaluar la robustez de los algoritmos basados en kernel en tareas de agrupación mostrando que algunos kernels robustos incluyendo los kernels de Tukey y de Andrews se desempeñan a la par de los algoritmos del estado del arte.Maestrí

    LOCATION AWARE CLUSTER BASED ROUTING IN WIRELESS SENSOR NETWORKS

    Get PDF
    Wireless sensor nodes are usually embedded in the physical environment and report sensed data to a central base station. Clustering is one of the most challenging issues in wireless sensor networks. This paper proposes a new cluster scheme for wireless sensor network by modified the K means clustering algorithm. Sensor nodes are deployed in a harsh environment and randomly scattered in the region of interest and are deployed in a flat architecture. The transmission of packet will reduce the network lifetime. Thus, clustering scheme is required to avoid network traffic and increase overall network lifetime. In order to cluster the sensor nodes that are deployed in the sensor network, the location information of each sensor node should be known. By knowing the location of the each sensor node in the wireless sensor network, clustering is formed based on the highest residual energy and minimum distance from the base station. Among the group of nodes, one node is elected as a cluster head using centroid method. The minimum distance between the cluster node’s and the centroid point is elected as a cluster head. Clustering of nodes can minimize the residual energy and maximize the network performance. This improves the overall network lifetime and reduces network traffic

    Dynaamisten mallien puoliautomaattinen parametrisointi käyttäen laitosdataa

    Get PDF
    The aim of this thesis was to develop a new methodology for estimating parameters of NAPCON ProsDS dynamic simulator models to better represent data containing several operating points. Before this thesis, no known methodology had existed for combining operating point identification with parameter estimation of NAPCON ProsDS simulator models. The methodology was designed by assessing and selecting suitable methods for operating space partitioning, parameter estimation and parameter scheduling. Previously implemented clustering algorithms were utilized for the operating space partition. Parameter estimation was implemented as a new tool in the NAPCON ProsDS dynamic simulator and iterative parameter estimation methods were applied. Finally, lookup tables were applied for tuning the model parameters according to the state. The methodology was tested by tuning a heat exchanger model to several operating points based on plant process data. The results indicated that the developed methodology was able to tune the simulator model to better represent several operating states. However, more testing with different models is required to verify general applicability of the methodology.Tämän diplomityön tarkoitus oli kehittää uusi parametrien estimointimenetelmä NAPCON ProsDS -simulaattorin dynaamisille malleille, jotta ne vastaisivat paremmin dataa useista prosessitiloista. Ennen tätä diplomityötä NAPCON ProsDS -simulaattorin malleille ei ollut olemassa olevaa viritysmenetelmää, joka yhdistäisi operointitilojen tunnistuksen parametrien estimointiin. Menetelmän kehitystä varten tutkittiin ja valittiin sopivat menetelmät operointiavaruuden jakamiselle, parametrien estimoinnille ja parametrien virittämiseen prosessitilan mukaisesti. Aikaisemmin ohjelmoituja klusterointialgoritmeja hyödynnettiin operointiavaruuden jakamisessa. Parametrien estimointi toteutettiin uutena työkaluna NAPCON ProsDS -simulaattoriin ja estimoinnissa käytettiin iteratiivisia optimointimenetelmiä. Lopulta hakutaulukoita sovellettiin mallin parametrien hienosäätöön prosessitilojen mukaisesti. Menetelmää testattiin virittämällä lämmönvaihtimen malli kahteen eri prosessitilaan käyttäen laitokselta kerättyä prosessidataa. Tulokset osoittavat että kehitetty menetelmä pystyi virittämään simulaattorin mallin vastaamaan paremmin dataa useista prosessitiloista. Kuitenkin tarvitaan lisää testausta erityyppisten mallien kanssa, jotta voidaan varmistaa menetelmän yleinen soveltuvuus
    corecore