555 research outputs found

    Modelling and recognition of protein contact networks by multiple kernel learning and dissimilarity representations

    Get PDF
    Multiple kernel learning is a paradigm which employs a properly constructed chain of kernel functions able to simultaneously analyse different data or different representations of the same data. In this paper, we propose an hybrid classification system based on a linear combination of multiple kernels defined over multiple dissimilarity spaces. The core of the training procedure is the joint optimisation of kernel weights and representatives selection in the dissimilarity spaces. This equips the system with a two-fold knowledge discovery phase: by analysing the weights, it is possible to check which representations are more suitable for solving the classification problem, whereas the pivotal patterns selected as representatives can give further insights on the modelled system, possibly with the help of field-experts. The proposed classification system is tested on real proteomic data in order to predict proteins' functional role starting from their folded structure: specifically, a set of eight representations are drawn from the graph-based protein folded description. The proposed multiple kernel-based system has also been benchmarked against a clustering-based classification system also able to exploit multiple dissimilarities simultaneously. Computational results show remarkable classification capabilities and the knowledge discovery analysis is in line with current biological knowledge, suggesting the reliability of the proposed system

    Fuzzy spectral clustering methods for textual data

    Get PDF
    Nowadays, the development of advanced information technologies has determined an increase in the production of textual data. This inevitable growth accentuates the need to advance in the identification of new methods and tools able to efficiently analyse such kind of data. Against this background, unsupervised classification techniques can play a key role in this process since most of this data is not classified. Document clustering, which is used for identifying a partition of clusters in a corpus of documents, has proven to perform efficiently in the analyses of textual documents and it has been extensively applied in different fields, from topic modelling to information retrieval tasks. Recently, spectral clustering methods have gained success in the field of text classification. These methods have gained popularity due to their solid theoretical foundations which do not require any specific assumption on the global structure of the data. However, even though they prove to perform well in text classification problems, little has been done in the field of clustering. Moreover, depending on the type of documents analysed, it might be often the case that textual documents do not contain only information related to a single topic: indeed, there might be an overlap of contents characterizing different knowledge domains. Consequently, documents may contain information that is relevant to different areas of interest to some degree. The first part of this work critically analyses the main clustering algorithms used for text data, involving also the mathematical representation of documents and the pre-processing phase. Then, three novel fuzzy versions of spectral clustering algorithms for text data are introduced. The first one exploits the use of fuzzy K-medoids instead of K-means. The second one derives directly from the first one but is used in combination with Kernel and Set Similarity (KS2M), which takes into account the Jaccard index. Finally, in the third one, in order to enhance the clustering performance, a new similarity measure S∗ is proposed. This last one exploits the inherent sequential nature of text data by means of a weighted combination between the Spectrum string kernel function and a measure of set similarity. The second part of the thesis focuses on spectral bi-clustering algorithms for text mining tasks, which represent an interesting and partially unexplored field of research. In particular, two novel versions of fuzzy spectral bi-clustering algorithms are introduced. The two algorithms differ from each other for the approach followed in the identification of the document and the word partitions. Indeed, the first one follows a simultaneous approach while the second one a sequential approach. This difference leads also to a diversification in the choice of the number of clusters. The adequacy of all the proposed fuzzy (bi-)clustering methods is evaluated by experiments performed on both real and benchmark data sets

    Motion-capture-based hand gesture recognition for computing and control

    Get PDF
    This dissertation focuses on the study and development of algorithms that enable the analysis and recognition of hand gestures in a motion capture environment. Central to this work is the study of unlabeled point sets in a more abstract sense. Evaluations of proposed methods focus on examining their generalization to users not encountered during system training. In an initial exploratory study, we compare various classification algorithms based upon multiple interpretations and feature transformations of point sets, including those based upon aggregate features (e.g. mean) and a pseudo-rasterization of the capture space. We find aggregate feature classifiers to be balanced across multiple users but relatively limited in maximum achievable accuracy. Certain classifiers based upon the pseudo-rasterization performed best among tested classification algorithms. We follow this study with targeted examinations of certain subproblems. For the first subproblem, we introduce the a fortiori expectation-maximization (AFEM) algorithm for computing the parameters of a distribution from which unlabeled, correlated point sets are presumed to be generated. Each unlabeled point is assumed to correspond to a target with independent probability of appearance but correlated positions. We propose replacing the expectation phase of the algorithm with a Kalman filter modified within a Bayesian framework to account for the unknown point labels which manifest as uncertain measurement matrices. We also propose a mechanism to reorder the measurements in order to improve parameter estimates. In addition, we use a state-of-the-art Markov chain Monte Carlo sampler to efficiently sample measurement matrices. In the process, we indirectly propose a constrained k-means clustering algorithm. Simulations verify the utility of AFEM against a traditional expectation-maximization algorithm in a variety of scenarios. In the second subproblem, we consider the application of positive definite kernels and the earth mover\u27s distance (END) to our work. Positive definite kernels are an important tool in machine learning that enable efficient solutions to otherwise difficult or intractable problems by implicitly linearizing the problem geometry. We develop a set-theoretic interpretation of ENID and propose earth mover\u27s intersection (EMI). a positive definite analog to ENID. We offer proof of EMD\u27s negative definiteness and provide necessary and sufficient conditions for ENID to be conditionally negative definite, including approximations that guarantee negative definiteness. In particular, we show that ENID is related to various min-like kernels. We also present a positive definite preserving transformation that can be applied to any kernel and can be used to derive positive definite EMD-based kernels, and we show that the Jaccard index is simply the result of this transformation applied to set intersection. Finally, we evaluate kernels based on EMI and the proposed transformation versus ENID in various computer vision tasks and show that END is generally inferior even with indefinite kernel techniques. Finally, we apply deep learning to our problem. We propose neural network architectures for hand posture and gesture recognition from unlabeled marker sets in a coordinate system local to the hand. As a means of ensuring data integrity, we also propose an extended Kalman filter for tracking the rigid pattern of markers on which the local coordinate system is based. We consider fixed- and variable-size architectures including convolutional and recurrent neural networks that accept unlabeled marker input. We also consider a data-driven approach to labeling markers with a neural network and a collection of Kalman filters. Experimental evaluations with posture and gesture datasets show promising results for the proposed architectures with unlabeled markers, which outperform the alternative data-driven labeling method

    A skewness-based clustering method

    Get PDF
    Partitive clustering methods represent one of the earlier and most famous sets of strategy in the field of clustering. The name comes from their main feature: all these methods start from an initial partition and modify it at every step of the process according to a known criterion, until a given convergence rule is satisfied. In other words, as pointed out by ÄyrĂ€mö and KĂ€rkkĂ€inen (2006), they work essentially as iterative allocation algorithms. In this framework, we do not only focus on “canonical” approaches such as K-means and fuzzy C-means, but discuss some recent symmetrybased partitive clustering methods, mostly developed in the context of computer science and engineering. As it will be shown, these approaches seem to provide encouraging results, especially in the field of image recognition and some related applications, and for this reason, they represent a starting point for our work. In this respect, we are particularly interested in the case of overlapping clusters. As we will clarify, this case may represent a critical aspect for most clustering methods we have considered. In particular, we started our analysis by noting that, in a case of high-dimensional data with overlapping clusters, it may be difficult to choose the component-specific distributions, and no graphical device can help us. So, we decided to investigate non parametric approaches to clustering. In this framework, we focused on the case of clusters with elliptical shapes, and in Gaussian mixtures as a special case. Then, we realized that for elliptical shapes the symmetry could be a “natural” choice. So, we searched for such clustering approaches, and we found the symmetrybased methods cited above. But, surprisingly, none of them was intended to focus on elliptical clusters, since their aim is essentially at handling image recognition of different symmetric shapes. So, we decided to discuss this issue, and to test whether a suitable function of symmetry could improve clustering results in the case of elliptical overlapping clusters. Since we are interested in elliptical shapes, from a clustering point of view, another broad subject that we will discuss is the Gaussian mixture model. In this context, our interest is in the EM-based Mclust algorithm from the R library mclust, see Fraley and Raftery (1999). Thus, our work address both of these topics, partitive clustering methods (with a focus on the symmetry-based approach) and Gaussian model-based clustering. The main reason of such a choice, that is to address two partially different subjects, derives from the essential features of our proposal: a symmetry-based partitive method which is intended to deal with elliptical clusters (with Gaussian being a special case). In this sense, we provide an evaluation of our clustering performances by proposing a comparison with the Gaussian mixture model implemented in the Mclust library, see Fraley and Raftery (1999). This is surely a challenging task, since this method has home-court advantage in the case of Gaussian clusters. In this framework, as pointed out before, we are mainly interested in the case of overlapping clusters. In this sense, a starting point for our work was the assumption that Mclust (also in its “natural” framework, that is Gaussian mixtures) could have problems in centroid estimation when clusters are highly overlapping. Quite obviously, this drawback could be related to its dependency on the mutivariate Gaussian density. So, we searched for a non parametric skewness-based method, which could be appropriate for elliptical distribution (including Gaussian) in the case of overlapping clusters. This was exactly the framework of the proposed Sbam (Skewness-Based Allocation Method) algorithm

    Gromov-Wasserstein Averaging of Kernel and Distance Matrices

    Get PDF
    International audienceThis paper presents a new technique for computing the barycenter of a set of distance or kernel matrices. These matrices, which define the interrelationships between points sampled from individual domains, are not required to have the same size or to be in row-by-row correspondence. We compare these matrices using the softassign criterion , which measures the minimum distortion induced by a probabilistic map from the rows of one similarity matrix to the rows of another; this criterion amounts to a regularized version of the Gromov-Wasserstein (GW) distance between metric-measure spaces. The barycenter is then defined as a Fréchet mean of the input matrices with respect to this criterion, minimizing a weighted sum of softassign values. We provide a fast iterative algorithm for the resulting noncon-vex optimization problem, built upon state-of-the-art tools for regularized optimal transportation. We demonstrate its application to the computation of shape barycenters and to the prediction of energy levels from molecular configurations in quantum chemistry

    A Primer on Kernel Methods

    Get PDF

    Bayesian Field Theory: Nonparametric Approaches to Density Estimation, Regression, Classification, and Inverse Quantum Problems

    Get PDF
    Bayesian field theory denotes a nonparametric Bayesian approach for learning functions from observational data. Based on the principles of Bayesian statistics, a particular Bayesian field theory is defined by combining two models: a likelihood model, providing a probabilistic description of the measurement process, and a prior model, providing the information necessary to generalize from training to non-training data. The particular likelihood models discussed in the paper are those of general density estimation, Gaussian regression, clustering, classification, and models specific for inverse quantum problems. Besides problem typical hard constraints, like normalization and positivity for probabilities, prior models have to implement all the specific, and often vague, "a priori" knowledge available for a specific task. Nonparametric prior models discussed in the paper are Gaussian processes, mixtures of Gaussian processes, and non-quadratic potentials. Prior models are made flexible by including hyperparameters. In particular, the adaption of mean functions and covariance operators of Gaussian process components is discussed in detail. Even if constructed using Gaussian process building blocks, Bayesian field theories are typically non-Gaussian and have thus to be solved numerically. According to increasing computational resources the class of non-Gaussian Bayesian field theories of practical interest which are numerically feasible is steadily growing. Models which turn out to be computationally too demanding can serve as starting point to construct easier to solve parametric approaches, using for example variational techniques.Comment: 200 pages, 99 figures, LateX; revised versio

    Doctor of Philosophy

    Get PDF
    dissertationWith the tremendous growth of data produced in the recent years, it is impossible to identify patterns or test hypotheses without reducing data size. Data mining is an area of science that extracts useful information from the data by discovering patterns and structures present in the data. In this dissertation, we will largely focus on clustering which is often the first step in any exploratory data mining task, where items that are similar to each other are grouped together, making downstream data analysis robust. Different clustering techniques have different strengths, and the resulting groupings provide different perspectives on the data. Due to the unsupervised nature i.e., the lack of domain experts who can label the data, validation of results is very difficult. While there are measures that compute "goodness" scores for clustering solutions as a whole, there are few methods that validate the assignment of individual data items to their clusters. To address these challenges we focus on developing a framework that can generate, compare, combine, and evaluate different solutions to make more robust and significant statements about the data. In the first part of this dissertation, we present fast and efficient techniques to generate and combine different clustering solutions. We build on some recent ideas on efficient representations of clusters of partitions to develop a well founded metric that is spatially aware to compare clusterings. With the ability to compare clusterings, we describe a heuristic to combine different solutions to produce a single high quality clustering. We also introduce a Markov chain Monte Carlo approach to sample different clusterings from the entire landscape to provide the users with a variety of choices. In the second part of this dissertation, we build certificates for individual data items and study their influence on effective data reduction. We present a geometric approach by defining regions of influence for data items and clusters and use this to develop adaptive sampling techniques to speedup machine learning algorithms. This dissertation is therefore a systematic approach to study the landscape of clusterings in an attempt to provide a better understanding of the data

    CLADAG 2021 BOOK OF ABSTRACTS AND SHORT PAPERS

    Get PDF
    The book collects the short papers presented at the 13th Scientific Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society (SIS). The meeting has been organized by the Department of Statistics, Computer Science and Applications of the University of Florence, under the auspices of the Italian Statistical Society and the International Federation of Classification Societies (IFCS). CLADAG is a member of the IFCS, a federation of national, regional, and linguistically-based classification societies. It is a non-profit, non-political scientific organization, whose aims are to further classification research
    • 

    corecore