754 research outputs found

    Statistical modeling for simultaneous data clustering, features selection, and outliers rejection

    Get PDF
    Model-based approaches and in particular finite mixture models are widely used for data clustering, which is a crucial step in several applications of practical importance. Indeed, many pattern recognition, computer vision, and image processing applications can be approached as feature space clustering problems. However, the use of these approaches for complex high-dimensional data presents several challenges such as the presence of many irrelevant features, which may affect the speed, and compromise the accuracy of the used learning algorithm. Another problem is the presence of outliers which potentially influence the resulting model parameters. Generally; clustering, features selection, and outliers detection problems have been approached separately. In this thesis, we propose a unified statistical framework to address the three problems simultaneously. The proposed statistical model partitions a given data set without a priori information about the number of clusters, the saliency of the features, or the number of outliers. We illustrate the performance of our approach using different applications involving synthetic data, real data, and objects shape clustering

    Positive Data Clustering based on Generalized Inverted Dirichlet Mixture Model

    Get PDF
    Recent advances in processing and networking capabilities of computers have caused an accumulation of immense amounts of multimodal multimedia data (image, text, video). These data are generally presented as high-dimensional vectors of features. The availability of these highdimensional data sets has provided the input to a large variety of statistical learning applications including clustering, classification, feature selection, outlier detection and density estimation. In this context, a finite mixture offers a formal approach to clustering and a powerful tool to tackle the problem of data modeling. A mixture model assumes that the data is generated by a set of parametric probability distributions. The main learning process of a mixture model consists of the following two parts: parameter estimation and model selection (estimation the number of components). In addition, other issues may be considered during the learning process of mixture models such as the: a) feature selection and b) outlier detection. The main objective of this thesis is to work with different kinds of estimation criteria and to incorporate those challenges into a single framework. The first contribution of this thesis is to propose a statistical framework which can tackle the problem of parameter estimation, model selection, feature selection, and outlier rejection in a unified model. We propose to use feature saliency and introduce an expectation-maximization (EM) algorithm for the estimation of the Generalized Inverted Dirichlet (GID) mixture model. By using the Minimum Message Length (MML), we can identify how much each feature contributes to our model as well as determine the number of components. The presence of outliers is an added challenge and is handled by incorporating an auxiliary outlier component, to which we associate a uniform density. Experimental results on synthetic data, as well as real world applications involving visual scenes and object classification, indicates that the proposed approach was promising, even though low-dimensional representation of the data was applied. In addition, it showed the importance of embedding an outlier component to the proposed model. EM learning suffers from significant drawbacks. In order to overcome those drawbacks, a learning approach using a Bayesian framework is proposed as our second contribution. This learning is based on the estimation of the parameters posteriors and by considering the prior knowledge about these parameters. Calculation of the posterior distribution of each parameter in the model is done by using Markov chain Monte Carlo (MCMC) simulation methods - namely, the Gibbs sampling and the Metropolis- Hastings methods. The Bayesian Information Criterion (BIC) was used for model selection. The proposed model was validated on object classification and forgery detection applications. For the first two contributions, we developed a finite GID mixture. However, in the third contribution, we propose an infinite GID mixture model. The proposed model simutaneously tackles the clustering and feature selection problems. The proposed learning model is based on Gibbs sampling. The effectiveness of the proposed method is shown using image categorization application. Our last contribution in this thesis is another fully Bayesian approach for a finite GID mixture learning model using the Reversible Jump Markov Chain Monte Carlo (RJMCMC) technique. The proposed algorithm allows for the simultaneously handling of the model selection and parameter estimation for high dimensional data. The merits of this approach are investigated using synthetic data, and data generated from a challenging namely object detection

    Multiple structure recovery with T-linkage

    Get PDF
    reserved2noThis work addresses the problem of robust fitting of geometric structures to noisy data corrupted by outliers. An extension of J-linkage (called T-linkage) is presented and elaborated. T-linkage improves the preference analysis implemented by J-linkage in term of performances and robustness, considering both the representation and the segmentation steps. A strategy to reject outliers and to estimate the inlier threshold is proposed, resulting in a versatile tool, suitable for multi-model fitting “in the wild”. Experiments demonstrate that our methods perform better than J-linkage on simulated data, and compare favorably with state-of-the-art methods on public domain real datasets.mixedMagri L.; Fusiello A.Magri, L.; Fusiello, A

    Unsupervised Learning with Feature Selection Based on Multivariate McDonald’s Beta Mixture Model for Medical Data Analysis

    Get PDF
    This thesis proposes innovative clustering approaches using finite and infinite mixture models to analyze medical data and human activity recognition. These models leverage the flexibility of a novel distribution, the multivariate McDonald’s Beta distribution, offering superior capability to model data of varying shapes. We introduce a finite McDonald’s Beta Mixture Model (McDBMM), demonstrating its superior performance in handling bounded and asymmetric data distributions compared to traditional Gaussian mixture models. Further, we employ deterministic learning methods such as maximum likelihood via the expectation maximization approach and also a Bayesian framework, in which we integrate feature selection. This integration enhances the efficiency and accuracy of our models, offering a compelling solution for real-world applications where manual annotation of large data volumes is not feasible. To address the prevalent challenge in clustering regarding the determination of mixture components number, we extend our finite mixture model to an infinite model. By adopting a nonparametric Bayesian technique, we can effectively capture the underlying data distribution with an unknown number of mixture components. Across all stages, our models are evaluated on various medical applications, consistently demonstrating superior performance over traditional alternatives. The results of this research underline the potential of the McDonald’s Beta distribution and the proposed mixture models in transforming medical data into actionable knowledge, aiding clinicians in making more precise decisions and improving health care industry

    Uncertainty Minimization in Robotic 3D Mapping Systems Operating in Dynamic Large-Scale Environments

    Get PDF
    This dissertation research is motivated by the potential and promise of 3D sensing technologies in safety and security applications. With specific focus on unmanned robotic mapping to aid clean-up of hazardous environments, under-vehicle inspection, automatic runway/pavement inspection and modeling of urban environments, we develop modular, multi-sensor, multi-modality robotic 3D imaging prototypes using localization/navigation hardware, laser range scanners and video cameras. While deploying our multi-modality complementary approach to pose and structure recovery in dynamic real-world operating conditions, we observe several data fusion issues that state-of-the-art methodologies are not able to handle. Different bounds on the noise model of heterogeneous sensors, the dynamism of the operating conditions and the interaction of the sensing mechanisms with the environment introduce situations where sensors can intermittently degenerate to accuracy levels lower than their design specification. This observation necessitates the derivation of methods to integrate multi-sensor data considering sensor conflict, performance degradation and potential failure during operation. Our work in this dissertation contributes the derivation of a fault-diagnosis framework inspired by information complexity theory to the data fusion literature. We implement the framework as opportunistic sensing intelligence that is able to evolve a belief policy on the sensors within the multi-agent 3D mapping systems to survive and counter concerns of failure in challenging operating conditions. The implementation of the information-theoretic framework, in addition to eliminating failed/non-functional sensors and avoiding catastrophic fusion, is able to minimize uncertainty during autonomous operation by adaptively deciding to fuse or choose believable sensors. We demonstrate our framework through experiments in multi-sensor robot state localization in large scale dynamic environments and vision-based 3D inference. Our modular hardware and software design of robotic imaging prototypes along with the opportunistic sensing intelligence provides significant improvements towards autonomous accurate photo-realistic 3D mapping and remote visualization of scenes for the motivating applications

    Visualisation of bioinformatics datasets

    Get PDF
    Analysing the molecular polymorphism and interactions of DNA, RNA and proteins is of fundamental importance in biology. Predicting functions of polymorphic molecules is important in order to design more effective medicines. Analysing major histocompatibility complex (MHC) polymorphism is important for mate choice, epitope-based vaccine design and transplantation rejection etc. Most of the existing exploratory approaches cannot analyse these datasets because of the large number of molecules with a high number of descriptors per molecule. This thesis develops novel methods for data projection in order to explore high dimensional biological dataset by visualising them in a low-dimensional space. With increasing dimensionality, some existing data visualisation methods such as generative topographic mapping (GTM) become computationally intractable. We propose variants of these methods, where we use log-transformations at certain steps of expectation maximisation (EM) based parameter learning process, to make them tractable for high-dimensional datasets. We demonstrate these proposed variants both for synthetic and electrostatic potential dataset of MHC class-I. We also propose to extend a latent trait model (LTM), suitable for visualising high dimensional discrete data, to simultaneously estimate feature saliency as an integrated part of the parameter learning process of a visualisation model. This LTM variant not only gives better visualisation by modifying the project map based on feature relevance, but also helps users to assess the significance of each feature. Another problem which is not addressed much in the literature is the visualisation of mixed-type data. We propose to combine GTM and LTM in a principled way where appropriate noise models are used for each type of data in order to visualise mixed-type data in a single plot. We call this model a generalised GTM (GGTM). We also propose to extend GGTM model to estimate feature saliencies while training a visualisation model and this is called GGTM with feature saliency (GGTM-FS). We demonstrate effectiveness of these proposed models both for synthetic and real datasets. We evaluate visualisation quality using quality metrics such as distance distortion measure and rank based measures: trustworthiness, continuity, mean relative rank errors with respect to data space and latent space. In cases where the labels are known we also use quality metrics of KL divergence and nearest neighbour classifications error in order to determine the separation between classes. We demonstrate the efficacy of these proposed models both for synthetic and real biological datasets with a main focus on the MHC class-I dataset
    • …
    corecore