42 research outputs found

    Positive Data Clustering based on Generalized Inverted Dirichlet Mixture Model

    Get PDF
    Recent advances in processing and networking capabilities of computers have caused an accumulation of immense amounts of multimodal multimedia data (image, text, video). These data are generally presented as high-dimensional vectors of features. The availability of these highdimensional data sets has provided the input to a large variety of statistical learning applications including clustering, classification, feature selection, outlier detection and density estimation. In this context, a finite mixture offers a formal approach to clustering and a powerful tool to tackle the problem of data modeling. A mixture model assumes that the data is generated by a set of parametric probability distributions. The main learning process of a mixture model consists of the following two parts: parameter estimation and model selection (estimation the number of components). In addition, other issues may be considered during the learning process of mixture models such as the: a) feature selection and b) outlier detection. The main objective of this thesis is to work with different kinds of estimation criteria and to incorporate those challenges into a single framework. The first contribution of this thesis is to propose a statistical framework which can tackle the problem of parameter estimation, model selection, feature selection, and outlier rejection in a unified model. We propose to use feature saliency and introduce an expectation-maximization (EM) algorithm for the estimation of the Generalized Inverted Dirichlet (GID) mixture model. By using the Minimum Message Length (MML), we can identify how much each feature contributes to our model as well as determine the number of components. The presence of outliers is an added challenge and is handled by incorporating an auxiliary outlier component, to which we associate a uniform density. Experimental results on synthetic data, as well as real world applications involving visual scenes and object classification, indicates that the proposed approach was promising, even though low-dimensional representation of the data was applied. In addition, it showed the importance of embedding an outlier component to the proposed model. EM learning suffers from significant drawbacks. In order to overcome those drawbacks, a learning approach using a Bayesian framework is proposed as our second contribution. This learning is based on the estimation of the parameters posteriors and by considering the prior knowledge about these parameters. Calculation of the posterior distribution of each parameter in the model is done by using Markov chain Monte Carlo (MCMC) simulation methods - namely, the Gibbs sampling and the Metropolis- Hastings methods. The Bayesian Information Criterion (BIC) was used for model selection. The proposed model was validated on object classification and forgery detection applications. For the first two contributions, we developed a finite GID mixture. However, in the third contribution, we propose an infinite GID mixture model. The proposed model simutaneously tackles the clustering and feature selection problems. The proposed learning model is based on Gibbs sampling. The effectiveness of the proposed method is shown using image categorization application. Our last contribution in this thesis is another fully Bayesian approach for a finite GID mixture learning model using the Reversible Jump Markov Chain Monte Carlo (RJMCMC) technique. The proposed algorithm allows for the simultaneously handling of the model selection and parameter estimation for high dimensional data. The merits of this approach are investigated using synthetic data, and data generated from a challenging namely object detection

    Bayesian Learning of Asymmetric Gaussian-Based Statistical Models using Markov Chain Monte Carlo Techniques

    Get PDF
    A novel unsupervised Bayesian learning framework based on asymmetric Gaussian mixture (AGM) statistical model is proposed since AGM is shown to be more effective compared to the classic Gaussian mixture. The Bayesian learning framework is developed by adopting sampling-based Markov chain Monte Carlo (MCMC) methodology. More precisely, the fundamental learning algorithm is a hybrid Metropolis-Hastings within Gibbs sampling solution which is integrated within a reversible jump MCMC (RJMCMC) learning framework, a self-adapted sampling-based MCMC implementation, that enables model transfer throughout the mixture parameters learning process, therefore, automatically converges to the optimal number of data groups. Furthermore, a feature selection technique is included to tackle the irrelevant and unneeded information from datasets. The performance comparison between AGM and other popular solutions is given and both synthetic and real data sets extracted from challenging applications such as intrusion detection, spam filtering and image categorization are evaluated to show the merits of the proposed approach

    Bayesian Learning Frameworks for Multivariate Beta Mixture Models

    Get PDF
    Mixture models have been widely used as a statistical learning paradigm in various unsupervised machine learning applications, where labeling a vast amount of data is impractical and costly. They have shown a significant success and encouraging performance in many real-world problems from different fields such as computer vision, information retrieval and pattern recognition. One of the most widely used distributions in mixture models is Gaussian distribution, due to its characteristics, such as its simplicity and fitting capabilities. However, data obtained from some applications could have different properties like non-Gaussian and asymmetric nature. In this thesis, we propose multivariate Beta mixture models which offer flexibility, various shapes with promising attributes. These models can be considered as decent alternatives to Gaussian distributions. We explore multiple Bayesian inference approaches for multivariate Beta mixture models and propose a suitable solution for the problem of estimating parameters using Markov Chain Monte Carlo (MCMC) technique. We exploit Gibbs sampling within Metropolis-Hastings for learning parameters of our finite mixture model. Moreover, a fully Bayesian approach based on birth-death MCMC technique is proposed which simultaneously allows cluster assignments, parameters estimation and the selection of the optimal number of clusters. Finally, we develop a nonparametric Bayesian framework by extending our finite mixture model to infinity using Dirichlet process to tackle the model selection problem. Experimental results obtained from challenging applications (e.g., intrusion detection, medical, etc.) confirm that our proposed frameworks can provide effective solutions comparing to existing alternatives

    Statistical Analysis of Spherical Data: Clustering, Feature Selection and Applications

    Get PDF
    In the light of interdisciplinary applications, data to be studied and analyzed have witnessed a growth in volume and change in their intrinsic structure and type. In other words, in practice the diversity of resources generating objects have imposed several challenges for decision maker to determine informative data in terms of time, model capability, scalability and knowledge discovery. Thus, it is highly desirable to be able to extract patterns of interest that support the decision of data management. Clustering, among other machine learning approaches, is an important data engineering technique that empowers the automatic discovery of similar object’s clusters and the consequent assignment of new unseen objects to appropriate clusters. In this context, the majority of current research does not completely address the true structure and nature of data for particular application at hand. In contrast to most previous research, our proposed work focuses on the modeling and classification of spherical data that are naturally generated in many data mining and knowledge discovery applications. Thus, in this thesis we propose several estimation and feature selection frameworks based on Langevin distribution which are devoted to spherical patterns in offline and online settings. In this thesis, we first formulate a unified probabilistic framework, where we build probabilistic kernels based on Fisher score and information divergences from finite Langevin mixture for Support Vector Machine. We are motivated by the fact that the blending of generative and discriminative approaches has prevailed by exploring and adopting distinct characteristic of each approach toward constructing a complementary system combining the best of both. Due to the high demand to construct compact and accurate statistical models that are automatically adjustable to dynamic changes, next in this thesis, we propose probabilistic frameworks for high-dimensional spherical data modeling based on finite Langevin mixtures that allow simultaneous clustering and feature selection in offline and online settings. To this end, we adopted finite mixture models which have long been heavily relied on deterministic learning approaches such as maximum likelihood estimation. Despite their successful utilization in wide spectrum of areas, these approaches have several drawbacks as we will discuss in this thesis. An alternative approach is the adoption of Bayesian inference that naturally addresses data uncertainty while ensuring good generalization. To address this issue, we also propose a Bayesian approach for finite Langevin mixture model estimation and selection. When data change dynamically and grow drastically, finite mixture is not always a feasible solution. In contrast with previous approaches, which suppose an unknown finite number of mixture components, we finally propose a nonparametric Bayesian approach which assumes an infinite number of components. We further enhance our model by simultaneously detecting informative features in the process of clustering. Through extensive empirical experiments, we demonstrate the merits of the proposed learning frameworks on diverse high dimensional datasets and challenging real-world applications

    Online Spectral Clustering on Network Streams

    Get PDF
    Graph is an extremely useful representation of a wide variety of practical systems in data analysis. Recently, with the fast accumulation of stream data from various type of networks, significant research interests have arisen on spectral clustering for network streams (or evolving networks). Compared with the general spectral clustering problem, the data analysis of this new type of problems may have additional requirements, such as short processing time, scalability in distributed computing environments, and temporal variation tracking. However, to design a spectral clustering method to satisfy these requirements certainly presents non-trivial efforts. There are three major challenges for the new algorithm design. The first challenge is online clustering computation. Most of the existing spectral methods on evolving networks are off-line methods, using standard eigensystem solvers such as the Lanczos method. It needs to recompute solutions from scratch at each time point. The second challenge is the parallelization of algorithms. To parallelize such algorithms is non-trivial since standard eigen solvers are iterative algorithms and the number of iterations can not be predetermined. The third challenge is the very limited existing work. In addition, there exists multiple limitations in the existing method, such as computational inefficiency on large similarity changes, the lack of sound theoretical basis, and the lack of effective way to handle accumulated approximate errors and large data variations over time. In this thesis, we proposed a new online spectral graph clustering approach with a family of three novel spectrum approximation algorithms. Our algorithms incrementally update the eigenpairs in an online manner to improve the computational performance. Our approaches outperformed the existing method in computational efficiency and scalability while retaining competitive or even better clustering accuracy. We derived our spectrum approximation techniques GEPT and EEPT through formal theoretical analysis. The well established matrix perturbation theory forms a solid theoretic foundation for our online clustering method. We facilitated our clustering method with a new metric to track accumulated approximation errors and measure the short-term temporal variation. The metric not only provides a balance between computational efficiency and clustering accuracy, but also offers a useful tool to adapt the online algorithm to the condition of unexpected drastic noise. In addition, we discussed our preliminary work on approximate graph mining with evolutionary process, non-stationary Bayesian Network structure learning from non-stationary time series data, and Bayesian Network structure learning with text priors imposed by non-parametric hierarchical topic modeling

    Unsupervised Learning with Feature Selection Based on Multivariate McDonald’s Beta Mixture Model for Medical Data Analysis

    Get PDF
    This thesis proposes innovative clustering approaches using finite and infinite mixture models to analyze medical data and human activity recognition. These models leverage the flexibility of a novel distribution, the multivariate McDonald’s Beta distribution, offering superior capability to model data of varying shapes. We introduce a finite McDonald’s Beta Mixture Model (McDBMM), demonstrating its superior performance in handling bounded and asymmetric data distributions compared to traditional Gaussian mixture models. Further, we employ deterministic learning methods such as maximum likelihood via the expectation maximization approach and also a Bayesian framework, in which we integrate feature selection. This integration enhances the efficiency and accuracy of our models, offering a compelling solution for real-world applications where manual annotation of large data volumes is not feasible. To address the prevalent challenge in clustering regarding the determination of mixture components number, we extend our finite mixture model to an infinite model. By adopting a nonparametric Bayesian technique, we can effectively capture the underlying data distribution with an unknown number of mixture components. Across all stages, our models are evaluated on various medical applications, consistently demonstrating superior performance over traditional alternatives. The results of this research underline the potential of the McDonald’s Beta distribution and the proposed mixture models in transforming medical data into actionable knowledge, aiding clinicians in making more precise decisions and improving health care industry

    Mixture-based Clustering for the Ordered Stereotype Model

    No full text
    Many of the methods which deal with the reduction of dimensionality in matrices of data are based on mathematical techniques. In general, it is not possible to use statistical inferences or select the appropriateness of a model via information criteria with these techniques because there is no underlying probability model. Furthermore, the use of ordinal data is very common (e.g. Likert or Braun-Blanquet scale) and the clustering methods in common use treat ordered categorical variables as nominal or continuous rather than as true ordinal data. Recently a group of likelihood-based finite mixture models for binary or count data has been developed (Pledger and Arnold, 2014). This thesis extends this idea and establishes novel likelihood-based multivariate methods for data reduction of a matrix containing ordinal data. This new approach applies fuzzy clustering via finite mixtures to the ordered stereotype model (Fernández et al., 2014a). Fuzzy allocation of rows and columns to corresponding clusters is achieved by performing the EM algorithm, and also Bayesian model fitting is obtained by performing a reversible jump MCMC sampler. Their performances for one-dimensional clustering are compared. Simulation studies and three real data sets are used to illustrate the application of these approaches and also to present novel data visualisation tools for depicting the fuzziness of the clustering results for ordinal data. Additionally, a simulation study is set up to empirically establish a relationship between our likelihood-based methodology and the performance of eleven information criteria in common use. Finally, clustering comparisons between count data and categorising the data as ordinal over a same data set are performed and results are analysed and presented

    Occupancy Estimation and Activity Recognition in Smart Buildings using Mixture-Based Predictive Distributions

    Get PDF
    Labeled data is a necessary part of modern computer science, such as machine learning and deep learning. In that context, large amount of labeled training data is required. However, collecting of labeled data as a crucial step is time consuming, error prone and often requires people involvement. On the other hand, imbalanced data is also a challenge for classification approaches. Most approaches simply predict the majority class in all cases. In this work, we proposed several frameworks about mixture models based predictive distribution. In the case of small training data, predictive distribution is data-driven, which can take advantage of the existing training data at its maximum and don't need many labeled data. The flexibility and adaptability of Dirichlet family distribution as mixture models further improve classification ability of frameworks. Generalized inverted Dirichlet (GID), inverted Dirichlet (ID) and generalized Dirichlet (GD) are used in this work with predictive distribution to do classification. GID-based predictive distribution has an obvious increase for activity recognition compared with the approach of global variational inference using small training data. ID-based predictive distribution with over-sampling is applied in occupancy estimation. More synthetic data are sampling for small classes. The total accuracy is improved in the end. An occupancy estimation framework is presented based on interactive learning and predictive distribution of GD. This framework can find the most informative unlabeled data and interact with users to get the true label. New labeled data are added in data store to further improve the performance of classification

    Assessing Heterogeneity of Two-Part Model via Bayesian Model-Based Clustering with Its Application to Cocaine Use Data

    Get PDF
    The purpose of this chapter is to provide an introduction to the model-based clustering within the Bayesian framework and apply it to asses the heterogeneity of fractional data via finite mixture two-part regression model. The problems related to the number of clusters and the configuration of observations are addressed via Markov Chains Monte Carlo (MCMC) sampling method. Gibbs sampler is implemented to draw observations from the related full conditionals. As a concrete example, the cocaine use data are analyzed to illustrate the merits of the proposed methodology
    corecore