287 research outputs found

    Image annotation and retrieval based on multi-modal feature clustering and similarity propagation.

    Get PDF
    The performance of content-based image retrieval systems has proved to be inherently constrained by the used low level features, and cannot give satisfactory results when the user\u27s high level concepts cannot be expressed by low level features. In an attempt to bridge this semantic gap, recent approaches started integrating both low level-visual features and high-level textual keywords. Unfortunately, manual image annotation is a tedious process and may not be possible for large image databases. In this thesis we propose a system for image retrieval that has three mains components. The first component of our system consists of a novel possibilistic clustering and feature weighting algorithm based on robust modeling of the Generalized Dirichlet (GD) finite mixture. Robust estimation of the mixture model parameters is achieved by incorporating two complementary types of membership degrees. The first one is a posterior probability that indicates the degree to which a point fits the estimated distribution. The second membership represents the degree of typicality and is used to indentify and discard noise points. Robustness to noisy and irrelevant features is achieved by transforming the data to make the features independent and follow Beta distribution, and learning optimal relevance weight for each feature subset within each cluster. We extend our algorithm to find the optimal number of clusters in an unsupervised and efficient way by exploiting some properties of the possibilistic membership function. We also outline a semi-supervised version of the proposed algorithm. In the second component of our system consists of a novel approach to unsupervised image annotation. Our approach is based on: (i) the proposed semi-supervised possibilistic clustering; (ii) a greedy selection and joining algorithm (GSJ); (iii) Bayes rule; and (iv) a probabilistic model that is based on possibilistic memebership degrees to annotate an image. The third component of the proposed system consists of an image retrieval framework based on multi-modal similarity propagation. The proposed framework is designed to deal with two data modalities: low-level visual features and high-level textual keywords generated by our proposed image annotation algorithm. The multi-modal similarity propagation system exploits the mutual reinforcement of relational data and results in a nonlinear combination of the different modalities. Specifically, it is used to learn the semantic similarities between images by leveraging the relationships between features from the different modalities. The proposed image annotation and retrieval approaches are implemented and tested with a standard benchmark dataset. We show the effectiveness of our clustering algorithm to handle high dimensional and noisy data. We compare our proposed image annotation approach to three state-of-the-art methods and demonstrate the effectiveness of the proposed image retrieval system

    Proportional Data Modeling using Unsupervised Learning and Applications

    Get PDF
    In this thesis, we propose the consideration of Aitchison’s distance in K-means clustering algorithm. It has been used for initialization of Dirichlet and generalized Dirichlet mixture models. This activity is then followed by that of estimating model parameters using Expectation-Maximization algorithm. This method has been further exploited by using it for intrusion detection where we statistically analyze entire NSL-KDD data-set. In addition, we present an unsupervised learning algorithm for finite mixture models with the integration of spatial information using Markov random field (MRF). The mixture model is based on Dirichlet and generalized Dirichlet distributions. This method uses Markov random field to incorporate spatial information between neighboring pixels into a mixture model. This segmentation model is also learned by Expectation-Maximization algorithm using Newton-Raphson approach. The obtained results using real images data-sets are more encouraging than those obtained using similar approaches

    Variational learning of a Dirichlet process of generalized Dirichlet distributions for simultaneous clustering and feature selection

    Get PDF
    This paper introduces a novel enhancement for unsupervised feature selection based on generalized Dirichlet (GD) mixture models. Our proposal is based on the extension of the finite mixture model previously developed in [1] to the infinite case, via the consideration of Dirichlet process mixtures, which can be viewed actually as a purely nonparametric model since the number of mixture components can increase as data are introduced. The infinite assumption is used to avoid problems related to model selection (i.e. determination of the number of clusters) and allows simultaneous separation of data in to similar clusters and selection of relevant features. Our resulting model is learned within a principled variational Bayesian framework that we have developed. The experimental results reported for both synthetic data and real-world challenging applications involving image categorization, automatic semantic annotation and retrieval show the ability of our approach to provide accurate models by distinguishing between relevant and irrelevant features without over- or under-fitting the data

    Distributions based Regression Techniques for Compositional Data

    Get PDF
    A systematic study of regression methods for compositional data, which are unique and rare are explored in this thesis. We start with the basic machine learning concept of regression. We use regression equations to solve a classification problem. With partial least squares discriminant analysis (PLS-DA), we follow regression algorithms and solve classification problems, like spam filtering and intrusion detection. After getting the basic understanding of how regression works, we move on to more complex algorithms of distributions based regression. We explore the uni-dimensional case of distributions, applied to regression, the beta-regression. This gives us an understanding of how, when the data to be predicted, or the outcome, is assumed to be of beta distribution, a prediction can be made with regression equations. To further enhance our understanding, we look into Dirichlet distribution, which is for a multi-dimensional case. Unlike traditional regression, here we are predicting a compositional outcome. Two novel regression approaches based on distributions are proposed for compositional data, namely generalized Dirichlet regression and Beta-Liouville regression. They are extensions of Beta regression in a multi-dimensional scenario, similar to Dirichlet regression. The models are learned by maximum likelihood estimation algorithm using Newton-Raphson approach. The performance comparison between the proposed models and other popular solutions is given and both synthetic and real data sets extracted from challenging applications such as market share analysis using Google-Trends and occupancy estimation in smart buildings are evaluated to show the merits of the proposed approaches. Our work will act as a tool for product based companies to estimate how their investments in advertising have yielded results in the market shares. Google-Trends gives an estimate of the popularity of a company, which reflects the effect of advertisements. This thesis bridges the gap between open source data from Google-Trends and market shares

    High-Dimensional Non-Gaussian Data Clustering using Variational Learning of Mixture Models

    Get PDF
    Clustering has been the topic of extensive research in the past. The main concern is to automatically divide a given data set into different clusters such that vectors of the same cluster are as similar as possible and vectors of different clusters are as different as possible. Finite mixture models have been widely used for clustering since they have the advantages of being able to integrate prior knowledge about the data and to address the problem of unsupervised learning in a formal way. A crucial starting point when adopting mixture models is the choice of the components densities. In this context, the well-known Gaussian distribution has been widely used. However, the deployment of the Gaussian mixture implies implicitly clustering based on the minimization of Euclidean distortions which may yield to poor results in several real applications where the per-components densities are not Gaussian. Recent works have shown that other models such as the Dirichlet, generalized Dirichlet and Beta-Liouville mixtures may provide better clustering results in applications containing non-Gaussian data, especially those involving proportional data (or normalized histograms) which are naturally generated by many applications. Two other challenging aspects that should also be addressed when considering mixture models are: how to determine the model's complexity (i.e. the number of mixture components) and how to estimate the model's parameters. Fortunately, both problems can be tackled simultaneously within a principled elegant learning framework namely variational inference. The main idea of variational inference is to approximate the model posterior distribution by minimizing the Kullback-Leibler divergence between the exact (or true) posterior and an approximating distribution. Recently, variational inference has provided good generalization performance and computational tractability in many applications including learning mixture models. In this thesis, we propose several approaches for high-dimensional non-Gaussian data clustering based on various mixture models such as Dirichlet, generalized Dirichlet and Beta-Liouville. These mixture models are learned using variational inference which main advantages are computational efficiency and guaranteed convergence. More specifically, our contributions are four-fold. Firstly, we develop a variational inference algorithm for learning the finite Dirichlet mixture model, where model parameters and the model complexity can be determined automatically and simultaneously as part of the Bayesian inference procedure; Secondly, an unsupervised feature selection scheme is integrated with finite generalized Dirichlet mixture model for clustering high-dimensional non-Gaussian data; Thirdly, we extend the proposed finite generalized mixture model to the infinite case using a nonparametric Bayesian framework known as Dirichlet process, so that the difficulty of choosing the appropriate number of clusters is sidestepped by assuming that there are an infinite number of mixture components; Finally, we propose an online learning framework to learn a Dirichlet process mixture of Beta-Liouville distributions (i.e. an infinite Beta-Liouville mixture model), which is more suitable when dealing with sequential or large scale data in contrast to batch learning algorithm. The effectiveness of our approaches is evaluated using both synthetic and real-life challenging applications such as image databases categorization, anomaly intrusion detection, human action videos categorization, image annotation, facial expression recognition, behavior recognition, and dynamic textures clustering

    A Statistical Framework for Discrete Visual Features Modeling and Classification

    Get PDF
    Multimedia contents are mostly described in discrete forms, so analyzing discrete data becomes an important task in many image processing and computer vision applications. One of the most used approaches for discrete data modeling is the finite mixture of multinomial distributions, considering that the events to model are independent. It, however, fails to capture the true nature in the case of sparse data and leads generally to poor biased estimates. Different smoothing techniques that reflect prior background knowledge are proposed to overcome this issue. Generalized Dirichlet distribution has suitable covariance structure, so it offers flexibility in parameter estimation; therefore, it has become a favorable choice as a prior. This specific choice, however, has its problems mainly in the estimation of the parameters, which appears to be a laborious task and can deteriorate the estimates accuracy when we consider the maximum likelihood (ML) approach. In this thesis, we propose an unsupervised statistical approach to learn structures of this kind of data. The central ingredient in our model is the introduction of the generalized Dirichlet distribution mixture as a prior to the multinomial. An estimation algorithm for the parameters based on leave-one-out (LOO) likelihood and empirical Bayesian inference is developed. This estimation algorithm can be viewed as a hybrid expectation-maximization (EM) which alternates EM iterations with Newton-Raphson iterations using the Hessian matrix. We also propose the use of our model as a parametric basis for support vector machines (SVM) within a hybrid Generative/discriminative framework. Through a series of experiments involving scene modeling and classification using visual words and color texture modeling, we show the efficiency of the proposed approaches

    Unsupervised Hybrid Feature Extraction Selection for High-Dimensional Non-Gaussian Data Clustering with Variational Inference

    Get PDF
    Clustering has been a subject of extensive research in data mining, pattern recognition, and other areas for several decades. The main goal is to assign samples, which are typically non-Gaussian and expressed as points in high-dimensional feature spaces, to one of a number of clusters. It is well known that in such high-dimensional settings, the existence of irrelevant features generally compromises modeling capabilities. In this paper, we propose a variational inference framework for unsupervised non-Gaussian feature selection, in the context of finite generalized Dirichlet (GD) mixture-based clustering. Under the proposed principled variational framework, we simultaneously estimate, in a closed form, all the involved parameters and determine the complexity (i.e., both model an feature selection) of the GD mixture. Extensive simulations using synthetic data along with an analysis of real-world data and human action videos demonstrate that our variational approach achieves better results than comparable techniques

    Bayesian learning of inverted Dirichlet mixtures for SVM kernels generation

    Get PDF
    We describe approaches for positive data modeling and classification using both finite inverted Dirichlet mixture models and support vector machines (SVMs). Inverted Dirichlet mixture models are used to tackle an outstanding challenge in SVMs namely the generation of accurate kernels. The kernels generation approaches, grounded on ideas from information theory that we consider, allow the incorporation of data structure and its structural constraints. Inverted Dirichlet mixture models are learned within a principled Bayesian framework using both Gibbs sampler and Metropolis-Hastings for parameter estimation and Bayes factor for model selection (i.e., determining the number of mixture’s components). Our Bayesian learning approach uses priors, which we derive by showing that the inverted Dirichlet distribution belongs to the family of exponential distributions, over the model parameters, and then combines these priors with information from the data to build posterior distributions. We illustrate the merits and the effectiveness of the proposed method with two real-world challenging applications namely object detection and visual scenes analysis and classification

    Statistical spatial color information modeling in images and applications

    Get PDF
    Image processing, among its vast applications, has proven particular efficiency in quality control systems. Quality control systems such as the ones in the food industry, fruits and meat industries, pharmaceutic, and hardness testing are highly dependent on the accuracy of the algorithms used to extract image feature vectors and process them. Thus, the need to build better quality systems is tied to the progress in the field of image processing. Color histograms have been widely and successfully used in many computer vision and image processing applications. However, they do not include any spatial information. We propose statistical models to integrate both color and spatial information. Our first model is based on finite mixture models which have been applied to different computer vision, image processing and pattern recognition tasks. The majority of the work done concerning finite mixture models has focused on mixtures for continuous data. However, many applications involve and generate discrete data for which discrete mixtures are better suited. In this thesis, we investigate the problem of discrete data modeling using finite mixture models. We propose a novel, well motivated mixture that we call a multinomial generalized Dirichlet mixture. Our second model is based on finite multiple-Bernoulli mixtures. For the estimation of the model's parameters, we use a maximum a posteriori (MAP) approach through deterministic annealing expectation maximization (DAEM). Smoothing priors to the components parameters are introduced to stabilize the estimation. The selection of the number of clusters is based on stochastic complexit

    Distribution-based Regression for Count and Semi-Bounded Data

    Get PDF
    Data mining techniques have been successfully utilized in different applications of significant fields, including pattern recognition, computer vision, medical researches, etc. With the wealth of data generated every day, there is a lack of practical analysis tools to discover hidden relationships and trends. Among all statistical frameworks, regression has been proven to be one of the most strong tools in prediction. The complexity of data that is unfavorable for most models is a considerable challenge in prediction. The ability of a model to perform accurately and efficiently is extremely important. Thus, a model must be selected to fit the data well, such that the learning from previous data is efficient and highly accurate. This work is motivated by the limited number of regression analysis tools for multivariate count data in the literature. We propose two regression models for count data based on flexible distributions, namely, the multinomial Beta-Liouville and multinomial scaled Dirichlet, and evaluate them in the problem of disease diagnosis. The performance is measured based on the accuracy of the prediction, which depends on the nature and complexity of the dataset. Our results show the efficiency of the two proposed regression models where the prediction performance of both models is competitive to other previously used regression approaches for count data and to the best results in the literature. Then, we propose three regression models for positive vectors based on flexible distributions for semi-bounded data, namely, inverted Dirichlet, inverted generalize Dirichlet, and inverted Beta-Liouville. The efficiency of these models is tested via real-world applications, including software defects prediction, spam filtering, and disease diagnosis. Our results show that the performance of the three proposed regression models is better than other commonly used regression models
    • …
    corecore