2 research outputs found

    Offline and Online Density Estimation for Large High-Dimensional Data

    Get PDF
    Density estimation has wide applications in machine learning and data analysis techniques including clustering, classification, multimodality analysis, bump hunting and anomaly detection. In high-dimensional space, sparsity of data in local neighborhood makes many of parametric and nonparametric density estimation methods mostly inefficient. This work presents development of computationally efficient algorithms for high-dimensional density estimation, based on Bayesian sequential partitioning (BSP). Copula transform is used to separate the estimation of marginal and joint densities, with the purpose of reducing the computational complexity and estimation error. Using this separation, a parallel implementation of the density estimation algorithm on a 4-core CPU is presented. Also, some example applications of the high-dimensional density estimation in density-based classification and clustering are presented. Another challenge in the area of density estimation rises in dealing with online sources of data, where data is arriving over an open-ended and non-stationary stream. This calls for efficient algorithms for online density estimation. An online density estimator needs to be capable of providing up-to-date estimates of the density, bound to the available computing resources and requirements of the application. In response to this, BBSP method for online density estimation is introduced. It works based on collecting and processing the data in blocks of fixed size, followed by a weighted averaging over block-wise estimates of the density. Proper choice of block size is discussed via simulations for streams of synthetic and real datasets. Further, with the purpose of efficiency improvement in offline and online density estimation, progressive update of the binary partitions in BBSP is proposed, which as simulation results show, leads into improved accuracy as well as speed-up, for various block sizes

    PARSIMONIOUS MULTIVARIATE COPULA MODEL FOR DENSITY ESTIMATION

    No full text
    The most common approach for estimating multivariate density assumes a parametric form for the joint distribution. The choice of this parametric form imposes constraints on the marginal distributions. Copula models disentangle the choice of marginals from the joint distributions, making it a powerful model for multivariate density estimation. However, so far, they have been widely studied mostly for low dimensional multivariate. In this paper, we investigate a popular Copula model – the Gaussian Copula model – for high dimensional settings. They however require estimation of a full correlation matrix which can cause data scarcity in this setting. One approach to address this problem is to impose constraints on the parameter space. In this paper, we present Toeplitz correlation structure to reduce the number of Gaussian Copula parameter. To increase the flexibility of our model, we also introduce mixture of Gaussian Copula as a natural extension of the Gaussian Copula model. Through empirical evaluation of likelihood on held-out data, we study the trade-off between correlation constraints and mixture flexibility, and report results on wine data sets from the UCI Repository as well as our corpus of monkey vocalizations. We find that mixture of Gaussian Copula with Toeplitz correlation structure models the data consistently better than Gaussian mixture models with equivalent number of parameters. Index Terms β€” Copula, Mixture Models 1
    corecore