217,024 research outputs found

    Correcting sampling biases via importance reweighting for spatial modeling

    Full text link
    In machine learning models, the estimation of errors is often complex due to distribution bias, particularly in spatial data such as those found in environmental studies. We introduce an approach based on the ideas of importance sampling to obtain an unbiased estimate of the target error. By taking into account difference between desirable error and available data, our method reweights errors at each sample point and neutralizes the shift. Importance sampling technique and kernel density estimation were used for reweighteing. We validate the effectiveness of our approach using artificial data that resemble real-world spatial datasets. Our findings demonstrate advantages of the proposed approach for the estimation of the target error, offering a solution to a distribution shift problem. Overall error of predictions dropped from 7% to just 2% and it gets smaller for larger samples

    Numerical Data Imputation for Multimodal Data Sets: A Probabilistic Nearest-Neighbor Kernel Density Approach

    Full text link
    Numerical data imputation algorithms replace missing values by estimates to leverage incomplete data sets. Current imputation methods seek to minimize the error between the unobserved ground truth and the imputed values. But this strategy can create artifacts leading to poor imputation in the presence of multimodal or complex distributions. To tackle this problem, we introduce the kkNN×\timesKDE algorithm: a data imputation method combining nearest neighbor estimation (kkNN) and density estimation with Gaussian kernels (KDE). We compare our method with previous data imputation methods using artificial and real-world data with different data missing scenarios and various data missing rates, and show that our method can cope with complex original data structure, yields lower data imputation errors, and provides probabilistic estimates with higher likelihood than current methods. We release the code in open-source for the community: https://github.com/DeltaFloflo/knnxkdeComment: 30 pages, 8 figures, accepted in TMLR (Reproducibility certification

    The Infinite Mixture of Infinite Gaussian Mixtures

    Get PDF
    Dirichlet process mixture of Gaussians (DPMG) has been used in the literature for clustering and density estimation problems. However, many real-world data exhibit cluster distributions that cannot be captured by a single Gaussian. Modeling such data sets by DPMG creates several extraneous clusters even when clusters are relatively well-defined. Herein, we present the infinite mixture of infinite Gaussian mixtures (I2GMM) for more flexible modeling of data sets with skewed and multi-modal cluster distributions. Instead of using a single Gaussian for each cluster as in the standard DPMG model, the generative model of I2GMM uses a single DPMG for each cluster. The individual DPMGs are linked together through centering of their base distributions at the atoms of a higher level DP prior. Inference is performed by a collapsed Gibbs sampler that also enables partial parallelization. Experimental results on several artificial and real-world data sets suggest the proposed I2GMM model can predict clusters more accurately than existing variational Bayes and Gibbs sampler versions of DPMG

    AdaCat: Adaptive Categorical Discretization for Autoregressive Models

    Full text link
    Autoregressive generative models can estimate complex continuous data distributions, like trajectory rollouts in an RL environment, image intensities, and audio. Most state-of-the-art models discretize continuous data into several bins and use categorical distributions over the bins to approximate the continuous data distribution. The advantage is that the categorical distribution can easily express multiple modes and are straightforward to optimize. However, such approximation cannot express sharp changes in density without using significantly more bins, making it parameter inefficient. We propose an efficient, expressive, multimodal parameterization called Adaptive Categorical Discretization (AdaCat). AdaCat discretizes each dimension of an autoregressive model adaptively, which allows the model to allocate density to fine intervals of interest, improving parameter efficiency. AdaCat generalizes both categoricals and quantile-based regression. AdaCat is a simple add-on to any discretization-based distribution estimator. In experiments, AdaCat improves density estimation for real-world tabular data, images, audio, and trajectories, and improves planning in model-based offline RL.Comment: Uncertainty in Artificial Intelligence (UAI) 2022 13 pages, 4 figure

    Feature Selection Using Different Mutual Information Estimation Methods

    Get PDF
    Tez (Yüksek Lisans) -- İstanbul Teknik Üniversitesi, Fen Bilimleri Enstitüsü, 2010Thesis (M.Sc.) -- İstanbul Technical University, Institute of Science and Technology, 2010Bu çalışmada, farklı karşılıklı bilgi kestirim yöntemlerinin öznitelik seçimi üzerindeki etkisi incelenmiş, minimum-bolluk-maksimum-ilgi (mRMR) ve karşılıklı bilgi filtresi öznitelik seçim yöntemleri, bölümlemeden daha gelişmiş kestirim yöntemleri olan çekirdek yoğunluk kestirimi (KDE) bazlı ve k en yakın komşu (KNN) bazlı yöntemler kullanılarak iyileştirilmeye çalışılmıştır. Ayrıca bu karşılıklı bilgi kestirim yöntemlerinin yapay ve gerçek veriler üzerindeki başarımı ölçülmüş ve yöntemlerin başarımı altküme seçimi ve birleştirme yolları ile arttırılmaya çalışılmıştır. Altküme seçimi ve birleştirme yöntemlerinin başarımı arttırmadığı, k en yakın komşu bazlı kestirim yönteminin karşılıklı bilgi filtresi için kullanıldığında bölümlemeden daha yüksek başarım sağladığı, fakat mRMR’ın bundan yararlanamadığı görülmüştür.In this study, effect of different mutual information estimation methods on feature selection is examined, minimum-redundancy-maximum-relevance and mutual information filter feature selection methods are tried to be improved by using more advanced mutual information estimation methods than binning like k-nearest-neighbour (KNN) based and kernel density estimation (KDE) based methods. Besides, performances of these mutual information estimation methods on artificial and real data are measured and this performance is tried to be improved by subset selection and combination. It is concluded that subset selection and combination does not improve performance, KNN based estimation method improves performance when used in mutual information filter but mRMR does not benefit from this.Yüksek LisansM.Sc

    Conditional Density Models Integrating Fuzzy and Probabilistic Representations of Uncertainty

    Get PDF
    __Abstract__ Conditional density estimation is an important problem in a variety of areas such as system identification, machine learning, artificial intelligence, empirical economics, macroeconomic analysis, quantitative finance and risk management. This work considers the general problem of conditional density estimation, i.e., estimating and predicting the density of a response variable as a function of covariates. The semi-parametric models proposed and developed in this work combine fuzzy and probabilistic representations of uncertainty, while making very few assumptions regarding the functional form of the response variable's density or changes of the functional form across the space of covariates. These models possess sufficient generalization power to approximate a non-standard density and the ability to describe the underlying process using simple linguistic descriptors despite the complexity and possible non-linearity of this process. These novel models are applied to real world quantitative finance and risk management problems by analyzing financial time-series data containing non-trivial statistical properties, such as fat tails, asymmetric distributions and changing variation over time
    corecore