Search CORE

217,024 research outputs found

Correcting sampling biases via importance reweighting for spatial modeling

Author: Koldasbayeva Diana
Prokhorov Boris
Zaytsev Alexey
Publication venue
Publication date: 14/09/2023
Field of study

In machine learning models, the estimation of errors is often complex due to distribution bias, particularly in spatial data such as those found in environmental studies. We introduce an approach based on the ideas of importance sampling to obtain an unbiased estimate of the target error. By taking into account difference between desirable error and available data, our method reweights errors at each sample point and neutralizes the shift. Importance sampling technique and kernel density estimation were used for reweighteing. We validate the effectiveness of our approach using artificial data that resemble real-world spatial datasets. Our findings demonstrate advantages of the proposed approach for the estimation of the target error, offering a solution to a distribution shift problem. Overall error of predictions dropped from 7% to just 2% and it gets smaller for larger samples

arXiv.org e-Print Archive

Numerical Data Imputation for Multimodal Data Sets: A Probabilistic Nearest-Neighbor Kernel Density Approach

Author: Doya Kenji
Lalande Floria
Publication venue
Publication date: 29/06/2023
Field of study

Numerical data imputation algorithms replace missing values by estimates to leverage incomplete data sets. Current imputation methods seek to minimize the error between the unobserved ground truth and the imputed values. But this strategy can create artifacts leading to poor imputation in the presence of multimodal or complex distributions. To tackle this problem, we introduce the

k

\times

KDE algorithm: a data imputation method combining nearest neighbor estimation (

k

NN) and density estimation with Gaussian kernels (KDE). We compare our method with previous data imputation methods using artificial and real-world data with different data missing scenarios and various data missing rates, and show that our method can cope with complex original data structure, yields lower data imputation errors, and provides probabilistic estimates with higher likelihood than current methods. We release the code in open-source for the community: https://github.com/DeltaFloflo/knnxkdeComment: 30 pages, 8 figures, accepted in TMLR (Reproducibility certification

arXiv.org e-Print Archive

The Infinite Mixture of Infinite Gaussian Mixtures

Author: Dundar Murat
Rajwa Bartek
Yerebakan Halid Z.
Publication venue
Publication date: 01/01/2015
Field of study

Dirichlet process mixture of Gaussians (DPMG) has been used in the literature for clustering and density estimation problems. However, many real-world data exhibit cluster distributions that cannot be captured by a single Gaussian. Modeling such data sets by DPMG creates several extraneous clusters even when clusters are relatively well-defined. Herein, we present the infinite mixture of infinite Gaussian mixtures (I2GMM) for more flexible modeling of data sets with skewed and multi-modal cluster distributions. Instead of using a single Gaussian for each cluster as in the standard DPMG model, the generative model of I2GMM uses a single DPMG for each cluster. The individual DPMGs are linked together through centering of their base distributions at the atoms of a higher level DP prior. Inference is performed by a collapsed Gibbs sampler that also enables partial parallelization. Experimental results on several artificial and real-world data sets suggest the proposed I2GMM model can predict clusters more accurately than existing variational Bayes and Gibbs sampler versions of DPMG

IUPUIScholarWorks

AdaCat: Adaptive Categorical Discretization for Autoregressive Models

Author: Abbeel Pieter
Jain Ajay
Li Qiyang
Publication venue
Publication date: 03/08/2022
Field of study

Autoregressive generative models can estimate complex continuous data distributions, like trajectory rollouts in an RL environment, image intensities, and audio. Most state-of-the-art models discretize continuous data into several bins and use categorical distributions over the bins to approximate the continuous data distribution. The advantage is that the categorical distribution can easily express multiple modes and are straightforward to optimize. However, such approximation cannot express sharp changes in density without using significantly more bins, making it parameter inefficient. We propose an efficient, expressive, multimodal parameterization called Adaptive Categorical Discretization (AdaCat). AdaCat discretizes each dimension of an autoregressive model adaptively, which allows the model to allocate density to fine intervals of interest, improving parameter efficiency. AdaCat generalizes both categoricals and quantile-based regression. AdaCat is a simple add-on to any discretization-based distribution estimator. In experiments, AdaCat improves density estimation for real-world tabular data, images, audio, and trajectories, and improves planning in model-based offline RL.Comment: Uncertainty in Artificial Intelligence (UAI) 2022 13 pages, 4 figure

arXiv.org e-Print Archive

Feature Selection Using Different Mutual Information Estimation Methods

Author: Kule Ahmet Kenan
Publication venue: 'Nara Institute of Science and Technology'
Publication date: 30/11/2010
Field of study

Tez (Yüksek Lisans) -- İstanbul Teknik Üniversitesi, Fen Bilimleri Enstitüsü, 2010Thesis (M.Sc.) -- İstanbul Technical University, Institute of Science and Technology, 2010Bu çalışmada, farklı karşılıklı bilgi kestirim yöntemlerinin öznitelik seçimi üzerindeki etkisi incelenmiş, minimum-bolluk-maksimum-ilgi (mRMR) ve karşılıklı bilgi filtresi öznitelik seçim yöntemleri, bölümlemeden daha gelişmiş kestirim yöntemleri olan çekirdek yoğunluk kestirimi (KDE) bazlı ve k en yakın komşu (KNN) bazlı yöntemler kullanılarak iyileştirilmeye çalışılmıştır. Ayrıca bu karşılıklı bilgi kestirim yöntemlerinin yapay ve gerçek veriler üzerindeki başarımı ölçülmüş ve yöntemlerin başarımı altküme seçimi ve birleştirme yolları ile arttırılmaya çalışılmıştır. Altküme seçimi ve birleştirme yöntemlerinin başarımı arttırmadığı, k en yakın komşu bazlı kestirim yönteminin karşılıklı bilgi filtresi için kullanıldığında bölümlemeden daha yüksek başarım sağladığı, fakat mRMR’ın bundan yararlanamadığı görülmüştür.In this study, effect of different mutual information estimation methods on feature selection is examined, minimum-redundancy-maximum-relevance and mutual information filter feature selection methods are tried to be improved by using more advanced mutual information estimation methods than binning like k-nearest-neighbour (KNN) based and kernel density estimation (KDE) based methods. Besides, performances of these mutual information estimation methods on artificial and real data are measured and this performance is tried to be improved by subset selection and combination. It is concluded that subset selection and combination does not improve performance, KNN based estimation method improves performance when used in mutual information filter but mRMR does not benefit from this.Yüksek LisansM.Sc

Ulusal Üniversitelerarası Açık Erişim Sistemi - İstanbul Teknik Üniversitesi

Conditional Density Models Integrating Fuzzy and Probabilistic Representations of Uncertainty

Author: Almeida e Santos Nogueira R.J. (Rui Jorge)
Publication venue
Publication date: 26/06/2014
Field of study

__Abstract__ Conditional density estimation is an important problem in a variety of areas such as system identification, machine learning, artificial intelligence, empirical economics, macroeconomic analysis, quantitative finance and risk management. This work considers the general problem of conditional density estimation, i.e., estimating and predicting the density of a response variable as a function of covariates. The semi-parametric models proposed and developed in this work combine fuzzy and probabilistic representations of uncertainty, while making very few assumptions regarding the functional form of the response variable's density or changes of the functional form across the space of covariates. These models possess sufficient generalization power to approximate a non-standard density and the ability to describe the underlying process using simple linguistic descriptors despite the complexity and possible non-linearity of this process. These novel models are applied to real world quantitative finance and risk management problems by analyzing financial time-series data containing non-trivial statistical properties, such as fat tails, asymmetric distributions and changing variation over time

EUR Research Repository

Erasmus University Digital Repository