217,024 research outputs found
Correcting sampling biases via importance reweighting for spatial modeling
In machine learning models, the estimation of errors is often complex due to
distribution bias, particularly in spatial data such as those found in
environmental studies. We introduce an approach based on the ideas of
importance sampling to obtain an unbiased estimate of the target error. By
taking into account difference between desirable error and available data, our
method reweights errors at each sample point and neutralizes the shift.
Importance sampling technique and kernel density estimation were used for
reweighteing. We validate the effectiveness of our approach using artificial
data that resemble real-world spatial datasets. Our findings demonstrate
advantages of the proposed approach for the estimation of the target error,
offering a solution to a distribution shift problem. Overall error of
predictions dropped from 7% to just 2% and it gets smaller for larger samples
Numerical Data Imputation for Multimodal Data Sets: A Probabilistic Nearest-Neighbor Kernel Density Approach
Numerical data imputation algorithms replace missing values by estimates to
leverage incomplete data sets. Current imputation methods seek to minimize the
error between the unobserved ground truth and the imputed values. But this
strategy can create artifacts leading to poor imputation in the presence of
multimodal or complex distributions. To tackle this problem, we introduce the
NNKDE algorithm: a data imputation method combining nearest neighbor
estimation (NN) and density estimation with Gaussian kernels (KDE). We
compare our method with previous data imputation methods using artificial and
real-world data with different data missing scenarios and various data missing
rates, and show that our method can cope with complex original data structure,
yields lower data imputation errors, and provides probabilistic estimates with
higher likelihood than current methods. We release the code in open-source for
the community: https://github.com/DeltaFloflo/knnxkdeComment: 30 pages, 8 figures, accepted in TMLR (Reproducibility certification
The Infinite Mixture of Infinite Gaussian Mixtures
Dirichlet process mixture of Gaussians (DPMG) has been used in the literature for clustering and density estimation problems. However, many real-world data exhibit cluster distributions that cannot be captured by a single Gaussian. Modeling such data sets by DPMG creates several extraneous clusters even when clusters are relatively well-defined. Herein, we present the infinite mixture of infinite Gaussian mixtures (I2GMM) for more flexible modeling of data sets with skewed and multi-modal cluster distributions. Instead of using a single Gaussian for each cluster as in the standard DPMG model, the generative model of I2GMM uses a single DPMG for each cluster. The individual DPMGs are linked together through centering of their base distributions at the atoms of a higher level DP prior. Inference is performed by a collapsed Gibbs sampler that also enables partial parallelization. Experimental results on several artificial and real-world data sets suggest the proposed I2GMM model can predict clusters more accurately than existing variational Bayes and Gibbs sampler versions of DPMG
AdaCat: Adaptive Categorical Discretization for Autoregressive Models
Autoregressive generative models can estimate complex continuous data
distributions, like trajectory rollouts in an RL environment, image
intensities, and audio. Most state-of-the-art models discretize continuous data
into several bins and use categorical distributions over the bins to
approximate the continuous data distribution. The advantage is that the
categorical distribution can easily express multiple modes and are
straightforward to optimize. However, such approximation cannot express sharp
changes in density without using significantly more bins, making it parameter
inefficient. We propose an efficient, expressive, multimodal parameterization
called Adaptive Categorical Discretization (AdaCat). AdaCat discretizes each
dimension of an autoregressive model adaptively, which allows the model to
allocate density to fine intervals of interest, improving parameter efficiency.
AdaCat generalizes both categoricals and quantile-based regression. AdaCat is a
simple add-on to any discretization-based distribution estimator. In
experiments, AdaCat improves density estimation for real-world tabular data,
images, audio, and trajectories, and improves planning in model-based offline
RL.Comment: Uncertainty in Artificial Intelligence (UAI) 2022 13 pages, 4 figure
Feature Selection Using Different Mutual Information Estimation Methods
Tez (Yüksek Lisans) -- İstanbul Teknik Üniversitesi, Fen Bilimleri Enstitüsü, 2010Thesis (M.Sc.) -- İstanbul Technical University, Institute of Science and Technology, 2010Bu çalışmada, farklı karşılıklı bilgi kestirim yöntemlerinin öznitelik seçimi üzerindeki etkisi incelenmiş, minimum-bolluk-maksimum-ilgi (mRMR) ve karşılıklı bilgi filtresi öznitelik seçim yöntemleri, bölümlemeden daha gelişmiş kestirim yöntemleri olan çekirdek yoğunluk kestirimi (KDE) bazlı ve k en yakın komşu (KNN) bazlı yöntemler kullanılarak iyileştirilmeye çalışılmıştır. Ayrıca bu karşılıklı bilgi kestirim yöntemlerinin yapay ve gerçek veriler üzerindeki başarımı ölçülmüş ve yöntemlerin başarımı altküme seçimi ve birleştirme yolları ile arttırılmaya çalışılmıştır. Altküme seçimi ve birleştirme yöntemlerinin başarımı arttırmadığı, k en yakın komşu bazlı kestirim yönteminin karşılıklı bilgi filtresi için kullanıldığında bölümlemeden daha yüksek başarım sağladığı, fakat mRMR’ın bundan yararlanamadığı görülmüştür.In this study, effect of different mutual information estimation methods on feature selection is examined, minimum-redundancy-maximum-relevance and mutual information filter feature selection methods are tried to be improved by using more advanced mutual information estimation methods than binning like k-nearest-neighbour (KNN) based and kernel density estimation (KDE) based methods. Besides, performances of these mutual information estimation methods on artificial and real data are measured and this performance is tried to be improved by subset selection and combination. It is concluded that subset selection and combination does not improve performance, KNN based estimation method improves performance when used in mutual information filter but mRMR does not benefit from this.Yüksek LisansM.Sc
Conditional Density Models Integrating Fuzzy and Probabilistic Representations of Uncertainty
__Abstract__
Conditional density estimation is an important problem in a variety of areas such as system identification, machine learning, artificial intelligence, empirical economics, macroeconomic analysis, quantitative finance and risk management.
This work considers the general problem of conditional density estimation, i.e., estimating and predicting the density of a response variable as a function of covariates. The semi-parametric models proposed and developed in this work combine fuzzy and probabilistic representations of uncertainty, while making very few assumptions regarding the functional form of the response variable's density or changes of the functional form across the space of covariates. These models possess sufficient generalization power to approximate a non-standard density and the ability to describe the underlying process using simple linguistic descriptors despite the complexity and possible non-linearity of this process.
These novel models are applied to real world quantitative finance and risk management problems by analyzing financial time-series data containing non-trivial statistical properties, such as fat tails, asymmetric distributions and changing variation over time
- …