20,361 research outputs found
A Comparison of the Quality of Rule Induction from Inconsistent Data Sets and Incomplete Data Sets
In data mining, decision rules induced from known examples are used to classify unseen cases. There are various rule induction algorithms, such as LEM1 (Learning from Examples Module version 1), LEM2 (Learning from Examples Module version 2) and MLEM2 (Modified Learning from Examples Module version 2). In the real world, many data sets are imperfect, either inconsistent or incomplete. The idea of lower and upper approximations, or more generally, the probabilistic approximation, provides an effective way to induce rules from inconsistent data sets and incomplete data sets. But the accuracies of rule sets induced from imperfect data sets are expected to be lower. The objective of this project is to investigate which kind of imperfect data sets (inconsistent or incomplete) is worse in terms of the quality of rule induction. In this project, experiments were conducted on eight inconsistent data sets and eight incomplete data sets with lost values. We implemented the MLEM2 algorithm to induce certain and possible rules from inconsistent data sets, and implemented the local probabilistic version of MLEM2 algorithm to induce certain and possible rules from incomplete data sets. A program called Rule Checker was also developed to classify unseen cases with induced rules and measure the classification error rate. Ten-fold cross validation was carried out and the average error rate was used as the criterion for comparison. The Mann-Whitney nonparametric tests were performed to compare, separately for certain and possible rules, incompleteness with inconsistency. The results show that there is no significant difference between inconsistent and incomplete data sets in terms of the quality of rule induction
Empirical Risk Minimization with Approximations of Probabilistic Grammars
Probabilistic grammars are generative statistical models that are useful for compositional and sequential structures. We present a framework, reminiscent of structural risk minimization, for empirical risk minimization of the parameters of a fixed probabilistic grammar using the log-loss. We derive sample complexity bounds in this framework that apply both to the supervised setting and the unsupervised setting.
Variational Bayesian Inference of Line Spectra
In this paper, we address the fundamental problem of line spectral estimation
in a Bayesian framework. We target model order and parameter estimation via
variational inference in a probabilistic model in which the frequencies are
continuous-valued, i.e., not restricted to a grid; and the coefficients are
governed by a Bernoulli-Gaussian prior model turning model order selection into
binary sequence detection. Unlike earlier works which retain only point
estimates of the frequencies, we undertake a more complete Bayesian treatment
by estimating the posterior probability density functions (pdfs) of the
frequencies and computing expectations over them. Thus, we additionally capture
and operate with the uncertainty of the frequency estimates. Aiming to maximize
the model evidence, variational optimization provides analytic approximations
of the posterior pdfs and also gives estimates of the additional parameters. We
propose an accurate representation of the pdfs of the frequencies by mixtures
of von Mises pdfs, which yields closed-form expectations. We define the
algorithm VALSE in which the estimates of the pdfs and parameters are
iteratively updated. VALSE is a gridless, convergent method, does not require
parameter tuning, can easily include prior knowledge about the frequencies and
provides approximate posterior pdfs based on which the uncertainty in line
spectral estimation can be quantified. Simulation results show that accounting
for the uncertainty of frequency estimates, rather than computing just point
estimates, significantly improves the performance. The performance of VALSE is
superior to that of state-of-the-art methods and closely approaches the
Cram\'er-Rao bound computed for the true model order.Comment: 15 pages, 8 figures, accepted for publication in IEEE Transactions on
Signal Processin
Bayesian Robust Tensor Factorization for Incomplete Multiway Data
We propose a generative model for robust tensor factorization in the presence
of both missing data and outliers. The objective is to explicitly infer the
underlying low-CP-rank tensor capturing the global information and a sparse
tensor capturing the local information (also considered as outliers), thus
providing the robust predictive distribution over missing entries. The
low-CP-rank tensor is modeled by multilinear interactions between multiple
latent factors on which the column sparsity is enforced by a hierarchical
prior, while the sparse tensor is modeled by a hierarchical view of Student-
distribution that associates an individual hyperparameter with each element
independently. For model learning, we develop an efficient closed-form
variational inference under a fully Bayesian treatment, which can effectively
prevent the overfitting problem and scales linearly with data size. In contrast
to existing related works, our method can perform model selection automatically
and implicitly without need of tuning parameters. More specifically, it can
discover the groundtruth of CP rank and automatically adapt the sparsity
inducing priors to various types of outliers. In addition, the tradeoff between
the low-rank approximation and the sparse representation can be optimized in
the sense of maximum model evidence. The extensive experiments and comparisons
with many state-of-the-art algorithms on both synthetic and real-world datasets
demonstrate the superiorities of our method from several perspectives.Comment: in IEEE Transactions on Neural Networks and Learning Systems, 201
- ā¦