13,856 research outputs found
Variational Bayes for Merging Noisy Databases
Bayesian entity resolution merges together multiple, noisy databases and
returns the minimal collection of unique individuals represented, together with
their true, latent record values. Bayesian methods allow flexible generative
models that share power across databases as well as principled quantification
of uncertainty for queries of the final, resolved database. However, existing
Bayesian methods for entity resolution use Markov monte Carlo method (MCMC)
approximations and are too slow to run on modern databases containing millions
or billions of records. Instead, we propose applying variational approximations
to allow scalable Bayesian inference in these models. We derive a
coordinate-ascent approximation for mean-field variational Bayes, qualitatively
compare our algorithm to existing methods, note unique challenges for inference
that arise from the expected distribution of cluster sizes in entity
resolution, and discuss directions for future work in this domain.Comment: 12 page
VerdictDB: Universalizing Approximate Query Processing
Despite 25 years of research in academia, approximate query processing (AQP)
has had little industrial adoption. One of the major causes of this slow
adoption is the reluctance of traditional vendors to make radical changes to
their legacy codebases, and the preoccupation of newer vendors (e.g.,
SQL-on-Hadoop products) with implementing standard features. Additionally, the
few AQP engines that are available are each tied to a specific platform and
require users to completely abandon their existing databases---an unrealistic
expectation given the infancy of the AQP technology. Therefore, we argue that a
universal solution is needed: a database-agnostic approximation engine that
will widen the reach of this emerging technology across various platforms.
Our proposal, called VerdictDB, uses a middleware architecture that requires
no changes to the backend database, and thus, can work with all off-the-shelf
engines. Operating at the driver-level, VerdictDB intercepts analytical queries
issued to the database and rewrites them into another query that, if executed
by any standard relational engine, will yield sufficient information for
computing an approximate answer. VerdictDB uses the returned result set to
compute an approximate answer and error estimates, which are then passed on to
the user or application. However, lack of access to the query execution layer
introduces significant challenges in terms of generality, correctness, and
efficiency. This paper shows how VerdictDB overcomes these challenges and
delivers up to 171 speedup (18.45 on average) for a variety of
existing engines, such as Impala, Spark SQL, and Amazon Redshift, while
incurring less than 2.6% relative error. VerdictDB is open-sourced under Apache
License.Comment: Extended technical report of the paper that appeared in Proceedings
of the 2018 International Conference on Management of Data, pp. 1461-1476.
ACM, 201
Mixture of Bilateral-Projection Two-dimensional Probabilistic Principal Component Analysis
The probabilistic principal component analysis (PPCA) is built upon a global
linear mapping, with which it is insufficient to model complex data variation.
This paper proposes a mixture of bilateral-projection probabilistic principal
component analysis model (mixB2DPPCA) on 2D data. With multi-components in the
mixture, this model can be seen as a soft cluster algorithm and has capability
of modeling data with complex structures. A Bayesian inference scheme has been
proposed based on the variational EM (Expectation-Maximization) approach for
learning model parameters. Experiments on some publicly available databases
show that the performance of mixB2DPPCA has been largely improved, resulting in
more accurate reconstruction errors and recognition rates than the existing
PCA-based algorithms
Recommended from our members
Validating Variational Bayes Linear Regression Method With Multi-Central Datasets.
PurposeTo validate the prediction accuracy of variational Bayes linear regression (VBLR) with two datasets external to the training dataset.MethodThe training dataset consisted of 7268 eyes of 4278 subjects from the University of Tokyo Hospital. The Japanese Archive of Multicentral Databases in Glaucoma (JAMDIG) dataset consisted of 271 eyes of 177 patients, and the Diagnostic Innovations in Glaucoma Study (DIGS) dataset includes 248 eyes of 173 patients, which were used for validation. Prediction accuracy was compared between the VBLR and ordinary least squared linear regression (OLSLR). First, OLSLR and VBLR were carried out using total deviation (TD) values at each of the 52 test points from the second to fourth visual fields (VFs) (VF2-4) to 2nd to 10th VF (VF2-10) of each patient in JAMDIG and DIGS datasets, and the TD values of the 11th VF test were predicted every time. The predictive accuracy of each method was compared through the root mean squared error (RMSE) statistic.ResultsOLSLR RMSEs with the JAMDIG and DIGS datasets were between 31 and 4.3 dB, and between 19.5 and 3.9 dB. On the other hand, VBLR RMSEs with JAMDIG and DIGS datasets were between 5.0 and 3.7, and between 4.6 and 3.6 dB. There was statistically significant difference between VBLR and OLSLR for both datasets at every series (VF2-4 to VF2-10) (P < 0.01 for all tests). However, there was no statistically significant difference in VBLR RMSEs between JAMDIG and DIGS datasets at any series of VFs (VF2-2 to VF2-10) (P > 0.05).ConclusionsVBLR outperformed OLSLR to predict future VF progression, and the VBLR has a potential to be a helpful tool at clinical settings
Monte Carlo Co-Ordinate Ascent Variational Inference
In Variational Inference (VI), coordinate-ascent and gradient-based
approaches are two major types of algorithms for approximating
difficult-to-compute probability densities. In real-world implementations of
complex models, Monte Carlo methods are widely used to estimate expectations in
coordinate-ascent approaches and gradients in derivative-driven ones. We
discuss a Monte Carlo Co-ordinate Ascent VI (MC-CAVI) algorithm that makes use
of Markov chain Monte Carlo (MCMC) methods in the calculation of expectations
required within Co-ordinate Ascent VI (CAVI). We show that, under regularity
conditions, an MC-CAVI recursion will get arbitrarily close to a maximiser of
the evidence lower bound (ELBO) with any given high probability. In numerical
examples, the performance of MC-CAVI algorithm is compared with that of MCMC
and -- as a representative of derivative-based VI methods -- of Black Box VI
(BBVI). We discuss and demonstrate MC-CAVI's suitability for models with hard
constraints in simulated and real examples. We compare MC-CAVI's performance
with that of MCMC in an important complex model used in Nuclear Magnetic
Resonance (NMR) spectroscopy data analysis -- BBVI is nearly impossible to be
employed in this setting due to the hard constraints involved in the model
- …