Search CORE

14,070 research outputs found

VerdictDB: Universalizing Approximate Query Processing

Author: Bickel P. J.
Bootstrapping Sample Survey Data Comparing Recent
Canty A. J.
Condie T.
Eykholt K.
Flajolet P.
Ganti V.
Hall P.
Kleiner A.
Mayo D. G.
Meliou A.
Mozafari B.
Mozafari B.
Mozafari B.
Mozafari B.
Olston C.
Park Y.
Politis D. N.
Sidirourgos L.
Su H.
Vrbsky S.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 08/11/2018
Field of study

Despite 25 years of research in academia, approximate query processing (AQP) has had little industrial adoption. One of the major causes of this slow adoption is the reluctance of traditional vendors to make radical changes to their legacy codebases, and the preoccupation of newer vendors (e.g., SQL-on-Hadoop products) with implementing standard features. Additionally, the few AQP engines that are available are each tied to a specific platform and require users to completely abandon their existing databases---an unrealistic expectation given the infancy of the AQP technology. Therefore, we argue that a universal solution is needed: a database-agnostic approximation engine that will widen the reach of this emerging technology across various platforms. Our proposal, called VerdictDB, uses a middleware architecture that requires no changes to the backend database, and thus, can work with all off-the-shelf engines. Operating at the driver-level, VerdictDB intercepts analytical queries issued to the database and rewrites them into another query that, if executed by any standard relational engine, will yield sufficient information for computing an approximate answer. VerdictDB uses the returned result set to compute an approximate answer and error estimates, which are then passed on to the user or application. However, lack of access to the query execution layer introduces significant challenges in terms of generality, correctness, and efficiency. This paper shows how VerdictDB overcomes these challenges and delivers up to 171

\times

speedup (18.45

\times

on average) for a variety of existing engines, such as Impala, Spark SQL, and Amazon Redshift, while incurring less than 2.6% relative error. VerdictDB is open-sourced under Apache License.Comment: Extended technical report of the paper that appeared in Proceedings of the 2018 International Conference on Management of Data, pp. 1461-1476. ACM, 201

arXiv.org e-Print Archive

Crossref

Estimating Discrete Markov Models From Various Incomplete Data Schemes

Author: Alberto Pasanisi
Allman
Andrieu
Andrieu
Andrieu
Ang
Atchadé
Bartholomew
Basile
Brooks
Bukowski
Carter
Cochran
Cole
Craiu
Cronvall
Deltout
Duckstein
Dupuis
Dupuis
El Ghaoui
El-Nashar
Früwirth-Schnatter
Fuertes
Fuh
Gelman
Genest
Genest
Gentleman
Gilks
Girard
Gouno
Grimshaw
Grinstead
Haario
Huard
Ischwaran
Kalbfleisch
Kim
Lassen
Lawless
Lee
Little
Little
MacRae
Marin
Marshall
Mira
Nelsen
Nicolas Bousquet
Nikoloulopoulos
Nott
Parent
Pasanisi
Pollard
Puolamäki
Robert
Robert
Roberts
Roberts
Roberts
Roh
Rosenthal
Rubin
Sargent
Sendi
Shuai Fu
Tuyl
Urakabe
Vihola
Vogel
Wu
Zhang
Publication venue: 'Elsevier BV'
Publication date: 01/01/2012
Field of study

The parameters of a discrete stationary Markov model are transition probabilities between states. Traditionally, data consist in sequences of observed states for a given number of individuals over the whole observation period. In such a case, the estimation of transition probabilities is straightforwardly made by counting one-step moves from a given state to another. In many real-life problems, however, the inference is much more difficult as state sequences are not fully observed, namely the state of each individual is known only for some given values of the time variable. A review of the problem is given, focusing on Monte Carlo Markov Chain (MCMC) algorithms to perform Bayesian inference and evaluate posterior distributions of the transition probabilities in this missing-data framework. Leaning on the dependence between the rows of the transition matrix, an adaptive MCMC mechanism accelerating the classical Metropolis-Hastings algorithm is then proposed and empirically studied.Comment: 26 pages - preprint accepted in 20th February 2012 for publication in Computational Statistics and Data Analysis (please cite the journal's paper

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

Teacher Quality at the High-School Level: The Importance of Accounting for Tracks

Author: C. Kirabo Jackson
Publication venue
Publication date
Field of study

Unlike in elementary school, high-school teacher effects may be confounded with both selection to tracks and unobserved track-level treatments. I document sizable confounding track effects, and show that traditional tests for the existence of teacher effects are likely biased. After accounting for these biases, high-school algebra and English teachers have much smaller test-score effects than found in previous studies. Moreover, unlike in elementary school, value-added estimates are weak predictors of teachers’ future performance. Results indicate that either (a) teachers are less influential in high school than in elementary school, or (b) test scores are a poor metric to measure teacher quality at the high-school level.

Research Papers in Economics

Teacher Quality and Student Inequality (Revised 2014)

Author: Mansfield Richard K
Publication venue: DigitalCommons@ILR
Publication date: 01/01/2014
Field of study

This paper examines the extent to which the allocation of teachers within and across public high schools is contributing to inequality in student test score performance. Using ten years of administrative data from North Carolina public high schools, I estimate a flexible education production function in which student achievement reflects student inputs, teacher quality, school quality, and a school-specific scaling factor that allows the impact of teaching quality to vary across schools. The existence of nearly 3,000 teacher transfers, combined with a testable exogenous mobility assumption, allows separate identification of each teacher’s quality from both school quality and school sensitivity to teacher quality. I find that teaching quality is surprisingly equitably distributed both within and across high schools. Schools predominantly serving underprivileged students employ teachers who are only slightly below average, and most students receive a mix of their school’s good and bad teachers. Overall, I find that the allocation of teacher and school inputs at the high school level contributes only 4% to the achievement gap between the top and bottom deciles of an index of student background. Finally, I find that schools that disproportionately serve disadvantaged students tend to be more sensitive to teacher quality

DigitalCommons@ILR

eCommons@Cornell

Towards Optimal Moment Estimation in Streaming and Distributed Models

Author: Jayaram Rajesh
Woodruff David P.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019)
Publication date: 01/01/2019
Field of study

One of the oldest problems in the data stream model is to approximate the p-th moment ||X||_p^p = sum_{i=1}^n X_i^p of an underlying non-negative vector X in R^n, which is presented as a sequence of poly(n) updates to its coordinates. Of particular interest is when p in (0,2]. Although a tight space bound of Theta(epsilon^-2 log n) bits is known for this problem when both positive and negative updates are allowed, surprisingly there is still a gap in the space complexity of this problem when all updates are positive. Specifically, the upper bound is O(epsilon^-2 log n) bits, while the lower bound is only Omega(epsilon^-2 + log n) bits. Recently, an upper bound of O~(epsilon^-2 + log n) bits was obtained under the assumption that the updates arrive in a random order. We show that for p in (0, 1], the random order assumption is not needed. Namely, we give an upper bound for worst-case streams of O~(epsilon^-2 + log n) bits for estimating |X |_p^p. Our techniques also give new upper bounds for estimating the empirical entropy in a stream. On the other hand, we show that for p in (1,2], in the natural coordinator and blackboard distributed communication topologies, there is an O~(epsilon^-2) bit max-communication upper bound based on a randomized rounding scheme. Our protocols also give rise to protocols for heavy hitters and approximate matrix product. We generalize our results to arbitrary communication topologies G, obtaining an O~(epsilon^2 log d) max-communication upper bound, where d is the diameter of G. Interestingly, our upper bound rules out natural communication complexity-based approaches for proving an Omega(epsilon^-2 log n) bit lower bound for p in (1,2] for streaming algorithms. In particular, any such lower bound must come from a topology with large diameter

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server