213,710 research outputs found

    Cheaper and Better: Selecting Good Workers for Crowdsourcing

    Full text link
    Crowdsourcing provides a popular paradigm for data collection at scale. We study the problem of selecting subsets of workers from a given worker pool to maximize the accuracy under a budget constraint. One natural question is whether we should hire as many workers as the budget allows, or restrict on a small number of top-quality workers. By theoretically analyzing the error rate of a typical setting in crowdsourcing, we frame the worker selection problem into a combinatorial optimization problem and propose an algorithm to solve it efficiently. Empirical results on both simulated and real-world datasets show that our algorithm is able to select a small number of high-quality workers, and performs as good as, sometimes even better than, the much larger crowds as the budget allows

    Comment on `Tainted evidence: cosmological model selection versus fitting', by Eric V. Linder and Ramon Miquel (astro-ph/0702542v2)

    Get PDF
    In astro-ph/0702542v2, Linder and Miquel seek to criticize the use of Bayesian model selection for data analysis and for survey forecasting and design. Their discussion is based on three serious misunderstandings of the conceptual underpinnings and application of model-level Bayesian inference, which invalidate all their main conclusions. Their paper includes numerous further inaccuracies, including an erroneous calculation of the Bayesian Information Criterion. Here we seek to set the record straight.Comment: 6 pages RevTeX

    Standard survey methods for estimating colony losses and explanatory risk factors in Apis mellifera

    Get PDF
    This chapter addresses survey methodology and questionnaire design for the collection of data pertaining to estimation of honey bee colony loss rates and identification of risk factors for colony loss. Sources of error in surveys are described. Advantages and disadvantages of different random and non-random sampling strategies and different modes of data collection are presented to enable the researcher to make an informed choice. We discuss survey and questionnaire methodology in some detail, for the purpose of raising awareness of issues to be considered during the survey design stage in order to minimise error and bias in the results. Aspects of survey design are illustrated using surveys in Scotland. Part of a standardized questionnaire is given as a further example, developed by the COLOSS working group for Monitoring and Diagnosis. Approaches to data analysis are described, focussing on estimation of loss rates. Dutch monitoring data from 2012 were used for an example of a statistical analysis with the public domain R software. We demonstrate the estimation of the overall proportion of losses and corresponding confidence interval using a quasi-binomial model to account for extra-binomial variation. We also illustrate generalized linear model fitting when incorporating a single risk factor, and derivation of relevant confidence intervals

    Database Learning: Toward a Database that Becomes Smarter Every Time

    Full text link
    In today's databases, previous query answers rarely benefit answering future queries. For the first time, to the best of our knowledge, we change this paradigm in an approximate query processing (AQP) context. We make the following observation: the answer to each query reveals some degree of knowledge about the answer to another query because their answers stem from the same underlying distribution that has produced the entire dataset. Exploiting and refining this knowledge should allow us to answer queries more analytically, rather than by reading enormous amounts of raw data. Also, processing more queries should continuously enhance our knowledge of the underlying distribution, and hence lead to increasingly faster response times for future queries. We call this novel idea---learning from past query answers---Database Learning. We exploit the principle of maximum entropy to produce answers, which are in expectation guaranteed to be more accurate than existing sample-based approximations. Empowered by this idea, we build a query engine on top of Spark SQL, called Verdict. We conduct extensive experiments on real-world query traces from a large customer of a major database vendor. Our results demonstrate that Verdict supports 73.7% of these queries, speeding them up by up to 23.0x for the same accuracy level compared to existing AQP systems.Comment: This manuscript is an extended report of the work published in ACM SIGMOD conference 201

    VerdictDB: Universalizing Approximate Query Processing

    Full text link
    Despite 25 years of research in academia, approximate query processing (AQP) has had little industrial adoption. One of the major causes of this slow adoption is the reluctance of traditional vendors to make radical changes to their legacy codebases, and the preoccupation of newer vendors (e.g., SQL-on-Hadoop products) with implementing standard features. Additionally, the few AQP engines that are available are each tied to a specific platform and require users to completely abandon their existing databases---an unrealistic expectation given the infancy of the AQP technology. Therefore, we argue that a universal solution is needed: a database-agnostic approximation engine that will widen the reach of this emerging technology across various platforms. Our proposal, called VerdictDB, uses a middleware architecture that requires no changes to the backend database, and thus, can work with all off-the-shelf engines. Operating at the driver-level, VerdictDB intercepts analytical queries issued to the database and rewrites them into another query that, if executed by any standard relational engine, will yield sufficient information for computing an approximate answer. VerdictDB uses the returned result set to compute an approximate answer and error estimates, which are then passed on to the user or application. However, lack of access to the query execution layer introduces significant challenges in terms of generality, correctness, and efficiency. This paper shows how VerdictDB overcomes these challenges and delivers up to 171×\times speedup (18.45×\times on average) for a variety of existing engines, such as Impala, Spark SQL, and Amazon Redshift, while incurring less than 2.6% relative error. VerdictDB is open-sourced under Apache License.Comment: Extended technical report of the paper that appeared in Proceedings of the 2018 International Conference on Management of Data, pp. 1461-1476. ACM, 201

    Parameter estimation in softmax decision-making models with linear objective functions

    Full text link
    With an eye towards human-centered automation, we contribute to the development of a systematic means to infer features of human decision-making from behavioral data. Motivated by the common use of softmax selection in models of human decision-making, we study the maximum likelihood parameter estimation problem for softmax decision-making models with linear objective functions. We present conditions under which the likelihood function is convex. These allow us to provide sufficient conditions for convergence of the resulting maximum likelihood estimator and to construct its asymptotic distribution. In the case of models with nonlinear objective functions, we show how the estimator can be applied by linearizing about a nominal parameter value. We apply the estimator to fit the stochastic UCL (Upper Credible Limit) model of human decision-making to human subject data. We show statistically significant differences in behavior across related, but distinct, tasks.Comment: In pres

    Common Biases In Business Research

    Get PDF

    A Learning Algorithm based on High School Teaching Wisdom

    Full text link
    A learning algorithm based on primary school teaching and learning is presented. The methodology is to continuously evaluate a student and to give them training on the examples for which they repeatedly fail, until, they can correctly answer all types of questions. This incremental learning procedure produces better learning curves by demanding the student to optimally dedicate their learning time on the failed examples. When used in machine learning, the algorithm is found to train a machine on a data with maximum variance in the feature space so that the generalization ability of the network improves. The algorithm has interesting applications in data mining, model evaluations and rare objects discovery
    corecore