3,685 research outputs found

    PF-OLA: A High-Performance Framework for Parallel On-Line Aggregation

    Full text link
    Online aggregation provides estimates to the final result of a computation during the actual processing. The user can stop the computation as soon as the estimate is accurate enough, typically early in the execution. This allows for the interactive data exploration of the largest datasets. In this paper we introduce the first framework for parallel online aggregation in which the estimation virtually does not incur any overhead on top of the actual execution. We define a generic interface to express any estimation model that abstracts completely the execution details. We design a novel estimator specifically targeted at parallel online aggregation. When executed by the framework over a massive 8TB8\text{TB} TPC-H instance, the estimator provides accurate confidence bounds early in the execution even when the cardinality of the final result is seven orders of magnitude smaller than the dataset size and without incurring overhead.Comment: 36 page

    Duet: efficient and scalable hybriD neUral rElation undersTanding

    Full text link
    Learned cardinality estimation methods have achieved high precision compared to traditional methods. Among learned methods, query-driven approaches face the data and workload drift problem for a long time. Although both query-driven and hybrid methods are proposed to avoid this problem, even the state-of-the-art of them suffer from high training and estimation costs, limited scalability, instability, and long-tailed distribution problem on high cardinality and high-dimensional tables, which seriously affects the practical application of learned cardinality estimators. In this paper, we prove that most of these problems are directly caused by the widely used progressive sampling. We solve this problem by introducing predicates information into the autoregressive model and propose Duet, a stable, efficient, and scalable hybrid method to estimate cardinality directly without sampling or any non-differentiable process, which can not only reduces the inference complexity from O(n) to O(1) compared to Naru and UAE but also achieve higher accuracy on high cardinality and high-dimensional tables. Experimental results show that Duet can achieve all the design goals above and be much more practical and even has a lower inference cost on CPU than that of most learned methods on GPU

    Estimated and analysis of the relationship between the endogenous and exogenous variables using fuzzy semi-paranetric sample selection model

    Get PDF
    An important progress within the last decade in the development of the selectivity model approach to overcome the inconsistent results if the distributional assumptions of the errors terms are made this problem is through the use of semi-parametric method. However, the uncertainties and ambiguities exist in the models, particularly the relationship between the endogenous and exogenous variables. A new framework of the relationship between the endogenous and exogenous variables of semi-parametric sample selection model using the concept of fuzzy modelling is introduced. Through this approach, a flexible fuzzy concept hybrid with the semi-parametric sample selection models known as Fuzzy Semi-Parametric Sample Selection Model (FSPSSM). The elements of vagueness and uncertainty in the models are represented in the model construction, as a way of increasing the available information to produce a more accurate model. This led to the development of the convergence theorem presented in the form of triangular fuzzy numbers to be used in the model. Besides that, proofs of the theorems are presented. An algorithm using the concept of fuzzy modelling is developed. The effectiveness of the estimators for this model is investigated. Monte Carlo simulation revealed that consistency depends on bandwidth parameter. When bandwidth parameters, c are increased from 0.1, 0.5, 0.75 and 1 as the numbers of N increased (from 100 to 200 and increased to 500), the values of mean approaches (closed to) the real parameter. Through the bandwidth parameter also reveals that the estimated parameter is efficient, i.e., the S.D, MSE and RMSE values become smaller as N increased. In particular, the estimated parameter becomes consistent and efficient as the bandwidth parameters approaches to infinity, c®¥ as the number of observations, n tend to infinity, n®¥. Keywords: Selectivity Model, Semi-Parametric, Fuzzy Concept, Bandwidth, Monte Carl

    BARRIERS FOR DEVELOPMENT IN ZAMBIAN SMALL- AND MEDIUM-SIZE FARMS: EVIDENCE FROM MICRO-DATA

    Get PDF
    The objective of this paper is to identify factors which limit the ability of Zambian farmers to increase Maize productivity and/or diversify their crop mix. Both may enable wealth accumulation, investments, and further expansion. Specifically, we link variations in agricultural decisions, practices, and outcomes, to variations in the tightness of the different constraints. We model crop production decisions as having recursive structure. Initially, farmers decide on land allocation among the different crops, based on their information set at planting time. Then, as new information (weather, market conditions) is revealed, farmers can change output by influencing the yield. This recursive structure enables to separate the effects of the constraints on the different stages of production. We therefore conduct estimation in two stages: we first estimate the fraction of land allocated to Maize as a dependent variable that is censored from below and from above, so that its predicted value is necessarily between zero and one. The yield of Maize is estimated in the second stage as a linear function of calculated land allotment (to avoid simultaneity bias) and the other state variables. Environmental and demographic variables also serve as explanatory variables in each stage. The first-stage results indicate that crop diversification can be promoted by rural road construction, developing markets for agricultural products, increasing the availability of seeds, draught animals, and farm machines, increasing women's farm work participation, and increasing the size of landholdings. Specialization in Maize can be promoted by increasing the availability of credit, fertilizers, hired permanent workers, and irrigation knowledge, and improving the timeliness of input delivery. The second-stage results show that the yield of Maize is inversely related to the area of Maize cultivated and to the operator's age, and is lower in female-headed farm households. Maize productivity can be improved by increasing the availability of seeds, fertilizers, labor, draught animals, machines, and credit.Crop Diversification, Maize Productivity, Recursive Decisions, Two-stage Estimation, Censored Dependent Variables, Community/Rural/Urban Development, International Development, Resource /Energy Economics and Policy, O1, Q1,

    One stone, two birds: A lightweight multidimensional learned index with cardinality support

    Full text link
    Innovative learning based structures have recently been proposed to tackle index and cardinality estimation tasks, specifically learned indexes and data driven cardinality estimators. These structures exhibit excellent performance in capturing data distribution, making them promising for integration into AI driven database kernels. However, accurate estimation for corner case queries requires a large number of network parameters, resulting in higher computing resources on expensive GPUs and more storage overhead. Additionally, the separate implementation for CE and learned index result in a redundancy waste by storage of single table distribution twice. These present challenges for designing AI driven database kernels. As in real database scenarios, a compact kernel is necessary to process queries within a limited storage and time budget. Directly integrating these two AI approaches would result in a heavy and complex kernel due to a large number of network parameters and repeated storage of data distribution parameters. Our proposed CardIndex structure effectively killed two birds with one stone. It is a fast multidim learned index that also serves as a lightweight cardinality estimator with parameters scaled at the KB level. Due to its special structure and small parameter size, it can obtain both CDF and PDF information for tuples with an incredibly low latency of 1 to 10 microseconds. For tasks with low selectivity estimation, we did not increase the model's parameters to obtain fine grained point density. Instead, we fully utilized our structure's characteristics and proposed a hybrid estimation algorithm in providing fast and exact results

    A Maximum Likelihood Estimator based on First Differences for a Panel Data Tobit Model with Individual Specific Effects

    Get PDF
    This paper proposes an alternative estimation procedure for a panel data Tobit model with individual specific effects based on taking first differences of the equation of interest. This helps to alleviate the sensitivity of the estimates to a specific parameterization of the individual specific effects and some Monte Carlo evidence is provided in support of this. To allow for arbitrary serial correlation estimation takes place in two steps: Maximum Likelihood is applied to each pair of consecutive periods and then a Minimum Distance estimator is employed.
    corecore