3,685 research outputs found
PF-OLA: A High-Performance Framework for Parallel On-Line Aggregation
Online aggregation provides estimates to the final result of a computation
during the actual processing. The user can stop the computation as soon as the
estimate is accurate enough, typically early in the execution. This allows for
the interactive data exploration of the largest datasets. In this paper we
introduce the first framework for parallel online aggregation in which the
estimation virtually does not incur any overhead on top of the actual
execution. We define a generic interface to express any estimation model that
abstracts completely the execution details. We design a novel estimator
specifically targeted at parallel online aggregation. When executed by the
framework over a massive TPC-H instance, the estimator provides
accurate confidence bounds early in the execution even when the cardinality of
the final result is seven orders of magnitude smaller than the dataset size and
without incurring overhead.Comment: 36 page
Duet: efficient and scalable hybriD neUral rElation undersTanding
Learned cardinality estimation methods have achieved high precision compared
to traditional methods. Among learned methods, query-driven approaches face the
data and workload drift problem for a long time. Although both query-driven and
hybrid methods are proposed to avoid this problem, even the state-of-the-art of
them suffer from high training and estimation costs, limited scalability,
instability, and long-tailed distribution problem on high cardinality and
high-dimensional tables, which seriously affects the practical application of
learned cardinality estimators. In this paper, we prove that most of these
problems are directly caused by the widely used progressive sampling. We solve
this problem by introducing predicates information into the autoregressive
model and propose Duet, a stable, efficient, and scalable hybrid method to
estimate cardinality directly without sampling or any non-differentiable
process, which can not only reduces the inference complexity from O(n) to O(1)
compared to Naru and UAE but also achieve higher accuracy on high cardinality
and high-dimensional tables. Experimental results show that Duet can achieve
all the design goals above and be much more practical and even has a lower
inference cost on CPU than that of most learned methods on GPU
Estimated and analysis of the relationship between the endogenous and exogenous variables using fuzzy semi-paranetric sample selection model
An important progress within the last decade in the development of the selectivity model approach to
overcome the inconsistent results if the distributional assumptions of the errors terms are made this problem is
through the use of semi-parametric method. However, the uncertainties and ambiguities exist in the models,
particularly the relationship between the endogenous and exogenous variables. A new framework of the
relationship between the endogenous and exogenous variables of semi-parametric sample selection model
using the concept of fuzzy modelling is introduced. Through this approach, a flexible fuzzy concept hybrid
with the semi-parametric sample selection models known as Fuzzy Semi-Parametric Sample Selection Model
(FSPSSM). The elements of vagueness and uncertainty in the models are represented in the model
construction, as a way of increasing the available information to produce a more accurate model. This led to
the development of the convergence theorem presented in the form of triangular fuzzy numbers to be used in
the model. Besides that, proofs of the theorems are presented. An algorithm using the concept of fuzzy
modelling is developed. The effectiveness of the estimators for this model is investigated. Monte Carlo
simulation revealed that consistency depends on bandwidth parameter. When bandwidth parameters, c are
increased from 0.1, 0.5, 0.75 and 1 as the numbers of N increased (from 100 to 200 and increased to 500), the
values of mean approaches (closed to) the real parameter. Through the bandwidth parameter also reveals that
the estimated parameter is efficient, i.e., the S.D, MSE and RMSE values become smaller as N increased. In
particular, the estimated parameter becomes consistent and efficient as the bandwidth parameters approaches
to infinity, c®¥ as the number of observations, n tend to infinity, n®¥.
Keywords: Selectivity Model, Semi-Parametric, Fuzzy Concept, Bandwidth, Monte Carl
BARRIERS FOR DEVELOPMENT IN ZAMBIAN SMALL- AND MEDIUM-SIZE FARMS: EVIDENCE FROM MICRO-DATA
The objective of this paper is to identify factors which limit the ability of Zambian farmers to increase Maize productivity and/or diversify their crop mix. Both may enable wealth accumulation, investments, and further expansion. Specifically, we link variations in agricultural decisions, practices, and outcomes, to variations in the tightness of the different constraints. We model crop production decisions as having recursive structure. Initially, farmers decide on land allocation among the different crops, based on their information set at planting time. Then, as new information (weather, market conditions) is revealed, farmers can change output by influencing the yield. This recursive structure enables to separate the effects of the constraints on the different stages of production. We therefore conduct estimation in two stages: we first estimate the fraction of land allocated to Maize as a dependent variable that is censored from below and from above, so that its predicted value is necessarily between zero and one. The yield of Maize is estimated in the second stage as a linear function of calculated land allotment (to avoid simultaneity bias) and the other state variables. Environmental and demographic variables also serve as explanatory variables in each stage. The first-stage results indicate that crop diversification can be promoted by rural road construction, developing markets for agricultural products, increasing the availability of seeds, draught animals, and farm machines, increasing women's farm work participation, and increasing the size of landholdings. Specialization in Maize can be promoted by increasing the availability of credit, fertilizers, hired permanent workers, and irrigation knowledge, and improving the timeliness of input delivery. The second-stage results show that the yield of Maize is inversely related to the area of Maize cultivated and to the operator's age, and is lower in female-headed farm households. Maize productivity can be improved by increasing the availability of seeds, fertilizers, labor, draught animals, machines, and credit.Crop Diversification, Maize Productivity, Recursive Decisions, Two-stage Estimation, Censored Dependent Variables, Community/Rural/Urban Development, International Development, Resource /Energy Economics and Policy, O1, Q1,
One stone, two birds: A lightweight multidimensional learned index with cardinality support
Innovative learning based structures have recently been proposed to tackle
index and cardinality estimation tasks, specifically learned indexes and data
driven cardinality estimators. These structures exhibit excellent performance
in capturing data distribution, making them promising for integration into AI
driven database kernels. However, accurate estimation for corner case queries
requires a large number of network parameters, resulting in higher computing
resources on expensive GPUs and more storage overhead. Additionally, the
separate implementation for CE and learned index result in a redundancy waste
by storage of single table distribution twice. These present challenges for
designing AI driven database kernels. As in real database scenarios, a compact
kernel is necessary to process queries within a limited storage and time
budget. Directly integrating these two AI approaches would result in a heavy
and complex kernel due to a large number of network parameters and repeated
storage of data distribution parameters. Our proposed CardIndex structure
effectively killed two birds with one stone. It is a fast multidim learned
index that also serves as a lightweight cardinality estimator with parameters
scaled at the KB level. Due to its special structure and small parameter size,
it can obtain both CDF and PDF information for tuples with an incredibly low
latency of 1 to 10 microseconds. For tasks with low selectivity estimation, we
did not increase the model's parameters to obtain fine grained point density.
Instead, we fully utilized our structure's characteristics and proposed a
hybrid estimation algorithm in providing fast and exact results
A Maximum Likelihood Estimator based on First Differences for a Panel Data Tobit Model with Individual Specific Effects
This paper proposes an alternative estimation procedure for a panel data Tobit model with individual specific effects based on taking first differences of the equation of interest. This helps to alleviate the sensitivity of the estimates to a specific parameterization of the individual specific effects and some Monte Carlo evidence is provided in support of this. To allow for arbitrary serial correlation estimation takes place in two steps: Maximum Likelihood is applied to each pair of consecutive periods and then a Minimum Distance estimator is employed.
- …