10,584 research outputs found

    Streaming Weighted Sampling over Join Queries

    Get PDF
    Join queries are a fundamental database tool, capturing a range of tasks that involve linking heterogeneous data sources. However, with massive table sizes, it is often impractical to keep these in memory, and we can only take one or few streaming passes over them. Moreover, building out the full join result (e.g., linking heterogeneous data sources along quasi-identifiers) can lead to a combinatorial explosion of results due to many-to-many links. Random sampling is a natural tool to boil this oversized result down to a representative subset with well-understood statistical properties, but turns out to be a challenging task due to the combinatorial nature of the sampling domain. Existing techniques in the literature focus solely on the setting with tabular data residing in main memory, and do not address aspects such as stream operation, weighted sampling and more general join operators that are urgently needed in a modern data processing context. The main contribution of this work is to meet these needs with more lightweight practical approaches. First, a bijection between the sampling problem and a graph problem is introduced to support weighted sampling and common join operators. Second, the sampling techniques are refined to minimise the number of streaming passes. Third, techniques are presented to deal with very large tables under limited memory. Finally, the proposed techniques are compared to existing approaches that rely on database indices and the results indicate substantial memory savings, reduced runtimes for ad-hoc queries and competitive amortised runtimes

    Zeros of random tropical polynomials, random polytopes and stick-breaking

    Full text link
    For i=0,1,…,ni = 0, 1, \ldots, n, let CiC_i be independent and identically distributed random variables with distribution FF with support (0,∞)(0,\infty). The number of zeros of the random tropical polynomials Tfn(x)=min⁑i=1,…,n(Ci+ix)\mathcal{T}f_n(x) = \min_{i=1,\ldots,n}(C_i + ix) is also the number of faces of the lower convex hull of the n+1n+1 random points (i,Ci)(i,C_i) in R2\mathbb{R}^2. We show that this number, ZnZ_n, satisfies a central limit theorem when FF has polynomial decay near 00. Specifically, if FF near 00 behaves like a gamma(a,1)gamma(a,1) distribution for some a>0a > 0, then ZnZ_n has the same asymptotics as the number of renewals on the interval [0,log⁑(n)/a][0,\log(n)/a] of a renewal process with inter-arrival distribution βˆ’log⁑(Beta(a,2))-\log(Beta(a,2)). Our proof draws on connections between random partitions, renewal theory and random polytopes. In particular, we obtain generalizations and simple proofs of the central limit theorem for the number of vertices of the convex hull of nn uniform random points in a square. Our work leads to many open problems in stochastic tropical geometry, the study of functionals and intersections of random tropical varieties.Comment: 22 pages, 5 figure
    • …
    corecore