6 research outputs found
Approximation with Error Bounds in Spark
We introduce a sampling framework to support approximate computing with
estimated error bounds in Spark. Our framework allows sampling to be performed
at the beginning of a sequence of multiple transformations ending in an
aggregation operation. The framework constructs a data provenance tree as the
computation proceeds, then combines the tree with multi-stage sampling and
population estimation theories to compute error bounds for the aggregation.
When information about output keys are available early, the framework can also
use adaptive stratified reservoir sampling to avoid (or reduce) key losses in
the final output and to achieve more consistent error bounds across popular and
rare keys. Finally, the framework includes an algorithm to dynamically choose
sampling rates to meet user specified constraints on the CDF of error bounds in
the outputs. We have implemented a prototype of our framework called
ApproxSpark, and used it to implement five approximate applications from
different domains. Evaluation results show that ApproxSpark can (a)
significantly reduce execution time if users can tolerate small amounts of
uncertainties and, in many cases, loss of rare keys, and (b) automatically find
sampling rates to meet user specified constraints on error bounds. We also
explore and discuss extensively trade-offs between sampling rates, execution
time, accuracy and key loss
Essential Entities towards Developing an Adaptive Reuse Model for Organization Management in Conservation of Heritage Buildings in Malaysia
The paper purposely to explore, review and confirm the keys factors that have enormous impact and influence on the conservation of heritage building in Malaysia. A focus on developing an adaptive reuse (AR) model as a decision-making tool in enhancing the heritage building performance by equipping substantial information to the relevant organization, such as authority and the private sector. These papers enlighten the significant finding from the excessive literature review within the trustworthy sources. Factors influence in heritage building being set accordingly to bridge the gap and tighten up the common understanding of the major playe
Variance-Optimal Offline and Streaming Stratified Random Sampling
Stratified random sampling (SRS) is a fundamental sampling technique that
provides accurate estimates for aggregate queries using a small size sample,
and has been used widely for approximate query processing. A key question in
SRS is how to partition a target sample size among different strata. While
Neyman allocation provides a solution that minimizes the variance of an
estimate using this sample, it works under the assumption that each stratum is
abundant, i.e., has a large number of data points to choose from. This
assumption may not hold in general: one or more strata may be bounded, and may
not contain a large number of data points, even though the total data size may
be large.
We first present VOILA, an offline method for allocating sample sizes to
strata in a variance-optimal manner, even for the case when one or more strata
may be bounded. We next consider SRS on streaming data that are continuously
arriving. We show a lower bound, that any streaming algorithm for SRS must have
(in the worst case) a variance that is {\Omega}(r) factor away from the
optimal, where r is the number of strata. We present S-VOILA, a practical
streaming algorithm for SRS that is locally variance-optimal in its allocation
of sample sizes to different strata. Our result from experiments on real and
synthetic data show that VOILA can have significantly (1.4 to 50.0 times)
smaller variance than Neyman allocation. The streaming algorithm S-VOILA
results in a variance that is typically close to VOILA, which was given the
entire input beforehand
Stratified Random Sampling from Streaming and Stored Data
Stratified random sampling (SRS) is a widely used sampling technique for approximate query processing. We consider SRS on continuously arriving data streams, and make the following contributions. We present a lower bound that shows that any streaming algorithm for SRS must have (in the worst case) a variance that is Ω(r ) factor away from the optimal, where r is the number of strata. We present S-VOILA, a streaming algorithm for SRS that is locally variance-optimal. Results from experiments on real and synthetic data show that S-VOILA results in a variance that is typically close to an optimal offline algorithm, which was given the entire input beforehand. We also present a variance-optimal offline algorithm VOILA for stratified random sampling. VOILA is a strict generalization of the well-known Neyman allocation, which is optimal only under the assumption that each stratum is abundant, i.e. has a large number of data points to choose from. Experiments show that VOILA can have significantly smaller variance (1.4x to 50x) than Neyman allocation on real-world data
Approximate query processing in a data warehouse using random sampling
Data analysis consumes a large volume of data on a routine basis.. With the fast increase in both the volume of the data and the complexity of the analytic tasks, data processing becomes more complicated and expensive. The cost efficiency is a key factor in the design and deployment of data warehouse systems. Approximate query processing is a well-known approach to handle massive data
among different methods to make big data processing more efficient, in which a small sample is used to answer the query. For many applications, a small error is justifiable for the saving of resources consumed to answer the query, as well as reducing the latency.
We focus on the approximate query processing using random sampling in a data warehouse system, including algorithms to draw samples, methods to maintain sample quality, and effective usages of the sample for approximately answering different classes of queries. First, we study different methods of sampling, focusing on stratified sampling that is optimized for population aggregate query. Next, as the query involves, we propose sampling algorithms for group-by aggregate queries. Finally, we introduce the sampling over the pipeline model of queries processing, where multiple queries and tables are involved in order to accomplish complicated tasks. Modern big data analyses routinely involve complex pipelines in which multiple tasks are choreographed to execute queries over their inputs and write the results into their outputs (which, in turn, may be used as inputs for other tasks) in a synchronized dance of gradual data refinement until the final insight is calculated. In a pipeline, approximate results are fed into downstream queries, unlike in a single query. Thus, we see both aggregate computations from sampled input and approximate input.
We propose a sampling-based approximate pipeline processing algorithm that uses unbiased estimation and calculates the confidence interval for produced approximate results. The key insight of the algorithm calls for enriching the output of queries with additional information. This enables the algorithm to piggyback on the modular structure of the pipeline without having to perform any global rewrites, i.e. no extra query or table is added into the pipeline. Compared to the bootstrap method, the approach described in this paper provides the confidence interval while computing aggregation estimates only once and avoids the need for maintaining intermediary aggregation distributions.
Our empirical study on public and private datasets shows that our sampling algorithm can have significantly (1.4 to 50.0 times) smaller variance, compared to the Neyman algorithm, for optimal sample for population aggregate queries. Our experimental results for group-by queries show that our sample algorithm outperforms the current state-of-the-art on sample quality and estimation accuracy. The optimal sample yields relative errors that are 5x smaller than competing approaches, under the same budget. The experiments for approximate pipeline processing show the high accuracy of the computed estimation, with an average error as low as 2%, using only a 1% sample. It also shows the usefulness of the confidence interval. At the confidence level of 95%, the computed CI is as tight as +/- 8%, while the actual values fall within the CI boundary from 70.49% to 95.15% of times
Optimization-driven sampling for analyzing big data streams
Real-time processing over data streams has become a popular trend for data analysis. With more business applications rely on real-time data analysis to make decisions, traditional batch data processing has become insufficient. While the demand of streaming analysis arises, analyzing big data streams quickly and accurately is a major challenge to overcome.
Sampling is a good approach to provide quick analysis over big data streams. Analyzing the sample gives us an approximation of the exact answer we obtain when analyzing original data. By avoiding analyzing the entire streams, the processing time could be greatly reduced. However, sampling over data streams leads to the following challenges: (1) given a limited budget size, how to build a sample such that the accuracy of approximation over sample is good? And (2) recent data are usually more valuable to some streaming analysis applications, e.g., a real-time intrusion detection system will focus on recent event logs. How to build a sample that weighs more on recent data and eliminates the ancient data in sample is another challenge.
In this research, we propose an optimization-driven sampling (ODS) framework as a solution that aims at (1) providing a more accurate analysis over streaming data and (2) elimination of older data using the sliding window model. Based on how the sample will be analyzed, we formulate the sampling process as an optimization problem and derive an optimal sampling algorithm that will be followed when constructing and maintaining sample over data stream. We study ODS with different sample usages over data streams and discuss how to construct an optimal sample in those settings. We also study lower bounds of accuracy of an ODS sample collected from data streams. Experiments and evaluations were also conducted to show our optimal sample can yield better analysis estimation compared to other existing streaming sampling methods