6 research outputs found
Cost estimation of spatial join in spatialhadoop
Spatial join is an important operation in geo-spatial applications, since it is frequently used for performing data analysis involving geographical information. Many efforts have been done in the past decades in order to provide efficient algorithms for spatial join and this becomes particularly important as the amount of spatial data to be processed increases. In recent years, the MapReduce approach has become a de-facto standard for processing large amount of data (big-data) and some attempts have been made for extending existing frameworks for the processing of spatial data. In this context, several different MapReduce implementations of spatial join have been defined which mainly differ in the use of a spatial index and in the way this index is built and used. In general, none of these algorithms can be considered better than the others, but the choice might depend on the characteristics of the involved datasets. The aim of this work is to deeply analyse them and define a cost model for ranking them based on the characteristics of the dataset at hand (i.e., selectivity or spatial properties). This cost model has been extensively tested w.r.t. a set of synthetic datasets in order to prove its effectiveness
Recommended from our members
HIGH-PERFORMANCE COMPLEX EVENT PROCESSING FOR DECISION ANALYTICS
Complex Event Processing (CEP) systems are becoming increasingly popular in do- mains for decision analytics such as financial services, transportation, cluster monitoring, supply chain management, business process management, and health care. These systems collect or create high volumes event streams, and often require such event streams to be processed in real-time. To this end, CEP queries are applied for filtering, correlation, ag- gregation, and transformation, to derive high-level, actionable information. Tasks for CEP systems fall into two categories: passive monitoring and proactive monitoring. For passive monitoring, users know their exact needs and express them in CEP queries, then CEP engines evaluate those queries against incoming data events; for proactive monitoring, users cannot tell exactly what they are looking for and need to work with CEP engines to figure out the query. In my thesis, there are contributions for both categories.
For passive monitoring, the first contribution I make is to apply CEP queries over streams with imprecise timestamps, which was infeasible before this work. Existing CEP systems
assumed that the occurrence time of each event is known precisely. However I observe that event occurrence times are often unknown or imprecise due to lossy raw data, granularity mismatch or clock synchronization. Therefore, I propose a temporal model that assigns a time interval to each event to represent all of its possible occurrence times. Under the uncertain temporal model, I further propose two evaluation frameworks, a point-based framework which convert events with time intervals into events with point timestamp before pattern matching, and an event-based framework which matches patterns over events with time intervals directly. I also propose optimizations in these frameworks. My new approach achieves high efficiency for a wide range of workloads tested using both both real traces and synthetic datasets. While existing systems cannot process this type of streams, the throughput of my algorithm achieves as high as tens of thousands of events per second for MapReduce case study. This contribution enables CEP techniques applicable for more application scenarios.
Another contribution for the passive monitoring is that I identify expensive queries in CEP, analyze their runtime complexity, and propose effective optimizations to improve their performance significantly. Those expensive queries involve Kleene closure patterns, flexible event selection strategies, and events with imprecise timestamps. I analyze the runtime complexity of each language component and identify two performance bottlenecks: Kleene closure under the most flexible event selection strategy and confidence computation in the case of imprecise timestamps. For the first bottleneck, I break query evaluation into two parts: pattern matching, which can be shared by many matches and result construction. Optimizations for the shared pattern matching cut cost from exponential to polynomial time and even close-to-linear. To address the second bottleneck, I design a dynamic program- ming algorithm to improve performance. Microbenchmark results show state-of-the-art systems suffer poor performance, while my system can provide 2 to 10 orders of magnitude improvement. A thorough case study on Hadoop cluster monitoring further demonstrates
the efficiency and effectiveness of my proposed techniques: the throughput is over 1 million events per second.
The last problem solved in this thesis is about proactive monitoring: explaining anomalies in CEP-based monitoring and proactive monitoring. CEP queries are used widely for monitoring purpose. When users observe abnormal status in the monitoring results, they annotate the abnormal period and a reference period. Then the system generates explanations by analyzing stream events, and the explanations can be encoded into CEP queries for future monitoring on similar anomalies. An entropy-based distance function is designed to select features for explanation. The new distance function reduces up to 99.2% of features to find ground truth compared to state-of-the-art distance functions for time series. A cluster- based auto labeling algorithm is also designed to leverage unlabeled data to filter noisy features. Compared with alternative techniques, the generated results improves up to 800% on explanation quality, reduces 93.8% of features for conciseness, and achieves as high quality as other techniques on prediction quality. The implementation is also efficient: with 2000 concurrent monitoring queries, triggered explanation analysis returns explanations within a minute and affects the performance only slightly, delaying events processing by less than 1 second
Efficient Incremental Data Analysis
Many data-intensive applications require real-time analytics over streaming data. In a growing number of domains -- sensor network monitoring, social web applications, clickstream analysis, high-frequency algorithmic trading, and fraud detections to name a few -- applications continuously monitor stream events to promptly react to certain data conditions. These applications demand responsive analytics even when faced with high volume and velocity of incoming changes, large numbers of users, and complex processing requirements. Developing suitable online analytics engine that meets these requirements is challenging. In this thesis, we study techniques for efficient online processing of complex analytical queries, ranging from standard database queries to complex machine learning and digital signal processing workflows. First, we focus on the problem of efficient incremental computation for database queries. We have developed a system, called DBToaster, that compiles declarative queries into high-performance stream processing engines that keep query results (views) fresh at very high update rates. At the heart of our system is a recursive query compilation algorithm that materializes a set of supporting higher-order delta views to achieve a substantially lower view maintenance cost. We study the trade-offs between single-tuple and batch incremental processing in local execution, and we present a novel approach for compiling view maintenance code into data-parallel programs optimized for distributed execution. DBToaster supports millions of complete view refreshes per second for a broad range of queries and outperforms commercial database and stream engines by orders of magnitude. We also study the incremental computation for queries written as iterative linear algebra, which can capture many machine learning and scientific calculations. We have developed a framework, called LINVIEW, for capturing deltas of linear algebra programs and understanding their computational cost. Linear algebra operations tend to cause an avalanche effect where even very local changes to the input matrices spread out and infect all of the intermediate results and the final view, causing incremental view maintenance to lose its performance benefit over re-evaluation. We develop techniques based on matrix factorizations to contain such epidemics of change and make incremental view maintenance of linear algebra practical and usually substantially cheaper than re-evaluation. We show, both analytically and experimentally, the usefulness of these techniques when applied to standard analytics tasks. Our last research question concerns the integration of general-purpose query processors and domain-specific operations to enable deep data exploration in both online and offline analysis. We advocate a deep integration of signal processing operations and general-purpose query processors. We demonstrate that in-situ processing of tempo-relational and signal data through a unified query language empowers users to express end-to-end workflows more succinctly inside one system while at the same time offering orders of magnitude better performance than existing popular data management systems
A Theoretical and Experimental Comparison of Filter-Based Equijoins in MapReduce
International audienceMapReduce has become an increasingly popular framework for large-scale data processing. However, complex operations such as joins are quite expensive and require sophisticated techniques. In this paper, we review state-of-the-art strategies for joining several relations in a MapReduce environment and study their extension with filter-based approaches. The general objective of filters is to eliminate non-matching data as early as possible in order to reduce the I/O, communication and CPU costs. We examine the impact of systematically adding filters as early as possible in MapReduce join algorithms, both analytically with cost models and practically with evaluations. The study covers binary joins, multi-way joins and recursive joins, and addresses the case of large inputs that gives rise to the most intricate challenges