934 research outputs found
Observations on Factors Affecting Performance of MapReduce based Apriori on Hadoop Cluster
Designing fast and scalable algorithm for mining frequent itemsets is always
being a most eminent and promising problem of data mining. Apriori is one of
the most broadly used and popular algorithm of frequent itemset mining.
Designing efficient algorithms on MapReduce framework to process and analyze
big datasets is contemporary research nowadays. In this paper, we have focused
on the performance of MapReduce based Apriori on homogeneous as well as on
heterogeneous Hadoop cluster. We have investigated a number of factors that
significantly affects the execution time of MapReduce based Apriori running on
homogeneous and heterogeneous Hadoop Cluster. Factors are specific to both
algorithmic and non-algorithmic improvements. Considered factors specific to
algorithmic improvements are filtered transactions and data structures.
Experimental results show that how an appropriate data structure and filtered
transactions technique drastically reduce the execution time. The
non-algorithmic factors include speculative execution, nodes with poor
performance, data locality & distribution of data blocks, and parallelism
control with input split size. We have applied strategies against these factors
and fine tuned the relevant parameters in our particular application.
Experimental results show that if cluster specific parameters are taken care of
then there is a significant reduction in execution time. Also we have discussed
the issues regarding MapReduce implementation of Apriori which may
significantly influence the performance.Comment: 8 pages, 8 figures, International Conference on Computing,
Communication and Automation (ICCCA2016
A Survey on Automatic Parameter Tuning for Big Data Processing Systems
Big data processing systems (e.g., Hadoop, Spark, Storm) contain a vast number of configuration parameters controlling parallelism, I/O behavior, memory settings, and compression. Improper parameter settings can cause significant performance degradation and stability issues. However, regular users and even expert administrators grapple with understanding and tuning them to achieve good performance. We investigate existing approaches on parameter tuning for both batch and stream data processing systems and classify them into six categories: rule-based, cost modeling, simulation-based, experiment-driven, machine learning, and adaptive tuning. We summarize the pros and cons of each approach and raise some open research problems for automatic parameter tuning.Peer reviewe
A Workload-Specific Memory Capacity Configuration Approach for In-Memory Data Analytic Platforms
We propose WSMC, a workload-specific memory capacity configuration approach
for the Spark workloads, which guides users on the memory capacity
configuration with the accurate prediction of the workload's memory requirement
under various input data size and parameter settings.First, WSMC classifies the
in-memory computing workloads into four categories according to the workloads'
Data Expansion Ratio. Second, WSMC establishes a memory requirement prediction
model with the consideration of the input data size, the shuffle data size, the
parallelism of the workloads and the data block size. Finally, for each
workload category, WSMC calculates the shuffle data size in the prediction
model in a workload-specific way. For the ad-hoc workload, WSMC can profile its
Data Expansion Ratio with small-sized input data and decide the category that
the workload falls into. Users can then determine the accurate configuration in
accordance with the corresponding memory requirement prediction.Through the
comprehensive evaluations with SparkBench workloads, we found that, contrasting
with the default configuration, configuration with the guide of WSMC can save
over 40% memory capacity with the workload performance slight degradation (only
5%), and compared to the proper configuration found out manually, the
configuration with the guide of WSMC leads to only 7% increase in the memory
waste with the workload's performance slight improvement (about 1%
Methods to Improve Applicability and Efficiency of Distributed Data-Centric Compute Frameworks
The success of modern applications depends on the insights they collect from their data repositories. Data repositories for such applications currently exceed exabytes and are rapidly increasing in size, as they collect data from varied sources - web applications, mobile phones, sensors and other connected devices. Distributed storage and data-centric compute frameworks have been invented to store and analyze these large datasets. This dissertation focuses on extending the applicability and improving the efficiency of distributed data-centric compute frameworks
- …