21 research outputs found
Experimental Performance Evaluation of Cloud-Based Analytics-as-a-Service
An increasing number of Analytics-as-a-Service solutions has recently seen
the light, in the landscape of cloud-based services. These services allow
flexible composition of compute and storage components, that create powerful
data ingestion and processing pipelines. This work is a first attempt at an
experimental evaluation of analytic application performance executed using a
wide range of storage service configurations. We present an intuitive notion of
data locality, that we use as a proxy to rank different service compositions in
terms of expected performance. Through an empirical analysis, we dissect the
performance achieved by analytic workloads and unveil problems due to the
impedance mismatch that arise in some configurations. Our work paves the way to
a better understanding of modern cloud-based analytic services and their
performance, both for its end-users and their providers.Comment: Longer version of the paper in Submission at IEEE CLOUD'1
On the Energy Efficiency of MapReduce Shuffling Operations in Data Centers
This paper aims to quantitatively measure the impact of different data centers networking topologies on the performance and energy efficiency of shuffling operations in MapReduce. Mixed Integer Linear Programming (MILP) models are utilized to optimize the shuffling in several data center topologies with electronic, hybrid, and all-optical switching while maximizing the throughput and reducing the power consumption. The results indicate that the networking topology has a significant impact on the performance of MapReduce. They also indicate that with comparable performance, optical-based data centers can achieve an average of 54% reduction in the energy consumption when compared to electronic switching data centers
Towards Building Wind Tunnels for Data Center Design
Data center design is a tedious and expensive process. Recently, this process has become even more challenging as users of cloud services expect to have guaranteed levels of availability, durability and performance. A new challenge for the service providers is to find the most cost-effective data center design and configuration that will accommodate the users ’ expectations, on ever-changing workloads, and constantly evolving hardware and software components. In this paper, we argue that data center design should become a systematic process. First, it should be done using an integrated approach that takes into account both the hardware and the software interdependencies, and their impact on users ’ expectations. Second, it should be performed in a “wind tunnel”, which uses large-scale simulation to systematically explore the impact of a data center configuration on both the users ’ and the service providers ’ requirements. We believe that this is the first step towards systematic data center design – an exciting area for future research. 1
A Big Data Analyzer for Large Trace Logs
Current generation of Internet-based services are typically hosted on large
data centers that take the form of warehouse-size structures housing tens of
thousands of servers. Continued availability of a modern data center is the
result of a complex orchestration among many internal and external actors
including computing hardware, multiple layers of intricate software, networking
and storage devices, electrical power and cooling plants. During the course of
their operation, many of these components produce large amounts of data in the
form of event and error logs that are essential not only for identifying and
resolving problems but also for improving data center efficiency and
management. Most of these activities would benefit significantly from data
analytics techniques to exploit hidden statistical patterns and correlations
that may be present in the data. The sheer volume of data to be analyzed makes
uncovering these correlations and patterns a challenging task. This paper
presents BiDAl, a prototype Java tool for log-data analysis that incorporates
several Big Data technologies in order to simplify the task of extracting
information from data traces produced by large clusters and server farms. BiDAl
provides the user with several analysis languages (SQL, R and Hadoop MapReduce)
and storage backends (HDFS and SQLite) that can be freely mixed and matched so
that a custom tool for a specific task can be easily constructed. BiDAl has a
modular architecture so that it can be extended with other backends and
analysis languages in the future. In this paper we present the design of BiDAl
and describe our experience using it to analyze publicly-available traces from
Google data clusters, with the goal of building a realistic model of a complex
data center.Comment: 26 pages, 10 figure
Characterization of Performance Anomalies in Hadoop
With the huge variety of data and equally large-scale systems, there is not a
unique execution setting for these systems which can guarantee the best
performance for each query. In this project, we tried so study the impact of
different execution settings on execution time of workloads by varying them one
at a time. Using the data from these experiments, a decision tree was built
where each internal node represents the execution parameter, each branch
represents value chosen for the parameter and each leaf node represents a range
for execution time in minutes. The attribute in the decision tree to split the
dataset on is selected based on the maximum information gain or lowest entropy.
Once the tree is trained with the training samples, this tree can be used to
get approximate range for the expected execution time. When the actual
execution time differs from this expected value, a performance anomaly can be
detected. For a test dataset with 400 samples, 99% of samples had actual
execution time in the range predicted time by the decision tree. Also on
analyzing the constructed tree, an idea about what configuration can give
better performance for a given workload can be obtained. Initial experiments
suggest that the impact an execution parameter can have on the target attribute
(here execution time) can be related to the distance of that feature node from
the root of the constructed decision tree. From initial results the percent
change in the values of the target attribute for various value of the feature
node which is closer to the root is 6 times larger than when that same iii
feature node is away from the root node. This observation will depend on how
well the decision tree was trained and may not be true for every case
Learning-based Automatic Parameter Tuning for Big Data Analytics Frameworks
Big data analytics frameworks (BDAFs) have been widely used for data
processing applications. These frameworks provide a large number of
configuration parameters to users, which leads to a tuning issue that
overwhelms users. To address this issue, many automatic tuning approaches have
been proposed. However, it remains a critical challenge to generate enough
samples in a high-dimensional parameter space within a time constraint. In this
paper, we present AutoTune--an automatic parameter tuning system that aims to
optimize application execution time on BDAFs. AutoTune first constructs a
smaller-scale testbed from the production system so that it can generate more
samples, and thus train a better prediction model, under a given time
constraint. Furthermore, the AutoTune algorithm produces a set of samples that
can provide a wide coverage over the high-dimensional parameter space, and
searches for more promising configurations using the trained prediction model.
AutoTune is implemented and evaluated using the Spark framework and HiBench
benchmark deployed on a public cloud. Extensive experimental results illustrate
that AutoTune improves on default configurations by 63.70% on average, and on
the five state-of-the-art tuning algorithms by 6%-23%.Comment: 12 pages, submitted to IEEE BigData 201