543 research outputs found
OS-Assisted Task Preemption for Hadoop
This work introduces a new task preemption primitive for Hadoop, that allows
tasks to be suspended and resumed exploiting existing memory management
mechanisms readily available in modern operating systems. Our technique fills
the gap that exists between the two extremes cases of killing tasks (which
waste work) or waiting for their completion (which introduces latency):
experimental results indicate superior performance and very small overheads
when compared to existing alternatives
Characterizing and Subsetting Big Data Workloads
Big data benchmark suites must include a diversity of data and workloads to
be useful in fairly evaluating big data systems and architectures. However,
using truly comprehensive benchmarks poses great challenges for the
architecture community. First, we need to thoroughly understand the behaviors
of a variety of workloads. Second, our usual simulation-based research methods
become prohibitively expensive for big data. As big data is an emerging field,
more and more software stacks are being proposed to facilitate the development
of big data applications, which aggravates hese challenges. In this paper, we
first use Principle Component Analysis (PCA) to identify the most important
characteristics from 45 metrics to characterize big data workloads from
BigDataBench, a comprehensive big data benchmark suite. Second, we apply a
clustering technique to the principle components obtained from the PCA to
investigate the similarity among big data workloads, and we verify the
importance of including different software stacks for big data benchmarking.
Third, we select seven representative big data workloads by removing redundant
ones and release the BigDataBench simulation version, which is publicly
available from http://prof.ict.ac.cn/BigDataBench/simulatorversion/.Comment: 11 pages, 6 figures, 2014 IEEE International Symposium on Workload
Characterizatio
DBT-5: An Open-Source TPC-E Implementation for Global Performance Measurement of Computer Systems
TPC-E is the new benchmark approved by the TPC council. It is designed to exercise a brokerage firm workload, which is representative of current On-Line Transaction Processing workloads. In this paper we present DBT-5, a fair usage open-source implementation of the TPC-E benchmark. In addition to reporting about the design and implementation of the tool, experimental results from a system running database engine are also described. The significance of this work is that it provides an environment where recent innovations in the OLTP workload field can be evaluated
Does Big Data Require Complex Systems? A Performance Comparison Between Spark and Unicage Shell Scripts
The paradigm of big data is characterized by the need to collect and process
data sets of great volume, arriving at the systems with great velocity, in a
variety of formats. Spark is a widely used big data processing system that can
be integrated with Hadoop to provide powerful abstractions to developers, such
as distributed storage through HDFS and resource management through YARN. When
all the required configurations are made, Spark can also provide quality
attributes, such as scalability, fault tolerance, and security. However, all of
these benefits come at the cost of complexity, with high memory requirements,
and additional latency in processing. An alternative approach is to use a lean
software stack, like Unicage, that delegates most control back to the
developer. In this work we evaluated the performance of big data processing
with Spark versus Unicage, in a cluster environment hosted in the IBM Cloud.
Two sets of experiments were performed: batch processing of unstructured data
sets, and query processing of structured data sets. The input data sets were of
significant size, ranging from 64 GB to 8192 GB in volume. The results show
that the performance of Unicage scripts is superior to Spark for search
workloads like grep and select, but that the abstractions of distributed
storage and resource management from the Hadoop stack enable Spark to execute
workloads with inter-record dependencies, such as sort and join, with correct
outputs.Comment: 10 pages, 14 figure
The state of SQL-on-Hadoop in the cloud
Managed Hadoop in the cloud, especially SQL-on-Hadoop, has been gaining attention recently. On Platform-as-a-Service (PaaS), analytical services like Hive and Spark come preconfigured for general-purpose and ready to use. Thus, giving companies a quick entry and on-demand deployment of ready SQL-like solutions for their big data needs. This study evaluates cloud services from an end-user perspective, comparing providers including: Microsoft Azure, Amazon Web Services, Google Cloud,
and Rackspace. The study focuses on performance, readiness, scalability, and cost-effectiveness of the different solutions at entry/test level clusters sizes. Results are based on over 15,000 Hive queries derived from the industry standard TPC-H benchmark.
The study is framed within the ALOJA research project, which features an open source benchmarking and analysis platform that has been recently extended to support SQL-on-Hadoop engines.
The ALOJA Project aims to lower the total cost of ownership (TCO) of big data deployments and study their performance characteristics for optimization.
The study benchmarks cloud providers across a diverse range instance types, and uses input data scales from 1GB to 1TB, in order to survey the popular entry-level PaaS SQL-on-Hadoop solutions, thereby establishing a common results-base upon which subsequent research can be carried out by the project. Initial results already show the main performance trends to both hardware and software configuration, pricing, similarities and architectural differences of the evaluated PaaS solutions. Whereas some
providers focus on decoupling storage and computing resources while offering network-based elastic storage, others choose to keep the local processing model from Hadoop for high performance, but reducing flexibility. Results also show the importance of application-level tuning and how keeping up-to-date hardware and software stacks can influence performance even more than replicating the on-premises model in the cloud.This work is partially supported by the Microsoft Azure for Research program, the European Research Council (ERC) under
the EUs Horizon 2020 programme (GA 639595), the Spanish Ministry of Education (TIN2015-65316-P), and the Generalitat
de Catalunya (2014-SGR-1051).Peer ReviewedPostprint (author's final draft
- …