4,757 research outputs found
Resource Management and Scheduling for Big Data Applications in Cloud Computing Environments
This chapter presents software architectures of the big data processing
platforms. It will provide an in-depth knowledge on resource management
techniques involved while deploying big data processing systems on cloud
environment. It starts from the very basics and gradually introduce the core
components of resource management which we have divided in multiple layers. It
covers the state-of-art practices and researches done in SLA-based resource
management with a specific focus on the job scheduling mechanisms.Comment: 27 pages, 9 figure
Efficient and Reliable Hybrid Cloud Architechture for Big Data
The objective of our paper is to propose a Cloud computing framework which is
feasible and necessary for handling huge data. In our prototype system we
considered national ID database structure of Bangladesh which is prepared by
election commission of Bangladesh. Using this database we propose an
interactive graphical user interface for Bangladeshi People Search (BDPS) that
use a hybrid structure of cloud computing handled by apache Hadoop where
database is implemented by HiveQL. The infrastructure divides into two parts:
locally hosted cloud which is based on Eucalyptus and the remote cloud which is
implemented on well-known Amazon Web Service (AWS). Some common problems of
Bangladesh aspect which includes data traffic congestion, server time out and
server down issue is also discussed.Comment: 13 pages, 9 figures, International Journal on Cloud Computing:
Services and Architecture (IJCCSA), Vol.3, No.6, December 201
Characterization and Architectural Implications of Big Data Workloads
Big data areas are expanding in a fast way in terms of increasing workloads
and runtime systems, and this situation imposes a serious challenge to workload
characterization, which is the foundation of innovative system and architecture
design. The previous major efforts on big data benchmarking either propose a
comprehensive but a large amount of workloads, or only select a few workloads
according to so-called popularity, which may lead to partial or even biased
observations. In this paper, on the basis of a comprehensive big data benchmark
suite---BigDataBench, we reduced 77 workloads to 17 representative workloads
from a micro-architectural perspective. On a typical state-of-practice
platform---Intel Xeon E5645, we compare the representative big data workloads
with SPECINT, SPECCFP, PARSEC, CloudSuite and HPCC. After a comprehensive
workload characterization, we have the following observations. First, the big
data workloads are data movement dominated computing with more branch
operations, taking up to 92% percentage in terms of instruction mix, which
places them in a different class from Desktop (SPEC CPU2006), CMP (PARSEC), HPC
(HPCC) workloads. Second, corroborating the previous work, Hadoop and Spark
based big data workloads have higher front-end stalls. Comparing with the
traditional workloads i. e. PARSEC, the big data workloads have larger
instructions footprint. But we also note that, in addition to varied
instruction-level parallelism, there are significant disparities of front-end
efficiencies among different big data workloads. Third, we found complex
software stacks that fail to use state-of-practise processors efficiently are
one of the main factors leading to high front-end stalls. For the same
workloads, the L1I cache miss rates have one order of magnitude differences
among diverse implementations with different software stacks
A Survey on Geographically Distributed Big-Data Processing using MapReduce
Hadoop and Spark are widely used distributed processing frameworks for
large-scale data processing in an efficient and fault-tolerant manner on
private or public clouds. These big-data processing systems are extensively
used by many industries, e.g., Google, Facebook, and Amazon, for solving a
large class of problems, e.g., search, clustering, log analysis, different
types of join operations, matrix multiplication, pattern matching, and social
network analysis. However, all these popular systems have a major drawback in
terms of locally distributed computations, which prevent them in implementing
geographically distributed data processing. The increasing amount of
geographically distributed massive data is pushing industries and academia to
rethink the current big-data processing systems. The novel frameworks, which
will be beyond state-of-the-art architectures and technologies involved in the
current system, are expected to process geographically distributed data at
their locations without moving entire raw datasets to a single location. In
this paper, we investigate and discuss challenges and requirements in designing
geographically distributed data processing frameworks and protocols. We
classify and study batch processing (MapReduce-based systems), stream
processing (Spark-based systems), and SQL-style processing geo-distributed
frameworks, models, and algorithms with their overhead issues.Comment: IEEE Transactions on Big Data; Accepted June 2017. 20 page
Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management
High-performance computing platforms such as supercomputers have
traditionally been designed to meet the compute demands of scientific
applications. Consequently, they have been architected as producers and not
consumers of data. The Apache Hadoop ecosystem has evolved to meet the
requirements of data processing applications and has addressed many of the
limitations of HPC platforms. There exist a class of scientific applications
however, that need the collective capabilities of traditional high-performance
computing environments and the Apache Hadoop ecosystem. For example, the
scientific domains of bio-molecular dynamics, genomics and network science need
to couple traditional computing with Hadoop/Spark based analysis. We
investigate the critical question of how to present the capabilities of both
computing environments to such scientific applications. Whereas this questions
needs answers at multiple levels, we focus on the design of resource management
middleware that might support the needs of both. We propose extensions to the
Pilot-Abstraction to provide a unifying resource management layer. This is an
important step that allows applications to integrate HPC stages (e.g.
simulations) to data analytics. Many supercomputing centers have started to
officially support Hadoop environments, either in a dedicated environment or in
hybrid deployments using tools such as myHadoop. This typically involves many
intrinsic, environment-specific details that need to be mastered, and often
swamp conceptual issues like: How best to couple HPC and Hadoop application
stages? How to explore runtime trade-offs (data localities vs. data movement)?
This paper provides both conceptual understanding and practical solutions to
the integrated use of HPC and Hadoop environments
Security and Privacy Aspects in MapReduce on Clouds: A Survey
MapReduce is a programming system for distributed processing large-scale data
in an efficient and fault tolerant manner on a private, public, or hybrid
cloud. MapReduce is extensively used daily around the world as an efficient
distributed computation tool for a large class of problems, e.g., search,
clustering, log analysis, different types of join operations, matrix
multiplication, pattern matching, and analysis of social networks. Security and
privacy of data and MapReduce computations are essential concerns when a
MapReduce computation is executed in public or hybrid clouds. In order to
execute a MapReduce job in public and hybrid clouds, authentication of
mappers-reducers, confidentiality of data-computations, integrity of
data-computations, and correctness-freshness of the outputs are required.
Satisfying these requirements shield the operation from several types of
attacks on data and MapReduce computations. In this paper, we investigate and
discuss security and privacy challenges and requirements, considering a variety
of adversarial capabilities, and characteristics in the scope of MapReduce. We
also provide a review of existing security and privacy protocols for MapReduce
and discuss their overhead issues.Comment: Accepted in Elsevier Computer Science Revie
Low-Level Augmented Bayesian Optimization for Finding the Best Cloud VM
With the advent of big data applications, which tends to have longer
execution time, choosing the right cloud VM to run these applications has
significant performance as well as economic implications. For example, in our
large-scale empirical study of 107 different workloads on three popular big
data systems, we found that a wrong choice can lead to a 20 times slowdown or
an increase in cost by 10 times.
Bayesian optimization is a technique for optimizing expensive (black-box)
functions. Previous attempts have only used instance-level information (such as
# of cores, memory size) which is not sufficient to represent the search space.
In this work, we discover that this may lead to the fragility problem---either
incurs high search cost or finds only the sub-optimal solution. The central
insight of this paper is to use low-level performance information to augment
the process of Bayesian Optimization. Our novel low-level augmented Bayesian
Optimization is rarely worse than current practices and often performs much
better (in 46 of 107 cases). Further, it significantly reduces the search cost
in nearly half of our case studies.
Based on this work, we conclude that it is often insufficient to use
general-purpose off-the-shelf methods for configuring cloud instances without
augmenting those methods with essential systems knowledge such as CPU
utilization, working memory size and I/O wait time
A Taxonomy and Survey on eScience as a Service in the Cloud
Cloud computing has recently evolved as a popular computing infrastructure
for many applications. Scientific computing, which was mainly hosted in private
clusters and grids, has started to migrate development and deployment to the
public cloud environment. eScience as a service becomes an emerging and
promising direction for science computing. We review recent efforts in
developing and deploying scientific computing applications in the cloud. In
particular, we introduce a taxonomy specifically designed for scientific
computing in the cloud, and further review the taxonomy with four major kinds
of science applications, including life sciences, physics sciences, social and
humanities sciences, and climate and earth sciences. Our major finding is that,
despite existing efforts in developing cloud-based eScience, eScience still has
a long way to go to fully unlock the power of cloud computing paradigm.
Therefore, we present the challenges and opportunities in the future
development of cloud-based eScience services, and call for collaborations and
innovations from both the scientific and computer system communities to address
those challenges
A Comparative Study of Association Rule Mining Algorithms on Grid and Cloud Platform
Association rule mining is a time consuming process due to involving both
data intensive and computation intensive nature. In order to mine large volume
of data and to enhance the scalability and performance of existing sequential
association rule mining algorithms, parallel and distributed algorithms are
developed. These traditional parallel and distributed algorithms are based on
homogeneous platform and are not lucrative for heterogeneous platform such as
grid and cloud. This requires design of new algorithms which address the issues
of good data set partition and distribution, load balancing strategy,
optimization of communication and synchronization technique among processors in
such heterogeneous system. Grid and cloud are the emerging platform for
distributed data processing and various association rule mining algorithms have
been proposed on such platforms. This survey article integrates the brief
architectural aspect of distributed system, various recent approaches of grid
based and cloud based association rule mining algorithms with comparative
perception. We differentiate between approaches of association rule mining
algorithms developed on these architectures on the basis of data locality,
programming paradigm, fault tolerance, communication cost, partition and
distribution of data sets. Although it is not complete in order to cover all
algorithms, yet it can be very useful for the new researchers working in the
direction of distributed association rule mining algorithms.Comment: 8 pages, preprin
Running genetic algorithms on Hadoop for solving high dimensional optimization problems
Hadoop is a popular MapReduce framework for developing parallel applications
in distributed environments. Several advantages of MapReduce such as
programming ease and ability to use commodity hardware make the applicability
of soft computing methods for parallel and distributed systems easier than
before. In this paper, we present the results of an experimental study on
running soft computing algorithms using Hadoop. This study shows how a simple
genetic algorithm running on Hadoop can be used to produce solutions for high
dimensional optimization problems. In addition, a simple but effective
technique, which did not need MapReduce chains, has been proposed
- …