353 research outputs found
Cache Affinity-aware In-memory Caching Management for Hadoop
Department of Computer Science and EngineeringIn this paper, we investigate techniques to effectively manage HDFS in-memory caching for Hadoop.
We first revisit the current implementation of Hadoop with HDFS in-memory caching to understand its limitation on the effective usage of in-memory caching.
For various representative MapReduce applications, we also evaluate a degree of benefit each application can get from in-memory caching, i.e. cache affinity.
We then propose an adaptive cache local scheduling algorithm that adaptively computes how long a MapReduce job waits to be scheduled on a cache local node to be proportional to the percentage of cached input data for the job.
In addition, we propose a block goodness aware cache replacement algorithm that determines which block is cached and evicted based on the accessed rate and the cache affinity of applications.
Using various workloads consisting of multiple MapReduce applications, we conduct extensive experimental study to demonstrate the effects of the proposed in-memory orchestration techniques. Our experimental results show that our enhanced Hadoop in-memory caching scheme improves the performance of the MapReduce workloads.ope
Hadoop-Oriented SVM-LRU (H-SVM-LRU): An Intelligent Cache Replacement Algorithm to Improve MapReduce Performance
Modern applications can generate a large amount of data from different
sources with high velocity, a combination that is difficult to store and
process via traditional tools. Hadoop is one framework that is used for the
parallel processing of a large amount of data in a distributed environment,
however, various challenges can lead to poor performance. Two particular issues
that can limit performance are the high access time for I/O operations and the
recomputation of intermediate data. The combination of these two issues can
result in resource wastage. In recent years, there have been attempts to
overcome these problems by using caching mechanisms. Due to cache space
limitations, it is crucial to use this space efficiently and avoid cache
pollution (the cache contains data that is not used in the future). We propose
Hadoop-oriented SVM-LRU (HSVM- LRU) to improve Hadoop performance. For this
purpose, we use an intelligent cache replacement algorithm, SVM-LRU, that
combines the well-known LRU mechanism with a machine learning algorithm, SVM,
to classify cached data into two groups based on their future usage.
Experimental results show a significant decrease in execution time as a result
of an increased cache hit ratio, leading to a positive impact on Hadoop
performance
Enabling Distributed Applications Optimization in Cloud Environment
The past few years have seen dramatic growth in the popularity of public clouds, such as Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS), and Container-as-a-Service (CaaS). In both commercial and scientific fields, quick environment setup and application deployment become a mandatory requirement. As a result, more and more organizations choose cloud environments instead of setting up the environment by themselves from scratch. The cloud computing resources such as server engines, orchestration, and the underlying server resources are served to the users as a service from a cloud provider. Most of the applications that run in public clouds are the distributed applications, also called multi-tier applications, which require a set of servers, a service ensemble, that cooperate and communicate to jointly provide a certain service or accomplish a task. Moreover, a few research efforts are conducting in providing an overall solution for distributed applications optimization in the public cloud.
In this dissertation, we present three systems that enable distributed applications optimization: (1) the first part introduces DocMan, a toolset for detecting containerized applicationâs dependencies in CaaS clouds, (2) the second part introduces a system to deal with hot/cold blocks in distributed applications, (3) the third part introduces a system named FP4S, a novel fragment-based parallel state recovery mechanism that can handle many simultaneous failures for a large number of concurrently running stream applications
Overview of Caching Mechanisms to Improve Hadoop Performance
Nowadays distributed computing environments, large amounts of data are
generated from different resources with a high velocity, rendering the data
difficult to capture, manage, and process within existing relational databases.
Hadoop is a tool to store and process large datasets in a parallel manner
across a cluster of machines in a distributed environment. Hadoop brings many
benefits like flexibility, scalability, and high fault tolerance; however, it
faces some challenges in terms of data access time, I/O operation, and
duplicate computations resulting in extra overhead, resource wastage, and poor
performance. Many researchers have utilized caching mechanisms to tackle these
challenges. For example, they have presented approaches to improve data access
time, enhance data locality rate, remove repetitive calculations, reduce the
number of I/O operations, decrease the job execution time, and increase
resource efficiency. In the current study, we provide a comprehensive overview
of caching strategies to improve Hadoop performance. Additionally, a novel
classification is introduced based on cache utilization. Using this
classification, we analyze the impact on Hadoop performance and discuss the
advantages and disadvantages of each group. Finally, a novel hybrid approach
called Hybrid Intelligent Cache (HIC) that combines the benefits of two methods
from different groups, H-SVM-LRU and CLQLMRS, is presented. Experimental
results show that our hybrid method achieves an average improvement of 31.2% in
job execution time
Spark versus Flink: Understanding Performance in Big Data Analytics Frameworks
International audienceBig Data analytics has recently gained increasing popularity as a tool to process large amounts of data on-demand. Spark and Flink are two Apache-hosted data analytics frameworks that facilitate the development of multi-step data pipelines using directly acyclic graph patterns. Making the most out of these frameworks is challenging because efficient executions strongly rely on complex parameter configurations and on an in-depth understanding of the underlying architectural choices. Although extensive research has been devoted to improving and evaluating the performance of such analytics frameworks, most of them benchmark the platforms against Hadoop, as a baseline, a rather unfair comparison considering the fundamentally different design principles. This paper aims to bring some justice in this respect, by directly evaluating the performance of Spark and Flink. Our goal is to identify and explain the impact of the different architectural choices and the parameter configurations on the perceived end-to-end performance. To this end, we develop a methodology for correlating the parameter settings and the operators execution plan with the resource usage. We use this methodology to dissect the performance of Spark and Flink with several representative batch and iterative workloads on up to 100 nodes. Our key finding is that there none of the two framework outperforms the other for all data types, sizes and job patterns. This paper performs a fine characterization of the cases when each framework is superior, and we highlight how this performance correlates to operators, to resource usage and to the specifics of the internal framework design
- âŠ