4 research outputs found
Hadoop-Oriented SVM-LRU (H-SVM-LRU): An Intelligent Cache Replacement Algorithm to Improve MapReduce Performance
Modern applications can generate a large amount of data from different
sources with high velocity, a combination that is difficult to store and
process via traditional tools. Hadoop is one framework that is used for the
parallel processing of a large amount of data in a distributed environment,
however, various challenges can lead to poor performance. Two particular issues
that can limit performance are the high access time for I/O operations and the
recomputation of intermediate data. The combination of these two issues can
result in resource wastage. In recent years, there have been attempts to
overcome these problems by using caching mechanisms. Due to cache space
limitations, it is crucial to use this space efficiently and avoid cache
pollution (the cache contains data that is not used in the future). We propose
Hadoop-oriented SVM-LRU (HSVM- LRU) to improve Hadoop performance. For this
purpose, we use an intelligent cache replacement algorithm, SVM-LRU, that
combines the well-known LRU mechanism with a machine learning algorithm, SVM,
to classify cached data into two groups based on their future usage.
Experimental results show a significant decrease in execution time as a result
of an increased cache hit ratio, leading to a positive impact on Hadoop
performance
Automating distributed tiered storage management in cluster computing
Presented at 46th International Conference on Very Large Data Bases, 31 August - 4 September 2020, JapanData-intensive platforms such as Hadoop and Spark are routinely used to process massive amounts of data residing on distributed le systems like HDFS. Increasing memory sizes and new hardware technologies (e.g., NVRAM, SSDs) have recently led to the introduction of storage tiering in such settings. However, users are now burdened with the additional complexity of managing the multiple storage tiers and the data residing on them while trying to optimize their workloads. In this paper, we develop a general framework for automatically moving data across the available storage tiers in distributed le systems. Moreover, we employ machine learning for tracking and predicting le access patterns, which we use to decide when and which data to move up or down the storage tiers for increasing system performance. Our approach uses incremental learning to dynamically rene the models with new le accesses, allowing them to naturally adjust and adapt to workload changes over time. Our extensive evaluation using realistic workloads derived from Facebook and CMU traces compares our approach with several other policies and showcases signicant bene ts in terms of both workload performance and cluster effciency