7 research outputs found

    LEMO-MR: Low Overhead and Elastic MapReduce Implementation Optimized for Memory and CPU-Intensive Applications

    Full text link
    Abstract—Since its inception, MapReduce has frequently been associated with Hadoop and large-scale datasets. Its deployment at Amazon in the cloud, and its applications at Yahoo! and Facebook for large-scale distributed document indexing and database building, among other tasks, have thrust MapReduce to the forefront of the data processing application domain. The applicability of the paradigm however extends far beyond its use with data intensive applications and diskbased systems, and can also be brought to bear in processing small but CPU intensive dis-tributed applications. In this work, we focus both on the perfor-mance of processing large-scale hierarchical data in distributed scientific applications, as well as the processing of smaller but demanding input sizes primarily used in diskless, and memory resident I/O systems. In this paper, we present LEMO-MR (Low overhead, elastic, configurable for in-memory applications, and on-demand fault tolerance), an optimized implementation of MapReduce, for both on-disk and in-memory applications, describe its architecture and identify not only the necessary components of this model, but also trade offs and factors to be considered. We show the efficacy of our implementation in terms of potential speedup that can be achieved for representative data sets used by cloud applications. Finally, we quantify the performance gains exhibited by our MapReduce implementation over Apache Hadoop in a compute intensive environment. I

    MARIANE: MApReduce Implementation Adapted for HPC Environments

    Full text link
    MapReduce is increasingly becoming a popular framework, and a potent programming model. The most popular open source implementation of MapReduce, Hadoop, is based on the Hadoop Distributed File System (HDFS). However, as HDFS is not POSIX compliant, it cannot be fully leveraged by applications running on a majority of existing HPC environments such as Teragrid and NERSC. These HPC environments typicallysupport globally shared file systems such as NFS and GPFS. On such resourceful HPC infrastructures, the use of Hadoop not only creates compatibility issues, but also affects overall performance due to the added overhead of the HDFS. This paper not only presents a MapReduce implementation directly suitable for HPC environments, but also exposes the design choices for better performance gains in those settings. By leveraging inherent distributed file systems' functions, and abstracting them away from its MapReduce framework, MARIANE (MApReduce Implementation Adapted for HPC Environments) not only allows for the use of the model in an expanding number of HPCenvironments, but also allows for better performance in such settings. This paper shows the applicability and high performance of the MapReduce paradigm through MARIANE, an implementation designed for clustered and shared-disk file systems and as such not dedicated to a specific MapReduce solution. The paper identifies the components and trade-offs necessary for this model, and quantifies the performance gains exhibited by our approach in distributed environments over Apache Hadoop in a data intensive setting, on the Magellan testbed at the National Energy Research Scientific Computing Center (NERSC)

    MARLA: MapReduce for Heterogeneous Clusters

    Full text link
    MapReduce has gradually become the framework of choice for ”big data”. The MapReduce model allows for efficient and swift processing of large scale data with a cluster of compute nodes. However, the efficiency here comes at a price. The performance of widely used MapReduce implementations such as Hadoop suffers in heterogeneous and load-imbalanced clusters. We show the disparity in performance between homogeneous and heteroge-neous clusters in this paper to be high. Subsequently, we present MARLA, a MapReduce framework capable of performing well not only in homogeneous settings, but also when the cluster exhibits heterogeneous properties. We address the problems associated with existing MapReduce implementations affecting cluster heterogeneity, and subsequently present through MARLA the components and trade-offs necessary for better MapReduce performance in heterogeneous cluster and cloud environments. We quantify the performance gains exhibited by our approach against Apache Hadoop and MARIANE in data intensive and compute intensive applications. I

    Configuring a MapReduce Framework for Dynamic and Efficient Energy Adaptation

    No full text
    Abstract—MapReduce has become a popular framework for Big Data applications. While MapReduce has received much praise for its scalability and efficiency, it has not been thoroughly evaluated for power consumption. Our goal with this paper is to explore the possibility of scheduling in a power-efficient manner without the need for expensive power monitors on every node. We begin by considering that no cluster is truly homogeneous with respect to energy consumption. From there we develop a MapReduce framework that can evaluate the current status of each node and dynamically react to estimated power usage. In so doing, we shift power consumption work toward more energy efficient nodes which are currently consuming less power. Our work shows that given an ideal framework configuration, certain nodes may consume only 62.3 % of the dynamic power they consumed when the same framework was configured as it would be in a traditional MapReduce implementation. I