59 research outputs found

    Hadoop Map Reduce Performance Evaluation and Improvement Using Compression Algorithms on Single Cluster

    Get PDF
    In todays scenario a word Big Data used by researchers is associated with large amount of data which requires more resources likes processors, memories and storage capacity. Data can be structured and non-structured like text, images, and audio, video, social media data. Data generated by various sensor devices, mobile devices, social media. Data is stored into repository on the basis of their attributes like size, colours name. Data requires more storage space. In this paper we have evaluated performance of Hadoop MapReduce examples like TeraGen, TeraSor, TeraValidate. We have evaluated Hadoop Map Reduce performance by configuring compression related parameter and different compression algorithm like DEFLATE, Bzip2, Gzip , LZ4 on single Cluster through Word Count example. After evaluating compression algorithm through Word Count Example we found job elapsed time, I/O time and storage space requirement is reduced marginally along with increase in the CPU computation time

    Towards green scientific data compression through high-level I/O interfaces

    Get PDF
    Every HPC system today has to cope with a deluge of data generated by scientific applications, simulations or large- scale experiments. The upscaling of supercomputer systems and infrastructures, generally results in a dramatic increase of their energy consumption. In this paper, we argue that techniques like data compression can lead to significant gains in terms of power efficiency by reducing both network and storage requirements. To that end, we propose a novel methodology for achieving on-the-fly intelligent determination of energy efficient data reduction for a given data set by leveraging state-of-the-art compression algorithms and meta data at application-level I/O. We motivate our work by analyzing the energy and storage saving needs of real-life scientific HPC applications, and review the various compression techniques that can be applied. We find that the resulting data reduction can decrease the data volume transferred and stored by as much as 80% in some cases, consequently leading to significant savings in storage and networking costs

    Data compression for climate data

    Get PDF
    The different rates of increase for computational power and storage capabilities of supercomputers turn data storage into a technical and economical problem. Because storage capabilities are lagging behind, investments and operational costs for storage systems have increased to keep up with the supercomputers' I/O requirements. One promising approach is to reduce the amount of data that is stored. In this paper, we take a look at the impact of compression on performance and costs of high performance systems. To this end, we analyze the applicability of compression on all layers of the I/O stack, that is, main memory, network and storage. Based on the Mistral system of the German Climate Computing Center (Deutsches Klimarechenzentrum, DKRZ), we illustrate potential performance improvements and cost savings. Making use of compression on a large scale can decrease investments and operational costs by 50% without negatively impacting performance. Additionally, we present ongoing work for supporting enhanced adaptive compression in the parallel distributed file system Lustre and application-specific compression

    On Understanding the Energy Impact of Speculative Execution in Hadoop

    Get PDF
    International audienceHadoop emerged as an important system for large- scale data analysis. Speculative execution is a key feature in Hadoop that is extensively leveraged in clouds: it is used to mask slow tasks (i.e., stragglers) — resulted from resource contention and heterogeneity in clouds — by launching speculative task copies on other machines. However, speculative execution is not cost-free and may result in performance degradation and extra resource and energy consumption. While prior literature has been dedicated to improving stragglers detection to cope with the inevitable heterogeneity in clouds, little work is focusing on understanding the implications of speculative execution on the performance and energy consumption in Hadoop cluster. In this paper, we have designed a set of experiments to evaluate the impact of speculative execution on the performance and energy consumption of Hadoop in homo- and heterogeneous environments. Our studies reveal that speculative execution may sometimes reduce, sometimes increase the energy consumption of Hadoop clusters. This strongly depends on the reduction in the execution time of MapReduce applications and on the extra power consumption introduced by speculative execution. Moreover, we show that the extra power consumption varies in-between applications and is contributed to by three main factors: the duration of speculative tasks, the idle time, and the allocation of speculative tasks. To the best of our knowledge, our work provides the first deep look into the energy efficiency of speculative execution in Hadoop

    Energy Efficient Data-Intensive Computing With Mapreduce

    Get PDF
    Power and energy consumption are critical constraints in data center design and operation. In data centers, MapReduce data-intensive applications demand significant resources and energy. Recognizing the importance and urgency of optimizing energy usage of MapReduce applications, this work aims to provide instrumental tools to measure and evaluate MapReduce energy efficiency and techniques to conserve energy without impacting performance. Energy conservation for data-intensive computing requires enabling technology to provide detailed and systemic energy information and to identify in the underlying system hardware and software. To address this need, we present eTune, a fine-grained, scalable energy profiling framework for data-intensive computing on large-scale distributed systems. eTune leverages performance monitoring counters (PMCs) on modern computer components and statistically builds power-performance correlation models. Using learned models, eTune augments direct measurement with a software-based power estimator that runs on compute nodes and reports power at multiple levels including node, core, memory, and disks with high accuracy. Data-intensive computing differs from traditional high performance computing as most execution time is spent in moving data between storage devices, nodes, and components. Since data movements are potential performance and energy bottlenecks, we propose an analysis framework with methods and metrics for evaluating and characterizing costly built-in MapReduce data movements. The revealed data movement energy characteristics can be exploited in system design and resource allocation to improve data-intensive computing energy efficiency. Finally, we present an optimization technique that targets inefficient built-in MapReduce data movements to conserve energy without impacting performance. The optimization technique allocates the optimal number of compute nodes to applications and dynamically schedules processor frequency during its execution based on data movement characteristics. Experimental results show significant energy savings, though improvements depend on both workload characteristics and policies of resource and dynamic voltage and frequency scheduling. As data volume doubles every two years and more data centers are put into production, energy consumption is expected to grow further. We expect these studies provide direction and insight in building more energy efficient data-intensive systems and applications, and the tools and techniques are adopted by other researchers for their energy efficient studies

    DEVELOPMENT OF MAP/REDUCE BASED MICROARRAY ANALYSIS TOOLS

    Get PDF
    High density oligonucleotide array (microarray) from the Affymetrix GeneChipš system has been widely used for the measurements of gene expressions. Currently, public data repositories, such as Gene Expression Omnibus (GEO) of the National Center for Biotechnology Information (NCBI), have accumulated very large amount of microarray data. For example, there are 84389 human and 9654 Arabidopsis microarray experiments in GEO database. Efficiently integrative analysis large amount of microarray data will provide more knowledge about the biological systems. Traditional microarray analysis tools all implemented sequential algorithms and can only be run on single processor. They are not able to handle very large microarray data sets with thousands of experiments. It is necessary to develop new microarray analysis tools using parallel framework. In this thesis, I implemented microarray quality assessment, background correction, normalization and summarization algorithms using the Map/Reduce framework. The Map/Reduce framework, first introduced by Google in 2004, offers a promising paradigm to develop scalable parallel applications for large-scale data. Evaluation of our new implementation on large microarray data of rice and Arabidopsis showed that they have good speedups. For example, running rice microarray data using our implementations of MAS5.0 algorithms on 20 computer nodes totally 320 processors has a 28 times speedup over using previous C++ implementation on single processor. Our new microarray tools will make it possible to utilize the valuable experiments in the public repositories

    Storage and aggregation for fast analytics systems

    Get PDF
    Computing in the last decade has been characterized by the rise of data- intensive scalable computing (DISC) systems. In particular, recent years have wit- nessed a rapid growth in the popularity of fast analytics systems. These systems exemplify a trend where queries that previously involved batch-processing (e.g., run- ning a MapReduce job) on a massive amount of data, are increasingly expected to be answered in near real-time with low latency. This dissertation addresses the problem that existing designs for various components used in the software stack for DISC sys- tems do not meet the requirements demanded by fast analytics applications. In this work, we focus specifically on two components: 1. Key-value storage: Recent work has focused primarily on supporting reads with high throughput and low latency. However, fast analytics applications require that new data entering the system (e.g., new web-pages crawled, currently trend- ing topics) be quickly made available to queries and analysis codes. This means that along with supporting reads efficiently, these systems must also support writes with high throughput, which current systems fail to do. In the first part of this work, we solve this problem by proposing a new key-value storage system – called the WriteBuffer (WB) Tree – that provides up to 30× higher write per- formance and similar read performance compared to current high-performance systems. 2. GroupBy-Aggregate: Fast analytics systems require support for fast, incre- mental aggregation of data for with low-latency access to results. Existing techniques are memory-inefficient and do not support incremental aggregation efficiently when aggregate data overflows to disk. In the second part of this dis- sertation, we propose a new data structure called the Compressed Buffer Tree (CBT) to implement memory-efficient in-memory aggregation. We also show how the WB Tree can be modified to support efficient disk-based aggregation.Ph.D

    Doctor of Philosophy

    Get PDF
    dissertationIn the past few years, we have seen a tremendous increase in digital data being generated. By 2011, storage vendors had shipped 905 PB of purpose-built backup appliances. By 2013, the number of objects stored in Amazon S3 had reached 2 trillion. Facebook had stored 20 PB of photos by 2010. All of these require an efficient storage solution. To improve space efficiency, compression and deduplication are being widely used. Compression works by identifying repeated strings and replacing them with more compact encodings while deduplication partitions data into fixed-size or variable-size chunks and removes duplicate blocks. While we have seen great improvements in space efficiency from these two approaches, there are still some limitations. First, traditional compressors are limited in their ability to detect redundancy across a large range since they search for redundant data in a fine-grain level (string level). For deduplication, metadata embedded in an input file changes more frequently, and this introduces more unnecessary unique chunks, leading to poor deduplication. Cloud storage systems suffer from unpredictable and inefficient performance because of interference among different types of workloads. This dissertation proposes techniques to improve the effectiveness of traditional compressors and deduplication in improving space efficiency, and a new IO scheduling algorithm to improve performance predictability and efficiency for cloud storage systems. The common idea is to utilize similarity. To improve the effectiveness of compression and deduplication, similarity in content is used to transform an input file into a compression- or deduplication-friendly format. We propose Migratory Compression, a generic data transformation that identifies similar data in a coarse-grain level (block level) and then groups similar blocks together. It can be used as a preprocessing stage for any traditional compressor. We find metadata have a huge impact in reducing the benefit of deduplication. To isolate the impact from metadata, we propose to separate metadata from data. Three approaches are presented for use cases with different constrains. For the commonly used tar format, we propose Migratory Tar: a data transformation and also a new tar format that deduplicates better. We also present a case study where we use deduplication to reduce storage consumption for storing disk images, while at the same time achieving high performance in image deployment. Finally, we apply the same principle of utilizing similarity in IO scheduling to prevent interference between random and sequential workloads, leading to efficient, consistent, and predictable performance for sequential workloads and a high disk utilization

    Towards Implicit Parallel Programming for Systems

    Get PDF
    Multi-core processors require a program to be decomposable into independent parts that can execute in parallel in order to scale performance with the number of cores. But parallel programming is hard especially when the program requires state, which many system programs use for optimization, such as for example a cache to reduce disk I/O. Most prevalent parallel programming models do not support a notion of state and require the programmer to synchronize state access manually, i.e., outside the realms of an associated optimizing compiler. This prevents the compiler to introduce parallelism automatically and requires the programmer to optimize the program manually. In this dissertation, we propose a programming language/compiler co-design to provide a new programming model for implicit parallel programming with state and a compiler that can optimize the program for a parallel execution. We define the notion of a stateful function along with their composition and control structures. An example implementation of a highly scalable server shows that stateful functions smoothly integrate into existing programming language concepts, such as object-oriented programming and programming with structs. Our programming model is also highly practical and allows to gradually adapt existing code bases. As a case study, we implemented a new data processing core for the Hadoop Map/Reduce system to overcome existing performance bottlenecks. Our lambda-calculus-based compiler automatically extracts parallelism without changing the program's semantics. We added further domain-specific semantic-preserving transformations that reduce I/O calls for microservice programs. The runtime format of a program is a dataflow graph that can be executed in parallel, performs concurrent I/O and allows for non-blocking live updates
    • 

    corecore