1,417 research outputs found

    Data locality in Hadoop

    Get PDF
    Current market tendencies show the need of storing and processing rapidly growing amounts of data. Therefore, it implies the demand for distributed storage and data processing systems. The Apache Hadoop is an open-source framework for managing such computing clusters in an effective, fault-tolerant way. Dealing with large volumes of data, Hadoop, and its storage system HDFS (Hadoop Distributed File System), face challenges to keep the high efficiency with computing in a reasonable time. The typical Hadoop implementation transfers computation to the data, rather than shipping data across the cluster. Otherwise, moving the big quantities of data through the network could significantly delay data processing tasks. However, while a task is already running, Hadoop favours local data access and chooses blocks from the nearest nodes. Next, the necessary blocks are moved just when they are needed in the given ask. For supporting the Hadoop’s data locality preferences, in this thesis, we propose adding an innovative functionality to its distributed file system (HDFS), that enables moving data blocks on request. In-advance shipping of data makes it possible to forcedly redistribute data between nodes in order to easily adapt it to the given processing tasks. New functionality enables the instructed movement of data blocks within the cluster. Data can be shifted either by user running the proper HDFS shell command or programmatically by other module like an appropriate scheduler. In order to develop such functionality, the detailed analysis of Apache Hadoop source code and its components (specifically HDFS) was conducted. Research resulted in a deep understanding of internal architecture, what made it possible to compare the possible approaches to achieve the desired solution, and develop the chosen one

    MOON: MapReduce On Opportunistic eNvironments

    Get PDF
    Abstract—MapReduce offers a flexible programming model for processing and generating large data sets on dedicated resources, where only a small fraction of such resources are every unavailable at any given time. In contrast, when MapReduce is run on volunteer computing systems, which opportunistically harness idle desktop computers via frameworks like Condor, it results in poor performance due to the volatility of the resources, in particular, the high rate of node unavailability. Specifically, the data and task replication scheme adopted by existing MapReduce implementations is woefully inadequate for resources with high unavailability. To address this, we propose MOON, short for MapReduce On Opportunistic eNvironments. MOON extends Hadoop, an open-source implementation of MapReduce, with adaptive task and data scheduling algorithms in order to offer reliable MapReduce services on a hybrid resource architecture, where volunteer computing systems are supplemented by a small set of dedicated nodes. The adaptive task and data scheduling algorithms in MOON distinguish between (1) different types of MapReduce data and (2) different types of node outages in order to strategically place tasks and data on both volatile and dedicated nodes. Our tests demonstrate that MOON can deliver a 3-fold performance improvement to Hadoop in volatile, volunteer computing environments

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    Data Transfers in Hadoop: A Comparative Study

    Get PDF
    Hadoop is an open source framework for processing large amounts of data in distributed computing environment. It plays an important role in processing and analyzing the Big Data. This framework is used for storing data on large clusters of commodity hardware. Data input and output to and from Hadoop is an indispensable action for any data processing job. At present, many tools have been evolved for importing and exporting Data in Hadoop. In this article, some commonly used tools for importing and exporting data have been emphasized. Moreover, a state-of-the-art comparative study among the various tools has been made. With this study, it has been decided that where to use one tool over the other with emphasis on the data transfer to and from Hadoop system. This article also discusses about how Hadoop handles backup and disaster recovery along with some open research questions in terms of Big Data transfer when dealing with cloud-based services
    • …
    corecore