341 research outputs found

    A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures

    Full text link
    Scientific problems that depend on processing large amounts of data require overcoming challenges in multiple areas: managing large-scale data distribution, co-placement and scheduling of data with compute resources, and storing and transferring large volumes of data. We analyze the ecosystems of the two prominent paradigms for data-intensive applications, hereafter referred to as the high-performance computing and the Apache-Hadoop paradigm. We propose a basis, common terminology and functional factors upon which to analyze the two approaches of both paradigms. We discuss the concept of "Big Data Ogres" and their facets as means of understanding and characterizing the most common application workloads found across the two paradigms. We then discuss the salient features of the two paradigms, and compare and contrast the two approaches. Specifically, we examine common implementation/approaches of these paradigms, shed light upon the reasons for their current "architecture" and discuss some typical workloads that utilize them. In spite of the significant software distinctions, we believe there is architectural similarity. We discuss the potential integration of different implementations, across the different levels and components. Our comparison progresses from a fully qualitative examination of the two paradigms, to a semi-quantitative methodology. We use a simple and broadly used Ogre (K-means clustering), characterize its performance on a range of representative platforms, covering several implementations from both paradigms. Our experiments provide an insight into the relative strengths of the two paradigms. We propose that the set of Ogres will serve as a benchmark to evaluate the two paradigms along different dimensions.Comment: 8 pages, 2 figure

    An Optimized Model for MapReduce Based on Hadoop

    Get PDF
    Aiming at the waste of computing resources resulting from sequential control of running mechanism of MapReduce model on Hadoop platform,Fork/Join framework has been introduced into this model to make full use of CPU resource of each node. From the perspective of fine-grained parallel data processing, combined with Fork/Join framework,a parallel and multi-thread model,this paper optimizes MapReduce model and puts forward a MapReduce+Fork/Join programming model which is a distributed and parallel architecture combined with coarse-grained and fine-grained on Hadoop platform to Support two-tier levels of parallelism architecture both in shared and distributed memory machines. A test is made under the environment of Hadoop cluster composed of four nodes. And the experimental results prove that this model really can improve performance and efficiency of the whole system and it is not only suitable for handling tasks with data intensive but also tasks with computing intensive. it is an effective optimization and improvement to the MapReduce model of big data processing
    corecore