994 research outputs found

    BIGhybrid - A Toolkit for Simulating MapReduce on Hybrid Infrastructures

    Get PDF
    Cloud computing has increasingly been used as a platform for running large business and data processing applications. Although clouds have become highly popular, when it comes to data processing, the cost of usage is not negligible. Conversely, Desktop Grids, have been used by a plethora of projects, taking advantage of the high number of resources provided for free by volunteers. Merging cloud computing and desktop grids into hybrid infrastructure can provide a feasible low-cost solution for big data analysis. Although frameworks like MapReduce have been conceived to exploit commodity hardware, their use on hybrid infrastructure poses some challenges due to large resource heterogeneity and high churn rate. This study introduces BIGhybrid a toolkit to simulate MapReduce on hybrid environments. The main goal is to provide a framework for developers and system designers to address the issues of hybrid MapReduce. In this paper, we describe the framework which simulates the assembly of two existing middleware: BitDew- MapReduce for Desktop Grids and Hadoop-BlobSeer for Cloud Computing. Experimental results included in this work demonstrate the feasibility of our approach

    A hybrid framework of iterative MapReduce and MPI for molecular dynamics applications

    Get PDF
    Developing platforms for large scale data processing has been a great interest to scientists. Hadoop is a widely used computational platform which is a fault-tolerant distributed system for data storage due to HDFS (Hadoop Distributed File System) and performs fault-tolerant distributed data processing in parallel due to MapReduce framework. It is quite often that actual computations require multiple MapReduce cycles, which needs chained MapReduce jobs. However, Design by Hadoop is poor in addressing problems with iterative structures. In many iterative problems, some invariant data is required by every MapReduce cycle. The same data is uploaded to Hadoop file system in every MapReduce cycle, causing repeated data delivering and unnecessary time cost in transferring this data. In addition, although Hadoop can process data in parallel, it does not support MPI in computing. In any Map/Reduce task, the computation must be serial. This results in inefficient scientific computations wrapped in Map/Reduce tasks because the computation can not be distributed over a Hadoop cluster, especially a Hadoop cluster on a traditional high performance computing cluster. Computational technologies have been extensively investigated to be applied into many application domains. Since the presence of Hadoop, scientists have applied the MapReduce framework to biological sciences, chemistry, medical sciences, and other areas to efficiently process huge data sets. In our research, we proposed a hybrid framework of iterative MapReduce and MPI for molecular dynamics applications. We carried out molecular dynamics simulations with the implemented hybrid framework. We improved the capability and performance of Hadoop by adding a MPI module to Hadoop. The MPI module enables Hadoop to monitor and manage the resources of Hadoop cluster so that computations incurred in Map/Reduce tasks can be performed in a parallel manner. We also applied the local caching mechanism to avoid data delivery redundancy to make the computing more efficient. Our hybrid framework inherits features of Hadoop and improves computing efficiency of Hadoop. The targeting application domain of our research is molecular dynamics simulation. However, the potential use of our iterative MapReduce framework with MPI is broad. It can be used by any applications which contain single or multiple MapReduce iterations, invoke serial or parallel (MPI) computations in Map phase or Reduce phase of Hadoop

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    Distributed simulation optimization and parameter exploration framework for the cloud

    Get PDF
    Simulation models are becoming an increasingly popular tool for the analysis and optimization of complex real systems in different fields. Finding an optimal system design requires performing a large sweep over the parameter space in an organized way. Hence, the model optimization process is extremely demanding from a computational point of view, as it requires careful, time-consuming, complex orchestration of coordinated executions. In this paper, we present the design of SOF (Simulation Optimization and exploration Framework in the cloud), a framework which exploits the computing power of a cloud computational environment in order to carry out effective and efficient simulation optimization strategies. SOF offers several attractive features. Firstly, SOF requires “zero configuration” as it does not require any additional software installed on the remote node; only standard Apache Hadoop and SSH access are sufficient. Secondly, SOF is transparent to the user, since the user is totally unaware that the system operates on a distributed environment. Finally, SOF is highly customizable and programmable, since it enables the running of different simulation optimization scenarios using diverse programming languages – provided that the hosting platform supports them – and different simulation toolkits, as developed by the modeler. The tool has been fully developed and is available on a public repository1 under the terms of the open source Apache License. It has been tested and validated on several private platforms, such as a dedicated cluster of workstations, as well as on public platforms, including the Hortonworks Data Platform and Amazon Web Services Elastic MapReduce solution
    • …
    corecore