181 research outputs found

    Dynamically Iterative MapReduce

    Get PDF
    [[abstract]]MapReduce is a distributed and parallel computing model for data-intensive tasks with features of optimized scheduling, flexibility, high availability, and high manageability. MapReduce can work on various platforms; however, MapReduce is not suitable for iterative programs because the performance may be lowered by frequent disk I/O operations. In order to improve system performance and resource utilization, we propose a novel MapReduce framework named Dynamically Iterative MapReduce (DIMR) to reduce numbers of disk I/O operations and the consumption of network bandwidth by means of using dynamic task allocation and memory management mechanism. We show that DIMR is promising with detail discussions in this paper.[[notice]]補正完畢[[incitationindex]]SCI[[incitationindex]]EI[[booktype]]紙本[[booktype]]電子

    BigFCM: Fast, Precise and Scalable FCM on Hadoop

    Full text link
    Clustering plays an important role in mining big data both as a modeling technique and a preprocessing step in many data mining process implementations. Fuzzy clustering provides more flexibility than non-fuzzy methods by allowing each data record to belong to more than one cluster to some degree. However, a serious challenge in fuzzy clustering is the lack of scalability. Massive datasets in emerging fields such as geosciences, biology and networking do require parallel and distributed computations with high performance to solve real-world problems. Although some clustering methods are already improved to execute on big data platforms, but their execution time is highly increased for large datasets. In this paper, a scalable Fuzzy C-Means (FCM) clustering named BigFCM is proposed and designed for the Hadoop distributed data platform. Based on the map-reduce programming model, it exploits several mechanisms including an efficient caching design to achieve several orders of magnitude reduction in execution time. Extensive evaluation over multi-gigabyte datasets shows that BigFCM is scalable while it preserves the quality of clustering

    Performance Model of MapReduce Iterative Applications for Hybrid Cloud Bursting

    Get PDF
    Hybrid cloud bursting (i.e., leasing temporary off-premise cloud resources to boost the overall capacity during peak utilization) can be a cost-effective way to deal with the increasing complexity of big data analytics, especially for iterative applications. However, the low throughput, high latency network link between the on-premise and off-premise resources (“weak link”) makes maintaining scalability difficult. While several data locality techniques have been designed for big data bursting on hybrid clouds, their effectiveness is difficult to estimate in advance. Yet such estimations are critical, because they help users decide whether the extra pay-as-you-go cost incurred by using the off-premise resources justifies the runtime speed-up. To this end, the current paper presents a performance model and methodology to estimate the runtime of iterative MapReduce applications in a hybrid cloud-bursting scenario. The paper focuses on the overhead incurred by the weak link at fine granularity, for both the map and the reduce phases. This approach enables high estimation accuracy, as demonstrated by extensive experiments at scale using a mix of real-world iterative MapReduce applications from standard big data benchmarking suites that cover a broad spectrum of data patterns. Not only are the produced estimations accurate in absolute terms compared with experimental results, but they are also up to an order of magnitude more accurate than applying state-of-art estimation approaches originally designed for single-site MapReduce deployments

    Distributed processing of large remote sensing images using MapReduce - A case of Edge Detection

    Get PDF
    Dissertation submitted in partial fulfillment of the requirements for the Degree of Master of Science in Geospatial Technologies.Advances in sensor technology and their ever increasing repositories of the collected data are revolutionizing the mechanisms remotely sensed data are collected, stored and processed. This exponential growth of data archives and the increasing user’s demand for real-and near-real time remote sensing data products has pressurized remote sensing service providers to deliver the required services. The remote sensing community has recognized the challenge in processing large and complex satellite datasets to derive customized products. To address this high demand in computational resources, several efforts have been made in the past few years towards incorporation of high-performance computing models in remote sensing data collection, management and analysis. This study adds an impetus to these efforts by introducing the recent advancements in distributed computing technologies, MapReduce programming paradigm, to the area of remote sensing. The MapReduce model which is developed by Google Inc. encapsulates the efforts of distributed computing in a highly simplified single library. This simple but powerful programming model can provide us distributed environment without having deep knowledge of parallel programming. This thesis presents a MapReduce based processing of large satellite images a use case scenario of edge detection methods. Deriving from the conceptual massive remote sensing image processing applications, a prototype of edge detection methods was implemented on MapReduce framework using its open-source implementation, the Apache Hadoop environment. The experiences of the implementation of the MapReduce model of Sobel, Laplacian, and Canny edge detection methods are presented. This thesis also presents the results of the evaluation the effect of parallelization using MapReduce on the quality of the output and the execution time performance tests conducted based on various performance metrics. The MapReduce algorithms were executed on a test environment on heterogeneous cluster that supports the Apache Hadoop open-source software. The successful implementation of the MapReduce algorithms on a distributed environment demonstrates that MapReduce has a great potential for scaling large-scale remotely sensed images processing and perform more complex geospatial problems

    3rd Many-core Applications Research Community (MARC) Symposium. (KIT Scientific Reports ; 7598)

    Get PDF
    This manuscript includes recent scientific work regarding the Intel Single Chip Cloud computer and describes approaches for novel approaches for programming and run-time organization
    corecore