11,074 research outputs found
The Family of MapReduce and Large Scale Data Processing Systems
In the last two decades, the continuous increase of computational power has
produced an overwhelming flow of data which has called for a paradigm shift in
the computing architecture and large scale data processing mechanisms.
MapReduce is a simple and powerful programming model that enables easy
development of scalable parallel applications to process vast amounts of data
on large clusters of commodity machines. It isolates the application from the
details of running a distributed program such as issues on data distribution,
scheduling and fault tolerance. However, the original implementation of the
MapReduce framework had some limitations that have been tackled by many
research efforts in several followup works after its introduction. This article
provides a comprehensive survey for a family of approaches and mechanisms of
large scale data processing mechanisms that have been implemented based on the
original idea of the MapReduce framework and are currently gaining a lot of
momentum in both research and industrial communities. We also cover a set of
introduced systems that have been implemented to provide declarative
programming interfaces on top of the MapReduce framework. In addition, we
review several large scale data processing systems that resemble some of the
ideas of the MapReduce framework for different purposes and application
scenarios. Finally, we discuss some of the future research directions for
implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
Distributed processing of large remote sensing images using MapReduce - A case of Edge Detection
Dissertation submitted in partial fulfillment of the requirements for the Degree of Master of Science in Geospatial Technologies.Advances in sensor technology and their ever increasing repositories of the collected data are revolutionizing the mechanisms remotely sensed data are collected, stored and processed. This exponential growth of data archives and the increasing user’s demand for real-and near-real time remote sensing data products has pressurized remote sensing service providers to deliver the required services. The remote sensing community has recognized the challenge in processing large and complex satellite datasets to derive customized products. To address this high demand in computational resources, several efforts have been made in the past few years towards incorporation of high-performance computing models in remote sensing data collection, management and analysis. This study adds an impetus to these efforts by introducing the recent advancements in distributed computing technologies, MapReduce programming paradigm, to the area of remote sensing. The MapReduce model which is developed by Google Inc. encapsulates the efforts of distributed computing in a highly simplified single library. This simple but powerful programming model can provide us distributed environment without having deep knowledge of parallel programming. This thesis presents a MapReduce based processing of large satellite images a use case scenario of edge detection methods. Deriving from the conceptual massive remote sensing image processing applications, a prototype of edge detection methods was implemented on MapReduce framework using its open-source implementation, the Apache Hadoop environment. The experiences of the implementation of the MapReduce model of Sobel, Laplacian, and Canny edge detection methods are presented. This thesis also presents the results of the evaluation the effect of parallelization using MapReduce on the quality of the output and the execution time performance tests conducted based on various performance metrics. The MapReduce algorithms were executed on a test environment on heterogeneous cluster that supports the Apache Hadoop open-source software. The successful implementation of the MapReduce algorithms on a distributed environment demonstrates that MapReduce has a great potential for scaling large-scale remotely sensed images processing and perform more complex geospatial problems
A Survey on Array Storage, Query Languages, and Systems
Since scientific investigation is one of the most important providers of
massive amounts of ordered data, there is a renewed interest in array data
processing in the context of Big Data. To the best of our knowledge, a unified
resource that summarizes and analyzes array processing research over its long
existence is currently missing. In this survey, we provide a guide for past,
present, and future research in array processing. The survey is organized along
three main topics. Array storage discusses all the aspects related to array
partitioning into chunks. The identification of a reduced set of array
operators to form the foundation for an array query language is analyzed across
multiple such proposals. Lastly, we survey real systems for array processing.
The result is a thorough survey on array data storage and processing that
should be consulted by anyone interested in this research topic, independent of
experience level. The survey is not complete though. We greatly appreciate
pointers towards any work we might have forgotten to mention.Comment: 44 page
Geospatial Narratives and their Spatio-Temporal Dynamics: Commonsense Reasoning for High-level Analyses in Geographic Information Systems
The modelling, analysis, and visualisation of dynamic geospatial phenomena
has been identified as a key developmental challenge for next-generation
Geographic Information Systems (GIS). In this context, the envisaged
paradigmatic extensions to contemporary foundational GIS technology raises
fundamental questions concerning the ontological, formal representational, and
(analytical) computational methods that would underlie their spatial
information theoretic underpinnings.
We present the conceptual overview and architecture for the development of
high-level semantic and qualitative analytical capabilities for dynamic
geospatial domains. Building on formal methods in the areas of commonsense
reasoning, qualitative reasoning, spatial and temporal representation and
reasoning, reasoning about actions and change, and computational models of
narrative, we identify concrete theoretical and practical challenges that
accrue in the context of formal reasoning about `space, events, actions, and
change'. With this as a basis, and within the backdrop of an illustrated
scenario involving the spatio-temporal dynamics of urban narratives, we address
specific problems and solutions techniques chiefly involving `qualitative
abstraction', `data integration and spatial consistency', and `practical
geospatial abduction'. From a broad topical viewpoint, we propose that
next-generation dynamic GIS technology demands a transdisciplinary scientific
perspective that brings together Geography, Artificial Intelligence, and
Cognitive Science.
Keywords: artificial intelligence; cognitive systems; human-computer
interaction; geographic information systems; spatio-temporal dynamics;
computational models of narrative; geospatial analysis; geospatial modelling;
ontology; qualitative spatial modelling and reasoning; spatial assistance
systemsComment: ISPRS International Journal of Geo-Information (ISSN 2220-9964);
Special Issue on: Geospatial Monitoring and Modelling of Environmental
Change}. IJGI. Editor: Duccio Rocchini. (pre-print of article in press
A MapReduce-Based Big Spatial Data Framework for Solving the Problem of Covering a Polygon with Orthogonal Rectangles
The polygon covering problem is an important class of problems in the area of computational geometry. There are slightly different versions of this problem depending on the types of polygons to be addressed. In this paper, we focus on finding an answer to a question of whether an orthogonal rectangle, or spatial query window, is fully covered by a set of orthogonal rectangles which are in smaller sizes. This problem is encountered in many application domains including object recognition/extraction/trace, spatial analyses, topological analyses, and augmented reality applications. In many real-world applications, in the cases of using traditional central computation techniques, working with real world data results in a performance bottlenecks. The work presented in this paper proposes a high performance MapReduce-based big data framework to solve the polygon covering problem in the cases of using a spatial query window and data are represented as a set of orthogonal rectangles. Orthogonal rectangular polygons are represented in the form of minimum bounding boxes. The spatial query windows are also called as range queries. The proposed spatial big data framework is evaluated in terms of horizontal scalability. In addition, efficiency and speed-up performance metrics for the proposed two algorithms are measured
Astronomy in the Cloud: Using MapReduce for Image Coaddition
In the coming decade, astronomical surveys of the sky will generate tens of
terabytes of images and detect hundreds of millions of sources every night. The
study of these sources will involve computation challenges such as anomaly
detection and classification, and moving object tracking. Since such studies
benefit from the highest quality data, methods such as image coaddition
(stacking) will be a critical preprocessing step prior to scientific
investigation. With a requirement that these images be analyzed on a nightly
basis to identify moving sources or transient objects, these data streams
present many computational challenges. Given the quantity of data involved, the
computational load of these problems can only be addressed by distributing the
workload over a large number of nodes. However, the high data throughput
demanded by these applications may present scalability challenges for certain
storage architectures. One scalable data-processing method that has emerged in
recent years is MapReduce, and in this paper we focus on its popular
open-source implementation called Hadoop. In the Hadoop framework, the data is
partitioned among storage attached directly to worker nodes, and the processing
workload is scheduled in parallel on the nodes that contain the required input
data. A further motivation for using Hadoop is that it allows us to exploit
cloud computing resources, e.g., Amazon's EC2. We report on our experience
implementing a scalable image-processing pipeline for the SDSS imaging database
using Hadoop. This multi-terabyte imaging dataset provides a good testbed for
algorithm development since its scope and structure approximate future surveys.
First, we describe MapReduce and how we adapted image coaddition to the
MapReduce framework. Then we describe a number of optimizations to our basic
approach and report experimental results comparing their performance.Comment: 31 pages, 11 figures, 2 table
OPTIMIZATION APPROACHES TO MPI AND AREA MERGING-BASED PARALLEL BUFFER ALGORITHM
On buffer zone construction, the rasterization-based dilation method inevitablyintroduces errors, and the double-sided parallel line method involves a series ofcomplex operations. In this paper, we proposed a parallel buffer algorithm based onarea merging and MPI (Message Passing Interface) to improve the performances ofbuffer analyses on processing large datasets. Experimental results reveal that thereare three major performance bottlenecks which significantly impact the serial andparallel buffer construction efficiencies, including the area merging strategy, thetask load balance method and the MPI inter-process results merging strategy.Corresponding optimization approaches involving tree-like area merging strategy, the vertex number oriented parallel task partition method and the inter-processresults merging strategy were suggested to overcome these bottlenecks. Experimentswere carried out to examine the performance efficiency of the optimized parallelalgorithm. The estimation results suggested that the optimization approaches couldprovide high performance and processing ability for buffer construction in a clusterparallel environment. Our method could provide insights into the parallelization ofspatial analysis algorithm
- …