209 research outputs found

    Towards High-Performance Big Data Processing Systems

    Get PDF
    The amount of generated and stored data has been growing rapidly, It is estimated that 2.5 quintillion bytes of data are generated every day, and 90% of the data in the world today has been created in the last two years. How to solve these big data issues has become a hot topic in both industry and academia. Due to the complex of big data platform, we stratify it into four layers: storage layer, resource management layer, computing layer, and methodology layer. This dissertation proposes brand-new approaches to address the performance of big data platforms like Hadoop and Spark on these four layers. We first present an improved HDFS design called SMARTH, which optimizes the storage layer. It utilizes asynchronous multi-pipeline data transfers instead of a single pipeline stop-and-wait mechanism. SMARTH records the actual transfer speed of data blocks and sends this information to the namenode along with periodic heartbeat messages. The namenode sorts datanodes according to their past performance and tracks this information continuously. When a client initiates an upload request, the namenode will send it a list of \u27\u27high performance\u27\u27 datanodes that it thinks will yield the highest throughput for the client. By choosing higher performance datanodes relative to each client and by taking advantage of the multi-pipeline design, our experiments show that SMARTH significantly improves the performance of data write operations compared to HDFS. Specifically, SMARTH is able to improve the throughput of data transfer by 27-245% in a heterogeneous virtual cluster on Amazon EC2. Secondly, we propose an optimized Hadoop extension called MRapid, which significantly speeds up the execution of short jobs on the resource management layer. It is completely backward compatible to Hadoop, and imposes negligible overhead. Our experiments on Microsoft Azure public cloud show that MRapid can improve performance by up to 88% compared to the original Hadoop. Thirdly, we introduce an efficient 3-level sampling performance model, called Hedgehog, and focus on the relationship between resource and performance. This design is a brand new white-box model for Spark, which is more complex and challenging than Hadoop. In our tool, we employ a Java bytecode manipulation and analysis framework called ASM to reduce the profiling overhead dramatically. Fourthly, on the computing layer, we optimize the current implementation of SGD in Spark\u27s MLlib by reusing data partition for multiple times within a single iteration to find better candidate weights in a more efficient way. Whether using multiple local iterations within each partition is dynamically decided by the 68-95-99.7 rule. We also design a variant of momentum algorithm to optimize step size in every iteration. This method uses a new adaptive rule that decreases the step size whenever neighboring gradients show differing directions of significance. Experiments show that our adaptive algorithm is more efficient and can be 7 times faster compared to the original MLlib\u27s SGD. At last, on the application layer, we present a scalable and distributed geographic information system, called Dart, based on Hadoop and HBase. Dart provides a hybrid table schema to store spatial data in HBase so that the Reduce process can be omitted for operations like calculating the mean center and the median center. It employs reasonable pre-splitting and hash techniques to avoid data imbalance and hot region problems. It also supports massive spatial data analysis like K-Nearest Neighbors (KNN) and Geometric Median Distribution. In our experiments, we evaluate the performance of Dart by processing 160 GB Twitter data on an Amazon EC2 cluster. The experimental results show that Dart is very scalable and efficient

    Remote sensing big data computing: challenges and opportunities

    Get PDF
    As we have entered an era of high resolution earth observation, the RS data are undergoing an explosive growth. The proliferation of data also give rise to the increasing complexity of RS data, like the diversity and higher dimensionality characteristic of the data. RS data are regarded as RS ‘‘Big Data’’. Fortunately, we are witness the coming technological leapfrogging. In this paper, we give a brief overview on the Big Data and data-intensive problems, including the analysis of RS Big Data, Big Data challenges, current techniques and works for processing RS Big Data

    Sensor grid middleware metamodeling and analysis

    Full text link
    Sensor grid is a platform that combines wireless sensor networks and grid computing with the aim of exploiting the complementary advantages of the two systems. Proper integration of these distinct systems into effective, logically single platform is challenging. This paper presents an approach for practical sensor grid implementation and management. The proposed approach uses a metamodeling technique and performance analysis and tuning as well as a middleware infrastructure that enable practical sensor grid implementation and management. The paper presents our implementation and analysis of the sensor grid. © 2014 Srimathi Chandrasekaran et al

    Development of a New Framework for Distributed Processing of Geospatial Big Data

    Get PDF
    Geospatial technology is still facing a lack of “out of the box” distributed processing solutions which are suitable for the amount and heterogeneity of geodata, and particularly for use cases requiring a rapid response. Moreover, most of the current distributed computing frameworks have important limitations hindering the transparent and flexible control of processing (and/or storage) nodes and control of distribution of data chunks. We investigated the design of distributed processing systems and existing solutions related to Geospatial Big Data. This research area is highly dynamic in terms of new developments and the re-use of existing solutions (that is, the re-use of certain modules to implement further specific developments), with new implementations continuously emerging in areas such as disaster management, environmental monitoring and earth observation. The distributed processing of raster data sets is the focus of this paper, as we believe that the problem of raster data partitioning is far from trivial: a number of tiling and stitching requirements need to be addressed to be able to fulfil the needs of efficient image processing beyond pixel level. We attempt to compare the terms Big Data, Geospatial Big Data and the traditional Geospatial Data in order to clarify the typical differences, to compare them in terms of storage and processing backgrounds for different data representations and to categorize the common processing systems from the aspect of distributed raster processing. This clarification is necessary due to the fact that they behave differently on the processing side, and particular processing solutions need to be developed according to their characteristics. Furthermore, we compare parallel and distributed computing, taking into account the fact that these are used improperly in several cases. We also briefly assess the widely-known MapReduce paradigm in the context of geospatial applications. The second half of the article reports on a new processing framework initiative, currently at the concept and early development stages, which aims to be capable of processing raster, vector and point cloud data in a distributed IT ecosystem. The developed system is modular, has no limitations on programming language environment, and can execute scripts written in any development language (e.g. Python, R or C#)

    SAT-hadoop-processor: a distributed remote sensing big data processing software for earth observation applications

    Get PDF
    Nowadays, several environmental applications take advantage of remote sensing techniques. A considerable volume of this remote sensing data occurs in near real-time. Such data are diverse and are provided with high velocity and variety, their pre-processing requires large computing capacities, and a fast execution time is critical. This paper proposes a new distributed software for remote sensing data pre-processing and ingestion using cloud computing technology, specifically OpenStack. The developed software discarded 86% of the unneeded daily files and removed around 20% of the erroneous and inaccurate datasets. The parallel processing optimized the total execution time by 90%. Finally, the software efficiently processed and integrated data into the Hadoop storage system, notably the HDFS, HBase, and Hive.This research was funded by Erasmus+ KA 107 program, and the UPC funded the APC. This work has received funding from the Spanish Government under contracts PID2019-106774RBC21, PCI2019-111851-2 (LeadingEdge CHIST-ERA), PCI2019-111850-2 (DiPET CHIST-ERA).Peer ReviewedPostprint (published version
    • 

    corecore