232 research outputs found

    A HADOOP-BASED ALGORITHM OF GENERATING DEM GRID FROM POINT CLOUD DATA

    Get PDF

    Using SQL-based Scripting Languages in Hadoop Ecosystem for Data Analytics

    Get PDF
    Selle lĂ”putöö eesmĂ€rk on andmeanalĂŒĂŒtika algoritmide rakendamine,\n\ret vĂ”rrelda erinevaid SQL-il pĂ”hinevaid skriptimiskeeli Hadoopi ökosĂŒsteemis.\n\rLĂ”putöö vĂ”rdleb erinevate raamistike efektiivsust ja algoritmide implementeerimise\n\rlihtsust kasutajal, kellel pole varasemat hajusarvutuse kogemust. EesmĂ€rgi\n\rtĂ€itmiseks implementeeriti kolm algoritmi: Pearsoni korrelatsioon, lihtne lineaarne\n\rregressioon ja naiivne Bayesi klassifikaator. Algoritmid implementeerti kahes\n\rSQL-il pĂ”hinevas raamistikus: Spark SQL-s ja HiveQL-s, samuti implementeeriti\n\rsamade algoritmide Spark MLlibi versioon. Algoritme testiti klastris erinevate sisendfaili\n\rsuurustega, samuti muudeti kasutatavate tuumade arvu. Selles lĂ”putöös\n\ruuriti ka Spark SQLi ja Spark MLlibi algoritmide skaleeruvust. Algoritmide jooksutamise\n\rtulemusel selgus, et Pearsoni korrelatsioon oli HiveQL’is veidi kiirem kui\n\rteistes uuritud raamistikes. Lineaarse regressiooni tulemused nĂ€itavad, et Spark\n\rSQL ja Spark MLlib olid selle algoritmiga sama kiired, HiveQL oli umbes 30%\n\raeglasem. Kahe esimese algoritmiga skaleerusid Spark SQL ja Spark MLlibist pĂ€rit\n\ralgoritm hĂ€sti. Naiivse Bayesi klasifikaatoriga tehtud testid nĂ€itasid, et Spark\n\rSQL on selle algoritmiga kiirem kui HiveQL, hoolimata sellest, et ta ei skallerunud\n\rhĂ€sti. Spark MLlibi tulemused selle algoritmiga ei olnud piisavad jĂ€relduste\n\rtegemiseks. Korrelatsiooni ja lineaarse regressiooni implementatsioonid HiveContextis\n\rja SQLContextis andsid sama tulemuse. Selle lĂ”putöö kĂ€igus leiti, et SQL-il\n\rpĂ”hinevaid raamistikke on kerge kasutada: HiveQL oli kĂ”ige lihtsam samas kui\n\rSpark SQL nĂ”udis veidi hajusarvutuse tundma Ă”ppimist. Spark MLlibi algoritmide\n\rimplementeerimine oli raskem kui oodatud, kuna nĂ”udis algoritmi sisemise töö\n\rmĂ”istmist, samuti osutusid vajalikuks teadmised hajusarvutusest.The goal of this thesis is to compare different SQL-based scripting languages\n\rin Hadoop ecosystem by implementing data analytics algorithms. The thesis compared framework efficiencies and easiness of implementing algorithms with no previous\n\rexperience in distributed computing. To fulfill this goal three algorithms were\n\rimplemented: Pearson’s correlation, simple linear regression and naive Bayes classifier.\n\rThe algorithms were implemented in two SQL-based frameworks on Hadoop\n\recosystem: Spark SQL and HiveQL, algorithms were also implemented from Spark\n\rMLlib. SQLContext and HiveContext were also compared in Spark SQL. Algorithms\n\rwere tested in a cluster with different dataset sizes and different number of\n\rexecutors. Scaling of Spark SQL and Spark MLlib’s algorithm was also measured.\n\rResults obtained in this thesis show that in the implementation of Pearson’s correlation\n\rHiveQL is slightly faster than other two frameworks. Linear regression\n\rresults show that Spark SQL and Spark MLlib are with similar run times, both\n\rabout 30% faster than HiveQL. Spark SQL and Spark MLlib algorithms scaled\n\rwell with these two algorithms. In the implementation of naive Bayes classifier\n\rSpark SQL did not scale well but was still faster than HiveQL. Results for Spark\n\rMLlib in multinomial naive Bayes proved to be inconclusive. With correlation\n\rand regression no difference between SQLContext and HiveContext was found.\n\rThe thesis found SQL-based frameworks easy to use: HiveQL was the easiest\n\rwhile Spark SQL required some additional investigation into distributed computing.\n\rImplementing algorithms from Spark MLlib was more difficulty as there it\n\rwas necessary to understand the internal workings of the algorithm, knowledge of\n\rdistributed computing was also necessary

    Leveraging Tiled Display for Big Data Visualization Using D3.js

    Get PDF
    Data visualization has proven effective at detecting patterns and drawing inferences from raw data by transforming it into visual representations. As data grows large, visualizing it faces two major challenges: 1) limited resolution i.e. a screen is limited to a few million pixels but the data can have a billion data points, and 2) computational load i.e. processing of this data becomes computationally challenging for a single node system. This work addresses both of these issues for efficient big data visualization. In the developed system, a High Pixel Density and Large Format display was used enabling the display of fine details on the screen when visualizing data. Apache Spark and Hadoop used in the system allow the computation to be done on a cluster. The system is demonstrated using a global wind flow simulation. The Global Surface Summary of the Day dataset is processed and visualized using web browsers with Data-Driven Documents (D3).js code. We conducted both a performance evaluation and a user study to measure the performance and effectiveness of the system. It was seen that the system was most efficient when visualizing data using streamed bitmap images rather than streamed raw data. The system only rendered images at 6-10 Frames Per Second (FPS) and did not meet our target of rendering images at 30 FPS. The results of the user study concluded that the system is effective and easy to use for data visualization. The outcome of our experiment suggests that the current state of Google Chrome may not be as powerful as required to perform heavy 2D data visualization on the web and still needs more development for visualizing data of large magnitude

    Big Data Geospatial Processing for Massive Aerial LiDAR Datasets

    Get PDF
    [Abstract] For years, Light Detection and Ranging (LiDAR) technology has been considered as a challenge when it comes to developing efficient software to handle the extremely large volumes of data this surveying method is able to collect. In contexts such as this, big data technologies have been providing powerful solutions for distributed storage and computing. In this work, a big data approach on geospatial processing for massive aerial LiDAR point clouds is presented. By using Cassandra and Spark, our proposal is intended to support the execution of any kind of heavy time-consuming process; nonetheless, as an initial case of study, we have focused on fast ground-only rasters obtention to generate digital terrain models (DTMs) from massive LiDAR datasets. Filtered clouds obtained from the isolated processing of adjacent zones may exhibit errors located on the boundaries of the zones in the form of misclassified points. Usually, this type of error is corrected through manual or semi-automatic procedures. In this work, we also present an automated strategy for correcting errors of this type, improving the quality of the classification process and the DTMs obtained while minimizing user intervention. The autonomous nature of all computing stages, along with the low processing times achieved, opens the possibility of considering the system as a highly scalable service-oriented solution for on-demand DTM generation or any other geospatial process. Said solution would be a highly useful and unique service for many users in the LiDAR field, and one which could get near to real-time processing with appropriate computational resources.Xunta de Galicia; ED431C 2017/04Consolidation Programme of Competitive Research Units; R2016/037Xunta de Galicia; ED431G/01Ministerio de EconomĂ­a y Competitividad; TIN2016-75845-

    An auto-scaling framework for analyzing big data in the cloud environment

    Get PDF
    Processing big data on traditional computing infrastructure is a challenge as the volume of data is large and thus high computational complexity. Recently, Apache Hadoop has emerged as a distributed computing infrastructure to deal with big data. Adopting Hadoop to dynamically adjust its computing resources based on real-time workload is itself a demanding task, thus conventionally a pre-configuration with adequate resources to compute the peak data load is set up. However, this may cause a considerable wastage of computing resources when the usage levels are much lower than the preset load. In consideration of this, this paper investigates an auto-scaling framework on cloud environment aiming to minimise the cost of resource use by automatically adjusting the virtual nodes depending on the real-time data load. A cost-effective auto-scaling (CEAS) framework is first proposed for an Amazon Web Services (AWS) Cloud environment. The proposed CEAS framework allows us to scale the computing resources of Hadoop cluster so as to either reduce the computing resource use when the workload is low or scale-up the computing resources to speed up the data processing and analysis within an adequate time. To validate the effectiveness of the proposed framework, a case study with real-time sentiment analysis on the universities’ tweets is provided to analyse the reviews/tweets of the people posted on social media. Such a dynamic scaling method offers a reference to improving the Twitter data analysis in a more cost-effective and flexible way

    Efficient clustering techniques on Hadoop and Spark

    Get PDF
    Software services based on large-scale distributed systems demand continuous and decentralised solutions for achieving system con- sistency and providing operational monitoring. Epidemic data aggregation algorithms provide decentralised, scalable and fault-tolerant solutions that can be used for system-wide tasks such as global state determination, monitoring and consensus. Existing continuous epidemic algorithms either periodically restart at fixed epochs or apply changes in the system state instantly producing less accurate approximation. This work introduces an innovative mechanism without fixed epochs that monitors the system state and restarts upon the detection of the system convergence or diver- gence. The mechanism makes correct aggregation with an approximation error as small as desired. The proposed solution is validated and analysed by means of simulations under static and dynamic network conditions

    Distributed Kernelized Locality-Sensitive Hashing for Faster Image Based Navigation

    Get PDF
    Content based image retrieval (CBIR) remains one of the most heavily researched areas in computer vision. Different image retrieval techniques and algorithms have been implemented and used in localization research, object recognition applications, and commercially by companies such as Facebook, Google, and Yahoo!. Current methods for image retrieval become problematic when implemented on image datasets that can easily reach billions of images. In order to process extremely large datasets, the computation must be distributed across a cluster of machines using software such as Apache Hadoop. There are many different algorithms for conducting content based image retrieval, but this research focuses on Kernelized Locality-Sensitive Hashing (KLSH). For the first time, a distributed implementation of the KLSH algorithm using the MapReduce programming paradigm performs CBIR and localization using an urban environment image dataset. This new distributed algorithm is shown to be 4.8 times faster than a brute force linear search while still maintaining localization accuracy within 8.5 meters

    MERRA/AS: The MERRA Analytic Services Project Interim Report

    Get PDF
    MERRA AS is a cyberinfrastructure resource that will combine iRODS-based Climate Data Server (CDS) capabilities with Coudera MapReduce to serve MERRA analytic products, store the MERRA reanalysis data collection in an HDFS to enable parallel, high-performance, storage-side data reductions, manage storage-side driver, mapper, reducer code sets and realized objects for users, and provide a library of commonly used spatiotemporal operations that can be composed to enable higher-order analyses
    • 

    corecore