125 research outputs found

    Highest Order Voronoi Processing on Apache Spark

    Get PDF
    Voronoi diagram is a method that divides the plane into smaller area based on the nearest distance to an object. Highest Order Voronoi Diagram is a new variant of the Voronoi diagram. The complexity of HSVD construction is on O(m4m^{4}), where m is the number of generator points. Highest order Voronoi diagram can be used on the field of query processing such as reverse k-nearest neighbour (RKNN), k farthest neighbour (KFN), k nearest neighbour(KNN), etc. From related works, there are method called Fast Labelling and Interchange Position (FLIP) and Left with Least-Angle Movement (LAM) used to construct highest order voronoi diagram. But, both of this method implemented on conventional computing and have limitation on number of points that can be processed and execution time is quite high. There is inefficiency of reuse a working set of data process by accessing disk repeatedly which caused the execution time is quite high and limit the number of points that can be processed. Beside that, conventional computing didn\u27t utilize the available resources. There are frameworks that can be used to utilize the available resources to optimize the computing process called Apache Spark. Apache Spark distribute the task to all available resources and work well on iterative process that reuse a set of data because of the ability to keep needed data in memory. This minor thesis shows that with the help of Apache Spark framework, the number of point that can be processed increase to 24 with the execution time is 60\% faster than LAM implementation in average. Keywords: voronoi diagram, highest order, apache spark, spatia

    Primena Big Data analitike za istraživanje prostorno-vremenske dinamike ljudske populacije

    Get PDF
    With the rapid growth of the volume of available data related to human dynamics, it became more challenging to research and investigate topics that could reveal novel knowledge in the area. In present time people tend to live mostly in large cities, where knowledge about human dynamics, habits and behaviour could lead to better city organisation, energy efficiency, transport organisation and overall better quality and more sustainable living. Human dynamics could be reasoned from many different aspects, but all of them have three elements in common: time, space and data volume. Human activity and interaction could not be inspected without space and time component because everything is happening somewhere at some time. Also, with huge smartphone adoption now terabytes of data related to human dynamic are available. Although data is sensitive to personal information, true owners of the data is either telecom operator company, social media company or any other company that provides the applications that are used on the mobile phone. If such data is to be opened to public or scientific community to conduct a research with it, it needs to be anonimized first.Another challenge of user generated data is data set volume. Data is usually very large in size (Volume), it comes from different sources and in different formats (Variety) and it is generated in real-time and it evolves very fast (Velocity). These are three V's of Big Data, and such data sets need to be approached with specially designed Big Data technologies.In the research presented in this thesis we assembled Big Data technologies, Graph Theory and space-time dependent human dynamic data.Са све већом и већом количином података која је доступна везано за динамику људске популације, постаје све више изазовно да се спроведе истраживање у овој области које би донело ново знање. У данашње време људи масовно живе у великим градовима где би знање о људској динамици, навикама и понашању могло значајно да унапреди организацију градова, енергетску ефикасност, транспорт и свеукупно квалитетнији и више одржив животни стил. Динамика људске популације може да се посматра са више аспеката, али сви они имају три заједничка елемента: време, простор и количину података. Људска активност и интеракције не могу се посматрати одвојено од просторне и временске компоненте јер се све дешава негде и у неко време. Такође, са великим присуством паметних телефона данас су доступни терабајти података о људској динамици. Иако су подаци осетљиви због приватности корисника, прави власници података су заправо телеком компаније, или компаније друштвених мрежа или неке друге компаније које развијају корисничке апликације за паметне телефоне. Ако би се такви подаци отварали за јавност или научну заједницу морали би прво да буду анонимизовани. Други изазов везан за кориснички генерисане податке је величина података. Подаци су обично веома велики меморијски (енг. „Volume“), долазе из различитих извора и у различитим форматима (енг. „Variety“) и генерисани су реалном времену и мењају се  еома брзо (енг. „Velocity“). Ово су три „V“ Великих података, и такви подаци захтевају посебан приступ аналитици са специјално дизајнираним алатима за Аналитику великих података. У оквиру истраживања које је презентовано у овој тези објединили смо Аналитику великих података, Теорију графова и просторно-временски зависне податке о људској динамици.Sa sve većom i većom količinom podataka koja je dostupna vezano za dinamiku ljudske populacije, postaje sve više izazovno da se sprovede istraživanje u ovoj oblasti koje bi donelo novo znanje. U današnje vreme ljudi masovno žive u velikim gradovima gde bi znanje o ljudskoj dinamici, navikama i ponašanju moglo značajno da unapredi organizaciju gradova, energetsku efikasnost, transport i sveukupno kvalitetniji i više održiv životni stil. Dinamika ljudske populacije može da se posmatra sa više aspekata, ali svi oni imaju tri zajednička elementa: vreme, prostor i količinu podataka. LJudska aktivnost i interakcije ne mogu se posmatrati odvojeno od prostorne i vremenske komponente jer se sve dešava negde i u neko vreme. Takođe, sa velikim prisustvom pametnih telefona danas su dostupni terabajti podataka o ljudskoj dinamici. Iako su podaci osetljivi zbog privatnosti korisnika, pravi vlasnici podataka su zapravo telekom kompanije, ili kompanije društvenih mreža ili neke druge kompanije koje razvijaju korisničke aplikacije za pametne telefone. Ako bi se takvi podaci otvarali za javnost ili naučnu zajednicu morali bi prvo da budu anonimizovani. Drugi izazov vezan za korisnički generisane podatke je veličina podataka. Podaci su obično veoma veliki memorijski (eng. „Volume“), dolaze iz različitih izvora i u različitim formatima (eng. „Variety“) i generisani su realnom vremenu i menjaju se  eoma brzo (eng. „Velocity“). Ovo su tri „V“ Velikih podataka, i takvi podaci zahtevaju poseban pristup analitici sa specijalno dizajniranim alatima za Analitiku velikih podataka. U okviru istraživanja koje je prezentovano u ovoj tezi objedinili smo Analitiku velikih podataka, Teoriju grafova i prostorno-vremenski zavisne podatke o ljudskoj dinamici

    A Framework for Spatial Database Explanations

    Get PDF
    abstract: In the last few years, there has been a tremendous increase in the use of big data. Most of this data is hard to understand because of its size and dimensions. The importance of this problem can be emphasized by the fact that Big Data Research and Development Initiative was announced by the United States administration in 2012 to address problems faced by the government. Various states and cities in the US gather spatial data about incidents like police calls for service. When we query large amounts of data, it may lead to a lot of questions. For example, when we look at arithmetic relationships between queries in heterogeneous data, there are a lot of differences. How can we explain what factors account for these differences? If we define the observation as an arithmetic relationship between queries, this kind of problem can be solved by aggravation or intervention. Aggravation views the value of our observation for different set of tuples while intervention looks at the value of the observation after removing sets of tuples. We call the predicates which represent these tuples, explanations. Observations by themselves have limited importance. For example, if we observe a large number of taxi trips in a specific area, we might ask the question: Why are there so many trips here? Explanations attempt to answer these kinds of questions. While aggravation and intervention are designed for non spatial data, we propose a new approach for explaining spatially heterogeneous data. Our approach expands on aggravation and intervention while using spatial partitioning/clustering to improve explanations for spatial data. Our proposed approach was evaluated against a real-world taxi dataset as well as a synthetic disease outbreak datasets. The approach was found to outperform aggravation in precision and recall while outperforming intervention in precision.Dissertation/ThesisMasters Thesis Computer Science 201

    Geometric Approaches to Big Data Modeling and Performance Prediction

    Get PDF
    Big Data frameworks (e.g., Spark) have many configuration parameters, such as memory size, CPU allocation, and the number of nodes (parallelism). Regular users and even expert administrators struggle to understand the relationship between different parameter configurations and the overall performance of the system. In this work, we address this challenge by proposing a performance prediction framework to build performance models with varied configurable parameters on Spark. We take inspiration from the field of Computational Geometry to construct a d-dimensional mesh using Delaunay Triangulation over a selected set of features. From this mesh, we predict execution time for unknown feature configurations. To minimize the time and resources spent in building a model, we propose an adaptive sampling technique to allow us to collect as few training points as required. Our evaluation on a cluster of computers using several workloads shows that our prediction error is lower than the state-of-art methods while having fewer samples to train

    Enabling stream processing for people-centric IoT based on the fog computing paradigm

    Get PDF
    The world of machine-to-machine (M2M) communication is gradually moving from vertical single purpose solutions to multi-purpose and collaborative applications interacting across industry verticals, organizations and people - A world of Internet of Things (IoT). The dominant approach for delivering IoT applications relies on the development of cloud-based IoT platforms that collect all the data generated by the sensing elements and centrally process the information to create real business value. In this paper, we present a system that follows the Fog Computing paradigm where the sensor resources, as well as the intermediate layers between embedded devices and cloud computing datacenters, participate by providing computational, storage, and control. We discuss the design aspects of our system and present a pilot deployment for the evaluating the performance in a real-world environment. Our findings indicate that Fog Computing can address the ever-increasing amount of data that is inherent in an IoT world by effective communication among all elements of the architecture

    Predicting residential building age from map data

    Get PDF
    The age of a building influences its form and fabric composition and this in turn is critical to inferring its energy performance. However, often this data is unknown. In this paper, we present a methodology to automatically identify the construction period of houses, for the purpose of urban energy modelling and simulation. We describe two major stages to achieving this – a per-building classification model and post-classification analysis to improve the accuracy of the class inferences. In the first stage, we extract measures of the morphology and neighbourhood characteristics from readily available topographic mapping, a high-resolution Digital Surface Model and statistical boundary data. These measures are then used as features within a random forest classifier to infer an age category for each building. We evaluate various predictive model combinations based on scenarios of available data, evaluating these using 5-fold cross-validation to train and tune the classifier hyper-parameters based on a sample of city properties. A separate sample estimated the best performing cross-validated model as achieving 77% accuracy. In the second stage, we improve the inferred per-building age classification (for a spatially contiguous neighbourhood test sample) through aggregating prediction probabilities using different methods of spatial reasoning. We report on three methods for achieving this based on adjacency relations, near neighbour graph analysis and graph-cuts label optimisation. We show that post-processing can improve the accuracy by up to 8 percentage points

    Efficient Large-scale Distance-Based Join Queries in SpatialHadoop

    Get PDF
    Efficient processing of Distance-Based Join Queries (DBJQs) in spatial databases is of paramount importance in many application domains. The most representative and known DBJQs are the K Closest Pairs Query (KCPQ) and the ε Distance Join Query (εDJQ). These types of join queries are characterized by a number of desired pairs (K) or a distance threshold (ε) between the components of the pairs in the final result, over two spatial datasets. Both are expensive operations, since two spatial datasets are combined with additional constraints. Given the increasing volume of spatial data originating from multiple sources and stored in distributed servers, it is not always efficient to perform DBJQs on a centralized server. For this reason, this paper addresses the problem of computing DBJQs on big spatial datasets in SpatialHadoop, an extension of Hadoop that supports efficient processing of spatial queries in a cloud-based setting. We propose novel algorithms, based on plane-sweep, to perform efficient parallel DBJQs on large-scale spatial datasets in Spatial Hadoop. We evaluate the performance of the proposed algorithms in several situations with large real-world as well as synthetic datasets. The experiments demonstrate the efficiency and scalability of our proposed methodologies

    Locality-Aware Fair Scheduling in the Distributed Query Processing Framework

    Get PDF
    Department of Computer EngineeringUtilizing caching facilities in modern query processing systems is getting more important as the capacity of main memory is having been greatly increasing. Especially in the data intensive applications, caching effect gives significant performance gain avoiding disk I/O which is highly expensive than memory access. Therefore data must be carefully distributed across back-end application servers to get advantages from caching as much as possible. On the other hand, load balance across back-end application servers is another concern the scheduler must consider. Serious load imbalance may result in poor performance even if the cache hit ratio is high. And the fact that scheduling decision which raises cache hit ratio sometimes results in load imbalance even makes it harder to make scheduling decision. Therefore we should find a scheduling algorithm which balances trade-off between load balance and cache hit ratio successfully. To consider both cache hit and load balance, we propose two semantic caching mechanisms DEMB and EM-KDE which successfully balance the load while keeping high cache hit ratio by analyzing and predicting trend of query arrival patterns. Another concern discussed in this paper is the environment with multiple front-end schedulers. Each scheduler can have different query arrival pattern from users. To reflect those differences of query arrival pattern from each front-end scheduler, we compare 3 algorithms which aggregate the query arrival pattern information from each front-end scheduler and evaluate them. To increase cache hit ratio in semantic caching scheduling further, migrating contents of cache to nearby server is proposed. We can increase cache hit count if data can be dynamically migrated to the server where the subsequent data requests supposed to be forwarded. Several migrating policies and their pros and cons will be discussed later. Finally, we introduce a MapReduce framework called Eclipse which takes full advantages from semantic caching scheduling algorithm mentioned above. We show that Eclipse outperforms other MapReduce frameworks in most evaluations.ope

    Efficient processing of large-scale spatio-temporal data

    Get PDF
    Millionen Geräte, wie z.B. Mobiltelefone, Autos und Umweltsensoren senden ihre Positionen zusammen mit einem Zeitstempel und weiteren Nutzdaten an einen Server zu verschiedenen Analysezwecken. Die Positionsinformationen und übertragenen Ereignisinformationen werden als Punkte oder Polygone dargestellt. Eine weitere Art räumlicher Daten sind Rasterdaten, die zum Beispiel von Kameras und Sensoren produziert werden. Diese großen räumlich-zeitlichen Datenmengen können nur auf skalierbaren Plattformen wie Hadoop und Apache Spark verarbeitet werden, die jedoch z.B. die Nachbarschaftsinformation nicht ausnutzen können - was die Ausführung bestimmter Anfragen praktisch unmöglich macht. Die wiederholten Ausführungen der Analyseprogramme während ihrer Entwicklung und durch verschiedene Nutzer resultieren in langen Ausführungszeiten und hohen Kosten für gemietete Ressourcen, die durch die Wiederverwendung von Zwischenergebnissen reduziert werden können. Diese Arbeit beschäftigt sich mit den beiden oben beschriebenen Herausforderungen. Wir präsentieren zunächst das STARK Framework für die Verarbeitung räumlich-zeitlicher Vektor- und Rasterdaten in Apache Spark. Wir identifizieren verschiedene Algorithmen für Operatoren und analysieren, wie diese von den Eigenschaften der zugrundeliegenden Plattform profitieren können. Weiterhin wird untersucht, wie Indexe in der verteilten und parallelen Umgebung realisiert werden können. Außerdem vergleichen wir Partitionierungsmethoden, die unterschiedlich gut mit ungleichmäßiger Datenverteilung und der Größe der Datenmenge umgehen können und präsentieren einen Ansatz um die auf Operatorebene zu verarbeitende Datenmenge frühzeitig zu reduzieren. Um die Ausführungszeit von Programmen zu verkürzen, stellen wir einen Ansatz zur transparenten Materialisierung von Zwischenergebnissen vor. Dieser Ansatz benutzt ein Entscheidungsmodell, welches auf den tatsächlichen Operatorkosten basiert. In der Evaluierung vergleichen wir die verschiedenen Implementierungs- sowie Konfigurationsmöglichkeiten in STARK und identifizieren Szenarien wann Partitionierung und Indexierung eingesetzt werden sollten. Außerdem vergleichen wir STARK mit verwandten Systemen. Im zweiten Teil der Evaluierung zeigen wir, dass die transparente Wiederverwendung der materialisierten Zwischenergebnisse die Ausführungszeit der Programme signifikant verringern kann.Millions of location-aware devices, such as mobile phones, cars, and environmental sensors constantly report their positions often in combination with a timestamp to a server for different kinds of analyses. While the location information of the devices and reported events is represented as points and polygons, raster data is another type of spatial data, which is for example produced by cameras and sensors. This Big spatio-temporal Data needs to be processed on scalable platforms, such as Hadoop and Apache Spark, which, however, are unaware of, e.g., spatial neighborhood, what makes them practically impossible to use for this kind of data. The repeated executions of the programs during development and by different users result in long execution times and potentially high costs in rented clusters, which can be reduced by reusing commonly computed intermediate results. Within this thesis, we tackle the two challenges described above. First, we present the STARK framework for processing spatio-temporal vector and raster data on the Apache Spark stack. For operators, we identify several possible algorithms and study how they can benefit from the underlying platform's properties. We further investigate how indexes can be realized in the distributed and parallel architecture of Big Data processing engines and compare methods for data partitioning, which perform differently well with respect to data skew and data set size. Furthermore, an approach to reduce the amount of data to process at operator level is presented. In order to reduce the execution times, we introduce an approach to transparently recycle intermediate results of dataflow programs, based on operator costs. To compute the costs, we instrument the programs with profiling code to gather the execution time and result size of the operators. In the evaluation, we first compare the various implementation and configuration possibilities in STARK and identify scenarios when and how partitioning and indexing should be applied. We further compare STARK to related systems and show that we can achieve significantly better execution times, not only when exploiting existing partitioning information. In the second part of the evaluation, we show that with the transparent cost-based materialization and recycling of intermediate results, the execution times of programs can be reduced significantly
    corecore