24 research outputs found

    What makes spatial data big? A discussion on how to partition spatial data

    Get PDF
    The amount of available spatial data has significantly increased in the last years so that traditional analysis tools have become inappropriate to effectively manage them. Therefore, many attempts have been made in order to define extensions of existing MapReduce tools, such as Hadoop or Spark, with spatial capabilities in terms of data types and algorithms. Such extensions are mainly based on the partitioning techniques implemented for textual data where the dimension is given in terms of the number of occupied bytes. However, spatial data are characterized by other features which describe their dimension, such as the number of vertices or the MBR size of geometries, which greatly affect the performance of operations, like the spatial join, during data analysis. The result is that the use of traditional partitioning techniques prevents to completely exploit the benefit of the parallel execution provided by a MapReduce environment. This paper extensively analyses the problem considering the spatial join operation as use case, performing both a theoretical and an experimental analysis for it. Moreover, it provides a solution based on a different partitioning technique, which splits complex or extensive geometries. Finally, we validate the proposed solution by means of some experiments on synthetic and real datasets

    A context-based approach for partitioning big data

    Get PDF
    In recent years, the amount of available data keeps growing at fast rate, and it is therefore crucial to be able to process them in an efficient way. The level of parallelism in tools such as Hadoop or Spark is determined, among other things, by the partitioning applied to the dataset. A common method is to split the data into chunks considering the number of bytes. While this approach may work well for text-based batch processing, there are a number of cases where the dataset contains structured information, such as the time or the spatial coordinates, and one may be interested in exploiting such a structure to improve the partitioning. This could have an impact on the processing time and increase the overall resource usage efficiency. This paper explores an approach based on the notion of context, such as temporal or spatial information, for partitioning the data. We design a context-based multi-dimensional partitioning technique that divides an n 12dimensional space into splits by considering the distribution of the each contextual dimension in the dataset. We tested our approach on a dataset from a touristic scenario, and our experiments show that we are able to improve the efficiency of the resource usage

    A MapReduce-Based Big Spatial Data Framework for Solving the Problem of Covering a Polygon with Orthogonal Rectangles

    Get PDF
    The polygon covering problem is an important class of problems in the area of computational geometry. There are slightly different versions of this problem depending on the types of polygons to be addressed. In this paper, we focus on finding an answer to a question of whether an orthogonal rectangle, or spatial query window, is fully covered by a set of orthogonal rectangles which are in smaller sizes. This problem is encountered in many application domains including object recognition/extraction/trace, spatial analyses, topological analyses, and augmented reality applications. In many real-world applications, in the cases of using traditional central computation techniques, working with real world data results in a performance bottlenecks. The work presented in this paper proposes a high performance MapReduce-based big data framework to solve the polygon covering problem in the cases of using a spatial query window and data are represented as a set of orthogonal rectangles. Orthogonal rectangular polygons are represented in the form of minimum bounding boxes. The spatial query windows are also called as range queries. The proposed spatial big data framework is evaluated in terms of horizontal scalability. In addition, efficiency and speed-up performance metrics for the proposed two algorithms are measured

    Distributing Tourists Among POIs with an Adaptive Trip Recommendation System

    Get PDF
    Traveling is part of many people leisure activities and an increasing fraction of the economy comes from the tourism. Given a destination, the information about the different attractions, or points of interest (POIs), can be found on many sources. Among these attractions, finding the ones that could be of interest for a specific user represents a challenging task. Travel recommendation systems deal with this type of problems. Most of the solution in the literature does not take into account the impact of the suggestions on the level of crowding of POIs. This paper considers the trip planning problem focusing on user balancing among the different POIs. To this aim, we consider the effects of the previous recommendations, as well as estimates based on historical data, while devising a new recommendation. The problem is formulated as a multi-objective optimization problem, and a recommendation engine has been designed and implemented for exploring the solution space in near real-time, through a distributed version of the Simulated Annealing approach. We test our solution using a real dataset of users visiting the POIs of a touristic city, and we show that we are able to provide high quality recommendations, yet maintaining the attractions not overcrowded

    CoPart: a context-based partitioning technique for big data

    Get PDF
    The MapReduce programming paradigm is frequently used in order to process and analyse a huge amount of data. This paradigm relies on the ability to apply the same operation in parallel on independent chunks of data. The consequence is that the overall performances greatly depend on the way data are partitioned among the various computation nodes. The default partitioning technique, provided by systems like Hadoop or Spark, basically performs a random subdivision of the input records, without considering the nature and correlation between them. Even if such approach can be appropriate in the simplest case where all the input records have to be always analyzed, it becomes a limit for sophisticated analyses, in which correlations between records can be exploited to preliminarily prune unnecessary computations. In this paper we design a context-based multi-dimensional partitioning technique, called COPART, which takes care of data correlation in order to determine how records are subdivided between splits (i.e., units of work assigned to a computation node). More specifically, it considers not only the correlation of data w.r.t. contextual attributes, but also the distribution of each contextual dimension in the dataset. We experimentally compare our approach with existing ones, considering both quality criteria and the query execution times

    Big Data Computing for Geospatial Applications

    Get PDF
    The convergence of big data and geospatial computing has brought forth challenges and opportunities to Geographic Information Science with regard to geospatial data management, processing, analysis, modeling, and visualization. This book highlights recent advancements in integrating new computing approaches, spatial methods, and data management strategies to tackle geospatial big data challenges and meanwhile demonstrates opportunities for using big data for geospatial applications. Crucial to the advancements highlighted in this book is the integration of computational thinking and spatial thinking and the transformation of abstract ideas and models to concrete data structures and algorithms

    A template-based approach for the specification of 3D topological constraints

    Get PDF
    Several different models have been defined in literature for the definition of 3D scenes that include a geometrical representation of objects together with a semantical classification of them. Such semantical characterization encapsulates important details about the object properties and behavior and often includes spatial relations that are defined only implicitly or through natural language, such as \u201can external access shall be in touch with the building only when it is classified as a direct access\u201d. The problem of ensuring the coherence between geometric and semantic information is well known in literature. Many attempts exist which try to extent the OCL to allow the representation of spatial integrity constraints in an UML model. However, this approach requires a deep knowledge of the OCL formalism and the implementation of ad-hoc procedures to validate the constraints specified at conceptual level. Therefore, a new approach is needed that helps designers to define complex OCL constraints and at the same time allows the automatic generation of the code to test them on a given dataset. The aim of this paper is to propose a set of predefined templates to express on the classes of an UML data model, a family of 3D spatial integrity constraints based on topological relations; all this without requiring the knowledge of any formal language by domain experts and supporting their automatic translation into validation procedures

    Adaptive Trip Recommendation System

    Get PDF
    Travel recommendation systems provide suggestions to the users based on di erent information, such as user preferences, needs, or constraints. The recommendation may also take into account some characteristics of the points of interest (POIs) to be visited, such as the opening hours, or the peak hours. Although a number of studies have been proposed on the topic, most of them tailor the recommendation considering the user viewpoint, without evaluating the impact of the suggestions on the system as a whole. This may lead to oscillatory dynamics, where the choices made by the recommendation system generate new peak hours. This paper considers the trip planning problem that takes into account the balancing of users among the di erent POIs. To this aim, we consider the estimate of the level of crowding at POIs, including both the historical data and the e ects of the recommendation. We formulate the problem as a multi- objective optimization problem, and we design a recommendation engine that explores the solution space in near real-time, through a distributed version of the Simulated Annealing approach. Through an experimental evaluation on a real dataset of users visiting the POIs of a touristic city, we show that our solution is able to provide high quality recommendations, yet maintaining the attractions not overcrowded
    corecore