85 research outputs found

    Message from the ICDE 2015 Program Committee and general chairs

    Get PDF
    Since its inception in 1984, the IEEE International Conference on Data Engineering (ICDE) has become a premier forum for the exchange and dissemination of data management research results among researchers, users, practitioners, and developers. Continuing this long-standing tradition, the 31st ICDE will be hosted this year in Seoul, South Korea, from April 13 to April 17, 2015. It is our great pleasure to welcome you to ICDE 2015 and to present its proceedings to you

    Algorithmes passant à l’échelle pour la gestion de données du Web sémantique sur les platformes cloud

    Get PDF
    In order to build smart systems, where machines are able to reason exactly like humans, data with semantics is a major requirement. This need led to the advent of the Semantic Web, proposing standard ways for representing and querying data with semantics. RDF is the prevalent data model used to describe web resources, and SPARQL is the query language that allows expressing queries over RDF data. Being able to store and query data with semantics triggered the development of many RDF data management systems. The rapid evolution of the Semantic Web provoked the shift from centralized data management systems to distributed ones. The first systems to appear relied on P2P and client-server architectures, while recently the focus moved to cloud computing.Cloud computing environments have strongly impacted research and development in distributed software platforms. Cloud providers offer distributed, shared-nothing infrastructures that may be used for data storage and processing. The main features of cloud computing involve scalability, fault-tolerance, and elastic allocation of computing and storage resources following the needs of the users.This thesis investigates the design and implementation of scalable algorithms and systems for cloud-based Semantic Web data management. In particular, we study the performance and cost of exploiting commercial cloud infrastructures to build Semantic Web data repositories, and the optimization of SPARQL queries for massively parallel frameworks.First, we introduce the basic concepts around Semantic Web and the main components and frameworks interacting in massively parallel cloud-based systems. In addition, we provide an extended overview of existing RDF data management systems in the centralized and distributed settings, emphasizing on the critical concepts of storage, indexing, query optimization, and infrastructure. Second, we present AMADA, an architecture for RDF data management using public cloud infrastructures. We follow the Software as a Service (SaaS) model, where the complete platform is running in the cloud and appropriate APIs are provided to the end-users for storing and retrieving RDF data. We explore various storage and querying strategies revealing pros and cons with respect to performance and also to monetary cost, which is a important new dimension to consider in public cloud services. Finally, we present CliqueSquare, a distributed RDF data management system built on top of Hadoop, incorporating a novel optimization algorithm that is able to produce massively parallel plans for SPARQL queries. We present a family of optimization algorithms, relying on n-ary (star) equality joins to build flat plans, and compare their ability to find the flattest possibles. Inspired by existing partitioning and indexing techniques we present a generic storage strategy suitable for storing RDF data in HDFS (Hadoop’s Distributed File System). Our experimental results validate the efficiency and effectiveness of the optimization algorithm demonstrating also the overall performance of the system.Afin de construire des systĂšmes intelligents, où les machines sont capables de raisonner exactement comme les humains, les donnĂ©es avec sĂ©mantique sont une exigence majeure. Ce besoin a conduit à l’apparition du Web sĂ©mantique, qui propose des technologies standards pour reprĂ©senter et interroger les donnĂ©es avec sĂ©mantique. RDF est le modĂšle rĂ©pandu destiné à dĂ©crire de façon formelle les ressources Web, et SPARQL est le langage de requĂȘte qui permet de rechercher, d’ajouter, de modifier ou de supprimer des donnĂ©es RDF. Être capable de stocker et de rechercher des donnĂ©es avec sĂ©mantique a engendré le dĂ©veloppement des nombreux systĂšmes de gestion des donnĂ©es RDF.L’évolution rapide du Web sĂ©mantique a provoqué le passage de systĂšmes de gestion des donnĂ©es centralisĂ©es à ceux distribuĂ©s. Les premiers systĂšmes Ă©taient fondĂ©s sur les architectures pair-à-pair et client-serveur, alors que rĂ©cemment l’attention se porte sur le cloud computing.Les environnements de cloud computing ont fortement impacté la recherche et dĂ©veloppement dans les systĂšmes distribuĂ©s. Les fournisseurs de cloud offrent des infrastructures distribuĂ©es autonomes pouvant ĂȘtre utilisĂ©es pour le stockage et le traitement des donnĂ©es. Les principales caractĂ©ristiques du cloud computing impliquent l’évolutivitĂ©Ì, la tolĂ©rance aux pannes et l’allocation Ă©lastique des ressources informatiques et de stockage en fonction des besoins des utilisateurs.Cette thĂšse Ă©tudie la conception et la mise en Ɠuvre d’algorithmes et de systĂšmes passant à l’échelle pour la gestion des donnĂ©es du Web sĂ©mantique sur des platformes cloud. Plus particuliĂšrement, nous Ă©tudions la performance et le coĂ»t d’exploitation des services de cloud computing pour construire des entrepĂŽts de données du Web sĂ©mantique, ainsi que l’optimisation de requĂȘtes SPARQL pour les cadres massivement parallĂšles.Tout d’abord, nous introduisons les concepts de base concernant le Web sémantique et les principaux composants des systèmes fondés sur le cloud. En outre, nous présentons un aperçu des systèmes de gestion des données RDF (centralisés et distribués), en mettant l’accent sur les concepts critiques de stockage, d’indexation, d’optimisation des requêtes et d’infrastructure.Ensuite, nous présentons AMADA, une architecture de gestion de données RDF utilisant les infrastructures de cloud public. Nous adoptons le modèle de logiciel en tant que service (software as a service - SaaS), où la plateforme réside dans le cloud et des APIs appropriées sont mises à disposition des utilisateurs, afin qu’ils soient capables de stocker et de récupérer des données RDF. Nous explorons diverses stratégies de stockage et d’interrogation, et nous étudions leurs avantages et inconvénients au regard de la performance et du coût monétaire, qui est une nouvelle dimension importante à considérer dans les services de cloud public.Enfin, nous présentons CliqueSquare, un système distribué de gestion des données RDF basé sur Hadoop. CliqueSquare intègre un nouvel algorithme d’optimisation qui est capable de produire des plans massivement parallèles pour des requêtes SPARQL. Nous présentons une famille d’algorithmes d’optimisation, s’appuyant sur les équijointures n- aires pour générer des plans plats, et nous comparons leur capacité à trouver les plans les plus plats possibles. Inspirés par des techniques de partitionnement et d’indexation existantes, nous présentons une stratégie de stockage générique appropriée au stockage de données RDF dans HDFS (Hadoop Distributed File System). Nos résultats expérimentaux valident l’effectivité et l’efficacité de l’algorithme d’optimisation démontrant également la performance globale du système

    Enhancing In-Memory Spatial Indexing with Learned Search

    Get PDF
    Spatial data is ubiquitous. Massive amounts of data are generated every day from a plethora of sources such as billions of GPS-enableddevices (e.g., cell phones, cars, and sensors), consumer-based applications (e.g., Uber and Strava), and social media platforms (e.g.,location-tagged posts on Facebook, Twitter, and Instagram). This exponential growth in spatial data has led the research communityto build systems and applications for efficient spatial data processing.In this study, we apply a recently developed machine-learned search technique for single-dimensional sorted data to spatial indexing.Specifically, we partition spatial data using six traditional spatial partitioning techniques and employ machine-learned search withineach partition to support point, range, distance, and spatial join queries. Adhering to the latest research trends, we tune the partitioningtechniques to be instance-optimized. By tuning each partitioning technique for optimal performance, we demonstrate that: (i) grid-basedindex structures outperform tree-based index structures (from 1.23× to 2.47×), (ii) learning-enhanced variants of commonly used spatialindex structures outperform their original counterparts (from 1.44× to 53.34× faster), (iii) machine-learned search within a partitionis faster than binary search by 11.79% - 39.51% when filtering on one dimension, (iv) the benefit of machine-learned search diminishesin the presence of other compute-intensive operations (e.g. scan costs in higher selectivity queries, Haversine distance computation, andpoint-in-polygon tests), and (v) index lookup is the bottleneck for tree-based structures, which could potentially be reduced by linearizingthe indexed partitions.Additional Key Words and Phrases: spatial data, indexing, machine-learning, spatial queries, geospatia

    Querying and mining heterogeneous spatial, social, and temporal data

    Get PDF

    Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache Spark

    Get PDF
    Aiming at the problem of spatial query processing in distributed computing systems, the design and implementation of new distributed spatial query algorithms is a current challenge. Apache Spark is a memory-based framework suitable for real-time and batch processing. Spark-based systems allow users to work on distributed in-memory data, without worrying about the data distribution mechanism and fault-tolerance. Given two datasets of points (called Query and Training), the group K nearest-neighbor (GKNN) query retrieves (K) points of the Training with the smallest sum of distances to every point of the Query. This spatial query has been actively studied in centralized environments and several performance improving techniques and pruning heuristics have been also proposed, while, a distributed algorithm in Apache Hadoop was recently proposed by our team. Since, in general, Apache Hadoop exhibits lower performance than Spark, in this paper, we present the first distributed GKNN query algorithm in Apache Spark and compare it against the one in Apache Hadoop. This algorithm incorporates programming features and facilities that are specific to Apache Spark. Moreover, techniques that improve performance and are applicable in Apache Spark are also incorporated. The results of an extensive set of experiments with real-world spatial datasets are presented, demonstrating that our Apache Spark GKNN solution, with its improvements, is efficient and a clear winner in comparison to processing this query in Apache Hadoop
    • 

    corecore