86 research outputs found

    Scalable big data systems: Architectures and optimizations

    Get PDF
    Big data analytics has become not just a popular buzzword but also a strategic direction in information technology for many enterprises and government organizations. Even though many new computing and storage systems have been developed for big data analytics, scalable big data processing has become more and more challenging as a result of the huge and rapidly growing size of real-world data. Dedicated to the development of architectures and optimization techniques for scaling big data processing systems, especially in the era of cloud computing, this dissertation makes three unique contributions. First, it introduces a suite of graph partitioning algorithms that can run much faster than existing data distribution methods and inherently scale to the growth of big data. The main idea of these approaches is to partition a big graph by preserving the core computational data structure as much as possible to maximize intra-server computation and minimize inter-server communication. In addition, it proposes a distributed iterative graph computation framework that effectively utilizes secondary storage to maximize access locality and speed up distributed iterative graph computations. The framework not only considerably reduces memory requirements for iterative graph algorithms but also significantly improves the performance of iterative graph computations. Last but not the least, it establishes a suite of optimization techniques for scalable spatial data processing along with three orthogonal dimensions: (i) scalable processing of spatial alarms for mobile users traveling on road networks, (ii) scalable location tagging for improving the quality of Twitter data analytics and prediction accuracy, and (iii) lightweight spatial indexing for enhancing the performance of big spatial data queries.Ph.D

    SPARQLGX: Efficient Distributed Evaluation of SPARQL with Apache Spark

    Get PDF
    International audiencesparql is the w3c standard query language for querying data expressed in the Resource Description Framework (rdf). The increasing amounts of rdf data available raise a major need and research interest in building efficient and scalable distributed sparql query eval-uators. In this context, we propose sparqlgx: our implementation of a distributed rdf datastore based on Apache Spark. sparqlgx is designed to leverage existing Hadoop infrastructures for evaluating sparql queries. sparqlgx relies on a translation of sparql queries into exe-cutable Spark code that adopts evaluation strategies according to (1) the storage method used and (2) statistics on data. We show that spar-qlgx makes it possible to evaluate sparql queries on billions of triples distributed across multiple nodes, while providing attractive performance figures. We report on experiments which show how sparqlgx compares to related state-of-the-art implementations and we show that our approach scales better than these systems in terms of supported dataset size. With its simple design, sparqlgx represents an interesting alternative in several scenarios

    P-LUPOSDATE: Using Precomputed Bloom Filters to Speed Up SPARQL Processing in the Cloud

    Get PDF
    Increasingly data on the Web is stored in the form of Semantic Web data. Because of today's information overload, it becomes very important to store and query these big datasets in a scalable way and hence in a distributed fashion. Cloud Computing offers such a distributed environment with dynamic reallocation of computing and storing resources based on needs. In this work we introduce a scalable distributed Semantic Web database in the Cloud. In order to reduce the number of (unnecessary) intermediate results early, we apply bloom filters. Instead of computing bloom filters, a time-consuming task during query processing as it has been done traditionally, we precompute the bloom filters as much as possible and store them in the indices besides the data. The experimental results with data sets up to 1 billion triples show that our approach speeds up query processing significantly and sometimes even reduces the processing time to less than half

    A scalable analysis framework for large-scale RDF data

    Get PDF
    With the growth of the Semantic Web, the availability of RDF datasets from multiple domains as Linked Data has taken the corpora of this web to a terabyte-scale, and challenges modern knowledge storage and discovery techniques. Research and engineering on RDF data management systems is a very active area with many standalone systems being introduced. However, as the size of RDF data increases, such single-machine approaches meet performance bottlenecks, in terms of both data loading and querying, due to the limited parallelism inherent to symmetric multi-threaded systems and the limited available system I/O and system memory. Although several approaches for distributed RDF data processing have been proposed, along with clustered versions of more traditional approaches, their techniques are limited by the trade-off they exploit between loading complexity and query efficiency in the presence of big RDF data. This thesis then, introduces a scalable analysis framework for processing large-scale RDF data, which focuses on various techniques to reduce inter-machine communication, computation and load-imbalancing so as to achieve fast data loading and querying on distributed infrastructures. The first part of this thesis focuses on the study of RDF store implementation and parallel hashing on big data processing. (1) A system-level investigation of RDF store implementation has been conducted on the basis of a comparative analysis of runtime characteristics of a representative set of RDF stores. The detailed time cost and system consumption is measured for data loading and querying so as to provide insight into different triple store implementation as well as an understanding of performance differences between different platforms. (2) A high-level structured parallel hashing approach over distributed memory is proposed and theoretically analyzed. The detailed performance of hashing implementations using different lock-free strategies has been characterized through extensive experiments, thereby allowing system developers to make a more informed choice for the implementation of their high-performance analytical data processing systems. The second part of this thesis proposes three main techniques for fast processing of large RDF data within the proposed framework. (1) A very efficient parallel dictionary encoding algorithm, to avoid unnecessary disk-space consumption and reduce computational complexity of query execution. The presented implementation has achieved notable speedups compared to the state-of-art method and also has achieved excellent scalability. (2) Several novel parallel join algorithms, to efficiently handle skew over large data during query processing. The approaches have achieved good load balancing and have been demonstrated to be faster than the state-of-art techniques in both theoretical and experimental comparisons. (3) A two-tier dynamic indexing approach for processing SPARQL queries has been devised which keeps loading times low and decreases or in some instances removes intermachine data movement for subsequent queries that contain the same graph patterns. The results demonstrate that this design can load data at least an order of magnitude faster than a clustered store operating in RAM while remaining within an interactive range for query processing and even outperforms current systems for various queries

    Data warehousing technologies for large-scale and right-time data

    Get PDF

    The SPARQLGX System for Distributed Evaluation of SPARQL Queries

    Get PDF
    SPARQL is the W3C standard query language for querying data expressed in the Resource Description Framework (RDF). The increasing amounts of data available in the RDF format raise a major need and research interest in building efficient and scalable distributed SPARQL query evaluators. In this context, we propose SPARQLGX: an implementation of a distributed RDF datastore based on Apache Spark. SPARQLGX is designed to leverage existing Hadoop infrastructures for evaluating SPARQL queries efficiently. SPARQLGX relies on an automated translation of SPARQL queries into optimized executable Spark code. We show that SPARQLGX makes it possible to evaluate SPARQL queries on billions of triples distributed across multiple nodes, while providing attractive performance figures. We report on experiments which show how SPARQLGX compares to state-of-the-art implementations and we show that our approach scales better than other systems in terms of supported dataset size. With its simple design, SPARQLGX represents an interesting alternative in several scenarios

    Temporal RDF(S) Data Storage and Query with HBase

    Get PDF
    Resource Description Framework (RDF) is a metadata model recommended by World Wide Web Consortium (W3C) for describing the Web resources. With the arrival of the era of Big Data, very large amounts of RDF data are continuously being created and need to be stored for management. The traditional centralized RDF storage models cannot meet the need of largescale RDF data storage. Meanwhile, the importance of temporal information management and processing has been acknowledged by academia and industry. In this paper, we propose a storage model to store temporal RDF based on HBase. The proposed storage model applies the built-in time mechanism of HBase. Our experiments on LUBM dataset with temporal information added show that our storage model can store large temporal RDF data and obtain good query efficiency

    Architecture and Knowledge Modelling for Smart City

    Get PDF

    The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing: Extended Survey

    Full text link
    Graph processing is becoming increasingly prevalent across many application domains. In spite of this prevalence, there is little research about how graphs are actually used in practice. We performed an extensive study that consisted of an online survey of 89 users, a review of the mailing lists, source repositories, and whitepapers of a large suite of graph software products, and in-person interviews with 6 users and 2 developers of these products. Our online survey aimed at understanding: (i) the types of graphs users have; (ii) the graph computations users run; (iii) the types of graph software users use; and (iv) the major challenges users face when processing their graphs. We describe the participants' responses to our questions highlighting common patterns and challenges. Based on our interviews and survey of the rest of our sources, we were able to answer some new questions that were raised by participants' responses to our online survey and understand the specific applications that use graph data and software. Our study revealed surprising facts about graph processing in practice. In particular, real-world graphs represent a very diverse range of entities and are often very large, scalability and visualization are undeniably the most pressing challenges faced by participants, and data integration, recommendations, and fraud detection are very popular applications supported by existing graph software. We hope these findings can guide future research
    corecore