521 research outputs found

    AT-GIS: highly parallel spatial query processing with associative transducers

    Get PDF
    Users in many domains, including urban planning, transportation, and environmental science want to execute analytical queries over continuously updated spatial datasets. Current solutions for largescale spatial query processing either rely on extensions to RDBMS, which entails expensive loading and indexing phases when the data changes, or distributed map/reduce frameworks, running on resource-hungry compute clusters. Both solutions struggle with the sequential bottleneck of parsing complex, hierarchical spatial data formats, which frequently dominates query execution time. Our goal is to fully exploit the parallelism offered by modern multicore CPUs for parsing and query execution, thus providing the performance of a cluster with the resources of a single machine. We describe AT-GIS, a highly-parallel spatial query processing system that scales linearly to a large number of CPU cores. ATGIS integrates the parsing and querying of spatial data using a new computational abstraction called associative transducers(ATs). ATs can form a single data-parallel pipeline for computation without requiring the spatial input data to be split into logically independent blocks. Using ATs, AT-GIS can execute, in parallel, spatial query operators on the raw input data in multiple formats, without any pre-processing. On a single 64-core machine, AT-GIS provides 3× the performance of an 8-node Hadoop cluster with 192 cores for containment queries, and 10× for aggregation queries

    Modelling parallel database management systems for performance prediction

    Get PDF
    Abstract unavailable please refer to PD

    Towards optimisation of model queries : A parallel execution approach

    Get PDF
    The growing size of software models poses significant scalability challenges. Amongst these challenges is the execution time of queries and transformations. In many cases, model management programs are (or can be) expressed as chains and combinations of core fundamental operations. Most of these operations are pure functions, making them amenable to parallelisation, lazy evaluation and short-circuiting. In this paper we show how all three of these optimisations can be combined in the context of Epsilon: an OCL-inspired family of model management languages. We compare our solutions with both interpreted and compiled OCL as well as hand-written Java code. Our experiments show a significant improvement in the performance of queries, especially on large models

    Analytical response time estimation in parallel relational database systems

    Get PDF
    Techniques for performance estimation in parallel database systems are well established for parameters such as throughput, bottlenecks and resource utilisation. However, response time estimation is a complex activity which is difficult to predict and has attracted research for a number of years. Simulation is one option for predicting response time but this is a costly process. Analytical modelling is a less expensive option but requires approximations and assumptions about the queueing networks built up in real parallel database machines which are often questionable and few of the papers on analytical approaches are backed by results from validation against real machines. This paper describes a new analytical approach for response time estimation that is based on a detailed study of different approaches and assumptions. The approach has been validated against two commercial parallel DBMSs running on actual parallel machines and is shown to produce acceptable accuracy

    Querying the web of data with low latency: high performance distributed SPARQL processing and benchmarking

    No full text
    The Web of Data extends the World Wide Web (WWW) in a way that applications can understand information and cooperate with humans on complex tasks. The basis of performing complex tasks is low latency queries over the Web of Data. The large scale and distributed nature of the Web of Data have negative impacts on several critical factors for efficient query processing, including fast data transmission between datasets, predictable data distribution and statistics that summarise and describe certain patterns in the data. Moreover, it is common on the Web of Data that the same resource is identified by multiple URIs. This phenomenon, named co-reference, potentially increases the complexity of query processing, and makes it even harder to obtain accurate statistics. With the aforementioned challenges, it is not clear whether it is possible to achieve efficient queries on the Web of Data on a large scale.In this thesis, we explore techniques to improve the efficiency of querying the Web of Data on a large scale. More specifically, we investigate two typical scenarios on the Web of Data, which are: 1) the scenario in which all datasets provide detailed statistics that are possibly available on a large scale, and 2) the scenario in which co-reference is taken into account, and datasets’ statistics are not reliable. For each scenario we explore existing and novel optimisation techniques that are tailored for querying the Web of Data, as well as well developed techniques with careful adjustments.For the scenario with detailed statistics we provide a scheme that implements a statistics query optimisation approach that requires detailed statistics, and intensively exploits parallelism. We propose an efficient algorithm called Parallel Sub-query Identification () to increase the degree of parallelism. () breaks a SPARQL query into sub-queries that can be processed in parallel while not increasing network traffic. We combine with dynamic programming to produce query plans with both minimum costs and a fair degree of parallelism. Furthermore, we develop a mechanism that maximally exploits bandwidth and computing power of datasets. For the scenario having co-reference and without reliable statistics we provide a scheme that implements a dynamic query optimisation approach that takes co-reference into account, and utilises runtime statistics to elevate query efficiency even further. We propose a model called Virtual Graph to transform a query and all its co-referent siblings into a single query with pre-defined bindings. Virtual Graph reduces the large number of outgoing and incoming requests that is required to process co-referent queries individually. Moreover, Virtual Graph enables query optimisers to find the optimal plan with respect to all co-referent queries as a whole. () is used in this scheme as well but provides a higher degree of parallelism with the help of runtime statistics. A Minimum-Spanning-Tree-based algorithm is used in this scheme as a result of using runtime statistics. The same parallel execution mechanism used in the previous scenario is adopted here as well.In order to examine the effectiveness of our schemes in practice, we deploy the above approaches in two distributed SPARQL engines, LHD-s and LHD-d respectively. Both engines are implemented using a popular Java-based platform for building Semantic Web applications. They can be used as either standalone applications or integrated into existing systems that require quick response of Linked Data queries.We also propose a scalable and flexible benchmark, called Distributed SPARQL Evaluation Framework (DSEF), for evaluating optimisation approaches in the Web of Data. DSEF adopts a expandable virtual-machine-based structure and provides a set of efficient tools to help easily set up RDF networks of arbitrary sizes. We further investigate the proportion and distribution of co-reference in the real world, based on which DESF is able to simulate co-reference for given RDF datasets. DSEF bases its soundness in the usage of widely accepted assessment data and queries.By comparing both LHD-s and LHD-d with existing approaches using DSEF, we provide evidence that neither existing statistics provided by datasets nor cost estimation methods, are sufficiently accurate. On the other hand, dynamic optimisation using runtime statistics together with carefully tuned parallelism are promising for significantly reducing the latency of large scale queries on the Web of Data. We also demonstrate that () and Virtual Graph algorithms significantly increase query efficiency for queries with or without co-reference.In summary, the contributions of this these include: 1) proposing two schemes for improving query efficiency in two typical scenarios in the Web of Data; 2) providing implementations, named LHD-s and LHD-d, for the two schemes respectively; 3) proposing a scalable and flexible evaluation framework for distributed SPARQL engines called DSEF; and 4) showing evidence that runtime-statistics-based dynamic optimisation with parallelism are promising to reduce latency of Linked Data queries on a large scale

    Data Management for Dynamic Multimedia Analytics and Retrieval

    Get PDF
    Multimedia data in its various manifestations poses a unique challenge from a data storage and data management perspective, especially if search, analysis and analytics in large data corpora is considered. The inherently unstructured nature of the data itself and the curse of dimensionality that afflicts the representations we typically work with in its stead are cause for a broad range of issues that require sophisticated solutions at different levels. This has given rise to a huge corpus of research that puts focus on techniques that allow for effective and efficient multimedia search and exploration. Many of these contributions have led to an array of purpose-built, multimedia search systems. However, recent progress in multimedia analytics and interactive multimedia retrieval, has demonstrated that several of the assumptions usually made for such multimedia search workloads do not hold once a session has a human user in the loop. Firstly, many of the required query operations cannot be expressed by mere similarity search and since the concrete requirement cannot always be anticipated, one needs a flexible and adaptable data management and query framework. Secondly, the widespread notion of staticity of data collections does not hold if one considers analytics workloads, whose purpose is to produce and store new insights and information. And finally, it is impossible even for an expert user to specify exactly how a data management system should produce and arrive at the desired outcomes of the potentially many different queries. Guided by these shortcomings and motivated by the fact that similar questions have once been answered for structured data in classical database research, this Thesis presents three contributions that seek to mitigate the aforementioned issues. We present a query model that generalises the notion of proximity-based query operations and formalises the connection between those queries and high-dimensional indexing. We complement this by a cost-model that makes the often implicit trade-off between query execution speed and results quality transparent to the system and the user. And we describe a model for the transactional and durable maintenance of high-dimensional index structures. All contributions are implemented in the open-source multimedia database system Cottontail DB, on top of which we present an evaluation that demonstrates the effectiveness of the proposed models. We conclude by discussing avenues for future research in the quest for converging the fields of databases on the one hand and (interactive) multimedia retrieval and analytics on the other

    Case for holistic query evaluation

    Get PDF
    In this thesis we present the holistic query evaluation model. We propose a novel query engine design that exploits the characteristics of modern processors when queries execute inside main memory. The holistic model (a) is based on template-based code generation for each executed query, (b) uses multithreading to adapt to multicore processor architectures and (c) addresses the optimization problem of scheduling multiple threads for intra-query parallelism. Main-memory query execution is a usual operation in modern database servers equipped with tens or hundreds of gigabytes of RAM. In such an execution environment, the query engine needs to adapt to the CPU characteristics to boost performance. For this purpose, holistic query evaluation applies customized code generation to database query evaluation. The idea is to use a collection of highly efficient code templates and dynamically instantiate them to create query- and hardware-specific source code. The source code is compiled and dynamically linked to the database server for processing. Code generation diminishes the bloat of higher-level programming abstractions necessary for implementing generic, interpreted, SQL query engines. At the same time, the generated code is customized for the hardware it will run on. The holistic model supports the most frequently used query processing algorithms, namely sorting, partitioning, join evaluation, and aggregation, thus allowing the efficient evaluation of complex DSS or OLAP queries. Modern CPUs follow multicore designs with multiple threads running in parallel. The dataflow of query engine algorithms needs to be adapted to exploit such designs. We identify memory accesses and thread synchronization as the main bottlenecks in a multicore execution environment. We extend the holistic query evaluation model and propose techniques to mitigate the impact of these bottlenecks on multithreaded query evaluation. We analytically model the expected performance and scalability of the proposed algorithms according to the hardware specifications. The analytical performance expressions can be used by the optimizer to statically estimate the speedup of multithreaded query execution. Finally, we examine the problem of thread scheduling in the context of multithreaded query evaluation on multicore CPUs. The search space for possible operator execution schedules scales fast, thus forbidding the use of exhaustive techniques. We model intra-query parallelism on multicore systems and present scheduling heuristics that result in different degrees of schedule quality and optimization cost. We identify cases where each of our proposed algorithms, or combinations of them, are expected to generate schedules of high quality at an acceptable running cost

    Towards optimising distributed data streaming graphs using parallel streams

    Full text link
    Modern scientific collaborations have opened up the op-portunity of solving complex problems that involve multi-disciplinary expertise and large-scale computational experi-ments. These experiments usually involve large amounts of data that are located in distributed data repositories running various software systems, and managed by different organi-sations. A common strategy to make the experiments more manageable is executing the processing steps as a work-flow. In this paper, we look into the implementation of fine-grained data-flow between computational elements in a scientific workflow as streams. We model the distributed computation as a directed acyclic graph where the nodes rep-resent the processing elements that incrementally implement specific subtasks. The processing elements are connected in a pipelined streaming manner, which allows task executions to overlap. We further optimise the execution by splitting pipelines across processes and by introducing extra parallel streams. We identify performance metrics and design a mea-surement tool to evaluate each enactment. We conducted ex-periments to evaluate our optimisation strategies with a real world problem in the Life Sciences—EURExpress-II. The paper presents our distributed data-handling model, the op-timisation and instrumentation strategies and the evaluation experiments. We demonstrate linear speed up and argue that this use of data-streaming to enable both overlapped pipeline and parallelised enactment is a generally applicable optimisation strategy
    corecore