44 research outputs found

    Managed Query Processing within the SAP HANA Database Platform

    Get PDF
    The SAP HANA database extends the scope of traditional database engines as it supports data models beyond regular tables, e.g. text, graphs or hierarchies. Moreover, SAP HANA also provides developers with a more fine-grained control to define their database application logic, e.g. exposing specific operators which are difficult to express in SQL. Finally, the SAP HANA database implements efficient communication to dedicated client applications using more effective communication mechanisms than available with standard interfaces like JDBC or ODBC. These features of the HANA database are complemented by the extended scripting engine–an application server for server-side JavaScript applications–that is tightly integrated into the query processing and application lifecycle management. As a result, the HANA platform offers more concise models and code for working with the HANA platform and provides superior runtime performance. This paper describes how these specific capabilities of the HANA platform can be consumed and gives a holistic overview of the HANA platform starting from query modeling, to the deployment, and efficient execution. As a distinctive feature, the HANA platform integrates most steps of the application lifecycle, and thus makes sure that all relevant artifacts stay consistent whenever they are modified. The HANA platform also covers transport facilities to deploy and undeploy applications in a complex system landscape

    The End of Slow Networks: It's Time for a Redesign

    Full text link
    Next generation high-performance RDMA-capable networks will require a fundamental rethinking of the design and architecture of modern distributed DBMSs. These systems are commonly designed and optimized under the assumption that the network is the bottleneck: the network is slow and "thin", and thus needs to be avoided as much as possible. Yet this assumption no longer holds true. With InfiniBand FDR 4x, the bandwidth available to transfer data across network is in the same ballpark as the bandwidth of one memory channel, and it increases even further with the most recent EDR standard. Moreover, with the increasing advances of RDMA, the latency improves similarly fast. In this paper, we first argue that the "old" distributed database design is not capable of taking full advantage of the network. Second, we propose architectural redesigns for OLTP, OLAP and advanced analytical frameworks to take better advantage of the improved bandwidth, latency and RDMA capabilities. Finally, for each of the workload categories, we show that remarkable performance improvements can be achieved

    Graph Processing in Main-Memory Column Stores

    Get PDF
    Evermore, novel and traditional business applications leverage the advantages of a graph data model, such as the offered schema flexibility and an explicit representation of relationships between entities. As a consequence, companies are confronted with the challenge of storing, manipulating, and querying terabytes of graph data for enterprise-critical applications. Although these business applications operate on graph-structured data, they still require direct access to the relational data and typically rely on an RDBMS to keep a single source of truth and access. Existing solutions performing graph operations on business-critical data either use a combination of SQL and application logic or employ a graph data management system. For the first approach, relying solely on SQL results in poor execution performance caused by the functional mismatch between typical graph operations and the relational algebra. To the worse, graph algorithms expose a tremendous variety in structure and functionality caused by their often domain-specific implementations and therefore can be hardly integrated into a database management system other than with custom coding. Since the majority of these enterprise-critical applications exclusively run on relational DBMSs, employing a specialized system for storing and processing graph data is typically not sensible. Besides the maintenance overhead for keeping the systems in sync, combining graph and relational operations is hard to realize as it requires data transfer across system boundaries. A basic ingredient of graph queries and algorithms are traversal operations and are a fundamental component of any database management system that aims at storing, manipulating, and querying graph data. Well-established graph traversal algorithms are standalone implementations relying on optimized data structures. The integration of graph traversals as an operator into a database management system requires a tight integration into the existing database environment and a development of new components, such as a graph topology-aware optimizer and accompanying graph statistics, graph-specific secondary index structures to speedup traversals, and an accompanying graph query language. In this thesis, we introduce and describe GRAPHITE, a hybrid graph-relational data management system. GRAPHITE is a performance-oriented graph data management system as part of an RDBMS allowing to seamlessly combine processing of graph data with relational data in the same system. We propose a columnar storage representation for graph data to leverage the already existing and mature data management and query processing infrastructure of relational database management systems. At the core of GRAPHITE we propose an execution engine solely based on set operations and graph traversals. Our design is driven by the observation that different graph topologies expose different algorithmic requirements to the design of a graph traversal operator. We derive two graph traversal implementations targeting the most common graph topologies and demonstrate how graph-specific statistics can be leveraged to select the optimal physical traversal operator. To accelerate graph traversals, we devise a set of graph-specific, updateable secondary index structures to improve the performance of vertex neighborhood expansion. Finally, we introduce a domain-specific language with an intuitive programming model to extend graph traversals with custom application logic at runtime. We use the LLVM compiler framework to generate efficient code that tightly integrates the user-specified application logic with our highly optimized built-in graph traversal operators. Our experimental evaluation shows that GRAPHITE can outperform native graph management systems by several orders of magnitude while providing all the features of an RDBMS, such as transaction support, backup and recovery, security and user management, effectively providing a promising alternative to specialized graph management systems that lack many of these features and require expensive data replication and maintenance processes

    Density-Aware Linear Algebra in a Column-Oriented In-Memory Database System

    Get PDF
    Linear algebra operations appear in nearly every application in advanced analytics, machine learning, and of various science domains. Until today, many data analysts and scientists tend to use statistics software packages or hand-crafted solutions for their analysis. In the era of data deluge, however, the external statistics packages and custom analysis programs that often run on single-workstations are incapable to keep up with the vast increase in data volume and size. In particular, there is an increasing demand of scientists for large scale data manipulation, orchestration, and advanced data management capabilities. These are among the key features of a mature relational database management system (DBMS). With the rise of main memory database systems, it now has become feasible to also consider applications that built up on linear algebra. This thesis presents a deep integration of linear algebra functionality into an in-memory column-oriented database system. In particular, this work shows that it has become feasible to execute linear algebra queries on large data sets directly in a DBMS-integrated engine (LAPEG), without the need of transferring data and being restricted by hard disc latencies. From various application examples that are cited in this work, we deduce a number of requirements that are relevant for a database system that includes linear algebra functionality. Beside the deep integration of matrices and numerical algorithms, these include optimization of expressions, transparent matrix handling, scalability and data-parallelism, and data manipulation capabilities. These requirements are addressed by our linear algebra engine. In particular, the core contributions of this thesis are: firstly, we show that the columnar storage layer of an in-memory DBMS yields an easy adoption of efficient sparse matrix data types and algorithms. Furthermore, we show that the execution of linear algebra expressions significantly benefits from different techniques that are inspired from database technology. In a novel way, we implemented several of these optimization strategies in LAPEG’s optimizer (SpMachO), which uses an advanced density estimation method (SpProdest) to predict the matrix density of intermediate results. Moreover, we present an adaptive matrix data type AT Matrix to obviate the need of scientists for selecting appropriate matrix representations. The tiled substructure of AT Matrix is exploited by our matrix multiplication to saturate the different sockets of a multicore main-memory platform, reaching up to a speed-up of 6x compared to alternative approaches. Finally, a major part of this thesis is devoted to the topic of data manipulation; where we propose a matrix manipulation API and present different mutable matrix types to enable fast insertions and deletes. We finally conclude that our linear algebra engine is well-suited to process dynamic, large matrix workloads in an optimized way. In particular, the DBMS-integrated LAPEG is filling the linear algebra gap, and makes columnar in-memory DBMS attractive as efficient, scalable ad-hoc analysis platform for scientists

    Coupling databases and advanced analytic tools (R)

    Get PDF
    Today, several contemporary organizations collect various kinds of data, creating large data repositories. But the capacity to perform advanced analytics over these large amount of data stored in databases remains a significant challenge to statistical software (R, S, SAS, SPSS, etc) and data management systems (DBMSs). This is because while statistical software provide comprehensive analytics and modelling functionalities, they can only handle limited amounts of data. The data management systems in contrast have capacity to handle large amount of data but lack adequate analytical facilities. The need to draw on the strengths of both camps gave rise to the idea of coupling databases and advanced analytical or statistical tools which seems very promising and is gaining a lot of grounds. This work studied the level of development of integration of a rising popular advanced analytical tool (R) with database systems (PostgreSQL, Oracle, DB2, SQL Server) and investigated the analytic performance of such coupling vis-`a-vis the performance of stand-alone implementation of (R). The results showed that the overall performance of coupling databases and R is about two (2) times faster than performance of stand-alone R. In the case of some individual benchmarks, the coupled systems (R+DBMS) performance is more than ten (10) times faster. However, there remain the challenges of efficient retrieval and passing of data to analytic functions, code portability, indistinguishable or flat analytics performance on small datasets and integration configuration snags with some of the well-known DBMSs. Although, stand-alone R performs competitively well compared to DBMSs coupled with R in cases of very small datasets analytics, the issue of data security still lingers. Our conclusion is that coupling databases with advanced analytical tools (R) is a good concept and technique which yields considerable performance gains for advanced analytics on substantial datasets provided retrieval and passing of data to the analytical functions are efficiently done. Thus, we confirm the initial assertion or hypothesis but on the condition that significant amount of data is involved in the process and the data is efficiently retrieved and passed to analytic functions. Overall, we recommend an integration which synergizes the robust DBMSs' data management capabilities and the rich statistical functionalities of advanced analytical tools for complex analytics in-situ databases in all situations for faster performance and data security

    Building Efficient Query Engines in a High-Level Language

    Get PDF
    Abstraction without regret refers to the vision of using high-level programming languages for systems development without experiencing a negative impact on performance. A database system designed according to this vision offers both increased productivity and high performance, instead of sacrificing the former for the latter as is the case with existing, monolithic implementations that are hard to maintain and extend. In this article, we realize this vision in the domain of analytical query processing. We present LegoBase, a query engine written in the high-level language Scala. The key technique to regain efficiency is to apply generative programming: LegoBase performs source-to-source compilation and optimizes the entire query engine by converting the high-level Scala code to specialized, low-level C code. We show how generative programming allows to easily implement a wide spectrum of optimizations, such as introducing data partitioning or switching from a row to a column data layout, which are difficult to achieve with existing low-level query compilers that handle only queries. We demonstrate that sufficiently powerful abstractions are essential for dealing with the complexity of the optimization effort, shielding developers from compiler internals and decoupling individual optimizations from each other. We evaluate our approach with the TPC-H benchmark and show that: (a) With all optimizations enabled, LegoBase significantly outperforms a commercial database and an existing query compiler. (b) Programmers need to provide just a few hundred lines of high-level code for implementing the optimizations, instead of complicated low-level code that is required by existing query compilation approaches. (c) The compilation overhead is low compared to the overall execution time, thus making our approach usable in practice for compiling query engines

    Accelerating Generalized Linear Models with MLWeaving: A One-Size-Fits-All System for Any-precision Learning (Technical Report)

    Full text link
    Learning from the data stored in a database is an important function increasingly available in relational engines. Methods using lower precision input data are of special interest given their overall higher efficiency but, in databases, these methods have a hidden cost: the quantization of the real value into a smaller number is an expensive step. To address the issue, in this paper we present MLWeaving, a data structure and hardware acceleration technique intended to speed up learning of generalized linear models in databases. ML-Weaving provides a compact, in-memory representation enabling the retrieval of data at any level of precision. MLWeaving also takes advantage of the increasing availability of FPGA-based accelerators to provide a highly efficient implementation of stochastic gradient descent. The solution adopted in MLWeaving is more efficient than existing designs in terms of space (since it can process any resolution on the same design) and resources (via the use of bit-serial multipliers). MLWeaving also enables the runtime tuning of precision, instead of a fixed precision level during the training. We illustrate this using a simple, dynamic precision schedule. Experimental results show MLWeaving achieves up to16 performance improvement over low-precision CPU implementations of first-order methods.Comment: 18 page

    Technology Selection for Big Data and Analytical Applications

    Get PDF
    The term Big Data has become pervasive in recent years, as smart phones, televisions, washing machines, refrigerators, smart meters, diverse sensors, eyeglasses, and even clothes connect to the Internet. However, their generated data is essentially worthless without appropriate data analytics that utilizes information retrieval, statistics, as well as various other techniques. As Big Data is commonly too big for a single person or institution to investigate, appropriate tools are being used that go way beyond a traditional data warehouse and that have been developed in recent years. Unfortunately, there is no single solution but a large variety of different tools, each of which with distinct functionalities, properties and characteristics. Especially small and medium-sized companies have a hard time to keep track, as this requires time, skills, money, and specific knowledge that, in combination, result in high entrance barriers for Big Data utilization. This paper aims to reduce these barriers by explaining and structuring different classes of technologies and the basic criteria for proper technology selection. It proposes a framework that guides especially small and mid-sized companies through a suitable selection process that can serve as a basis for further advances
    corecore