45 research outputs found

    Scaling Up Concurrent Analytical Workloads on Multi-Core Servers

    Get PDF
    Today, an ever-increasing number of researchers, businesses, and data scientists collect and analyze massive amounts of data in database systems. The database system needs to process the resulting highly concurrent analytical workloads by exploiting modern multi-socket multi-core processor systems with non-uniform memory access (NUMA) architectures and increasing memory sizes. Conventional execution engines, however, are not designed for many cores, and neither scale nor perform efficiently on modern multi-core NUMA architectures. Firstly, their query-centric approach, where each query is optimized and evaluated independently, can result in unnecessary contention for hardware resources due to redundant work found across queries in highly concurrent workloads. Secondly, they are unaware of the non-uniform memory access costs and the underlying hardware topology, incurring unnecessarily expensive memory accesses and bandwidth saturation. In this thesis, we show how these scalability and performance impediments can be solved by exploiting sharing among concurrent queries and incorporating NUMA-aware adaptive task scheduling and data placement strategies in the execution engine. Regarding sharing, we identify and categorize state-of-the-art techniques for sharing data and work across concurrent queries at run-time into two categories: reactive sharing, which shares intermediate results across common query sub-plans, and proactive sharing, which builds a global query plan with shared operators to evaluate queries. We integrate the original research prototypes that introduce reactive and proactive sharing, perform a sensitivity analysis, and show how and when each technique benefits performance. Our most significant finding is that reactive and proactive sharing can be combined to exploit the advantages of both sharing techniques for highly concurrent analytical workloads. Regarding NUMA-awareness, we identify, implement, and compare various combinations of task scheduling and data placement strategies under a diverse set of highly concurrent analytical workloads. We develop a prototype based on a commercial main-memory column-store database system. Our most significant finding is that there is no single strategy for task scheduling and data placement that is best for all workloads. In specific, inter-socket stealing of memory-intensive tasks can hurt overall performance, and unnecessary partitioning of data across sockets involves an overhead. For this reason, we implement algorithms that adapt task scheduling and data placement to the workload at run-time. Our experiments show that both sharing and NUMA-awareness can significantly improve the performance and scalability of highly concurrent analytical workloads on modern multi-core servers. Thus, we argue that sharing and NUMA-awareness are key factors for supporting faster processing of big data analytical applications, fully exploiting the hardware resources of modern multi-core servers, and for more responsive user experience

    Growth of relational model: Interdependence and complementary to big data

    Get PDF
    A database management system is a constant application of science that provides a platform for the creation, movement, and use of voluminous data. The area has witnessed a series of developments and technological advancements from its conventional structured database to the recent buzzword, bigdata. This paper aims to provide a complete model of a relational database that is still being widely used because of its well known ACID properties namely, atomicity, consistency, integrity and durability. Specifically, the objective of this paper is to highlight the adoption of relational model approaches by bigdata techniques. Towards addressing the reason for this in corporation, this paper qualitatively studied the advancements done over a while on the relational data model. First, the variations in the data storage layout are illustrated based on the needs of the application. Second, quick data retrieval techniques like indexing, query processing and concurrency control methods are revealed. The paper provides vital insights to appraise the efficiency of the structured database in the unstructured environment, particularly when both consistency and scalability become an issue in the working of the hybrid transactional and analytical database management system

    Graph Processing in Main-Memory Column Stores

    Get PDF
    Evermore, novel and traditional business applications leverage the advantages of a graph data model, such as the offered schema flexibility and an explicit representation of relationships between entities. As a consequence, companies are confronted with the challenge of storing, manipulating, and querying terabytes of graph data for enterprise-critical applications. Although these business applications operate on graph-structured data, they still require direct access to the relational data and typically rely on an RDBMS to keep a single source of truth and access. Existing solutions performing graph operations on business-critical data either use a combination of SQL and application logic or employ a graph data management system. For the first approach, relying solely on SQL results in poor execution performance caused by the functional mismatch between typical graph operations and the relational algebra. To the worse, graph algorithms expose a tremendous variety in structure and functionality caused by their often domain-specific implementations and therefore can be hardly integrated into a database management system other than with custom coding. Since the majority of these enterprise-critical applications exclusively run on relational DBMSs, employing a specialized system for storing and processing graph data is typically not sensible. Besides the maintenance overhead for keeping the systems in sync, combining graph and relational operations is hard to realize as it requires data transfer across system boundaries. A basic ingredient of graph queries and algorithms are traversal operations and are a fundamental component of any database management system that aims at storing, manipulating, and querying graph data. Well-established graph traversal algorithms are standalone implementations relying on optimized data structures. The integration of graph traversals as an operator into a database management system requires a tight integration into the existing database environment and a development of new components, such as a graph topology-aware optimizer and accompanying graph statistics, graph-specific secondary index structures to speedup traversals, and an accompanying graph query language. In this thesis, we introduce and describe GRAPHITE, a hybrid graph-relational data management system. GRAPHITE is a performance-oriented graph data management system as part of an RDBMS allowing to seamlessly combine processing of graph data with relational data in the same system. We propose a columnar storage representation for graph data to leverage the already existing and mature data management and query processing infrastructure of relational database management systems. At the core of GRAPHITE we propose an execution engine solely based on set operations and graph traversals. Our design is driven by the observation that different graph topologies expose different algorithmic requirements to the design of a graph traversal operator. We derive two graph traversal implementations targeting the most common graph topologies and demonstrate how graph-specific statistics can be leveraged to select the optimal physical traversal operator. To accelerate graph traversals, we devise a set of graph-specific, updateable secondary index structures to improve the performance of vertex neighborhood expansion. Finally, we introduce a domain-specific language with an intuitive programming model to extend graph traversals with custom application logic at runtime. We use the LLVM compiler framework to generate efficient code that tightly integrates the user-specified application logic with our highly optimized built-in graph traversal operators. Our experimental evaluation shows that GRAPHITE can outperform native graph management systems by several orders of magnitude while providing all the features of an RDBMS, such as transaction support, backup and recovery, security and user management, effectively providing a promising alternative to specialized graph management systems that lack many of these features and require expensive data replication and maintenance processes

    Robust and adaptive query processing in hybrid transactional/analytical database systems

    Get PDF
    The quality of query execution plans in database systems determines how fast a query can be processed. Conventional query optimization may still select sub-optimal or even bad query execution plans, due to errors in the cardinality estimation. In this work, we address limitations and unsolved problems of Robust and Adaptive Query Processing, with the goal of improving the detection and compensation of sub-optimal query execution plans. We demonstrate that existing heuristics cannot sufficiently characterize the intermediate result cardinalities, for which a given query execution plan remains optimal, and present an algorithm to calculate precise optimality ranges. The compensation of sub-optimal query execution plans is a complementary problem. We describe metrics to quantify the robustness of query execution plans with respect to cardinality estimations errors. In queries with cardinality estimation errors, our corresponding robust plan selection strategy chooses query execution plans, which are up to 3.49x faster, compared to the estimated cheapest plans. Furthermore, we present an adaptive query processor to compensate sub-optimal query execution plans. It collects true cardinalities of intermediate results at query execution time to re-optimize the currently running query. We show that the overall effort for re-optimizations and plan switches is similar to the initial optimization. Our adaptive query processor can execute queries up to 5.19x faster, compared to a conventional query processor.Die QualitĂ€t von AnfrageausfĂŒhrungsplĂ€nen in Datenbank Systemen bestimmt, wie schnell eine Anfrage verarbeitet werden kann. Aufgrund von Fehlern in der KardinalitĂ€tsschĂ€tzung können konventionelle Anfrageoptimierer immer noch sub-optimale oder sogar schlechte AnfrageausfĂŒhrungsplĂ€nen auswĂ€hlen. In dieser Arbeit behandeln wir EinschrĂ€nkungen und ungelöste Probleme robuster und adaptiver Anfrageverarbeitung, um die Erkennung und den Ausgleich sub-optimaler AnfrageausfĂŒhrungsplĂ€ne zu verbessern. Wir zeigen, dass bestehende Heuristiken nicht entscheiden können, fĂŒr welche KardinalitĂ€ten ein AnfrageausfĂŒhrungsplan optimal ist, und stellen einen Algorithmus vor, der prĂ€zise OptimalitĂ€tsbereiche berechnen kann. Der Ausgleich von sub-optimalen AnfrageausfĂŒhrungsplĂ€nen ist ein ergĂ€nzendes Problem. Wir beschreiben Metriken, welche die Robustheit von AnfrageausfĂŒhrungsplĂ€nen gegenĂŒber Fehlern in der KardinalitĂ€tsschĂ€tzung quantifizieren können. Unsere robuste Planauswahlstrategie, die auf Robustheitsmetriken aufbaut, kann PlĂ€ne finden, die bei Fehlern in der KardinalitĂ€tsschĂ€tzung bis zu 3.49x schneller sind als die geschĂ€tzt gĂŒnstigsten PlĂ€ne. Des Weiteren stellen wir einen adaptiven Anfrageverarbeiter vor, der sub-optimale AnfrageausfĂŒhrungsplĂ€ne ausgleichen kann. Er erfasst die wahren KardinalitĂ€ten von Zwischenergebnissen wĂ€hrend der AnfrageausfĂŒhrung, um damit die aktuell laufende Anfrage zu re-optimieren. Wir zeigen, dass der gesamte Aufwand fĂŒr Re-Optimierungen und PlanĂ€nderungen einer initialen Optimierung entspricht. Unser adaptiver Anfrageverarbeiter kann Anfragen bis zu 5.19x schneller ausfĂŒhren als ein konventioneller Anfrageverarbeiter

    Energy-Aware Data Management on NUMA Architectures

    Get PDF
    The ever-increasing need for more computing and data processing power demands for a continuous and rapid growth of power-hungry data center capacities all over the world. As a first study in 2008 revealed, energy consumption of such data centers is becoming a critical problem, since their power consumption is about to double every 5 years. However, a recently (2016) released follow-up study points out that this threatening trend was dramatically throttled within the past years, due to the increased energy efficiency actions taken by data center operators. Furthermore, the authors of the study emphasize that making and keeping data centers energy-efficient is a continuous task, because more and more computing power is demanded from the same or an even lower energy budget, and that this threatening energy consumption trend will resume as soon as energy efficiency research efforts and its market adoption are reduced. An important class of applications running in data centers are data management systems, which are a fundamental component of nearly every application stack. While those systems were traditionally designed as disk-based databases that are optimized for keeping disk accesses as low a possible, modern state-of-the-art database systems are main memory-centric and store the entire data pool in the main memory, which replaces the disk as main bottleneck. To scale up such in-memory database systems, non-uniform memory access (NUMA) hardware architectures are employed that face a decreased bandwidth and an increased latency when accessing remote memory compared to the local memory. In this thesis, we investigate energy awareness aspects of large scale-up NUMA systems in the context of in-memory data management systems. To do so, we pick up the idea of a fine-grained data-oriented architecture and improve the concept in a way that it keeps pace with increased absolute performance numbers of a pure in-memory DBMS and scales up on NUMA systems in the large scale. To achieve this goal, we design and build ERIS, the first scale-up in-memory data management system that is designed from scratch to implement a data-oriented architecture. With the help of the ERIS platform, we explore our novel core concept for energy awareness, which is Energy Awareness by Adaptivity. The concept describes that software and especially database systems have to quickly respond to environmental changes (i.e., workload changes) by adapting themselves to enter a state of low energy consumption. We present the hierarchically organized Energy-Control Loop (ECL), which is a reactive control loop and provides two concrete implementations of our Energy Awareness by Adaptivity concept, namely the hardware-centric Resource Adaptivity and the software-centric Storage Adaptivity. Finally, we will give an exhaustive evaluation regarding the scalability of ERIS as well as our adaptivity facilities

    Just-in-time Analytics Over Heterogeneous Data and Hardware

    Get PDF
    Industry and academia are continuously becoming more data-driven and data-intensive, relying on the analysis of a wide variety of datasets to gain insights. At the same time, data variety increases continuously across multiple axes. First, data comes in multiple formats, such as the binary tabular data of a DBMS, raw textual files, and domain-specific formats. Second, different datasets follow different data models, such as the relational and the hierarchical one. Data location also varies: Some datasets reside in a central "data lake", whereas others lie in remote data sources. In addition, users execute widely different analysis tasks over all these data types. Finally, the process of gathering and integrating diverse datasets introduces several inconsistencies and redundancies in the data, such as duplicate entries for the same real-world concept. In summary, heterogeneity significantly affects the way data analysis is performed. In this thesis, we aim for data virtualization: Abstracting data out of its original form and manipulating it regardless of the way it is stored or structured, without a performance penalty. To achieve data virtualization, we design and implement systems that i) mask heterogeneity through the use of heterogeneity-aware, high-level building blocks and ii) offer fast responses through on-demand adaptation techniques. Regarding the high-level building blocks, we use a query language and algebra to handle multiple collection types, such as relations and hierarchies, express transformations between these collection types, as well as express complex data cleaning tasks over them. In addition, we design a location-aware compiler and optimizer that masks away the complexity of accessing multiple remote data sources. Regarding on-demand adaptation, we present a design to produce a new system per query. The design uses customization mechanisms that trigger runtime code generation to mimic the system most appropriate to answer a query fast: Query operators are thus created based on the query workload and the underlying data models; the data access layer is created based on the underlying data formats. In addition, we exploit emerging hardware by customizing the system implementation based on the available heterogeneous processors Ăą CPUs and GPGPUs. We thus pair each workload with its ideal processor type. The end result is a just-in-time database system that is specific to the query, data, workload, and hardware instance. This thesis redesigns the data management stack to natively cater for data heterogeneity and exploit hardware heterogeneity. Instead of centralizing all relevant datasets, converting them to a single representation, and loading them in a monolithic, static, suboptimal system, our design embraces heterogeneity. Overall, our design decouples the type of performed analysis from the original data layout; users can perform their analysis across data stores, data models, and data formats, but at the same time experience the performance offered by a custom system that has been built on demand to serve their specific use case

    Pemetaan Kebutuhan Teknologi Pada Komputer Untuk Pemerintah Kota Surabaya (Studi Kasus: Badan Perencanaan Dan Pembangunan Kota)

    Get PDF
    Komputer merupakan salah satu alat pendukung perkerjaan yang sangat penting, dan melekat erat pada setiap individu termasuk di kantor pemerintahan. Saat ini, perkembangan pelayanan yang ada di Surabaya telah berpengaruh pada meningkatnya jumlah permintaan komputer dari setiap Satuan Kerja Perangkat Daerah (SKPD) yang masuk ke Unit Perlengkapan. Seperti yang diketahui bahwa teknologi merupakan salah satu faktor penting dalam pemilihan komputer. Perkembangan teknologi yang sangat cepat menyebabkan pemetaan kebutuhan teknologi untuk setiap jabatan menjadi penting agar tidak terjadi kerenggangan teknologi pada setiap jabatan. Dimana, kerenggangan terhadap teknologi untuk setiap jabatan dapat memicu pengeluaran yang sia-sia untuk teknologi komputer. Berdasarkan pada pemetaan yang telah dilakukan pada survei lapangan, kesalahan pada pengalokasian komputer menjadi masalah utama yang memicu kenaikan pada permintaan komputer. Melalui analisa beban kerja dan beban kerja aplikasi yang digunakan pada setiap jabatan, penelitian ini bertujuan untuk menyusun sebuah kerangka umum yang dapat digunakan untuk menentukan tingkat kebutuhan teknologi pada setiap jabatan. Kerangka ini akan digunakan untuk membantu mengoptimalkan pembagian komputer di setiap SKPD. Selain itu, optimasi dalam pengalokasian komputer dapat membantu bagian perlengkapan dan pengadaan untuk menghemat pengeluaran dan menghindari adanya gap untuk teknologi. ========================================================================================================== Computer is one of many supporting tools needed by office including in Surabaya city government. The growth of innovation and development in Surabaya city government, has affected the number of computer asset request from each Satuan Kerja Perangkat Daerah (SKPDs) in procurement unit. Technology is one of the important aspect that needed to be considered in selecting computer unit. However, a speedy growth in technology causing a computer technology mapping in organization becomes necessary in order to avoid technology gap. Because, technology gap itself can lead to waste of budget spent on the computer technology. Based on the direct survey, the deployment of the computer becomes a major cause that trigger a sharp increment in computer asset request from SKPD. Through the analysis of job analysis and application workload in each functional position, this research aims to construct a general framework to define the level of technology for each functional position. This framework will be used to define and optimize the deployment of computer unit in each SKPDs based on the function. Besides, optimization in computer deployment can reduce unnecessary procurement that will help procurement unit to save more budget and avoid technology gap

    Insights from the Inventory of Smart Grid Projects in Europe: 2012 Update

    Get PDF
    By the end of 2010 the Joint Research Centre, the European Commission’s in-house science service, launched the first comprehensive inventory of smart grid projects in Europe1. The final catalogue was published in July 2011 and included 219 smart grid and smart metering projects from the EU-28 member states, Switzerland and Norway. The participation of the project coordinators and the reception of the report by the smart grid community were extremely positive. Due to its success, the European Commission decided that the project inventory would be carried out on a regular basis so as to constantly update the picture of smart grid developments in Europe and keep track of lessons learnt and of challenges and opportunities. For this, a new on-line questionnaire was launched in March 2012 and information on projects collected up to September 2012. At the same time an extensive search of project information on the internet and through cooperation links with other European research organizations was conducted. The resulting final database is the most up to date and comprehensive inventory of smart grids and smart metering projects in Europe, including a total of 281 smart grid projects and 90 smart metering pilot projects and rollouts from the same 30 countries that were included in the 2011 inventory database. Projects surveyed were classified into three categories: R&D, demonstration or pre-deployment) and deployment, and for the first time a distinction between smart grid and smart metering projects was made. The following is an insight into the 2012 report.JRC.F.3-Energy securit

    An Approach for Guiding Developers to Performance and Scalability Solutions

    Get PDF
    This thesis proposes an approach that enables developers who are novices in software performance engineering to solve software performance and scalability problems without the assistance of a software performance expert. The contribution of this thesis is the explicit consideration of the implementation level to recommend solutions for software performance and scalability problems. This includes a set of description languages for data representation and human computer interaction and a workflow

    Design principles for supply network systems

    Full text link
    Supply networks crossing organizational boundaries become more and more critical for success in the dynamics of supply relationships. The ability to quickly find, connect and qualify business partners as sources of supply on a regional, national and international basis, and to sustain those relationships crossing intercultural barriers along the completion of business cases, is and will remain a key competitive differentiator in the global economy. Supply networks in particular are characterized by wide inter-organizational settings of connected entities, with the key focus on supply management and procurement in the provision of goods and services. As such, they describe the value generation flow between connected business partners. Supporting supply networks on an individual company level, the primary target of enterprise resource planning systems is to standardize structured data and to streamline business processes within a company. Extending this scope, e-procurement and supplier-relationship management systems focus on supply networks beyond the boundary of a single company. They enable integration across companies by establishing standards for document exchange between different systems. However, these approaches still result in costly, rigid, and complex data and process integration of peer-to-peer nodes in dynamic supply networks. Also, the emergence of e-marketplaces in the early 2000s could not overcome these challenges, and many of the most promising e-marketplace providers and concepts disappeared when the .com-bubble burst – primarily because of the data and process integration complexity that arises when connecting dyadic and many-to-many network relationships on a single platform. Besides this integration challenge, e-procurement and supplier-relationship management systems are targeted on streamlining structured business processes and handling of structured data. Structured data and processes have fixed coded meaning, format and sequence - from start to completion of a business transaction. Structured data is normally stored in database fields, and structured processes follow pre-defined patterns of transactional steps to complete standardized business cases. The coverage of structure data and processes is mostly sufficient where the use case is commonly defined and accepted by the involved business partners, composed mostly of routine steps and not requiring much direct human interaction. To cope with the increasing challenges in the highly collaborative, inter-organizational settings and to leverage Web 2.0 and Enterprise 2.0 capabilities in supply networks, the support of unstructured interactions become more important. Unlike structured data and processes, unstructured interactions like instant messages, feeds, or blogs have no or limited fixed format, are directly derived from human interactions resulting in textual data, and can hardly be computed without any prior transformation. Many of these happen before, during, and after the actual execution of structured business processes, but are rarely supported by the corresponding supply management systems. This is a particular challenge, as Cappuccio (2012) predicts that enterprise data will grow by 650% over the next five years, with 80% of that data being unstructured. In summary, by designing systems providing utility for companies and business professionals in supply networks, the two comprehensive challenges of (1) data and process integration and (2) support of unstructured interactions need to be addressed. Addressing these challenges, the doctoral thesis proposes a design that tightly bundles both structured and unstructured data and process perspectives - following suggestions “[to find] new ways to generate and maintain connections within and between social units, and new social connection-focused IT capabilities” (Oinas-Kukkonen et al. 2010, p. 61). Through not looking at both challenges independently, the overarching research question is: Which design principles instantiated in a software artifact advance supply networks for professionals, by connecting both structured and unstructured data and processes? The answer to this research question consists of aggregated design principles for the design and implementation of supply network systems supporting business professionals in supply management and procurement. Their individual performance in supply networks should increase, and their effort to execute supply network tasks should decrease, assuming that improvements at individual levels will also lead to supply network advances overall. In order to answer the research question, Action Design Research (ADR) is employed, based on the Design Science Research (DSR) paradigm. The design principles of ‘networked business objects (n-BO)’ and ‘social augmentation’ are conceptualized, which are used to derive and implement related design decisions in a software artifact. Finally, testable hypotheses are derived and evaluated towards the utility of both the artifact and the underlying design principles
    corecore