110 research outputs found

    Data management in cloud environments: NoSQL and NewSQL data stores

    Get PDF
    : Advances in Web technology and the proliferation of mobile devices and sensors connected to the Internet have resulted in immense processing and storage requirements. Cloud computing has emerged as a paradigm that promises to meet these requirements. This work focuses on the storage aspect of cloud computing, specifically on data management in cloud environments. Traditional relational databases were designed in a different hardware and software era and are facing challenges in meeting the performance and scale requirements of Big Data. NoSQL and NewSQL data stores present themselves as alternatives that can handle huge volume of data. Because of the large number and diversity of existing NoSQL and NewSQL solutions, it is difficult to comprehend the domain and even more challenging to choose an appropriate solution for a specific task. Therefore, this paper reviews NoSQL and NewSQL solutions with the objective of: (1) providing a perspective in the field, (2) providing guidance to practitioners and researchers to choose the appropriate data store, and (3) identifying challenges and opportunities in the field. Specifically, the most prominent solutions are compared focusing on data models, querying, scaling, and security related capabilities. Features driving the ability to scale read requests and write requests, or scaling data storage are investigated, in particular partitioning, replication, consistency, and concurrency control. Furthermore, use cases and scenarios in which NoSQL and NewSQL data stores have been used are discussed and the suitability of various solutions for different sets of applications is examined. Consequently, this study has identified challenges in the field, including the immense diversity and inconsistency of terminologies, limited documentation, sparse comparison and benchmarking criteria, and nonexistence of standardized query languages

    Extending a methodology for migration of the database layer to the cloud considering relational database schema migration to NoSQL

    Get PDF
    The advances in Cloud computing and in modern Web applications have raised the need for highly available and scalable distributed databases to accommodate the big data being created and consumed. Along with the explosion in data growth comes the necessity to rapidly evolve databases and schemas to meet user demands for new functionality. A special attention is being paid to the vast amounts of semi-structured and un-structured data, and the data management tools should reflect the support for these needs. This has lead to the development of new Cloud serving systems such as "Not Only" SQL (NoSQL) databases. NoSQL databases were driven by the scalability needs of the big companies, such as Google, Facebook, Amazon, and Yahoo. While the demands of these key players are different from those of small and medium enterprises in terms of scalability, the core problem is the same - storage arrays are not scalable and force you into expensive, forklift upgrades. These facts combined with changes in how IT resources are delivered and consumed through the Cloud computing paradigm, projects adopting NoSQL solutions are not a hype anymore. NoSQL databases are being offered as a service by the big Cloud providers, such as Google, Amazon, Microsoft, but by smaller vendors as well. In this master thesis we investigate the possibilities and limitations of mapping relational database schemas to NoSQL schemas when migrating the database layer to the Cloud. Based on literature research we provide recommendations and guidelines with regard to schema transformation and discuss the implications at other application architecture layers, such as business logic and data access layer. We extend an existing data migration tool and methodology for incorporating the migration guidelines and hints. Moreover, we validate our work based on a chosen sub-set of relational and NoSQL databases by using example data from the established TPC-H benchmark

    Architectural Principles for Database Systems on Storage-Class Memory

    Get PDF
    Database systems have long been optimized to hide the higher latency of storage media, yielding complex persistence mechanisms. With the advent of large DRAM capacities, it became possible to keep a full copy of the data in DRAM. Systems that leverage this possibility, such as main-memory databases, keep two copies of the data in two different formats: one in main memory and the other one in storage. The two copies are kept synchronized using snapshotting and logging. This main-memory-centric architecture yields nearly two orders of magnitude faster analytical processing than traditional, disk-centric ones. The rise of Big Data emphasized the importance of such systems with an ever-increasing need for more main memory. However, DRAM is hitting its scalability limits: It is intrinsically hard to further increase its density. Storage-Class Memory (SCM) is a group of novel memory technologies that promise to alleviate DRAM’s scalability limits. They combine the non-volatility, density, and economic characteristics of storage media with the byte-addressability and a latency close to that of DRAM. Therefore, SCM can serve as persistent main memory, thereby bridging the gap between main memory and storage. In this dissertation, we explore the impact of SCM as persistent main memory on database systems. Assuming a hybrid SCM-DRAM hardware architecture, we propose a novel software architecture for database systems that places primary data in SCM and directly operates on it, eliminating the need for explicit IO. This architecture yields many benefits: First, it obviates the need to reload data from storage to main memory during recovery, as data is discovered and accessed directly in SCM. Second, it allows replacing the traditional logging infrastructure by fine-grained, cheap micro-logging at data-structure level. Third, secondary data can be stored in DRAM and reconstructed during recovery. Fourth, system runtime information can be stored in SCM to improve recovery time. Finally, the system may retain and continue in-flight transactions in case of system failures. However, SCM is no panacea as it raises unprecedented programming challenges. Given its byte-addressability and low latency, processors can access, read, modify, and persist data in SCM using load/store instructions at a CPU cache line granularity. The path from CPU registers to SCM is long and mostly volatile, including store buffers and CPU caches, leaving the programmer with little control over when data is persisted. Therefore, there is a need to enforce the order and durability of SCM writes using persistence primitives, such as cache line flushing instructions. This in turn creates new failure scenarios, such as missing or misplaced persistence primitives. We devise several building blocks to overcome these challenges. First, we identify the programming challenges of SCM and present a sound programming model that solves them. Then, we tackle memory management, as the first required building block to build a database system, by designing a highly scalable SCM allocator, named PAllocator, that fulfills the versatile needs of database systems. Thereafter, we propose the FPTree, a highly scalable hybrid SCM-DRAM persistent B+-Tree that bridges the gap between the performance of transient and persistent B+-Trees. Using these building blocks, we realize our envisioned database architecture in SOFORT, a hybrid SCM-DRAM columnar transactional engine. We propose an SCM-optimized MVCC scheme that eliminates write-ahead logging from the critical path of transactions. Since SCM -resident data is near-instantly available upon recovery, the new recovery bottleneck is rebuilding DRAM-based data. To alleviate this bottleneck, we propose a novel recovery technique that achieves nearly instant responsiveness of the database by accepting queries right after recovering SCM -based data, while rebuilding DRAM -based data in the background. Additionally, SCM brings new failure scenarios that existing testing tools cannot detect. Hence, we propose an online testing framework that is able to automatically simulate power failures and detect missing or misplaced persistence primitives. Finally, our proposed building blocks can serve to build more complex systems, paving the way for future database systems on SCM

    Adaptive Management of Multimodel Data and Heterogeneous Workloads

    Get PDF
    Data management systems are facing a growing demand for a tighter integration of heterogeneous data from different applications and sources for both operational and analytical purposes in real-time. However, the vast diversification of the data management landscape has led to a situation where there is a trade-off between high operational performance and a tight integration of data. The difference between the growth of data volume and the growth of computational power demands a new approach for managing multimodel data and handling heterogeneous workloads. With PolyDBMS we present a novel class of database management systems, bridging the gap between multimodel database and polystore systems. This new kind of database system combines the operational capabilities of traditional database systems with the flexibility of polystore systems. This includes support for data modifications, transactions, and schema changes at runtime. With native support for multiple data models and query languages, a PolyDBMS presents a holistic solution for the management of heterogeneous data. This does not only enable a tight integration of data across different applications, it also allows a more efficient usage of resources. By leveraging and combining highly optimized database systems as storage and execution engines, this novel class of database system takes advantage of decades of database systems research and development. In this thesis, we present the conceptual foundations and models for building a PolyDBMS. This includes a holistic model for maintaining and querying multiple data models in one logical schema that enables cross-model queries. With the PolyAlgebra, we present a solution for representing queries based on one or multiple data models while preserving their semantics. Furthermore, we introduce a concept for the adaptive planning and decomposition of queries across heterogeneous database systems with different capabilities and features. The conceptual contributions presented in this thesis materialize in Polypheny-DB, the first implementation of a PolyDBMS. Supporting the relational, document, and labeled property graph data model, Polypheny-DB is a suitable solution for structured, semi-structured, and unstructured data. This is complemented by an extensive type system that includes support for binary large objects. With support for multiple query languages, industry standard query interfaces, and a rich set of domain-specific data stores and data sources, Polypheny-DB offers a flexibility unmatched by existing data management solutions

    In-Memory Databases

    Get PDF
    Táto práca sa zaoberá databázami pracujúcimi v pamäti a tiež konceptmi, ktoré boli vyvinuté na vytvorenie takýchto systémov, pretože dáta sú v týchto databázach uložené v hlavnej pamäti, ktorá je schopná spracovať data niekoľkokrát rýchlejšie, ale je to súčasne nestabilné pamäťové medium. Na podloženie týchto konceptov je v práci zhrnutý vývoj databázových systémov od počiatku ich vývoja až do súčasnosti. Prvými databázovými typmi boli hierarchické a sieťové databázy, ktoré boli už v 70. rokoch 20. storočia nahradené prvými relačnými databázami ktorých vývoj trvá až do dnes a v súčastnosti sú zastúpené hlavne OLTP a OLAP systémami. Ďalej sú spomenuté objektové, objektovo-relačné a NoSQL databázy a spomenuté je tiež rozširovanie Big Dát a možnosti ich spracovania. Pre porozumenie uloženia dát v hlavnej pamäti je predstavená pamäťová hierarchia od registrov procesoru, cez cache a hlavnú pamäť až po pevné disky spolu s informáciami o latencii a stabilite týchto pamäťových médií. Ďalej sú spomenuté možnosti usporiadania dát v pamäti a je vysvetlené riadkové a stĺpcové usporiadanie dát spolu s možnosťami ich využitia pre čo najvyšší výkon pri spracovaní dát. V tejto sekcii sú spomenuté aj kompresné techniky, ktoré slúžia na čo najúspornejšie využitie priestoru hlavnej pamäti. V nasledujúcej sekcii sú uvedené postupy, ktoré zabezpečujú, že zmeny v týchto databázach sú persistentné aj napriek tomu, že databáza beží na nestabilnom pamäťovom médiu. Popri tradičných technikách zabezpečujúcich trvanlivosť zmien je predstavený koncept diferenciálnej vyrovnávacej pamäte do ktorej sa ukladajú všetky zmeny v a taktiež je popísaný proces spájania dát z tejto vyrovnávacej pamäti a dát z hlavného úložiska. V ďalšej sekcii práce je prehľad existujúcich databáz, ktoré pracujú v pamäti ako SAP HANA, Times Ten od Oracle ale aj hybridných systémov, ktoré pracujú primárne na disku, ale sú schopné pracovať aj v pamäti. Jedným z takýchto systémov je SQLite. Táto sekcia porovnáva jednotlivé systémy, hodnotí nakoľko využívajú koncepty predstavené v predchádzajúcich kapitolách, a na jej konci je tabuľka kde sú prehľadne zobrazené informácie o týchto systémoch. Ďalšie časti práce sa týkajú už samotného testovania výkonnosti týchto databáz. Zo začiatku sú popísané testovacie dáta pochádzajúce z DBLP databázy a spôsob ich získania a transformácie do použiteľnej formy pre testovanie. Ďalej je popísaná metodika testovania, ktorá sa deli na dve časti. Prvá časť porovnáva výkon databázy pracujúcej v disku s databázou pracujúcou v pamäti. Pre tento účel bola využitá databáza SQLite a možnosť spustenia databázy v pamäti. Druhá časť testovania sa zaoberá porovnaním výkonu riadkového a stĺpcového usporiadania dát v databáze pracujúcej v pamäti. Na tento účel bola využitá databáza SAP HANA, ktorá umožňuje ukladať dáta v oboch usporiadaniach. Výsledkom práce je analýza výsledkov, ktoré boli získané pomocou týchto testov.This bachelor thesis deals with in-memory databases and concepts that were developed to create such systems. To lay the base ground for in-memory concepts, the thesis summarizes the development of the most used database systems. The data layouts like the column and the row layout are introduced together with the compression and storage techniques used to maintain persistence of the in-memory databases. The other parts contain the overview of the existing in-memory database systems and describe the benchmarks used to test the performance of the in-memory databases. At the end, the thesis analyses the results of benchmarks.

    Atomic Transfer for Distributed Systems

    Get PDF
    Building applications and information systems increasingly means dealing with concurrency and faults stemming from distribution of system components. Atomic transactions are a well-known method for transferring the responsibility for handling concurrency and faults from developers to the software\u27s execution environment, but incur considerable execution overhead. This dissertation investigates methods that shift some of the burden of concurrency control into the network layer, to reduce response times and increase throughput. It anticipates future programmable network devices, enabling customized high-performance network protocols. We propose Atomic Transfer (AT), a distributed algorithm to prevent race conditions due to messages crossing on a path of network switches. Switches check request messages for conflicts with response messages traveling in the opposite direction. Conflicting requests are dropped, obviating the request\u27s receiving host from detecting and handling the conflict. AT is designed to perform well under high data contention, as concurrency control effort is balanced across a network instead of being handled by the contended endpoint hosts themselves. We use AT as the basis for a new optimistic transactional cache consistency algorithm, supporting execution of atomic applications caching shared data. We then present a scalable refinement, allowing hierarchical consistent caches with predictable performance despite high data update rates. We give detailed I/O Automata models of our algorithms along with correctness proofs. We begin with a simplified model, assuming static network paths and no message loss, and then refine it to support dynamic network paths and safe handling of message loss. We present a trie-based data structure for accelerating conflict-checking on switches, with benchmarks suggesting the feasibility of our approach from a performance stand-point
    corecore