56 research outputs found
Deep Learning Data and Indexes in a Database
A database is used to store and retrieve data, which is a critical component for any software application. Databases requires configuration for efficiency, however, there are tens of configuration parameters. It is a challenging task to manually configure a database. Furthermore, a database must be reconfigured on a regular basis to keep up with newer data and workload. The goal of this thesis is to use the query workload history to autonomously configure the database and improve its performance. We achieve proposed work in four stages: (i) we develop an index recommender using deep reinforcement learning for a standalone database. We evaluated the effectiveness of our algorithm by comparing with several state-of-the-art approaches, (ii) we build a real-time index recommender that can, in real-time, dynamically create and remove indexes for better performance in response to sudden changes in the query workload, (iii) we develop a database advisor. Our advisor framework will be able to learn latent patterns from a workload. It is able to enhance a query, recommend interesting queries, and summarize a workload, (iv) we developed LinkSocial, a fast, scalable, and accurate framework to gain deeper insights from heterogeneous data
Optimizing Hierarchical Storage Management For Database System
Caching is a classical but effective way to improve system performance.
To improve system performance, servers, such as database servers and storage servers, contain
significant amounts of memory that act as a fast cache.
Meanwhile, as new storage devices such as flash-based solid state drives (SSDs)
are added to storage systems over time, using the
memory cache is not the only way to improve system performance.
In this thesis, we address the problems of how to manage the cache of a storage server and
how to utilize the SSD in a hybrid storage system.
Traditional caching policies are known to perform poorly for storage
server caches. One promising approach to solving this problem is to use hints
from the storage clients to manage the storage server cache. Previous
hinting approaches are ad hoc, in that a predefined reaction to
specific types of hints is hard-coded into the caching policy. With
ad hoc approaches, it is difficult to ensure that the best hints are
being used, and it is difficult to accommodate multiple types of hints
and multiple client applications. In this thesis, we propose
CLient-Informed Caching (CLIC), a generic hint-based technique for
managing storage server caches. CLIC automatically interprets hints
generated by storage clients and translates them into a server caching
policy. It does this without explicit knowledge of the
application-specific hint semantics. We demonstrate using trace-based
simulation of database workloads that CLIC outperforms hint-oblivious
and state-of-the-art hint-aware caching policies.
We also demonstrate that the space required to track and interpret
hints is small.
SSDs are becoming a part of the storage system.
Adding SSD to a storage system not only raises the question of how to manage the SSD,
but also raises the question of whether current buffer pool algorithms will still work effectively.
We are interested in the use of hybrid storage systems, consisting of SSDs and hard disk drives (HDD),
for database management.
We present cost-aware replacement algorithms for both the DBMS buffer pool and the SSD.
These algorithms are aware of the different I/O performance
of HDD and SSD.
In such a hybrid storage system, the physical access pattern to the SSD depends on the management of the DBMS buffer pool.
We studied the impact of the buffer pool caching policies on the access patterns of the SSD.
Based on these studies, we designed a caching policy to effectively manage the SSD.
We implemented these algorithms in MySQL's InnoDB storage engine
and used the TPC-C workload to demonstrate that these cost-aware algorithms
outperform previous algorithms
Scaling Up Concurrent Analytical Workloads on Multi-Core Servers
Today, an ever-increasing number of researchers, businesses, and data scientists collect and analyze massive amounts of data in database systems. The database system needs to process the resulting highly concurrent analytical workloads by exploiting modern multi-socket multi-core processor systems with non-uniform memory access (NUMA) architectures and increasing memory sizes. Conventional execution engines, however, are not designed for many cores, and neither scale nor perform efficiently on modern multi-core NUMA architectures. Firstly, their query-centric approach, where each query is optimized and evaluated independently, can result in unnecessary contention for hardware resources due to redundant work found across queries in highly concurrent workloads. Secondly, they are unaware of the non-uniform memory access costs and the underlying hardware topology, incurring unnecessarily expensive memory accesses and bandwidth saturation. In this thesis, we show how these scalability and performance impediments can be solved by exploiting sharing among concurrent queries and incorporating NUMA-aware adaptive task scheduling and data placement strategies in the execution engine. Regarding sharing, we identify and categorize state-of-the-art techniques for sharing data and work across concurrent queries at run-time into two categories: reactive sharing, which shares intermediate results across common query sub-plans, and proactive sharing, which builds a global query plan with shared operators to evaluate queries. We integrate the original research prototypes that introduce reactive and proactive sharing, perform a sensitivity analysis, and show how and when each technique benefits performance. Our most significant finding is that reactive and proactive sharing can be combined to exploit the advantages of both sharing techniques for highly concurrent analytical workloads. Regarding NUMA-awareness, we identify, implement, and compare various combinations of task scheduling and data placement strategies under a diverse set of highly concurrent analytical workloads. We develop a prototype based on a commercial main-memory column-store database system. Our most significant finding is that there is no single strategy for task scheduling and data placement that is best for all workloads. In specific, inter-socket stealing of memory-intensive tasks can hurt overall performance, and unnecessary partitioning of data across sockets involves an overhead. For this reason, we implement algorithms that adapt task scheduling and data placement to the workload at run-time. Our experiments show that both sharing and NUMA-awareness can significantly improve the performance and scalability of highly concurrent analytical workloads on modern multi-core servers. Thus, we argue that sharing and NUMA-awareness are key factors for supporting faster processing of big data analytical applications, fully exploiting the hardware resources of modern multi-core servers, and for more responsive user experience
Proceedings of the NSSDC Conference on Mass Storage Systems and Technologies for Space and Earth Science Applications
The proceedings of the National Space Science Data Center Conference on Mass Storage Systems and Technologies for Space and Earth Science Applications held July 23 through 25, 1991 at the NASA/Goddard Space Flight Center are presented. The program includes a keynote address, invited technical papers, and selected technical presentations to provide a broad forum for the discussion of a number of important issues in the field of mass storage systems. Topics include magnetic disk and tape technologies, optical disk and tape, software storage and file management systems, and experiences with the use of a large, distributed storage system. The technical presentations describe integrated mass storage systems that are expected to be available commercially. Also included is a series of presentations from Federal Government organizations and research institutions covering their mass storage requirements for the 1990's
Dynamic Scale-out Mechanisms for Partitioned Shared-Nothing Databases
For a database system used in pay-per-use cloud environments, elastic scaling becomes an essential feature, allowing for minimizing costs while accommodating fluctuations of load. One approach to scalability involves horizontal database partitioning and dynamic migration of partitions between servers. We define a scale-out operation as a combination of provisioning a new server followed by migration of one or more partitions to the newly-allocated server.
In this thesis we study the efficiency of different implementations of the scale-out operation in the context of online transaction processing (OLTP) workloads. We designed and implemented three migration mechanisms featuring different strategies for data transfer. The first one is based on a modification of the Xen hypervisor, Snowflock, and uses on-demand block transfers for both server provisioning and partition migration. The second one is implemented in a database management system (DBMS) and uses bulk transfers for partition migration, optimized for higher bandwidth utilization. The third one is a conventional application, using SQL commands to copy partitions between servers.
We perform an experimental comparison of those scale-out mechanisms for disk-bound and CPU-bound configurations. When comparing the mechanisms we analyze their impact on whole-system performance and on the experience of individual clients
A technology reference model for client/server software development
In today's highly competitive global economy, information resources representing enterprise-wide information are essential to the survival of an organization. The development of and increase in the use of personal computers and data communication networks are supporting or, in many cases, replacing the traditional computer mainstay of corporations. The client/server model incorporates mainframe programming with desktop applications on
personal computers. The aim of the research is to compile a technology model for the development of client/server
software. A comprehensive overview of the individual components of the client/server system is given. The different methodologies, tools and techniques that can be used are reviewed, as well as client/server-specific design issues. The research is intended to create a road map in the form of a Technology Reference Model for Client/Server Software Development.ComputingM. Sc. (Information Systems
NSSDC Conference on Mass Storage Systems and Technologies for Space and Earth Science Applications, volume 1
Papers and viewgraphs from the conference are presented. This conference served as a broad forum for the discussion of a number of important issues in the field of mass storage systems. Topics include magnetic disk and tape technologies, optical disks and tape, software storage and file management systems, and experiences with the use of a large, distributed storage system. The technical presentations describe, among other things, integrated mass storage systems that are expected to be available commercially. Also included is a series of presentations from Federal Government organizations and research institutions covering their mass storage requirements for the 1990's
Sixth Goddard Conference on Mass Storage Systems and Technologies Held in Cooperation with the Fifteenth IEEE Symposium on Mass Storage Systems
This document contains copies of those technical papers received in time for publication prior to the Sixth Goddard Conference on Mass Storage Systems and Technologies which is being held in cooperation with the Fifteenth IEEE Symposium on Mass Storage Systems at the University of Maryland-University College Inn and Conference Center March 23-26, 1998. As one of an ongoing series, this Conference continues to provide a forum for discussion of issues relevant to the management of large volumes of data. The Conference encourages all interested organizations to discuss long term mass storage requirements and experiences in fielding solutions. Emphasis is on current and future practical solutions addressing issues in data management, storage systems and media, data acquisition, long term retention of data, and data distribution. This year's discussion topics include architecture, tape optimization, new technology, performance, standards, site reports, vendor solutions. Tutorials will be available on shared file systems, file system backups, data mining, and the dynamics of obsolescence
Recommended from our members
Hadoop performance modeling and job optimization for big data analytics
This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonBig data has received a momentum from both academia and industry. The MapReduce model has emerged into a major computing model in support of big data analytics. Hadoop, which is an open source implementation of the MapReduce model, has been widely taken up by the community. Cloud service providers such as Amazon EC2 cloud have now supported Hadoop user applications. However, a key challenge is that the cloud service providers do not a have resource provisioning mechanism to satisfy user jobs with deadline requirements. Currently, it is solely the user responsibility to estimate the require amount of resources for their job running in a public cloud. This thesis presents a Hadoop performance model that accurately estimates the execution duration of a job and further provisions the required amount of resources for a job to be completed within a deadline. The proposed model employs Locally Weighted Linear Regression (LWLR) model to estimate execution time of a job and Lagrange Multiplier technique for resource provisioning to satisfy user job with a given deadline. The performance of the propose model is extensively evaluated in both in-house Hadoop cluster and Amazon EC2 Cloud. Experimental results show that the proposed model is highly accurate in job execution estimation and jobs are completed within the required deadlines following on the resource provisioning scheme of the proposed model. In addition, the Hadoop framework has over 190 configuration parameters and some of them have significant effects on the performance of a Hadoop job. Manually setting the optimum values for these parameters is a challenging task and also a time consuming process. This thesis presents optimization works that enhances the performance of Hadoop by automatically tuning its parameter values. It employs Gene Expression Programming (GEP) technique to build an objective function that represents the performance of a job and the correlation among the configuration parameters. For the purpose of optimization, Particle Swarm Optimization (PSO) is employed to find automatically an optimal or a near optimal configuration settings. The performance of the proposed work is intensively evaluated on a Hadoop cluster and the experimental results show that the proposed work enhances the performance of Hadoop significantly compared with the default settings.Abdul Wali Khan University Marda
- …