216 research outputs found

    Database server workload characterization in an e-commerce environment

    Get PDF
    A typical E-commerce system that is deployed on the Internet has multiple layers that include Web users, Web servers, application servers, and a database server. As the system use and user request frequency increase, Web/application servers can be scaled up by replication. A load balancing proxy can be used to route user requests to individual machines that perform the same functionality. To address the increasing workload while avoiding replicating the database server, various dynamic caching policies have been proposed to reduce the database workload in E-commerce systems. However, the nature of the changes seen by the database server as a result of dynamic caching remains unknown. A good understanding of this change is fundamental for tuning a database server to get better performance. In this study, the TPC-W (a transactional Web E-commerce benchmark) workloads on a database server are characterized under two different dynamic caching mechanisms, which are generalized and implemented as query-result cache and table cache. The characterization focuses on response time, CPU computation, buffer pool references, disk I/O references, and workload classification. This thesis combines a variety of analysis techniques: simulation, real time measurement and data mining. The experimental results in this thesis reveal some interesting effects that the dynamic caching has on the database server workload characteristics. The main observations include: (a) dynamic cache can considerably reduce the CPU usage of the database server and the number of database page references when it is heavily loaded; (b) dynamic cache can also reduce the database reference locality, but to a smaller degree than that reported in file servers. The data classification results in this thesis show that with dynamic cache, the database server sees TPC-W profiles more like on-line transaction processing workloads

    Auditing database systems through forensic analysis

    Get PDF
    The majority of sensitive and personal data is stored in a number of different Database Management Systems (DBMS). For example, Oracle is frequently used to store corporate data, MySQL serves as the back-end storage for many webstores, and SQLite stores personal data such as SMS messages or browser bookmarks. Consequently, the pervasive use of DBMSes has led to an increase in the rate at which they are exploited in cybercrimes. After a cybercrime occurs, investigators need forensic tools and methods to recreate a timeline of events and determine the extent of the security breach. When a breach involves a compromised system, these tools must make few assumptions about the system (e.g., corrupt storage, poorly configured logging, data tampering). Since DBMSes manage storage independent of the operating system, they require their own set of forensic tools. This dissertation presents 1) our database-agnostic forensic methods to examine DBMS contents from any evidence source (e.g., disk images or RAM snapshots) without using a live system and 2) applications of our forensic analysis methods to secure data. The foundation of this analysis is page carving, our novel database forensic method that we implemented as the tool DBCarver. We demonstrate that DBCarver is capable of reconstructing DBMS contents, including metadata and deleted data, from various types of digital evidence. Since DBMS storage is managed independently of the operating system, DBCarver can be used for new methods to securely delete data (i.e., data sanitization). In the event of suspected log tampering or direct modification to DBMS storage, DBCarver can be used to verify log integrity and discover storage inconsistencies

    Store-Ordered Streaming of Shared Memory

    Get PDF
    Coherence misses in shared-memory multiprocessors account for a substantial fraction of execution time in many important scientific and commercial workloads. Memory streaming provides a promising solution to the coherence miss bottleneck because it improves memory level parallelism and lookahead while using on-chip resources efficiently. We observe that the order in which shared data are consumed by one processor is correlated to the order in which they were produced by another. We investigate this phenomenon and demonstrate that it can be exploited to send Store- ORDered Streams (SORDS) of shared data from producers to consumers, thereby eliminating coherent read misses. Using a trace-driven analysis of all user and OS memory references in a cache-coherent distributed shared- memory multiprocessor, we show that SORDS based memory streaming can eliminate between 36% and 100% of all coherent read misses in scientific workloads and between 23% and 48%in online transaction processing workloads

    Second-tier Cache Management to Support DBMS Workloads

    Get PDF
    Enterprise Database Management Systems (DBMS) often run on computers with dedicated storage systems. Their data access requests need to go through two tiers of cache, i.e., a database bufferpool and a storage server cache, before reaching the storage media, e.g., disk platters. A tremendous amount of work has been done to improve the performance of the first-tier cache, i.e., the database bufferpool. However, the amount of work focusing on second-tier cache management to support DBMS workloads is comparably small. In this thesis we propose several novel techniques for managing second-tier caches to boost DBMS performance in terms of query throughput and query response time. The main purpose of second-tier cache management is to reduce the I/O latency endured by database query executions. This goal can be achieved by minimizing the number of reads and writes issued from second-tier caches to storage devices. The rst part of our research focuses on reducing the number of read I/Os issued by second-tier caches. We observe that DBMSs issue I/O requests for various reasons. The rationales behind these I/O requests provide useful information to second-tier caches because they can be used to estimate the temporal locality of the data blocks being requested. A second-tier cache can exploit this information when making replacement decisions. In this thesis we propose a technique to pass this information from DBMSs to second-tier caches and to use it in guiding cache replacements. The second part of this thesis focuses on reducing the number of writes issued by second-tier caches. Our work is two fold. First, we observe that although there are second-tier caches within computer systems, today's DBMS cannot take full advantage of them. For example, most commercial DBMSs use forced writes to propagate bufferpool updates to permanent storage for data durability reasons. We notice that enforcing such a practice is more conservative than necessary. Some of the writes can be issued as unforced requests and can be cached in the second-tier cache without immediate synchronization. This will give the second-tier cache opportunities to cache and consolidate multiple writes into one request. However, unfortunately, the current POSIX compliant le system interfaces provided by mainstream operating systems e.g., Unix and Windows) are not flexible enough to support such dynamic synchronization. We propose to extend such interfaces to let DBMSs take advantage of using unforced writes whenever possible. Additionally, we observe that the existing cache replacement algorithms are designed solely to maximize read cache hits (i.e., to minimize read I/Os). The purpose is to minimize the read latency, which is on the critical path of query executions. We argue that minimizing read requests is not the only objective of cache replacement. When I/O bandwidth becomes a bottleneck the objective should be to minimize the total number of I/Os, including both reads and writes, to achieve the best performance. We propose to associate a new type of replacement cost, i.e., the total number of I/Os caused by the replacement, with each cache page; and we also present a partial characterization of an optimal algorithm which minimizes the total number of I/Os generated by caches. Based on this knowledge, we extend several existing replacement algorithms, which are write-oblivious (focus only on reducing reads), to be write-aware and observe promising performance gains in the evaluations

    Query Interactions in Database Systems

    Get PDF
    The typical workload in a database system consists of a mix of multiple queries of different types, running concurrently and interacting with each other. The same query may have different performance in different mixes. Hence, optimizing performance requires reasoning about query mixes and their interactions, rather than considering individual queries or query types. In this dissertation, we demonstrate how queries affect each other when they are executing concurrently in different mixes. We show the significant impact that query interactions can have on the end-to-end workload performance. A major hurdle in the understanding of query interactions in database systems is that there is a large spectrum of possible causes of interactions. For example, query interactions can happen because of any of the resource-related, data-related or configuration-related dependencies that exist in the system. This variation in underlying causes makes it very difficult to come up with robust analytical performance models to capture and model query interactions. We present a new approach for modeling performance in the presence of interactions, based on conducting experiments to measure the effect of query interactions and fitting statistical models to the data collected in these experiments to capture the impact of query interactions. The experiments collect samples of the different possible query mixes, and measure the performance metrics of interest for the different queries in these sample mixes. Statistical models such as simple regression and instance-based learning techniques are used to train models from these sample mixes. This approach requires no prior assumptions about the internal workings of the database system or the nature or cause of the interactions, making it portable across systems. We demonstrate the potential of capturing, modeling, and exploiting query interactions by developing techniques to help in two database performance related tasks: workload scheduling and estimating the completion time of a workload. These are important workload management problems that database administrators have to deal with routinely. We consider the problem of scheduling a workload of report-generation queries. Our scheduling algorithms employ statistical performance models to schedule appropriate query mixes for the given workload. Our experimental evaluation demonstrates that our interaction-aware scheduling algorithms outperform scheduling policies that are typically used in database systems. The problem of estimating the completion time of a workload is an important problem, and the state of the art does not offer any systematic solution. Typically database administrators rely on heuristics or observations of past behavior to solve this problem. We propose a more rigorous solution to this problem, based on a workload simulator that employs performance models to simulate the execution of the different mixes that make up a workload. This mix-based simulator provides a systematic tool that can help database administrators in estimating workload completion time. Our experimental evaluation shows that our approach can estimate the workload completion times with a high degree of accuracy. Overall, this dissertation demonstrates that reasoning about query interactions holds significant potential for realizing performance improvements in database systems. The techniques developed in this work can be viewed as initial steps in this interesting area of research, with lots of potential for future work

    Workload-Aware Database Monitoring and Consolidation

    Get PDF
    In most enterprises, databases are deployed on dedicated database servers. Often, these servers are underutilized much of the time. For example, in traces from almost 200 production servers from different organizations, we see an average CPU utilization of less than 4%. This unused capacity can be potentially harnessed to consolidate multiple databases on fewer machines, reducing hardware and operational costs. Virtual machine (VM) technology is one popular way to approach this problem. However, as we demonstrate in this paper, VMs fail to adequately support database consolidation, because databases place a unique and challenging set of demands on hardware resources, which are not well-suited to the assumptions made by VM-based consolidation. Instead, our system for database consolidation, named Kairos, uses novel techniques to measure the hardware requirements of database workloads, as well as models to predict the combined resource utilization of those workloads. We formalize the consolidation problem as a non-linear optimization program, aiming to minimize the number of servers and balance load, while achieving near-zero performance degradation. We compare Kairos against virtual machines, showing up to a factor of 12Ă— higher throughput on a TPC-C-like benchmark. We also tested the effectiveness of our approach on real-world data collected from production servers at Wikia.com, Wikipedia, Second Life, and MIT CSAIL, showing absolute consolidation ratios ranging between 5.5:1 and 17:1

    Factors That Influence Throughput on Cloud-Hosted MySQL Server

    Get PDF
    Many businesses are moving their infrastructure to the cloud and may not fully understand the factors that can increase costs. With so many factors available to improve throughput in a database, it can be difficult for a database administrator to know which factors can provide the best efficiency to maintain lower costs. Grounded in Six Sigma theoretical framework, the purpose of this quantitative, quasi-experimental study was to evaluate the relationship between the time of day, the number of concurrent users, InnoDB buffer pool size, InnoDB Input/Output capacity, and MySQL transaction throughput to a MySQL database running on a cloud, virtual, database server. Data were collected from Debian Linux virtual machines (VMs) on Amazon Web Services, Google Cloud Platform, and Microsoft Azure using HammerDB database benchmarking software. The results of the one-way ANOVA were not significant. A key recommendation is to study further other factors and a more in-depth investigation into each cloud provider\u27s performance. The implications for positive social change include the potential for database administrators to make informed decisions on how to configure MySQL to run in a VM and choose the best cloud provider so that nonprofits may serve their clients more efficiently
    • …
    corecore