21 research outputs found

    Research Statement

    No full text
    My research is motivated by the challenge of managing massive data. As diverse scientific fields such as astronomy and computational mechanics accumulate observational or experimental data, large-scale data management is expected to be the “limiting or the enabling factor for a wide range of sciences ” [3]. The same holds for modern enterprises, which rely on massive data warehouses for business intelligence. My broad research interests span all aspects of large-scale data management performance, such as query execution and optimization, indexing, storage system design and support for novel applications. I focus on two key research areas with a critical impact on large-scale data management. The first is self-tuning databases and in particular algorithms for automated database design. Automated design is crucial for performance and a challenging task given today’s complex systems and workloads. The second is efficient indexing, query execution and disk layouts for datasets such as computational meshes, graphs and arrays, that are common in scientific and enterprise computing, but not effectively handled by existing databases. 1 Self-Tuning Databases 1.1 Workload-Driven Schema Partitionin

    Autopart: Automating schema design for large scientific databases using data partitioning

    No full text
    Database applications that use multi-terabyte datasets are becoming increasingly important for scientific fields such as astronomy and biology. Scientific databases are particularly suited for the application of automated physical design techniques, because of their data volume and the complexity of the scientific workloads. Current automated physical design tools focus on the selection of indexes and materialized views. In large-scale scientific databases, however, the data volume and the continuous insertion of new data allows for only limited indexes and materialized views. By contrast, data partitioning does not replicate data, thereby reducing space requirements and minimizing update overhead. In this paper we propose AutoPart, an algorithm that automatically partitions database tables to optimize sequential access assuming prior knowledge of a representative workload. The resulting schema is indexed using a fraction of the space required for indexing the original schema. To evaluate AutoPart, we build an automated schema design tool that interfaces to commercial database systems. We experiment with AutoPart in the context of the Sloan Digital Sky Survey database, a real-world astronomical database, running on SQL Server 2000. Our experiments corroborate the benefits of partitioning for large-scale systems: Partitioning alone improves query execution performance by a factor of two on average. Combined with indexes, the new schema also outperforms the indexed original schema by 20 % (for queries) and a factor of five (for updates), while using only half the original index space.

    Efficient Use of the Query Optimizer for Automated Database Design

    No full text
    State-of-the-art database design tools rely on the query optimizer for comparing between physical design alternatives. Although it provides an appropriate cost model for physical design, query optimization is a computationally expensive process. The significant time consumed by optimizer invocations poses serious performance limitations for physical design tools, causing long running times, especially for large problem instances. So far it has been impossible to remove query optimization overhead without sacrificing cost estimation precision. Inaccuracies in query cost estimation are detrimental to the quality of physical design algorithms, as they increase the chances of “missing” good designs and consequently selecting sub-optimal ones. Precision loss and the resulting reduction in solution quality is particularly undesirable and it is the reason the query optimizer is used in the first place. In this paper we eliminate the tradeoff between query cost estimation accuracy and performance. We introduce the INdex Usage Model (INUM), a cost estimation technique that returns the same values that would have been returned by the optimizer, while being three orders of magnitude faster. Integrating INUM with existing index selection algorithms dramatically improves their running times without precision compromises
    corecore