3 research outputs found
The Evolution of Cloud Data Architectures: Storage, Compute, and Migration
Recent advances in data architectures have shifted from on-premises to the cloud. However, new challenges emerge as data explosion continues to expand at an exponential rate. As a result, my Ph.D. research focuses on addressing the following challenges.
First, cloud data-warehouses such as Snowflake, BigQuery, and Redshift often rely on storage systems such as distributed file systems or object stores to store massive amounts of data. The growth of data volumes is accompanied by an increase in the number of objects stored and the amount of metadata such systems must manage. By treating metadata management similar to data management, we built FileScale, an HDFS-based file system that replaces metadata management in HDFS with a three-tiered distributed architecture that incorporates a high throughput, distributed main-memory database system at the lowest layer, along with distributed caching and routing functionality above it. FileScale performs comparably to the single-machine architecture at a small scale, while enabling linear scalability as the file system metadata increases.
Second, Function as a Service, or FaaS, is a new type of cloud-computing service that executes code in response to events without the complex infrastructure typically associated with building and launching microservices applications. FaaS offers cloud functions with millisecond billing granularity to be scaled automatically, independently, and instantaneously as needed. We built Flock, the first practical cloud-native SQL query engine that supports event stream processing on FaaS with heterogeneous hardware (x86 and Arm) with the ability to shuffle and aggregate data without requiring a centralized coordinator or remote storage such as Amazon S3. This architecture is more cost-effective than traditional systems, especially for dynamic workloads and continuous queries.
Third, Software as a Service, or SaaS, is a method of software product delivery to end-users over the internet and via pay-as-you-go pricing in which the software is centrally hosted and managed by the cloud service provider. Continuous Deployment (CD) in SaaS, an aspect of DevOps, is the increasingly popular practice of frequent, automated deployment of software changes. To realize the benefits of CD, it must be straightforward to deploy updates to both front-end code and the database, even when the database’s schema has changed. Unfortunately, this is where current practices run into difficulty. So we built BullFrog, a PostgreSQL extension that is the first system to use lazy schema migration to support single-step, online schema evolution without downtime, which achieves efficient, exactly-once physical migration of data under contention
Recommended from our members
Towards Automated Online Schema Evolution
Schema evolution studies the issue of moving a database from one version of its schema to a new updated schema. Traditionally, database administrators perform these tasks offline and they involve large amounts of manual labor and custom scripting. In today’s world where databases power many 24/7 online services, schema evolution can no longer be an offline process. Furthermore, application requirements change much more rapidly today, causing more frequent changes in database schemas. Because of these trends, it is critical for database administrators to have automated tools to evolve database schemas in an online fashion that does not disrupt the foreground services.This thesis attempts to explore ways an administrator might automate this pro- cess and provide some insight into building tools to help make this process easier, faster and more reliable. The thesis makes the following contributions. First, it provides a complete system implementation, Ratchet, that a database administrator can use to perform efficient schema evolution on supported platforms (PostgreSQL). The system uses various techniques such as improved fine-grained locking and a delayed- copy strategy to improve its schema evolution performance. Second, it analyzes the characteristics of schema evolution for a five-year period for Wikimedia, one of the most widely used websites. Third, using Ratchet, the thesis recreates five years of schema evolution automatically. Finally, the thesis provides a mechanism of rollback in schema evolution
Recommended from our members
Towards Automated Online Schema Evolution
Schema evolution studies the issue of moving a database from one version of its schema to a new updated schema. Traditionally, database administrators perform these tasks offline and they involve large amounts of manual labor and custom scripting. In today’s world where databases power many 24/7 online services, schema evolution can no longer be an offline process. Furthermore, application requirements change much more rapidly today, causing more frequent changes in database schemas. Because of these trends, it is critical for database administrators to have automated tools to evolve database schemas in an online fashion that does not disrupt the foreground services.This thesis attempts to explore ways an administrator might automate this pro- cess and provide some insight into building tools to help make this process easier, faster and more reliable. The thesis makes the following contributions. First, it provides a complete system implementation, Ratchet, that a database administrator can use to perform efficient schema evolution on supported platforms (PostgreSQL). The system uses various techniques such as improved fine-grained locking and a delayed- copy strategy to improve its schema evolution performance. Second, it analyzes the characteristics of schema evolution for a five-year period for Wikimedia, one of the most widely used websites. Third, using Ratchet, the thesis recreates five years of schema evolution automatically. Finally, the thesis provides a mechanism of rollback in schema evolution