A core requirement of database engine testing is the ability to create synthetic versions of the customer’s data warehouse at the vendor site. Prior work on synthetic data regeneration suffers from critical limitations with regard to (a) scaling to large data volumes, (b) handling complex query workloads, and (c) producing data on demand. In this demo, we present HYDRA, a workload-dependent dynamic data regenerator, that materially addresses these limitations. It introduces the concept of dynamic regeneration by constructing a minuscule memory-resident database summary that can on-the-fly regenerate databases of arbitrary size during query execution. Further, since the data is generated in memory, the velocity of generation can be closely regulated. Finally, to complement dynamic regeneration, Hydra also ensures that the process of summary construction is data-scale-free

Haritsa, Jayant

Sanghi, Anupam

Singh, Dharmendra

Sood, Raghav

Tirthapura, Srikanta

English

A core requirement of database engine testing is the ability to create synthetic versions of the customer’s data warehouse at the vendor site. Prior work on synthetic data regeneration suffers from critical limitations with regard to (a) scaling to large data volumes, (b) handling complex query workloads, and (c) producing data on demand. In this demo, we present HYDRA, a workload-dependent dynamic data regenerator, that materially addresses these limitations. It introduces the concept of dynamic regeneration by constructing a minuscule memory-resident database summary that can on-the-fly regenerate databases of arbitrary size during query execution. Further, since the data is generated in memory, the velocity of generation can be closely regulated. Finally, to complement dynamic regeneration, Hydra also ensures that the process of summary construction is data-scale-free.This proceeding is published as Sanghi, Anupam, Raghav Sood, Dharmendra Singh, Jayant R. Haritsa, and Srikanta Tirthapura. "HYDRA: A Dynamic Big Data Regenerator." Proceedings of the VLDB Endowment 11, no. 12 (2018): 1974. doi: 10.14778/3229863.3236238. Posted with permission.</p

Digital Repository @ Iowa State University (ISU)

HYDRA: A Dynamic Big Data RegeneratorAnupam Sanghi♦ Raghav Sood♦ Dharmendra Singh♦ Jayant R. Haritsa♦ Srikanta Tirthapura♣♦Indian Institute of Science, Bangalore, India ♣Iowa State Univerity, Ames, USA{anupamsanghi, dharmendra, haritsa}@iisc.ac.in raghavsood33@gmail.com snt@iastate.eduABSTRACTA core requirement of database engine testing is the ability to createsynthetic versions of the customer’s data warehouse at the vendorsite. Prior work on synthetic data regeneration suffers from criti-cal limitations with regard to (a) scaling to large data volumes, (b)handling complex query workloads, and (c) producing data on de-mand. In this demo, we present HYDRA, a workload-dependentdynamic data regenerator, that materially addresses these limita-tions. It introduces the concept of dynamic regeneration by con-structing a minuscule memory-resident database summary that canon-the-fly regenerate databases of arbitrary size during query exe-cution. Further, since the data is generated in memory, the veloc-ity of generation can be closely regulated. Finally, to complementdynamic regeneration, Hydra also ensures that the process of sum-mary construction is data-scale-free.PVLDB Reference Format:Anupam Sanghi, Raghav Sood, Dharmendra Singh, Jayant R. Haritsa andSrikanta Tirthapura. HYDRA: A Dynamic Big Data Regenerator. PVLDB,11 (12): 1974-1977, 2018.DOI: https://doi.org/10.14778/3229863.32362381. INTRODUCTIONIn industrial practice, relational database vendors often need totest their OLAP engines on the data warehouses present at the cus-tomer sites. This requirement arises for reasons like: (a) analyzingperformance issues during query processing, (b) performing func-tional testing of embedded-SQL programs, and (c) proactively as-sessing the performance impacts of planned engine upgrades. Dueto privacy concerns, however, transferring data from the client tothe vendor may not be a viable option. Moreover, even if the clientis willing to share, transferring and storing the data at the vendor’ssite may have impractical time and space overheads, especially inthe impending Big Data era. Therefore, looking into the future,vendors need to be able to dynamically regenerate representativedatabases that mimic, for the intended purposes, the behavior ofthe client data processing environments.A rich body of literature exists on data regeneration, includ-ing both workload-independent (WI) techniques (e.g. [7, 8]) andThis work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copyof this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. Forany use beyond those covered by this license, obtain permission by emailinginfo@vldb.org.Proceedings of the VLDB Endowment, Vol. 11, No. 12Copyright 2018 VLDB Endowment 2150-8097/18/8.DOI: https://doi.org/10.14778/3229863.3236238workload-dependent (WD) techniques (e.g. [6, 9, 4]). The WI ap-proaches, including those that are specifically targeted towards BigData environments (e.g. [11, 13]), fail to retain satisfactory statis-tical fidelity. This shortcoming is addressed in the WD schemes– specifically, they generate synthetic data that exhibits volumet-rically similar behavior to the original database on the customerquery workload. That is, with common query execution plans atthe client and vendor sites, the output row cardinalities of individ-ual operators in these plans are almost identical. However, thesetechniques suffer from substantive practical limitations, especiallywith regard to: (a) scaling to large data volumes, (b) handling com-plex query workloads, and (c) producing data on demand.We have attempted to address the above limitations of WD tech-niques in designing a new database regenerator called Hydra [12].This tool forms part of our ongoing CODD project [5], whichincorporates a novel metaphor of “dataless databases”, wherebydatabase environments with the desired characteristics are simu-lated without persistently generating and/or storing the contents.Hydra currently focuses on the volume and velocity facets [1] ofBig Data, which are of primary interest in the context of enterpriserelational warehouses. It introduces the concept of dynamic re-generation by constructing a minuscule memory-resident databasesummary that can on-the-fly regenerate arbitrary client databasesduring query execution. Since the data is generated in memory, thevelocity of data generation can be closely regulated, as comparedto disk-resident databases. To complement dynamic regeneration,Hydra also ensures that the process of summary construction isdata-scale-free. Specifically, the summaries for complex Big Dataclient scenarios are constructed within just a few minutes.On the implementation front, Hydra is completely written inJava, running to over 15K lines of code, and is currently oper-ational on the PostgreSQL v9.3 engine [3]. It has an intuitiveuser interface that facilitates modeling of enterprise database envi-ronments, delivers feedback on the regenerated data, and tabulatesperformance reports on the regeneration quality. The entire tool,including the source, can be downloaded at [2], and has alreadybeen deployed in major telecom and software organizations.Demo Highlights. The highlights of the demo (detailed in Sec-tion 4) include the following components: (a) Client Interface,which captures the construction and transfer of the informationsynopsis created at the client site; (b) Vendor Interface, whichpresents the synthetic database summary and explicit verification ofvolumetric similarity; (c) Dynamic Regeneration, which demon-strates on-the-fly data generation during query execution on a data-less database; and (d) Scenario Construction, which helps thevendor to pro-actively simulate anticipated client environments.19742. HYDRA DESIGNHydra leverages the declarative approach to data regenerationproposed in the DataSynth tool [4]. To illustrate this approach, con-sider a relational database with the toy schema shown in Figure 1a(where the pk and fk denote primary-key and foreign-key attributes,respectively). A sample query on this schema is listed in Figure 1b,and the corresponding execution plan in Figure 1c. A special as-pect of this plan is that the output edge of each operator is annotatedwith the associated row cardinality, as evaluated during the client’sexecution. It is therefore referred to as an Annotated Query Plan(AQP) [6]. The AQPs constructed over the entire query workloadare then collectively formulated as a set of linear programs (LPs),one per schema relation. These LPs separately are then input to anSMT solver, and the solutions are used to construct the syntheticdatabase. The objective of the regeneration process at the vendorsite is to ensure that the synthetic data closely mimics the opera-tor data volumes indicated in the AQP. This helps to preserve themulti-dimensional layout and flow of the data, a pre-requisite forachieving comparable performance on the client’s workload.R (R pk, S fk, T fk) S (S pk, A, B) T (T pk, C)(a) Database Schemaselect * from R, S, Twhere R.S fk = S.S pk and R.T fk = T.T pkand S.A >= 20 and S.A < 60 and T.C >= 2 and T.C < 3(b) Example Query(c) Annotated Query Plan (AQP)Figure 1: Example Database ScenarioNovelties. Hydra is able to handle significantly more complexquery workloads than DataSynth, as detailed in [12]. This im-provement is due to a novel region-partitioning algorithm [12] thatresults in an LP encoding whose complexity (in terms of the num-ber of variables) is several orders of magnitude smaller in compar-ison to the grid-partitioning approach of [4]. In fact, our region-partitioning corresponds to an LP with the minimum number ofvariables [12].Second, Hydra introduces the concept of dynamic regenerationby constructing a minuscule database summary that can on-the-fly regenerate databases of arbitrary size during query execution.This approach is imperative for Big Data systems, where work-ing with materialized solutions entails impractical time and spaceoverheads. Specifically, dynamic generation eliminates the need tostore data on the disk and its subsequent load by the engine – in-stead, all data is created and delivered on demand. An orthogonalbenefit is that the generation rate can be strictly controlled, therebyaddressing the velocity aspect of Big Data.Third, our database summary generation, thanks to its uniquedata-scale-free feature, is extremely efficient. As a case in point,the summary for a large workload of 131 distinct queries on theTPC-DS database was generated in less than 2 minutes on a vanillacomputing platform, occupying only a few KB of space [12].Finally, our summary generation method also delivers better fi-delity than prior work with regard to volumetric similarity. Forinstance, on the above-mentioned query workload, more than 90%of the volumetric constraints were satisfied with virtually no error,while the remaining were all satisfied with a relative error of lessthan 10% [12]. Further, since the magnitude of the volumetric dis-crepancy is constant for a given query workload, the relative errorsbecomes progressively smaller with increasing database size.The above efficiency and accuracy in constructing the summaryare an outcome of the deterministic alignment strategy of Hydra(details in [12]), as opposed to the sampling-based strategy of [4].3. HYDRA ARCHITECTUREWe now present an overview of Hydra’s architecture, shown inFigure 2. In this figure, the green boxes represent the new com-ponents designed specifically for Hydra, whereas the yellow boxesare sourced from the prior literature.Figure 2: Hydra ArchitectureAt the client site, Hydra fetches the schema, metadata and thequery workload with its corresponding AQPs, and ships this en-tire information to the vendor. If required, privacy concerns canbe addressed by passing the information through an appropriateanonymization layer at the client.When the information is received at the vendor, it initially goesthrough a Preprocessor, sourced from [4]. This component facili-tates the independent processing of each relation in the subsequentsteps, a key requirement for model tractability and regeneration ef-ficiency. The AQPs are then evaluated by the LP Formulator andan optimized LP is constructed for each relation, using our newregion-partitioning approach. This collection of LPs is passed tothe Z3 solver [10], which provides feasible per-relation solutions asthe output. Leveraging these solutions, the Summary Generatorconstructs a summary using our deterministic alignment algorithm.Further, a post-processing step is executed to ensure that referen-tial constraints are not violated across the solutions. This step mayincur minor additive errors in satisfying the volumetric constraints,but their impact is expected to be negligible at Big Data scale.1975Figure 3: Client Site: Metadata, Queries and Annotated Query PlansSubsequently, the Tuple Generator generates the requisite dataon-demand, one row at a time, for each relation appearing in thequery, using the database summary. As a proof of concept, we haveimplemented this functionality in the PostgreSQL v9.3 engine [3]by adding a new feature called datagen, which is included as aproperty for each relation in the database. On enabling this fea-ture for a relation, the traditional scan operator is replaced with theequivalent dynamic regeneration operator.Finally, the metadata transfer functionality of CODD [5] is usedto ensure a common choice of plan at the client and vendor sites.4. HYDRA DEMONSTRATIONIn the demo, the audience will actively engage with a variety ofvisual scenarios that showcase the utility of the HYDRA tool.4.1 Client SiteAt the Client Site, the client supplies the query workload andthe corresponding AQPs are obtained by optimizing and executingthese queries on the client platform. Currently, the JSON formatis supported to parse the execution plans. The next screen in theclient interface is shown in Figure 3. In this figure, the top halfprofiles the metadata statistics. Specifically, the user can choose aspecific table column, and the system presents the distribution ofthe most frequent values and the bucket boundaries of the equi-depth histogram for this column.1 In the bottom half, the user canpick a query from the input workload (the figure shows a canonicalSPJ query on the TPC-DS schema), and the corresponding SQLtext is displayed at the bottom left along with the associated AQP atthe bottom right. The widths of the edges connecting the operatorsin the plan are scaled to visually indicate the volume of data flowingin each of these edges. Finally, once the user selects the SUBMITbutton, all this information is transferred to the Vendor Site.1The metadata visualization is customized for PostgreSQL [3].4.2 Vendor SiteAfter receiving the above-mentioned information package fromthe client, the Vendor Site initiates the data regeneration process.Here, the primary interface during the LP solving stage tabulatesthe LPs complexity in terms of their number of variables and runtimes. Subsequently, in the next screen, shown in Figure 4, thefinal database summary is displayed. The user can select an indi-vidual relation, and the system shows its summary in the top middlepanel. The difference in the schema of a relation summary and thatof the corresponding relation is that the pk column in the relationis replaced with a #TUPLES column – this column captures thenumber of tuples that share the vector of data values present in theremaining columns. For instance, the first row in the item relationsummary in Figure 4 indicates that there are 917 rows with values<40, pop, Music, ...>. The pk columns are subsequentlygenerated as auto-numbers. Note that this approach does not affectthe referential constraints or the AQP constraints, as the foreign-keys have already been assigned compatible values.Secondly, the top right panel shows the runtime configurationsettings where the user can choose to either dynamically generateor optionally materialize the selected relation. Also, for dynamicgeneration, the desired velocity, measured in rows per second, canbe set using the slider bar. The chosen relation’s row count forthe original and synthetic database are shown below the bar. Toassess the overall quality of the regenerated data, the bottom leftgraph plots the percentage of volumetric constraints that are satis-fied within a given relative error. Finally, the user can also drilldown to a query-specific AQP comparison by selecting a queryfrom the drop down menu. In this mode, the corresponding SQLtext and AQP are shown in the bottom middle and right panels, re-spectively. The edges in the AQP are annotated with the originalcardinality in green color, and the relative errors (typically minor)incurred as a result of the regeneration are shown in red color.1976Figure 4: Vendor Site: Database Summary, Runtime Configuration Settings, Generation Quality and AQP Comparison4.3 Dynamic RegenerationIn this segment of the demo, we explicitly demonstrate that theregenerated database has absolutely no data stored in the physi-cal tables – i.e. the “dataless” approach. Instead, using our tuplegenerator, data is generated and supplied on-demand during queryexecution. As an example of the final outcome, a few sample rowsfor the initial columns of the ITEM table (highlighted in Figure 4)are enumerated in Table 1.Table 1: Sample Tuplesitem sk i manager id i class i category ...0 40 pop Music ...917 91 dresses Women ...938 0 accessories Men ...963 1 reference Electronics ...4.4 Scenario ConstructionFinally, Hydra also facilitates the vendor to pro-actively simulateanticipated client environments, by constructing synthetic AQPsthrough injecting cardinality annotations into the original clientAQPs. For such “what-if” scenarios, Hydra creates the regener-ation summary after verifying the feasibility of the synthetic as-signments. This feature is particularly useful for testing the abilityof the vendor’s engine to robustly handle boundary condition sce-narios and stressed Big Data environments. In the demo, we willmodel an extrapolated exabyte scenario to showcase this feature,focusing on the efficient summary creation and the on-demand datageneration.Acknowledgements.. We thank Huawei Technologies India and TCS In-novation Labs for their valuable feedback and support in this project.5. REFERENCES[1] Big Data. en.wikipedia.org/wiki/Big_data[2] Hydra Database Regenerator.dsl.cds.iisc.ac.in/projects/HYDRA[3] PostgreSQL. postgresql.org/docs/9.3[4] A. Arasu, R. Kaushik, and J. Li. Data Generation usingDeclarative Constraints. In Proc. of ACM SIGMOD Conf.,2011, pgs. 685-696.[5] S. Ashoke and J. R. Haritsa. CODD: A Dataless Approach toBig Data Testing. PVLDB, 8(12):2008-2011, 2015.[6] C. Binnig, D. Kossmann, E. Lo, and M. T. Özsu. QAGen:Generating Query-Aware Test Databases. In Proc. of ACMSIGMOD Conf., 2007, pgs. 341-352.[7] N. Bruno and S. Chaudhuri. Flexible Database Generators.In Proc. of 31st VLDB Conf., 2005, pgs. 1097-1107.[8] J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J.Weinberger. Quickly Generating Billion-Record SyntheticDatabases. In Proc. of ACM SIGMOD Conf., 1994, pgs.243-252.[9] E. Lo, N. Cheng, W. W. Lin, W.-K. Hon, and B. Choi.MyBenchmark: generating databases for query workloads.The VLDB Journal, 23(6):895-913, 2014.[10] L. De Moura and N. Bjørner. Z3: An efficient SMT solver. InProc. of TACAS Conf., 2008, pgs. 337-340.[11] T. Rabl, M. Danisch, M. Frank, S. Schindler andH. Jacobsen. Just can’t get enough - Synthesizing Big Data.In Proc. of ACM SIGMOD Conf., 2015, pgs. 1457-1462.[12] A. Sanghi, R. Sood, J. R. Haritsa, and S. Tirthapura. Scalableand Dynamic Regeneration of Big Data Volumes. In Proc. of21st EDBT Conf., 2018, pgs. 301-312.[13] J. W. Zhang and Y. C. Tay. Dscaler: Synthetically Scaling AGiven Relational Database. PVLDB, 9(14):1671-1682, 2016.1977

HYDRA: A Dynamic Big Data Regenerator

https://dr.lib.iastate.edu/handle/20.500.12876/20887

HYDRA: A Dynamic Big Data Regenerator

Abstract

Similar works

Full text

Available Versions

Digital Repository @ Iowa State University (ISU)