TabbyXL: Experiment Data

Abstract

The data are designed to evaluate TabbyXL, a system for rule-based transformation spreadsheet data from arbitrary to relational tables that is freely available at GitHub (https://github.com/cellsrg/tabbyxl). Our data are based on the existing dataset of tables Troy_200 [1]. It contains 200 arbitrary tables as CSV files collected from 10 different government statistical websites. They were collected for the experiment on data extraction from tables that is presented in the paper [2]. We use its earlier version that stores the original tables with style features (fonts, alignment, and indentation) as Excel spreadsheets available at http://tango.byu.edu/data. We have put all of these tables with style features into the single spreadsheet file (data/TangoDataset.xlsx). Each of 200 tables is located in a separate sheet. The pair of tags STARTandSTART and END points out to its location inside the sheet. We initially used this file in our previous experiment described in the paper [3]. We have transformed automatically all tables of the single spreadsheet into the relational form, using TabbyXL and the ruleset (data/rules.dslr). The folder data/results contains the obtained results. The folder data/gt contains the ground-truth data for automated performance evaluation of TabbyXL in the role and structural stages of the table analysis. Each table of our data/results and data/gt dataset is accompanied with two recordsets: ENTRIES and LABELS. The first of them specifies entries. Each record presents an entry as a triple . In LABELS recordset each record presents a label as a triple . We also have stored the log files: results.log with the results of running and eval.log with the results of performance evaluation of TabbyXL. REFERENCES [1] Nagy G. TANGO-DocLab web tables from international statistical sites, (Troy_200), 1, ID: Troy_200_1. URL: http://tc11.cvc.uab.es/datasets/Troy_200_1. [2] Embley D., Krishnamoorthy M., Nagy G., & Seth S. (2016). Converting heterogeneous statistical tables on the web to searchable databases. Int. J. on Document Analysis and Recognition, 19(2), 119-138. URL: https://link.springer.com/article/10.1007/s10032-016-0259-1. [3] Shigarov A., Paramonov V., Belykh P., & Bondarev A. (2016) Rule-Based Canonicalization of Arbitrary Tables in Spreadsheets. Proc. 22nd Int. Conf. on Information and Software Technologies, pp. 78-91. URL: http://link.springer.com/chapter/10.1007/978-3-319-46254-7_7

    Similar works

    Full text

    thumbnail-image