14 research outputs found

    DataSpread: Unifying Databases and Spreadsheets.

    Get PDF
    Spreadsheet software is often the tool of choice for ad-hoc tabular data management, processing, and visualization, especially on tiny data sets. On the other hand, relational database systems offer significant power, expressivity, and efficiency over spreadsheet software for data management, while lacking in the ease of use and ad-hoc analysis capabilities. We demonstrate DataSpread, a data exploration tool that holistically unifies databases and spreadsheets. It continues to offer a Microsoft Excel-based spreadsheet front-end, while in parallel managing all the data in a back-end database, specifically, PostgreSQL. DataSpread retains all the advantages of spreadsheets, including ease of use, ad-hoc analysis and visualization capabilities, and a schema-free nature, while also adding the advantages of traditional relational databases, such as scalability and the ability to use arbitrary SQL to import, filter, or join external or internal tables and have the results appear in the spreadsheet. DataSpread needs to reason about and reconcile differences in the notions of schema, addressing of cells and tuples, and the current pane (which exists in spreadsheets but not in traditional databases), and support data modifications at both the front-end and the back-end. Our demonstration will center on our first and early prototype of the DataSpread, and will give the attendees a sense for the enormous data exploration capabilities offered by unifying spreadsheets and databases

    Translating SQL to Spreadsheet: A Survey

    Get PDF
    Spreadsheets are the most popular and conventionally databases in use today. Since Spreadsheets are visual and expression based languages, research into the features of spreadsheets is therefore a highly relevant topic to study. Spreadsheet can be viewed as a Relation Database which contains a sheet and its corresponding information in terms of rows, while in RDBMS each table or say relation also represents its contained information in terms of rows. Each row represents a record which belongs to one or more relation. Spreadsheets uses different formulae to extract required information but it need expert knowledge about the tool and its usage. One can extend the usage of Spreadsheet in any direction as it provides great flexibility in terms of data storage and dependency of stored data. We surveyed some of research which took great attention over Spreadsheets and its applicability in different functional cases, such as Data Visualization, SQL Engines and many more. Our survey focuses on QUERYSHEET, ES-SQL, MDSHEET and PrediCalc [3], [5], [4], [8]. These different researches are motivations to our survey and attraction in Spreadsheets and its functional extensibility

    Dise帽o de un almac茅n de datos hist贸rico en el marco del desarrollo de software dirigido por modelos

    Get PDF
    Un Decision Support System (DSS) asiste a los usuarios en el proceso de an谩lisis de datos en una organizaci贸n con el prop贸sito de producir informaci贸n que les permita tomar mejores decisiones. Los analistas que utilizan el DSS est谩n m谩s interesados en identificar tendencias que en buscar alg煤n registro individual en forma aislada [HRU96]. Con ese prop贸sito, los datos de las diferentes transacciones se almacenan y consolidan en una base de datos central denominada Data Warehouse (DW); los analistas utilizan esas estructuras de datos para extraer informaci贸n de sus negocios que les permita tomar mejores decisiones [GHRU97]. Bas谩ndose en el esquema de datos fuente y en los requisitos de informaci贸n de la organizaci贸n, el objetivo del dise帽ador de un DSS es sintetizar esos datos para reducirlos a un formato que le permita, al usuario de la aplicaci贸n, utilizarlos en el an谩lisis del comportamiento de la empresa. Dos tipos diferentes (pero relacionados) de actividades est谩n presentes: el dise帽o de las estructuras de almacenamiento y la creaci贸n de consultas sobre esas estructuras. La primera tarea se desarrolla en el 谩mbito de los dise帽adores de aplicaciones inform谩ticas; la segunda, en la esfera de los usuarios finales. Ambas actividades, normalmente, se realizan con escasa asistencia de herramientas automatizadas.Eje: Tecnolog铆a Inform谩tica aplicada en educaci贸nRed de Universidades con Carreras en Inform谩tica (RedUNCI

    Towards a Holistic Integration of Spreadsheets with Databases: A Scalable Storage Engine for Presentational Data Management

    Full text link
    Spreadsheet software is the tool of choice for interactive ad-hoc data management, with adoption by billions of users. However, spreadsheets are not scalable, unlike database systems. On the other hand, database systems, while highly scalable, do not support interactivity as a first-class primitive. We are developing DataSpread, to holistically integrate spreadsheets as a front-end interface with databases as a back-end datastore, providing scalability to spreadsheets, and interactivity to databases, an integration we term presentational data management (PDM). In this paper, we make a first step towards this vision: developing a storage engine for PDM, studying how to flexibly represent spreadsheet data within a database and how to support and maintain access by position. We first conduct an extensive survey of spreadsheet use to motivate our functional requirements for a storage engine for PDM. We develop a natural set of mechanisms for flexibly representing spreadsheet data and demonstrate that identifying the optimal representation is NP-Hard; however, we develop an efficient approach to identify the optimal representation from an important and intuitive subclass of representations. We extend our mechanisms with positional access mechanisms that don't suffer from cascading update issues, leading to constant time access and modification performance. We evaluate these representations on a workload of typical spreadsheets and spreadsheet operations, providing up to 20% reduction in storage, and up to 50% reduction in formula evaluation time

    Initial report on Object Spreadsheets

    Get PDF
    There is a growing demand for data-driven web applications that help automate organizational and business processes of low to medium complexity by letting users view and update structured data in controlled ways. We present Object Spreadsheets, an end-user development tool that combines a spreadsheet interface with a rich data model to help the process administrators build the logic for such applications themselves. Its all-in-one interface with immediate feedback has the potential to bring more complex tasks within reach of end-user developers, compared to existing approaches. Our data model is based on the structure of entity-relationship models and directly supports nested variable-size collections and object references, which are common in web applications but poorly accommodated by traditional spreadsheets. Object Spreadsheets has a formula language suited to the data model and supports stored procedures to specify the forms of updates that application users may make. Formulas can be used to assemble data in the exact structure in which it is to be shown in the application UI, simplifying the task of UI building; we intend for Object Spreadsheets to be integrated with a UI builder to provide a complete solution for application development. We describe our prototype implementation and several example applications we built to demonstrate the applicability of the tool

    Layout Inference and Table Detection in Spreadsheet Documents

    Get PDF
    Spreadsheets have found wide use in many different domains and settings. They provide a broad range of both basic and advanced functionalities. In this way, they can support data collection, transformation, analysis, and reporting. Nevertheless, at the same time spreadsheets maintain a friendly and intuitive interface. Additionally, they entail no to very low cost. Well-known spreadsheet applications, such as OpenOffice, LibreOffice, Google Sheets, and Gnumeric, are free to use. Moreover, Microsoft Excel is widely available, with millions of users worldwide. Thus, spreadsheets are not only powerful tools, but also have a very low entrance barrier. Therefore, they have become very popular with novices and professionals alike. As a result, a large volume of valuable data resides in these documents. From spreadsheets, of particular interest are data coming in tabular form, since they provide concise, factual, and to a large extend structured information. One natural progression is to transfer tabular data from spreadsheets to databases. This would allow spreadsheets to become a direct source of data for existing or new business processes. It would be easier to digest them into data warehouses and to integrate them with other sources. Nevertheless, besides databases, there are other means to work with spreadsheet data. New paradigms, like NoDB, advocate querying directly from raw documents. Going one step further, spreadsheets together with other raw documents can be stored in a sophisticated centralized repository, i.e., a data lake. From then on they can serve (on-demand) various tasks and applications. All in all, by making spreadsheet data easily accessible, we can prevent information silos, i.e., valuable knowledge being isolated and scattered in multiple spreadsheet documents. Yet, there are considerable challenges to the automatic processing and understanding of these documents. After all, spreadsheets are designed primarily for human consumption, and as such, they favor customization and visual comprehension. Data are often intermingled with formatting, formulas, layout artifacts, and textual metadata, which carry domain-specific or even user-specific information (i.e., personal preferences). Multiple tables, with different layout and structure, can be found on the same sheet. Most importantly, the structure of the tables is not known, i.e., not explicitly given by the spreadsheet documents. Altogether, spreadsheets are better described as partially structured, with a significant degree of implicit information. In literature, the automatic understanding of spreadsheet data has only been scarcely investigated, often assuming just the same uniform table layout. However, due to the manifold possibilities to structure tabular data in spreadsheets, the assumption of a uniform layout either excludes a substantial number of tables from the extraction process or leads to inaccurate results. In this thesis, we primarily address two fundamental tasks that can lead to more accurate information extraction from spreadsheet documents. Namely, we propose intuitive and effective approaches for layout analysis and table detection in spreadsheets. Nevertheless, our overall solution is designed as a processing pipeline, where specialized steps build on top of each other to discover the tabular data. One of our main objectives is to eliminate most of the assumptions from related work. Instead, we target highly diverse sheet layouts, with one or multiple tables. On the same time, we foresee the presence of textual metadata and other non-tabular data in the sheet. Furthermore, we make use of sophisticated machine learning and optimization techniques. This brings flexibility to our approach, allowing it to work even with complex or malformed tables. Moreover, this intended flexibility makes our approaches transferable to new spreadsheet datasets. Thus, we are not bounded to specific domains or settings.:1 INTRODUCTION 1.1 Motivation 1.2 Contributions 1.3 Outline 2 FOUNDATIONS AND RELATED WORK 2.1 The Evolution of Spreadsheet Documents 2.1.1 Spreadsheet User Interface and Functionalities 2.1.2 Spreadsheet File Formats 2.1.3 Spreadsheets Are Partially-Structured 2.2 Analysis and Recognition in Electronic Documents 2.2.1 A General Overview of DAR 2.2.2 DAR in Spreadsheets 2.3 Spreadsheet Research Areas 2.3.1 Layout Inference and Table Recognition 2.3.2 Unifying Databases and Spreadsheets 2.3.3 Spreadsheet Software Engineering 2.3.4 Data Wrangling Approaches 3 AN EMPIRICAL STUDY OF SPREADSHEET DOCUMENTS 3.1 Available Corpora 3.2 Creating a Gold Standard Dataset 3.2.1 Initial Selection 3.2.2 Annotation Methodology 3.3 Dataset Analysis 3.3.1 Takeaways from Business Spreadsheets 3.3.2 Comparison Between Domains 3.4 Summary and Discussion 3.4.1 Datasets for Experimental Evaluation 3.4.2 A Processing Pipeline 4 LAYOUT ANALYSIS 4.1 A Method for Layout Analysis in Spreadsheets 4.2 Feature Extraction 4.2.1 Content Features 4.2.2 Style Features 4.2.3 Font Features 4.2.4 Formula and Reference Features 4.2.5 Spatial Features 4.2.6 Geometrical Features 4.3 Cell Classification 4.3.1 Classification Datasets 4.3.2 Classifiers and Assessment Methods 4.3.3 Optimum Under-Sampling 4.3.4 Feature Selection 4.3.5 Parameter Tuning 4.3.6 Classification Evaluation 4.4 Layout Regions 4.5 Summary and Discussions 5 CLASSIFICATION POST-PROCESSING 5.1 Dataset for Post-Processing 5.2 Pattern-Based Revisions 5.2.1 Misclassification Patterns 5.2.2 Relabeling Cells 5.2.3 Evaluating the Patterns 5.3 Region-Based Revisions 5.3.1 Standardization Procedure 5.3.2 Extracting Features from Regions 5.3.3 Identifying Misclassified Regions 5.3.4 Relabeling Misclassified Regions 5.4 Summary and Discussion 6 TABLE DETECTION 6.1 A Method for Table Detection in Spreadsheets 6.2 Preliminaries 6.2.1 Introducing a Graph Model 6.2.2 Graph Partitioning for Table Detection 6.2.3 Pre-Processing for Table Detection 6.3 Rule-Based Detection 6.3.1 Remove and Conquer 6.4 Genetic-Based Detection 6.4.1 Undirected Graph 6.4.2 Header Cluster 6.4.3 Quality Metrics 6.4.4 Objective Function 6.4.5 Weight Tuning 6.4.6 Genetic Search 6.5 Experimental Evaluation 6.5.1 Testing Datasets 6.5.2 Training Datasets 6.5.3 Tuning Rounds 6.5.4 Search and Assessment 6.5.5 Evaluation Results 6.6 Summary and Discussions 7 XLINDY: A RESEARCH PROTOTYPE 7.1 Interface and Functionalities 7.1.1 Front-end Walkthrough 7.2 Implementation Details 7.2.1 Interoperability 7.2.2 Efficient Reads 7.3 Information Extraction 7.4 Summary and Discussions 8 CONCLUSION 8.1 Summary of Contributions 8.2 Directions of Future Work BIBLIOGRAPHY LIST OF FIGURES LIST OF TABLES A ANALYSIS OF REDUCED SAMPLES B TABLE DETECTION WITH TIRS B.1 Tables in TIRS B.2 Pairing Fences with Data Regions B.3 Heuristics Framewor

    A Spreadsheet Algebra for a Direct Data Manipulation Query Interface

    No full text

    Layout inference and table detection in spreadsheet document

    Get PDF
    Spreadsheet applications have evolved to be a tool of great importance for businesses, open data, and scientific communities. Using these applications, users can perform various transformations, generate new content, analyze and format data such that they are visually comprehensive. The same data can be presented in different ways, depending on the preferences and the intentions of the user. These functionalities make spreadsheets user-friendly, but not as much machine-friendly. When it comes to integrating with other sources, the free-for-all nature of spreadsheets is disadvantageous. It is rather difficult to algorithmically infer the structure of the data when they are intermingled with formatting, formulas, layout artifacts, and textual metadata. Therefore, user involvement is often required, which results in cumbersome and time-consuming tasks. Overall, the lack of automatic processing methods limits our ability to explore and reuse a great amount of rich data stored into partially-structured documents such as spreadsheets. In this thesis, we tackle this open challenge, which so far has been scarcely investigated in literature. Specifically, we are interested in extracting tabular data from spreadsheets, since they hold concise, factual, and to a large extend structured information. It is easier to process such information, in order to make it available to other applications. For instance, spreadsheet (tabular) data can be loaded into databases. Thus, these data would become instantly available to existing or new business processes. Furthermore, we can eliminate the risk of losing valuable company knowledge, by moving data or integrating spreadsheets with other more sophisticated information management systems. To achieve the aforementioned objectives and advancements, in this thesis, we develop a spreadsheet processing pipeline. The requirements for this pipeline were derived from a large scale empirical analysis of real-world spreadsheets, from business and Web settings. Specifically, we propose a series of specialized steps that build on top of each other with the goal of discovering the structure of data in spreadsheet documents. Our approach is bottom-up, as it starts from the smallest unit (i.e., the cell) to ultimately arrive at the individual tables of the sheet. Additionally, this thesis makes use of sophisticated machine learning and optimization techniques. In particular, we apply these techniques for layout analysis and table detection in spreadsheets. We target highly diverse sheet layouts, with one or multiple tables and arbitrary arrangement of contents. Moreover, we foresee the presence of textual metadata and other non-tabular data in the sheet. Furthermore, we work even with problematic tables (e.g., containing empty rows/columns and missing values). Finally, we bring flexibility to our approach. This not only allows us to tackle the above-mentioned challenges but also to reuse our solution for different (spreadsheet) datasets.Els fulls de c脿lcul s鈥檈mpren massivament en molts dominis i contexts diferents, ja que proporcionen una 脿mplia gamma de funcionalitats, b脿siques i avan莽ades, de gesti贸 de dades. D鈥檃questa manera, donen suport a la recollida, transformaci贸, an脿lisi i visualitzaci贸 de dades. A la mateixa vegada, els fulls de c脿lcul tenen una interf铆cie amigable i intu茂tiva i tenen un cost molt baix d鈥檌mplantaci贸. Aplicacions de full de c脿lcul molt conegudes, com OpenOffice, LibreOffice, Google Sheets i Gnumeric, poden utilitzar-se de forma gratu茂ta i d鈥檃ltres, com Microsoft Excel, s贸n a l鈥檃bast d鈥檜na gran majoria d鈥檜suaris. Per tant, han esdevingut molt populars tant per a novells com per professionals. Com a resultat, un gran volum de dades valuoses resideixen en aquests documents. S贸n de particular inter猫s les dades que es presenten en format tabular dins dels fulls de c脿lcul, ja que proporcionen informaci贸 concreta, factual i parcialment estructurada. Com a conseq眉猫ncia, hi ha inter猫s en transferir dades tabulars des de fulls de c脿lcul a bases de dades. Aix貌 permetria que els fulls de c脿lcul es converteixin en una font directa de dades per a processos empresarials, i introduir aquestes dades als magatzems de dades i integrar-les amb altres fonts. Un pas m茅s enll脿, els fulls de c脿lcul juntament amb altres documents en brut es poden emmagatzemar en repositoris de dades centralitzats avan莽ats, com per exemple, els data lake. Un cop al data lake, es podran fer servir (sota demanda) per a diverses tasques i aplicacions. Tot plegat, l鈥檕bjectiu 茅s fer accessibles les dades emmagatzemades als fulls de c脿lcul. Malgrat tot, hi ha reptes considerables en el processament i comprensi贸 autom脿tica d鈥檃quests documents. Els fulls de c脿lcul estan dissenyats principalment per al consum hum脿 i, per tant, afavoreixen la personalitzaci贸 i la comprensi贸 visual. Les dades sovint s鈥檈ntrellacen amb formataci贸, f贸rmules, artefactes de disseny i metadades textuals, que porten informaci贸 espec铆fica del domini o fins i tot informaci贸 espec铆fica de l鈥檜suari. Al mateix full es poden trobar diverses taules, amb una estructura i disseny diferents. A m茅s, el format de cada taula no es declara a priori, 茅s a dir, no hi ha cap mecanisme per definir l鈥檈structura d鈥檜na taula, com passa a les bases de dades. Per aquest motiu, els fulls de c脿lcul es coneixen com a fonts de dades parcialment estructurades, amb un grau rellevant d'informaci贸 impl铆cita. A la literatura, la comprensi贸 autom脿tica de les dades emmagatzemades en fulls de c脿lcul s'ha investigat superficialment, sovint assumint el mateix format uniforme de taula a tots els fulls de c脿lcul. Tanmateix, a causa de les m煤ltiples possibilitats d'estructurar les dades tabulars en fulls de c脿lcul, la suposici贸 d'un disseny uniforme o b茅 exclou un nombre substancial de taules del proc茅s d'extracci贸 o condueix a resultats inexactes. En aquesta tesi, abordem tasques fonamentals que contribueixen a l鈥檈xtracci贸 d鈥檌nformaci贸 dels fulls de c脿lcul d鈥檜na manera m茅s precisa. Proposem m猫todes intu茂tius i efica莽os per a l鈥檃n脿lisi de la distribuci贸 i detecci贸 de taules en fulls de c脿lcul. Un dels nostres objectius principals 茅s eliminar la majoria dels sup貌sits de l鈥檈stat de l鈥檃rt actual. Per fer-ho, considerem estructures tabulars altament heterog猫nies, contingudes en fulls de c脿lcul amb una o m茅s taules. Addicionalment, preveiem la presencia de metadades i altres tipus de dades no tabulars al mateix full. Per 煤ltim, utilitzem t猫cniques d鈥檕ptimitzaci贸 i d鈥檃prenentatge autom脿tic per identificar l鈥檈structura de les taules. Aix貌 aporta flexibilitat al nostre enfocament, permetent-lo treballar, fins i tot, amb taules complexes o malformades. Aquesta flexibilitat fa que els nostres m猫todes siguin transferibles a nous conjunts de fulls de c脿lcul amb dades d鈥檃ltres dominis. Per tant, no estem limitats a dominis o configuracion
    corecore