17 research outputs found

    Layout inference and table detection in spreadsheet document

    Get PDF
    Spreadsheet applications have evolved to be a tool of great importance for businesses, open data, and scientific communities. Using these applications, users can perform various transformations, generate new content, analyze and format data such that they are visually comprehensive. The same data can be presented in different ways, depending on the preferences and the intentions of the user. These functionalities make spreadsheets user-friendly, but not as much machine-friendly. When it comes to integrating with other sources, the free-for-all nature of spreadsheets is disadvantageous. It is rather difficult to algorithmically infer the structure of the data when they are intermingled with formatting, formulas, layout artifacts, and textual metadata. Therefore, user involvement is often required, which results in cumbersome and time-consuming tasks. Overall, the lack of automatic processing methods limits our ability to explore and reuse a great amount of rich data stored into partially-structured documents such as spreadsheets. In this thesis, we tackle this open challenge, which so far has been scarcely investigated in literature. Specifically, we are interested in extracting tabular data from spreadsheets, since they hold concise, factual, and to a large extend structured information. It is easier to process such information, in order to make it available to other applications. For instance, spreadsheet (tabular) data can be loaded into databases. Thus, these data would become instantly available to existing or new business processes. Furthermore, we can eliminate the risk of losing valuable company knowledge, by moving data or integrating spreadsheets with other more sophisticated information management systems. To achieve the aforementioned objectives and advancements, in this thesis, we develop a spreadsheet processing pipeline. The requirements for this pipeline were derived from a large scale empirical analysis of real-world spreadsheets, from business and Web settings. Specifically, we propose a series of specialized steps that build on top of each other with the goal of discovering the structure of data in spreadsheet documents. Our approach is bottom-up, as it starts from the smallest unit (i.e., the cell) to ultimately arrive at the individual tables of the sheet. Additionally, this thesis makes use of sophisticated machine learning and optimization techniques. In particular, we apply these techniques for layout analysis and table detection in spreadsheets. We target highly diverse sheet layouts, with one or multiple tables and arbitrary arrangement of contents. Moreover, we foresee the presence of textual metadata and other non-tabular data in the sheet. Furthermore, we work even with problematic tables (e.g., containing empty rows/columns and missing values). Finally, we bring flexibility to our approach. This not only allows us to tackle the above-mentioned challenges but also to reuse our solution for different (spreadsheet) datasets.Els fulls de càlcul s’empren massivament en molts dominis i contexts diferents, ja que proporcionen una àmplia gamma de funcionalitats, bàsiques i avançades, de gestió de dades. D’aquesta manera, donen suport a la recollida, transformació, anàlisi i visualització de dades. A la mateixa vegada, els fulls de càlcul tenen una interfície amigable i intuïtiva i tenen un cost molt baix d’implantació. Aplicacions de full de càlcul molt conegudes, com OpenOffice, LibreOffice, Google Sheets i Gnumeric, poden utilitzar-se de forma gratuïta i d’altres, com Microsoft Excel, són a l’abast d’una gran majoria d’usuaris. Per tant, han esdevingut molt populars tant per a novells com per professionals. Com a resultat, un gran volum de dades valuoses resideixen en aquests documents. Són de particular interès les dades que es presenten en format tabular dins dels fulls de càlcul, ja que proporcionen informació concreta, factual i parcialment estructurada. Com a conseqüència, hi ha interès en transferir dades tabulars des de fulls de càlcul a bases de dades. Això permetria que els fulls de càlcul es converteixin en una font directa de dades per a processos empresarials, i introduir aquestes dades als magatzems de dades i integrar-les amb altres fonts. Un pas més enllà, els fulls de càlcul juntament amb altres documents en brut es poden emmagatzemar en repositoris de dades centralitzats avançats, com per exemple, els data lake. Un cop al data lake, es podran fer servir (sota demanda) per a diverses tasques i aplicacions. Tot plegat, l’objectiu és fer accessibles les dades emmagatzemades als fulls de càlcul. Malgrat tot, hi ha reptes considerables en el processament i comprensió automàtica d’aquests documents. Els fulls de càlcul estan dissenyats principalment per al consum humà i, per tant, afavoreixen la personalització i la comprensió visual. Les dades sovint s’entrellacen amb formatació, fórmules, artefactes de disseny i metadades textuals, que porten informació específica del domini o fins i tot informació específica de l’usuari. Al mateix full es poden trobar diverses taules, amb una estructura i disseny diferents. A més, el format de cada taula no es declara a priori, és a dir, no hi ha cap mecanisme per definir l’estructura d’una taula, com passa a les bases de dades. Per aquest motiu, els fulls de càlcul es coneixen com a fonts de dades parcialment estructurades, amb un grau rellevant d'informació implícita. A la literatura, la comprensió automàtica de les dades emmagatzemades en fulls de càlcul s'ha investigat superficialment, sovint assumint el mateix format uniforme de taula a tots els fulls de càlcul. Tanmateix, a causa de les múltiples possibilitats d'estructurar les dades tabulars en fulls de càlcul, la suposició d'un disseny uniforme o bé exclou un nombre substancial de taules del procés d'extracció o condueix a resultats inexactes. En aquesta tesi, abordem tasques fonamentals que contribueixen a l’extracció d’informació dels fulls de càlcul d’una manera més precisa. Proposem mètodes intuïtius i eficaços per a l’anàlisi de la distribució i detecció de taules en fulls de càlcul. Un dels nostres objectius principals és eliminar la majoria dels supòsits de l’estat de l’art actual. Per fer-ho, considerem estructures tabulars altament heterogènies, contingudes en fulls de càlcul amb una o més taules. Addicionalment, preveiem la presencia de metadades i altres tipus de dades no tabulars al mateix full. Per últim, utilitzem tècniques d’optimització i d’aprenentatge automàtic per identificar l’estructura de les taules. Això aporta flexibilitat al nostre enfocament, permetent-lo treballar, fins i tot, amb taules complexes o malformades. Aquesta flexibilitat fa que els nostres mètodes siguin transferibles a nous conjunts de fulls de càlcul amb dades d’altres dominis. Per tant, no estem limitats a dominis o configuracion

    Layout Inference and Table Detection in Spreadsheet Documents

    Get PDF
    Spreadsheets have found wide use in many different domains and settings. They provide a broad range of both basic and advanced functionalities. In this way, they can support data collection, transformation, analysis, and reporting. Nevertheless, at the same time spreadsheets maintain a friendly and intuitive interface. Additionally, they entail no to very low cost. Well-known spreadsheet applications, such as OpenOffice, LibreOffice, Google Sheets, and Gnumeric, are free to use. Moreover, Microsoft Excel is widely available, with millions of users worldwide. Thus, spreadsheets are not only powerful tools, but also have a very low entrance barrier. Therefore, they have become very popular with novices and professionals alike. As a result, a large volume of valuable data resides in these documents. From spreadsheets, of particular interest are data coming in tabular form, since they provide concise, factual, and to a large extend structured information. One natural progression is to transfer tabular data from spreadsheets to databases. This would allow spreadsheets to become a direct source of data for existing or new business processes. It would be easier to digest them into data warehouses and to integrate them with other sources. Nevertheless, besides databases, there are other means to work with spreadsheet data. New paradigms, like NoDB, advocate querying directly from raw documents. Going one step further, spreadsheets together with other raw documents can be stored in a sophisticated centralized repository, i.e., a data lake. From then on they can serve (on-demand) various tasks and applications. All in all, by making spreadsheet data easily accessible, we can prevent information silos, i.e., valuable knowledge being isolated and scattered in multiple spreadsheet documents. Yet, there are considerable challenges to the automatic processing and understanding of these documents. After all, spreadsheets are designed primarily for human consumption, and as such, they favor customization and visual comprehension. Data are often intermingled with formatting, formulas, layout artifacts, and textual metadata, which carry domain-specific or even user-specific information (i.e., personal preferences). Multiple tables, with different layout and structure, can be found on the same sheet. Most importantly, the structure of the tables is not known, i.e., not explicitly given by the spreadsheet documents. Altogether, spreadsheets are better described as partially structured, with a significant degree of implicit information. In literature, the automatic understanding of spreadsheet data has only been scarcely investigated, often assuming just the same uniform table layout. However, due to the manifold possibilities to structure tabular data in spreadsheets, the assumption of a uniform layout either excludes a substantial number of tables from the extraction process or leads to inaccurate results. In this thesis, we primarily address two fundamental tasks that can lead to more accurate information extraction from spreadsheet documents. Namely, we propose intuitive and effective approaches for layout analysis and table detection in spreadsheets. Nevertheless, our overall solution is designed as a processing pipeline, where specialized steps build on top of each other to discover the tabular data. One of our main objectives is to eliminate most of the assumptions from related work. Instead, we target highly diverse sheet layouts, with one or multiple tables. On the same time, we foresee the presence of textual metadata and other non-tabular data in the sheet. Furthermore, we make use of sophisticated machine learning and optimization techniques. This brings flexibility to our approach, allowing it to work even with complex or malformed tables. Moreover, this intended flexibility makes our approaches transferable to new spreadsheet datasets. Thus, we are not bounded to specific domains or settings.:1 INTRODUCTION 1.1 Motivation 1.2 Contributions 1.3 Outline 2 FOUNDATIONS AND RELATED WORK 2.1 The Evolution of Spreadsheet Documents 2.1.1 Spreadsheet User Interface and Functionalities 2.1.2 Spreadsheet File Formats 2.1.3 Spreadsheets Are Partially-Structured 2.2 Analysis and Recognition in Electronic Documents 2.2.1 A General Overview of DAR 2.2.2 DAR in Spreadsheets 2.3 Spreadsheet Research Areas 2.3.1 Layout Inference and Table Recognition 2.3.2 Unifying Databases and Spreadsheets 2.3.3 Spreadsheet Software Engineering 2.3.4 Data Wrangling Approaches 3 AN EMPIRICAL STUDY OF SPREADSHEET DOCUMENTS 3.1 Available Corpora 3.2 Creating a Gold Standard Dataset 3.2.1 Initial Selection 3.2.2 Annotation Methodology 3.3 Dataset Analysis 3.3.1 Takeaways from Business Spreadsheets 3.3.2 Comparison Between Domains 3.4 Summary and Discussion 3.4.1 Datasets for Experimental Evaluation 3.4.2 A Processing Pipeline 4 LAYOUT ANALYSIS 4.1 A Method for Layout Analysis in Spreadsheets 4.2 Feature Extraction 4.2.1 Content Features 4.2.2 Style Features 4.2.3 Font Features 4.2.4 Formula and Reference Features 4.2.5 Spatial Features 4.2.6 Geometrical Features 4.3 Cell Classification 4.3.1 Classification Datasets 4.3.2 Classifiers and Assessment Methods 4.3.3 Optimum Under-Sampling 4.3.4 Feature Selection 4.3.5 Parameter Tuning 4.3.6 Classification Evaluation 4.4 Layout Regions 4.5 Summary and Discussions 5 CLASSIFICATION POST-PROCESSING 5.1 Dataset for Post-Processing 5.2 Pattern-Based Revisions 5.2.1 Misclassification Patterns 5.2.2 Relabeling Cells 5.2.3 Evaluating the Patterns 5.3 Region-Based Revisions 5.3.1 Standardization Procedure 5.3.2 Extracting Features from Regions 5.3.3 Identifying Misclassified Regions 5.3.4 Relabeling Misclassified Regions 5.4 Summary and Discussion 6 TABLE DETECTION 6.1 A Method for Table Detection in Spreadsheets 6.2 Preliminaries 6.2.1 Introducing a Graph Model 6.2.2 Graph Partitioning for Table Detection 6.2.3 Pre-Processing for Table Detection 6.3 Rule-Based Detection 6.3.1 Remove and Conquer 6.4 Genetic-Based Detection 6.4.1 Undirected Graph 6.4.2 Header Cluster 6.4.3 Quality Metrics 6.4.4 Objective Function 6.4.5 Weight Tuning 6.4.6 Genetic Search 6.5 Experimental Evaluation 6.5.1 Testing Datasets 6.5.2 Training Datasets 6.5.3 Tuning Rounds 6.5.4 Search and Assessment 6.5.5 Evaluation Results 6.6 Summary and Discussions 7 XLINDY: A RESEARCH PROTOTYPE 7.1 Interface and Functionalities 7.1.1 Front-end Walkthrough 7.2 Implementation Details 7.2.1 Interoperability 7.2.2 Efficient Reads 7.3 Information Extraction 7.4 Summary and Discussions 8 CONCLUSION 8.1 Summary of Contributions 8.2 Directions of Future Work BIBLIOGRAPHY LIST OF FIGURES LIST OF TABLES A ANALYSIS OF REDUCED SAMPLES B TABLE DETECTION WITH TIRS B.1 Tables in TIRS B.2 Pairing Fences with Data Regions B.3 Heuristics Framewor

    Layout inference and table detection in spreadsheet document

    Get PDF
    Tesi en modalitat de cotutela: Universitat Politècnica de Catalunya i Technische Universität DresdenSpreadsheet applications have evolved to be a tool of great importance for businesses, open data, and scientific communities. Using these applications, users can perform various transformations, generate new content, analyze and format data such that they are visually comprehensive. The same data can be presented in different ways, depending on the preferences and the intentions of the user. These functionalities make spreadsheets user-friendly, but not as much machine-friendly. When it comes to integrating with other sources, the free-for-all nature of spreadsheets is disadvantageous. It is rather difficult to algorithmically infer the structure of the data when they are intermingled with formatting, formulas, layout artifacts, and textual metadata. Therefore, user involvement is often required, which results in cumbersome and time-consuming tasks. Overall, the lack of automatic processing methods limits our ability to explore and reuse a great amount of rich data stored into partially-structured documents such as spreadsheets. In this thesis, we tackle this open challenge, which so far has been scarcely investigated in literature. Specifically, we are interested in extracting tabular data from spreadsheets, since they hold concise, factual, and to a large extend structured information. It is easier to process such information, in order to make it available to other applications. For instance, spreadsheet (tabular) data can be loaded into databases. Thus, these data would become instantly available to existing or new business processes. Furthermore, we can eliminate the risk of losing valuable company knowledge, by moving data or integrating spreadsheets with other more sophisticated information management systems. To achieve the aforementioned objectives and advancements, in this thesis, we develop a spreadsheet processing pipeline. The requirements for this pipeline were derived from a large scale empirical analysis of real-world spreadsheets, from business and Web settings. Specifically, we propose a series of specialized steps that build on top of each other with the goal of discovering the structure of data in spreadsheet documents. Our approach is bottom-up, as it starts from the smallest unit (i.e., the cell) to ultimately arrive at the individual tables of the sheet. Additionally, this thesis makes use of sophisticated machine learning and optimization techniques. In particular, we apply these techniques for layout analysis and table detection in spreadsheets. We target highly diverse sheet layouts, with one or multiple tables and arbitrary arrangement of contents. Moreover, we foresee the presence of textual metadata and other non-tabular data in the sheet. Furthermore, we work even with problematic tables (e.g., containing empty rows/columns and missing values). Finally, we bring flexibility to our approach. This not only allows us to tackle the above-mentioned challenges but also to reuse our solution for different (spreadsheet) datasets.Els fulls de càlcul s’empren massivament en molts dominis i contexts diferents, ja que proporcionen una àmplia gamma de funcionalitats, bàsiques i avançades, de gestió de dades. D’aquesta manera, donen suport a la recollida, transformació, anàlisi i visualització de dades. A la mateixa vegada, els fulls de càlcul tenen una interfície amigable i intuïtiva i tenen un cost molt baix d’implantació. Aplicacions de full de càlcul molt conegudes, com OpenOffice, LibreOffice, Google Sheets i Gnumeric, poden utilitzar-se de forma gratuïta i d’altres, com Microsoft Excel, són a l’abast d’una gran majoria d’usuaris. Per tant, han esdevingut molt populars tant per a novells com per professionals. Com a resultat, un gran volum de dades valuoses resideixen en aquests documents. Són de particular interès les dades que es presenten en format tabular dins dels fulls de càlcul, ja que proporcionen informació concreta, factual i parcialment estructurada. Com a conseqüència, hi ha interès en transferir dades tabulars des de fulls de càlcul a bases de dades. Això permetria que els fulls de càlcul es converteixin en una font directa de dades per a processos empresarials, i introduir aquestes dades als magatzems de dades i integrar-les amb altres fonts. Un pas més enllà, els fulls de càlcul juntament amb altres documents en brut es poden emmagatzemar en repositoris de dades centralitzats avançats, com per exemple, els data lake. Un cop al data lake, es podran fer servir (sota demanda) per a diverses tasques i aplicacions. Tot plegat, l’objectiu és fer accessibles les dades emmagatzemades als fulls de càlcul. Malgrat tot, hi ha reptes considerables en el processament i comprensió automàtica d’aquests documents. Els fulls de càlcul estan dissenyats principalment per al consum humà i, per tant, afavoreixen la personalització i la comprensió visual. Les dades sovint s’entrellacen amb formatació, fórmules, artefactes de disseny i metadades textuals, que porten informació específica del domini o fins i tot informació específica de l’usuari. Al mateix full es poden trobar diverses taules, amb una estructura i disseny diferents. A més, el format de cada taula no es declara a priori, és a dir, no hi ha cap mecanisme per definir l’estructura d’una taula, com passa a les bases de dades. Per aquest motiu, els fulls de càlcul es coneixen com a fonts de dades parcialment estructurades, amb un grau rellevant d'informació implícita. A la literatura, la comprensió automàtica de les dades emmagatzemades en fulls de càlcul s'ha investigat superficialment, sovint assumint el mateix format uniforme de taula a tots els fulls de càlcul. Tanmateix, a causa de les múltiples possibilitats d'estructurar les dades tabulars en fulls de càlcul, la suposició d'un disseny uniforme o bé exclou un nombre substancial de taules del procés d'extracció o condueix a resultats inexactes. En aquesta tesi, abordem tasques fonamentals que contribueixen a l’extracció d’informació dels fulls de càlcul d’una manera més precisa. Proposem mètodes intuïtius i eficaços per a l’anàlisi de la distribució i detecció de taules en fulls de càlcul. Un dels nostres objectius principals és eliminar la majoria dels supòsits de l’estat de l’art actual. Per fer-ho, considerem estructures tabulars altament heterogènies, contingudes en fulls de càlcul amb una o més taules. Addicionalment, preveiem la presencia de metadades i altres tipus de dades no tabulars al mateix full. Per últim, utilitzem tècniques d’optimització i d’aprenentatge automàtic per identificar l’estructura de les taules. Això aporta flexibilitat al nostre enfocament, permetent-lo treballar, fins i tot, amb taules complexes o malformades. Aquesta flexibilitat fa que els nostres mètodes siguin transferibles a nous conjunts de fulls de càlcul amb dades d’altres dominis. Per tant, no estem limitats a dominis o configuracionsPostprint (published version

    A machine learning approach for layout inference in spreadsheets

    Get PDF
    Spreadsheet applications are one of the most used tools for content generation and presentation in industry and the Web. In spite of this success, there does not exist a comprehensive approach to automatically extract and reuse the richness of data maintained in this format. The biggest obstacle is the lack of awareness about the structure of the data in spreadsheets, which otherwise could provide the means to automatically understand and extract knowledge from these files. In this paper, we propose a classification approach to discover the layout of tables in spreadsheets. Therefore, we focus on the cell level, considering a wide range of features not covered before by related work. We evaluated the performance of our classifiers on a large dataset covering three different corpora from various domains. Finally, our work includes a novel technique for detecting and repairing incorrectly classified cells in a post-processing step. The experimental results show that our approach deliver s very high accuracy bringing us a crucial step closer towards automatic table extraction.Peer ReviewedPostprint (published version

    Cell Classification for Layout Recognition in Spreadsheets

    Get PDF
    Spreadsheets compose a notably large and valuable dataset of documents within the enterprise settings and on the Web. Although spreadsheets are intuitive to use and equipped with powerful functionalities, extracting and reusing data from them remains a cumbersome and mostly manual task. Their greatest strength, the large degree of freedom they provide to the user, is at the same time also their greatest weakness, since data can be arbitrarily structured. Therefore, in this paper we propose a supervised learning approach for layout recognition in spreadsheets. We work on the cell level, aiming at predicting their correct layout role, out of five predefined alternatives. For this task we have considered a large number of features not covered before by related work. Moreover, we gather a considerably large dataset of annotated cells, from spreadsheets exhibiting variability in format and content. Our experiments, with five different classification algorithms, show that we can predict cell layout roles with high accuracy. Subsequently, in this paper we focus on revising the classification results, with the aim of repairing misclassifications. We propose a sophisticated approach, composed of three steps, which effectively corrects a reasonable number of inaccurate predictions

    Table recognition in spreadsheets via a graph representation

    Get PDF
    Spreadsheet software are very popular data management tools. Their ease of use and abundant functionalities equip novices and professionals alike with the means to generate, transform, analyze, and visualize data. As a result, spreadsheets are a great resource of factual and structured information. This accentuates the need to automatically understand and extract their contents. In this paper, we present a novel approach for recognizing tables in spreadsheets. Having inferred the layout role of the individual cells, we build layout regions. We encode the spatial interrelations between these regions using a graph representation. Based on this, we propose Remove and Conquer (RAC), an algorithm for table recognition that implements a list of carefully curated rules. An extensive experimental evaluation shows that our approach is viable. We achieve significant accuracy in a dataset of real spreadsheets from various domains. © 2018 IEEE.Peer ReviewedPostprint (author's final draft

    XLIndy: interactive recognition and information extraction in spreadsheets

    Get PDF
    Over the years, spreadsheets have established their presence in many domains, including business, government, and science. However, challenges arise due to spreadsheets being partially-structured and carrying implicit (visual and textual) information. This translates into a bottleneck, when it comes to automatic analysis and extraction of information. Therefore, we present XLIndy, a Microsoft Excel add-in with a machine learning back-end, written in Python. It showcases our novel methods for layout inference and table recognition in spreadsheets. For a selected task and method, users can visually inspect the results, change configurations, and compare different runs. This enables iterative fine-tuning. Additionally, users can manually revise the predicted layout and tables, and subsequently save them as annotations. The latter is used to measure performance and (re-)train classifiers. Finally, data in the recognized tables can be extracted for further processing. XLIndy supports several standard formats, such as CSV and JSON.Peer ReviewedPostprint (author's final draft

    Table identification and reconstruction in spreadsheets

    Get PDF
    Spreadsheets are one of the most successful content generation tools, used in almost every enterprise to perform data transformation, visualization, and analysis. The high degree of freedom provided by these tools results in very complex sheets, intermingling the actual data with formatting, formulas, layout artifacts, and textual metadata. To unlock the wealth of data contained in spreadsheets, a human analyst will often have to understand and transform the data manually. To overcome this cumbersome process, we propose a framework that is able to automatically infer the structure and extract the data from these documents in a canonical form. In this paper, we describe our heuristics-based method for discovering tables in spreadsheets, given that each cell is classified as either header, attribute, metadata, data, or derived. Experimental results on a real-world dataset of 439 worksheets (858 tables) show that our approach is feasible and effectively identifies tables within partially structured spreadsheets.Peer ReviewedPostprint (author's final draft

    Layout inference and table detection in spreadsheet document

    No full text
    Spreadsheet applications have evolved to be a tool of great importance for businesses, open data, and scientific communities. Using these applications, users can perform various transformations, generate new content, analyze and format data such that they are visually comprehensive. The same data can be presented in different ways, depending on the preferences and the intentions of the user. These functionalities make spreadsheets user-friendly, but not as much machine-friendly. When it comes to integrating with other sources, the free-for-all nature of spreadsheets is disadvantageous. It is rather difficult to algorithmically infer the structure of the data when they are intermingled with formatting, formulas, layout artifacts, and textual metadata. Therefore, user involvement is often required, which results in cumbersome and time-consuming tasks. Overall, the lack of automatic processing methods limits our ability to explore and reuse a great amount of rich data stored into partially-structured documents such as spreadsheets. In this thesis, we tackle this open challenge, which so far has been scarcely investigated in literature. Specifically, we are interested in extracting tabular data from spreadsheets, since they hold concise, factual, and to a large extend structured information. It is easier to process such information, in order to make it available to other applications. For instance, spreadsheet (tabular) data can be loaded into databases. Thus, these data would become instantly available to existing or new business processes. Furthermore, we can eliminate the risk of losing valuable company knowledge, by moving data or integrating spreadsheets with other more sophisticated information management systems. To achieve the aforementioned objectives and advancements, in this thesis, we develop a spreadsheet processing pipeline. The requirements for this pipeline were derived from a large scale empirical analysis of real-world spreadsheets, from business and Web settings. Specifically, we propose a series of specialized steps that build on top of each other with the goal of discovering the structure of data in spreadsheet documents. Our approach is bottom-up, as it starts from the smallest unit (i.e., the cell) to ultimately arrive at the individual tables of the sheet. Additionally, this thesis makes use of sophisticated machine learning and optimization techniques. In particular, we apply these techniques for layout analysis and table detection in spreadsheets. We target highly diverse sheet layouts, with one or multiple tables and arbitrary arrangement of contents. Moreover, we foresee the presence of textual metadata and other non-tabular data in the sheet. Furthermore, we work even with problematic tables (e.g., containing empty rows/columns and missing values). Finally, we bring flexibility to our approach. This not only allows us to tackle the above-mentioned challenges but also to reuse our solution for different (spreadsheet) datasets.Els fulls de càlcul s’empren massivament en molts dominis i contexts diferents, ja que proporcionen una àmplia gamma de funcionalitats, bàsiques i avançades, de gestió de dades. D’aquesta manera, donen suport a la recollida, transformació, anàlisi i visualització de dades. A la mateixa vegada, els fulls de càlcul tenen una interfície amigable i intuïtiva i tenen un cost molt baix d’implantació. Aplicacions de full de càlcul molt conegudes, com OpenOffice, LibreOffice, Google Sheets i Gnumeric, poden utilitzar-se de forma gratuïta i d’altres, com Microsoft Excel, són a l’abast d’una gran majoria d’usuaris. Per tant, han esdevingut molt populars tant per a novells com per professionals. Com a resultat, un gran volum de dades valuoses resideixen en aquests documents. Són de particular interès les dades que es presenten en format tabular dins dels fulls de càlcul, ja que proporcionen informació concreta, factual i parcialment estructurada. Com a conseqüència, hi ha interès en transferir dades tabulars des de fulls de càlcul a bases de dades. Això permetria que els fulls de càlcul es converteixin en una font directa de dades per a processos empresarials, i introduir aquestes dades als magatzems de dades i integrar-les amb altres fonts. Un pas més enllà, els fulls de càlcul juntament amb altres documents en brut es poden emmagatzemar en repositoris de dades centralitzats avançats, com per exemple, els data lake. Un cop al data lake, es podran fer servir (sota demanda) per a diverses tasques i aplicacions. Tot plegat, l’objectiu és fer accessibles les dades emmagatzemades als fulls de càlcul. Malgrat tot, hi ha reptes considerables en el processament i comprensió automàtica d’aquests documents. Els fulls de càlcul estan dissenyats principalment per al consum humà i, per tant, afavoreixen la personalització i la comprensió visual. Les dades sovint s’entrellacen amb formatació, fórmules, artefactes de disseny i metadades textuals, que porten informació específica del domini o fins i tot informació específica de l’usuari. Al mateix full es poden trobar diverses taules, amb una estructura i disseny diferents. A més, el format de cada taula no es declara a priori, és a dir, no hi ha cap mecanisme per definir l’estructura d’una taula, com passa a les bases de dades. Per aquest motiu, els fulls de càlcul es coneixen com a fonts de dades parcialment estructurades, amb un grau rellevant d'informació implícita. A la literatura, la comprensió automàtica de les dades emmagatzemades en fulls de càlcul s'ha investigat superficialment, sovint assumint el mateix format uniforme de taula a tots els fulls de càlcul. Tanmateix, a causa de les múltiples possibilitats d'estructurar les dades tabulars en fulls de càlcul, la suposició d'un disseny uniforme o bé exclou un nombre substancial de taules del procés d'extracció o condueix a resultats inexactes. En aquesta tesi, abordem tasques fonamentals que contribueixen a l’extracció d’informació dels fulls de càlcul d’una manera més precisa. Proposem mètodes intuïtius i eficaços per a l’anàlisi de la distribució i detecció de taules en fulls de càlcul. Un dels nostres objectius principals és eliminar la majoria dels supòsits de l’estat de l’art actual. Per fer-ho, considerem estructures tabulars altament heterogènies, contingudes en fulls de càlcul amb una o més taules. Addicionalment, preveiem la presencia de metadades i altres tipus de dades no tabulars al mateix full. Per últim, utilitzem tècniques d’optimització i d’aprenentatge automàtic per identificar l’estructura de les taules. Això aporta flexibilitat al nostre enfocament, permetent-lo treballar, fins i tot, amb taules complexes o malformades. Aquesta flexibilitat fa que els nostres mètodes siguin transferibles a nous conjunts de fulls de càlcul amb dades d’altres dominis. Per tant, no estem limitats a dominis o configuracion

    Layout Inference and Table Detection in Spreadsheet Documents

    No full text
    Spreadsheets have found wide use in many different domains and settings. They provide a broad range of both basic and advanced functionalities. In this way, they can support data collection, transformation, analysis, and reporting. Nevertheless, at the same time spreadsheets maintain a friendly and intuitive interface. Additionally, they entail no to very low cost. Well-known spreadsheet applications, such as OpenOffice, LibreOffice, Google Sheets, and Gnumeric, are free to use. Moreover, Microsoft Excel is widely available, with millions of users worldwide. Thus, spreadsheets are not only powerful tools, but also have a very low entrance barrier. Therefore, they have become very popular with novices and professionals alike. As a result, a large volume of valuable data resides in these documents. From spreadsheets, of particular interest are data coming in tabular form, since they provide concise, factual, and to a large extend structured information. One natural progression is to transfer tabular data from spreadsheets to databases. This would allow spreadsheets to become a direct source of data for existing or new business processes. It would be easier to digest them into data warehouses and to integrate them with other sources. Nevertheless, besides databases, there are other means to work with spreadsheet data. New paradigms, like NoDB, advocate querying directly from raw documents. Going one step further, spreadsheets together with other raw documents can be stored in a sophisticated centralized repository, i.e., a data lake. From then on they can serve (on-demand) various tasks and applications. All in all, by making spreadsheet data easily accessible, we can prevent information silos, i.e., valuable knowledge being isolated and scattered in multiple spreadsheet documents. Yet, there are considerable challenges to the automatic processing and understanding of these documents. After all, spreadsheets are designed primarily for human consumption, and as such, they favor customization and visual comprehension. Data are often intermingled with formatting, formulas, layout artifacts, and textual metadata, which carry domain-specific or even user-specific information (i.e., personal preferences). Multiple tables, with different layout and structure, can be found on the same sheet. Most importantly, the structure of the tables is not known, i.e., not explicitly given by the spreadsheet documents. Altogether, spreadsheets are better described as partially structured, with a significant degree of implicit information. In literature, the automatic understanding of spreadsheet data has only been scarcely investigated, often assuming just the same uniform table layout. However, due to the manifold possibilities to structure tabular data in spreadsheets, the assumption of a uniform layout either excludes a substantial number of tables from the extraction process or leads to inaccurate results. In this thesis, we primarily address two fundamental tasks that can lead to more accurate information extraction from spreadsheet documents. Namely, we propose intuitive and effective approaches for layout analysis and table detection in spreadsheets. Nevertheless, our overall solution is designed as a processing pipeline, where specialized steps build on top of each other to discover the tabular data. One of our main objectives is to eliminate most of the assumptions from related work. Instead, we target highly diverse sheet layouts, with one or multiple tables. On the same time, we foresee the presence of textual metadata and other non-tabular data in the sheet. Furthermore, we make use of sophisticated machine learning and optimization techniques. This brings flexibility to our approach, allowing it to work even with complex or malformed tables. Moreover, this intended flexibility makes our approaches transferable to new spreadsheet datasets. Thus, we are not bounded to specific domains or settings.:1 INTRODUCTION 1.1 Motivation 1.2 Contributions 1.3 Outline 2 FOUNDATIONS AND RELATED WORK 2.1 The Evolution of Spreadsheet Documents 2.1.1 Spreadsheet User Interface and Functionalities 2.1.2 Spreadsheet File Formats 2.1.3 Spreadsheets Are Partially-Structured 2.2 Analysis and Recognition in Electronic Documents 2.2.1 A General Overview of DAR 2.2.2 DAR in Spreadsheets 2.3 Spreadsheet Research Areas 2.3.1 Layout Inference and Table Recognition 2.3.2 Unifying Databases and Spreadsheets 2.3.3 Spreadsheet Software Engineering 2.3.4 Data Wrangling Approaches 3 AN EMPIRICAL STUDY OF SPREADSHEET DOCUMENTS 3.1 Available Corpora 3.2 Creating a Gold Standard Dataset 3.2.1 Initial Selection 3.2.2 Annotation Methodology 3.3 Dataset Analysis 3.3.1 Takeaways from Business Spreadsheets 3.3.2 Comparison Between Domains 3.4 Summary and Discussion 3.4.1 Datasets for Experimental Evaluation 3.4.2 A Processing Pipeline 4 LAYOUT ANALYSIS 4.1 A Method for Layout Analysis in Spreadsheets 4.2 Feature Extraction 4.2.1 Content Features 4.2.2 Style Features 4.2.3 Font Features 4.2.4 Formula and Reference Features 4.2.5 Spatial Features 4.2.6 Geometrical Features 4.3 Cell Classification 4.3.1 Classification Datasets 4.3.2 Classifiers and Assessment Methods 4.3.3 Optimum Under-Sampling 4.3.4 Feature Selection 4.3.5 Parameter Tuning 4.3.6 Classification Evaluation 4.4 Layout Regions 4.5 Summary and Discussions 5 CLASSIFICATION POST-PROCESSING 5.1 Dataset for Post-Processing 5.2 Pattern-Based Revisions 5.2.1 Misclassification Patterns 5.2.2 Relabeling Cells 5.2.3 Evaluating the Patterns 5.3 Region-Based Revisions 5.3.1 Standardization Procedure 5.3.2 Extracting Features from Regions 5.3.3 Identifying Misclassified Regions 5.3.4 Relabeling Misclassified Regions 5.4 Summary and Discussion 6 TABLE DETECTION 6.1 A Method for Table Detection in Spreadsheets 6.2 Preliminaries 6.2.1 Introducing a Graph Model 6.2.2 Graph Partitioning for Table Detection 6.2.3 Pre-Processing for Table Detection 6.3 Rule-Based Detection 6.3.1 Remove and Conquer 6.4 Genetic-Based Detection 6.4.1 Undirected Graph 6.4.2 Header Cluster 6.4.3 Quality Metrics 6.4.4 Objective Function 6.4.5 Weight Tuning 6.4.6 Genetic Search 6.5 Experimental Evaluation 6.5.1 Testing Datasets 6.5.2 Training Datasets 6.5.3 Tuning Rounds 6.5.4 Search and Assessment 6.5.5 Evaluation Results 6.6 Summary and Discussions 7 XLINDY: A RESEARCH PROTOTYPE 7.1 Interface and Functionalities 7.1.1 Front-end Walkthrough 7.2 Implementation Details 7.2.1 Interoperability 7.2.2 Efficient Reads 7.3 Information Extraction 7.4 Summary and Discussions 8 CONCLUSION 8.1 Summary of Contributions 8.2 Directions of Future Work BIBLIOGRAPHY LIST OF FIGURES LIST OF TABLES A ANALYSIS OF REDUCED SAMPLES B TABLE DETECTION WITH TIRS B.1 Tables in TIRS B.2 Pairing Fences with Data Regions B.3 Heuristics Framewor
    corecore