434 research outputs found
Querying for model-driven spreadsheetsd
Dissertação de mestrado em Engenharia InformåticaSpreadsheets are used for a diverse number of objectives, that range from simple applications
to complete information systems. In all of these cases, they are frequently used as data
repositories that can grow tremendously in size, and as the amount of the data grows, the
frustration and challenge to withdraw information out of them also grows.
This Thesis project focuses on the problem of spreadsheet querying. Speci cally, the
objective is to meticulously and carefully study competing query languages, and proposing
our very own expressive and composable query language to be used in spreadsheets,
where intuitive queries can be de ned. This approach builds on a model-driven spreadsheet
development environment, and queries are expressed referencing ClassSheet model entities
instead of the actual data. Furthermore, this language shall be integrated into the MDSheet
framework, taking into account evolution mechanisms, auto-generation of models for query
results, and shall rely on Google's QUERY function for spreadsheets.As folhas de c alculo s~ao utilizadas para diversos ns, desde aplica c~oes simples at e sistemas
de informa c~ao completos. Entre todos estes casos, s~ao frequentemente utilizadas
para armazenar grandes volumes de dados, sendo que, a medida que o reposit orio cresce, a
frusta c~ao e o desa o de recolher informa c~ao tamb em aumenta. O projeto desta disserta c~ao
foca-se no problema da consulta e interroga c~ao de folhas de c alculo. Especi camente, o
objetivo e estudar de forma cuidada e meticulosa diversas linguagens de interroga c~ao existentes,
e prop^or a nossa pr opria linguagem para ser utilizada em folhas de c alculo, que se
caracteriza por ser uma linguagem expressiva, que possibilita a composi c~ao de interroga c~oes
e a de ni c~ao das mesmas de forma intuitiva.
A abordagem a utilizar passa pela utiliza c~ao de folhas de c alculo dirigidas por modelos,
sendo as interroga c~oes expressas atrav es de entidades do modelo ClassSheet em vez de dados
em concreto. Al em disto, a linguagem desenvolvida ser a integrada no framework MDSheet,
considerando diversos mecanismos de evolu c~ao, gera c~ao autom atica de modelos para os resultados
de uma interroga c~ao, e ser a baseada na fun c~ao QUERY desenvolvida pela Google
para a interroga c~ao de folhas de c alculo
âPracticing safe spreadsheetingâ : a case study examination of spreadsheet use and the challenges and risks associated with their use in a private sector business operating in Halifax, Nova Scotia (Canada)
1 online resource (157 p.) : col. ill.Includes abstract and appendices.Includes bibliographical references (p. 147-156).Spreadsheets have been and continue to be one of most commonly used analytical tools by many businesses today. With the increasing pressures on businesses today associated with making sense of increasingly larger data sets, it begs the question: are they the ârightâ tool to be used. The purpose of this case study is to examine how spreadsheets are being used and the risks associated with their (spreadsheet) use within a private sector organization operating in Halifax, Nova Scotia, using a subject group of participants comprised of various types of spreadsheet users working in different functional areas of the organization. The data were collected through a series of one-on-one interviews with each participant, using a standard list of questions developed specifically for this research study, that captured the participantsâ experiences using spreadsheets within their organization. The aim of this case study is to determine the extent of use and reliance on spreadsheets by the participants and their organization, and the challenges and risks (due to potential errors) associated with their use. Additionally this case study also looks at potential techniques and tools that may be used to mitigate the risks of spreadsheet errors and the organizations reliance on spreadsheets, as well as offering an assessment of the capabilities (skill level) of the spreadsheet users/creators
Compiling and optimizing spreadsheets for FPGA and multicore execution
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007."September 2007."Includes bibliographical references (p. 102-104).A major barrier to developing systems on multicore and FPGA chips is an easy-to-use development environment. This thesis presents the RhoZeta spreadsheet compiler and Catalyst optimization system for programming multiprocessors and FPGAs. Any spreadsheet frontend may be extended to work with RhoZeta's multiple interpreters and behavioral abstraction mechanisms. RhoZeta synchronizes a variety of cell interpreters acting on a global memory space. RhoZeta can also compile a group of cells to multithreaded C or Verilog. The result is an easy-to-use interface for programming multicore microprocessors and FPGAs. A spreadsheet environment presents parallelism and locality issues of modem hardware directly to the user and allows for a simple global memory synchronization model. Catalyst is a spreadsheet graph rewriting system based on performing behaviorally invariant guarded atomic actions while a system is being interpreted by RhoZeta. A number of optimization macros were developed to perform speculation, resource sharing and propagation of static assignments through a circuit. Parallelization of a 64-bit serial leading-zero-counter is demonstrated with Catalyst. Fault tolerance macros were also developed in Catalyst to protect against dynamic faults and to offset costs associated with testing semiconductors for static defects. A model for partitioning, placing and profiling spreadsheet execution in a heterogeneous hardware environment is also discussed. The RhoZeta system has been used to design several multithreaded and FPGA applications including a RISC emulator and a MIDI controlled modular synthesizer.by Amir Hirsch.M.Eng
Layout Inference and Table Detection in Spreadsheet Documents
Spreadsheets have found wide use in many different domains and settings. They provide a broad range of both basic and advanced functionalities. In this way, they can support data collection, transformation, analysis, and reporting. Nevertheless, at the same time spreadsheets maintain a friendly and intuitive interface. Additionally, they entail no to very low cost. Well-known spreadsheet applications, such as OpenOffice, LibreOffice, Google Sheets, and Gnumeric, are free to use. Moreover, Microsoft Excel is widely available, with millions of users worldwide. Thus, spreadsheets are not only powerful tools, but also have a very low entrance barrier. Therefore, they have become very popular with novices and professionals alike. As a result, a large volume of valuable data resides in these documents.
From spreadsheets, of particular interest are data coming in tabular form, since they provide concise, factual, and to a large extend structured information. One natural progression is to transfer tabular data from spreadsheets to databases. This would allow spreadsheets to become a direct source of data for existing or new business processes. It would be easier to digest them into data warehouses and to integrate them with other sources. Nevertheless, besides databases, there are other means to work with spreadsheet data. New paradigms, like NoDB, advocate querying directly from raw documents. Going one step further, spreadsheets together with other raw documents can be stored in a sophisticated centralized repository, i.e., a data lake. From then on they can serve (on-demand) various tasks and applications. All in all, by making spreadsheet data easily accessible, we can prevent information silos, i.e., valuable knowledge being isolated and scattered in multiple spreadsheet documents.
Yet, there are considerable challenges to the automatic processing and understanding of these documents. After all, spreadsheets are designed primarily for human consumption, and as such, they favor customization and visual comprehension. Data are often intermingled with formatting, formulas, layout artifacts, and textual metadata, which carry domain-specific or even user-specific information (i.e., personal preferences). Multiple tables, with different layout and structure, can be found on the same sheet. Most importantly, the structure of the tables is not known, i.e., not explicitly given by the spreadsheet documents. Altogether, spreadsheets are better described as partially structured, with a significant degree of implicit information.
In literature, the automatic understanding of spreadsheet data has only been scarcely investigated, often assuming just the same uniform table layout. However, due to the manifold possibilities to structure tabular data in spreadsheets, the assumption of a uniform layout either excludes a substantial number of tables from the extraction process or leads to inaccurate results.
In this thesis, we primarily address two fundamental tasks that can lead to more accurate information extraction from spreadsheet documents. Namely, we propose intuitive and effective approaches for layout analysis and table detection in spreadsheets. Nevertheless, our overall solution is designed as a processing pipeline, where specialized steps build on top of each other to discover the tabular data. One of our main objectives is to eliminate most of the assumptions from related work. Instead, we target highly diverse sheet layouts, with one or multiple tables. On the same time, we foresee the presence of textual metadata and other non-tabular data in the sheet. Furthermore, we make use of sophisticated machine learning and optimization techniques. This brings flexibility to our approach, allowing it to work even with complex or malformed tables. Moreover, this intended flexibility makes our approaches transferable to new spreadsheet datasets. Thus, we are not bounded to specific domains or settings.:1 INTRODUCTION
1.1 Motivation
1.2 Contributions
1.3 Outline
2 FOUNDATIONS AND RELATED WORK
2.1 The Evolution of Spreadsheet Documents
2.1.1 Spreadsheet User Interface and Functionalities
2.1.2 Spreadsheet File Formats
2.1.3 Spreadsheets Are Partially-Structured
2.2 Analysis and Recognition in Electronic Documents
2.2.1 A General Overview of DAR
2.2.2 DAR in Spreadsheets
2.3 Spreadsheet Research Areas
2.3.1 Layout Inference and Table Recognition
2.3.2 Unifying Databases and Spreadsheets
2.3.3 Spreadsheet Software Engineering
2.3.4 Data Wrangling Approaches
3 AN EMPIRICAL STUDY OF SPREADSHEET DOCUMENTS
3.1 Available Corpora
3.2 Creating a Gold Standard Dataset
3.2.1 Initial Selection
3.2.2 Annotation Methodology
3.3 Dataset Analysis
3.3.1 Takeaways from Business Spreadsheets
3.3.2 Comparison Between Domains
3.4 Summary and Discussion
3.4.1 Datasets for Experimental Evaluation
3.4.2 A Processing Pipeline
4 LAYOUT ANALYSIS
4.1 A Method for Layout Analysis in Spreadsheets
4.2 Feature Extraction
4.2.1 Content Features
4.2.2 Style Features
4.2.3 Font Features
4.2.4 Formula and Reference Features
4.2.5 Spatial Features
4.2.6 Geometrical Features
4.3 Cell Classification
4.3.1 Classification Datasets
4.3.2 Classifiers and Assessment Methods
4.3.3 Optimum Under-Sampling
4.3.4 Feature Selection
4.3.5 Parameter Tuning
4.3.6 Classification Evaluation
4.4 Layout Regions
4.5 Summary and Discussions
5 CLASSIFICATION POST-PROCESSING
5.1 Dataset for Post-Processing
5.2 Pattern-Based Revisions
5.2.1 Misclassification Patterns
5.2.2 Relabeling Cells
5.2.3 Evaluating the Patterns
5.3 Region-Based Revisions
5.3.1 Standardization Procedure
5.3.2 Extracting Features from Regions
5.3.3 Identifying Misclassified Regions
5.3.4 Relabeling Misclassified Regions
5.4 Summary and Discussion
6 TABLE DETECTION
6.1 A Method for Table Detection in Spreadsheets
6.2 Preliminaries
6.2.1 Introducing a Graph Model
6.2.2 Graph Partitioning for Table Detection
6.2.3 Pre-Processing for Table Detection
6.3 Rule-Based Detection
6.3.1 Remove and Conquer
6.4 Genetic-Based Detection
6.4.1 Undirected Graph
6.4.2 Header Cluster
6.4.3 Quality Metrics
6.4.4 Objective Function
6.4.5 Weight Tuning
6.4.6 Genetic Search
6.5 Experimental Evaluation
6.5.1 Testing Datasets
6.5.2 Training Datasets
6.5.3 Tuning Rounds
6.5.4 Search and Assessment
6.5.5 Evaluation Results
6.6 Summary and Discussions
7 XLINDY: A RESEARCH PROTOTYPE
7.1 Interface and Functionalities
7.1.1 Front-end Walkthrough
7.2 Implementation Details
7.2.1 Interoperability
7.2.2 Efficient Reads
7.3 Information Extraction
7.4 Summary and Discussions
8 CONCLUSION
8.1 Summary of Contributions
8.2 Directions of Future Work
BIBLIOGRAPHY
LIST OF FIGURES
LIST OF TABLES
A ANALYSIS OF REDUCED SAMPLES
B TABLE DETECTION WITH TIRS
B.1 Tables in TIRS
B.2 Pairing Fences with Data Regions
B.3 Heuristics Framewor
- âŚ