434 research outputs found

    Data-Parallel Spreadsheet Programming

    Get PDF

    Querying for model-driven spreadsheetsd

    Get PDF
    Dissertação de mestrado em Engenharia InformåticaSpreadsheets are used for a diverse number of objectives, that range from simple applications to complete information systems. In all of these cases, they are frequently used as data repositories that can grow tremendously in size, and as the amount of the data grows, the frustration and challenge to withdraw information out of them also grows. This Thesis project focuses on the problem of spreadsheet querying. Speci cally, the objective is to meticulously and carefully study competing query languages, and proposing our very own expressive and composable query language to be used in spreadsheets, where intuitive queries can be de ned. This approach builds on a model-driven spreadsheet development environment, and queries are expressed referencing ClassSheet model entities instead of the actual data. Furthermore, this language shall be integrated into the MDSheet framework, taking into account evolution mechanisms, auto-generation of models for query results, and shall rely on Google's QUERY function for spreadsheets.As folhas de c alculo s~ao utilizadas para diversos ns, desde aplica c~oes simples at e sistemas de informa c~ao completos. Entre todos estes casos, s~ao frequentemente utilizadas para armazenar grandes volumes de dados, sendo que, a medida que o reposit orio cresce, a frusta c~ao e o desa o de recolher informa c~ao tamb em aumenta. O projeto desta disserta c~ao foca-se no problema da consulta e interroga c~ao de folhas de c alculo. Especi camente, o objetivo e estudar de forma cuidada e meticulosa diversas linguagens de interroga c~ao existentes, e prop^or a nossa pr opria linguagem para ser utilizada em folhas de c alculo, que se caracteriza por ser uma linguagem expressiva, que possibilita a composi c~ao de interroga c~oes e a de ni c~ao das mesmas de forma intuitiva. A abordagem a utilizar passa pela utiliza c~ao de folhas de c alculo dirigidas por modelos, sendo as interroga c~oes expressas atrav es de entidades do modelo ClassSheet em vez de dados em concreto. Al em disto, a linguagem desenvolvida ser a integrada no framework MDSheet, considerando diversos mecanismos de evolu c~ao, gera c~ao autom atica de modelos para os resultados de uma interroga c~ao, e ser a baseada na fun c~ao QUERY desenvolvida pela Google para a interroga c~ao de folhas de c alculo

    “Practicing safe spreadsheeting” : a case study examination of spreadsheet use and the challenges and risks associated with their use in a private sector business operating in Halifax, Nova Scotia (Canada)

    Get PDF
    1 online resource (157 p.) : col. ill.Includes abstract and appendices.Includes bibliographical references (p. 147-156).Spreadsheets have been and continue to be one of most commonly used analytical tools by many businesses today. With the increasing pressures on businesses today associated with making sense of increasingly larger data sets, it begs the question: are they the ‘right’ tool to be used. The purpose of this case study is to examine how spreadsheets are being used and the risks associated with their (spreadsheet) use within a private sector organization operating in Halifax, Nova Scotia, using a subject group of participants comprised of various types of spreadsheet users working in different functional areas of the organization. The data were collected through a series of one-on-one interviews with each participant, using a standard list of questions developed specifically for this research study, that captured the participants’ experiences using spreadsheets within their organization. The aim of this case study is to determine the extent of use and reliance on spreadsheets by the participants and their organization, and the challenges and risks (due to potential errors) associated with their use. Additionally this case study also looks at potential techniques and tools that may be used to mitigate the risks of spreadsheet errors and the organizations reliance on spreadsheets, as well as offering an assessment of the capabilities (skill level) of the spreadsheet users/creators

    Compiling and optimizing spreadsheets for FPGA and multicore execution

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007."September 2007."Includes bibliographical references (p. 102-104).A major barrier to developing systems on multicore and FPGA chips is an easy-to-use development environment. This thesis presents the RhoZeta spreadsheet compiler and Catalyst optimization system for programming multiprocessors and FPGAs. Any spreadsheet frontend may be extended to work with RhoZeta's multiple interpreters and behavioral abstraction mechanisms. RhoZeta synchronizes a variety of cell interpreters acting on a global memory space. RhoZeta can also compile a group of cells to multithreaded C or Verilog. The result is an easy-to-use interface for programming multicore microprocessors and FPGAs. A spreadsheet environment presents parallelism and locality issues of modem hardware directly to the user and allows for a simple global memory synchronization model. Catalyst is a spreadsheet graph rewriting system based on performing behaviorally invariant guarded atomic actions while a system is being interpreted by RhoZeta. A number of optimization macros were developed to perform speculation, resource sharing and propagation of static assignments through a circuit. Parallelization of a 64-bit serial leading-zero-counter is demonstrated with Catalyst. Fault tolerance macros were also developed in Catalyst to protect against dynamic faults and to offset costs associated with testing semiconductors for static defects. A model for partitioning, placing and profiling spreadsheet execution in a heterogeneous hardware environment is also discussed. The RhoZeta system has been used to design several multithreaded and FPGA applications including a RISC emulator and a MIDI controlled modular synthesizer.by Amir Hirsch.M.Eng

    Declarative, Parallel Programming For End-User Development

    Get PDF

    Layout Inference and Table Detection in Spreadsheet Documents

    Get PDF
    Spreadsheets have found wide use in many different domains and settings. They provide a broad range of both basic and advanced functionalities. In this way, they can support data collection, transformation, analysis, and reporting. Nevertheless, at the same time spreadsheets maintain a friendly and intuitive interface. Additionally, they entail no to very low cost. Well-known spreadsheet applications, such as OpenOffice, LibreOffice, Google Sheets, and Gnumeric, are free to use. Moreover, Microsoft Excel is widely available, with millions of users worldwide. Thus, spreadsheets are not only powerful tools, but also have a very low entrance barrier. Therefore, they have become very popular with novices and professionals alike. As a result, a large volume of valuable data resides in these documents. From spreadsheets, of particular interest are data coming in tabular form, since they provide concise, factual, and to a large extend structured information. One natural progression is to transfer tabular data from spreadsheets to databases. This would allow spreadsheets to become a direct source of data for existing or new business processes. It would be easier to digest them into data warehouses and to integrate them with other sources. Nevertheless, besides databases, there are other means to work with spreadsheet data. New paradigms, like NoDB, advocate querying directly from raw documents. Going one step further, spreadsheets together with other raw documents can be stored in a sophisticated centralized repository, i.e., a data lake. From then on they can serve (on-demand) various tasks and applications. All in all, by making spreadsheet data easily accessible, we can prevent information silos, i.e., valuable knowledge being isolated and scattered in multiple spreadsheet documents. Yet, there are considerable challenges to the automatic processing and understanding of these documents. After all, spreadsheets are designed primarily for human consumption, and as such, they favor customization and visual comprehension. Data are often intermingled with formatting, formulas, layout artifacts, and textual metadata, which carry domain-specific or even user-specific information (i.e., personal preferences). Multiple tables, with different layout and structure, can be found on the same sheet. Most importantly, the structure of the tables is not known, i.e., not explicitly given by the spreadsheet documents. Altogether, spreadsheets are better described as partially structured, with a significant degree of implicit information. In literature, the automatic understanding of spreadsheet data has only been scarcely investigated, often assuming just the same uniform table layout. However, due to the manifold possibilities to structure tabular data in spreadsheets, the assumption of a uniform layout either excludes a substantial number of tables from the extraction process or leads to inaccurate results. In this thesis, we primarily address two fundamental tasks that can lead to more accurate information extraction from spreadsheet documents. Namely, we propose intuitive and effective approaches for layout analysis and table detection in spreadsheets. Nevertheless, our overall solution is designed as a processing pipeline, where specialized steps build on top of each other to discover the tabular data. One of our main objectives is to eliminate most of the assumptions from related work. Instead, we target highly diverse sheet layouts, with one or multiple tables. On the same time, we foresee the presence of textual metadata and other non-tabular data in the sheet. Furthermore, we make use of sophisticated machine learning and optimization techniques. This brings flexibility to our approach, allowing it to work even with complex or malformed tables. Moreover, this intended flexibility makes our approaches transferable to new spreadsheet datasets. Thus, we are not bounded to specific domains or settings.:1 INTRODUCTION 1.1 Motivation 1.2 Contributions 1.3 Outline 2 FOUNDATIONS AND RELATED WORK 2.1 The Evolution of Spreadsheet Documents 2.1.1 Spreadsheet User Interface and Functionalities 2.1.2 Spreadsheet File Formats 2.1.3 Spreadsheets Are Partially-Structured 2.2 Analysis and Recognition in Electronic Documents 2.2.1 A General Overview of DAR 2.2.2 DAR in Spreadsheets 2.3 Spreadsheet Research Areas 2.3.1 Layout Inference and Table Recognition 2.3.2 Unifying Databases and Spreadsheets 2.3.3 Spreadsheet Software Engineering 2.3.4 Data Wrangling Approaches 3 AN EMPIRICAL STUDY OF SPREADSHEET DOCUMENTS 3.1 Available Corpora 3.2 Creating a Gold Standard Dataset 3.2.1 Initial Selection 3.2.2 Annotation Methodology 3.3 Dataset Analysis 3.3.1 Takeaways from Business Spreadsheets 3.3.2 Comparison Between Domains 3.4 Summary and Discussion 3.4.1 Datasets for Experimental Evaluation 3.4.2 A Processing Pipeline 4 LAYOUT ANALYSIS 4.1 A Method for Layout Analysis in Spreadsheets 4.2 Feature Extraction 4.2.1 Content Features 4.2.2 Style Features 4.2.3 Font Features 4.2.4 Formula and Reference Features 4.2.5 Spatial Features 4.2.6 Geometrical Features 4.3 Cell Classification 4.3.1 Classification Datasets 4.3.2 Classifiers and Assessment Methods 4.3.3 Optimum Under-Sampling 4.3.4 Feature Selection 4.3.5 Parameter Tuning 4.3.6 Classification Evaluation 4.4 Layout Regions 4.5 Summary and Discussions 5 CLASSIFICATION POST-PROCESSING 5.1 Dataset for Post-Processing 5.2 Pattern-Based Revisions 5.2.1 Misclassification Patterns 5.2.2 Relabeling Cells 5.2.3 Evaluating the Patterns 5.3 Region-Based Revisions 5.3.1 Standardization Procedure 5.3.2 Extracting Features from Regions 5.3.3 Identifying Misclassified Regions 5.3.4 Relabeling Misclassified Regions 5.4 Summary and Discussion 6 TABLE DETECTION 6.1 A Method for Table Detection in Spreadsheets 6.2 Preliminaries 6.2.1 Introducing a Graph Model 6.2.2 Graph Partitioning for Table Detection 6.2.3 Pre-Processing for Table Detection 6.3 Rule-Based Detection 6.3.1 Remove and Conquer 6.4 Genetic-Based Detection 6.4.1 Undirected Graph 6.4.2 Header Cluster 6.4.3 Quality Metrics 6.4.4 Objective Function 6.4.5 Weight Tuning 6.4.6 Genetic Search 6.5 Experimental Evaluation 6.5.1 Testing Datasets 6.5.2 Training Datasets 6.5.3 Tuning Rounds 6.5.4 Search and Assessment 6.5.5 Evaluation Results 6.6 Summary and Discussions 7 XLINDY: A RESEARCH PROTOTYPE 7.1 Interface and Functionalities 7.1.1 Front-end Walkthrough 7.2 Implementation Details 7.2.1 Interoperability 7.2.2 Efficient Reads 7.3 Information Extraction 7.4 Summary and Discussions 8 CONCLUSION 8.1 Summary of Contributions 8.2 Directions of Future Work BIBLIOGRAPHY LIST OF FIGURES LIST OF TABLES A ANALYSIS OF REDUCED SAMPLES B TABLE DETECTION WITH TIRS B.1 Tables in TIRS B.2 Pairing Fences with Data Regions B.3 Heuristics Framewor
    • …
    corecore