Search CORE

44 research outputs found

Combining Spreadsheet Smells for Improved Fault Prediction

Author: Hofer Birgit
Jannach Dietmar
Koch Patrick
Schekotihin Konstantin
Wotawa Franz
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 26/05/2018
Field of study

Spreadsheets are commonly used in organizations as a programming tool for business-related calculations and decision making. Since faults in spreadsheets can have severe business impacts, a number of approaches from general software engineering have been applied to spreadsheets in recent years, among them the concept of code smells. Smells can in particular be used for the task of fault prediction. An analysis of existing spreadsheet smells, however, revealed that the predictive power of individual smells can be limited. In this work we therefore propose a machine learning based approach which combines the predictions of individual smells by using an AdaBoost ensemble classifier. Experiments on two public datasets containing real-world spreadsheet faults show significant improvements in terms of fault prediction accuracy.Comment: 4 pages, 1 figure, to be published in 40th International Conference on Software Engineering: New Ideas and Emerging Results Trac

arXiv.org e-Print Archive

Crossref

SpreadCluster: Recovering Versioned Spreadsheets through Similarity-Based Clustering

Author: Dou Wensheng
Gao Chushu
Huang Tao
Wang Jie
Wei Jun
Xu Liang
Zhong Hua
Publication venue
Publication date: 27/04/2017
Field of study

Version information plays an important role in spreadsheet understanding, maintaining and quality improving. However, end users rarely use version control tools to document spreadsheet version information. Thus, the spreadsheet version information is missing, and different versions of a spreadsheet coexist as individual and similar spreadsheets. Existing approaches try to recover spreadsheet version information through clustering these similar spreadsheets based on spreadsheet filenames or related email conversation. However, the applicability and accuracy of existing clustering approaches are limited due to the necessary information (e.g., filenames and email conversation) is usually missing. We inspected the versioned spreadsheets in VEnron, which is extracted from the Enron Corporation. In VEnron, the different versions of a spreadsheet are clustered into an evolution group. We observed that the versioned spreadsheets in each evolution group exhibit certain common features (e.g., similar table headers and worksheet names). Based on this observation, we proposed an automatic clustering algorithm, SpreadCluster. SpreadCluster learns the criteria of features from the versioned spreadsheets in VEnron, and then automatically clusters spreadsheets with the similar features into the same evolution group. We applied SpreadCluster on all spreadsheets in the Enron corpus. The evaluation result shows that SpreadCluster could cluster spreadsheets with higher precision and recall rate than the filename-based approach used by VEnron. Based on the clustering result by SpreadCluster, we further created a new versioned spreadsheet corpus VEnron2, which is much bigger than VEnron. We also applied SpreadCluster on the other two spreadsheet corpora FUSE and EUSES. The results show that SpreadCluster can cluster the versioned spreadsheets in these two corpora with high precision.Comment: 12 pages, MSR 201

arXiv.org e-Print Archive

Crossref

Enron versus EUSES: A Comparison of Two Spreadsheet Corpora

Author: Jansen Bas
Publication venue
Publication date: 13/03/2015
Field of study

Spreadsheets are widely used within companies and often form the basis for business decisions. Numerous cases are known where incorrect information in spreadsheets has lead to incorrect decisions. Such cases underline the relevance of research on the professional use of spreadsheets. Recently a new dataset became available for research, containing over 15.000 business spreadsheets that were extracted from the Enron E-mail Archive. With this dataset, we 1) aim to obtain a thorough understanding of the characteristics of spreadsheets used within companies, and 2) compare the characteristics of the Enron spreadsheets with the EUSES corpus which is the existing state of the art set of spreadsheets that is frequently used in spreadsheet studies. Our analysis shows that 1) the majority of spreadsheets are not large in terms of worksheets and formulas, do not have a high degree of coupling, and their formulas are relatively simple; 2) the spreadsheets from the EUSES corpus are, with respect to the measured characteristics, quite similar to the Enron spreadsheets.Comment: In Proceedings of the 2nd Workshop on Software Engineering Methods in Spreadsheet

arXiv.org e-Print Archive

CiteSeerX

TU Delft Repository

FigShare

Design and implementation of queries for model-driven spreadsheets

Author: Cunha Jácome Miguel Costa
Fernandes João Paulo
Mendes Jorge
Pereira Rui
Saraiva João Alexandre
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

This paper presents a domain-specific querying language for model-driven spreadsheets. We briefly show the design of the language and present in detail its implementation, from the denormalization of data and translation of our user-friendly query language to a more efficient query, to the execution of the query using Google. To validate our work, we executed an empirical study, comparing QuerySheet with an alternative spreadsheet querying tool, which produced positive results

Universidade do Minho: RepositoriUM

Refactoring smelly spreadsheet models

Author: B.A. Nardi
J. Cunha
J. Cunha
J. Cunha
J. Cunha
J. Cunha
J. Cunha
J. Cunha
L. Beckwith
S. Badame
S. Peyton Jones
S.G. Powell
T. Arendt
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Identifying bad design patterns in software is a successful and inspiring research trend. While these patterns do not necessarily correspond to software errors, the fact is that they raise potential problematic issues, often referred to as code smells, and that can for example compromise maintainability or evolution. The identification of code smells in spreadsheets, which can be viewed as software development environments for non-professional programmers, has already been the subject of confluent researches by different groups. While these research groups have focused on detecting smells on concrete spreadsheets, or spreadsheet instances, in this paper we propose a comprehensive set of smells for abstract representations of spreadsheets, or spreadsheet models. We also propose a set of refactorings suggesting how spreadsheet models can become simpler to understand, manipulate and evolve. Finally we present the integration of both smells and refactorings under the MDSheet framework.Part funded by ERDF - European Regional Development Fund through the COMPETE Programme (operational programme for competitiveness) and by National Funds through the FCT - Fundação para a Ciência e a Tecnologia within projects FCOMP-01-0124-FEDER-022701 and Network Sensing for Critical Systems Monitoring (NORTE-01-0124-FEDER-000058), ref. BIM-2013 BestCase RL3.2 UMINHO

Universidade do Minho: RepositoriUM

Crossref

Towards a catalog of spreadsheet smells

Author: B.A. Nardi
E.F. Codd
F. Chiang
F. Hermans
I. Heitlager
J. Cunha
J. Cunha
J. Cunha
J. Cunha
M. Erwig
M. Fowler
M. Mäntylä
M.V. Mäntylä
R. Abraham
R. Abraham
R.R. Panko
R.R. Panko
V. Levenshtein
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Spreadsheets are considered to be the most widely used programming language in the world, and reports have shown that 90% of real-world spreadsheets contain errors. In this work, we try to identify spreadsheet smells, a concept adapted from software, which consists of a surface indication that usually corresponds to a deeper problem. Our smells have been integrated in a tool, and were computed for a large spreadsheet repository. Finally, the analysis of the results we obtained led to the refinement of our initial catalog

Universidade do Minho: RepositoriUM

Crossref

A Literature Review of Spreadsheet Technology

Author: Bock Alexander
Publication venue
Publication date: 01/11/2016
Field of study

The IT University of Copenhagen's Repository