791 research outputs found
Enron versus EUSES: A Comparison of Two Spreadsheet Corpora
Spreadsheets are widely used within companies and often form the basis for
business decisions. Numerous cases are known where incorrect information in
spreadsheets has lead to incorrect decisions. Such cases underline the
relevance of research on the professional use of spreadsheets.
Recently a new dataset became available for research, containing over 15.000
business spreadsheets that were extracted from the Enron E-mail Archive. With
this dataset, we 1) aim to obtain a thorough understanding of the
characteristics of spreadsheets used within companies, and 2) compare the
characteristics of the Enron spreadsheets with the EUSES corpus which is the
existing state of the art set of spreadsheets that is frequently used in
spreadsheet studies.
Our analysis shows that 1) the majority of spreadsheets are not large in
terms of worksheets and formulas, do not have a high degree of coupling, and
their formulas are relatively simple; 2) the spreadsheets from the EUSES corpus
are, with respect to the measured characteristics, quite similar to the Enron
spreadsheets.Comment: In Proceedings of the 2nd Workshop on Software Engineering Methods in
Spreadsheet
A New Approach to Speeding Up Topic Modeling
Latent Dirichlet allocation (LDA) is a widely-used probabilistic topic
modeling paradigm, and recently finds many applications in computer vision and
computational biology. In this paper, we propose a fast and accurate batch
algorithm, active belief propagation (ABP), for training LDA. Usually batch LDA
algorithms require repeated scanning of the entire corpus and searching the
complete topic space. To process massive corpora having a large number of
topics, the training iteration of batch LDA algorithms is often inefficient and
time-consuming. To accelerate the training speed, ABP actively scans the subset
of corpus and searches the subset of topic space for topic modeling, therefore
saves enormous training time in each iteration. To ensure accuracy, ABP selects
only those documents and topics that contribute to the largest residuals within
the residual belief propagation (RBP) framework. On four real-world corpora,
ABP performs around to times faster than state-of-the-art batch LDA
algorithms with a comparable topic modeling accuracy.Comment: 14 pages, 12 figure
Memory-Efficient Topic Modeling
As one of the simplest probabilistic topic modeling techniques, latent
Dirichlet allocation (LDA) has found many important applications in text
mining, computer vision and computational biology. Recent training algorithms
for LDA can be interpreted within a unified message passing framework. However,
message passing requires storing previous messages with a large amount of
memory space, increasing linearly with the number of documents or the number of
topics. Therefore, the high memory usage is often a major problem for topic
modeling of massive corpora containing a large number of topics. To reduce the
space complexity, we propose a novel algorithm without storing previous
messages for training LDA: tiny belief propagation (TBP). The basic idea of TBP
relates the message passing algorithms with the non-negative matrix
factorization (NMF) algorithms, which absorb the message updating into the
message passing process, and thus avoid storing previous messages. Experimental
results on four large data sets confirm that TBP performs comparably well or
even better than current state-of-the-art training algorithms for LDA but with
a much less memory consumption. TBP can do topic modeling when massive corpora
cannot fit in the computer memory, for example, extracting thematic topics from
7 GB PUBMED corpora on a common desktop computer with 2GB memory.Comment: 20 pages, 7 figure
Locality statistics for anomaly detection in time series of graphs
The ability to detect change-points in a dynamic network or a time series of
graphs is an increasingly important task in many applications of the emerging
discipline of graph signal processing. This paper formulates change-point
detection as a hypothesis testing problem in terms of a generative latent
position model, focusing on the special case of the Stochastic Block Model time
series. We analyze two classes of scan statistics, based on distinct underlying
locality statistics presented in the literature. Our main contribution is the
derivation of the limiting distributions and power characteristics of the
competing scan statistics. Performance is compared theoretically, on synthetic
data, and on the Enron email corpus. We demonstrate that both statistics are
admissible in one simple setting, while one of the statistics is inadmissible a
second setting.Comment: 15 pages, 6 figure
A machine learning approach for layout inference in spreadsheets
Spreadsheet applications are one of the most used tools for content generation and presentation in industry and the Web. In spite of this success, there does not exist a comprehensive approach to automatically extract and reuse the richness of data maintained in this format. The biggest obstacle is the lack of awareness about the structure of the data in spreadsheets, which otherwise could provide the means to automatically understand and extract knowledge from these files. In this paper, we propose a classification approach to discover the layout of tables in spreadsheets. Therefore, we focus on the cell level, considering a wide range of features not covered before by related work. We evaluated the performance of our classifiers on a large dataset covering three different corpora from various domains. Finally, our work includes a novel technique for detecting and repairing incorrectly classified cells in a post-processing step. The experimental results show that our approach deliver s very high accuracy bringing us a crucial step closer towards automatic table extraction.Peer ReviewedPostprint (published version
Automated Refactoring of Nested-IF Formulae in Spreadsheets
Spreadsheets are the most popular end-user programming software, where
formulae act like programs and also have smells. One well recognized common
smell of spreadsheet formulae is nest-IF expressions, which have low
readability and high cognitive cost for users, and are error-prone during reuse
or maintenance. However, end users usually lack essential programming language
knowledge and skills to tackle or even realize the problem. The previous
research work has made very initial attempts in this aspect, while no effective
and automated approach is currently available.
This paper firstly proposes an AST-based automated approach to systematically
refactoring nest-IF formulae. The general idea is two-fold. First, we detect
and remove logic redundancy on the AST. Second, we identify higher-level
semantics that have been fragmented and scattered, and reassemble the syntax
using concise built-in functions. A comprehensive evaluation has been conducted
against a real-world spreadsheet corpus, which is collected in a leading IT
company for research purpose. The results with over 68,000 spreadsheets with 27
million nest-IF formulae reveal that our approach is able to relieve the smell
of over 99\% of nest-IF formulae. Over 50% of the refactorings have reduced
nesting levels of the nest-IFs by more than a half. In addition, a survey
involving 49 participants indicates that for most cases the participants prefer
the refactored formulae, and agree on that such automated refactoring approach
is necessary and helpful
- …