Version information plays an important role in spreadsheet understanding,
maintaining and quality improving. However, end users rarely use version
control tools to document spreadsheet version information. Thus, the
spreadsheet version information is missing, and different versions of a
spreadsheet coexist as individual and similar spreadsheets. Existing approaches
try to recover spreadsheet version information through clustering these similar
spreadsheets based on spreadsheet filenames or related email conversation.
However, the applicability and accuracy of existing clustering approaches are
limited due to the necessary information (e.g., filenames and email
conversation) is usually missing. We inspected the versioned spreadsheets in
VEnron, which is extracted from the Enron Corporation. In VEnron, the different
versions of a spreadsheet are clustered into an evolution group. We observed
that the versioned spreadsheets in each evolution group exhibit certain common
features (e.g., similar table headers and worksheet names). Based on this
observation, we proposed an automatic clustering algorithm, SpreadCluster.
SpreadCluster learns the criteria of features from the versioned spreadsheets
in VEnron, and then automatically clusters spreadsheets with the similar
features into the same evolution group. We applied SpreadCluster on all
spreadsheets in the Enron corpus. The evaluation result shows that
SpreadCluster could cluster spreadsheets with higher precision and recall rate
than the filename-based approach used by VEnron. Based on the clustering result
by SpreadCluster, we further created a new versioned spreadsheet corpus
VEnron2, which is much bigger than VEnron. We also applied SpreadCluster on the
other two spreadsheet corpora FUSE and EUSES. The results show that
SpreadCluster can cluster the versioned spreadsheets in these two corpora with
high precision.Comment: 12 pages, MSR 201