2 research outputs found
A Differential Testing Approach for Evaluating Abstract Syntax Tree Mapping Algorithms
Abstract syntax tree (AST) mapping algorithms are widely used to analyze
changes in source code. Despite the foundational role of AST mapping
algorithms, little effort has been made to evaluate the accuracy of AST mapping
algorithms, i.e., the extent to which an algorihtm captures the evolution of
code. We observe that a program element often has only one best-mapped program
element. Based on this observation, we propose a hierarchical approach to
automatically compare the similarity of mapped statements and tokens by
different algorithms. By performing the comparison, we determine if each of the
compared algorithms generates inaccurate mappings for a statement or its
tokens. We invite 12 external experts to determine if three commonly used AST
mapping algorithms generate accurate mappings for a statement and its tokens
for 200 statements. Based on the experts' feedback,we observe that our approach
achieves a precision of 0.98--1.00 and a recall of 0.65--0.75. Furthermore, we
conduct a large-scale study with a dataset of ten Java projects, containing a
total of 263,165 file revisions. Our approach determines that GumTree, MTDiff
and IJM generate inaccurate mappings for 20%--29%, 25%--36% and 21%--30% of the
file revisions, respectively. Our experimental results show that state-of-art
AST mapping agorithms still need improvements
HyperAST: Enabling Efficient Analysis of Software Histories at Scale
International audienceSyntax Trees (ASTs) are widely used beyond compilers in many tools that measure and improve code quality, such as code analysis, bug detection, mining code metrics, refactoring. With the advent of fast software evolution and multistage releases, the temporal analysis of an AST history is becoming useful to understand and maintain code. However, jointly analyzing thousands versions of ASTs independently faces scalability issues, mostly combinatorial, both in terms of memory and CPU usage. In this paper, we propose a novel type of AST, called HyperAST , that enables efficient temporal code analysis on a given software history by: 1/ leveraging code redundancy through space (between code elements) and time (between versions); 2/ reusing intermediate computation results. We show how the HyperAST can be built incrementally on a set of commits to capture all multiple ASTs at once in an optimized way. We evaluated the HyperAST on a curated list of large software projects. Compared to Spoon, a state-of-the-art technique, we observed that the HyperAST outperforms it with an order-of-magnitude difference from Ă—6 up to Ă—8076 in CPU construction time and from Ă—12 up to Ă—1159 in memory footprint. While the HyperAST requires up to 2 h 22 min and 7.2 GB for the biggest project, Spoon requires up to 93 h and 31 min and 2.2 TB. The gains in construction time varied from 83.4 % to 99.99 % and the gains in memory footprint varied from 91.8 % to 99.9 %. We further compared the task of finding references of declarations with the HyperAST and Spoon. We observed on average 90 % precision and 97 % recall without a significant difference in search time