The abundance of publicly available source code repositories, in conjunction
with the advances in neural networks, has enabled data-driven approaches to
program analysis. These approaches, called neural program analyzers, use neural
networks to extract patterns in the programs for tasks ranging from development
productivity to program reasoning. Despite the growing popularity of neural
program analyzers, the extent to which their results are generalizable is
unknown.
In this paper, we perform a large-scale evaluation of the generalizability of
two popular neural program analyzers using seven semantically-equivalent
transformations of programs. Our results caution that in many cases the neural
program analyzers fail to generalize well, sometimes to programs with
negligible textual differences. The results provide the initial stepping stones
for quantifying robustness in neural program analyzers.Comment: for related work, see arXiv:2008.0156