430 research outputs found
Evaluation of Generalizability of Neural Program Analyzers under Semantic-Preserving Transformations
The abundance of publicly available source code repositories, in conjunction
with the advances in neural networks, has enabled data-driven approaches to
program analysis. These approaches, called neural program analyzers, use neural
networks to extract patterns in the programs for tasks ranging from development
productivity to program reasoning. Despite the growing popularity of neural
program analyzers, the extent to which their results are generalizable is
unknown.
In this paper, we perform a large-scale evaluation of the generalizability of
two popular neural program analyzers using seven semantically-equivalent
transformations of programs. Our results caution that in many cases the neural
program analyzers fail to generalize well, sometimes to programs with
negligible textual differences. The results provide the initial stepping stones
for quantifying robustness in neural program analyzers.Comment: for related work, see arXiv:2008.0156
Towards Demystifying Dimensions of Source Code Embeddings
Source code representations are key in applying machine learning techniques
for processing and analyzing programs. A popular approach in representing
source code is neural source code embeddings that represents programs with
high-dimensional vectors computed by training deep neural networks on a large
volume of programs. Although successful, there is little known about the
contents of these vectors and their characteristics. In this paper, we present
our preliminary results towards better understanding the contents of code2vec
neural source code embeddings. In particular, in a small case study, we use the
code2vec embeddings to create binary SVM classifiers and compare their
performance with the handcrafted features. Our results suggest that the
handcrafted features can perform very close to the highly-dimensional code2vec
embeddings, and the information gains are more evenly distributed in the
code2vec embeddings compared to the handcrafted features. We also find that the
code2vec embeddings are more resilient to the removal of dimensions with low
information gains than the handcrafted features. We hope our results serve a
stepping stone toward principled analysis and evaluation of these code
representations.Comment: 1st ACM SIGSOFT International Workshop on Representation Learning for
Software Engineering and Program Languages, Co-located with ESEC/FSE
(RL+SE&PL'20
BEKG: A Built Environment Knowledge Graph
Practices in the built environment have become more digitalized with the
rapid development of modern design and construction technologies. However, the
requirement of practitioners or scholars to gather complicated professional
knowledge in the built environment has not been satisfied yet. In this paper,
more than 80,000 paper abstracts in the built environment field were obtained
to build a knowledge graph, a knowledge base storing entities and their
connective relations in a graph-structured data model. To ensure the retrieval
accuracy of the entities and relations in the knowledge graph, two
well-annotated datasets have been created, containing 2,000 instances and 1,450
instances each in 29 relations for the named entity recognition task and
relation extraction task respectively. These two tasks were solved by two
BERT-based models trained on the proposed dataset. Both models attained an
accuracy above 85% on these two tasks. More than 200,000 high-quality relations
and entities were obtained using these models to extract all abstract data.
Finally, this knowledge graph is presented as a self-developed visualization
system to reveal relations between various entities in the domain. Both the
source code and the annotated dataset can be found here:
https://github.com/HKUST-KnowComp/BEKG
- …