3 research outputs found
Semantic Source Code Models Using Identifier Embeddings
The emergence of online open source repositories in the recent years has led
to an explosion in the volume of openly available source code, coupled with
metadata that relate to a variety of software development activities. As an
effect, in line with recent advances in machine learning research, software
maintenance activities are switching from symbolic formal methods to
data-driven methods. In this context, the rich semantics hidden in source code
identifiers provide opportunities for building semantic representations of code
which can assist tasks of code search and reuse. To this end, we deliver in the
form of pretrained vector space models, distributed code representations for
six popular programming languages, namely, Java, Python, PHP, C, C++, and C#.
The models are produced using fastText, a state-of-the-art library for learning
word representations. Each model is trained on data from a single programming
language; the code mined for producing all models amounts to over 13.000
repositories. We indicate dissimilarities between natural language and source
code, as well as variations in coding conventions in between the different
programming languages we processed. We describe how these heterogeneities
guided the data preprocessing decisions we took and the selection of the
training parameters in the released models. Finally, we propose potential
applications of the models and discuss limitations of the models.Comment: 16th International Conference on Mining Software Repositories (MSR
2019): Data Showcase Trac
Graph-Driven Generative Models for Heterogeneous Multi-Task Learning
We propose a novel graph-driven generative model, that unifies multiple
heterogeneous learning tasks into the same framework. The proposed model is
based on the fact that heterogeneous learning tasks, which correspond to
different generative processes, often rely on data with a shared graph
structure. Accordingly, our model combines a graph convolutional network (GCN)
with multiple variational autoencoders, thus embedding the nodes of the graph
i.e., samples for the tasks) in a uniform manner while specializing their
organization and usage to different tasks. With a focus on healthcare
applications (tasks), including clinical topic modeling, procedure
recommendation and admission-type prediction, we demonstrate that our method
successfully leverages information across different tasks, boosting performance
in all tasks and outperforming existing state-of-the-art approaches.Comment: Accepted by AAAI-202