16,724 research outputs found
Semantic Source Code Models Using Identifier Embeddings
The emergence of online open source repositories in the recent years has led
to an explosion in the volume of openly available source code, coupled with
metadata that relate to a variety of software development activities. As an
effect, in line with recent advances in machine learning research, software
maintenance activities are switching from symbolic formal methods to
data-driven methods. In this context, the rich semantics hidden in source code
identifiers provide opportunities for building semantic representations of code
which can assist tasks of code search and reuse. To this end, we deliver in the
form of pretrained vector space models, distributed code representations for
six popular programming languages, namely, Java, Python, PHP, C, C++, and C#.
The models are produced using fastText, a state-of-the-art library for learning
word representations. Each model is trained on data from a single programming
language; the code mined for producing all models amounts to over 13.000
repositories. We indicate dissimilarities between natural language and source
code, as well as variations in coding conventions in between the different
programming languages we processed. We describe how these heterogeneities
guided the data preprocessing decisions we took and the selection of the
training parameters in the released models. Finally, we propose potential
applications of the models and discuss limitations of the models.Comment: 16th International Conference on Mining Software Repositories (MSR
2019): Data Showcase Trac
Unsupervised Domain Adaptation with Similarity Learning
The objective of unsupervised domain adaptation is to leverage features from
a labeled source domain and learn a classifier for an unlabeled target domain,
with a similar but different data distribution. Most deep learning approaches
to domain adaptation consist of two steps: (i) learn features that preserve a
low risk on labeled samples (source domain) and (ii) make the features from
both domains to be as indistinguishable as possible, so that a classifier
trained on the source can also be applied on the target domain. In general, the
classifiers in step (i) consist of fully-connected layers applied directly on
the indistinguishable features learned in (ii). In this paper, we propose a
different way to do the classification, using similarity learning. The proposed
method learns a pairwise similarity function in which classification can be
performed by computing similarity between prototype representations of each
category. The domain-invariant features and the categorical prototype
representations are learned jointly and in an end-to-end fashion. At inference
time, images from the target domain are compared to the prototypes and the
label associated with the one that best matches the image is outputed. The
approach is simple, scalable and effective. We show that our model achieves
state-of-the-art performance in different unsupervised domain adaptation
scenarios
Recommended from our members
Preparing sparse solvers for exascale computing.
Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms. We address the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential. The efforts described here are works in progress, highlighting current success and upcoming challenges. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'
- …