58 research outputs found
Context2Name: A Deep Learning-Based Approach to Infer Natural Variable Names from Usage Contexts
Most of the JavaScript code deployed in the wild has been minified, a process
in which identifier names are replaced with short, arbitrary and meaningless
names. Minified code occupies less space, but also makes the code extremely
difficult to manually inspect and understand. This paper presents Context2Name,
a deep learningbased technique that partially reverses the effect of
minification by predicting natural identifier names for minified names. The
core idea is to predict from the usage context of a variable a name that
captures the meaning of the variable. The approach combines a lightweight,
token-based static analysis with an auto-encoder neural network that summarizes
usage contexts and a recurrent neural network that predict natural names for a
given usage context. We evaluate Context2Name with a large corpus of real-world
JavaScript code and show that it successfully predicts 47.5% of all minified
identifiers while taking only 2.9 milliseconds on average to predict a name. A
comparison with the state-of-the-art tools JSNice and JSNaughty shows that our
approach performs comparably in terms of accuracy while improving in terms of
efficiency. Moreover, Context2Name complements the state-of-the-art by
predicting 5.3% additional identifiers that are missed by both existing tools
Epicure: Distilling Sequence Model Predictions into Patterns
Most machine learning models predict a probability distribution over concrete
outputs and struggle to accurately predict names over high entropy sequence
distributions. Here, we explore finding abstract, high-precision patterns
intrinsic to these predictions in order to make abstract predictions that
usefully capture rare sequences. In this short paper, we present Epicure, a
method that distils the predictions of a sequence model, such as the output of
beam search, into simple patterns. Epicure maps a model's predictions into a
lattice that represents increasingly more general patterns that subsume the
concrete model predictions.
On the tasks of predicting a descriptive name of a function given the source
code of its body and detecting anomalous names given a function, we show that
Epicure yields accurate naming patterns that match the ground truth more often
compared to just the highest probability model prediction. For a false alarm
rate of 10%, Epicure predicts patterns that match 61% more ground-truth names
compared to the best model prediction, making Epicure well-suited for scenarios
that require high precision
Deep learning type inference
Dynamically typed languages such as JavaScript and Python are increasingly popular, yet static typing has not been totally eclipsed: Python now supports type annotations and languages like TypeScript offer a middle-ground for JavaScript: a strict superset of JavaScript, to which it transpiles, coupled with a type system that permits partially typed programs. However, static typing has a cost: adding annotations, reading the added syntax, and wrestling with the type system to fix type errors. Type inference can ease the transition to more statically typed code and unlock the benefits of richer compile-time information, but is limited in languages like JavaScript as it cannot soundly handle duck-typing or runtime evaluation via eval. We propose DeepTyper, a deep learning model that understands which types naturally occur in certain contexts and relations and can provide type suggestions, which can often be verified by the type checker, even if it could not infer the type initially. DeepTyper, leverages an automatically aligned corpus of tokens and types to accurately predict thousands of variable and function type annotations. Furthermore, we demonstrate that context is key in accurately assigning these types and introduce a technique to reduce overfitting on local cues while highlighting the need for further improvements. Finally, we show that our model can interact with a compiler to provide more than 4,000 additional type annotations with over 95% precision that could not be inferred without the aid of DeepTyper
Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks
Deep learning (DL) techniques are gaining more and more attention in the
software engineering community. They have been used to support several
code-related tasks, such as automatic bug fixing and code comments generation.
Recent studies in the Natural Language Processing (NLP) field have shown that
the Text-To-Text Transfer Transformer (T5) architecture can achieve
state-of-the-art performance for a variety of NLP tasks. The basic idea behind
T5 is to first pre-train a model on a large and generic dataset using a
self-supervised task ( e.g: filling masked words in sentences). Once the model
is pre-trained, it is fine-tuned on smaller and specialized datasets, each one
related to a specific task ( e.g: language translation, sentence
classification). In this paper, we empirically investigate how the T5 model
performs when pre-trained and fine-tuned to support code-related tasks. We
pre-train a T5 model on a dataset composed of natural language English text and
source code. Then, we fine-tune such a model by reusing datasets used in four
previous works that used DL techniques to: (i) fix bugs, (ii) inject code
mutants, (iii) generate assert statements, and (iv) generate code comments. We
compared the performance of this single model with the results reported in the
four original papers proposing DL-based solutions for those four tasks. We show
that our T5 model, exploiting additional data for the self-supervised
pre-training phase, can achieve performance improvements over the four
baselines.Comment: Accepted to the 43rd International Conference on Software Engineering
(ICSE 2021
Typilus: Neural Type Hints
Type inference over partial contexts in dynamically typed languages is
challenging. In this work, we present a graph neural network model that
predicts types by probabilistically reasoning over a program's structure,
names, and patterns. The network uses deep similarity learning to learn a
TypeSpace -- a continuous relaxation of the discrete space of types -- and how
to embed the type properties of a symbol (i.e. identifier) into it.
Importantly, our model can employ one-shot learning to predict an open
vocabulary of types, including rare and user-defined ones. We realise our
approach in Typilus for Python that combines the TypeSpace with an optional
type checker. We show that Typilus accurately predicts types. Typilus
confidently predicts types for 70% of all annotatable symbols; when it predicts
a type, that type optionally type checks 95% of the time. Typilus can also find
incorrect type annotations; two important and popular open source libraries,
fairseq and allennlp, accepted our pull requests that fixed the annotation
errors Typilus discovered.Comment: Accepted to PLDI 202
- …