186 research outputs found
Code Translation with Compiler Representations
In this paper, we leverage low-level compiler intermediate representations
(IR) to improve code translation. Traditional transpilers rely on syntactic
information and handcrafted rules, which limits their applicability and
produces unnatural-looking code. Applying neural machine translation (NMT)
approaches to code has successfully broadened the set of programs on which one
can get a natural-looking translation. However, they treat the code as
sequences of text tokens, and still do not differentiate well enough between
similar pieces of code which have different semantics in different languages.
The consequence is low quality translation, reducing the practicality of NMT,
and stressing the need for an approach significantly increasing its accuracy.
Here we propose to augment code translation with IRs, specifically LLVM IR,
with results on the C++, Java, Rust, and Go languages. Our method improves upon
the state of the art for unsupervised code translation, increasing the number
of correct translations by 11% on average, and up to 79% for the Java -> Rust
pair with greedy decoding. With beam search, it increases the number of correct
translations by 5.5% in average. We extend previous test sets for code
translation, by adding hundreds of Go and Rust functions. Additionally, we
train models with high performance on the problem of IR decompilation,
generating programming source code from IR, and study using IRs as intermediary
pivot for translation.Comment: 9 page
The Strengths and Behavioral Quirks of Java Bytecode Decompilers
During compilation from Java source code to bytecode, some information is
irreversibly lost. In other words, compilation and decompilation of Java code
is not symmetric. Consequently, the decompilation process, which aims at
producing source code from bytecode, must establish some strategies to
reconstruct the information that has been lost. Modern Java decompilers tend to
use distinct strategies to achieve proper decompilation. In this work, we
hypothesize that the diverse ways in which bytecode can be decompiled has a
direct impact on the quality of the source code produced by decompilers.
We study the effectiveness of eight Java decompilers with respect to three
quality indicators: syntactic correctness, syntactic distortion and semantic
equivalence modulo inputs. This study relies on a benchmark set of 14
real-world open-source software projects to be decompiled (2041 classes in
total).
Our results show that no single modern decompiler is able to correctly handle
the variety of bytecode structures coming from real-world programs. Even the
highest ranking decompiler in this study produces syntactically correct output
for 84% of classes of our dataset and semantically equivalent code output for
78% of classes.Comment: 11 pages, 6 figures, 9 listings, 3 table
Leveraging Static Analysis for Bug Repair
We propose a method combining machine learning with a static analysis tool
(i.e. Infer) to automatically repair source code. Machine Learning methods
perform well for producing idiomatic source code. However, their output is
sometimes difficult to trust as language models can output incorrect code with
high confidence. Static analysis tools are trustable, but also less flexible
and produce non-idiomatic code. In this paper, we propose to fix resource leak
bugs in IR space, and to use a sequence-to-sequence model to propose fix in
source code space. We also study several decoding strategies, and use Infer to
filter the output of the model. On a dataset of CodeNet submissions with
potential resource leak bugs, our method is able to find a function with the
same semantics that does not raise a warning with around 97% precision and 66%
recall.Comment: 13 pages. DL4C 202
- …