2,261 research outputs found
CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation
Recent code translation techniques exploit neural machine translation models
to translate source code from one programming language to another to satisfy
production compatibility or to improve efficiency of codebase maintenance. Most
existing code translation datasets only focus on a single pair of popular
programming languages. To advance research on code translation and meet diverse
requirements of real-world applications, we construct CodeTransOcean, a
large-scale comprehensive benchmark that supports the largest variety of
programming languages for code translation. CodeTransOcean consists of three
novel multilingual datasets, namely, MultilingualTrans supporting translations
between multiple popular programming languages, NicheTrans for translating
between niche programming languages and popular ones, and LLMTrans for
evaluating executability of translated code by large language models (LLMs).
CodeTransOcean also includes a novel cross-framework dataset, DLTrans, for
translating deep learning code across different frameworks. We develop
multilingual modeling approaches for code translation and demonstrate their
great potential in improving the translation quality of both low-resource and
high-resource language pairs and boosting the training efficiency. We also
propose a novel evaluation metric Debugging Success Rate@K for program-level
code translation. Last but not least, we evaluate LLM ChatGPT on our datasets
and investigate its potential for fuzzy execution predictions. We build
baselines for CodeTransOcean and analyze challenges of code translation for
guiding future research. The CodeTransOcean datasets and code are publicly
available at https://github.com/WeixiangYAN/CodeTransOcean.Comment: Accepted by Findings of EMNLP 202
Teaching Large Language Models to Self-Debug
Large language models (LLMs) have achieved impressive performance on code
generation. However, for complex programming tasks, generating the correct
solution in one go becomes challenging, thus some prior works have designed
program repair approaches to improve code generation performance. In this work,
we propose Self-Debugging, which teaches a large language model to debug its
predicted program via few-shot demonstrations. In particular, we demonstrate
that Self-Debugging can teach the large language model to perform rubber duck
debugging; i.e., without any feedback on the code correctness or error
messages, the model is able to identify its mistakes by explaining the
generated code in natural language. Self-Debugging achieves the
state-of-the-art performance on several code generation benchmarks, including
the Spider dataset for text-to-SQL generation, TransCoder for C++-to-Python
translation, and MBPP for text-to-Python generation. On the Spider benchmark
where there are no unit tests to verify the correctness of predictions,
Self-Debugging with code explanation consistently improves the baseline by
2-3%, and improves the prediction accuracy on problems of the hardest label by
9%. On TransCoder and MBPP where unit tests are available, Self-Debugging
improves the baseline accuracy by up to 12%. Meanwhile, by leveraging feedback
messages and reusing failed predictions, Self-Debugging notably improves sample
efficiency, and can match or outperform baseline models that generate more than
10x candidate programs
Towards Explainable Evaluation Metrics for Machine Translation
Unlike classical lexical overlap metrics such as BLEU, most current
evaluation metrics for machine translation (for example, COMET or BERTScore)
are based on black-box large language models. They often achieve strong
correlations with human judgments, but recent research indicates that the
lower-quality classical metrics remain dominant, one of the potential reasons
being that their decision processes are more transparent. To foster more
widespread acceptance of novel high-quality metrics, explainability thus
becomes crucial. In this concept paper, we identify key properties as well as
key goals of explainable machine translation metrics and provide a
comprehensive synthesis of recent techniques, relating them to our established
goals and properties. In this context, we also discuss the latest
state-of-the-art approaches to explainable metrics based on generative models
such as ChatGPT and GPT4. Finally, we contribute a vision of next-generation
approaches, including natural language explanations. We hope that our work can
help catalyze and guide future research on explainable evaluation metrics and,
mediately, also contribute to better and more transparent machine translation
systems.Comment: Preprint. We published an earlier version of this paper
(arXiv:2203.11131) under a different title. Both versions consider the
conceptualization of explainable metrics and are overall similar. However,
the new version puts a stronger emphasis on the survey of approaches for the
explanation of MT metrics including the latest LLM based approache
- …