Search CORE

1,131 research outputs found

Leveraging Code Generation to Improve Code Retrieval and Summarization via Dual Learning

Author: Allamanis Miltiadis
Allamanis Miltiadis
Alon Uri
Banerjee Satanjeev
Caruana A.
Di He Wei Chen Yuanzhi Li
He Di
Lu Meili
Moreno Laura
Movshovitz-Attias Dana
Papineni Kishore
Su Shang-Yu
Voorhees M.
Wang Yijun
Xia Yingce
Yao Ziyu
Ye Hai
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 25/02/2020
Field of study

Code summarization generates brief natural language description given a source code snippet, while code retrieval fetches relevant source code given a natural language query. Since both tasks aim to model the association between natural language and programming language, recent studies have combined these two tasks to improve their performance. However, researchers have yet been able to effectively leverage the intrinsic connection between the two tasks as they train these tasks in a separate or pipeline manner, which means their performance can not be well balanced. In this paper, we propose a novel end-to-end model for the two tasks by introducing an additional code generation task. More specifically, we explicitly exploit the probabilistic correlation between code summarization and code generation with dual learning, and utilize the two encoders for code summarization and code generation to train the code retrieval task via multi-task learning. We have carried out extensive experiments on an existing dataset of SQL and Python, and results show that our model can significantly improve the results of the code retrieval task over the-state-of-art models, as well as achieve competitive performance in terms of BLEU score for the code summarization task.Comment: Published at The Web Conference (WWW) 2020, full pape

arXiv.org e-Print Archive

Crossref

CoaCor: Code Annotation for Code Retrieval with Reinforcement Learning

Author: Peddamail Jayavardhan Reddy
Sun Huan
Yao Ziyu
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

To accelerate software development, much research has been performed to help people understand and reuse the huge amount of available code resources. Two important tasks have been widely studied: code retrieval, which aims to retrieve code snippets relevant to a given natural language query from a code base, and code annotation, where the goal is to annotate a code snippet with a natural language description. Despite their advancement in recent years, the two tasks are mostly explored separately. In this work, we investigate a novel perspective of Code annotation for Code retrieval (hence called `CoaCor'), where a code annotation model is trained to generate a natural language annotation that can represent the semantic meaning of a given code snippet and can be leveraged by a code retrieval model to better distinguish relevant code snippets from others. To this end, we propose an effective framework based on reinforcement learning, which explicitly encourages the code annotation model to generate annotations that can be used for the retrieval task. Through extensive experiments, we show that code annotations generated by our framework are much more detailed and more useful for code retrieval, and they can further improve the performance of existing code retrieval models significantly.Comment: 10 pages, 2 figures. Accepted by The Web Conference (WWW) 201

arXiv.org e-Print Archive

Crossref

Large Language Models are Few-Shot Summarizers: Multi-Intent Comment Generation via In-Context Learning

Author: Dong Dezun
Geng Mingyang
Jin Zhi
Li Ge
Liao Xiangke
Mao Xiaoguang
Wang Haotian
Wang Shangwen
Publication venue
Publication date: 08/06/2023
Field of study

Code comment generation aims at generating natural language descriptions for a code snippet to facilitate developers' program comprehension activities. Despite being studied for a long time, a bottleneck for existing approaches is that given a code snippet, they can only generate one comment while developers usually need to know information from diverse perspectives such as what is the functionality of this code snippet and how to use it. To tackle this limitation, this study empirically investigates the feasibility of utilizing large language models (LLMs) to generate comments that can fulfill developers' diverse intents. Our intuition is based on the facts that (1) the code and its pairwise comment are used during the pre-training process of LLMs to build the semantic connection between the natural language and programming language, and (2) comments in the real-world projects, which are collected for the pre-training, usually contain different developers' intents. We thus postulate that the LLMs can already understand the code from different perspectives after the pre-training. Indeed, experiments on two large-scale datasets demonstrate the rationale of our insights: by adopting the in-context learning paradigm and giving adequate prompts to the LLM (e.g., providing it with ten or more examples), the LLM can significantly outperform a state-of-the-art supervised learning approach on generating comments with multiple intents. Results also show that customized strategies for constructing the prompts and post-processing strategies for reranking the results can both boost the LLM's performances, which shed light on future research directions for using LLMs to achieve comment generation.Comment: Accepted by the 46th International Conference on Software Engineering (ICSE 2024

arXiv.org e-Print Archive

Laminar: A New Serverless Stream-based Framework with Semantic Code Search and Code Completion

Author: Filgueira Rosa
Li Zihao
Zahra Zaynab
Publication venue
Publication date: 01/09/2023
Field of study

This paper introduces Laminar, a novel serverless framework based on dispel4py, a parallel stream-based dataflow library. Laminar efficiently manages streaming workflows and components through a dedicated registry, offering a seamless serverless experience. Leveraging large lenguage models, Laminar enhances the framework with semantic code search, code summarization, and code completion. This contribution enhances serverless computing by simplifying the execution of streaming computations, managing data streams more efficiently, and offering a valuable tool for both researchers and practitioners.Comment: 13 pages, 10 Figures, 6 Table

arXiv.org e-Print Archive

Large Language Models for Software Engineering: A Systematic Literature Review

Author: Grundy John
Hou Xinyi
Li Li
Liu Yue
Lo David
Luo Xiapu
Wang Haoyu
Wang Kailong
Yang Zhou
Zhao Yanjie
Publication venue
Publication date: 21/08/2023
Field of study

Large Language Models (LLMs) have significantly impacted numerous domains, notably including Software Engineering (SE). Nevertheless, a well-rounded understanding of the application, effects, and possible limitations of LLMs within SE is still in its early stages. To bridge this gap, our systematic literature review takes a deep dive into the intersection of LLMs and SE, with a particular focus on understanding how LLMs can be exploited in SE to optimize processes and outcomes. Through a comprehensive review approach, we collect and analyze a total of 229 research papers from 2017 to 2023 to answer four key research questions (RQs). In RQ1, we categorize and provide a comparative analysis of different LLMs that have been employed in SE tasks, laying out their distinctive features and uses. For RQ2, we detail the methods involved in data collection, preprocessing, and application in this realm, shedding light on the critical role of robust, well-curated datasets for successful LLM implementation. RQ3 allows us to examine the specific SE tasks where LLMs have shown remarkable success, illuminating their practical contributions to the field. Finally, RQ4 investigates the strategies employed to optimize and evaluate the performance of LLMs in SE, as well as the common techniques related to prompt optimization. Armed with insights drawn from addressing the aforementioned RQs, we sketch a picture of the current state-of-the-art, pinpointing trends, identifying gaps in existing research, and flagging promising areas for future study

arXiv.org e-Print Archive