129 research outputs found
Code2Que: A Tool for Improving Question Titles from Mined Code Snippets in Stack Overflow
Stack Overflow is one of the most popular technical Q&A sites used by
software developers. Seeking help from Stack Overflow has become an essential
part of software developers' daily work for solving programming-related
questions. Although the Stack Overflow community has provided quality assurance
guidelines to help users write better questions, we observed that a significant
number of questions submitted to Stack Overflow are of low quality. In this
paper, we introduce a new web-based tool, Code2Que, which can help developers
in writing higher quality questions for a given code snippet. Code2Que consists
of two main stages: offline learning and online recommendation. In the offline
learning phase, we first collect a set of good quality
pairs as training samples. We then train our model on these training samples
via a deep sequence-to-sequence approach, enhanced with an attention mechanism,
a copy mechanism and a coverage mechanism. In the online recommendation phase,
for a given code snippet, we use the offline trained model to generate question
titles to assist less experienced developers in writing questions more
effectively. At the same time, we embed the given code snippet into a vector
and retrieve the related questions with similar problematic code snippets.Comment: arXiv admin note: text overlap with arXiv:2005.1015
Diverse Title Generation for Stack Overflow Posts with Multiple Sampling Enhanced Transformer
Stack Overflow is one of the most popular programming communities where
developers can seek help for their encountered problems. Nevertheless, if
inexperienced developers fail to describe their problems clearly, it is hard
for them to attract sufficient attention and get the anticipated answers. We
propose MNSCT5, a novel approach to automatically generate multiple post
titles from the given code snippets. Developers may use the generated titles to
find closely related posts and complete their problem descriptions. MNSCT5
employs the CodeT5 backbone, which is a pre-trained Transformer model having an
excellent language understanding and generation ability. To alleviate the
ambiguity issue that the same code snippets could be aligned with different
titles under varying contexts, we propose the maximal marginal multiple nucleus
sampling strategy to generate multiple high-quality and diverse title
candidates at a time for the developers to choose from. We build a large-scale
dataset with 890,000 question posts covering eight programming languages to
validate the effectiveness of MNSCT5. The automatic evaluation results on
the BLEU and ROUGE metrics demonstrate the superiority of MNSCT5 over six
state-of-the-art baseline models. Moreover, a human evaluation with trustworthy
results also demonstrates the great potential of our approach for real-world
application.Comment: under revie
Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow
For tasks like code synthesis from natural language, code retrieval, and code
summarization, data-driven models have shown great promise. However, creating
these models require parallel data between natural language (NL) and code with
fine-grained alignments. Stack Overflow (SO) is a promising source to create
such a data set: the questions are diverse and most of them have corresponding
answers with high-quality code snippets. However, existing heuristic methods
(e.g., pairing the title of a post with the code in the accepted answer) are
limited both in their coverage and the correctness of the NL-code pairs
obtained. In this paper, we propose a novel method to mine high-quality aligned
data from SO using two sets of features: hand-crafted features considering the
structure of the extracted snippets, and correspondence features obtained by
training a probabilistic model to capture the correlation between NL and code
using neural networks. These features are fed into a classifier that determines
the quality of mined NL-code pairs. Experiments using Python and Java as test
beds show that the proposed method greatly expands coverage and accuracy over
existing mining methods, even when using only a small number of labeled
examples. Further, we find that reasonable results are achieved even when
training the classifier on one language and testing on another, showing promise
for scaling NL-code mining to a wide variety of programming languages beyond
those for which we are able to annotate data.Comment: MSR '1
Understanding the Role of Images on Stack Overflow
Images are increasingly being shared by software developers in diverse
channels including question-and-answer forums like Stack Overflow. Although
prior work has pointed out that these images are meaningful and provide
complementary information compared to their associated text, how images are
used to support questions is empirically unknown. To address this knowledge
gap, in this paper we specifically conduct an empirical study to investigate
(I) the characteristics of images, (II) the extent to which images are used in
different question types, and (III) the role of images on receiving answers.
Our results first show that user interface is the most common image content and
undesired output is the most frequent purpose for sharing images. Moreover,
these images essentially facilitate the understanding of 68% of sampled
questions. Second, we find that discrepancy questions are more relatively
frequent compared to those without images, but there are no significant
differences observed in description length in all types of questions. Third,
the quantitative results statistically validate that questions with images are
more likely to receive accepted answers, but do not speed up the time to
receive answers. Our work demonstrates the crucial role that images play by
approaching the topic from a new angle and lays the foundation for future
opportunities to use images to assist in tasks like generating questions and
identifying question-relatedness
Towards Query Logs for Privacy Studies: On Deriving Search Queries from Questions
Translating verbose information needs into crisp search queries is a
phenomenon that is ubiquitous but hardly understood. Insights into this process
could be valuable in several applications, including synthesizing large
privacy-friendly query logs from public Web sources which are readily available
to the academic research community. In this work, we take a step towards
understanding query formulation by tapping into the rich potential of community
question answering (CQA) forums. Specifically, we sample natural language (NL)
questions spanning diverse themes from the Stack Exchange platform, and conduct
a large-scale conversion experiment where crowdworkers submit search queries
they would use when looking for equivalent information. We provide a careful
analysis of this data, accounting for possible sources of bias during
conversion, along with insights into user-specific linguistic patterns and
search behaviors. We release a dataset of 7,000 question-query pairs from this
study to facilitate further research on query understanding.Comment: ECIR 2020 Short Pape
Optimising the fit of stack overflow code snippets into existing code
Software developers often reuse code from online sources such as Stack Overflow within their projects. However, the process of searching for code snippets and integrating them within existing source code can be tedious. In order to improve efficiency and reduce time spent on code reuse, we present an automated code reuse tool for the Eclipse IDE (Integrated Developer Environment), NLP2TestableCode. NLP2TestableCode can not only search for Java code snippets using natural language tasks, but also evaluate code snippets based on a user's existing code, modify snippets to improve fit and correct errors, before presenting the user with the best snippet, all without leaving the editor. NLP2TestableCode also includes functionality to automatically generate customisable test cases and suggest argument and return types, in order to further evaluate code snippets. In evaluation, NLP2TestableCode was capable of finding compilable code snippets for 82.9% of tasks, and testable code snippets for 42.9%.Brittany Reid, Christoph Treude, Markus Wagne
Improving Developer Efficiency through Code Reuse
Code reuse is an integral part of modern software development, where most software is built using existing software artefacts. Ranging from the copy-pasting of code fragments to the use of third-party libraries, developers frequently turn to the internet to find already-made solutions to difficult programming tasks and save development time. However, the large amount of libraries and code online can make finding the best solution difficult, and reuse is not necessarily straightforward. Most online code snippets do not run, meaning developers need to spend time correcting errors, and when example code snippets are meant to demonstrate API usage, this can present a barrier to using new libraries. This work studies ways to aid developers in the code reuse process, in order to improve their efficiency. We look at ways to more easily connect developers to the wealth of libraries and usage examples online from within their programming environment with our tool for Node.js, Node Code Query (NCQ). We then evaluate how well developers perform compared to the conventional code reuse process and found that developers using our tool solve tasks faster and have to try fewer libraries. Additionally, we study what problems online Node.js code snippets have and how to best correct them automatically, to save developers time in this step of the reuse process. We find that through the combination of the TypeScript compiler’s error detection and codefixes, and our line deletion and custom fixes, we can increase the percentage error-free snippets in our dataset from 26.3% to 74.94%. Finally, we compare the emerging AI code snippet generation and pair programmer technologies to current online code snippet reuse practices, particularly looking at how snippets generated by GitHub’s Copilot extension and those retrieved from Stack Overflow using Google might differ. We find that for the same set of queries, Copilot returned more snippets, with fewer errors and that were more relevant. Ultimately, this work provides further evidence of how automating the code reuse process can improve developer efficiency, and proposes a series of solutions to that end. Additionally, we provide a comparison between existing and emerging reuse processes. As the state of code reuse changes, helping developers understand the strengths of weaknesses of these approaches will become increasingly important.Thesis (Ph.D.) -- University of Adelaide, School of Computer and Mathematical Sciences, 202
- …