129 research outputs found

    Code2Que: A Tool for Improving Question Titles from Mined Code Snippets in Stack Overflow

    Full text link
    Stack Overflow is one of the most popular technical Q&A sites used by software developers. Seeking help from Stack Overflow has become an essential part of software developers' daily work for solving programming-related questions. Although the Stack Overflow community has provided quality assurance guidelines to help users write better questions, we observed that a significant number of questions submitted to Stack Overflow are of low quality. In this paper, we introduce a new web-based tool, Code2Que, which can help developers in writing higher quality questions for a given code snippet. Code2Que consists of two main stages: offline learning and online recommendation. In the offline learning phase, we first collect a set of good quality pairs as training samples. We then train our model on these training samples via a deep sequence-to-sequence approach, enhanced with an attention mechanism, a copy mechanism and a coverage mechanism. In the online recommendation phase, for a given code snippet, we use the offline trained model to generate question titles to assist less experienced developers in writing questions more effectively. At the same time, we embed the given code snippet into a vector and retrieve the related questions with similar problematic code snippets.Comment: arXiv admin note: text overlap with arXiv:2005.1015

    Diverse Title Generation for Stack Overflow Posts with Multiple Sampling Enhanced Transformer

    Full text link
    Stack Overflow is one of the most popular programming communities where developers can seek help for their encountered problems. Nevertheless, if inexperienced developers fail to describe their problems clearly, it is hard for them to attract sufficient attention and get the anticipated answers. We propose M3_3NSCT5, a novel approach to automatically generate multiple post titles from the given code snippets. Developers may use the generated titles to find closely related posts and complete their problem descriptions. M3_3NSCT5 employs the CodeT5 backbone, which is a pre-trained Transformer model having an excellent language understanding and generation ability. To alleviate the ambiguity issue that the same code snippets could be aligned with different titles under varying contexts, we propose the maximal marginal multiple nucleus sampling strategy to generate multiple high-quality and diverse title candidates at a time for the developers to choose from. We build a large-scale dataset with 890,000 question posts covering eight programming languages to validate the effectiveness of M3_3NSCT5. The automatic evaluation results on the BLEU and ROUGE metrics demonstrate the superiority of M3_3NSCT5 over six state-of-the-art baseline models. Moreover, a human evaluation with trustworthy results also demonstrates the great potential of our approach for real-world application.Comment: under revie

    Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow

    Full text link
    For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. Stack Overflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high-quality code snippets. However, existing heuristic methods (e.g., pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data.Comment: MSR '1

    Understanding the Role of Images on Stack Overflow

    Full text link
    Images are increasingly being shared by software developers in diverse channels including question-and-answer forums like Stack Overflow. Although prior work has pointed out that these images are meaningful and provide complementary information compared to their associated text, how images are used to support questions is empirically unknown. To address this knowledge gap, in this paper we specifically conduct an empirical study to investigate (I) the characteristics of images, (II) the extent to which images are used in different question types, and (III) the role of images on receiving answers. Our results first show that user interface is the most common image content and undesired output is the most frequent purpose for sharing images. Moreover, these images essentially facilitate the understanding of 68% of sampled questions. Second, we find that discrepancy questions are more relatively frequent compared to those without images, but there are no significant differences observed in description length in all types of questions. Third, the quantitative results statistically validate that questions with images are more likely to receive accepted answers, but do not speed up the time to receive answers. Our work demonstrates the crucial role that images play by approaching the topic from a new angle and lays the foundation for future opportunities to use images to assist in tasks like generating questions and identifying question-relatedness

    Towards Query Logs for Privacy Studies: On Deriving Search Queries from Questions

    Get PDF
    Translating verbose information needs into crisp search queries is a phenomenon that is ubiquitous but hardly understood. Insights into this process could be valuable in several applications, including synthesizing large privacy-friendly query logs from public Web sources which are readily available to the academic research community. In this work, we take a step towards understanding query formulation by tapping into the rich potential of community question answering (CQA) forums. Specifically, we sample natural language (NL) questions spanning diverse themes from the Stack Exchange platform, and conduct a large-scale conversion experiment where crowdworkers submit search queries they would use when looking for equivalent information. We provide a careful analysis of this data, accounting for possible sources of bias during conversion, along with insights into user-specific linguistic patterns and search behaviors. We release a dataset of 7,000 question-query pairs from this study to facilitate further research on query understanding.Comment: ECIR 2020 Short Pape

    Optimising the fit of stack overflow code snippets into existing code

    Get PDF
    Software developers often reuse code from online sources such as Stack Overflow within their projects. However, the process of searching for code snippets and integrating them within existing source code can be tedious. In order to improve efficiency and reduce time spent on code reuse, we present an automated code reuse tool for the Eclipse IDE (Integrated Developer Environment), NLP2TestableCode. NLP2TestableCode can not only search for Java code snippets using natural language tasks, but also evaluate code snippets based on a user's existing code, modify snippets to improve fit and correct errors, before presenting the user with the best snippet, all without leaving the editor. NLP2TestableCode also includes functionality to automatically generate customisable test cases and suggest argument and return types, in order to further evaluate code snippets. In evaluation, NLP2TestableCode was capable of finding compilable code snippets for 82.9% of tasks, and testable code snippets for 42.9%.Brittany Reid, Christoph Treude, Markus Wagne

    Improving Developer Efficiency through Code Reuse

    Get PDF
    Code reuse is an integral part of modern software development, where most software is built using existing software artefacts. Ranging from the copy-pasting of code fragments to the use of third-party libraries, developers frequently turn to the internet to find already-made solutions to difficult programming tasks and save development time. However, the large amount of libraries and code online can make finding the best solution difficult, and reuse is not necessarily straightforward. Most online code snippets do not run, meaning developers need to spend time correcting errors, and when example code snippets are meant to demonstrate API usage, this can present a barrier to using new libraries. This work studies ways to aid developers in the code reuse process, in order to improve their efficiency. We look at ways to more easily connect developers to the wealth of libraries and usage examples online from within their programming environment with our tool for Node.js, Node Code Query (NCQ). We then evaluate how well developers perform compared to the conventional code reuse process and found that developers using our tool solve tasks faster and have to try fewer libraries. Additionally, we study what problems online Node.js code snippets have and how to best correct them automatically, to save developers time in this step of the reuse process. We find that through the combination of the TypeScript compiler’s error detection and codefixes, and our line deletion and custom fixes, we can increase the percentage error-free snippets in our dataset from 26.3% to 74.94%. Finally, we compare the emerging AI code snippet generation and pair programmer technologies to current online code snippet reuse practices, particularly looking at how snippets generated by GitHub’s Copilot extension and those retrieved from Stack Overflow using Google might differ. We find that for the same set of queries, Copilot returned more snippets, with fewer errors and that were more relevant. Ultimately, this work provides further evidence of how automating the code reuse process can improve developer efficiency, and proposes a series of solutions to that end. Additionally, we provide a comparison between existing and emerging reuse processes. As the state of code reuse changes, helping developers understand the strengths of weaknesses of these approaches will become increasingly important.Thesis (Ph.D.) -- University of Adelaide, School of Computer and Mathematical Sciences, 202
    • …
    corecore