Search CORE

9 research outputs found

Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow

Author: Allamanis Miltiadis
Bahdanau Dzmitry
Devanbu Premkumar
Gal Yarin
Gu Jiatao
Kalchbrenner Nal
Lin Xi Victoria
Luong Thang
Movshovitz-Attias Dana
Wong Edmund
Publication venue
Publication date: 22/05/2018
Field of study

For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. Stack Overflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high-quality code snippets. However, existing heuristic methods (e.g., pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data.Comment: MSR '1

arXiv.org e-Print Archive

Crossref

Mining Question and Answer Sites for Automatic Comment Generation

Author: Edmund Wong
Publication venue: 'University of Waterloo'
Publication date: 28/04/2014
Field of study

Code comments improve software maintainability, programming productivity, and software reliability. To address the comment scarcity issue in many projects and save developers’ time in writing comments, we propose a new, general automatic comment generation approach, which mines comments from a large programming Question and Answer (Q&A) site. Q&A sites allow programmers to post questions and receive solutions, which contain code segments together with their descriptions, referred to as code-description mappings. We develop AutoComment to extract such mappings, and leverage them to generate description comments automatically for similar code segments matched in open source projects. We apply AutoComment to analyze 92,140 Java and Android tagged Q&A posts to extract 132,767 code-description mappings, which help AutoComment generate 102 comments automatically for 23 Java and Android projects. The number of generated comments is still low, but the user study results show that the majority of the participants consider the generated comments accurate, adequate, concise, and useful in helping them understand the code. One of the advantages from mining Q&A sites for automatic comment generation is that human written comments can provide information that is not explicitly in the code. In the future, we would like to focus on improving both the yield and quality of the generated comments. To improve the yield, we can replace the token-based clone detection tool with one that can detect addition and reordering of lines to increase the number of code matches. To improve the quality, we can apply advanced natural language processing techniques such as semantic role labeling to analyze the semantics of the sentences, or typed dependencies to analyze the grammatical structure of the sentences

University of Waterloo's Institutional Repository

SWordNet: Inferring Semantically Related Words from Software Context

Author: Yang Jinqiu
Publication venue: 'University of Waterloo'
Publication date: 01/01/2013
Field of study

Code search is an integral part of software development and program comprehension. The difficulty of code search lies in the inability to guess the exact words used in the code. Therefore, it is crucial for keyword-based code search to expand queries with semantically related words, e.g., synonyms and abbreviations, to increase the search effectiveness. However, it is limited to rely on resources such as English dictionaries and WordNet to obtain semantically related words in software, because many words that are semantically related in software are not semantically related in English. On the other hand, many words that are semantically related in English are not semantically related in software. This thesis proposes a simple and general technique to automatically infer semantically re- lated words (referred to as rPairs) in software by leveraging the context of words in comments and code. In addition, we propose a ranking algorithm on the rPair results and study cross-project rPairs on two sets of software with similar functionality, i.e., media browsers and operating sys- tems. We achieve a reasonable accuracy in nine large and popular code bases written in C and Java. Our further evaluation against the state of art shows that our technique can achieve a higher precision and recall. In addition, the proposed ranking algorithm improves the rPair extraction accuracy by bringing correct rPairs to the top of the list. Our cross-project study successfully discovers overlapping rPairs among projects of similar functionality and finds that cross-project rPairs are more likely to be correct than project-specific rPairs. Since the cross-project rPairs are highly likely to be general for software of the same type, the discovered overlapping rPairs can benefit other projects of the same type that have not been anaylyzed.1 yea

University of Waterloo's Institutional Repository

Compreendendo programas por meio de Design by Contract: um estudo com desenvolvedores.

Author: CARVALHO JÚNIOR Normando Gomes de.
Publication venue: UFCG
Publication date: 18/05/2018
Field of study

Compreender programas é difícil porque cada trecho de código atende requisitos especíﬁcos. Em alguns casos, fatores como compreender programas que foram escritos por outras pessoas, o escopo limitado das ferramentas existentes, e a falta de documentação adicionam complexidade. Sendo assim, os desenvolvedores necessitam de uma abordagem de compreensão efetiva que diminua os custos na manutenção e que reduzam os riscos de erros, caso o entendimento do programa seja incompleto. Existem abordagens sistemáticas, apoiadas por ferramentas, para compreensão que utilizam veriﬁcações estáticas (análise do código fonte) ou dinâmicas (dados sobre a execução). As abordagens dinâmicas são utilizadas por sua efetividade, pois basta executar um teste para ver o resultado, no entanto, falham por não utilizar informações de alto nível sobre o comportamento que possam ser veriﬁcadas. Todavia, estas informações podem ser capturadas ao deﬁnir contratos, como por exemplo, na metodologia Design by Contract. Contudo, não existe uma abordagem sistemática de compreensão que utilize contratos. Este trabalho, propõe um estudo para compreender programas apoiada por uma abordagem sistematizada a ﬁm de priorizar a escrita de contratos em programas C# utilizando os princípios de Design by Contract por meio da biblioteca Code Contracts. Podendo, mesmo assim, ser utilizada em qualquer linguagem de programação que dê suporte a Design by Contract. A avaliação foi feita em ambientes de desenvolvimento de software com 12 desenvolvedores de um centro de pesquisa e desenvolvimento em ciência da computação, considerando a aplicação da abordagem a três métodos (rotinas) de um projeto open-source. Os resultados do estudo apontam indícios de melhor compreensão dos métodos usando contratos, e por outro lado, de favorecimento da própria escrita dos contratos em métodos,a princípio desconhecidos pelos desenvolvedores.Program comprehension is generally a difﬁcult task, since each part of code meets speciﬁc requirements. In some cases, factors - such as comprehend programs that were written by others, limited scope of existing tools, and lack of documentation - add complexity. There fore, developers need an effective program comprehension approach that reduces maintenance costs and the risk of errors whether the program comprehension is incomplete. In order to mitigate that problem, systematic approaches are used, supported by tools, to verify comprehension using static (source code analysis) or dynamic (data on execution). Dynamic approaches are used because of their effectiveness, since they simply run a test to see the result, however, they fail to use high-level behavioral information that can be veriﬁed. Still, this information can be captured while deﬁning contracts, for example in the Design by Contract methodology. Nevertheless, for the best of our knowledge, there is no systematic approach to program comprehension that uses contracts. In this work, we propose a study to program comprehension supported by a systematized approach, in order to prioritize the writing of contracts in C# programs using the principles of Design by Contract through the Code Contracts library. It can still be used in any programming language that supports Design by Contract. We evaluated it in software development environments with 12 developers of a research and development center in computer science, considering the approach application to three methods (routines) of an open-source project. The results of our study indicate better comprehension of methods using contracts, and, on the other hand, favoring the writing of contracts in methods, initially unknown to developers.CNP

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Biblioteca Digital de Teses e Dissertações da Universidade Federal de Campina Grande

Improving Software Dependability through Documentation Analysis

Author: Wong Edmund
Publication venue: 'University of Waterloo'
Publication date: 21/01/2019
Field of study

Software documentation contains critical information that describes a system’s functionality and requirements. Documentation exists in several forms, including code comments, test plans, manual pages, and user manuals. The lack of documentation in existing software systems is an issue that impacts software maintainability and programmer productivity. Since some code bases contain a large amount of documentation, we want to leverage these existing documentation to improve software dependability. Specifically, we utilize documentation to help detect software bugs and repair corrupted files, which can reduce the number of software error and failure to improve a system’s reliability (e.g., continuity of correct service). We also generate documentation (e.g., code comment) automatically to help developers understand the source code, which helps improve a system’s maintainability (e.g., ability to undergo repairs and modifications). In this thesis, we analyze software documentation and propose two branches of work, which focuses on three types of documentation including manual pages, code comments, and user manuals. The first branch of work focuses on documentation analysis because documentation contains valuable information that describes the behavior of the program. We automatically extract constraints from documentation and apply them on a dynamic analysis symbolic execution tool to find bugs in the target software, and we extract constraints manually from documentation and apply them on a structured-file parsing application to repair corrupted PDF files. The second branch of work focuses on automatic code comment generation to improve software documentation. For documentation analysis, we propose and implement DASE and DocRepair. DASE leverages automatically extracted constraints from documentation to improve a dynamic analysis symbolic execution tool. DASE guides symbolic execution to focus the testing on execution paths that execute a program’s core functionalities using constraints learned from the documentation. We evaluated DASE on 88 programs from five mature real-world software suites to detect software bugs. DASE detects 12 previously unknown bugs that symbolic execution would fail to detect when given no input constraints, 6 of which have been confirmed by the developers. In DocRepair we perform an empirical study to study and repair corrupted PDF files. We create the first dataset of 319 corrupted PDF files and conduct an empirical study on 119 real-world corrupted PDF files to study the common types of file corruption. Based on the result of the empirical study we propose a technique called DocRepair. DocRepair’s repair algorithm includes seven repair operators that utilizes manually extracted constraints from documentation to repair corrupted files. We evaluate DocRepair against three common PDF repair tools. Amongst the 1,827 collected corrupted files from over two corpora of PDF files, DocRepair can successfully repair 354 files compared to Mutool, PDFtk, and GhostScript which repair 508, 41 and 84 respectively. We also propose a technique to combine multiple repair tools called DocRepair+, which can successfully repair 751 files. In the case where there is a lack of documentation, DASE and DocRepair+ would not work. Therefore, we propose automated documentation generation to address the issue. We propose and implement CloCom+ to generate code comments by mining both existing software repositories in GitHub and a Question and Answer site, Stack Overflow. CloCom+ generated 442 unique comments for 16 Java projects. Although CloCom+ improves on previous work, SumSlice, on automatic comment generation, the quality (evaluated on completeness, conciseness, expressiveness, and usefulness) and yield (number of generated comments) are still rather low which makes the technique not ready for real-world usage. In the future, it may be possible to combine the two proposed branches of work (documentation analysis and documentation generation) to further improve software dependability. For example, we can extract constraints from the automatically generated documentation (e.g., code comments)

University of Waterloo's Institutional Repository

Recommended from our members

Information Foraging Theory as a Unifying Foundation for Software Engineering Research : Connecting the Dots

Author: Piorkowski David
Publication venue: 'Oregon State University'
Publication date
Field of study

Empirical studies have shown that programmers spend up to one-third of their time navigating through code during debugging. Although researchers have conducted empirical studies to understand programmers’ navigation difficulties and developed tools to address those difficulties, the resulting findings tend to be loosely connected to each other. To address this gap, we propose using theory to “connect the dots” between software engineering (SE) research findings. Our theory of choice is Information Foraging Theory (IFT) which explains and predicts how people seek information in an environment. Thus, it is well-suited as a unifying foundation because navigating code is a fundamental aspect of software engineering. In this dissertation, we investigated IFT’s suitability as a unifying foundation for SE through a combination of tool building and empirical user studies of programmers debugging. Our contributions show how IFT can help to unify SE research via cross-cutting insights spanning multiple software engineering subdisciplines

ScholarsArchive@OSU