9 research outputs found

    Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow

    Full text link
    For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. Stack Overflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high-quality code snippets. However, existing heuristic methods (e.g., pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data.Comment: MSR '1

    Mining Question and Answer Sites for Automatic Comment Generation

    Get PDF
    Code comments improve software maintainability, programming productivity, and software reliability. To address the comment scarcity issue in many projects and save developers’ time in writing comments, we propose a new, general automatic comment generation approach, which mines comments from a large programming Question and Answer (Q&A) site. Q&A sites allow programmers to post questions and receive solutions, which contain code segments together with their descriptions, referred to as code-description mappings. We develop AutoComment to extract such mappings, and leverage them to generate description comments automatically for similar code segments matched in open source projects. We apply AutoComment to analyze 92,140 Java and Android tagged Q&A posts to extract 132,767 code-description mappings, which help AutoComment generate 102 comments automatically for 23 Java and Android projects. The number of generated comments is still low, but the user study results show that the majority of the participants consider the generated comments accurate, adequate, concise, and useful in helping them understand the code. One of the advantages from mining Q&A sites for automatic comment generation is that human written comments can provide information that is not explicitly in the code. In the future, we would like to focus on improving both the yield and quality of the generated comments. To improve the yield, we can replace the token-based clone detection tool with one that can detect addition and reordering of lines to increase the number of code matches. To improve the quality, we can apply advanced natural language processing techniques such as semantic role labeling to analyze the semantics of the sentences, or typed dependencies to analyze the grammatical structure of the sentences

    SWordNet: Inferring Semantically Related Words from Software Context

    Get PDF
    Code search is an integral part of software development and program comprehension. The difficulty of code search lies in the inability to guess the exact words used in the code. Therefore, it is crucial for keyword-based code search to expand queries with semantically related words, e.g., synonyms and abbreviations, to increase the search effectiveness. However, it is limited to rely on resources such as English dictionaries and WordNet to obtain semantically related words in software, because many words that are semantically related in software are not semantically related in English. On the other hand, many words that are semantically related in English are not semantically related in software. This thesis proposes a simple and general technique to automatically infer semantically re- lated words (referred to as rPairs) in software by leveraging the context of words in comments and code. In addition, we propose a ranking algorithm on the rPair results and study cross-project rPairs on two sets of software with similar functionality, i.e., media browsers and operating sys- tems. We achieve a reasonable accuracy in nine large and popular code bases written in C and Java. Our further evaluation against the state of art shows that our technique can achieve a higher precision and recall. In addition, the proposed ranking algorithm improves the rPair extraction accuracy by bringing correct rPairs to the top of the list. Our cross-project study successfully discovers overlapping rPairs among projects of similar functionality and finds that cross-project rPairs are more likely to be correct than project-specific rPairs. Since the cross-project rPairs are highly likely to be general for software of the same type, the discovered overlapping rPairs can benefit other projects of the same type that have not been anaylyzed.1 yea

    Compreendendo programas por meio de Design by Contract: um estudo com desenvolvedores.

    Get PDF
    Compreender programas é difícil porque cada trecho de código atende requisitos específicos. Em alguns casos, fatores como compreender programas que foram escritos por outras pessoas, o escopo limitado das ferramentas existentes, e a falta de documentação adicionam complexidade. Sendo assim, os desenvolvedores necessitam de uma abordagem de compreensão efetiva que diminua os custos na manutenção e que reduzam os riscos de erros, caso o entendimento do programa seja incompleto. Existem abordagens sistemáticas, apoiadas por ferramentas, para compreensão que utilizam verificações estáticas (análise do código fonte) ou dinâmicas (dados sobre a execução). As abordagens dinâmicas são utilizadas por sua efetividade, pois basta executar um teste para ver o resultado, no entanto, falham por não utilizar informações de alto nível sobre o comportamento que possam ser verificadas. Todavia, estas informações podem ser capturadas ao definir contratos, como por exemplo, na metodologia Design by Contract. Contudo, não existe uma abordagem sistemática de compreensão que utilize contratos. Este trabalho, propõe um estudo para compreender programas apoiada por uma abordagem sistematizada a fim de priorizar a escrita de contratos em programas C# utilizando os princípios de Design by Contract por meio da biblioteca Code Contracts. Podendo, mesmo assim, ser utilizada em qualquer linguagem de programação que dê suporte a Design by Contract. A avaliação foi feita em ambientes de desenvolvimento de software com 12 desenvolvedores de um centro de pesquisa e desenvolvimento em ciência da computação, considerando a aplicação da abordagem a três métodos (rotinas) de um projeto open-source. Os resultados do estudo apontam indícios de melhor compreensão dos métodos usando contratos, e por outro lado, de favorecimento da própria escrita dos contratos em métodos,a princípio desconhecidos pelos desenvolvedores.Program comprehension is generally a difficult task, since each part of code meets specific requirements. In some cases, factors - such as comprehend programs that were written by others, limited scope of existing tools, and lack of documentation - add complexity. There fore, developers need an effective program comprehension approach that reduces maintenance costs and the risk of errors whether the program comprehension is incomplete. In order to mitigate that problem, systematic approaches are used, supported by tools, to verify comprehension using static (source code analysis) or dynamic (data on execution). Dynamic approaches are used because of their effectiveness, since they simply run a test to see the result, however, they fail to use high-level behavioral information that can be verified. Still, this information can be captured while defining contracts, for example in the Design by Contract methodology. Nevertheless, for the best of our knowledge, there is no systematic approach to program comprehension that uses contracts. In this work, we propose a study to program comprehension supported by a systematized approach, in order to prioritize the writing of contracts in C# programs using the principles of Design by Contract through the Code Contracts library. It can still be used in any programming language that supports Design by Contract. We evaluated it in software development environments with 12 developers of a research and development center in computer science, considering the approach application to three methods (routines) of an open-source project. The results of our study indicate better comprehension of methods using contracts, and, on the other hand, favoring the writing of contracts in methods, initially unknown to developers.CNP

    Improving Software Dependability through Documentation Analysis

    Get PDF
    Software documentation contains critical information that describes a system’s functionality and requirements. Documentation exists in several forms, including code comments, test plans, manual pages, and user manuals. The lack of documentation in existing software systems is an issue that impacts software maintainability and programmer productivity. Since some code bases contain a large amount of documentation, we want to leverage these existing documentation to improve software dependability. Specifically, we utilize documentation to help detect software bugs and repair corrupted files, which can reduce the number of software error and failure to improve a system’s reliability (e.g., continuity of correct service). We also generate documentation (e.g., code comment) automatically to help developers understand the source code, which helps improve a system’s maintainability (e.g., ability to undergo repairs and modifications). In this thesis, we analyze software documentation and propose two branches of work, which focuses on three types of documentation including manual pages, code comments, and user manuals. The first branch of work focuses on documentation analysis because documentation contains valuable information that describes the behavior of the program. We automatically extract constraints from documentation and apply them on a dynamic analysis symbolic execution tool to find bugs in the target software, and we extract constraints manually from documentation and apply them on a structured-file parsing application to repair corrupted PDF files. The second branch of work focuses on automatic code comment generation to improve software documentation. For documentation analysis, we propose and implement DASE and DocRepair. DASE leverages automatically extracted constraints from documentation to improve a dynamic analysis symbolic execution tool. DASE guides symbolic execution to focus the testing on execution paths that execute a program’s core functionalities using constraints learned from the documentation. We evaluated DASE on 88 programs from five mature real-world software suites to detect software bugs. DASE detects 12 previously unknown bugs that symbolic execution would fail to detect when given no input constraints, 6 of which have been confirmed by the developers. In DocRepair we perform an empirical study to study and repair corrupted PDF files. We create the first dataset of 319 corrupted PDF files and conduct an empirical study on 119 real-world corrupted PDF files to study the common types of file corruption. Based on the result of the empirical study we propose a technique called DocRepair. DocRepair’s repair algorithm includes seven repair operators that utilizes manually extracted constraints from documentation to repair corrupted files. We evaluate DocRepair against three common PDF repair tools. Amongst the 1,827 collected corrupted files from over two corpora of PDF files, DocRepair can successfully repair 354 files compared to Mutool, PDFtk, and GhostScript which repair 508, 41 and 84 respectively. We also propose a technique to combine multiple repair tools called DocRepair+, which can successfully repair 751 files. In the case where there is a lack of documentation, DASE and DocRepair+ would not work. Therefore, we propose automated documentation generation to address the issue. We propose and implement CloCom+ to generate code comments by mining both existing software repositories in GitHub and a Question and Answer site, Stack Overflow. CloCom+ generated 442 unique comments for 16 Java projects. Although CloCom+ improves on previous work, SumSlice, on automatic comment generation, the quality (evaluated on completeness, conciseness, expressiveness, and usefulness) and yield (number of generated comments) are still rather low which makes the technique not ready for real-world usage. In the future, it may be possible to combine the two proposed branches of work (documentation analysis and documentation generation) to further improve software dependability. For example, we can extract constraints from the automatically generated documentation (e.g., code comments)
    corecore