8 research outputs found

    Candoia: A Platform and an Ecosystem for Building and Deploying Versatile Mining Software Repositories Tools

    Get PDF
    Research on mining software repositories (MSR) has shown great promise during the last decade in solving many challenging software engineering problems. There exists, however, a ‘valley of death’ between these significant innovations in the MSR research and their deployment in practice. The significant cost of converting a prototype to software; need to provide support for a wide variety of tools and technologies e.g. CVS, SVN, Git, Bugzilla, Jira, Issues, etc, to improve applicability; and the high cost of customizing tools to practitioner-specific settings are some key hurdles in transition to practice. We describe Candoia, a platform and an ecosystem that is aimed at bridging this valley of death between innovations in MSR research and their deployment in practice. We have implemented Candoia and provide facilities to build and publish MSR ideas as Candoia apps. Our evaluation demonstrates that Candoia drastically reduces the cost of converting an idea to an app, thus reducing the barrier to transitioning research findings into practice. We also see versatility, in Candoia app’s ability to work with a variety of tools and technologies that the platform supports. Finally, we find that customizing Candoia app to fit project-specific needs is often well within the grasp of developers

    Candoia: a platform for building and sharing mining software repositories tools as apps

    Get PDF
    We propose Candoia, a novel platform and ecosystem for building and sharing Mining Software Repositories (MSR) tools. Using Candoia, MSR tools are built as apps and Candoia ecosystem, acting as an appstore, allows effective sharing. Candoia platform provides, data extraction tools for curating custom datasets for user projects, and data abstractions for enabling uniform access to MSR artifacts from disparate sources, which makes apps portable and adoptable across diverse software project settings of MSR researchers and practitioners. The structured design of a Candoia app and the languages selected for building various components of a Candoia app promotes easy customization. To evaluate Candoia we have built over two dozen MSR apps for analyzing bugs, software evolution, project management aspects, and source code and programming practices showing the applicability of the platform for building a variety of MSR apps. For testing portability of apps across diverse project settings, we tested the apps using ten popular project repositories, such as Apache Tomcat, JUnit, Node.js, etc, and found that apps required no changes to be portable. We performed a user study to test customizability and we found that five of eight Candoia users found it very easy to customize an existing app. Candoia is available for download

    Mining preconditions of APIs in large-scale code corpus

    Get PDF
    Modern software relies on existing application programming interfaces (APIs) from libraries. Formal specifications for the APIs enable many software engineering tasks as well as help developers correctly use them. In this work, we mine large-scale repositories of existing open-source software to derive potential preconditions for API methods. Our key idea is that APIs’ preconditions would appear frequently in an ultra-large code corpus with a large number of API usages, while project-specific conditions will occur less frequently. First, we find all client methods invoking APIs. We then compute a control dependence relation from each call site and mine the potential conditions used to reach those call sites. We use these guard conditions as a starting point to automatically infer the preconditions for each API. We analyzed almost 120 million lines of code from SourceForge and Apache projects to infer preconditions for the standard Java Development Kit (JDK) library. The results show that our technique can achieve high accuracy with recall from 75–80% and precision from 82–84%. We also found 5 preconditions missing from human written specifications. They were all confirmed by a specification expert. In a user study, participants found 82% of the mined preconditions as a good starting point for writing specifications. Using our mining result, we also built a benchmark of more than 4,000 precondition-related bugs

    Exploiting implicit belief to resolve sparse usage problem in usage-based specification mining

    Get PDF
    Frameworks and libraries provide application programming interfaces (APIs) that serve as building blocks in modern software development. As APIs present the opportunity of increased productivity, it also calls for correct use to avoid buggy code. The usage-based specification mining technique has shown great promise in solving this problem through a data-driven approach. These techniques leverage the use of the API in large corpora to understand the recurring usages of the APIs and infer behavioral specifications (pre- and post-conditions) from such usages. A challenge for such technique is thus inference in the presence of insufficient usages, in terms of both frequency and richness.We refer to this as a “sparse usage problem. This thesis presents the first technique to solve the sparse usage problem in usage-based precondition mining. Our key insight is to leverage implicit beliefs to overcome sparse usage. An implicit belief (IB) is the knowledge implicitly derived from the fact about the code. An IB about a program is known implicitly to a programmer via the language’s constructs and semantics, and thus not explicitly written or specified in the code. The technical underpinnings of our new precondition mining approach include a technique to analyze the data and control flow in the program leading to API calls to infer preconditions that are implicitly present in the code corpus, a catalog of 35 code elements in total that can be used to derive implicit beliefs from a program, and empirical evaluation of all of these ideas.We have analyzed over 350 millions lines of code and 7 libraries that suffer from the sparse usage problem. Our approach realizes 6 implicit beliefs and we have observed that adding single-level context sensitivity can further improve the result of usage-based precondition mining. The result shows that we achieve overall 60% in precision and 69% in recall and the accuracy is relatively improved by 32% in precision and 78% in recall compared to base usage-based mining approach for these libraries

    Software Library Usage Pattern Extraction Using a Software Model Checker

    No full text

    Statistical Machine Translation of English Text to API Code Usages: A comparison of Word Map, Contextual Graph Ordering, Phrase-based, and Neural Network Translations

    Get PDF
    Statistical Machine Translation (SMT) has gained enormous popularity in recent years as natural language translations have become increasingly accurate. In this thesis we apply SMT techniques in the context of translating English descriptions of programming tasks to source code. We evaluate four existing approaches: maximum likelihood word maps, ContextualExpansion, phrase-based, and neural network translation. As a training and test (i.e. reference translation) data set we clean and align the popular developer discussion forum StackOverflow. Our baseline approach, WordMapK, uses a simple maximum likelihood word map model which is then ordered using existing code usage graphs. The approach is quite effective, with a precision and recall of 20 and 50, respectively. Adding context to the word map model, ContextualExpansion, is able to increase the precision to 25 with a recall of 40. The traditional phrase-based translation model, Moses, achieves a similar precision and recall also incorporating the context of the input text by mapping English sequences to code sequences. The final approach is neural network translation, OpenNMT. While the median precision is 100 the recall is only 20. When manually examining the output of the neural translation, the code usages are very small and obvious. Our results represent an application of existing natural language strategies in the context of software engineering. We make our scripts, corpus, and reference translations in the hope that future work will adapt these techniques to further increase the quality of English to code statistical machine translation

    Analyzing repetitiveness in big code to support software maintenance and evolution

    Get PDF
    Software systems inevitably contain a large amount of repeated artifacts at different level of abstraction---from ideas, requirements, designs, algorithms to implementation. This dissertation focuses on analyzing software repetitiveness at implementation code level and leveraging the derived knowledge for easing tasks in software maintenance and evolution such as program comprehension, API use, change understanding, API adaptation and bug fixing. The guiding philosophy of this work is that, in a large corpus, code that conforms to specifications appears more frequently than code that does not, and similar code is changed similarly and similar code could have similar bugs that can be fixed similarly. We have developed different representations for software artifacts at source code level, and the corresponding algorithms for measuring code similarity and mining repeated code. Our mining techniques bases on the key insight that code that conforms to programming patterns and specifications appears more frequently than code that does not. Thus, correct patterns and specifications can be mined from large code corpus. We also have built program differencing techniques for analyzing changes in software evolution. Our key insight is that similar code is likely changed in similar ways and similar code likely has similar bug(s) which can be fixed similarly. Therefore, learning changes and fixes from the past can help automatically detect and suggest changes/fixes to the repeated code in software development. Our empirical evaluation shows that our techniques can accurately and efficiently detect repeated code, mine useful programming patterns and API specifications, and recommend changes. It can also detect bugs and suggest fixes, and provide actionable insights to ease maintenance tasks. Specifically, our code clone detection tool detects more meaningful clones than other tools. Our mining tools recover high quality programming patterns and API preconditions. The mined results have been used to successfully detect many bugs violating patterns and specifications in mature open-source systems. The mined API preconditions are shown to help API specification writer identify missing preconditions in already-specified APIs and start building preconditions for the not-yet-specified ones. The tools are scalable which analyze large systems in reasonable times. Our study on repeated changes give useful insights for program auto-repair tools. Our automated change suggestion approach achieves top-1 accuracy of 45%-51% which relatively improves more than 200% over the base approach. For a special type of change suggestion, API adaptation, our tool is highly correct and useful

    Exploring regularities in software with statistical models and their applications

    Get PDF
    Software systems are becoming popular. They are used with different platforms for different applications. Software systems are developed with support from programming languages, which help developers work conveniently. Programming languages can have different paradigms with different form, syntactic structures, keywords, and representation ways. In many cases, however, programming languages are similar in different important aspects: 1. They are used to support description of specific tasks, 2. Source codes are written in languages and includes a limit set of distinctive tokens, many tokens are repeated like keywords, function calls, and 3. They follow specific syntactic rules to make machine understanding. Those points also respect the similarity between programming language and natural language. Due to its critical role in many applications, natural language processing (NLP) has been studied much and given many promising results like automatic cross-language translation, speech-to-text, information searching, etc. It is interesting to observe if there are similar characteristics between natural language and programming language and whether techniques in NLP can be reused for programming language processing? Recent works in software engineering (SE) shows that their similarities between NLP and programming language processing and techniques in NLP can be reused for PLP. This dissertation introduces my works with contributions in study of characteristics of programming languages, the models which employed them and the main applications that show the usefulness of the proposed models. Study in both three aspects has drawn interests from software engineering community and received awards due to their innovation and applicability. \u27 I hope that this dissertation will bring a systematic view of how advantage techniques in natural language processing and machine learning can be re-used and give huge benefit for programming language processing, and how those techniques are adapted with characteristics of programming language and software systems
    corecore