8 research outputs found

    Code search via topic-enriched dependence graph matching

    Get PDF
    Abstract—Source code contains textual, structural, and seman-tic information, which can all be leveraged for effective search. Some studies have proposed semantic code search where users can specify query topics in a natural language. Other studies can search through system dependence graphs. In this paper, we propose a semantic dependence search engine that integrates both kinds of techniques and can retrieve code snippets based on expressive user queries describing both topics and dependencies. Users can specify their search targets in a free form format describing desired topics (i.e., high-level semantic or functionality of the target code); a specialized graph query language allows users to describe low-level data and control dependencies in code and thus helps to refine the queries described in the free format. Our empirical evaluation on a number of software maintenance tasks shows that our search engine can efficiently locate desired code fragments accurately. Keywords-Code search; topic modelling; dependence graphs; I

    Assisted Specification of Code Using Search

    Full text link
    We describe an intelligent assistant based on mining existing software repositories to help the developer interactively create checkable specifications of code. To be most useful we apply this at the subsystem level, that is chunks of code of 1000-10000 lines that can be standalone or integrated into an existing application to provide additional functionality or capabilities. The resultant specifications include both a syntactic description of what should be written and a semantic specification of what it should do, initially in the form of test cases. The generated specification is designed to be used for automatic code generation using various technologies that have been proposed including machine learning, code search, and program synthesis. Our research goal is to enable these technologies to be used effectively for creating subsystems without requiring the developer to write detailed specifications from scratch

    Active Code Search: Incorporating User Feedback to Improve Code Search Relevance

    Get PDF
    Code search techniques return relevant code fragments given a user query. They typically work in a passive mode: given a user query, a static list of code fragments sorted by the relevance scores decided by a code search technique is re-turned to the user. A user will go through the sorted list of returned code fragments from top to bottom. As the user checks each code fragment one by one, he or she will natu-rally form an opinion about the true relevance of the code fragment. In an active model, those opinions will be taken as feedbacks to the search engine for refining result lists. In this work, we incorporate users ’ opinion on the results from a code search engine to refine result lists: as a user forms an opinion about one result, our technique takes this opinion as feedback and leverages it to re-order the results to make truly relevant results appear earlier in the list. The re-finement results can also be cached to potentially improve fu-ture code search tasks. We have built our active refinement technique on top of a state-of-the-art code search engine— Portfolio. Our technique improves Portfolio in terms of Nor-malized Discounted Cumulative Gain (NDCG) by more than 11.3%, from 0.738 to 0.821

    Multimodal Code Search

    Get PDF

    Improving Software Quality by Synergizing Effective Code Inspection and Regression Testing

    Get PDF
    Software quality assurance is an essential practice in software development and maintenance. Evolving software systems consistently and safely is challenging. All changes to a system must be comprehensively tested and inspected to gain confidence that the modified system behaves as intended. To detect software defects, developers often conduct quality assurance activities, such as regression testing and code review, after implementing or changing required functionalities. They commonly evaluate a program based on two complementary techniques: dynamic program analysis and static program analysis. Using an automated testing framework, developers typically discover program faults by observing program execution with test cases that encode required program behavior as well as represent defects. Unlike dynamic analysis, developers make sure of the program correctness without executing a program by static analysis. They understand source code through manual inspection or identify potential program faults with an automated tool for statically analyzing a program. By removing the boundaries between static and dynamic analysis, complementary strengths and weaknesses of both techniques can create unified analyses. For example, dynamic analysis is efficient and precise but it requires selection of test cases without guarantee that the test cases cover all possible program executions, and static analysis is conservative and sound but it produces less precise results due to its approximation of all possible behaviors that may perform at run time. Many dynamic and static techniques have been proposed, but testing a program involves substantial cost and risks and inspecting code change is tedious and error-prone. Our research addresses two fundamental problems in dynamic and static techniques. (1) To evaluate a program, developers are typically required to implement test cases and reuse them. As they develop more test cases for verifying new implementations, the execution cost of test cases increases accordingly. After every modification, they periodically conduct regression test to see whether the program executes without introducing new faults in the presence of program evolution. To reduce the time required to perform regression testing, developers should select an appropriate subset of the test suite with a guarantee of revealing faults as running entire test cases. Such regression testing selection techniques are still challenging as these methods also have substantial costs and risks and discard test cases that could detect faults. (2) As a less formal and more lightweight method than running a test suite, developers often conduct code reviews based on tool support; however, understanding context and changes is the key challenge of code reviews. While reviewing code changes—addressing one single issue—might not be difficult, it is extremely difficult to understand complex changes—including multiple issues such as bug fixes, refactorings, and new feature additions. Developers need to understand intermingled changes addressing multiple development issues, finding which region of the code changes deals with a particular issue. Although such changes do not cause trouble in implementation, investigating these changes becomes time-consuming and error-prone since the intertwined changes are loosely related, leading to difficulty in code reviews. To address the limitations outlined above, our research makes the following contributions. First, we present a model-based approach to efficiently build a regression test suite that facilitates Extended Finite State Machines (EFSMs). Changes to the system are performed at transition level by adding, deleting or replacing transition. Tests are a sequence of input and expected output messages with concrete parameter values over the supported data types. Fully-observable tests are introduced whose descriptions contain all the information about the transitions executed by the tests. An invariant characterizing fully observable tests is formulated such that a test is fully-observable whenever the invariant is a satisfiable formula. Incremental procedures are developed to efficiently evaluate the invariant and to select tests from a test suite that are guaranteed to exercise a given change when the tests run on a modified EFSM. Tests rendered unusable due to a change are also identified. Overlaps among the test descriptions are exploited to extend the approach to simultaneously select and discard multiple tests to alleviate the test selection costs. Although test regression selection problem is NP-hard [78], the experimental results show the cost of our test selection procedure is still acceptable and economical. Second, to support code review and regression testing, we present a technique, called ChgCutter. It helps developers understand and validate composite changes as follows. It interactively decomposes these complex, composite changes into atomic changes, builds related change subsets using program dependence relationships without syntactic violation, and safely selects only related test cases from the test suite to reduce the time to conduct regression testing. When a code reviewer selects a change region from both original and changed versions of a program, ChgCutter automatically identifies similar change regions based on the dependence analysis and the tree-based code search technique. By automatically applying a change to the identified regions in an original program version, ChgCutter generates a program version which is a syntactically correct version of program. Given a generated program version, it leverages a testing selection technique to select and run a subset of the test suite affected by a change automatically separated from mixed changes. Based on the iterative change selection process, there can be each different program version that include its separated change. Therefore, ChgCutter helps code reviewers inspect large, complex changes by effectively focusing on decomposed change subsets. In addition to assisting understanding a substantial change, the regression testing selection technique effectively discovers defects by validating each program version that contains a separated change subset. In the evaluation, ChgCutter analyzes 28 composite changes in four open source projects. It identifies related change subsets with 95.7% accuracy, and it selects test cases affected by these changes with 89.0% accuracy. Our results show that ChgCutter should help developers effectively inspect changes and validate modified applications during development

    Big Code Search: A Bibliography

    Get PDF
    peer reviewedCode search is an essential task in software development. Developers often search the internet and other code databases for necessary source code snippets to ease the development efforts. Code search techniques also help learn programming as novice programmers or students can quickly retrieve (hopefully good) examples already used in actual software projects. Given the recurrence of the code search activity in software development, there is an increasing interest in the research community. To improve the code search experience, the research community suggests many code search tools and techniques. These tools and techniques leverage several different ideas and claim a better code search performance. However, it is still challenging to illustrate a comprehensive view of the field considering that existing studies generally explore narrow and limited subsets of used components. This study aims to devise a grounded approach to understanding the procedure for code search and build an operational taxonomy capturing the critical facets of code search techniques. Additionally, we investigate evaluation methods, benchmarks, and datasets used in the field of code search

    A dependency-aware, context-independent code search infrastructure

    Full text link
    Over the last decade many code search engines and recommendation systems have been developed, both in academia and industry, to try to improve the component discovery step in the software reuse process. Key examples include Krugle, Koders, Portfolio, Merobase, Sourcerer, Strathcona and SENTRE. However, the recall and precision of this current generation of code search tools are limited by their inability to cope effectively with the structural dependencies between code units. This lack of “dependency awareness” manifests itself in three main ways. First, it limits the kinds of search queries that users can define and thus the precision and local recall of dependency aware searches (giving rise to large numbers of false positives and false negatives). Second, it reduces the global recall of the component harvesting process by limiting the range of dependency-containing software components that can be used to populate the search repository. Third, it significantly reduces the performance of the retrieval process for dependency-aware searches. This thesis lays the foundation for a new generation of dependency-aware code search engines that addresses these problems by designing and prototyping a new kind of software search platform. Inspired by the Merobase code search engine, this platform contains three main innovations - an enhanced, dependency aware query language which allows traditional Merobase interface-based searches to be extended with dependency requirements, a new “context independent” crawling infrastructure which can recognize dependencies between code units even when their context (e.g. project) is unknown, and a new graph-based database integrated with a full-text search engine and optimized to store code modules and their dependencies efficiently. After describing the background to, and state-of-the-art in, the field of code search engines and information retrieval the thesis motivates the aforementioned innovations and explains how they are realized in the DAISI (Dependency-Aware, context-Independent code Search Infrastructure) prototype using Lucene and Neo4J.DAISI is then used to demonstrate the advantages of the developed technology in a range of examples
    corecore