Search CORE

74 research outputs found

Simplifying Deep-Learning-Based Model for Code Search

Author: Hassan Ahmed E.
Li Shanping
Liu Chao
Liu Zhiwei
Lo David
Xia Xin
Publication venue
Publication date: 28/05/2020
Field of study

To accelerate software development, developers frequently search and reuse existing code snippets from a large-scale codebase, e.g., GitHub. Over the years, researchers proposed many information retrieval (IR) based models for code search, which match keywords in query with code text. But they fail to connect the semantic gap between query and code. To conquer this challenge, Gu et al. proposed a deep-learning-based model named DeepCS. It jointly embeds method code and natural language description into a shared vector space, where methods related to a natural language query are retrieved according to their vector similarities. However, DeepCS' working process is complicated and time-consuming. To overcome this issue, we proposed a simplified model CodeMatcher that leverages the IR technique but maintains many features in DeepCS. Generally, CodeMatcher combines query keywords with the original order, performs a fuzzy search on name and body strings of methods, and returned the best-matched methods with the longer sequence of used keywords. We verified its effectiveness on a large-scale codebase with about 41k repositories. Experimental results showed the simplified model CodeMatcher outperforms DeepCS by 97% in terms of MRR (a widely used accuracy measure for code search), and it is over 66 times faster than DeepCS. Besides, comparing with the state-of-the-art IR-based model CodeHow, CodeMatcher also improves the MRR by 73%. We also observed that: fusing the advantages of IR-based and deep-learning-based models is promising because they compensate with each other by nature; improving the quality of method naming helps code search, since method name plays an important role in connecting query and code

arXiv.org e-Print Archive

Institutional Knowledge at Singapore Management University

AUSearch: Accurate API usage search in Github repositories with type resolution

Author: ASYROFI Muhammad Hilmi
JIANG Lingxiao
LO David
THUNG Ferdian
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/02/2020
Field of study

Crossref

Institutional Knowledge at Singapore Management University

Github application programme interface and wordnet for code reuse

Author: Perera Indika
Pirapuraj
Publication venue: Faculty of Applied Sciences, South eastern University of Sri lanka
Publication date
Field of study

It is clear that code reuse is important task in software development and maintenance. As a lot of software application and source code have been used as libraries in version control systems, such that Git, SVN, LibreSource and related web sites, such that GitHub.com, sourceforge.net, projectsgeek.com, Googlecode.com, more and more companies, especially Small and Medium Enterprises (SMEs), are reusing open source code to develop their own software. The problem in code reuse is, after download all relevant code, we need to identify most relevant code among pool of code. In this paper we use keyword search with n-gram NLP technique using GitHub Application Program Interface (API). Before search the source code, we retrieve all Repository name in GitHub belongs to particular programing language (JAVA, C++, etc.), as well as we retrieve all .java file name if we search java libraries using GitHub API. Then compare our keyword with this list, if the keyword extracted from Software architecture is connected word, then we will split using Apache Camel Splitter. If the particular keyword related to any project, we download the project. Otherwise using WordNet, get some synonym and do the above process again. For further relevancy, we will use a speech recognition technique (Dynamic Time Warping (DTW)) and a NLP technique (Part of Speech Tagging (POS)). Because of this is a part of the whole research, in this paper we will consider only GitHub API

IR South Eastern University of Sri Lanka

CodeHow: Effective Code Search Based on API Understanding and Extended Boolean Model (E)

Author: LOU Jian-guang
LV Fei
WANG Shaowei
ZHANG Dongmei
ZHAO Jainjun
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/11/2015
Field of study

Crossref

Institutional Knowledge at Singapore Management University

Augmenting and structuring user queries to support efficient free-form code search

Author: BISSYANDE Tegawendé F.
KIM Dongsun
KIM Kisub
KLEIN Jacques
LO David
SIRRES Raphael
TRAON Yves Le
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Source code terms such as method names and variable types are often different from conceptual words mentioned in a search query. This vocabulary mismatch problem can make code search inefficient. In this paper, we present Code voCABUlary (CoCaBu), an approach to resolving the vocabulary mismatch problem when dealing with free-form code search queries. Our approach leverages common developer questions and the associated expert answers to augment user queries with the relevant, but missing, structural code entities in order to improve the performance of matching relevant code examples within large code repositories. To instantiate this approach, we build GitSearch, a code search engine, on top of GitHub and StackOverflow Q\&A data. We evaluate GitSearch in several dimensions to demonstrate that (1) its code search results are correct with respect to user-accepted answers; (2) the results are qualitatively better than those of existing Internet-scale code search engines; (3) our engine is competitive against web search engines, such as Google, in helping users complete solve programming tasks; and (4) GitSearch provides code examples that are acceptable or interesting to the community as answers for StackOverflow questions

Crossref

Institutional Knowledge at Singapore Management University

Open Repository and Bibliography - Luxembourg

Searching, Selecting, and Synthesizing Source Code Components

Author: McMillan Collin
Publication venue: W&M ScholarWorks
Publication date: 01/01/2012
Field of study

As programmers develop software, they instinctively sense that source code exists that could be reused if found --- many programming tasks are common to many software projects across different domains. oftentimes, a programmer will attempt to create new software from this existing source code, such as third-party libraries or code from online repositories. Unfortunately, several major challenges make it difficult to locate the relevant source code and to reuse it. First, there is a fundamental mismatch between the high-level intent reflected in the descriptions of source code, and the low-level implementation details. This mismatch is known as the concept assignment problem , and refers to the frequent case when the keywords from comments or identifiers in code do not match the features implemented in the code. Second, even if relevant source code is found, programmers must invest significant intellectual effort into understanding how to reuse the different functions, classes, or other components present in the source code. These components may be specific to a particular application, and difficult to reuse.;One key source of information that programmers use to understand source code is the set of relationships among the source code components. These relationships are typically structural data, such as function calls or class instantiations. This structural data has been repeatedly suggested as an alternative to textual analysis for search and reuse, however as yet no comprehensive strategy exists for locating relevant and reusable source code. In my research program, I harness this structural data in a unified approach to creating and evolving software from existing components. For locating relevant source code, I present a search engine for finding applications based on the underlying Application Programming Interface (API) calls, and a technique for finding chains of relevant function invocations from repositories of millions of lines of code. Next, for reusing source code, I introduce a system to facilitate building software prototypes from existing packages, and an approach to detecting similar software applications

CiteSeerX

College of William & Mary: W&M Publish

Assisted Specification of Code Using Search

Author: Reiss Steven P.
Publication venue
Publication date: 20/09/2022
Field of study

We describe an intelligent assistant based on mining existing software repositories to help the developer interactively create checkable specifications of code. To be most useful we apply this at the subsystem level, that is chunks of code of 1000-10000 lines that can be standalone or integrated into an existing application to provide additional functionality or capabilities. The resultant specifications include both a syntactic description of what should be written and a semantic specification of what it should do, initially in the form of test cases. The generated specification is designed to be used for automatic code generation using various technologies that have been proposed including machine learning, code search, and program synthesis. Our research goal is to enable these technologies to be used effectively for creating subsystems without requiring the developer to write detailed specifications from scratch

arXiv.org e-Print Archive