5,191 research outputs found
Detecting similar repositories on GitHub
NSFC Progra
Predicting Good Configurations for GitHub and Stack Overflow Topic Models
Software repositories contain large amounts of textual data, ranging from
source code comments and issue descriptions to questions, answers, and comments
on Stack Overflow. To make sense of this textual data, topic modelling is
frequently used as a text-mining tool for the discovery of hidden semantic
structures in text bodies. Latent Dirichlet allocation (LDA) is a commonly used
topic model that aims to explain the structure of a corpus by grouping texts.
LDA requires multiple parameters to work well, and there are only rough and
sometimes conflicting guidelines available on how these parameters should be
set. In this paper, we contribute (i) a broad study of parameters to arrive at
good local optima for GitHub and Stack Overflow text corpora, (ii) an
a-posteriori characterisation of text corpora related to eight programming
languages, and (iii) an analysis of corpus feature importance via per-corpus
LDA configuration. We find that (1) popular rules of thumb for topic modelling
parameter configuration are not applicable to the corpora used in our
experiments, (2) corpora sampled from GitHub and Stack Overflow have different
characteristics and require different configurations to achieve good model fit,
and (3) we can predict good configurations for unseen corpora reliably. These
findings support researchers and practitioners in efficiently determining
suitable configurations for topic modelling when analysing textual data
contained in software repositories.Comment: to appear as full paper at MSR 2019, the 16th International
Conference on Mining Software Repositorie
Unusual Events in GitHub Repositories
In large and active software projects, it becomes impractical for a developer
to stay aware of all project activity. While it might not be necessary to know
about each commit or issue, it is arguably important to know about the ones
that are unusual. To investigate this hypothesis, we identified unusual events
in 200 GitHub projects using a comprehensive list of ways in which an artifact
can be unusual and asked 140 developers responsible for or affected by these
events to comment on the usefulness of the corresponding information. Based on
2,096 answers, we identify the subset of unusual events that developers
consider particularly useful, including large code modifications and unusual
amounts of reviewing activity, along with qualitative evidence on the reasons
behind these answers. Our findings provide a means for reducing the amount of
information that developers need to parse in order to stay up to date with
development activity in their projects.Comment: Accepted for publication in Journal of Systems and Softwar
Simplifying Deep-Learning-Based Model for Code Search
To accelerate software development, developers frequently search and reuse
existing code snippets from a large-scale codebase, e.g., GitHub. Over the
years, researchers proposed many information retrieval (IR) based models for
code search, which match keywords in query with code text. But they fail to
connect the semantic gap between query and code. To conquer this challenge, Gu
et al. proposed a deep-learning-based model named DeepCS. It jointly embeds
method code and natural language description into a shared vector space, where
methods related to a natural language query are retrieved according to their
vector similarities. However, DeepCS' working process is complicated and
time-consuming. To overcome this issue, we proposed a simplified model
CodeMatcher that leverages the IR technique but maintains many features in
DeepCS. Generally, CodeMatcher combines query keywords with the original order,
performs a fuzzy search on name and body strings of methods, and returned the
best-matched methods with the longer sequence of used keywords. We verified its
effectiveness on a large-scale codebase with about 41k repositories.
Experimental results showed the simplified model CodeMatcher outperforms DeepCS
by 97% in terms of MRR (a widely used accuracy measure for code search), and it
is over 66 times faster than DeepCS. Besides, comparing with the
state-of-the-art IR-based model CodeHow, CodeMatcher also improves the MRR by
73%. We also observed that: fusing the advantages of IR-based and
deep-learning-based models is promising because they compensate with each other
by nature; improving the quality of method naming helps code search, since
method name plays an important role in connecting query and code
RefDiff: Detecting Refactorings in Version Histories
Refactoring is a well-known technique that is widely adopted by software
engineers to improve the design and enable the evolution of a system. Knowing
which refactoring operations were applied in a code change is a valuable
information to understand software evolution, adapt software components, merge
code changes, and other applications. In this paper, we present RefDiff, an
automated approach that identifies refactorings performed between two code
revisions in a git repository. RefDiff employs a combination of heuristics
based on static analysis and code similarity to detect 13 well-known
refactoring types. In an evaluation using an oracle of 448 known refactoring
operations, distributed across seven Java projects, our approach achieved
precision of 100% and recall of 88%. Moreover, our evaluation suggests that
RefDiff has superior precision and recall than existing state-of-the-art
approaches.Comment: Paper accepted at 14th International Conference on Mining Software
Repositories (MSR), pages 1-11, 201
- …