3 research outputs found
Predicting Good Configurations for GitHub and Stack Overflow Topic Models
Software repositories contain large amounts of textual data, ranging from
source code comments and issue descriptions to questions, answers, and comments
on Stack Overflow. To make sense of this textual data, topic modelling is
frequently used as a text-mining tool for the discovery of hidden semantic
structures in text bodies. Latent Dirichlet allocation (LDA) is a commonly used
topic model that aims to explain the structure of a corpus by grouping texts.
LDA requires multiple parameters to work well, and there are only rough and
sometimes conflicting guidelines available on how these parameters should be
set. In this paper, we contribute (i) a broad study of parameters to arrive at
good local optima for GitHub and Stack Overflow text corpora, (ii) an
a-posteriori characterisation of text corpora related to eight programming
languages, and (iii) an analysis of corpus feature importance via per-corpus
LDA configuration. We find that (1) popular rules of thumb for topic modelling
parameter configuration are not applicable to the corpora used in our
experiments, (2) corpora sampled from GitHub and Stack Overflow have different
characteristics and require different configurations to achieve good model fit,
and (3) we can predict good configurations for unseen corpora reliably. These
findings support researchers and practitioners in efficiently determining
suitable configurations for topic modelling when analysing textual data
contained in software repositories.Comment: to appear as full paper at MSR 2019, the 16th International
Conference on Mining Software Repositorie
Are Multi-language Design Smells Fault-prone? An Empirical Study
Nowadays, modern applications are developed using components written in
different programming languages. These systems introduce several advantages.
However, as the number of languages increases, so does the challenges related
to the development and maintenance of these systems. In such situations,
developers may introduce design smells (i.e., anti-patterns and code smells)
which are symptoms of poor design and implementation choices. Design smells are
defined as poor design and coding choices that can negatively impact the
quality of a software program despite satisfying functional requirements.
Studies on mono-language systems suggest that the presence of design smells
affects code comprehension, thus making systems harder to maintain. However,
these studies target only mono-language systems and do not consider the
interaction between different programming languages. In this paper, we present
an approach to detect multi-language design smells in the context of JNI
systems. We then investigate the prevalence of those design smells.
Specifically, we detect 15 design smells in 98 releases of nine open-source JNI
projects. Our results show that the design smells are prevalent in the selected
projects and persist throughout the releases of the systems. We observe that in
the analyzed systems, 33.95% of the files involving communications between Java
and C/C++ contains occurrences of multi-language design smells. Some kinds of
smells are more prevalent than others, e.g., Unused Parameters, Too Much
Scattering, Unused Method Declaration. Our results suggest that files with
multi-language design smells can often be more associated with bugs than files
without these smells, and that specific smells are more correlated to
fault-proneness than others