2,710 research outputs found
Predicting Good Configurations for GitHub and Stack Overflow Topic Models
Software repositories contain large amounts of textual data, ranging from
source code comments and issue descriptions to questions, answers, and comments
on Stack Overflow. To make sense of this textual data, topic modelling is
frequently used as a text-mining tool for the discovery of hidden semantic
structures in text bodies. Latent Dirichlet allocation (LDA) is a commonly used
topic model that aims to explain the structure of a corpus by grouping texts.
LDA requires multiple parameters to work well, and there are only rough and
sometimes conflicting guidelines available on how these parameters should be
set. In this paper, we contribute (i) a broad study of parameters to arrive at
good local optima for GitHub and Stack Overflow text corpora, (ii) an
a-posteriori characterisation of text corpora related to eight programming
languages, and (iii) an analysis of corpus feature importance via per-corpus
LDA configuration. We find that (1) popular rules of thumb for topic modelling
parameter configuration are not applicable to the corpora used in our
experiments, (2) corpora sampled from GitHub and Stack Overflow have different
characteristics and require different configurations to achieve good model fit,
and (3) we can predict good configurations for unseen corpora reliably. These
findings support researchers and practitioners in efficiently determining
suitable configurations for topic modelling when analysing textual data
contained in software repositories.Comment: to appear as full paper at MSR 2019, the 16th International
Conference on Mining Software Repositorie
Chaff from the Wheat : Characterization and Modeling of Deleted Questions on Stack Overflow
Stack Overflow is the most popular CQA for programmers on the web with 2.05M
users, 5.1M questions and 9.4M answers. Stack Overflow has explicit, detailed
guidelines on how to post questions and an ebullient moderation community.
Despite these precise communications and safeguards, questions posted on Stack
Overflow can be extremely off topic or very poor in quality. Such questions can
be deleted from Stack Overflow at the discretion of experienced community
members and moderators. We present the first study of deleted questions on
Stack Overflow. We divide our study into two parts (i) Characterization of
deleted questions over approx. 5 years (2008-2013) of data, (ii) Prediction of
deletion at the time of question creation. Our characterization study reveals
multiple insights on question deletion phenomena. We observe a significant
increase in the number of deleted questions over time. We find that it takes
substantial time to vote a question to be deleted but once voted, the community
takes swift action. We also see that question authors delete their questions to
salvage reputation points. We notice some instances of accidental deletion of
good quality questions but such questions are voted back to be undeleted
quickly. We discover a pyramidal structure of question quality on Stack
Overflow and find that deleted questions lie at the bottom (lowest quality) of
the pyramid. We also build a predictive model to detect the deletion of
question at the creation time. We experiment with 47 features based on User
Profile, Community Generated, Question Content and Syntactic style and report
an accuracy of 66%. Our feature analysis reveals that all four categories of
features are important for the prediction task. Our findings reveal important
suggestions for content quality maintenance on community based question
answering websites.Comment: 11 pages, Pre-prin
Recommending Comprehensive Solutions for Programming Tasks by Mining Crowd Knowledge
Developers often search for relevant code examples on the web for their
programming tasks. Unfortunately, they face two major problems. First, the
search is impaired due to a lexical gap between their query (task description)
and the information associated with the solution. Second, the retrieved
solution may not be comprehensive, i.e., the code segment might miss a succinct
explanation. These problems make the developers browse dozens of documents in
order to synthesize an appropriate solution. To address these two problems, we
propose CROKAGE (Crowd Knowledge Answer Generator), a tool that takes the
description of a programming task (the query) and provides a comprehensive
solution for the task. Our solutions contain not only relevant code examples
but also their succinct explanations. Our proposed approach expands the task
description with relevant API classes from Stack Overflow Q&A threads and then
mitigates the lexical gap problems. Furthermore, we perform natural language
processing on the top quality answers and then return such programming
solutions containing code examples and code explanations unlike earlier
studies. We evaluate our approach using 48 programming queries and show that it
outperforms six baselines including the state-of-art by a statistically
significant margin. Furthermore, our evaluation with 29 developers using 24
tasks (queries) confirms the superiority of CROKAGE over the state-of-art tool
in terms of relevance of the suggested code examples, benefit of the code
explanations and the overall solution quality (code + explanation).Comment: Accepted at ICPC, 12 pages, 201
Usage and Attribution of Stack Overflow Code Snippets in GitHub Projects
Stack Overflow (SO) is the most popular question-and-answer website for
software developers, providing a large amount of copyable code snippets. Using
those snippets raises maintenance and legal issues. SO's license (CC BY-SA 3.0)
requires attribution, i.e., referencing the original question or answer, and
requires derived work to adopt a compatible license. While there is a heated
debate on SO's license model for code snippets and the required attribution,
little is known about the extent to which snippets are copied from SO without
proper attribution. We present results of a large-scale empirical study
analyzing the usage and attribution of non-trivial Java code snippets from SO
answers in public GitHub (GH) projects. We followed three different approaches
to triangulate an estimate for the ratio of unattributed usages and conducted
two online surveys with software developers to complement our results. For the
different sets of projects that we analyzed, the ratio of projects containing
files with a reference to SO varied between 3.3% and 11.9%. We found that at
most 1.8% of all analyzed repositories containing code from SO used the code in
a way compatible with CC BY-SA 3.0. Moreover, we estimate that at most a
quarter of the copied code snippets from SO are attributed as required. Of the
surveyed developers, almost one half admitted copying code from SO without
attribution and about two thirds were not aware of the license of SO code
snippets and its implications.Comment: 44 pages, 8 figures, Empirical Software Engineering (Springer
Repairing Deep Neural Networks: Fix Patterns and Challenges
Significant interest in applying Deep Neural Network (DNN) has fueled the
need to support engineering of software that uses DNNs. Repairing software that
uses DNNs is one such unmistakable SE need where automated tools could be
beneficial; however, we do not fully understand challenges to repairing and
patterns that are utilized when manually repairing DNNs. What challenges should
automated repair tools address? What are the repair patterns whose automation
could help developers? Which repair patterns should be assigned a higher
priority for building automated bug repair tools? This work presents a
comprehensive study of bug fix patterns to address these questions. We have
studied 415 repairs from Stack overflow and 555 repairs from Github for five
popular deep learning libraries Caffe, Keras, Tensorflow, Theano, and Torch to
understand challenges in repairs and bug repair patterns. Our key findings
reveal that DNN bug fix patterns are distinctive compared to traditional bug
fix patterns; the most common bug fix patterns are fixing data dimension and
neural network connectivity; DNN bug fixes have the potential to introduce
adversarial vulnerabilities; DNN bug fixes frequently introduce new bugs; and
DNN bug localization, reuse of trained model, and coping with frequent releases
are major challenges faced by developers when fixing bugs. We also contribute a
benchmark of 667 DNN (bug, repair) instances
Bridging Semantic Gaps between Natural Languages and APIs with Word Embedding
Developers increasingly rely on text matching tools to analyze the relation
between natural language words and APIs. However, semantic gaps, namely textual
mismatches between words and APIs, negatively affect these tools. Previous
studies have transformed words or APIs into low-dimensional vectors for
matching; however, inaccurate results were obtained due to the failure of
modeling words and APIs simultaneously. To resolve this problem, two main
challenges are to be addressed: the acquisition of massive words and APIs for
mining and the alignment of words and APIs for modeling. Therefore, this study
proposes Word2API to effectively estimate relatedness of words and APIs.
Word2API collects millions of commonly used words and APIs from code
repositories to address the acquisition challenge. Then, a shuffling strategy
is used to transform related words and APIs into tuples to address the
alignment challenge. Using these tuples, Word2API models words and APIs
simultaneously. Word2API outperforms baselines by 10%-49.6% of relatedness
estimation in terms of precision and NDCG. Word2API is also effective on
solving typical software tasks, e.g., query expansion and API documents
linking. A simple system with Word2API-expanded queries recommends up to 21.4%
more related APIs for developers. Meanwhile, Word2API improves comparison
algorithms by 7.9%-17.4% in linking questions in Question&Answer communities to
API documents.Comment: accepted by IEEE Transactions on Software Engineerin
Smart Contract Development from the Perspective of Developers: Topics and Issues Discussed on Social Media
Blockchain-based platforms are emerging as a transformative technology that
can provide reliability, integrity, and auditability without trusted entities.
One of the key features of these platforms is the trustworthy decentralized
execution of general-purpose computation in the form of smart contracts, which
are envisioned to have a wide range of applications. As a result, a rapidly
growing and active community of smart-contract developers has emerged in recent
years. A number of research efforts have investigated the technological
challenges that these developers face, introducing a variety of tools,
languages, and frameworks for smart-contract development, focusing on security.
However, relatively little is known about the community itself, about the
developers, and about the issues that they face and discuss. To address this
gap, we study smart-contract developers and their discussions on two social
media sites, Stack Exchange and Medium. We provide insight into the trends and
key topics of these discussions, into the developers' interest in various
security issues and security tools, and into the developers' technological
background
Sentiment Classification using N-gram IDF and Automated Machine Learning
We propose a sentiment classification method with a general machine learning
framework. For feature representation, n-gram IDF is used to extract
software-engineering-related, dataset-specific, positive, neutral, and negative
n-gram expressions. For classifiers, an automated machine learning tool is
used. In the comparison using publicly available datasets, our method achieved
the highest F1 values in positive and negative sentences on all datasets.Comment: 4 pages, IEEE Softwar
Gistable: Evaluating the Executability of Python Code Snippets on GitHub
Software developers create and share code online to demonstrate programming
language concepts and programming tasks. Code snippets can be a useful way to
explain and demonstrate a programming concept, but may not always be directly
executable. A code snippet can contain parse errors, or fail to execute if the
environment contains unmet dependencies.
This paper presents an empirical analysis of the executable status of Python
code snippets shared through the GitHub gist system, and the ability of
developers familiar with software configuration to correctly configure and run
them. We find that 75.6% of gists require non-trivial configuration to overcome
missing dependencies, configuration files, reliance on a specific operating
system, or some other environment configuration. Our study also suggests the
natural assumption developers make about resource names when resolving
configuration errors is correct less than half the time.
We also present Gistable, a database and extensible framework built on
GitHub's gist system, which provides executable code snippets to enable
reproducible studies in software engineering. Gistable contains 10,259 code
snippets, approximately 5,000 with a Dockerfile to configure and execute them
without import error. Gistable is publicly available at
https://github.com/gistable/gistable
Simplifying Deep-Learning-Based Model for Code Search
To accelerate software development, developers frequently search and reuse
existing code snippets from a large-scale codebase, e.g., GitHub. Over the
years, researchers proposed many information retrieval (IR) based models for
code search, which match keywords in query with code text. But they fail to
connect the semantic gap between query and code. To conquer this challenge, Gu
et al. proposed a deep-learning-based model named DeepCS. It jointly embeds
method code and natural language description into a shared vector space, where
methods related to a natural language query are retrieved according to their
vector similarities. However, DeepCS' working process is complicated and
time-consuming. To overcome this issue, we proposed a simplified model
CodeMatcher that leverages the IR technique but maintains many features in
DeepCS. Generally, CodeMatcher combines query keywords with the original order,
performs a fuzzy search on name and body strings of methods, and returned the
best-matched methods with the longer sequence of used keywords. We verified its
effectiveness on a large-scale codebase with about 41k repositories.
Experimental results showed the simplified model CodeMatcher outperforms DeepCS
by 97% in terms of MRR (a widely used accuracy measure for code search), and it
is over 66 times faster than DeepCS. Besides, comparing with the
state-of-the-art IR-based model CodeHow, CodeMatcher also improves the MRR by
73%. We also observed that: fusing the advantages of IR-based and
deep-learning-based models is promising because they compensate with each other
by nature; improving the quality of method naming helps code search, since
method name plays an important role in connecting query and code
- …