393 research outputs found
On Using Machine Learning to Identify Knowledge in API Reference Documentation
Using API reference documentation like JavaDoc is an integral part of
software development. Previous research introduced a grounded taxonomy that
organizes API documentation knowledge in 12 types, including knowledge about
the Functionality, Structure, and Quality of an API. We study how well modern
text classification approaches can automatically identify documentation
containing specific knowledge types. We compared conventional machine learning
(k-NN and SVM) and deep learning approaches trained on manually annotated Java
and .NET API documentation (n = 5,574). When classifying the knowledge types
individually (i.e., multiple binary classifiers) the best AUPRC was up to 87%.
The deep learning and SVM classifiers seem complementary. For four knowledge
types (Concept, Control, Pattern, and Non-Information), SVM clearly outperforms
deep learning which, on the other hand, is more accurate for identifying the
remaining types. When considering multiple knowledge types at once (i.e.,
multi-label classification) deep learning outperforms na\"ive baselines and
traditional machine learning achieving a MacroAUC up to 79%. We also compared
classifiers using embeddings pre-trained on generic text corpora and
StackOverflow but did not observe significant improvements. Finally, to assess
the generalizability of the classifiers, we re-tested them on a different,
unseen Python documentation dataset. Classifiers for Functionality, Concept,
Purpose, Pattern, and Directive seem to generalize from Java and .NET to Python
documentation. The accuracy related to the remaining types seems API-specific.
We discuss our results and how they inform the development of tools for
supporting developers sharing and accessing API knowledge. Published article:
https://doi.org/10.1145/3338906.333894
A survey of the use of crowdsourcing in software engineering
The term 'crowdsourcing' was initially introduced in 2006 to describe an emerging distributed problem-solving model by online workers. Since then it has been widely studied and practiced to support software engineering. In this paper we provide a comprehensive survey of the use of crowdsourcing in software engineering, seeking to cover all literature on this topic. We first review the definitions of crowdsourcing and derive our definition of Crowdsourcing Software Engineering together with its taxonomy. Then we summarise industrial crowdsourcing practice in software engineering and corresponding case studies. We further analyse the software engineering domains, tasks and applications for crowdsourcing and the platforms and stakeholders involved in realising Crowdsourced Software Engineering solutions. We conclude by exposing trends, open issues and opportunities for future research on Crowdsourced Software Engineering
Demystifying Dependency Bugs in Deep Learning Stack
Deep learning (DL) applications, built upon a heterogeneous and complex DL
stack (e.g., Nvidia GPU, Linux, CUDA driver, Python runtime, and TensorFlow),
are subject to software and hardware dependencies across the DL stack. One
challenge in dependency management across the entire engineering lifecycle is
posed by the asynchronous and radical evolution and the complex version
constraints among dependencies. Developers may introduce dependency bugs (DBs)
in selecting, using and maintaining dependencies. However, the characteristics
of DBs in DL stack is still under-investigated, hindering practical solutions
to dependency management in DL stack. To bridge this gap, this paper presents
the first comprehensive study to characterize symptoms, root causes and fix
patterns of DBs across the whole DL stack with 446 DBs collected from
StackOverflow posts and GitHub issues. For each DB, we first investigate the
symptom as well as the lifecycle stage and dependency where the symptom is
exposed. Then, we analyze the root cause as well as the lifecycle stage and
dependency where the root cause is introduced. Finally, we explore the fix
pattern and the knowledge sources that are used to fix it. Our findings from
this study shed light on practical implications on dependency management
Software Entity Recognition with Noise-Robust Learning
Recognizing software entities such as library names from free-form text is
essential to enable many software engineering (SE) technologies, such as
traceability link recovery, automated documentation, and API recommendation.
While many approaches have been proposed to address this problem, they suffer
from small entity vocabularies or noisy training data, hindering their ability
to recognize software entities mentioned in sophisticated narratives. To
address this challenge, we leverage the Wikipedia taxonomy to develop a
comprehensive entity lexicon with 79K unique software entities in 12
fine-grained types, as well as a large labeled dataset of over 1.7M sentences.
Then, we propose self-regularization, a noise-robust learning approach, to the
training of our software entity recognition (SER) model by accounting for many
dropouts. Results show that models trained with self-regularization outperform
both their vanilla counterparts and state-of-the-art approaches on our
Wikipedia benchmark and two Stack Overflow benchmarks. We release our models,
data, and code for future research.Comment: ASE 202
Documentation of Machine Learning Software
Machine Learning software documentation is different from most of the
documentations that were studied in software engineering research. Often, the
users of these documentations are not software experts. The increasing interest
in using data science and in particular, machine learning in different fields
attracted scientists and engineers with various levels of knowledge about
programming and software engineering. Our ultimate goal is automated generation
and adaptation of machine learning software documents for users with different
levels of expertise. We are interested in understanding the nature and triggers
of the problems and the impact of the users' levels of expertise in the process
of documentation evolution. We will investigate the Stack Overflow Q/As and
classify the documentation related Q/As within the machine learning domain to
understand the types and triggers of the problems as well as the potential
change requests to the documentation. We intend to use the results for building
on top of the state of the art techniques for automatic documentation
generation and extending on the adoption, summarization, and explanation of
software functionalities.Comment: The paper is accepted for publication in 27th IEEE International
Conference on Software Analysis, Evolution and Reengineering (SANER 2020
- …