Search CORE

393 research outputs found

On Using Machine Learning to Identify Knowledge in API Reference Documentation

Author: Cois C. A.
Herrera F.
Hinton G. E.
Mikolov T.
Miller G. A.
Mukkamala M. C.
Patterson J.
Saied M. A.
Zhang Z.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 23/07/2019
Field of study

Using API reference documentation like JavaDoc is an integral part of software development. Previous research introduced a grounded taxonomy that organizes API documentation knowledge in 12 types, including knowledge about the Functionality, Structure, and Quality of an API. We study how well modern text classification approaches can automatically identify documentation containing specific knowledge types. We compared conventional machine learning (k-NN and SVM) and deep learning approaches trained on manually annotated Java and .NET API documentation (n = 5,574). When classifying the knowledge types individually (i.e., multiple binary classifiers) the best AUPRC was up to 87%. The deep learning and SVM classifiers seem complementary. For four knowledge types (Concept, Control, Pattern, and Non-Information), SVM clearly outperforms deep learning which, on the other hand, is more accurate for identifying the remaining types. When considering multiple knowledge types at once (i.e., multi-label classification) deep learning outperforms na\"ive baselines and traditional machine learning achieving a MacroAUC up to 79%. We also compared classifiers using embeddings pre-trained on generic text corpora and StackOverflow but did not observe significant improvements. Finally, to assess the generalizability of the classifiers, we re-tested them on a different, unseen Python documentation dataset. Classifiers for Functionality, Concept, Purpose, Pattern, and Directive seem to generalize from Java and .NET to Python documentation. The accuracy related to the remaining types seems API-specific. We discuss our results and how they inform the development of tools for supporting developers sharing and accessing API knowledge. Published article: https://doi.org/10.1145/3338906.333894

arXiv.org e-Print Archive

Crossref

Challenges and Perils of Testing Database Manipulation Code

Author: Cleve Anthony
Demeyer Serge
Gobert Maxime
Nagy Csaba
Rocha Henrique
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2021
Field of study

Institutional Repository Universiteit Antwerpen

Repository of the University of Namur

A survey of the use of crowdsourcing in software engineering

Author: Capra L
Harman M
Jia Y
Mao K
Publication venue
Publication date: 01/01/2015
Field of study

The term 'crowdsourcing' was initially introduced in 2006 to describe an emerging distributed problem-solving model by online workers. Since then it has been widely studied and practiced to support software engineering. In this paper we provide a comprehensive survey of the use of crowdsourcing in software engineering, seeking to cover all literature on this topic. We first review the definitions of crowdsourcing and derive our definition of Crowdsourcing Software Engineering together with its taxonomy. Then we summarise industrial crowdsourcing practice in software engineering and corresponding case studies. We further analyse the software engineering domains, tasks and applications for crowdsourcing and the platforms and stakeholders involved in realising Crowdsourced Software Engineering solutions. We conclude by exposing trends, open issues and opportunities for future research on Crowdsourced Software Engineering

CiteSeerX

UCL Discovery

Demystifying Dependency Bugs in Deep Learning Stack

Author: Cao Junmin
Chen Bihuan
Huang Kaifeng
Ma Lei
Peng Xin
Wu Susheng
Publication venue
Publication date: 01/09/2023
Field of study

Deep learning (DL) applications, built upon a heterogeneous and complex DL stack (e.g., Nvidia GPU, Linux, CUDA driver, Python runtime, and TensorFlow), are subject to software and hardware dependencies across the DL stack. One challenge in dependency management across the entire engineering lifecycle is posed by the asynchronous and radical evolution and the complex version constraints among dependencies. Developers may introduce dependency bugs (DBs) in selecting, using and maintaining dependencies. However, the characteristics of DBs in DL stack is still under-investigated, hindering practical solutions to dependency management in DL stack. To bridge this gap, this paper presents the first comprehensive study to characterize symptoms, root causes and fix patterns of DBs across the whole DL stack with 446 DBs collected from StackOverflow posts and GitHub issues. For each DB, we first investigate the symptom as well as the lifecycle stage and dependency where the symptom is exposed. Then, we analyze the root cause as well as the lifecycle stage and dependency where the root cause is introduced. Finally, we explore the fix pattern and the knowledge sources that are used to fix it. Our findings from this study shed light on practical implications on dependency management

arXiv.org e-Print Archive

Software Entity Recognition with Noise-Robust Learning

Author: Chen Muhao
Di Yifeng
Lee Joohan
Nguyen Tai
Zhang Tianyi
Publication venue
Publication date: 21/08/2023
Field of study

Recognizing software entities such as library names from free-form text is essential to enable many software engineering (SE) technologies, such as traceability link recovery, automated documentation, and API recommendation. While many approaches have been proposed to address this problem, they suffer from small entity vocabularies or noisy training data, hindering their ability to recognize software entities mentioned in sophisticated narratives. To address this challenge, we leverage the Wikipedia taxonomy to develop a comprehensive entity lexicon with 79K unique software entities in 12 fine-grained types, as well as a large labeled dataset of over 1.7M sentences. Then, we propose self-regularization, a noise-robust learning approach, to the training of our software entity recognition (SER) model by accounting for many dropouts. Results show that models trained with self-regularization outperform both their vanilla counterparts and state-of-the-art approaches on our Wikipedia benchmark and two Stack Overflow benchmarks. We release our models, data, and code for future research.Comment: ASE 202

arXiv.org e-Print Archive

Documentation of Machine Learning Software

Author: Antoniol Giuliano
Hashemi Yalda
Nayebi Maleknaz
Publication venue
Publication date: 01/01/2020
Field of study

Machine Learning software documentation is different from most of the documentations that were studied in software engineering research. Often, the users of these documentations are not software experts. The increasing interest in using data science and in particular, machine learning in different fields attracted scientists and engineers with various levels of knowledge about programming and software engineering. Our ultimate goal is automated generation and adaptation of machine learning software documents for users with different levels of expertise. We are interested in understanding the nature and triggers of the problems and the impact of the users' levels of expertise in the process of documentation evolution. We will investigate the Stack Overflow Q/As and classify the documentation related Q/As within the machine learning domain to understand the types and triggers of the problems as well as the potential change requests to the documentation. We intend to use the results for building on top of the state of the art techniques for automatic documentation generation and extending on the adoption, summarization, and explanation of software functionalities.Comment: The paper is accepted for publication in 27th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2020

arXiv.org e-Print Archive

Crossref

PolyPublie