133 research outputs found
Overcoming Language Dichotomies: Toward Effective Program Comprehension for Mobile App Development
Mobile devices and platforms have become an established target for modern
software developers due to performant hardware and a large and growing user
base numbering in the billions. Despite their popularity, the software
development process for mobile apps comes with a set of unique, domain-specific
challenges rooted in program comprehension. Many of these challenges stem from
developer difficulties in reasoning about different representations of a
program, a phenomenon we define as a "language dichotomy". In this paper, we
reflect upon the various language dichotomies that contribute to open problems
in program comprehension and development for mobile apps. Furthermore, to help
guide the research community towards effective solutions for these problems, we
provide a roadmap of directions for future work.Comment: Invited Keynote Paper for the 26th IEEE/ACM International Conference
on Program Comprehension (ICPC'18
On the Effect of Semantically Enriched Context Models on Software Modularization
Many of the existing approaches for program comprehension rely on the
linguistic information found in source code, such as identifier names and
comments. Semantic clustering is one such technique for modularization of the
system that relies on the informal semantics of the program, encoded in the
vocabulary used in the source code. Treating the source code as a collection of
tokens loses the semantic information embedded within the identifiers. We try
to overcome this problem by introducing context models for source code
identifiers to obtain a semantic kernel, which can be used for both deriving
the topics that run through the system as well as their clustering. In the
first model, we abstract an identifier to its type representation and build on
this notion of context to construct contextual vector representation of the
source code. The second notion of context is defined based on the flow of data
between identifiers to represent a module as a dependency graph where the nodes
correspond to identifiers and the edges represent the data dependencies between
pairs of identifiers. We have applied our approach to 10 medium-sized open
source Java projects, and show that by introducing contexts for identifiers,
the quality of the modularization of the software systems is improved. Both of
the context models give results that are superior to the plain vector
representation of documents. In some cases, the authoritativeness of
decompositions is improved by 67%. Furthermore, a more detailed evaluation of
our approach on JEdit, an open source editor, demonstrates that inferred topics
through performing topic analysis on the contextual representations are more
meaningful compared to the plain representation of the documents. The proposed
approach in introducing a context model for source code identifiers paves the
way for building tools that support developers in program comprehension tasks
such as application and domain concept location, software modularization and
topic analysis
Are We Building on the Rock? On the Importance of Data Preprocessing for Code Summarization
Code summarization, the task of generating useful comments given the code,
has long been of interest. Most of the existing code summarization models are
trained and validated on widely-used code comment benchmark datasets. However,
little is known about the quality of the benchmark datasets built from
real-world projects. Are the benchmark datasets as good as expected? To bridge
the gap, we conduct a systematic research to assess and improve the quality of
four benchmark datasets widely used for code summarization tasks. First, we
propose an automated code-comment cleaning tool that can accurately detect
noisy data caused by inappropriate data preprocessing operations from existing
benchmark datasets. Then, we apply the tool to further assess the data quality
of the four benchmark datasets, based on the detected noises. Finally, we
conduct comparative experiments to investigate the impact of noisy data on the
performance of code summarization models. The results show that these data
preprocessing noises widely exist in all four benchmark datasets, and removing
these noisy data leads to a significant improvement on the performance of code
summarization. We believe that the findings and insights will enable a better
understanding of data quality in code summarization tasks, and pave the way for
relevant research and practice
Automatic detection and repair of directive defects of Java APIs documentation
Application Programming Interfaces (APIs) represent key tools for software developers to build complex software systems. However, several studies have revealed that even major API providers tend to have incomplete or inconsistent API documentation. This can severely hamper the API comprehension and as a consequence the quality of the software built on them. In this paper, we propose DRONE (Detect and Repair of dOcumentatioN dEfects), a framework to automatically detect and repair defects from API documents by leveraging techniques from program analysis, natural language processing, and constraint solving. Specifically, we target at the directives of API documents, which are related to parameter constraints and exception handling declarations. Furthermore, in presence of defects, we also provide a prototypical repair recommendation system. We evaluate our approach on parts of the well-documented APIs of JDK 1.8 APIs (including javaFX) and Android 7.0 (level 24). Across the two empirical studies, our approach can detect API defects with an average F-measure of 79.9%, 71.7%, and 81.4%, respectively. The API repairing capability has also been evaluated on the generated recommendations in a further experiment. User judgements indicate that the constraint information is addressed correctly and concisely in the rendered directives
Code Structure Guided Transformer for Source Code Summarization
Code summaries help developers comprehend programs and reduce their time to
infer the program functionalities during software maintenance. Recent efforts
resort to deep learning techniques such as sequence-to-sequence models for
generating accurate code summaries, among which Transformer-based approaches
have achieved promising performance. However, effectively integrating the code
structure information into the Transformer is under-explored in this task
domain. In this paper, we propose a novel approach named SG-Trans to
incorporate code structural properties into Transformer. Specifically, we
inject the local symbolic information (e.g., code tokens and statements) and
global syntactic structure (e.g., data flow graph) into the self-attention
module of Transformer as inductive bias. To further capture the hierarchical
characteristics of code, the local information and global structure are
designed to distribute in the attention heads of lower layers and high layers
of Transformer. Extensive evaluation shows the superior performance of SG-Trans
over the state-of-the-art approaches. Compared with the best-performing
baseline, SG-Trans still improves 1.4% and 2.0% in terms of METEOR score, a
metric widely used for measuring generation quality, respectively on two
benchmark datasets
- …