2,512 research outputs found
Relating Developers’ Concepts and Artefact Vocabulary in a Financial Software Module
Developers working on unfamiliar systems are challenged to accurately identify where and how high-level concepts are implemented in the source code. Without additional help, concept location can become a tedious, time-consuming and error-prone task. In this paper we study an industrial financial application for which we had access to the user guide, the source code, and some change requests. We compared the relative importance of the domain concepts, as understood by developers, in the user manual and in the source code. We also searched the code for the concepts occurring in change requests, to see if they could point developers to code to be modified. We varied the searches (using exact and stem matching, discarding stop-words, etc.) and present the precision and recall. We discuss the implication of our results for maintenance
On the Generation, Structure, and Semantics of Grammar Patterns in Source Code Identifiers
Identifier names are the atoms of program comprehension. Weak identifier names decrease developer productivity and degrade the performance of automated approaches that leverage identifier names in source code analysis; threatening many of the advantages which stand to be gained from advances in artificial intelligence and machine learning. Therefore, it is vital to support developers in naming and renaming identifiers. In this paper, we extend our prior work, which studies the primary method through which names evolve: rename refactorings. In our prior work, we contextualize rename changes by examining commit messages and other refactorings. In this extension, we further consider data type changes which co-occur with these renames, with a goal of understanding how data type changes influence the structure and semantics of renames. In the long term, the outcomes of this study will be used to support research into: (1) recommending when a rename should be applied, (2) recommending how to rename an identifier, and (3) developing a model that describes how developers mentally synergize names using domain and project knowledge. We provide insights into how our data can support rename recommendation and analysis in the future, and reflect on the significant challenges, highlighted by our study, for future research in recommending renames
Why did you clone these identifiers? Using Grounded Theory to understand Identifier Clones
Developers spend most of their time comprehending source code, with some studies estimating this activity takes between 58% to 70% of a developer’s time. To improve the readability of source code, and therefore the productivity of developers, it is important to understand what aspects of static code analysis and syntactic code structure hinder the understandability of code. Identifiers are a main source of code comprehension due to their large volume and their role as implicit documentation of a developer’s intent when writing code. Despite the critical role that identifiers play during program comprehension, there are no regulated naming standards for developers to follow when picking identifier names. Our research supports previous work aimed at understanding what makes a good identifier name, and practices to follow when picking names by exploring a phenomenon that occurs during identifier naming: identifier clones. Identifier clones are two or more identifiers that are declared using the same name. This is an important yet unexplored phenomenon in identifier naming where developers intentionally give the same name to two or more identifiers in separate parts of a system. We must study identifier clones to understand it’s impact on program comprehension and to better understand the nature of identifier naming. To accomplish this, we conducted an empirical study on identifier clones detected in open-source software engineered systems and propose a taxonomy of identifier clones containing categories that can explain why they are introduced into systems and whether they represent naming antipatterns
Using the Uniqueness of Global Identifiers to Determine the Provenance of Python Software Source Code
We consider the problem of identifying the provenance of free/open source
software (FOSS) and specifically the need of identifying where reused source
code has been copied from. We propose a lightweight approach to solve the
problem based on software identifiers-such as the names of variables, classes,
and functions chosen by programmers. The proposed approach is able to
efficiently narrow down to a small set of candidate origin products, to be
further analyzed with more expensive techniques to make a final provenance
determination.By analyzing the PyPI (Python Packaging Index) open source
ecosystem we find that globally defined identifiers are very distinct. Across
PyPI's 244 K packages we found 11.2 M different global identifiers (classes and
method/function names-with only 0.6% of identifiers shared among the two types
of entities); 76% of identifiers were used only in one package, and 93% in at
most 3. Randomly selecting 3 non-frequent global identifiers from an input
product is enough to narrow down its origins to a maximum of 3 products within
89% of the cases.We validate the proposed approach by mapping Debian source
packages implemented in Python to the corresponding PyPI packages; this
approach uses at most five trials, where each trial uses three randomly chosen
global identifiers from a randomly chosen python file of the subject software
package, then ranks results using a popularity index and requires to inspect
only the top result. In our experiments, this method is effective at finding
the true origin of a project with a recall of 0.9 and precision of 0.77
Recommended from our members
Improving Information Retrieval Bug Localisation Using Contextual Heuristics
Software developers working on unfamiliar systems are challenged to identify where and how high-level concepts are implemented in the source code prior to performing maintenance tasks. Bug localisation is a core program comprehension activity in software maintenance: given the observation of a bug, e.g. via a bug report, where is it located in the source code?
Information retrieval (IR) approaches see the bug report as the query, and the source files as the documents to be retrieved, ranked by relevance. Current approaches rely on project history, in particular previously fixed bugs and versions of the source code. Existing IR techniques fall short of providing adequate solutions in finding all the source code files relevant for a bug. Without additional help, bug localisation can become a tedious, time- consuming and error-prone task.
My research contributes a novel algorithm that, given a bug report and the application’s source files, uses a combination of lexical and structural information to suggest, in a ranked order, files that may have to be changed to resolve the reported bug without requiring past code and similar reports.
I study eight applications for which I had access to the user guide, the source code, and some bug reports. I compare the relative importance and the occurrence of the domain concepts in the project artefacts and measure the effectiveness of using only concept key words to locate files relevant for a bug compared to using all the words of a bug report.
Measuring my approach against six others, using their five metrics and eight projects, I position an effected file in the top-1, top-5 and top-10 ranks on average for 44%, 69% and 76% of the bug reports respectively. This is an improvement of 23%, 16% and 11% respectively over the best performing current state-of-the-art tool.
Finally, I evaluate my algorithm with a range of industrial applications in user studies, and found that it is superior to simple string search, as often performed by developers. These results show the applicability of my approach to software projects without history and offers a simpler light-weight solution
An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags
This paper presents an ensemble part-of-speech tagging approach for source code identifiers. Ensemble tagging is a technique that uses machine-learning and the output from multiple part-of-speech taggers to annotate natural language text at a higher quality than the part-of-speech taggers are able to obtain independently. Our ensemble uses three state-of-the-art part-of-speech taggers: SWUM, POSSE, and Stanford. We study the quality of the ensemble\u27s annotations on five different types of identifier names: function, class, attribute, parameter, and declaration statement at the level of both individual words and full identifier names. We also study and discuss the weaknesses of our tagger to promote the future amelioration of these problems through further research. Our results show that the ensemble achieves 75\% accuracy at the identifier level and 84-86\% accuracy at the word level. This is an increase of +17\% points at the identifier level from the closest independent part-of-speech tagger
Reproducing, Extending, and Analyzing Naming Experiments
Naming is very important in software development, as names are often the only
vehicle of meaning about what the code is intended to do. A recent study on how
developers choose names collected the names given by different developers for
the same objects. This enabled a study of these names' diversity and structure,
and the construction of a model of how names are created. We reproduce
different parts of this study in three independent experiments. Importantly, we
employ methodological variations rather than striving of an exact replication.
When the same results are obtained this then boosts our confidence in their
validity by demonstrating that they do not depend on the methodology.
Our results indeed corroborate those of the original study in terms of the
diversity of names, the low probability of two developers choosing the same
name, and the finding that experienced developers tend to use slightly longer
names than inexperienced students. We explain name diversity by performing a
new analysis of the names, classifying the concepts represented in them as
universal (agreed upon), alternative (reflecting divergent views on a topic),
or optional (reflecting divergent opinions on whether to include this concept
at all). This classification enables new research directions concerning the
considerations involved in naming decisions. We also show that explicitly using
the model proposed in the original study to guide naming leads to the creation
of better names, whereas the simpler approach of just asking participants to
use longer and more detailed names does not.Comment: 35 pages with 10 figures and 6 table
- …