7 research outputs found

    On the Effect of Semantically Enriched Context Models on Software Modularization

    Full text link
    Many of the existing approaches for program comprehension rely on the linguistic information found in source code, such as identifier names and comments. Semantic clustering is one such technique for modularization of the system that relies on the informal semantics of the program, encoded in the vocabulary used in the source code. Treating the source code as a collection of tokens loses the semantic information embedded within the identifiers. We try to overcome this problem by introducing context models for source code identifiers to obtain a semantic kernel, which can be used for both deriving the topics that run through the system as well as their clustering. In the first model, we abstract an identifier to its type representation and build on this notion of context to construct contextual vector representation of the source code. The second notion of context is defined based on the flow of data between identifiers to represent a module as a dependency graph where the nodes correspond to identifiers and the edges represent the data dependencies between pairs of identifiers. We have applied our approach to 10 medium-sized open source Java projects, and show that by introducing contexts for identifiers, the quality of the modularization of the software systems is improved. Both of the context models give results that are superior to the plain vector representation of documents. In some cases, the authoritativeness of decompositions is improved by 67%. Furthermore, a more detailed evaluation of our approach on JEdit, an open source editor, demonstrates that inferred topics through performing topic analysis on the contextual representations are more meaningful compared to the plain representation of the documents. The proposed approach in introducing a context model for source code identifiers paves the way for building tools that support developers in program comprehension tasks such as application and domain concept location, software modularization and topic analysis

    On the Generation, Structure, and Semantics of Grammar Patterns in Source Code Identifiers

    Get PDF
    Identifier names are the atoms of program comprehension. Weak identifier names decrease developer productivity and degrade the performance of automated approaches that leverage identifier names in source code analysis; threatening many of the advantages which stand to be gained from advances in artificial intelligence and machine learning. Therefore, it is vital to support developers in naming and renaming identifiers. In this paper, we extend our prior work, which studies the primary method through which names evolve: rename refactorings. In our prior work, we contextualize rename changes by examining commit messages and other refactorings. In this extension, we further consider data type changes which co-occur with these renames, with a goal of understanding how data type changes influence the structure and semantics of renames. In the long term, the outcomes of this study will be used to support research into: (1) recommending when a rename should be applied, (2) recommending how to rename an identifier, and (3) developing a model that describes how developers mentally synergize names using domain and project knowledge. We provide insights into how our data can support rename recommendation and analysis in the future, and reflect on the significant challenges, highlighted by our study, for future research in recommending renames

    Automatic Extraction of a WordNet-Like Identifier Network from Software

    No full text
    International audienceSoftwares are designed to be used a significant amount of time, therefore maintenance represents an important part of their life cycle. It has been estimated that a lot of the time allocated to software maintenance is spent on the program comprehension. Many approaches using the program structure or external documentation have been created to ease the program comprehension. However, another important source of information is still not widely used for this purpose: the identifiers. In this article, we propose an approach, based on Natural Language Processing techniques, that automatically extracts and organizes concepts from software identifiers in a WordNet-like structure: lexical views. Those lexical views give useful insight on an overall software architecture and can be used to improve results of many software engineering tasks. The proposal is validated on a corpus of 24 open source softwares

    Towards Improving the Code Lexicon and its Consistency

    Get PDF
    RÉSUMÉ La compréhension des programmes est une activité clé au cours du développement et de la maintenance des logiciels. Bien que ce soit une activité fréquente—même plus fré- quente que l’écriture de code—la compréhension des programmes est une activité difficile et la difficulté augmente avec la taille et la complexité des programmes. Le plus souvent, les mesures structurelles—telles que la taille et la complexité—sont utilisées pour identifier ces programmes complexes et sujets aux bogues. Cependant, nous savons que l’information linguistique contenue dans les identifiants et les commentaires—c’est-à-dire le lexique du code source—font partie des facteurs qui influent la complexité psychologique des programmes, c’est-à-dire les facteurs qui rendent les programmes difficiles à comprendre et à maintenir par des humains. Dans cette thèse, nous apportons la preuve que les mesures évaluant la qualité du lexique du code source sont un atout pour l’explication et la prédiction des bogues. En outre, la qualité des identifiants et des commentaires peut ne pas être suffisante pour révéler les bogues si on les considère en isolation—dans sa théorie sur la compréhension de programmes par exemple, Brooks avertit qu’il peut arriver que les commentaires et le code soient en contradiction. C’est pourquoi nous adressons le problème de la contradiction et, plus généralement, d’incompatibilité du lexique en définissant un catalogue d’Antipatrons Linguistiques (LAs), que nous définissons comme des mauvaises pratiques dans le choix des identifiants résultant en incohérences entre le nom, l’implémentation et la documentation d’une entité de programmation. Nous évaluons empiriquement les LAs par des développeurs de code propriétaire et libre et montrons que la majorité des développeurs les perçoivent comme mauvaises pratiques et par conséquent elles doivent être évitées. Nous distillons aussi un sous-ensemble de LAs canoniques que les développeurs perçoivent particulièrement inacceptables ou pour lesquelles ils ont entrepris des actions. En effet, nous avons découvert que 10% des exemples contenant les LAs ont été supprimés par les développeurs après que nous les leur ayons présentés. Les explications des développeurs et la forte proportion de LAs qui n’ont pas encore été résolus suggèrent qu’il peut y avoir d’autres facteurs qui influent sur la décision d’éliminer les LAs, qui est d’ailleurs souvent fait par le moyen de renommage. Ainsi, nous menons une enquête auprès des développeurs et montrons que plusieurs facteurs peuvent empêcher les développeurs de renommer. Ces résultats suggèrent qu’il serait plus avantageux de souligner les LAs et autres mauvaises pratiques lexicales quand les développeurs écrivent du code source—par exemple en utilisant notre plugin LAPD Checkstyle détectant des LAs—de sorte que l’amélioration puisse se faire sur la volée et sans impacter le reste du code.----------ABSTRACT Program comprehension is a key activity during software development and maintenance. Although frequently performed—even more often than actually writing code—program comprehension is a challenging activity. The difficulty to understand a program increases with its size and complexity and as a result the comprehension of complex programs, in the best- case scenario, more time consuming when compared to simple ones but it can also lead to introducing faults in the program. Hence, structural properties such as size and complexity are often used to identify complex and fault prone programs. However, from early theories studying developers’ behavior while understanding a program, we know that the textual in- formation contained in identifiers and comments—i.e., the source code lexicon—is part of the factors that affect the psychological complexity of a program, i.e., factors that make a program difficult to understand and maintain by humans. In this dissertation we provide evidence that metrics evaluating the quality of source code lexicon are an asset for software fault explanation and prediction. Moreover, the quality of identifiers and comments considered in isolation may not be sufficient to reveal flaws—in his theory about the program understanding process for example, Brooks warns that it may happen that comments and code are contradictory. Consequently, we address the problem of contradictory, and more generally of inconsistent, lexicon by defining a catalog of Linguistic Antipatterns (LAs), i.e., poor practices in the choice of identifiers resulting in inconsistencies among the name, implementation, and documentation of a programming entity. Then, we empirically evaluate the relevance of LAs—i.e., how important they are—to industrial and open-source developers. Overall, results indicate that the majority of the developers perceives LAs as poor practices and therefore must be avoided. We also distill a subset of canonical LAs that developers found particularly unacceptable or for which they undertook an action. In fact, we discovered that 10% of the examples containing LAs were removed by developers after we pointed them out. Developers’ explanations and the large proportion of yet unresolved LAs suggest that there may be other factors that impact the decision of removing LAs, which is often done through renaming. We conduct a survey with developers and show that renaming is not a straightforward activity and that there are several factors preventing developers from renaming. These results suggest that it would be more beneficial to highlight LAs and other lexicon bad smells as developers write source code—e.g., using our LAPD Checkstyle plugin detecting LAs—so that the improvement can be done on-the-fly without impacting other program entities
    corecore