112 research outputs found
Supporting the Maintenance of Identifier Names: A Holistic Approach to High-Quality Automated Identifier Naming
A considerable part of the source code is identifier names-- unique lexical tokens that provide information about entities, and entity interactions, within the code. Identifier names provide human-readable descriptions of classes, functions, variables, etc. Poor or ambiguous identifier names (i.e., names that do not correctly describe the code behavior they are associated with) will lead developers to spend more time working towards understanding the code\u27s behavior. Bad naming can also have detrimental effects on tools that rely on natural language clues; degrading the quality of their output and making them unreliable. Additionally, misinterpretations of the code, caused by poor names, can result in the injection of quality issues into the system under maintenance. Thus, improved identifier naming increases developer effectiveness, higher-quality software, and higher-quality software analysis tools. In this dissertation, I establish several novel concepts that help measure and improve the quality of identifiers. The output of this dissertation work is a set of identifier name appraisal and quality tools that integrate into the developer workflow. Through a sequence of empirical studies, I have formulated a series of heuristics and linguistic patterns to evaluate the quality of identifier names in the code and provide naming structure recommendations. I envision and working towards supporting developers in integrating my contributions, discussed in this dissertation, into their development workflow to significantly improve the process of crafting and maintaining high-quality identifier names in the source code
On the Generation, Structure, and Semantics of Grammar Patterns in Source Code Identifiers
Identifier names are the atoms of program comprehension. Weak identifier names decrease developer productivity and degrade the performance of automated approaches that leverage identifier names in source code analysis; threatening many of the advantages which stand to be gained from advances in artificial intelligence and machine learning. Therefore, it is vital to support developers in naming and renaming identifiers. In this paper, we extend our prior work, which studies the primary method through which names evolve: rename refactorings. In our prior work, we contextualize rename changes by examining commit messages and other refactorings. In this extension, we further consider data type changes which co-occur with these renames, with a goal of understanding how data type changes influence the structure and semantics of renames. In the long term, the outcomes of this study will be used to support research into: (1) recommending when a rename should be applied, (2) recommending how to rename an identifier, and (3) developing a model that describes how developers mentally synergize names using domain and project knowledge. We provide insights into how our data can support rename recommendation and analysis in the future, and reflect on the significant challenges, highlighted by our study, for future research in recommending renames
Toward an Effective Automated Tracing Process
Traceability is defined as the ability to establish, record, and maintain dependency relations among various software artifacts in a software system, in both a forwards and backwards direction, throughout the multiple phases of the projectâs life cycle. The availability of traceability information has been proven vital to several software engineering activities such as program comprehension, impact analysis, feature location, software reuse, and verification and validation (V&V). The research on automated software traceability has noticeably advanced in the past few years. Various methodologies and tools have been proposed in the literature to provide automatic support for establishing and maintaining traceability information in software systems. This movement is motivated by the increasing attention traceability has been receiving as a critical element of any rigorous software development process. However, despite these major advances, traceability implementation and use is still not pervasive in industry. In particular, traceability tools are still far from achieving performance levels that are adequate for practical applications. Such low levels of accuracy require software engineers working with traceability tools to spend a considerable amount of their time verifying the generated traceability information, a process that is often described as tedious, exhaustive, and error-prone. Motivated by these observations, and building upon a growing body of work in this area, in this dissertation we explore several research directions related to enhancing the performance of automated tracing tools and techniques. In particular, our work addresses several issues related to the various aspects of the IR-based automated tracing process, including trace link retrieval, performance enhancement, and the role of the human in the process. Our main objective is to achieve performance levels, in terms of accuracy, efficiency, and usability, that are adequate for practical applications, and ultimately to accomplish a successful technology transfer from research to industry
Automatic Detection and Classification of Identifier Renamings
RĂSUMĂ
Le lexique du code source joue un rÎle primordial dans la maintenabilité des logiciels. Un lexique pauvre peut induire à une mauvaise compréhension du programme et à l'augmentation des erreurs du logiciel. Il est donc important que les développeurs maintiennent le lexique de leur code source en renommant les identifiants afin qu'ils reflÚtent les concepts qu'ils expriment. Dans cette thÚse, nous étudions le lexique et proposons une approche pour détecter et classifier les renommages des identifiants dans le code source.
La dĂ©tection des renommages est basĂ©e sur la combinaison de deux techniques: la diffĂ©renciation des codes sources et l'analyse de flux de donnĂ©es. Tandis que le classificateur de renommage utilise une base de donnĂ©es ontologique et un analyseur syntaxique du langage naturel pour classer les renommages selon la taxonomie que nous avons dĂ©fini. Afin d'Ă©valuer l'exactitude et l'exhaustivitĂ© du dĂ©tecteur de renommage, nous avons rĂ©alisĂ© une Ă©tude empirique sur lâhistorique de cinq programmes Java open-source. Les rĂ©sultats de cette Ă©tude rapportent une prĂ©cision de 88% et un rappel 92%.
Nous avons Ă©galement menĂ© une Ă©tude exploratoire qui analyse et discute comment les identifiants sont renommĂ©s, selon la taxonomie proposĂ©e, dans les cinq programmes Java de lâĂ©tude prĂ©cĂ©dente. Les rĂ©sultats de cette Ă©tude exploratoire montrent quâil existe des renommages dans chaque dimension de notre taxonomie.
Afin dâappliquer lâapproche proposĂ©e aux programmes PHP, nous avons adapte notre dĂ©tecteur de renommages pour prendre en compte les caractĂ©ristiques inhĂ©rentes Ă ces programmes. Une Ă©tude prĂ©liminaire effectuĂ©e sur trois programmes PHP montre que notre approche est applicable aux programmes PHP. Cependant, ces programmes ont des tendances de renommages diffĂ©rentes de celles observĂ©es dans les programmes Java.
Cette thĂšse propose deux rĂ©sultats. Tout d'abord, la dĂ©tection et la classification des renommages et un outil, qui peut ĂȘtre utilisĂ© pour documenter les renommages. Les dĂ©veloppeurs seront en mesure de, par exemple, rechercher des mĂ©thodes qui font partie de lâinterface de programmation car celles-ci impactent les applications clientes. Ils pourront Ă©galement identifier les incohĂ©rences entre le nom et la fonctionnalitĂ© d'une entitĂ© en cas de renommage dit risquĂ© comme lors dâun renommage vers un antonyme. DeuxiĂšmement, les rĂ©sultats de nos Ă©tudes nous fournissent des leçons qui constituent une base de connaissance et de conseils pouvant aider les dĂ©veloppeurs Ă Ă©viter des renommages inappropriĂ©s ou inutiles et ainsi maintenir la cohĂ©rence du lexique de leur code source.----------ABSTRACT
Source code lexicon plays a paramount role in software maintainability: a poor lexicon can lead to poor comprehensibility and increase software fault-proneness. For this reason, developers should maintain their source code lexicon by renaming identifiers when they do not
reflect the concepts that they should express. In this thesis, we study lexicon and propose an approach to detect and classify identifier renamings in source code. The renaming detection is based on a combination of source code differencing and data flow analysis, while the renaming classifier uses an ontological database and a natural language parser to classify renamings according to a taxonomy we define. We report a studyâconducted on the evolution history of five open-source Java programsâaimed at evaluating the accuracy and completeness of the renaming detector. The study reports a precision of 88% and a recall of 92%. In addition, we report an exploratory study investigating and discussing how identifiers are renamed in the five Java programs, according to our taxonomy. Moreover, we report the challenges and applicability of the proposed approach to PHP programs and report our preliminary results of renaming detection and classification for three programs. This thesis provides two outcomes. First, the renaming detection and classification approach and tool, which can be used for documenting renamings. Developers will be able to, for example, look up methods that are part of the public API (as they impact client applications), or look for inconsistencies between the name and the implementation of an entity that underwent a high risk renaming (e.g., towards the opposite meaning). Second, pieces of actionable knowledge, based on our qualitative study of renamings, that provide advice on how to avoid some unnecessary renamings
Towards Improving the Code Lexicon and its Consistency
RĂSUMĂ
La comprĂ©hension des programmes est une activitĂ© clĂ© au cours du dĂ©veloppement et de la maintenance des logiciels. Bien que ce soit une activitĂ© frĂ©quenteâmĂȘme plus frĂ©- quente que lâĂ©criture de codeâla comprĂ©hension des programmes est une activitĂ© difficile et la difficultĂ© augmente avec la taille et la complexitĂ© des programmes. Le plus souvent, les mesures structurellesâtelles que la taille et la complexitĂ©âsont utilisĂ©es pour identifier ces programmes complexes et sujets aux bogues. Cependant, nous savons que lâinformation linguistique contenue dans les identifiants et les commentairesâcâest-Ă -dire le lexique du code sourceâfont partie des facteurs qui influent la complexitĂ© psychologique des programmes, câest-Ă -dire les facteurs qui rendent les programmes difficiles Ă comprendre et Ă maintenir par des humains.
Dans cette thĂšse, nous apportons la preuve que les mesures Ă©valuant la qualitĂ© du lexique du code source sont un atout pour lâexplication et la prĂ©diction des bogues. En outre, la qualitĂ© des identifiants et des commentaires peut ne pas ĂȘtre suffisante pour rĂ©vĂ©ler les bogues si on les considĂšre en isolationâdans sa thĂ©orie sur la comprĂ©hension de programmes par exemple, Brooks avertit quâil peut arriver que les commentaires et le code soient en contradiction. Câest pourquoi nous adressons le problĂšme de la contradiction et, plus gĂ©nĂ©ralement, dâincompatibilitĂ© du lexique en dĂ©finissant un catalogue dâAntipatrons Linguistiques (LAs), que nous dĂ©finissons comme des mauvaises pratiques dans le choix des identifiants rĂ©sultant en incohĂ©rences entre le nom, lâimplĂ©mentation et la documentation dâune entitĂ© de programmation. Nous Ă©valuons empiriquement les LAs par des dĂ©veloppeurs de code propriĂ©taire et libre et montrons que la majoritĂ© des dĂ©veloppeurs les perçoivent comme mauvaises pratiques et par consĂ©quent elles doivent ĂȘtre Ă©vitĂ©es. Nous distillons aussi un sous-ensemble de LAs canoniques que les dĂ©veloppeurs perçoivent particuliĂšrement inacceptables ou pour lesquelles ils ont entrepris des actions. En effet, nous avons dĂ©couvert que 10% des exemples contenant les LAs ont Ă©tĂ© supprimĂ©s par les dĂ©veloppeurs aprĂšs que nous les leur ayons prĂ©sentĂ©s.
Les explications des dĂ©veloppeurs et la forte proportion de LAs qui nâont pas encore Ă©tĂ© rĂ©solus suggĂšrent quâil peut y avoir dâautres facteurs qui influent sur la dĂ©cision dâĂ©liminer les LAs, qui est dâailleurs souvent fait par le moyen de renommage. Ainsi, nous menons une enquĂȘte auprĂšs des dĂ©veloppeurs et montrons que plusieurs facteurs peuvent empĂȘcher les dĂ©veloppeurs de renommer. Ces rĂ©sultats suggĂšrent quâil serait plus avantageux de souligner les LAs et autres mauvaises pratiques lexicales quand les dĂ©veloppeurs Ă©crivent du code sourceâpar exemple en utilisant notre plugin LAPD Checkstyle dĂ©tectant des LAsâde sorte que lâamĂ©lioration puisse se faire sur la volĂ©e et sans impacter le reste du code.----------ABSTRACT
Program comprehension is a key activity during software development and maintenance. Although frequently performedâeven more often than actually writing codeâprogram comprehension is a challenging activity. The difficulty to understand a program increases with its size and complexity and as a result the comprehension of complex programs, in the best- case scenario, more time consuming when compared to simple ones but it can also lead to introducing faults in the program. Hence, structural properties such as size and complexity are often used to identify complex and fault prone programs. However, from early theories studying developersâ behavior while understanding a program, we know that the textual in- formation contained in identifiers and commentsâi.e., the source code lexiconâis part of the factors that affect the psychological complexity of a program, i.e., factors that make a program difficult to understand and maintain by humans.
In this dissertation we provide evidence that metrics evaluating the quality of source code lexicon are an asset for software fault explanation and prediction. Moreover, the quality of identifiers and comments considered in isolation may not be sufficient to reveal flawsâin his theory about the program understanding process for example, Brooks warns that it may happen that comments and code are contradictory. Consequently, we address the problem of contradictory, and more generally of inconsistent, lexicon by defining a catalog of Linguistic Antipatterns (LAs), i.e., poor practices in the choice of identifiers resulting in inconsistencies among the name, implementation, and documentation of a programming entity. Then, we empirically evaluate the relevance of LAsâi.e., how important they areâto industrial and open-source developers. Overall, results indicate that the majority of the developers perceives LAs as poor practices and therefore must be avoided. We also distill a subset of canonical LAs that developers found particularly unacceptable or for which they undertook an action. In fact, we discovered that 10% of the examples containing LAs were removed by developers after we pointed them out.
Developersâ explanations and the large proportion of yet unresolved LAs suggest that there may be other factors that impact the decision of removing LAs, which is often done through renaming. We conduct a survey with developers and show that renaming is not a straightforward activity and that there are several factors preventing developers from renaming. These results suggest that it would be more beneficial to highlight LAs and other lexicon bad smells as developers write source codeâe.g., using our LAPD Checkstyle plugin detecting LAsâso that the improvement can be done on-the-fly without impacting other program entities
Model refactoring by example: A multiâobjective search based software engineering approach
Declarative rules are frequently used in model refactoring in order to detect refactoring opportunities and to apply the appropriate ones. However, a large number of rules is required to obtain a complete specification of refactoring opportunities. Companies usually have accumulated examples of refactorings from past maintenance experiences. Based on these observations, we consider the model refactoring problem as a multi objective problem by suggesting refactoring sequences that aim to maximize both structural and textual similarity between a given model (the model to be refactored) and a set of poorly designed models in the base of examples (models that have undergone some refactorings) and minimize the structural similarity between a given model and a set of wellâdesigned models in the base of examples (models that do not need any refactoring). To this end, we use the Nonâdominated Sorting Genetic Algorithm (NSGAâII) to find a set of representative Pareto optimal solutions that present the best tradeâoff between structural and textual similarities of models. The validation results, based on 8 real world models taken from openâsource projects, confirm the effectiveness of our approach, yielding refactoring recommendations with an average correctness of over 80%. In addition, our approach outperforms 5 of the stateâofâtheâart refactoring approaches.Peer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/143783/1/smr1916.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/143783/2/smr1916_am.pd
User Review Analysis for Requirement Elicitation: Thesis and the framework prototype's source code
Online reviews are an important channel for requirement elicitation. However, requirement engineers face challenges when analysing online user reviews, such as data volumes, technical supports, existing techniques, and legal barriers. Juan Wang proposes a framework solving user review analysis problems for the purpose of requirement elicitation that sets up a channel from downloading user reviews to structured analysis data. The main contributions of her work are: (1) the thesis proposed a framework to solve the user review analysis problem for requirement elicitation; (2) the prototype of this framework proves its feasibility; (3) the experiments prove the effectiveness and efficiency of this framework.
This resource here is the latest version of Juan Wang's PhD thesis "User Review Analysis for Requirement Elicitation" and all the source code of the prototype for the framework as the results of her thesis
- âŠ