112 research outputs found

    Supporting the Maintenance of Identifier Names: A Holistic Approach to High-Quality Automated Identifier Naming

    Get PDF
    A considerable part of the source code is identifier names-- unique lexical tokens that provide information about entities, and entity interactions, within the code. Identifier names provide human-readable descriptions of classes, functions, variables, etc. Poor or ambiguous identifier names (i.e., names that do not correctly describe the code behavior they are associated with) will lead developers to spend more time working towards understanding the code\u27s behavior. Bad naming can also have detrimental effects on tools that rely on natural language clues; degrading the quality of their output and making them unreliable. Additionally, misinterpretations of the code, caused by poor names, can result in the injection of quality issues into the system under maintenance. Thus, improved identifier naming increases developer effectiveness, higher-quality software, and higher-quality software analysis tools. In this dissertation, I establish several novel concepts that help measure and improve the quality of identifiers. The output of this dissertation work is a set of identifier name appraisal and quality tools that integrate into the developer workflow. Through a sequence of empirical studies, I have formulated a series of heuristics and linguistic patterns to evaluate the quality of identifier names in the code and provide naming structure recommendations. I envision and working towards supporting developers in integrating my contributions, discussed in this dissertation, into their development workflow to significantly improve the process of crafting and maintaining high-quality identifier names in the source code

    On the Generation, Structure, and Semantics of Grammar Patterns in Source Code Identifiers

    Get PDF
    Identifier names are the atoms of program comprehension. Weak identifier names decrease developer productivity and degrade the performance of automated approaches that leverage identifier names in source code analysis; threatening many of the advantages which stand to be gained from advances in artificial intelligence and machine learning. Therefore, it is vital to support developers in naming and renaming identifiers. In this paper, we extend our prior work, which studies the primary method through which names evolve: rename refactorings. In our prior work, we contextualize rename changes by examining commit messages and other refactorings. In this extension, we further consider data type changes which co-occur with these renames, with a goal of understanding how data type changes influence the structure and semantics of renames. In the long term, the outcomes of this study will be used to support research into: (1) recommending when a rename should be applied, (2) recommending how to rename an identifier, and (3) developing a model that describes how developers mentally synergize names using domain and project knowledge. We provide insights into how our data can support rename recommendation and analysis in the future, and reflect on the significant challenges, highlighted by our study, for future research in recommending renames

    Toward an Effective Automated Tracing Process

    Get PDF
    Traceability is defined as the ability to establish, record, and maintain dependency relations among various software artifacts in a software system, in both a forwards and backwards direction, throughout the multiple phases of the project’s life cycle. The availability of traceability information has been proven vital to several software engineering activities such as program comprehension, impact analysis, feature location, software reuse, and verification and validation (V&V). The research on automated software traceability has noticeably advanced in the past few years. Various methodologies and tools have been proposed in the literature to provide automatic support for establishing and maintaining traceability information in software systems. This movement is motivated by the increasing attention traceability has been receiving as a critical element of any rigorous software development process. However, despite these major advances, traceability implementation and use is still not pervasive in industry. In particular, traceability tools are still far from achieving performance levels that are adequate for practical applications. Such low levels of accuracy require software engineers working with traceability tools to spend a considerable amount of their time verifying the generated traceability information, a process that is often described as tedious, exhaustive, and error-prone. Motivated by these observations, and building upon a growing body of work in this area, in this dissertation we explore several research directions related to enhancing the performance of automated tracing tools and techniques. In particular, our work addresses several issues related to the various aspects of the IR-based automated tracing process, including trace link retrieval, performance enhancement, and the role of the human in the process. Our main objective is to achieve performance levels, in terms of accuracy, efficiency, and usability, that are adequate for practical applications, and ultimately to accomplish a successful technology transfer from research to industry

    Automatic Detection and Classification of Identifier Renamings

    Get PDF
    RÉSUMÉ Le lexique du code source joue un rĂŽle primordial dans la maintenabilitĂ© des logiciels. Un lexique pauvre peut induire Ă  une mauvaise comprĂ©hension du programme et Ă  l'augmentation des erreurs du logiciel. Il est donc important que les dĂ©veloppeurs maintiennent le lexique de leur code source en renommant les identifiants afin qu'ils reflĂštent les concepts qu'ils expriment. Dans cette thĂšse, nous Ă©tudions le lexique et proposons une approche pour dĂ©tecter et classifier les renommages des identifiants dans le code source. La dĂ©tection des renommages est basĂ©e sur la combinaison de deux techniques: la diffĂ©renciation des codes sources et l'analyse de flux de donnĂ©es. Tandis que le classificateur de renommage utilise une base de donnĂ©es ontologique et un analyseur syntaxique du langage naturel pour classer les renommages selon la taxonomie que nous avons dĂ©fini. Afin d'Ă©valuer l'exactitude et l'exhaustivitĂ© du dĂ©tecteur de renommage, nous avons rĂ©alisĂ© une Ă©tude empirique sur l’historique de cinq programmes Java open-source. Les rĂ©sultats de cette Ă©tude rapportent une prĂ©cision de 88% et un rappel 92%. Nous avons Ă©galement menĂ© une Ă©tude exploratoire qui analyse et discute comment les identifiants sont renommĂ©s, selon la taxonomie proposĂ©e, dans les cinq programmes Java de l’étude prĂ©cĂ©dente. Les rĂ©sultats de cette Ă©tude exploratoire montrent qu’il existe des renommages dans chaque dimension de notre taxonomie. Afin d’appliquer l’approche proposĂ©e aux programmes PHP, nous avons adapte notre dĂ©tecteur de renommages pour prendre en compte les caractĂ©ristiques inhĂ©rentes Ă  ces programmes. Une Ă©tude prĂ©liminaire effectuĂ©e sur trois programmes PHP montre que notre approche est applicable aux programmes PHP. Cependant, ces programmes ont des tendances de renommages diffĂ©rentes de celles observĂ©es dans les programmes Java. Cette thĂšse propose deux rĂ©sultats. Tout d'abord, la dĂ©tection et la classification des renommages et un outil, qui peut ĂȘtre utilisĂ© pour documenter les renommages. Les dĂ©veloppeurs seront en mesure de, par exemple, rechercher des mĂ©thodes qui font partie de l’interface de programmation car celles-ci impactent les applications clientes. Ils pourront Ă©galement identifier les incohĂ©rences entre le nom et la fonctionnalitĂ© d'une entitĂ© en cas de renommage dit risquĂ© comme lors d’un renommage vers un antonyme. DeuxiĂšmement, les rĂ©sultats de nos Ă©tudes nous fournissent des leçons qui constituent une base de connaissance et de conseils pouvant aider les dĂ©veloppeurs Ă  Ă©viter des renommages inappropriĂ©s ou inutiles et ainsi maintenir la cohĂ©rence du lexique de leur code source.----------ABSTRACT Source code lexicon plays a paramount role in software maintainability: a poor lexicon can lead to poor comprehensibility and increase software fault-proneness. For this reason, developers should maintain their source code lexicon by renaming identifiers when they do not reflect the concepts that they should express. In this thesis, we study lexicon and propose an approach to detect and classify identifier renamings in source code. The renaming detection is based on a combination of source code differencing and data flow analysis, while the renaming classifier uses an ontological database and a natural language parser to classify renamings according to a taxonomy we define. We report a study—conducted on the evolution history of five open-source Java programs—aimed at evaluating the accuracy and completeness of the renaming detector. The study reports a precision of 88% and a recall of 92%. In addition, we report an exploratory study investigating and discussing how identifiers are renamed in the five Java programs, according to our taxonomy. Moreover, we report the challenges and applicability of the proposed approach to PHP programs and report our preliminary results of renaming detection and classification for three programs. This thesis provides two outcomes. First, the renaming detection and classification approach and tool, which can be used for documenting renamings. Developers will be able to, for example, look up methods that are part of the public API (as they impact client applications), or look for inconsistencies between the name and the implementation of an entity that underwent a high risk renaming (e.g., towards the opposite meaning). Second, pieces of actionable knowledge, based on our qualitative study of renamings, that provide advice on how to avoid some unnecessary renamings

    Towards Improving the Code Lexicon and its Consistency

    Get PDF
    RÉSUMÉ La comprĂ©hension des programmes est une activitĂ© clĂ© au cours du dĂ©veloppement et de la maintenance des logiciels. Bien que ce soit une activitĂ© frĂ©quente—mĂȘme plus frĂ©- quente que l’écriture de code—la comprĂ©hension des programmes est une activitĂ© difficile et la difficultĂ© augmente avec la taille et la complexitĂ© des programmes. Le plus souvent, les mesures structurelles—telles que la taille et la complexité—sont utilisĂ©es pour identifier ces programmes complexes et sujets aux bogues. Cependant, nous savons que l’information linguistique contenue dans les identifiants et les commentaires—c’est-Ă -dire le lexique du code source—font partie des facteurs qui influent la complexitĂ© psychologique des programmes, c’est-Ă -dire les facteurs qui rendent les programmes difficiles Ă  comprendre et Ă  maintenir par des humains. Dans cette thĂšse, nous apportons la preuve que les mesures Ă©valuant la qualitĂ© du lexique du code source sont un atout pour l’explication et la prĂ©diction des bogues. En outre, la qualitĂ© des identifiants et des commentaires peut ne pas ĂȘtre suffisante pour rĂ©vĂ©ler les bogues si on les considĂšre en isolation—dans sa thĂ©orie sur la comprĂ©hension de programmes par exemple, Brooks avertit qu’il peut arriver que les commentaires et le code soient en contradiction. C’est pourquoi nous adressons le problĂšme de la contradiction et, plus gĂ©nĂ©ralement, d’incompatibilitĂ© du lexique en dĂ©finissant un catalogue d’Antipatrons Linguistiques (LAs), que nous dĂ©finissons comme des mauvaises pratiques dans le choix des identifiants rĂ©sultant en incohĂ©rences entre le nom, l’implĂ©mentation et la documentation d’une entitĂ© de programmation. Nous Ă©valuons empiriquement les LAs par des dĂ©veloppeurs de code propriĂ©taire et libre et montrons que la majoritĂ© des dĂ©veloppeurs les perçoivent comme mauvaises pratiques et par consĂ©quent elles doivent ĂȘtre Ă©vitĂ©es. Nous distillons aussi un sous-ensemble de LAs canoniques que les dĂ©veloppeurs perçoivent particuliĂšrement inacceptables ou pour lesquelles ils ont entrepris des actions. En effet, nous avons dĂ©couvert que 10% des exemples contenant les LAs ont Ă©tĂ© supprimĂ©s par les dĂ©veloppeurs aprĂšs que nous les leur ayons prĂ©sentĂ©s. Les explications des dĂ©veloppeurs et la forte proportion de LAs qui n’ont pas encore Ă©tĂ© rĂ©solus suggĂšrent qu’il peut y avoir d’autres facteurs qui influent sur la dĂ©cision d’éliminer les LAs, qui est d’ailleurs souvent fait par le moyen de renommage. Ainsi, nous menons une enquĂȘte auprĂšs des dĂ©veloppeurs et montrons que plusieurs facteurs peuvent empĂȘcher les dĂ©veloppeurs de renommer. Ces rĂ©sultats suggĂšrent qu’il serait plus avantageux de souligner les LAs et autres mauvaises pratiques lexicales quand les dĂ©veloppeurs Ă©crivent du code source—par exemple en utilisant notre plugin LAPD Checkstyle dĂ©tectant des LAs—de sorte que l’amĂ©lioration puisse se faire sur la volĂ©e et sans impacter le reste du code.----------ABSTRACT Program comprehension is a key activity during software development and maintenance. Although frequently performed—even more often than actually writing code—program comprehension is a challenging activity. The difficulty to understand a program increases with its size and complexity and as a result the comprehension of complex programs, in the best- case scenario, more time consuming when compared to simple ones but it can also lead to introducing faults in the program. Hence, structural properties such as size and complexity are often used to identify complex and fault prone programs. However, from early theories studying developers’ behavior while understanding a program, we know that the textual in- formation contained in identifiers and comments—i.e., the source code lexicon—is part of the factors that affect the psychological complexity of a program, i.e., factors that make a program difficult to understand and maintain by humans. In this dissertation we provide evidence that metrics evaluating the quality of source code lexicon are an asset for software fault explanation and prediction. Moreover, the quality of identifiers and comments considered in isolation may not be sufficient to reveal flaws—in his theory about the program understanding process for example, Brooks warns that it may happen that comments and code are contradictory. Consequently, we address the problem of contradictory, and more generally of inconsistent, lexicon by defining a catalog of Linguistic Antipatterns (LAs), i.e., poor practices in the choice of identifiers resulting in inconsistencies among the name, implementation, and documentation of a programming entity. Then, we empirically evaluate the relevance of LAs—i.e., how important they are—to industrial and open-source developers. Overall, results indicate that the majority of the developers perceives LAs as poor practices and therefore must be avoided. We also distill a subset of canonical LAs that developers found particularly unacceptable or for which they undertook an action. In fact, we discovered that 10% of the examples containing LAs were removed by developers after we pointed them out. Developers’ explanations and the large proportion of yet unresolved LAs suggest that there may be other factors that impact the decision of removing LAs, which is often done through renaming. We conduct a survey with developers and show that renaming is not a straightforward activity and that there are several factors preventing developers from renaming. These results suggest that it would be more beneficial to highlight LAs and other lexicon bad smells as developers write source code—e.g., using our LAPD Checkstyle plugin detecting LAs—so that the improvement can be done on-the-fly without impacting other program entities

    Model refactoring by example: A multi‐objective search based software engineering approach

    Full text link
    Declarative rules are frequently used in model refactoring in order to detect refactoring opportunities and to apply the appropriate ones. However, a large number of rules is required to obtain a complete specification of refactoring opportunities. Companies usually have accumulated examples of refactorings from past maintenance experiences. Based on these observations, we consider the model refactoring problem as a multi objective problem by suggesting refactoring sequences that aim to maximize both structural and textual similarity between a given model (the model to be refactored) and a set of poorly designed models in the base of examples (models that have undergone some refactorings) and minimize the structural similarity between a given model and a set of well‐designed models in the base of examples (models that do not need any refactoring). To this end, we use the Non‐dominated Sorting Genetic Algorithm (NSGA‐II) to find a set of representative Pareto optimal solutions that present the best trade‐off between structural and textual similarities of models. The validation results, based on 8 real world models taken from open‐source projects, confirm the effectiveness of our approach, yielding refactoring recommendations with an average correctness of over 80%. In addition, our approach outperforms 5 of the state‐of‐the‐art refactoring approaches.Peer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/143783/1/smr1916.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/143783/2/smr1916_am.pd

    User Review Analysis for Requirement Elicitation: Thesis and the framework prototype's source code

    Get PDF
    Online reviews are an important channel for requirement elicitation. However, requirement engineers face challenges when analysing online user reviews, such as data volumes, technical supports, existing techniques, and legal barriers. Juan Wang proposes a framework solving user review analysis problems for the purpose of requirement elicitation that sets up a channel from downloading user reviews to structured analysis data. The main contributions of her work are: (1) the thesis proposed a framework to solve the user review analysis problem for requirement elicitation; (2) the prototype of this framework proves its feasibility; (3) the experiments prove the effectiveness and efficiency of this framework. This resource here is the latest version of Juan Wang's PhD thesis "User Review Analysis for Requirement Elicitation" and all the source code of the prototype for the framework as the results of her thesis
    • 

    corecore