26 research outputs found

    Context-Aware Source Code Identifier Splitting and Expansion for Software Maintenance

    Get PDF
    RÉSUMÉ La compréhension du code source des programmes logiciels est une étape nécessaire pour plusieurs tâches de compréhension de programmes, retro-ingénierie, ou re-documentation. Dans le code source, les informations textuelles telles que les identifiants et les commentaires représentent une source d’information importante. Le problème d’extraction et d’analyse des informations textuelles utilisées dans les artefacts logiciels n’a été reconnu par la communauté du génie logiciel que récemment. Des méthodes de recherche d’information ont été proposées pour aider les tâches de compréhension de programmes telles que la localisation des concepts et la traçabilité des exigences au code source. Afin de mieux tirer bénéfice des approches basées sur la recherche d’information, le langage utilisé au niveau de tous les artefacts logiciels doit être le même. Ceci est dû au fait que les requêtes de la recherche d’information ne peuvent pas retourner des documents pertinents si le vocabulaire utilisé dans les requêtes contient des mots qui ne figurent pas au niveau du vocabulaire du code source. Malheureusement, le code source contient une proportion élevée de mots qui ne sont pas significatifs, e.g., abréviations, acronymes, ou concaténation de ces types. En effet, le code source utilise un langage différent de celui des autres artefacts logiciels. Cette discordance de vocabulaire provient de l’hypothèse implicite faite par les techniques de recherche de l’information et du traitement de langage naturel qui supposent l’utilisation du même vocabulaire. Ainsi, la normalisation du vocabulaire du code source est un grand défi. La normalisation aligne le vocabulaire utilisé dans le code source des systèmes logiciels avec celui des autres artefacts logiciels. La normalisation consiste à décomposer les identifiants (i.e., noms de classes, méthodes, variables, attributs, paramètres, etc.) en termes et à étendre ces termes aux concepts (i.e., mots d’un dictionnaire spécifique) correspondants. Dans cette thèse, nous proposons deux contributions à la normalisation avec deux nouvelles approches contextuelles: TIDIER et TRIS. Nous prenons en compte le contexte car nos études expérimentales ont montré l’importance des informations contextuelles pour la normalisation du vocabulaire du code source. En effet, nous avons effectué deux études expérimentales avec des étudiants de baccalauréat, maîtrise et doctorat ainsi que des stagiaires postdoctoraux. Nous avons choisi aléatoirement un ensemble d’identifiants à partir d’un corpus de systèmes écrits en C et nous avons demandé aux participants de les normaliser en utilisant différents niveaux de contexte. En particulier, nous avons considéré un contexte interne qui consiste en le contenu des fonctions, fichiers et systèmes contenant les identifiants ainsi qu’un niveau externe sous forme de documentation externe. Les résultats montrent l’importance des informations contextuelles pour la normalisation. Ils révèlent également que les fichiers de code source sont plus utiles que les fonctions et que le contexte construit au niveau des systèmes logiciels n’apporte pas plus d’amélioration que celle obtenue avec le contexte construit au niveau des fichiers. La documentation externe, par contre, aide parfois. En résumé, les résultats confirment notre hypothèse sur l’importance du contexte pour la compréhension de programmes logiciels en général et la normalisation du vocabulaire utilisé dans le code source systèmes logiciels en particulier. Ainsi, nous proposons une approche contextuelle TIDIER, inspirée par les techniques de la reconnaissance de la parole et utilisant le contexte sous forme de dictionnaires spécialisés (i.e., contenant des acronymes, abréviations et termes spécifiques au domaine des système logiciels). TIDIER est plus performante que les approches qui la précédent (i.e., CamelCase et samurai). Spécifiquement, TIDIER atteint 54% de précision en termes de décomposition des identifiants lors de l’utilisation un dictionnaire construit au niveau du système logiciel en question et enrichi par la connaissance du domaine. CamelCase et Samurai atteint seulement 30% et 31% en termes de précision, respectivement. En outre, TIDIER est la première approche qui met en correspondance les termes abrégés avec les concepts qui leurs correspondent avec une précision de 48% pour un ensemble de 73 abréviations. La limitation principale de TIDIER est sa complexité cubique qui nous a motivé à proposer une solution plus rapide mais tout aussi performante, nommée TRIS. TRIS est inspirée par TIDIER, certes elle traite le problème de la normalisation différemment. En effet, elle le considère comme un problème d’optimisation (minimisation) dont le but est de trouver le chemin le plus court (i.e., décomposition et extension optimales) dans un graphe acyclique. En outre, elle utilise la fréquence des termes comme contexte local afin de déterminer la normalisation la plus probable. TRIS est plus performante que CamelCase, Samurai et TIDIER, en termes de précision et de rappel, pour des systèmes logiciels écrits en C et C++. Aussi, elle fait mieux que GenTest de 4% en termes d’exactitude de décomposition d’identifiants. L’amélioration apportée par rapport à GenTest n’est cependant pas statistiquement significative. TRIS utilise une représentation basée sur une arborescence qui réduit considérablement sa complexité et la rend plus efficace en terme de temps de calcul. Ainsi, TRIS produit rapidement une normalisation optimale en utilisant un algorithme ayant une complexité quadratique en la longueur de l’identifiant à normaliser. Ayant développé des approches contextuelles pour la normalisation, nous analysons alors son impact sur deux tâches de maintenance logicielle basées sur la recherche d’information, à savoir, la traçabilité des exigences au code source et la localisation des concepts. Nous étudions l’effet de trois stratégies de normalisation: CamelCase, Samurai et l’oracle sur deux techniques de localisation des concepts. La première est basée sur les informations textuelles seulement, quant à la deuxième, elle combine les informations textuelles et dynamiques (traces d’exécution). Les résultats obtenus confirment que la normalisation améliore les techniques de localisation des concepts basées sur les informations textuelles seulement. Quand l’analyse dynamique est prise en compte, n’importe qu’elle technique de normalisation suffit. Ceci est dû au fait que l’analyse dynamique réduit considérablement l’espace de recherche et donc l’apport de la normalisation n’est pas comparable à celui des informations dynamiques. En résumé, les résultats montrent l’intérêt de développer des techniques de normalisation avancées car elles sont utiles dans des situations où les traces d’exécution ne sont pas disponibles. Nous avons aussi effectué une étude empirique sur l’effet de la normalisation sur la traçabilité des exigences au code source. Dans cette étude, nous avons analysé l’impact des trois approches de normalisation précitées sur deux techniques de traçabilité. La première utilise une technique d’indexation sémantique latente (LSI) alors que la seconde repose sur un modèle d’espace vectoriel (VSM). Les résultats indiquent que les techniques de normalisation améliorent la précision et le rappel dans quelques cas. Notre analyse qualitative montre aussi que l’impact de la normalisation sur ces deux techniques de traçabilité dépend de la qualité des données étudiées. Finalement, nous pouvons conclure que cette thèse contribue à l'état de l’art sur la normalisation du vocabulaire de code source et l’importance du contexte pour la compréhension de programmes logiciels. En plus, cette thèse contribue à deux domaines de la maintenance logicielle et spécifiquement à la localisation des concepts et à la traçabilité des exigences au code source. Les résultats théoriques et pratiques de cette thèse sont utiles pour les praticiens ainsi que les chercheurs. Nos travaux de recherche futures relatifs à la compréhension de programmes logiciels et la maintenance logicielle consistent en l’évaluation de nos approches sur d’autres systèmes logiciels écrits en d’autres langages de programmation ainsi que l’application de nos approches de normalisation sur d’autres tâches de compréhension de programmes logiciels (e.g., récapitulation de code source). Nous sommes aussi en cours de la préparation d’une deuxième étude sur l’effet du contexte sur la normalisation du vocabulaire de code source en utilisant l’oculométrie afin de mieux analyser les stratégies adoptées par les développeurs lors de l’utilisation des informations contextuelles. Le deuxième volet que nous avons entamé actuellement concerne l’impact des styles des identifiants sur la qualité des systèmes logiciels. En effet, nous sommes entrain d’inférer, en utilisant un modèle statistique (i.e., modèle de Markov cache), les styles des identifiants adoptés par les développeurs dans les systèmes logiciels. Nous sommes également entrain d’étudier l’impact de ces styles sur la qualité des systèmes logiciels. L’idée est de montrer d’abord, si les développeurs utilisent leur propre style de nommage issu de leur propre expérience ou s’ils s’adaptent au projet, i.e., aux conventions de nommage suivies (s’il y en a) et d’analyser, ensuite, les styles d’identifiants (e.g., abréviations ou acronymes) qui mènent à l’introduction de bogues et à la dégradation des attributs de qualité internes, notamment, le couplage et cohésion sémantiques.----------ABSTRACT Understanding source code is a necessary step for many program comprehension, reverse-engineering, or re-documentation tasks. In source code, textual information such as identifiers and comments represent an important source of information. The problem of extracting and analyzing the textual information in software artifacts was recognized by the software engineering research community only recently. Information Retrieval (IR) methods were proposed to support program comprehension tasks, such as feature (or concept) location and traceability link recovery. However, to reap the full benefit of IR-based approaches, the language used across all software artifacts must be the same, because IR queries cannot return relevant documents if the query vocabulary contains words that are not in the source code vocabulary. Unfortunately, source code contains a significant proportion of vocabulary that is not made up of full (meaningful) words, e.g., abbreviations, acronyms, or concatenation of these. In effect, source code uses a different language than other software artifacts. This vocabulary mismatch stems from the implicit assumption of IR and Natural Language Processing (NLP) techniques which assume the use of a single natural-language vocabulary. Therefore, vocabulary normalization is a challenging problem. Vocabulary normalization aligns the vocabulary found in the source code with that found in other software artifacts. Normalization must both split an identifier into its constituent parts and expand each part into a full dictionary word to match vocabulary in other artifacts. In this thesis, we deal with the challenge of normalizing source code vocabulary by developing two novel context-aware approaches. We use the context because the results of our experimental studies have shown that context is relevant for source code vocabulary normalization. In fact, we conducted two user studies with 63 participants who were asked to split and expand a set of 50 identifiers from a corpus of open-source C programs with the availability of different context levels. In particular, we considered an internal context consisting of the content of functions, source code files, and applications where the identifiers appear and an external context involving external documentation. We reported evidence on the usefulness of contextual information for source code vocabulary normalization. We observed that the source code files are more helpful than just looking at function source code, and that the application-level contextual information does not help any further. The availability of external sources of information (e.g., thesaurus of abbreviations and acronyms) only helps in some circumstances. The obtained results confirm the conjecture that contextual information is useful in program comprehension, including when developers split and expand identifiers to understand them. Thus, we suggest a novel contextual approach for vocabulary normalization, TIDIER. TIDIER is inspired by speech recognition techniques and exploits contextual information in the form of specialized dictionaries (e.g., acronyms, contractions and domain specific terms). TIDIER significantly outperforms its previous approaches (i.e., CamelCase and Samurai which are the approaches that exist before TIDIER). Specifically, TIDIER achieves with a program-level dictionary complemented with domain knowledge, 54% of correct splits, compared to 30% obtained with CamelCase and 31% of correct splits attained using Samurai. Moreover, TIDIER was able to correctly map identifiers’ terms to dictionary words with a precision of 48% for a set of 73 abbreviations. The main limitations of TIDIER is its cubic complexity that leads us to propose a fast solution, namely, TRIS. TRIS is inspired by TIDIER, but it deals with the vocabulary normalization problem differently. It maps it to a graph optimization (minimization) problem to find the optimal path (i.e., optimal splitting-expansion) in an acyclic weighted graph. In addition, it uses the relative frequency of source code terms as a local context to determine the most likely identifier splitting-expansion. TRIS significantly outperforms CamelCase and Samurai in terms of precision and recall of splitting, and TIDIER, in terms of identifier expansion precision and recall, with a medium to large effect size, for C and C++ systems. In addition, TRIS shows an improvement of 4%, in terms of identifier splitting correctness, over GenTest (a more recent splitter suggested after TIDIER). The latter improvement is not statistically significant. TRIS uses a tree-based representation that makes it—in addition to being more accurate than other approaches—efficient in terms of computation time. Thus, TRIS produces one optimal split and expansion fast using an identifier processing algorithm having a quadratic complexity in the length of the identifier to split/expand. We also investigate the impact of identifier splitting on two IR-based software maintenance tasks, namely, feature location and traceability recovery. Our study on feature location analyzes the effect of three identifier splitting strategies: CamelCase, Samurai, and an Oracle on two feature location techniques (FLTs). The first is based on IR while the second relies on the combination of IR and dynamic analysis (i.e., execution traces). The obtained results support our conjecture that when only textual information is available, an improved splitting technique can help improve effectiveness of feature location. The results also show that when both textual and execution information are used, any splitting algorithm will suffice, as FLTs produced equivalent results. In other words, because dynamic information helps pruning the search space considerably, the benefit of an advanced splitting algorithm is comparably smaller than that of the dynamic information; hence the splitting algorithm will have little impact on the final results. Overall, our findings outline potential benefits of creating advanced preprocessing techniques as they can be useful in situations where execution information cannot be easily collected. In addition, we study the impact of identifier splitting on two traceability recovery techniques utilizing the same three identifier splitting strategies that we used in our study on feature location. The first traceability recovery technique uses Latent Semantic Indexing (LSI) while the second is based on Vector Space Model (VSM). The results indicate that advanced splitting techniques help increase the precision and recall of the studied traceability techniques but only in some cases. In addition, our qualitative analysis shows that the impact or improvement brought by such techniques depends on the quality of the studied data. Overall, this thesis contributes to the state-of-the-art on identifier splitting and expansion, context, and their importance for program comprehension. In addition, it contributes to the fields of feature location and traceability recovery. Theoretical and practical findings of this thesis are useful for both practitioners and researchers. Our future research directions in the areas of program comprehension and software maintenance will extend our empirical evaluations to other software systems belonging to other programming languages. In addition, we will apply our source code vocabulary normalization approaches on other program comprehension tasks (e.g., code summarization). We are also preparing a replication of our study on the effect of context on vocabulary normalization using Eye-Tracking to analyze the different strategies adopted by developers when exploring contextual information to perform identifier splitting and expansion. A second research direction that we are currently tackling concerns the impact of identifier style on software quality using mining software repositories. In fact, we are currently inferring the identifier styles used by developers in open-source projects using a statistical model, namely, the Hidden Markov Model (HMM). The aim is to show whether open-source developers adhere to the style of the projects they join and their naming conventions (if any) or they bring their own style. In addition, we want to analyze whether a specific identifier style (e.g., short abbreviations or acronyms) introduces bugs in the systems and whether it impacts internal software quality metrics, in particular, the semantic coupling and cohesion

    Automatic derivation of concepts based on the analysis of identifiers

    Get PDF
    The existing software engineering literature has empirically shown that a proper choice of identifiers influences software understandability and maintainability. Indeed, identifiers are developers’ main up-to-date source of information and guide their cognitive processes during program understanding when the high-level documentation is scarce or outdated and when the source code is not sufficiently commented. Deriving domain terms from identifiers using high-level and domain concepts is not an easy task when naming conventions (e.g., Camel Case) are not used or strictly followed and–or when these words have been abbreviated or otherwise transformed. Our thesis aims at developing a contextual approach that overcomes the shortcomings of the existing approaches and maps identifiers to domain concepts even in the absence of naming conventions and–or the presence of abbreviations. We also aim to take advantage of our approach to enhance the predictability of the overall system quality by using identifiers when assessing software quality. The key components of our approach are: dynamic time warping algorithm (DTW) used to recognize words in continuous speech, string-edit distance between terms and words as a proxy for the distance between the terms and the concepts they represent, plus words transformations rules attempting to mimic the cognitive processes of developers when composing identifiers with abbreviated forms. To validate our approach, we apply it to identifiers extracted from different open source applications to show that our method is able to provide a mapping of identifiers to domain terms, compare it with the two families of approaches that to the best of our knowledge, exist in the literature with respect to an oracle that we have manually built. We also enrich our technique by using domain knowledge and context-aware dictionaries to analyze how sensitive are the performances of our approach to the use of contextual information and specialized knowledge

    Leveraging Deep Learning for Abstractive Code Summarization of Unofficial Documentation

    Full text link
    Usually, programming languages have official documentation to guide developers with APIs, methods, and classes. However, researchers identified insufficient or inadequate documentation examples and flaws with the API's complex structure as barriers to learning an API. As a result, developers may consult other sources (StackOverflow, GitHub, etc.) to learn more about an API. Recent research studies have shown that unofficial documentation is a valuable source of information for generating code summaries. We, therefore, have been motivated to leverage such a type of documentation along with deep learning techniques towards generating high-quality summaries for APIs discussed in informal documentation. This paper proposes an automatic approach using the BART algorithm, a state-of-the-art transformer model, to generate summaries for APIs discussed in StackOverflow. We built an oracle of human-generated summaries to evaluate our approach against it using ROUGE and BLEU metrics which are the most widely used evaluation metrics in text summarization. Furthermore, we evaluated our summaries empirically against a previous work in terms of quality. Our findings demonstrate that using deep learning algorithms can improve summaries' quality and outperform the previous work by an average of %57 for Precision, %66 for Recall, and %61 for F-measure, and it runs 4.4 times faster

    Improving Code Example Recommendations on Informal Documentation Using BERT and Query-Aware LSH: A Comparative Study

    Full text link
    Our research investigates the recommendation of code examples to aid software developers, a practice that saves developers significant time by providing ready-to-use code snippets. The focus of our study is Stack Overflow, a commonly used resource for coding discussions and solutions, particularly in the context of the Java programming language. We applied BERT, a powerful Large Language Model (LLM) that enables us to transform code examples into numerical vectors by extracting their semantic information. Once these numerical representations are prepared, we identify Approximate Nearest Neighbors (ANN) using Locality-Sensitive Hashing (LSH). Our research employed two variants of LSH: Random Hyperplane-based LSH and Query-Aware LSH. We rigorously compared these two approaches across four parameters: HitRate, Mean Reciprocal Rank (MRR), Average Execution Time, and Relevance. Our study revealed that the Query-Aware (QA) approach showed superior performance over the Random Hyperplane-based (RH) method. Specifically, it exhibited a notable improvement of 20% to 35% in HitRate for query pairs compared to the RH approach. Furthermore, the QA approach proved significantly more time-efficient, with its speed in creating hashing tables and assigning data samples to buckets being at least four times faster. It can return code examples within milliseconds, whereas the RH approach typically requires several seconds to recommend code examples. Due to the superior performance of the QA approach, we tested it against PostFinder and FaCoY, the state-of-the-art baselines. Our QA method showed comparable efficiency proving its potential for effective code recommendation

    Empirical Study of Programming to an Interface

    Get PDF
    International audienceA popular recommendation to programmers in object-oriented software is to "program to an interface, not an implementation" (PTI). Expected benefits include increased simplicity from abstraction, decreased dependency on implementations , and higher flexibility. Yet, interfaces must be immutable, excessive class hierarchies can be a form of complexity, and "speculative generality" is a known code smell. To advance the empirical knowledge of PTI, we conducted an empirical investigation that involves 126 Java projects on GitHub, aiming to measuring the decreased dependency benefits (in terms of cochange)

    Automatic Derivation of Concepts Based on the Analysis of Source Code Identifiers

    No full text

    Investigating the android apps' success: An empirical study

    No full text
    Measuring the success of software systems was not a trivial task in the past. Nowadays, mobile apps provide a uniform schema, i.e., the average ratings provided by the apps' users to gauge their success. While recent research has focused on examining the relationship between change-and fault-proneness and apps' lack of success, as well as qualitatively analyzing the reasons behind the apps' users dissatisfaction, there is little empirical evidence on the factors related to the success of mobile apps. In this paper, we explore the relationships between the mobile apps' success and a set of metrics that not only characterize the apps themselves but also the quality of the APIs used by the apps, as well as user attributes when they interact with the apps. In particular, we measure API quality in terms of bugs fixed in APIs used by apps and changes that occurred in the API methods. We examine different kinds of changes including changes in the interfaces, implementation, and exception handling. For user-related factors, we leverage the number of app's downloads and installations, and users' reviews. Through an empirical study of 474 free Android apps, we find that factors such as the number of users' reviews provided for an app, app's category and size appear to have an impact on the app's success

    Understanding How and Why Developers Seek and Analyze API-related Opinions

    Full text link
    With the advent and proliferation of online developer forums as informal documentation, developers often share their opinions about the APIs they use. Thus, opinions of others often shape the developer's perception and decisions related to software development. For example, the choice of an API or how to reuse the functionality the API offers are, to a considerable degree, conditioned upon what other developers think about the API. While many developers refer to and rely on such opinion-rich information about APIs, we found little research that investigates the use and benefits of public opinions. To understand how developers seek and evaluate API opinions, we conducted two surveys involving a total of 178 software developers. We analyzed the data in two dimensions, each corresponding to specific needs related to API reviews: (1) Needs for seeking API reviews, and (2) Needs for automated tool support to assess the reviews. We observed that developers seek API reviews and often have to summarize those for diverse development needs (e.g., API suitability). Developers also make conscious efforts to judge the trustworthiness of the provided opinions and believe that automated tool support for API reviews analysis can assist in diverse development scenarios, including, for example, saving time in API selection as well as making informed decisions on a particular API features
    corecore