129 research outputs found
Survey of Research on Software Clones
This report summarizes my overview talk on software clone detection
research. It first discusses the notion of software redundancy, cloning, duplication,
and similarity. Then, it describes various categorizations of clone types, empirical
studies on the root causes for cloning, current opinions and wisdom of consequences
of cloning, empirical studies on the evolution of clones, ways to remove, to avoid,
and to detect them, empirical evaluations of existing automatic clone detector performance
(such as recall, precision, time and space consumption) and their fitness
for a particular purpose, benchmarks for clone detector evaluations, presentation
issues, and last but not least application of clone detection in other related fields.
After each summary of a subarea, I am listing open research questions
Toward an Effective Automated Tracing Process
Traceability is defined as the ability to establish, record, and maintain dependency relations among various software artifacts in a software system, in both a forwards and backwards direction, throughout the multiple phases of the project’s life cycle. The availability of traceability information has been proven vital to several software engineering activities such as program comprehension, impact analysis, feature location, software reuse, and verification and validation (V&V). The research on automated software traceability has noticeably advanced in the past few years. Various methodologies and tools have been proposed in the literature to provide automatic support for establishing and maintaining traceability information in software systems. This movement is motivated by the increasing attention traceability has been receiving as a critical element of any rigorous software development process. However, despite these major advances, traceability implementation and use is still not pervasive in industry. In particular, traceability tools are still far from achieving performance levels that are adequate for practical applications. Such low levels of accuracy require software engineers working with traceability tools to spend a considerable amount of their time verifying the generated traceability information, a process that is often described as tedious, exhaustive, and error-prone. Motivated by these observations, and building upon a growing body of work in this area, in this dissertation we explore several research directions related to enhancing the performance of automated tracing tools and techniques. In particular, our work addresses several issues related to the various aspects of the IR-based automated tracing process, including trace link retrieval, performance enhancement, and the role of the human in the process. Our main objective is to achieve performance levels, in terms of accuracy, efficiency, and usability, that are adequate for practical applications, and ultimately to accomplish a successful technology transfer from research to industry
Smells in system user interactive tests
peer reviewedTest smells are known as bad development practices that reflect poor design and implementation choices in software tests. Over the last decade, there are few attempts to study test smells in the context of system tests that interact with the System Under Test through a Graphical User Interface. To fill the gap, we conduct an exploratory analysis of test smells occurring in System User Interactive Tests (SUIT). We thus, compose a catalog of 35 SUIT-specific smells, identified through a multi-vocal literature review, and show how they differ from smells encountered in unit tests. We also conduct an empirical analysis to assess the diffuseness and removal of these smells in 48 industrial repositories and 12 open-source projects. Our results show that the same type of smells tends to appear in both industrial and open-source projects, but they are not addressed in the same way. We also find that smells originating from a combination of multiple code locations appear more often than those that are localized on a single line. This happens because of the difficulty to observe non-local smells without tool support. Furthermore, we find that smell-removing actions are not frequent with less than 50% of the affected tests ever undergoing a smell removal. Interestingly, while smell-removing actions are rare, some smells disappear while discarding tests, i.e., these smells do not appear in follow-up tests that replace the discarded ones
Eliminating Code Duplication in Cascading Style Sheets
Cascading Style Sheets (i.e., CSS) is the standard styling language, widely used for defining the presentation semantics of user interfaces for web, mobile and desktop applications. Despite its popularity, CSS has not received much attention from academia. Indeed, developing and maintaining CSS code is rather challenging, due to the inherent language design shortcomings, the interplay of CSS with other programming languages (e.g., HTML and JavaScript), the lack of empirically- evaluated coding best-practices, and immature tool support. As a result, the quality of CSS code bases is poor in many cases.
In this thesis, we focus on one of the major issues found in CSS code bases, i.e., the duplicated code. In a large, representative dataset of CSS code, we found an average of 68% duplication in style declarations. To alleviate this, we devise techniques for refactoring CSS code (i.e., grouping style declarations into new style rules), or migrating CSS code to take advantage of the code abstraction features provided by CSS preprocessor languages (i.e., superset languages for CSS that augment it by adding extra features that facilitate code maintenance). Specifically for the migration transformations, we attempt to align the resulting code with manually-developed code, by relying on the knowledge gained by conducting an empirical study on the use of CSS preprocessors, which revealed the common coding practices of the developers who use CSS preprocessor languages.
To guarantee the behavior preservation of the proposed transformations, we come up with a list of preconditions that should be met, and also describe a lightweight testing technique. By applying a large number of transformations on several web sites and web applications, it is shown that the transformations are indeed presentation-preserving, and can effectively reduce the amount of duplicated code in CSS
Analyzing Clone Evolution for Identifying the Important Clones for Management
Code clones (identical or similar code fragments in a code-base) have dual but contradictory impacts (i.e., both positive and negative impacts) on the evolution and maintenance of a software system. Because of the negative impacts (such as high change-proneness, bug-proneness, and unintentional inconsistencies), software researchers consider code clones to be the number one bad-smell in a code-base. Existing studies on clone management suggest managing code clones through refactoring and tracking. However, a software system's code-base may contain a huge number of code clones, and it is impractical to consider all these clones for refactoring or tracking. In these circumstances, it is essential to identify code clones that can be considered particularly important for refactoring and tracking. However, no existing study has investigated this matter. We conduct our research emphasizing this matter, and perform five studies on identifying important clones by analyzing clone evolution history.
In our first study we detect evolutionary coupling of code clones by automatically investigating clone evolution history from thousands of commits of software systems downloaded from on-line SVN repositories. By analyzing evolutionary coupling of code clones we identify a particular clone change pattern, Similarity Preserving Change Pattern (SPCP), such that code clones that evolve following this pattern should be considered important for refactoring. We call these important clones the SPCP clones. We rank SPCP clones considering their strength of evolutionary coupling. In our second study we further analyze evolutionary coupling of code clones with an aim to assist clone tracking. The purpose of clone tracking is to identify the co-change (i.e. changing together) candidates of code clones to ensure consistency of changes in the code-base. Our research in the second study identifies and ranks the important co-change candidates by analyzing their evolutionary coupling. In our third study we perform a deeper analysis on the SPCP clones and identify their cross-boundary evolutionary couplings. On the basis of such couplings we separate the SPCP clones into two disjoint subsets. While one subset contains the non-cross-boundary SPCP clones which can be considered important for refactoring, the other subset contains the cross-boundary SPCP clones which should be considered important for tracking. In our fourth study we analyze the bug-proneness of different types of SPCP clones in order to identify which type(s) of code clones have high tendencies of experiencing bug-fixes. Such clone-types can be given high priorities for management (refactoring or tracking). In our last study we analyze and compare the late propagation tendencies of different types of code clones. Late propagation is commonly regarded as a harmful clone evolution pattern. Findings from our last study can help us prioritize clone-types for management on the basis of their tendencies of experiencing late propagations. We also find that late propagation can be considerably minimized by managing the SPCP clones. On the basis of our studies we develop an automatic system called AMIC (Automatic Mining of Important Clones) that identifies the important clones for management (refactoring and tracking) and ranks these clones considering their evolutionary coupling, bug-proneness, and late propagation tendencies. We believe that our research findings have the potential to assist clone management by pin-pointing the important clones to be managed, and thus, considerably minimizing clone management effort
Software Product Line Engineering via Software Transplantation
For companies producing related products, a Software Product Line (SPL) is a
software reuse method that improves time-to-market and software quality,
achieving substantial cost reductions.These benefits do not come for free. It
often takes years to re-architect and re-engineer a codebase to support SPL
and, once adopted, it must be maintained. Current SPL practice relies on a
collection of tools, tailored for different reengineering phases, whose output
developers must coordinate and integrate. We present Foundry, a general
automated approach for leveraging software transplantation to speed conversion
to and maintenance of SPL. Foundry facilitates feature extraction and
migration. It can efficiently, repeatedly, transplant a sequence of features,
implemented in multiple files. We used Foundry to create two valid product
lines that integrate features from three real-world systems in an automated
way. Moreover, we conducted an experiment comparing Foundry's feature migration
with manual effort. We show that Foundry automatically migrated features across
codebases 4.8 times faster, on average, than the average time a group of SPL
experts took to accomplish the task
Mining Semantic Loop Idioms
To write code, developers stitch together patterns, like API protocols or data structure traversals. Discovering these patterns can identify inconsistencies in code or opportunities to replace these patterns with an API or a language construct. We present coiling, a technique for automatically mining code for semantic idioms: surprisingly probable, semantic patterns. We specialize coiling for loop idioms, semantic idioms of loops. First, we show that automatically identifiable patterns exist, in great numbers, with a largescale empirical study of loops over 25MLOC. We find that most loops in this corpus are simple and predictable: 90 percent have fewer than 15LOC and 90 percent have no nesting and very simple control. Encouraged by this result, we then mine loop idioms over a second, buildable corpus. Over this corpus, we show that only 50 loop idioms cover 50 percent of the concrete loops. Our framework opens the door to data-driven tool and language design, discovering opportunities to introduce new API calls and language constructs. Loop idioms show that LINQ would benefit from an Enumerate operator. This can be confirmed by the exitence of a StackOverflow question with 542k views that requests precisely this feature
- …