4 research outputs found
Restructuring source code identifiers
In software engineering, maintenance cost 60% of overall project lifecycle costs of any
software product. Program comprehension is a substantial part of maintenance and
evolution cost and, thus, any advancement in maintenance, evolution, and program understanding
will potentially greatly reduce the total cost of ownership of any software
products. Identifiers are an important source of information during program understanding
and maintenance. Programmers often use identifiers to build their mental
models of the software artifacts. Thus, poorly-chosen identifiers have been reported in
the literature as misleading and increasing the program comprehension effort. Identifiers are composed of terms, which can be dictionary words, acronyms, contractions, or
simple strings. We conjecture that the use of identical terms in different contexts may
increase the risk of faults, and hence maintenance effort. We investigate our conjecture
using a measure combining term entropy and term context-coverage to study whether
certain terms increase the odds ratios of methods to be fault-prone. We compute term
entropy and context-coverage of terms extracted from identifiers in Rhino 1.4R3 and
ArgoUML 0.16. We show statistically that methods containing terms with high entropy
and context-coverage are more fault-prone than others, and that the new measure is only
partially correlated with size. We will build on this study, and will apply summarization
technique for extracting linguistic information form methods and classes. Using this
information, we will extract domain concepts from source code, and propose linguistic
based refactoring
Studying the evolution of software through software clustering and concept analysis
This thesis describes an investigation into the use of software clustering and concept analysis techniques for studying the evolution of software. These techniques produce representations of software systems by clustering similar entities in the system together. The software engineering community has used these techniques for a number of different reasons but this is the first study to investigate their uses for evolution. The representations produced by software clustering and concept analysis techniques can be used to trace changes to a software system over a number of different versions of the system. This information can be used by system maintainers to identify worrying evolutionary trends or assess a proposed change by comparing it to the effects of an earlier, similar change. The work described here attempts to establish whether the use of software clustering and concept analysis techniques for studying the evolution of software is worth pursuing. Four techniques, chosen based on an extensive literature survey of the field, have been used to create representations of versions of a test software system. These representations have been examined to assess whether any observations about the evolution of the system can be drawn from them. The results are positive and it is thought that evolution of software systems could be studied by using these techniques
Defining linguistic antipatterns towards the improvement of source code quality
Previous studies showed that linguistic aspect of source code is a valuable source of
information that can help to improve program comprehension. The proposed research
work focuses on supporting quality improvement of source code by identifying, specifying,
and studying common negative practices (i.e., linguistic antipatterns) with respect
to linguistic information. We expect the definition of linguistic antipatterns to increase
the awareness of the existence of such bad practices and to discourage their use. We
also propose to study the relation between negative practices in linguistic information
(i.e., linguistic antipatterns) and negative practices in structural information (i.e., design
antipatterns) with respect to comprehension effort and fault/change proneness. We
discuss the proposed methodology and some preliminary results
Design Recovery and Data Mining: A Methodology That Identifies Data-Cohesive Subsystems Based on Mining Association Rules.
Software maintenance is both a technical and an economic concern for organizations. Large software systems are difficult to maintain due to their intrinsic complexity, and their maintenance consumes between 50% and 90% of the cost of their complete life-cycle. An essential step in maintenance is reverse engineering, which focuses on understanding the system. This system understanding is critical to avoid the generation of undesired side effects during maintenance. The objective of this research is to investigate the potential of applying data mining to reverse engineering. This research was motivated by the following: (1) data mining can process large volumes of information, (2) data mining can elicit meaningful information without previous knowledge of the domain, (3) data mining can extract novel non-trivial relationships from a data set, and (4) data mining is automatable. These data mining features are used to help address the problem of understanding large legacy systems. This research produced a general method to apply data mining to reverse engineering, and a methodology for design recovery, called Identification of Subsystems based on Associations (ISA). ISA uses mined association rules from a database view of the subject system to guide a clustering process that produces a data-cohesive hierarchical subsystem decomposition of the system. ISA promotes object-oriented principles because each identified subsystem consists of a set of data repositories and the code (i.e., programs) that manipulates them. ISA is an automatic multi-step process, which uses the source code of the subject system and multiple parameters as its input. ISA includes two representation models (i.e., text-based and graphic-based representation models) to present the resulting subsystem decomposition. The automated environment RE-ISA implements the ISA methodology. RE-ISA was used to produce the subsystem decomposition of real-word software systems. Results show that ISA can automatically produce data-cohesive subsystem decompositions without previous knowledge of the subject system, and that ISA always generates the same results if the same parameters are utilized. This research provides evidence that data mining is a beneficial tool for reverse engineering and provides the foundation for defining methodologies that combine data mining and software maintenance