In software engineering, maintenance cost 60% of overall project lifecycle costs of any
software product. Program comprehension is a substantial part of maintenance and
evolution cost and, thus, any advancement in maintenance, evolution, and program understanding
will potentially greatly reduce the total cost of ownership of any software
products. Identifiers are an important source of information during program understanding
and maintenance. Programmers often use identifiers to build their mental
models of the software artifacts. Thus, poorly-chosen identifiers have been reported in
the literature as misleading and increasing the program comprehension effort. Identifiers are composed of terms, which can be dictionary words, acronyms, contractions, or
simple strings. We conjecture that the use of identical terms in different contexts may
increase the risk of faults, and hence maintenance effort. We investigate our conjecture
using a measure combining term entropy and term context-coverage to study whether
certain terms increase the odds ratios of methods to be fault-prone. We compute term
entropy and context-coverage of terms extracted from identifiers in Rhino 1.4R3 and
ArgoUML 0.16. We show statistically that methods containing terms with high entropy
and context-coverage are more fault-prone than others, and that the new measure is only
partially correlated with size. We will build on this study, and will apply summarization
technique for extracting linguistic information form methods and classes. Using this
information, we will extract domain concepts from source code, and propose linguistic
based refactoring