Understanding the Error Behavior of Complex Critical Software Systems through Field Data


Software systems are the basis for human everyday activities, which are increasingly dependent on software. Software is an integral part of systems we interact with in our daily life raging form small systems for entertainment and domotics, to large systems and infrastructures that provide fundamental services such as telecommunication, transportation, and financial. In particular, software systems play a key role in the context of critical domains, supporting crucial activities. For example, ground and air transportation, power supply, nuclear plants, and medical applications strongly rely on software systems: failures affecting these systems can lead to severe consequences, which can be catastrophic in terms of business or, even worse, human losses. Therefore, given the growing dependence on software systems in life- and critical-applications, dependability, has become among one of the most relevant industry and research concerns in the last decades. Software faults have been recognized as one of the major cause for system failures since the hardware failure rate has been decreasing over the years. Time and cost constraints, along with technical limitations, often do not allow to fully validate the correctness of the software solely by means of testing; therefore, software might be released with residual faults that activate during operations. The activation of a fault generates errors which propagate through the components of the system, possibly leading to a failure. Therefore, in order to produce reliable software, it is important to understand how errors affect a software system. This is of paramount importance especially in the context of complex critical software systems, where the occurrence of a failure can lead to severe consequences. However, the analysis of the error behavior of this kind of system is not trivial. They are often distributed systems based on many interacting heterogeneous components and layers, including Off-The-Shelf (OTS), third party components and legacy systems. All these aspects, undermine the understanding of the error behavior of complex critical software system. A well established methodology to evaluate the dependability of operational systems and to identify their dependability bottlenecks is represented by field failure data analysis (FFDA), which is based on the monitoring and recording of errors and failures occurred during the operational phase of the system under real workload conditions, i.e., field data. Indeed, direct measurement and analysis of natural failures occurring under real workload conditions is among the most accurate ways to assess dependability characteristics. One of the main sources of field data, are monitoring techniques. The contribution of the thesis is to provide a methodology that allows understanding the error behavior of complex critical software systems by means of field data generated by the monitoring techniques already implemented in the target system. The use of available monitoring techniques allows to overcome the limitations imposed in the context of critical systems, avoiding severe changes in the system, and preserving its functionality and performance. The methodology is based on fault injection experiments that stimulate the target system with different error conditions. Injection experiments allow to accelerate the collection of error data naturally generated by the monitoring techniques already implemented in the system. The collected data are analyzed in order to characterize the behavior of the system under the occurred software errors. To this aim, the proposed methodology leverages a set of innovative means defined in this dissertation, i.e., (i) Error Propagation graphs, which allow to analyze the error propagation phenomena occurred in the target system and that can be inferred by the collected field data, and a set of metrics composed by (ii) Error Determination Degree, which allows gaining insights into the ability of error notifications of a monitoring technique to suggest either the fault that led to the error, or the failure the error led to in the system, (iii) Error Propagation Reportability, which allow understanding the ability of a monitoring technique at reporting the propagation of errors, and (iv) Data Dissimilarity, which allows gaining insights into the suitability of the data generated by the monitoring techniques for failure analysis. The methodology has been experimented on two instances of complex critical software systems in the field of Air Traffic Control (ATC), i.e., a communication middleware supporting data exchanging among ATC applications, and an arrival manager that is responsible for managing flight arrivals to a given airspace, within an industry-academia collaboration in the context of a national research project. Results show that field data generated by means of monitoring techniques already implemented in a complex critical software system can be leveraged to obtain insights about the error behavior exhibited by the target system, as well as about the potential beneficial locations for EDMs and ERMs. In addition, the proposed methodology also allowed to characterize the effectiveness of the monitoring techniques in terms of failure reporting, error propagation reportability, and data dissimilarity

Similar works

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.