Trigger and data acquisition (TDAQ) systems for modern HEP experiments are
composed of thousands of hardware and software components depending on each
other in a very complex manner. Typically, such systems are operated by
non-expert shift operators, which are not aware of system functionality
details. It is therefore necessary to help the operator to control the system
and to minimize system down-time by providing knowledge-based facilities for
automatic testing and verification of system components and also for error
diagnostics and recovery. For this purpose, a verification and diagnostic
framework was developed in the scope of ATLAS TDAQ. The verification
functionality of the framework allows developers to configure simple low-level
tests for any component in a TDAQ configuration. The test can be configured as
one or more processes running on different hosts. The framework organizes tests
in sequences, using knowledge about components hierarchy and dependencies, and
allowing the operator to verify the functionality of any subset of the system.
The diagnostics functionality includes the possibility to analyze the test
results and diagnose detected errors, e.g. by starting additional tests and
understanding reasons of failures. A conclusion about system functionality,
error diagnosis and recovery advice are presented to the operator in a GUI. The
current implementation uses the CLIPS expert system shell for knowledge
representation and reasoning.Comment: Paper for the 2003 Computing in High Energy and Nuclear Physics
(CHEP03), La Jolla, Ca, USA, March 2003 (presented as poster). Format: PDF,
using MSWord template, 5 pages, 6 figures. PSN TUGP00