Monitoring large clusters is a challenging problem. It is necessary to
observe a large quantity of devices with a reasonably short delay between
consecutive observations. The set of monitored devices may include PCs, network
switches, tape libraries and other equipments. The monitoring activity should
not impact the performances of the system. In this paper we present PerfMC, a
monitoring system for large clusters. PerfMC is driven by an XML configuration
file, and uses the Simple Network Management Protocol (SNMP) for data
collection. SNMP is a standard protocol implemented by many networked
equipments, so the tool can be used to monitor a wide range of devices. System
administrators can display informations on the status of each device by
connecting to a WEB server embedded in PerfMC. The WEB server can produce
graphs showing the value of different monitored quantities as a function of
time; it can also produce arbitrary XML pages by applying XSL Transformations
to an internal XML representation of the cluster's status. XSL Transformations
may be used to produce HTML pages which can be displayed by ordinary WEB
browsers. PerfMC aims at being relatively easy to configure and operate, and
highly efficient. It is currently being used to monitor the Italian
Reprocessing farm for the BaBar experiment, which is made of about 200 dual-CPU
Linux machines.Comment: Talk from the 2003 Computing in High Energy and Nuclear Physics
(CHEP03), La Jolla, Ca, USA, March 2003, 10 pages, LaTeX, 4 eps figures. PSN
MOET00