The complexity and cost of managing high-performance computing
infrastructures are on the rise. Automating management and repair through
predictive models to minimize human interventions is an attempt to increase
system availability and contain these costs. Building predictive models that
are accurate enough to be useful in automatic management cannot be based on
restricted log data from subsystems but requires a holistic approach to data
analysis from disparate sources. Here we provide a detailed multi-scale
characterization study based on four datasets reporting power consumption,
temperature, workload, and hardware/software events for an IBM Blue Gene/Q
installation. We show that the system runs a rich parallel workload, with low
correlation among its components in terms of temperature and power, but higher
correlation in terms of events. As expected, power and temperature correlate
strongly, while events display negative correlations with load and power. Power
and workload show moderate correlations, and only at the scale of components.
The aim of the study is a systematic, integrated characterization of the
computing infrastructure and discovery of correlation sources and levels to
serve as basis for future predictive modeling efforts.Comment: 12 pages, 7 Figure