92,594 research outputs found
LIKWID Monitoring Stack: A flexible framework enabling job specific performance monitoring for the masses
System monitoring is an established tool to measure the utilization and
health of HPC systems. Usually system monitoring infrastructures make no
connection to job information and do not utilize hardware performance
monitoring (HPM) data. To increase the efficient use of HPC systems automatic
and continuous performance monitoring of jobs is an essential component. It can
help to identify pathological cases, provides instant performance feedback to
the users, offers initial data to judge on the optimization potential of
applications and helps to build a statistical foundation about application
specific system usage. The LIKWID monitoring stack is a modular framework build
on top of the LIKWID tools library. It aims on enabling job specific
performance monitoring using HPM data, system metrics and application-level
data for small to medium sized commodity clusters. Moreover, it is designed to
integrate in existing monitoring infrastructures to speed up the change from
pure system monitoring to job-aware monitoring.Comment: 4 pages, 4 figures. Accepted for HPCMASPA 2017, the Workshop on
Monitoring and Analysis for High Performance Computing Systems Plus
Applications, held in conjunction with IEEE Cluster 2017, Honolulu, HI,
September 5, 201
- …