Practical resource monitoring for robust high throughput computing

Benjamin Tovar; Dariusz Król; Douglas Thain; Ewa Deelman; Gideon Juve; Miron Livny; Rafael Ferreira Da Silva; William Allcock

Practical resource monitoring for robust high throughput computing

Authors: Benjamin Tovar
Dariusz Król
Douglas Thain
Ewa Deelman
Gideon Juve
Miron Livny
Rafael Ferreira Da Silva
William Allcock
Publication date: 1 January 2014
Publisher

Abstract

Abstract-Robust high throughput computing requires effective monitoring and enforcement of a variety of resources including CPU cores, memory, disk, and network traffic. Without effective monitoring and enforcement, it is easy to overload machines, causing failures and slowdowns, or underutilize machines, which results in wasted opportunities. This paper explores how to describe, measure, and enforce resources used by computational tasks. We focus on tasks running in distributed execution systems, in which a task requests the resources it needs, and the execution system ensures the availability of such resources. This presents two non-trivial problems: how to measure the resources consumed by a task, and how to monitor and report resource exhaustion in a robust and timely manner. For both of these tasks, operating systems have a variety of mechanisms with different degrees of availability, accuracy, overhead, and intrusiveness. We describe various forms of monitoring and the available mechanisms in contemporary operating systems. We then present two specific monitoring tools that choose different tradeoffs in overhead and accuracy, and evaluate them on a selection of benchmarks

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

CiteSeerX

oai:CiteSeerX.psu:10.1.1.1077....

Last time updated on 07/12/2020