805 research outputs found
ARcode: HPC Application Recognition Through Image-encoded Monitoring Data
Knowing HPC applications of jobs and analyzing their performance behavior
play important roles in system management and optimizations. The existing
approaches detect and identify HPC applications through machine learning
models. However, these approaches rely heavily on the manually extracted
features from resource utilization data to achieve high prediction accuracy. In
this study, we propose an innovative application recognition method, ARcode,
which encodes job monitoring data into images and leverages the automatic
feature learning capability of convolutional neural networks to detect and
identify applications. Our extensive evaluations based on the dataset collected
from a large-scale production HPC system show that ARcode outperforms the
state-of-the-art methodology by up to 18.87% in terms of accuracy at high
confidence thresholds. For some specific applications (BerkeleyGW and e3sm),
ARcode outperforms by over 20% at a confidence threshold of 0.8
Graph mining for role extraction in predictive analytics of high-performance computing systems
Master of ScienceDepartment of Computer ScienceWilliam H. HsuThis thesis addresses the task of analyzing property graphs in system log data from high-performance computing (HPC) systems, to identify entity roles to aid in predicting job submission outcomes. This predictive analytics project uses inductive learning on historical logs to produce regression models for estimating resource needs and potential shortfalls, and classification models that predict when jobs will fail due to insufficient resource allocation. The log files are generated by the workload manager of an HPC compute cluster and include runtime parameters for every submitted job. The research objectives of the overall project consist of using these techniques to solve three extant problems: (1) predicting the sufficiency of resource requested in a HPC system at job submission time; (2) making HPC resource allocation more efficient; and (3) building a decision support system for HPC users. Previous approaches and techniques used features such as user demographics and simulations harnessed with simple optimization algorithms to improve the resource allocation usage on a large-scale compute cluster (Kansas State University’s Beocat). In this thesis, role extraction is applied with the goal to create a user-specific feature for machine learning tasks. Specific use cases include personalized prediction of submitted job outcomes or reinforcement learning from simulation for optimization tasks in job scheduling. Objectives include improving on the accuracy, precision, recall, and utility of previous learning systems
ExaMon-X: a Predictive Maintenance Framework for Automatic Monitoring in Industrial IoT Systems
In recent years, the Industrial Internet of Things (IIoT) has led to significant steps forward in many industries, thanks to the exploitation of several technologies, ranging from Big Data processing to Artificial Intelligence (AI). Among the various IIoT scenarios, large-scale data centers can reap significant benefits from adopting Big Data analytics and AI-boosted approaches since these technologies can allow effective predictive maintenance. However, most of the off-the-shelf currently available solutions are not ideally suited to the HPC context, e.g., they do not sufficiently take into account the very heterogeneous data sources and the privacy issues which hinder the adoption of the cloud solution, or they do not fully
exploit the computing capabilities available in loco in a supercomputing facility. In this paper, we tackle this issue, and we propose an IIoT holistic and vertical framework for predictive maintenance in supercomputers. The framework is based on a big lightweight data monitoring infrastructure, specialized databases suited for heterogeneous data, and a set of high-level AI-based functionalities tailored to HPC actors’ specific needs. We present the deployment and assess the usage of this framework in several in-production HPC systems
Distributed Computing in a Pandemic: A Review of Technologies Available for Tackling COVID-19
The current COVID-19 global pandemic caused by the SARS-CoV-2 betacoronavirus
has resulted in over a million deaths and is having a grave socio-economic
impact, hence there is an urgency to find solutions to key research challenges.
Much of this COVID-19 research depends on distributed computing. In this
article, I review distributed architectures -- various types of clusters, grids
and clouds -- that can be leveraged to perform these tasks at scale, at
high-throughput, with a high degree of parallelism, and which can also be used
to work collaboratively. High-performance computing (HPC) clusters will be used
to carry out much of this work. Several bigdata processing tasks used in
reducing the spread of SARS-CoV-2 require high-throughput approaches, and a
variety of tools, which Hadoop and Spark offer, even using commodity hardware.
Extremely large-scale COVID-19 research has also utilised some of the world's
fastest supercomputers, such as IBM's SUMMIT -- for ensemble docking
high-throughput screening against SARS-CoV-2 targets for drug-repurposing, and
high-throughput gene analysis -- and Sentinel, an XPE-Cray based system used to
explore natural products. Grid computing has facilitated the formation of the
world's first Exascale grid computer. This has accelerated COVID-19 research in
molecular dynamics simulations of SARS-CoV-2 spike protein interactions through
massively-parallel computation and was performed with over 1 million volunteer
computing devices using the Folding@home platform. Grids and clouds both can
also be used for international collaboration by enabling access to important
datasets and providing services that allow researchers to focus on research
rather than on time-consuming data-management tasks.Comment: 21 pages (15 excl. refs), 2 figures, 3 table
- …