28 research outputs found
Performance and Power Analysis of HPC Workloads on Heterogenous Multi-Node Clusters
Performance analysis tools allow application developers to identify and characterize the inefficiencies that cause performance degradation in their codes, allowing for application optimizations. Due to the increasing interest in the High Performance Computing (HPC) community towards energy-efficiency issues, it is of paramount importance to be able to correlate performance and power figures within the same profiling and analysis tools. For this reason, we present a performance and energy-efficiency study aimed at demonstrating how a single tool can be used to collect most of the relevant metrics. In particular, we show how the same analysis techniques can be applicable on different architectures, analyzing the same HPC application on a high-end and a low-power cluster. The former cluster embeds Intel Haswell CPUs and NVIDIA K80 GPUs, while the latter is made up of NVIDIA Jetson TX1 boards, each hosting an Arm Cortex-A57 CPU and an NVIDIA Tegra X1 Maxwell GPU.The research leading to these results has received funding from the European Communityâs Seventh Framework Programme [FP7/2007-2013] and Horizon 2020 under the Mont-Blanc projects [17], grant agreements n. 288777, 610402 and 671697. E.C. was partially founded by âContributo 5 per mille assegnato allâUniversitĂ degli Studi di Ferrara-dichiarazione dei redditi dellâanno 2014â. We thank the University of Ferrara and INFN Ferrara for the access to the COKA Cluster. We warmly thank the BSC tools group, supporting us for the smooth integration and test of our setup within Extrae and Paraver.Peer ReviewedPostprint (published version
Runtime Power Allocation Based on Multi-GPU utilization in GAMESS
To improve the power consumption of parallel applications at the runtime, modern processors provide frequency scaling and power limiting capabilities. In this work, a runtime strategy is proposed to maximize performance under a given power budget by distributing the available power according to the relative GPU utilization. Time series forecasting methods were used to develop workload prediction models that provide accurate prediction of GPU utilization during application execution. Experiments were performed on a multi-GPU computing platform DGX-1 equipped with eight NVIDIA V100 GPUs used for quantum chemistry calculations in the GAMESS package. For a limited power budget, the proposed strategy may deliver as much as hundred times better GAMESS performance than that obtained when the power is distributed equally among all the GPUs
Accurate Energy and Performance Prediction for Frequency-Scaled GPU Kernels
Energy optimization is an increasingly important aspect of todayâs high-performance computing applications. In particular, dynamic voltage and frequency scaling (DVFS) has become a widely adopted solution to balance performance and energy consumption, and hardware vendors provide management libraries that allow the programmer to change both memory and core frequencies manually to minimize energy consumption while maximizing performance. This article focuses on modeling the energy consumption and speedup of GPU applications while using different frequency configurations. The task is not straightforward, because of the large set of possible and uniformly distributed configurations and because of the multi-objective nature of the problem, which minimizes energy consumption and maximizes performance. This article proposes a machine learning-based method to predict the best core and memory frequency configurations on GPUs for an input OpenCL kernel. The method is based on two models for speedup and normalized energy predictions over the default frequency configuration. Those are later combined into a multi-objective approach that predicts a Pareto-set of frequency configurations. Results show that our approach is very accurate at predicting extema and the Pareto set, and finds frequency configurations that dominate the default configuration in either energy or performance.DFG, 360291326, CELERITY: Innovative Modellierung fĂŒr Skalierbare Verteilte Laufzeitsystem
Time-energy Analysis of Multilevel Parallelism in Heterogeneous Clusters: the Case of EEG Classification in BCI Tasks
Present heterogeneous architectures interconnect nodes including multiple multi-core microprocessors and accelerators that allow different strategies to accelerate the applications and optimize their energy consumption according to the specific power-performance trade-offs. In this paper, a multi-level parallel procedure is proposed to take advantage of all nodes of a heterogeneous CPU-GPU cluster. Two more alternatives have been implemented, and experimentally compared and analyzed from both running time and energy consumption. Although the paper considers an evolutionary master-worker algorithm for feature selection in EEG classification, the conclusions from the experimental analysis here provided can be frequently applied, as many other useful bioinformatics and data mining applications show the same master-worker profile than the classification problem here considered. Our parallel approach allows to reduce the time by a factor of up to 83, with only about a 4.9% of energy consumed by the sequential procedure, in a cluster with 36 CPU cores and 43 GPU compute units.Spanish Ministerio de Ciencia, InnovaciĂłn y Universidades under grant PGC2018-098813-B-C31ERDF fun
Policy-based SLA storage management model for distributed data storage services
There is high demand for storage related services supporting scientists in their research activities. Those services are expected to provide not only capacity but also features allowing for more flexible and cost efficient usage. Such features include easy multiplatform data access, long term data retention, support for performance and cost differentiating of SLA restricted data access. The paper presents a policy-based SLA storage management model for distributed data storage services. The model allows for automated management of distributed data aimed at QoS provisioning with no strict resource reservation. The problem of providing users with the required QoS requirements is complex, and therefore the model implements heuristic approach for solving it. The corresponding system architecture, metrics and methods for SLA focused storage management are developed and tested in a real, nationwide environment
How Fast Can We Play Tetris Greedily With Rectangular Pieces?
Consider a variant of Tetris played on a board of width and infinite
height, where the pieces are axis-aligned rectangles of arbitrary integer
dimensions, the pieces can only be moved before letting them drop, and a row
does not disappear once it is full. Suppose we want to follow a greedy
strategy: let each rectangle fall where it will end up the lowest given the
current state of the board. To do so, we want a data structure which can always
suggest a greedy move. In other words, we want a data structure which maintains
a set of rectangles, supports queries which return where to drop the
rectangle, and updates which insert a rectangle dropped at a certain position
and return the height of the highest point in the updated set of rectangles. We
show via a reduction to the Multiphase problem [P\u{a}tra\c{s}cu, 2010] that on
a board of width , if the OMv conjecture [Henzinger et al., 2015]
is true, then both operations cannot be supported in time
simultaneously. The reduction also implies polynomial bounds from the 3-SUM
conjecture and the APSP conjecture. On the other hand, we show that there is a
data structure supporting both operations in time on
boards of width , matching the lower bound up to a factor.Comment: Correction of typos and other minor correction
Crafting Concurrent Data Structures
Concurrent data structures lie at the heart of modern parallel programs. The design and implementation of concurrent data structures can be challenging due to the demand for good performance (low latency and high scalability) and strong progress guarantees. In this dissertation, we enrich the knowledge of concurrent data structure design by proposing new implementations, as well as general techniques to improve the performance of existing ones.The first part of the dissertation present an unordered linked list implementation that supports nonblocking insert, remove, and lookup operations. The algorithm is based on a novel ``enlist\u27\u27 technique that greatly simplifies the task of achieving wait-freedom. The value of our technique is also demonstrated in the creation of other wait-free data structures such as stacks and hash tables.The second data structure presented is a nonblocking hash table implementation which solves a long-standing design challenge by permitting the hash table to dynamically adjust its size in a nonblocking manner. Additionally, our hash table offers strong theoretical properties such as supporting unbounded memory. In our algorithm, we introduce a new ``freezable set\u27\u27 abstraction which allows us to achieve atomic migration of keys during a resize. The freezable set abstraction also enables highly efficient implementations which maximally exploit the processor cache locality. In experiments, we found our lock-free hash table performs consistently better than state-of-the-art implementations, such as the split-ordered list.The third data structure we present is a concurrent priority queue called the ``mound\u27\u27. Our implementations include nonblocking and lock-based variants. The mound employs randomization to reduce contention on concurrent insert operations, and decomposes a remove operation into smaller atomic operations so that multiple remove operations can execute in parallel within a pipeline. In experiments, we show that the mound can provide excellent latency at low thread counts.Lastly, we discuss how hardware transactional memory (HTM) can be used to accelerate existing nonblocking concurrent data structure implementations. We propose optimization techniques that can significantly improve the performance (1.5x to 3x speedups) of a variety of important concurrent data structures, such as binary search trees and hash tables. The optimizations also preserve the strong progress guarantees of the original implementations
DAG-Based Attack and Defense Modeling: Don't Miss the Forest for the Attack Trees
This paper presents the current state of the art on attack and defense
modeling approaches that are based on directed acyclic graphs (DAGs). DAGs
allow for a hierarchical decomposition of complex scenarios into simple, easily
understandable and quantifiable actions. Methods based on threat trees and
Bayesian networks are two well-known approaches to security modeling. However
there exist more than 30 DAG-based methodologies, each having different
features and goals. The objective of this survey is to present a complete
overview of graphical attack and defense modeling techniques based on DAGs.
This consists of summarizing the existing methodologies, comparing their
features and proposing a taxonomy of the described formalisms. This article
also supports the selection of an adequate modeling technique depending on user
requirements