31 research outputs found
Performance Evaluation of Data-Intensive Computing Applications on a Public IaaS Cloud
[Abstract] The advent of cloud computing technologies, which dynamically provide on-demand access to computational resources over the Internet, is offering new possibilities to many scientists and researchers. Nowadays, Infrastructure as a Service (IaaS) cloud providers can offset the increasing processing requirements of data-intensive computing applications, becoming an emerging alternative to traditional servers and clusters. In this paper, a comprehensive study of the leading public IaaS cloud platform, Amazon EC2, has been conducted in order to assess its suitability for data-intensive computing. One of the key contributions of this work is the analysis of the storage-optimized family of EC2 instances. Furthermore, this study presents a detailed analysis of both performance and cost metrics. More specifically, multiple experiments have been carried out to analyze the full I/O software stack, ranging from the low-level storage devices and cluster file systems up to real-world applications using representative data-intensive parallel codes and MapReduce-based workloads. The analysis of the experimental results has shown that data-intensive applications can benefit from tailored EC2-based virtual clusters, enabling users to obtain the highest performance and cost-effectiveness in the cloud.Ministerio de Economía y Competitividad; TIN2013-42148-PGalicia. Consellería de Cultura, Educación e Ordenación Universitaria; GRC2013/055Ministerio de Educación y Ciencia; AP2010-434
ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms
Fault-tolerant distributed applications require mechanisms to recover data
lost via a process failure. On modern cluster systems it is typically
impractical to request replacement resources after such a failure. Therefore,
applications have to continue working with the remaining resources. This
requires redistributing the workload and that the non-failed processes reload
data. We present an algorithmic framework and its C++ library implementation
ReStore for MPI programs that enables recovery of data after process failures.
By storing all required data in memory via an appropriate data distribution and
replication, recovery is substantially faster than with standard checkpointing
schemes that rely on a parallel file system. As the application developer can
specify which data to load, we also support shrinking recovery instead of
recovery using spare compute nodes. We evaluate ReStore in both controlled,
isolated environments and real applications. Our experiments show loading times
of lost input data in the range of milliseconds on up to 24576 processors and a
substantial speedup of the recovery time for the fault-tolerant version of a
widely used bioinformatics application
gSuite: A Flexible and Framework Independent Benchmark Suite for Graph Neural Network Inference on GPUs
As the interest to Graph Neural Networks (GNNs) is growing, the importance of
benchmarking and performance characterization studies of GNNs is increasing. So
far, we have seen many studies that investigate and present the performance and
computational efficiency of GNNs. However, the work done so far has been
carried out using a few high-level GNN frameworks. Although these frameworks
provide ease of use, they contain too many dependencies to other existing
libraries. The layers of implementation details and the dependencies complicate
the performance analysis of GNN models that are built on top of these
frameworks, especially while using architectural simulators. Furthermore,
different approaches on GNN computation are generally overlooked in prior
characterization studies, and merely one of the common computational models is
evaluated. Based on these shortcomings and needs that we observed, we developed
a benchmark suite that is framework independent, supporting versatile
computational models, easily configurable and can be used with architectural
simulators without additional effort.
Our benchmark suite, which we call gSuite, makes use of only hardware
vendor's libraries and therefore it is independent of any other frameworks.
gSuite enables performing detailed performance characterization studies on GNN
Inference using both contemporary GPU profilers and architectural GPU
simulators. To illustrate the benefits of our new benchmark suite, we perform a
detailed characterization study with a set of well-known GNN models with
various datasets; running gSuite both on a real GPU card and a timing-detailed
GPU simulator. We also implicate the effect of computational models on
performance. We use several evaluation metrics to rigorously measure the
performance of GNN computation.Comment: IEEE International Symposium on Workload Characterization (IISWC)
202
EMERGING ARCHITECTURES FOR PROCESSOR-IN-MEMORY CHIPS: TAXONOMY AND IMPLEMENTATION
The emergence of PIM (processing-in-memory) die and Date-Centric systems (DCS) and near- data processing approach (NDP) has given rise to the need of developing architectural taxonomy for multi-core PNM (processing near memory) hardware with multi-level memory structure. PIM die (in Russian technical literature usually used terms chips or crystals) considered as an effective alternative to conventional SRAM/DRAM/Flash-memory on Cache-CPU/Main Memory/Storage Class Memory and Storage levels. In the past decade, a few different methods to classify and to implement PIM die and DCS/NDP systems proposed. These approaches are either software interfacing with computing, hierarchical and massively parallel SIMD processing approaches etc. In this paper, presented summarized prolegomena for PIM die architecture and implementation. In particular, in form of basic PIM chips and nanostores
Multiple target task sharing support for the OpenMP accelerator model
The use of GPU accelerators is becoming common in HPC platforms due to the their effective performance and energy efficiency. In addition, new generations of multicore processors are being designed with wider vector units and/or larger hardware thread counts, also contributing to the peak performance of the whole system. Although current directive–based paradigms, such as OpenMP or OpenACC, support both accelerators and multicore-based hosts, they do not provide an effective
and efficient way to concurrently use them, usually resulting in accelerated programs in which the potential computational performance of the host is not exploited. In this paper we propose an extension to the OpenMP 4.5 directive-based programming model to support the specification and execution of multiple instances of task regions on different devices (i.e. accelerators in conjunction with the vector and heavily multithreaded capabilities in multicore processors). The compiler is responsible for the generation of device-specific code for each device kind, delegating to the runtime system the dynamic schedule of the tasks to the available devices. The new proposed clause conveys useful insight to guide the scheduler while keeping a clean, abstract and machine independent programmer interface. The potential of the proposal is analyzed in a prototype implementation in the OmpSs compiler and runtime infrastructure. Performance evaluation is done using three kernels (N-Body, tiled matrix multiply and Stream) on different GPU-capable systems based on ARM, Intel x86 and IBM Power8. From the evaluation we observe speed–ups in the 8–20% range compared to versions in which only the GPU is used, reaching 96 % of the additional peak performance thanks to the reduction of data transfers and the benefits introduced by the
OmpSs NUMA-aware scheduler.This work is partially supported by the IBM/BSC Deep Learning Center Initiative,
by the Spanish Government through Programa Severo Ochoa (SEV-2015-0493), by the Spanish Ministry of Science and Technology through TIN2015-65316-P project and by the Generalitat de Catalunya (contract 2014-SGR-1051).Peer ReviewedPostprint (author's final draft
New-Sum: A Novel Online ABFT Scheme for General Iterative Methods
Emerging high-performance computing platforms, with large component counts and lower power margins, are anticipated to be more susceptible to soft errors in both logic circuits and memory subsystems. We present an online algorithm-based fault tolerance (ABFT) approach to efficiently detect and recover soft errors for general iterative methods. We design a novel checksum-based encoding scheme for matrix-vector multiplication that is resilient to both arithmetic and memory errors. Our design decouples the checksum updating process from the actual computation, and allows adaptive checksum overhead control. Building on this new encoding mechanism, we propose two online ABFT designs that can effectively recover from errors when combined with a checkpoint/rollback scheme. These designs are capable of addressing scenarios under different error rates. Our ABFT approaches apply to a wide range of iterative solvers that primarily rely on matrix-vector multiplication and vector linear operations. We evaluate our designs through comprehensive analytical and empirical analysis. Experimental evaluation on the Stampede supercomputer demonstrates the low performance overheads incurred by our two ABFT schemes for preconditioned CG (0:4% and 2:2%) and preconditioned BiCGSTAB (1:0% and 4:0%) for the largest SPD matrix from UFL Sparse Matrix Collection. The evaluation also demonstrates the exibility and effectiveness of our proposed designs for detecting and recovering various types of soft errors in general iterative methods