65 research outputs found
Interactive Supercomputing on 40,000 Cores for Machine Learning and Data Analysis
Interactive massively parallel computations are critical for machine learning
and data analysis. These computations are a staple of the MIT Lincoln
Laboratory Supercomputing Center (LLSC) and has required the LLSC to develop
unique interactive supercomputing capabilities. Scaling interactive machine
learning frameworks, such as TensorFlow, and data analysis environments, such
as MATLAB/Octave, to tens of thousands of cores presents many technical
challenges - in particular, rapidly dispatching many tasks through a scheduler,
such as Slurm, and starting many instances of applications with thousands of
dependencies. Careful tuning of launches and prepositioning of applications
overcome these challenges and allow the launching of thousands of tasks in
seconds on a 40,000-core supercomputer. Specifically, this work demonstrates
launching 32,000 TensorFlow processes in 4 seconds and launching 262,000 Octave
processes in 40 seconds. These capabilities allow researchers to rapidly
explore novel machine learning architecture and data analysis algorithms.Comment: 6 pages, 7 figures, IEEE High Performance Extreme Computing
Conference 201
pPython Performance Study
pPython seeks to provide a parallel capability that provides good speed-up
without sacrificing the ease of programming in Python by implementing
partitioned global array semantics (PGAS) on top of a simple file-based
messaging library (PythonMPI) in pure Python. pPython follows a SPMD (single
program multiple data) model of computation. pPython runs on a single-node
(e.g., a laptop) running Windows, Linux, or MacOS operating systems or on any
combination of heterogeneous systems that support Python, including on a
cluster through a Slurm scheduler interface so that pPython can be executed in
a massively parallel computing environment. It is interesting to see what
performance pPython can achieve compared to the traditional socket-based MPI
communication because of its unique file-based messaging implementation. In
this paper, we present the point-to-point and collective communication
performances of pPython and compare them with those obtained by using mpi4py
with OpenMPI. For large messages, pPython demonstrates comparable performance
as compared to mpi4py.Comment: arXiv admin note: substantial text overlap with arXiv:2208.1490
FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption
Fully Homomorphic Encryption is a technique that allows computation on
encrypted data. It has the potential to change privacy considerations in the
cloud, but computational and memory overheads are preventing its adoption. TFHE
is a promising Torus-based FHE scheme that relies on bootstrapping, the
noise-removal tool invoked after each encrypted logical/arithmetical operation.
We present FPT, a Fixed-Point FPGA accelerator for TFHE bootstrapping. FPT is
the first hardware accelerator to exploit the inherent noise present in FHE
calculations. Instead of double or single-precision floating-point arithmetic,
it implements TFHE bootstrapping entirely with approximate fixed-point
arithmetic. Using an in-depth analysis of noise propagation in bootstrapping
FFT computations, FPT is able to use noise-trimmed fixed-point representations
that are up to 50% smaller than prior implementations.
FPT is built as a streaming processor inspired by traditional streaming DSPs:
it instantiates directly cascaded high-throughput computational stages, with
minimal control logic and routing networks. We explore throughput-balanced
compositions of streaming kernels with a user-configurable streaming width in
order to construct a full bootstrapping pipeline. Our approach allows 100%
utilization of arithmetic units and requires only a small bootstrapping key
cache, enabling an entirely compute-bound bootstrapping throughput of 1 BS /
35us. This is in stark contrast to the classical CPU approach to FHE
bootstrapping acceleration, which is typically constrained by memory and
bandwidth.
FPT is implemented and evaluated as a bootstrapping FPGA kernel for an Alveo
U280 datacenter accelerator card. FPT achieves two to three orders of magnitude
higher bootstrapping throughput than existing CPU-based implementations, and
2.5x higher throughput compared to recent ASIC emulation experiments.Comment: ACM CCS 202
Automated Translation and Accelerated Solving of Differential Equations on Multiple GPU Platforms
We demonstrate a high-performance vendor-agnostic method for massively
parallel solving of ensembles of ordinary differential equations (ODEs) and
stochastic differential equations (SDEs) on GPUs. The method is integrated with
a widely used differential equation solver library in a high-level language
(Julia's DifferentialEquations.jl) and enables GPU acceleration without
requiring code changes by the user. Our approach achieves state-of-the-art
performance compared to hand-optimized CUDA-C++ kernels, while performing
faster than the vectorized-map (\texttt{vmap}) approach
implemented in JAX and PyTorch. Performance evaluation on NVIDIA, AMD, Intel,
and Apple GPUs demonstrates performance portability and vendor-agnosticism. We
show composability with MPI to enable distributed multi-GPU workflows. The
implemented solvers are fully featured, supporting event handling, automatic
differentiation, and incorporating of datasets via the GPU's texture memory,
allowing scientists to take advantage of GPU acceleration on all major current
architectures without changing their model code and without loss of
performance.Comment: 11 figure
Survey of storage systems for high-performance computing
In current supercomputers, storage is typically provided by parallel distributed file systems for hot data and tape archives for cold data. These file systems are often compatible with local file systems due to their use of the POSIX interface and semantics, which eases development and debugging because applications can easily run both on workstations and supercomputers. There is a wide variety of file systems to choose from, each tuned for different use cases and implementing different optimizations. However, the overall application performance is often held back by I/O bottlenecks due to insufficient performance of file systems or I/O libraries for highly parallel workloads. Performance problems are dealt with using novel storage hardware technologies as well as alternative I/O semantics and interfaces. These approaches have to be integrated into the storage stack seamlessly to make them convenient to use. Upcoming storage systems abandon the traditional POSIX interface and semantics in favor of alternative concepts such as object and key-value storage; moreover, they heavily rely on technologies such as NVM and burst buffers to improve performance. Additional tiers of storage hardware will increase the importance of hierarchical storage management. Many of these changes will be disruptive and require application developers to rethink their approaches to data management and I/O. A thorough understanding of today's storage infrastructures, including their strengths and weaknesses, is crucially important for designing and implementing scalable storage systems suitable for demands of exascale computing
An Empirical Evaluation of Allgatherv on Multi-GPU Systems
Applications for deep learning and big data analytics have compute and memory
requirements that exceed the limits of a single GPU. However, effectively
scaling out an application to multiple GPUs is challenging due to the
complexities of communication between the GPUs, particularly for collective
communication with irregular message sizes. In this work, we provide a
performance evaluation of the Allgatherv routine on multi-GPU systems, focusing
on GPU network topology and the communication library used. We present results
from the OSU-micro benchmark as well as conduct a case study for sparse tensor
factorization, one application that uses Allgatherv with highly irregular
message sizes. We extend our existing tensor factorization tool to run on
systems with different node counts and varying number of GPUs per node. We then
evaluate the communication performance of our tool when using traditional MPI,
CUDA-aware MVAPICH and NCCL across a suite of real-world data sets on three
different systems: a 16-node cluster with one GPU per node, NVIDIA's DGX-1 with
8 GPUs and Cray's CS-Storm with 16 GPUs. Our results show that irregularity in
the tensor data sets produce trends that contradict those in the OSU
micro-benchmark, as well as trends that are absent from the benchmark.Comment: 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid
Computing (CCGRID
A Formal Category Theoretical Framework for Multi-model Data Transformations
Data integration and migration processes in polystores and multi-model database management systems highly benefit from data and schema transformations. Rigorous modeling of transformations is a complex problem. The data and schema transformation field is scattered with multiple different transformation frameworks, tools, and mappings. These are usually domain-specific and lack solid theoretical foundations. Our first goal is to define category theoretical foundations for relational, graph, and hierarchical data models and instances. Each data instance is represented as a category theoretical mapping called a functor. We formalize data and schema transformations as Kan lifts utilizing the functorial representation for the instances. A Kan lift is a category theoretical construction consisting of two mappings satisfying the certain universal property. In this work, the two mappings correspond to schema transformation and data transformation.Peer reviewe
- …