4,851 research outputs found
Scaling Deep Learning on GPU and Knights Landing clusters
The speed of deep neural networks training has become a big bottleneck of
deep learning research and development. For example, training GoogleNet by
ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training
process, the current deep learning systems heavily rely on the hardware
accelerators. However, these accelerators have limited on-chip memory compared
with CPUs. To handle large datasets, they need to fetch data from either CPU
memory or remote processors. We use both self-hosted Intel Knights Landing
(KNL) clusters and multi-GPU clusters as our target platforms. From an
algorithm aspect, current distributed machine learning systems are mainly
designed for cloud systems. These methods are asynchronous because of the slow
network and high fault-tolerance requirement on cloud systems. We focus on
Elastic Averaging SGD (EASGD) to design algorithms for HPC clusters. Original
EASGD used round-robin method for communication and updating. The communication
is ordered by the machine rank ID, which is inefficient on HPC clusters.
First, we redesign four efficient algorithms for HPC systems to improve
EASGD's poor scaling on clusters. Async EASGD, Async MEASGD, and Hogwild EASGD
are faster \textcolor{black}{than} their existing counterparts (Async SGD,
Async MSGD, and Hogwild SGD, resp.) in all the comparisons. Finally, we design
Sync EASGD, which ties for the best performance among all the methods while
being deterministic. In addition to the algorithmic improvements, we use some
system-algorithm codesign techniques to scale up the algorithms. By reducing
the percentage of communication from 87% to 14%, our Sync EASGD achieves 5.3x
speedup over original EASGD on the same platform. We get 91.5% weak scaling
efficiency on 4253 KNL cores, which is higher than the state-of-the-art
implementation
Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators
In this paper, we evaluate the error criticality of radiation-induced errors on modern High-Performance Computing (HPC) accelerators (Intel Xeon Phi and NVIDIA K40) through a dedicated set of metrics. We show that, as long as imprecise computing is concerned, the simple mismatch detection is not sufficient to evaluate and compare the radiation sensitivity of HPC devices and algorithms. Our analysis quantifies and qualifies radiation effects on applications’ output correlating the number of corrupted elements with their spatial locality. Also, we provide the mean relative error (dataset-wise) to evaluate radiation-induced error magnitude.
We apply the selected metrics to experimental results obtained in various radiation test campaigns for a total of more than 400 hours of beam time per device. The amount of data we gathered allows us to evaluate the error criticality of a representative set of algorithms from HPC suites. Additionally, based on the characteristics of the tested algorithms, we draw generic reliability conclusions for broader classes of codes. We show that arithmetic operations are less critical for the K40, while Xeon Phi is more reliable when executing particles interactions solved through Finite Difference Methods. Finally, iterative stencil operations seem the most reliable on both architectures.This work was supported by the STIC-AmSud/CAPES scientific cooperation program under the EnergySFE research
project grant 99999.007556/2015-02, EU H2020 Programme, and MCTI/RNP-Brazil under the HPC4E Project, grant agreement
n° 689772. Tested K40 boards were donated thanks to Steve Keckler, Timothy Tsai, and Siva Hari from NVIDIA.Postprint (author's final draft
- …