7 research outputs found
At the Locus of Performance: A Case Study in Enhancing CPUs with Copious 3D-Stacked Cache
Over the last three decades, innovations in the memory subsystem were
primarily targeted at overcoming the data movement bottleneck. In this paper,
we focus on a specific market trend in memory technology: 3D-stacked memory and
caches. We investigate the impact of extending the on-chip memory capabilities
in future HPC-focused processors, particularly by 3D-stacked SRAM. First, we
propose a method oblivious to the memory subsystem to gauge the upper-bound in
performance improvements when data movement costs are eliminated. Then, using
the gem5 simulator, we model two variants of LARC, a processor fabricated in
1.5 nm and enriched with high-capacity 3D-stacked cache. With a volume of
experiments involving a board set of proxy-applications and benchmarks, we aim
to reveal where HPC CPU performance could be circa 2028, and conclude an
average boost of 9.77x for cache-sensitive HPC applications, on a per-chip
basis. Additionally, we exhaustively document our methodological exploration to
motivate HPC centers to drive their own technological agenda through enhanced
co-design
Discrete mathematics I.
KĂ©szĂĽlt az ELTE FelsĹ‘oktatási StruktĂşraátalakĂtási AlapbĂłl támogatott programja keretĂ©ben
Cache optimized linear sieve
Sieving is essential in different number theoretical algorithms. Sieving with
large primes violates locality of memory access, thus degrading performance.
Our suggestion on how to tackle this problem is to use cyclic data structures
in combination with in-place bucket-sort. We present our results on the
implementation of the sieve of Eratosthenes, using these ideas, which show that
this approach is more robust and less affected by slow memory
Why globally re-shuffle? Revisiting data shuffling in large scale deep learning
International audienceStochastic gradient descent (SGD) is the most prevalent algorithm for training Deep Neural Networks (DNN). SGD iterates the input data set in each training epoch processing data samples in a random access fashion. Because this puts enormous pressure on the I/O subsystem, the most common approach to distributed SGD in HPC environments is to replicate the entire dataset to node local SSDs. However, due to rapidly growing data set sizes this approach has become increasingly infeasible. Surprisingly, the questions of why and to what extent random access is required have not received a lot of attention in the literature from an empirical standpoint. In this paper, we revisit data shuffling in DL workloads to investigate the viability of partitioning the dataset among workers and performing only a partial distributed exchange of samples in each training epoch. Through extensive experiments on up to 2,048 GPUs of ABCI and 4,096 compute nodes of Fugaku, we demonstrate that in practice validation accuracy of global shuffling can be maintained when carefully tuning the partial distributed exchange. We provide a solution implemented in PyTorch that enables users to control the proposed data exchange scheme
Why globally re-shuffle? Revisiting data shuffling in large scale deep learning
International audienceStochastic gradient descent (SGD) is the most prevalent algorithm for training Deep Neural Networks (DNN). SGD iterates the input data set in each training epoch processing data samples in a random access fashion. Because this puts enormous pressure on the I/O subsystem, the most common approach to distributed SGD in HPC environments is to replicate the entire dataset to node local SSDs. However, due to rapidly growing data set sizes this approach has become increasingly infeasible. Surprisingly, the questions of why and to what extent random access is required have not received a lot of attention in the literature from an empirical standpoint. In this paper, we revisit data shuffling in DL workloads to investigate the viability of partitioning the dataset among workers and performing only a partial distributed exchange of samples in each training epoch. Through extensive experiments on up to 2,048 GPUs of ABCI and 4,096 compute nodes of Fugaku, we demonstrate that in practice validation accuracy of global shuffling can be maintained when carefully tuning the partial distributed exchange. We provide a solution implemented in PyTorch that enables users to control the proposed data exchange scheme