27 research outputs found
High-Performance 3D Compressive Sensing MRI Reconstruction Using Many-Core Architectures
Compressive sensing (CS) describes how sparse
signals can be accurately reconstructed from many fewer samples
than required by the Nyquist criterion. Since MRI scan duration
is proportional to the number of acquired samples, CS has been
gaining significant attention in MRI. However, the computationally
intensive nature of CS reconstructions has precluded their
use in routine clinical practice. In this work, we investigate how
different throughput-oriented architectures can benefit one CS
algorithm and what levels of acceleration are feasible on different
modern platforms. We demonstrate that a CUDA-based code
running on an NVIDIA Tesla C2050 GPU can reconstruct a
256 × 160 × 80 volume from an 8-channel acquisition in 19 seconds,
which is in itself a significant improvement over the state of the art. We then
show that Intel's Knights Ferry can perform the same 3D MRI
reconstruction in only 12 seconds, bringing CS methods even
closer to clinical viability
Lattice QCD with Domain Decomposition on Intel Xeon Phi Co-Processors
The gap between the cost of moving data and the cost of computing continues
to grow, making it ever harder to design iterative solvers on extreme-scale
architectures. This problem can be alleviated by alternative algorithms that
reduce the amount of data movement. We investigate this in the context of
Lattice Quantum Chromodynamics and implement such an alternative solver
algorithm, based on domain decomposition, on Intel Xeon Phi co-processor (KNC)
clusters. We demonstrate close-to-linear on-chip scaling to all 60 cores of the
KNC. With a mix of single- and half-precision the domain-decomposition method
sustains 400-500 Gflop/s per chip. Compared to an optimized KNC implementation
of a standard solver [1], our full multi-node domain-decomposition solver
strong-scales to more nodes and reduces the time-to-solution by a factor of 5.Comment: 12 pages, 7 figures, presented at Supercomputing 2014, November
16-21, 2014, New Orleans, Louisiana, USA, speaker Simon Heybrock; SC '14
Proceedings of the International Conference for High Performance Computing,
Networking, Storage and Analysis, pages 69-80, IEEE Press Piscataway, NJ, USA
(c)201
Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models
Deep learning recommendation models (DLRMs) are used across many
business-critical services at Facebook and are the single largest AI
application in terms of infrastructure demand in its data-centers. In this
paper we discuss the SW/HW co-designed solution for high-performance
distributed training of large-scale DLRMs. We introduce a high-performance
scalable software stack based on PyTorch and pair it with the new evolution of
Zion platform, namely ZionEX. We demonstrate the capability to train very large
DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup
in terms of time to solution over previous systems. We achieve this by (i)
designing the ZionEX platform with dedicated scale-out network, provisioned
with high bandwidth, optimal topology and efficient transport (ii) implementing
an optimized PyTorch-based training stack supporting both model and data
parallelism (iii) developing sharding algorithms capable of hierarchical
partitioning of the embedding tables along row, column dimensions and load
balancing them across multiple workers; (iv) adding high-performance core
operators while retaining flexibility to support optimizers with fully
deterministic updates (v) leveraging reduced precision communications,
multi-level memory hierarchy (HBM+DDR+SSD) and pipelining. Furthermore, we
develop and briefly comment on distributed data ingestion and other supporting
services that are required for the robust and efficient end-to-end training in
production environments