3,858 research outputs found
Research and Education in Computational Science and Engineering
Over the past two decades the field of computational science and engineering
(CSE) has penetrated both basic and applied research in academia, industry, and
laboratories to advance discovery, optimize systems, support decision-makers,
and educate the scientific and engineering workforce. Informed by centuries of
theory and experiment, CSE performs computational experiments to answer
questions that neither theory nor experiment alone is equipped to answer. CSE
provides scientists and engineers of all persuasions with algorithmic
inventions and software systems that transcend disciplines and scales. Carried
on a wave of digital technology, CSE brings the power of parallelism to bear on
troves of data. Mathematics-based advanced computing has become a prevalent
means of discovery and innovation in essentially all areas of science,
engineering, technology, and society; and the CSE community is at the core of
this transformation. However, a combination of disruptive
developments---including the architectural complexity of extreme-scale
computing, the data revolution that engulfs the planet, and the specialization
required to follow the applications to new frontiers---is redefining the scope
and reach of the CSE endeavor. This report describes the rapid expansion of CSE
and the challenges to sustaining its bold advances. The report also presents
strategies and directions for CSE research and education for the next decade.Comment: Major revision, to appear in SIAM Revie
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated
state-of-the-art performance in various Artificial Intelligence tasks. To
accelerate the experimentation and development of CNNs, several software
frameworks have been released, primarily targeting power-hungry CPUs and GPUs.
In this context, reconfigurable hardware in the form of FPGAs constitutes a
potential alternative platform that can be integrated in the existing deep
learning ecosystem to provide a tunable balance between performance, power
consumption and programmability. In this paper, a survey of the existing
CNN-to-FPGA toolflows is presented, comprising a comparative study of their key
characteristics which include the supported applications, architectural
choices, design space exploration methods and achieved performance. Moreover,
major challenges and objectives introduced by the latest trends in CNN
algorithmic research are identified and presented. Finally, a uniform
evaluation methodology is proposed, aiming at the comprehensive, complete and
in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal,
201
SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks
Going deeper and wider in neural architectures improves the accuracy, while
the limited GPU DRAM places an undesired restriction on the network design
domain. Deep Learning (DL) practitioners either need change to less desired
network architectures, or nontrivially dissect a network across multiGPUs.
These distract DL practitioners from concentrating on their original machine
learning tasks. We present SuperNeurons: a dynamic GPU memory scheduling
runtime to enable the network training far beyond the GPU DRAM capacity.
SuperNeurons features 3 memory optimizations, \textit{Liveness Analysis},
\textit{Unified Tensor Pool}, and \textit{Cost-Aware Recomputation}, all
together they effectively reduce the network-wide peak memory usage down to the
maximal memory usage among layers. We also address the performance issues in
those memory saving techniques. Given the limited GPU DRAM, SuperNeurons not
only provisions the necessary memory for the training, but also dynamically
allocates the memory for convolution workspaces to achieve the high
performance. Evaluations against Caffe, Torch, MXNet and TensorFlow have
demonstrated that SuperNeurons trains at least 3.2432 deeper network than
current ones with the leading performance. Particularly, SuperNeurons can train
ResNet2500 that has basic network layers on a 12GB K40c.Comment: PPoPP '2018: 23nd ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programmin
- …