20 research outputs found
Scaling Deep Learning on GPU and Knights Landing clusters
The speed of deep neural networks training has become a big bottleneck of
deep learning research and development. For example, training GoogleNet by
ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training
process, the current deep learning systems heavily rely on the hardware
accelerators. However, these accelerators have limited on-chip memory compared
with CPUs. To handle large datasets, they need to fetch data from either CPU
memory or remote processors. We use both self-hosted Intel Knights Landing
(KNL) clusters and multi-GPU clusters as our target platforms. From an
algorithm aspect, current distributed machine learning systems are mainly
designed for cloud systems. These methods are asynchronous because of the slow
network and high fault-tolerance requirement on cloud systems. We focus on
Elastic Averaging SGD (EASGD) to design algorithms for HPC clusters. Original
EASGD used round-robin method for communication and updating. The communication
is ordered by the machine rank ID, which is inefficient on HPC clusters.
First, we redesign four efficient algorithms for HPC systems to improve
EASGD's poor scaling on clusters. Async EASGD, Async MEASGD, and Hogwild EASGD
are faster \textcolor{black}{than} their existing counterparts (Async SGD,
Async MSGD, and Hogwild SGD, resp.) in all the comparisons. Finally, we design
Sync EASGD, which ties for the best performance among all the methods while
being deterministic. In addition to the algorithmic improvements, we use some
system-algorithm codesign techniques to scale up the algorithms. By reducing
the percentage of communication from 87% to 14%, our Sync EASGD achieves 5.3x
speedup over original EASGD on the same platform. We get 91.5% weak scaling
efficiency on 4253 KNL cores, which is higher than the state-of-the-art
implementation
Distributed training of deep neural networks with spark: The MareNostrum experience
Deployment of a distributed deep learning technology stack on a large parallel system is a very complex process, involving the integration and configuration of several layers of both, general-purpose and custom software. The details of such kind of deployments are rarely described in the literature. This paper presents the experiences observed during the deployment of a technology stack to enable deep learning workloads on MareNostrum, a petascale supercomputer. The components of a layered architecture, based on the usage of Apache Spark, are described and the performance and scalability of the resulting system is evaluated. This is followed by a discussion about the impact of different configurations including parallelism, storage and networking alternatives, and other aspects related to the execution of deep learning workloads on a traditional HPC setup. The derived conclusions should be useful to guide similarly complex deployments in the future.Peer ReviewedPostprint (author's final draft
Reconfigurable Cyber-Physical System for Lifestyle Video-Monitoring via Deep Learning
Indoor monitoring of people at their homes has become a popular application
in Smart Health. With the advances in Machine Learning and hardware for
embedded devices, new distributed approaches for Cyber-Physical Systems (CPSs)
are enabled. Also, changing environments and need for cost reduction motivate
novel reconfigurable CPS architectures. In this work, we propose an indoor
monitoring reconfigurable CPS that uses embedded local nodes (Nvidia Jetson
TX2). We embed Deep Learning architectures to address Human Action Recognition.
Local processing at these nodes let us tackle some common issues: reduction of
data bandwidth usage and preservation of privacy (no raw images are
transmitted). Also real-time processing is facilitated since optimized nodes
compute only its local video feed. Regarding the reconfiguration, a remote
platform monitors CPS qualities and a Quality and Resource Management (QRM)
tool sends commands to the CPS core to trigger its reconfiguration. Our
proposal is an energy-aware system that triggers reconfiguration based on
energy consumption for battery-powered nodes. Reconfiguration reduces up to 22%
the local nodes energy consumption extending the device operating time,
preserving similar accuracy with respect to the alternative with no
reconfiguration