212 research outputs found
Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations
Load imbalance pervasively exists in distributed deep learning training
systems, either caused by the inherent imbalance in learned tasks or by the
system itself. Traditional synchronous Stochastic Gradient Descent (SGD)
achieves good accuracy for a wide variety of tasks, but relies on global
synchronization to accumulate the gradients at every training step. In this
paper, we propose eager-SGD, which relaxes the global synchronization for
decentralized accumulation. To implement eager-SGD, we propose to use two
partial collectives: solo and majority. With solo allreduce, the faster
processes contribute their gradients eagerly without waiting for the slower
processes, whereas with majority allreduce, at least half of the participants
must contribute gradients before continuing, all without using a central
parameter server. We theoretically prove the convergence of the algorithms and
describe the partial collectives in detail. Experimental results on
load-imbalanced environments (CIFAR-10, ImageNet, and UCF101 datasets) show
that eager-SGD achieves 1.27x speedup over the state-of-the-art synchronous
SGD, without losing accuracy.Comment: Published in Proceedings of the 25th ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming (PPoPP'20), pp. 45-61. 202
SparCML: High-Performance Sparse Communication for Machine Learning
Applying machine learning techniques to the quickly growing data in science
and industry requires highly-scalable algorithms. Large datasets are most
commonly processed "data parallel" distributed across many nodes. Each node's
contribution to the overall gradient is summed using a global allreduce. This
allreduce is the single communication and thus scalability bottleneck for most
machine learning workloads. We observe that frequently, many gradient values
are (close to) zero, leading to sparse of sparsifyable communications. To
exploit this insight, we analyze, design, and implement a set of
communication-efficient protocols for sparse input data, in conjunction with
efficient machine learning algorithms which can leverage these primitives. Our
communication protocols generalize standard collective operations, by allowing
processes to contribute arbitrary sparse input data vectors. Our generic
communication library, SparCML, extends MPI to support additional features,
such as non-blocking (asynchronous) operations and low-precision data
representations. As such, SparCML and its techniques will form the basis of
future highly-scalable machine learning frameworks
Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?
Dense Multi-GPU systems have recently gained a lot of attention in the HPC
arena. Traditionally, MPI runtimes have been primarily designed for clusters
with a large number of nodes. However, with the advent of MPI+CUDA applications
and CUDA-Aware MPI runtimes like MVAPICH2 and OpenMPI, it has become important
to address efficient communication schemes for such dense Multi-GPU nodes. This
coupled with new application workloads brought forward by Deep Learning
frameworks like Caffe and Microsoft CNTK pose additional design constraints due
to very large message communication of GPU buffers during the training phase.
In this context, special-purpose libraries like NVIDIA NCCL have been proposed
for GPU-based collective communication on dense GPU systems. In this paper, we
propose a pipelined chain (ring) design for the MPI_Bcast collective operation
along with an enhanced collective tuning framework in MVAPICH2-GDR that enables
efficient intra-/inter-node multi-GPU communication. We present an in-depth
performance landscape for the proposed MPI_Bcast schemes along with a
comparative analysis of NVIDIA NCCL Broadcast and NCCL-based MPI_Bcast. The
proposed designs for MVAPICH2-GDR enable up to 14X and 16.6X improvement,
compared to NCCL-based solutions, for intra- and inter-node broadcast latency,
respectively. In addition, the proposed designs provide up to 7% improvement
over NCCL-based solutions for data parallel training of the VGG network on 128
GPUs using Microsoft CNTK.Comment: 8 pages, 3 figure
Recommended from our members
Scalable High Performance Message Passing over InfiniBand for Open MPI
InfiniBand (IB) is a popular network technology for modern high-performance computing systems. MPI implementations traditionally support IB using a reliable, connection-oriented (RC) transport. However, per-process resource usage that grows linearly with the number of processes, makes this approach prohibitive for large-scale systems. IB provides an alternative in the form of a connectionless unreliable datagram transport (UD), which allows for near-constant resource usage and initialization overhead as the process count increases. This paper describes a UD-based implementation for IB in Open MPI as a scalable alternative to existing RC-based schemes. We use the software reliability capabilities of Open MPI to provide the guaranteed delivery semantics required by MPI. Results show that UD not only requires fewer resources at scale, but also allows for shorter MPI startup times. A connectionless model also improves performance for applications that tend to send small messages to many different processes
Kilometer-scale climate models: Prospects and challenges
Currently major efforts are underway toward refining the horizontal resolution (or grid spacing) of climate models to about 1 km, using both global and regional climate models (GCMs and RCMs). Several groups have succeeded in conducting kilometer-scale multiweek GCM simulations and decadelong continental-scale RCM simulations. There is the well-founded hope that this increase in resolution represents a quantum jump in climate modeling, as it enables replacing the parameterization of moist convection by an explicit treatment. It is expected that this will improve the simulation of the water cycle and extreme events and reduce uncertainties in climate change projections. While kilometer-scale resolution is commonly employed in limited-area numerical weather prediction, enabling it on global scales for extended climate simulations requires a concerted effort. In this paper, we exploit an RCM that runs entirely on graphics processing units (GPUs) and show examples that highlight the prospects of this approach. A particular challenge addressed in this paper relates to the growth in output volumes. It is argued that the data avalanche of high-resolution simulations will make it impractical or impossible to store the data. Rather, repeating the simulation and conducting online analysis will become more efficient. A prototype of this methodology is presented. It makes use of a bit-reproducible model version that ensures reproducible simulations across hardware architectures, in conjunction with a data virtualization layer as a common interface for output analyses. An assessment of the potential of these novel approaches will be provided
Increased neutrophil-lymphocyte ratio is a poor prognostic factor in patients with primary operable and inoperable pancreatic cancer
Background:
The neutrophil-lymphocyte ratio (NLR) has been proposed as an indicator of systemic inflammatory response. Previous findings from small-scale studies revealed conflicting results about its independent prognostic significance with regard to different clinical end points in pancreatic cancer (PC) patients. Therefore, the aim of our study was the external validation of the prognostic significance of NLR in a large cohort of PC patients.
Methods:
Data from 371 consecutive PC patients, treated between 2004 and 2010 at a single centre, were evaluated retrospectively. The whole cohort was stratified into two groups according to the treatment modality. Group 1 comprised 261 patients with inoperable PC at diagnosis and group 2 comprised 110 patients with surgically resected PC. Cancer-specific survival (CSS) was assessed using the Kaplan–Meier method. To evaluate the independent prognostic significance of the NLR, the modified Glasgow prognostic score (mGPS) and the platelet-lymphocyte ratio univariate and multivariate Cox regression models were applied.
Results:
Multivariate analysis identified increased NLR as an independent prognostic factor for inoperable PC patients (hazard ratio (HR)=2.53, confidence interval (CI)=1.64–3.91, P<0.001) and surgically resected PC patients (HR=1.61, CI=1.02–2.53, P=0.039). In inoperable PC patients, the mGPS was associated with poor CSS only in univariate analysis (HR=1.44, CI=1.04–1.98).
Conclusion:
Risk prediction for cancer-related end points using NLR does add independent prognostic information to other well-established prognostic factors in patients with PC, regardless of the undergoing therapeutic modality. Thus, the NLR should be considered for future individual risk assessment in patients with PC
Report from the OECI Oncology Days 2014
The 2014 OECI Oncology Days was held at the ‘Prof. Dr. Ion Chiricuta’ Oncology Institute in Cluj, Romania, from 12 to 13 June. The focus of this year’s gathering was on developments in personalised medicine and other treatment advances which have made the cost of cancer care too high for many regions throughout Europe
Structured Digital Self-Assessment of Patient Anamnesis Prior to Computed Tomography: Performance Evaluation and Added Value
The aim of this study was to evaluate the performance of a tablet-based, digitized structured self-assessment (DSSA) of patient anamnesis (PA) prior to computed tomography (CT). Of the 317 patients consecutively referred for CT, the majority (n = 294) was able to complete the tablet-based questionnaire, which consisted of 67 items covering social anamnesis, lifestyle factors (e.g., tobacco abuse), medical history (e.g., kidney diseases), current symptoms, and the usability of the system. Patients were able to mark unclear questions for a subsequent discussion with the radiologist. Critical issues for the CT examination were structured and automatically highlighted as “red flags” (RFs) in order to improve patient interaction. RFs and marked questions were highly prevalent (69.5% and 26%). Missing creatinine values (33.3%), kidney diseases (14.4%), thyroid diseases (10.6%), metformin (5.5%), claustrophobia (4.1%), allergic reactions to contrast agents (2.4%), and pathological TSH values (2.0%) were highlighted most frequently as RFs. Patient feedback regarding the comprehensibility of the questionnaire and the tablet usability was mainly positive (90.9%; 86.2%). With advanced age, however, patients provided more negative feedback for both (p = 0.007; p = 0.039). The time effort was less than 20 min for 85.1% of patients, and faster patients were significantly younger (p = 0.046). Overall, the DSSA of PA prior to CT shows a high success rate and is well accepted by most patients. RFs and marked questions were common and helped to focus patients’ interactions and reporting towards decisive aspects
- …