204 research outputs found
Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations
Load imbalance pervasively exists in distributed deep learning training
systems, either caused by the inherent imbalance in learned tasks or by the
system itself. Traditional synchronous Stochastic Gradient Descent (SGD)
achieves good accuracy for a wide variety of tasks, but relies on global
synchronization to accumulate the gradients at every training step. In this
paper, we propose eager-SGD, which relaxes the global synchronization for
decentralized accumulation. To implement eager-SGD, we propose to use two
partial collectives: solo and majority. With solo allreduce, the faster
processes contribute their gradients eagerly without waiting for the slower
processes, whereas with majority allreduce, at least half of the participants
must contribute gradients before continuing, all without using a central
parameter server. We theoretically prove the convergence of the algorithms and
describe the partial collectives in detail. Experimental results on
load-imbalanced environments (CIFAR-10, ImageNet, and UCF101 datasets) show
that eager-SGD achieves 1.27x speedup over the state-of-the-art synchronous
SGD, without losing accuracy.Comment: Published in Proceedings of the 25th ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming (PPoPP'20), pp. 45-61. 202
SparCML: High-Performance Sparse Communication for Machine Learning
Applying machine learning techniques to the quickly growing data in science
and industry requires highly-scalable algorithms. Large datasets are most
commonly processed "data parallel" distributed across many nodes. Each node's
contribution to the overall gradient is summed using a global allreduce. This
allreduce is the single communication and thus scalability bottleneck for most
machine learning workloads. We observe that frequently, many gradient values
are (close to) zero, leading to sparse of sparsifyable communications. To
exploit this insight, we analyze, design, and implement a set of
communication-efficient protocols for sparse input data, in conjunction with
efficient machine learning algorithms which can leverage these primitives. Our
communication protocols generalize standard collective operations, by allowing
processes to contribute arbitrary sparse input data vectors. Our generic
communication library, SparCML, extends MPI to support additional features,
such as non-blocking (asynchronous) operations and low-precision data
representations. As such, SparCML and its techniques will form the basis of
future highly-scalable machine learning frameworks
Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?
Dense Multi-GPU systems have recently gained a lot of attention in the HPC
arena. Traditionally, MPI runtimes have been primarily designed for clusters
with a large number of nodes. However, with the advent of MPI+CUDA applications
and CUDA-Aware MPI runtimes like MVAPICH2 and OpenMPI, it has become important
to address efficient communication schemes for such dense Multi-GPU nodes. This
coupled with new application workloads brought forward by Deep Learning
frameworks like Caffe and Microsoft CNTK pose additional design constraints due
to very large message communication of GPU buffers during the training phase.
In this context, special-purpose libraries like NVIDIA NCCL have been proposed
for GPU-based collective communication on dense GPU systems. In this paper, we
propose a pipelined chain (ring) design for the MPI_Bcast collective operation
along with an enhanced collective tuning framework in MVAPICH2-GDR that enables
efficient intra-/inter-node multi-GPU communication. We present an in-depth
performance landscape for the proposed MPI_Bcast schemes along with a
comparative analysis of NVIDIA NCCL Broadcast and NCCL-based MPI_Bcast. The
proposed designs for MVAPICH2-GDR enable up to 14X and 16.6X improvement,
compared to NCCL-based solutions, for intra- and inter-node broadcast latency,
respectively. In addition, the proposed designs provide up to 7% improvement
over NCCL-based solutions for data parallel training of the VGG network on 128
GPUs using Microsoft CNTK.Comment: 8 pages, 3 figure
Recommended from our members
Scalable High Performance Message Passing over InfiniBand for Open MPI
InfiniBand (IB) is a popular network technology for modern high-performance computing systems. MPI implementations traditionally support IB using a reliable, connection-oriented (RC) transport. However, per-process resource usage that grows linearly with the number of processes, makes this approach prohibitive for large-scale systems. IB provides an alternative in the form of a connectionless unreliable datagram transport (UD), which allows for near-constant resource usage and initialization overhead as the process count increases. This paper describes a UD-based implementation for IB in Open MPI as a scalable alternative to existing RC-based schemes. We use the software reliability capabilities of Open MPI to provide the guaranteed delivery semantics required by MPI. Results show that UD not only requires fewer resources at scale, but also allows for shorter MPI startup times. A connectionless model also improves performance for applications that tend to send small messages to many different processes
Kilometer-scale climate models: Prospects and challenges
Currently major efforts are underway toward refining the horizontal resolution (or grid spacing) of climate models to about 1 km, using both global and regional climate models (GCMs and RCMs). Several groups have succeeded in conducting kilometer-scale multiweek GCM simulations and decadelong continental-scale RCM simulations. There is the well-founded hope that this increase in resolution represents a quantum jump in climate modeling, as it enables replacing the parameterization of moist convection by an explicit treatment. It is expected that this will improve the simulation of the water cycle and extreme events and reduce uncertainties in climate change projections. While kilometer-scale resolution is commonly employed in limited-area numerical weather prediction, enabling it on global scales for extended climate simulations requires a concerted effort. In this paper, we exploit an RCM that runs entirely on graphics processing units (GPUs) and show examples that highlight the prospects of this approach. A particular challenge addressed in this paper relates to the growth in output volumes. It is argued that the data avalanche of high-resolution simulations will make it impractical or impossible to store the data. Rather, repeating the simulation and conducting online analysis will become more efficient. A prototype of this methodology is presented. It makes use of a bit-reproducible model version that ensures reproducible simulations across hardware architectures, in conjunction with a data virtualization layer as a common interface for output analyses. An assessment of the potential of these novel approaches will be provided
Increased neutrophil-lymphocyte ratio is a poor prognostic factor in patients with primary operable and inoperable pancreatic cancer
Background:
The neutrophil-lymphocyte ratio (NLR) has been proposed as an indicator of systemic inflammatory response. Previous findings from small-scale studies revealed conflicting results about its independent prognostic significance with regard to different clinical end points in pancreatic cancer (PC) patients. Therefore, the aim of our study was the external validation of the prognostic significance of NLR in a large cohort of PC patients.
Methods:
Data from 371 consecutive PC patients, treated between 2004 and 2010 at a single centre, were evaluated retrospectively. The whole cohort was stratified into two groups according to the treatment modality. Group 1 comprised 261 patients with inoperable PC at diagnosis and group 2 comprised 110 patients with surgically resected PC. Cancer-specific survival (CSS) was assessed using the Kaplan–Meier method. To evaluate the independent prognostic significance of the NLR, the modified Glasgow prognostic score (mGPS) and the platelet-lymphocyte ratio univariate and multivariate Cox regression models were applied.
Results:
Multivariate analysis identified increased NLR as an independent prognostic factor for inoperable PC patients (hazard ratio (HR)=2.53, confidence interval (CI)=1.64–3.91, P<0.001) and surgically resected PC patients (HR=1.61, CI=1.02–2.53, P=0.039). In inoperable PC patients, the mGPS was associated with poor CSS only in univariate analysis (HR=1.44, CI=1.04–1.98).
Conclusion:
Risk prediction for cancer-related end points using NLR does add independent prognostic information to other well-established prognostic factors in patients with PC, regardless of the undergoing therapeutic modality. Thus, the NLR should be considered for future individual risk assessment in patients with PC
Report from the OECI Oncology Days 2014
The 2014 OECI Oncology Days was held at the ‘Prof. Dr. Ion Chiricuta’ Oncology Institute in Cluj, Romania, from 12 to 13 June. The focus of this year’s gathering was on developments in personalised medicine and other treatment advances which have made the cost of cancer care too high for many regions throughout Europe
Outcomes in randomised controlled trials in prevention and management of carious lesions:a systematic review
Abstract Background Inconsistent outcome reporting is one significant hurdle to combining results from trials into systematic reviews. Core outcome sets (COS) can reduce this barrier. The aim of this review was to map outcomes reported in caries prevention and management randomised controlled trials (RCT) as a first step to COS development. We also investigated RCT characteristics and reporting of primary outcomes and sample size calculations. Methods PubMed, Embase, Web of Knowledge and Cochrane CENTRAL were systematically searched (1 January 1968 to 25 August 2015). Inclusion criteria: RCTs comparing any technique for prevention or management of caries with another or placebo and RCTs comparing interventions to support patients undergoing treatment of caries (without setting, dentition or age restrictions). Categories were developed through piloting and group consensus and outcomes grouped accordingly. Results Of 4773 search results, 764 were potentially relevant, full text was available for 731 papers and 605 publications met the inclusion criteria and were included. For all outcomes across the time periods 1968–1980 and 2001–2010, reporting of outcome ‘caries experience’ reduced from 39% to 18%; ‘clinical performance of the restoration’ reporting increased from 33% to 42% although there was a reduction to 22% in 2011–2015. Emerging outcome domains include ‘lesion activity’ and ‘pulp health-related outcomes’, accounting for 1% and 0%, respectively, during 1968–1980 and 10% and 4% for 2011–2015. Reporting ‘resource efficiency’ and ‘quality of life measures’ have remained at a low level. No publications reported tooth survival independent of an index such as DMFT or equivalent. Primary outcomes were only identified as such in 414 (68%) of the reports. Conclusions Over the past 50 years, outcome reporting for trials on prevention and management of carious lesions have tended to focus on outcomes measuring caries experience and restoration material clinical performance with lesion activity and cost-effectiveness increasingly being reported. Patient-reported and patient-focused outcomes are becoming more common (although as secondary outcomes) but remain low in use. The challenge with developing a COS will be balancing commonly previously reported outcomes against those more relevant for the future. Trial registration PROSPERO, CRD42015025310 . Registered on 14 August 2015, Trials (Schwendicke et al., Trials 16:397, 2015) and COMET initiative online (COMET, 2017)
- …