212 research outputs found

    Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations

    Full text link
    Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself. Traditional synchronous Stochastic Gradient Descent (SGD) achieves good accuracy for a wide variety of tasks, but relies on global synchronization to accumulate the gradients at every training step. In this paper, we propose eager-SGD, which relaxes the global synchronization for decentralized accumulation. To implement eager-SGD, we propose to use two partial collectives: solo and majority. With solo allreduce, the faster processes contribute their gradients eagerly without waiting for the slower processes, whereas with majority allreduce, at least half of the participants must contribute gradients before continuing, all without using a central parameter server. We theoretically prove the convergence of the algorithms and describe the partial collectives in detail. Experimental results on load-imbalanced environments (CIFAR-10, ImageNet, and UCF101 datasets) show that eager-SGD achieves 1.27x speedup over the state-of-the-art synchronous SGD, without losing accuracy.Comment: Published in Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'20), pp. 45-61. 202

    SparCML: High-Performance Sparse Communication for Machine Learning

    Full text link
    Applying machine learning techniques to the quickly growing data in science and industry requires highly-scalable algorithms. Large datasets are most commonly processed "data parallel" distributed across many nodes. Each node's contribution to the overall gradient is summed using a global allreduce. This allreduce is the single communication and thus scalability bottleneck for most machine learning workloads. We observe that frequently, many gradient values are (close to) zero, leading to sparse of sparsifyable communications. To exploit this insight, we analyze, design, and implement a set of communication-efficient protocols for sparse input data, in conjunction with efficient machine learning algorithms which can leverage these primitives. Our communication protocols generalize standard collective operations, by allowing processes to contribute arbitrary sparse input data vectors. Our generic communication library, SparCML, extends MPI to support additional features, such as non-blocking (asynchronous) operations and low-precision data representations. As such, SparCML and its techniques will form the basis of future highly-scalable machine learning frameworks

    Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?

    Full text link
    Dense Multi-GPU systems have recently gained a lot of attention in the HPC arena. Traditionally, MPI runtimes have been primarily designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and CUDA-Aware MPI runtimes like MVAPICH2 and OpenMPI, it has become important to address efficient communication schemes for such dense Multi-GPU nodes. This coupled with new application workloads brought forward by Deep Learning frameworks like Caffe and Microsoft CNTK pose additional design constraints due to very large message communication of GPU buffers during the training phase. In this context, special-purpose libraries like NVIDIA NCCL have been proposed for GPU-based collective communication on dense GPU systems. In this paper, we propose a pipelined chain (ring) design for the MPI_Bcast collective operation along with an enhanced collective tuning framework in MVAPICH2-GDR that enables efficient intra-/inter-node multi-GPU communication. We present an in-depth performance landscape for the proposed MPI_Bcast schemes along with a comparative analysis of NVIDIA NCCL Broadcast and NCCL-based MPI_Bcast. The proposed designs for MVAPICH2-GDR enable up to 14X and 16.6X improvement, compared to NCCL-based solutions, for intra- and inter-node broadcast latency, respectively. In addition, the proposed designs provide up to 7% improvement over NCCL-based solutions for data parallel training of the VGG network on 128 GPUs using Microsoft CNTK.Comment: 8 pages, 3 figure

    Kilometer-scale climate models: Prospects and challenges

    Get PDF
    Currently major efforts are underway toward refining the horizontal resolution (or grid spacing) of climate models to about 1 km, using both global and regional climate models (GCMs and RCMs). Several groups have succeeded in conducting kilometer-scale multiweek GCM simulations and decadelong continental-scale RCM simulations. There is the well-founded hope that this increase in resolution represents a quantum jump in climate modeling, as it enables replacing the parameterization of moist convection by an explicit treatment. It is expected that this will improve the simulation of the water cycle and extreme events and reduce uncertainties in climate change projections. While kilometer-scale resolution is commonly employed in limited-area numerical weather prediction, enabling it on global scales for extended climate simulations requires a concerted effort. In this paper, we exploit an RCM that runs entirely on graphics processing units (GPUs) and show examples that highlight the prospects of this approach. A particular challenge addressed in this paper relates to the growth in output volumes. It is argued that the data avalanche of high-resolution simulations will make it impractical or impossible to store the data. Rather, repeating the simulation and conducting online analysis will become more efficient. A prototype of this methodology is presented. It makes use of a bit-reproducible model version that ensures reproducible simulations across hardware architectures, in conjunction with a data virtualization layer as a common interface for output analyses. An assessment of the potential of these novel approaches will be provided

    Increased neutrophil-lymphocyte ratio is a poor prognostic factor in patients with primary operable and inoperable pancreatic cancer

    Get PDF
    Background: The neutrophil-lymphocyte ratio (NLR) has been proposed as an indicator of systemic inflammatory response. Previous findings from small-scale studies revealed conflicting results about its independent prognostic significance with regard to different clinical end points in pancreatic cancer (PC) patients. Therefore, the aim of our study was the external validation of the prognostic significance of NLR in a large cohort of PC patients. Methods: Data from 371 consecutive PC patients, treated between 2004 and 2010 at a single centre, were evaluated retrospectively. The whole cohort was stratified into two groups according to the treatment modality. Group 1 comprised 261 patients with inoperable PC at diagnosis and group 2 comprised 110 patients with surgically resected PC. Cancer-specific survival (CSS) was assessed using the Kaplan–Meier method. To evaluate the independent prognostic significance of the NLR, the modified Glasgow prognostic score (mGPS) and the platelet-lymphocyte ratio univariate and multivariate Cox regression models were applied. Results: Multivariate analysis identified increased NLR as an independent prognostic factor for inoperable PC patients (hazard ratio (HR)=2.53, confidence interval (CI)=1.64–3.91, P<0.001) and surgically resected PC patients (HR=1.61, CI=1.02–2.53, P=0.039). In inoperable PC patients, the mGPS was associated with poor CSS only in univariate analysis (HR=1.44, CI=1.04–1.98). Conclusion: Risk prediction for cancer-related end points using NLR does add independent prognostic information to other well-established prognostic factors in patients with PC, regardless of the undergoing therapeutic modality. Thus, the NLR should be considered for future individual risk assessment in patients with PC

    Report from the OECI Oncology Days 2014

    Get PDF
    The 2014 OECI Oncology Days was held at the ‘Prof. Dr. Ion Chiricuta’ Oncology Institute in Cluj, Romania, from 12 to 13 June. The focus of this year’s gathering was on developments in personalised medicine and other treatment advances which have made the cost of cancer care too high for many regions throughout Europe

    Structured Digital Self-Assessment of Patient Anamnesis Prior to Computed Tomography: Performance Evaluation and Added Value

    Get PDF
    The aim of this study was to evaluate the performance of a tablet-based, digitized structured self-assessment (DSSA) of patient anamnesis (PA) prior to computed tomography (CT). Of the 317 patients consecutively referred for CT, the majority (n = 294) was able to complete the tablet-based questionnaire, which consisted of 67 items covering social anamnesis, lifestyle factors (e.g., tobacco abuse), medical history (e.g., kidney diseases), current symptoms, and the usability of the system. Patients were able to mark unclear questions for a subsequent discussion with the radiologist. Critical issues for the CT examination were structured and automatically highlighted as “red flags” (RFs) in order to improve patient interaction. RFs and marked questions were highly prevalent (69.5% and 26%). Missing creatinine values (33.3%), kidney diseases (14.4%), thyroid diseases (10.6%), metformin (5.5%), claustrophobia (4.1%), allergic reactions to contrast agents (2.4%), and pathological TSH values (2.0%) were highlighted most frequently as RFs. Patient feedback regarding the comprehensibility of the questionnaire and the tablet usability was mainly positive (90.9%; 86.2%). With advanced age, however, patients provided more negative feedback for both (p = 0.007; p = 0.039). The time effort was less than 20 min for 85.1% of patients, and faster patients were significantly younger (p = 0.046). Overall, the DSSA of PA prior to CT shows a high success rate and is well accepted by most patients. RFs and marked questions were common and helped to focus patients’ interactions and reporting towards decisive aspects
    corecore