Search CORE

42 research outputs found

CoCoA: A General Framework for Communication-Efficient Distributed Optimization

Author: Forte Simone
Jaggi Martin
Jordan Michael I.
Ma Chenxin
Smith Virginia
Takac Martin
Publication venue
Publication date: 21/06/2017
Field of study

The scale of modern datasets necessitates the development of efficient distributed optimization methods for machine learning. We present a general-purpose framework for distributed computing environments, CoCoA, that has an efficient communication scheme and is applicable to a wide variety of problems in machine learning and signal processing. We extend the framework to cover general non-strongly-convex regularizers, including L1-regularized problems like lasso, sparse logistic regression, and elastic net regularization, and show how earlier work can be derived as a special case. We provide convergence guarantees for the class of convex regularized loss minimization objectives, leveraging a novel approach in handling non-strongly-convex regularizers and non-smooth loss functions. The resulting framework has markedly improved performance over state-of-the-art methods, as we illustrate with an extensive set of experiments on real distributed datasets

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Repository for Publications and Research Data

Mixing multi-core CPUs and GPUs for scientific simulation software

Author: Hawick K.A.
Leist A.
Playne D.P.
Publication venue: 'Massey University'
Publication date: 01/01/2010
Field of study

Recent technological and economic developments have led to widespread availability of multi-core CPUs and specialist accelerator processors such as graphical processing units (GPUs). The accelerated computational performance possible from these devices can be very high for some applications paradigms. Software languages and systems such as NVIDIA's CUDA and Khronos consortium's open compute language (OpenCL) support a number of individual parallel application programming paradigms. To scale up the performance of some complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica- tions using threading approaches and multi-core CPUs to control independent GPU devices. We present speed-up data and discuss multi-threading software issues for the applications level programmer and o er some suggested areas for language development and integration between coarse-grained and ne-grained multi-thread systems. We discuss results from three common simulation algorithmic areas including: partial di erential equations; graph cluster metric calculations and random number generation. We report on programming experiences and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs; a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and trends in multi-core programming for scienti c applications developers

Massey Research Online

Optimizing NEURON Simulation Environment Using Remote Memory Access with Recursive Doubling on Distributed Memory Systems

Author: Danish Shehzad
Zeki Bozkuş
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2016
Field of study

Increase in complexity of neuronal network models escalated the efforts to make NEURON simulation environment efficient. The computational neuroscientists divided the equations into subnets amongst multiple processors for achieving better hardware performance. On parallel machines for neuronal networks, interprocessor spikes exchange consumes large section of overall simulation time. In NEURON for communication between processors Message Passing Interface (MPI) is used. MPI_Allgather collective is exercised for spikes exchange after each interval across distributed memory systems. The increase in number of processors though results in achieving concurrency and better performance but it inversely affects MPI_Allgather which increases communication time between processors. This necessitates improving communication methodology to decrease the spikes exchange time over distributed memory systems. This work has improved MPI_Allgather method using Remote Memory Access (RMA) by moving two-sided communication to one-sided communication, and use of recursive doubling mechanism facilitates achieving efficient communication between the processors in precise steps. This approach enhanced communication concurrency and has improved overall runtime making NEURON more efficient for simulation of large neuronal network models

Crossref

Directory of Open Access Journals

PubMed Central

QSGD: communication-efficient SGD via gradient quantization and encoding

Author: Alistarh Dan
Grubic Demjan
Li Jerry Z.
Tomioka Ryota
Vojnovic Milan
Publication venue: 'Center for Open Science'
Publication date: 06/12/2017
Field of study

Parallel implementations of stochastic gradient descent (SGD) have received significant research attention, thanks to its excellent scalability properties. A fundamental barrier when parallelizing SGD is the high bandwidth cost of communicating gradient updates between nodes; consequently, several lossy compresion heuristics have been proposed, by which nodes only communicate quantized gradients. Although effective in practice, these heuristics do not always converge. In this paper, we propose Quantized SGD (QSGD), a family of compression schemes with convergence guarantees and good practical performance. QSGD allows the user to smoothly trade off communication bandwidth and convergence time: nodes can adjust the number of bits sent per iteration, at the cost of possibly higher variance. We show that this trade-off is inherent, in the sense that improving it past some threshold would violate information-theoretic lower bounds. QSGD guarantees convergence for convex and non-convex objectives, under asynchrony, and can be extended to stochastic variance-reduced techniques. When applied to training deep neural networks for image classification and automated speech recognition, QSGD leads to significant reductions in end-to-end training time. For instance, on 16GPUs, we can train the ResNet-152 network to full accuracy on ImageNet 1.8× faster than the full-precision variant

arXiv.org e-Print Archive

LSE Research Online

Comparative Investigation for Energy Consumption of Different Chipsets Based on Scheduling for Wireless Sensor Networks

Author: Monica
Sharma Ajay K
Publication venue: 'Academy and Industry Research Collaboration Center (AIRCC)'
Publication date: 23/09/2010
Field of study

Rapid progress in microelectromechanical system (MEMS) and radio frequency (RF) design has enabled the development of low-power, inexpensive, and network-enabled microsensors. These sensor nodes are capable of capturing various physical information, such as temperature, pressure, motion of an object, etc as well as mapping such physical characteristics of the environment to quantitative measurements. A typical wireless sensor network (WSN) consists of hundreds to thousands of such sensor nodes linked by a wireless medium. In this paper, we present a comparative investigation of energy consumption for few commercially available chipsets such as TR1001, CC1000 and CC1010 based on different scheduling methods for two types of deployment strategies. We conducted our experiment within the OMNeT++ simulator.Comment: 17 pages, Based on scheduling for Wireless Sensor Network

arXiv.org e-Print Archive

CiteSeerX

Crossref

Offloading for Mobile Device Performance Improvement

Author: Parkkila Janne
Porras Jari
Temesgene Dagnachew
Publication venue: AIS Electronic Library (AISeL)
Publication date: 01/01/2019
Field of study

Mobile devices are increasingly becoming part of everyday life. These include smart phones, tablets, wearable devices etc. Due to their mobility aspect, they are always constrained in their size and weight, which limits their resource capacity, e.g. processing power, and battery life. One possible solution for augmentation of such resource-constrained devices is through efficient usage of their surrounding resources, i.e. using some offloading technique. This paper studies how offloading of tasks to the surrounding resources affects on both the performance of task execution as well as the battery life of the mobile device. Two mobile phones and two tablets (from two different manufacturers) are studied in the experiments to find out the impact of the device characteristics. Two computationally demanding tasks, namely image processing and encryption/decryption, are used in these experiments. These results are compared to our earlier results on mobile devices of previous generations. We assumed that the increased computing power of new devices would make offloading obsolete. Our results show gains both in energy saving and in computational performance with these mobile devices. The comparison to our earlier results show that the performance increase of newer mobile device generations has not diminished the benefits of offloading. These results are in line with results presented in literature and they show that the offloading could offer a viable approach for resource augmentation of mobile devices towards edge/fog resources emphasized by the new 5G technology

Crossref

ScholarSpace at University of Hawai'i at Manoa

AIS Electronic Library (AISeL)