42 research outputs found
CoCoA: A General Framework for Communication-Efficient Distributed Optimization
The scale of modern datasets necessitates the development of efficient
distributed optimization methods for machine learning. We present a
general-purpose framework for distributed computing environments, CoCoA, that
has an efficient communication scheme and is applicable to a wide variety of
problems in machine learning and signal processing. We extend the framework to
cover general non-strongly-convex regularizers, including L1-regularized
problems like lasso, sparse logistic regression, and elastic net
regularization, and show how earlier work can be derived as a special case. We
provide convergence guarantees for the class of convex regularized loss
minimization objectives, leveraging a novel approach in handling
non-strongly-convex regularizers and non-smooth loss functions. The resulting
framework has markedly improved performance over state-of-the-art methods, as
we illustrate with an extensive set of experiments on real distributed
datasets
Mixing multi-core CPUs and GPUs for scientific simulation software
Recent technological and economic developments have led to widespread availability of
multi-core CPUs and specialist accelerator processors such as graphical processing units
(GPUs). The accelerated computational performance possible from these devices can be very
high for some applications paradigms. Software languages and systems such as NVIDIA's
CUDA and Khronos consortium's open compute language (OpenCL) support a number of
individual parallel application programming paradigms. To scale up the performance of some
complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and
very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica-
tions using threading approaches and multi-core CPUs to control independent GPU devices.
We present speed-up data and discuss multi-threading software issues for the applications
level programmer and o er some suggested areas for language development and integration
between coarse-grained and ne-grained multi-thread systems. We discuss results from three
common simulation algorithmic areas including: partial di erential equations; graph cluster
metric calculations and random number generation. We report on programming experiences
and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs;
a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and
trends in multi-core programming for scienti c applications developers
Optimizing NEURON Simulation Environment Using Remote Memory Access with Recursive Doubling on Distributed Memory Systems
Increase in complexity of neuronal network models escalated the efforts to make NEURON simulation environment efficient. The computational neuroscientists divided the equations into subnets amongst multiple processors for achieving better hardware performance. On parallel machines for neuronal networks, interprocessor spikes exchange consumes large section of overall simulation time. In NEURON for communication between processors Message Passing Interface (MPI) is used. MPI_Allgather collective is exercised for spikes exchange after each interval across distributed memory systems. The increase in number of processors though results in achieving concurrency and better performance but it inversely affects MPI_Allgather which increases communication time between processors. This necessitates improving communication methodology to decrease the spikes exchange time over distributed memory systems. This work has improved MPI_Allgather method using Remote Memory Access (RMA) by moving two-sided communication to one-sided communication, and use of recursive doubling mechanism facilitates achieving efficient communication between the processors in precise steps. This approach enhanced communication concurrency and has improved overall runtime making NEURON more efficient for simulation of large neuronal network models
QSGD: communication-efficient SGD via gradient quantization and encoding
Parallel implementations of stochastic gradient descent (SGD) have received significant research attention, thanks to its excellent scalability properties. A fundamental barrier when parallelizing SGD is the high bandwidth cost of communicating gradient updates between nodes; consequently, several lossy compresion heuristics have been proposed, by which nodes only communicate quantized gradients. Although effective in practice, these heuristics do not always converge. In this paper, we propose Quantized SGD (QSGD), a family of compression schemes with convergence guarantees and good practical performance. QSGD allows the user to smoothly trade off communication bandwidth and convergence time: nodes can adjust the number of bits sent per iteration, at the cost of possibly higher variance. We show that this trade-off is inherent, in the sense that improving it past some threshold would violate information-theoretic lower bounds. QSGD guarantees convergence for convex and non-convex objectives, under asynchrony, and can be extended to stochastic variance-reduced techniques. When applied to training deep neural networks for image classification and automated speech recognition, QSGD leads to significant reductions in end-to-end training time. For instance, on 16GPUs, we can train the ResNet-152 network to full accuracy on ImageNet 1.8× faster than the full-precision variant
Comparative Investigation for Energy Consumption of Different Chipsets Based on Scheduling for Wireless Sensor Networks
Rapid progress in microelectromechanical system (MEMS) and radio frequency
(RF) design has enabled the development of low-power, inexpensive, and
network-enabled microsensors. These sensor nodes are capable of capturing
various physical information, such as temperature, pressure, motion of an
object, etc as well as mapping such physical characteristics of the environment
to quantitative measurements. A typical wireless sensor network (WSN) consists
of hundreds to thousands of such sensor nodes linked by a wireless medium. In
this paper, we present a comparative investigation of energy consumption for
few commercially available chipsets such as TR1001, CC1000 and CC1010 based on
different scheduling methods for two types of deployment strategies. We
conducted our experiment within the OMNeT++ simulator.Comment: 17 pages, Based on scheduling for Wireless Sensor Network
Offloading for Mobile Device Performance Improvement
Mobile devices are increasingly becoming part of everyday life. These include smart phones, tablets, wearable devices etc. Due to their mobility aspect, they are always constrained in their size and weight, which limits their resource capacity, e.g. processing power, and battery life. One possible solution for augmentation of such resource-constrained devices is through efficient usage of their surrounding resources, i.e. using some offloading technique. This paper studies how offloading of tasks to the surrounding resources affects on both the performance of task execution as well as the battery life of the mobile device. Two mobile phones and two tablets (from two different manufacturers) are studied in the experiments to find out the impact of the device characteristics. Two computationally demanding tasks, namely image processing and encryption/decryption, are used in these experiments. These results are compared to our earlier results on mobile devices of previous generations. We assumed that the increased computing power of new devices would make offloading obsolete. Our results show gains both in energy saving and in computational performance with these mobile devices. The comparison to our earlier results show that the performance increase of newer mobile device generations has not diminished the benefits of offloading. These results are in line with results presented in literature and they show that the offloading could offer a viable approach for resource augmentation of mobile devices towards edge/fog resources emphasized by the new 5G technology