3,006 research outputs found
Genuinely Distributed Byzantine Machine Learning
Machine Learning (ML) solutions are nowadays distributed, according to the
so-called server/worker architecture. One server holds the model parameters
while several workers train the model. Clearly, such architecture is prone to
various types of component failures, which can be all encompassed within the
spectrum of a Byzantine behavior. Several approaches have been proposed
recently to tolerate Byzantine workers. Yet all require trusting a central
parameter server. We initiate in this paper the study of the ``general''
Byzantine-resilient distributed machine learning problem where no individual
component is trusted.
We show that this problem can be solved in an asynchronous system, despite
the presence of Byzantine parameter servers and
Byzantine workers (which is optimal). We present a new algorithm, ByzSGD, which
solves the general Byzantine-resilient distributed machine learning problem by
relying on three major schemes. The first, Scatter/Gather, is a communication
scheme whose goal is to bound the maximum drift among models on correct
servers. The second, Distributed Median Contraction (DMC), leverages the
geometric properties of the median in high dimensional spaces to bring
parameters within the correct servers back close to each other, ensuring
learning convergence. The third, Minimum-Diameter Averaging (MDA), is a
statistically-robust gradient aggregation rule whose goal is to tolerate
Byzantine workers. MDA requires loose bound on the variance of non-Byzantine
gradient estimates, compared to existing alternatives (e.g., Krum).
Interestingly, ByzSGD ensures Byzantine resilience without adding communication
rounds (on a normal path), compared to vanilla non-Byzantine alternatives.
ByzSGD requires, however, a larger number of messages which, we show, can be
reduced if we assume synchrony.Comment: This is a merge of arXiv:1905.03853 and arXiv:1911.07537;
arXiv:1911.07537 will be retracte
Submicron Systems Architecture Project : Semiannual Technical Report
The Mosaic C is an experimental fine-grain multicomputer
based on single-chip nodes. The Mosaic C chip includes 64KB of fast dynamic RAM,
processor, packet interface, ROM for bootstrap and self-test, and a two-dimensional selftimed
router. The chip architecture provides low-overhead and low-latency handling of
message packets, and high memory and network bandwidth. Sixty-four Mosaic chips are
packaged by tape-automated bonding (TAB) in an 8 x 8 array on circuit boards that can, in
turn, be arrayed in two dimensions to build arbitrarily large machines. These 8 x 8 boards are
now in prototype production under a subcontract with Hewlett-Packard. We are planning
to construct a 16K-node Mosaic C system from 256 of these boards. The suite of Mosaic
C hardware also includes host-interface boards and high-speed communication cables. The
hardware developments and activities of the past eight months are described in section 2.1.
The programming system that we are developing for the Mosaic C is based on the
same message-passing, reactive-process, computational model that we have used with earlier
multicomputers, but the model is implemented for the Mosaic in a way that supports finegrain
concurrency. A process executes only in response to receiving a message, and may in
execution send messages, create new processes, and modify its persistent variables before
it either exits or becomes dormant in preparation for receiving another message. These
computations are expressed in an object-oriented programming notation, a derivative of
C++ called C+-. The computational model and the C+- programming notation are
described in section 2.2. The Mosaic C runtime system, which is written in C+-, provides
automatic process placement and highly distributed management of system resources. The
Mosaic C runtime system is described in section 2.3
Exploiting Device-to-Device Communications to Enhance Spatial Reuse for Popular Content Downloading in Directional mmWave Small Cells
With the explosive growth of mobile demand, small cells in millimeter wave
(mmWave) bands underlying the macrocell networks have attracted intense
interest from both academia and industry. MmWave communications in the 60 GHz
band are able to utilize the huge unlicensed bandwidth to provide multiple Gbps
transmission rates. In this case, device-to-device (D2D) communications in
mmWave bands should be fully exploited due to no interference with the
macrocell networks and higher achievable transmission rates. In addition, due
to less interference by directional transmission, multiple links including D2D
links can be scheduled for concurrent transmissions (spatial reuse). With the
popularity of content-based mobile applications, popular content downloading in
the small cells needs to be optimized to improve network performance and
enhance user experience. In this paper, we develop an efficient scheduling
scheme for popular content downloading in mmWave small cells, termed PCDS
(popular content downloading scheduling), where both D2D communications in
close proximity and concurrent transmissions are exploited to improve
transmission efficiency. In PCDS, a transmission path selection algorithm is
designed to establish multi-hop transmission paths for users, aiming at better
utilization of D2D communications and concurrent transmissions. After
transmission path selection, a concurrent transmission scheduling algorithm is
designed to maximize the spatial reuse gain. Through extensive simulations
under various traffic patterns, we demonstrate PCDS achieves near-optimal
performance in terms of delay and throughput, and also superior performance
compared with other existing protocols, especially under heavy load.Comment: 12 pages, to appear in IEEE Transactions on Vehicular Technolog
On Fault Tolerance Methods for Networks-on-Chip
Technology scaling has proceeded into dimensions in which the reliability of manufactured devices is becoming endangered. The reliability decrease is a consequence of physical limitations, relative increase of variations, and decreasing noise margins, among others. A promising solution for bringing the reliability of circuits back to a desired level is the use of design methods which introduce tolerance against possible faults in an integrated circuit.
This thesis studies and presents fault tolerance methods for network-onchip (NoC) which is a design paradigm targeted for very large systems-onchip. In a NoC resources, such as processors and memories, are connected to a communication network; comparable to the Internet. Fault tolerance in such a system can be achieved at many abstraction levels.
The thesis studies the origin of faults in modern technologies and explains the classification to transient, intermittent and permanent faults. A survey of fault tolerance methods is presented to demonstrate the diversity of available methods. Networks-on-chip are approached by exploring their main design choices: the selection of a topology, routing protocol, and flow control method. Fault tolerance methods for NoCs are studied at different layers of the OSI reference model.
The data link layer provides a reliable communication link over a physical channel. Error control coding is an efficient fault tolerance method especially against transient faults at this abstraction level. Error control coding methods suitable for on-chip communication are studied and their implementations presented. Error control coding loses its effectiveness in the presence of intermittent and permanent faults. Therefore, other solutions against them are presented. The introduction of spare wires and split transmissions are shown to provide good tolerance against intermittent and permanent errors and their combination to error control coding is illustrated.
At the network layer positioned above the data link layer, fault tolerance can be achieved with the design of fault tolerant network topologies and routing algorithms. Both of these approaches are presented in the thesis together with realizations in the both categories. The thesis concludes that an optimal fault tolerance solution contains carefully co-designed elements from different abstraction levelsSiirretty Doriast
Null convention logic circuits for asynchronous computer architecture
For most of its history, computer architecture has been able to benefit from a rapid scaling in semiconductor technology, resulting in continuous improvements to CPU design. During that period, synchronous logic has dominated because of its inherent ease of design and abundant tools. However, with the scaling of semiconductor processes into deep sub-micron and then to nano-scale dimensions, computer architecture is hitting a number of roadblocks such as high power and increased process variability. Asynchronous techniques can potentially offer many advantages compared to conventional synchronous design, including average case vs. worse case performance, robustness in the face of process and operating point variability and the ready availability of high performance, fine grained pipeline architectures. Of the many alternative approaches to asynchronous design, Null Convention Logic (NCL) has the advantage that its quasi delay-insensitive behavior makes it relatively easy to set up complex circuits without the need for exhaustive timing analysis. This thesis examines the characteristics of an NCL based asynchronous RISC-V CPU and analyses the problems with applying NCL to CPU design. While a number of university and industry groups have previously developed small 8-bit microprocessor architectures using NCL techniques, it is still unclear whether these offer any real advantages over conventional synchronous design. A key objective of this work has been to analyse the impact of larger word widths and more complex architectures on NCL CPU implementations. The research commenced by re-evaluating existing techniques for implementing NCL on programmable devices such as FPGAs. The little work that has been undertaken previously on FPGA implementations of asynchronous logic has been inconclusive and seems to indicate that asynchronous systems cannot be easily implemented in these devices. However, most of this work related to an alternative technique called bundled data, which is not well suited to FPGA implementation because of the difficulty in controlling and matching delays in a 'bundle' of signals. On the other hand, this thesis clearly shows that such applications are not only possible with NCL, but there are some distinct advantages in being able to prototype complex asynchronous systems in a field-programmable technology such as the FPGA. A large part of the value of NCL derives from its architectural level behavior, inherent pipelining, and optimization opportunities such as the merging of register and combina- tional logic functions. In this work, a number of NCL multiplier architectures have been analyzed to reveal the performance trade-offs between various non-pipelined, 1D and 2D organizations. Two-dimensional pipelining can easily be applied to regular architectures such as array multipliers in a way that is both high performance and area-efficient. It was found that the performance of 2D pipelining for small networks such as multipliers is around 260% faster than the equivalent non-pipelined design. However, the design uses 265% more transistors so the methodology is mainly of benefit where performance is strongly favored over area. A pipelined 32bit x 32bit signed Baugh-Wooley multiplier with Wallace-Tree Carry Save Adders (CSA), which is representative of a real design used for CPUs and DSPs, was used to further explore this concept as it is faster and has fewer pipeline stages compared to the normal array multiplier using Ripple-Carry adders (RCA). It was found that 1D pipelining with ripple-carry chains is an efficient implementation option but becomes less so for larger multipliers, due to the completion logic for which the delay time depends largely on the number of bits involved in the completion network. The average-case performance of ripple-carry adders was explored using random input vectors and it was observed that it offers little advantage on the smaller multiplier blocks, but this particular timing characteristic of asynchronous design styles be- comes increasingly more important as word size grows. Finally, this research has resulted in the development of the first 32-Bit asynchronous RISC-V CPU core. Called the Redback RISC, the architecture is a structure of pipeline rings composed of computational oscillations linked with flow completeness relationships. It has been written using NELL, a commercial description/synthesis tool that outputs standard Verilog. The Redback has been analysed and compared to two approximately equivalent industry standard 32-Bit synchronous RISC-V cores (PicoRV32 and Rocket) that are already fabricated and used in industry. While the NCL implementation is larger than both commercial cores it has similar performance and lower power compared to the PicoRV32. The implementation results were also compared against an existing NCL design tool flow (UNCLE), which showed how much the results of these implementation strategies differ. The Redback RISC has achieved similar level of throughput and 43% better power and 34% better energy compared to one of the synchronous cores with the same benchmark test and test condition such as input sup- ply voltage. However, it was shown that area is the biggest drawback for NCL CPU design. The core is roughly 2.5× larger than synchronous designs. On the other hand its area is still 2.9× smaller than previous designs using UNCLE tools. The area penalty is largely due to the unavoidable translation into a dual-rail topology when using the standard NCL cell library
GPU High-Performance Framework for PIC-like Simulation Methods Using the Vulkan® Explicit API
Within computational continuum mechanics there exists a large category of simulation methods which operate by tracking Lagrangian particles over an Eulerian background grid. These Lagrangian/Eulerian hybrid methods, descendants of the Particle-In-Cell method (PIC), have proven highly effective at simulating a broad range of materials and mechanics including fluids, solids, granular materials, and plasma. These methods remain an area of active research after several decades, and their applications can be found across scientific, engineering, and entertainment disciplines.
This thesis presents a GPU driven PIC-like simulation framework created using the Vulkan® API. Vulkan is a cross-platform and open-standard explicit API for graphics and GPU compute programming. Compared to its predecessors, Vulkan offers lower overhead, support for host parallelism, and finer grain control over both device resources and scheduling. This thesis harnesses those advantages to create a programmable GPU compute pipeline backed by a Vulkan adaptation of the SPgrid data-structure and multi-buffered particle arrays. The CPU host system works asynchronously with the GPU to maximize utilization of both the host and device. The framework is demonstrated to be capable of supporting Particle-in-Cell like simulation methods, making it viable for GPU acceleration of many Lagrangian particle on Eulerian grid hybrid methods. This novel framework is the first of its kind to be created using Vulkan® and to take advantage of GPU sparse memory features for grid sparsity
Gunrock: GPU Graph Analytics
For large-scale graph analytics on the GPU, the irregularity of data access
and control flow, and the complexity of programming GPUs, have presented two
significant challenges to developing a programmable high-performance graph
library. "Gunrock", our graph-processing system designed specifically for the
GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on
operations on a vertex or edge frontier. Gunrock achieves a balance between
performance and expressiveness by coupling high performance GPU computing
primitives and optimization strategies with a high-level programming model that
allows programmers to quickly develop new graph primitives with small code size
and minimal GPU programming knowledge. We characterize the performance of
various optimization strategies and evaluate Gunrock's overall performance on
different GPU architectures on a wide range of graph primitives that span from
traversal-based algorithms and ranking algorithms, to triangle counting and
bipartite-graph-based algorithms. The results show that on a single GPU,
Gunrock has on average at least an order of magnitude speedup over Boost and
PowerGraph, comparable performance to the fastest GPU hardwired primitives and
CPU shared-memory graph libraries such as Ligra and Galois, and better
performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing
(TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance
Graph Processing Library on the GPU
- …