40,244 research outputs found
An Efficient and User Privacy-Preserving Routing Protocol for Wireless Mesh Networks
Wireless mesh networks (WMNs) have emerged as a key technology for next
generation wireless broadband networks showing rapid progress and inspiring
numerous compelling applications. A WMN comprises of a set of mesh routers
(MRs) and mesh clients (MCs), where MRs are connected to the Internet backbone
through the Internet gateways (IGWs). The MCs are wireless devices and
communicate among themselves over possibly multi-hop paths with or without the
involvement of MRs. User privacy and security have been primary concerns in
WMNs due to their peer-to-peer network topology, shared wireless medium,
stringent resource constraints, and highly dynamic environment. Moreover, to
support real-time applications, WMNs must also be equipped with robust,
reliable and efficient routing protocols so as to minimize the end-to-end
latency. Design of a secure and efficient routing protocol for WMNs, therefore,
is of paramount importance. In this paper, we propose an efficient and reliable
routing protocol that also provides user anonymity in WMNs. The protocol is
based on an accurate estimation of the available bandwidth in the wireless
links and a robust estimation of the end-to-end delay in a routing path, and
minimization of control message overhead. The user anonymity, authentication
and data privacy is achieved by application of a novel protocol that is based
on Rivest's ring signature scheme. Simulations carried out on the proposed
protocol demonstrate that it is more efficient than some of the existing
routing protocols.Comment: 14 pages, 10 figures, i tabl
Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems
The increasing scale and wealth of inter-connected data, such as those
accrued by social network applications, demand the design of new techniques and
platforms to efficiently derive actionable knowledge from large-scale graphs.
However, real-world graphs are famously difficult to process efficiently. Not
only they have a large memory footprint, but also most graph algorithms entail
memory access patterns with poor locality, data-dependent parallelism and a low
compute-to-memory access ratio. Moreover, most real-world graphs have a highly
heterogeneous node degree distribution, hence partitioning these graphs for
parallel processing and simultaneously achieving access locality and
load-balancing is difficult.
This work starts from the hypothesis that hybrid platforms (e.g.,
GPU-accelerated systems) have both the potential to cope with the heterogeneous
structure of real graphs and to offer a cost-effective platform for
high-performance graph processing. This work assesses this hypothesis and
presents an extensive exploration of the opportunity to harness hybrid systems
to process large-scale graphs efficiently. In particular, (i) we present a
performance model that estimates the achievable performance on hybrid
platforms; (ii) informed by the performance model, we design and develop TOTEM
- a processing engine that provides a convenient environment to implement graph
algorithms on hybrid platforms; (iii) we show that further performance gains
can be extracted using partitioning strategies that aim to produce partitions
that each matches the strengths of the processing element it is allocated to,
finally, (iv) we demonstrate the performance advantages of the hybrid system
through a comprehensive evaluation that uses real and synthetic workloads (as
large as 16 billion edges), multiple graph algorithms that stress the system in
various ways, and a variety of hardware configurations
Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms
Dense linear algebra kernels are critical for wireless applications, and the
oncoming proliferation of 5G only amplifies their importance. Many such matrix
algorithms are inductive, and exhibit ample amounts of fine-grain ordered
parallelism -- when multiple computations flow with fine-grain
producer/consumer dependences, and where the iteration domain is not easily
tileable. Synchronization overheads make multi-core parallelism ineffective and
the non-tileable iterations make the vector-VLIW approach less effective,
especially for the typically modest-sized matrices. Because CPUs and DSPs lose
order-of-magnitude performance/hardware utilization, costly and inflexible
ASICs are often employed in signal processing pipelines. A programmable
accelerator with similar performance/power/area would be highly desirable. We
find that fine-grain ordered parallelism can be exploited by supporting: 1.
fine-grain stream-based communication/synchronization; 2. inductive data-reuse
and memory access patterns; 3. implicit vector-masking for partial vectors; 4.
hardware specialization of dataflow criticality. In this work, we propose,
REVEL, as a next-generation DSP architecture. It supports the above features in
its ISA and microarchitecture, and further uses a novel vector-stream control
paradigm to reduce control overheads. Across a suite of linear algebra kernels,
REVEL outperforms equally provisioned DSPs by 4.6x-37x in latency and achieves
a performance per mm 2 of 8.3x. It is only 2.2x higher power to achieve the
same performance as ideal ASICs, at about 55% of the combined area
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Large deep learning models offer significant accuracy gains, but training
billions to trillions of parameters is challenging. Existing solutions such as
data and model parallelisms exhibit fundamental limitations to fit these models
into limited device memory, while obtaining computation, communication and
development efficiency. We develop a novel solution, Zero Redundancy Optimizer
(ZeRO), to optimize memory, vastly improving training speed while increasing
the model size that can be efficiently trained. ZeRO eliminates memory
redundancies in data- and model-parallel training while retaining low
communication volume and high computational granularity, allowing us to scale
the model size proportional to the number of devices with sustained high
efficiency. Our analysis on memory requirements and communication volume
demonstrates: ZeRO has the potential to scale beyond 1 Trillion parameters
using today's hardware.
We implement and evaluate ZeRO: it trains large models of over 100B parameter
with super-linear speedup on 400 GPUs, achieving throughput of 15 Petaflops.
This represents an 8x increase in model size and 10x increase in achievable
performance over state-of-the-art. In terms of usability, ZeRO can train large
models of up to 13B parameters (e.g., larger than Megatron GPT 8.3B and T5 11B)
without requiring model parallelism which is harder for scientists to apply.
Last but not the least, researchers have used the system breakthroughs of ZeRO
to create the world's largest language model (Turing-NLG, 17B parameters) with
record breaking accuracy
PipeDream: Fast and Efficient Pipeline Parallel DNN Training
PipeDream is a Deep Neural Network(DNN) training system for GPUs that
parallelizes computation by pipelining execution across multiple machines. Its
pipeline parallel computing model avoids the slowdowns faced by data-parallel
training when large models and/or limited network bandwidth induce high
communication-to-computation ratios. PipeDream reduces communication by up to
95% for large DNNs relative to data-parallel training, and allows perfect
overlap of communication and computation. PipeDream keeps all available GPUs
productive by systematically partitioning DNN layers among them to balance work
and minimize communication, versions model parameters for backward pass
correctness, and schedules the forward and backward passes of different inputs
in round-robin fashion to optimize "time to target accuracy". Experiments with
five different DNNs on two different clusters show that PipeDream is up to 5x
faster in time-to-accuracy compared to data-parallel training
Complete Security Framework for Wireless Sensor Networks
Security concern for a Sensor Networks and level of security desired may
differ according to application specific needs where the sensor networks are
deployed. Till now, most of the security solutions proposed for sensor networks
are layer wise i.e a particular solution is applicable to single layer itself.
So, to integrate them all is a new research challenge. In this paper we took up
the challenge and have proposed an integrated comprehensive security framework
that will provide security services for all services of sensor network. We have
added one extra component i.e. Intelligent Security Agent (ISA) to assess level
of security and cross layer interactions. This framework has many components
like Intrusion Detection System, Trust Framework, Key Management scheme and
Link layer communication protocol. We have also tested it on three different
application scenarios in Castalia and Omnet++ simulator.Comment: 7 pages, International Journal of Computer Science and Information
Security, IJCSIS 2009, ISSN 1947 5500, Impact Factor 0.42
Theano-MPI: a Theano-based Distributed Training Framework
We develop a scalable and extendable training framework that can utilize GPUs
across nodes in a cluster and accelerate the training of deep learning models
based on data parallelism. Both synchronous and asynchronous training are
implemented in our framework, where parameter exchange among GPUs is based on
CUDA-aware MPI. In this report, we analyze the convergence and capability of
the framework to reduce training time when scaling the synchronous training of
AlexNet and GoogLeNet from 2 GPUs to 8 GPUs. In addition, we explore novel ways
to reduce the communication overhead caused by exchanging parameters. Finally,
we release the framework as open-source for further research on distributed
deep learnin
Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale
Reliability is a serious concern for future extreme-scale high-performance
computing (HPC) systems. While the HPC community has developed various
resilience solutions, the solution space remains fragmented. There are no
formal methods and metrics to integrate the various HPC resilience techniques
into composite solutions, nor are there methods to holistically evaluate the
adequacy and efficacy of such solutions in terms of their protection coverage,
and their performance & power efficiency characteristics. In this paper, we
develop a structured approach to the design, evaluation and optimization of HPC
resilience using the concept of design patterns. A design pattern is a general
repeatable solution to a commonly occurring problem. We identify the problems
caused by various types of faults, errors and failures in HPC systems and the
techniques used to deal with these events. Each well-known solution that
addresses a specific HPC resilience challenge is described in the form of a
pattern. We develop a complete catalog of such resilience design patterns,
which may be used as essential building blocks when designing and deploying
resilience solutions. We also develop a design framework that enhances a
designer's understanding the opportunities for integrating multiple patterns
across layers of the system stack and the important constraints during
implementation of the individual patterns. It is also useful for defining
mechanisms and interfaces to coordinate flexible fault management across
hardware and software components. The overall goal of this work is to establish
a systematic methodology for the design and evaluation of resilience
technologies in extreme-scale HPC systems that keep scientific applications
running to a correct solution in a timely and cost-efficient manner despite
frequent faults, errors, and failures of various types.Comment: Supercomputing Frontiers and Innovations. arXiv admin note: text
overlap with arXiv:1611.0271
Pipe-SGD: A Decentralized Pipelined SGD Framework for Distributed Deep Net Training
Distributed training of deep nets is an important technique to address some
of the present day computing challenges like memory consumption and
computational demands. Classical distributed approaches, synchronous or
asynchronous, are based on the parameter server architecture, i.e., worker
nodes compute gradients which are communicated to the parameter server while
updated parameters are returned. Recently, distributed training with AllReduce
operations gained popularity as well. While many of those operations seem
appealing, little is reported about wall-clock training time improvements. In
this paper, we carefully analyze the AllReduce based setup, propose timing
models which include network latency, bandwidth, cluster size and compute time,
and demonstrate that a pipelined training with a width of two combines the best
of both synchronous and asynchronous training. Specifically, for a setup
consisting of a four-node GPU cluster we show wall-clock time training
improvements of up to 5.4x compared to conventional approaches.Comment: Accepted at NeurIPS 201
PIMBALL: Binary Neural Networks in Spintronic Memory
Neural networks span a wide range of applications of industrial and
commercial significance. Binary neural networks (BNN) are particularly
effective in trading accuracy for performance, energy efficiency or
hardware/software complexity. Here, we introduce a spintronic, re-configurable
in-memory BNN accelerator, PIMBALL: Processing In Memory BNN AcceL(L)erator,
which allows for massively parallel and energy efficient computation. PIMBALL
is capable of being used as a standard spintronic memory (STT-MRAM) array and a
computational substrate simultaneously. We evaluate PIMBALL using multiple
image classifiers and a genomics kernel. Our simulation results show that
PIMBALL is more energy efficient than alternative CPU, GPU, and FPGA based
implementations while delivering higher throughput
- …