3,866 research outputs found
Design of a Hybrid Modular Switch
Network Function Virtualization (NFV) shed new light for the design,
deployment, and management of cloud networks. Many network functions such as
firewalls, load balancers, and intrusion detection systems can be virtualized
by servers. However, network operators often have to sacrifice programmability
in order to achieve high throughput, especially at networks' edge where complex
network functions are required.
Here, we design, implement, and evaluate Hybrid Modular Switch (HyMoS). The
hybrid hardware/software switch is designed to meet requirements for modern-day
NFV applications in providing high-throughput, with a high degree of
programmability. HyMoS utilizes P4-compatible Network Interface Cards (NICs),
PCI Express interface and CPU to act as line cards, switch fabric, and fabric
controller respectively. In our implementation of HyMos, PCI Express interface
is turned into a non-blocking switch fabric with a throughput of hundreds of
Gigabits per second.
Compared to existing NFV infrastructure, HyMoS offers modularity in hardware
and software as well as a higher degree of programmability by supporting a
superset of P4 language
GPU peer-to-peer techniques applied to a cluster interconnect
Modern GPUs support special protocols to exchange data directly across the
PCI Express bus. While these protocols could be used to reduce GPU data
transmission times, basically by avoiding staging to host memory, they require
specific hardware features which are not available on current generation
network adapters. In this paper we describe the architectural modifications
required to implement peer-to-peer access to NVIDIA Fermi- and Kepler-class
GPUs on an FPGA-based cluster interconnect. Besides, the current software
implementation, which integrates this feature by minimally extending the RDMA
programming model, is discussed, as well as some issues raised while employing
it in a higher level API like MPI. Finally, the current limits of the technique
are studied by analyzing the performance improvements on low-level benchmarks
and on two GPU-accelerated applications, showing when and how they seem to
benefit from the GPU peer-to-peer method.Comment: paper accepted to CASS 201
DeLTA: GPU Performance Model for Deep Learning Applications with In-depth Memory System Traffic Analysis
Training convolutional neural networks (CNNs) requires intense compute
throughput and high memory bandwidth. Especially, convolution layers account
for the majority of the execution time of CNN training, and GPUs are commonly
used to accelerate these layer workloads. GPU design optimization for efficient
CNN training acceleration requires the accurate modeling of how their
performance improves when computing and memory resources are increased. We
present DeLTA, the first analytical model that accurately estimates the traffic
at each GPU memory hierarchy level, while accounting for the complex reuse
patterns of a parallel convolution algorithm. We demonstrate that our model is
both accurate and robust for different CNNs and GPU architectures. We then show
how this model can be used to carefully balance the scaling of different GPU
resources for efficient CNN performance improvement
Recommended from our members
MobileTrust: Secure Knowledge Integration in VANETs
Vehicular Ad hoc NETworks (VANET) are becoming popular due to the emergence of the Internet of Things and ambient intelligence applications. In such networks, secure resource sharing functionality is accomplished by incorporating trust schemes. Current solutions adopt peer-to-peer technologies that can cover the large operational area. However, these systems fail to capture some inherent properties of VANETs, such as fast and ephemeral interaction, making robust trust evaluation of crowdsourcing challenging. In this article, we propose MobileTrust—a hybrid trust-based system for secure resource sharing in VANETs. The proposal is a breakthrough in centralized trust computing that utilizes cloud and upcoming 5G technologies to provide robust trust establishment with global scalability. The ad hoc communication is energy-efficient and protects the system against threats that are not countered by the current settings. To evaluate its performance and effectiveness, MobileTrust is modelled in the SUMO simulator and tested on the traffic features of the small-size German city of Eichstatt. Similar schemes are implemented in the same platform to provide a fair comparison. Moreover, MobileTrust is deployed on a typical embedded system platform and applied on a real smart car installation for monitoring traffic and road-state parameters of an urban application. The proposed system is developed under the EU-founded THREAT-ARREST project, to provide security, privacy, and trust in an intelligent and energy-aware transportation scenario, bringing closer the vision of sustainable circular economy
NVIDIA Tensor Core Programmability, Performance & Precision
The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called
"Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices
per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta
microarchitecture, provides 640 Tensor Cores with a theoretical peak
performance of 125 Tflops/s in mixed precision. In this paper, we investigate
current approaches to program NVIDIA Tensor Cores, their performances and the
precision loss due to computation in mixed precision.
Currently, NVIDIA provides three different ways of programming
matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply
Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS
GEMM. After experimenting with different approaches, we found that NVIDIA
Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100
GPU, seven and three times the performance in single and half precision
respectively. A WMMA implementation of batched GEMM reaches a performance of 4
Tflops/s. While precision loss due to matrix multiplication with half precision
input might be critical in many HPC applications, it can be considerably
reduced at the cost of increased computation. Our results indicate that HPC
applications using matrix multiplications can strongly benefit from using of
NVIDIA Tensor Cores.Comment: This paper has been accepted by the Eighth International Workshop on
Accelerators and Hybrid Exascale Systems (AsHES) 201
- …