52,109 research outputs found
Millimeter-wave Wireless LAN and its Extension toward 5G Heterogeneous Networks
Millimeter-wave (mmw) frequency bands, especially 60 GHz unlicensed band, are
considered as a promising solution for gigabit short range wireless
communication systems. IEEE standard 802.11ad, also known as WiGig, is
standardized for the usage of the 60 GHz unlicensed band for wireless local
area networks (WLANs). By using this mmw WLAN, multi-Gbps rate can be achieved
to support bandwidth-intensive multimedia applications. Exhaustive search along
with beamforming (BF) is usually used to overcome 60 GHz channel propagation
loss and accomplish data transmissions in such mmw WLANs. Because of its short
range transmission with a high susceptibility to path blocking, multiple number
of mmw access points (APs) should be used to fully cover a typical target
environment for future high capacity multi-Gbps WLANs. Therefore, coordination
among mmw APs is highly needed to overcome packet collisions resulting from
un-coordinated exhaustive search BF and to increase the total capacity of mmw
WLANs. In this paper, we firstly give the current status of mmw WLANs with our
developed WiGig AP prototype. Then, we highlight the great need for coordinated
transmissions among mmw APs as a key enabler for future high capacity mmw
WLANs. Two different types of coordinated mmw WLAN architecture are introduced.
One is the distributed antenna type architecture to realize centralized
coordination, while the other is an autonomous coordination with the assistance
of legacy Wi-Fi signaling. Moreover, two heterogeneous network (HetNet)
architectures are also introduced to efficiently extend the coordinated mmw
WLANs to be used for future 5th Generation (5G) cellular networks.Comment: 18 pages, 24 figures, accepted, invited paper
A radiation-hard dual-channel 12-bit 40 MS/s ADC prototype for the ATLAS liquid argon calorimeter readout electronics upgrade at the CERN LHC
The readout electronics upgrade for the ATLAS Liquid Argon Calorimeters at
the CERN Large Hadron Collider requires a radiation-hard ADC. The design of a
radiation-hard dual-channel 12-bit 40 MS/s pipeline ADC for this use is
presented. The design consists of two pipeline A/D channels each with four
Multiplying Digital-to-Analog Converters followed by 8-bit
Successive-Approximation-Register analog-to-digital converters. The custom
design, fabricated in a commercial 130 nm CMOS process, shows a performance of
67.9 dB SNDR at 10 MHz for a single channel at 40 MS/s, with a latency of 87.5
ns (to first bit read out), while its total power consumption is 50 mW/channel.
The chip uses two power supply voltages: 1.2 and 2.5 V. The sensitivity to
single event effects during irradiation is measured and determined to meet the
system requirements
Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics
Graphics Processing Units (GPUs) are having a transformational effect on
numerical lattice quantum chromodynamics (LQCD) calculations of importance in
nuclear and particle physics. The QUDA library provides a package of mixed
precision sparse matrix linear solvers for LQCD applications, supporting single
GPUs based on NVIDIA's Compute Unified Device Architecture (CUDA). This
library, interfaced to the QDP++/Chroma framework for LQCD calculations, is
currently in production use on the "9g" cluster at the Jefferson Laboratory,
enabling unprecedented price/performance for a range of problems in LQCD.
Nevertheless, memory constraints on current GPU devices limit the problem sizes
that can be tackled. In this contribution we describe the parallelization of
the QUDA library onto multiple GPUs using MPI, including strategies for the
overlapping of communication and computation. We report on both weak and strong
scaling for up to 32 GPUs interconnected by InfiniBand, on which we sustain in
excess of 4 Tflops.Comment: 11 pages, 7 figures, to appear in the Proceedings of Supercomputing
2010 (submitted April 12, 2010
A Taxonomy of Self-configuring Service Discovery Systems
We analyze the fundamental concepts and issues in service
discovery. This analysis places service discovery in the context of distributed
systems by describing service discovery as a third generation
naming system. We also describe the essential architectures and the
functionalities in service discovery. We then proceed to show how service
discovery fits into a system, by characterizing operational aspects.
Subsequently, we describe how existing state of the art performs service
discovery, in relation to the operational aspects and functionalities, and
identify areas for improvement
Designing a CPU model: from a pseudo-formal document to fast code
For validating low level embedded software, engineers use simulators that
take the real binary as input. Like the real hardware, these full-system
simulators are organized as a set of components. The main component is the CPU
simulator (ISS), because it is the usual bottleneck for the simulation speed,
and its development is a long and repetitive task. Previous work showed that an
ISS can be generated from an Architecture Description Language (ADL). In the
work reported in this paper, we generate a CPU simulator directly from the
pseudo-formal descriptions of the reference manual. For each instruction, we
extract the information describing its behavior, its binary encoding, and its
assembly syntax. Next, after automatically applying many optimizations on the
extracted information, we generate a SystemC/TLM ISS. We also generate tests
for the decoder and a formal specification in Coq. Experiments show that the
generated ISS is as fast and stable as our previous hand-written ISS.Comment: 3rd Workshop on: Rapid Simulation and Performance Evaluation: Methods
and Tools (2011
Vector processing-aware advanced clock-gating techniques for low-power fused multiply-add
The need for power efficiency is driving a rethink of design decisions in processor architectures. While vector processors succeeded in the high-performance market in the past, they need a retailoring for the mobile market that they are entering now. Floating-point (FP) fused multiply-add (FMA), being a functional unit with high power consumption, deserves special attention. Although clock gating is a well-known method to reduce switching power in synchronous designs, there are unexplored opportunities for its application to vector processors, especially when considering active operating mode. In this research, we comprehensively identify, propose, and evaluate the most suitable clock-gating techniques for vector FMA units (VFUs). These techniques ensure power savings without jeopardizing the timing. We evaluate the proposed techniques using both synthetic and “real-world” application-based benchmarking. Using vector masking and vector multilane-aware clock gating, we report power reductions of up to 52%, assuming active VFU operating at the peak performance. Among other findings, we observe that vector instruction-based clock-gating techniques achieve power savings for all vector FP instructions. Finally, when evaluating all techniques together, using “real-world” benchmarking, the power reductions are up to 80%. Additionally, in accordance with processor design trends, we perform this research in a fully parameterizable and automated fashion.The research leading to these results has received funding from the RoMoL ERC Advanced Grant GA 321253 and is supported in part by the European Union (FEDER funds) under contract TTIN2015-65316-P.
The work of I. Ratkovic was supported by a FPU research grant from the Spanish MECD.Peer ReviewedPostprint (author's final draft
Computing Safe Contention Bounds for Multicore Resources with Round-Robin and FIFO Arbitration
Numerous researchers have studied the contention that arises among tasks running in parallel on a multicore processor. Most of those studies seek to derive a tight and sound upper-bound for the worst-case delay with which a processor resource may serve an incoming request, when its access is arbitrated using time-predictable policies such as round-robin or FIFO. We call this value upper-bound delay ( ubd ). Deriving trustworthy ubd statically is possible when sufficient public information exists on the timing latency incurred on access to the resource of interest. Unfortunately however, that is rarely granted for commercial-of-the-shelf (COTS) processors. Therefore, the users resort to measurement observations on the target processor and thus compute a “measured” ubdm . However, using ubdm to compute worst-case execution time values for programs running on COTS multicore processors requires qualification on the soundness of the result. In this paper, we present a measurement-based methodology to derive a ubdm under round-robin (RoRo) and first-in-first-out (FIFO) arbitration, which accurately approximates ubd from above, without needing latency information from the hardware provider. Experimental results, obtained on multiple processor configurations, demonstrate the robustness of the proposed methodology.The research leading to this work has received funding from: the European Union’s Horizon 2020 research and innovation programme under grant agreement No
644080(SAFURE); the European Space Agency under Contract 789.2013 and NPI Contract 40001102880; and COST Action IC1202, Timing Analysis On Code-Level (TACLe). This work has also been partially supported by the Spanish Ministry of Science and Innovation under grant TIN2015-65316-P. Jaume Abella has been partially supported by the MINECO under Ramon y Cajal postdoctoral fellowship number RYC-2013-14717. The authors would like to thanks Paul Caheny for his help with the proofreading of this document.Peer ReviewedPostprint (author's final draft
- …