704 research outputs found
On Complexity, Energy- and Implementation-Efficiency of Channel Decoders
Future wireless communication systems require efficient and flexible baseband
receivers. Meaningful efficiency metrics are key for design space exploration
to quantify the algorithmic and the implementation complexity of a receiver.
Most of the current established efficiency metrics are based on counting
operations, thus neglecting important issues like data and storage complexity.
In this paper we introduce suitable energy and area efficiency metrics which
resolve the afore-mentioned disadvantages. These are decoded information bit
per energy and throughput per area unit. Efficiency metrics are assessed by
various implementations of turbo decoders, LDPC decoders and convolutional
decoders. New exploration methodologies are presented, which permit an
appropriate benchmarking of implementation efficiency, communications
performance, and flexibility trade-offs. These exploration methodologies are
based on efficiency trajectories rather than a single snapshot metric as done
in state-of-the-art approaches.Comment: Submitted to IEEE Transactions on Communication
Frequency Domain Hybrid-ARQ Chase Combining for Broadband MIMO CDMA Systems
In this paper, we consider high-speed wireless packet access using code
division multiple access (CDMA) and multiple-input multiple-output (MIMO).
Current wireless standards, such as high speed packet access (HSPA), have
adopted multi-code transmission and hybrid-automatic repeat request (ARQ) as
major technologies for delivering high data rates. The key technique in
hybrid-ARQ, is that erroneous data packets are kept in the receiver to
detect/decode retransmitted ones. This strategy is refereed to as packet
combining. In CDMA MIMO-based wireless packet access, multi-code transmission
suffers from severe performance degradation due to the loss of code
orthogonality caused by both interchip interference (ICI) and co-antenna
interference (CAI). This limitation results in large transmission delays when
an ARQ mechanism is used in the link layer. In this paper, we investigate
efficient minimum mean square error (MMSE) frequency domain equalization
(FDE)-based iterative (turbo) packet combining for cyclic prefix (CP)-CDMA MIMO
with Chase-type ARQ. We introduce two turbo packet combining schemes: i) In the
first scheme, namely "chip-level turbo packet combining", MMSE FDE and packet
combining are jointly performed at the chip-level. ii) In the second scheme,
namely "symbol-level turbo packet combining", chip-level MMSE FDE and
despreading are separately carried out for each transmission, then packet
combining is performed at the level of the soft demapper. The computational
complexity and memory requirements of both techniques are quite insensitive to
the ARQ delay, i.e., maximum number of ARQ rounds. The throughput is evaluated
for some representative antenna configurations and load factors to show the
gains offered by the proposed techniques.Comment: Submitted to IEEE Transactions on Vehicular Technology (Apr 2009
System-on-chip Computing and Interconnection Architectures for Telecommunications and Signal Processing
This dissertation proposes novel architectures and design techniques targeting SoC building blocks for telecommunications and signal processing applications.
Hardware implementation of Low-Density Parity-Check decoders is approached at both the algorithmic and the architecture level. Low-Density Parity-Check codes are a promising coding scheme for future communication standards due to their outstanding error correction performance.
This work proposes a methodology for analyzing effects of finite precision arithmetic on error correction performance and hardware complexity. The methodology is throughout employed for co-designing the decoder. First, a low-complexity check node based on the P-output decoding principle is designed and characterized on a CMOS standard-cells library. Results demonstrate implementation loss below 0.2 dB down to BER of 10^{-8} and a saving in complexity up to 59% with respect to other works in recent literature. High-throughput and low-latency issues are addressed with modified single-phase decoding schedules. A new "memory-aware" schedule is proposed requiring down to 20% of memory with respect to the traditional two-phase flooding decoding. Additionally, throughput is doubled and logic complexity reduced of 12%. These advantages are traded-off with error correction performance, thus making the solution attractive only for long codes, as those adopted in the DVB-S2 standard. The "layered decoding" principle is extended to those codes not specifically conceived for this technique. Proposed architectures exhibit complexity savings in the order of 40% for both area and power consumption figures, while implementation loss is smaller than 0.05 dB.
Most modern communication standards employ Orthogonal Frequency Division Multiplexing as part of their physical layer. The core of OFDM is the Fast Fourier Transform and its inverse in charge of symbols (de)modulation. Requirements on throughput and energy efficiency call for FFT hardware implementation, while ubiquity of FFT suggests the design of parametric, re-configurable and re-usable IP hardware macrocells. In this context, this thesis describes an FFT/IFFT core compiler particularly suited for implementation of OFDM communication systems. The tool employs an accuracy-driven configuration engine which automatically profiles the internal arithmetic and generates a core with minimum operands bit-width and thus minimum circuit complexity. The engine performs a closed-loop optimization over three different internal arithmetic models (fixed-point, block floating-point and convergent block floating-point) using the numerical accuracy budget given by the user as a reference point. The flexibility and re-usability of the proposed macrocell are illustrated through several case studies which encompass all current state-of-the-art OFDM communications standards (WLAN, WMAN, xDSL, DVB-T/H, DAB and UWB). Implementations results are presented for two deep sub-micron standard-cells libraries (65 and 90 nm) and commercially available FPGA devices. Compared with other FFT core compilers, the proposed environment produces macrocells with lower circuit complexity and same system level performance (throughput, transform size and numerical accuracy).
The final part of this dissertation focuses on the Network-on-Chip design paradigm whose goal is building scalable communication infrastructures connecting hundreds of core. A low-complexity link architecture for mesochronous on-chip communication is discussed. The link enables skew constraint looseness in the clock tree synthesis, frequency speed-up, power consumption reduction and faster back-end turnarounds. The proposed architecture reaches a maximum clock frequency of 1 GHz on 65 nm low-leakage CMOS standard-cells library. In a complex test case with a full-blown NoC infrastructure, the link overhead is only 3% of chip area and 0.5% of leakage power consumption.
Finally, a new methodology, named metacoding, is proposed. Metacoding generates correct-by-construction technology independent RTL codebases for NoC building blocks. The RTL coding phase is abstracted and modeled with an Object Oriented framework, integrated within a commercial tool for IP packaging (Synopsys CoreTools suite). Compared with traditional coding styles based on pre-processor directives, metacoding produces 65% smaller codebases and reduces the configurations to verify up to three orders of magnitude
A Reconfigurable Outer Modem Platform for Future Communications Systems
Future mobile and wireless communications networks
require flexible modem architectures with high performance.
Efficient utilization of
application specific flexibility is key to fulfill these
requirements.
For high throughput a single processor can not provide
the necessary computational power.
Hence multi-processor architectures become necessary.
This paper presents a multi-processor platform based on a new
dynamically reconfigurable application specific instruction set processor (dr-ASIP)
for the application domain of channel decoding.
Inherently parallel decoding tasks can be mapped onto individual processing nodes.
The implied challenging inter-processor communication is efficiently handled
by a Network-on-Chip (NoC) such that the throughput of each node is not degraded.
The dr-ASIP features Viterbi and Log-MAP decoding
for support of convolutional and turbo codes
of more than 10 currently specified mobile and wireless standards.
Furthermore, its flexibility allows for adaptation to future systems
State of the art baseband DSP platforms for Software Defined Radio: A survey
Software Defined Radio (SDR) is an innovative approach which is becoming a more and more promising technology for future mobile handsets. Several proposals in the field of embedded systems have been introduced by different universities and industries to support SDR applications. This article presents an overview of current platforms and analyzes the related architectural choices, the current issues in SDR, as well as potential future trends.Peer reviewe
Flexible LDPC Decoder Architectures
Flexible channel decoding is getting significance with the increase in number of wireless standards and modes within a standard. A flexible channel decoder is a solution providing interstandard and intrastandard support without change in hardware. However, the design of efficient implementation of flexible low-density parity-check (LDPC) code decoders satisfying area, speed, and power constraints is a challenging task and still requires considerable research effort. This paper provides an overview of state-of-the-art in the design of flexible LDPC decoders. The published solutions are evaluated at two levels of architectural design: the processing element (PE) and the interconnection structure. A qualitative and quantitative analysis
of different design choices is carried out, and comparison is provided in terms of achieved flexibility, throughput, decoding efficiency, and area (power) consumption
Datacenter Design for Future Cloud Radio Access Network.
Cloud radio access network (C-RAN), an emerging cloud service that combines the traditional radio access network (RAN) with cloud computing technology, has been proposed as a solution to handle the growing energy consumption and cost of the traditional RAN. Through aggregating baseband units (BBUs) in a centralized cloud datacenter, C-RAN reduces energy and cost, and improves wireless throughput and quality of service. However, designing a datacenter for C-RAN has not yet been studied. In this dissertation, I investigate how a datacenter for C-RAN BBUs should be built on commodity servers.
I first design WiBench, an open-source benchmark suite containing the key signal processing kernels of many mainstream wireless protocols, and study its characteristics. The characterization study shows that there is abundant data level parallelism (DLP) and thread level parallelism (TLP). Based on this result, I then develop high performance software implementations of C-RAN BBU kernels in C++ and CUDA for both CPUs and GPUs. In addition, I generalize the GPU parallelization techniques of the Turbo decoder to the trellis algorithms, an important family of algorithms that are widely used in data compression and channel coding.
Then I evaluate the performance of commodity CPU servers and GPU servers. The study shows that the datacenter with GPU servers can meet the LTE standard throughput with 4Ă to 16Ă fewer machines than with CPU servers. A further energy and cost analysis show that GPU servers can save on average 13Ă more energy and 6Ă more cost. Thus, I propose the C-RAN datacenter be built using GPUs as a server platform.
Next I study resource management techniques to handle the temporal and spatial traffic imbalance in a C-RAN datacenter. I propose a âhill-climbingâ power management that combines powering-off GPUs and DVFS to match the temporal C-RAN traffic pattern. Under a practical traffic model, this technique saves 40% of the BBU energy in a GPU-based C-RAN datacenter. For spatial traffic imbalance, I propose three workload distribution techniques to improve load balance and throughput. Among all three techniques, pipelining packets has the most throughput improvement at 10% and 16% for balanced and unbalanced loads, respectively.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120825/1/qizheng_1.pd
- âŠ