Resource Efficiency of Scalable Processor Architectures for SDR-based Applications (Invited) by Jungeblut, Thorsten  et al.
Resource Efficiency of Scalable Processor Architectures
for SDR-based Applications
Thorsten Jungeblut1, Johannes Ax2, Gregor Sievers2,
Boris Hu¨bener2, Mario Porrmann2, and Ulrich Ru¨ckert1
1Cognitive Interaction Technology
Center of Excellence
Bielefeld University, Germany
2Heinz Nixdorf Institute
University of Paderborn, Germany
Abstract
In this work we discuss the resource efficiency of scalable
processor architectures for software-defined radio (SDR)
based applications. The development of resource efficient
processor architectures is based on a two-staged design
flow using a high level processor specification as a refer-
ence. This design-flow has been used to perform a compre-
hensive design-space exploration of algorithms from vari-
ous application scenarios. As a result the 4-issue VLIW-
architecture CoreVA has been implemented. The fine-
grained parallelism of this architecture allows for perfor-
mance gains of up to a factor of three to four for the an-
alyzed application scenarios. From the design-space ex-
ploration on system level dedicated hardware accelerators
have been derived. The hardware accelerators proposed in
this work improve the energy efficiency by a factor of 6 to
43. To further improve the performance of the system, the
single-core based approach is extended to network-on-chip
(NoC) level. As an example, an IEEE 802.11b application
has been mapped to a cluster of four processor cores re-
ducing the processing time of the algorithm by up to 60 %.
1 Introduction
Wireless communication finds it’s way in our daily life.
Complex transmission methods (WLAN1, UMTS2, LTE3
provide increasing data rates at low latencies. This allows
for new applications, like HDTV4, video conferences, or
1Wireless Local Area Network
2Universal Mobile Telecommunications System
3Long Term Evolution
4High Definition Television
3D online gaming. Basically, a mobile phone has to sup-
port multiple standards. Up to now, heterogenous hardware
platforms with dedicated accelerator devices were used.
Whereas in practice, none or only few algorithms are used
simultaneously, all accelerators contribute to the area (and
therefore the costs) and the power consumption of the mo-
bile device.
Therefore, modern approaches rely on flexible architec-
tures, which are based on high performance and univer-
sal processors [10, 7, 17, 18, 16]. This principle is called
software-defined radio and allows for the exchange of the
communication methods during run-time. In addition, new
algorithms can be ported to the architecture. High-level pro-
gramming languages simplifies the development of applica-
tions and are mainly platform independent.
The decreasing feature size of modern fabrication pro-
cesses allows for the integration of more and more logic or
memory in embedded systems. Parallel architectures offer
a high performance at a reasonable low clock frequency.
Compared to superscalar architectures, very long instruc-
tion word (VLIW) processors allow for a fine-grained scal-
ing of the computational throughput at low resource re-
quirements. Examples for current commercial VLIW archi-
tectures are NXP TriMedia [25] and the Texas Instruments
TMS320C6X [24]. Scalable Network-on-Chips (NoCs)
with multiple processor cores can further enhance the avail-
able throughput dependent on the application-requirements.
Current architectures are, e.g., Tilera Tile64 [1, 2], Intel Ter-
aFLOPS [26], XPipes [3], AEtheral [23], or FAUST [15].
In this work we discuss the potential of scalable proces-
sor architectures and NoCs for the execution of SDR-based
applications. In Section 2 we propose a two-stages design
flow for the design-space exploration of resource efficient
processor architectures. Section 3 introduces our modular
VLIW-architecture CoreVA. Section 4 presents the system
architecture and proposes a flexible environment for the in-
Assembler
Instruction set 
simulator
Visualization/
Profiling
RTL description
Emulator 
(Prototype)
Benchmarks
Source code
Assembler code
Object files
Executable
Profiling statistics
RTL code
RTL code
Netlist
UPSLA
Functional 
verification
Resourcen efficiency
RTL simulation
Linker
(C-)Compiler
Synthesis
Definition of
the semantic
ASIC realization
Vice-UPSLA
Reference 
specification
planned
Executable
Figure 1. Two-stage design-flow for the de-
sign space exploration of processor architec-
tures
tegration of hardware accelerators. In Section 5 the exten-
sion of the CoreVA system to NoC level is introduced. Sec-
tion 6 concludes this paper and gives an outlook on future
work.
2 Processor Design Based on UPSLA as a
Reference Specification
For the development of resource efficient processor ar-
chitectures, a two-stage design-flow is used, which con-
sists of the automatic generation of a complete C-compiler-
based toolchain from a reference specification in the Unified
Processor Specification Language (UPSLA) (cf. Figure 1,
[14, 11]).
The development toolchain encompasses a C-compiler,
an assembler, a linker, a cycle-accurate instruction set simu-
lator (ISS) and various debugging and profiling tools. Start-
ing from the application, the source code is compiled and
can be profiled, e.g., in terms of clock cycles or functional
units utilization. The hardware design flow is based on the
register-transfer level (RTL) description of the architecture.
The executables of the application can be simulated using
RTL-simulation or verified functional using our modular
rapid prototyping environment RAPTOR [21]. Combining
the synthesis results (wrt. the resource requirements) and
the profiling results, the resource efficiency is derived. To
verify the consistency of the software and the hardware do-
mains, the simulation by validation approach of [5] is used.
Register Write
FE
DC
RD
EX
ME
WR
Instruction Fetch
Instruction Decode
Register Read
LD/ST
Instruction
memory
Data
memory
Memory
ALU0 2 ALU
*0/2 /0/2
ALU1 3 ALU
*1/3 /1/3
LD/ST
Condition
register
Register
B
y
p
a
s
s
Figure 2. 6-staged pipeline of the CoreVA pro-
cessor
3 The modular CoreVA VLIW-architecture
Using this design flow we implemented the resource effi-
cient VLIW-architecture called CoreVA. The RTL descrip-
tion of the architecture is widely configurable: The number
of VLIW-slots (even single slot), arithmetic-logical units
(ALUs), dedicated multiply-accumulate (MLA) or division
step (DVS) units can be specified. Load/store interfaces can
be assigned the each VLIW-slot. Using a Greedy-based ap-
proach, optimized pipeline-bypass configurations [4] can be
derived by the systematic deactivation of rarely used bypass
paths.
From design-space exploration of a heterogeneous set
of algorithms of various application scenarios (synthetic
benchmarks, coding, baseband processing, error correction,
cryptography, and image processing) the configuration of
the first implementation of the CoreVA architecture has
been derived [12, 6, 10, 7]. The CoreVA architecture em-
beds four VLIW architectures with four ALUs, two ded-
icated MLA- and DVS-units (cf. Figure 2). The level-1
caches for instructions and data (two-port) with a size of
16 kByte each interface to external memory [13]. For ap-
plications with low memory requirements the caches can be
configured to a scratchpad mode at run-time to omit high
latencies on cache misses. The fine-grained parallelism
of the VLIW architecture allows for performance gains of
up to factor three to four for the analyzed application sce-
narios. Figure 3 combines the layout with a chip photo
of the CoreVA architecture. Maximum frequency of the
ASIC prototype is 400 MHz (1.2 V, 25 ◦C) at a power con-
sumption of 100 mW. Area requirements are 2.7mm2 in am
65 nm standard cell technology from STMicroelectronics.
Figure 3. Chip foto (upper left) and layout
(lower right) of the CoreVA architecture
4 Hardware accelerators
Figure 4 depicts the system architecture of the CoreVA
processor. Beside the instruction and data caches, the
processor core can access dedicated hardware accelerators
via memory-mapped I/O (MMIO). Hardware extensions are
mapped to the logical memory address space. Dependent
on the memory address, an address decoder forwards ac-
cesses either to the physical memory or to an extension.
Application-specific hardware accelerators allow for high
performance gains but require with a considerable addi-
tional hardware effort. Nevertheless, due to a large decrease
of the processing time, overall energy efficiency can be in-
creased by orders of magnitudes. Table 1 shows the results
for the resource efficiency for four hardware accelerators
(cyclic redundancy check (CRC), elliptic curves cryptogra-
phy (ECC) [12], IEEE 802.11b [7], and advanced encryp-
tion standard (AES)). To limit area requirements and costs
it may not be possible to integrate all dedicated hardware
extensions available on the processor DIE. Therefore, we
implemented a generic interface to either integrate hard-
ware extensions tightly coupled in the processor core, or
map them loosely coupled to a dedicated FPGA (cf. Fig-
ure 4). The reconfigurability of the FPGA supports the dy-
namic exchange of hardware extensions during run-time.
5 The CoreVANoC
To further enhance the performance of the system, the
single-core based approach can be extended to a multi-
CoreVA
CPU
Daten-
Cache
MMIO
CRC
Instr.-
Cache
A
rb
it
e
r
FPGA
Systembus
RAPTOR-System
ASIC
SDRAM
Controller
SDRAM
Lokalbus-
schnittstelle
Systembus
Controller
ETH 
MAC
ETH 
PHY
ISE ISE ISE
ECC
IEEE 
802.11b
Host-PC
Systembus
S
y
s
te
m
b
u
s
Figure 4. System architecture of the CoreVA
processor
core architecture [8, 9]. Whereas the fine-grained paral-
lelism of VLIW architectures can improve the throughput
for applications with a high instruction level parallelism
(ILP), NoCs can enhance the efficiency of the system by ap-
plying software-pipelining or exploiting data concurrency.
The CoreVANoC is based on the GigaNoC [22, 20, 19]
and represents a hierarchical NoC that is especially suit-
able for scalable multiprocessor architectures. The Core-
VANoC features packet-switched wormhole routing with a
link bandwidth of up to 24 GBit/s. Backbone of the NoC
is a parameterizable Switch-Box (SB). Each SB interfaces
to a dedicated CoreVA processor. Area requirements are
about 0.5mm2 per SB at a maximum clock frequency of
750 MHz.
As an example, the IEEE 802.11b application of [7] has
been mapped to the CoreVANoC (cf. Figure 5. The al-
gorithm is partitioned to four processor nodes. The first
node performs scrambling, differential encoding, and sym-
bol mapping. The FIR-filter is split into inphase (I) and
quadrature (Q) components, each mapped to a dedicated
processor node. A fourth CoreVA is required for the syn-
chronization and post-processing of the I/Q-data. By using
the CoreVANoC, the IEEE 802.11b algorithm could be sped
up by about 60 % (cf. Figure 6).
6 Conclusion
In this paper we discussed the resource efficiency of scal-
able processor architectures and NoCs for SDR-based ap-
plications. We introduced a two-staged design-flow used
Hardware Processing time Area Power Energy
accelerator (speedup) requirements consumption efficiency
CRC -87 % (×8) +0.7 % +0.6 % × 6.8
ECC -93 % (×14) +30 % +30 % × 11.0
802.11b -88 % (×8) +40 % +19 % × 6.0
AES -99 % (×66) +39 % +54 % × 43.0
Table 1. Resource efficiency of hardware accelerators
Packet 
Mem
S B
P
or
t 0
C
om
 
C
on
tro
lle
r
Crossbar
Port 4
Port 2
P
o
rt
 1
P
o
rt
 3
CoreVA
Packet 
Mem
S B
P
or
t 0
C
om
 
C
on
tro
lle
r
Crossbar
Port 4
Port 2
P
o
rt
 1
P
o
rt
 3
CoreVA
Packet 
Mem
S B
P
or
t 0
C
om
 
C
on
tro
lle
r
Crossbar
Port 4
Port 2
P
o
rt
 1
P
o
rt
 3
CoreVA
Packet 
Mem
S B
P
or
t 0
C
om
 
C
on
tro
lle
r
Crossbar
Port 4
Port 2
P
o
rt
 1
P
o
rt
 3
CoreVA
Fir-Filter
(I-Part)
Fir-Filter
(Q-Part)
Synchro-
nization
Scrambling
Diff. encoding
Symbol mapping
Figure 5. Mapping of the IEEE 802.11b algo-
rithm to the CoreVANoC
for the design-space exploration of resource efficient pro-
cessor architectures. The design flow is based on a central
processor specification in the UPSLA language. The design
flow is highly automated to shorten the iteration cycles of
the design-space exploration. As a result from a compre-
hensive profiling of a large set of applications from SDR-
scenarios, the CoreVA VLIW architecture has been derived.
The fine-grained parallelism of the architecture allows for
performance improvements of the selected applications of a
factor of three to four.
By extending the design-space exploration to system
level, several hardware accelerators have been imple-
mented. This application-specific extensions improve the
energy efficiency by a factor of 6 to 43.
To further improve the performance, the single-core ap-
proach can be extended to NoC level. The CoreVANoC is
based on the GigaNoC and integrates the CoreVA VLIW
architecture as a processor node. As an example an IEEE
0
500
1000
1500
2000
2500
3000
3500
4000
8 16 32 64 128 256 512
CoreVA
CoreVA-NoC (1 PE)
CoreVA-NoC (2 PE)
CoreVA-NoC (4 PE)
Input data [Bytes] 
C
lo
ck
 c
yc
le
s 
p
e
r 
by
te
 
Figure 6. Execution time of the IEEE 802.11b
algorithm for different mapping strategies
802.11b algorithm has been mapped to a cluster of four
CoreVA processors. Different mapping strategies have been
compared, leading to a reduction of the processing time of
up to 60 %.
References
[1] A. Agarwal. The Tile Processor: A 64-core Multicore for
Embedded Processing. In Proc. of HPEC Workshop, 2007.
[2] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Le-
ung, J. MacKay, M. Reif, L. Bao, J. Brown, et al. Tile64-
processor: A 64-core SoC with Mesh Interconnect. In Solid-
State Circuits Conference, 2008. ISSCC 2008. Digest of
Technical Papers. IEEE International, pages 88–598. IEEE,
2008.
[3] D. Bertozzi, A. Jalabert, S. Murali, R. Tamhankar, S. Ster-
giou, L. Benini, and G. D. Micheli. NoC Synthesis Flow
for Customized Domain Specific Multiprocessor Systems-
on-Chip. IEEE, 16, 2005.
[4] R. Dreesen, T. Jungeblut, M. Thies, and U. Kastens. De-
pendence Analysis of VLIW Code for Non-Interlocked
Pipelines. In Proceedings of the 8th Workshop on Optimiza-
tions for DSP and Embedded Systems (ODES-8), Apr. 2010.
[5] R. Dreesen, T. Jungeblut, M. Thies, M. Porrmann, U. Kas-
tens, and U. Ru¨ckert. A Synchronization Method for Reg-
ister Traces of Pipelined Processors. In Analysis, Architec-
tures and Modelling of Embedded Systems, pages 207–217.
Springer Boston, 2009.
[6] T. Jungeblut, R. Dreesen, M. Porrmann, U. Ru¨ckert, and
U. Hachmann. Design Space Exploration for Resource Effi-
cient VLIW-Processors. In University Booth of the Design,
Automation and Test in Europe (DATE) conference, 2008.
[7] T. Jungeblut, R. Dreesen, M. Porrmann, M. Thies,
U. Ru¨ckert, and U. Kastens. A Framework for the De-
sign Space Exploration of Software-Defined Radio Applica-
tions. In The 2nd International ICST Conference on Mobile
Lightweight Wireless Systems, May 2010.
[8] T. Jungeblut, M. Gru¨newald, M. Porrmann, and U. Ru¨ckert.
Real-time multiprocessor soc for mobile ad hoc networks. In
Proceedings of the Conference on Design, Automation and
Test in Europe (DATE ’07) – University Booth, 2007, 16 -
20 Apr. 2007.
[9] T. Jungeblut, M. Gru¨newald, M. Porrmann, and U. Ru¨ckert.
Realtime Multiprocessor for Mobile Ad Hoc Networks. Ad-
vances in Radio Science, 6:239–243, 2008.
[10] T. Jungeblut, D. Klassen, R. Dreesen, M. Porrmann,
M. Thies, U. Ru¨ckert, and U. Kastens. Design space ex-
ploration for next generation wireless technologies. In Proc.
of the Electrical and Electronic Engineering for Communi-
cation Conference (EEEfCOM) 2009, 2009.
[11] T. Jungeblut, S. Ltkemeier, G. Sievers, M. Porrmann, and
U. Rckert. A Modular Design Flow for Very Large Design
Space Explorations. In Proceedings of the CDNLive! EMEA
2010, Munich, Germany, 2010, May 2010.
[12] T. Jungeblut, C. Puttmann, R. Dreesen, M. Porrmann,
M. Thies, U. Ru¨ckert, and U. Kastens. Resource Efficiency
of Hardware Extensions of a 4-issue VLIW Processor for
Elliptic Curve Cryptography. Advances in Radio Science,
2010.
[13] T. Jungeblut, G. Sievers, M. Porrmann, and U. Ru¨ckert. De-
sign Space Exploration for Memory Subsystems of VLIW
Architectures. In 5th IEEE International Conference on Net-
working, Architecture, and Storage (NAS 2010), July 2010.
[14] U. Kastens, D. K. Le, A. Slowik, and M. Thies. Feedback
Driven Instruction-Set Extension. ACM SIGPLAN Notices,
39(7):135, 2004.
[15] D. Lattard. A Reconfigurable Baseband Platform Based on
an Asynchronous Network-on-Chip. IEEE J. Solid-State
Circuits, 43, 2008.
[16] D. Liu, A. Nilsson, E. Tell, D. Wu, and J. Eilert. Bridging
Dream and Reality: Programmable Baseband Processors for
Software-defined Radio. Communications Magazine, IEEE,
47(9):134–140, 2009.
[17] S. Mamidi, E. Blem, M. Schulte, J. Glossner, D. Iancu,
A. Iancu, M. Moudgill, and S. Jinturkar. Instruction Set
Extensions for Software Defined Radio on a Multithreaded
Processor. In Proceedings of the 2005 international confer-
ence on Compilers, architectures and synthesis for embed-
ded systems, pages 266–273. ACM, 2005.
[18] S. Mamidi, E. Blem, M. Schulte, J. Glossner, D. Iancu,
A. Iancu, M. Moudgill, and S. Jinturkar. Instruction Set Ex-
tensions for Software Defined Radio. Microprocessors and
Microsystems, 33(4):260–272, 2009.
[19] J.-C. Niemann. Ressourceneffiziente Schaltungstechnik
eingebetteter Parallelrechner – GigaNetIC. 2009.
[20] J.-C. Niemann, C. Puttmann, M. Porrmann, and U. Ru¨ckert.
Resource efficiency of the GigaNetIC chip multiprocessor
architecture. ScienceDirect, Systems Architecture, 2006.
[21] M. Porrmann, J. Hagemeyer, C. Pohl, J. Romoth, and
M. Strugholtz. Parallel Computing: From Multicores and
GPU’s to Petascale, Advances in Parallel Computing, vol-
ume 19, chapter RAPTOR – A Scalable Platform for Rapid
Prototyping and FPGA-based Cluster Computing, pages
592–599. IOS press, 2010.
[22] C. Puttmann, J.-C. Niemann, M. Porrmann, and U. Ru¨ckert.
Giganoc – A Hierarchical Network-on-Chip for Scalable
Chip-Multiprocessors. 2007.
[23] E. Rijpkema. Trade offs in the design of a router with
both guaranteed and best-effort services for network on chip.
IEEE Proc. Computers and Digital Techniques, 150, 2003.
[24] TMS320C64x/C64x+ DSP CPU and Instruction Set Refer-
ence Guide. Technical report, Texas Instruments, 2008.
[25] J.-W. van de Waerdt, S. Vassiliadis, S. Das, S. Mirolo,
C. Yen, B. Zhong, C. Basto, J.-P. van Itegem, D. Amirtharaj,
K. Kalra, P. Rodriguez, and H. van Antwerpen. The TM3270
Media-Processor. In MICRO 38: Proceedings of the 38th
annual IEEE/ACM International Symposium on Microarchi-
tecture, pages 331–342, Washington, DC, USA, 2005. IEEE
Computer Society.
[26] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson,
J. Tschanz, D. Finan, A. Singh, T. Jacob, S. Jain, et al.
An 80-tile sub-100-W Teraflops Processor in 65-nm CMOS,
2008.
