I. INTRODUCTION
Wireless communications have been one of the main forces behind the growth of microelectronics industry for the past two decades [1] . In fact, the continuous technological advance in the field of wireless communications has required the design and implementation of increasingly complex Systems-on-Chip (SoCs) to cope with the higher complexity of algorithms/communication protocols. Furthermore, in the past few years we have witness a gradual shift towards the implementation of multi-mode and multi-standard transceivers [2] .
The implementation of multi-standard, multi-mode transceivers introduces new design challenges and therefore new architectural solutions are required. These solutions cannot be based on a simple collage of single-standard transceivers because of intrinsic constraints, such as: energy and power consumption, integration costs, as well as area and weight of the devices. More integrated and flexible solutions have to be explored for the realization of multi-mode, multistandards radios. The concept of a flexible radio implemented as a software radio was finally introduced in [3] to meet these new requirements for the radio systems.
Flexible radio platforms do not offer just a high degree of flexibility to users but also help the industry to keep design costs under control. In fact, if a single architectural solution can be used for a wider set of applications, it enables the sharing of non-recurring engineering (NRE) costs, which increase with the scaling of technology. In fact, to provide the computational power required by the target application system designer have to rely on latest technology nodes, which provide a high computational density. However, the utilization of Ultra-Deep SubMicron (UDSM) technology nodes requires larger financial investments to cover the high costs related to design and verification of complex SoCs as well as the intrinsic higher costs of the silicon real estate. Therefore, to counterbalance the financial investment, the design effort of Very Large Scale Integration (VLSI) systems must be shared across different families of products and, if possible, across different application domains [4] .
The work was financially supported by the Graduate School in Electronics, Telecommunications and Automation (GETA) and Tampere University of Technology. Research grants were received from the Tuula ja Yrjö Neuvo Foundation, the Nokia Foundation, the Ulla Tuominen Foundation and the Tekniikan Edistämissätiö, which are all gratefully acknowledged.
Along with flexibility and computational power, power and energy consumption are a fundamental design parameters for mobile applications [5] . Morover, power consumption has also become one of the most crucial design parameters for VLSI systems: indeed, to the high computational density provided by UDSM technology nodes corresponds an high power density, which has become one of the main limiting factor of silicon technology [6] . In fact, high power densities on the chip leads to hard and soft errors (e.g. electro-migrations and IR Drops) which finally have a negative impact on the yield, reducing the margin of profit of the industry.
High computational power, flexibility and power efficiency can be obtained through platform based on multi-processor systems. In fact, Multi-Processor Systems-on-Chip (MP-SoCs) have been gaining a growing interest from the research community as a feasible way to implement modern radio systems [7] . Heterogeneous multi-processors are today the most common architectural solution since they are able to deliver high performance with high efficiency, measured for example in giga operations per second per milliWatts (GOPS/mW); each processing node of the architecture is optimized to perform a predefined set of tasks. On the other hand, the cost related to the design and verification of such systems dramatically increases with technology scaling. Moreover, at application level, software engineers have to design software tool-chains able to work with different target architectures, programming languages, compilers and tools.
Homogeneous multi-processor architectures have gained a growing attention from the research community because of their native ability of keeping design, and especially verification, costs under control. Moreover, simplifications are introduced at software level as well. As a drawback, if compared to heterogeneous systems homogeneous architecture provide a limited computational efficiency. However, future technology nodes would allow the integration of more and more processing elements thus mitigating the performance gap.
From the application point of view, the utilization of multi-processor architectures introduces new challenges for the development of algorithms: the design and implementation of algorithms must now consider the parallelism exposed by the platforms, providing efficient solutions that take full advantage of the provided parallelism to improve the computational efficiency of the whole system.
II. OBJECTIVE AND SCOPE OF RESEARCH
The aim of this research work was the design and implementation of software defined radio algorithms on a homogeneous multi-processor architecture. Two main goals were set: 1) the identification of an homogeneous multi-processor architecture as reference platform for the implementation of SDR algorithms and 2) the design and implementation of algorithms for the digital base-band processing, which are able to exploit the parallelism made available by the underline hardware in order to achieve high computational efficiency as well as high efficiency in power consumption.
The reference platform needs to provide a high computational power, a high degree of flexibility and provide dynamic power management in order to improve the power and energy efficiency of the system. Therefore, homogeneous multi-processor architecture based on simple tile structures interconnected by a Network-on-Chip (NoC) were considered as potential reference architectures.
The designed algorithm and their related implementations need to efficiently exploit the parallel resources made available by the platform. Moreover, to ensure the portability of the proposed solutions across different platforms the proposed implementations should not be tidily bounded to the architectural solution of the reference platform. Finally, to ensure power and computational efficiency, the proposed parallel implementations should be highly scalable with the number of computational nodes.
III. MAIN RESULTS
The main results achieved by this research work are twofold: 1) the definition of Ninesilica, a homogeneous multiprocessor architecture as a representative platform for the implementation of software defined radios; and 2) the design and implementation of wireless communication algorithms able to take full advantage of the parallelism made available by the proposed reference platform.
Ninesilica architecture is a homogeneous multi-processor architecture. Ninesilica is composed by a 3x3 mesh of computational nodes (CNs) interconnected by a hierarchical NoC [8] . Each CN hosts a RISC core as processing element. The central node works as master of the architecture and takes care of tasks and data scheduling for the whole system. The other CNs can work on independent tasks as well as parallel accelerator. Dynamic power saving techniques were also implemented on Ninesilica with significant results in terms of power reduction [9] , [10] . Moreover, Ninesilica cluster can also be utilized as an elementary building block for the implementation of clustered many-core architecture, ensuring a high scalability in terms of hardware [11] .
The proposed algorithms for W-CDMA and OFDM based systems led to a high exploitation of the parallelism made available by Ninesilica architecture. Simulation results pointed out that the proposed algorithms implementations reached parallelization efficiency close to the theoretical limits [12] . Furthermore, the proposed algorithms were able to take full advantage of the the dynamic power management system provided by Ninesilica platform leading to significant reductions in energy and power consumption [13] .
The proposed algorithm implementations are not tightly bounded to Ninesilica architecture. Therefore, such solutions could be ported to similar architectures without a loss in terms of scalability and parallelization efficiency. As an example, a homogeneous architecture based on a light SIMD processor core working at frequencies in the range of one GHz would provide similar relative performance as Ninesilica but with an improvement of a factor of ten in absolute performance (e.g. replacing the RISC core with a 2-way SIMD core at 800 MHz). Such architecture would be a competitive alternative to today's DSP multi-core (based on a few muscular DSP-core interconnected via a bus-based communication system), providing the required performance as well as high power efficiency and hardware/software scalability.
ACKNOWLEDGMENT
The author would like to thank all the co-authors of the published works that form the basis of the Ph.D. thesis.
