The fast development of wireless communication protocols brings in big challenges for designing mobile baseband processor. In this paper, we propose a novel multi-core vector-scalar architecture with heuristic instruction set that can achieve high performance processing with budgeted power consumption and area cost across major computing blocks inside different communication protocols. This proposed architecture consists of four Vector-Scalar Engine Pairs (VSEPs). Each pair can support two data streams for multi-protocol application. The vector-scalar engine pair shares a common pipeline and the vector engine (VE) mainly deals with the symbol level data processing of wireless communication standard, such as OFDM (Orthogonal Frequency-Division Multiplexing) demodulation, while at the same time the scalar engine (SE) calculates the key parameters based on the heuristic instruction. We verify the performance of the architecture through benchmarking typical algorithms such as FFT (Fast Fourier Transform), Channel Estimation and MIMO (Multiple-Input Multiple-Output) detection. The results show that this proposed architecture can achieve better performance in average for 4G wireless communication.
Introduction
The 4 th generation communication technology being represented by LTE (Long Term Evolution)/LTE-A (LTE-Advanced) has already come to its commercial stage, LTE-A evolves from LTE and provides higher throughput as well as backward compatibility with the existing LTE standard. The 5th generation is still in research stage by many companies such as Ericsson, Qualcomm, Huawei and Bell Laboratories. And the big challenges to designing a baseband processor are the exponentially growing baseband processing computation density and the coexistence of multi communication standards, such as 3G, 4G, and 5G [1] .
Because of design simplicity, currently the ASIC (Application Specific Integrated Circuit) approach still plays the significant role in the mobile baseband chip design, for accelerating the GSM (Global System for Mobile Communication), 3G and 4G communication [2] . This ASIC hardware acceleration approach separates the whole data stream processing into a few stages, such as I/Q data reception, filtering and imbalance compensation, frequency correction, demodulation and measurement, de-mapping and channel estimation, and decoding [3] . For each stage, a dedicated module is designed, which normally can only handle present protocol for one mode application. When multi-mode application is required, a few more modules are designed for the same processing stage and the on-chip resource reuse among those working modes.
Existence of multiple wireless communication protocols demands a high performance baseband processor which at the same time is tightly constrained by power consumption and area cost. It becomes a must to consider the performance and power as well as flexibility and upgrade cost together during the design stage. In Figure 1 , the balance between flexibility and performance for all popular architectures is presented. The general single core processor has best flexibility but less performance power ratio and the ASIC has the best performance power ratio but less flexibility [4] . Software Defined Radio (SDR) baseband technology with high flexibility and low cost to upgrade is gradually replacing the traditional ASIC design method. The multi-core high performance processor architecture based on SDR technology becomes popular in recent years and it can potentially solve the problem [5] . By programming different frequencies, different types of modulation and multiple access way, the SDR processor can adapt to any standard protocol in no time and multi-standard running at the same time in one engine also becomes possible. As the protocol is evolving, increasing demand for high bit rate and small delay interval makes the design of SDR processor architecture a complex project. Many factors must be taken into consideration, such as area, performance, energy efficient. Comparing to SDE, this proposed VSEP architecture tries to find a better way to tradeoff performance per power and the flexibility/programmability.
For wireless communication, according to the data flow features, in general, the receiver can be divided into three parts based on the characteristics of the processing tasks. As shows in In a communication protocol, the symbol level processing is the key module that significantly impacts the performance energy ratio. Symbol level processing requires the computing units to handle multiple data at the same time. The SIMD (Single Instruction Multiple Data) processor or vector processor fits well into this requirement. This paper focuses on the discussion of the proposed novel architecture for symbol level processing which utilizes multiple vectorscalar pairs with a special heuristic instruction set architecture which can regenerate the instructions for the scalar engine in case some parameter data is required while running the vector engine. This paper is organized as follows. Section 2 overviews the background information on SIMD processors and vector processing technology. The main concept underlying the proposed architecture is given in Section 3. Finally, evaluation results are discussed in Section 4.
Backgrounds
The fundamental algorithms inside communication standards include FFT/IFFT, channel estimation and MIMO detection which require data parallelism for higher throughput. By exploiting SIMD architecture, a few solutions have been proposed in literatures.
Schoenes, M and Eberli, S have introduced a novel DSP (Digital Signal Processing) architecture for SDR, which achieves very high data throughput by means of massively parallel arithmetic unit [6] . Based on a radix-4 butterfly structure which is optimized for complex-valued arithmetic, the processor's datapath enables extremely fast FFT computations. Furthermore, exceptional programming flexibility and increased code efficiency are offered by a reconfigurable instruction set.
Kees van Berkel et al. have presented a heterogeneous hardware architecture with the programmable vector processor EVP (Embedded Vector Processor) as the key component which can support WLAN (Wireless Local Area Networks), UMTS (Universal Mobile Telecommunications System), and other standards [7] . The SIMD width is scalable and the maximum parallelism available equals to five vector operations, plus four scalar operations, three address updates and additional loop-control.
In order to achieve a more efficient and higher performance architecture for multi-standard processing Mark Woh et al. have designed an enhanced SDR architecture, named AnySP, which consists of SIMD and scalar data paths. The SIMD data path consists of eight groups of 8-wide SIMD units, which can be configured to implement SIMD widths of 16, 32, and 64. Each of the 8-wide SIMD units is composed of groups of Flexible Functional Units (FFU). The FFUs contain the functional units of two lanes that are connected through a simple crossbar. Eight SIMD register files feed the SIMD data path. Each register file contains 16 entries where each entry is 8-element wide. The swizzle network aligns data for the FFUs. It can support a fixed number of swizzle patterns of 8-, 16-, 32-, 64-, and 128-wide elements. All these features greatly enhance the multi-standard supports and increase the performance significantly [8] .
MAPro is a tiny processor for reconfigurable baseband modulation mapping, which was presented by Liang Tang et al [9] . MARro has provided a single lowcost flexible hardware platform for emerging communication protocols and applications in modern embedded systems.
Omer Anjum et al. have proposed an MPSoC (Multi-Processor System-onChip ) design for the baseband processing of a 20 MHz LTE system. Instead of using conventional DSPs/VLIW architectures, the proposed TA (Triggered Architecture) has been selected as processing element (PE) of the MPSoC.
Processing tasks are statically scheduled. Synchronization among the PEs is based on polling of a shared memory space [10] .
Seyed A. Rooholamin and Sotirios G. Ziavras presented an innovative architecture for a VP (Vector Processor) which separates the path for performing data shuffle and memory-indexed accesses from the data path for executing other vector instructions that access the memory. This separation speeds up the most common memory access operations by avoiding extra delays and unnecessary stalls. In the lane-based VP design, each vector lane uses its own private memory to avoid any stalls during memory access instructions [11] .
The above referred architectures on one side improve the performance and/or efficiency, but have some obvious limitations. They exploit the performance improvement mainly by increasing the width of SIMD, adding extra functional units and doubling the number of PEs. For some design, an extra scalar processor is integrated for control stream and supplementary computation. Also, analyzing the multiple-cycle delay of vector processing and how much time/cycles being wasted due to waiting for results of vector computing were not deeply exploited. In the following, we present an novel vector-scalar engine pairs architecture which greatly improves the efficiency of the vector processing by designing a special heuristic instruction set architecture which can automatically generate the temporary instructions for scalar engine that runs in a separate pipeline. The scalar engine can complete parameters computation at right time and feed them to the vector engine. The following section will explain the novel design in details.
Processor Design

Proposed processor
The proposed baseband processor is a fully programmable architecture as shows in Figure 3 . A novel 4 Vector-Scalar Engine Pairs (VSEPs) architecture is designed for multi-protocol based applications, it fully supports LTE-A core algorithms processing and can be easily adapted to future 5G core algorithms processing. The proposed 64-bit heuristic instruction set architecture is designed to make sure that the vector-scalar engine pairs run efficiently without pipeline stall even if computationally intensive algorithms are being processed. A twolevel memory hierarchy design (private data memory and public shared data memory) can minimize the data transactions. The shared data memory is further divided into four banks in order to be accessed by four computation pairs at the same time with the help from the on-chip interconnect unit. This interconnect unit is designed with resource scheduling pool and scheduler for high efficient routing management. In figure 3 , PC means program counter, IB represents instruction buffer, ID is for instruction decoding unit and IG is for the instruction regenerating unit that produces the new instruction sequence from the ID unit. The new instruction sequence is buffered in ISB and is waiting for decoding by the ISD unit. Once decoded, the instructions are fed into scalar engine.
With the auxiliary support of the heuristic instruction set, the scalar engine can easily handle the required parameters calculation, such as generation of Gold Sequence (GS), searching for the peak value, generation of the twiddle factor for FFT/IFFT, etc. The outcomes can be passed to the vector engine by Inter-Engine Unit (IEU) and this kind of pairing working design efficiently eliminates the possible pipeline stall for computation intensive algorithms and applications.
Working mode
For a better balance between performance and power dissipation, the vectorscalar engine pairs VSEP architecture provide an efficient way for self-adaption and scalability. The VSEP pair can run independently or all the pairs run simultaneously while obtaining synchronization based on requirement of communication algorithm. After loading programs into the PM (Programm Memory), the CMU (Clock Management Unit) turns on the clock of the VSEP pair. After initiation is done, the VSEP pair go to sleep, waiting for system tick. If there is data processing required by application, the system tick wakes up the scalar engine. According to the instruction sequences, scalar engine triggers the data transportation into the private memory and starts the parameters calculation. After passing the parameters into the IEU, the vector engine is fed with the required scalar data. The vector engine acquires the parameters and continues the work. At the same time, the scalar engine works on the next round of parameters calculations.
Multiple pairs mode
Taking energy dissipation into consideration, each vector-scalar engine pair comes into service freely and different tasks can be dispatched to any available vector-scalar pair. Of course, the master control unit of the baseband chip can turn off any pair when it is not needed any more. For massive data processing, such as MIMO, all four vector-scalar engine pairs can cooperate easily with the help of the on chip synchronization mechanism.
Design the heuristic instruction set
In order to support heavy workload operations such as FFT and MIMO matrix operations, the heuristic instruction set architecture is proposed and designed to fully utilize the hardware VSEP pairs and the instruction regeneration units IRU. The basic idea of the heuristic instruction is to divide the 64-bit instruction into two parts. The 46-bit instruction at the front is decoded for vector engine processing while the remainder instruction bit is passed to instruction regenerate unit for scalar engine processing, as shows in Figure 4 . 
Experiment
To evaluate the efficiency of proposed processor, three categories of core processes for wireless communication based on OFDM (Orthogonal Frequency Division Multiplexing) technology are selected, which include FFT, Channel Estimation and MIMO Detection and the proposed architecture can perform excellently.
Conclusions
The modem stage of an SDR requires software flexibility to cope with the multitude of wireless standards, their evolution, and with algorithmic improvement (including bug fixes and in-field upgrades) without the need to respin an IC. The proposed architecture, with its powerful VE-SE pairs outperforms conventional DSPs by an order of magnitude or more, in a powerefficient way. Accordingly, the VE-SE pairs can be a key component of an SDR, where it can save silicon area by both intra-standard and inter-standard reuse and it can potentially handle multiple standards simultaneously.
