Abstract-In LTE base-stations, RoHC is a processingintensive algorithm that may limit the system from serving a large number of users when it is used to compress the VoIP packets of mobile traffic. In this paper, a hardware-software and a full-hardware solution are proposed to accelerate the RoHC compression algorithm in LTE base-stations and enhance the system throughput and capacity. Results for both solutions are discussed and compared with respect to design metrics like throughput, capacity, power consumption, and hardware resources. This comparison is instrumental in taking architectural level trade-off decisions in-order to meet the present day requirements and also be ready to support a future evolution. In terms of throughput, a gain of 20% (6250 packets/sec) is achieved in the HW-SW implementation by accelerating the Cyclic Redundancy Check (CRC) and the Least Significant Bit (LSB) encoding in hardware. The full-HW implementation leads to a throughput of 45 times (244000 packets/sec) compared to the SW-Only implementation. The full-HW solution consumes more Adaptive Look-Up Tables  (7477 ALUTs) compared to the HW-SW solution (2614 ALUTs) when synthesized on Altera's Arria II GX FPGA.
I. INTRODUCTION
With the proliferation of 4G/LTE networks, many cellular carriers are embracing the emerging field of mobile Voice over Internet Protocol (VoIP). It is a known problem that VoIP packets, transmitted via the RTP/UDP/IP stack, have relatively small payloads compared to the overhead of the packet headers enclosed by the respective protocol stack. The typical VoIP payload size is 20 bytes which is encapsulated in a header of size 40-60 bytes (200-300% overhead). This creates a need for a header compression technique for efficient utilization of the bandwidth and network resources to deliver a unit of payload data. The robust header-compression (RoHC) framework was introduced by the Internet Engineering Task Force (IETF) in [1] and adopted by the 3rd Generation Partnership Project (3GPP) as a solution to this problem. The RoHC algorithm supports the compression of various protocol stacks. In this paper, the discussion is limited to the compression of the RTP/UDP/IP protocol stack which is referred to as RoHC profile 1 in [1] .
Traditionally base-stations have implementations in software, with some parts accelerated in hardware, such as Ciphering. In [2] , profiling of the most complex algorithms in the LTE layer 2 was performed and it was shown that 51% of the LTE layer 2 processing power was consumed by RoHC algorithm when it is implemented in software. Thus, it was concluded that this traditional partitioning is not a tractable solution anymore and a new more sophisticated partitioning is needed to account for the complexity of the RoHC algorithm. In [3] , an analysis of the memory bandwidth requirements and the hardware implementation of some RoHC functions were presented. The authors showed that in the next generation of mobile networks, network processors need to be augmented with more hardware in order to support the processing complexity of the RoHC algorithm.
In this paper, a hardware-software solution and a fullhardware solution of the RoHC compressor are proposed and implemented on Altera's Arria II GX FPGA. The aim of this work is evaluating the improvement in performance with respect to the degree of the HW-SW partitioning and determining whether the hardware acceleration of certain parts of RoHC can cater to the increasing capacity needs or a full hardware-based RoHC is needed. Both solutions are targeting the LTE base stations with the aim of supporting high capacity.
II. ROHC FRAMEWORK
The basic principle of RoHC is that only a few header fields (dynamic fields) change randomly or within a pattern in an RTP/UDP/IP packet stream, while most of the fields do not change at all (static fields). For RoHC profile 1 framework, the RTP Sequence Number (RTP-SN) header field is used to establish functions to other dynamic header fields that changes within a pattern such as IP Identification (IP-ID) and RTP Timestamp (RTP-TS). The core function of compressing these fields is the LSB encoding method. To compress/decompress a packet, both the compressor and the decompressor use recently stored information from previous packets for a particular stream which is referred to as the context. The contexts is stored per stream and is identified by Context Identifier (CID). Since the compressor and decompressor resides in different end points of the channel, the context of the decompressor might get invalidated. Hence, 3 different modes/states of operations has been defined in [1] to compensate for this problem. The mode/state of operation affects the selection of the packet type to send and the compression efficiency. In RoHC profile 1, there are 40 packet types with different transmission capabilities for different header fields. In addition, RoHC packets may be protected by one of the three CRC polynomials (3,7, and 8) defined in [1] for error detection.
III. ROHC BASED HARDWARE ACCELERATORS
RoHC algorithm is a control-intensive algorithm where most of the processing is carried out conditionally following the standard. The data-path complexity of the RoHC algorithm resides in the implementation of complex LSB encoding equations, bit level manipulations in the header parsing and the error detection polynomials (CRC), and the exhaustive search in the process of selecting a packet type to send. a) CRC hardware accelerator: CRC computation requires bit level operations which can be efficiently implemented in hardware. Therefore, the three CRC polynomials of RoHC are implemented as hardware accelerators. Using the method explained in [4] , 32 and 8 bits parallel CRC accelerators are implemented for each of the CRC polynomials. For each output bit of a CRC polynomial, EX-OR operations are applied to a set of the input vector together with a set of the previous CRC state which are referred to as d and c in Table I , respectively. To utilize the hardware resources, the 8-bits parallel CRC can be implemented using the same EX-OR operations of the 32-bits parallel CRC by insuring that the input bits 8 to 31 are zeros. The 8-bits parallel CRC is needed when the input block length is not a multiple of 32 bits. b) Least Significant Bit (LSB) Encoder: As the name indicates, the LSB encoding method is used to transmit the change that occurs in the least significant bits of a field value (value) when compared to a reference value (vref ) stored in the context from a previous packet. To find the least number of bits (k) to send out of value, a mathematical model is developed in this work based on the positive distances, R f d and R bk , between value and vref . To find the least number of bits, k is found as
where k1 and k2 are calculated as
and Table II . In [3] , the results of equation 3 was considered larger than the results of equation 2 and therefore it was ignored. In this work, it is found that each of equations 2 and 3 has significance on the final result of equation 1. In an attempt to reduce hardware in this work, a method is proposed to determine which of the equations would produce the smallest k beforehand without calculating both equations 2 and 3. Hence, a generic hardware that is capable of computing either equation 2 or 3 in one clock cycle, is implemented as shown in Fig. 1 . Depending on the distance between vref and value, equation 1 can be rewritten as
where R bk_max is shown in Table II . To calculate the integer results of the logarithmic terms in equation 2 and 3, a Leading One detector (LOD) is used. Also unlike in [3] , an error-free method is proposed to determine the exact value of equations 2 and 3, that is by ceiling upon comparing the integer remainder of the logarithmic terms (R1 and R2) after shifting R2 to an equivalent logarithmic scale to R1. Both proposed methods are fully explained and proved in [5] .
c) Bit-packing: The operations of input header parsing, CRC field packing, and output packet packing require bit masking, shifting and concatenating operations. These three logical operations are executed in one clock cycle using the hardware presented in Figure 2 . The bit-packing hardware has four main control signals msb, lsb, shift and sel_lft_over. The msb and lsb signals are used to generate a mask for the bits of interest in the input data. Then, the masked input data is shifted right or left by the barrel shifter and concatenated together with what was previously stored in the register. The result can be taken out directly or registered in the same clock cycle. d) Search LUTs: Selecting the best packet type to send is an exhaustive search problem, where some types are not suitable for sending due to mode or state restrictions. After determining the state/mode unaccepted packets, their flags are set and 40 bits packet options, where each bit refer to a packet type, are sent to the search LUTs hardware. The best packet is selected in the search LUTs upon the required transmission capabilities for IP-ID, SN, and TS fields. Fig. 3 shows a block diagram of the hardware needed for this process. The LSZD is used to detect a zero from the LSB bit of the packet options register in one clock cycle. When a zero is found, its corresponding packet capabilities are checked. If the packet capabilities are not sufficient, its flag bit is set and the process continues until a packet type is selected. The output of LSZD is a one hot code which is used to address the look-up table that is used to store the capabilities of all RoHC packet types. 
IV. HW-SW SOLUTION
The HW-SW co-design methodology consists of porting the reference RoHC libraries [6] to Altera's NIOS II processor and profiling the code for worst case execution. The results of profiling show that increase in performance can be obtained by HW accelerating CRC and LSB functions. Hence, by implementing these two functions as custom instructions, the performance is increased by 20% without reducing the flexibility. The advantage of using custom instructions is that they are much faster than the memory-mapped custom-logic peripherals. This is due to the reduction in the communication and arbitration cost, as custom instructions are directly embedded in the NIOS II Arithmetic Logic Unit (ALU) whereas custom peripherals are accessed from the Qsys interconnect (Altera IP). The embedded system for RoHC is presented in Fig. 4 . The processor executes the RoHC code from the onchip memory and an external memory is used for maintaining the user context data.
V. FULL-HW SOLUTION
The full hardware architecture of the RoHC algorithm is divided into a controller stack and data-path accelerators connected through a shared bus and point-to-point connections, as shown in Figure 5 . The controller stack is a set of FSMs connected together in a sequential fashion. Most of the operations in RoHC are based on comparing context data and received packet data. Therefore, separating these into two RAMs has enabled fast processing. The context RAM might also be used as a temporary storage for the output packet. The RTL design is described in VHDL and fully explained in [5] .
VI. RESULTS
An increase of 20% in the throughput (and hence capacity) is obtained by a HW-SW co-design compared to the software only implementation, in which 5% hardware (363 ALUTs) of full hardware solution is assisting the RoHC software running on NIOS-II embedded system. The 5% hardware comprises of CRC and WLSB hardware accelerators. In the full-HW solution, the throughput is significantly improved by a factor of 45 compared to the pure software implementation at the cost of more ALUTs (7477) as compared to the HW-SW design. Moreover, the power consumption in the full-HW design is around 40% less than the HW-SW design which can be attributed to the slower clock. The full-HW, HW-SW and SW-only designs are compared in Table III . The execution time analysis for compressing an RTP/UD-P/IP packet for HW-SW and full-HW solutions are as shown in Fig. 6a and Fig. 6b , respectively. In HW-SW solution, the processing time of CRC and LSB is almost nullified because of hardware acceleration. For full-HW solution, the execution time of compressing a packet is almost equally distributed among the controller stack as shown in Fig. 6b . Lacking a dominant component on the execution time is the reason why the performance increase in the HW-SW solution is modest. Regarding the ALUT utilization in the HW-SW solution, NIOS II CPU occupies 86% of the whole design, whereas, CRC and LSB HW accelerators occupy the remaining 14%. In case of the full-HW implementation, the controller stack occupies 71% of the total design and the remaining 29% are utilized by the HW accelerators as shown in Table IV . This is due to the complex controller of 211 states.
In typical VoIP codecs, each user generates at most 50 packets/second. This requires a high external memory bandwidth for reading and writing each user context that many times as shown in Table III . In the full-HW solution, the overhead of accessing the context externally on the performance is around 30% as shown in Fig. 6c . Therefore, when thinking about increasing the RoHC capacity, the external memory bandwidth must also be considered since its bandwidth might be shared by other components in the LTE system. As IPv4 requires more processing than IPv6, compressing IPv4 in the HW-SW solution is slower than IPv6 as shown in Fig. 6a . However, compressing IPv6 in the full-HW solution is slower than IPv4 as it requires higher memory bandwidth as shown in Fig. 6c . size cells, the proposed solutions by this work address a wide range of these applications. An increase of 20% in the performance is obtained by a HW-SW co-design in which around 5% hardware is assisting the RoHC software compared to the SW-only implementation. This should be compared to the 45 times increase in the performance that is achieved by the full-HW solution. These results can be used to empirically determine the percentage of the partitioning necessary to achieve a particular performance. Also, power and area results can be used to help designers in a system-level planning.
