Abstract: Cyclic Redundancy Check (CRC) is widely used error detection technique in many contemporary communication systems such as Fourth Generation (4G) Mobile Communication-Long Term Evolution (LTE) and LTE Advanced, Wi-Fi, Wireless LAN. For real time embedded systems, code size (Memory), Processor Machine Cycle (Speed) and Power are the three important parameters which are needs to be optimized. CRC is very effective and simple for error detection but its software implementation is not efficient. This paper presents software implementation of CRC using Bit by Bit (BYB) and Look-Up Table ( LUT) approaches reported in the earlier literature. Using these approaches, we have compared machine cycle requirements for computation of CRC-3/5/8/12/16 generator polynomials. We have used TMS320C6713 and Freescale Star Core SC140 architectures for comparing the machine cycle requirements. Then we have intuitively modified our software implementations (Based on C program) of LUT using In Place Computation (IPC).This IPC-LUT based CRC computation is found to be more optimized in terms of machine cycle and memory compared to LUT method. We have reduced the machine cycle requirement by 39.47 % using our IPC-LUT approach compared to conventional LUT. We have also developed inline assembly code for SC140 architecture using IPC-LUT approach that takes only 45 machine cycles for computations. Peak to Average Power Ratio (PAPR) is one of the major drawback of contemporary communication systems. For third parameter (Power), we have simply done the analysis to fix up the decision criteria for deciding the sequences having low PAPR.
Introduction
Emerging demands for high data rate services and high spectral efficiency are the key driving forces for the continued technology evolution in wireless communications. Third generation (3G) mobile communication systems have been commercially deployed to meet the initial demand for high data rate. Wireless communication for mobile terminals has been a high performance computing challenge. It requires almost super computer performance while consuming very little power [1] - [2] . This requirement is being made even more challenging with the move to Fourth and Fifth Generation (4G / 5G) wireless communication. Next generation data rates are greater than current 3G technology hence it will require more computational power. Leading technologies are protocols like 802.16e (Mobile Worldwide Interoperability for Microwave Access-WiMAX) and Third Generation Partnership Project (3GPP) Long Term Evolution (LTE), LTE [11] which uses Orthogonal Frequency Division Multiplexing (OFDM) at core level. A promising modulation technique that is increasingly being adopted in the telecommunication field is OFDM [3] . ODFM is a good solution for high speed digital communications. But high Peak to Average Power Ratio (PAPR) is a major problem in OFDM [8] , [9] , [10] . In OFDM, the data is spreaded over a large number of orthogonal carriers modulated at lower rates. The carriers can be made orthogonal by appropriately choosing the frequency spacing between them. Its advantages are high data rate and bandwidth efficiency. To provide high data rate in next generation wireless communication systems, the execution of all baseband processing algorithms must be done at high speed. The algorithms are implemented at Physical Layer. The physical layer deals with bit level transmission between different communicating stations. It consists of the basics networking hardware transmission technologies of a network.
Hence developing the algorithm which will take Minimum Machine Cycles for execution (High Speed), Minimum Code Size (Less Memory) and Minimum Power consumption is of prime consideration. Cyclic Redundancy Check (CRC) is widely used error detection method in data transmission and storage systems. It is simple but its software implementation is not efficient. Using CRC for error detection in embedded systems involves trade off among Speed (Machine Cycle), Memory (Code Size) and Power consumption. Because many embedded systems have significant resource constraints, it is important to understand the available trade off options and find the ways to attain better error detection at lower computational cost [6] - [7] . In this paper, we have studied the optimization of CRC computations in terms of machine cycle and memory requirement. CRC typically uses Galois Field, GF (2) for its operation. It is basically a discrete sequence. Hence we have also done PAPR analysis for discrete sequences for understanding the power constraints.
Physical Layer Context
The message carried over the physical channel is protected by various Forward Error Correcting Codes (FEC) in the physical layer. With FEC, redundant parity bits are added to the message, and these bits allow the receiver to detect and correct the errors. In the channel coding process CRC is appended to the input data packet and then passed to the FEC encoder. After encoding, puncturing is performed to increase the data rate followed by the interleaver to distribute the burst error. The scrambler introduces the pseudo random sequence into the incoming bit stream. This avoids the occurrence of long streams of zeros or ones and also provides better synchronization. At the receiver side exactly opposite operations are performed to get back the original information bits. The flow of data through different channel coding blocks can be referred from Fig. 1 . The algorithm design and software implementation overview is given in the subsequent sections. Fig. 1 shows all the sub-blocks of the channel coding block. Blocks are implemented in such a way that it exposes two Application Peripheral Interface (API). One of these is the Initialization API while second is the actual kernel of the block, or the Processing API. Typically, the user of the channel coding block would call the Initialization APIs for all the blocks at system start up thereby blocking the memory required by various blocks, initializing Look Up Tables (LUT) and other data structures. Thereafter, in the steady state operation, the user would call the Processing API as and when required. Fig. 2 shows the call sequence for these APIs: 
A Novel Strategy for Algorithm Design

Algorithm Testing Framework
The test framework includes a set of test stub applications along with configurable parameters specific to the routines which are to be tested. Typically, a developed routine would be tested by building a project using Freescale CodeWarrior IDE. The input and reference output test vectors will be provided. The test stub application will call the Initialization API and then will pass the input test vector to the routine being tested (Processing API). It will then compare the output generated by the routines against the reference output test vector calculated using hand computations. It will provide the test results as SUCCESS or FAILURE depending on the final comparison. The test vectors (i.e. reference input and output) would be saved as ASCII text files with typically one value per line. The values can be unpacked bits (0 or 1), packed bytes/words (unsigned bytes/words) or soft values (signed bytes). The nature of values would depend on the input/output format of the routine to be tested. Fig.  3 shows flow chart for test stub.
Fig. 3. Flow Chart for Test Stub
CRC
The CRC length that can be inserted has five different values: 3, 5, 8, 12 and 16 bits. Probability of undetected errors is low, when length of CRC is high. Basic CRC computation algorithm: Bit by Bit Computation (BYB) [5] 1. Check MSB of data bit. 2. If MSB bit =1, then XOR the data with CRC poly and left shift by 1. 3. If MSB bit=0, then only shift data to left by 1. 4. After processing all bits, remainder is CRC. Output buffer pointer and output length parameters are not used in final optimized data structure. u2CrcLut parameter of structure stores the 256 entries corresponding to CRC for input byte varying from 0x00 to 0xFF. Length of each entry is equal to length of CRC polynomial. CrcLutGenerator ( ) function generates the LUT [5] specific to the CRC polynomial. As a part of initialization, members of the stCrc structure are initialized and the LUT is populated by calling the CrcLutGenerator () function. The structure parameter u1InPtr and u2InLength are initialized by user. Table 2 provides the machine cycle requirements for the computation of different variants of CRC on Freescale SC140 architecture using conventional BYB, LUT [5] - [7] approaches and out proposed IPC-LUT approach. Optimization level of 0 and 3 can be set in SC140 Integrated Development Environment (IDE) project setting options. Finding the number of machine cycles, memory consumed, utilization of resources (shifters, Multiply and Accumulate-MAC etc) present in architecture for developed software program is termed as "Profiling". BYB is traditional approach which consumes more machine cycles compared to LUT. In LUT method, CRC for all bytes are precomputed using BYB (Total CRC's 256) and stored in the memory and then the 8-bits of message which are to be encoded is used as an index to get the corresponding CRC values from stored memory and appended to it. In our proposed IPC-LUT approach, in place computation is used to reduce the memory reference pointer and hence reduces machine cycle requirements drastically (Machine cycle reduces from 114 to 69, around 40% reduction is achieved).
Results and Discussion
 In place computation implementation is to make the processing in-place, i.e. no separate output buffer. The computed CRC is appended to the end of the input buffer. In LUT, separate input and output buffers were used. UINT1 *u1InpPtr; // Pointer to Input buffer; UINT1 *u1OutPtr; // Pointer to Output buffer Copying input to output buffer and the indexing the memory for precomputed CRC values was creating lot of memory read and write operation. This was consuming huge machine cycles. In proposed IPC-LUT, we have reduced these memory references using in place computation. This has also reduced the memory requirement to some extent. Removed following members of the stCrc structure o output buffer pointer o output length  In LUT approach, the members of structure stCrc were being directly accessed from inside the loop, resulting in extra memory reads. For IPC-LUT, pointer to the input buffer and pointer to the LUT are stored in local variables initially. The local variables are then accessed inside the loop wherever required. Due to this reduction in machine cycle count was observed. Finally, IPC-LUT approach and Data Arithmetic & Logical Units (DALU) available in SC140 architecture are properly utilized to develop highly optimized assembly routine for CRC computation. CRC kernel takes only 45 machine cycles for computation. 
APPROACHES
The same code is executed on two different architectures for understanding the profiler performance. From , it is clear that, machine cycle count drastically reduces on SC140 architecture. For designing the embedded applications, capability of processor architecture is equally important for porting the optimized algorithm.
Low PAPR Discrete Sequences
The PAPR is defined as the ratio between the maximum instantaneous power and the average power, defined by [8] - [10] PAPR can be measured either in continuous time or in discrete time [10] . Where, and k=1, 2,…..n Smaller the , nearer the number of runs approximate to length n/2. However, consider the sequence 110011001100. It's a short cycle sequence with very high PAPR. But for this sequence number of runs equals to half of sequence length and number of ones equals to number of zeros. Hence if we apply first two test on this sequence then our decision will be wrong. Therefore, the decision criteria should be amended. Test 3: To find aperiodic autocorrelation of sequence [13] R (i). But is nothing but R (i) calculated at i=1. R (i) must be very small to have good randomness in the sequence.
Hence sum of , and must be small, to have good randomness in the sequence. Now our decision criteria for generating the low value PAPR sequences is as follows:
Forward Error Correction (FEC) generates the sequences. But these sequences should have low value of Peak to Average Power Ratio (PAPR).Above decision criteria can be used for monitoring the sequences that exhibits low PAPR. We transmits only those sequences or codewords that has low PAPR to avoid the distortions caused due non linearity in power amplifier. Low PAPR sequences consumes less power. Thus system becomes power efficient.
Conclusions
CRC computation using BYB, LUT and our proposed IPC-LUT approaches are implemented on Freescale SC140 architecture and compared the machine cycle counts. Our results shows that, CRC computation using proposed IPC-LUT approach drastically reduces machine cycle count compared to conventional BYB and LUT approach. We have also developed optimized inline assembly code for CRC on SC140 architecture using our proposed IPC-LUT approach and tested for machine cycle count. It takes only 45 machine cycles. Result shows that Freescale Star Core SC140 architecture provides better optimization compared to TMS320C6713 architecture. Overall, we hope that our results provide embedded application engineers with better trade off information of machine cycle consumption, architectural profiling and power level constraints for discrete sequences.
