Due to increasing demand for machine-to-machine (M2M) communication, simultaneous connections for many terminals are requested for current wireless communication systems. Interleave division multiple access (IDMA) has superior multiuser detection performance and attains high data transmission efficiency in multiuser communications. This paper describes the VLSI implementation of an interference canceller for OFDM-IDMA systems. The conventional architecture decreases a throughput in pipeline processing due to wait time occurring in interleave and deinterleave memory units. The proposed architecture adopts dual-frame processing to solve the problem of the wait time and achieves a high utilization ratio in pipeline stage operation. In the implementation results, the proposed architecture has reduced circuit area and power consumption by 25% and 41% for BPSK demodulation and 33% and 44% for QPSK demodulation compared with the conventional architecture on the same throughput condition. key words: OFDM-IDMA, interference canceller, VLSI architecture
Introduction
In recent wireless communication, machine to machine (M2M) communication that machines autonomously exchange information via communication networks has attracted attention and become an essential technology to construct information infrastructures such as smart cities and smart grids [1] , [2] . Wireless communication systems request simultaneous connections for many terminals to cope with thus increasing communication throughput.
Multiple access, simultaneously connecting multiple user terminals, is classified into orthogonal and nonorthogonal schemes in radio resource allocation. Orthogonal multiple access that each user separates radio resources by time and frequency units as frequency division multiple access (FDMA), time division multiple access (TDMA), and orthogonal frequency-division multiple access (OFDMA), requires scheduling where control information is exchanged among base stations and user terminals. In M2M communication supporting connections of many terminals, orthogonal multiple access might decrease an effective transmission rate due to increasing control information. On the other hand, non-orthogonal multiple access as code division multiple access (CDMA) and interleave division multiple access (IDMA) [3] - [5] does not require the scheduling. Since IDMA is superior to CDMA in multiuser detection (de- tecting desired signals from interference and noise) on the condition of connecting many users [5] , it is thought to be promising in wireless M2M communication. As for practical studies, throughput evaluation of IDMA in cellular communication has been reported in [6] . We have studied hardware development and outdoor experiment of IDMA systems [7] , [8] . Moreover, OFDM-IDMA combining IDMA and OFDM can easily perform channel equalization in the frequency domain, which is robust with multipath interference [9] . This paper describes VLSI implementation of an interference canceller that is the most important part in OFDM-IDMA processing. The interference canceller separates users' signals by iterative processing, whose computational cost is proportional to the numbers of users and iterations. As a related work, pipeline and parallel architecture is adopted in an interference canceller, which is used in a partitioned spreading CDMA (PS-CDMA) receiver [10] . Since the interference cancellation of PS-CDMA is similar to that of IDMA, we treat it as conventional architecture. The conventional architecture decreases a throughput processing due to wait time occurring in interleave and deinterleave memory units, whose utilization ratio in circuit operation is half. The proposed architecture adopts dual-frame processing to solve the problem of the wait time and achieves a high utilization ratio, which can provide smaller circuit scale and power consumption than the conventional architecture when comparing on the same throughput condition.
The paper is organized as follows: Sect. 2 explains OFDM-IDMA system. Section 3 describes an interleaver design used in OFDM-IDMA system. Section 4 discusses VLSI architecture of an interference canceller. The VLSI implementation results of conventional and proposed architectures are reported in Sect. 5. Section 6 summarizes our work.
OFDM-IDMA System

Transmitter and Receiver
Block diagrams of OFDM-IDMA transmitter and receiver and uplink channel are shown in Fig. 1 .
In the transmitter in Fig. 1(a) , the coded bit sequence
is generated by N rep -times repetition codes with a coding rate R rep (=1/N rep ) from the information bit sequence
Copyright c 2015 The Institute of Electronics, Information and Communication Engineers N b − 1) in the repetition coder (REP). n b is a bit number in the information bit sequence and N c (=N b N rep ) denotes the number of coded bits, and k (0 ≤ k ≤ K − 1) indicates a user number. The transmit symbol sequencec k is generated by changing the order of the coded bits by adopting a user specific interleaving pattern π k . The BPSK/QPSK symbol sequence
is generated by the BPSK/QPSK modulator. n is a symbol number and N(=N c /N bps ) denotes the number of symbols. N bps means the number of bits per symbol, i.e., 1 for BPSK and 2 for QPSK. After the OFDM modulation, the OFDM-IDMA signalsx k are transmitted by adding a cyclic prefix (CP).
In the uplink channel in Fig. 1(b) , the OFDM-IDMA transmitted signalsx k are multiplied by the channel coefficienth k for each user. The OFDM-IDMA received signals y are modeled by summing the channel affected signals of all users. In the receiver in Fig. 1(c) , the frequency-domain received signals y are obtained by the removal of CPs and the OFDM demodulation. From the training signals located at the head of OFDM-IDMA frame, the channel coefficients of all users h 0 ,h 1 ,...,h K−1 are computed by the frequencydomain channel estimation. The interference canceller consists of the elementary signal estimator (ESE) and the parallel iterative decoders (DECs) with K users [11] . The number of iterations is given by N iter . The ESE computes the extrinsic value λ mud (c k ) for each user by using the received signals y and the channel coefficients of all users h 0 ,h 1 ,...,h K−1 . The DEC accepts the extrinsic value λ mud (c k ) after deinterleaving π −1 k and outputs the received bit sequenceb k by decoding of repetition codes with BPSK/QPSK demodulation. The reliable extrinsic value λ dec (c k ) converts to λ dec (c k ) by interleaving π k . The converted data λ dec (c k ) is used as the input of ESE at the next iteration step.
Interference Canceller
In the receiver of Fig. 1(c) , the received signal y after the OFDM demodulation can be expressed by the desired and interference components as
where I k (n) denotes the sum of the noise and interference components in the n-th symbol, expressed as
where z(n) is a sample of additive white Gaussian noise (AWGN). The interference cancellation by iterative processing is given by the following procedure: (A) Generate the inputs of the interference canceller by interleaving the extrinsic values of the decoder outputs.
where λ dec (c k ) = 0 is applied in case of the first iteration step.
(B) Compute the expectation and variance values of the desired signals as E k (n) and V k (n) and their summations as E(n) andV(n). For BPSK demodulation,
For QPSK demodulation, the symbol index n is divided into odd and even numbers of 2m and 2m
, their equations are given by
where 
where σ 2 denotes the average noise power in the received signals. For QPSK demodulation, 
(E) Compute the DEC outputs asb k and the extrinsic values for the next iteration as λ dec (c k ).
where
. f Demod generates binary data by BPSK/QPSK demodulation. By repeating the aforementioned procedure, the decoder output sequencê b k gradually approaches to the information bit sequence b k .
Interleaver
Random Interleaver
Data sorting of interleaver is realized by the use of memory on hardware. The behavior of interleaver and deinterleaver is illustrated in Fig. 2 . First, the data sequence of {A,B,C,D} are stored in the interleaver memory by the write address order of {0,1,2,3}. The sorting of {D,B,C,A} is performed by the read address order of {3,1,2,0}. The sorting of deinterleaver is given by the write address order of {3,1,2,0} and the read address order of {0,1,2,3}. The interleaver and deinterleaver are requested to memorize their address sequences for all users because the sorting patterns are user specific. Memorizing address sequences costs KN memory words if their sequences are randomly generated.
Algebraic Interleaver
An algebraic interleaver has been presented to reduce the amount of memorized interleaving patterns [12] . The algebraic interleaver generates interleave patterns by algebraic operation according to a user specific constant. The algebraic operation of the Takeshita-Costello method [12] is given by the following equation:
where N is given by the power of 2. S k is a user specific constant, randomly generated from odd numbers. The Takeshita-Costello method can reduce memory words from KN to K compared with the random interleaver. However, the Takeshita-Costello method might degrade communication performance due to the low-spread property that the similar address sequence patterns are observed among users. Hence, we employ multiple stage algebraic interleaver shown in Fig. 3 . The output of the first stage interleaver a 1 (n) is connected to the input of the second stage to scramble the interleaved patterns. The number of user specific constants increases to the two of S 1,k and S 2,k , however its increase is small compared with the random interleaver. The optimal number of stages in algebraic interleaver depends on communication specifications and channel conditions. As one of examples, we append the simulation results of random and algebraic interleavers in Appendix.
VLSI Design
IDMA Decoder
A circuit structure of an IDMA decoder at the inside of an OFDM-IDMA receiver is illustrated in Fig. 4 . After storing the received signals y and channel coefficients h in memory, the IDMA decoder performs iterative processing by parallel interference cancellers of all users. The block of "Interference Canceller" includes the processing of "ESE" and "DEC" in Fig. 1 . The interference canceller executes pipeline processing by a block unit of N symbols. Since the summations of means and variances in (6) and (7) ( (10) and (11)) cannot be computed at the inside of the interference canceller, the "Mean/Var Summation" block takes mean and variance values of all users and delivers their summations to the interference cancellers.
Conventional Architecture
Conventional architecture base on pipeline processing that has been used in PS-CDMA system [10] is illustrated in Fig. 5 . All circuit blocks operate by pipeline processing and the feedback datapath of λ dec (c k ) realizes iterative processing. We have modified the mean and variance computation block ("Mean/Var") and the LLR computation block ("LLR") for IDMA. Also, we use the algebraic interleaver, where the address generation is executed in the "Algebraic Operation" block. When the outputs of the extrinsic value computation block "Ext. Calc" go back to the input of the interleaver memory ("Int. RAM"), the next iterative process is started.
The timing chart of the conventional architecture is illustrated in Fig. 6 . The output data of "Extrinsic Calculation" becomes the input data of "Interleaver" at the next iteration step. Since all the blocks execute pipeline processing, a utilization ratio should be almost one on the assumption that those blocks process valid data anytime. However, the utilization ratio of conventional architecture is about half, where all the blocks are forced to wait their operations until the next iteration step comes.
Improvement of Utilization by Dual-Frame Processing
Let us consider the reason that the conventional architecture has the low utilization in pipeline processing. Taking notice of the dashed circle in Fig. 6 , it is obvious that memory writing and reading are not simultaneously executed. We explain their memory read and write operations by the example of data sequence of {A, B, C, D} in Fig. 2 . The memory read and write operations in the interleaver memory for every clock cycle are expressed in Table 1 , however the other operations after interleaving are omitted. {A 1 , B 1 , C 1 , D 1 } denotes the input data at the first iteration step. The read operation of {D 1 , B 1 , C 1 , A 1 } cannot start until the symbol of "D 1 " has been buffered in memory. Similarly, the write operation of {A 2 , B 2 , C 2 , D 2 } at the second iteration step cannot start until the symbol of "A 1 " has been fed back to the input. The interleaver memory essentially needs idle cycles in memory read and write operations (given by the number of symbols N) for data buffering.
Our new idea to dissolve the aforementioned idle cycles is to apply dual-frame processing in the interleaver memory. As long as the parameters of IDMA such as the number of symbols N and the repetition code rate R rep are identical between two frames, another data sequence can be processed by substituting its operation cycles for the idle cycles. The memory read and write operations by dual-frame processing are expressed in Table 2 . Another data sequence of {a 1 , b 1 , c 1 , d 1 } can be interleaved without data collision although the memory size has doubled. The simultaneous read and write operations can be implemented by dual-port memory having independent read and write address ports.
We should consider whether the dual-frame processing is acceptable for OFDM-IDMA communication. Figure 7 shows single-frame and multi-subframe transmission schemes. For the single-frame transmission such as wireless LAN, the dual-frame processing is not efficient because of taking a long time for the next frame reception. The multi-subframe transmission such as mobile communications in long-term evolution (LTE) and WiMAX, consisting of multiple subframes for each frame, has a chance to apply the dual-processing when their subframes are assigned to IDMA frames. Although a frame format of OFDM-IDMA communication has not been standardized yet, Matsumoto et al. presented the resource allocation of OFDM-IDMA [6] where IDMA symbols are mapped into time and frequency bins in the inside of a subframe as illustrated in Fig. 8. 
Proposed Architecture
In accordance with the idea of applying dual-frame processing, we present the proposed architecture, illustrated in Fig. 9 . The proposed architecture expands the interleaver and deinterleaver memory sizes to 2N words and inserts the frame timing block that supplies the start timings of two frames to the other processing blocks. Most of arithmetic units are the same as those in the conventional architecture. The timing chart of proposed architecture is illustrated in Fig. 10 . Highlighted in the dashed circles, the dual-frame can increase the utilization radio in pipeline processing to almost one. Table 3 compares conventional and proposed architectures in the number of operation cycles, throughput, and memory size per user. F denotes clock frequency (Hz) and Table 3 Comparison of architectures.
Conventional
Proposed Number of operation cycles 10 Timing chart of proposed architecture. τ is the number of latency cycles caused by pipeline stages. W D denotes a bit length in fixed-point operation. The proposed architecture provides a throughput of twice higher than the conventional architecture. The proposed architecture costs a double memory size, however fewer hardware resources than the conventional architecture on the same throughput condition.
Other Circuits
We describe the other circuit blocks used in the interference canceller. These blocks have been implemented in both conventional and proposed architectures.
Algebraic Interleaver
A circuit structure of an algebraic interleaver is shown in Fig. 11 . According to (21), two multiplications and modulo operation are required. Since N is the power of two, the modulo operation can be realized by selecting lower bits corresponding to fewer digits than N. The width of address counter bus is given by A b =log 2 N. The multiple stage interleaver is constructed by connecting the same circuits together with inserting pipeline registers.
Hyperbolic Tangent Function
The direct circuit implementation of hyperbolic tangent function in (4) is complicated and costs computation cycles. Therefore, we apply the following approximation as
where f table denotes a function approximation by ROM table.
The ROM values are calculated in advance. Although the approximation accuracy depends on the number of ROM words, the simulation of fixed-point arithmetic has indicated that at least 512 words are enough to have the same accuracy as the direct computation.
Complex Arithmetic Unit
The computations of mean and variance values in (8)- (11) and LLRs in (13)- (16) for QPSK require complex operations in arithmetic units. We define real and imaginary parts of y and h k as
We convert the complex operations to only real operations that can be implemented by fixed-point arithmetic units, expressed by the following equations: 
. (42)
VLSI Implementation
The interference cancellers based on conventional and proposed architectures have been implemented on CMOS 90-nm standard cell library whose voltage supply is 1.0 V. We have designed digital circuits by Verilog hardware description language and used memory macros with 16/32-bit buses and 512/1024 words. The bit length in fixed-point arithmetic units has been set to W D =16. The number of latency cycles τ is 32 for BPSK and 39 for QPSK. Table 4 summarizes the implementation results. The target clock frequency was set to 357 MHz in logic synthesis. We have measured power consumption from the synthesized gate-level circuits including switching activities. In order to compare the two architectures on the same throughput condition, we assume a case that two identical units concurrently operate in the conventional architecture. The notation of "Conventional (two units)" in Table 4 indicates this case. On the same throughput condition, the proposed architecture has reduced circuit area and power consumption by 25% and 41% for BPSK and by 33% and 44% for QPSK compared with the conventional architecture.
The implementation results of IDMA decoders (Fig. 4 ) based on the proposed architecture are presented in Table 5 . The IDMA decoders have implemented parallel interference cancellers according to the number of users, set to K=20. Since the IDMA decoders require memory units storing received signals and channel coefficients of all users, the values of circuit area and power consumption are larger than those of the 20 times in the interference canceller.
Conclusion
This paper has presented the VLSI implementation of interference canceller for OFDM-IDMA systems. We have presented an algebraic interleaver for memory size reduction and proposed dual-frame processing architecture to solve a problem of decreasing an utilization ratio in interleave and deinterleave memory blocks. In the VLSI implementation, the proposed architecture has showed smaller circuit area and power consumption than the conventional architecture on the same throughput condition. VLSI implementations of OFDM-IDMA transmitter and receiver will be studied in our future work.
Appendix: Optimal Number of Stages in Algebraic Interleaver
The optimal number of stages in algebraic interleaver depends on communication specifications and channel conditions. Figure A· 1 shows the simulation results in BPSK and QPSK modes, where averages of bit error rates (BERs) for all users are plotted. The one-stage algebraic interleaver degrades a BER, especially in the QPSK mode. The threestage algebraic interleaver shows almost the same BERs as those in the random interleaver.
Due to the channel estimation error, the received signal is affected by its amplitude and phase errors even after channel equalization. It is considered that the phase error causes the differences of BERs between BPSK and QPSK in less than 13 dB of E b /N 0 , as probability of error in BPSK/QPSK OFDM systems with post-FFT phase error has been reported in [13] . 
