We present a systematic and efficient way of managing the path metric memory and simplifying its connection network to the add_compare_select unit (ACSU) for Viterbi decoder (VD) design. Using the derived equations for memory partition and add-compare-select (ACS) arrangement together with the extended in-place scheduling scheme proposed in this work, we can increase the memory bandwidth for onflictfree path metric accesses with hardwired interconnection between the path metric memory and ACSU. Compared with the existing work, the developed architecture possesses the following advantages: (1) Each partitioned memory bank can be treated as a local memory of a specific processing element, inside the ACSU, with hardwired interconnection, so that the interconnect complexity is reduced significantly. (2) The partitioned memory banks can be merged into only two pseudo-banks regardless of the number of adopted ACS processing elements. This not only greatly simplifies the design of address generation unit, but also makes smaller the physical size of required memory. (3) The implementation can be accomplished in a systematic way with regular and simple controlling circuitry. Experimental results demonstrate the effectiveness of the developed architecture and the benefit will be more apparent for convolutional codes with large memory order. key words: Viterbi decoder (VD), in-place scheduling, path metric memory management, VLSI architecture
Introduction
The Viterbi decoding algorithm (VA) is known as an efficient method of realizing the maximum likelihood (ML) decoding of convolutional codes [1]- [3] . A k-input, n-output convolutional code (code rate R=k/n) with memory order m is denoted as a (n,k,m) code. In general, the implementation of VA, referred to as the Viterbi decoder (VD), can be generally divided into three basic units: the branch metric unit (BMU), add-compare-select unit (ACSU), and survivor memory unit (SMU) [4] . The BMU is used to compute the branch metrics at each time stage based on the received noisy data and the correct codeword. These metrics are then fed to the ACSU to update the path metric of each survivor path. Finally, the SMU is used to store the survivor sequence for each state, perform the trace-back operation, and output the decoded bits. Basically, the central unit of a VD is the data-dependent feedback loop performing the add-compareselect operation and it will become one of the bottlenecks in VLSI implementation. Therefore, an efficient technique of path metric memory management is crucial to the perfor- interconnection network in between, we propose a novel in-place scheduling technique, denoted as tte extended inplace scheduling, for VDs using butterfly-based computation. With the mathematical model and equations derived in this work, not only the conflict-free PMM access is ensured , but also the interconnect complexity is greatly reduced. The main advantage of our development is that each partitioned memory bank becomes local to a specific processing element (PE), inside the ACSU, making the interconnection network and the address generation circuit much simpler than the work of [15] , [16] , [19] , [20] . With the proposed scheduling scheme, we can further merge the P partitioned banks, each with a width of d bits, into only two pseudobanks, each of which being 1/2*P*d-bit wide, regardless of the number of required ACSs. This not only reduces the hardware requirement of address generation and interconnection, but also makes smaller the physical size of required memory space. Moreover, we also show that the proposed techniques of achieving the targeted architecture will not affect the design of conventional BMU and SMU. Through the comparisons in Sect. 5, the proposed work exhibits the best performance. This paper is organized as follows: Section 2 reviews the basic concept of in-place path metric update and describes our area-efficient architecture model. Section 3 presents the derived methods of conflict-free path metric access and the corresponding area-efficient design. Then, we describe in Sect. 4 the proposed extended in-place scheduling technique and the local-memory, pseudo-bank architecture. Section 5 shows the comparison with the existing work and performance evaluation of our development. Finally, Section 6 concludes this work.
Fundamentals and Definitions
Basically, the path metrics can be updated in either the ping-pong mode or in-place mode [17] , [18] . For in-place scheduling, only one path metric memory is required for updating path metrics, and each old path metric is immediately overwritten by the newly computed path metric [18] . As a result, its memory size is half of that of the ping-pong mode at the expense of more complicated control circuit for in-place scheduling. This is an important feature in the implementation of VDs with a long constraint length to save memory space and, of course, the resulting chip area. In this work, we focus on the in-place scheduling schemes.
Based on the in-place computation, Fig. 1 (a) depicts our basic area-efficient architecture model in which it contains a single memory made up of P banks for storing path metrics, P ACSs, and two interconnection (routing) networks. This architecture contains P/2 processing elements (PEs), each consisting of two ACSs for updating path metrics of two states defined by the butterfly module. Since P states are processed at the same time, it takes (N/P) time units or iterations to operate on all the N trellis states at each time stage. For in-place computation, we use two routing networks to provide P old path metrics to ACSs and to de- In our previous work [16] , we transformed the partitioning problem into the coloring solution on the conesponding conflict graph and derived an area-efficient architecture similar to Fig. 1(a) . In this architecture, the underlying memory structure is global or centralized, and the two routing networks might dominate the performance of the resulting hardware implementation, especially for a large value of P. In this work, we present an efficient ACS arrangement scheme together with a novel scheduling technique, referred to as the extended in place scheduling, to reduce the interconnect complexity. Based on our development, the area-efficient model of Fig. 1(a) can be greatly simplified as the one depicted in Fig. 1 (b) -a local-memory (LM), fixed-interconnect, area-efficient model. The resulting architecture has the following characteristics. (a) Each bank becomes the local memory of a specific processing element. (b) The interconnection networks between memory banks and ACSs become fixed or hardwired so that the interconnection overhead can be significantly reduced. (c) As will be explained in Sect. 4, the partitioned P banks can be merged into only two pseudo-banks independent of the number of employed ACSs. As a result, the address generation circuit is much simpler than that of [15] .
Note that the presented architecture model in Fig. 1 can be generalized for R=k/n decoders. For code rate R=k/n, if the encoder has k input bits connected to k shift registers with the sizes of shift registers differing by at most one bit, a rate-k/n encoder can then be modeled as a radix-2k shift register in which the most significant bit of the state corresponds to the oldest bit in the shift register and the least significant bit is the new input bit to the shift register. As a result, the bit-cyclic property can be applied for in-place scheduling and the butterfly-based computation can be used for in-place update of path metrics. Without loss of generality and for the ease of explanation, we focus on rate-1/n, radix-2 applications in our development. INF. & SYST., VOL.E91-D, NO.9 SEPTEMBER 2008 3. Conflict-Free Memory Access
In our previous work [16] , we have presented an efficient approach to partition the memory into a number of memory banks. In fact, more than one solution might exist when we transform the problem of memory partition into finding a coloring solution on the corresponding conflict graph, and several results have been given in the literature [11] , [12] , [15] . Instead of adopting the graph representation, this paper investigates a systematic way of formulating the partition algorithm in mathematical forms. In this way, we can derive a systematic approach of achieving a good partitioned result without going through the relatively complicated coloring process required for convolutional codes with long constraint length.
Memory Partition by Mathematical Equations

Definition of Notations
For simplicity of explanation, the following notations are used throughout this paper. (a) An address A at time stage t is denoted as At. When a specific address, say j, is of interest, it is represented as Aj. Therefore, the notation Atj represents the address j at time stage t. Similar interpretation applies to the state S, memory bank B, ACS, and PE. For example, S2, B4, and ACS3 denote the state 2, bank 4, and the add_compare_select component with index 3, respectively.
(b) The function BN(A) is used to compute the bank number which the original address A will be mapped into. And, the assigned or mapped address in the bank is denoted as BA(A). For instance, if BN(A5)=3 and BA(A5)=4, it means that the original address 5 (A5) will be stored in bank 3 (B3), and it is mapped to address 4 in the selected bank.
(c) The function RR(x,i) is defined as cyclically shifting the variable x (in binary representation) to the right by i bits. When cyclically shifting to the left, it is denoted as RL(x,i). For example, RR((11100)2,3)=(10011)2.
(d) Since the path metric update is performed based on the butterfly module and P states are processed at the same time, it takes (N/P) iterations to process the N trellis states at each time stage. We use the butterfly index BI, 0BI< N/P, to indicate the iterative index, and it is assumed that the BI value changes in ascending order, i.e., from 0 to N/P-1, at each time stage.
(e) Because the set of N states is divided into N/P subsets, each consisting of P states, we define CBI= {c(0,BI), c(1,BI),..., c(P-1,BI)} to be the subset that contains the P state numbers to be processed at the same time according to the given BI value. The element c(l,BI) in CBI represents a state number and its function is defined by (1) in onr development. we have C0={c(0,0),c(1,0),c(2,0),c(3,0)}={0,8,1,9}, C1={c(0,1),c(1,1),c(2,1),c(3,1)}={2,10,3,11}, C2=
{4,12,5,13}, and C3={6,14,7,15}. It means that states S0, S8, S1, and S9 will be processed in the first (BI=0) iteration, and states S2, S10, S3, and S11 in the second 
The address A of a state returns to its original value after m times of rotations, that is, Property II-Symmetrical Property: The distance between the two addresses allocated to store the pair of states S and Sj+N/2, 0jN/2, at each time stage is fixed and possesses the cyclically rotational property. That is, if the distance is 2m-1 at time stage 0, then it becomes 1 at time stage m-1. 
Based on Property 1 and (5), we can deduce BN(At+1)=RR(BN(At),1).
From (7), we conclude that the mapped memory banks Table 1 Conflict-free address locations for m=4 and P=4.
also have the cyclically rotational and symmetrical properties at each time stage. This guarantees that there exists no conflict during path metric update at each time stage. Note that the salient feature of employing such a partition is that we can predict the bank access, including the bank number and mapped address in the bank, so that the resulting hardware implementation can be greatly simplified as described in the following sections.
Conflict-Free Architectures
In the following, we describe two different conflict-free architectures and give the basic concept to achieve the developed architecture with local memory and fixed interconnect.
The Architecture without Transposed ACSs
For the architecture without transposed ACSs, it is assumed that at each iteration the P states in CBI={c(0,BI), c(1,BI), ...,c(P-1,BI)} are mapped to P/2 PEs according to the following rule: The ordered pair of states (c(2i,BI), c(2i+ 1,BI)), for i=0,1,...,P/2-1, will be assigned to PEi, which incorporates ACS2i and ACS2i+1 as shown in Fig. 1 . And, the new state d(i,BI) in DBI will be outputted from ACSi. Based on (1) and (2), Fig. 2 shows the mapping between states and processing elements in the case of N=16 and P=4
at time stage t=0. The states in this figure are arranged from top to bottom for BI changing from 0 to N/P-1. For each BI value, the sequence of state numbers is the same as those defined in CBI, i.e., the values of c(l,BI) for l changing from 0 to P-1.
Together with (5), we also show the bank number from which an old state is to be read on the left-hand side of Fig. 2 . With in-place scheduling, the sequence of bank numbers listed on the right-hand side, which corresponds to the places that the newly computed path metrics are to be written into, is the same as that on the Fig. 2 An example of task mapping with P=4 and N=16. BI) ,...,c(P-1,BI)}={c(l,BI)|(l=0,1,2,...,P-1} is transformed into C*BI={c(l*,BI)|l*=TF(l,BI) for l=0,1,2,...,P-1}, in which the transposing function TF(l,BI) is defined as for l=0,1,...,P-1 and BI=0,1,...,N/P-1, where (lv-1,...,l1,l0) and (lv*-l,...,l1*,l0*) are the binary representations of l and l*, respectively, and the iterating function IF(BI), a function of the butterfly index BI, is defined in (9).
The derivations
of (8) and (9) As an example, Fig. 4 shows the transposed conflictfree architecture of Fig. 2 . Based on (9), we have IF(BI= 0)=(0,0), IF(BI=1)=(0,1), IF(BI=2)=
(1,0), and IF(BI=3)=(1,1). Replacing the variable l with l* according to (8) and (9), we can deduce C*0={c(0,0),c(1,0),c(2,0),c(3,0)}={0,8,1,9}, C*1={c(1,1),c(0,1),c(3,1),c(2,1))={10,2,11,3}, C*2= and 7 from B1. Thus, the resulting interconnection network between memory banks and PEs is expected to be simplified. Note that, due to in-place scheduling, the same state number may be stored in different banks at different time stages, and it is necessary to take the temporal effect into account. For example, the state 1 will be allocated in two different banks, B1 and B2, respectively, for two consecutive time stages as highlighted in the two gray boxes of Fig. 4 . To consider the temporal effect, let BRt={br(0,t), br(1,t),...,br(P-1,t)} be the set that contains the bank numbers assigned to the state numbers defined in C*0= {c(l*,0)|l*=TF(l,0) for l=0,1,2,...,P-1}={c(0,0), c(1,0),...,c(P-1,0)} at time stage t. That is, the path metric of state c(l*,BI), for BI=0,1,2,...,N/P-1, will be read from bank br(l,t) at time stage t. According to (10) and the property of convolutional codes, we can derive (11) for l=0,1,...,P-1; t=0,1,...,m-1, because, as shown in (7), the assigned memory banks also have the cyclically rotational property at each time stage. Finally, we use BWt={bw(0,t),bw(1,t),..., bw(P-1,t)} to denote the set that contains the bank numbers for storing the newly computed path metrics of the states defined in D*0={d(l*,0)|l*=TF(l,0) for l=0,1,2,...,P-1}. Due to in-place scheduling, each old path metric is immediately overwritten by the new one; therefore, we get bw(l,t)=br(l,t).
Following (8), (11), and (12), Fig. 5 illustrates the resulting architecture for P=4, in which the rotation control element (RCE) is employed to control bank accesses. Each RCE contains log2v=1 controlling signal, which is to reflect the right rotation property as indicated in (11) for each time stage. As a result, each PE only need to access part of memory banks with BR0=BR2={0,2,1,3} and BR1=BR3={0,1,2,3}.
In this paper, we denote this type of architecture as the RCE architecture. In such an architecture, the interconnection network between memory banks and PEs is fixed at a given time stage, but it will change at Fig. 5 The RCE architecture for P=4. INF. & SYST., VOL.E91-D , NO.9 SEPTEMBER 2008 the following time stage. Compared with the implementation result shown in Fig. 3 (a) , the RCE design can reduce the interconnection overhead. Note that a fixed interconnection between memory banks and PEs can be applied to all the time stages if we employ the extended in-place scheduling scheme described in Sect. 4. Before explaining the extended in-place scheduling, we show a lemma which can be adopted to reduce the complexity of address generation unit (AGU) in the next section. Lemma II holds because RR(BN(l)=(0,...,0)2,t)=0 and RR(BN(l)=(1,...,1)2,t)=P-1 for all the time stages.
Lemma II: The memory banks B0 and B1 are directly connected to the processing elements without using the rotation control elements in RCE architecture.
Architecture with Local Memory and Fixed Interconnect
4.1 Extended In-Place Scheduling
Basic Concept of Extended In-Place Scheduling
Two potential problems arise in the realization of Fig. 5: (a) the memory is centralized and (b) the complexity of routing networks at both the input and output of memory banks might strongly affect the total area and performance if P becomes large. Based on the transposed ACS architecture, in this section, we show how to find a feasible solution so that each memory bank can be treated as a local memory of a specific processing element, inside the ACSU, with hardwired interconnection, for all the time stages. To achieve the desired goals, we develop a novel scheduling technique, denoted as the extended in-place scheduling, to distinguish from the conventional in-place scheduling technique.
In the conventional in-place scheduling [18] , each old path metric is immediately overwritten by the newly computed path metric and the pair of memory addresses are the same for a butterfly module to perform the read and write operations as shown in Fig. 6 (a) . In fact, if there is more than one butterfly module performing their operations at the same time, it is possible that the pair of memory addresses that a butterfly module reads is different from that in the Fig. 6 In-place computation with P=4.
(a) Conventional method; (b) The extended in-place scheduling scheme. following write operation. Figure 6 (b) shows one of the possible alternatives. The salient feature of applying extended in-place scheduling is stated as follows. When we partition the PMM into a set of banks, we have the freedom to distribute the newly computed path metric into a different bank so that the interconnection network between the memory banks and the butterfly modules is altered. The interesting point is that if we can predict the bank mapping relationship among different time stages, then it is possible to write the newly computed path metric to a specific bank so that each butterfly module always read the path metrics from a fixed bank. When a memory bank is always accessed by the same butterfly module at all the time stages, it can be treated as a local memory of the module.
The Bank Mapping Relationship
According to the discussion in Sect. 3.2.2, we can choose BI=0, for simplicity, in the following development because the derived results are independent on the value of BI. Let BWEt={bwe(0,t), bwe(1,t),...,bwe(P-1,t)} be the set that contains the predicted bank numbers for writing new path metrics based on the extended in-place scheduling scheme at time stage t, i.e., the path metric of state d(l*,0)=d(l,0) in D*0; for l=0,1,2,...,P-1, will be written into bank bwe(l,t). Similarly, we use BREt={bre(0,t), bre(1,t),...,bre(P-1,t)} to denote the set whose element bre(l,t) is the bank number from which the state c(l,0) is read when, employing the extended in-place scheduling.
Before deriving the desired bank mapping relationship, it is reminded that both the states and bank numbers possess the right rotational property as indicated in Property I and (11) . And, the routing network of the RCE architecture in Fig. 5 is designed to reflect the right rotational property among different time stages. As a result, if the new path metric is written into the bank which is now one bit left rotation of what is originally defined by conventional in-place scheduling, then the same state can be read from the same bank at the next time stage. The extended in-place scheduling is applied to fulfill the relationship defined in (13) for ensuring a fixed interconnection network between the PMM and processing elements. bwe(l,t)=RL(bw(l,0),1),
for l=0,1,...,P-1; t=0,1,...,m-1. Since a state is always read from the same bank at all the time stages, the following equation can be derived. It implies that BREt= BR0 for all the t values.
bre(l,t)=br(l,0),
for l=0,1,...,P-1;t=0,1,...,m-1. The property of fixed interconnect can be justified as follows: To validate this property, we only need to show that a state Si read from bank Bj at time stage t=0 will be read from the same bank at time stage t=1. The same reasoning can then be applied to other time stages. For the conventional in-place scheduling, since the state c(l,0) is read from bank br(l,0)=BN(c(l,0)) at the current time stage, say t=0, the same state will be read from bank br(l,1)=BN(RR(c(l,0),1))=RR (BN(c(l,0) ),1)= RR(br(l,0),1)=RR(bw(l,0),1) at t=1 by (3), (7) (11) and (12). On the contrary, when applying the extended in-place scheduling, we have bre (l,1)=RR(bwe(l,0),1) RR(RL(bw(l,0),1),1)=bw(l,0)=br(l,0) at t=1. Equations (13) and (14) hold as the situation repeats for other time stages.
As shown in Fig. 4 , the old path metric of state S1 is read from B1, but the new path metric of S1 is written into B2. Using the developed extended in-place scheduling scheme, each trellis state will be stored in the same memory bank at all the time stages (see Fig. 7 and Table 2 ). For example, states 1, 4, 11, and 14 are all stored in B1 regardless of the value of t. Therefore, each memory bank can be treated as a local memory of a specific PE. In Table 2 , the notation Si(Aj) means that state Si is stored at address Aj of the original PMM without bank partition. The Aj is given to make a comparison with the assigned address BA(Aj) in the Fig. 8 (a) The generalized LM architecture; (b) An example of LM design with P=8.
left-most column. Figure 8 (a) shows the generalized local-memory (LM) architecture based on the transposed ACS and extended inplace scheduling. Figure 8 (b) illustrates an example of the LM architecture with P=8.
Compared with Figs. 3 and 5, we can see that fixed interconnection networks exist at the input and output of the memory module.
Pseudo-Bank Design
From (11) and (12), the expression bwe(l,t)=RL(bw(l,0),1) given in (13) can be also rewritten as follows.
for l=0,1,...,P-1; t=0,1,..., m-1. According to the definition of assigned address BA(A) in (6), for the butterflybased computation, the assigned addresses for c(l,BI) in (1) own the following characteristics. Because the item [l/2] on the right-hand side of (1) is discarded when used in (6), we have BA(c(0,BI))=BA(c(2,BI))=...=BA(c(P-2,BI)); BA(c(1,BI))=BA(c(3,BI))=...=BA(c (P-1,BI) ), for BI=0,1,...,N/P-1
for even and odd values of l's, respectively, at time stage t=0. That is, the read addresses of banks bre(l,0) for even (odd) values of l's are the same. Hence, when writing new path metrics into those banks, they are written with the same address. Because bre(l,0)=RR(bwe(l,0),1)=l/2 for even values of l's, all the banks in the set {B0,B1,...,BP/2-1} will be read with the same address at t=0.
Similar reasoning applies to the odd values of l's. From (15), we obtain BA(d(0,BI))=BA(d(1,BI))=... =BA(d(P/2-1,BI));
BA(d(P/2,BI))=BA(d(P/2+1,BI))=... P-1,BI) ), for BI=0,1,...,N/P-1 (17)
=BA(d(
Based on the properties of convolutional codes and extended in-place scheduling, at the next time stage (t=1) the addresses for reading the path metrics of states , c(l,BI) for even (odd) values of l's, should be the same . Repeat the above procedures, we can conclude that all the banks in the set {B0,B1,..., BP/2-1} (or {BP/2,BP/2+1,...,BP -1}) will be read with the same address at all the time stages.
Since all the elements in {B0,B1 ,...,BP/2-1} (or {BP/2,BP/2+1,...,BP-1}) are accessed by using the same address at the same time, we denote this feature as the pseudobank property (Property III) in our development. The benefits from applying this salient feature are summarized as follows. (a) The resulting address generator can be greatly simplified by only generating the addresses of B0 and BP-1 in different iterations at different time stages. (b) We can merge the P partitioned banks, each with d-bit width, into only two pseudo-banks, each with 1/2* P* d-bit width, if needed . In this way, the expected physical size of required memory can be further reduced as the number of decoders inside the memory is decreased. Property III-Pseudo-Bank Property: Based on the transposed ACS and extended in-place scheduling, the total number of banks can be divided into two disjoint sets, PB0= {B0,B1,...,BP/2-1} and PB1={BP/2,BP/2+1,...,BP-1}, independent on the number of employed processing elements, and all the banks in a pseudo-bank, PB0 or PB1, are accessed with the same address at the same time.
Based on the extended in-place scheduling, an alternative representation of Table 2 is listed in Table 3 to emphasize the pseudo-bank property in our development. As seen from this table, the two banks in PB0={B0,B1} and PB1={B2,B3} are accessed with the same address in different iterations at different time stages.
Address Generation
By the pseudo-bank property, it is sufficient to only compute the addresses for B0 and BP-1 in different iterations at different time stages. By (13) and Lemma II, we know that the left rotational operation has no effect on B0 and BP-1 because RL(B0=(0,...,0)2,t)=0
and RL(BP-1=(1,...,1)2,t)= P-1.
Therefore, the assigned addresses can be generated based on the design method used in the conventional in- Table 3 Illustration of the pseudo-bank property in Table 2 PBA0(BI,t)=BA(RR(c(TF(0,BI),BI),t)); PBA1(BI,t)=BA(RR(c(TF(P-1,BI) ,BI),t)), Fig. 9 (a) The generalized LMPB architecture; (b) The address generation unit (AGU); (c) A design example with P=4. pseudo-bank property. In this way, the P partitioned banks, each with d-bit width, can be merged into only two pseudobanks, each with 1/2*P*d-bit width, to reduce the physical size of required PMM. Figure 9(b) shows the associated address generation unit (AGU) to implement (18) , in which only a pair of memory addresses is generated at each iteration. Based on our development, a design example of LMPB architecture with P=4 is illustrated in Fig. 9(c) . Compared with the BCF and RCE designs in Figs. 3 and 5, LMPB design can significantly reduce the complexity of AGU and the interconnection network between PMM and PEs.
Comparison and Discussion
For area-efficient architecture design, in addition to ensuring the conflict-free access of PMM, a good solution must take into account the impact of routing network existing between the PMM and ACSU. In [11], [12] , the methodology of trellis mapping onto the ACS network is based on matrix permutation techniques where the working matrix is heuristic. In contrast, our proposed methodology of conflict-free address arrangement is based on (5), which is much simpler and more efficient than their approaches. The path metrics of [13] , [14] are stored in FIFO that may take a larger chip area for VDs with long constraint length. In addition, their works require extra memory space to perform concatenation and decimation operations. In [15] , [16] , the focus is on the derivation of conflict-free memory access. Without applying appropriate scheduling techniques, the resulting architecture will suffer from the complicated routing network between the PMM and ACSU. In this work, we aim at reducing the complexity of routing network existing between PMM and ACSU. With 100% utilization of processing units, we have shown that a fixed interconnection network in between can be achieved based on the proposed transposed ACSs and extended in-place scheduling scheme. Compared with the existing techniques, our development has the following salient features. (a) The design can be accomplished in a simple and systematic way. (b) A fixed interconnect can be used to distribute the path metrics to ACSs. (c) Not only the pseudo-bank property of PMM can be applied to increase the equivalent memory bandwidth, but also it can be used to simplify the design of address generation unit. (d) The property of local memory access makes our development a feasible solution for systolic array design. Figure 10 depicts the design of VDs based on the presented BCF and LMPB architectures, respectively, in which the buffer is used to accumulate the decision bits for all the states at time stage t before writing into the SMU. The BCF architecture in Fig. 10 (a) serves as a basic areaefficient design. The main disadvantage is the complicated (de)multiplexer (DMUX/MUX) routing network and AGU used for fulfilling the in-place computation. This will take a large hardware overhead and may degrade the overall operation speed. By applying the transposed ACS scheduling, we derive the RCE architecture for reducing the interconnection area; however, it still suffers from the drawback of global memory structure.
The LMPB architecture in Fig. 10 (b) is designed to overcome previously mentioned drawbacks using our proposed extended in-place scheduling scheme. In addition to the fixed interconnect between PMM and ACSU, the design of AGU is also greatly simplified according to the pseudobank property. Another benefit we get from the feature of fixed interconnect between PMM and ACSU is that the transpose component Tr2 for SMU also inherits the fixedinterconnect feature. That is, Tr1 contains N wires arranged based on the rule of extended in-place scheduling. As for the transpose component Tr1, it can be eliminated accordingly because we know the states to be processed simultaneously and the mapping relationship as described in Sect. 3.2.2. As an example, assume N=16, P=4, the state c(l,BI) can be represented as (l0,bi1,bi0,l1) in binary notation according to (1). To have a fixed interconnection, we can replace c(l,BI) with c(l*,BI) and deduce the transformed state (l0(+)bi0, bi1,bi0,l1(+)bi1) by (9). With the derived mapping relationship, the associated hardware implementation can be accomplished with a fixed interconnect in the BMU. Note that since the states are transformed based on the exclusive-OR operation, we can incorporate the effect into the generator polynomials of codeword and retain the original state order.
In consequence, there exists no overhead in BMU and SMU design when we apply the transposed ACS and extended in-place scheduling to ACSU and PMM. Table 4 compares the features of the three different architectures. The interconnect complexity shown in the last row is measured based on the number of point-to-point interconnection wires, the so-called 2-point nets, between the ACSU and PMM, excluding the required (de)multiplexers. The notation O(P) means that the interconnect complexity is linearly proportional to the number of partitioned banks.
In Table 5 , we compare our results with those in the literature. Basically, the work [15] is targeted for rate-1/n convolutional codes and all the three works [15] , [16] , [19] need a number of multiplexers to adjust the paths to access the PMM, i.e., their routing networks are not fixed (or hard- Table 4 Comparison of three different architectures. size of employed memory.
Conclusion
In this paper, we present a systematic approach of designing area-efficient VDs with reduced interconnect complexity between PMM and ACSU. With a set of formulas derived for performing memory partition, ACS transposition and extended in-place scheduling, a novel architecture with localmemory/pseudo-bank feature is developed in this work. In our LMPB architecture, not only the interconnection network between PMM and ACSU becomes hardwired, but also the design of address generation unit is greatly simplified. We also show that the proposed techniques of achieving the targeted architecture will not affect the design of conventional BMU and SMU. By choosing an appropriate value of P, our development can be easily applied to design area-effecient VDs for different applications. A 
