Absrracr -The problem of practical realization of the optimal fixed-delay symbol-by-symbol detection algorithm, which is optimum in the sense of minimizing the symbol error probability, 
I. INTRODUCTION
An increasing demand for high data rate transmission over bandlimited channels with severe intersymbol interference (ISI) has resulted in a fluny of activity over the last two decades to develop improved methods of equalization 111. Since error probability is the most important performance measure in such applications, it is of interest to develop practical VLSI implementable receivers which optimize some measures directly related to probability of error. Two important criteria in this regard are: maximum likelihood sequence estimation (MLSE) and minimum probability of symbol error.
An important breakthrough came with the derivation of maximum-likelihood receivers for channels with jinite memory. Taking advantage of the fact that the effective range of IS1 is actually finite. Chang and I-Iancock derived a sequential procedure [2], in which the number of computations increased only linearly (rather than exponentially) with the message length. Following [21, Abend and Fritchman [3] derived the optimum fixed-delay 'symbol-by-symbol' detector, which is optimum in the sense of minimizing the symbol error probability given a fixed delay constraint, D (Le. given aU the previous as well as D succeeding received samples). This is simpler than the optimum compound detector [4, 5] , in that the symbol estimation can be carried out before the entire data sequence is received. Nevertheless, neither this nor related simplifications, as in [61. were thought of as practical procedures and researchers have pointed out the reasons [7-91. Unlike optimal symbol detection, the procedure that has received tremendous attention in the last two decades is the Viterbi algorithm (VA), which was originally introduced in 1967 for maximum-likelihood decoding of convolutional codes [lo] , and, after its optimality in the sense of MLSE was realized, it was applied to signaling on IS1 channels [8, 11, 12] .
While the structural aspects of MLSE (or the related VA) have received a great deal of attention [13-181, the criterion and/or algorithms based on the minimum probability of symbol error have not received the same scrutiny. The main purpose of the present paper is to examine an algorithm which minimizes symbol error probability with the objective of developing parallel-structure implemenmtions with low complexity. We focus on the practical realization of the optimal fixed-delay symbol-by-symbol detection algorithm derived in [31. First, the algorithm is mapped onto a fully-parallel structure suitable for VLSI implementlition. Then, a number of simplifications are introduced, through systematic reformulations of the algorithm. that avoid the computation of exponenrials 'and reduce (or possibly eliminate) the number of muitiplications to be performed. All this is achieved at a price which seems acceptable in practice. In Section IV, a number of suboptimal design considerations are discussed and a few suboptimal symbol-by-symbol detectors are introduced. In Section V, the simplified parael symbol (SPS) detector is compared to the Viterbi detector, where it is shown that a suboptimal SPS detector is identical to the minimum-metric Viterbi detector. Figure 1 shows the block diagram of a digital tr'msmission system with pulse 'amplitude modulation (PAM). The cascade of the transmitting filter. the channel filter, the receiver's whitening matched filter. and the sampler (Figure 1 t $e on any value ik drawn from an M-symbol alphabet (It = + I in the binary case), L is the effective length of the channel dispersion, and ( q k ) are independent, identically distributed Gaussian random variables with zero mean and vari-
The specific problem is to provide a delayed estimate of the transmitted symbol I k -D , given the observed received sequence v i , v 2 , . . . , V k , where D is the chosen delay constraint, according to some optimality criterion.
B . The Algorithm
We consider the optimal symbol-by-symbol detection algorithm under a fixed delay constmint, developed by Abend and Fritchman [3] and as presented in [71. The algorithm IS optimum in the sense of minimizing the probability of a symhol error. in compdson with all detectors which depend on the same number of received samples.
Let v , , v2, ..., vg be the observed received sequence, where k > D 2 L . l As is well known, minimization of symbol error probability is equivalent to MAP estimation. Hence, the algonthm computes the a posteriori probabilities and chooses the one with the largest probability [7, p.3871.
The MAP estimate of the information symbol at time
In (2), the expression arg ( , G ( a ) } is equal eo the value of 
/'bo-,
where P ( I k ) is the a priori probability of the information symbol I k and p ( . ) denotes the probability density function of the received sample conditioned on the possible values of the information symbols and is given by 1 The algonthm for the case D < L (also given m 131 ) is sundar to the w e D 2 L presented m this paper, except that it considers L (rather than D) most recent symbols for the recursion.
C. The Computational Complexity
Using (4), the recursion in (3) can be replaced, for statistically independent, qui-probable (i.e., P ( I , ) = P I ) input data,
Using (3, the initial vdue for the recursion in (4), is obtained 
(7)
The MAP estimate of the information symbol, at time k -D , is then given by The intimidating computational burden of the above algorithm is apparent from Table 1 . The procedure involved is particularly complicated due to the large number of the slow and composite operations of exponentiation and multiplication required for the estimation of each symbol. In addition, there is a large dynamic range associated with the computation of exponentials. In a typical floating-point representation, this leads to ovefiow (or underflow) problems. A practical realization of this algorithm, which is the focus of this paper, is not possible without challenging its computational burden. 
SIMPLIFIED PARALLEL SYMBOL D E T E~O R

A Parallel Structure
To exploit the structure inherent in the algorithm ' and to take full advantage of its parallelism ,and reguhity, we follow a procedure (as in [19] ) towards a design that is suitable for VLSI implementation [20-231. First, the recursive algorithm is mapped onto a p,dlel ,array structure.
To simplify the notation in (6), let Then. with N 2' , the algorithm defined by (6)-(8) can
be rewritten as follows. 
where the floor function l a ] denotes the greatest imeger not greater than a.
To exploit the parailelism in the recursive algorithm expressions given by (12), we first consider the index space (i , k) of (12a) In the index space ( j , k ) of the derived DG, (1 IC), which gives the optimal estimate of the kth information symbol, uses the intermediate node outputs (X/" ) given by (12b). According to (llc), detection of Ik depends on the comparison of the two sums (each of which could be interpreted as an average) EX:" and C X$*D . Since (X,k+D} , for all j , are computed in parallel, no storage of path history is required.
It must be noted that, for equi-probable statistically independent binary information symbols, the algorithm given in (1 IC) and (12) is the same as the optimal fixed-delay symbolby-symbol detection algorithm denved by Abend and Fritchman [3] and presented in (6)-(8). The latter is only regrouped to derive the fully-parallel array structure represented by the DG of Figure 2 (b) . An N-processor realization, one for each state, increases the throughput of the system by a factor of N compared to a serial implementation. Equation (12), processed at each node, is represented in Figure 3 .
. Processing required at each node before sunphfication, (12).
B. A Low-Complexity Parallel Structure
The next step is to reduce the large computational burden of the algorithm. Parallel processing does increase the throughput significantly, but still, the computations required at each node are complex and slow, and are accompanied by other problems such as that of a large dynamic range. The exponential factoE cannot be replaced by their exponents. Furthermore, the presence of IS1 terms (in (7)) makes the exponents' dynamic range too large for the exponentials to be approximated by linear or quadratic expressions, even at low values of signal-to-noise ratio (SNR), where SNR is defined as the ratio of the average power of the signal to the average Dower of the noise in the signal bandwidth. For the binary 1664 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL 42, NO 21314, FEBRUARYIMARCWAPRL 1994 -- for small values of I 1. This is further discussed in Section case considered here, Ik=+l, the signal power is equal to 1, and therefore,
Noting that Xf (1 1 
We now use the concept of Jacobi's logarithm to rewrite (14). The function Q, called Jacobi's logarithm [25] , is defined by the property where a is a primitive element in a finite field. By analogy,
Similarly, e' s +e'' = exp (max(p, , P2)+ln(1+e-"I-P" )I (15) )I Eqn. ( 14) can be rewritten as e-Y:lN,
Let us use the following shorthand notation
The recursive equations, for any given value df the delay constraint D, can now be obtained by generaking the above simplification:
Then, using (18), (16) Xf+' = C j~: , iz0.1, * * . , 2 N -1 ( 1 9~)
Note, from (18) and (19), hat the knowledge of the noise variance, N~ / 2, is now used in the computation of only.
Also, from (I@, -N o In 2 2 @! ; < 0 , and is signiiicant onlv IV. A significant simplification of the algorithm (19) occurs when only yf is computed at every node ( j , k ) and the factor $: , , is tabulated explicitly as a function of 1 at,, I. Figure 4 shows the resulting simplification in the processing required at each node. Similarly, using (13a) in (1 IC) and applying the concept of Jacobi's logarithm, the MAP estimate of I k -D , in the new structure, is obtained from (22) is zero and the remaining are positive, making one of the exponential terms in (22) equal to in (22) decreases and becomes '1'. As AI increases. e negligible compared to '1'. Consequently, a near-optimal approach could be to keep the two smallest Ai only, or, instead, if a 'tree' search is used in finding min( Q O , a 1 , .--, U N -~ ), to apply the table look-up only to the last binary comparison, as shown in (23). This way, the same look-up tzzble used for the recursion may be used for output generation (symbol estimation). In this case, (21) may be approximated by -A.lNo additions. Note that if the optmal expressions (21) are used in the computation of y k ( 0 ) and y k ( 1 ), each binary comparison will be followed by one lookup-add operation. However, the approximation given by (23) is justified by the low sensitivity of the algorithm to the additive look-up term given by (18). This is further discussed in Section IV below.
Note (by comparing Figures 3 and 4) that all multiplications by in the computation of e-""', and all exponentials are removed, and the multiplication inside every node is replaced by one addition, 'ail at the expense of simple operations of compare-select and look-up (one per node) and an (affordable) extra complexity in symbol estimation ( (20) replaces (llc)). The result is a simplified parallel symbol (SPS) detection algorithm (19)- (23) with a fully-parallel structure containing N nodes where the computations required at each node are simplified to add-compare-select followed by lookup-add (Figure 4) , and where the storage of surviving sequences is not required. This algorithm is optimal if the symbol estimation given by (20) is performed using (21) and is near-optlmal if (23) Nevertheless, considering that a multiplication is a few times more complex than an addition, a binary comparison, or a look-up operation. and that the complexity of an exponentiation is yet several times that of a multiplicahon, the substantial reduction in overall complexity is apparent. To this, one can add the other additional benefits: The (serious) problems associated with a large dynamic r&ge, such as overflow or underflow, are (practically) avoided by replacing the summation of exponentials by values comparable to their exponents; and, finally, as stated earlier, the exploitation of the parallelism inherent in the algorithm has resulted in the derivation of a parallel structure, suitable for VLSI implementation, with a speed-up factor of N compared to a serial implementation.
The optimal SPS detection algorithm, given in (19)- (22) represented by a DG, in which the nodes output { z! 1 , given in (28), with the same topology as that derived earlier in this paper but with 2D-' nodes (instead of 2' nodes). It must be noted that insofar as the recursion in (28) is concemed, the DG could be further collapsed to one with 2L nodes. However, thts IS restricted by (20) and (31), in which the { y: ] that span D symbols and are present in the 2 D -1 nodes, are needed for the optimal symbol estimation. This restnction is relaxed in the next section where more suboptimal symbol detectors are considered.
As discussed in Section III.B, a near-optimal approach is to replace min* ( . ) by min ( . ) for all comparisons in (3 1) except for the final one (i.e. the one outside the square brackets in I bkj (31)). The complexity of the near-optimal SPS detection algorithm is given in the second column of Table 1 . The exact companson is a function of the specific design approach, the technology, and the comparison criterion used. Again, the computations required to obtain (vk -EA 1k-J )', common to all algorithms considered, are not included in Table 1 . Also, to perform each lookup operation, a St, (17) which is the difference between some yf and yf, in (28a), must be obtained. In practice, a look-up operation always follows a comparison, as given in (19b), and thus may be combined with it.
First, St, may be obtained, then, its magnitude is used in the table look-up operation and its sign bit is used to select the result of comparison. Also, the time needed to perform the
IV. MORE SUBOPTIMAL DETECTORS
Simulation results show that the SPS detection algorithm derived in Section I11 is quite robust to uncertainty in our knowledge of the noise variance N o / 2 and. generally, it is relatively insensitive to the size of the look-up table (18) 
I r a Equation (33) can be represented by the DG derived in Section IIIA (Figure 2(b) ) with 2L nodes, where each node-computes (33a). Not only is qk computed, but the surviving f k -L is also stored at each node. In (33c) A hybrid of the optimal SPS detector and Suboptimal
Detector 'B', referred to as Suboptimal Detector ' M ' , may also be derived. This can be obtained directly from the optimal SPS detection algorithm (20), (28), (31) through an approach similar to the one used to derive the 'B' algorithm. The only difference is that min' ( . ) and min ( e ) are taken to be equal in the symbol estunation (20),(31) but not in the recursion (28). The result is an algorithm which is the same as (33), except that min( . ) is replaced by min* ( . ) in (334 and (33b). The complexities of Suboptimal Detectors 'A2' ' and 'B' are displayed in Table 1 . The insignificance of the look-up factor with increasing S N R suggests that the performance of these suboptimal detectors should approach optimal performance as S N R increases. Figure 6 verifies this for 'B', for the specific channel and the delay constraint indicated. The curve labeled 'Threshold Detector' is obtained by ex,amining the sign of the received symbol vk Only (ignoring the channel memory) for Symbol estimation:
where sgn ( x ) is -1 for x < 0 and + 1 for n 0. Furthermore, the lower-bound curve labeled 'No ISI' (for l k =+1), obtained by assuming that the additive white Gaussian noise channel is not time-dispersive, represents P, = ( e$c(
The accuracy of the simulation results throughout this work has been such that the standard deviation for each point is less than ten percent of the mean. Moreover, simulations with a variety of channels indicated that a 
k=l where the marginal densities on the right hand side of (34) are considered to be independent for additive white Gaussian noise.
If the length of the channel dispersion (the number of IS1 terms) is L (finite), a recursive algorithm to estimate the information symbols based on MLSE can be used as follows. Upon reception of the sample v k , the metrics
are computed. The second term in the square brackets, in (Ma), is the branch metric and the sum the path metric. For a binary alphabet, (3%) involves computation of 2L+1 path pemcs and selection of 2L surviving path memcs. Also, the I k -L suggested by each surviving path metric is stored, hence the storage of 2L surviving sequences, At each stage, a decision will be made on the set ( I I , ,I,,,), 1 I m < k , if all the 2L surviving sequences that terminate in the symbol Ik agree on its value. Otherwise, the decision is deferred. by the majority of the surviving sequences), the 'minimum memc' rule (choosing the symbol suggested by the path with minimum metric), and the 'arbitmy selection' rule (choosing the symbol suggested by an arbitrary path).
The recursion in (35) can be represented by the Viterbi trellis [261 which has the same structure as the SPS detector's DG (Figure 2(b) ). The Viterbi trellis may also be considered as its DG [191 and may be used directly for its VLSI implementation [141. It consists of 2L nodes where the processing at each node is the add-compare-select (ACS) given in (35). Based on this representation, (35) may be rewritten as
The processing required at each node of the Viterbi trellis, given by (36a) and (36b) is shown in Figure 7 . The estimate of the information symbol I,-,, according to minimum-metric VA. is then given by
Pf+P Fig. 7 . hcessing required at each node of Viterbi trelhs.
The fixed-delay symbol-by-symbol detection algorithm and the MLSE are optimum according to two different criteria. The fixed-delay symbol-by-symbol detector is optmum in the sense of minimizing the probability of a symbol error given a delay constraint. On the other hand. MLSE is optimum in the sense of minimizing the error probability of the entire sequence. Nevertheless, even though VA is a method to implement MLSE, the implementation is a symbol-by-symbol process. This can be seen in (36) for minimum-memc VA.
Hence, it is fair to compare the (practical) VA and the SPS detection algorithm based on symbol error probability, which is also the measure of performance most commonly used in data communications.
The fixed-delay symbol-by-symbol detection algorithm is, by definition of its optimality, expected to yield a lower probability of a bit error P, compared to the Viterbi algorithm (VA) for the same delay constraint. As pointed out by Hayes [U] , "optimum sequence detection considers all erroneous sequences to be equally bad" and therefore, at low SNR. "errors may lead to detected sequences that are far from the true sequence". The simulation results show that the performance of the SPS detector is indeed slightly superior to that of the Viterbi detector at low values of SNR but the two are comparable at moderate to high values of SNR. This could also be predicted from an important result shown in this work From the comparison of (36) and (33) it is readily apparent that the minimum-metric V A and Suboptimal 'B' algorithm (derived from the optimal SPS detection algorithm) are one and the same. This result could be obtained directly through approximating the summation of exponentials in the original algorithm (7)-(9) by the largest exponential and, then, applying the msformations discussed in Section IV that led to the derivation of (34). Ungerboeck used this approxiination in a different approach 161 and pointed out that. by doing so, "the 'single bit' MAP concept leading to the optimal nonlinear equalizer has been replaced by a 'sequence' ML concept. This ML solution is asymptotically optimal for high SNR." Figure 8 shows that the gap between the SPS and the Viterbi detectors, at low SNR, remains even if the latter uses a much longer delay. An interesting range to consider is the values of the delay constraint, D , comparable to the length of the channel dispersion, L. Figure 9 shows that only the Viterbi detector with the 'minimum metric' strategy comes close to the SPS (at least for D 1 6 L ) . hence a motivation for comparing their structures.
As discussed earlier, the VA and the SPS detection algorithm have the same DG (Figure 2(b) ). From the comparison of (19) and (36), or equivalently Figures 4 and 7 it can be seen that, except for an additive look-up term, the same processing is performed at every node of the two detectors. There are, 2L nodes in the Viterbi trellis but 2max(L7D-1) nodes in the SPS detector's DG. This adds to the complexity of the latter as D increases beyond L. On the other hand, in the Viterbi detector, symbol estimation requires the storage and control of 2L surviving sequences, each of length D -L. In a VLSI implementation. this takes a considerable area, perhaps as much as onethird of the chip area. The decision on the information sym- This study has shown that a practical realization of fixeddelay symbol-by-symbol detection for noisy and timedispersive channels is possible. The mapping of the algorithm onto a fully-parallel array structure and the subsequent systematic simplifications introduce a simplified panllel symbol (SPS) detector which is several times faster and simpler than that suggested by the original algorithm.
We have shown a hierarchy of symbol detectors. The optimal SPS detection algorithm (19)- (22) (or (20), (28), and (3 1) for D > L) can be approximated by the near-optimal SPS detecrion algorithm (191, (20) . and (23). Two suboptimal symbol detectors 'Al' and 'A2' (both discussed in Section IV) are. in turn, derived from the SPS detector. Further approximation leads to the derivation of Suboptimal Detector 'B'(33) which is shown to be the same as the minimum-metric Viterbi detector (36). The latter is superior in perfomcame to the VA with 'majority decision' rule or 'arbitrary selection' rule with respect to minimizing the symbol error probability. The Viterbi detector with 'arbitrary selection' rule (commonly used) is, itself. an approximation to the minimum-metric Viterbi (or 'El') detector. This hierarchy from the optimal SPS detection algorithm to the 'arbitrary-selection' VA introduces a uadeoff in performance and complexity. The derivation of the optimal SPS detector together with the related suboptimal design considerations and trade-off possibilities among a number of efficient symbol detectors (including Viterbi detectors) form the main results of this work.
Our brief comparison of the SPS detector developed in this paper with the Viterbi detector shows that the former achieves a slightly better performance at low SNR's and the latter is simpler in complexity (particularly) at higher SNR's (for which large values of D are needed): otherwise, the two are comparable in complexity (and performance. The simplicity of the SPS detector may make it possible to carry out a more detailed analytical comparison of its performance with respect to the Viterbi detector for specific applications and also throw light on symbol error performances of detectors based on sequence estimation.
Since the topology and the type of operations involved in the SPS detector are similar to those of the Viterbi detector, the same approaches that use the Viterbi trellis for its VLSI implementation can be applied here. The simphfications dready known for VA to avoid multiplications in the branch metric computations C271 are dso applicable to SPS detection algorithm. Furthermore, by modifying the expression for the branch metrics, an SPS decoder may be derived. A complete analysis of such designs merits further work. Finally, some of the improvements suggested for receivers that include the Viterbi detector can also be applied to the SPS detector. For example. channel truncation and estimation may be applied be- 
