Parallel Algorithms for Isolated and Connected Word Recognition by Yoder, Mark Alan & Jamieson, Leah H.
Purdue University
Purdue e-Pubs
Department of Electrical and Computer
Engineering Technical Reports
Department of Electrical and Computer
Engineering
12-1-1984






Follow this and additional works at: https://docs.lib.purdue.edu/ecetr
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact epubs@purdue.edu for
additional information.
Yoder, Mark Alan and Jamieson, Leah H., "Parallel Algorithms for Isolated and Connected Word Recognition" (1984). Department of
Electrical and Computer Engineering Technical Reports. Paper 531.
https://docs.lib.purdue.edu/ecetr/531
Parallel Algorithms for 
Isolated and Connected 
Word Recognition




School of Electrical Engineering
Purdue University
West Lafayette, Indiana 47907
PARALLEL ALGORITHMS FOR ISOLATED AND
CONNECTED WORD RECOGNITION
Mark A. Yoder 
Leah H. Jamieson
School of Electrical Engineering 
Purdue University




This work was supported by National Science Foundation grants ECS-.7909016 and ECS- 
8120896.
ACKNOWLEDGMENTS
The authors would like to thank Sharon Katz for the excellent job she did 
preparing the figures, and for the use of her equipment in assembling the final 
version. We also would like to thank Sarah K. Yoder for the time she spent in 
proof reading and correcting the next to final draft.
Thanks go also to Steven J. Holmes for the hours he spent explaining the 
internals of the Poker System to me, and to James T. Kuehn for his help with 
the SEMD machine simulations.





LIST OF TABLES................... ................................. ............. . . ..............viii
LIST OF FIGURES......... ......................................................... .................. ......,..xi
ABSTRACT ............................................................... ...... ...... ..................... ........xx i'
1. INTRODUCTION.......... ................. ..................................... ................ ............i
2. THE SIMD MACHINE MODEL.............................................. . ............... .5
2.1. Flock Algol — Introduction..................................... ...... .................. ..........5
2.2. Sumrnary of Flock Algol............................... ............... ...... .................. .....g
2.3. Mask Statements ................. ................... ......... ........................... ........... .11
2.3.1. ENABLE and DISABLE......................... ...... ............................ ......... ll
2.3.2. WHERE ... ELSEWHERE..................................... .......................n
2.4. TRANSFER and USE Statements........... ....... .............. .......... ..............13
2.4.1. The Cube Interconnection Function.......................... ................... ...13
2.4.2. The Permutation Interconnection Function................................. ...14
2.4.3. The Shift Interconnection Function ........... ,...... ...... .............. ..........14
2.5. Broadcast Statements.................................................. . ................ ..... .14
2.6. An Example of a Flock Algol Algorithm.............. .................. ......... ..... 15
2.7. Summary.................................................... ..... ............................. ..............17
3. VLSI PROCESSOR ARRAY MODEL...... .............. ................. ........ .........19
3.1. A Sample VLSI Processor Array Algorithm — Filtering.......................22
3.2. Summary............ ................. ....................................... ....... .............. ....... ...24
4. AN ISOLATED WORD RECOGNITION SYSTEM..... . ....... ................. 27
4.1. Filtering and Sampling of Input Signal......................... .........................29
4.2. Preemphasis Filtering........... ........................................................... .........29
4.3. Autocorrelation Analysis.................... ..................... ............................... . .30
IV
Page
4.4. Linear Predictive Coding...
4.5. Endpoint Detection. ..........
4.6. Time Warping.............. . ...
4.6.1 Linear Time Warping....








5. SURVEY OK PARALLEL SPEECH PROCESSING ALGORITHMS .. .46
5.1. Autocorrelation.................----- -------------------------------------------------- 46
5.1.1. Autocorrelation Using M PEs — AUTOl--- ------  ............47
5.1.2. Autocorrelation Using Two FFTs — AUT02 ................................. ...47
5.1.3. Autocorrelation Using p + 1 PEs — AUT03.... ...................  .51
5.2. Linear Prediction of Speech.................. ...... .............................................56
5 2.1. Parallel LPC Using the Autocorrelation Method......................  56
5.2.2. Parallel LPC Coding Using the Covariance Method . ...........  58
5.3. Digital Filtering. ......................... ................. ................. .............. ...... 61
5.3.1. Recursive Filtering for the VLSI
Processor Array (FIL1)................. ......................................... ......... .........63
5.3.2. SIMI) Digital Recurrence Filter — Kogge (FIL2)................. ......... 63
5.3.3. SIMD Digital Recurrence Filter — Kuck................... .......................66
5.3.4. Summary of Parallel Recursive Filtering Algorithms......... ;...........67
5.4. Dynamic Time Warping................ .....................................................67
5.4.1. High Speed Array Computer - Full Array.......... ...............70
5.4.2. High Speech Array Computer - Reduced Arrays..................... .73
5.5. Summary.......................................................................................... ••••..... .79
6. NEW PARALLEL ALGORITHMS FOR SPEECH PROCESSING........ 80
6.1. Digital Filtering............................................... ........................... ................80
6.1.1. VLSI Processor Array Algorithm — VLSI!............. ..........................82
6.1.2. An Improved Parallel Filtering Algorithm— f
SIMDl and VLSI2  .............84
6.1.3. An Improved SIMD Algorithm — SIMD2...................... ................. .86
6.1.4. SIMD Solution of General Linear
Recurrence Equations........................    .......91
6.1.5. Comparison of VLSI Processor Array and
SIMD Algorithms..................      93
6.1.6. Varying the Problem Size on an SIMD Machine......;......... ............95
6.1.7. Summary of General Digital Filtering Algorithms........................101





6.3. Linear Time Warp ......... . . .
6.3.1. Method One..............................;.................... .....
6.3.2. Method Two ............................ . .
6.3.3. Summary,,,,,..,.... ............................................
6.4. Dynamic Time Warping,..... .......................... ...
6.4.1 SIMI) Algorithms.................... ............... .....................:
6.4.2. VLSI Processor Array Algorithms.............;..............
6.4.3. Summary of Results... ...
6.5, Conclusions................... ..... .
7. SIMD MACHINE SIMULATION.......... ..............
7.1. Simulating an SIMD Machine Using Sim68 ...................
7.1.1. Simulating the PEs and the CU
7.1.2. Simulating the Interconnection Network..................
7.1.3. Simulating Broadcasts
7.1.4. Data Conditional Masking
7.1.5. The Typical Speech Recognition System.. ...............
7.1.6. Execution Times............................................................
7.1.7. Summary.............;....................... .........__ .......
7.2. Digital Preemphasis Filtering..................... ..............,,,..
7.2.1. Summary...........
7.3. Simulation of the Autocorrelation Algorithm ..............
7.3.1. Affects of NetD on Execution Times ........................
7.3.2. Using F ewer PEs.       __........... . ..
7.3.3. Increasing the Throughput Through Serialism ........ .
7.3.4. Summary................... ................ ............ .
7.4. Simulation of the Linear Prediction Algorithm ..............
7.4.1. Summary.................
7.5. Simulation of Linear Time Warping (LTW) Algorithms
7.5.1. Method One — One Frame per PE....... ...............
7.5.2. Method Two ~ One Coefficient per PE................ .....
7.5.3. Comparing LTW Methods One and Two................ ;..
7.5.4. Summary.................... ........... . .. ........... ...
7.6. Simulation of Dynamic Time Warping Algorithms,.......,
7.6.1. Rearrange ...... ........................... .
7.6.2. Simulation of the DTW Algorithm —
The Serial Parallel Method (SP).......... .









































7.6.4. Summary.............        .........194
7.7 SIMD Machine Based Isolated Word Recognition System.....................196
7.7.1. Endpoint Detection........................    196
7.7.2. Data Allocation....................................—.................. ................. .....201
7.7.3. Execution Times..........................................  .......202
7.7.4. Buffering, the Input Data.......................................    .204
7.7.5. Summary.....................v.....;........... ............... ............................. ......205
7.8. Conclusions................................................ ............................................... 207
7.8.1. The Processor .........................................—••••.......    207
7.8.2. Inter-PE Communication - Cube, Shift(± 1), and Broadcasts.... 210
7.8.3. Masking — Data Conditional...................    ......210
7.8.4. MC68000 Clock Rate - 8 MHz.......................................................213
7.8.5. Number of PEs — 100..........     ...........213
7.8.6 Changing the Word Recognition System Parameters...................215
7.8.7. Summary....... .............................. ............................... .—..217
8. SIMULATING .'■.VLSI PROCESSOR ARRAYS................ ....... ...............219
8.1. Poker Details, ................................................. ................ ......... ................219
8.1.1. Software for Emulating with Poker............  222
8.1.2. Hardware Emulated by Poker...........................................................230
8.1.3. Summary.....................................  ...................................234
8.2. Simulation of Filtering Algorithms'............. 235
8.2 1. Digital Filtering Without Broadcasts....... ............  ..235
8.2.2. Digital Filtering Using Broadcasts...................................... .....241
8.2.3. Summary........................................ ..... ••••...... ................. ......258
8.3. Simulation:of ’the Autocorrelation Algorithms......................... ........ ...262
8.3.1. Poker Simulation of the Autocorrelation Algorithm........... ..........263
8.3.2. High-level Language Programs — al and a2 .........    263
8 3-3. Execution Times al and a2 ........ ............................... 271
8.3.4. Assembly Language Programs — a3 and a4........ ...,,.............. .......275
8.3.5. Potential Problems — a3 and a4......................    .......279
8.3.6. Asynchronous Computing a5 ..........................    ..280
8.3.7. Summary........................................................ ......... ;.jv............ ......286
8.4 Simulation of Parallel Linear Prediction Algorithms.........................287
8.4.1. Improve the ara: Compiler.......... ....................    291
8.4.2. Use a Faster APU.............----------------- .-----................................291
8.4.3. Use Multiple Cells....................................................:,....................292
8.4.4. Summary....................i.......;.....,.....,,..,.........  ......297
8.5. Simulation of Linear Time Warping (LTW) Algorithms.......... .........299
8.5.1. Parallel LTW-II ..... .299
vii
Page
8.5.2. Serial LTW - 12.................................... ....... ...............1..... .......309
8.5.3. Summary...............,.......... .................... ................................ ...........309
8.6. Poker Simulation of Dynamic Time Warping........ ............... 317
8.6.1. BAC written in xx —dl......... .......................... ...... ..........................317
8.6.2. 8051 Assembly Language Version of BAC — d2 ...........................333
8.6.3. Execution Times.......----- ,      ............... ..335
8.6.4. Summary..................... ..................... ..............................................340
8.7 VLSI Processor Array Isolated Word Recognition System .... ......344
8.7.1. Input Cell..................... ,.......................................... .......... ................ 344
8-7.2. Preemphasis Cell.............. ........ ...... ................................................ ..351
8.7.3. Autocorrelation Cells ....... .—................  ......... ......351
8.7.4. The Split, Merge, and Pipe Cells .......... .............................. ............352
8.7 5. The LPC Cell............;........... ..................................... ......................352
8.7.6 Endpoint Detection..,.,........,............ ................ ......................... .......356
8.7.7. Linear Time Warping.......... ..... .............. .............. ........................,..356
8.7.8. Dynamic Time Warping....... .................. ,...... ................ ........... .....356
8.7.9 Summary.................................. ........... .................. .............................356
8.8 Conclusions............................ ...... ................................. . —................ 359
8.8.1. The Processor ............................ ............................. .........................359
8.8.2. Inter-PE Communications ................................... .............................362
8.8.3. Number of Cells — 51................................................ ........ .............. ...363
8.8.4. Changing the Word Recognition System Parameters...................363
8.8.5. Summary............................................ ......................... ......................366
9. CONNECTED WORD RECOGNITION.... .......... .......367
9.1. A Level Building Dynamic Time Warping Algorithm................ .......367
9.2. An SIMD Level Building DTW Algorithm.........................................375
9.3. A VLSI Processor Array DTW Algorithm.............. ............................. 379
9.4. Summary..... ................... ....... .................... ....... ................. ........ ............381
10. CONCLUSIONS...............................................................'. ...382
LIST OF REFERENCES . . ...... ........................................ .................. .385
APPENDICES
APPENDIX A: SIMD Machine Assembly Language Programs.................. 392




5.1. Summary of the methods to compute autocorrelation coefficients......52
5.2. Time complexities for computing autocorrelation
coefficients for M—128 and p=8............... ................ ..... ......................... 55
5.3. Summary of parallel and serial LPC analysis algorithms.................... .60
5.4. Operations needed for Choleksy decomposition.;... V„y:...'.62'
5.5. Summary of parallel recursive filtering algorithms............... ............ .68
5.6. PEs and cycles needed to filter a M = 128
sample signal with a p=8 pole recursive filter.................. .....................69
6.1. Execution times for serial, VLSI,
and SIMD digital filtering algorithms.......... .............. . ..........94
6.2 Comparison between Ashajayanthi’s SIMD 
autocorrelation algorithm (AUT03) and an
improved version (AUT04)................. ................. ...... ............... ...........107
6.3 Time complexities of linear time warping algorithms........................ 114
6.4. Summary of Parallel Dynamic Time Warping Algorithms  ..........119
7.1 Parameters for the typical speech recognition system. ......147
7.2 Sampling rates for the SIMD preemphasis
program using 16rbit signed data................................................. . .151
ix
Table Page
7.3 Execution time for autocorrelation program
using 16-bit signed inputs and a 32-bit signed sum.................. . ........154
7.4 , Execution times for LPC program and
filter + auto+ lpc programs....................... .............. ................ . ........ 162
7.5 Execution times for linear time warping, method one................ ........167
7.6 Execution times for linear time warping, method two: ............. ........168
7.7 Execution times for rearrange routine..... ........ ............................ ........175
7.8 Execution times in cycles between adjacent labels
of SP DTW program..... .................................................... ............ ........178
7.9 Execution times for serial dynamic time warping (SP).... ......... ........180
7.10 Execution times for parallel dynamic time warping (PP1)........ ........184
7.11 Execution times in cycles between adjacent labels
of PP2 DTW program.................. ................ ....... ...................... ........190
7 12 Execution times for distance calculations for PP2.....................
7.13 Execution times for dynamic time warping program PP2......... ........193
7.14 Parameters for speech recognition systems..................... ...... . ........197
7.15 Buffer requirements for SIMD speech recognition system, 
p—8j NetD —18,1^40, input sample rate = 20 KHz............. . ........206
7 16 Memory usage, in bytes, for SIMD based
isoiated word recognition system............................................... ........209
7.17 Inter-PE communication used by SIMD machine......... . ........ 211
7.18 Data conditional masking time in cycles........... ...... . ........212
7.19 Number of PEs used by the parallel speech
recognition system.................................... ........ ......... ...... :..... . ........214
P ' ' X ■ :
Table Page
8.1 Execution times for bitering program fl. ....;„P..u.:Pi\.......v;,-.,,v.-,-t248.
8.2 Summary of simulation of digital filtering ,
algorithms in Poker. .... ..........P,.......,..............;................?.........,...,.259
8.3 Execution times for autocorrelation programs al and a2..................274
8.4 Summary of execution times for
autocorrelation programs..........___________________ .....................278
8.5 Execution times for the LPC program in Figure 8.29...................... ...290
8.6 Execution times for LTW programs.................... .............................315
8.7 Execution time summary for DTW P c
program dl using four coefficients per frame.....................................332
8.8 Execution time summary for DTW programs dl and d2..... ..338
8.9 Execution time summary for DTW program d.2
using 16 bits per coefficient. 339
8.10 Execution times for DTW programs,......-.....,,...................................341
8.11 Parameters for speech recognition systems. .......................................345
8.12 Memory usage, in bytes, for SIMD based
isolated word recognition system............ ...... ............................... ..,..,,.361
8.13 Number of cells used by the SI
processor array parallel speech recognition system........... ... ........ ..364
9.1 Variable name translations for connected
word algorithm- ••............................. ........ ............................. ..................374
9.2 Comparison of serial and parallel leveling
building DTW algorithms---------------••••••••••377
9.3 Comparison of serial and parallel leveling




2.1. SIMD machine organization.........,............................       ....... .6
2.2. Model of an SIMD processing element (PE)........ ...C....... 7
2-3. Pidgin Algol core for Flock Algol.___
2.4. Flock Algol statements to express parallelism..,,......,....,......,,.,.......,..,!
2.5. Parallel calculation of y[i] —■y[i-1] + a[i]..,........... ..........................16
2.6. Intermediate values for recursive-doubling algorithm, ........................18
3.1. An example of a systolic array..................... .........................................20
3.2. An example of a VLSI processor array..............    ,,.......,.,.....21
3.3. VLSI processor array to compute FIR filter for q—2. .........................23
3.4. Data low diagram for Figure 3,3.................. ...........................................25
.. .. ■' . ■
4.1. Block diagram of an isolated word recognition system........................28
4.2. Durbin’s Algorithm to compute LPC coefficients a; from 
autocorrelation coefficients R(i), 0 < i <p............... ..... ............,.h...33
4.3. An example of how the zero crossings and energy
thresholds are used to find the end-points of a word. .........................35
4-4. Dynamic time warping paths........................ ........................................38
4.5. An example of time warping................................................................39
4.6. Possible paths to a point..........____  ........................40
■■ . xii
Figure Page
4.7. Adjustment window of width r....................................... ...... ...... ...........42
4.8. Serial DTW program.
Execution times assume an 8 MHz MC68000......--------------------.......43
5.1. Algorithm for autocorrelation using N PEs........................................ ....48
5.2. Data transfers to move s(m + i) to PE m to compute s(m)*s(m+i)
terms for R(i), 0 < i < p. shown for N=M=8, p=3............... ........... 49
5.3. Performing sum of elements in N PEs
using recursive doubling for N=8........... ....... .........................................50
5.4. SIMD algorithm (AUT03) to compute autocorrelation 
coefficients R(i), 0 < i < p, for an M-point signal,
using p+ 1 PEs.............................................. ..................................... ........53
5.5. Contents of variable P in each PE at the 
start of line 16 for p—3, M=5,..........................
5.6. Data transfers for computation of aj’s for p=4 
in four PEs..:............w....;...................................
5.7. SIMD algorithm using Durbin’s method to solve for 
p predictor coefficients using p- PEs...........................
5.8. Systolic array to compute recursive filter for p=2....
5.9. Data flow for array in Figure 5.8__..........................
5.10. One cell in HSAC..:....... .........................
5.11. High Speed Array Computer used to compute
dynamic time warp......;.):..;...:....'.........;..........









5.13. Virtual movement of reduced array through I by I grid......................76
5.14. Virtual propagation of diagonal reduced array.... . .................. 78
Page
6.1. a) VLSI processor arfay to compute generalized 
digital filter p=2, q=2. b) Data flow diagram for (a)..83
6.2. Data flow diagram for StMD-1 generalized digital filtering
algorithm for p—2, q—2 ......
6.3. Data flow diagram for improved SIMD2 generalized
digital filtering algorithm for p'==2, ,q^=2. ....................87
6.4. Skewed coefficient storage for SIMD2 algorithm. ............................89
6.5. SIMD2 generalized digital filtering algorithm................. ;.................90
6.6.
linear recurrence equations........................ ................................ 92
6.7.
shown for p—2, q—2.......
= p + q + 1 PEs, 
...............................................97
6.8.





Algorithm for preemphasis filtering................
Ashajay anthi’s SIMD autocorrelation method





6.13. SIMD algorithm to do linear time Warp............... .
6.14. Data flow for LTW for expanding from M=5 to M—7 frames
6.15. Data flow for compressing M—7 frames to N=5 frames..........
6.16.
A set of g(i,j) that can be computed in parallel, 








6.18. Parallel DTW program ................
6.19. Data transfers into even and odd numbered
PEs in PP algorithm................................................................
6.20. g(i,j) computations in PP algorithm with r even....................
6.21. a) Bilinear array of cells.
bj Data paths between cells in left and right columns ....... .
6.22. Data flow in BAGalgorithm............. . ..........................
6.23. Instructions executed during one loop of the BAG algorithm
for I odd.......—....—............................ ....................................
6.24. Number of loops for W=100, 1=40, r =8............ ...................
6 25. Number of loops for W=100,1=40, no window.
HSAC not shown, since 1,600 PEs required—......................
6.26. Number of loops for W=1,000, 1=40, r—8. ...........................
6.27. Number of loops for W=1,000, 1=40, no window.
HSAG not shown, since 1,600 PEs required............................
7.1. Sample algorithm SIMD machine................... ...... .
7.2. Sim68 program to perform preemphasis filtering.......... ..........
7.3. Algorithm for autocorrelation using N PEs for





Program to rearrange data from PE i containing 
coefficient i to all PEs containing all coefficients..
Calculation order for accumulated distances of ,
Algorithm to compute local distances and move data.... 

















SP DTW program.... ..................................... ...............188
7.8. Flock Algol algorithm for isolated word recognition. ........................198
7 9. Time and PE usage for the parallel isolated
word recognition system, .v.;................ .........203
8.1. Typical switch lattice.
8.2. An ex ample of an xx program. .223
8.3. (a) Example of a cell configuration for a VLSI algorithm.
Cxar .......226
8.4. Example of a Poker switch setting and code
name assignments. >.......:;w.........i..v;...-i....i...v;...v....'v..;.,.-..,....'....':..;.'.'.;...'.>'.228^
8.5. Example of Poker port name assignments.........................................229
8.6. Poker cell detail........
8.7. Switch settings for no broadcast xx filter program, 
p=2 and q=2,,.............................................
8.8. Port names for no broadcast xx filter program,
p=2 and q—2.. .................... ..............................................
8.9. xx code for no broadcast filter program...........................
8.10. Execution times for slow xx filter program...............
8.11. Switch settings and code names for fast filter (fl) 
program for p=l and q^f.........—
8.12. Port names of fast filter (fl) program.........................
8.13. xx code for fast filter (fl) program.. ...................... .
8.14. Execution times for xx fast filter (fl) program. ...............














8.16. Arrival times and port names for f2. ................
8.17. Execution times in ^s for fast filter programs..
8.18. Switch setting for fast filter program (f3)
for p=l and q—2................................................. .
8.19. Switch setting for autocorrelation programs
(al) and (a2) for VLSI processor array.............
8.20. Port names for autocorrelation programs
(al) and (a2) for VLSI processor array.............
8.21. xx listing for autocorrelation programs
(al) and (a2) for VLSI processor array.............
8.22. Execution times for autocorrelation program
al using real numbers.;......................................
8.23. Execution times for autocorrelation 
? prograrh a2 using integers....;........
8.24. Switch settings for assembly language 
autocorrelation routines,....................................................
8.25. Execution times for autocorrelation program
using 8, 16, and 32-bit inputs.. .......... ............................
8.26. Switch setting for autocorrelation program a5,
8.27. Time delays in using tree to broadcast.
can send data to two ports with one write instruction. .....
8.28. Time delays in using tree to broadcast. One port
can send data to only one port with one write instruction
8.29. Durbin’s method for finding LPC coefficients from
autocorrelation coefficients. ...... ............... .
8.30. Switch settings and port names for


















8 31. Port names for multi-cell LPC program......... ... ......__...................294
8.32. xx program listing for multi-cell LPC program................................... 295
8.33. Switch setting for multL cell LTW program......... I..................... .......300
8.34. Port names for multi-cell LTW program..... ...................... .................301
8.35. Code for multi-cell LTW program.................__ ...............................302
8.36. Execution times in ps for multi-cell LTW................................. ,.. ,..307
8.37. Single-cell LTW program....__-.I../..-............................................ .........310
8.38. Execution times is /is for single cell LTW....... ..................................313
8.39. Switch settings for DTW program dl............318
8.40. Port names for DTW program dl......................................................... 319
8.41. xx code for DTW program dl...............................__........................320
8.42. Execution times in ps for dl
using four coefficients per frame....................................................... .330
8.43. Switch settings for DTW program d2........................................ .334
8.44. Execution times in ps for dl and d2........... ;.....336
8.45. Switch settings for word recognition system........... .................. .. .346
8.46. Code names for the word recognition systemv....;.v.,..^.,.>.:i..v.....'.'..'..'..347
8.47. Port names for speech recognition system......... ................................348
8.48. Plot of speech data output by the input cell. ....................................349
8.49. Speech input data for word recognition system................. ...350
8.50. xx code for pipe cell....____......... .............___
Xviii
3.51. xxprogram for computing LPC coefficients
from autocorrelation coefficients.................. ...........................................354
8.52. xx program for finding endpoints..............................  ............... ...357
9.1. Illustration of dynamic warping alignment 
between text pattern T and super reference pattern
.. .............................  369
9.2. Graphical description of the computation order of
non-level building algorithm.............. 370
9.3. Graphical description of the computation order of
level building algorithm,................................. ............... .........................371
9.4. Algorithm for serial level building DTW..........    373
9.5. Algorithm for parallel level building DTW.......... ................................376
9.6. Instructions executed during one loop of the
BAC algorithm for I odd. ................................. ............... .....................380
Appendix 
Figure ' :, .
A.l MC68000 instruction set..U.........;.................... 393
A.2 Contents of simd.h, the file describing the device locations in
the address space... .........i.......;.......... 395
A.3 Contents of defs.h, the definition file.............         .398
A.4 Sim68 program to perform preemphasis filtering.....................  ...399
A.5 Program performing autocorrelation. .............................................. .401
A.6 Program performing autocorrelation using half as
many PEs as frames. ..... ........................... ............................. ...............404





A.8 Program for linear time warping using one frame per PE.... ............. 412
A.9 Program for linear time warping using p PEs.............. ....... ............. 418 /
A. 10 Parallel program for parallel-parallel DTW algorithm. ...................... 421
All Serial program for implementing the serial-parallel
DTW algorithm........ ...... ......... ...... ..... ................... ........ ...... .. ............. 434
A. 12 Parallel program for DTWing (PP2)........... ..................................439
A. 13 SIMD program for isolated speech recognition system.
Contains endpoint routine.......v................................................451
B1 Description of xx programming language....457
B. 2 8051 instruction set description and timings... ......___..................... 461
B.3 Using a builtfin timer to control loop time............................. ...... ...463
B.4 Example of 8051 code for inter-cell communication. .........................465
B.5 Contents of ports.h..............     .....f467
B.6 Contents of util.h.............
B.7 8051 program listing for 8 bit fast filter (f2)................... ....................469
B.8 8051 listing for fast filter program (f3).........................i........................474
B.9 8051 programs for autocorrelation
program a3 using 16-bit inputs and 32-bit sums. .482
B. 10 8051 program for autocorrelation
program a4 using 8-bit inputs and 16-bit sums.................................492
B.ll 8051 program for autocorrelation program a5,
using asynchronous 16-bit input and 32-bit output......... .................496




B.13 Program to output stored speech signal.. .....................  537
B.14 Assembly language program for preemphasis filtering........................539
B.15 Assembly language program for autocorrelation...................................542
xxi
ABSTRACT
Mark Alan Yoder, Ph.D., Purdue University December 1984 Parallel Algo­
rithms for Isolated and Connected Word Recognition. Major Professor: Leah
H. Jamieson
For years researchers have worked toward finding a way to allow people to 
talk to machines in the same manner a person communicates to another per­
son. This verbal man to machine interface, called speech recognition, can be 
grouped into three types: isolated word recognition, connected word recogni­
tion, and continuous speech recognition. Isolated word recognizers recognize 
single words with distinctive pauses before and after them. Continuous speech 
recognizers recognize speech spoken as one person speaks to another, continu­
ously without pauses. Connected word recognition is an extension of isolated 
word recognition which recognizes groups of words spoken continuously. A 
group of words must have distinctive pauses before and after it, and the 
number of words in a group is limited to some small value (typically less than 
six).  
If these types of recognition systems are to be successful in the real world, 
they must be speaker independent and support a large vocabulary. They also 
must be able to recognize the speech input accurately and in real time. 
Currently there is no system which can meet all of these criteria because a vast 
amount of computations are needed.
This report examines the use of parallel processing to reduce the computa­
tion time for speech recognition. Two different types of parallel architectures 
are considered here, the Single Instruction stream - Multiple Data (S1MD)
XXII
machine and the VLSI processor array. The SIMD machine is chosen for its 
flexibility, which makes it a good candidate for testing new speech recognition 
algorithms. The VLSI processor array is selected as being good for a dedicated 
recognition system because of its simple processors and fixed interconnections.
This report involves designing SIMD systems and VLSI processor arrays 
for both isolated and connected word recognition systems. These architectures 
are evaluated and contrasted in terms of the number of processors needed, the 
interprocessor connections required, and the “power” each processor needs to 
achieve real time recognition.
The results show that an SIMD machine using 100 processors, each with 
an MC68000 processor, can recognize isolated words in real time using a 20 
KHz sampling rate and a 1,000 word vocabulary.
1. INTRODUCTION
Voice input to machines is one of the most natural forms of man-machine 
communication. For years researchers have worked toward finding a way to 
allow a person to talk to machines in the same manner a person communicates 
to another person. This verbal man to machine interface, called speech recog­
nition, can be grouped into two major types, continuous speech recognition and 
isolated word recognition. The following describes what each type entails.
The computer’s role in continuous speech recognition is analogous to the 
role of a secretary taking dictation in that the machine would take the voice 
input and transcribe it into the words that were spokem
In isolated word recognition there is a distinctive pause (of about 100 ins) 
between each utterance. Isolated woid recognition is the more likely of the two 
types of recognition to be found on an assembly line taking Orders to do a 
given task. Here single words or short phrases are given to control a machine. 
The distinctive pauses before and after the utterance make it easier to find 
where the utterance begins and ends. Continuous speech may not have pauses 
around each utterance, which makes finding word boundaries within continuous 
speech more difficult than isolated ^speech. This is one reason why isolated 
word recognition is easier to perform than continuous speech recognition.
A third type of recognition is connected word recognition. Connected word 
recognition is an extension ef isolated word recognition Which allows recogni­
tion of groups of words spoken continuously. A group of words must have dis­
tinctive pauses before and after it, and the number of words in a group is lim­
ited to some small value (typically less than six). The presence of distinctive 
pauses, and the knowledge that there is only a small number of words in a 
group makes connected speech recognition easier to perform than continuous 
speech recognition. Since connected word recognition is an extension of iso­
lated word recognition, it is not considered a major type.
2For any of the types of speech recognition to be successful in general 
usage, they must meet the following criteria.
1) Speaker Independence: Many recognition systems are trained to a small
group of speakers. A system is called speaker independent if it can 
recognize speakers not in the training group. To do this it must be able 
to handle different dialects, accents, speaking rates, and pitches.
2) Large Vocabulary: The typical adult may know 100,000 words or more
[LeLi81]. Although an isolated Word recognizer controlling a machine 
may only need to recognize a few command words, the use of continuous 
speech recognition to take dictation requires a large vocabulary.
3) Accurate Recognition: Recognition accuracy is a common standardised to
compare different recognition systems. Certainly the machine should 
accurately recognize all utterances in order to avoid having the user 
repeat words, or worse yet, have the machine misrecognize words.
4) Real-Time Response: The response time is the time needed to decide what
was spoken. Real-time response is needed so, that the speaker does not 
grow tired Waiting for an answer. In a continuous speech recognition 
system, real time response is needed so processing does not accumulate. 
This has not been achieved by a system which also met the other three 
characteristics.
An example of a continuous speech recognition system in the literature is 
the HWiM [BBN76] system that is able to understand continuous speech from 
three cooperative male general American speakers. It can recognize a 1,097 
word vocabulary with a 56% error rate while operating at 1,350 times real time 
op a PDP-10,
The level building dynamic time warping algorithm by Myers and Rabiner 
[MyR aSlb] is an ex ample of a connected word recognition system. The system 
can recognize up to five words in a connected utterance. The basic operation 
performed by the system is a form of dynamic programming, known as a time 
warp, to compare the input utterance to stored templates representing the 
vocabulary. (Time warping will be discussed in detail in later chapters. For 
now, it is the complexity of the time warp process which is of interest.) With a 
vocabulary size of 10 words, it requires 50 basic time warps. On a Data Gen­
eral Eclipse S230 minicomputer, Myers et al. [MRR80] states that a basic time
warp requires 289 to 454 milliseconds. This means a vocabulary of 10 words 
requires 14.45 to 22,7 seconds, while a vocabulary of 1,000 words needs 24 to 38 
minutes just for the dynamic time warping. Therefore, the level building 
method cannot run in real time with a large vocabulary on a conventional pro­
cessor.
Neither of the above two systems is^ speaker independent j nor could they 
meet the teal time response constraint. Currently these two constraints are 
met by using a simpler type of recognition, i.e., isolated word recognition. Sys­
tems are commercially available which recognize isolated words in real time 
[Dodd81]. Generally these systems are speaker dependent with small (10-20 
word) vocabularies. Even though the real-time response is possible, it is at the 
expense of a small vocabulary and small speaker population.
This report investigates the use of parallel processing to reduce the compu­
tation time for speech recognition. This will be done by writing parallel pro- 
cessing algorithms for the component algorithms that make up the speech 
recognition systems.
Two different parallel architectures are considered here, the s ingle 
instruction stream - multiple data stream (SEMD) [Flyn66] computer and the 
VLSI processor array. In the SIMD machine many processors execute the same 
instructions- simultaneously on different data. The instructions are broadcast 
from a control unit, and the processors are able to pass data between each 
other by a general interconnection network. The VLSI processor array, on the 
other hand, is a multidimensional pipeline consisting of many cells, with the 
output(s) of one cell connected to the input(s) of other cell(s). Although most 
cells will be executing the same instructions on different data, it is possible 
some “special” cells will be executing different instructions. The VLSI proces­
sor array can be thought of as a super systolic array [KungSQ], Both arrays are 
the same in that they both use a fixed interconnection network. They differ 
since each cell of the systolic array performs simple instructions like addition 
and multiplication and has a small fixed number of registers (as few as three), 
while each cell of the VLSI processor array can be as powerful as a *
* The figures Myers gives are 57.8 to 90.8 ms for combinatorics with local distance meas­
ures requiring 80% of the computation time.
microprocessor with its own addressable memory. The systems examined are 
programmable parallel systems. Since speech recognition is a research area in 
which new methods are likely to be proposed, special purpose hardware devices 
(e.g. [LMMB84]) are not considered.
Chapter 2 presents the SIMD machine model and a language for writing 
parallel algorithms for it. Chapter 3 discusses the VLSI processor array model 
and gives examples of how it works. Chapter 4 describes the word template 
matching approach to isolated word recognition. Chapter 5 is a survey of 
parallel speech processing algorithms. Chapter 6 describes the new parallel 
speech processing algorithms developed for this report. Chapter 7 presents the 
results of simulating the SIMD algorithms and Chapter 8 presents the VLSI 
processor array simulation results. Chapter 9 discusses connected word recog­
nition and presents a parallel algorithm for a level building dynamic time warp. 
And finally, Chapter 10 gives the conclusions of this research effort.
2. THE SIMD MACHINE MODEL
With the advent of VLSI technology, large-scale processing systems with 
as many as 214 processors have become feasible [Ba79,Pe77,SDK77]. One 
approach to using a large number of processors is the single instruction stream 
- multiple data stream (SIMD) machine. An SIMD machine typically consists 
of a control unit (CU), a set of N = 2" processing elements (PEs), and an inter­
connection network as shown in Figure 2.1 [Sieg81a|. A PE consists of a pro­
cessor with its own memory, fast access general purpose registers, an address 
register (ADl)R), and two data transfer registers (DTRin and DTRout) as 
shown in Figure 2.2. The PEs are addressed (numbered) from 0 to N-l in a 
machine of size N. The register ADDR in PE i contains the integer i, for 
0 < i < N. The two data transfer registers allow each PE to access the inter­
connection network which in turn allow each PE to send and receive data from 
the other PEs (Si79|. The CU broadcasts instructions to all PEs, and each 
active PE executes each of these instructions on the data in its own memory. 
All active PEs execute each instruction simultaneously. It is possible to enable 
and disable PEs so all N PEs may not be active.
2.1. Flock Algol — Introduction
A tool called Flock Algol has been developed by Siegel et al. [SiegSIb] to 
aid in writing and describing parallel algorithms. Flock Algol is used here 
because it incorporates ways to express SIMD processing in an algorithm 
description language. The following summarizes Flock Algol and focuses on 
the constructs it uses to express and control parallel execution. Finally an 

























Figure 2.2. Model of an SIMD processing element (PE).
2.2. Summary of Flock Algol
Flock Algol uses traditional mathematical and programming language con­
structs, after Pidgin Algol [AHU74]. It also contains parallel-specific constructs 
extending its Pidgin Algol origin to accommodate parallel algorithms. As in
Pidgin Algol, any statement with a clear meaning is allowed.
A Backus-Naur form (BNF) specification is used here to describe Flock 
Algol. A BNF statement has the form
<non—terminal> sequence of terminals and/or non-terminals.
Terminals are elements of the set of language symbols. For Flock Algol the 
keywords include IF, THEN, ELSE, FOR, STEP, BEGIN, END, PRO­
CEDURE, ENABLE, DISABLE, TRANSFER, BROADCAST, USE, etc. To 
aid the reader, Flock Algol keywords are shown in all capital letters. However, 
case is unimportant when expressing algorithms in Flock Algol. Nonterminals 
are symbols delimited by < > such as <program>, <statement>, <vari­
able >;i <expression>, <condition>, <initial value>, <step size >,< final
value>, <procedure name>, <parameter list>, etc.
The BNF specification consists of a set of “rewriting rules,” where each 
rewriting rule specifies the ways in which a given non-terminal can be rewrit­
ten In the BNF specification, a vertical bar (] ) separates alternative ways of 
rewriting a given non-terminal. Braces ( { } ) denote optional replication, and 
are used to indicate that the contents between the braces may be employed 
zero or more times.
Flock Algol includes a core of constructs drawn from Pidgin Algol 
[AHU74], Pascal [JeWi74], arid C [KeRi78] which is shown in Figure 2.3. Fig­
ure 2.4 shows the BNF specification of the extensions to Pidgin Algol incor­
porate SIMD parallelism. The statements are of three general types:
1) mask statements, to allow subsets of PEs to be enabled (active) for execu­
tion of a statement or set of statements (and implicitly, to disable other
PEs);
2) transfer statements, to specify the transfer of data between PEs; and
3) broadcast statements, to allow the dissemination of a single data item to a
specified set of PEs.
The following gives a synopsis of each of these statement types.
9<program> ::= <procedure definition>
<procedure definition > ::= PROCEDURE < procedure name > (<parameter list>) 
{<procedure definition>} <block>
<block> ::== <statements | <declaration part> <statement>
<statement> :: =
1. <variable> 4- <expression> j 
2a. IF <condition> THEN <statement> | 
b. IF <condition> THEN <statement> ELSE <statement> j
3. FOR <variable> +- <initial value> TO <final value>
DO <statement> |
4. BREAK)
5. BEGIN <statement> { <statements } END |
6a. <procedure name> ( <argument list> )|
b. <variable> <procedure- name> ( <argument list>)
c. RETURN j RETURN <expression> |
7. miscellaneous statements |
8. <null statement>
Figure 2.3. Pidgin Algol core for Flock Algol.
10
<statement> <mask statement> | <transfer statement> j
<broadcast statement> | <set network>
1. <mask statement > [<mask specification>J <statement> |
<data conditional mask>
a. <mask specification> ENABLE <well defined set of PEs> j
DISABLE <well defined set of PEs>
b. <data conditional mask> =
WHERE <condition> DO <statement> END WHERE |
WHERE <condition> DO <statement> ELSEWHERE < statement > END WHERE




3. <broadcast statement> BROADCAST Cbroadcast specification> j
< broadcast specification > <source specification>
FROM PE <PE source>
TO <destination specification>
<PE source> <constant with value between 0 and N-l> |
<variable with value between 0 and N-l>
4. <set network> USE ^interconnection function>
Figure 2.4. Flock Algol statements to express parallelism.
11
2.3. Mask Statements
A mask statement will have the effect of specifying a subset of the N PEs 
in the SIMD system. Masks provide the system user with a method to control 
the active/inactive status of the PEs of the system. Siegel [Si77] gives details 
of the various types of masking schemes. Flock Algol includes two mask for­
mats. :
2.3.1. ENABLE and DISABLE
In the first format, the statement of type la consists of the keyword 
ENABLE or DISABLE, followed by an unambiguous specification of a set of 
PEs. The PEs enabled as a result of the mask specification execute the state­
ment following the mask specification. If no mask accompanies a statement, all 
PEs are assumed to be active. The speech processing algorithms presented here 
use PE address masks [Si77] to specify which PEs to enable or disable. The PE 
address masks are n-position (where n-=log2N) masks that specify which of the 
N PEs are active for each instruction. Each mask position contains a 0, 1, or X 
(“don’t care”) and only those active PEs whose address (in binary representa­
tion) matches the mask are enabled (or disabled). An “X” matches either a 1 
or a 0. Superscripts are repetition factors i.e., (X5] ~ [XXXXX]. Square brack­
ets denote a mask. For example ENABLE [Xn-1l] activates all odd numbered 
PEs and DISABLE [xn-i0] disables all even PEs. If no mask accompanies an 
instruction, alLPEs are active.
2.3.2. WHERE .*. ELSEWHERE
The second format for mask statements is a data conditional Statement, 
defined in statement type lb. Data conditional masks are the implicit result of 
performing a conditional branch dependent on local data in an SIMD machine 
environment, where? the result of different PEs’ evaluations may differ. As a 
result of a conditional WHERE statement of the form
12





each PE will be active for the statement following for either the DO or the 
ELSEWHERE, but not both. The execution of the ELSEWHERE statement 
must follow the DO statement; i.e., the DO and ELSEWHERE statements can- 
not be executed simultaneously. For example, as a result of executing the 
statement •





each PE will assign to C the maximum of its A and B values, i.e., some PEs 
will execute “C;«- A,” and then the rest will execute “C <- B.” Machines such 
as the Illiac IV [Bam68] and PEPE [Cran72] use this type of masking. Nesting 
data conditional mask statements is possible, the implementation can be 
accomplished using a run-time control stack, as discussed in [SiMu78].
From an implementation point of view, data conditional masks allow the 
specification of the mask condition to depend on PE data. The subset of PEs 
to enable is determined at execution time. The time to execute a “WHERE ... 
ELSEWHERE” statement will be the sum of the times to execute the state­
ments following the DO and the ELSEWHERE.
The ‘‘IF-THEN-ELSE’’ and “WHERE-DO-ELSEWHERE” statements 
correspond to two different actions on an SIMD machine. An “IF-THEN- 
ELSE” is a control flow statement executed by the CU to determine which of 
two sets of code should be executed. The expression specifying the condition in 
an IF-THEN-ELSE STATEMENT will contain only constants and CU vari­
ables. If the code to be executed includes PE instructions, all active PEs will 
execute that code; A “WHERE-DO-ELSEWHERE” statement divides the PEs 
in the system into two sets, and instructs the two sets to execute different code. 
In this case, both sets of code are executed one after the other, but by different 
PEs. An “IF-THEN-ELSE” format could be used to specify data conditional 
mask statements. However, since the basic function of the two types of
statements is different, it seems clearer to use different keywords to identify the 
two types of actions.
2.4. TRANSFER and USE Statements
The purpose of the TRANSFER statement (type 2 in Figure 2.4) is to 
allow inter-PE communications. The USE statement (type 4 in figure 2.4) 
specifies the type of interconnection function to use, and the interconnection 
functions specify the type of transfer to perform. Formally, an interconnection 
function is a bijection on the set of PE addresses. When an interconnection 
function, f, is executed, the contents of the source variable in PE j are 
transferred to the destination variable of PE f(j). This occurs for all j simul­
taneously, for 0 < j '< N and PE j active.
The PEs interface to the interconnection network via the DTRin and 
DTRout registers. If the DTRin and DTRout register names are used in the 
algorithm, the “<source specification> TO <destination specification^” in 
the transfer statement syntax can he omitted. In this case, the source is 
assumed to be the DTRin, and the destination is the DTRout. The DTRin 
acts as the standard input to the network, and the DTRout acts as the stan­
dard output from the network. If the “<source specification>” is given 
without the “< destination specification > ” the destination is the same as the 
source.
The following are interconnection functions used in the speech processing 
algorithms presented in Sections 5 and 6.
2.4.1. The Cube Interconnection Function
The Cube [SiMcSlb] interconnection function is defined by letting 
P = pn_j • • • ptpq be the binary representation of the address of an arbitrary 
PE. The n cube interconnectipn functions are:
14
Cube(i)[pn_1 •; • Pi; • * Pol ~ Pn-i ' ' ‘ Pi ‘ ‘ ' Po>
where 0 < i < ri, 0 < P < N, and pj is the complement of pj* This means the 
cube(i) interconnection function connects PE P to cube(i) [P] where cube(i) [P] 
is the same address as P with the ith bit complemented.
2.4.2. The Permutation Interconnection Function
The Perm utation (Si8l] interconnection function is defined as:
Permi(j)
i—j where 1 < j < i
j elsewhere
Perm5(j) would switch data between PEs 0 and 5, PEs 1 and 4 and, PEs 2 and
3. '■//
2.4.3. The Shift Interconnection Function
The Shift interconnection function is defined as:
Shift +n (j) = j+n mod N 
Shift—n (j) = j~n mod N
where N is the number of PEs. Therefore Shift +1 (j) would send data from 
PE 0 to PE 1, PE 1 to PE 2, and so on.
2.5. Broadcast Statements
The purpose of broadcast statements (type 3 in Figure 2.4) is to allow the 
dissemination of a value from one PE to all PEs. The <PE source> is the PE 
containing the value to be broadcast. If the PE source is not given, the value
is broadcast from the CU. The value is broadcast to all PEs.
15
2.6. An Example of a Flock Algol Algorithm
The following is an example of a Flock Algol algorithm. It performs a 
computation similar to that given in [StonSO]. Suppose the vector a[] is given, 
and the vector y[] is to be found such that
y|°l = al°] (2II
y[i] = y[i-l] + a(ij 1 < 1 < N * ' ’
therefore y[i] is the sum of a[0] + a[l] + ... + a[i]. On a serial machine y[] is 
found by:
y[0] 4— a[0]
FOR i •<—1 TO N~1 DO
y[i] ^-yM] + a[i]
This algorithm appears to be serial since y[i—1] is computed before y[i], Since 
the last statement is executed N-l times, the time complexity is O(N). An 
SIMD machine with N PEs can find y[] in 0(log N) time by using the method 
diagrammed in Figure 2.5. The figure is for N=8 PEs, where the nodes with 
an open circle do nothing, while the nodes with filled in circles form the sum of 
the two operands. The following SIMD algorithm to find y[], assumes element i 
of vector a[] is stored in PE i for 0 < i < N. After the algorithm, y[i] is stored 
in PE i.
1' ■ 'y a ■ '
2 FOR j 4-0TO log2N-l Do
3 TRANSFER y TO DTRout USING Shift +2j (2.2)
4 DISABLE [©n_jXjj
5 y 4—y + DTRout
Each step does the following:
1) Store a[i] in y[i] for 0 < i < N. This is done in all PEs simultaneously.
2) Execute statements [3]-[5j log2 N times.
3) Transfer the data in y in PE i to DTRout in PE (i+2^) mod N. On the first
loop, the data in y in PE 1 will transfer to DTRout in PE 2, and PE 2’s 
data will transfer to PE 3, and so on. PE N-l will transfer its y value 
to PE 0. When j = l, the data in y in PE 1 will transfer to DTRout in 
PE 3 etc.






Figure 2.5. Parallel calculation of y[i] ,= y{i-l] + a[i].
17
n=log2N) which matches only PE 0, so PE 0 will be disabled. This is 
indicated by a circle at node 0 in Figure 2.5. The second time through 
the loop, j = l, so the mask is [0n lX] which matches PEs 0 and 1, so 
they are disabled. The DISABLE instruction only disables the PEs dur­
ing the indented instruction (s) below it, therefore on subsequent times 
through the loop, all PEs will execute steps [2]-[4].
5) The new data transferred into DTRout is added to y[] only in the enabled 
PEs. -
Figure 2.6 shows the intermediate values for this algorithm. Kogge and 
Stone[KoSt73] call this technique of shifting and summing recursive doubling. 
The time complexity is clearly O(log N) since the body of the loop in lines [2]- 
[5] of algorithm (2.2) is executed log2N times.
2.7. Summary
Real-time recognition of speaker independent isolated or connected speech 
using a large vocabulary requires more processor throughput than current serial 
machines can deliver. The SIMD machine is one possible way to organize a 
large number of processors to do the recognition in real time.
Flock Algol provides a high level algorithm description language for SIMD 
algorithms. It is based on a general model of an SIMD machine, and is 
intended to separate the structure of the parallel algorithm from architecture- 
specific issues such as the physical interconnection network or the actual 
mechanisms used to implement data broadcasts and the enabling/disabling of
The time complexity in the example algorithms above is reduced from 
O(N) on the serial machine to O(log N) on the SIMD machine. This shows 
that the parallelism of the SIMD machine can reduce the execution time of 
some algorithms. The following chapters will show how the SIMD, machine can 
reduce the time complexity of various speech processing algorithms.
18
Shift +0 Mask Shift +1 Mask Shift +2 Mask
PE TRANSFER [000] Sum TRANSFER [00X] Sum TRANSFER [OXX] Sum
y a DTRout y DTRout y DTRout y
0 y(0,0)=a[0] y(7,7) 0 y(0,0) y(5,6) o y(o,o) y(i>4) 0 y(o,o)
1 y(l,l)=a[l] y(o.o) l y(P.i) y(6,7) 0 y(o,i) y(2,5) 0 y(o,i)
■ 2 y(2,2)=a[2] y(U) l y(i-2) y(°,o) 1 y(o,2) y(3,6) 0 y(o,2)
3 y(3,3)=a[3] y(2,2) l y(2,3) y(o,i) 1 y(o,3) y(4,7) 0 y(o,3)
4 y(4,4)=a[4] y(3,3) l y(3,4) y(i-2) 1 y(M) y(o,o) l y(o,4)
5 y(5,5)=a[5] y(t-t) l y(i.s) y(2,3) 1 y(2,s) y(°.i) l y(o,5)
6 y(6,6)=a[6] y(s,5) l y(s.6) y(3,4) 1 y(3,6) y(o,2) l y(o,6)
7 y(7,7)=a[7] y(o,6) l y(6,7) y(4,5) 1 y(4,7) y(o,3) i y(o,7)
Figure 2.6. Intermediate values for recursive-doubling algorithm. 
k=j
Where: y(i,j) denotes £]a(k),
k=l
and a 0 mask means the PE is disabled, 
and a 1 mask means it is enabled.
3. VLSI PROCESSOR ARRAY MODEL
Very large scale integration technology has shown that simple regular 
interconnections are easy to implement, and give high densities. The VLSI 
processor arrays are so named because they are designed to have simple regular 
interconnections which exploit the capabilities of VLSI technology. A VLSI 
processor array is a network of specialized processing elements (cells ) that cir­
culate data in a regular fashion. The network configuration for a VLSI proces­
sor array is particular to the algorithm (or class of algorithms) being imple­
mented. In general, the data flow can be viewed as a multidimensional pipe­
line. The VLSI processor array is a generalization of the systolic array 
[KungSO]. Both arrays have fixed interconnection networks. They differ in 
that systolic cells are assumed to be very simple, whereas VLSI processor array 
cells may be complex. For example, Figure 3.1 shows a systolic array 
presented by Kung[KuLe] for matrix multiplication. Without going into the 
details of how it works, notice each cell has only three registers (a,b,c) and the 
cell only does the operations shown in the lower right corner of Figure 3.1. 
Figure 3.2 shows a VLSI array for dynamic time warping. (Details of the array 
will be discussed in Section 6.4.2.1.) All the cells are connected by a fixed inter­
connection network as with the systolic array, but each cell has several regis­
ters, some of which contain vectors. Each cell does all the instructions shown 
in the lower right side. Figures 3.1 and 3.2 are only examples of one systolic 
array and one VLSI processor array. Both arrays can have different intercon­
nections and perform different operations. This example shows that the cells in 
the VLSI processor array are more complex than those in the systolic array. *
* Since the processing elements in the SIMD machine are different from those in the VLSI 
processor array, they will be called “PEs” in the SIMD machine and "cells’ in the VLSI 
processor array.
20





a Vector down 









g.bot.old <— g.bot 
g.top DTtop 
g,bot <- DTbot 
d.bot «— DTbot 
d.top «— DTtop 
DTtop «— g 
DTbot 4— g
Figure 3.2. An example of a VLSI processor array.
22
Both VLSI processor arrays and SIMD machines are forms of synchronous 
large scale parallel processing systems. VLSI processor arrays represent 
algorithm-specific systems with fixed interconnections between cells, specialized 
processors and a small set of registers for memory. SIMD machines are more 
complex, having a large memory in each cell and a general interconnection net­
work between cells, making the system more flexible. The VLSI processor 
array algorithms are specified by giving the fixed interconnections between 
cells, and the instructions executed by each cell.
3.1. A Sample VLSI Processor Array Algorithm - Filtering
An example of a linear VLSI processor array is the finite impulse response 
(FIR) filter presented by Kung[Kung80]. The output ym of a FIR filter is given
by:
ym = Ebkxm_k q<m<M (3.1)
k=0 ■
where xm is the input to the filter, the bk’s are the filter coefficients, and M is 
the number of samples in the signal to be filtered.
Kung’s FIR filter algorithm computes a (q+lj-tap FIR filter using a linear 
array of q + 1 systolic cells. It solves the equation in which ym is computed 
using the summation in equation (3.1). The output ym can be computed by the 
following recurrence relation, where yM is the partial result in the computation 
of ym after k steps in the recurrence.
-to) =0
yff+1) = yi!il)+b<i-kxni“<i+k o < k < q (3.2)
V = V(<1+1)Jm v m
The above recurrences can be evaluated by pipelining the xm and y^ values 
through q+1 linearly connected processors as shown in Figure 3.3. Each pro­












out ■<- X.i n
Rx
y ^yout R y
Figure 3.3. VLSI processor array to compute FIR filter for q—2.
24
respectively. Initially, all Rx and Ry registers contain zeros, and the Rb regis­
ter in processor i contains bq_i. Each cycle of the array consists of the steps 
shown in Figure 3.3. yW is computed in cell k—1, and the output is produced 
in cell q. The data flowing up (the xm values) must be synchronized with the 
data flowing down (the y^ + 1) values) so that they meet in the correct cell with 
the correct coefficient. Therefore during odd numbered cycles, only even num­
bered cells contain valid data, and during even numbered cycles only the odd 
cells contain valid data. Thus only half of the cells are active during a given 
cycle. One output value is therefore computed every two cycles of the systolic 
array, where during each cycle, the operations performed are the simultaneous 
transfer of data in the two pipes, plus the one addition, one multiplication, and 
one assignment shown in equation (3.2).
Figure 3.4 is the data flow diagram for the linear array. Each column of 
the data flow diagram represents the contents of each register in each cell after 
a given cycle. Moving from left to right shows how the data changes from one 
cycle to the next. The arrows show where Rx and Ry will be transferred on the 
next cycle.
This linear array uses q+1 cells and produces a new y value every two 
cycles. Ignoring the startup and stop time (i.e., the time required to pipe y0 
from cell 0 to cell q and to pipe yM_j from cell 0 to cell q) the VLSI processor 
array is (q+l)/2 times faster than a serial machine. This is because there are 
q + 1 cells, half of which are doing computations on valid data at a given time.
3.2. Summary
Although the SIMD machine may have the computing power needed to 
recognize speech in real time, its general nature (a general purpose processor in 
each PE and a general interconnection network) may make it too expensive for 
a dedicated application. The VLSI processor array, on the other hand, with its 
fixed interconnection network and independently operating cells may be able to 
perform the task with less hardware.
Cell 0
tsan
Figure 3.4. Data flow diagram for Figure 3.3.
26
This chapter presented a VLSI processor array model along with an exam­
ple of how a linear array of q + 1 cells could achieve a speed up of (q + l)/2 over 
a serial algorithm. The VLSI processor array is a generalization of Kung’s sys­
tolic array. The generalization adds a more powerful processor in each cell 
along with more memory and broadcast capability. Chapters 5 and 6 present 
some parallel speech processing algorithms which Use the VLSI processor array 
and Chapter 8 presents the results of simulating the algorithms.
27
4. AN ISOLATED WORD RECOGNITION SYSTEM
Of the many commercially available speech recognition systems, most per­
form isolated word recognition [Dodd81] since it is easier than connected word 
recognition. In isolated speech each utterance is separated from the next by a 
short pause (>100 ms). These pauses help the system in locating the begin­
ning and end of each Utterance. After the unknown utterance is located, many 
speech recognition systems rely on pattern matching techniques to match the 
features of an unknown input utterance to previously stored features of known 
utterances. Figure 4.1 is a block diagram of a typical template matching sys­
tem for isolated word recognition [RLRW79].
A template matching based system has two modes of operation, training 
and recognizing. During training, the speech signal is bandpass filtered (to 
prevent aliasing) and then sampled. After sampling, the speech signal is bro­
ken into fixed sized frames that generally contain between 100 and 400 sam­
ples. Each frame passes through a preemphasis filter followed by autocorrela­
tion analysis. Next linear predictive coding (LPC) [Makh75,MaGy76] analysis 
is used to take the autocorrelation coefficients and produce LPC coefficients. 
The LPC analysis reduces each frame from N samples (100 < N < 400) to p 
LPC coefficients where p is typically between 6 and 25. Next, endpoint detec­
tion finds the first and last frames of the utterance and discards the silent
frames before the first frame and after the last frame. The discarded frames
are not used in the rest of the processing. At this point an utterance will be 
represented by approximately 40 frames of 8-14 coefficients each. If the utter­
ance has more or less than 40 frames, a linear time warp (LTW) normalizes, in 
time, the utterance to 40 frames.
The process above is repeated for each utterance in the vocabulary, and 
the 40 sets of LPC coefficients for each word are stored for later use. To 
achieve speaker independence, the same word is spoken by several different


















Figure 4.1. Block diagram of an isolated word recognition system.
speakers and all sets of coefficients are stored or clusters are used as discussed 
in [RLRW79]. ;-
During the recognition mode the same steps as in the training are used, 
except after the linear time warp a dynamic time warp (DTW) compares the 
word to be recognized (the test template) to the training set (the reference tem­
plates). The distance from the input utterance to all the stored utterances is 
found, and the stored utterance with the shortest distance from the input 
utterance is picked as the utterance that was spoken.
The following is a detailed description of each block in Figure 4.1.
29
4.1. Filtering and Sampling of Input Signal
The first step iri recognizing a word is to filter and sample the input signal. 
The choice of filtering frequencies and sampling rate depends on the quality of 
speech available. The input is low pass (or possibly bandpass) filtered at 10 
KHz (100 - 10 KHz) and sampled at 15-20 KHz when using ‘‘high quality” 
speech. If the system is to work over the phone lines (telephone quality speech) 
the input is band pass filtered around 300-3200 Hz and sampled at 6.67 KHz. 
Systems using both 6.67 KHz sampling [RLRW79] and 20 KHz sampling 
[BBGI80] have appeared in the literature, along with various other sampling 
rates in between.
4.2. Preemphasis Filtering
Each frame passes through a digital preemphasis filter with a z transform
of
H(z)=l - az 1
where typically ariD.95. Experimental evidence shows that preemphasis serves
30
to reduce the variance of the distance calculation in an LPC based template 
matching system[RLRW79].
4.3. Autocorrelation Analysis
Next, the sampled signal is broken into frames for autocorrelation 
analysis. The LPC processing that is done later dictates the number of sam­
ples per frame. The frame length should be short enough so the vocal tract 
configuration is constant during the frame, but long enough so the initial condi­
tion assumptions (i.e., the values the signal is assumed to have outside of the 
frame) have a small effect on the coefficients. Frame lengths are usually fixed 
and contain between 100 and 400 samples, which correspond to 10-20 ms of 
speech depending on the sampling rate. One common method uses 300 sample 
frames that begin every 100 samples. This leaves a 200 sample overlap 
between frames. This overlap tends to reduce the variance in the LPC 
coefficients between frames containing the same speech sound.
The short term autocorrelation coefficients are found by using:
■: M-i-l '
R(i) - E s(m)s(m+i) 0<i<p (4.1)
m=0
where M is the frame length and p is determined by the LPC processing and is 
between 6 and 25. The first autocorrelation coefficient, R(0), is the energy for 
each frame, while all the coefficients are used in the LPC analysis which fol­
lows.-;..:;
31
4,4. Linear Predictive Coding
Following the autocorrelation analysis is linear predictive coding analysis. 
LPC models the speech sounds as an all pole filter and an excitation source 
[MaGy76j. The filter represents the configuration of the vocal tract, i.e., the 
position of the mouth, nose, and throat. If the sound is voiced, the excitation 
represents the pitch pulses from the vocal chords. If the sound is unvoiced, the 
excitation represents the “noise-like” sound of the air being forced past some 
constriction. The constriction may be the tongue and the avleolar ridge 
(behind the upper front teeth) as in the sound “s.”
LPC assumes that the with sample of the speech signal {s} can be 
represented by two components:
1) a linear combination of the p previous speech samples, and
2) the excitation, <5(m), which may differ for each sample s(m).
The sample s(m) is modeled as follows: [AtIIa71,Makh75,RaSc78]
s(m) = a(k) s(m-k) + £(m) p < m < M (4.2)
■ k=l ■
A common method used to find the LPC coefficients, a(k) for 1 < k < p, is to 
define s(m) as the predicted signal (i.e., the linear combination of the p previ­
ous samples) and minimize the squared prediction error which is:
E*=£KmH<m))2 = V[s(,„| - £ a(kHm-k)]2 (4.3)
m m k=l
To find the a(k)’s, find the k partial derivatives of E2 with respect to a(k) and 
set them to zero:
m 1 < k < p
This will result in p equations with p unknowns. By assuming the speech sig­
nal is zero before and after the frame (i.e., s(m) —0 m < 0 and s(m) — 0 
m > M), equation (4.2) can be solved by defining the short-term autocorrela­
tion functions as in equation (4.1) and rewriting equation (4.3) as
32
£a(k)R([i-kl) = R(i) 1<><P (4.4)
k = l
Equation (4.4) can be written in matrix form as:
= U (4.5)
where R and It are p element vectors of elements R(i) and a(i) respectively for 
1 < i < p, and K is a p by p matrix with K — R(|i-k|) 0 < i,k < p. K is a 
Toeplitz matrix, i.e., it is symmetric with all elements on each diagonal being 
equal.
Finding the coefficients a(k) takes two steps,
1) Find the p autocorrelation coefficients R(i), and
2) solve equation (4.5) for "at.
It could be found from equation (4.5) by finding the matrix inverse of K, but 
since K is Toeplitz, more efficient methods are available. Figure 4.2 is the 
serial algorithm for Durbin’s method, which is one of the most efficient 
methods available.
4.5. Endpoint Detection
After LPG analysis the endpoints are located. The endpoints of an utter­
ance are the frames where the word begins and ends.
Rabiner [RaSa75j presents a simple but robust method to detect endpoints 
based On using an upper (UE) and a lower (LE) “energy” threshold, and a zero 
crossing threshold (ZC). The following are definitions of the terms used in 
describing the method to find the beginning point. (Reverse all directions when 
finding the ending point.)
energy: The “energy” for each frame is the first autocorrelation coefficient, 
R(0). (See equation (4.1).)
zero crossing: The zero crossing rate is defined as the number of times the nor­
malized signal changes sign in one frame.
1 E(°) = R(0);
2 FOR i <-1 TO p DO
3 /* compute k(i) */
4 k(i) 4— 0;
5 FOR j «- .l TO i-1 DO
6 k(i) 4— k(i) + a^1-1) * R(i-j);
7 k(i) <- [R(i) - k(i)] / El-');
8 EWi— (l-k(i)2 ) *
/* compute aj’s for stage i */
9 a;W 4-. k(i);
10 FOR j 4— 1 TO i-1 DO
11 aW 4- a|1-1)-k(i) * a/^y1);
12 FOR j 4—1 TO p DO
13 aj 4— a.fp);
Figure 4.2. Durbin’s Algorithm to compute LPC coefficients aj from autocorre­
lation coefficients R(i), 0 < i < p.
34
frame pointer: The frame pointer points to the frame that is currently being 
considered as the first (or last) frame of the word, 
frame after: If the frame pointer is at frame n, the frame after is frame n + 1. 
back up: When the frame pointer is backed up, it moves from frame n to n~l 
to n-2, etc. until the criterion is met.
Rabiner’s method works as follows:
1) The energy and zero crossings are measured for all frames in the utterance.
2) After the thresholds are set (to be discussed later), the frame pointer is used
to find the first (or last) frame in the utterance by setting the frame 
pointer to the first frame to exceed the upper energy threshold.
3) Next the frame pointer is backed up to the frame after the first frame that
does not exceed the lower energy threshold.
4) If three frames before this frame exceed the zero crossing rate threshold, the
frame pointer is backed up until the frame after the first frame that does 
not exceed the zero crossing rate threshold.
After step 4, the frame pointer is pointing to the first frame of the utter­
ance. The same procedure (and thresholds) are used to locate the ending point. 
Figure 4.3 is an example of how the thresholds are used to find the endpoints. 
The circled numbers represent the location of the frame pointer after the given 
step number.
The three thresholds are set by finding the mean (ftzc) and Standard devia­
tion (<t7C) of the zero crossings for the first 10 frames. These frames are 
assumed to be silent (background noise only). The zero crossing threshold (ZC) 
is found by:
ZC = MIN(FIXED,/izc + 2<tzc)
where FIXED is a fixed threshold. A typical value for FIXED is 25 crossings 
per 10 ms if the sampling rate is 10 KHz. The UE and LE thresholds are 
found by:





Figure 4.3. An example of how the zero crossings and energy thresholds are 
used to find the end-points of a word (from [RaSa75]).
36
where PEAK is the largest energy over all frames, and SILENT is the largest 
energy of the silent frames (silent frames are assumed to be the first 10 frames).
The double energy threshold is used so that mouth noises (breathing, lip 
smacking, etc.) that commonly occur before an utterance are not included as 
part of the utterance. These noises will tend to exceed the lower energy thres­
hold, but not the upper energy threshold. The zero crossing rate is used to 
detect the beginnings of words starting with a fricative. The energy of a frica­
tive is generally not enough to exceed the upper energy threshold, so the zero 
crossing rate is used to detect the high frequencies which are commonly present 
in fricatives. Lamel [LRRW81] states that the use of zero crossing rate is not 
effective in detecting words starting with a fricative for telephone quality recog­
nition since telephone speech is band limited to 3200 Hz.
4.6. Time Warping
Dynamic time warping (DTW) is widely used in word and speech recogni­
tion to eliminate the effects of nonlinear time fluctuations in speech patterns. 
The function of DTW is to find the minimum time-normalized distance 
between two templates A and B where A and B are sequences of features vec­
tors a; and kj for 1 < i < I, 1 < j < J. Each aj and bj is a vector of features 
for a segment of speech. In the template matching system discussed here, the 
feature vector contains the p LPC coefficients. It is generally easier to com­
pare two templates of equal length with dynamic time warping, so linear time 
warping is used before dynamic time warping to normalize the length (i.e., the 
number of frames) of the templates. The following two sections describe the 
linear and dynamic time warping.
4,6.1 Linear Time Warping
The following linearly warps a template of speech of length M to length N.
T(n) = (1—s)*R(m) + s*R(m + l), n = l,...,N .(4.6)'
where R(m) for 1 < m < M are the M frames of the input templates, and T(n) 








where lx] is the greatest integer less than or equal to x. For a time signal, the 
simple linear interpolation used in equation (4.6) is adequate as long as M and 
N do not differ greatly [Myer80]. Words are typically 40 frames long, so N = 
40. •'
4.0.2 Dynamic Time Warping
Following the linear time warp is a dynamic time warp. This is done, as 
shown in Figure 4.4, by finding a path connecting (1,1) to (I,J) such that the 
accumulated distance is a minimum. Figure 4.5 is an example of how an 
input signal is warped to match a reference signal. The accumulated distance 
is a weighted sum of the local distances d(i,j) between the feature vectors a; 
and hj. An exhaustive search of all possible paths is computationally infeasible, 
so dynamic programming (DP) theory is used to reduce the number of paths 
searched, DP theory states that if the point (i,j) is on the optimum path, then 
the path from (1,1) to (i,j) is locally optimum. One method to find the accu­
mulated distance, g(i,j), restricts the possible paths leading to a given point to 
those shown in Figure 4.6. Using these restrictions , g(i,j) is recursively defined 
as,
*Myers [MRR80] would describe these restrictions as Type I local constraints with an 
unsmoothed Type d weighting function.
38
d(I,J)



















Figure 4.6. Possible paths to a point.
41
g(i)j) = d(i,j) + min
g(M) = 2d(l,l)




Once g(I, J) is found, the normalized distance D(A,B) can be found by dividing 
I,J) by I + J.
Two methods that can be used to reduce the computation time are an 
adjustment window and pruning. The adjustment window[SaCh71], r, reduces 
the number of local distance calculations by restricting the domain of the time 
warp to those g(i,j) for which | i—j | < r, as shown by the two diagonal lines in 
Figure 4.7. Pruning compares the g(i,j) values at each point in the time warp 
to a threshold, and if the threshold is exceeded, the DTW is stopped and DTW 
on the next reference template is started. This reduces the DTW time by 
aborting comparisons that will definitely not yield the minimum distance.
The steps needed to compute one g(i,j) are:
1) computing the local distance d(i,j);
2) the two multiplications and four additions in equation (4.7); and
3) two comparisons to find the minimum of three values.
These three steps are defined as one loop and will be used as a basis to com­
pare the time complexities of different dynamic time warping algorithms. The 
serial algorithm in Figure 4.8 must execute one loop for every (i,j) pair in Fig- 
ure 4.4. Using no adjustment window, the total time is I2 loops . However, if 
the adjustment window is used, the number of loops is
I2 - 2£ i = 2Ir-I-r2 + r.
i=l *
*A linear time warp is commonly used on both the test and reference patterns to make 
them the same length, allowing the assumption that 1= J.
42
A
Figure 4.7. Adjustment window of width r.
43
/*.
Serial program for dynamic time warping,
•A
I number of test vectors
J number of refence vectors
r adjustment window
known[xj[i] contains coefficient i of
vector x of the known utterance. 
unknown[y][i] contains coefficient i of
vector y of the unknown utterance. 
d[x][y] contains the local distance between
the x known vector and the y unknown vector. 
g[x][y] contains the accumulated distance up to
the x known vector and the y unknown vector.
Line Time in /is
i PROCEDURE DTW
2 9 FOR y 0 TO 1-1 /* For each frame in the
unknown utterance*/
3 4 FOR x -H- -r TO r /* For each frame in the
warping path*/
4 5 IF (y+x > 0) AND (y+x > 21-2)
5 1*
6 Compute the local distance.
7 *!
8 .5 sum ■+- 0;
9 2.75 FOR i 4- 0 TO p-1
10 11.25 sum sum + (known[x][iJ-unknown[y][i])J;
11 2 d|x][yl sum;
12 1*
13 Check initial conditions
14 *1
15 5 IF Y = 0 AND X=0
16 8.25 gWIyl - 2 * d[x][yj;
17 ELSE
18 IF Y = 0 /* Check left edge*/
19 3.25 min •*— 2 * d[x][y-l];
20 ELSE IF X = 0 /* Check bottom edge*/
21 3.75 min «- 2 * d[x-l][y];
22 . ■ ELSE
23 I*
24 Compute possible paths.
25 */
26 4 At- g[x-l][y-2j + 2d[xJ[y-l];
27 3 B g[x-2][y-lj + 2d[x-l][yj;
28 2 c •*- g[x-lj[y-l] + 2d(xJ[yj;
29 /*
Figure 4.8. Serial DTW program. Execution times assume an 8 MHz 
MC68000. (See Section 7.6)
44
30 Find minimum path.
31 */
32 min +- A
33 6.5 WHEREB < A
34 5 min 4- B;
35 2 ENDHWERE
36 6.5 WHERE C < min
37 .5 min +- C;
38 2 END WHER E
39
40 2.5 gWIyl dN|yJ + min;
41 /*
42 If g(x][y] is > oo set of oo, otherwise
43 repeated doubling might cause it to wrap around
44 to “00
45 */■ . ■
46 6.5 WHERE g(x][yj > oo
47 .5 g(x][y] - oo;
48 2 END WHERE
49




This section has described an isolated word recognition system that uses 
template matching. This system was chosen to be implemented on an SIMD 
machine and VLSI processor array for the following reasons:
1) It has speaker independent accuracies as high as 98.2% [RLRW79].
2) It and systems like it have appeared many time in the literature, therefore
there is interest in such a system.
3) The system currently cannot run in real time on a serial processor.
As the vocabulary size increases, this system will take more time to do the 
pattern matching. If a vocabulary of 1,000 words is used, a conventional pro­
cessor cannot compare the input templates to all the test templates in real 
time. The following chapters present algorithms for SIMD machine and VLSI 
processor arrays to do each step in the recognition system. When the SIMD 
and VLSI processor array speech algorithms are combined into one system, 
(either as all SIMD or all VLSI processor array) it should be able to run in real 
time with a large vocabulary. If so, this system will meet three of the four cri­
teria given in Section 1; namely, real time response, large vocabulary, and 
speaker independent. The only criterion not met will be continuous speech 
recognition, which is a topic of future research.
5. SURVEY OF PARALLEL SPEECH PROCESSING ALGORITHMS
The following is a survey of some of the highly parallel speech processing 
algorithms in the literature. The algorithms examined are those heeded for the 
recognition systems considered here. The major topics are LPC coding (includ­
ing autocorrelation algorithms), dynamic time warping, and digital filtering. 
Each section presents an algorithm and then discusses the machine require­
ments and speed up obtained by the algorithm.
6.1. Autocorrelation
Autocorrelation has many uses in speech processing. The template match­
ing recognition system often uses it as an intermediate step to finding LPC 
coefficients (See Section 4.4). The short term autocorrelation function, R, is 
defined as:
' . Mi" 1 ■
R(i) == s(m)s(m + i) 0 < i < p
m=0
Three methods to find the autocorrelation coefficients are discussed here. The 
first method (AUTOl) uses M PEs to multiply the M—i— I s(m)s(m + i) terms in 
parallel. The second method (AUT02) uses M PEs to compute R(i) for 
0 < i < M using two FFTs. The third method (AUT03) uses p +1 PEs to 
sum the terms in each R(i) in parallel.
47
5.1.1. Autocorrelation Using M PEs — AUTOl
Siegel [Si80a] gives a SIMD algorithm to compute the autocorrelation 
coellicients R(i), 0 < i < p for an M-point signal s(m), 0 < m < M. Her algo­
rithm, listed in Figure 5.1, is referred to here as AUTOl. It uses N PEs where 
2n_1 < M < 2n = N. The signal s(m) is initially distributed among the PEs so 
that s(j) is stored in variable s in PE j for 0 < j < M, and 0 is stored in vari­
able s in PE j for M < j < N. Each element R(i) is computed simultaneously 
by transferring s(m+i) in PE ra+i to PE m, and then computing s(m)s(m + i) 
in PE m for 0 < m < M-d. These products are summed up using a recursive 
doubling technique (see Section 2.6). Figure 5.2 shows the pattern of data 
transfers used to compute the product terms. Figure 5.3 shows the data 
transfers used in recursive doubling with a Cube transfer function. Using the 
Cube transfer function allows the sum of the products to appear in the first L 
PEs, i.e., on completion of the algorithm PEs 0 through L will contain R(i), 
0 < i < p. This is done so that the data is in place for the LPC algorithm 
which follows autocorrelation. The LPC algorithm needs R(i), 0 < i < p to be 
stored in PE i, 0 < i < p*
Assume that M_p < L and M is a power of two, then the total number of 
parallel multiplications performed in the algorithm is p + I. For each R(i), the 
recursive doubling requires at most flogM] parallel additions, so the total 
number of addition steps is (p +1) flogM]. The number of Shift -1 transfers 
performed is p, and the number of Cube transfer functions is at most 
(p +1) flogM]. The total number of transfer steps is at most p + (p + l) flogM]. 
The asymptotic complexity is reduced from O(Mp) for the serial algorithm to 
0(p log M) for the SIMD algorithm,
5.1.2. Autocorrelation Using Two FFTs— AUT02
Another parallel method to find the autocorrelation coefficients presented 
by Siegel[Si80a] is to take the fast Fourier transform (FFT) of the magnitude 
squared of the FFT of the signal s(m) padded with zeros to a length of 2M. 
This method, referred to as AUT02 is not practical on a serial machine, espe­
cially when only small number of coefficients are needed, since it requires so 
much computation time. However, on a parallel machine, certain values of M 
and p make this method practical.
/* Algorithm Name: auto
Section: 5.LI
Machine: SIMD
Function: This program finds the autocorrelation
coefficients of input speech data.
Number of PEs: N
Transfers: Shift(-l), Cube
Masking: Data Conditional
Parameters: autocoef, The number of coefs. to find.
N, The number of PEs in use.
NetD, The interconnection network 
delay time in cycles.
Input: The input data is stored in PEs 0 through N-l
with PE i containing sample i for 0 < i <N.
Output: The autocorrelation coefficients, R(i),
for 0 < i <autocoef-l appear in PE i 
for 0 < i <N (i.e. each PE contains 
every coefficient).
Cycles: autocoef[136+NetD + (54 + 2NetD)logN) - 12 - NetD
Typical Time: 1,757 /is for autocoefs=9, NetD=18, and logN=7.
Variable Usage: (* means set by calling routine)
ADDR: Address of PE (e.g. ADDR = 0 in PE 0).
L: on completion, PEs 0-L will contains R(i).
partsum: temporary variable holding a partial sum.
R(): autocorrelatin coefficients,
sig: input signal
slast: after stage i; “slast” in PE m holds sig(m + i).
*/ ■
Line Time in (is
1 1.5 slast ♦- sig /* After stage I, ” slast” in
PE m holds sig(m+i) */
2
3 5 FOR i <t- 0 TO p DO
4 1.5 IF i ^ 0 THEN
5 3 USE Shift(-l)
6 1.5 DTRin 4- slast
7 4.5 TRANSFER
8 1.5 slast «—DTRout
t 0.5 partsum 4— 0
10 6.5 WHERE ADDR < M-i DO
11 9.25 partsum 4- slast * sig
12 2 END WHER E
13 2.25 FOR j 4— 0 TO max( [log(M-i)l~l,log(L~l))
14 3 USE Cube(j)
15 12.5 TRANSFER partsum TQ tmp
16 0.75 partsum 4- tmp + partsum
17 1.5 R(i) <- partsum
Figure 5.1. Algorithm for autocorrelation using N PEs. The execution times 
assume an 8 MHz MC68000. (See Section 7.3.)
49
PE 1 2 3






Figure 5.2. Data transfers to move s(m+i) to PE m to compute s(m)*s(m+i) 
terms for R(i), 0 < i < p. shown for N=M:=8, p—3.
PE . , • . . . . . ■ . .
0 (0)-^><^-'(0 + l)\ /(0+1 + 2 + 3K /(0 + ...+7)
1 (1)'^^^(0 + 1)O></(0+1 + 2+3)n\ //(0 + ...+7)
2 (2)^><r^(2+3)'^5><v (0+1+2+4vyvv(0+ - +7)
3 (3)'^><^(2+3)^ ^jo+l + 2 + 3)vYX>6(0 + ...+7)
4 (4)'*>^-~(4+5)\ /(4 + 5+6+7)Wr(0 + ...+7)
5 (5)-^:>^(4+5)-Ox^(4 + 5 + 6 + 7) /yXV(0 + ...+7)
6 (6)-^<^(6+7)^>0'(4 + 5+6 + 7)4/ Vv(0 + ...+7)
7 '(7)-"'^"“(6+7).^ \(4 + 5 + 6 + 7)/ X(0 + ...+7)
Cube(O) Cube(l) Cube(2)
Figure 5.3. Performing sum of elements in N PEs using recursive doubling for 
N=8.
51
Siegel et al. [Si81,SMS79,MSS80], present an algorithm to compute the 
FFT using the decimation-in-frequency approach on an SIMD machine. This 
algorithm uses M PEs to compute the FFT of a 2M point signal where PE i 
initially contains s(i) and s(i+M), 0 < i < M. Using this SIMD algorithm, 
each 2.M-ipoiht DFT, M a power of 2, is computed in M PEs at a cost of log 
M + l parallel complex multiplications, 2(log M + l) parallel complex additions, 
and log M parallel data transfers. Finding the magnitude squared of each of 
the 2M-points that are distributed over M PEs requires 2 complex additions 
and 2 complex multiplications. After the second FFT, p + 1 broadcasts are 
needed to move the R(i)’s from the M PEs so that all R(i)’s appear in each of 
the first L PEs. Table 5.1 is a summary comparing the two methods.
5.1.3. Autocorrelation Using p +1 PEs — AUT03
Ashajayanthi [ASV79] also presents an algorithm to find autocorrelation 
coefficients. It is rewritten in Flock Algol and is listed in Figure 5.4 and 
referred to as AUT03. AUT03 uses p + 1 PEs and the signal s(m), 
0 < m < M, is stored in PE 0 (or PE 0 reads s(m) from some input device). 
Lines 1-10 input each new s(m) and shift it from PE i to PE i + 1 until PE i 
contains s(p-i) for 0 < i < p. Figure 5.5a shows the data allocation after line 
14 for p=3 and M=4. Lines 17-19 broadcast Q from PE p which is the oldest 
of the p+1 stored samples to all other PEs. Each PE multiplies this value 
times its current Q value and adds it to its own variable sum. Then lines 19-21 
read in a new s(m) and shift the old samples from PE i to PE i + 1 as shown in 
Figure 5.5b. After M loops, R(i) will be in PE p—i, 0 < i < p. Lines 24-26 use 
p + 1 broadcasts to send all R(i) values to all PEs. The computation times 
listed in Table 5.1 do not count lines 1-14 since the other SIMD algorithms all 
assumed the data was already in each PE.
Table 5.2 shows the time complexity for each method with M—128 and 
p—8. There is no clear best method. If p is small compared to M, straight 
computation with M PEs (AUTOl) will require the least time. If p is close to 
M in value, FFT (AUT02) is the fastest approach.
Table 5.1. Summary of the methods to compute autocorrelation coefficients.
PEs additions multiplications transfers broadcasts
Serial 1 M(p + l)-p(p + l)/2 M(p + lhp(p + l)/2
AUTOl M (p + l)log M P +1 p + (p +1 )log M




21og M p + 1
AUT03 P + * . ■ M M M M + p + 1
53
/* ' . . v V ■
sum: sum of all coefficients in each PE. 
p : address of last PE.
Q : register used to hold values being shifted between PBs.
R(i): autocorrelation coefficients. (output) 
s(i) : input signal, enters in PE 0.
*/
1 sum +“ 0 /^ Initialize autocorrelation functions sum to 0 */
2 USE Shift +1
3 FOR i «- 0 TO p-1 DO
4 WHERE ADDR = 0 DO /* Shift in first p + 1 samples into */





10 Q +- DTRout
11




16 FOR i +- p TO M-l DO j* Broadcast Q from PE p to all PEs */
17 BROADCAST Q FROM PE p
18 sum +- sum + Q * DTRout /.* Muliply Q times value from PE p */
19 TRANSFER Q




24 FOR i ♦-0 TO p DO /* Store all coefficients in all PEs */
25 BROADCAST sum FROM PE i
26 R(i) *- DTRout
Figure 5.4, SIMD algorithm (AUT03) to compute autocorrelation coefficients 
R(i), 0 < i < p, for an M-point signal, using p +1 PEs.
PE
\—4 5 6 7 8
o s(3) s(4) 0 0 0
1 s(2) 8(3) 8(4) 0 0
2 s(l) 8(2) *(3) 8(4) 0
3 s(0) s(l) 8(2) 8(3) s(4)
(a) (b)
Figure 5.5. Contents of variable P in each PE at the start of line 16 for p =3,
55
Table 5.2. Time complexities for computing autocorrelation coefficients for 
M=128 and p=8.
PEs additions multiplications transfers broadcasts
Serial 1 1116 1116
AUTOl M 54 9 62
AUT02 M 30 (complex) 16 (complex) 12 9
AUT03 p +1 128 128 128 137
56
5.2. Linear Prediction of Speech
Linear prediction is a popular method used in speech recognition and 
speech compression. Parallel algorithms for both coding speech into linear 
prediction coefficients and reconstructing speech from LPC coefficients are 
presented in the literature. The following sections discuss parallel algorithms 
for computing LPC coefficients using both autocorrelation and covariance 
methods. It also discusses a parallel algorithm for synthesis using LPC 
coefficients.
5.2.1. Parallel LPC Using the Autocorrelation Method
Siegel [Si80a,Si80b,Si81] presents an SIMD algorithm for linear predictive 
coding using Durbin’s method [Makh75,RaSc78]. The serial algorithm is in 
Figure 4.2. The SIMD algorithm achieves its speedup over the serial algorithm 
by computing the k(i)’s in line 6 in parallel and the aj’s in line 11 in parallel.
The SIMD algorithm uses P PEs to solve the p pole linear predictor, where 
2m_1 < p < 2m — P. Initially, each PE contains all R(i)’s for 0 < i < p. 
After stage i in the iteration, the predictor coefficient, is in the variable a 
of PE j mod N, for 1 < j < i (i.e., if p < N, PE j will contains aj for 
1 S J S P5 if P=N, PE j will contain aj for 1 < j < p, and PE 0 will contain 
ap). At the completion of the algorithm, logical PE j will contain a^ for 
1 < J <P- .-
The two parts of Durbin’s method are:
1) computation of the k(i)’s from the R(i)’s and,
2) the iterative computation of the predictor coefficients (a^’s) for an order i
predictor from the k(i)’s and the predictor coefficients from the previous 
iteration..'..
The SIMD computation of the k(i)’s uses recursive doubling. For each 
iteration i, the ajW’s are computed by transferring data so that ai1 and a^y1) 
are in the same PE, and then executing the operations of line 11 in the serial 
algorithm in parallel for all values of j, 0 < j < i. Figure 5.6 shows the 
transfers needed for a 4-th order predictor computed in 4 PEs. No transfers 
are needed for i=l and i=2. Stage i of Durbin’s algorithm requires pairing ele­
ments af1-1) and a/i]^, for 1 < j < i, which is done with the Perm;
57
LADDR# i=3 i=4
1 ai>^a2 alV A
2 a2^‘ ax ^^4^2
3 a$-— as as a*
4 a4—*> a4
Figure 5.6. Data transfers for computation of a/s for p—4 in four PEs.
58
interconnection function. See Section 2.4.2 for more details oil the Perm func­
tion. ' \
The serial algorithm requires p2+p additions and multiplications, and j> 
divisions to compute the aj’s. Siegel’s algorithm, shown in Figure 5.7, requires 
p multiplication steps to compute the k’s (lines 4-7), and (p + l)log N additions 
and (p + l)Iog N data transfers (lines 11-15). Computing E (in lines 17-18) 
requires 2p multiplications and additions, and p divisions. Computing the aj’s 
requires p-1 multiplications and divisions with p-1 data transfers. Table 5.3 
summarizes these results. The parallel algorithm reduces the asymptotic time 
complexity from 0(p2) to 0(p log N).
5.2.2. Parallel LPC CodingUsing the Covariance Method
The covariance method [RaSc78j. is another method used to find the LPC 
coefficients of a speech waveform. This method involves solving:
£ ak^(i,k) = 0(i,O) 1 < i < p
, k = l .
where ak, 1 < k < p, are the LPC coefficients and the covariance matrix, 
0(i,k), is defined as:
- M-k-1
^(i,k) = s(m)s(m+k-i) 1 < i < p , 0 < k < p (5.3)
.m=-k
This equation looks something like equation (4.1) which was used for the auto­
correlation method, but the samples s(m), -p < m < M, are used where equa­
tion (4.1) used only s(m), 0 < m < M. Equation (5.3) can be written as:
*Ta.-K
where R and at are p element vectors of elements ^(i,0) and a(i) respectively for 
1 < i < p and K is a p by p matrix with K — ^(i,k), 1 < i, k < p. This is the 
same as the autocorrelation analysis equation (4.1) except K is symmetric, and 
not Toeplitz. Durbin’s method cannot be used to solve for a; instead the 
Cholesky decomposition [RaSc78] can be used.
Siegel et al. [Si80b], presents a parallel SIMD algorithm to compute the 
covariance coefficients. This algorithm uses M PEs and requires p +1
59
LADDR: logical address of PE (e g. LADDR — i + 1 in PE i).




R(): Autocorrelation coefficients (input)
Line Time in /is
i 2.5 E <- R(0)
2 0.5 a+-0
3 5.4 FOR i +- 1 TO p DO /* Compute k(i)*/
4 0.75 k ^ 0 .,.
5 6.5 WHERE LADDR < i DO
6 12.75 k +- a * R(i-LADDR)




/* Sum k’s in all PEs so all PEs have E */
11 2.75 FOR j «- 0 TO logN -1 DO
12 3 USE Cube(j)
13 0 DTRin +- k
14 ' 17 TRANSFER
15 0.75 k <- k + DTRout
16












20 /* Compute aj’s for stage i */
22 ’ 3 USE PermLADDR(i)
22 8.5 WHERE LADDR = i DO
23 a *-k /* a('^+-k(i) */
24 2 " ELSEWHERE
25 6.5 WHERE LADDR < i DO
26 1.5 DTRin +— a
27 4.5 TRANSFER
28 14.75 'a'«— a - k * DTRout
29 2 .... ENDWHERE
30 2 ENDWHERE
Figure 5.7. SIMD algorithm using Durbin’s method to solve for p predictor 
coefficients using p PEs. Executions times are based on an 8 MHz MC68000. 
(See Section 7.4.)
Table 5.3. Summary of parallel and serial LPC analysis algorithms.
Additions Multiplications Divisions Data Transfers
k’s p(p + l)/2 p(p + l)/2
Serial E P P P
aj’s Pfp l)/2 p(p-l)/2
Total p2+p p2 + p P
k’s (p + i)log N P (p + l)Iog N
Parallel E 2p 2p P
at’s p-1 p-1 P-2
Total (p + l)log N 4p-l p (p + l)IogN + p-l
61
multiplications, (p + l)(log M + l) additions, and log M(p + l)+3p + l transfers. 
A serial covariance algorithm requires Mp+p2-p additions and multiplications. 
The parallel algorithm has reduced the time complexity from 0(pM) to Q(p log
M).
Safranek [Saf82] presents a parallel SIMD algorithm to solve equation (5.3) 
for a. This algorithm uses p PEs and consists of three parts: decompose, tran­
spose, and solve. The decomposition part assumes ^(i,k), 1 < i, k < p will be 
stored in <^[j] in PE i. Table 5.4 shows the computation requirements for the 
decomposition. The decomposition results in a matrix which must be tran­
sposed. The transposition requires p + 1 additions, and p transfers. Following 
the transposition, the predictor coefficients are then computed. Table 5.4 
shows the operations used for solving for the predictor coefficients. Table 5.4 
also shows the number of operations used by a serial algorithm for each of the 
three parts of the Cholesky decomposition. The time complexity of the serial 
algorithm is 0(p3). The parallel algorithm, on the other hand, uses p PEs and 
has a time complexity of 0(p2). Thus this method provides an ideal asymp­
totic speed up.
Digital filtering is frequently used in speech and signal processing. The 
following discusses four parallel algorithms for recursive digital filtering. The 
basic operations in recursive filters are the computation of the sum of product 
terms, with output ym given by:
where p is the order of the filter and afc,l < k < p, are the filter coefficients 
and .yj=0 for i < 0. All four parallel algorithms solve equation (5.5) by break­
ing it down into the following recurrence relations.
5.3. Digital Filtering
(5.5)











Add/Sub. 2p(p + i) 2p2(p + l) p + 1 0 4p 4p2
Multiply p(p + l) p2(p + l) 0 0 2p 2p2
Divide p2+l P3 + 1 0 0 p + 1 p2+l
Transfer p(p + l) 0 P 0 p + 2 0
63
ym - yip) (5-6c)
Kung’s method (FILl) is for a VLSI processor array, while Kogge’s (FIL2) and 
Kuck’s (FIL3, FIL4) methods are for SIMD machines.
5.3.1. Recursive Filtering for the VLSI Processor Array (FILl)
Kung [KuLe,Kung80] has given systolic arrays to do both recursive and 
non-recursive filtering and has shown that these arrays are useful for both 
types of digital filtering. (In digital signal processing terminology, “recursive” 
filter typically refers to any filter that includes a recursive dependence of the 
output on previous outputs. A non-recursive filter is a filter whose output does 
not depend on previous outputs.) The non-recursive array was given as an 
example of a VLSI array in Section 3. The following is a description of a VLSI 
array to do recursive filtering.
Kurig’s recursive filter algorithm computes ym by using one cell for each of 
the p recurrence equations of equation (5.6b). Figure 5.8 shows the linear array 
of p + 1 cells used to perform the computations. Each PE is the same as in the 
non-recursive filter algorithm, except that PE p is a dummy PE that reads the 
Ry data from PE p-1 and routes this same data to Rx in PE p—1. Figure 5.9 
shows the data flow for the array in Figure 5.8.
Each cycle of the array consists of multiplying Ry times Ra and adding the 
product to Rx. This array can produce one ym every two cycles for a total of 
2M cycles to produce all ym!s for p < m < M.
5.3.2. SIMD Digitsl Recurrence Filter — Kogge (FIL2)
Kogge and Stone [KoSt73] have formulated an SIMD method for solving 
recurrence relations using recursive doubling. In this approach, the computa­









R •«- x . x sn
R y.7 ,n
R R + R,
y y b
X t X ,out in
o u t
R x
Figure 5.8. Systolic array to compute recursive filter for p=2.
output
Figure 5.9. Data flow for array in Figure 5.8,
(dummy)(dummy) (dummy)(dummy) (dummy)(dummy)
oCn
Ti-l ai H . . . ap
. i 0 . . . 0
Yi = , • A = 0 1 . . . 0
. yi~ni . 0 ■ . . 0 1 0
Therefore, A is a p by p matrix, and Y is a p by 1 vector. This approach uses 
M/p PEs and requires an initialization process plus flog(M/p)l steps. Each 
step, however, consists of multiplication of a p by p matrix by a p by 1 matrix 
and the transfer of the resulting p by p matrix to a different PE. This method 
is efficient when p is small and when M/p PEs are available.
5.3.3. SIMD Digital Recurrence Filter — Kuck
5.3.3.1. Column Sweep Method (FIL3)
Kuck [Kuck77] presents two algorithms to solve equation (5.5). The first 
is the column sweep method. It requires M-l PEs, one for each 
Yi,l < i < M—1 that is to be computed. Initially, y0 is known. In step 1, y0 is 
broadcast to all PEs. Each PE multiplies y0 by the correct a^ and adds it to 
SUM. SUM is a variable in each PE which contains the intermediate y M terms 
from equation (5.6b) and a^, 0 < k < p, are the filter coefficients which have 
been precomputed and stored in each PE. After step 1, SUM in PE 0 contains 
yi- Then yj is broadcast and the same is done for yj as was done with y0. 
This continues until ym is found. This method requires M~1 steps. Each step 
consists of an addition, a multiplication, and a broadcast. This method is 
efficient when p^M and M PEs are available.
5.3,3.2. Product-Form Recurrence Method (FIL\)
Kuck’s second method [Kuck77] to solve equation (5.5) is the fastest 
method known for computing recurrences. The method requires at most
(2 + log p)log M - “;(log2 p + log p) steps. Each step consists of an addition 
£
67
and multiplication. The number of PEs used is at most p2M/2 + O(pM) for 
p«M. For large p, the number of PEs used is quite large. The following sec­
tion compares the four parallel recursive filtering algorithms.
5.3.4. Summary of Parallel Recursive Filtering Algorithms
Table 5.5 is a summary of the four algorithms. Consider the problem of a 
signal with M:::128 samples and a p=16 pole filter. Table 5.6 shows how many 
PEs (cells) and steps are heeded by each algorithm. FIL3 and FIL4 are 
designed for recurrences where p =* M. For digital filtering, p«M, which 
makes FIL3 and FIL4 impractical for filtering applications. The number of 
PEs per steps required by FIL4 are both upper bounds, therefore these 
numbers could be much smaller. FIL2 uses the least number of PEs and steps, 
but each step requires a 16 by 16 matrix multiply, and a 16 by 16 matrix 
transfer. The matrix multiplication alone uses 256 scalar multiplications and 
240 scalar additions. Therefore, FIL2 may be the slowest of the four.
FILl is the only algorithm whose number of PEs does not depend on M. 
It is also the only algorithm that can filter an arbitrary length signal. This is a 
desirable property for real time processing.
5.4. Dynamic Time Warping
As discussed in Section 4.6.2, dynamic time warping (DTW) is a common 
but time consuming method used in speech recognition. Its purpose is to com­
pare each known utterance in the vocabulary to the unknown input utterance. 
The result of each comparison is a distance score, the lower the score, the 
better the match. Myers et al. [Myer80], reports that dynamic time warping 
uses from 50 to 90% of the computation time in word recognition on a serial 
computer. About 80% of the dynamic time warp calculation time is spent 
computing the local distances between feature vectors. This makes dynamic 
time warping a prime target when trying to reduce the total recognition time. 
'One system mentioned in the literature to do dynamic time warping on a
mTable 5.5, Summary of parallel recursive filtering algorithms.
PEs Operations Cycles to Compute
(cells) Per Cycle ym, p < m < M
FILl
1 scalar add
p +1 1 scalar mult 2M
2 shifts
FIL2
1 p by p matrix
M/p mult. jlog2(M/p)j




M~1 1 scalar mult M~!
1 broadcast
FIL 4 < p2M/2 + O(pM) 1 scalar add < (2 + log p)log M—
p<<M 1 scalar mult (log2p + logp)/2
Table 5.6. PEs and cycles needed to filter a M=128 sample signal with a p-8 
pole recursive filter.
PEs Operations Cycles to Compute
(cells) Per Cycle ym, P < m < M
FIL1
1 scalar add
17 1 scalar mult 256
2 shifts
FIL2
1 p by p matrix
8 mult. 3 + overhead




127 1 scalar mult 127
1 broadcast
FIL 4 < 16,384 1 scalar add < 32
+ 0(2,048) 1 scalar mult
70
VLSI processor array is the high speed array computer (HSAC) by Burr et al. 
[BAW8I,WBA83,BAW84]. The following section discusses the HSAC which 
uses a full I by I grid of cells where I is the number of frames in each utterance. 
The section after that presents a reduced array which requires fewer cells, but 
still exploits the parallelism of the DTW task.
5.4.1. High Speed Array Computer — Full Array
The HSAC presented in [BAW81] uses an I by I grid of cells to compare 
several Vocabulary templates to the input template simultaneously. Figure 
5.10 shows a typical cell which has two serial input lines and two serial output 
lines. The reference feature vector a,; enters the cell from the “bottom” in a bit 
serial manner as the test feature vector b; enters from the “left” side. The cell 
calculates the local distance d between them, and Outputs a; bit serially out of 
the top of the cell to the cell “above” it, while it Outputs bj to the cell to the 
right. The calculation of the accumulated distance, g, overlaps with the 
transfer of and b;. Following the calculation of g, g and d are moved bit 
serially to both the cells above and to the right over the same lines that 
transferred the feature vectors. Overlapping the transfers with the calculations 
helps reduce the overhead of the bit serial transfers. All cells on an x+y=k 
(for k equal to some constant) diagonal execute the same instructions at the 
same time, for example, cells (3,1), (2,2), and (1,3) perform the same instruc­
tions simultaneously; at the same time cells (4,1), (3,2), (2,3), and (1,4) execute 
the same instructions, which are possibly different from the (3,1), (2,2), (1,3) 
instructions. This allows one diagonal of cells to compute their accumulated 
distances, while an adjacent diagonal is receiving new feature vectors, thus 
overlapping transfers and calculations. Figure 5.11 shows an example of how 
sixteen of the HSAC cells are connected in a four by four grid. The unknown 
feature vectors enter the grid on the left, pass from cell to cell unchanged and 
emerge on the right. The reference vectors enter from the bottom and pass to 
the top.
To compare reference template A-{ai,a2,.....aM} to test template 
B={b1,b2,....bM}, &! enters cell (1,1) via Rj while bt enters via Uj. While 
finding the local distance, a.j is shifted to cell (1,2) while bj is shifted to cell
71
Figure 5.10. One cell in HSAC.
Figure 5.11. High Speed Array Computer used to compute dynamic time warp
73
(2,1) &2 is enters into cell (2,1) via R2 and b2 enters into cell (1,2) via U2 at
the same time. All cells on this diagonal find the local distance between aj,b2 
anda^hi in parallel while shifting aj and to cells (1,3) and (2,2) respectively 
and shifting bj and b2 to cells (3,1) and (2,2). This continues until cell (1,1) 
computes g(I,I) from vectors a* and bj. g(I,I) is the optimal distance for the 
templates A and B. Figure 5.12 shows the data flow for 1=4. In general a; (hi) 
enters at R ; (U,) one loop after aM (bM) enters at RM (U;_j). Cell (i,j) com­
putes d(i,j) and g(i,j), with computation progressing on a diagonal wave from 
the lower left to the upper right of the array. For a W-word vocabulary, the 
comparison of words X and Y can start one loop after words X-l and Y-l are 
started by entering a;x (bjY) in R; (Uj) one loop after ajX-1'bY-1 enters Rj (Uj). 
For a W word vocabulary with I frames per word, the HSAC requires 21—1 
loops to compute the first comparison, and one loop for each subsequent com­
parison, for a total time of 2I+W-2 loops. HSAC needs I2 cells if an adjust­
ment Window is not used. If an adjustment window is used, the cells in the 
upper left and lower right corners can be omitted leaving an r cell wide “warp­
ing path” from cell (1,1) to cell (1,1). Only 2lrT-r2Tr cells are needed, but the 
same number of loops are required. For 1=40, the HSAC requires 1600 cells if 
no adjustment window is used; if an adjustment window of r=8 is used, 544 
cells. ;
554 is a large number of cells. The next section discusses reduced arrays, 
which can use fewer cells.
5.4.2. High Speech Array Computer - Reduced Arrays
Implementing the HSAC with a full array of cells require a large number 
( >500) of cells and is dependent on the problem size since the array must have 
as many rows and columns as unknown frames in the utterance. West, Ack- 
land, and Burr [WBA83jBAW84] present the “reduced” array which overcomes 
these problems. The reduced array uses enough cells to compute an integral 
number of diagonals in parallel. Figure 5.13 shows a reduced array with three
*A loop, as used here, is defined as the time after vector ax enters the grid and before 
vector Jlx +1 enters.
U4
Rj R4









A B e D
A B C D
A B C D
A B C D
A B C D
A B C D
A' B C D
1 I
Figure 5.13. Virtual movement of reduced array through I by I grid
77
diagonals. The large square represents the I by I grid of a full array. Three 
pairs of vectors are being compared simultaneously. The diagonals labeled A, 
B, and C are the three diagonals of the reduced array which are doing the com­
parison. When the computations for the current diagonal are complete, the A 
diagonal will move to the B diagonal, and the B diagonal will move to the C 
diagonal. The C diagonal would move to the D diagonal in a full array, but 
there is no D diagonal in the reduced array. Instead, the G diagonal moves to 
the A diagonal in the reduced array.
The reduced array is therefore sweeping the matrix space of the I by I grid 
as shown in Figure 5.14. The advantages of the reduced array are:
1) Fewer cells are used.
2) The number of diagonals used is independent of the problem size.
3) The number of cells can be traded off for performance.
The disadvantages are:
1) Some cells are idle during the computation as shown in Figure 5.14.
2) Slightly more complex hardware is needed to recirculate the data from the
right edge of the reduced array to the left.
3) Fewer pairs of utterances can be compared at a time.
The smallest size a reduced array can be is one diagonal. If no adjustment 
window is used, the diagonal will have I cells. If an adjustment window is 
used, r cells are needed. The one diagonal reduced array can compute one 
comparison in 21 loops.
This I IS AC can not use pruning since pruning aborts a comparison if at
some time during the comparison it is apparent the current comparison will not 
be the closest match; Once this array starts a comparison it is difficult to












Figure 5.14. Virtual propagation of diagonal reduced array (from [BAW84]).
79
5.5. Summary
This section presented parallel algorithms for autocorrelation, LPC 
analysis, dynamic time warping, and digital filtering. One of the filtering algo­
rithms, the three autocorrelation algorithms, and the LPC algorithm are for 
the SIMD machine. Three of the dynamic time warping algorithms and the 
rest of the digital filtering algorithms are for VLSI arrays. Several new algo­
rithms for speech processing for both SIMD machines and VLSI processor 
arrays are presented in the next chapter.
80
6. NEW PARALLEL ALGORITHMS FOR SPEECH PROCESSING
The following are several new algorithms for speech processing on SIMD 
machines and VLSI processor arrays. This chapter presents four parallel algo­
rithms for digital filtering, one for autocorrelation analysis, two for linear time 
warping, along with three algorithms for dynamic time warping. Each section 
presents an algorithm and then discusses the machine requirements and speed 
up obtained by the algorithm.
8.1. Digital Filtering
The basic operations in digital filtering are the computation of sum of pro­
ducts terms, with output ym given iby
y«n - £ mrn-k + £, Vm-k P <* < M (6.1)
k=l k=0
where xm is the input to the filter at sample m, the ak’s and bk’s are the filter 
coefficients, and M is the number of samples in the signal to be filtered. The 
first sum in (6.1) represents a recursive filter. (In digital signal processing ter­
minology, “recursive” filter typically refers to any filter that includes a recur­
sive dependence of the output on previous outputs, so the filter in (6.1) is a 
recursive filter. To make a distinction between the recursive and non-recursive 
portions of the computation, we will refer to (6.1) as a “generalized” recursive 
filter, and will use the term recursive filter to refer to a filter having only a 
recursive dependence.) In the recursive filter, the dependence of output ym on 
the previous ym_k values, 1 < k < p, takes the form of a linear recurrence rela­
tion. The second sum in (6.1) represents a non-recursive filter, in which the
current output value depends only on the current and q previous input values. 
In digital filtering applications, non-recursive filters are used to realize finite 
impulse response (FIR) filters, and it is common for q to be as large as 250 
[RaGo75]. In digital filtering applications, generalized recursive filters are used 
to realize infinite impulse response (HR) filters (e.g., Butterworth or Chebyshev 
filters, or filters for linear prediction [Makh75,Si80b]), with p < 20 [RaGo75].
Real-time applications often use digital filtering as a single processing step 
in tasks requiring other extensive computations. It is therefore desirable to 
consider fast implementations. The computations required for digital filtering 
are also characteristic of the general class of problems involving linear systems 
and linear recurrences. Some work in the use of parallel systems for solution of 
such problems has been reported. Kung [Kung80] presents systolic array algo­
rithms to implement the two basic types of filters, non-recursive and recursive, 
that were described in Sections 3 and 5.3. Because of the recursive nature of 
the computation, the systolic array appears to be a natural structure for imple­
menting the digital filter. Kogge and Stone [KoSt73] have formulated an SIMD 
method for solving recurrence relations using recursive doubling. Kuck 
[Kuck77] presented two SIMD algorithms for solving recurrence relations. The 
first used the column sweep method, and the second used a product-form 
recurrence method. All these approaches were discussed in Section 5.3.
This section presents five parallel algorithms to perform digital filtering. 
Four of these algorithms originally appeared in [YoSi81]. The first (VLSI!) is a 
simple extension of Kung’s systolic array algorithms, showing how the non­
recursive and recursive systolic arrays can be combined in a straightforward 
way. The second (SIMDl) is an SIMD algorithm derived from the VLSIl 
approach. The third (VLSI2) is an VLSI processor array algorithm derived 
from the SIMDl algorithm. The fourth (SIMD2) is an SIMD algorithm that 
assumes more powerful processors and more flexible inter-PE communications 
than the VLSI-based algorithms. The fifth algorithm presented (SIMD3) is an 
extension of the fourth algorithm to allow problems of varying sizes (number of 
coefficients) to be run on a fixed number of PEs. Together, the fourth and fifth 
algorithms provide a general method for dealing with recurrence relations in an 
SIMD system.
82
6.1.1. VLSI Processor Array Algorithm — VLSIl
The first VLSI processor algorithm presented here combines the non- 
recursive (FIR) and recursive filter systolic algorithms covered in Sections 3 
and 5.3.1 into a generalized recursive filter algorithm. It is based on a linear 
array of cells, with each cell holding one filter coefficient and data flowing in 
Opposite directions in two pipelines. One pipe circulates the input data (xm 
values) while the other passes partial results in the ym computations. The gen­
eralized digital filter of equation (6.1) can be computed by combining the two 
algorithms discussed in Sections 3 and 5.3.1. The recurrence relations used for 
the generalized digital filter are:
yi0) = 0 .
y# + 1) -yff + Vkxm-q+k 0<k<q
yik + 1) - ymk) + an ym-n Q+l < k < p + q
V = v(p+<l + 1)J.m ■ J m '
Figure 6.1a shows that the recurrences can be evaluated by pipelining the xm 
and yW values through p+q+ 2 linearly connected cells. Input xm feeds into 
Rx in cell q and output ym appears in Ry in cell p + q. Figure 6.1b is the data 
flow diagram for the linear array. Each column of the data flow diagram 
represents the contents of each register in each cell after a given cycle. Moving 
from left to right shows how the data changes from one cycle to the next. The 
arrows show where Rx and Ry are transferred on the next cycle. As in the 
component algorithms, only half of the cells are active during a given cycle. 
Before the first cycle, the correct coefficient is loaded into each cell, and the Rx 
and Ry registers are set to zero. The first q cycles shift x0 from the input line 
in cell q to Rx in cell 0, xx from cell q to Rx in cell 2, ..., and x^j to Rx in cell 
2 lq/2j (i.e., initializing the array by placing the first lq/2J+ X input values in 
every other cell, starting with Xq in cell 0). After p+q + 1 more cycles, every 
two cycles of the VLSI array compute one output value, where during each 
cycle, the operations performed are the simultaneous transfer of data in the 
two pipes, one addition, one multiplication, and one assignment.
i iipu t
output. TIME
Figure 6.1 a) VLSI processor array to compute generalized digital filter p 
q=2. b) Data flow diagram for (a).
(dummy)(dummy) (dummy.!(dummy)(dummy)
6.1.2. An Improved Parallel Filtering Algorithm ^ SIMDI and VLSI2
A major drawback to the VLSI processor array algorithm is that: only halt 
of the cells are active during a given cycle, so that a new ym value is computed' 
every two cycles of the VLSI array. This problem can be overcome on the 
SIMD machine by using a data broadcast. A broadcast sends a data item in 
otte PE to a specified set of PEs. A broadcast may be implemented either by 
having the control unit broadcast the data item to all the desired PEs (e g., 
Illiac IV [Barn68,Bouk72]), or by using the interconnection network to transfer 
the data item to the desired PEs (e.g., Cube [SiMc81b] or ADM [SiMc8la] net­
works). See Section 2.5 for more information on broadcasts.
In the first SIMD algorithm (SIMDI), each PE holds one filter Coefficient, 
as in the generalized VLSI array algorithm. The upward flowing pipeline from 
the VLSI processor array structure, which was used to disseminate the input x 
values and the completely computed y values, is replaced by two broadcasts of 
data. One broadcasts the current x value, and the other, the newly computed 
y value. By making this replacement, every PE is active during every cycle. 
Figure 6.2 shows the data flow diagram using this technique. As each partial y 
shifts into a given PE, the correct coefficient and x (in PEs 0 through p) or y 
(in PEs p + 1 through p + q+1) are there to meet it. Moreover, if a given PE 
receives an x as a result of the broadcast, it will not receive a y as a result of 
the broadcast, and vice versa. If the interconnection network (rather than the 
control unit) performs the broadcasts, it may be possible to do the two broad­
casts to disjoint sets of PEs simultaneously [SiMc81a,SiMc81b], Whether this 
is possible will depend on factors such as the type of interconnection network 
used, the actual sets to which the data items are being broadcast, and the way 
in which the x values enter the system. The data flow of the partial results 
(the y£c + 1) values) and the placement of one coefficient per PE is the same as 
the VLSI processor array algorithm. The replacement of the upward pipe by 
two broadcasts simplifies the synchronization problems, and allows all PEs to 
be active at every step. Every cycle of the SIMDI algorithm produces one out­
put value; the operations performed in one cycle are the two possibly simul­
taneous broadcasts, one data transfer of partial results (the remaining pipe), 
one addition, one multiplication, and one assignment. This cycle is clearly 
longer than the cycle in the VLSI array; however, an output is produced every
cycle instead of every two cycles.
m+2
m+1 .
TIME - ------- -►




The principal attributes of the SIMDl algorithm above are that
1) each PE holds one filter coefficient and always computes the same term (i.e.,
Superscript k) of the recurrence for the yrt values,
2) a pipeline similar to the VLSI array pipeline passes partial results from one
PE to the next, and
3) broadcasts are used to disseminate new x and y values to the PEs in which
they are needed.
The use of broadcasts is the only architectural difference between the 
SIMDl algorithm and the VLSI1 algorithm. If the VLSI processor array can 
broadcast data, it can execute the same SIMDl filtering algorithm as the SIMD 
machine. Therefore the second VLSI algorithm (VLSI2) is the same as the 
SIMDl algorithm. The major differences are:
1) The broadcasts in the VLSI2 algorithm will occur simultaneously with the
shifts, while the SIMDl broadcasts and shifts must be performed sequen- 
tially.
2) The broadcast time in the VLSI2 algorithm should be much Shorter than the
SIMDl algorithm since the VLSI2 algorithm uses a fixed interconnection 
network.
Section 6.1.5 will compare these algorithms.
6.1.3. An Improved SIMD Algorithm — SIMD2
The SIMDl algorithm can be improved by arranging the data so that par­
tial results (y4k + 1* values) do not have to be shifted from one PE to another. 
In the SIMD2 algorithm, the same PE performs all the steps needed to compute 
a given ym, as shown in Figure 6.3. Each PE holds all of the filter coefficients, 
and uses an indexing operation to select which coefficient to use at a given step 
of the algorithm. Partial results accumulate within the PEs, rather than being 
pipelined through them. The data transfers required are two (possibly simul­
taneous) broadcasts, one of the current input signal value, and one of a com­
pleted output value. All PEs are always active, each cycle of the algorithm 
completes one output, where the computations during a cycle are one indexing 
operation to select a filter coefficient, two broadcasts, one addition, one
TIME
Figure 6.3. Data flow diagram for improved SIMD2 generalized digital filtering 
algorithm for p—2, q—2. The double boxes indicate the start of a new ym com­
putation.
multiplication, and one assignment. Figure 6.4 shows that for this SIMD2 algo­
rithm, the coefficients are arranged as a vector in each PE. This arrangement 
allows each PE to use the same index into the vector to access the correct 
coefficient for the cycle: at cycle m, each PE accesses its 
COEF [m mod (p + q + 1)], where p + q+1 = N = the number of PEs. The 
SIMD2 algorithm works on the computation of p+q+1: y^’s simultaneously by 
having each PE at a different stage in the computation of its own ym. The 
algorithm is again based on the recurrence relations in equation (6.2). For its 
own ym, each PE is computing y^ + 1l for a different value of k, 0 < k < p'+q. 
The data is arranged so that if PE i completes ym after cycle t, then PE (i + 1) 
mod (p + q + 1) will complete y^-n after cycle t + 1, ym is used in computing 
ym+j for 1 < j < p, so after PE i computes ym, its value is broadcast to PEs 
(i+j) mod (p + q+1), for I < j < p. In general, PE m mod (p + q+1) com­
putes output ym, and ym is completed in cycle m.
Figure 6.5 gives the SIMD2 algorithm that executes simultaneously in all 
PEs. Each PE will have its own values for the program variables. Initializa­
tion is handled by broadcasting 0 for the value xm during in the first q cycles of 
the algorithm. Combined with the initialization of SUM to 0, this ensures that 
y[m] — 0 for m < 0. At cycle q+1 (i.e., m = -p), x0 is broadcast, followed by 
Xj on the next cycle, etc. The computation of y0 is completed during the cycle 
when m = 0, followed by completion of yj when m — 1, etc. The algorithm 
assumes that during each cycle, the current input value x is broadcast as vari­
able x from the control unit, and the interconnection network broadcasts the 
newly completed y value from the PE in which it was computed. For simpli­
city, the algorithm is written so that all PEs receive the broadcast y and x 
values, and each PE selects which one it will use in accumulating the next term 
in its sum. To perform this selection, each PE holds a vector of flags in which 
Ft/AG[i] is set to one if COEF[i] in that PE is an “a” coefficient, and Set to 
zero if it is a “b” coefficient. By determining whether its 
COEF[m mod (p+q+1)] value for cycle m is an “a” or “b” coefficient, each 
PE can select whether it is to use the newly received y value (with an “a” 
coefficient) or the input x value (with a “b” coefficient) for cycle m.
89
PE COBFfOl COEF[l] CQEF[2] .... CQEFfp-fq-l] CQEFjp°fq|
0 ax bq bq-l •••• f>q-2 .
1 J&2 3,| bq ^q-! ^3
2 a3 ^2 a^ bq &4
p bo ap ajx-i ^p-2
p + 1 bx b0 ap ap_i b2
q+p-1 bq_! bq_2 bq_3 .... bq_4 bq
q + p bq bq-! bq_2 .... bq_s ax
Figure 6.4. Skewed coefficient storage for SIMD2 algorithm
90
/*
ADDR Address of PE (e.g., ADDR - 0 in PE 0)
DTRin Data Transfer Register input to interconnection network 
DTRout Data Transfer Register output from interconnection network
coefj] Vector of coefficients (see Figure 6.6) 
flag[i] Equals 1 if COEFjij is an “a” coefficient 
sum Contains partially computed ym 
m Index of y value to be completed in this cycle 
(SUM = ym in PE (m mod (p + q + 1))
sum 0
FOR m «- -(p+q) TO M-l DO
/* select the PE containing the newly */
/* completed y value: y^. */
BROADCAST sum FROM PE m-l mod (p + q + 1) TO DTRout 
WHERE ADDR = m—1 mod(p + q +1)
SUM 4-0 /* start a new sum in that PE */
END WHERE





sum 4— sum + tmp * coef[m mod (p + q + 1)]
— 1 DO /* In each PE, select to use */
/* either the broadcast y value */
I* or the new x value, xm+p */
Figure 6.5, SIMD2 generalized digital filtering algorithm.
91
6.1*4. SIMD Solution of General Linear Recurrence Equations
The approach presented in the SIMD2 algorithm for digital filtering can be 
applied to the solution of general linear recurrence equations of order p, given 
y; for 0 < i < p, solve for ym for p < m < M, where
Ym — S am,k ym-k •
k = l
The SIMD algorithm to handle the recursive dependence uses N = p PEs, with 
PE m mod p computing ym. This PE completes computation of ym at cycle m, 
then broadcasts its completed ym value to PEs (m+j) mod p, for 1 < j < p. 
PE i, 0 < i < N, will hold the coefficient sets (am k’s) for all m for which 
i = m mod p. The coefficient sets are skewed in a manner analogous to that in 
Figure 6.4. In particular, let z be such that z mod p = 0 (i.e., PE 0 computes 
y2). Figure 6.6 shows that the coefficient sets a^+j^ for 0 < j < p are stored. 
At cycle m of the computation, each PE will access its COEF[m mod pj. For 
example at cycle m, PE m mod p is completing computation of ym. From Fig­
ure 6.6, this PE accesses am i, which is the coefficient used with ym_l5 and 
which is the last term in the recurrence to be accumulated in computing ym in 
the SIMD algorithm. At the same time, each other PE is accessing the 
appropriate coefficient for its computation. Depending on the form of the Bm?s, 
it may be desirable and possible to use additional PEs to compute these terms. 
(This is the case in the digital filtering algorithm, when Bm is considered to be 
the (q + l)-term non-recursive sum in each ym.) This general method will reduce 
the number of multiplications and additions in solving an order p, M-point 
recurrence from p(M-p) in the serial algorithm to M + p in the p-PE SIMD 
method. The overhead in the SIMD algprithm is M+p broadcasts. The 
broadcast-based algorithm for digital filtering therefore provides an efficient 
general method for solving linear recurrence equations on an SIMD machine.
PE COEFjOj COEFfl] COEF[2] ... COEF[p-l]
az,l az.P az,p-l az,2
az+l,2 az + l,I az+l,p - az +1,3
az +1,3 az +2,2 az + 2,l az + 2,4
lz +p-2,p-l az + p-2,p-2 az+p-2,p-3 ••• az+p-2,p
az + p-l,p az+p-l,p-l az + p-l,p-2 ••• az+p-l,l




6.1.5. Comparison of VLSI Processor Array and SXMB Algorithms 
Table 6.1 shows the times for the serial and parallel generalized digital 
filtering algorithms. (The “Preem” entry will discussed later in Section 6.1.8.) 
The parallel algorithms can be compared in three ways:
1) total time to compute one ym,
2) number of ym’s computed per unit time (throughput) (the throughput can
also be considered by measuring the time between successive ym’s), and
3) speedup over the corresponding serial algorithm.
The times considered are for the steady state operation of the algorithms. 
Although the algorithms require some initialization steps (for example, to dis­
tribute the first lq/2]+l x’s in the VLSI processor array algorithm), most of 
the processing is in the steady state operation.
The time to compute one ym value is the time from the beginning of the 
computation of ym until the time that ym is available as an output. In the 
VLSI processor array algorithms and SIMDl algorithm, computation of each ym 
starts with the calculation of the bqxm_q term in PE/cell 0 and completes on 
the inclusion of the a^^ term in the sum of PE/cell p + q. (In the VLSIl 
algorithm, ym is available at this point, or access to ym may be delayed by one 
array cycle, until ym arrives at the output line in the dummy cell.) For all of 
these algorithms, the time to compute ym is the time to move, via the algo­
rithm* from PE/cell 0 to PE/cell p+q, comprising p + q+1 algorithm cycles. 
The number of arithmetic steps to compute one y is therefore the same as in 
the serial algorithm. The VLSI processor array algorithms have an overhead of 
p + q+1 shifts and the SIMDl algorithm has an overhead of p + q+1 shifts and 
2(p+q + l) broadcasts. (This section assumes that the two SIMD broadcasts do 
not occur simultaneously.) The VLSI2 algorithm has p + q+1 shifts and broad­
casts, assuming the two broadcasts can occur simultaneously. In the SIMD2 
algorithm, the time to compute one y is the time for a single PE to perform the 
arithmetic operations (i.e., the serial time) plus the time for p + q+1 broadcasts 
of x values and p+q + 1 broadcasts of completed y values. As in the SIMDl 
algorithm, broadcasts of xm + 1 through xm+q and of ym-q_p through y^-^ con­
tribute to the time to compute ym) even though they are not used in the ym 
calculations.
Table 6.1. Execution times for serial, VLSI, and SIMD digital filtering algo­
rithms. ■


















P+q + 1 P+q+1 0
M(p+q + l) M(p + q + l) 0
P+q+1 p+q+1 p+q+1
2(M-l) + p+q + l 2(M-l) + p+q + l 2(M-l) + p+q+l
p+q+i
M + p+q 
p+q + 1 
M + p + q 
p+q + 1 
M + p + q 
1
p+q + 1 
M + p +q 
p + q + 1 
M + p+ q 
p +q+ 1 
M + p+q 
1
p+q + 1 
M + p + q 
p + q + 1 









M + p+q 2(p + q + l)/3
2(p+q + l)
2(M + p + q) 2(p + q + l)/(2 + 3t) 
2(p+q +1)
2(M + p + q) (p+q + l)/(l+t)
0 M
95
For all the parallel algorithms, the time to compute M output values is the 
time to compute one y value plus the time to compute (M-l) subsequent y 
values. The latter time is obtained by considering the time between successive 
y values. The time between successive y’s in the VLSIl array is two additions, 
multiplications, and shifts since one y is computed every two cycles. The 
VLSI2 algorithm takes only one addition, multiplication, and shift/broadcast. 
The SIMD algorithms on the other hand do one addition, one multiplication, 
and either two broadcasts and one shift for SIMD 1 or two broadcasts for 
SIMD2 between successive y values, since they compute one y every cycle. 
Depending on the SIMD broadcast versus VLSI processor array shift time, the 
second SIMD algorithm may have a greater complete throughput.
The speedup of an algorithm is (serial time/parallel time) [Kuck77|. 
Assume that additions and multiplications require one time unit on all 
machines, and data transfers (shifts or broadcasts) require one time unit on the 
VLSI processor array and t units on the SIMD machine. Also assume shifts 
and broadcasts occur simultaneously on the VLSI processor array and sequen­
tially on the SIMD machine. The value of / will depend on a number of fac­
tors, including implementat ion details of the VL SI and SIMD machines. Table
6.1 shows the speed ups for the parallel algorithms, assuming that M » p+q. 
If t=2, SIMD2 will have the same speed up as VLSIl. If t-1/3, SIMD2 will 
match the VLSI2 algorithm. If a multistage interconnection network such as 
the multistage Cube [SiMc81bJ or Augmented Data Manipulator [SiMcSla] per­
forms the broadcasts, it is unlikely that t < 2. Unless the broadcasts can be 
performed simultaneously, the speed up for the systolic array is significantly 
greater than for the SIMD algorithm. However, smaller values for / may be 
feasible. If the control unit performs the broadcasts, then the systolic and 
SIMD algorithms may have comparable speed ups.
6.1.6. Varying the Problem Size on an SIMD Machine
The VLSI processor array and SIMD algorithms can also be compared 
with respect to the ease with which the machine-size/problem-size relationship 
can be changed. In particular, assume the above techniques have been used to 
implement an order p+q digital filter. Consider the impact of deciding to use
a higher order filter. Let the new filter have pf + q>+1 coefficients, where 
p'+q'+1 > p + q+1. With some modifications, the SIMD2 algorithm can 
implement a filter having pf + q' +1 coefficients with fewer than p' + q' -f 1 
PEs. Figures 6.7 and 6.8 show the data allocation diagrams for two different 
problem sizes. Case A is for N - p+q+1 and case B is for 
N=p' +q' +1 < p + q+1. Each rectangle in the diagram represents the cycles 
during which a given PE is computing a certain y. In each rectangle are the x 
and y values the PE needs during each cycle of the computation. In the origi­
nal algorithm (case A, Figure 6.7) PE m mod (p+q+1) computes output ym. 
Since each ym computation required p+q + 1 cycles, as soon as PE m mod 
(p+q + 1) completed computation of ym, computation of ym+p+q+1 was about 
to be started. The computations were skewed so all recurrences that required a 
given xm (or ym) as input were computed during the same cycle. Case B (Fig­
ure 6.8) shows the data allocation needed to implement a pr +q' +1 coefficient 
filter with N <-p/ +q' +1 PEs. Each PE again performs all the computations 
for a given output, with ym computed in PE m mod N. However, since the 
number Of cycles to compute ym is greater than the number of PEs, computa­
tion of ym+N does not begin until ym is completed. Cycles are classified into 
two types:
1) transient cycles, defined to be cycles in which any PE starts to compute a
new y value, and
2) steady state cycles, cycles that are not transient.
Following every set of N transient cycles there are p' +q' +1-N steady state 
cycles. Also, following every set of N cycles during which y values are com­
pleted, there are p' +q' + 1-N cycles during which no new y values are com­
pleted. During the set of N transient cycles, each PE can be placed into one of 
two classes:
1) PEs that have started computing a new y value since the beginning of the
set of transient cycles, and
2) PEs that have not started computing a new y value.
At the start of the set of transient cycles, all PEs are in class 2. After each 
transient cycle, one PE completes its y value and therefore moves to class 1. 
At the end of the set of transient cycles, all PEs have moved to class 1. During 
the steady state cycles, the computations are skewed as in the Case A
time to compute one y
X o 
m-2 Xm-1 X m V-2 : Vi Xm+3 Xm+i* Xm+5 ym+3 ym+4




Xm+2 V Vi Xm+5 Xm+6 Xm+7 Vs ym+6
Xm+1 Xm+2 Xm+3 Vi V Xm*6 Xm+7 Xm+8 Ym+6 ym+7
Xm+2 Xm+3 V* v2 ym+3 Xm+7 Xm+8 Xm+9 Ym+7 Vm+8
1 cycle
TIME
Figure 6.7. Data allocation for SIMD machine algorithm with N '= p+q + 1 
PEs, shown for p—2, q=2.
time to compute one y
in-1 m
m+7 m+8
m+8 m+9. m+1 0 ^m+8
m+2 I m+3 ym+l m+4‘ , m+7 • m+5 m+10 m+M '
transient
class 2
Figure 6.8. Data allocation for SIMD machine algorithm with N < p+q + 1 
shown for p=2, q=2, N=4.
99
computation, so all recurrences requiring a given xm (or ym) as input are com­
puted during the same cycle. However, during the transient cycles, the PEs in 
class 1 need a different set of x’s and y’s than the PEs in class 2 (see Figure 
6.8).'
Figure 6.9 gives an algorithm to implement a filter where the number of 
PEs is less than the number of filter coefficients. Lines 6-16 compute the 
steady state cycles, while lines 18-39 handle the transient cycles. Line 25 
broadcasts the newly computed y value to all PEs and line 29 stores the newly 
computed value in the y[ ] vector. The variable diff is used to determine 
whether a PE is in class 1 or 2. If diff=0, the PE is in class 1; otherwise, 
diff = A is the difference in indices of the X and y vectors between the PEs in 
class 1 and the PEs in class 2. Execution time is p' +q- +1 cycles to compute 
one y value, and [M/N](p' +q' + l) + ((M-l)modN) cycles to compute M y 
values. For large M, if N = (p* +q' + l)/r for r > 1, then the throughput of 
the N-PE algorithm is reduced by approximately a factor of r from that of the 
(p'+q'+1)-PE algorithm. This ability to adapt the SIMD algorithm to 
different problem sizes means that a fixed set of PEs can be used to implement 
digital filters. Alternatively, on reconfigurable systems, in which it is possible 
to vary the number of PEs that act together as a virtual SIMD machine [e.g., 
SiegSi], it means that for a given digital filter, the virtual machine size can be 
tailored to the particular application. Fewer PEs may be chosen if speed 
requirements do not require the use of p+q + 1 PEs. If, as will most often be 
the case, the filtering is one processing step in a sequence of algorithms, fewer 
than p+q + 1 PEs may be chosen to make the digital filtering algorithm com­
patible with other SIMD algorithms to be applied as part of the complete task.
This method of adapting the SIMD digital filtering algorithm to fewer PEs 
also applies to the solution of general linear recurrence equations. The 
broadcast-based approach therefore provides a general method for using an 
SIMD system to solve linear recurrence equations of order p using p or fewer 
PEs.
In contrast to this flexibility in the SIMD implementation/VLSI processor 
array needs a major hardware modification (adding more registers to add addi­
tional coefficients and y^ values) to handle a digital filter of larger size. It is 
generally easier to add more cells to the array than to modify the existing cells.
100
/* ADDR Address of PE (i.e., AD DR = 0 in PE 0)
A difference in indices between PEs in class 1 and 2
diff 0 if PE is in class 1, A if PE is in class 2
x[] Input data (x|m] = 0 for m < 0)
(stored in each PE before start of algorithm) 
y[] Output data (y[m] = 0 for m < 0)
coef[J Vector of coefficients (see Figure 6.6) 
flagji] Equals 1 if coef[i] is an “a” coefficient 
sum Contains partially computed ym
Line
1 . ' diff 4- 0 ' '
2 c p+q + 1
3 A - c-N . ; ' ' ' ' ' '
4 sum -e- o ;
5 FOR nn-1-N TO M-l bo
0 IF m mod N = 1 THEN/* Do steady state recurrences */
/* i.e. no new ym is started */
7 FOR i.*r m TO m+DELTA-'l DO
8 . WHERE flagfi mod c] ~ IDO
9 tmp «-y[i-l-DELTA]
10 ELSEWHERE
11 tmp •»- x(i+p-DELTAj
12 END WHERE
13
H sum «— sum + tmp * coefli mod cl
15 . ;
16 . diff == A
17 ■'■■■■
WHERE ADDR = m-1 mod N DO /* a new ym is computed in
PE m mod N ♦/
DTRin sum /* Send newly computed y value to ♦/
/* all PEs by placing it in the DTRin. */ 
sum 0 /* Clear SUM for next ym */
/* value to be computed in this PE. ♦/ 
diff 0 /* When diff=^0 in a given PE, */
/* the given PE is in clas$ 1 ♦/
BROADCAST /♦Broadcast the SUM placed in DTRin */
/♦inline to all PEs. */
END WHERE
y[m-l] 4^ DTRout /* DTRout contains ym-1 ♦/
/* in all PEs ♦/
WHERE flag[m mod cj = 1 DO /♦ If COEF[ls are “a” values, ♦/
/* load y values into TMP. */ 
tmp 4^- y[m-l-DELTA-f diff]
ELSEWHERE /♦ if COEFfls are “b” values, ♦/
/♦ load x values into TMP. */ 
tmp 4-x[m+p-DELTA-f diffJ 
END WHERE
sum sum -h tmp * coef[(m + diff) mod c]






















Therefore, a VLSI processor array of size p+q+1 cannot easily implement a 
larger problem size In terms of flexibility to adapt to changing problem sizes, 
then, the SIMD system has the capability of handling varying problem sizes 
under software control. Adapting a VLSI processor array to a problem size 
different than that for which the array was designed requires hardware 
modification. For some computing environments, this difference in flexibility 
may be significant, and would dictate use of the possibly slower but more flexi­
ble SIMD system.
0.1.7. Summary of General Digital Filtering Algorithms
Synchronous parallel structures for implementing digital filters have been 
presented. Both VLSI processor arrays and SIMD implementations yield 
significant speedups over serial processing. The SIMD method provides a gen­
eral approach to solving linear recurrence equations on an SIMD system. For a 
given application or environment, the choice of VLSI processor or SIMD struc­
ture depends on a number of factors. Although exact timing is implementation 
dependent, it is most likely that the VLSI processor array approach will be fas­
ter than the SIMD algorithms. System cost will also be less for the VLSI pro­
cessor array. On the other hand, the SIMD system can accommodate changes 
in the order of the filter, whereas the VLSI processor array requires hardware 
modification to handle a change in problem size. Moreover, if the filtering is 
simply one step in a series of operations, no additional hardware is needed in 
the SIMD system. The data allocation resulting from the SIMD algorithm, 
where the output data is distributed across the PEs, is a useful allocation for a 
number of SIMD signal processing algorithms, including computation of auto­
correlation and covariance coefficients [SiSOb] and FFTs [SMS79]. The ability 
to run the SIMD algorithm on different machine sizes improves its potential 
compatibility with other SIMD algorithms which, together with digital filtering, 
comprise a complete signal processing task. Therefore, for a particular environ­
ment, speed requirements, cost, the importance of flexibility, and the context in 
which the algorithm is to be used may all be factors in selecting a parallel 
structure for digital filtering.
102
6.1.8. Parallel Preemphasis Filtering
Fortunately, the preemphasis filtering which is used before performing 
autocorrelation in a speech processing system is much simpler than the general 
digital filter. Figure 6.10 is the Flock Algol algorithm for implementing
: H(z) - 1 - 0.95*z-1.
The signal is broken up into frames containing N samples each where N is the 
number of PEs. Before execution, sample i of the input data is in PE i for 
0 < i < N. After execution, PE i contains output sample i for 0 < i < N. 
The number of PEs used need not be equal to the number of samples per LPC 
frame (M). However, they are often the same since the autocorrelation algo­
rithm which follows uses M sample frames with the same data arrangement as 
output by the filtering algorithm. Line 1 sets up the interconnection network 
for a Shift +1 transfer. Line 2 transfers the input data so that PE i contains 
sample i in input and sample i~l in tmp for 1 < i < N. PE 0 however has 
sample N~1 in tmp since the shift transfer wraps around. Lines 4-8 handle the 
wrap around from PE N~1 to PE 0 by saving the value in imp in PE 0 for 
later and using the sample from the previous time the algorithm was used. 
This value was sample N-l from the previous N samples, which is the value 
that is needed. The value that wraps around is saved in oldvalue until the 
next time the routine is called.
After line 8, PE i has both sample i and sample i-1, therefore the filter 
operation is easily performed by the operation in line 10.
The numbers to the right of the line numbers are the approximate execu­
tion times in /is for each statement. These are based on the program presented 
in Section 7.2. Since there are no loops, the time complexity is 0(1).
103
l*
H(z) = 1 - 0.96 * z ** -1
input: input data
output: filter output data
tmp,tmp2: temporary values
Line Time in ps
1 3 USE Shift +1
2 8 TRANSFER input TO tmp
3
4 7 WHERE ADDR = 0 DO /* Get value from previous call */
5 0.5 tmp2+-tmp




10 12.75 output «- input + tmp * 0.95
Figure 6.10. Algorithm for preemphasis filtering. Left column is the execution 




Section 5.1 presented three SIMD algorithms for computing autocorrela­
tion coefficients. This section presents another algorithm for the same task; It 
is a variation, with throughput improvement, of Ashajayanthi’s [ASV79] SIMD 
machine autocorrelation algorithm. Ashajayanthi’s algorithm (AUT03) is 
presented in Figure 5.4 in Section 5.1.3. A direct mapping of it into a VLSI 
processor array results in the array in Figure 6.11 Each cell performs the opera­
tions shown in the figure with all the variables set to zero before the first sam­
ple enters cell 0. After sample M-l enters cell 0, SUM in cell i contains 
R(p-i-l).
Figure 6.12 shows an improved version of this array (AUT04). The array 
differs from AUT03 in that the data entering int in the top cell is also broad­
cast to in2 in all cells. AUT03, on the other hand, broadcasts the data enter­
ing ini in the bottom cell to in2 in all cells. The cells in AUT03 all do the 
same operation as the cells in AUT04, with cell i computing the same Opera­
tions as in Figure 6.11. All variables are set to zero before sample 0 enters cell 
0, and cell i computes R(i) for 0 < i < p. This is an improvement since Figure 
6.11 requires p operations to get sample 0 into cell p-1, followed by M-l opera­
tions to compute the coefficients. AUT04 needs no initialization and requires 
M operations using the same cells as AUT03.
0.2.1 Summary
Table 6.2 compares Ashajayanthi’s algorithm (AUTOS) with the improved 
algorithm (AUT04). Initialization times are included in the times in Table 6.2, 
but were omitted when computing the times in Table 5.2. AUT04 is a faster 
algorithm than AUT03 since it uses the same cells and does not require any 
initialization steps other than setting R to zero before sample 0 is computed.
105
R R+ ini * in2 
out ■*— in I
Figure 6.11. Ashajayanthi’s SIMD autocorrelation method [ASV79] mapped to 
a VLSI processor array.
106
R R + ini * in2 
out -**— in I
Figure 6.12 Improved VLSI processor array autocorrelation algorithm
107
Table 6.2 Comparison between Ashajayanthi’s SIMD autocorrelation algorithm 




AUT03 P + 1 M+p + l M+p + l M+p + l M+2p+2
AUT04 P + 1 ' M M M M+p + l
108
6.3. Linear Time Warp
The purpose of linear time warping (LTW) is to take an utterance R(j) for 
0 < j < J and stretch or shrink it to an utterance T(i) for 0 < i <1. Ele­
ments of R(j) and T(i) are vectors of LPC coefficients. The following equations 
show the relationship between R() and T().
T(i) = (l-s).R(j) + s.R(j+l), i = l.... I (6.3)
where
J =
s = (i-Dpf + H
One method to compute T(i) in parallel is to have PE i compute T(i) for 
0 < i < I. A second method is to compute the vector/scalar products 
(1—s)R(j) and sR(j +1) in parallel (i.e. have PE k compute element k of vector 
T(i)). The following sections discuss each method.
6.3.1. Method One
The algorithm in Figure 6.13 does a linear time warp from J frames to I 
frames on an SIMD machine. It uses equations 6.3 and 6.4 to warp R(j), 
0 < j < J, to T(i), 0 < i < I. Each element of R(j) is a feature vector and 
R(j) for 0 < j < J is one utterance. The algorithm assumes R(j) is in PE j for 
0 < j < J. Method one has three cases, one where J<I, another where J=I, 
and finally where J>I. The following sections give examples for how the algo­
rithm works when J<I and J>I. The J—I case is a simple copying operation 




Take J samples in PEs 0 through J-l and linearly 
warp them to I samples in PEs 0 through 1-1.
The input frames R are stored with R[j] in PE j.
J is equal to the number of input frames.
I is equal to the number of output frame.
Output: T[i] will contain the linearly warped output in
PEs 0 through 1-1.
Line Time in (is
1 1.5 IF (I = J) THEN
2 32.25 T +- R
3 2 RETURN
4
5 24.5 factor «- (J-l) / (1-1)
6 26.24 i ■*- [ADDR/factorl
7
8 /*
9 If data is being expanded, move input data to
10 cover all output PEs.
11 */
12 2 IF(I > J) THEN
13 3 USE Shift +1
14 3.5 FOR k ■*- 1 TO I - J
15 6.5 WHERE(ADDR < i) DO
16 7.5 TRANSFER i
17 127.5 TRANSFER R
18 2 ENDWHERE
19 0.5 i +- ADDR
20
21 11.25 tmp *— i * factor 4-1
22 2.5 j 4- itmpj
23 1.25 s +- tmp - j
24 ■ 3 USE Shift -1
25 96.5 TRANSFER R to R1
26 217 ■ T «- (1-s) * R + s * R1
27
28 /•
29 Shift new T’s down until only I PEs are occupied
30 */
31 1.75 IF(I < J) THEN
32 3 FOR k «- 1 TO J - I
33 7.5 TRANSFER i TO i_tmp
34 6.5 WHERE(i_tmp < ADDR) DO
35 92 TRANSFER T
36 0.5 i *- i_tmp
37 2 ENDWHERE
Figure 613. SIMD algorithm to do linear time warp. Numbers right of line 
number are the execution times assuming an 8 MHz 68000. (See Section 7.5.)
110
6.8.1.1. An Example of Expanding J—5 Frames to I—7 Frames
Suppose J=5 and 1=7. Since J < I, the data is being expanded. Using 
equations (6.3) and (6.4) we have:
\ T|l) = R(I)
T(2) - j-R(l) + |R(2)
T(3) = |R(2) + jR(3)
T(4) - R(3) (6.5)
T(5) = -±-R(3) + |R(4)
T(6) = |r(4) + |r(5)
T(7) =R(5)
Line 6 computes i in each PE based on the PE’s address. R(j) can be com­
puted in PE k by using R in PE k and R in PE k + 1. Figure 6.1-4 shows the
PEs and their i values. Notice 1 and 4 are missing from the i column. Lines
12-18 shift the data so that T(i) can be computed in PEs 0 through 6. This is 
done by comparing ADDR to i. If ADI)R < i, (as in PEs 2 through 5), i and 
R(i) are shifted from PE k to PE k + 1. This happens I—J times as shown in 
Figure 6.13. Now i is assigned ADDR in PEs 0 through 6 and R is transferred 
from PE k to Rl in PE k—1 in line 25. Line 26 then does the computations of 
the equations in 6.5 in parallel, leaving T(i) in PE i for 0 < i < I.
In general, if J < I, the R(j)’s are then shifted between the PEs until I 
PEs are used, and PE i contains the two R()’s needed to compute T(i).
6.8.1.2. An Example of Compressing J—7 Frames to I—5 Frames












0 R(0) 0 0 R(0) 0 R(0) 0 R(0) 1 0
1 R(l) 2 0 R(0) 0 R(0) 1 R(!) 1 2/3
2 R(2) 3 2 R(l) 2 R(l) 2 R(2) 2 1/3
3 R(3) 5 3 R(2) 3 R(2) 3 R(2) 3 0
4 R(4) 6 5 R(3) 3 R(2) 4 R(3) 3 2/3
5 6 R(4) 5 R(3) 5 R(4) 4 1/3
6 6 R(4) 6 R(0) 5 0
Figure 6.14. Data flow for LTW for expanding from J-5 to 1-7 frames.
112
. Till = I!|l)
T(2) = ±R(2) + |R(3)
T(3) = R(4) ' ■' f/ (#.«)
T(4) - |r(5| + |-R(6)
T(5) - HIT) ;
This is done by the transfer of lines 24 and 25. Figure 6.15 shows the data in 
each PE after the transfer. The boldface values in the T columns indicate 
those PEs that are disabled after line 34. Recall that if a PE is disabled, it can 
pass data to other PEs, but other PEs cannot pass their data to it. Notice the 
equations in (6.6) can now be computed simultaneously, with PEs 2 and 5 com­
puting values that are not needed (“junk” values). Lines 31-37 then shift the 
T(i) values so that T(i) is in PE i. Line 33 shifts the i values from PE k to 
i_tmp in PE k—1, then those PEs with ADDR > i_tmp put i_tmp in i, and R 
gets the value of R in PE k + 1. i is transferred to i_tmp before comparing to 
ADDR since a disabled PE cannot receive data. This, in effect, shifts good T(i) 
values over the junk values.
In general, if I > J, PE i computes T(i) and then the data is shifted so PE 
i contains T(i) for 0 < i < I.
6.3.1.8. Time Complexity
Table 6.3 summarizes the time complexity for the linear time warp algo­
rithm. The total number of PEs required is the maximum of J and I. The 2N 
products and the I additions in equation (6.3) are all done in parallel by line 26 
of Figure 6.13. The rest of the algorithm is for shifting data so that each R(j) 
and T(i) value is placed in the correct PE. Some of this shifting overhead may 
be reduced depending on the arrangenient of the data in the algorithms before 











0 R(0) R(1) 0 i 0 T(0) 1 T(0) 0 i T(0) 0
1 R(i) R(2) 1 2 1/2 T(l) 2 T(l) 1 2 ■ T(l) 1
2 R(2) R(3) 2 4 0 junk .2 T(2) 2: 3 T(2) 2
3 R(3) R(4) 2 4 0 T(2) 3 T(3) 3 4 T(3) 3
4 R(4) R(5) 3 5 1/2 T(3) 4 junk 4 4 T(4) 4
5 R(5) R(6) 4 7 0 junk 4 T(4) 4 0 T(0) 0
6 R(6) R(0) 4 7 o T(4) 0 T(0) 0 0 T(0) 0
Figure 6.15. Data flow for compressing J =7 frames to 1=5.frames. Boldface 
























The second approach to parallel linear time warping is to have PE k hold 
coefficient k of frame j for 0 < k < p and 0 < j < max(J,I). Each 
vector/scalar multiplication is done in parallel. The algorithm is presented in 
Figure 6.16 The number to the right of the line numbers are the execution 
times in /is when implemented on an SIMD machine (see Section 7.5). The 
number of PEs (cells) used is p, the number of coefficients per frame. This 
algorithm can be implemented on both the SIMD machine and the VLSI pro­
cessor array (see Section 8.5 for details on the VLSI processor array). The time 
complexity is summarized in Table 6.3.
6.3.3. Summary
Method two is an improvement over method one in that it uses fewer PEs 
(cells) and does not require vector operations. Method one requires fewer 
operations overall, and will therefore execute in less time. The final considera­
tion in choosing between these two methods is the arrangement of the data 
among the PEs (cells). The algorithm commonly preceding the linear time 
warp will be the LPC algorithm. The SIMD LPC algorithm leaves the data in 
the PEs in an arrangement that method two can used directly. To use method 
one, the data must be rearranged, which might require more time than will be 
saved by using the faster method one.
Line Time in ps
1 1.75 IF(M = N) THEN
2 111 T 4- R
3 2 RETURN
4
5 23.5 factor 4- (M-1)/(N-1)
6
7 2.75 FOR n ~ 0 TO N-l
8 11.25 tmp ♦- n * facl
9 2.5 m 4- ItmpJ
10 1.25 s 4- tmp - m
11 29 T 4- (1-s) * R(:
Figure 6.16. Algorithm for linear time warping using p PEs. Execution times 
are for an 8 MHz MC68000. (See Section 7.5).
117
6.4. Dynamic Time Warping
This section presents dynamic time warping algorithms for both the SIMD 
machine and the VLSI processor array. These algorithms have previously 
appeared in [YoSi82]. The SIMD algorithms assume that the feature vectors 
for the entire test word and all template feature vectors needed are stored in 
every PE memory. The PEs are complete processors, and a general intercon­
nection network handles the needed inter-PE communications. The VLSI array 
algorithms assume that the cells have less memory, and that fast, fixed inter- 
PE transfers are a part of the system architecture. In these algorithms, the 
feature vectors shift from one cell to the next, and the computations are per­
formed in a pipelined fashion.
6.4.1 SIMD Algorithms
This section presents two approaches to performing DTW on an SIMD 
machine. Both assume that the speech recognizer must compare the test tem­
plate to W reference templates, and each PE contains complete test and refer­
ence templates, The serial-parallel approach uses up to W PEs in parallel with 
each PE doing a serial DTW using a different reference template. The 
parallel-parallel approach uses many PEs in parallel for each DTW match of 
the test template with a reference template.
6.4.I.I. Serial-Parallel (SP) SIMD Approach
A recognizer with a vocabulary of W templates can be implemented on a 
processor with N < W PEs. If W = N, then PE w contains template w, 
0 < w < W, from the vocabulary, so that every PE contains a different tem­
plate. Each PE performs a serial DTW between its stored template and the 
input X. Recursive doubling [StonSO] is used to find the PE containing the
smallest distance, in log N time, which represents the template most closely 
matching the input. See Section 2.6 for an example of recursive doubling.
All DTW algorithms compute the following steps:
1) computing the local distance d(i,j);
2) the two multiplications and four additions in equation (4 7); and
3) two comparisons to find the minimum of three values.
These three steps are defined as one loop as discussed in Section 4.5.2. A serial 
DTW algorithm requires W(2lr~I— r2 + r) loops to compute W D(A,B)s with the 
adjustment window r, and WI2 loops without. This does not take into account 
the possible time saved by pruning. The same algorithm on an SIMD machine 
with N = W PEs requires (2lr—I— r2 + r)loops with the adjustment window, and 
I2 without. This is an ideal speedup (i.e. by a factor of N) over the serial pro­
cessor. However, if the serial processor uses pruning, the parallel approach will 
attain a less than ideal speedup. At least one comparison (the minimum dis­
tance match) is not pruned, so the time for the SIMD algorithm is not reduced 
by pruning. Since the time of the serial algorithm may be reduced by pruning, 
the SP algorithm will no longer attain a factor of N speedup. If W > N, (the 
vocabulary is larger than the number of PEs) then the SP algorithm can be run 
[W/3\;] times to match all words. See Table 6.4 for a summary of these results
6.4 12. Parallel-Parallel (PP) SIMD Approach
Two drawbacks to the SP approach are that pruning will not reduce the 
computation time unless all PEs can prune at the same time, and that there is
no effective way to use N > W PEs. In the parallel-parallel approach each 
DTW match uses several PEs, Equation (4.7) shows that g(i-2,j-l) and 
g(i-l,j-2) must be computed before computing g(i,j). The g(i,j)’s that can be 
computed in parallel are all g(i,j)s for i-Pj =2k and i+j=2k + l, for a fixed 
value of k between 1 and I inclusive. If g(k) is defined as all g(i,j) with i+j=2k 
and i+j=2k + l, all g(i,j)s in g(k) can be computed in parallel. These g(k) 
depend only on g(m) for m < k. Figure 6.17 shows two diagonal rows that 
represent a typical group of g(i,j) in a given g(k); the g(m) for m < k are 
“down” and “to the left” of the diagonal rows. Each g(k) contains at most 
2r + l points when using an adjustment window of size r. If no adjustment
119
Table 6.4. Summary of Parallel Dynamic Time Warping Algorithms.
Algorithm Adjustment Number APEs per Loops for Number of Loops Operations 


















































































SP: Serial Parallel algorithm Id: local distance calculation
PP: Parallel Parallel algorithm m: multiplication
HSAC: High Speed Array Computer a: addition
BAC: Bilinear Array Computer c: comparison
N: number of PEs used sv: shift vector through pipe to adjacent PE
K: 2Ir-I-r2 + r ss: shift scalar through pipe to adjacent PE




Figure 6.17. A set of g(i,j) that can be computed in parallel, labeled with PE 
numbers;
121
window is used, each g(k) has a maximum of 21-1 points. Figure 6.18 shows 
the PP algorithm. The PEs are numbered -r,-(r-l),... -2-1,0,1,2,...,r—l,r,* 
and PE n computes g(i,j) for (i=k+n/2, j=k-n/2) for n even, and 
(i=k+(n+l)/2, j=k-(n-l)/2) for n odd. Figure 6.19 is a data flow diagram 
for a lines 13-55 of the PP algorithm with each box showing which g(i,j) the 
given PE is computing and each column of boxs showing the contents of all 
PEs during a given loop in the algorithm. The arrows between PEs represent 
the data transfers with the g transfers as solid lines and the d transfers as 
dashed lines The odd (even) numbered PEs correspond to the PEs in the top 
(bottom) row in Figure 6.17. This assumes that the feature vectors a; and bj 
are stored in the appropriate PEs before the start of the algorithm. Figure 
6.20 is a data flow diagram for the PP algorithm. Each row of boxes indicates 
which g(i,j) a given PE is computing during each loop of the algorithm. Each 
column shows which g(i,j)s are computed in parallel for a given k value. A 
total of 2r +1 PEs per template are needed. If the SIMD machine has N PEs,
N/(2r + l)j templates can be matched in I parallel loops, requiring
— —- I loops for a W template vocabulary. Both the SP and PP
(N/(2r + l)] J
methods yield a speedup over the serial algorithm. The following section 
discusses a parallel DTW algorithm for the VLSI processor array, The section 
after that compares all the parallel DTW algorithm to each other.
6.4.2. VLSI Processor Array Algorithms
Burr, Weste, and Ackland [BAW81,BAW84,WBA83] have presented a high 
speed array computer (HSAC) in which an I by I grid of cells compares several 
vocabulary templates to the input template simultaneously. They also 
presented reduced arrays which can use as few as r + 1 cells to “sweep out” the 
I by I grid. The complexity analysis of the HSAC was presented in Section 5.6. 
The next section presents a bilinear VLSI array algorithm which incorporates 
some of the strategy used in the PP SIMD algorithm with the reduced arrays
* If no adjustment window is used, the PEs are numbered 
- (I-1),- (I- 2),.. -1,0,1. .,1-21-1)
122
/*
Algorithm Name: dtw.s (parallel)
Section: 6.4.1.2.
Machine: SIMD
Function: This program performs a dynamic time warp.
Number of PEs: 2r + l or 21-1
Parameters: t, the width of the warping path.
p, the number of coefficients per frame.
NetD, the network delay time.
I, the number of frames per utterance.
Input: All PEs hold all the input data.
Output: PE r holds the distance score.
•/'
Line Time in /is
1 PROCEDURE dtw
2 2 g<- o
3 2 gold «— 0
4 15 d <- oo
5 1.5 dold +- oo
6 28 WHERE ADDR = 0 DO
7 2 g ♦- o
8 2 ENDWHERE
9 { ; ■ , • :
io 4.75 Xindex *- + fADDR/21
11 4.75 Yindex IADDR/2J
12
13 1 FOR k ^ I TO I DO
14 124 compute d(Xindex,Yindex)
15 10.5 WHERE ADDR is even DO
16 2.5 dDTR +- dold
17 2.5 gDTR *- gold
18 8 ELSEWHERE
19 2.5 dDTR +- d
20 2.5 gDTR g
21 8 ENDWHERE
22
23 3 USE Shift +1
24 5+NetD TRANSFER dDTR TO dup
25 5+NetD TRANSFER gDTR TO gup
26 , . . • . . • ' ■
27 3 USE Shift -1
28 5 +NetD TRANSFER dDTR TO ddown
29 5+NetD TRANSFER gDTR TO gdown
30
31 7 WHERE ADDR = r DO
32 1.5 gdown <-oo
33 2 ENDWHERE
Figure 6.18. Parallel DTW program. Execution times are for an 8 MHz 
MC68000. (See Section 7.6.)
123
34
35 7 WHERE ADDR = -r DO
36 1.5 gup <- 00
37 2 END WHERE
38
39 2.5 gold +- g
40 2.5 dold ♦- d
41
42 4 A «-gdown + 2 * ddown
43 3 B +- gold + d
44 4 C *— gup + 2 * dup
45
46 6.5 WHERE B < ADO
47 ' .5 : A*-B
48 2 ENDWHERE
49 6.5 WHERE C < A DO
50 .5 A ■*- C
51 2 ENDWHERE
52 3 g A + d
53
54 Xindex <— Xindex + 1
55 Yindex +- Yindex + 1
56
57 7 WHERE ADDR = 0 DO

















2,1 | 3,2 4,3 1,1-1
1 J 2,2 3,3 • • © I-1,1-1 I,I





4 2 2 3-— 3+ -- ° 2 ,p 2
2--E- 2+ — 
r 2»^ 2 3-— 3+— J 2 » _2
1 2 / 3 I-l I
(time)
Figure 6.20 g(i,j) computations in PP algorithm with r even.
126
into a VLSI array structure. This work was reported in [YoSi82] and was 
developed independently of the HSAC reduced array [WBA83,BAW84j.
6.4-2.1. A Bilinear Array Computer (BAG)
In general, the single diagonal HSAC uses r+1 cells per DTW comparison. 
Due to the interdependences discussed in the previous section, it can use no 
more than r + 1 cells per DTW for general path restriction. The bilinear array 
computer (BAC) presented here restricts the path leading to a given point on 
the warping graph so that:
g(i,j) = d(i,j) + min




Because of this restriction, the BAC uses 2r + l cells per comparison, which 
results in it requiring half as many loops as the single diagonal HSAC. The sin­
gle diagonal HSAC uses enough cells to compute one diagonal in Figure 6.17. 
The BAC uses enough cells to compute two diagonals of points for g(k) shown 
in Figure 6.17. Figure 6.21a shows the cells are arranged in a bilinear array 
with the cells in the left column computing the g(i,j)s for the lower diagonal, 
and the right column for the upper diagonal. Figure 6.21b shows the data 
paths between adjacent cells. DTtop and DTbot are Data Transfer registers. 
Storing a value in DTtop in cell i will transfer that value to DTbot in cell i + 1. 
In general, the feature vectors a; and bj are piped in from opposite ends at the 
rate of one vector every loop. When a; meets bj in cell i—j, it computes d(i,j) 
and g(i,j) and sends them to cells i—j + 1 and i—j—1. On the next loop, a, and 
bj + 1 meet in cell i—j + 1 and it computes d(i,j +1) and g(i,j +1) and sends them 
to cells i—j+2 and i—j. Figure 6.22 shows the data flow as a function of time. 
Figure 6.23 shows the instructions executed by each group of cells if I is odd. 
If I is even, the even cells execute the group B instructions and the odd cells 
execute the group A. The instruction “a vector down” means to transfer the 
“a” vector from cell i to cell i— 2 for —(1—2) < i < 1—2 and transfer in a new 
“a” vector into both cell 1-1 and cell I~2. The instruction “b vector up” is 
similar to “a vector down” but for the “b” feature vector.
127
to Cell k+3





Figure 6.21 a) Bilinear array of cells, b) Data paths between cells in left and 
right columns.
128
Figure 6.22 Data flow in BAG algorithm.
129
Even numbered cells 
Group A
a vector down 
b vector up 
compute d 
DTtop 4- d
Odd numbered cells 
Group B
a vector down 
b vector up 
compute d 
d.bot 4- DTbot
DTbot 4- d d.top 4- DTtop
g.bot.old + 2d.bot g.bot+ 2d.bot
g 4- d +min g-td g 4- d + min g + d
g.top.old + 2d.top g.top+ 2d.top
g.top.old^-g.top
g.bot.o!d 4- g bot
g.top 4-— DTtop DTbot 4- g
g.bot 4— DTbot DTtop 4- g
d.bot 4- DTbot DTtop 4- d









Figure 6.23. Instructions executed 
odd. (Exchange columns for I even)
one loop of the BAC algorithm for I
130
This array computes only one DTW at a time so its throughput is less 
than the full array HSAC, but it uses twice as many cells as the single diagonal 
reduced array, so it takes half as long for a comparison. If the BAC requires n 
cells, and N> n cells are available, In/ nj arrays can be built, and [N/n] DTWs 
can be computed simultaneously. The time to compute one DTW is the 
number of loops from the time a,j enters the array until aj enters cell 0. This 
time is [l/2l loops to get the first aj,bj pair to cell 0, and I loops until aj,bj. 
arrive at cell 0, giving a total time of I+fl/2] loops. The values for the 
second template follow the ai,bj values of the first template, so the initial fl/2| 
loops used to get a^b! into cell 0 are not needed for the DTWs that follow. 
With an adjustment window r, the algorithm needs only 2r + l cells and r +21 
loops.
6.4.3> Summary of Results
Table 6.4 summarizes the above results. The column labeled “Number of 
PEs’’ lists the minimum number of PEs (cells) needed to use the algorithm. 
The APE column is the number of PEs (cells) to be added to do another match 
in parallel. The fifth and sixth columns list the number of loops needed to do 
one match and W matches. The last column shows the operations done during 
one loop.
The serial and SP algorithms require the same operations per loop. The 
PP algorithm requires inter-PE transfers of the d and g values, which may 
increase the total loop time. Based on proposed general interconnection net­
works (e.g., [SiMc81a,SiMc81b]), the transfer time will be negligible compared 
to the time to compute the local distances. Depending on the implementation, 
it may be possible to overlap the transfers with the computations, so that little 
or no extra time is incurred. The loop times for the HSAC and the BAG will 
be about equal. The operation counts for the SIMD and array algorithms differ 
significantly; however, time differences will depend on specific implementations. 
The predominant difference in operation counts arise because the serial and 
SIMD algorithms assume each PE contains the feature vector before the algo­
rithm starts, whereas the VLSI array algorithms require shifts to bring the test 
and vocabulary vectors into the cells. The A and B vector shifts occur
131
simultaneously, so the time required is for the transfer of one feature vector. 
The times to transfer d and g values may also differ, since the PP algorithm 
uses a general interconnection network* whereas the VLSI array uses a less gen­
eral (but most likely faster) fixed pipeline between adjacent cells. If transfer 
and computation steps can be overlapped, the loop times will be approximately 
equal, in spite of differences in the operations counts. Figures 6.24 and 6.25 
show two plots of the number of loops needed to match W=100 words of 
length I—40 vs. the number of PEs (cells) with and without an adjustment win­
dow. Figures 6.26 and 6.27 are the same as 6.24 and 6.25 except for W=1,000. 
In Figure 6.24 the BAC and RHSAC lines are plotted almost on top of each 
other. In Figure 6.25 the BAC, PP, and RHSAC lines are almost one top of 
each other, with the RHSAC requiring fewer loops in the 1 to 128 PE range. 
In Figure 6.26 all but the SP are plotted almost on top of each other. The SP 
requires fewer loops than the other algorithms when using 1 to about 384 PEs, 
and around 500 PEs, and around 1,000 PEs. In Figure 6.27, the BAC and PP 
lines are plotted exactly on top of each other, and the RHSAC is plotted 
slightly below the BAC and PP lines for certain numbers of PEs.
Figure 6.24 shows that the BAC takes a few more loops than the PP algo­
rithm since it requires a few loops to initialize the array which the PP algo­
rithm does not need. The figure also shows that the BAC algorithm requires 
fewer loops than the HSAC with 544 cells. Since the operations per loop are 
equivalent, the BAC will therefore be slightly faster. This speed is attained by 
reducing the number of idle cells. In the BAC, no cells are idle after f 1/2 ] 
loops, while the HSAC requires 21 loops before all cells are in use. The PP and 
BAC algorithms can continue to reduce execution time by adding more PEs 
(cells), so for these algorithms/architectures, the machine size can be chosen to 
meet, speed requirements.
Figures 6.26 and 6.27 show that when the vocabulary size is increased to 
1,Q00 words, the SP program clearly requires the fewest loops. This is because 
each cell is executing a serial DTW program which has little overhead of paral­
lelism.
■ SP Serial Parallel
• PP Parallel Parallel
" BAD Bilinear Array Computer
• RHSAC Reduced High Speed Array Computer
















BAC Bilinear Array Computer
RHSAC Reduced High Speed Array Computer
HSAC High Speed Array Computer
RHSAC
number of processors
Figure 6.25. Number of loops for W=100, 1=40, no window. HSAC not











BAG Bilinear Array Computer
RHSAC Reduced High Speed Array Computer
HSAC High Speed Array Computer
HSAC
256 38H 512 6H0
number of processors












BAC Bilinear Array Computer
RHSAC Reduced High Speed Array Computer
HSAC High Speed Array Computer
BACrppI RHSAC
number of processors
Figure 6.27. Number of loops for W=1,000, 1=40, no window. HSAC not 
shown, since 1,600 PEs required.
136
6.5. Conclusions
Five parallel digital filtering algorithms, an autocorrelation, a linear time 
warp, and three parallel dynamic time warping algorithms were discussed. To 
choose the best algorithm, one must consider the need for flexibility, the type 
of processor used (PEs for SIMD or cells for the VLSI array) available. Also, 
when using the DTW algorithms the use of pruning and an adjustment window 
must be considered. The VLSI array algorithms are best suited for a dedicated 
task since the inter-cell connections are not easily changed. The SIMD inter­
connection and PEs are more general and could therefore be used to perform 
other tasks in a recognition system. All the algorithms provide significant 
speedups for these computationally intensive tasks.
137
7. SIMD MACHINE SIMULATION
This chapter presents the results of simulating many of the SIMD machine 
algorithms presented in the previous chapters. Section 7.1 describes the sim68 
simulator that is used to run the simulations. These simulations allow the 
operations of the algorithms to be verified and also give an idea about the exe­
cution times of each algorithm assuming the use of current technology proces­
sors. The Sections 7.2 through 7.6 present the results of simulating some of the 
SIMD algorithms from Chapters 5 and 6. Each algorithm is presented as an 
individual program in these sections and Section 7.7 combines some of the pro­
grams into an SIMD machine based isolated word recognition system. This 
system can process input data sampled at 20 KHz and recognize a 1,000 word 
vocabulary in real time. Section 7.8 discusses the strengths and weaknesses of 
using an SIMD architecture for speech processing and suggests improvements to 
the architecture.
7.1. Simulating an SIMD Machine Using Sim68
The sim68 program performs an assembly language instruction level simu­
lation of an SIMD machine [SiKu82j. All sim68 programs are written in 
MC68000 assembly language with the aid of many support programs such as a 
parallel assembler and loader. The following sections describe the different 
parts of the SIMD model from Chapter 2 that are simulated.
Q-
7.1.1. Simulating the PEs and the CU
Sim68 simulates the PEs and the CU in the SIMD machine as MC68000 
microprocessors. The MC68000 is a state-of-the-art 16-bit microprocessor 
[ToGu81,Mot79], and reasons for its selection are discussed in [SiKu82]. 
Among these reasons are:
1) It can operate on a variety of data sizes: bit, byte, word (16-bits),
and long (32-bits).
2) It has a fast cycle time: from 8 to 12.5 MHz.
3) It has a large address spa.ce: 24-bits.
4) It has a regular instruction set. See Figure A.1 in Appendix A.
It has been shown in [SiKu82] that the execution of CU and PEs instruc­
tions can overlap by using an instruction queue between them. This overlap 
can result in a reduction in processing time. Sim68, however, assumes that 
there is no overlap and no delay time for broadcasting instructions to the PEs. 
Therefore, either the CU is executing an instruction, or the PEs are, but never 
both at the same time. This assumption means that the execution times given 
are conservative and might be reduced if an instruction queue were used.
138
7.1.1.1. The MC68000 Parallel Assembler
All programming for sim68 is done in MC68000 assembly language. The 
parallel assembler used is called pa68. Pa68 is loosely based on the Digital 
Equipment Corporation Macro 11 assembler [Dec]. The major differences 
between and a typical serial assembler are;
1) Instructions executed by the CU begin with a “c_”, while PE instruc­
tions start with a “p^”.
2) Instructions opcodes may end with a .b, .w, or an ./ depending on
whether the data operated on is 6yte (8-bits), word (16-bits), or 
long (32-bits).
3) The .word directive is used to define data in the CU and the PEs.
When defining data in the PEs, argument i of the word directive
is stored in PE i—1. Therefore
.word 10,11,12,13
139
would store the value 10 in PE 0, 11 in PE 1, and so on.
4) Instructions starting with a capital letter such as Where(dO,EQ,dl)
and Shift(dl) are macros defined to simulate the functions with 
the same name in Flock Algol. These are discussed in Section
7.1.2. '
5) Unlike some assemblers, the opcode is followed by the source operand,
which is followed by the destination operand as defined in 
[Mot79]. Therefore, pjmov.w dO,dl moves the data in register dO 
to dl in all active PEs.




are executed, the branch to label will occur if dl is less than dO. 
This is the reverse of the normal convention. Note that pjblt 
does nothing since the CU must perform all the branching 
instructions.
Figure 7.1 is a sample listing of a Flock Algol algorithm. It is presented here 
as an example, and the details of its operation will be discussed in Section 7.2. 
It shows some of the features of Flock Algol and the conventions that will be 
used here in presenting algorithms and programs. The left most numbers in 
Figure 7.1 are the line numbers, while the next number on the line is the exe­
cution time, in ps, of the statement running on an 8 MHz MC68000.
The block of comments before the first numbered line is a standard header 
that appears before each major program. Each section of the header is 
described in the following list.
Program Name gives the name of the program. This is sometimes referred to 
if there are several programs that perform the same function.
Algorithm will give the figure number of the corresponding Flock Algol code if 
the program is an assembly language program. The Flock Algol listing 
will give the figure number of the algorithm it is implementing.
Machine will be the SIMD machine.
Function will give a brief description of what the program does.















Function: This program preemphasises the 
input speech data with a filter 
with the transfer function:
H(z) = 1 “ coef * z~x
Number of PEs: •- N
Parameters: coef, The filter coef. (default = 0.95).
Input format: The input data is stored in
PEs 0 through N-l. PE i contains 
sample i for 0 < i < N-l.
Output: The output data is stored in
PEs 0 through N-l. PE i contains 
sample i for 0 < i < N.
Cycles: 130 + NetD
Typical time: 37 //s
Variable Usage: (* means set by calling routine)
input: input data *




3 USE Shift +1
8 TRANSFER input TO tmp
7 \ WHERE ADDR = 0 DO/* Get value from previous call*/
0.5 tmp2 tmp
1.5 tmp *— oldvalue/* Switch tmp and oldvalue*/
1.5 old value 4- tmp2
2 END WHER E
12.75 output 4- input =4- tmp * 0.95
Figure 7.1 Sample algorithm SIMD machine. The execution time assumes an 
8 MHz MC68000.
other important variables used by the program.
Number of PEs will list the number of PEs used by the SIMD machine. 
Parameters lists and describes the parameters that affect the execution times. 
Input tells how the input data is distributed among the PEs in the SIMD 
machine.
Output is the corresponding information as Input.
Cycles gives the number of machine cycles needed to process one input sample 
for the SIMD machine. Typical Time gives the execution time in ps for 
a typical speech recognition system.
jjc
Figure 7.2 is a listing of the assembly language program written for pa68 
to implement the algorithm in Figure 7.1. The numbers on the left are the 
only part of the listing that Would not appear as an input to pa68. They show 
how many cycles each instruction takes. To convert cycles to seconds, divide 
two by the clock rate a:nd multiply by the number of cycles. Therefore, for an 
8 MHz clock, divide the number of cycles by four to get the execution time in 
ps.
Everything to the right of a semicolon in Figure 7.2 is a comment. The 
comments written in boldface type are the Flock Algol statements which 
correspond to the assembler statements which follow them. The number to the 
left of the Flock Algol statement but to the right of the semicolon is the line 
number of the corresponding Flock Algol listing.
Lines starting with the string # include instruct pa68 to read in another 
file and process it. The speech processing programs commonly use the simd.h 
and the defs.h include files. The include file simd.h is listed in Figure A.2. All 
the data transfer registers, masking unit registers, and other special devices are 
memory mapped into the CU and PE address spaces. Simd.h defines where the 
various devices appear in the address spaces. It also defines macros for setting 
up the different interconnection functions and for data conditional masking. 
These are discussed later.
Figure A.3 is the listing of the include file de/s./i. Defs.h contains 
definitions for the parameters used by the different speech processing programs. *










SIMD, simulated by a MC68000. 
This program preemphasises the 
input speech data with a filter 
with the transfer function:
H{z) = 1 - coef * z_1
N
Parameters: coef, The filter coef. (default = 0.95).
Input format: The input data is stored in
PEs 0 through N-l. PE i contains 
sample i for 0 < i < N^-l.
Output: The output data is stored in
PEs 0 through N-l. PE i contains 
sample i for 0 < i < N.
Cycles: 130 + NetD
Typical time: 37 ps
Register usage: (* means set by calling routine)
dO pe used by macros
di pe tmp
d2 pe used to swap tmp and old value
d7* pe WHOAMI (physical pe address)
aO* pe points to input signal
al* pe points to output signal
#include ”simd.h”
#include "defs.h”
; Data allocation for routine
.p_data ; Data stored in each PE









Shift(#l) ; Set up interconnection network addresses
Figure 7.2 Sim68 program to perform preemphasis filtering. Numbers to left 






























TRANSFER input TO imp 
pjmov.w (aO),dO
p_mov.w dO,DTRIN.w ; transfer inputs from
; PE i to PE M
NetworkDelay(O)
p_mov.w t)TROUT.w,dl
WHERE ADDR = 0 DO /* Get value from previous call 
Where(d7 ,EQ,#0) ; In PEO, get value from last call
tmp2 <-tmp


















; mult, by coef and save in dl.
; shift 15 to the right by shifting left one, 
; and swapping upper and lower words.
; dO = dO + coef * dl 





7.1.2. Simulating the Interconnection Network
Sim68 does not simulate a given interconnection network. Instead, each 
PE has access to the following three registers:
DTRDEST Physical PE address of destination.
DTRIN Input to the interconnection network.
DTROUT Output from the interconnection network.
The DTRDEST register allows any PE to talk to any other PE. Setting 
DTRDEST to the appropriate values in each PE allows any interconnection 
function to be simulated. The programs presented here use only the Shift, 
Cube, and Permutation functions as described in Section 2.4. To assist the 
programmer, the macros Shift(x), Cube(x), and Perrn(x) define the given func­
tions respectively. See Figure A.2 for the actual macro definitions.
Most interconnection networks take some time for data to travel from the 
input to the output. The macro NetworkDelayQ is defined to be a nop (no 
operation, i.e. an operation that does nothing) whose execution time is the 
same as the typical network transfer time. This value is assumed to be 18 
cycles, or 4.5 ps based on the information in [BaLu81,BrSi82]. The intercon­
nection network may have a transfer time as fast as 500 ns for a 16-bit word 
[Ku84|. If such a network is used, or the transfers are overlapped with the exe­
cution time, the effective network could be zero. Therefore the case where the 
network delay is zero is also presented in many of the tables.
Some'algorithms require the CU to make conditional branches based on 
data stored in the PEs, therefore there is a data path between PE 0 and the 
CU. Anything PE 0 writes into memory location TOCU will appear at the CU 
in memory location FR OMPEO afterone network delay time.
7.1.3. Simulating Broadcasts
Sim68 simulates broadcasts from the CU to all PEs by using self modify­
ing code. The following two instructions will broadcast the data of size word 




The first instruction writes the data in dO into the memory location containing 
the immediate data for the PE instruction. When the second instruction is 
broadcast to all active PEs, the new data goes with it. No additional network 
delays are encountered using this method. The macro Broadcast (in,out) is 
defined to broadcast data from register in in the CU to register out in the PEs 
using the above method.
7.1.4. Data Conditional Masking
Although the SIMD machine model presented in Chapter 2 includes both 
PE address masking and data conditional masking, sim68 simulates only data 
conditional masking. It uses a mask stack as presented in [ClSi83]. The fol­
lowing example shows how it is performed.
Suppose the following code is to be performed:
1 WHERE A>B DO
2 C *-*• A
3 ELSEWHERE
■ 4 '• ■ CV-B
5 END WHERE
Line 1 is executed first in the active PEs by comparing A and B: 
p_cmp B,A
Next the flags set by the comparison are moved to the PE condition codes 
register (PECCR) of the masking unit:
p_mov.w sr,PECCR
Now the masking unit is given the desired condition: 
p_mov.b #GT,PECCS
The PE condition code select register (PECCS) tells the masking unit which 
condition must be met. At this point, all previously active PEs are still active. 
The CU now tells, the masking Unit to logically AND the negative of the 
current condition with the top of the mask stack and push the results on the 
mask stack. This is done by writing the proper code to the mask control regis­
ter (MASKCTL):
146
c mov.w #Pushs +NDataCond,MASKCTL
The negative of the condition enables the PEs for the ELSEWHERE condition. 
Next, the positive condition code is logically ANDed with the value second 
from the top of the mask stack and pushed on the mask stack:
c_mov.w #Pushss + DataCond,MASKCTL
Now the PEs are enabled for the WHERE condition. The statements for line 2 
are now executed in those PEs where the condition is true. The ELSEWHERE 
on line 3 is performed by popping the top of the mask stack:
c_mov.w #Pop+DataCond,MASKCTL
Then the statements of line 4 are executed. Finally, Tine five is executed by 
again popping the mask stack:
c mov.w #Pop+DataCond,MASKCTL 
Now all the PEs that were active before line 1 are again active.
Sim68 assumes that if all PEs should be disabled during a WHERE or an 
ELSEWHERE condition, the statements in that block will take no time to exe­
cute. This means the hardware must be able to detect that all PEs are dis­
abled and ignore all PE instructions until some PEs are enabled again.
In most cases some PEs will execute the WHERE block, while some will 
do the ELSEWHERE block, making the execution time the total of both 
blocks plus the time for enabling and disabling the appropriate sets of PEs.
7.1.5. The Typical Speech Recognition System
The programs in the rest of the chapter frequently reference a typical 
speech recognition system. Table 7.1 lists the parameters for the typical sys­
tem as used here. These parameters are for a high quality speech recognition 
system. Most speech recognition systems use 12-bit input samples rather than 
the 16-bit samples as shown in the table. Also, many high quality systems use 
an input data rate of 15 KHz, while this system can process data at 20 KHz. 
This system was chosen to be a conservative system, therefore, it requires more 
processor throughput than many high quality speech recognition systems.
147
Table 7.1 Parameters for the typical speech recognition system.








DTW Warping Path Width 











Execution times that are listed for the typical system are in ps and assume the 
MC68000 uses an 8 MHz dock and data takes 4.5 ps to travel through the 
interconnection network.
7.1.6. Execution Times
Execution times for all sim68 simulations are given in cycles. This paper 
assumes that the MC68000 runs at an 8 MHz clock rate which gives a register- 
to-register addition time of 0 5 ps for a word (16-bits) data size, or 1 ps for a 
long (32-bit) data size. A 16 by 16 bit signed multiply takes 8.75 ps.
For each program an expression for the execution time is derived in terms 
of the parameters of the program. These times are given in terms of:
autocoef The number of autocorrelation coefficients used of LPC.
M The number of samples per LPC frame.
I The number of frames output from the ltw routine.
N The number of PEs the given programs uses.
logN |1oS2N|
NetD The network delay time in cycles,
p The number of LPC coefficients,
r The width of the dtw warping path.
In most speech processing systems ■ p=atitocoef—1.. The times are given in an 
expanded form, for example:
cycles — 10 + autocoef [(24+NetD) + 85 +
(54 + 2NetD)logN + 2 + 19] — 23—NetD +1
Each term corresponds roughly to the execution time between adjacent labels 
in the program being considered. In the example above, (54 + 2NetD)logN+2 
would correspond to a loop that executed log N times and contained two net­
work transfers. These times do not include the overhead of a main program cal­
ling or returning from the given program.
149
7.1.7* Summary
Sim68 does a good job of simulating an SIMD machine. The important 
things to know about the simulations are:
1) All Flock Algol times are given in /is assuming an 8 MHz clock and a 4.5 ps
network delay time.
2) All pa68 times are given in cycles. Divide cycles by 4 to convert to /is.
3) If all PEs are disabled, the PE instruction takes no time to execute.
4) The times are conservative because of the assumption that CU and PE
instructions are not overlapped.
150
7.2. Digital Preemphasis Filtering
This section presents the SIMD implementation of the Flock Algol algo­
rithm for preemphasis filtering. The filter transfer function is:
H(z) = l~az_1
where typically a ^ .95. The preemphasis filter is used on the input speech 
data before autocorrelation analysis is done. To process telephone quality 
speech in real time, the filtering program must be able to filter 6,670 8-bit sam­
ples per second. Filtering high quality speech requires a sampling rate of 15 to 
20 KHz using 11 to 12 bits per sample.
Figure A.4 is a parallel MC68000 program to perform the preemphasis 
filtering on an SIMD machine as discussed in Section 6.1.8. The program uses 
16-bit samples and N PEs. It assumes the speech data is stored in the PEs 
before the program is executed., Sample i is stored in PE i for 0 < i < N, 
where N is the number of PEs used. The output data uses the same arrange­
ment as the input data. The total execution time is
130 + NetD,
Where NetD is the network delay time in cycles. This time does not include 
approximately 26 cycle overhead of calling and returning from the routine. 
Table 7.2 lists the sampling rates using different network delays and different 
numbers of PEs. The parameters than are being changed are shown in bold­
face type. Using one PE may be fast enough since Table 7.2 shows that one 
PE can process data at a sample rate of 27 KHz which is greater than the rate 
needed for high quality speech processing. This is a lower bound on the max­
imum sampling rate since if the algorithm uses only one PE, the conditional 
masking can be replaced by branching instructions and the network transfers 
are not needed.
151
Table 7.2 Sampling rates for the SIMD preemphasis program using 16-bit 
signed data.
Program Preemphasis Filter
N i 10 100 1 io 100
Number of PEs i 10 100 1 10 100
NetD 0 0 0 18 18 18
Transfers 1 1 1 1 1 1
Cycles 130 130 130 148 148 148
Time/Sample (/xs) 32.5 32.5 32.5 37 37 37
Max Sample Rate (KHz) 30 300 3,000 27 270 2,700
152
7.2.1. Summary
This section presented a parallel preemphasis filter program. It is able to 
process speech in real time using as few as 1 PE. By using more PEs, the pro­
gram can process data at a higher sampling rate. This program assumes that 
the data was already in the PEs before the program is executed. This is a 
valid assumption if the program calling the filter program has already loaded 
the data.
The MC68000 processor is well suited for this type of speech processing 
since speech data typically uses 12 to 16 bits per sample. The 16 by 16 signed 
multiplication instruction and the 16-bit signed addition instruction allow the 
MC68000 to compute the filtered signal quickly.
Filtering usually precedes the computation of autocorrelation coefficients. 
The next section presents the autocorrelation program and shows how it will 
work with the preemphasis filtering program to process speech.
153
7.3. Simulation of the Autocorrelation Algorithm
Autocorrelation plays an important role in many isolated word recognition 
systems. It is used to find the short term autocorrelation coefficients which are 
then used to find the LPC coefficients. Autocorrelation, as used here, is defined
M-i-l
R(i) = X) x(k)x(k+i) 0 < i < autocoef 
k=o
where R(i) are the autocorrelation coefficients and x(tn) is the input signal. 
For speech processing M ranges from 100 to 300 samples, while autocoef is 
between 8 and 16 [Myer80]. For the typical system, M = 100 and autocoef=9.
In this section, Siegel’s autocorrelation algorithm, discussed in Section
5.1.1, is converted to a MC68000 assembly language program and sim68 is used 
to simulate an SIMD machine executing the program. Figure A.5 is a listing of 
the program with the execution times, in cycles, on the left, and the 
corresponding Flock Algol statements as comments in boldface. This program 
assumes 16-bit input data and keeps a 32-bit sum. In general, the total execu­
tion time is:
cycles = 10 +
(autocoef) [(30+NetD) + 85 + (54 + 2NetD)logM + 2 + 19]
■ —23—NetD + 1 ■
= (autocoef)[136+NetD+(54 + 2NetD)logM]—12—NetD
Each number in the first line roughly represents the execution time between 
adjacent labels in Figure A.5.
Table 7.3 gives the execution times for a typical speech application.
154
Table 7.3 Execution time for autocorrelation program using 16-bit signed 
inputs and a 32-bit signed sum.
Program auto auto + filter
autocoef 9 9 9 9
M 100 100 100 100
logM 7 ' 7 ' 7 7
Number of PEs 100 100 100 100
NetD 0 18 0 18
Transfers 134 134 135 135
Cycles 4,614 7,026 4,744 7,174
Time 1,153 ps 1,757 ps 1,186 ps 1794 ps
Time/Sample 11.53 /is 17.57 ps 11.86 ps 17.94 ps
Max Sample Rate 86 KHz 56 KHz 84 KHz 55 KHz
Program auto auto + filter
mitocbef 17 17 17 17
• M . 100 100 100 100
logM ■' 7 ' 7 ' 7 7
Number of PEs 100 100 100 100
NetD 0 18 0 18
Transfers 134 134 135 135
Cycles 8,726 13,316 8,856 13,464
Time 2,182 ps 3,329 ps 2,214 ps 3,366/xs
Time/Sample 21.82 ps 33.29 ps 22.14 ps 33.66 ps
Max Sample Rate 45 KHz 30 KHz 45 KHz 29 KHz
155
7.3.1. Effects of NetD bn Execution Times
Selecting a value for NetD is difficult. The execution summaries use the 
values 0 and 18 cycles. 0 is used for a small or negligible delay [Ku84] or when 
the network transfer is overlapped with the instruction execution. 18 cycles, 
which is 4.5 /is, is the value used in [BrSi82]. Another approach is to ask 
“What is the maximum value NetD can have and still allow the program to 
run in real time?” Combining the filtering and autocorrelation programs, as 
they would be in a typical speech system, gives an execution time of 4,744 
cycle to process 100 samples for autocoef=9. There are 200 cycles between 
samples when using a 20 KHz sampling rate, therefore transfers can use 
20,000-4,744=15,256 cycles. The programs Use 135 transfers, so each transfer 
can take 113 cycles or 28 fis per 16-bit word. For example, the Poker system 
[Snyder82b] requires 12 fis per byte, or 24 fis per 16-bit word for transfers 
which is less than the maximum delay of 28 fis. An effective sampling rate of 
85 KHz with no network delay is reduced to 20 KHz if the network delay is 28 
fis per 16-bit word. This algorithm can tolerate a slow interconnection network 
and still process high quality speech in real time if autocoef=9. If auto- 
coef=17, then 8,856 cycles are used leaving 11,144 cycles for the 256 transfers 
which is 43 cycles (10/is) per transfer.
7.3.2. Using Fewer PEs
The algorithm, as presented, must use as many PEs as there are samples 
in each frame. In a typical speech recognition system the frame size ranges 
from 100 to 400 samples which means 100 to 400 PEs must be used. The algo­
rithm (auto/2) in Figure 7.3 can find the autocorrelation coefficients of a 
M=2N sample frame using N PEs. Before execute, PE i contains samples i and 
i + N/2 for 0 < i <N. As before, the data is shifted between the PEs so that 
when autocorrelation coefficient j is being computed, PE i contains samples i 
and i+j, and samples i+N/2 and i+j + N/2. Since each PE contains two sam­
ples, two transfers must be used to get this arrangement. The product of sam­
ples i and i+j for 0 < i <2N is found using two multiplication steps per PE 





Function: This program finds the autocorrelation
coefficients of input speech data using 
half as many PEs as samples in a frame.
Number of PEs: N
Transfers: Shift(-l), Cube
Masking: Data Conditional
Parameters: autocoef, The number of coefs. to find.
N, The number of PEs in use.
NetD, The interconnection network 
delay time in cycles.
Input: The input data is stored in PEs 0 through N-l
with PE i containing sample i and H-N/2 
for 0 < i <N.
Output: The autocorrelation coefficients, R(i),
for 0 < i <autocoef-l appear in PE i 
for 0 < i <N (i.e. each PE contains 
every coefficient).
Cycles: autocoef[l36 + NetD + (54 + 2NetD)logN| — 12 — NetD
Typical Time: .1,757 //s for autocoefs=9, NetD=18, and logN=7.
Variable Usage: (* means set by calling routine)
ADDR: Address of PE (e g. ADDR= 0 in PE 0).
L: on completion, PEs 0—L will contains R(i).
partsum: temporary variable holding a partial sum.
R(): autocorrelatin coefficients,
sigl:* first half of input signal (sample i)
sig2:* second half of input signal (sample i + N/2)
slastl: after stage i: “slast” in PE m holds sig(m+i).
slast2: after stage i: “slast” in PE m
holds sig(m + N/2 + i).
Line Time in (is
1 1.75 slastl .4- sigl /*" After stage I, “slast” in
2 ■ PE m holds sig(m + i) */
3 1.57 slast‘2 4-. sig2 /* After stage I, “slast” in
4 PE m holds sig(m -H) */•'-
5
6 5 FOR i ^ 0 TO p DO
7 1.5 IF i ^ 0 THEN
8 3 USE Shift(-l)
9 1.5 DTRin slastl
10 4.5 TRANSFER
11 2.0 slastl DTRout
12 1.5 DTRin slast2
13 4.5 TRANSFER
14 2.0 slast2 >-* DTRout
Figure 7.3 Algorithm for autocorrelation using N PEs for a frame size of 2N. 




17 0.5 imp 4- slastl
18 slastl 4— slast2
19 slast2 4- tmp
20 END WHERE
21
22 0.5 partsum 4- 0
23
24 6.5 WHERE ADDR < M-i DO
25 10.75 partsum 4- s!ast2 * sig2
26 2 END WHERE
27
28 10 partsum 4- partsum -f slastl * sigl
29
30 2.25 FOR j «- 0 TO max( flog(M-i)|- l iog(L-l))
31 3 USE Cube(j)
32 12.5 TRANSFER partsum TO tmp
33 0.75 partsum 4- tmp + partsum
34 1.5 R(i) 4- partsum
Figure 7.3 (Continued)
158
Figure A.6 is a listing of the corresponding program. The time complexity 
for auto/2 is:
cycles=18+(autocoef)[(84+2NetD) + 87 + 44+(54+2NetD)logN + 2 +19]-
77-2NetD + l :
=(autocoef)[236+2NetD + (54+2NetD)logN]-58-2NetD
In the proposed speech recognition system using 100 PEs, the autocorrela­
tion program uses 7,174 cycles when autocoef=9, the frame size is 100 samples, 
and NetD=18. If 50 PEs are used, auto/2 uses 7,214 cycles which is a sam­
pling rate of about 55 KHz. Auto/2, using 50 PEs, requires 188 cycles more 
than auto, using 100 PEs, which is about 3% more. This is a surprisingly small 
increase in execute time. Examining the time complexity equations for auto 
and auto/2 shows that auto requires 136 + NetD cycles to perform the Shift 
transfer and find the product of two samples. Auto/2 requires 236 + 2NetD 
cycles to compute the same values, therefore needing almost twice as many 
cycles. Auto requires (54 + 2NetD)logM cycles to find the sum of the products 
using recursive doubling, while auto/2 uses (544"2NetD)logN cycles where 
N=M/2. Therefore since auto/2 has two samples per PE, it requires one less 
pass through the interconnection network, so it uses 54+2NetD fewer cycles to 
compute the sum. The time saved by auto/2 having two samples per PE is 
slightly less than the extra time it uses to compute the product of two samples 
per PE, therefore there is only a slight increase in the total computation time.
The same techniques that converted auto to auto/2 can be applied to 
further reduce the number of PEs used, while increasing the execution time. In 
general; if there are more samples per each frame than PEs, the algorithm can 
be modified so each PE will compute Im/n1 products where M is the number 
of samples per frame and N is the number of PEs.
7.3.3. Increasing the Throughput Through Serialism
The previous section showed that using half as many PEs resulted in only 
a 3% increase in the execution time. This result can be used to increase the 
throughput while using the same number of PEs. Suppose a system uses 100 
samples per frame, and has 100 PEs. The execution time will be 7,026 cycles if
159
autocoef=9 and NetD —18. The system can process two frames at a time if 
PEs 0 through 49 process the first frame, and PE 49 through 99 process the 
second frame using a modified version of auto/2. The total execution time will 
be roughly 7,214 cycles (there will be some over head due to processing two 
frames at a time.) The average execution time per frame is then 7,214/2 — 
3,607 cycles which is 52% of the cycles used when processing only one frame at 
a time.
The above technique could be repeated until 100 frames are being pro­
cessed in parallel with the 100 PEs doing one frame each. This will Certainly 
increase the throughput, but it will also increase greatly the delay between the 
time a sample enters the system,: and the time the autocorrelation coefficients 
are computed. This is probably not appropriate for an environment in which 
real-time processing is desired.
7.3.4. Summary
This section presented a program implementing a parallel autocorrelation 
algorithm. Using M—N PEs it can find the first autocoef—9 autocorrelation 
coefficients of an M—100 sample frame of speech in 1.7 {is. This gives an 
effective sample rate of 56 KHz \vhich is more than sufficient for high quality 
speech processing. Each additional coefficient computed takes 194.5 {is. Com­
bining autocorrelation with the preemphasis filter program from the previous 
section gives a sampling rate of 55 KHz which is more than enough for high 
quality speech recognition. Some high quality speech processing uses auto­
coef =17 coefficients, which gives a sampling rate of 29 KHz which is still more 
than enough for high quality speech.
The input data is arranged with one 16-bit sample per PE with PE i con­
taining sample i for 0 < i < N. This is the same as the output format of the 
filter program. The output has PEs 0 through autocoef containing all the 
autocorrelation coefficients.
Fewer PEs than samples in a frame can be used without greatly increasing 
the execution time. Although the throughput can be increased by computing 
several frames in parallel using a fewer PEs per frame, the delay time between 
an input and an output will, increase.
The hardware is well suited for this problem since it has a 16 by 16-bit 
signed multiplications and 32-bit additions. These built-in instructions which 
perform operations on data the same size as the problem’s data size make pro­
gramming the SIMD machine a straightforward task.
7.4, Simulation of the Linear Prediction Algorithm
Linear predictive coding (LPC) is frequently used in both speech synthesis 
and recognition. The LPG coefficients model the vocal tract as an all pole 
filter, and the error signal from the coding, models the excitation of the vocal 
chords. A speech recognition system divides the the speech signal into 10 to 20 
ms frames and finds the LPC coefficients for each frame. Therefore, a real­
time system that inputs data at 10 KHz to 20 KHz must process one frame of 
between 100 and 400 samples every 10 to 20 ms. t Generally 16-bit coefficients 
are used, but some applications cap us.e as few as 10 bits [MaGr74|.
Figure A.7 is the listing of a program that finds the LPC coefficients given 
the autocorrelation coefficients. It is based on the algorithm in Figure 5.7. 
The input data is arranged so each PE contains all the autocorrelation 
coefficients (R(i) for 0 < i < autocoef). The output data has LPC coefficient i 
stored in PE i-1 for 1 < i < p.
The program uses fixed point arithmetic. The position of the decimal point 
is shown in the right column. The code d# ~x,y means that in register d$, x 
bits are to the left of the decimal point, and y bits are to the right.
The total execution time for the program is:
cycles = 26+p[92 + (54+2NetD)log(p) + 2 + 112 +
125+88+81 + NetD + 13]-NetD-65 +1
= p[513 + NetD+(54+2NetD)log(p)]—38—NetD
where each number in the first line roughly represents the time between labels 
in Figure A.7. Table 7.4 gives the execution times for a typical speech applica­
tion. Computing the LPC coefficients alone can be done at a rate of 62 KHz 
assuming 100 samples per frame, 8 coefficients and NetD = 18 using 8 PEs. A 
typical speech processing system would preemphasize the signal and find the 
autocorrelation coefficients before finding the LPC coefficients. Using the pre­
vious filtering and autocorrelation programs, this can be done with a sampling
162
Table 7.4 Execution times for LPC program and filter + auto + lpc programs
Program LPC filter + auto+LPC
P 8 8 8 8
M 100 100 100 100
Number of PEs 8 8 100 100
NetD 0 18 0 18
Transfers 55 55 190 190
Cycles 5,362 6,352 10,106 13,526
Time 1,341 (is 1,588 /is 2,527 (is 3,391 (is
Time/Sample 13.41 (is 15.88 (is 25.27 (is 33.82 (is
Max Sample Rate 74 KHz 62 KHz 39 KHz 29 KHz
Program LPC filter + auto+LPC
P • 16 16 16 16
M 100 100 100 100
Number of PEs 8 8 100 100
NetD 0 18 0 18
Transfers 143 143 399 399
Cycles 11,626 14*200 20,482 27,664
Time 2,907 (is 3,550 (is 5,121 (is 6,916 (is
Time/Sample 29.07 (is 35.50 (is 51.21 (is 69.16 (is
Max Sample Rate 34 KHz 28 KHz 19 KHz 14 KHz
163
rate of 29 KHz, which is sufficient for high quality speech.
A sample rate of 20 KHz and a frame size of 100 samples gives 20,000 
cycle between frames. The three programs use 10,052 cycles leaving at most 
9,948 cycles for network delays. Since 190 transfers are used, each can take 52 
cycles, or 13 ps per 16-bit word and process speech in real time.
Table 7.4 shows that if p = 16 coefficients are used and a 4.5 /is NetD is 
assumed, the programs can process data at 14 KHz which is too slow for most 
high quality speech processing. If the network transfers are fast, or overlapped 
with the instruction execution so that NetD=0, the speech data can be pro­
cessed at 19 KHz which is in the range of 15 KHz to 20 KHz used most often 
for high quality processing.
7.4.1. Summary
This section presented a parallel program for computing LPC coefficients 
from autocorrelation coefficients. It is able to process data at a rate of 62 K 
samples per second assuming a 100 sample frame, 8 LPC coefficients, and a 
network delay of 4.5//s per 16-bit word. LPC analysis is usually preceded by 
preemphasis filtering and autocorrelation. The processing rate for these three 
programs, using the conditions above, is 29 KHz. This is sufficient for real-time 
processing of high quality speech. A network delay of up to 13 [is per 16-bit 
word can be tolerated and still process at the 20 KHz rate needed for high 
quality speech.
This program uses fixed-point arithmetic and computes coefficients with 
16-bit precision. The program uses approximately 7% of the coefficient calcu­
lation time to rotate the data so the decimal point is in the correct position.
This is a small overhead fox implementing fixed point arithmetic.
The LPC program uses both the Cube and Perm interconnection functions 
and is the only program to use the Perm function. It is possible the intercon­
nection network will not be able to perform the Perm function directly, but 
instead will use multiple passes through the network. Since the Perm function 
is used p times and it may take p passes through the interconnection network 
to implement it, p(p— 1) additional network delays may be added to the execu­
tion time. For the typical system this is roughly (8)(7)(4.5 ps) - 252 [is. This
164
is about a 16% increase over the original time
165
7.5. Simulation of Linear Time Warping (LTW) Algorithms
In a typical isolated word recognition system, linear time warping occurs 
after the endpoint detection and before the dynamic time warping. Its purpose 
is to take an utterance of variable length and linearly stretch or shrink it, in 
the time domain, until it is a fixed length. Isolated utterances can range from 
20 to 80 frames in length in a typical system, where a frame consists of 8 LPC 
coefficients. Some systems will stretch or shrink the utterance to a 40 frame 
length. Only after the endpoint routines detect an utterance can the LTW 
program process the speech data. Since isolated words are about one third to 
one half second in duration, the LTW must be able to perform its operation in 
about 300 to 500 ms.
Two LTW algorithms were presented in Section 6.3, Method one places 
one frame per PE and moves the data between the PEs to do the warping. 
Method two has one coefficient from each frame in each PE and gets its speed 
by doing the vector operations in parallel. The following sections present pro­
grams implementing each algorithm and gives timing information for each.
7.5.1. Method One — One Frame per PE
Figure A.8 is a program for performing method one. The input data is 
arranged so PE j contains frame j for 0 < j < J, where J is the number for 
frames in the input utterance and each frame consists of p LPC coefficients. 
After processing, PE i contains frame i for 0 < i < I, where I is the new utter­
ance length. In a typical system 20 < J < 80 and 1=40, so the number of PEs 
is the maximum of J and I.
The time complexity for method one in Figure A.8 is:
cycles=7 + 210+80+p(29 + NetD) + 2 +10 + 109p + 2+6 + 6 
+ {42+NetD + [28+NetD]p+2 +15}( J-I) + 2
166
=325+(138+NetD)p +(J-I)[59 +NetD + (29 + NetD)p]
if j>1. If J=I the linear time warp simplifies to a copy operation taking
11+Up
cycles. If J<I the time complexity is:
cycles -7 + 232 + (I—J) [42 +NetD + (45+NetD)p + 2 + 13] + 2+2+80 
+(29+NetD)p + 2+10+109p + 2+7 
=344+(138+NetD)p +(I—J)[57 + NetD + (45 +NetD)p]
Whenever the utterance is being expanded or compressed, the number of 
operations is based on the amount of change in size. Table 7.5 gives values for 
j-I = -2Q, -10, 0, 10, 20, 40 for network delays of 0 and 18 cycles and p=8
coefficients.
7.5.2. Method Two — One Coefficient per PE
Figure A.9 is the program for implementing the the second method of 
linear time warping as discussed in Section 6.3.2. For 8 LPC coefficients, it 
uses 8 PEs with the input data arranged so that PE k contains coefficient k of 
frame j for 0 < k < p and 0 < j < J. The output data uses the same
arrangement. Its time complexity is
cycles=7+98+1(45 +10+22 +106)+ 2
■ =io7+1831 / ■;
if J 5*1 and 450 cycles if J=I. Table 7.6 gives times for a typical speech system.
7.5.3. Comparing LTW Methods One and Two
These two methods are an example of the importance of including over­
head such as transfers in the time complexities. From Table 6.3. one would 
expect method one to perform better than method two because method one
167
Table 7.5 Execution times for linear time warping, method one.
Program LTW Method One
J-I -20 -20 -10 — 10 0 0
P 8 8 8 8 8 8
Number of PEs 40 40 40 40 40 40
NetD 0 18 0 18 0 18
Transfers 188 188 98 98 0 0
Cycles 9,788 13,172 5,618 7,382 99 99
Time (/xs) 2,447 3,293 1,405 1,846 34 34
Program LTW Method One
J-I 10 10 20 20 40 40
P 8 8 8 8 8 8
Number of PEs 50 50 60 60 80 80
NetD 0 18 0 18 0 18
Transfers 98 98 188 188 368 368
Cycles 4,339 6,103 7,249 10,633 13,069 19,693
Time (//s) 1,085 1,526 1,812 2,658 3,267 4,923
Program LTW Method One
J-I -20 -20 -10 -10 0 0
P 8 8 8 8 8 8
Number of PEs 40 40 40 40 40 40
NetD 0 18 0 18 0 18
Transfers 356 356 186 186 0 0
Cycles 18,092 24,500 10,322 13,670 187 187
Time (//s) 4,523 6,125 2,580 3,418 47 47
Program LTW Method One
J-I 10 10 20 20 40 40
P 8 8 8 8 8 8
Number of PEs 50 50 60 60 80 80
NetD 0 18 0 18 0 18
Transfers 186 186 356 356 696 696
Cycles 7,763 11,111 12,993 19,401 23,453 35,981
Time (/xs) 1,941 2,778 3,249 4,851 5,864 8,996
Table 7.6 Execution times for linear time warping, method two.
Program LTW Method Two
I 40 40
p 8 , id
Number of PEs 8 16
Transfers 0 0
Cycles 7,427 7,427
Time 1,857 /xs 1,857 (is
uses one scalar and two vector multiplication steps and method two uses 31 
scalar multiplication steps, In a typical system the vectors contain 8 elements 
and 1=40, so method one uses 17 scalar multiplication steps while method two 
uses 120. Tables 7.5 and 7.6 show that methods one and two both take about
1.8 ms if I~J—10 and NetD=18. This seems inconsistent with Table 6.3 until 
the transfer times are considered. Method one uses | J-l| +1 transfers while 
method two uses none. The vector and scalar transfers take approximately 
(| J-I| +1)(453) cycles, and the j J—1| vector multiplications, used in method 
one, require 872 cycles for p =8 and NetD = 18. The vector transfer time is 
about half the time of a vector multiplication. Therefore when comparing the 
time complexities of two methods, relative times of all operations should be 
considered.
7.5.4. Summary
A typical speech recognition system has at least 300 to 500 ms between 
the starting times of two utterances. The LTW program must be performed 
once for each input utterance, therefore the LTW must executed in less than 
300 to 500 ms to run in real time. Both methods presented here can execute in 
less than 300 to 500 ms assuming that the data is stored in each PE before the 
LTW program is run. The problem of getting the data in this allocation is dis­
cussed in Section 7.7.
The arrangement of the input and output data and the number of PEs 
used are the main differences between these two methods. Method one uses the 
maximum of J and I PEs while method two uses p PEs.
Selecting one of these methods may depend on the data arrangement, not 
the execution time. If a system has each PE processing one frame of speech, 
method one should be used since it requires one frame per PE as input. If the 
system has each PE containing one coefficient from each frame, method two 
should be used since that is how its input data is arranged. If the system uses 
neither of the above arrangements the data will have to be moved to match
The time of a multiplication step is the time used by one multiplication operation in 
several PEs in parallel.
one of the arrangements. The choice of which arrangement to use would be 
based on the time needed to move the data into one of the arrangements, and 
the desired output data arrangement.
Neither LTW program can begin execution until after the input utterance 
has been detected. This causes a delay time since the LTW program and the 
programs that follow it must wait until the entire utterance is spoken.
7.6. Simulation of Dynamic Time Warping Algorithms
Dynamic time warping (DTW) is the process of taking one unknown utter­
ance and comparing it to one known utterance. The DTW algdrithm dynami­
cally stretches and shrinks both utterances, in time, to match them to each 
other as well as possible. This is done, as explained in Section 4.6.2, by com­
puting the local distance d(i,j) between frame i of the known utterance and 
frame j of the unknowp utterance. Dynamic programming theory is used to 
find the minimum path from d(0,0) to d(I,I) where I is the number of frames in 
the known and unknown utterances. The local distance scores are accumulated 
along this minimum path, and the result is a single score telling how closely the 
two utterances match. A typical isolated word recognition system matches an 
unknown utterance to every known utterance in the system’s vocabulary. A 
1,000 utterance vocabulary would therefore require 1,000 DTWs to be per- 
formed.
An utterance is a collection of / frames of p coefficients each. / is con­
stant since the LTW program will stretch or shrink the utterance to a fixed 
length before the DTW program processes it. Typically I~40 and p=8 and 
each coefficient has 16 bits.
Section 6.4.1 presented two approachs for implementing a parallel DTW. 
Both methods are simulated using sim68. The first approach is the serial 
parallel (SP) method. Since a typical speech recognition system needs to per­
form one DTW match for each word in its vocabulary, the SP method uses one 
PE for each vocabulary word and broadcasts the unknown utterance to all 
PEs. Each PE executes a serial DTW to match its known utterance to the 
unknown utterance.
The second approach is the parallel parallel (PP) method. The PP method 
uses several PEs to perform one DTW comparison. Two implementations of the 
PP method are given. The first (PPl) moves the input data to the appropriate 
PEs and then computes the local distances as they are needed. The second
. 172. ■
■ ■ ■ ' ' |
program (PP2) computes the local distances while moving the data to the PEs 
and then computes the DTW.
The following section presents the rearrange routine which is used to rear­
range the unknown utterance among the PEs before executing the SP and PPl 
programs.'■
7.6.1. Rearrange
Both the SP and PPl methods need to store the input data in each PE in 
an unusual manner. The rearrange routine moves the data from one arrange­
ment to another so that the DTW programs will have the data in the right
places.
The rearrange routine expects its input data to be stored with coefficient 
k of frame i in PE k for 0 < k < p and 0 < i < I. This arrangement is 
chosen since it is the arrangement used by the LPC and LTW routines. Rear­
range moves the data from this arrangement to the arrangement needed by the 
DTW program, in which each PE has all the coefficients from all the frames in 
the unknown utterance. Figure 7.4 is a listing of the rearrange algorithm and 
Figure A. 10 contains a listing of the rearrange program. The rearrange routine 
sends the data to all PEs by using a series of the Shift -1 transfer functions. 
First PE Q sends its data to the CU by writing it to a memory location called 
TOCU. There is a data path from PE 0 to the CU, so that anything PE 0 
stores in memory location TOCU appears in memory location FROMPEO in 
the CU after the network delay time. PE 0 sends its data to the CU and the 
CU broadcasts it to all the PEs. The broadcast if performed by having the CU 
store the data to be broadcast in the immediate data field of a PE instruction. 
The PE instruction, with the broadcast data, is broadcast to all PEs as is any 
other instruction and when the PEs execute it, then the data is stored in each 
PE’s register.
After PE 0 sends its data to the CU, all PEs execute a Shift -1 transfer 
function Now PE i contains the data from PE i + 1. PE 0 sends the data it 
received from PE 1 to the CU and it is broadcast, as before. All PEs execute 
the Shift -1 transfer function again, so now PE i fias data that was originally 
in PE i+2 and PE 0 sends its data to the CU. This shift-broadcast loop is
173




Function: This program moves data around in preperation
for the DTW program 
Number of PEs: 2r + l
Parameters: r, the width of the warping path.
p> the number of coefficients per frame.
NetD, the network delay time.
I, the number of frames per utterance 
Input: inputfi] contains coefficient k of
input vector i in PE k for 0 < 1 < k.
Output: output[i][k] contains coefficient k of
vector i of the output in all PEs.
Cycles: 26 + I[13 + p(47 + NetD)] + 9 [r/2|
Typical Time: 5,344 ps for p=8, r=6, 1=40, NetD=18
*/ - v::..
Line Time in (is >
1 PROCEDURE Rearrange
2 3 USE Shift-1
3 2.25 FOR i 4- 0 TO 1-1
4 1 tmp 4- inputji]; /* tmp contains coefficient i in PE i */
5 1.75 FOR j 4- 0 TO p-1
6 2 TOCU 4- tmp; /* send coefficient to CU
7 2 DTRIN «— tmp; /* send coefficient to PE to the left
8 NetD TRANSFER;
9 3 BROADCAST FROMPEO TO output[i][j];
/* Send to all PEs */
10
11
2 ' tmp 4- DTROUT; /* Get coefficient from PE to right
12 2.75 : FOR i 0 TO r/2
13 1 output[i-hI] >-co;
Figure 7.4 Program to rearrange data from PE k containing coefficient k, 
0 <k <p to all PEs containing all coefficients.
174
repeated until all PEs have shifted their data to PE 0 arid PE 0 has sent it to 
the CU and it is broadcast to all PEs.
The time complexity of the rearrange program is:
cycles=16+I[6+p(47+NetD) +2 +5] + 2 + 6+9lr/2]+2
Table 7.7 summarizes the execution times for the rearrange program.
Although some interconnection networks can broadcast data without going 
through the CU [SiMc81a,SiMc81b], this method of using a data path between 
PEO and the CU is used here because it can use a less powerful interconnection 
network. The method implemented requires one data path going from PE 0 to 
the CU, and the network must be able to perform a Shift +1 interconnection 
function. The execution time for such a broadcast is the time to send the data 
to the CU plus the 3 /is which are needed for the CU to write the data into a 
PE instruction and broadcast the instruction.
7.®.2. Simulation of the DTW Algorithm — The Serial Parallel 
Method (SP)
Figure A. 11 is the listing of the SP MC68000 program for dynamic time 
warping It uses PE 0 and assumes that the rearrange program was run before 
it so that all the known and unknown frames are stored in PE 0 before execut­
ing the program. It differs from a serial program in that the CU executes the 
branching instructions and performs the loop control as in a parallel program. 
Some ‘‘IF ... THEN .. ELSE” constructs that a serial program would use are 
replaced by the ‘‘WHERE ... ELSEWHERE ... ENDWHERE” constructs in the 
SP program Although the serial-parallel program executes on only one PE, it 
is written to execute-on-., several .PEs at the same time. This is the way it 
would be used on an SIMD system in which each PE compares the unknown 
utterance to a reference utterance.
The distance score of oo which is used in the algorithm to represent dis­
tances from invalid paths is represetited in the MC68000 program as the value 
400016. This value is used since the local distance scores are stored as 16-bit 
numbers and they may be multiplied by two and added to each other. For
175
Table 7.7 Execution times for rearrange routine.
Program Rearrange
P 8 8 16 18
r 6 6 6 6
I 40 40 40 40
Number of PEs 13 13 16 16
NetD 0 18 0 18
Transfers 320 320 640 640
Cycles 15,613 21,373 30,653 42,173
T ime/Rearrange 3,903 /zs 5,344 (is 7,663 fis 10,543 (is
176 ..
example if dl=oo and d2=oo, then 2*dl + dl=C00016 which can be 
represented with 16 bits. Using a larger value for oo could cause the 16-bit
value to overflow after the above manipulations are performed.
The time complexity of the SP program is:
12 + (24+50p+2+7+7+25 + 13) + (1)
r[24 + 50p +2+7 + 23 + 54 +13] 4- (2)
£i[9 + 13 + 13]+ (3)
. i=l
r[24 + 50p + 2 +15 +13+54 +13] + (4)
[(I-l)( 2r+1)—r—r2] [24 + 50p + 2+16+16+48 + 44 + 54+13] + (5)
£i[19 + 13 + 13]+ > (6)
. 331+3 : (7)
Each number roughly represents the time between‘two successive labels in the 
program. Figure 7.5 shows the order in which the distances are computed for 
1=10 and r =4 and Table 7.8 gives a breakdown of the time spent between 
adjacent labels. The “.”’s in Figure 7.5 are where actual distances are com­
puted and the “ + ”’s are locations that are “visited” but no distance is com­
puted. A visit to a location means the program sets x and y equal to the coor­
dinates of that location, but the location is not in the warping path. Line (1) 
in the equation is the time used to initialize the loop counters and compute the 
special case where x=0 and y=0 (point 1 in Figure 7.5) Line (2) is the special 
case where y=0 and x^Q (points 2-5 in Figure 7.5) In general this line is exe­
cuted r times. Line (3) is the time to skip over the + ’s in the lower left trian­
gle. In general there are r + 1 + ’s on the horizontal side of the triangle. Line 
(4) is the time to compute the special case where x =0 and y^O. Line (5) is the 
normal case for x^O and y^O. The factor 1-1 is used because x takes on the 
values from 0 to 1-1 with line (2) computing the execution times for x=0 The 
2r +1 term in equation (5) is the width of the warping path; the r+r2 term is 
subtracted to adjust for the time taken into account by lines (3),
Line (6) is the time to skip over the +’s in the upper right triangle. Line (7) is 








999 • • • • • •
24 25 26 27 28 29 30 31 32+ 4 • • • • 9 9 9
15 16 17 18 19 20 21 22 23-+• “f” 4* • 9 9 9 9 9






Figure 7.5 Calculation order for accumulated distances oi sr u i w program.
a
178
Table 7.8 Execution times in cycles between adjacent labels of SP DTW pro­
gram (x =50p + 2 + 7). The column headings refer to the time complexity equa­
tions in Section 7.6.2.
Line (i) J[2L (3) (4) „ (5) (6) (7)
Times
1^1 (I-l)(2r + l) r
Executed i r r -r-r2 ■E* I
1=1 i=i
dtw: 12
nextdist: 24 24 9 24 ■ 24 19




findG: 54 .54, : 54




firstrow: ' 7 23
firstcol: 25
yedge: 13
; -179- . ;
The simplified time complexity is:
• " ' r-1
12 + (78+50p),+T[l23 + 50p] + ^}35i+
r [121 + 50p] + {(1—1)( 2r +1)—r—r2] [217 + 50p] +
• • ]P45i4-33I+3 :
i—I
Table 7.9 gives the execution times for a typical speech recognition system. 
The SP DTW program is able to execute a match in 74 ms which is 13 matches 
per second using one PE. A 1,000 word vocabulary can be matched in one 
second using 77 PEs.
The SP method has little overhead of parallelism because each PE is 
implementing a serial algorithm. The only parallel construct used is data con­
ditional masking which the program frequently uses for finding the minimum 
path. The following shows the overhead of using the data conditional mask, 
and suggests two methods for eliminating the overhead.
The following code performs the same task as the Flock Algol lines 32-35 
in Figure A. 10, i.e., it stores the minimum of the variables A and B in the vari­
able min.
34 WHERE A<B
2 min <— A;
8 ELSEWHERE
2 min •»— B;
8 ENDWHERE
The numbers on the left are the number of cycles used for each step assuming 
an 8 MHz MC68000 is used and A, B, and min are stored in registers. The pro­
gram uses a total of 54 cycles (13.5 /is). Overlapping the PE and CU instruc­
tions by using an instruction queue would not significantly reduce the execu­
tion times of these statements since the CU must wait until the PEs have exe­
cuted the instructions in the queue before enabling the data conditional mask 
[SiKu82]. The following is the faster method used in Figure A. 10.
2 min *— A;
26 WHERE B < min
2 min *— B;
8 ENDWHERE
180
Table 7.9 Execution times for serial dynamic time warping (SP).
Program DTW DTW + Rearrange
P 8 8 8
r 6 6 6
v 40 40 40
Number of PEs 'I . 8 8
NetD 0 18
Transfers 0 320 320
Cycles 296,452 312,065 317,825
Time/Comparison 74,113 /zs 78,017 *ts 79,456 (is
Comparisons/Second 13 13 13
Program DTW DTW + Rearrange
p 16 16 16
ir 6 6 6
I ■' 40 40 40
Number of PEs \ T'-; 8 8
NetD 0 18
Transfers 0 640 640
Cycles 487,652 518,305 529,825
Time/Compar isoh 121,913 /is 129,577 (is 132,457 (is
Comparisons/Second 8 7 . 7 '■
181'
'This’-requires'-38-cycles-'(9i5'/is)v:?'Wliich;;is:16 cycles less than the first method, 
The extra cycles are the time needed to; push the ELSEWHERE condition on 
the condition codes stack and to pop it off again. Avoiding the ELSEHERE 
statement by using the above technique will save 4 /is on the MC68000 when 
running at 8 MHz.
The following is a serial method to perform the same operation.
2 min <— A;
7 IF B < min
2 min <— B;
This takes only 11 cycles. A processor using an instruction prefetch may 
reduce the execution time of the above statements, but its effect will be limited 
since the second line is a conditional branch which may disrupt the prefetching 
of instructions. Although this code cannot be used by the parallel DTW pro­
gram, it does show that the parallel version of finding a minimum takes about 
250% longer than the serial version. If the min operation, or any other simple 
operation, is frequently used it should be included in the instruction set of the 
PEs. Then the PEs could execute the simple function with one instruction 
rather than using the data conditional masking which requires more time to 
execute.
A more general approach would be to allow the programmer to define his 
own instructions, so that he could define simple operations, like the min func­
tion, as they are needed. On most processors, new instructions are defined by 
writing microcode, if they can be defined at all. On the MC68000, which is 
used in the simulations, the microcode cannot be changed. Custom instruc­
tions could be implemented by allowing the PEs to execute code out of their 
own memory while running in S1MD mode. The routines, stored in the local 
memory of each PE, would be identical in each PE, and would be written so 
that the execution time of each routine is independent of the data processed. 
This would take care of the synchronization problems. Then the PEs could 
perform simple instructions like min without the overhead of data conditional 
masking.
One other approach, if a custom instruction set were being designed, 
would be to implement an Mcc instruction that works like the Bcc instruction 
on the MC68000. The Bcc is a 6ranch on condition code, cc can be on? of
182
16 conditions such as, less than, greater than, etc. The Mcc would be a move
on condition code. The operation would be to move data from one register to
another if the condition is true. Therefore,
2 p_mov dO,dl; Move data from register dO to dl.
2 p_cmp dl,d2; Compare registers dl and d2.
5 p_mlt d0,d2; Move contents of dO to d2 if
; d2 is less fhan dl.
would store the minimum of dl and d2 in dO, without data conditional mask­
ing. The minimum, maximum, and absolute value functions are a few of the 
many functions that could be implemented using the Mcc instruction.
7.6.3. Simulation of the SIMD DTW Algorithms
Sohie applications may have more PEs available than there are words in 
the vocabulary. In cases like this, the SP method may not decrease the execu­
tion time of the DTW algorithm as milch as wanted since it uses only one PE 
per DTW match. The parallel parallel (PP) method, discussed in Section
6.4.1.2., uses 2r4rl PEs for each DTW match, therefore decreasing the time 
needed to do one match. Two alternatives to implementing the PP program 
are presented. The first, PPl, uses the rearrange routine described earlier to 
move the data frpm the output format used by the LTW program to the input 
format used by the DTW program. Then the PPl DTW program computes 
the local distances as they are needed. The second, PP2, uses a variation of 
the rearrange program which computes the local distances while moving the 
data. This reduces the amount of data that must be rearranged and stored in 
each PE. After the data is moved and all the local distances are computed, the 
PP2 program is executed. The following paragraphs discuss the PPl program, 
and the next section covers the PP2 program.
7.6.8.1 PPl
Figure A ll is a listing of the PPl DTW program. The time complexity 
for the PPl distance program is:
cycles—58+50pt-2H-l6
183
The time complexity for the PPl DTW program is:
cycles=4 +114+I[10+dist +104 + 2(52+2NetD)+104+
16 +12+16 + 118+5] +2 +44 + 14r + 6r (7.1)
cycles —• 164+1[565 +50p +4NetD]+20r
where dist is the time used by the DTW distance program. The value 118 in 
equation (7.1) is the time used to run the instruction between labels findmin 
and incindex in Figure A. 10 Adding up execution times between the labels 
yields 124 cycles. The six cycles used by the instruction 2 lines before the 
incindex label are not included in the total execution titties because it is not 
normally executed. The sitn68 simulator does not count the execution time if 
all PEs are disabled. The term 6r is added outside the main loop (the loop 
starting at the label nextdist) to compensate for the few times the statement is 
executed. Table 7.10 summarizes the execution times for both the PPl and 
the PP1 + rearrange programs.
In a typical speech recognition system the PPl program would compare a 
pair of utterances in less than 16 ms using 13 PEs. The SP requires 80 ms to 
compute the same comparison using one PE, or it can compare 13 pairs of 
utterances in 80 ms using 13 PEs. This gives an average of 6 ms per DTW 
using the SP algorithm with 13 PEs. (All times include the time for the rear­
range program.) This means the PPl program takes about 8/3 times as long 
as the SP program to execute roughly the same operations. One difference 
between the SP and PPl programs is the PPl uses the interconnection net­
work. If the network delay time is 0, PPl requires 14 ms per DTW while SP 
needs 79 ms/13 = 6 ms. Still the PPl program takes over two times as long to 
perform a comparison between an unknown and a reference utterance.
The difference is caused by the implementation on the MC68000. The 
MC68000 has 8 32-bit data registers and 8 32-bit address registers. The SP 
program stores all of its variables in the data and address registers. The PPl 
program uses over 17 variables since it must store the g and d values for itself 
and the PEs adjacent to it, plus it must save the old g and d values for itself 
and the adjacent PEs. All these variables are stored in memory since there are 
not enough registers to hold them all. The MC68000 can do a register-to- 
register move in .5 //s and a memory-to-memory move in 2.5 fis, which is 5
184
Table 7.10 Execution times for parallel dynamic time warping (PPl).
Program PP1DTW Rearrange+DTW
P 8 8 ' 8 8
r 6 6 6 6
I 40 40 40 40
Number of PEs 13 13 13 13
NetD : 0 18 0 18
Transfers 160 160 800 800
Cycles 54,884 57,764 85,537 99,937
Time/Match 13,721 (is 14,441 (is 21,384 (is 24,984 (is
Matches/Second 72 69 46 40
Program PP1DTW Rearrange+DTW
V P, - .■ 8 8 8 8
r 6 6 6 6
V ; I 40 40 40 40
Number of PEs . 13 13 13
NetD 0 18 0 18
Transfers 160 160 480 480
Cycles 38,884 41,764 54,497 61,137
T ime / Comparison 9,721 ns 10,441 us 13,624 ns 15,784 /*s
Comparisons/Second 102 95 73 63
times as long. In general each memory access takes about 1 [is more than each 
register access. Since the memory-to-memory move instruction references 
memory once to read the value and again to write it to a new location, it takes 
2 fis longer than the register-to-register move. Therefore the PPl program is 
slower than the SP program partially because it uses inter-PE transfers, but 
mainly because the MC68000 does not have enough registers to hold all the PP 
variables. Some variables must be stored in memory which is slower to access.
This provides another design feature. The processor used in each PE of an 
SIMD machine for DTW should have more registers than the 8 provided by the 
MC68000. This would allow more data to be quickly accessed without using 
main memory.
7.6.8.2. Simulation of the DTW Algorithm — PP2
The time the rearrange program uses to move data between PEs is all 
parallel overhead since the data movement is not needed on a serial processor. 
The PP2 program attempts to reduce the rearrange time by computing the 
local distance as the data is being moved. The rearranging time should be 
reduced since two frames of p coefficients each are combined into one distance 
score after the calculation. The next section presents the distance program 
which computes the local distances while moving the data. The section after 
that presents the PP2 program.
7.6 8.2.1. The Distance Program
Figure 7.6 is the Flock Algol algorithm for computing the local distances. 
It uses max(p,2r + l) PEs and the input data is arranged so PE k contains 
coefficient k of frame i for 0 < k < p and 0 < i < I, where I is the total 
number of frames.
The distance routine computes the local distance between known frame i 
and unknown frame j in PE 0 through PE p—1 and stores the resulting data in 
PE i—j. Figure 7.7 represents the local distances with “.”s for r—4, p—6, and 
1=10. The dots outside of the shaded area are are stored in PEs 0 through 




Algorithm Name: distance (PP2)
Section: 7.6.3.2.1.
Machine: SIMD
Function: This program moves data around and computes
the local distances in preperation 
for the DTW program.
Number of PEs: 2r + l
Parameters: r, the width of the warping path.
p, the number of coefficients per frame.
NetD, the network delay time.
I, the number of frames per utterance.
Input: knownjx] contains coefficient i in PE i of
input vector x.
unknownjy] contains coefficient i in PE i of 
input vector y.
Output: d[dptr] contains the local distances.
d[0] contains the first distance needed by the 
PE it is stored in for the DTW program. 
d[l] contains the next distance, and so on.
Line Time in [is
1 PROCEDURE distance
2 .5 LADDR = ADDR r /* Logical address, PE ae numbered -r to r
3 4 ... FOR i ♦- 1 TO r/2
4 13 ' '■'• '"WHEjlEj LADDR] > i.DO ■ '
5,.. 2 ■ dfdptr] op;




' 4 ■’ ENDWHERE
8 FOR y •**- 0 TO 1-1
10 3 FOR x <—-r TO r
11 ; 5 IF y+x < 0 AND y +x < 21— 2
12 10.75 sum +- (knownjx] — unknownfy])
13 3 FOR k>- 0 TO logN-i
14 3 USE Cube(k);
15 1.5 DTRIN. sum;
16 NetD TRANSFER;
17. 1.5 sum sum + DTROUT;
18
19 /*
20 The coefficients are in PE 0 - PE p
21 and the distance score is needed in PE i
22 where i > p. Use the Shift function to
23 . move the data from PE 0 to the desired PE.
24 ■ */
Figure 7.6 Algorithm to compute local distances and move data. Execution 
time are for an 8 MHz MC68000.
187
25 2 IF x + r > p
26 3 USE Shift +x+r
27 4 +NetD TRANSFER sum
28
29 6.5 WHERE x +r = ADDR /* Enable PE */
30 1 d[dptr] «— sum; /* that will use */
31 dtpr «— dptr +■ 1; /* the distance */
32 2 , END WHERE /* score.*/
33
34 3 FOR i <— 1 to r/2
35 1 d[d ptr] +- oc;






Figure 7.7 Calculation order for accumulated distances of SP DTW program 
PEs in shaded area do not start with input data.
189
the input data is stored in only PEs 0 through p— 1, and the distance scores are 
computed in the same PEs, the distance scores represented by the shaded area 
in Figure 7.7 must have their scores transferred from a PE outside of the 
shaded area.
A typical speech recognition system has p—8 and r— 6, so 2r + l is > p 
and extra transfers are needed to get the data from a PE outside of the shaded 
area to the proper PE in the shaded area. Lines 25-27 of Figure 7.6 handle this 
case. If p—16, as with some high quality speech recognition systems, p > 
2r +1 and lines 25-27 are not ever executed.
The time complexity for the distance routine is:
cycles ^ 12+85 lr/2 J + 2 +12 + (1)
[I(2r +1)—r-r2N20 + 43+4 +(NetD + 31 )logp + 2 + 9 + 38 + 13]+ (2)
(9 + 7 + 13) J]i+ (3)
i—1
(19 + 7 + 13)X)i+ (4)
i=l •
[(2r + l-p)(I-r)+2xfi](25+NetD + l)+ (5)
i=i
301 + 1+ (6)
6+9 [r/2 j+2 (7)
assuming p < 2r +1. Table 7.11 gives the breakdown on how the time is spent 
between each label in the assembly language program, given in Figure A.12, for 
each line of the time complexity. Line (1) is the time used to initialize some 
variables and store infinity scores in those PEs outside the warping path during 
the first r/2 loops of the DTW program (see Figure 7.7). Line 2 is the main 
loop of the program, during which the distances are computed. Line (3) is the 
time used for visiting the “ + ”’s in the lower left triangle. Line (4) is the visit 
time for the upper right triangle. Line (5) is the time used to move data from 
PE 0 to PE i when i > p. The “.”’s in the shaded area of Figure 7.7 represent 
the time in which this is done. Line (5) can be omitted from the time complex­
ity if p > 2r + l. Line (6) is the time used to prepare to use a new unknown 
frame. Line (7) is the time needed to pad the d[] array with infinity values for 
those PEs outside the warping path.
190
Table 7.11 Execution times in cycles between adjacent labels of PP2 DTW 
program. The column headings refer to the time complexity equations in Sec­
tion 7.6.4.1. (y=logp(NetD+31) + 2, x=85u-/2j+2 + 12, z =9 lr/2J)
Line (1) (2) (3) (4) (5) (6) (7)
Times
Executed
I(2r + 1) (2r + l~p)(I-r)
r-l r 2r-p

















'.'43 ■ : '
. 4 - v'
y 25 + NetD
9
. 38
13 13 ' 13'
30 6
z
: ■. 7..": 7 :
Simplified, the time complexity-for the PP2 distance program is:
35 + 94 lr/2|+301 + 29 £ i+ 39 £} i +
. : . i=l i=l
[I(2r+l)-r-r2][12«+ (NetD+31)logp] +
[(2r + l-p)(p-r) + iTj 2r—pi] [25 + NetD] 
i=l
Table 7.12 gives execution times for a typical speech recognition system.
7.6.3,2.2. The PP2 DTW Program
After the distance program is executed, the DTW program is run. The 
PP2 DTW program is identical to the PPl program except the PP2 program 
does not call a routine to compute the local distances. Instead, it finds the dis­
tances in an array, already computed by the distance program. Figure A. 12 
lists the DTW program along with the main and distance programs. The time 
complexity for the PP2 DTW program is:
4+76+I[lQ6 + 2(52+2NetD) + 104 + 16 + 12+16+124 + 5]+2 + 44
126+I[487 + 4NetD]
Table 7.13 summarizes the execution times for a typical speech recognition sys­
tem. The PP2 program can match 24 pairs of utterances in one second using 
13 PEs. The PPl program is able to match 63 pairs in the same time using the 
same number of PEs. The execution time has increased because, 1) the 
number of transfers has increased, and 2) less parallelism is used.
It had been expected that the number of cycles would decrease because 
two frames of coefficients were being combined into one distance Score, which 
would take less time to pass through the network. This did not happen since 
in PP2, p PEs are used in parallel to compute each local distance. The dis­
tance calculation requires log p transfers to sum the square of the differences 
between coefficients (lines 13-17 in Figure 7.6). This is done once for each dis­
tance score, yielding a total of approximately I(2r+l)log p transfers. The rear­
range program needs transfers to move the LPC coefficients to the appropriate
192
Table 7.12 Execution times for distance calculations for PP2.
Program PP2 distance
P 8' 8 16 16
r 6 6 6 6
I 40 40 40 40
Number of PEs 13 13 16 16
NetD 0 18 0 18
Transfers 1,614 1,614 1,912 1,912
Cycles 113,387 142,439 123,705 158,121
Time/Compairson 28,347 fts 35,609 fis 30,927 fts 39,531 fts
193
Table 7.13 Execution times for dynamic time warping program PP2.


























33,249 (is 41,232 (is : 
30 24








Time/Comparison 35,828 (is 45,152 (is
Comparisons/Second 24 22
194
destinations and uses p transfers per frame for a total of Ip transfers. If p is 
greater than (2r + l)log p, the distance program will use fewer transfers.
7.6.4. Summary
The previous sections have presented three programs for dynamic time 
warping. The serial parallel (SP) program broadcasts the unknown input utter­
ance to all PEs and each PE executes a serial DTW program to compare it to a 
known utterance. The two parallel parallel (PP) programs use 2r + l PEs to 
perform each match. The PPl program moves the data to all PEs, then com­
putes the local distances as they are needed during the DTW program. Each 
local distance is computed in a single PE, however, all 2r + l PEs can be com­
puting a different local distance simultaneously. The PP2 program computes 
the local distances as the data is being moved to the PEs. p PEs are used to 
compute one distance score. All local distances are computed before the DTW 
program starts executing.
The SP program is the fastest of the three. It can match 169 pairs of 
utterances (consisting of 40 frames of 8 coefficients each) in one second using 13 
MC68000’s running at 8 MHz. The PPl program is the next fastest matching 
63 pairs per second under the same conditions, and PP2 is slowest matching 24 
pairs per second. Tables 7.9, 7.7, and 7.12 summarize the execution times for a 
typical speech recognition system. If faster processing rates are needed, the SP 
program can use N PEs to compute N comparisons simultaneously. The PP 
programs can use sets of 2r + l PEs in parallel so that N PEs can compute 
fN/(2r + l)J DTW comparisons in parallel.
The SP program was fastest since it required fewer data transfers between 
PEs (none at all after the DTW starts executing except for the recursive dou­
bling needed to find the minimum distance score), and it uses fewer variables 
than the PP programs. The SP program stores all of its variables in registers, 
while the PP programs have more variables than registers, so some variables 
are stored in memory. The MC68000 uses four more cycles to reference 
memory than a register; therefore the PP programs, while executing about the 
same number of operations, run slower than the SP program. The PP pro­
grams could run faster if the processor in the PE had more registers (at least 18
data registers), or faster memory access.
The PPl prograin is the next fastest DTW program since the PP2 dis­
tance program uses p PEs to compute one distance score in .parallel’.. The PPl 
DTW program uses 2r +1 PEs to compute 2r + l distance scores serially within 
each PE. Since p < 2r + l in the typical system, the distance program has 
2r + l—p PEs idle when computing local distances. Therefore the PPl DTW 
program makes better use of the available parallel computing power.
Although the SP program is a serial program running in each PE, if the 
program is being run under SIMD control, data conditional masking must be 
used in each PE to find the minimum of two registers. Data conditional mask­
ing is a time consuming operation and should be avoided if possible. It would 
not be needed if the MC68000 could execute a “minimum” instruction directly, 
but is is unrealistic to expect the processor to have every possible “handy” 
instruction in its instruction Set: A better approach would be to use a proces­
sor with programmable microcode or use a custom processor. A library of c°m~ 
monly used microcode operations could be available to the programmer so sim­
ple operations like finding the minimum of two register could be executed with 
one instruction. The would reduce the number of times data conditional mask­
ing is used, and should reduce the execution time.
The MC68000 does not have programmable microcode, but this feature 
could be simulated by letting each PE execute code out of its own memory 
while running in SIMD mode. Again, a library of commonly used functions 
could be stored in the local memory of each PE. Each function would be writ­
ten so the execution time was independent of the data processed so all proces­
sors would execute the instruction in the same amount of time.
The DTW programs all used the Shift ± 1 transfer functions, the PE 0 to 
CU link, and the CU broadcast. The PP2 programused the Cube transfers 
and the Shift +n transfer function for p < n < 2r.
Overall, the SIMD architecture implemented with MC68000 is well suited 
for the DTW programs.
196
7,7. SIMD Machine Based Isolated Word Recognition System
Previous sections in this chapter have presented programs for performing 
various speech recognition tasks. This section shows how these programs are 
assembled together to perform the function of the speech recognition system 
shown in Figure 4.1. The parameters listed on Figure 4.1 are for processing 
telephone quality speech. Table 7.14 lists parameters for telephone quality and 
high quality speech processing.
The following section presents the main program which calls each of the 
speech processing programs as they are needed, and contains the endpoint 
detection program. The main program contains the endpoint detection pro­
gram since the LPC program is not called until after the begining of an utter­
ance is found, and the LTW and DTW programs are not called until after an 
entire utterance is found. Section 7.7.2 discusses the data allocation used by 
each program, and Section 7.7.3 discusses the execution times of the entire sys­
tem. Section 7,7.4 discusses the size of the input buffers needed to hold the 
incoming speech samples while the DTW program is executing. Section 7.7.5 
summarizes Section 7.7. Figure 7.8 is a Flock Algol algorithm for the main 
program in the speech recognizer and Figure A. 13 is the MC68000 program.
7.7.1. Endpoint Detection
The endpoint portion of the main program finds the endpoints based on 
the energy in each frame as discussed in Section 4.5. The program does not use 
the zero crossing (ZX) rate discussed in Section 4-5 since Larnel [LRRW81] 
states it is not always effective.
The endpoint program checks the energy of the current frame by having 
PE 0 send its autocorrelation coefficient R[0] to the CU after the autocorrela­
tion program is executed, If the energy is greater than lothresh*, the low
*Unlike the method used in (RaSa,75j,lothreshand hithresk are not adaptive. They are 
constants that are set before the programJs executed.







Sample Rate 6.67 KHz 20 KHz 20 KHz
Bits per Sample 8 16 16
LPC Coefficients 8 16 16
Bits per Coefficient 16 16 16 £
Vocabulary Size (words) 10-1,000 10-1,000 1,000








This is the main routine. It calls filter() 
and autoQ to preemphize the signal and find 
the autocoerrelation coefficients. If R(0)
(the energy) is greater than lothresh, it calls 
lpc(). This main routine also does the 
endpoint detectioii. After an utterance is 
detected, ltw() and dtw() are called.
Number of PEs: 
Parameters:
100




autocoef, the number of autocorrelation coefs. 
r, the width of the warping path, 
p, the number of LPG coefficients.
NetD, the network delay time.
I, the number of frames per utterance. 
VOCABSIZE, the size of the vocabulary.
Sample i mod N is is PE i.




' M: - .
input jx]:
Index to current input sample.
= TRUE if utterance has been found. 
Lower threshold.
Upper threshold. See section 4.6??'
Index to current input frame.
Input samples, PE i contains sample 
i of frame x.
filout: Filtered output for filter program.
W ■
PE i contains sample i of frame. 
Autocorrelation coefficients, 
all coefficients in all PEs.
lpcout[xJ: LPC coefficients PE i contains 
coefficient i of frame x.
Stwout[xJ: Output utterance from LTW program.
PE i contains coefficient
shuffoutfx]:
i from frame x/
Output of shuffle program.
All PEs contain all coefficients 
from all frames.
lib [x]: Library of known utterances.
PE i contains all coefficients from 
all frames for utterance x.
Each PE contains a different
Scoresfi):
utterance.
Output scores from all DTW matches,
Figure 7.8 Flock Algol algorithm for isolated word recognition. Contains end­
point detection algorithm and calls the filter, autocorrelation, LPC, LTW, and 
DTW algorithms.
■490
contained in PE r.*/
1 PROCEDURE main
2 found •«— FALSE;
3 M -H- 0;
4 i — 0;
5 WHILE(TRUE)
6 /*




11 Find autocorrelation coefficients




16 Take the energy R[0] in PE 0 and
17 pass to the CU for endpoit detecton.
18 */




23 If the energy is greater than the
24 low threshold, compute the LPC
25 coefficients and save in lpcout[].
26 */
27 IF energy > lothresh
28 IF energy > hithresh
29 found TRUE;
30 lpc(R[J, lpcout[M]);
31 M 4-'M + 1; .
32 /•
33 Otherwise, this may be the end of










44 For each word in the vocabulary.
45 do a DTW and save the scores.
46 *1
47 shuffle(ltwout [] ,shuffout []);
48 FOR j 0 TO VOCABSIZE-1























Use the next frame.
i <- > + i;
Figure 7.8 (Continued)
threshold, the main program calls the LPC program to compute the LPC 
coefficients and saves them in an array, If the energy is greater than hithresk 
the found flag is set to TRIJE. If the energy is less than lothresh and the 
foundflag is TRUE, the LTW program is called, followed by the rearrange 
program and the SP DTW program. If the energy is less than lothresh and the 
found flag is FALSE, the saved coefficients are discarded.
7,7.2. Data Allocation
When combining SIMD machine programs the output data arrangement of 
one program must match the input data format of the program that follows. 
The programs presented earlier in this chapter were written so their data for­
mats matched.
The filter program in Section 7.2 expects the input data to be stored with 
sample i mod N in PE i for 0 < i < N. Where N is the total number of PEs, 
and mod is the modulus function. The autocorrelation program in Section 7.3 
takes the input data in the same format the filter program outputs and stores 
its output so all autocorrelation coefficients are in all PEs. The LPC program 
in Section 7.4 uses just 8 PEs, and expects all the autocorrelation coefficients in 
each PE, just as the autocorrelation program left it. The LPC program leaves 
LPC coefficient i in PE i for 0 < i < p. The next task in Figure 4.1 is the 
endpoint detection. The endpoint detection program does not process the data 
as the other programs do. Instead, it decides whether or not an input utter­
ance has been detected. If it has, the data is sent to the programs which fol­
low. Otherwise, the data is discarded. The LTW routine is called after the 
endpoint routine has detected an utterance. The LTW routine expects PE i to 
contain coefficient i of frame j for 0 < i < p and 0 < j <C L This is the 
arrangement that the LPC program outputs. The output data arrangement of
the LTW program is the same as the input data arrangement.
The SP DTW program needs all frames of the unknown utterance stored 
in all PEs. This is not the format output by the LTW. The rearrange routine 
moves the data from the arrangement output by the LTW program to the
arrangement the DTW program uses as input.
V ■ ■
202
When running the DTW program, each PE contains W/N known utter­
ances where W is the total number of utterances in the vocabulary, and N is 
the number of PEs. The SP DTW program is executed W/N times and the 
distance scores are accumulated in the scores array in each PE.
7.7.3. Execution Times
To process high quality speech in real time the system must meet the 
specifications in Table 7.14. Table 7.4 shows that if p =8 (not p=16 as shown 
in Table 7.14) and N=100, the filter, autocorrelation, and LPC programs can 
process data at 29 KHz. Table 7.9 shows that the SP DTW program can com­
pare 12 utterances per second using one PE which is 1,000 utterances per 
second using 77 PEs. These two facts show that the SIMD parallel machine 
can easily process high quality speech in real time. The only problem is the 
filter, autocorrelation, and LPC programs and the DTW program must execute 
within the allowed amount of time. Figure 7.9 shows the time and number of 
PEs used for each task in the system. The filter and autocorrelation programs 
process all input data. If the energy is below the lower threshold, the LPC pro­
gram is not run. Frames ! and 2 in Figure 7.9 did not exceed the threshold. 
Frames 3 through I-1 did, and the LPC coefficients are found for each of 
them. Frame I was below the low threshold which marks the end of the utter­
ance The LTW program then is executed. During this time, the input data is 
being saved in a buffer since the PEs are not running the filter and autocorrela­
tion programs.
After the LTW program, the data is rearranged so all PEs contain the 
unknpwn input utterance. Finally, the SP DTW program is executed in all 100 
PEs. In the end 100 distance scores are computed and the smallest score comes 
from the known utterance that best matches the unknown input utterance.
Figure 7.9 shows that most of the system time is spent executing the 
filtering, autocorrelation, and LPC programs. For a typical utterance with 40 
frames, 40(1.8+ .16) - 136 ms are spent computing the LPC coefficients from 
the speech samples. The DTW program uses 79.4 ms for both the rearrange 
and SP programs. Since the LPC programs uses only 8 PEs, 92 PEs are idle 
during 64 ms of the LPC computation time. These idle PEs can be used if
203
PEs
; 8 ' 100 
Condition Frame Time in ms
energy < lothresh #1
energy < lothresh #2




filter + autocorrelationenergy' > lothresh #4
. 1.6
filter + autocorrelation J 1.8energy > lothresh #5
filter + autocorrelationenergy < lothresh
Rearrange
DTW










several frames of LPC coefficients are computed in parallel. To do this, the 
autocorrelation program would leave the first, frame of coefficients in PEs 0 
through p-1. The LPC program would not be executed as described above; 
instead the autocorrelation program would be run again. The autocorrelation 
coefficients from the second run would be stored in PEs p through 2p—1. The 
would be repeated with the autocorrelation coefficients from frame d stored in 
PEs ip through (i + l)p-l. Then the LPC program could be run and it would 
compute iN/pJ frames of LPC coefficients simultaneously where N is the 
number of PEs. If this approach were used on the system in Figure 7.9, the 
filter, autocorrelation, and LPC execution time would be reduced to 78.4 ms 
not including the time to move data from PEs 0 through p-1 to PEs ip 
through (i + l)p-l. Although this approach will increase the throughput, it will 
also increase the delay between the time the speech enters the system and the 
time LPC coefficients are computed. This is because the computation of the 
LPC coefficients of frame 0 must wait until the autocorrelation coefficients of 
frame In/p) are computed. Such a delay is undesirable for real-time process- 
. inS-
The DTW program could execute in fewer cycles with more PEs if needed. 
For a 1,000 word vocabulary, the area in Figure 7.9 will be constant, so adding 
more PEs will decrease the execution time, and removing PEs will increase the 
execution time. Increasing the execution time will delay the processing of new 
input samples, which would have to be buffered while the DTW program is 
running. The next section discusses the effects of the DTW execution time on 
the input buffer size.
7.7.4; Buffering the Input Data
After executing the DTW program, approximately 80 ms have passed 
since the last input frame was processed. During this time 1,600 new samples 
will arrive if the sampling rate is 20 KHz. The input data is spread among 100 
PEs, so each PE needs a 16 16-bit word buffer to hold the new data while exe­
cuting the DTW program. Each additional 100 utterances added to the voca­
bulary require 15 more 16-bit words of buffer space, so the 1,000 word vocabu­
lary needs 151 16-bit words of storage in each PE to hold the new input
205
samples that arrive while the DTW program is running,
The filter, autocorrelation, and LPGprograms can process data at 29 KHz 
when p — 8, while high quality speech samples arrive at 20 KHz, therefore the 
system cam empty the buffer at a rate of 9 KHz. The 100 utterance system 
takes 178 ms to catch up, while the 1,000 utterance system takes 1,690 iris. 
Both of these times assume the energy is greater than the low threshold and 
the LPG coefficients are computed for each frame. If the energy is less than 
the low threshold, the endpoint routine does not call the LPG program. The 
sampling rate for the filtering and autocorrelation programs is 55 KHz (Table 
7.3), therefore the buffer will empty at 35 KHz. The 100 utterance system will 
catch up in 45 ms, while the 1,000 utterance system will need 431 ms. Most
real-time speech recognition systems can tolerate a delay of 431 ms.
7.7.5* Summary
Although the SIMD speech recognition system can process data at 20 KHz 
and have a 1,000 utterance vocabulary, a buffer is needed to hold the input 
Samples as the DTW program is run, and the utterances must be spaced far 
enough apart so that a subsequent utterance does not end before the buffers are 
emptied. Table 7,15 summarizes the buffer requirements for p =8. The buffer 
requirements were not computed for p —16 since the 
filer+autocorrelation -fLPC programs can process at most 14K samples per 
Second when NetD=18, and 19K samples per second when NetD=0.
This chapter has shown that an SIMD machine using a current technology 
processor in each of its PEs and CU can process high quality speech in real 
time. The next section gives concluding remarks and describes the strengths 





7.15 Buffer requirements for SIMD speech recognition system 







DTW Time 80 ms 745 ms
Samples Buffered 1,600 15,000
PE Buffer Size 16 150
Catch Up Time 
with LPC 178 ms 1,670 ms
Catch Up Time 
without LPC 45 ms 431 ms
7.8. Conclusions
Designing a parallel processor is difficult without knowing the types of pro­
gram it will run. This chapter has presented a parallel speech recognition sys­
tem based on an SIMD machine. The experience gained in programming the 
SIMD machine to recognize isolated words will help in refining the SIMD 
machine design for speech recognition. The following sections discuss the 
different parts of the SIMD machine and give details as to which features it 
should have for real-time speech recognition.
7.8.1. The Processor
Each PE and the CU contain a processor. Sim68 simulated each processor 
as an MC68000 microprocessor, which proved to be well suited for the typical 
isolated word recognition system presented in Chapter 4. The following sec­
tions discuss what was good about the MC68000, and what improvements 
could be made if a custom processor were used.
7.81.1. Data Size and Type - 16-bit signed fixed point
Most speech data can be represented as a 16-bit signed integer, therefore 
the processor should operate on 16-bit data. The autocorrelation LPC and 
LTW routines used some 32-bit values, so 32-bit addition should also be imple­
mented.
The LPC, LTW, and DTW routines could have used floating point opera­
tions, but they were able to be implemented using only fixed point operations. 
Adding floating point operations would make writing some of the programs 
easier and might reduce the execution times of the LPC and LTW programs.
Some DTW programs use a distance measure which requires taking the 
logarithm of a value [Itak75j. The logarithm function can be approximated
208
using fixed-point arithmetic, but this places a burden on the programmer. A 
system using such a distance measure may benefit from having hardware 
floating-point operations since it makes the machine easier to program.
7.81.2. Internal Registers — At Least18 Data Registers
The MC68000 has 8 32-bit data registers. Comparing the SP and PPl 
DTW programs showed that more registers could be used. The SP program 
has only a few variables and keeps them all in registers. The PPl program has 
18 variables, which must be stored in memory. Although the two program exe­
cute similar code, the SP program takes half the time of the PPl program 
because it did not reference variables in memory as often. For the speech 
recognition system used here, at least 18 data registers are needed since the 
PPl program uses 18 variables.
7.8.1.3. Memory Size — 2K bytes
Table 7.16 summarizes the memory requirements for each of the programs 
in the speech recognition system. Many of the programs can store all their 
variables in the internal registers, therefore they require no PE memory. The 
total memory usage for the CU is 1,680 bytes and each PE uses 352 bytes. The 
main routine passes the data to the other routines by using pointers, therefore 
most routines use little PE memory, while the main routine (and endpoint) uses 
the most PE memory.
A CU memory size of 2K bytes and a PE memory size of 512 bytes should 
be enough for the proposed speech recognition system. Using 512 bytes for the 
PE memory allows 352 bytes for the variable, and 160 bytes for buffer space
7.814. Instruction Set —Add Mcc
The instruction set of the MC68000 is well suited for speech signal process­
ing since it is a 16-bit processor. The most important operations are the 16 
and 32-bit signed additions and subtractions, and-the 16 by 16-bit signed mul­
tiply and the 32 by 16-bit signed divide;:














Contains the endpoint routine.
210
The need for data conditional masking could be reduced if a new instruc­
tion called Mcc were implemented. The Mcc instruction is like the Bcc 
instruction which branches when a condition code is true. The Mcc instruc­
tion would move data from one register to another when a condition code is 
true. Finding the minimum of two variables takes 9.5 //s using data condi­
tional masking. The Mcc instruction could reduce this to about 3 {is.
7.8.2. Inter-PEi Communication — Cube, Shift(± l), and Broadcasts
Table 7.17 shows the inter-PE communication usages for each of the pro­
grams. The Shift(± 1) and Cube interconnection functions are frequently used 
by the programs and should be implemented with hardware so they will 
transfer quickly. The Perm functionJs used only by the LPC routine and does 
not need a hardware implementation since it is infrequently used,
The broadcasts are all performed by the CU using self modifying code, 
which requires no special hardware. The TOCU path from PE 0 to the CU is 
needed by the endpoint routine so the CD can make conditional branches based 
on the data in the PEs. The rearrange program uses the TOCU path to broad­
cast data from PE 0 to all PEs.
7.8.3. Masking — Data Conditional
Of the two different masking techniques discussed in Section 2.3, the 
speech recognition system programs used only the data conditional mask. In 
all but the DTW program, general PE masks could have been used instead of 
the data conditional masks. The data conditional masks were used since it was 
clearer which set of PEs were being enabled. In many cases, general PE masks 
will execute faster than data conditional masks because they can be computed 
once at compile time, The data conditional piasks; however, must be Com­
puted at run time, once for every time the mask is used. Table 7.18 summar­
izes the times the data conditional mask is used and gives the time, in cycles, it 
takes to set-up the data conditional mask arid the time taken by the state­
ments affected by the mask. The EPC program is the only program that used 
the ELSEWHERE mask, and its times are indicated by 8/91 which mean the































WHERE condition takes 8 cycles arid the ELSEWHERE takes 91 cycles. The 
table shows that except for the LPC program, the set up time for the data con­
ditional mask is longer than the time taken by the statements affected by the 
mask. The Mcc instruction (described earlier) could be used in all but the 
LPC program instead of the data conditional mask. This would reduce the 
execution times.
7.8.4, MC68000 Clock Rate - 8 MHz
All the instruction timings presented have assumed an 8 MHz clock rate. 
Sorrie versions of the MC68000 can "rim using a 12.5 MHz clock rate. This 
clock rate with a no wait state memory will cause the programs to run 50% 
faster. Although the proposed system can run in real time with the 8 MHz 
clock, the faster clock rate will allow changes in the system (such as increasing 
the number of LPC coefficients) and still run in real time.
7.8.5. Number of PEs — 100
Table 7.19 summarizes the number of PEs Used by each program in the 
parallel word recognition system. By using 100 PEs, the MC68000 based SIMB 
machine is able to implement a typical speech recognition system in real time. 
The value of 100 was chosen because
1) it is the maximum number of PEs that can be used by the autocorrelation
program, and
2) the DTW program can compare 1,000 utterances pairs in 0.8 seconds.
The number of PEs used by the autocorrelation, LPC, LTW, and rear­
range programs was determined by the problem size. The autocorrelation pro­
gram uses N=100 PEs, which is more than all the other programs. Its PE 
usage is equal to the number of samples in a frame of speech. The preemphasis 
filter program can use any number of PEs, so it uses the same number as the 
autocorrelation program. The LPC and LTW programs use p-8 PEs. Since 
p < N, N-p=92PEs are idle during the execution times of the LPC and LTW 
programs. The DTW program can use any number of PEs too. It uses all 100 
since the autocorrelation program uses 100. If there are less than 100
214
Table 7.19 Number of PEs used by the parallel speech recognition system
Number of PEs Determined by
filter 1 or more
auto 100 N (framesize)
LPC 8 p (Number of LPC coefficients)
endpoint 0
p (Number of LPC coefficients)LTW 8
rearrange
DTW 1 or more
Number of PEs used by DTW
utterances in the vocabulary, some PEs will be idle during the DTW’s execu­
tion. The rearrange program uses as many PEs as the DTW program since 
rearranged job is to rearrange the data for the DTW program.
Using half as many PEs wilh at increase the execution time of the auto­
correlation program by 3%. The following example shows how the proposed 
system can be implemented using 50 PEs. The filter, autocorrelation, and LPC 
programs require 148, 7,026, and 6,352 cycles respectively to execute on 100 
PEs. If 50 PEs are used, the LPC program will require the same number of 
cycles since it uses only 8 PEs, and the filter program will use twice as many 
cycles since it will be executed twice for every input frame. The autocorrela­
tion program will use 7,214 cycles for a total of 2*148 + 7,214 + 6,352 = 
13,862, cycles which is a sampling rate of 28 KHz. This is only one 1 KHz 
slower than when 100 PEs are used. Therefore, 50 PEs can be used and still 
process speech in real time; however, the DTW program will require twice as 
much time when using 50 PEs. With 50 PEs the DTW program will use 1,6, 
seconds on a 1,000 word vocabulary which is considered too long for real time
response.
7.8.6. Changing the Word Recognition System Parameters
It has been shown that the proposed isolated word recognition system can 
process high quality speech in real time. The following section discuss the 
effects of altering the system parameters on the processing throughput.
7.8.6.1. Changing the LPC Frame Size
If the frame size is increased, the autocorrelation program can use more 
PEs, and the execution time will increase in proportion to log M (where M is 
the frame size) based on the time complexity equations. The time between 
frames will increase if the sample rate remains the same. Suppose the frame 
size is doubled to 200 samples and the sampling rate remains the same. The 
autocorrelation program requires 7,836 cycles per frame which is a sampling 
rate of 102 KHz (assuming NetD = l8 and autocoef=9). This is nearly twice the 
throughput of the program using 100 sample frames (See Table 7.3).
216
If the frame size of the above example is doubled from 100 to 200 samples, 
and 100 PEs are still used, the autocorrelation program will use 8,204 cycles, 
the filter program will use twice as many cycles, and the LPC will used the 
same number of cycles The total will be 8,202 + 2*148 + 10,106 = 18,426 
cycles which is a sampling rate of 43 KHz. This is faster than using 100 sam­
ples per frame, which yields 39 KHz.
Reducing the frame size would reduce the number of PEs used. The dura­
tion of a frame is based on the characteristics of the vocal tract and the pro­
posed duration (5 ms) is shorter than what is commonly used (10-20 ms); there­
fore a frame size reduction would most likely result from a decrease in the sam­
pling rate.
7.8. 6.2. Changing the Number of LPC Coefficients
The proposed isolated word recognition system has assumed 8 LPC 
coefficients are used. Many high quality speech processing systems use as many 
as 16 LPC coefficients. Table 7.4 shows that the maximum sampling rate for 
16 coefficients is 19 KHz; 14 KHz for NetD=18. Although most high quality 
systems sample at 15 to 20 KHz and these are near that range, there is no time 
left for executing the DTW program. This shows that the 8 MHz MC68000 
SIMD machine based system is able to process in real time, but it does not 
have much leeway. Increasing the number of LPC coefficients makes it unable 
to process in real time.
The proposed system assumes a 5 ms frame size. Typically 10 to 20 ms 
frames are used. If the frame size is increased to 10 ms by using 200 samples 
per frame and 100 PEs are still used, the time needed will be 8,202 cycle for 
the autocorrelation program, 2*148 cycles for the filtering program, and 14,200 
cycles to the LPC program. This gives a total of 22,698 cycles to process 200 
samples for a sampling rate to 35 KHz, which is fast enough of high quality
'8217
7.8.6.3. Changing the Number pf Frames per Utterance
The proposed system assumed that 1=40 frames per utterance were output 
from the LTW and processed bytheDTW program. The LTW and DTW exe­
cution times are proportional td I, so increasing I will increase the LTW and 
DTW processing times. Thus a larger buffer is needed to store the incoming 
speech samples while the LTW and DTW programs are executing. Decreasing 
I, on the other hand, will shorten the LTW and DTW execution times and 
require a smaller input buffer.
7.8.6.4• Changing the Vocabulary Size
The DTW program is the only program whose execution time depends on 
the vocabulary size. The DTW execution time is proportional to fW^Nl where 
W is the number of words in the vocabulary and N is the number of PEs. As 
with the number of frames per utterance, an increase in W will require a larger 
input buffer, and a decrease will require a smaller input buffer.
7.8.7. Summary
The proposed SIMD machine based isolated word recognition system is 
able to execute in real time using 100 PEs. Many of the word recognition 
parameters can be changed and the system will still run in real time. However, 
increasing the number of LPC coefficients from 8 to 16 without increasing the 
frame size will cause the system, as it is implemented here, to run slower than 
real time. The performance of this system is conservative because:
1) a clock rate of 8 MHz was used, although 12.5 MHz MC68000s are available,
2) the PE and CU instruction executions were not overlapped,
3) the LPC frame size was assumed to be 5 ms where 10 to 20 ms are normally
used,
4) the network delay was assumed to be 4.5 /is per 16-bit word and was not
overlapped with the instruction execution, and
5) the LPC program uses only 8 PEs and leaves 92 PEs idle.
Increasing the clock rate to 12.5 MHz would increase the throughput by 50% if 
no wait state memory is used. The table on page 59 of [SiKu82] shows that
218
overlapping the CU and PE instruction execution can result in a 50% speedup. 
As shown earlier, increasing the frame size and using the same number of PEs 
reduces the number of computations. Using a faster network and overlapping 
network transfers can give an effective network delay of 0 which improves the 
throughput. Finally, computing the LPC coefficients for several frames in 
parallel will reduce the number of parallel computations needed for the LPC 
routine,
Considering all of the above, the SIMD based isolated word system has the 
power needed to execute the proposed system in real time. A system requiring 
more computations can be implemented in real time if a less conservative 
model is used.
8. SIMULATING VLSI PROCESSOR ARRAYS
Section 5.3showed how a VLSI processor array could reduce the number 
of /oops needed to perform a given task. Of course the question left 
unanswered was “How much time does a loop take?” The following section 
describes Poker, an emulator for a processor array called Pringle, which has 
been used to obtain timings. The Poker system was written by members of the 
Computer Science Department at Purdue University to help in developing the 
Blue CHIP project [Snyder82aj.
8.1. Poker Details
The CHiP (Cbnfigurable, Highly Parallel) computer [Snyder82a] is a fam­
ily of architectures each constructed from a switch lattice and a collection of 
microprocessors (called cells*). The switch lattice consists of many switches 
that can be connected to each other and to adjacent cells. Figure 8.1 shows a 
possible layout of switches and cells, where the circles represent switches and 
the squares are cells. Each switch can be dynamically programmed to connect 
to any of its eight nearest neighbors (i.e., any switch or cell to the north, east, 
West, south, northeast, northwest, southeast, or southwest). The cells are not 
connected directly to each other, but communicate through the switch lattice. 
This connection is a circuit switch rather than a packet switch. The VLSI 
array structure of two cells being connected can be realized in a CHiP architec­
ture by connecting two cells through a switch. The VLSI array computer can
* Although Poker documentation calls their processors PEs, I will continue to call the pro­
cessors associated with VLSI arrays cells, and reserve the label PEs for processors in an 
SIMD machine.
Figure 8.1, Typical switch lattice.
therefore be inchidddras a member of the CHiP computer family by using this 
type of inter-cell connection. - : ^
The Poker System provides a means to emulate Pringle, a CHiP computer 
[Snyder82a]. The Poker programming environment gives the user the following 
tools for developing programs for a GHiP computer:
1) A high level language called xx that allows one to write code for each cell
without having to be concerned with details of the hardware.
2) The ability to set switch settings, thus controlling which port on one cell can
communicate to another port on another cell.
3) A simple way to assign which cell will run which xx code, and pass different
parameters to cells running the same Code.
4) A way to map the logical port names given in the xx code to the physical
ports given in the switch settings.
5) An added feature that allows a user to trace the execution of ami program 
on a line by line basis.
Details about using 1 through 4 above are given in [Snyder$3]. The major 
difference between the hardware emulated by Poker and a CHiP computer is 
the switch lattice. Poker does not use a circuit switched interconnection as a
CHiP computer does. Instead, each cell has an output latch and an input 
queue between it and the switch lattice. The latch is polled regularly by the 
switch hardware. If it contains data, the data is moved to the input queue of 
the destination cell.
Although Poker does not directly emulate the inter-cell communication of 
a VLSI array processor it does emulate enough of the VLSI array to obtain 
meaningful timings. The following sections describe the Poker programming 
environment and the hardware it emulates.
f
222
8.1.1. Software for Emulating with Poker
8.1.1.1. The xx Programming Language
The xx programming language is a simplified sequential programming 
language for defining the code for the cells in Poker. Figure B.l in Appendix B 
gives a complete description of the language. The example in Figure 8.2 shows 
some of the features of the language and the conventions that will be used here 
in presenting Poker programs. The line numbers on the left in the figure are 
used to refer to portions of the figure.
The block of comments before the first numbered line is a standard header
that appears before each major program. Each section of the header is
describedinthe following list.
Program Name gives the name of the program as listed in the code names sec­
tion. The name will be followed by the program name (as used in the 
text) in () ’s if more than one program uses the same name;
Algorithm will give the figure number of the corresponding xx code if the pro­
gram is an assembly language program. The xx programs will give the 
figure number of the algorithm it is implementing.
Machine will be the VLSI processor array.
Function will give a brief description of whatthe program does.
Precision lists the number of bits and format for the input, output, and any 
other important variables used by the program.
Number of PEs will list the number of cells used by the VLSI processor array.
Parameters lists and describes the parameters that affect the execution times.
Input tells which port the input data comes from in the VLSI processor array.
Output is the correspondingInformation to Input.
Loop Time tells how many /is are needed to process one input sample in the 
VLSI processor array.
Max Sample Rate tells how many samples can be processed in one second.
Lines 1-12 of Figure 8.2 show that a comment is enclosed between /* and */, 
and can span more than one line.
Line 14 declares this code to be named auto, and must be stored in a file 
named auto. x. If parameters were passed to this cell, the line would be
■ ProgramName: : auto (al)^r
Section: 6 2;
Machine: VLSI processor array, simulated by Poker.
Function: Find autocorrelation coefficients R(i)
given input signal x(m), using
.. .. k=M~i-l - V































Precision: Input: 32-hit floating point
Output: 32-bit floating point 
Number of PEs: p, the number of coefficients computed.
Parameters: p, the number of coefficients computed.
Input: Arrives at the north port of cell (1,3).
Output: Departs from east port of merge cell.
Loop Time: 90 //s to process one input sample.
Max Sample Rate: 11 KHz
This routine finds the first p autocorrelation coefficients 
of its input data. The value of p depends on the number of 
cells used. One sample is read from each of the two input 
ports (ini and in2), The sample coming from the ini port 
is written to the bottom port (out) so the cell below 
can use it during the next cycle. The two samples are 
multiplied together and added the a running sum (sum). After 
one frames worth of samples have been read (as determined by 




ini, in2, out, results;
sint i,samples; /* Samples per frame */
real top,left,sum; f* These are type int for (a2) */




out <- sum; /* Send a zero out to initialize the pipeline */
while true do
begin 
i: = i + l;
Figure 8.2. An example of an xx program.
224
30 top <» ini;
31 left <- in2;
32
33 if i < samples then
34 begin
35 out <- top; .




40 sum := sum +
41 results <- sum
42 sum := 0;
43 out <- sum;




/* Has one frame been processed? */ 
/* No */
ft; /* Last sample in frame */
/* send out results */
/* Reinitialize, sum */ 





where argLand arg2 are giyen in the code name section which is dis­
cussed in Section 8.1.1.'3.
Line 15 gives the variables to be traced. All variables listed here (up to four) 
will appear oh the screen during a run, and in the Trace file if used. 
This allows monitoring of the variables during execution, but would not
be used in a production setting.
Line 16 tells which I/O ports will be used. These are logical names, and will 
not be associated with physical names until load time. The data in the 
port names section tell which logical name to map to which physical 
direction.
Line 17 starts the beginning of the program.
Line 18 declares i and samples tobe of type short integer (sint).
Lines 19 and 20 declare several variables to be of type rea/.
Lines 22-24 are assignment statements.
Line 25 writes the values of sum to the port out. Notice that is the
assignment operator, while is the read/write port operator.
Line 27 is a, while statement, and the boolean value true is always true, so this 
loop will go on forever.
Line 28 is the start of a begin/end pair.
The rest of the code is much like any other FORTRAN-like high level 
language.
8.1.1.2. The Switch
Figure 8.3a is an example of a configuration of cells for a VLSI processor 
array algorithm, and Figure 8.3b is the switch setting that implements it. The 
particular algorithm is for autocorrelation, and is used as an example of a typi-
' 1+-+1
cal algorithm. Each box x,y
+-+
cell, and y is the column number, 
the data paths.
is a cell where a: is the row number of the 
A is a switch and the —,\,/, and j are
226
R R + ini * in2 
o tit ^—* in I
4-4 +-+ 4-4 4-4
4,1 *4", 2 J1 4,3 .- 4,4
4-4 +- + +.4 4-4
(») (b)
Figure 8.3 (a) Example of a cell configuration for a VLSI algorithm, 
(b) Example of Poker switch settings for the algorithm.
:22f
Each processor has eight logical switch input/output ports. Most pro­
grams presented here use a given port for either input or outputbutnotboth, 
so often arrow heads are used to show the direction the data flows; Thisi has 
no effect on the hardware or software; the arrows are used t6 make the data 
flow clearer to the reader. Also, some data paths are used to synchronize two 
cells. In this case, the arrival of data at cell A marks some event at cell and 
the value oil the data passed is ignored. The data paths used in this manner 
are drawn with light lines, while true data paths are drawn with heavy lines.
8.1.1.8, Code Names
Each processor can run different code. The code name listing on the right 
of Figure 8.4 shows which program is run on which processor. There is a 
correspondence between the left and the right halves of the figure. The upper 
left cell in the switch runs the code listed in the upper left of the code names. 
If a cell is unused, no name is listed. In reality, all cells run all the time, but 
the unused cells run code called empty which is a statement jumping to itself.
Some programs will have data values listed below the program names. 
These values are passed to the given program as arguments on the line declar­
ing the name of the code. For example if line 14 of Figure 8.2 were:
code auto(argi,arg2,arg3,arg4);
the first value listed below the code name in Figure 8.3 would be passed as 
argl, the value below it as arg2, and So on. Up to four values can be passed. 
The values need not be the same for different cells running the same code.
.8.1,1,4. Port Names
As mentioned above, each port can be assigned a logical name. This name 
is mapped to a physical port during load time. The port names given in the 
example in Figure 8.5 show the mapping from logical names to physical ports. 
The position of a given name in a cell identifies the port to which it is con­
nected. The positions are:
228
n
+-+ +-+ w +-+
1,1 *1,3- 1,3 , 1,4










pipe auto merge4 lpc
auto



























Figure 8.5. Example of Poker port name assignments.
230
north
■ nw \ ne
west east
sw \ se ■.
south
When running assembly code, data is written to the physical ports, and not 
logical ports; therefore the port assignment table is not needed;
8.1.2. Hardware Emulated by Poker
Figure 8.6 shows the hardware used in one cell of Poker. The main com­
ponents are an Intel 8051 microprocessor, an Intel 8231 Arithmetic Processor 
Unit (APU), and the switch interface. The following gives more details about 
the hardware emulated by Poker.
8.1.2.1. The Intel 8051 Microprocessor
The heart of the hardware is an Intel 8051 single^component 8-bit micro­
computer [Intel]. It is an 8-bit processor designed for single chip operations as 
a controller or as an arithmetic processor. It runs with a 12 MHz clock and the 
shortest instruction takes 12 cycles, or 1 /is. An 8-bit register addition or sub­
traction takes 1 /is while an 8-bit unsigned multiplication takes 4 /is. Figure 
B.2 is a list of the 8051 instruction set including execution times for each 
Instruction.
The 8051 has two types of RAM, internal and external. There are 256 
bytes of internal RAM with the upper 128 bytes being special function regis­
ters. These registers allow access to the two built-in 16-bit timers, the four 
built-in 8-bit I/O ports, and other special features of the 8051. (Figure B.3 
gives an example of how to use the built-in timer to control the execution time 
of a loop.) The lower 128 bytes can be used as regular memory. Most assembly 
language programs presented here use only the internal RAM.
The external RAM consists of 4K bytes of EPROM and 2K bytes of 
RAM. The EPROM contains routines used to support the xx code. The RAM 
holds the user’s program and data. The external RAM is accessed only 
through a special register, and thus takes more processor time to use than the 
interna! RAM. '
To Central 

























Figure 8.6. Poker cell detail (from [Field]).
232
8.1.2.2. The Arithmetic Processing Unit (APU)
There is an Intel 8231 APU to assist the 8051 microprocessor with 32-bit 
floating point arithmetic. The two processors communicate via an 8-bit com­
mand latch and an 8-bit data latch. The 8051 pushes data onto the 8231V 
stack, sends a command, and then pops the result. The APU executes a 32-bit 
floating-point addition in at most 92 /is, subtraction in 93 ps, a multiplication 
in 42 ^s, and a division in 48 /is. These maximum execution times are too slow 
for most speech processing. Also, there is considerable overhead in 
pushing/popping data to/from the APU, so it is faster for the 8051 to perform 
some operations than to send them to the APU.
Variables declared to be type real or ml in xx are 32 bits long and are 
processed by the APU. Otherwise, variables of type sint are 8 bits each and 
are processed directly by the 8051.
8.1.2.8. The Switch
An 8051 can communicate with other 805Is through the switch. The 
switch is a crossbar switch that allows any processor to talk to any other pro­
cessor. An 8051 talks to the switch through an 11-bit wide output latch, and 
kn 11-bit wide, 16-word deep input queue. Since each processor has 8 logical 
I/O ports that are implemented by one latch and queue, three of the 11 bits 
are directional information, i.e., they tell to which port the remaining 8 bits of 
data are to go. The same is true for the input queue: 8 bits are data, and three 
bits are the tag telling from which port the data came.
- The switch can poll 8 cells every jts. There are 64 processor cells in an 8 
by 8 square, plus 32 more I/O cells along the edges of the square, giving a total 
of 96 cells, or 12 /is to do one scan. It is the software’s responsibility to wait 
12 //s between writes to the output latch to be sure the previous data was writ­
ten. If two writes happen between scans, the first data written is lost. Figure 
B.4 gives an example of how to read;/write data from/to the switch.
Once the data is received from the switch, it is the programmer’s responsi­
bility to check the tag and buffer the data until all four bytes have arrived 
from the same direction. In some programs, the data comes from only one 
direction, or a known direction, so the direction need not be checked. This 
short cut is used frequently in the assembly routines presented in the following 
chapters to decrease the execution time of the algorithms.
When using xx, the high level language, all port checking and delaying are 
handled by the compiler and/or loader.
8.1.24. The 8051 Assembler
The assembler used for the 8051 supports all the mnemonics for machine 




would move the data from a (the accumulator) to the internal RAM location 
called sum. The output from the assembler, shown in the figures in Appendix 
B, prints the execution time in ps for each instruction to the left of the instruc­
tion, .
The assembler also allows files to be inserted into the current input file. A 
line of the form:
^include’’filename.h”
will stop the assembler from reading the current file and start reading 
filename.h. Once filename.h is read, processing is continued on the previous 
input file. Two commonly used include file are ports.h and util.h. Ports.h con­
tains the I/O port definition as shown in Figure B.5. util.h contains the 
definition for writedelay which waits a fixed amount of time for data to be read 
from the output latch, and readwait which waits for data to appear in the 
input queue, util.h is listed in FigureB.6.
234
8.1.8. Summary
This section has presented the Poker system that is used to simulate VLSI 
processor arrays. A brief description was given of both the hardware and 
software, with emphasis on how the hardware affects the software. The impor­
tant points with respect to the simulations described in the following sections 
are:
1) Although each cell has an Intel 8231 APU, it is often faster to use the Intel
8051 microprocessor to perform the 8 and 16-bit fixed point arithmetic.
2) Each cell has eight logical I/O ports which are implemented by one output
latch and one 16-word deep input queue. 8-bit data is written into the 
output latch; the latch is polled once every 12 so there must be a 
12-/is delay between writes to the latch.
3) The 8051 has two 16-bit timers that can be used to synchronize cells.
Overall, Poker provides an accurate simulation of a VLSI processor array.
8.2. Simulation of Filtering Algorithms
This section presents two different digital filtering algorithm simulations. 
The first is a direct implementatid® of the VLSI algorithms discussed in Section
6.1.1. This algorithm use no broadcasts and produces one output every two 
loops. The second algorithm is based on the VLSI algorithms in Section 6.1.2. 
Here broadcasts are used, and one output is produced during every loop.
The following is a list of requirements a filtering program must meet to 
process speech data in real time.
Sampling rate; The sampling rate for speech data ranges from 6,67 KHz for 
telephone quality speech to 20 KHz for high quality speech. The filter 
program must process speech data at these rates to run in real time. 
Precision: Speech data needs about 8 bits per sample for telephone quality 
speech and 11 to 12 bits per sample for high quality speech.
Type of filter: Selecting values for p and q depends on the type of filter used; 
The selection of p and q does not affect the execution time of these 
filtering algorithms; it changes only the number of cells that are used. 
Therefore during the simulations, p and q are generally set to values 
that produce convenient sized arrays.
8.2.1. Digital Filtering Without Broadcasts
Figures §.7, 8.8, and 8.9 show the switch settings, port names, and xx rou­
tines, respectiyely, used to simulate the first filter algorithm with p=2 and 
q=2. The values selected for p and q have no effect on the execution time of 
this program. For convenience, these values were selected so that the array 
would fit in a four by four: cell arrangement. The numbers listed under the 
name filter on the right half of Figure 8.7 is the value of the filter coefficient 





1,1. 1,2 . 1,3 . 1,4. . i zero
+-+ +-+ +-+ -H+
l +-+ +-+
2,1 • 2,2 ., 2,3 . 2,4 . 2 filter
+-+ +-+ +-+ +-+ 1
AJ b2
■+-+■ +-+ +»-f +-+
3,1 ., 3,2 ., 3,3 . 3,4 . 3 filter
+r+ +-+ 2
\
+-+ +-+ +-+ +-+ 
4-1 .j4.2 .(|4.3_ .r4.4 . 
+-+ +-+ +-+ +-+
output
filter filter filter dunry 
3 4 5
................................................ \ \ \
b0 a2 aj












Topin'i f "."p**1. To pi n Botout-
Figure 8.8, Port names for no broadcast xx filter 
ure 8.12. Port names of fast filter (fl) program.
program, p=2 and q=2. Fig-
238




Machine: VLSI processor array, simulated by Poker.
Function: Compute ym given using
Precision:
y m ^kxnirk aky m-v-
k=0 k=l
Input: 32-bit floating point.
Coefficients: 32-bit floating point.
Output: 32-bit floating point.
Number of PEs: p + q +1, the number of coefficients.
Parameters: p+q + 1, the number of coefficients.
Input: Arrives at the north port of cell (2,1).
Output: Departs from the south port of cell (4,3).
Loop Time: 2,016 /zs to' produce one output sample.





3 ports Topin, Topout, Botin, Botout;
l . begin
5 real Topin, Topout, Botin, Botout;





9 sum := 0;
10 zero 0;
11 Topout <- zero;
12 Botout <- zero;
13
14 while true do
15 begin
16 in <-Botin;
17 Topout <- in;
18 sum <- Topin;
19 sum := sum + coef * in; ~
20 Botout <- sum;
21 end
22. end.
Figure 8.9. xx code for no broadcast filter program.
Dec 16-08:41 1983 dummy .x Page 1
1 code dummy;
2 trace tmp;
3 ports Topin, Botout;
4 . begin
5 real tmp;
6 real Topin, Botout;
7 ,
8 tmp := 0.0;
9
10 while true do
11 begin
12 Botout <- tmp;
13 tmp <- Topin;
14 end
15 end.









9 zero := 0.0;
10 i := 1.0;
11
12 while true do
13 begin
14 tmp <- sync;
15 out <- i;
16 i := i + 1;
17 if(i > 10.0} then
18 i := 1.0;
19
20 tmp <- sync;




















Dec 16 08:41 1983 zero.x Page 1
1 code zero;
2 ports Botout, sync;
3 begin
4 real dumb,z;










Botout <-. z; 
end
Figure 8.9 (Continued)
for b0=3, bA—2, b2=lraj—Id lists tbe exeeiititffl time in 
ps for each statement in thefilterprogram. Column one is the number of 
times the given statementwasexecuted during the simulation. Columns two 
through four are the minimum, average,and maximum times in jus for the 
given statement. The total time for one loop is 1,008 //s, and two loops are 
required to process one input. This gives a, total time of 2,016 ps, or a sam- 
pling rate of less than 500 Hz. Briefly, the main delays causing the program to 
be so slow are the time for the inter-cell communications and the time needed 
to send data to and from the APU. This will be discussed in more detail in 
Section 8.2.2.2.
500 Hz is not fast eh0u|-h fOT Speech processing. This problem is overcome 
by the algorithm discussed ii the hhxt section.
8.2.2, Digital Filtering Using DrOaelcasts
The previous filtering ^dgOrithih ^Ohld not process data fast enough to 
filter speech signals Mfhce^t^eqhired fwo loops to produce one sample and each 
loop took 1,008 fis. An implementation of the VLSI algorithm presented in. 
Section 6.1.2 can produce One sample fOr every loop. It does this by replacing 
the upward flowing ipipiehlfe Wife two Simultaneous broadcasts. Three pro­
grams were written to run this algorithm. They are as follows:
Name Laricuaffe .... Data Size Sum Size
fl XX 32 bit ■ 32 bit
f2 8051 8 bit 16 bit
f3 8051 16 bit 24 bit
All three programs implemCntthe same algorithm. They differ in the 
language in Vvhich they are writteh and in the precision Of the data they pro­
cess. PrOgTam fl still cannot process data ’fast enough for real-time speech 
filtering. Trdjp&msT2 ahd ^^how^that by reducing the precision of the data 
and coding in ^semfliy lahgh^, ohe can process data fast enough for real­
time speech prOeeSsiiig. The follo'wihg sectiohs describe each program.
242
Count Min Ave Max
code filter(coef);
trace sum,in;
ports Topin, Topout, Botin, Botout; 
begin
real Topin, Topout, Botin, Botout;
1 10 10 10 real coef, sum, in;
1 0 0 real zero;
1 178 178 178 sum := 0;
1 178 178 178 zero := 0;
1 91 91 91 Topout O zero;
1 91 91 91 Botout <- zero;
1 0 0 while true do
begin
29 268 268 268 in <- Botin;
29 91 91 91 Topout <- in;
29 238 238 238 sum O Topin;
29 318 318 318 sum := sum + coef * in:
29 91 91 91 Botout <- sum;
29 2 2 2 end
end.




Figures 8.11, 8.12, and 8.13 shows the switch settings, portnames,and xx 
listings, respectively, for the ar# program for fl with p=l and q=2. For con­
venience, these values for p and <p are chosen so that all the cells used for the 
filtering operation will fit along one column of a four by four array. As before, 
the values of p and q have no effect on the execution time of the algorithm 
unless large values will lengthen the time needed to broadcast a value to all 
cells.
The heavy lines in Figure 8.11 are the data paths, while the lighter lines 
are paths used for synchronization. Notice the similarities between the switch 
setting of Figure 8.11 and Figure 6.2. Figure 8.14 lists the execution time in /is 
for each statement in the filter program.
Some general comments about these times are:
1) The variable declarations require some execution time because various flags
are set during runtime to indicate which variables are traced. In a 
production system the variables would not need to be traced.
2) All writes to output ports take 94 /is. They are not buffered and go immedi­
ately to the switch lattice.
3) Reads from input ports, on the other hand, can vary greatly in execution
time. The data eoming from the switch lattice enters a 16-Word 
hardware input buffer. When the cell reads from the buffer, it gets 
one byte of data along with a tag telling which port the byte came 
from. If the data did not come from the desired port, the data is 
stored in, a buffer for use when the cell wants to read from the given 
port.
The total time for one loop is 9Q6 /is. Since one sample is processed every 
loop, the sample rate which can he handled is about 1.1 KHz, still too slow for 
speech processing. The execution time for one loop is spent as shown in Table
8.1. 65% of the time is for I/O, while only 38% is for the actual computation. 
Figure B.4 shows it takes 49? /is to write four bytes to an output port while Fig­
ure 8.14 shows that/writing to an output port takes 91 /is. The additional 42 
/is are the overhead introduced by the compiler; Part of this overhead is mov­
ing the data from external RAM to internal RAM. The xx compiler stores all 
type reals in external RAM while the example in Figure B.4 assumed the data 






















eel! I- 2 • 3
1 Slier lero
1




4 filter output 
4
4
Figure 8.11. Switch settings and code names for fast filter (fl) program for 
p = l' and q=2. The heavy line are the data paths, while the lighter lines are 
paths used for synchronization.
245
Cell
Figure 8.12. Port names of fast filter (fl) program.
246
Dec 16 09:34 1983 filter.x Page 1
/*





VLSI processor array, simulated by Poker 
Compute ym given xm using
ym= E Mm-lc + E akym-k-
k=0 k=l
Input: 32-bit floating point.
Coefficients: 32-bit floating point.
Output: 32-bit floating point.
Number of PEs:p +q + 1, the number of coefficients. 
Parameters: p + q + 1, the number of coefficients.
Input: Arrives at the north port of cell (2,1).
Output: Departs from the south port of cell (4,3).
Loop Time: 906 /is to produce one output sample.




3 ports right, top, out;
4 begin
5 real right, top, out;
6 real coef, sum, in;
7
8 sum := 0.0;
9 out <- 0.0;
10
11 while true do
12 begin
13 in <-right;
14 sum <- top;
15 sum :== sum + coef * in;
16 out <- sum;
17 end
18 end.
Figure 8.13. xx code for fast filter (fl) program.








8 i := 1.0;
9
10 while true do
11 begin
12 tmp <- sync;
13 out <- i;
14 i : = i + 1,0;
15 end
16 end.

















while true do 
begin 
out <- in; 
end







7 while true do
8 begin
9 dumb <- sync;





Table 8.1 Execution times for filtering program fl.










Count Min Ave Max
code filter(coef);
trace sum,in;
ports right, top, out;
begin
real right, top, out
1 10 10 10 real coef, sum, in;
1 52 52 52 sum := 0.0;
1 143 143 143 out <- 0.0;
10 0 while true do
begin
33 250 255 418 in <- right;
33 238 240 310 sum <- top;
32 318 318 318 sum := sum + coef * in
32 91 91 91 out <- sum;
32 2 2 2 end
end.
Figure 8.14. Execution times for xx fast filter (fl) program.
250
The computation takes 318'/is. to multiply two numbers and add the pro­
duct to a running sum. Most of this time is spent moving data from external 
RAM to the APU and back again.
The largest percent of time is spent reading an input port. The data 
arrives one byte at a time, with a tag telling which port it came from. The 
software must maintain a separate buffer for each possible tag since a tag 
represents a logical input port. This buffer management requires a great deal 
of time, as Figure 8.14 shows.
8.2.2.2. Programming Techniques for Reducing Execution Times
By using assembly language programming, the following techniques can be 
applied to reduce the execution time of a loop.
1) Reduce the data size. Although most applications do not need 32-bit float­
ing point arithmetic, the cm^^ version of xx supports only 32-bit float- 
ing point and integer arithmetic . Digital filtering can be done with 8 or 
16-bit signed fixed point data. This allows the 8051 to do the computa- 
tions directly, thus saving the overhead of sending the data to the APU. 
Also, reducing the data size reduces the amount of data to send through 
the switch.
2) Use the 12 /is delay time between writing to the switch. Of the 49 (is
needed to move four bytes of data from internal RAM to the switch, 33 
(is are nops (“no operations”) waiting on the switch. These 33 (is could 
be used to perform a computation.
3) Store variables in internal RAM. In assembly language, all important vari­
ables can be stored in internal RAM, thus eliminating the overhead of 
referencing external RAM.
4) Control the arrival time of data. The arrival of data to the input port can
be controlled so that data will arrive in the order needed. This elim­
inates the need for time consuming buffer management; *
* xx does have a short integer (siiit) which is 8 bits, unsigned. Being unsigned reduces its 
usefulness for this application.
Given the current -implementation of the m programming language, 
assembly language programs are needed to get the throughput for real-time 
processing. The following sections describe A2 and f3. These programs are 
Written in 8051 assembly language and use the above techniques to reduce the 
time of a loop.
8.2.2.3. Fast Assembly Language Fitter Program — f2
The f2. program uses tMbSt inputs and produces a 16-bit sum. Figure 8.15 
shows the switch settings for f2 with p=l and q-2, and Figure B.7 is a listing 
of the program. There are no port names given since these assembly language 
routines reference the physical ports and not the logical ports. Comments have 
been added to the f2 listing to help explain what it is doing. For example, the 
line
;8 sum <- 0:
9 - '
is a comment meaning that the assembly statements that follow perform the 
same function as line 8 of the corresponding fl program. The identifies the 
start of a comments
Program f2 implements the same algorithm as fl with one exception. The 
communication thtoa||h the switch is carefully controlled so that data arrives 
in the order it is needed, life eftminates the need to check the source tag and 
buffer inputs. IJnfortuniastely, the switch settings of Figure 8.11 result in a race 
Condition when cells (3,1) and ?(l,Jl| write to their south ports at the same time. 
The destination *0f hoth writes is cell 1(4,1) and the order of arrival is uncertain. 
To prevent this, the south port of cell (4,1) goes to the output cell (4,2), which 
delays the data slight ly 'before writing it to its west port. The arrival times are 
controlled by using some data paths only for synchronization. Figure 8.16 
shows the arrival times and the name of the input port from which the data 
came the following example:
1) At time one each df lfbe filter seels ((i,!i), (2,1), (3,1), and (4,1) writes data to 
its south port, and Ibe zero cell (1,2) writes to its north port. This data 
arrives at the north ports of the filter cells and the south port of the 
output icelF(4f2) ;-at lime ftw®.
252









,2 . 1,3 . 1,4 . i filter zero
+-+ +-+ 1
-f-+ +-+ +-+
2,3 . 2,4 . 2 filter input
+r + -f-+ +-+ 2
J . * ■ •
+-+ +-+ +-+
3,2 ., 3,3 ., 3,4 . 3 filter
+-+ +-+’ +-+ 3
+-+ +-+ -K+
-3,2 .. 4,3 .. 4,4 . 4 filter output
+-+ +-+ +-+ 4
Figure 8.15. Switch settings for 8-bit fast filter (f2) for p —1 and q
1213'
cell 2
1 5 north 
4 east 









2 north i 2 south :
+-.........--+ —------+
Figure 8.16. Arrival times and port names for f2.
254
2) The output cell (4,2) sends data out its west port after getting data from its
south port. This data arrives at the east ports of cells (3,1) and (4,1) 
and the south port of the input cell (2,2) at time three.
3) The arrival of data at the input cell (2,2) signals it to write to its west port
at time three. This arrives at the east port of filter cells (1,1) and (2,1) 
and the west port of the zero cell (1,2) at time four.
4) The arrival of data at the zero cell signals it to write a zero value to its
north port, which arrives at the north port of cell (1,1) at time five.
Now all cells have data as Figure 8.16 shows. The data is guaranteed to 
arrive in this order each time through the loop since the transfers are done syn­
chronously.
In f2, cells (1,1), (2,1), (3,1), and (4,1) perform the same code at the same 
time. At the start of the loop, the data from the north port is in the input 
queue as shown above. Both bytes are read and saved in internal RAM. Next, 
the data from the east port is in the queue, It is read and the computation 
performed. Finally, the sum is written to the south port.
The LSB (least significant byte) of the sum is written before the MSB 
(most significant byte) is computed. This allows the computation to overlap 
the 12j<s switch waiting time.
The input data and the filter coefficients are 8 bits while the running sum 
is 16. The butput cell removes the upper 8 bits when passing the sum back to 
cells (3,1) and (4,1).
Figure 8.17 gives the equivalent times for each of the xx code statements. 
The total time for one loop is 33 /is, or a sample rate of about 30 KHz. This is 
well above the rate needed for speech processing.
There are some practical problems with f2. The data size of 8 bits is ade­
quate for telephone quality speech, but many applications use more than 8 bits. 
Also the input cell is tightly coupled to the other cells, he,, it must produce 
input data at a given time. If it is too soon or too late, the data will enter the 
queue at the wrong time and be mistaken for other input data. In a real appli­
cation, the speech sample rate should not have to be tied to the processor 





fl f3 12 
xx 8051 8051












2 2 . 2
end.
real right, top, out;
real coef, sum, in;
sum := 0.0; 
out <- 0.0;




sum : = sum + coef * in
out <- sum;
end
906 59 33 Total loop time
Figure 8.17. Execution times in //s for fast filter programs.
256
8.2.2.4 Fast Assembly Language Filter Program — }8
Program f3 overcomes the shortcomings of f2 by using 16-bit input data 
and keeping a 24-bit sum. Also, the input cell is decoupled from the rest of the 
cells thus allowing input to arrive at any time following a constant delay after 
the previous input. Figure 8.18 gives the switch settings for f3, and Figure B.8 
lists the program. The switch settings differ from f2 in that the data flow 
between the output cell (4,2) and the input cell (2,2) is reversed. In program f2 
cell (4,2) would signal the input cell (2,2) when a value arrived from cell (4,1). 
This signal indicated to the input cell tha,t the most recently input data value 
had produced a result at the end of the pipeline and it was time to start 
another value. Since program f3 runs asynchronously, the input cell must 
notify cell (4,2) that new data has arrived, so cell (4,2) can synchronize with 
the other cells receiving input data.
When an input is produced, the data goes to cells (1,1) and (2,1). Also, 
cells (1,2) and (4,2) get the data but do not use it. Instead, the input signals 
them to output data, so that soon after filter cells (1,1) and (2,1) get data in 
their east ports from cell (2,2), filter cells (3,1) and (4,1) will get data from 
(4.3).
The filter cells in this program are not synchronized. Instead, each waits 
for an input and start processing immediately after the input is received. 
Although the filter cells wait for the first byte of input data, the second byte is 
assumed to be no later than 24 //s behind. Therefore, there is no check made 
on the input queue before reading the second byte,
In f2, the input cell waited for filter cell (4,1) to produce an output before 
producing another input. In f3, the input cell uses the builtin timer and pro­
duces input data at the rate of one sample (2 bytes) every 100 /is. This is done 
to show that arrival time of the input data is not tied to the rest of the algo­
rithm.
The 8-bit filter coefficients are treated as if the decimal point is to the left 
of the most significant bit. This assumes that the filter coefficients are less 
then one. If they are not all less than one, instructions can be added to shift 
the data left or right the number of bits needed to produce an output with the 
decimal in the same position as the input data. This can be done on a cell by 












“1 * cell 1
4-4 4-4 4-4
r^’2 . 1,3 . 1,4 . 1 filter
4-4 +-+ 4-4 128
4-4 4-4 4-4
■.2,2 . 2,3 . 2,4 . 2 fill ter
4-4 4-4 4-4 64
4-4 4-4 4-4
3,2 3,3 . 3,4 . 3 filter
4-4 4-4 4-4 128
r 4-4 4-4
.4,2 .. 4,3 . 4,4 . 4 filter
+-+ 4-4 4-4 64
J
Figure 8.18. Switch setting for fast filter program (f3) for p = l and q
258
another cell. The addition of the shift instructions will add one (is per bit 
shifted to the execution times.
When the decimal is to the right of the MSB, the values of the coefficient 
can range from 1/256 = .0039 to 255/256 = .996. The sum is 24 bits with 16 
bits left of the decimal.
Figure 8.17 summarizes the equivalent execution times of f3 for each step 
of fl. The total time for one loop is 63 ps, or a sample rate of 15.8 KHz. (If 
the data arrives at a slower rate the sample rate will be dictated by the arrival 
rate of the input data:) This is adequate for most speech recognition applica­
tions.
8.2.3. Summary
Two digital filtering algorithms were simulated using an 8051 8-bit 
microprocessor running at 12 MHz. Inter-cell communication was through an 
8-bit wide pipeline between cells, with the maximum throughput of one byte 
every 42/is.
The first algorithm was based on the pipelined VLSI array algorithm in 
Section 6.1.1. It used local pipeline communication and no broadcasts. It pro­
duced one output sample for every two loops. A simulation written in xx 
showed that a loop takes 1,008 ps giving a sample rate of less than 500 Hz.
The second algorithm was based on the pipelined/broadcast VLSI array 
algorithm in Section 6.1.2. It used the same local communication as the first 
algorithm, but it also used broadcasts. It was simulated by three programs, 
two Written in assembly language and one in written xx. Table 8.2 presents a 
summary of the simulations.
Program f3 shows that a VLSI processor array using cells with the power 
of a current 8-bit microprocessor, can filter 16-bit speech data in real time at a 
sampling rate of up to 15.8 KHz. The number of coefficients in the filter does 
not affect the sampling rate. If more coefficients are needed, more cells can be
added to the array. The only limitation may be the fan out of a broadcast. 
The Poker emulator can broadcast from one port to up to four other ports. So 
if it is necessary to broadcast to more than four ports, extra cells will have to 
be added as “line driyers” (see the next section). The extra cells will require
Table 8.2 Summary of simulation of digitial filtering algorithms in Poker.






ri XX 32 bit 906 //s 1.1 KHz
f2 8051 8 bit 33 //s 30 KHz
f3 8051 16 bit 63 //s 15.8 KHz
260
more time for the data to move through, therefore the maximum sampling rate 
will be decreased.
In f2, 7 fis are wasted (nops) waiting for the switch to poll the output 
latch, while f3 uses 15 fis out of 63 fis for nops. If the output used a queue like 
the input does, thus eliminating the 12 fis delay between writes to the output 
port, the nops could be removed from f2 and f3. Without nops, f2 can process 
at 33-7 = 26 fis per loop for a 38 KHz sampling rate, while f3 would run at 
63-15 48^s or 20.8 KHz. A sampling rate of 15.8 KHz (f2 with nops) 
should be sufficient for most speech applications; If processing of high quality 
speech requires a 20 KHz rate, this can be achieved with this modification to 
program f3.
These filtering algorithms map well onto the CHiP architecture. The 
“pipeline only” algorithm can be implemented on both the Poker emulator and 
Pringle. The pipeline/broadcast algorithm will run only on the Poker emula­
tor. The Pringle hardware can not broadcast data, while the Poker emulator 
will allow one port to broadcast to up to four ports. The ability to broadcast is 
important since it allows the algorithm presented in Section 8.2.2 to be used. 
This algorithm has a throughput two times faster than the “no broadcast” 
algorithm in Section 8.2.1.
The filtering algorithms require a fast interconnection network because the 
data is transferred between cells at the same speed as the sampling rate. The 
Poker system transfers one byte every 12 fis, or one 16-bit word every 24 fis for 
a throughput of 1/24 fis = 41 KHz. This rate is sufficient for high quality 
speech processing if the processor does not have to manage the I/O buffers. 
The I/O buffer management could be handled by having an output queue 
(instead of a latch) between the processor and the interconnection network. 
Also, separate input queues for each port could be used. If such queues are 
used, the programmer would not have to wait after writing data to the inter­
connection network to be sure it had been sent. A more general approach 
would be to have a separate I/O processor which would manage the I/O 
queue(s) so the main processor could be used mainly for executing programs.
Since most input speech data is 11 to 12 bits, and most computations use 
16 bits, each cell should have a 16-bit processor. The internal RAM of the 
8051 has the same access time as its registers, making the internal RAM act
like 64 16-bit fast general purpose registers. Storage based on a few fast regis­
ters is characteristic of the systolic array and is a desirable feature.
Since programming a large project in assembly language is tedious at best, 
the VLSI processor array must be able to execute programs written in a high 
level language in real time. The high level language should allow the program­
mer to select the precision and type (integer, floating point, etc.) of data for 
each variable. Therefore if only 16 bits are needed, only 16 bits will be used. 
If the processor which is used is like the 8051 in that it has fast internal RAM, 
the high level language should allow the programmer to select where the vari­
ables are stored.
The Poker system is able to implement the filtering algorithms. The 
second algorithm can process high quality speech in real time, if the program is 
carefully written in assembly language.
262
8.3. Simulation of the Autocorrelation Algorithms
Autocorrelation plays an important role in many isolated word recognition 
systems. It is used to find the short term autocorrelation coefficients which are 
then used to find the LPC coefficients. Autocorrelation, as used here, is defined 
as: ■
' ■ M-5-1
R(i) - £ x(k)x(k+i) 0 < i < p 
k=° ■
where R(i) are the autocorrelation coefficients and x(m) is the input signal. 
For speech processing the frame length, M, ranges from 100 to 300 samples, 
while p is between 8 and 16 [Myer80],
For these programs, M=100 and p=4- The value p~4 is chosen so the 
arrays of cells will fit conveniently in a four by four grid of cells. Changing p 
to the more common value of 9 will not change the throughput; however, it 
will change the number of cells needed.
The number of bits used for each input sample ranges from 8 for telephone 
quality speech to 12 for high quality speech. The number of bits for the sum 
can range from 16 bits to 32 bits. If all the samples in one frame of speech use 
12 significant bits (i.e., the most significant bit is set), the square of each sam­
ple (used in finding R(0)) will use 24 bits. The sum of 100 24 bit values will 
use at most 32 bits, therefore 32 bits is sufficient for computing the sum values. 
It is possible that long frame sizes can result in a sum that uses more than a 32 
bits, but this will happen only when most of the input samples use 12 
significant bits. If most samples use all 12 bits, the signal must have a large 
DC bias which can be subtracted off before processing.
8.3.1. Poker Simulation of the Autocorrelation Algorithm
This section discusses the results of simulating the computation of auto­
correlation coefficients on a VLSI processor array using five different programs 
(al-a5) on the Poker system. The first two programs are written in the xx pro­
gramming language. Program al uses 32 bit floating point numbers, while a2 
uses 32 bit integers. Both use the APU for processing
Programs a3-a5 are written in 8051 assembly language. Programs a3 and 
a5 use 16-bit integer input samples and produce 32-bit integer sums. Program 
a4 takes 8-bit inputs and produces a 16- bit sum. None of these programs uses 
.the APU.
As with filtering, reducing the precision of the calculations and switching 
to assembly language results in a greater than tenfold increase in throughput. 
Although the fastest programs (a3 and a4) can process inputs as fast as 12 KHz 
and 45KHz*, respectively, like f2 they must be synchronized with the cell pro­
ducing the input. On the other hand, program a5 processes data at a slower 
rate (less than 11 KHz), but can run completely asynchronously with respect to 
the input cell.
263
8.3.2. High-level Language Programs — al and a2
Figures 8.19, 8.20 and 8.21 show the switch settings, code names, port 
names, and xx listings for program al. Program a2 is not listed since it is 
identical to al except all real declarations are changed int declarations. Pro­
grams al and a2 are based on the algorithm in Section 5.1.1. Although the 
assembly language programs have slightly different settings for the input cells, 
the autocorrelation cells are connected in the same way. The algorithm works 
as follows:
1) The input cell (2,1) writes sample x to its output port. This value is broad­
cast to the west input ports on the autocorrelation cells (1,2), (2,2), (3,2), 
and (4,2).
2) The Poker switch emulator cannot broadcast to more than four ports. Cell
(1,2) uses two input ports, so cell (2,1) must broadcast to five ports.




i npu t auto
r +-+ j+.-h
me r ge4 outputati topipe
4 auto
+- +





Figure 8.20. Port names for autocorrelation programs (al) and (a2) for VLSI 
processor array.
266

















































VLSI processor array, simulated by Poker. 
Find autocorrelation coefficients R(i) 




Input: 32-bit floating point 
Output: 32-bit floating point 
p, the number of coefficients computed, 
p, the number of coefficients computed. 
Arrives at the north port of cell (1,3). 
Departs from east port of merge cell.
90 //s to process one input sample.
11 KHz
This routine finds the first p autocorrelation coefficients 
of its input data. The value of p depends on the number of 
cells used. One sample is read from each of the two input 
ports (ini and in2). The sample coming from the ini port 
is written to the bottom port (out) so the cell below 
can use it during the next cycle. The two samples are 
multiplied together and added the a running sum (sum). After 
one frames worth of samples have been read (as determined by 





sint i,samples; /* Samples per frame */
real top;ieft,sum; /* These are type int for (a?) */
real ini,in2,out,results; /* These are type int for (a2) */
i - 0;
sum:= 0; 
samples :— 10; 
but <- sum;
while true do
/♦ Send a zero out to initialize the pipeline */
8.21. ^ listing for autocorrelation programs (al) and (a2) for VLSI pro­
cessor array.
267























top <- ini; 
left<- in2;
if i < samples then /* Has one frame been processed? */ 
begin /* No */
out <- top; /* Send sample from top to cell below */
sum sum + top * left;/* Find sum */
end
else begin
sum :=• sum ~f~ top * left; /* Last sample in frame*/
results <- sum; /* send out results */
sum 0; /* Reinitialize, sum */
out <- sum; /* and pipeline. */





1 /* . .
2 This routine generates input data for the autocorrelation
3 array. The input is the sequence 1,2,3,....
4 Each values is written to the output port (out), and the 
next value is written after a dummy value is received at
Dec 13 09:02 1983 input.x Page 1










16 . i := 1;
17
18 while true do
19 begin
20 out <- i;
21 tmp <- sync
22 i := ;i +1;
23 end
24 end.
/* These are type int in (a2) */
/* These are type int in (a2) */
/* These are type int in (a2) */
/♦ Wait for data out of last pe before *f 
/* sending any more out. */
Dec 13 09:02 1983 merge4.x Page 1
1 /*
2 This routine will merge four data streams into one by
3 alternating data starting the the top input.





9 real tmp; ■/* These are type int in (a2) */
10 real bne,two,three,four,out;








19 tmp <- four;
20 out <- tmp;
21 end;
Figure 8.21 (Continued)
Dec 13 09:02 1983 merge4.x Page 2 
22 end.










This routine is simply a sink. It reads in real values 
from its input port (in) and assigns them to a traced 
variable. It does no useful processing, but is very 






13 real in,out; /*
14
15 while true do
16 begin
17 out <- in;
18 end
19 end.
Dec 13 09:02 1983 pipe.x Page 1
This routine is like a hardware line driver. The switch 
emulator can not broadcast to more than 4 ports at a 
time, so this pipe is used to increase the number of port 
a given cell can send data to at one time.
Pipe simply reads data from its input port (in) and 








16 while true do
17 begin
18 tmp <- in;
19 out <- tmp;
20 end
21 end.
/* These are type int in (a2)
Figure 8.21 (Continued)
270
The pipe cell at (3,1) is used as a “line driver” so the ports on cells (3,2) 
and (4,2) will appear as one port to the input cell ^2,1). If the problem 
size is increased to 8 coefficients, the input will have to be broadcast to 9 
ports and another line driver will have to be added. Adding more line 
drivers will increase the execution time of the program since each line 
driver has a delay between the arrival time of the data and the time the 
data is broadcast to the output ports.
3) Cell (1,2) receives sample x at both of its input ports (ini,m2). It writes the
values from the north port (ini) to the south port (out). It then multi­
plies the two input values together and adds it to the running sum- 




4) When cell (2,2) receives the values x from cell (1,2) it also gets the next
value (x +1) from the input cell (2,1). It does the same operations (mul­
tiplication and addition) as cell (1,2) to compute:
M-2
I((l|=Vx|k)x|k i 1| ;
■' k=0
Cell (1,2) has provided the one sample delay so that although both cells 
are performing the same operations, they are computing different auto­
correlation coefficients. The same operations are done for the other 
autocorrelation cells (3,2) and (4,2), with cell (3,2) computing the auto­
correlation coefficient with a delay of two and cell (4,2) computing the 
Coefficient with a delay of three.
5) When cell (4,2) writes to its south port, the data is sent to the input cell
(2,1). The arrival of the data tells the input cell to write Out another 
value. The value that just arrived has no effect on the value written 
out. It just synchronizes the input cell to the autocorrelation cells.
6) If fewer than M samples have been processed, go back to step 1), otherwise
write sum to the east port (results) set sum to zero, and go to 1).
7) The merge cell (3,3) collects the autocorrelation coefficients and combines
them in one stream for processing by the Ipc cell which is discussed in a 
later section.
8.3.3. Execution Times — al and a2
Figure 8.22 shows the execution times in /is for each of the statements in 
the xa: program. iVo things to note about these times are:
1) Short integers (sint) are only eight bits long and are handled entirely by the
8051. Variables of type real and int are 32 bits long and are handled by 
the APU. This is why i:=0 takes 5 /is, whereas sum. -0 take 178 ps.
2) As discussed in Section 8.1.2.3, each cell has one hardware input queue for
all the input ports. When data from an input port arrives, the data and 
a tag indicating the port are written into the queue. When the instruc­
tion lop <- is executed, the program first checks to see how much 
data is in the Jop port buffer. If there are less than four bytes, the input 
queue is read until four bytes from the top port are found (this includes 
the data already in the top buffer). Any data read from the queue 
which is not for the top port is Stored in the appropriate port buffer. 
The same process is followed when executing left <- in2. While auto is 
waiting for data from the north port (ini) it may also read data from 
all the other input ports- Therefore top <- in2 must wait 419 ps for 
the data to arrive, while left <- in2 requires only 93 ps since most of 
the data has already been read in and buffered.
A loop in this algorithm consists of the operations needed to input, process, 
and output one sample of speech. For this program, one loop takes 961 ps. 
After every M loops the computation of the autocorrelation coefficients is com­
pleted, and the result is written to an. output port. If a result is output during 
the loop, the time increases to 1,223 ps. The execution time of the last loop is 
longer than the rest: of; the loops since the result must be written to an output 
port, and certain variables must be reinitialized. This gives a sampling rate of 
about 1 KHz which is too slow for speech analysis.
Figure 8.23 is the same algorithm using 32-bit integers for computations 
instead of real numbers. Here the total time for a loop is 887 ps and 1,010 ps 
if a result is produced- This is still not fast enough for speech processing.
Table 8.3 shows the the most time-consuming steps in the xx routines. 
Using data of type integer is adequate for speech data processing. Program 
(a2) can process one sample every 887 ps, which is a sampling rate of 1.1 KHz. 
This is not fast enough for real-time processing. As with filtering, xx in its
272
Count Min Ave Max
code auto;
trace sum,left,top;
ports in 1 ,in2, out,results;
1 0 0 0
begin
sint i,samples; /* Samples per frame */
1 15 15 15 real top,left*sum;
1 5 5 5
real ini,in2,out,results;
i := °r
1 178 178 178 sum:= 0;
5 ■ 5 5 samples :== 10^
1 91 91 91
/* Send a zero out to initialize the pipeline */ 
out <- sum;
1 0 0 0 while true do
34 14 14 14
begin 
i:=i + l;- ■
34 419 419 419 top <- ini;
34 93 93 93 left <- in2;
34 22 23.8 24
■/* Has one frame been processed? */ 
if i < samples then
31 91 91 91
begin /* No */ 
out <- top;
31 318 818 318
/♦ Send sample from top to cell below ♦/ 
sum := sum + top ♦ left; /* Find sum*/
31 2 2 ; 2
end
else begin /* Last sample in frame */
3 ; 318 318 318 sum := sum + top * left;
3 91 91 91 results <-sum; /* send out results */
3 178 178 178 sum := 0; /* Reinitialize, sum */
3 91 91 91 out <- sum; /* and pipeline.*/
3 5 5 5 i:= °; ' ,
3 0 0 0 end''
34 2 2 2 end
end.
Figure 8.22. Execution times in /is for autocorrelatioh program al using real 
numbers.
Count Min Ave Max
1 0 0 0
1 15 15 15
1 5 5 5
1 29 29 29
1 5 5 5
1 91 91 91
1 0 0 0
31 14 14 14
31 419 419 419
31 93 93 93
31 22 23.8 24
28 91 91 91
28 244 244 244
28 2 2 2
3 244 244 244
3 91 91 91
3 29 29 29
3 91 91 91
3 5 5 5
3 0 0 0
31 2 2 2
code auto; 
trace sum,left,top- 
ports ini, in2, out, results;
begin
sint i,samples; /* Samples per frame */
int top,left,sum;
int ini,in2,out,results;
i : = 0;
sum: = 0; 
samples 10;
/* Send a zero out to initialize the pipeline */ 
out <- sum;
while true do 
begin 
i: = i +1; 
top <- ini; 
left in2;
/* Has one frame been processed? */ 
if i < samples then 
begin /* No */
/* Send sample from top to cell below */ 
out <- top;
sum := sum + top * left; /* Find sum*/
end
else begin /* Fast sample in frame */
sum ;= sum + top * left; 
results <- sum; /* send out results */
sum := 0; /* Reinitialize, sum */
out <- sum; /* and pipeline. */ 




Figure 8.23. Exeeutioii times in /is for autocorrelation program a2 using 
integers.
274
Table 8.3 Execution times for autocorrelation programs al and a2.
Program al . a2 V
Data Type int int real real
Input 512 (is 53% 512 /zs 58%
Finding Sum 318 /is 33% 244 /zs 28%
Output 93 /is 9% 93/zs 10%
Total (no result) 961 (is 100% 887 /zs 100%
Total (with result) 1,223 (is 100% 1,010 /zs 100%
current state cannot produce code that executes fast enough for real-time pro­
cessing. The following section presents assembly language implementations of 
the algorithm to compute autocorrelation coefficients.
..fk&4;Assembly Language Programs — a3 and a4
Autocorrelation of a speech signal does not require P-bit input data, as 
used above, for most applications. Instead 8 or 16-bit input data is enough. 
Using fewer bits reduces the I/O time since less data is sent through the switch 
lattice, and reduces the execution time since the 8051 can do 16-bit arithmetic 
without sending data to the APU.
Three 8051 assembly language programs were written to compute auto­
correlation coefficients. Two (a3 and a5) use 16-bit input samples and a 52-bit 
sum, while the other (at) uses 8-bit inputs and a 16-bit sum. Figures B.9 and 
B. 10 are listings of the first two programs. Figure 8!24 shows the switch set^ 
ting. They perform the same calculations as the xx programs but with less pre­
cision, All calculations are done by the 8051; the APU is never used; All 
inter-cell communication is done blindly, i.e., when receiving data, no check is 
made to see from which port it came., There is no risk of data arriving at a 
cell’s input queue in the wrong order if
1) all the cells are synchronized (i.e., the main loop requires the same amount
of time in each cell) and
2) there is no input from cells which are not synchronized.
Unfortunately debugging synchronized code is tedious because the code in unre­
lated cells must be carefully timed to take the same amount of time. Data 
must arrive from the outside world? which is not synchronized to Poker, so 2) is 
an unrealistic constraint. This will be addressed in Section 8.3.6.
Figure 8.25 is a summary of the equivalent execution times for the assem­
bly routine as compared to the integer version of the xx routines. Table 8.4 
summarizes the total time between input samples for each of the algorithms. 
Switching to assembly language has produced about a tenfold increase in speed. 
This increase comes from a combination of:
I) Reducing the input data size from 32 bits to 8 or 16 bits. This allows the 
8051 to perform the arithmetic rather than sending it to the APU, which
276
cell 1 2 3 4
+-+ +-*f +-+
1,1 .*1,3- 1,3 . 1,4
+-+ +-+1 +-+ +-+
1 pipe auto
-£+ -K-iV
U2-,r-M)► 2,4 .. , 2 input auto













4,3 . 4,4 
+-+ '+-+




si.l a3 a4 a5.
xx 8051 8051 8051
32bit 16 bit 8 bit 16 bit








5 . 2 ' 2 2
29 ; 4 2 4
5
91 17 6 17
11 1 1 1
419 6 3 >10
91 6 2 >io
23.8 3/5 3/5 3/5
91 4 2 4
244 60 9 60
2
244 60 9 60
91 14 10 13
29 4 2 -4
91 16 1 16
5 ' 2 2 2
2 2 2 :2




samples := 10; 
out sum; 
while true do
; i: = i + l; . . -
top <- ini;
left<- in2;
if i < samples then
begin /* Not done yet >/
oiit <- top;
sum := sum + top * left; 
end
else begin
sum := sum + top * left; 
results <-sum; 





times' m for aiitocorrelation program using 8, 16,
Table 8.4 Summary of execution times for autocorrelation programs.












al .. xx./v.v 32-bit real async v > 061/is > 1,233/is >96,372/is < 1,037Hz
a2 XX . 32-bit int async >887 (is >1,010 //s > 88,823 (is <1,125 Hz
a3 8051 16-bit int sync 82 fis 116 //s 8,234 (is 12,144 Hz
a4 8051 8-bit int sync 26 (is 47 (is 2,621 (is 38,153 Hz
a5 8051 16-bit int async >90 /is >123 fis >9,033 fis <11,071 Hz
is time consuming. Also, the smaller data size requites less time to move 
through the network.
2) Storing all variables in internal RAM. xx stores all variables in external
RAM, which requires more time to access^ ^
3) Overlapping data transfers with computation. Therefore, when waiting for
the LSB to be read from the output latch, the MSBs are being com- 
puted.
Steps 1) and 2) could be implemented by a compiler, thus possible making 
real-time processing possible without using assembly language.
The seventh column of Table 8.4 shows the time required to process One 
100-sample frame of speech. Using 16-bit samples, a sample rate of 10 KIIz is 
easily obtained. This is fast enough for telephone quality speech, but not for 
high quality speech. Dropping to 8-bit inputs allows a sample rate of about 38 
KHz which is fast enough for most speech applications, but is not enough preci­
sion for high quality speech. These rates present one problem: it is possible to 
sample at a high rate (38 KHz), or with high precision (16-bit inputs), but not 
both. These rates assume there is some buffering of input data during the 
longer last loop so no data is lost.
8*3.5. Potential Problems — a3 and a4
The assembly routines assume the input value will enter at a given time. 
There is an 8 //s window in the 8-bit version during which the input data must 
arrive. The 16-bit version has a 21 //s window. If the data arrives outside this 
window, data will be lost.
Because of this narrow window, a pipe cell cannot be used to broadcast 
the input data since it introduces delays in the arrival times. Instead, two 
identical input cells are used along with broadcasting, as the switch setting in 
Figure 8.24 shows. The ptpe (1,1) here is used so the data arriving at cell (1,2) 
will arrive in the proper order.
This “patch job” of duplicating the input cell is sufficient for demonstrat­
ing the system works, but is not practical for processing real data. The Pringle 
hardware cannot broadcast, therefore one input cell would be needed for each 
autocorrelation cejl. The following section presents a method to overcome this 
problem.
280
8.3.6. Asynchronous Computing— a5
The last assembly language program (a5) allows the auto cells to run asyn­
chronously with respect to the input cell. Figure 8.26 is the switch setting and 
code names for the autocorrelation program listed in Figure 13.11. Figure 8.25 
and Table 8.4 summarizes the results. Asynchronous execution is achieved as 
follows: In the synchronous programs, the order of execution is:
T) Read input from external world.
2) Read input from cell above.
3) Compute sum while next external input arrives.
4) Write data to cell below.
Step 3 overlaps the computations with data input. This program is synchro­
nous since the data must arrive during the computation.
To run asynchronously the order of execution is changed to:
1) Read input from cell above. ;
2) Wait for input from external world.
3) Write data to cell below. :
4) Compute sum while input from cell above arrives.
There is still overlap of computation with input, but the input is from another 
cell, not the external world. This new program adds only the slight overhead 
of checking for the arrival of the external input.
The only assumption made is that the external input arrives at all cells at 
the same time. This is a valid assumption if the hardware can perform a 
broadcast. Systolic arrays cannot broadcast data, so program a5 uses a tree 
like configuration of cells to distribute the input data as a broadcast would. 
This method, however, does not deliver the data to all the cells at the same 
time. Figure 8.27 shows two columns of cells. The cells in column one form a 
broadcast tree while the cells in column t\vo are cells receiving the broadcast 
data. The number in each box is the arrival time of the data assuming it 
starts in cell (5,1) at time 4=4 and that a write to a port takes one time unit. 
These arrival times alsq assume a cell can send data to both output ports with
281
n
+-+ +-+ +--f +-+









Figure 8.26. Switch setting for autocorrelation program af>.
282
cell 1 2
+ - ~b + - +
l3I—♦ 4 + ^+-+i+ - + +- +
Figure 8.27 Time delays in using tree to broadcast. One port can send data to 
two ports with one write instruction.
one write instruction. This means that the data 
column two at the same time.
in arrive at all the cells in
Figure 8.28 on the other hand, assumes the program can write to only one 
port at a time. The data arrives at cell (5,1) at time t=l. Cell (5,1) first writes 
to its southwest port at time t=2, then to its northwest port at time 1=3. Fig^ 
ure 8.28 shows that the southwest port of cell (5^1) goes to cell (7,jj),' and the 
northwest port of cell (5,1) goes to cell (3,1). Cell (7,1) gets the data from cell
(5.1) at time i=2 and first sends it to its south port, then its north port. Cell
(3.2) receives and sends its data one time unit later. Cells (2,1), (4,1), (6,1), 
and (8,1) all perform the same operations, only at different times as shown by 
the Figure 8.28 When using this scheme to broadcast to 8 cells, there is only a 
one unit delay between adjacent cells. There can be timing problems with such 
a broadcast tree. Ip the assembly language algorithm, column two acts as the 
pipeline in Figure 8.26. Each cell receives two data items; one broadcast item, 
and one data item being passed through the pipeline. Therefore each cell in 
column two receives data from two ports. The first port comes from the broad­
cast tree and is called the broadcast data. The second port comes from the cell 
above and is called the pipeline data. A cell writes pipeline data to the cell 
below it after receiving both broadcast data and pipeline data. Although this 
is how the autocorrelation algorithm functions, this problem can be generalized 
t° any algorithm that fits the above description. Checking the input queue tag 
and buffering the input data is a time consuming task, so the program is struc­
tured so that the data will arrive in the queue in a known order. Since the 
input queue direction tag is not checked, one of the two assumption must be 
made:
1) Data comes from the broadcast port first.
2) Data comes from the pipeline port first.
The following shows that either of the above assumptions can result ip data 
arriving in the wrong order.
Assume 1) and consider cell (4,2). Broadcast data arrives at time t-5; 
suppose pipeline data arrives at time t-54* and the pipeline data is written at
Since the processor can execute instructions faster than the data travels between cells, it 
is possible for data to be written into the network at a non-integer number of network 
units from the time the first data was written into the network.
284








Figure 8.28 Time delays in using tree to broadcast. One port can send data to
only one port with one write instruction.
285
time t-5.8. Cell (5,2) assumes the broadcast data will arrive Erst at 1-6, but 
cell (4,2) sent its pipeline data to cell (5,2) before cell (6,1) sent the broadcast 
data. This is a problem.
Now assume d), that the pipeline data comes first. Cell (3,2) receives pipe­
line data from the cell above it and at time t—6 receives its broadcast data. 
Suppose at time t—6.4 cell (3,2) sends its pipeline data to cell (4,2), Cell (4,2) 
expects the pipeline data first, but at time t—5 the broadcast data arrives. 
Again a problem.
There is one receiving cell, call it A, whose data is always written first as 
it travels through the broadcast tree. In Figure 8.28, this is cell (8,2). There is 
one cell, call it B, whose data is always written last as it travels through the 
tree, this is cell (1,2) in Figure 8.28. As the number of cells receiving the data 
increases, the number of levels in the broadcast tree must be increased. Each 
additional level of the tree adds on network delay for data arriving at cell A, 
and two network delays for data arriving at cell B. Therefore, the difference in 
arrival times increases as the number of cells receiving the broadcast data 
increases.
Five possible solutions to this problem are:
1) Lower the input rate so all the broadcast data will have propagated through
the tree before any pipeline data arrives.
2) Build delays into the broadcast network to be sure the data arrives at all
processing cells at about the same time.
3) Use separate input queues for each port.
4) Allow a port to broadcast to two ports.
5) Allow a port to broadcast to any number of ports.
From the programmer’s point of view, solution five is the best solution in 
that not being able to broadcast data is an architectural limitation. The solu­
tion to such a limitation is a different architecture. Using a general broadcast 
frees up the cells in the broadcast tree so they can perform some other task. 
Solution five is the most expensive solution in that it requires a hardware 
change.
Solution four is a less expensive solution than five since it may require 
fewer hardware modifications. The tree broadcast can be used, as shown in 
Figure 8.27, to broadcast data so that it arrives at the same time at the
286
destination cells, assuming the cells in the tree can broadcast to two other cells 
with one write instruction. Solution three would require the least expensive 
hardware modifications. Having separate input queues for each input port 
would eliminate the arrival order problem.
Solution one is used here since spacing the input samples 150 /is apart is 
slow enough for all cells to complete computing before the next sample arrives. 
This does decrease the throughput, but 150 /is between samples is fast enough 
for telephone quality speech. It is not, however, fast enough for high quality
8.3.7. Summary
Table 8.4 summarizes the results of the five programs for autocorrelation 
discussed in this section. As with the filtering algorithms, the programs written 
in ix cannot process data fast enough for real-time speech processing. The 
three programs written in assembly language show that the 8-bit 8051 
microprocessor can process at real-time speeds with throughput ranging from 
12 to 38 KHz. Although computing more coefficients may increase the delay 
time between input and output (because it requires a larger broadcast tree), it 
does not change the throughput, but only the delay time between the arrival of 
the last sample and the output of the results. If more coefficients must be com­
puted, more cells can be added to the array.
Program a5 showed that a broadcast can be done with a tree-like structure 
of cells. The problem with this type of broadcast is the variation in arrival 
times at the destination cells. This problem could be overcome by allowing a 
general broadcast to many ports, or simply by allowing a broadcast from one 
port to two ports. This simple broadcast would allow the tree structure to 
broadcast data to many Cells without the variation in arrival times.
The simplest hardware change that would allow the programs for execute 
faster would be to have separate input queues for each port. This would allow 
a5 to process data at 10 KHz instead of 6.67 KHz.
8*4. Simulation of Parallel Linear Prediction Algorithms
Both speech synthesis and recognition frequently use linear predictive cod- 
iiig (LPC). The LPC coefficients model the vocal tract as an all pole filter, 
while the error signal from the analysis represents the excitation of the vocal 
Chords.; A speech recognition system divides the the speech signal into 10 to 20 
ms frames and finds the LPC coefficients for each frame. Therefore, a real­
time system must process one frame of 100 to 400 samples every 10 to 20 ms. 
Generally, 16-bit signed fixed-point coefficients are used, but some applications
can use as few as 10 bits |NlaGr^4j.
The LPC coefficients Me found using the autocorrelation method JR aSC78j. 
The previous section showed that p cells can compute p autocorrelation 
Coefficients. The output from each cell is merged into one cell. This section 
describes the LPC program that reads the autocorrelation coefficients from one 
input port and Writes the LPC coefficients to the output port.
Although Siegel’s method for computing LPC coefficients presented in Sec­
tion 5.4.1 does achieve some speedup over the serial method, the method simu­
lated here is entirely serial. A single 8651 with an attached APU is able to
compute the coefficients in real time. ■Figure 8.29 lists the xx program used. It 
is a direct implementation of Durbin’s recursive solution as discussed in Section
4.4. The execution times, in |*s, for computing 8 LPC coefficients are listed to 
the left of each Statement. Table '8.5 shows the total execution times for vari­
ous numbers of coefficients. The time to compute 8 coefficients is 42 ms, which 
is two to four times longer than the desired 10 to 20 ms. Three possible solu­
tions to this problem are: improve the xx compiler, use a faster APU, or use 





Machine: VLSI processor array, simulated by Poker.
Function: Find LPC coefficients using
Durbin’s method.
Precision: Input: 32-bit fioating point.
Output: 32rbit floating point.
Number of PEs: 1
Parameters: p, the number of coefficients computed.
Input: Autocorrelation coefficients
arrive at “in” port.
Output: Energy (R[0j) is sent out “out” port
followed by p LPC coefficients.
Typical Time: 38,421 /*s for p=8.
*/ ■





1 10 10 10 sint i,j,p;
1 0 0 0 ini itmp,in;
1 0 0 0 real a[10], /* LPC coefficients */
1 0 0 0 aold[l0], 1* old LPC coefficients */
1 5 5 .5 . \ E, /* Prediction error */
1 5 5 5 : • k> 
out, /* output port */
1 0 0 o R(iol, /* autocorrelation coefs */
1 0 0 0 tmp;
1 5 -5 ; 5 ■■■: p;= 8; ;yy y
1 0 0 0 while true do 
begin
l 20 20 20 for i := 0 to p do 
begin
/* Read in autocorrelation coefs*/ 
/* Starting with R(0) */
9 262 277 394 itmp <- in;
9 193 193 193 k itmp;
9 79 79 79 R[i+1] k; /* All R[] indexs are +1 since*/
9 ‘ 4 , 9 10 end; /* xx indexs start at 1 */
1 70 70 70 E := R[l];
/♦ Send R[l] to endpoint routine*/1 91 91 91 out <- E;
1 20 20 20 for i := 1 to p do
begin
Figure 8.29: Durbin’s method for finding LPC coefficients from autocorrelation 
coefficients.
289
8 52 52 52 k := 0.0;
8 32 33 38 for j := 1 to i-1
28 385 386 387 k := k H
8 351 351 351 k := (R[i-blJ - 1
8 495 495 495 tmp := k*k; E : =
8 73 73 73 a[i] := k;
8 32 32 38 for j := 1 to i-1
28 433 434 435 a[jj := a<
8 24 24 24 for j ;= 1 to i do
36 102 103 104 aold[j] : =
8 4 9 10 end;
1 20 20 20 for i := 1 to p do j
begin
8 73 73 73 k := aold[i);
8 91 91 91 out <- k;
8 4 9 10 end;
1 2 2 2 end















4 2,810 /is 10,246 /is 712 /is 13,768 /is
7 4,484 /is 27,607 /is 1,231 /is 33,322 /is
8 5,042 /is 35,240 /is 1,404 /is 41,686 /is
8.4.1. Improve the xx Compiler
Since this method for computing LPC coefficients uses real numbers, the 
xx compiler uses the APU. The 8051 accesses the APU by pushing and pop­
ping data to and from the APU’s stack. The APU is given an operation which 
it performs on the data on the stack and the result is left on the top of the 
stack. Pushing and popping data from the APU stack is a time consuming 
operation because the APU stack is memory mapped as external RAM. The xx 
compiler does not optimize the stack operation, so when:
for j := 1 to i— 1 do
k :•= k + aold[i] * jRfi-j];




4 * (multiply top two elements and
leave the results on top of the stack.)
5 + (add top two elements and








Lines 6 and 7 show an extra push/pop operation which is not needed. A sim­
ple improvement to the xx compiler would be to allow one variable to be 
declared as an “APU stack variable” and the compiler would know to leave it 
on the stack. This could save many unnoedod pushes and pops.
8.4.2. Use a Faster APU
The Intel 8231 APU requires at most 92 /is for a floating-point addition, 
93 /is for a subtraction, 42 /is for a multiplication, and 43 /is for a division. 
These times are too slow for speech processing. For example, the two most 




k := k + aold[j] * R[i-j + l 
a[j] : =• aold[j] — k * aold[i—j
; and
Line A uses 387 (is per execution and is run 28 times when p=8. Line B uses 
434 (is per and is executed 28 times. If the execution times of lines A and B 
were reduced to only the time used by the APU, they would require 134 (is and 
135 (is respectively. This is 253 (is and 299 (is less time for a total savings of 
28*253 + 28*299 = 15,456 (is for the entire program. The total execution 
time for the Ipc program is 41,686 (is. Subtracting the time saved from the 
total time leaves 26,230 (is which is still too slow for real time processing. 
Therefore, by ignoring the overhead of indexing into arrays and sending data 
to and from the APU on the two most time consuming statements, the pro­
gram is still unable to run in real time. A solution to this problem would be to 
use a faster APIJ.
8.4.3. Use Multiple Cells
Unlike Siegel’s method where one LPC computation was divided among 
many cells, each cell could perform the LPC analysis on a different frame of 
speech. Figures 8.30, 8.31, and 8.32 show the switch settings and code names, 
port names, and xx program listing, respectively, for the multiple cell LPC pro­
gram. .The .program - demux receives the input coefficients from the autocorrela­
tion program (the autocorrelation program is replaced by the input program for 
testing purposes), demux sends the first 9 coefficients (one frame) to the Ipc 
cell (1,2). The next 9 coefficients are sent to Ipc cell (2,2) and so on. After Ipc 
cell (2,4) receives its coefficients, the next 9 coefficients are sent to cell (2,1). 
The mux cell collects the outputs from each /pc cell to form one data stream 
similar to the.input .stream into the demux 'cell.--
Each Ipc cell receives one out of every four frames. If a frame’s length is 
10 ms, each cell will have 40 ms to compute its LPC coefficients before receiv­
ing another input frame. Table 8.5 shows that the Ipc cell requires ~42 ms — 
slightly longer than the 40 ms that is available. The extra 2 ms could be 
trimmed from the Ipc program by optimizing the APU stack operations as dis­
cussed in the previous section. If a shorter frame length is used, more /pc cells 
can be used to increase the throughput.
293
+-+ -M- +-+ +-+
M 1,3 . 1,4
+-+1 +-+1 +-+ +-+
+-+■ +-+ +-+ +-+ 
2,1^2,2^2,3 . 2,4 














Figure 8.30. Switch settings and port names for multiple LPC cell program.
294
Cell 1 2 3 4
Figure 8.31. Port names for multi-cell LPC program.













































in, outl, out2, out3, out4; 
int tmp;
int in, outl, out2, out3, out4;
sint i;
This program sends its input to four Ipc cells.
It sends the first frame to port outl, the next to 
port out2, the next to port out3, and the next to




for I:-0 to 8 dp 
begin
tmp <- in; 
outl tmp; 
end;
for i: = 0 to 8 do 
begin
tmp <- in; 
out2 <- tmp; 
end;
for i:~ 0 to 8 do
begin
tmp O in; 
out3 <- tmp; 
end;
for i:~ 0 to 8 do 
begin
tmp <- in; 
out4 <- tmp; 
end;
end
to port outl and starts
/* Send first frame to 
/* Ipc cell at (2,1)
/* Send 2nd frame to 
/* Ipc cell at (2,2)
/* Send third frame to 
/* Spc cell at (2,3)
/* Send forth frame to 
/* Ipc cell at (2,4)
Figure 8.32. xx prograin Iistiii;| for multi-cell LPC program.
296


















































ini, in2, in3, in4, out; 
real tmp;
real ini, in2, in3, in4, out;
sint i;
This program combines the input from four Ipc cells. 
It gets the first frame from port ini, the next from 
port in2, the next from port in3, and the next from 
port in4. Then it goes back to port ini and starts 
over.
while true do 
begin
for i:= 0 to 8 do 
begin
tmp <- ini; 
out <- tmp; 
end;
for i:= 0 to 8 do 
begin
tmp <- in2; 
out <- tmp; 
end;
for i: = 0 to 8 do 
begin
tmp <- in3; 
out <- tmp; 
end;
for 'i:= 0 to 8 do 
begin
tmp <- in4; 
out <- tmp; 
end;
end
/* Get first frame from */
/* >pc (1,2)
/* Get 2nd frame from */ 
/* Ipc (2,2)
/* Get third frame from*/ 
I* Ipc (3,2)




Although this form of parallelism has the throughput heeded for real-time 
processing, it introduces a constant delay, i.e., it takes 42 ms to COhtphte ohe 
frame of LPC coefficients even though a frame is computed every 10 ms. The 
result is the input to the cell which follows the mua: cell will be delayed by 
about 30 ms (40 ms computation time minus the 10 ms frame length).
8.4.4. Summary
This section has presented a serial program to 'compute'LPC coefficients 
given the autocorrelation coefficients. It showed that 8 coefficients can be com­
puted in 42 ms which is two to four times longer than the time needed for 
real-time processing. Three solutions where given to improve the execution 
time. The first Was to improve the xx compiler to use ah “APU stack vari­
able.” The compiler would leave this variable on the APU stack thus optimiz­
ing the stack operations. This solution will not decrease the execution time 
enough unless the second solution is Used. The second was to use a faster API1 
since the Intel 8231 is too slow for speech processing. The last solution was to 
use multiple cells, each running a serial LPC program. A demux cell would 
assign alternate input frames to each LPC cell in a round robin fashion. A 
mux cell would then collect the output from each of the LPC cells. This 
method has little overhead of parallelism since each cell is running a serial pro­
gram.■'■■■
The LPC program is the first speech processing program that uses the 
APU. Although the APU can perform fixed and floating point arithmetic, until 
now its use has been avoided; This is due to the overhead in communicating 
with it and its slow execution times; The APU stack is memory mapped into 
the 805 Us external PAM address space. The 8051 accesses all external RAM 
through its single dptr register Therelore, if a 32-bit value stored in external 
RAM is to be pushed on the APU stack, the dptr must, be set. twice for each 
byte transferred (once to point to a byte in the variable and once to point to 
the APU stack), giving a total of 8 times. Setting the dptr requires 2 /is, so 16 
(is are used just setting the dptr. This extra setting of the dptr can be avoided 
if the APU stack is mapped into one of the 805l’s built-in I/O ports. The dptr 
can point to the variable in external RAM, and the 8051 can access the APU
'298
stack directly through the its built-in port. This simple modification to the 
hardware would decrease the time needed to use the APU.
299
8.5. Simulation of Linear Time Warping (LTW) Algorithms
In a typical isolated word recognition system, linear time warping occurs 
after the endpoint detection and before the dynamic time warping. Its purpose 
is to take an utterance of variable length and stretch or shrink it, in the time 
domain, until it is a fixed length. Isolated utterances can range from 20 to 80 
frames in length, where a frame consists of 8 LPC coefficients. Some systems 
will stretch or shrink the utterance to a 40 frame length. Only after detecting 
the utterance can the LTW program process the speech data. Since isolated 
words are about one third to one half seconds in duration, the LTW must be 
able to perform its operation in about 300 to 500 ms.
The LTW algorithm presented in Section 6.3.2 is implemented on the 
Poker system and the next section presents II, the resulting program. A later 
section discusses a second LTW program, 12, which is a single processor algo­
rithm. The data throughput needed by the LTW processor is slow (500 ms 
between utterances) compared to the other parts of the speech recognition sys­
tem- A single cell may be able to perform the LTW task in real time.
8,5.1. Parallel LTW - 11
Figures 8.33, 8.34, and 8.35, show the switch settings, port names, and xx 
program listing for 11, the parallel LTW program. The 11 program uses one Ihv 
cell per coefficient, therefore in the figure, the switch settings are for four 
coefficients per frame. The following describes how the program works and 
discusses the execution times for using 11 in a typical isolated word recognition 
system.
In the program, cell (3,1) outputs the input frames which go to the west 
port of Itw cell (1,1). All the coefficients enter cell (1,1) and it passes all but 
one to cells (1,2), (1,3), and (1,4). Each cell keeps one coefficient and passes the 
other coefficients on to the cells to the right. The algorithm works as follows.
300
cell 12 3 4
+-+ .+-+■
1 2ht f "mml
+-+ +-+
^l,4„ 1 Itw 1 tw Itw Itw







2,1 • 2,2 .,2,3 , 2,4 . 2 output output output output
+-+ ;+-+ -K+ +-+
\ /
+-+ +-+ +-+ +-.+
. 3,1 . 3,2 . 3,3 . 3,4 . 3 input
+-+ +-+ +-+ +-+
+-+ +-+ +-+ -K+
. 4,1 . 4,2 . 4,3 ,. 4,4 . 4
+-+ +-+ +-+ +-+
Figure 8.33. Switch setting for multi-cell LTW program.
1 3Cell 4
passon passonpas son pas son
4
Figure 8.34. Port names for multi-cell LTW program..
302
































VLSI processor array, simulated by Poker 
This routine does a linear time warp.
The data enters in from the left as p 
coefficients per frame. The first cell 
takes the last coefficient and keeps it 
and passes the p—1 preceding 
coefficients on the the cells to the 
right. Each of the other cells do the 
same thing until the right most cell 
only inputs one coefficient. In the 
end, cell 1 will have lpc coefficient k 
for all frames, and cell k will have 
lpc coef. 1. -1 is input to show the 
end of data and time to start 
computing. Each cells computes the new 
warped output using only its lpc coef. 
Input: 32-bit floating point 
Output: 8-bit, unsigned 
p, the number of LPC coefficients, 
p, the number of LPC coefficients.
LPC coefficients enter the leftmost cell. 
Warped coefficients exit the south 
port of each LPC cell.
Does not apply
92 /is for p=8 and 1=40
I* Number of frames in input utterance */
/* Number of frames in output utterance *
/* Number of coefs. given cell will get ♦ 
/♦ leftmost cell should have number=p, ♦/
/♦ the next-will have p-1, until the ♦/





15 real factor, /* ratio of J/I ♦/
Figure 8.35. Code for multi-cell LTW program.
303
Dec 20 11:48 1983 ltw.x Page 2
16 R{80], /* Input utterance */
17 s, /* scale */
18 onems, /* l~s */
19 tmp,
20 T1,T2, /* Patch Job */








27 while true do
28 begin
29 ' j := 0;
30 inputting ;s= true;
31
32 while inputting do
33 begin
34 tmp <- in; /* get first input for yourself
35 if tmp > 0 then /* if — 1, it’s the end
of the input "*/
36 8 begin
37 for k := 2 to number do
38 begin /* Send the rest to the other
39 passon <- tmp;
40 tmp <- in;
41 end;
42 j := j +.1;
43 R[j] ;=■ tmp;
44 end
45 else begin
46 inputting := false;
47 for k 2 to number do




end; /* of inputting loop */
52 ) := j;
53 factor := (J— 1)/(I— 1);
54
55 for i := 1 to I do
56 begin
57 tmp := (i-1) * factor + 1.0;
58 j := tmp;
59 s := 64.0 * tmp - j;
60 onems := 64.0 - s;
61 Tl:= onems * R[jj;
62 T2:= s * R{j + lJ;
Figure 8.35 (Continued)
304



















T-:= T1 + T2 + 128.0;
out <- T;
end;
/* of while true loop
Figure 8.35 (Continued)
305











11 next : = 1.0;
12 out <- next;
13 next := next + 1.0;
14 out <- next;
15 next := next + 1.0;
16 out <- next;
17 next := next + 1.0;
18 out <- next;
19
20 next : — 2.0;
21 out <- next;
22 tmp <- sync;
23 next := next + 1.0;
24 out <- next;
25 next := next + 1.0;
26 out <- next;
27 next := next 4* 1.0;
28 out <- next;
29
30 next :=3.0;
31 out <- next;
32 tmp <- sync;
33 next :=■ next + 1.0;
34 out <- next;
35 next next + 1.0;
36 out <- next;
37 next := next + 1.0;
38 out <- next;
39 next := -1.0;
40 out <- next;
41 end.
/* Wait for data to flow through*/ 
/* before sending next group
/* before sending next group 
/* Wait for data to flow through*/
Figure 8.35 (Continued)
306
1) Coefficient one of frame one enters cell (1,1). Cell (1,1) passes it to cell (1,2)
which passes it to cell (1,3), and finally to cell (1,4).
2) Coefficient two of frame one enters cell (1,1) which passes it to cell (1,2) and
then to cell (1,3). Cell (1,3) does not pass it to cell (1,4).
3) Coefficient three enters cell (1,1), which passes it to cell (1,2) where it stops.
4) Coefficient four enters cell (1,1) and stays there.
Now each cell has one coefficient from the first frame. The above process 
repeats for every frame in the utterance. Once each cell has one coefficient 
from each frame, the cell starts computing the new frames. After each cell 
computes a new coefficient it writes to the output cell below it (See Figure 
8.33). If p =8, 8 llw Cells must be used, and the computation time will not 
increase. However, the time needed to pipe the 8 coefficients to all the cells 
vvil] double.
Figure 8.36 shows the execution times for a sample run which uses three 
frames of four coefficients each for input and produces two frames of output. 
The total time needed to warp the three frames to two is 11,417 [is. Of the 
11,417 (fis, 6,579 [is are spent reading in the coefficients and passing them op to 
other cells. 4,286 [is are used to compute, and scale each coefficient, while 552 
//s are for outputting the new frames. One way tp gauge the performance of 
the 11 program is to view it in a speech recognition system. In a typical sys­
tem, the coefficients will arrive one frame at a time about once every 10 to 20 
ms. With this slow input rate, most of the time the program uses to input 
data is spent, waiting for the next frame to arrive. The important time is the 
time after the end of the utterance and before producing new warped frames. 
This time shows how quickly the program can warp the utterance after all the 
data has arrived.
Consider a system that produces one frame of 8 coefficients once every 10 
ms. The typical word length is 40 frames, so the LTW program must output 
40 frames after all the data is input. Program 11 uses 8 Itw cells, one cell for 
each coefficient. The time to compute and output one frame is the time used 
by lines 56 to 64 of Figure 8,35, This is 2,303 [is. Since 11 must produce 40 
frames, the total time is 92,120 [is. Therefore the computation time depends 
on the number of frames outputted. In a typical speech system there is 300 to 
500 ms between the beginnings of adjacent utterances. The LTW programs
. /* ...
This routine does a linear time
‘warp. The data enters in from 
the left as p coefficients per 
frame. The first cell takes 
the last coefficient and keeps 
it and passes the p—1 preceding 
coefficients on the the cells 
to the right, feach of the 
other cells do the same thing 
until the right most cell only 
inputs ohe coefficient. In the 
&fid, cell 1 will have Ipc 
coefficient k for all frames, 
and cell k will have Ipc coef.
1. — 1 is input to show the end 
of data and time to start 
computing. Each cells computes 
the Iw watped output using 
only its Ipc Coef. The new 
"warped utterance is output one 
eoeftcMft 'at a time at it is 
cmppiited. The hew coefficients 






Count Min Ave Max
1 0 0 0
1 0 0 0
1 5 5 5
1 5 5 5
1 0 0 0
1 0 0 0
1 0 0 0
i o o b
1 5 5 %
1 0 0 0
1 5 5 5
1 0 0 0
"bodlinpiittihg; 
sirit k,
j,J, /♦ Number of frames in input utterance ♦/
iiI; /* Number of frames in output utterance ♦/
'iiit number;/* Number of coefs. given cell will get */
/♦ leftmost cell should have number = p, ♦/ 
/♦ the next will have p-1, until the ♦/ 
/♦ rightmost will have niimber=l ♦/
•real factor, /♦ ratio of J/I ♦/




TT/T2, //♦ Patch Job ♦/
T, /* Ouput utterance */
■ • "ihpasson;
Figure 8.36. Execution times mijMs for multi-cell LTW. Three input frames of 
four coefficients each, two output frames. 2,303-ps per output frame.
308
1 0 0 0 int out;
1 5 5 5 I := 2;
1 0 0 0 while true do
begin
2 5 5 5 j := 0;
2 5 5 5 inputting := true;
2 9 9 9 while inputting do
begin
4 274 274 274 tmp <-in; /* get first input for yourself ♦/
/♦ if -1, it’s the end of 
the of the utterance */
4 314 314 314 *
/*
if tmp > 0 then 
begin
Send the rest to the other cells */
3 66 66 66 for k := 2 to number do 
begin
9 91 91 91 passon <- tmp;
9 . 274 . 274 274 tmp <- in;
9 4 8 10 end;
3 14 14 14 j := j + 1;
3 73 73 73 R[j) tmp; 
end
3 2 • 2 2 else begin
1 5 6 6 inputting := false;
1 66 66 66 for k := 2 to number do
3 95 99 101 passon <- tmp;
1 0 0 0 end;
4 11 11 11 end; /* of inputting loop *
1 8 8 8 Ji;
i ■ ■;. 210 210 210 factor : = (J-l)/(l-1);
1 . 20 20 20 for i : = 1 to I do 
begin
"2 ■ ■ 425 '425 425. tmp := (i—1) * factor + 1.0;
2 157 157 157 j := tmp;
2: 442 442 442 s := 64.0 * (tmp - j);
0 . 227 ■227 227 •• onems 64.0 — s;
2 . 196 196 196 Tl:= onems * R[j];
2 202 202 202 T2:=- s * R[j + 1];
2. : 368 368 368 T T1 + T2 + 128.0;
2 • 276 276 276 out <-T;
2 4 7 10 end;




must execute in this amount of time to run in real time. The 11 program 
requires only 92 ms, therefore it can run in real time.
8*5.2. Serial LTW — 12
Since the LTW program needs a low throughput for real-time processing 
(i.e,, 300 to 500 ms per utterance), this section considers a serial approach. 
Figure 8.37 is the listing for 12, the single-cell LTW program. 12 uses only one 
cell and executes a serial LTW program. Figure 8.38 shows the timings for 
each step. The execution times depend on both the number of coefficients and 
the number of output frames, but not the number of input frames. The total 
time needed to input three frames of four coefficients each is 18,016 ps. 6,920 
fis are used to input the three frames, and 11,096 /ts are used to compute a id 
output two frames.
Viewing the 12 program in the same setting as the 11 program shows that 
no more cells are used when computing 8 coefficients than computing 4. Pro­
gram 12 must repeat lines 62 to 65 of Figure 8.37 for each coefficient it com­
putes. These lilies take 1,042 /is to compute, making a total time of 9,597 /is 
to compute one frame of 8 coefficients. 12 uses 383,880 //s to compute 40 
frames. Table 8.6 summarizes these results. In a typical system, the LTW 
program has 300 to 500 ms to perform its operations, therefore the 12 program 
may not be able to process in real time if many short utterances are spoken in 
a row. In such a case a buffer is needed to store the next frame while the 
current frame is being processed.
8.5.3. Summary
Two programs to perform linear time warping were presented. The first, 
11, was based on the algorithm presented in Section 6.3.2 11 achieves its paral­
lelism by using one cell for each coefficient in a frame. By using p cells (where 
p is the number of coefficients per frame!, U is able to warp an input utterance 
with an arbitrary number of frames to 40 frames in 92 ms. This time does not 
depend on the number of input frames nor the number of coefficients per 















VLSI processor array, simulated by Poker 
This routine does a linear time warp 
using only one cell. 10000 is input to 
show the end of data and time to start 
computing. The new warped utterance is 
output one coefficient at a time at it 
is computed. All outputs are ints, 
multiplied by 64 with 128 added, so the 
fraction part will not be lost.
Input: 32-bit floating point
Output: 8-bit integers multiplies by 64 with 128 added.
1 ■ '
p, the number of LPC coefficients.
I, the number of frames computed, 
p LPC coefficients are received at the “in” port.
The value 10001 is received if a .word is spotted and 
the ltw program should begin. The value 10000 
is received if the energy previously received 
should be discarded.








6 - ■ ■
7 sint j,J, /* Number of frames in input utterance */
8 i,I; /* Number of frames in output utterance */
9
10 real factor, /* ratio of J/I */






.17 , • ■ R7[10],
18 R8[10],
19 s, /* scale */
20 onems, /* 1 minus s */
21 imp,
22 T1,T2, /* Patch Job */
23 T, /* Output utterance */







I := 2; ■ ■■
Zo
29 while true do
30 begin
31 j : = 0; /
32 inputting := true;
33
34 while inputting do
35 begin
36 tmp <- in;
37
38 if trap == 10000.0 then /* false alarm, empty buffer *i
39 j - o ■;
40 else if trap - IQOOi.O then
41 inputting :== false /* end of word, start warping
42 else
43 begin /*■ Get next frame *!
44 j :== j + X;
45 Rljjj tmp; trap <-in;
46 R2[j] :== tmp; tmp <- in;
47 R3[j] trap; tmp <-in;
48 R4[jj trap;
49 iii <* trap; /* Send sync to endpoint
50 end
51 end; /* of inputting loop */
52
53
54 factor :=* (j-^l)/(Hl);
55
56 for i := 1 to I do
57 begin
58 trap (i-1) * factor + L0;
59 j := trap;
60 s :■= 64.0 * (trap ^ j); /* Scale by 128 so it can be */
61 oneras :== 64.0-^s; /* stored in an 8 bit value *1
62 Tli— onems * Rl[j]; 5 /* also add 128 bias so it will be */
63 T2:== s * Rl|j+l]; /* always positive */
64 T := Tl + T2 + 128.0;
65 out <- T;
66 Tl:== oneras * R2[j];
67 T2:= s * R2[j + lj;
68 T : = Tl + T2 + 128.0;
69 out <* T;
70 Tl:^ onems> R3[jj;
71 T2:= s * R3[j + 1];
72 T :^:TX,+ T2 + 128.0;










Tl: = onems * R4[j];
T2:= s *R4(j + lj;
T := T1 + T2 + 128.0;
out <- T;
end;
end; /* of while true loop */
Figure 8.37 (Continued)
313
Count Min Ave Max
1 0 0 0
1 5 5 5
1 5 5 5
/*
This routine does a linear time warp 
using only one cell. 10000 is input to 
show the end of data and time to start 
computing. The new warped utterance is 
output one coefficient at a time at it 
is computed. AH outputs are ints, 
multiplied by 64 with 128 added, so the 







sint j,J, /* Number of frames in input utterance */ 
i,I; /* Number of frames in output utterance */
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 5 5 5
1 0 0 0
1 5 5 5
real factor, /* ratio of 3/1 */









onems, /* 1 minus s */
trap,
T1,T2, /* Patch Job */
T, /* Output utterance */
in;
int out;
1 5 5 5 J- 2;
1 0 0 0
1 5 5 5





1 9 0 9
4 262 363 398
while inputting 4o 
begin
trnp ■<- in;
Figure 8.38. Execution hiMes ds jfis for single cell LTW. Three input frames of 
four coefficients each, two otftput Irames. 5,429 .pts per output frame.
f* false alarm, empty buffer */
314
4 432 432 432 if tmp = 10000.0 then 
j := 0
else if tmp = 10001.0 then
/* end of word, start warping */ 
inputting := false
1 7 7 7 , else
begin /* Get next frame */
3 14 14 14 j := j + 1;
3 335 335 335 ■ ■■ Rl[j] := tmp; tmp <-in;
3 335 335 335 R2[j] : = tmp; tmp <- in;
3 335 335 335 R3[j] := tmp; tmp <-in;
3 73 73 73 R4[j] := tmp;
3 91 91 91 in <- tmp; /* Send sync to endpoint*/
3 0 0 0 end
4 11 11 11 end; /* of inputting loop */
1 8 8 8 J:= j;
1 210 210 210 factor := (J— l)/(I— 1);
1 20 20 20 for i := 1 to I do 
begin
o** ■ 425 .425 425' . tmp := (i-1) * factor 4- 1.0;
2 - 157 157 157 j := tmp;
/* Scale by 128 so it can be stored */
/* in an 8 bit value */
/* also add 128 bias so it will be */
/* always positive */
2 442 442 442 s:=64.0*(tmp-j);
2 227 227 227 ' onems 64-0-s;
2 190 190 190 Tl:= onems * Rl[j];
2 202 202 202 T2:= s * Rl[j + lj;
2 ' 308 308 308 T T1 + T2 +128.0;
2' 270 .270 •276 ' out <- T;
2 ' 190 190 190 ■Tl:= onems * R2[j];
2 202 202 202 T2:= s *R2[j + l];
2 308 308 368 T := 'T1 + T2 + 128.0;
v 270 270 276 out <- T;
2 190 190 190 Tl:= onems * R3[j];~
2 . 202 202 202 T2:= s * R3[j + 1];
2 ; 308 308 368 T :=' T1 + T2 + 128.0;
2 270 270 270 out <- T;
2 ;: 190 190 196 Tl:= onems * R4[j];
2 . 202 202 202 T2:= s * R4[j +1];
2 ' 308 368 308 T := T1 + T2 + 128.0;
2 276 276 276 ■ out <- T;
2 4 7 10 end;




Table 8.6 Execution times for LTW programs.
program n 12
Number of 
cells 4 8 1 1
Number of 
coefficients 4 8 4 8
Time for 
one frame 2,303 jus 2,303 /is 5,429 iis 9,597 /is
Time for
40 frames 92,120 /zs 92,120/zs 217,160 /zs 383,880 /zs
316
require I * 2,303 //s to compute. Data from the LPC program, which precedes 
the LTW program, arrives at the rate of one frame every 10 to 20 ms in a typi­
cal system. 10 to 20 ms to input each of the 40 frames is a long time compared 
to the 92 ms needed to compute and output 40 frames. Therefore the time 
used for inputting frames is not included in the total time since it is dependent 
on the Ipc cell which is producing the input data.
Program 12 is a serial program using one cell. It requires 384 ms to per­
form the same task as above using 8 coefficients per frame. Each additional 
coefficient requires 1,042 pis to compute. When using 8 coefficients, each addi­
tional output frame requires 9,597 pis to compute.
The Poker system is able to implement both LTW algorithms in real time 
since it performs the LTW task once every 300 to 500 ms. A buffer may be 
needed to hold the inputs to the 12 program since it needs 384 ms for its com­
putation, Since the computational requirements are lax, both algorithms are 
written in ara; and run in real time.
8.6. Poker Simulation of Dynamic Time Warping
Dynamic time warping (DTW) is the process of taking one unknown utter­
ance and comparing it to one known utterance. The result of the DTW opera­
tion is a single score telling how closely the two utterances match. A typical 
isolated word recognition system matches the unknown utterance to every 
known utterance in the system’s vocabulary. A 1,000 word vocabulary would 
therefore require 1,000 DTWs to be performed.
An utterance is a collection of I frames of p coefficients each, / is con­
stant since the LTW program will stretch or shrink the utterance to a fixed 
length before the DTW program processes it. Typically I—40, p~8, and each 
coefficient is 16 bits.
The Poker system simulates the operations of the BAC using two different 
programs. The first, dl, is written in -xx. The second, d2, is written in 80S! 
assembly language. As With the simulations of the previous algorithms, the xx 
program is too slow for real-time processing. The 8051 program, which, in 
addition to being written in assembly language, uses less precise data (8-bit 
coefficients and 16-bit distances), can run in real time. A typical speech recog­
nition system uses 16-bit coefficients and 16-bit distances, so the execution 
times for such a system are extrapolated from the executions times of the simu­
lated system. The following sections give the highlights of the two programs.
8.6.1. BAt3 ivrltleai m xx -r dl
Figures 8.39 and 8.10 show the switch settings, code names, and port 
names used to simulate a BAC with a warping path of r=2. This value was 
chosen because it requires a total of 2r +1 even and odd cells which con­
veniently fit into a four by four grid of cells. Increasing r to a typical value of 
6 will not change the throughput; however, it will increase the time needed to 
initialize the array. Figure 8.41 gives the xx code for the instructions given in
318
cell 12 3 4
bend
Figure 8.39. Switch settings for PTW program dl.
319
Cell
















VLSI processor array, simulated by Poker. 
This routine does a dynamic time warp.
This code is executed by the even, cells.
Input: 32-bit floating point 
d: 32-bit floating point
g: 32-bit floating point
2r + l
r, the width of the warping path, 
p, the number of coeficients per frame.
I, the number of frames per utterance. 
tend generates the a vectors and sends 
them up from the bottom of the array. 
bend generates the b vectors and sends 
them down from the top of the array.
The d and g values are passed to the adjacent 
odd cells. The a and b vectors are passed 
to the adjacent even cells.
8,960 ps
36 ms for 1=40, p=4, and r=2
1 */
2
3 Patterned after assembly language program d2
4 */
5 code even;. '
6 trace d,g,atmp,btmp;
7 ports bout, bin, aout, ain, DTtop, DTbot;
.8 begin
9 sint i,
10 coefs; /* Number of coeficients in feature vector
11 int bout, bin, aout, ain, DTtop, DTbot;
12 ' int a[l0], atmp,
13 b[10], btmp,
14 d, /* Distance between a and b vectors *
15 Dbot,
16 ■ ■ Dtop, '■











Figure 8.41 XX code for DTW program dl.
321
Jan 17 08:56 1984 even.x Page 1
27 coefs :== 4;
28 inf := 32786;
29 1*
30 Initialize all variables.
31 *1
32 Gbotold := inf;
33 Gtopold := inf;
34 Gbot := inf;
35 Gtop := inf;
36 Dbot := inf;




40 for i := 1 to coefs do
41 begin
42 afij:== inf;
43 bji] := inf;
44 end;
45
46 while true do
47 begin
48 d := 0; ' "
49 for i := 1 to coefs do
50 begin
51 aout <-a[i]; /* Send out coefficients
52 bout <- b[i];
/* Read in new coefficients */
53 atrnp <- ain; a[ij : = atmp;
54
55
fotrnp Obin; b[i] := btmp;
56 tmpl := atmp ^ btmp;
/* Find distance between/




60 1* If a[l] ot b[l] is === inf> distance is inf */




64 DTtop < - d; /* Send local distance to odd cell *
65
66
DTbot <- d; /* ’’above” and ’’below”
67 tmpl := Gbotold + 2*Dbot; /* Find minimum path */
68 tmp2 := g -b d;
69
70
tmp3 Gtopold + 2*Dtop;
71 if tmpl < tmp2 then




74 min := tmp2;
75 if tmp3 < min then
76 min : = tmp3;
77
Jan 17 08:56 1984 even.x Page 2
78 if min < inf then f* If these are not infinite vectors,
79 g := d + min /* compute g *1
80 else
81 g := o; 1* Otherwise set to zero for
82 /* for next time */
83 Gtopold := Gtop; /♦ Save current values for later use
84 G bo told i— Gbot;
85
86 Gtop <- DTtop; /♦ Get new g values from odd cells
87 Gbot <- DTbot;
88
89 DTtop <- g; /* Send g to odd cells ♦/
90 DTbot <- g;
91
92 Dtop <~ DTtop; /* Get new d values from odd cells




f igure 8.41 (Continued)
323
















































VLSI processor array, simulated? by Poker: 
This routine does a dynamic time warp.
This code is executed by the odd ceils.
Input: 32-bit floating point
d: 32-bit floating point
g: 32-bit floating point
2r + t
r, the width of the warping path. 
pr the number of coeflcients per frame.
I, the number of frames per utterance. 
tend generates the a vectors and sends 
them: up from the bottom of the array. 
bend generates the b vectors and sends 
them down from the top of the array.
The d and g values are passed to the adjacent 
even; cells. The a and b vectors are passed 
* to-The adjacent odd cells.
8,960 //s
36 ms for 1=40, p=4, and r=2
Patterned after the assembly language routines, 
odd;
d,g,atmp,btmp;
bout, bin, aout, ain, DTtop, DTbot; 
siiit i,
coefs; /* Number of coeflcients in feature vector */ 
iiit bout, bin, aout, ain, DTtop, DTbot;
ifrt a|lO], atmp,
b-jlO], btmp,
d, /* Distance between a and b vectors*/
Dbot,
Dtop,
g, /* Local optimal distance*/
Gbot,
Gtop,







Jan 17 08:56 1984 odd.x Page 2
' 26 ' /*
27 Initialize all variables and send out a dummy infinity vector.
28 */
29 Gbot := inf;
30 Gtop : = inf;
31 Dbot := inf;
32 Dtop := inf;
33 g := 0;
34
35 for i := 1 to coefs do /* Send out infinity vector and recieve*/
36 begin /* real input vector*/
37 a[i] inf;
38 bji] := inf;
39 end;
40
41 while true do
42 begin
. 43 d := 0;
44 for i;:= 1 to coefs do
45 begin
46 aout <- a[i];
47 bout <» b[i];
48 atmp <~ ain; a[i] := atmp;
49 btmp <- bin; b[ij :== btmp;
■'50 ■
/* Compute distance between vectors */
51 tmpl atmp ~ btmp;
52 d := d + tmpl * tmpl
53 ■end;
54 .
55 I* If a[lj or b[lj is == inf, distance is inf */
56 if (a[lj ~ inf) | (b[lj = inf) then
57 d :== inf;
58
59 Dbot <- DTbot;
60 Dtop <- DTtop;
61 . :
62 tmpl := Gbot + 2*Dbot; /* Find minimum path*/
63 tmp2 := g + d;
64 tmp3 := Gtop+ 2*Dtop;
65
66 if tmpl < tmp2 then
67 min := tmpl
68 else
69 min := tmp2;
70 if tmp3 < min then
71 min := tmp3;
72
Figure 8.41 ■ (Continued)
325
Jan 17 08:56 1984 odd.x Page 3










81 Gbot <- DTbot;
82 Gtop <» DTtop;
83
84 DTbot <- d;








Jan 17 08:56 1984 bend.x Page 1
1 code bend;
2 trace Ain, Bout, dTtop;
3 ports ain, bout* DTtop, bout2, ain2;
4 begin
5 sint h
6 coefs; /* Number of coeficients in feature vector :






13 coefs : — 4;
14 inf := 32768;
15 Bout := 2;
16
17 while true do
18 begin
19 for i :‘= 1 to coefs do
20 begin
21. bout <- Bout; /* Output new B vector*/
22 bout2<- Bout;
23 Bout := Bout + 1;
24 Ain <- ain; /* Dummy read*/












Jan 17 08:56 1984 tend.x Page 1
1 code tend;
2 trace Aout, Bin, dTbot;
3 ports bin, aout, DTbot, aout2, bin‘2;
4 begin
5 sint 'h
6 coefs; /* Nur






13 coefs := 4;
14 inf := 32768;
15 Aout 1;
16
17 while true do
18 begin
19 for i := 1 to coefs do
20 begin
21 aout O Aout;
22 aout2<- Aout;
23 Aout := Aout
24 Bin <- bin;
25 Bin <- bin2;
26 end;
27
28 dTbot <- DTbot;
29
30 DTbot <- inf;
31 dTbot <» DTbot;
32 DTbot <- inf;
33 end;
34 end.
•/* Output new A vector*/




Figure 6.23, where even is Group A, odd is Group B, and fend and bend are 
the fop and bottom ends, tend and bend produce the input data for even and 
odd. Therefore a total of 2r+5 cells are used. 2r + l cells are for the BAG, and 
the four extra cells are used to produce inputs. Changing the width of the 
warping path will change the number of cells used, but will not change the 
throughput of the BAG.
The instructions for the even numbered cells in Group A of Figure 6.23 
map to the xx code as follows: Lines 5-24 of even in Figure 8.41 are variable 
declarations. Lines 27-44 are variable initializations. All the variables and the 
input frames are initially set to infinity since it takes time for the frames to 
fill the bilinear array. During the filling process most cells contain invalid data. 
Figure 6.22 shows that during loop #4, only cells -1, 0, and 1 have two pairs 
of valid frames. During loop #4 these three Cells compute valid distance 
scores, while the rest compute values that have no meaning. Initializing these 
“invalid” cells to infinity allows them to perform their computations and pass 
their distance scores (which will be infinity) to the cells making valid computa­
tions. Since g is picked as the minimum path, the infinite distances from the 
invalid cells have no effect on the path taken.
Lines 48-57 in even in Figure 8.41 move the unknown frames down, the 
known frames up, and compute rf. The distance measure used is a sum of 
squares of differences. It Was chosen because its computation time falls 
between the time needed for a simple “absolute value of differences” and the 
“Itakuradistance measure” [Itak75], If the Itakura distance measure were 
used, it would increase the distance computation time because it requires the 
log of a value. Since the APU takes 1,783 //s for the log computation and the 
distance measure implemented uses 1,580 jis for p=8, the Itakura measure 
would take at least more than double the local distance computation time. 
Since the local distance measure is computed in a serial fashion, other distance 
measures can be used without the need for finding parallel algorithms to imple­
ment them.
£
Section 6.4.2 calls the input data to the BAC vectors. I will use the term frames instead 
of vectors since the vectors a and b represent frames of speech data in a speech recogni­
tion system.
329
Lines 61 and 62 check to see if either frame is infinity, if so, d is infinity. 
Lines 67-81 find the minimum of the three paths. If the minimum is infinity, g 
is set to zero. This condition occurs after processing one pair of utterances and 
before the arrival of the next pair. Lines 83-93 send off the g and d values to 
the adjacent odd numbered cells. The process starts over at line 48.
Because of the internal workings of the arx compiler, lines 89,00 and 92,93 
are switched from Figure 6.23. If cell A writes two values to the same port of 
cell B before cell B reads one, the first value is lost. Alternating reads and 
writes to the same cell insures that no data is lost. A later version of xx solves 
this problem.
The translation from Group B of Figure 6.23 to odd of Figure 8.41 follows 
the same pattern.
Cells (1,1) and (4,4) run xx code tend and bend, tendprovides the unk­
nown input frame while bend produces the known frame. Since all the even 
cells are identical, cell (2,1) will read and write values to the cell ‘‘above” it 
just as cell (3,2) does, but there is no even cell above it. The teven cell absorbs 
the d and g values sent to it by cell (2,1) and produces infinity d and g values 
to send to cell (2,1). These infinity values signal cell (2,1) that there is no valid 
warping path from the cell above it. The todd, bodd, and beven cells perform 
the same function for the cells they communicate with as the teven cell does 
for cell (2,1).
Figure 8.42 gives the execution times in ps for each step of cell (2,1) using 
four coefficients per frame. The maximum total time needed for one loop of 
the dl program, not including variable declarations or initializations, is 8,960 
ps. Table 8.7 shows the percentage of time each part of the DTW program 
uses when computing four coefficients per frame. A real speech system would 
use an order of 8 coefficients per frame, which doubles the time to move a and 
b and find d. The total time for an 8 coefficient system is 2*3,891 ps + 
2*1,611 ps + 1,674 ps +•' 1,784 ps — 14,462 ps per loop. A typical system will 
require 40 loops for one comparison, which is 40 * 14,462 — 578,480 ps. One 
comparison is performed for every word in the vocabulary, so a vocabulary of 
only two words can be matched in a little over one second. This is much too 
slow for real time recognition: As before, coding in assembly language can 
reduce the time of a loop.
330
Count Min Ave Max
code even;
trace d,g,atmp,btmp;
ports bout, bin, aout, ain, DTtop, DTbot;
begin
1 0 0 0 sint ' ..*»
i V 0 0 0
int
coefs; /* Number of coeficients in feature vector 
bout, bin, aout, ain, DTtop, DTbot;
1 ■' 5 5 - 5 int a[l0], atmp,
1 . 5 . 5- 5 bjlO], btmp,
1 .5' : 5 ■: 5 . d, /* Distance between a and b vectors*/
1 0 0 0 Db.ot,
i - 0 ' 0 0 Dtop,
r : 5 5 5 ■■ g, /* Local optimal distance *f
l 0 0 0 Gbot,
1 0 0 0 Gbotold,
i 0 0 0 Gtop,
1 0 0 0 Gtopold,
1: . 0 0 0 inf, 1* Infinity */
1 "■ ; 0 0 0 min,
1 ' ; 0 0 0 tmpl,tmp2,tmp3;
'1-■ ' 0 5 .5 . coefs 4;




1 ,52 52 : 52 . Gbotold := inf;
T . 52 52 52 Gtopold := inf;
1 , - 52 52 52 Gbot := inf;
1 ■: 52 52 52 Gtop := inf;
i 52 "52 •' 52 Dbot :=■ inf;
1 ; 52 52 52 Dtop := inf;
i 29 29 29 g^O;
1 20 20 20 for i 1 to coefs do
begin
4’-. 73 73 73 a[i] := inf;
4 . 73 73 73 bjij := inf;
4 :: . ■ 4 ' • 8.5 10 end;
1 0 •0 0 while true do 
begin
5. . 29 29 29 d := 0;
5 20 20 20 for i := 1 to coefs do
. . begin
'/* Send out coefficients */
Figure 8.42. Execution times in fis for dFusing four coefficients per frame.
331
20 164 164 164 aout <- a[i];
20 164 164 164 bout O b[ij;
/* Read in new coefficients */
20 166 310 347 atmp <» ain; a[ij atnip;
20 329 329 329 btrnp <- bin; b[i] btrnp;
20 143 143 143 tmpl — atmp - btrnp;
/* F ind distance between vectors * 
d := d + tmpl * tmpl
20 248 252 254 end;
/* If a[l] or b[l] is — — inf, distance is inf */
5 304 304 304 if (a[lj = inf) | (b[l) = inf) then
2 . 52 52 52 d : = inf;
/* Send local distance to odd cell */
5 91 91 91 DTtop <*'d;
5 91 91 91 DTbot O d; /* ’’above” and "below” */ 
/♦Find minimum path */
5 229 229 229 tmpl Gbotold + 2*Dbot7
5 139 139 139 tmp2 g + d;
5 229 229 229 tmp3 :== Gtopold + 2*Dtop;
5 132 132 132 if tmpl < tmp2 then
min tmpl
3 54 54 54 else
2 52 52 52 min tmp2;
5 132 132 132 if tmp3 < min then
min tmp3;
/* If these ate not infinite vectors, */
5 132 132 132 if miii < inf then
g := d + min /* compute g */
3 141 141 141 else
f* Otherwise set to zero for next time */
2 29 29 29 g 0;
/* Save current values for later use */
5 52 52 52 Gtopold :== Gtop;
5 52 52 52 Gbotold :== Gbot;
/♦Get new g values from odd cells */
5 238 238 238 Gtop <- DTtop;
5 342 474 570 Gbot <- DTbot;
/* Send g to odd cells */
5 91 91 91 DTtop <* g;
5 91 91 91 DTbot <- g;
/♦ Get new d values from odd cells */
5 318 321 322 Dtop <- DTtop;
5 459 464 471 Dbot <-DTbot;




Table 8.7 Execution time summary for DTW program dl using four 
coefficients per frame.
Operation Time Percent of Total
Moving a and b 3,891 /zs 43%
Finding d 1,611 /is -18%
Finding g 1,560- 1,674 /zs 19%
Moving d and g 1,784 /zs 20%
Total Time 8,960 100%
8.6.2. 8051 Assembly Language Version of BAG - d2
Like the xx algorithm dl, the assembly language version of the 0A€ 
implements the algorithm in Figure 6.23. Figure 8.43 shows the switch setting 
for d2. d2 uses 13 cells since a typical speech recognition system uses a warp­
ing path width of r—6 and 2r + l cells must be used. The cells are arranged in 
two vertical columns. The original “two rows on a diagonal” layout provides a 
good conceptual map from the task being performed to the program, but it 
makes poor use of cell space. The two vertical columns, however make better 
use of the space. d2 does not used the tend and tend cells, instead it uses the 
beven (bottom even), bodd /bottom odd), teven (top even), and todd (top 
odd) cells. The difference is the tend arid tend cells do not eoriipute distance 
values, while the top/bottom even/odd cells do. This makes better use of the 
computing power of each cell. d2 adds a new cell, called seven, (scores even) to 
the middle of the array. This is an even cell that has an extra output port 
that outputs the distance scores. The xx program has no provisions for output­
ting scores. If it were to be used for a “real” speech system an output cell 
would be needed. Since dl was Used only to compute the loop time and not 
perform a complete f)TW comparison, the output cell was not used.
Finally, four new cells are added, input, repeat, teq, '- and scores. Input 
reads unknown frames unknown frames from memory and writes them to its 
output port, repeat takes these frames and sends them out once for every 
utterance iri the vocabulary. The known frames are stored in the seq cell: It 
outputs one known frame for every unknown frame coming from- repeat. The 
known frame goes out the south port, while the unknown goes out the north.
Figure B.12 is a listing of all the assembly code for each cell. All the 
DTW cells execute their, instructions -in approximately lock-step fashion to 
minimize the overhead of inter-cell communication. The execution is approxi­
mately lock-step in that all the cells start executing their main loops at the 
same time and the. instructions executed are timed so that the writes to the 
output ports are within a few jms of each other. Therefore the execution is not 
strictly lock-step, but it is not asynchronous either. When executing in this 
manner, all Cells write to the switch at about the same time; however, opera­
tions between writes to the switch may riot be precisely synchronized. After a 
fixed (and known) amount of time (12 //s), all cells can read from the switch
334





6 input seq even
7 repeat beven
+ -+. 
8 , 5 
'■ + - +




+ - + ■
+ - + 
8,8 
.+- +








because data is guaranteed to be there. If the cells are not synchronized, it is 
possible for cell A to write to the switch, wait 12 fis, then read from the switch 
but get no data. This occurs when cell B is slightly behind cell A and has not 
written its data, intended for cell A, to the switch. The cells are run quasi- 
synchronously by using the built-in timer in each 8051 processor. Because of 
this, new feature frames must enter the DTW cells at a specific time. The seq 
cell is synchronized with the DTW cells so when seq has data to send, it sends 
it at the proper time. When there is no data to send, the seq Cell sends infinity 
frames.
The last new cell is scores. DTW cell (4,7) sends each of its g values to 
the scores Cell. When the scores cell receives a zero value, the DTW cells are 
starting to compare a new pair of utterances. The value it receives before the 
zero value is the total score for the previous pair of utterances. The scores cell 
stores this value in an array.
8.6.3. Execution Times
Figure 8.44 shows the execution times for both d.l and d2 when using four 
coefficients per frame. The total time for one loop of d2 is 460 //s. Table 8.8 
shows the percentage of time used by each part of both of the DTW programs.
The execution times for d2 assume there are four 8-bit unsigned 
coefficients per frame. A typical system would have 8 16-bit signed coefficients 
per frame. Table 8.9 is a summary of the expected execution times for a ver­
sion of d2 that uses four and 8 16-bit coefficients per frame. The time used to 
compute g and move g and d will remain the same when changing either the 
number of coefficients or the frame size. However, the time used to move the a 
and b vectors will double when either the number of coefficients or the frame
size is doubled since twice as much data is being moved. Also, changing from 8 
to 16 bits will increase the time to compute d from 23.5 //s per coefficient to 74 
fis because the 8*bit multiply-accumulate takes 9 ps while a 16-bit multiply- 
accumulate takes about 60 ^s. The computation time for d will be about 296 
fis for four coefficients or 592 ps for 8 coefficients. The d2 program spends 
47% of its time doing the computations (finding d and g) while the rest of its 








ports bout, bin, aout, ain, DTtop, DTbot;
begin
0 0 sint i,
0 0 coefs; /* Number of coeficeihts in feature vector 
int bout, bin, aout, ain, DTtop* DTbot;
5 0 int a[10], atmp,
5 0 b[lO], btmp,
5 0 d, /* Distance between a and b vectors*/
0 0 Dbot,
0 0 Dtop,





0 0 inf, /* Infinity */
0 0 min,
0 0 tmpl,tmp2,tmp3;






52- , • ■4 ■ " Gbotold ;= inf;
52 - ■' 2 , Gtopold := inf;
52 . 2 ' Gbot := inf;
52-. • 2 Gtop := inf;
52 2 Dbot := inf;
52 ,2 Dtop := inf;
29 6 • g 0;
20 • :.2. ■ ' for i 1 to coefs do ~
begin
73 1 ... a[i] := inf;
73 ■■ ■ i ■ b[i] := inf;
8.5 6 end;
0 5 . while true do 
begin
29 4 d :=. 0; '
20 3 for i 1 to coefs do
Figure 8.44. Execution times injis for dl and d2.
337
dl d2 > ■ •
XX 8051
begin
164 16 aout O a[i]; /* Send out coefficients */
164 14 bout <~ b[ij;
310 4 atmp <~ ain; a[i] : = atmp;/* Read in new coefficients */
329 4 btmp O bin; b[i] := btmp;
143 7 tmpl ■;= atmp - btmp;
10 d :■= d + tmpl * tmpl /* Find distance between vectors
252 2 end;
/* If a[l] or b[l] is == inf, distance is inf*/
304 9 if (a[ 1 ] = inf) | (b[ 1 ] = inf) then
52 4 d inf;
91 30 DTtop O d; /* Send local distance to odd cell */
91 19 DTbot <- d; /* ’’above” and ’’below” */
229 12 tmpl Gbotold + 2*Dbot; /* Find minimum path*/
139 6 tmp2 := g + d;
229 12 f tmp3 := Gtopold -F 2*Dtop;
132 10 if tmpl < tmp2 then
54 4 miu := tmpl
else
52 4 min tmp2;
132 8 if tmp3 < min then
54 4 min := tmp3;
132 6 if min < inf then /* If these are not infinite vectors, */
141 6 g d + min /* compute g*/
*/
else
29 3 g := 0;
52 4 Gtopold : = Gtop;
52 4 Gbotold Gbot;
238 6 Gtop <- DTtop;
474 6 Gbot <- DTbot;
91 30 DTtop <- g;
91 19 DTbot <- g;
321 6 Dtop <- DTtop;
464 6 Dbot < - DTbot;
2 11 end;
end.
/* Otherwise set to zero for next time */
/* Save current values for later use*/
/* Get new g values from odd cells*/
/* Send g to odd cells */
/* Get new d values from odd cells*/
Figure 8,44 (Continued)
338
Table 8.8 Execution time summary for DTW programs dl and d2.







Moving a and b 3,891 //s 43% 152 /is 33%
Finding d 1,611 /is 18% 94 /is 20%
Finding g 1,560-1,674 (is 19% 76 fjs 17%
Moving d and g 1,784 (is 20% 122 iis 27%
Timer Control 0 0% > 16 fis 3%
Wasting for the Switch 146 jjS 32%
Total Time 8,960 (is 460 (is
339
Table 8.9 Execution time summary for DTW program d2 using 16 bits per 
coefficient.
Operation Time Percent 
of Total
Time Percent 3 
of Total =
Coefs. per Ffame 4 8
Moving a and b 304 ifis 37% 608 ps 43% '"
Finding d 296 ps 36% 592 ps 42%
Finding g 76 ps ’9% 76 ps 5%
Moving d and g ) 12 2 ps 15% 122 -ps 9%
Timer Control >16 ps 2% >16 ps .1% ■








b, d, and g), 386 (is is spent waiting for the switch. Therefore 27% of the loop 
time is idle waiting for data to move through the switch.
At least 16 7/s are spent starting and stopping the internal timer. The 
timer keeps all the DTW cells executing synchronously by doing the following:
1) At the start of the main loop all DTW cells (even, odd, teven, todd, beven,
teven, seven, and, seq) start their timers at the same time.
2) All cells execute the instructions in their loop. Some cells may take longer
than others.
3) At the end of the loop all cells wait for the timer to reach a certain predeter­
mined value. Since all cells start at the same time, and all cells wait for 
the same timer value, all cells will start the next loop at the same time. 
An alternative to using the timers is to pad all loops executed by the DTW 
cells with nops so they are the same length. This makes program development 
tedious since the programmer must change the code in every cell if the code in 
one cell is changed.
Although even can complete a loop in 1414 //s while processing 8 16-bit 
coefficients, seven requires an additional 30 jis to send data to the scores cell. 
Thus, the timer is set so one loop takes 1,445 ps. A typical speech system uses 
40 frames per utterance, giving 40 * 1,445 //s'. = 56 ms to match one unknown 
utterance to one known utterance. Table 8.10 summarizes the execution times 
for dl apd d2. d2 can match a vocabulary of 17 words in one second using 8 
16-bit coefficients per frame and 16-bit coefficients. Multiple BACs can be used 
in parallel to process a larger vocabulary in real time.
8.6,4. Summary
Two parallel programs to implement the BAG algorithm were presented. 
Program dl, written in xx, takes over 578 ms to perform one DTW match 
between two utterances of 40 frames each with 8 coefficients per frame. Pro­
gram d2, written in assembly language, takes 57 ms to match the same two 
utterances. The following techniques were used to obtain this increase in 
speed.
1) Reducing the precision of the coefficients and distance scores.
2) Synchronizing all the DTW cells.










Coefficients 4 8 4 8 4 8
Total Time 
for One Loop 8,960 fis 14,462 fis 845 s 1,445 /is 510 fis 750 fis
Total Time 
for 40 Loops 358,400 fis 578,480 fis 33,800 /is 57,760 fis 20,400 fis 30,000 fis
Word-comparisons 
per Second
2 1 29 17 49 33
341
342
Changing the precision of the coefficients from 32-bit integers to 8 or 16- 
bit integers, and the distance scores from 32-bit integers to 16-bit integers 
reduces the inter-cell communication time because less data is passed between 
cells. This also reduces the computation time since the 8051 can perform the 
operations instead of sending them to the APU. In a real speech recognition 
system, however, 8 bits are not enough for the coefficients [MaGr74]. Instead, 
a typical system uses 16 bits [WBA83]. Therefore a 16-bit version of d2 would 
have to be used.
Synchronizing the cells was the second technique used to speed up the pro­
gram. The order of arrival of data to the input queue for each cell is difficult 
to determine since each cell normally executes independently of the other cells. 
The hardware provides a tag for each item in the queue. The tag indicates the 
port from which the item came. The task of checking this tag and saving the 
data, if it is not from the desired port, is time consuming. The assembly 
language BAG program never checks the tag. Instead it controls when the data 
enters the switch so that, the data arrives in the order it is needed. Controlling 
the arrival of data from several cells running different programs is difficult, so 
all cells are synchronized by using the 8051’s built-in timers. All cells enter 
their main loops at the same time. Each cell starts its own timer and will not 
restart the main loop until the given time has elapsed. Therefore when cell A, 
running even code, reads from its port, it knows that cell B, running odd code, 
has sent it some data.
Synchronization can be achieved without the use of timers. The program­
mer can carefully compute the execution time for the main loop in each cell 
and pad the cells which have the shortest execution times with nops so all cells 
have the same time. This makes the tedious task of assembly language pro­
gramming even more tedious. The programmer must change the code in all 
cells if he changes the code in one cell. The 8051’s build-in timers are a great 
help to the programmer.
The DTVV algorithms have required more inter-cell communication than 
the previous algorithms. This results in spending 51% of the loop time moving 
data between cells. Over half (27% of the total time) of this time is spent 
waiting for the switch. Using an output queue instead of an output latch could 
eliminate this time and allow the algorithm, to run faster. Also, having
343
separate input queues for each port would eliminate the need to synchronize 
the cells, since there would be no confusion as to the arrival'order of the input 
data.
Since most speech recognition systems use 16-bit coefficients, the processor 
must be able to perform 16-bit arithmetic. Although the 8051 is an 8-bit 
machine that can implement 16-bit arithmetic, a better solution would be to 
use a 16-bit processor. This would allow 16-bit coefficients to be processed 
without the overhead of implementing 16-bit arithmetic on an 8-bit machine. 
Likewise, a 16-bit wide data path between cells would reduce the inter-cell 
communication time.
344
8.7 VLSI Processor Array Isolated Word Recognition System
Previous sections have presented programs for performing various speech 
recognition tasks. The block diagram in Figure 4.1 shows a typical isolated 
word recognition system which uses some of these tasks. The parameters listed 
on it are for processing telephone quality speech. Table 8.11 lists parameters 
for telephone quality and high quality speech processing. The values listed 
under the Poker System (implemented) column are the values the system actu­
ally simulated. The' values under the (possible) column are attainable by using 
the Poker system with minor changes in the programs.
This section shows how these programs are assembled together to perform 
the function of the speech recognition system shown in Figure 4.1; When com­
bining VLSI processor array programs, the output data rate and format of one 
cell must match the input data rate and format of the cell to which it is 
attached. Figures 8.45, 8.16, and 8.47 show the switch settings, code names, 
and port names, respectively, for the entire system which uses 51 cells. In the 
shaded area on the left are all the cells used to compute the autocorrelation 
coefficients, and the cells in the shaded area on the right perform the DTW. 
The following sections discuss the new programs and-the changes made to the 
programs from the previous sections so that the system could function.
8.7.1. Input Cell
The input cell (1,1) has data from a real speech signal which it sends to 
the filter cell. Figure 8.48 shows the plot of part of the /a/ sound from the 
word ‘‘all” as spoken by a male speaker. This data is digitized and formatted 
for input to the assembler and the listing is shown in Figure 8.49. The pro­
gram in Figure B.13 (called input) outputs the first sample as a 16-bit value in 
two’s complement notation- It sends the least significant byte (LSB) first; 16 
[is later it sends the most significant byte (MSB). 160 /is after sending the LSB








Sample Rate 6.67 KHz 20 KHz 6.25 KHz 6.25 KHz
Bits per Sample 8 16 16 16
LPC Coefficients 8 16 4 8
Bits per Coefficient 16 16 8 16
Range of Vocabulary 
Size (words) 10-1.000 10-1,000 49* 17*
The number of words that can be matched in one second using 13 DTW cells.
346
+-+ + - +
Mr- + \
\ /




. 8,6 . 8,7 . 8,8
+ - +
Figure 8.45 Switch settings for word recognition system.
input
v en
endpointfi 1 t e r Vven
s p4i t
figure 8.46. Code names for the word recognition system.
In a real system, the Ipc cell will require six cells: one for the demux pro­











Figure 8.47. Pori names for speech recognition system.
*Cells executing .assembly language programs reference physical ports 






Figure 8.48. Plot of speech data output by the input cell.
350
This is a portion of the /a/ in the word “all” 
It is sampled at lOKHz
dw -130, -120, -151, -91, 4, 75, 166, 277, 316, 283,
dw 205, 74, -66, -236, -341, -340, -331, -222, -73, 108,
dw 241, 302, 363, 276, 173, 41, -93, rl47, -204, -144,
dw -38, 28, 166, 218, 249, 243, 151, 116, 12, -49,
dw -57, -67, -5, 43, 106, 180, 186, 209, 172, 121,
dw 71, -10, -22, -48, -28, 29, 74, 140, 184, 211,
dw 148, -150, -330, -519, -836, -821, -659, -455, -103, 254,
dw 554, 670, 614, 458, 187, -158, -484, -591, -649, -626,
dw -347, -125, 85, 288, 347, 394, 300, 170, 129, 5,
dw ”69, -63, -76, -78, -100, -107, -107, -140, -115, -30,
dw 44, 132, 243, 300, 285, 218, 98, -42, -209, -329,
dw -351, -352, -260, -113, 53, 213, 282, 356, 291, 178,
dw 66, -87, -156, -203, -181, -67, -1, 132, 215, 240,
dw 253, 174, 115, 30, -58, -61, -77, -32, 42, 101,
dw 182, 213, 225, 215, 134, 76, 9, -60, -54, -52,
dw 10, 82, 170, 238, 21, -234, -353, -761, -951, -809,
dw “682, ”347, 70, 467, 738, 736, 652, - 389, 7, -417,
dw -670, -738, -800, -572, -222, 5, 306, 438, 480, 448,
dw 232, 130, 19, -143, -125, -115, -87, -53, -64, -39,
dw -80, -123, -60, -11, 56, 183, 280, 320, 293, 196,
dw o, 0, 0, o, o, o, o, o, 0, o,
dw o, 0, o, o, o, 0, o, 0, 0, 0,
■dw 0, 0, 0, 0, o, 0, o, 0, o, 0,
dw o, o, o, 0, o, 0, . o, 0, o, 0,
dw 0, 0, o, 0, o, 0, o, 0, o, o,
dw 0, 0, 0, o, o, o, 0,. 0, o, 0,
dw 0, 0, 0, o, 0, 0, o, 0, o, 0,
dw 0, ■ 0, ■; o, 0, o, 0, o, 0, 0. 0,
dw 0, 0, 0, 0, 0, 0, o, 0, 0, 0,
dw 0, ' 0, 0, o, o, o, o, 0, 0, 0,
Figure 8.49. Speech input data for word recognition system.
.351
of the first sample, it is sending the LSB of the next sample. The maxiwuam 
data rate in limited by how fast th broadcast tree can send the data to each 
autocorrelation cell. The 160 ps is a sampling rate of 6.25 KHz which is too 
slow for telephone quality speech, but is the fastest the autocorrelation cells 
can receive data from the broadcast tree. See Section 8 3.6 for more details on 
the broadcast tree. The input cell uses the 80'51’s built-in timer to time the 
delay between samples.
8.7.2. Preemphasis Cell
Although Section 8.2 presented many filtering programs, none of them is 
used here. The transfer function of the preemphasis filter is Ii(z) = l-.95z l. It 
is simple enough for a single cell to perform. Although all the assembly 
language programs in Section 8.2 used unsigned data, the speech data coming 
from the input cell is signed dataV The filter cell (5;2) uses signed data. The 
program is shown in Figure B. 14. It takes 16-bit two’s complement data as 
input and produces filtered 16-bit sign magnitude data as output. The 8051 
uses two’s complement notation for its addition and subtraction; The 8051 has 
an unsigned 8-bit by 8-bit multiply, but no signed multiply. There are fewer 
conversions needed to multiply two sign magnitude numbers than to multiply 
two two’s complement numbers with an unsigned multiply. Therefore since the 
autocorrelation cells must use a multiplication, the filter cell converts its output 
to sign magnitude.
8.7.3. Autocorrelation Cells
The autocorrelation cells (1,3) - (8,3) fun a program based on program a5 
in Figure B.ll. The new autocorrelation program, auto, differs from a5 in that 
a5 uses unsigned data as input. The program used here takes 16-bit sign mag­
nitude data as input and produces 32-bit two’s complement data as output. 
The program is show in Figure B.15.
A typical speech recognition system uses 9 autocorrelation coefficients. 
auto computes 8 coefficients. This value is chosen since 8 cells fit into the 8 by 
8 grid of cells used by Poker. A 9th cell could be added, but it would decrease
352
the clarity of how the program functions because it could not be placed in the 
same vertical line with the other auto cells. Using 8 or 9 cells makes no 
difference in throughput.
8.7.4. The Split, Merge, and Pipe Cells
The split and merge cells run the same code as shown in Figure B.ll. 
They are used to broadcast data to and collect results from the auto cells. The 
split cells form a broadcast tree which sends the input data to all auto cells. 
The merge cells collect the autocorrelation coefficients from the auto cells into 
one data stream for input into the Ipc cell.
The system uses the pipe cell (8,5) (see Figure 8,50) so the input buffer on 
the merge cell (5,4) will not overflow when cells (3,4) and (7,4) send their data 
(16 bytes from each cell) to cell (5,4) at the same time. The pipe cell delays 
the data from cell (7,4) so that cell (5,4) has time to empty its buffer before 
more data arrives. (This is because of a bug in the xx compiler. It is fixed in a 
later version of the compiler.)
Another function of the pipe cell is to discard some of the coefficients the 
auto cell produces. Since this system uses only four LPC coefficients per 
frame, the LPC cell uses only five autocorrelation coefficients as input. The 
pipe cell discards three out of every four values it receives (it is a leaky pipe) so 
that the extra coefficients will not reach the LPC cell.
8.7.5. The LPC Cell
The Ipc cell (4,5) runs the code shown in Figure 8.51. This is the same 
program as Figure 8.29 except that a line (line 27) is added to send the energy 
of the frame (R(0)) to the endpoint program. The endpoint program uses this 
value to detect the endpoints.
This program computes four LPC coefficients while a typical speech recog­
nition system would compute 8. Section 8.4.3 showed that the LPC program 
can be implemented in real time by using a demux, mux, and four Ipc cells. 
The single cell LPC program is used here since the system being simulated uses 







Number of P&: 
Parameters:
pipe,
VLSI processor array* simulated by Poker 
This routine reads four values from the top port
The other three values are discarded.
Its main function is to delay the data entering the 
middle merge cell.
Input: 32-bit integer 
Output: 32^bit integer
T . 7 " •
interlace, the number of values to read from 
top port before write frist value to 
out port.



















13 for i := 1 to interlace
14 "begin
15 trnp top;
16 tophdldji] ■: =
17 end;
18 blit tophold[f] ;
19 end;
20 end. ' ' . . . . ■ ■





Machine: VLSI processor array, simulated by Poker
Function: Find LPC coefficients using Durbin’s algorithm
Precision: Input: 32-bit floating point
Output: 32-bit floating point
Number of PEs: 1
Parameters: p, the number of coefficients computed.
Input: Autocorrelation coefficients arrive at
“in” port
Output: Energy (R [0]) is sent out “out” 
port followed by p LPC coefficients
Loop Time: Does not apply
Typical Time: 42,130 //s for p=8
*/
1 code Ipc;




■6 ’ int itmp,in;
/* LPC coefficients */7 real a[l0],
8 aold[lO], ■/* old LPC coefficients */ •.
9 E, ' /* Prediction error *j
lo­ k,
■/*. output port.. */ll out,





17 •while true do
18 begin
0 to p do /♦ Read in autocorrelation coefs19 for i : =














itmp <- in; 
k := itmp;
R[i + l] : = k; /* All R[] indexs are -f 1 since ♦/ 
/* xx indexs start at 1 */end;
E := R[l]; 
out <- E;
for i := 1 to p do
begin
k := 0; '
/* Send R[lj to endpoint routine •/





















for j 1 to i-1 do
k: = fc + aoldfjji* Eifk- j + It; 
k := (R[j + ll “ k) / E; 
tmp := k*k; E := (1 - tmp)s* E; 
a[ij := k;
for j 1 to i^l do
a[j] := aoldjj] - k * aoldfi-jf; 
for j : = 1 to i do
aold [j] := a[j[;
end;
for i 1 to p do /* Send out Ipc eoefs starting with at 
begin






Cell (5,5) executes the endpoint code given in Figure 8.52. The program 
finds the endpoints based on the energy in each frame as discussed in Section 
4.5. The endpoint program receives the energy of the current frame followed by 
p LPC coefficients. If the energy is greater than the low threshold lothresh the 
p LPC coefficients are sent to the LTW cell. If the energy is less than lothresh 
and some previous frame exceeded hithresh, the value 10001 is sent to the 
LTW cell. This signals the LTW program to start processing. If the energy 
does not exceed hithresh the program sends the value 10000 to the LTW cell 
to tell it to discard all the data received since the last 10001 value.
8.7.7. Linear Time Warping
Cell (6,5) executes the linear time warping program given in Figure 8.37. 
No changes are made to the program.
8.7.8. Dynamic Time Warping
The cells executing the DTW programs are identical to those discussed in 
Section 8 6. No changes are made to the program.
8.7.0. Summary
A number of the parallel speech processing programs were combined to 
form a speech recognition system. Since most speech data is signed, the filter 
and auto programs heeded major changes so that they could process signed 
speech data. The other programs needed little or no modification to run on the 
system, Table 8.11 (Section 8.7.) summarizes the parameters of the system 
simulated on Poker. The system is unable to process telephone quality speech 
because its maximuta sample rate is 6.25 KHz where 6.67 KHz is needed. Also, 
it uses only four LPC coefficients of 8 bits each when 8 coefficients of 16 bits 
each are needed. The conclusion section of this chapter discusses the changes 
that could be made so the VLSI processor array speech recognition system can 



















lothresh, /* low threshold */
hithresh, /* high threshold */
in,out;
p := 4; /* Number of coefficents per frame
lothresh :“ 1000(X)0 0; 
hithresh : = 2000000.0; 
found := false;
while true do 
- begin
energy <- in;
if energy >= lothresh then 
begin
if energy >~ hithresh then 
found true;



























Section 4.5 without zero crossing rate 
VLSI processor array, simulated by Poker 
This routine does endpont detection by 
looking at the value of R(0) out of the 
autocorrelation routine via the Ipe routine.
If it is big enough, the following p 
coefficients are passed on to the Itw rputihev 
Input: 32-bit floating point
Output: 32-bit floating point
I- ■ ■
p/ the number of LPC coefficients computed 
Energy (R(0)) is arrives at “in” 
port followed by p LPC coefficients 
p LPC coefficients are sent out the “out’- 
port if the energy is greater than “lothresh” 
The value lObOl is sent if a word is spotted. 
The value 1000Q is sent if the energy drops 
below “lothresh” before 
going above hithresh.
358
26 for i := 1 fco p do /* Send frame to ltw •*/
27 begin
. 28. tmp <« in;
29 out <- tmp;
30 end;
31 ■^ end .
.32 ' else
33 begin
34 if found then
35 begin /* A word has been spotted */
36 out <- 10001.0;
37 found := false;
38 ■■ . end ' ■
39 else
40 out <- 10000.0; /* No word,







for i := 1 to p do /* Dummy read word */





This chapter has presented several parallel programs for speech processing. 
The previous section showed how some of these programs could be combined 
into a parallel word recognition system. The goal was for this system to pro­
cess high quality speech, as defined in Table 8.11, in real time. As Table 8.11 
shows, the Poker system did not reach this goal for two of the parameters. It 
can process speech at a rate of 6.25 KHz, not at the rate of 20 KHz as desired 
and it uses 8-bit coefficients, not the 16-bit coefficients needed for high quality 
speech processing. The following sections discuss the VLSI processor array and 
give details as to which features it should have for it to process speech signals 
in real time.
8.8.1. The Processor
Poker emulated each cell as an Intel 8051 8-bit microprocessor. The fol­
lowing sections discuss the desirable properties of a VLSI processor array 
microprocessor.
8.8.1.1. Data Size and Type — 16-bit signed fixed point
Most speech data can be represented as a 16-bit signed integer, therefore 
the processor should operate on 16-bit data. The autocorrelation LPC and 
LTW routines used some 32-bit values, so 32-bit addition should also be imple­
mented. The LPC and LTW routines used the Intel 8231 APU for floating­
point operations, but they could have been implemented using only fixed point 
arithmetic. Adding floating-point operations would made writing some of the 
programs easier but it did not make the LPC or LTW programs execute faster. 
If the APU is to decrease the execution time, the microprocessor must be able 
to get data to it quickly and it must be able to perform its operations in less
360
time than the 42 ps the 8231 needs for a floating-point multiply.
8 8.1.2. Internal Registers
The 8051 has 128 bytes of internal RAM. The internal RAM has the 
same access time as the 8051’s data registers for most instructions. This inter­
nal RAM can be used as 64 16-bit registers. Having many registers available is 
good since programs like the BAG can store all of its variables in the register- 
like memory and not have to use the external memory which is much slower to 
access.
8.8,1.3. Memory Size — 2K bytes
Table 8.12 summarizes the memory requirements for each of the programs 
in the speech recognition system. The LTW program used the most memory 
with 1,280 bytes. The input cell does not include the storage needed for the 
input data. Most likely, the input data would come from an analog to digital 
converter and not memory. Also, the seq cell memory usage does not include 
the memory needed to store the known templates. A typical system would use 
40 frames per utterance, 8 coefficients per frame, and 16-bits per coefficient. 
This is a total of 640 bytes per utterance. Therefore, if there are more than 
three words in the vocabulary, the seq cell would use more memory than any 
of the other cells. For a 100 word vocabulary the memory requirements would 
be 128K bytes if each word used 40 frames of 16 coefficients and 16 bits per 
coefficient.
Excluding the storage used by the seq cell to store known templates, each 
cell could operate using 2 K bytes of memory. The seq cell may have to be a 
special cell with extra memory to hold all the templates.
Table 8.12 
system.
Memory usage, in bytes, for SIMD based isolated word’ recognition;
Language Memory Usage 
(bytes)
input 8051 101 *
filter 1 8051 151
: sink 8051 12
split 8051 136
auto 8051 136
1 merge XX . .4,50 ■
!pc XX ' . 848
demux . XX • 348
: mux \ XX 286
■, endpoint XX 385
itw XX 1280
repeat \ 8051 211
pip e 8051 142
seq 8051 177 f
even 8051 436
teven 8051 445







* Does not include storage for input data, 
f Does not include storage for known templates.
362
8.8.2. Inter-PE Communications
The following sections discuss features the inter-cell communication should 
have.
8.8 2.1 The Broadcast
The VLSI processor array needs to implement a general broadcast that 
allows one port to broadcast to many ports with one write instruction. Using 
such a broadcast eliminates the need for the broadcast tree (the split cells) in 
Figure 8.45. With a broadcast, the data arrives at the input ports of the auto 
cells simultaneously, thus allowing the auto cells to process their data at a sam­
pling rate over 40 KHz — a 60% increase in throughput. Before, the autocorre­
lation cell would run at 6 25 KHz so the data would have time to travel 
through the broadcast tree before the next sample arrived.
If a general broadcast is not possible, broadcasting from one port to two 
ports would be an improvement. This type of broadcast allows the data to 
propagate through the broadcast tree and arrive at the auto cells at the same 
time Again the auto cells could process data at over 10 KHz.
The difference between a two port broadcast and the general broadcast is 
the two port broadcast would have a longer delay between the arrival of the 
last sample of the frame and the arrival of the autocorrelation coefficients at 
the merge cell. This is because it takes time for the data to travel through the 
broadcast tree;
8.8 2.2: The I/ O Buffer
Using an output queue to replace the output latch which is between the 
8051 and the switch would simplify programming the 8051 in assembly 
language and decrease the execution time. The dtw cells spend 32% of their 
total execution time waiting for the switch to read the output latch. Most of 
this wasted time would be eliminated by using a queue.
363
8.8.3. Number of Cells— 51
Table 8.13 summarizes the number of cells used by each program in the 
parallel word recognition system. By using 51 cells, the 8051 based VLSI pro­
cessor array is able to process speech in real time, sampling at 6.25 KHz, using 
4 8-bit coefficients per frame, and recognizing a 17 word vocabulary in 1 
second. The demux/mux approach that was used by the LPC program could 
be used on the DTW program so that a 100 word vocabulary can be recognized 
in real-time if 5 copies of the DTW array were used in parallel. Such a Sys­
tem would use a total of 51 +5*15 = 126 cells. The scores cell could be changed 
to collect the distance scores from all the DTW arrays.
The demux/mux approach could also be used to improve the throughput 
of the autocorrelation program, but a better approach would be tp use a more 
powerful cell so the program could run faster.
8.8.4. Changing the Word Recognition System Parameters
The following sections discuss the effects of altering the system parameters 
on the processing throughput.
8.8.4.1 Changing the LPC Frame Size
Changing the LPC frame size will not change the number of cells used by 
the autocorrelation program Or the throughput. Changing the frame size will 
only change how often the autocorrelation coefficients are output.
8.8.4.2. Changing the Number of LPC Coefficients
Increasing the number Of LPC coefficients will not change the execution 
time Of the autocorrelation program, however the autocorrelation program 
would have to Use more cells since it uses one cell per coefficient. Increasing
the number of LPC coefficients will increase the execution time of the Ipc cell.
# r ! “ -
The DTW array consists of the repeat, seq, and all the even and odd cells.
364
Table 8.13 Number of cells used by the VLSI processor array parallel speech 
recognition system.
Function Type Number
































More Ipc cells may have to be added to process in real time, and the demux 
and mux cells will have to be changed to distribute the autocorrelation 
coefficients to more Ipc cells.
8.8.4 s. Changing the Number of Frames per Utterance
The proposed system assumed that I ^40 frames per utterance were output 
from the LTVV and processed by the DTW program. As with the SIMD 
machine algorithms, the LTW and DTW execution times are proportional to I, 
so increasing I will increase the LTW and DTW processing times. Decreasing I, 
on the other hand, will shorten the LTW and DTW execution times.
If the LTW time is increased to greater than 500 ms, the demux/mux 
method used for the LPC program may have to be used to increase the 
throughput.
365
8.8.4.4- Changing the Vocabulary Size
As with the SIMD machine, the DTW program is the only program whose 
execution time depends on the vocabulary size. Increasing the vocabulary size 
will require the replication of the cells used for the DTW array and using the 
demux/mux scheme that was used for the LPC program. The scores cell could 
be changed to collect the distance scores from each DTW array and find the 
minimum score.
8.8.5. Summary
The VLSI processor array, as simulated by Poker, is not able to process 
telephone quality speech in real time. This inability to process speech in real 
time is not caused by the VLSI processor array architecture, but by the system 
that simulated it. The Poker system uses an 8-bit microprocessor in each cell. 
As Chapter 7 showed, a 16-bit processor is more suited for speech processing 
since most intermediate speech data is 16 bits.
Poker’s inter-cell communications are handled by the switch which can 
poll a cell only once every 12 /is. This slow inter-cell communication rate
366
combined with a single input queue and an output latch required the 8051 to 
use up to 30% of its processing time waiting on the switch. If circuit switched 
inter-cell communication is used, the time the processor uses to service the I/O 
queues would be reduced.
If the VLSI processor array uses a 16-bit processor in each cell and has 
fast inter-cell communications, it should be able to recognize isolated words in 
real time.
367
9. CONNECTED WORD RECOGNITION
The purpose of this work is to improve the man/machine interface 
through the use of speech recognition. The idea is that communication 
between man and machine will improve if the machine can communicate using 
man’s common method of communication (spoken words) rather than have 
man use the machine’s method (terminal). The previous chapters have dis­
cussed using an isolated word recognition system which allows the computer to 
recognize words spoken with short (100 ms) pauses between them. Although 
isolated word recognition allows man to talk to a machine in a more natural 
manner, natural speech does not contain pauses between every word. Con­
nected word recognition is an extension of isolated word recognition that allows 
several (typically less than six) words to be spoken together without pauses 
between them. An isolated word recognizer can be extended to recognize con­
nected words by changing the DTW algorithm. Section 9.1 describes a level 
building dynamic time warping algorithm for connected word recognition 
[MyRi81a]. Section 9.2 presents a parallel DTW algorithm for connected word 
recognition.
9.1. A Level Building Dynamic Time Warping Algorithm
Myers and Rabiner have presented a thorough description of a general 
DTW algorithm for connected word recognition [MyRi81a,MyRi81b]. The 
algorithm presented here is from [MyRi81a|. We are given an unknown test 
pattern T(m) for 1 < m < M where each T(m) is a frame of speech, and M is
368
the total number of frames in the pattern*. T(m) contains L utterances where 
l < L < Lmax. The purpose of the DTW is to find which known utterances 
Rv are contained in pattern T(m), where 1 < v < V and V is the vocabulary 
size. This is done by making a “super” reference pattern Rs by concatenating 
L reference patterns, i.e.,
RS=Rq(l)Rq(2)Rq(3) ‘ ' ' Rq(L)
where q(n) for 1 < n < Lj^x selects which reference pattern to use in each 
position. The same DTW algorithm as used for isolated word recognition can 
then compare the test pattern T(m) to each of the super reference patterns Rs 
as shown in Figure 9.1.
This is a computationally intense operation since there are many super 
reference patterns. If V=10, and = 5, there are 11,111 super patterns.
Myers’ solution is a level building approach. Figure 9.2 shows graphically the 
computations used for the non-level building approach. The vertical lines show 
the order in which the distances are computed. The computation starts at the 
bottom of the leftmost vertical line, and proceeds up the line. After the first 
line is complete, the next line starts at the bottom and continues up, and so 
on. The warping path is restricted to the trapezoidal shaped region so that the 
warping path does not try to compare the end of the super reference pattern to 
the beginning of the test pattern.
Figure 9.3 shows the level building approach. The computation is as 
before, moving up from the bottom following the vertical lines. The difference 
is that the computations are done in levels. The lowest row of heavy dots 
represents the first level. Figure 9.2 follows the vertical lines up from the bot­
tom comparing a given frame of T(m) to the first utterance in the super refer­
ence pattern, then the second utterance, and so on. Figure 9.3 starts at the 
bottom and compares a frame m-1 of T(m) to the reference pattern R(n)v, but 
stops at the first level (row of dots). It records the accumulated distance and 
starts processing back at the bottom with frame T(m + 1) and R(n)v. This con­
tinues until accumulated distance scores are found for all the dots on level one.
^Previous chapters called the unknown pattern an utterance. Here the unknown pattern 
may consist of many utterances.
369
m(TEST)
Figure 9.1. Illustration of dynamic warping alignment between text pattern T 
and super reference pattern Rs.
370








At this point, the next reference pattern, R(n)v + 1, is compared to T(m) 
starting at m = l. This continues until all references patterns, Rv for 
1 < v < V, are compared to T(m) starting at m=l. Each dot on Figure 9.3 
has one accumulated distance score for each Rv. The minimum distance for 
each dot is saved and used as initial conditions for the next level of DTWing.
The algorithm in Figure 9.4 outlines the level building process. Table 9.1 
shows the translation from the symbols used in [MyRi81a] to those used in Fig­
ure 9.4. Line 1 sets the accumulated distance for frame zero of level zero to 
zero. Lines 2 and 3 set the accumulated distance for all Other frames on level 
zero to infinity. Lines 1-3 constrain the starting endpoint to the start of the 
super reference and the start of the test reference. Lines 5 and 6 set the accu­
mulated distance of frame zero to infinity on all levels. Line 8 repeats lines 9- 
27 for each possible number of utterances in the test pattern. Line 9 repeats 
lines 10-18 for each utterance in the vocabulary. Lines 11 and 12 copy the best 
accumulated distance scores from the previous level to be used as initial condi­
tions on the current level. For 1=1, the previous level was set on lines 1-3 to 
allow only a path from the beginning of both the test and reference patterns. 
Lines 14-16 compute the accumulated distance in the same manner the isolated 
word DTW does. Line 14 selects which vertical line in Figure 9 3 to follow and 
line 15 selects the position on the line. Line 17 saves the accumulated distance 
at the locations with the dots in Figure 9.3. Lines 11-17 are repeated for each 
pattern in the vocabulary and the variable DT saves the accumulated distances 
for each pattern. Lines 20-22 initialize DTBand W to infinity for the current 
level. DTB is the minimum value of DT over all possible super reference pat­
terns and W is the index of the minimum reference pattern. Then the shortest 
distance for each dot in Figure 9.3 is found by Lines 24^27. The best distances 
and the index of the word giving that distance are saved in DTB and W 
respectively. Lines 9-27 are repeated for each level. Lines 29-34 find the level 
with the smallest distance and set D to the distance.
Myers presents an algorithm with backtracking, so after finding D the 
reference patterns that composed the pattern can be found. Although back­
tracking is omitted here because it tends to obscure the function of the algo­






















FOR m «- 1 TO M
DTB(0,m)+-oo;
FOR 1 -H- 1 TO LMAX 
DTB(I,0) +- oo;
FOR ! 1 TO LMAX
FOR v +- 1 TO V
/* Constrain starting point to * /
/* 1st frame of ref. pattern */ 
/♦ and 1st frame of test patt. */
/* Accumulated distance scores ~ oo */ 
/♦on all levels */
/* For e^ch level */
/♦ For each vocabulary word ♦/
FOR m *— 1 TO M /* Set initial conditions of ♦/
D(m,0) •*— DTB(l—l,m); /♦ current level to accumulated */
/♦ distances of previous level ♦/
FOR m ■*— 1 TO M /* Perform DTW as in isolated word system*/ 
FOR n +- L(l,m) TO U(l,m)
D(m-l,n-2) +2d(v,m,n-l) 
D(m,n)d(v,m,n) + min D(m-l,n-l)+d(v,m,n) ;
lD(m-2,n-l)+ 2d(v,m-l,n).
















FOR m «- 1 TO M /* Find minimum accumulated distance for*/
DTB(l,m) oo; /* each dot. */
W(l,m) op;
FOR v i 1 TO V /* For each vocab. word */
IF( DT(v,m) < DTB(l,m) ) /* If smaller, save *./
DTB(l,m) ■*— DT(v,m); /* distance, and */
W(I,m)+-v; /* index to word too*/
L +- oo;
D +- oo;
FOR 1 +- LMIN TO LMAX /* all possible level find */
IF( DTB(1,M) < D ) /* shortest path */
L 1;
D DTB(L,M);
Figure 9.4. Algorithm for serial level building DTW.
374
Table 9.1 Variable name translations for connected word algorithm.
[MyRi81b] Algorithm Description
T(ni) Test pattern.
M Length (in frames) of test pattern.
Rjn) Reference pattern v.
V ■ Number of reference words.
Rs Super reference pattern consisting of a sequence of con­
catenated reference patterns.
Nv Nv ■ Length (in frames) of t^h reference pattern.
L Number of reference patterns in a string.
1) D Global distance between test pattern and super reverence 
pattern. «,
Di(m,n) D(m,n) Accumulated distance to frame m of the test pattern, 
and frame n of the fth reference of the super reference 
pattern,
DT(v,m) Accumulated distance to frame m of the text pattern, 
and the last frame of the /th reference of the super refer­
ence pattern for reference pattern v.
dj(m,n) d(v,rn,n) Local distance between the rath frame of the test pat­
tern, and the nth frame of the tth reference of the super 
reference pattern.
L|(m) L(l,m) Modified lower boundary function for the fth level,
U,(M) U(l,m) Modified upper boundary function for the Ith level.
Lmax LMAX Maximum number of references in a super reference pat­
tern.
Lmn LMIN Minumum number of references in a super reference pat­
tern.
t>iB(m) DTB(l,m) Minimum value of Di(m) over all possible super reference 
patterns of length /.
VV,(M) The index v, of the reference pattern Rv, that gives
dPm.
375 .
9.2. An SIMD Level Building DTW Algorithm
The previous section presented a serial level building DTW algorithm. 
This section shows how it can be implemented on an SIMD machine.
The parallel level building DTW algorithm starts with the serial parallel 
(SP) algorithm discussed in section 6 4.1.1. The SP algorithm uses one PE for 
every utterance in the vocabulary. The unknown utterance is broadcast to all 
PEs and each PE executes a serial DTW program. Figure 9.5 shows the paral­
lel version of Figure 9.4. Only a few changes are needed. The unknown pat­
tern is broadcast to all PEs, and each PE does the level building warp similar 
to the serial program. After the accumulated distances are computed for each 
level, recursive doubling is used to find the minimum distance for each dot over 
all the vocabulary words. The minimum distance is stored in all PEs.
Line 9 of the serial algorithm is missing since all the vocabulary words are 
done in parallel. Lines 24-27 of Figure 9.4 are changed since the accumulated 
distances are spread across the PEs. Lines 24-36 of Figure 9.5 use recursive 
doubling to find the utterance with the smallest distance. The arrays DTB 
and W contain the same values in all PEs, therefore lines 39-44 for Figure 9.5 
are the same as lines 32-37 of Figure 9.4.
Table 9.2 gives some computational comparisons between the serial and 
parallel level building DTW algorithms. The serial column is from [MyRi81a]*. 
N is the average reference pattern length in frames and M is the frame length 
of the test pattern. NM/3 is the average number of distances at each level. 
This is shown by the the shaded area in Figure 9.2. Table 9.3 gives typical
computational requirements for LMAX=5, V=10, M=120, and N-35.














































DTB(O O) 4- 0 
FOR m1 TO M
DTB(0,m) V oo;
/♦ Constrain starting point to first frame*/ 
/♦ of test pattern and first frame of*/
/* reference pattern,*/
FOR I 4- 1 TO LMAX /* Accumulated distance scores = oo*/
DTB(1,0) -4r— oo; /* on all levels */
FOR I 4- l TO LMAX /* For each level*/
/* In all PEs*/
FOR m 4- 1 TO M /* Set initial conditions for current*/
P(m.,0).4- DTB(l-l,m); /* level to acc. dists. on previous*/
/* level.*/
FOR m-4— 1 TO M /* Perform DTW as in isolated word*/
FOR n 4- L(l,m) TO U(l,m) /* system.*/
D(nr-i,n-2) + 2d(m,n~l)
D(m,n) 4-d(m,nl + min D(m-l,n~l) + d(m,n) ;
lD(m-2,n-l) + 2d(m~l,n)J
/♦ Save accumulated dist. for word v*/
/* Use recursive doubling to find*/
/* minimum dist. for each dot*/
DT(m) 4- D(m,Nv);
FOR m 4-1 TO M
DTB(l,m) 4- co; 
W(l,m) 4- oo;
/♦ Store current frame’s dist. in 
/* and index in v*/
-1
DT 4- DT(m); 
v 4— ADDR + 1; .
FOR i 4- 0 TO [log2v|-
USE Cube(i); /* Send to another PE*/
TRANSFER DT to DT' ;
TR ANSFER v to v';
WHERE DT' < DT
DT4-DT' ; /* Put smallest value in DT*/
' v 4- v' •
END WHERE
DTM(l,m) 4- DT;
W (l,m) 4- v;





FOR 1 4- LMIN TO LMAX 
IF( DTB(1,M) < D )
L 4-1;
D 4- DTB(L,M);
■;■/* Find smallest dist. for each*/ 
/♦ level. Done serially in all*/ 
/* PEs with the same data.*/




Number of Basic Time Warps
Size of Time Warps





Table 9.3 Comparison of serial and parallel leveling building DTW algorithms 
Counts are for lmax=5, N=35, V=10, and M—120.
Serial Parallel
Number of Basic Time Warps
Size of Time Warps




9.3. A VLSI Processor Array DTW Algorithm
The level building DTW algorithm can also be implemented on the BAG. 
Figure 9.6 gives the instructions that are executed by the array in Figure 6.21. 
The level building algorithm differs from the isolated word DTW in that the 
infinity vectors that separated utterances are used differently. Previously, all 
elements of the infinity vectors were infinity values. Now the second and third 
elements of the infinity vectors instruct the cell how to initialize its variables. 
The algorithm works as follows. An infinity vector enters the top and bottom 
of the array before any data is entered. The second element of the a vector 
(a[l]) is set to NEWUNKNOWN. This instructs the cells to set ginit, g, and 
gmin to infinity, ginit is the initial value of g oh the current level, gmin. is the 
minimum accumulated value of g for the current level, and g is as before, the 
current accumulated distance. The third element of a (a[2]) is an index telling 
which known utterance is being entered.
The known and unknown utterances enter as before, with the unknown 
utterances entering the bottom of the array one frame at a time and the known 
utterances entering the top of the array. The infinity vector that follows the 
known utterance has the second element set to NEWWORD, which instructs 
the cell to compare the current g value to the minimum g value of all the 
utterances which have been processed since the last NEWUNKNOWN value. 
If g is smaller than the previous g's, its value is assigned to gmin, the index of 
the current utterance is saved in minindex, and g is assigned the initial value 
for the current level, ginit. After the infinity vector has propagated from the 
top to the bottom cell, a new known/unknown pair is started through the cells. 
Between levels, the second element of the infinity vector is set to NEWLEVEL 
which instructs the cell to set the initial value for the new level to the 
minimum value of the previous level, and set gmin to infinity.
Some observations about this approach are:
1) all the even (odd) cells are nOt executing the same code since those cells pro­
cessing infinity values must execute extra instructions, and
2) one pair of known/unknown utterances must pass completely through the
BAG before the next pair can enter. The original BAC allowed pairs of 
utterances to be separated by a single infinity vector, thus overlapping 
the computations and eliminating the initialization time.
380
Even numbered cells Odd numbered cells
Group A Group B
a vector down a vector down
b vector up b vector up
if a|l| = NEWUNKNOWN if a[l] = NEWUNKNOWN 
ginit«- oo ginit +- oo
g oo g «- oo
gmin ■*- oo gmin oo
index = a[2] index = a[2j
if a[.l] = NEWWORD if a[l] = NEW WORD
if(g < gmin) if(g < gmin)
gmin g gmin 4-g
g ginit g ginit
minindex «- index minindex index
if a[l] = NEWLEVEL if a[l] = NEWLEVEL
ginit gmin ginit gmin
gmin «— oo gmin «— oo
index a[2] index «— a[2]
g <- ginit g 4- ginit
computed computed
DTtop -4- d d.bot +- DTbot
DTbot+- d d.top .4- DTt<rp
g.bot.old + 2d.bot g.bot+ 2d.bot
g 4-r d +min g + d g4-d + min g+d
g.top.old + 2d.top g.top + 2d. top
g.top.old^-g.top 
g.bot.old +- g.bot
g.top +- DTtop DTbot 4-g
g.bot «— DTbot DTtop «—g
d.bot'4- DTbot DTtop «— d
d.top DTtop DTbot d
DTtop <— g '■'g.bot. 4- DTbot
DTbot+-g g.top "t- DTtop
Figure 9.6. Instructions executed during one loop of the BAC algorithm for I 
odd. (Exchange columns for I even).
V
Although the level building DTW can be implemented on the BAC, it is 
not a ‘‘clean” implementation in that cells executing the same code are not 
executing synchronously as before, and the data flow must be disrupted 
between each utterance to initialize various variables.
9.4. Summary
This chapter has presented two parallel algorithms for a level building 
dynamic time warp: one for the SIMD machine and the other for the VLSI pro^ 
cessor array. The SIMD machine used one PE per vocabulary word and 
required only a few simple changes to the serial level building DTW algorithm. 
The VLSI processor array algorithm required only simple changes to the code 
executed in each cell, however the changes contained conditional branches 
which caused the cell taking the branch to be unsynchronized with the other 
cells. Also, the pipeline between cells must be reinitialized between utterances, 
therefore disrupting the data flow.
The HSAC [BAW81,BAW84] cannot implement the level building DTW 
since the HSAC has all cells in a diagonal executing the same instructions and 
it is not possible to instruct individual cells to save their accumulated dis­
tances.
The SIMD machine is well suited for performing the level building DTW 
since it requires few changes from the isolated word DTW, and the changes 
that are made do not alter the time needed to perform a basic time warp.
382
10. CONCLUSIONS
In this thesis, parallel algorithms for isolated word recognition were writ­
ten for an VLSI processor array and an SIMD machine. These algorithms were 
simulated to determine if real-time execution was achievable and to obtain 
detailed measurements on the ways in which the architecture features were 
used. The simulations were run using parameters that a typical speech recogni­
tion system would use. The SIMD simulations showed that an SIMD machine 
with a MC68000 microprocessor in the CU and each PE could run in real time 
using 100 PEs. The VLSI processor array simulations showed that a processor 
array using Intel 8051s, in each cell, could not process speech data in real time. 
This inability to process fast enough waa attributed to the 8-bit 8051 and the 
slow inter-cell communication, and not the parallel architecture model.
It is not meaningful to compare the execution times of an algorithm imple­
mented on both parallel architectures since the SIMD machine uses a 16-bit 
microprocessor with an 8 MHz clock rate, and the VLSI processor array used 
an 8-bit microprocessor with a 12 MHz clock rate. Any such comparison will 
show that the 16-bit processor is better for speech processing than an 8-bit pro­
cessor. ^
Desirable features for the SIMD machine architecture for real-time speech 
recognition are: : :
1) The processor should operate on16-bit signed fixed-point data, have at least
18 data registers, and at least 2K bytes of general purpose memory in 
the CU and 512 bytes in each PE.
2) The interconnection network should implement the Cube and Shift(± 1)
; interconnection functions and have a data path from PE0 to the CU.
3) ,A|1 PE masking operations can be performed using data conditional masks;
however, many of the masks can be computed at compile time and exe­
cuted as general PE masks.
. 383
4) If 100 PEs are used, the time to compare the input-utterance to 1,000 known 
utterances will be less than 500 ms. Also, many of the speech system 
parameters can be changed and 100 PEs will still be able to process 
speech in real time.
Desirable features for a VLSI processor array for real-time speech process­
ing are:
1) The processor should operate on 16-bit signed fixed point data and have at
least 2K bytes of general purpose memory. The internal RAM of the 
8051 is also a desirable feature since it can be used as if it were many 
fast general purpose registers.
2) The inter-cell communication must include a broadcast capability, and each
cell should have both an input and output buffer between it and the 
other cells.
3) The speech recognition system needs at least 51 cells. Fewer cells could be
used if the architecture supports broadcasting.
One comparison that can be made, however, is to compare the SIMD 
DTW program to the HSAC [BAW84], The HSAC can produce a DTW com­
parison once every 40 /is for a throughput df 25,000 matches per second using a 
full array of 400 processors. An 8 by 16 reduced array of PEs can compute 
5,000 matches per second. The SP DTW program, using 128 PEs, can compute 
1,664 matches per second which is about 1/3 the rate of the reduced array 
HSAC. These figures show that the dedicated processors of the HSAC are able 
to compare utterances faster than the SIMD machine. However, the SIMD 
machine is more flexible in that it can perform a level building DTW for con­
nected speech recognition and the HSAC cannot.
The work in this thesis could be extended by considering the following 
problems.
1) Simulate the cells of the VLSI processor array as if they were a digital signal
processing chip instead of the 8051 microprocessor. The TMS32010 
[TI83] digital signal processor can perform a 16 by Tfl-bit multiply in 200 
ns and a 32-bit addition in 200 ns. Such performance is more than 
sufficient for speech processing and should improve the VLSI processor 
array throughput.
2) Simulate different inter-cell communications on the VLSI processor array.
384
Simulating circuit switched and packet switched communications with 1, 
8, and 16 bit wide data paths would indicate which method is best 
suited for speech processing.
3) Simulate an instruction queue between the CU and PEs in the SIMD
machine. Previous work has shown a 50% improvement in execution 
times for image processing [SiKu82]. Such simulations will show if 
speech processing can yield the same improvements.
4) Write algorithms for continuous speech recognition and simulate them.
. ;:CfOiitiiMi0us speech recognition is more time consuming than isolated 
word recognition and a powerful parallel processor might be able to 
recognize continuous speech in real time.
In summary, this thesis has shown that parallel processing is useful for 
real-time speech recognition Through simulations, it demonstrated that both 




[AHU74] Alfred V. Aho, John E. Hopcroft, and Jeffery D. IJllman, The 
Design and Analysis of Computer Algorithms, Addisori-Wesley, 
Reading, Mass, 1974.
[ASV79] A. Y. Ashajayanthi, S. Rajaram, and N. Viswahadham, “A Paral­
lel Processor for Real-Time Speech Signal Processing,” 1979 IEEE 
International Conference on Acoustics, Speech, and Signal Process­
ing, April 1979, pp. 868-871.
[AtHa71] Bishnu S. Atal and Susan L. Hanauer, “Speech Analysis and Syn­
thesis by Linear Prediction of the Speech Wave,” Journal of the 
Acoustical Society of America, Vol. 50, August 1971, pp. 637-655.
[Ba79] Kenneth Batcher, “MPP - a Massively Parallel Processor,” 1979 
International Conference on Parallel Processing, August 1979, pp. 
249.
[BaLuSl] George H. Barnes and Stephen F. Lundstrom, “Design and Valida­
tion of a Connected Network for Many-Processor Multiprocessor 
Systems,” Computer, Vol. 14, No. 12, December 1981, pp, 31-41.
[Barn68j George H. Barnes, “The Illiac IV Computer,” IEEE Transactions 
on Computers, Vol. C-17, August 1968, pp. 746-757.
[BAW81] David J. Burr, Bryan D. Ackland, and Neil Weste, “A High Speed 
Array Computer for Dynamic Time Warping,” Proceedings of 1981 
the IEEE Acoustics, Speech, and Signal Processing, April 1981, pp.
■ 471-474.
[BAW84] David J. Burr, Bryan D. Ackland, and Neil Weste, “Array 
Configurations for Dynamic Time Warping,” IEEE Transactions 
on Acoustics, Speech, and Signal Processing, Vol. ASSP-23, No. 1, 
February 1984, pp. 119-128.
[BBGI80] Jeffrey A. Barnett, Morton I. Bernstein, Richard A. Gillmann, and 
Iris M. Kameny, “The SDC Speech Understanding System,” in 
Trends in Speech Recognition, Prentice-Hall Inc, Englewood Cliffs, 














William Woods et. al., “Speech Understanding Systems, Final 
Technical Progress Report,” No. 3438, Bolt Beranek and Newman 
Inc., October 1976.
W. J. Bouknight, Stewart A. Denenberg, David E. McIntyre, J. M. 
Randall, Amed H. Sameh, and Daniel L. Slotnick, “The Illiac Sys­
tem,” Proceedings of the IEEE, Vol. 60, No., 4, April 1972, pp. 
369-388.
Edward C. Bronson and Leah Jameison Siegel, “A Parallel Archi­
tecture for Acoustic Processing in Speech Understanding,” 
Proceedings of the 1982 International Conference on Parallel Pro­
cessing, Bellaire, Michigan, August, 1982, pp. 307-311.
Carolyn Cline and Howard Jay Siegel, “Extension of ADA for 
SIMD Parallel Processing,” The IEEE Computer Society’s Seventh 
International Computer Software and Applications Conference, 
November 1983, pp. 366-372.
B. A Crane, “PEPE Computer Architecture,” IEEE Computer 
Society Conference, September 1972, pp. 57-60,
Digital Equipment Corporation, Macro-11 Assembler, Publication 
number DEC-ll-DMACA-A-D.
George R. Doddington and Thomas B. Schalk, “Speech Recogni­
tion: Turning Theory to Practice,” IEEE Spectrum, Vol. 18, No. 
9,September 1981.
J, Timothy Field, Alejandro A. Kapauan, and Lawrence Snyder, 
“Pringle: A Parallel Processor to Emulate CHiP Computers,” Pur­
due University Department of Computer Science, CSD-TR-443.
Michael J. Flynn, “Very High-Speed Computing Systems,” 
Proceedings of the IEEE, Vol. 54, No. 12, December 1966, pp. 
1901-1909.
C. J. M Hodges* Thomas P. Barnwell, and Daniel McWhorter, 
“The Implementation of an All Digital Speech Synthesizer Using a 
Multimicroprocessor Architecture,” IEEE International Conference 
on Acoustics, Speech, and Signal Processing, April 9-11, 1980, pp. 
855-858.
Intel Corporation, “Intel MCS-51(tm) Family of Single Chip 
Microcomputers Users’s Manual,” Part Number 121517-001, July
1981...
Fumitada Itakura, “Minimum Prediction Residual Principle 
Applied to Speech Recognition,” IEEE Transactions Acoustics,
387
Speech, and Signal Processing, Vol. ASSP-23, No. 1, February 
■ 1975. 'Vs-
[JeWi74] Kathleen Jensen and Niklaus Wirth, Pascal User Manual and 
Report, second edition, Springer-Verlag, New York, NY, 1974.
[KeRi78] Brian W. Kernighan and Dennis M. Ritchie, The C Prograniming 
Language, Prentice-Hall, Inc., Englewood Cliffs, New Jersey 07632, 
1978.
[KoSt73] Peter M. Kogge and Harold S. Stone, “A Parallel Algorithm for 
the Efficient Solution of a General Class of Recurrence Equations,” 
IEEE Transactions on Computers, Vol. C-22, No. 8, August 1973, 
pp. 786-793.
[Ku84] James T. Kuehn, internal correspondence.
[Kuck77] David J. Kuck, “A Survey of Parallel Machine Organization and 
Programming,” Computing Surveys, Vol.-9, No. 1, March 1977, pp. 
29-59.
[KuLe] H, T. Kung and Charles E. Leiserson, “Algorithms for VLSI Pro 
cessing Arrays,” in Introduction to VLSI Systems, edited by Carver 
Mead and Lynn Conway, Addison-Wesley, Reading, MA, 1980, pp. 
271-294.
[Kung80] H. T. Kung, “The Structure of Parallel Algorithms,” in Advances 
in Computers, Vol. 19, edited by Marshall C. Yovits, Academic 
Press, New York, NY, 1980.
[LeLi81] Stephen E. Levinson and Mark Y. Liberman, “Speech Recognition 
by Computer,” Scientific American, April 1981, pp. 64-76.
[LMMB83] Menahem Lowy, Hy Murveit, David M. Mintz, Robert W. Broder- 
son, “An Architecture for a Speech Recognition System,” IEEE 
1983 International Solid State Circuits Conference, Vol. 26, Febru­
ary 1983, pp. 118-119.
[LRRW81] Lori F.. Lamel, Lawrence R. Rabiner, Arron E. Rosenberg, and Jay 
G. Wilpon, “An Improved Endpoint Detector for Isolated Word
[MaGr74]
Recognition,” IEEE Transactions on Acoustics, Speech, and Signal 
Processing, Vol. ASSP-29, No. 4, August 1981, pp. 777-785.
John D. Markel and Augustine H. Gray, “Fixed-Point Truncation 
Arithmetic Implementation of a Linear Prediction Autocorrelation 
Vocoder,” IEEE Transactions on Acoustics, Speech, and Signal 














John D. Market, Augustine H. Gray, Jr., “Linear Prediction of 
Speech”,Springer-Verlag, New York, NY,1976.
John Makhoul, “Linear Prediction: A Tutorial Review,” Proceed­
ings of the IEEE, Vol. 63, No. 4, April 1975, pp. 561-580.
Motorola Semiconductor, MC68000 16-bit Microprocessor User’s 
Manual, Motorola IC Division, Austin, TX, 1979.
Cory Myers, Lawrence R. Rabiner, and Aaron E. Rosenberg, “Per­
formance Tradeoffs in Dynamic Time Warping Algorithms for Iso­
lated Word Recognition,” IEEE Transactions Acoustics, Speech, 
and Signal Processing, Vol. ASSP-28, No. 6, December 1980, pp. 
623-635.
Philip T. Mueller, Jr., Leah J. Siegel, and Howard Jay Siegel, 
“Parallel Algorithms for the Two-Dimensional FFT,” 5th Interna­
tional Conference on Pattern Recognition, December 1980, pp. 
497-502.
Cory S. Myers, “A Comparative Study of Several Dynamic Time 
Warping Algorithms for Speech Recognition,” Masters Thesis, 
Massachusetts Institute of Technology, February 1980.
Cory S. Myers and Lawrence R. Rabiner, “A Level Building 
Dynamic Time Warping Algorithm for Connected Word Recogni­
tion,” IEEE Transactions on Acoustics, Speech, and Signal Pro­
cessing, Vol. ASSP-29, No. 2, April 1981, pp. 284-297.
Cory S. Myers and Lawrence R. Rabiner, “Connected Digit Recog­
nition Using a Level-Building DTW Algorithm,” IEEE Transac­
tions on Acoustics, Speech, and Signal Processing, Vol. ASSP-29, 
No. 3, June 1981, pp. 351-363.
A. N. Noll, “Cepstrum Pitch Determination,” Journal of the 
Acoustic Society of America, Vol. 41, February 1967, pp. 293-309.
Alan V. Oppenheim and Ronald W. Schafer, Digital Signal Pro­
cessing, Prentice-Hall, Englewood Cliffs, NJ, 1975.
Marshall C. Pease, “The Indirect Binary N-cube Microprocessor 
Array,” IEEE Transactions on Computers, Vol. C-26, No. 5, May 
1977, pp. 458-573.
Lawrence R. Rabiner and Ben Gold, Theory and Application of 
Digital Signal Processing, Prentice-Hall, Englewood Cliffs, NJ, 
1975. .
389
[RaSa75] Lawrence R. Rabiner and Marvin R. Sambur, “An Algorithm for 
Determining the Endpoints of Isolated Utterances,” Bell System 
Technical Journal, Vol. 54, No. 2, February 1975.
[RaSc78] Lawrence R. Rabiner and Ronald W. Schafer, Digital Processing of 
Speech Signals, Prentice-Hall, Englewood Cliffs, NJ, 1978.
[RLRW79] Lawrence R. Rabiner, Stephen E. Levinson, Aaron E. Rosenberg, 
and Jay G. Wilpon, “Speaker-Independent Recognition of Isolated 
Words Using Clustering Techniques,” IEEE Transactions Acous­
tics, Speech, and, Signal Processing, Vol. ASSP-27, No. 4, August 
1979, pp. 336-349.
[SaCh71] Hiroaki Sakoe and Seibi Chiba, “Dynamic Programming Algorithm 
Optimization for Spoken Word Recognition,” IEEE Transactions 
Acoustics, Speech, and Signal Processing, Vol. ASSP-26, No. T, 
February 1978, pp. 43-49.
[Sakoe79] Hiroaki Sakoe, “Two-Level DP-Matching - A Dynamic 
Programming-Based Pattern Matching Algorithm for Connected 
Word Recognition,” IEEE Transactions Acoustics, Speech, and 
Signal Processing, Vol. ASSP-27, No. 6, December 1979, pp. 588- 
595.
[Saf82] Robert J. Safranek, “Speech Processing on SIMD Computers” 
Master of Science Thesis, Purdue University, School of Electrical 
Engineering, August 1982.
[SBK77] Herbert Sullivan, T. R. Bashkow, and David Klappholz, “A Large 
Scale Homogeneous, Fully Distributed Parallel Machine,” Sym­
posium on Computer Architecture, March 1977, pp. 105-124.
[Si77] Howard Jay Siegel, “Analysis Techniques for SIMD Machine Inter­
connection Networks and the Effects of Processor Address Masks,” 
IEEE Transactions on Computers, Vol. C-26, No. 2, February 
1977, pp. 153-161.
[Si79] Howard Jay Siegel, “Interconnection Networks for SIMD
Machines,” Computer, Vol. 12, June 1979, pp. 57-65.
[Si80a] Leah J. Siegel, “Parallel Processing Algorithms for Linear Predic­
tive Coding,” IEEE International Conference on Acoustics, Speech, 
and Signal Processing, April 1980, pp. 960-963.
[Si80b] Leah J. Siegel, Howard Jay Siegel, Robert J. Safranek, and Mark 
A. Yoder, “SIMD Algorithms to Perform Linear Predictive Coding 
for Speech Processing Applications,” 1980 International Conference 
on Parallel Processing, August, 1980, pp. 193-196.
390
Leah J. Siegel, “Using SIMD Machines for Speech Analysis,” 14th 
Annual Hawaii International Conference on System Sciences, Janu­
ary, 1981, Vol. 1, pp. 309-318.
[SiegSla] Howard Jay Siegel, Leah J. Siegel, Frederick Kemmerer, Philip T.
Mueller, Jr., Harold E. Smalley, Jr., and S. Diane Smith, “PASM: 
a Partitionable Multimicrocomputer SIMD/MIMD System for 
Image Processing and Pattern Recognition,” IEEE Transactions on 
Computers, VoL C-30, No. 12, December 1981,pp. 934-947.
[Sieg81b] Leah J. Siegel et al., “Parallel Image Processing/Feature Extrac­
tion Algorithms and Architecture Emulation: Interim Report for 
Fiscal 1981,” Technical Report, TR-EE-81-35, School of Electrical 
Engineering, Purdue University, West Lafayette, Indiana 47907.
[SiKu82] Howard Jay Siegel and James T. Kuehn, “Design and Simulation 
of a Multimicroprocessor System for Mapping Applications,” 
Technical Report, TR-EE-83-18, School of Electrical Engineering, 
Purdue University, West Lafayette, Indiana 47907, December 
1984.
[SiMc81a] Howard Jay Siegel and Robert J. McMillen, “Using the Augmented 
Data Manipulator Network in PASM,” Computer, Vol. 14, No. 2, 
February 1981, pp. 25-33.
[SiMcSlb] Howard Jay Siegel and Robert J. McMillen, “The Multistage 
Cube: A Versatile Interconnection Network,” Computer, Vol. 14, 
No. 12, December 1981, pp. 65-76.
[SiMu78] Howard Jay Siegel and Philip T. Mueller, Jr., “The Organization 
and Language Design of Microprocessors for an SIMD/MIMD Sys­
tem,” 2nd Rocky Mountain Symposium on Microcomputers, August
1978, pp. 311-340.
[SMS79] Leah J. Siegel, Philip T. Mueller, and Howard Jay Siegel, “FFT 
Algorithms for SIMD Machines,” Seventeenth Annual Allerton 
Conference on Communication, Control, and Computing, October
1979, pp. 1006-1015.
[Snyder82a] Lawrence Snyder, “Introduction to the Configurable, Highly Paral­
lel Computer,” Computer Vol. 15, No. 1, January 1982, pp 47-56.
[Snyder82b] Lawrence Snyder, “The Poker (1.0) Programmers Guide,” Techni­
cal Report CSD-TR-434, Computer Science Department, Purdue 
University, West Lafayette, IN 47907, December 1982.
[Snyder83] Lawrence Snyder, “Introduction to the Poker Parallel Program­
ming Environment,” 1983 International Conference on Parallel 
Processing, August 1983, pp 289-292.
391
[Sond68] Man Mohan Sondfii, “New Methods of Pitch Extraction,” IEEE 
Transactions on Audio Electroacoustics, Vol. AU-16, No. 3, June 
1968, pp. 262-266.
[StonSO] Harold S. Stone, “Parallel Computers,” in Introduction to Com­
puter Architecture, 2nd edition, edited by Harold S. Stone, Science 
Research Associates, Inc., Chicago, IL, 1980, pp. 362-425.
[Thre] Threshold Technology, Inc., Delran, NJ.
[TI83] “TMS32010 Digital Signal Processor,” Texas Instruments, Dallas,
Texas 75265, May 1983.
[ToGu81] H-m. D. Toong and A. Gupta, “An Architectural Comparison of 
Contemporary 16-bit Microprocessors,” IEEE Micro, Vol. 1, May 
1981, pp. 26-37.
[Verb] Verbex Corp., Bedford, Mass.
[WBA83] Neil Weste, David J. Burr, and Bryan D. Ackland, “Dynamic 
Time Warp Pattern Matching Using an Integrated Multiprocessing 
Array,” IEEE Transactions on Computers, Vol. C-32, No. 8, 
August 1983, pp 731-744.
[YoSi81] Mark A. Yoder and Leah J. Siegel, “Systolic and SIMD Algorithms 
for Digital Filtering,” Proceeding of the Nineteenth Annual Allerton 
Conference on Communication, Control, and Computing, October 
1981, pp. 880-889.
[YoSi82] Mark A. Yoder and Leah J. Siegel, “Dynamic Time Warping Algo­
rithms for SIMD Machines and VLSI Processor Arrays,” IEEE 
International Conference on Acoustics, Speech, and Signal Process­
ing, May 1982, pp. 1274-1277.
APPENDICES
392




Data Register Direct EA = Dn
Address Register Direct EA = An
Absolute Data Addressing
Absolute Short EA = (Next Word)
Absolute Long EA = (Next Two Words)
Program Counter Relative Addressing
Relative with Offset EA = (PC) + di6
Relative with Index and Offset EA = (PC) + (Xn) + da
Register Indirect Addressing
Register Indirect EA = (An)
Postincrement Register Indirect EA = (An), An-«- An + N
Predecrement Register Indirect An-*- An - N, EA = (An)
Register Indirect With Offset EA = (An) + di6
Indexed Register Indirect With Offset EA = (An) + (Xn) +.da
Immediate Data Addressing
Immediate DATA - Next Word(s)
Quick Immediate Inherent Data
Implied Addressing
Implied Register EA = SR, USP, SP, PC
NOTES:
EA = Effective Address 
An = Address Register 
Dn = Data Register
Xn = Address or Data Register used as index Register 
SR = Status Register 
PC - Program Counter
da = Eight-bit Offset (displacement) 
die = Sixteen-bit Offset (displacement) 
N = 1 for Byte, 2 for Words and 4 for 
Long Words 











































Bit Test and Change
Bit Test and Clear
Branch Always






























































Rotate Left without Extend
Rotate Right without Extend
Rotate Left with Extend .






; Fixed addresses in CU space
MASKCTL = ' 0x404 ; masking operations unit control port
FROMPEO = 0x408 ; Data path from PEO
; The following are standard definitions for the
; PE transfer registers.
DTRDEST = 0x400 ; PE address where data is transfered to
DTRIN ■ " = ■ 0x402 ; Data transfer in from interconnection network
DTROUT ■■■■■■ = . 0x404 ; Data transfer out of network
TOCU — 0x40c ; Data path to CU from PEO
#define NetworkDelay(x) p_mov.I0,0
; Network Delay (4.5 microseconds at 8 MHz)
The following are standard definitions for the 
PE condition code registers.
PECC'R = 
PECCS=
0x408 ; Condition codes register (SR), size W, write only
0x40a ; Condition codes select register, size, B, write only
The following are control words for the masking operations unit 
See (SiKu82j for more details.
;OP
Pushs = 0x0000 ; Push*






DataCond — 0x0600 ; Positive Data conditional mask
NDataCond — 0x0700 ; Negative Data conditional mask
; The following are control word for the condition codes select register
; From page A-3 of the 68000 Assembly Language Programming Manual
T 0x0 ; True
F 0x1 ; False
III 0x2 ; High
LS / ' - 0x3 ; Low or same
CC = 0x4 ; Carry clear
cs 0x5 ; Carry set
NE = 0x6 ; Not equal
EQ 0x7 ; Equal
,vc'.■ = 0x8 ; No overflow
vs = 0x9 ; Overflow
PL ^ Oxa ; Plus
Figure A-2 Contents of simd.h, 
address space.
the file describing the device locations
MI '•= Oxb ; Minus
GE = Oxc ; Greater than or equal
LT Oxd ; Less than
GT ■ = Oxe ; Greater than
LE ,= Oxf ; Less than or equal
; Macro definitions for inter PE communications
; These deinitions assume dO is available for use,
; and d7 contains WHOAMI.
; In Transfer__l(in,out), in and out must be different D registers.
; In Broadcast, in must be a register >
#define Broadcast(in,out) % \
c_mov.w in,.+6.w %\
p_jnov.w #0,out







p_mov. w E)TROUT.w,out %\
p_swap in %\
pjswap out %\












p_mov .w PTROUT.w,in %\
p_swap in
#define Shift(x) %\







pjbchg x,dO % \
p_mov.w dO,DTRDEST.w
#define Perm(x,y) %\
p_mov.w y,dO %\ .
pjsub.w x,dO % \
pjsubq.w #l,dO %\
p_mov.w dO.DTRDEST.w
Macro definitions for Data Conditional masking
#define Where(x,cond,y) % \
pjmp.w y,x ' %\
p_mov.w sr.PECCR. %v
p_mov.b #cond,PECCS.w % \
.lock % V
c_mov.w #Pushs + DataCond.MASKCTL. w% \
.unlock
#define WhereElse(x ,cond,y) %.\
pjmp.w . y.x %\
p_mov.w sr.PECCR .w %\
p_mov.b #eond,PECCS.w %\
dock %\
c_rriov.w #Pushs + NDataCond,MASKCTL.w% 1
c_mov.w #Pushss + DataCond.MASKCTL ,w% \
.unlock
^define ElseWhere c^mov.w #Pop + DataCond.MASKCTL .w













Definitions for main routine
= 0x1000 ; Put the stack at the top of memory
Definitions for autocorrelation
= .. 9 ■





Definitions for ltw 
= 80
Definitions for dtw 
= px4000 ; Infinity
= 40 ; Number of frames in utterance
- 6 ; Width of warping path N ~ 2r + l
= 10 ; Number of utterances in the vocabulary




Machine: SIMD, simulated by a MC68000.
Function: This program preemphasises the input speech
data with a filter with the transfer function:
H(z) = 1 -coef* z'1 
Precision: Input: 16-bit signed
coef: 16-bit signed (15 bits to the right
of the decimal point.)
Output: 16-bit signed 
Number of PEs: N
Transfers: Shift( + 1)
Masking: Data Conditional
; Parameters: coef, The filter coefficient (default = 0.95).
NetD, The interconnection network delay 
time in cycles.
! Input: The input data is stored in PEs 0 through N-l.
PE i contains sample i for 0 < i < N-l.
! Output: The output data is stored in PEs 0 through N-l.
PE i contains sample i for 0 < i < N.
. Cycles: 130 + NetD
• Typical Time: 37 ps for one N sample frame
• Register Usage: (* means set by calling routine)
. dO pe used by macros
; dl pe tmp
d2 pe used to swap tmp and old value 
! d7* pe WHOAMI (physical pe address)
aO* pe points to input signal
; al* pe points to output signal
^include "simd.1T 
^include ,rdefs.h,v
; Data allocation for routine
.p__data ; Data stored in each PE










Figure A.4 Sim68 program to perform preemphasis filtering. Numbers on left 
are execution times in cycles.
400
; 1 USE Shift +1
• ' ■ ■ - ’ ' . V
t
Shift(#l) ; Set up interconnection network addresses
J ' ^















Where(d7,EQ,#0) ; In PEO, get value from last call




tmp <- oldvalue 
oldvalue <- tmp2
/* Switch tmp and oldvalue */
2 p_mov.w <11 ,<12












• output <-Input + tmp * 0.95
39
f
p_jnuls coef.w,dl ; mult. by coef and save in dl.
4 p__asl.I #l,dl ; shift 15 to the right by shifting left
2 pjswap dl ; and swapping upper and lower words.
2 p_add.w dl,d0 ; dO — dO + coef * dl
4 p_mov.w dO,(al) ; save in memory






Machine: SIMD, simulated by a MC68000.
Function: This program finds the autocorrelation
coefficients of input speech data.
Precision: Input: 16-bit signed
Output: 32-bit signed 
Number of PEs: N
Transfers: Shift(— 1), Cube
Masking: Data Conditional
Parameters: autocoef, The number of coefs. to find.
N, The number of PEs in use.
NetD, The interconnection network delay 
time in cycles.
Input: The input data is stored in PEs 0 through N-l
with PE i containing sample i for 0 < i <N.
Output: The autocorrelation coefficients, R(i),
for 0 < i <autocoef-l appear in PE i 
for 0 < i <N (i.e. each PE contains 
every coefficient).
Cycles: autocoef[136-HNetIi + (54 + 2NetD)logN]-12-NetD
Typical Time: 1,757 //s for autocoefs=9, NetD = 18, and logN=7.
Register Usage: (* means set by calling routine)
d0 pe used by macros
dl "P« j,tmp
dl cu j
d2 pe N~i, tmp
d3 pe partsum
dt pe sig input data
do pe slast
d6 pe . i
d6 cu i
d7* pe WHOAMI (physical address)
aO* pe pointer to input data










Figure A.5 Program performing autocorrelation. Numbers on left are execu­














slast sig /* After stage I, “slast” in PE
holds sig(m + i) */
P—1mov.w (a0),d4 ; store sig in register
p_mov.w d4,d5 ; slast ,<-sig
FOR i 0 TO p DO
p_clr.w d6 ; i <-0 in PE
c_clr.w d6 ; i <„ 0 in CU -
IF 1V 0 THEN
e_tst.w d6 ; if i == 0 jump to labl
c_beq.s labl
f




5 6 DTRin 4— slast
? 7 TRANSFER















4 p_mov.w #N,d2 ; d2 = N
2 p_mov.w d6,dl ; dl = i
2 p_sub.wdl,d2 ; d2 = N-i
26 Where(d2,GT,d7) ; <17 = WHOAMI
9
f 11 partsum 4— slast * sig
9
2 pjnov.w d4,d3 ; partsum <-sig
Figure A.5 (Continued)
403
35 p_mul.s d5,d3 ; partsum <- slast * sig
I 12 END WHERE
9
8 EndWhere
•13 FOR J <- 0 logN-1 DO
f ^ .
2 p„clr.w dl ; j <-0 in PEs





j 15 TRANSFER partsum TO tmp
9
32 + 2NetD TransferJ(d3,d2)
■ 5 V "
} 10 partsumimp + partsum
3 p_add.l d2,d3
; Loop back for ”FOR j +- 0 TO logN-1 DO”
2 p_addq.w #l,dl ; j + + in PEs
5/7 c_dbf dl,loop2 ; j— in CU
6
; 17 R(i) ■*- partsum
p_mov.ld3,(ai) +
; Loop back for ”FOR i <- 0 TO p DO”
2 p_addq.w #l.d6 ; i + + in PEs
2 c_addq.w #l,d6 ; i + + in CU
4 c_cmp.w #autocoef,d6 ; if i < number autocorrelation coefs,







Machine: SIMD, simulated by a MC68000.
Function: This program finds the autocorrelatiou
coefficients of input speech data using 
half as many PEs as samples in a frame.
Precision: Input: 16-bit signed
Output: 32-bit signed 
Number of PEs: N
Transfers: Shjft(-l), Cube
Masking: Data Conditional
Parameters: autocoef, The number of coefs. to find
N, The number of PEs in use.
NetD, The interconnection network delay 
time in cycles.
Input: The input data is stored in PEs 0 through N~1
with PE i containing sample i for 0 < i <N. 
Output: The autocorrelation coefficients, R(i),
for 0 < i <autocoef-1 appear in PE i 
for 0 < i <N (i.e. each PE contains 
every coefficient).
Cycles: autocoef[l36 + NetD '+ (54+ 2NetD)logN]-12-NetD
Typical Time: 1,757 //s for autocoefs-9, NetD = 18, and logN=7.
Register Usage: (* means set by calling routine) 
d0 pe used by macros
dl pe j,tmp
dl cu j
d2 pe N~i, tmp
d3 pe partsum




d7* pe WHOAMI (physical address)
a0* pe pointer to input data
al* pe pointer to output coefficients




; Data allocation for routine
.glob) auto2
Figure A.6 Program performing autocorrelation using half as many PEs as 




• - - •
51 slastl ♦-sigl /• After stage I, "slast” in
• 2 PE m holds sig(m-H) */
• 3 1.57 slast2 *- slg2 /* After stage I, "slast” in
. 4 PE m holds sig(m-H) */
f
4 p_mov.w (a0) + ,d4 ; store first half of frame in a reg.
2 pjswap d4 ; move to upper 16 bits
4 p_jnov.w (a0),d4 ; store second half in other half of reg.
2 pjswap d4 ; Keep first half in lower 16 bits
2 p_mov.ld4,d5 ; slast <- sig
•
• 0 FOR 1 0 TO p DO
/ f • # •
2 p_clr.w d6 ; i <- 0 in PE
2 c_clr.w d6 ; i <-0 in CU
loop!:
.. y
;7 IF 1 ^ 0 THEN
o
9
2 c_tst.w d6 ; if i == - 0 jump to labl
5/4 c_beq s labl






• 9 DTRin *- slast 1
5 10 TRANSFER
j 11 slast 1 •*- DTRout
. i
; Send first half through network
6 p_mov.w d5,DTRIN.w ; DTRIN <- slast
NetD NetworkDelay(O)
6 pjnov.w DTROUT.w,d5 ; slast <-DTROUT
?
; 12 DTRin 4-slast2
113 TRANSFER
; 14 slast2 DTRout
# • •9
2 pjswap d5 ; Send second half through network
5 p_mov.w d5,DTRIN.w ; DTRIN O slast
NetD NetworkDelay(O)






4 17 tmp *— slastl
I 18 slastl «- slast2






; 22 partsum ^-0
9
2 p_clr.w d3 ; partsum <-0
I
; 24 WHERE ADDR < M-i DO




4 p_mov.w #N,d2 . ; d2 = N
2 pjmov.w d6,dl ; <n = i
2 p_sub.wdl,d2 ; d2 = N-i
26 Where(d2,GT,d7) ; d7 = WIIOAMI
2 p_mov.l d4,d3 ; partsum <- sig (first half)
2 p_swap d3
2 p_swap d5





; 28 partsum «— partsum + slastl * sigl
2
>
p_mov .w d4,d2 ; now compute second half
35 p_mul.s d5,d2
3 p__add.l d2,d3 ; Add to halves together
|'80 FOR j «- 0 TO logN-1 DO
99
2 p_clr.w dl ; j <- 0










TRANSFER partsum TO tmp










p_addq.w #l,dl ; j + + in pe










p_addq.w #l,d6 ; i + + in PE
c_addq.w #l,d6 ; i+ + in cu
c_cmp.w #autocoef,d6 ; if i < number autocorrelation coefs.






Machine: SIMD, simulated by a MC68000.
F unction: This routine finds the LPC coefficients using 
Durbin’s method.
Precision: Input: 16-bits, signed
Output: 16-bits, signed with 12 bits to the 
right of the decimal point
Number of PEs: p, the number of LPC coefficients.
Transfers: Cube, Perm
Masking: Data Conditional
Parameters: p, the number of LPC coefficients.
NetD, the interconnection network delay 
time in cycles.
Input: Each PE contains all the autocorrelation 
coefficients R(i) for 0 < i < p.
Output: The results are stored in a(i), with
PE(i-l) containing a(i) for 0 < i < p.
Cycles: p[513 + NetD+(54 + 2NetD)log(p)]-38-NetD
Typical Time: 1,588 //s for p=8 and NetD-18
Register usage: (* means set by calling routine)




d4 cii /' j
d5 pe i
d5 cu i
d6 pe LADDR logical PE address
d7* pe WHO AMI physical PE address
al* pe R points to the array of
autocorrelation coef.
a2 pe R points to current R value
p - 8








E . = .+ 3
.c_text
Figure A.7 Program to finding LPC coefficients.
409
6 p„mov.w LADDR.w,d6
2 p_mov J al,a2 ; a2 points to current R value
•
J
5 1 E 4-' R(0)
8
5





p_clr.w dl ; a = 0
5
5 3 FOR i 4- 1 TO p DO /* Compute k(i) */
4 .
1
p_mov.w #l,d5 ; i = 1
4
mainloop:
c_mov.w #I,d5 ; i = 1
9
; 4 k 4- 0
3
5
p_clr.l d2 ; k = 0 d2=4.12
5





; 6 k +- a * R(i-LADDR)
2
?.
p_mov.w d5,d3 ; d3 = i
2 p_sub.w d6,d3 ; d3 = i-LADDR
5 p_asl J #l.d3 ; *2 for word addressing
7 p_mov.w 0(al,d3.w),d2 ; d2=k=R[i-LADDR] d2 = 16.0
35 pjmuls dl,d2 ; d2=k = a*R[i-LADDR]
dl=4.12d2=4.12
; 7 END WHERE
8 EndWhere
; 11 FOR J +- 0 TO logp - 1 DO
2 p_clr.w d4 ; j = 0
2 c_movq #logp-l,d4 ; j — log2(p) — 1
J




12 Cube(d4) ; cube(j)
5 13 DTRIn 4- k
; 14 TRANSFER
1
32 + 2NetD Transfer_l(d2,dO)
! 15 k «— k + DTRout
3 p_add.S d0,d2 ; k = k + R[i-LADDR]
Loop back for ’’FOR j «- 0 TO logp - 1 DO”
2 p_addq.w #l,d4 ;j + +
5/7 c_dbf d 4, again
9





4 p_mov.w (a2) + ,d3 d3 = R[i] d3=16.0
2 p_ext.l d3 sign extend d3=16.0
2 p_movq #12,dO shift by 12 to move decimal place
16 p_asl.l d0,d3 d3=16.12
3 p_sub.l d2,d3 d3 = R[i]-k d3 = 16.12
2 p_mov.l d3,d2 d3=16.12




















} 22 WHERE LADDR = 1 DO
? 23 a +- k /« —k(i)














2 p_mov.w d2,dl ; a “ k










} 20 DTRiii •*- a
5 27 TRANSFER
6 ’ p_mov.w dl DTRIN.w ; DTRIN = a
NetD NetworkDelay(O)
9
j 28 a «- a - k * DTRout
9
39 p_muls DTR6UT.w,d2 ; k = k * a
2 p_inovq #12,d0
16 p_asr.I d0,d2 ;









; Loop back for ”i «— 1 TO p DO”
inccounters:
2 p_addq.w #l,d5 ;i+ +
2 c_addq.w #l,d5 ;i+H-















































SIMD, simulated by a MC68000 
This program does a linear time warp 
on the input data 
Input: 16-bit signed 
Output: 16-bit signed 
Max number of frames.
(J or I whichever is greater.)
Shift(— 1), Broadcast 
Data Conditional
J—I, the changed in the number of frames 
p, the number of coefficients per frame 
NetD, the network delay time.
PE j holds frame j for 
0 < j < number of input frames (J)
PE i holds frame i for 
0 < i < number of output frames (I) 
ifJ>I 325 + (138 + NetD)p
+ (J-I)[59 + NetD + (29 + NetD)p] 
if J=P 47 + lip 
if J<I 344 + (138 +NetD)p
+ (I- J)[57 + NetD + (45 + NetD)p] 
7,382 //s for I-J= 10
Starting number of frames 
Finishing number of frames
(physical pe address)
Points to current input frame 
Points to current output frame
Itw:
.c_text
Figure A.8 Program for linear time warping using one frame per PE.
413


































Since ”factor”, ”tmp”, and ”s” are fractions, 
they are represented as fixed decimal by shifting them left 
by X places. The notation X.Y means there are X bits to 
the left of the decimal and Y bits to the right.
shift lower 16 bits to upper 14 bits 
since quot. is between .5 and 2 d0=2.14
6 c_ror.l #2,dO ; Faster to rotate right and
2 c_swap dO ; swap words
70 c_divu (11,(10 ; dO <- dO/dl dl
2 c_mov.w d0,d5 ; factor <- ■'( J— 1)/(I— 1)












11 p_asl.I #23-16,dl ; asl.l #23,d2 the fast way
70 p_divu d5,dl ; dl <- dl/d5 d5
4 p_add.w #0xlff,dl ; find ceiling of dl
2 p_movq #9,d2
12 p_asr.w d2,dl ; i <- ceiI(ADDR/factor)




IF (I > J) THEN
2.
f
cjmp.w d7,d6 ; if(i> J)
5/6 c_bgt lab2
;
5 14 FOR i 1TOI-J
>
FOR i <- I-J-l TO 0 STIiP-i
2 c_mov.l d7,d3 ; i <-1
3 c_sub.l d6,(13 ; i <-1—
4 c_subq.l #l,d3 ; i <- I-J-l
; 13 USE Shift■ +1
12
9





d 1 = 16.0 
d4=16.0
At this point, all PEs transfer their ”i” values to the 
network, but only the enabled PEs will read the values
p_movq #p-Pp,dO 
c_movq




? 15 WHERE(ADDR < i) DO
5
Where(dl.GT,d7)
Network Delay (0) 







The section of code thrns on all PEs so they can write thier 










0(a5,d0.w),DTR IN.w ; TRANSFER each R coef,



























































$ 25 TRANSFER R to El
I






11 p_mov.w 0(a5,d0.w),DTR IN.w ; DTRin <- R
NetD NetworkDelay(O)
11 p_mov.w DTROUT.vv,0(al,d0.w) - ; Rl <- DTRout
5/7 c_dbf d0,loop3 .> . ■ ■ ' - ■
•
9
; 26 T (l-s) * R +s * El
4
i
p_mov.w #0x4000,d3 ; move a shifted 1 to d3 d3=2.14




p_movq #p + p,d0
2 pjsubq.w #2,d0
2 p_mov.w d3,dl ; dl <- (l-s)
40 p_muls 0(a5,d0.w),dl ; dl <- (l-s) * R d 1—2.14
2 p_mov.w d5,d2 ; d2 <- s
40 p_muls 0(a4,d0.w),d2 ; d2 <- s * Rl d2=2.14
3 p_add.l ' ^l;d2
6 p_rol.l #2,d2 ; shift right by 14 by rotating
2 pjswap d2 __ . ; left 2 and swapping









c_cmp;w d7,d6 ; IF(KJ)
5/4 ejdt.s Itwend
5*2 FOR i 4- 1 TOJ-I
i
T FOR i <- J-I-l TO 0 STEP -1





cjsub.w d7,d3 ; i <- J—I
c_subq.w #l,d3 ; i <- J-I-l
for2:
•f





p_mov.w d4,DTRIN.w ; DTRin <- i
NetworkDelay(O)
p_mov.w DTROUT.w,d6 ; i_tmp <-DTRout
;34
26 !














p_mov.w 0(a6,d0.w),DTRIN.w ; DTRin <- T
NetworkDelay(O)












5/7 c_dbf d3,for2 ; FOR i <- j-I-l TO 0 STEP -1
Itwehd:





Machine: SIMD, simulated by a MC68000
Function: This program does a linear time warp on 
the input data
Precision: Input: 16-bits, signed
Output: 16-bits, signed
Number of PEs: p, the number of coefficients.
Transfers: Broadcast
Masking: Data Conditional
Parameters: p, the number of coefficients.
I, the number of output frames.
Input: PE k contains coefficient k of frame j
for 0 < k < p and 0 < j < Total number of frames
Output: PE k contains coefficient k of frame j 
for 0 < k < p and 0 < j < I.
Cycles: if J * I, 107 + 1831 
if J = I, 450
Typical Time: 1,857 //S for I—40.
Register usage: (* means register is set by the calling routine)
dO pe used by macros








d7* pe WHOAMI (physical pe address)
a4 pe R1
a5* pe R Pointer to input frames






















c_subq.w #l,d0 ; Number of coefs to transfer
6 p_mov.w (a5) + ,(a6) + ; T <- R











2 c_subq.w #140 ; J-l
2 c_subq.w #1.41 ; i-i




they are represented as fixed decimal by shifting them left 
by X places. The notation X.Y means there are X bits to 
the left of the decimal and Y bits to the right.
y shift lower 16 bits to upper 14 bits
y since quot. is between .5 and 2 d0=2.14
6 c_ror.l #2,d0 ; shift to left and then
2 c_swap dO ; swap to upper byte
70 c_divu dl,d0 ; dO <- dO/dl dl = 16.0 d0=2.14
2 c_mov. w d0,d5 ,-factor <- (J— 1)/(I— 1) d5=2.14
; BROADCAST d5 From CU to PEs
10 Broadcast(d5,d5)
; 7 FOR I 4-0 TQ. |-1
; FOR i <- 0 to F-1 in pes and FOR i <- I—1 to 0 STEP—1 in cu
2 p_cir.w d4
2 c.mov.w dl,d4 ; i = 1-1
J




2 pjmov.w d4,dO ; d0=16.0
35 p^rnulu d5,d0 ; d5=2.14 dO=2.14
8 p_add.l #0x4000,dO ; add 1 (factor is still shifted)







2 p_mov.l dO,dl ; dl <- tmp dl^2^14
8 p_and.l #0xffffcOOO,dl ; dl == J
f




2 p_moy.l d0>d6 ; s == d6 d-S^^ll
; T <- (1-s) * R + s * R1
4 p_niov.w #0x4000,d3 ; move a shifted 1 to d3 d3=2.14
2 pjsub.w d6,d3 ; (1-s) = = d3
; Adjust j (dl) so it can index R()
; shift by 13 since indexing words dl = 15.0
7 p_rol.l #3,dl ; It’s faster to shift right 3
2 pjswap dl ; and swap
2 pjsubq.w #2,41 . ; of 2 bytes each














; dO <- (1-s)
; dO <- (1-s) * R dl = 2.14
; d‘2 <- s
; d2 <- s * R1 d2=2.14
; Unshift to normal range d2=16.0
; Save time by shifting left 2 and 
; swaping words 
; T <- (l-s)*R .+ s*Rl
; i. == i + 1
Figure A.9 (Continued)
421
















SIMD, simulated by a MC68000 
This program calls the dynamic time 
warp routine.
Input: 16-bit signed 




r, the width of the warping path, 
p, the number of coefficients per frame. 
NetD, the network delay time.
I, the number of frames per utterance. 
All PEs hold all the input data.
PE r holds the distance score.
See text
d5 pe LADDR (logical address)
d7 pe WHOAMI (physical address)
aO pe Pointer to start of Input data
al pe Pointer to start of Unknown data
a2 peO Pointer to where to store results. 





















Jul 17 08:51 1984 main.s Page 2
.word inf, inf,inf,inf, inf, inf,inf,inf,inf,inf, \
inf,inf,inf,inf,inf, inf, inf ,iii f,inf,
.word inf, inf,inf,inf, inf, inf,inf,inf,inf,inf, \
inf,inf,inf,inf,inf, inf, inf,inf,inf, inf
; Known input utterance
.word 1,1,1,1,1, 2,2,2,2;2, 3,3,3,3,3, 4,4,4,4,4
.word 5,5,5,5,5, 6,6,6,6,15, 7,7,7,7,7, 8,8,8,8,8
.word 8,8,8,8,8, 7,7,7,?;7, 6,6,6,6,6, 5,5,5,5,5
.word 4,4,4,4,4, 3,3,3,3,:1, 2,2,2,2,2, 1,1,1,1,1
word 0,0,0,0,0, 0,0,0,0,(), 0,0,0,0,0, 0,0,0,0,0
word 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0 
.word 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0 
word 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0, 0,0,0,0,0














d: .= .+ 2
ga­ .= .+ 2
doid: . —. 4* 2
gold: . = .+ 2
dDTR: 2
gDTR: .= + 2
dup: . = .+ 2
gup: . = .4- 2
ddown: .= .4- 2
; Local distance 
; Optimal subpath distance 
; Old local distance 
; Old optimal subpath distance 
; d values to transfer 
; g values to transfer
; d values to transfer to PE with next higher addr.
; g values to transfer to PE with next higher addr.
; d values to transfer to PE with next lower addr.
Figure AM (Continued)
423
Jul 17 08:51 1984 main s Page 3
gdown: . = . + 2 ; g values to transfer to PE with next lower addr.
.word inf,inf,inf,inf,inf, inf,inf,inf,inf,inf, \ 
inf, inf,inf,inf, inf, inf, inf,inf,inf, inf 
.word inf,inf,inf,inf,inf, inf,inf,inf,inf,inf, \ 
inf,inf,inf,inf,inf, inf,inf,inf,inf,inf





8 c_mov.w #0x8000, MASKCTL.w
WHOAMI.w,d7 ; d7 always has WHOAMI in it 
(Physical address)
#inf,d6; Store infinity in d6 
d7,d5 ; Store logical address in d5
#r,d5
#unknown,a0 ; Take data in the order put out by lpc, and 






















; set up stack pointer 
; Initialize CMS
Figure A. 10 (Continued)
424
; Program Name: dtw (PP1)
; Algorithm: Figure 6.18??
; Machine: SIMP, simulated by a MC68000
; Function: This program does a dynamic time warp
; on the input data
; Precision: Input: 16-bit signed
; Output: 16-bit signed
; Number of PEs: 2r + l
; Parameters: r, the width of the warping path.
; p, the number of coefficients per frame.
; NetD, the network delay time.
; I, the number of frames per utterance.
; Input: All PEs hold all the input data.
; Output: PE r holds the distance score.
; Cycles: See text
; Typical Time: 10 ms for 1=40, r=6, NetD = 18, and p=8
; Register Usage: (* means register is set by the calling routine)
Jul 17 09:09 1984 dtw.s Page 1
d2 pe tmp storage
d5* pe LADDR (logical address)
46 pe inf (infinity)
d7* pe WHOAMI (physical address)
d7 cu k ' '
aO* pe Pointer to start of Input data
al* pe Pointer to start of Unknown data
a2* peO Pointer to where to store results.
a2++ after stored
The subroutine distance(x,y) returns the distance between 
frame x of utterance 1 and frame y of utterance 2. 
x and y are passed in the dO and dl.
The result is returned in dO.
#include ’’defs.h” 
#include ’’simd.h”












Figure A. 10 (Continued)
425




4 p_mov.w #inf,d6; Store infinity in d6
•10 Xindex «—+ fADDR/2]
9
findindex:
2 p_mov.w d5,d0 ; Xindex <- ceil(LADDR/2)
2 ■ p_addq.w #l,dO
4 p_asr.w #l,d0 ; LADDR/2
7 ; p__asl.w #4,dO ; Multiply index by 2 4 for p=8.
; 2A4 = autocoef * word size
4 p_add.w dO,aO
; ii Yindex lADDR/2]
2 . .
I
p_mov.w d5,d0 ; Yindex <- floor(LADDR/2)
4 p_asr.w #l,d0
2 p_neg.wd0
7 p_asl.w #4,dO ; Multiply index by 2 4 for p=8.




?3 gold «- 0
5 4 d 4- oo
; 5 dold oo
8
»
p_clr.w g.w ; g <- 0
8 p_clr.w gold.w ; gold <- 0
6 p_mov.w d6,d.w ; d <- inf
6 p_mov.w d6,dold.w ; dold <- inf
1






8 p_clr.w g.w ; g <- o
8 EndWhere
; 13 FOR k «- 1 TO I DO
Figure A. 10 (Continued)
426
Jul 17 09:09 1984 dtw.s Page 3 
1
1 c_mov.w #I-l,d7; FOR k <-1-1 TO 0 STEP-1 DO
9






5 15 WHERE ADDR is even DO
1 1ft dDTR 4- 4old
>17 gDTR 4- gold
•
9
2 p.mov.w d5,d0 ; WHERE LADDR is even DO
4 p_and.w #l,d0
36 WhereElse(d0,EQ,#0)













; 23 USE Shift +1
5 24 TRANSFER dDTR TO dup
J 25 TRANSFER gDTR TO gup
t
movedataup:




1° p_mov.w gDTR ,w,DTR IN.w ; TRANSFER gDTR TO gup
NetD NetworkDelay(O)
10 p_mov.w DTROUT.w.gup.w
} 27 USE Shift -1
Figure A.10 (Continued)
427
Jul 17 09:09 1984 dtw.s Page 4
5 28 TRANSFER dDTR TO ddown
; 20 TRANSFER gDTR TO gdown
movedatadown: _
12 Shift(#-1) ; TRANSFER dDTR TO ddown
10 p_mov.w ' dDTR.w,DTKIN.w-.
NetD NetworkDelay(O)
10 p_mov.w DTROUT.w,ddown.w





!ai. WHERE ADDR = r DO












I 39 gold <- g
; 40 dold «— d
p_mov.w g.w,gold.w ; gold <- g
p_mov.w d.w,dold.w ; dold <-d
A «- gdown + 2 * ddown
pjnov.w ddown.w,d0 ; A <- gdown + 2 * ddown










; WHERE LADDR = -r DO 
d6,gup.w ; gup <- inf
; WHERE LADDR = + r DO 
; gdown <- inf
>
Figure A, 10 (Continued)
428
B *- gold + d




; 44 Q «- gup + 2 * dup
findC:




} 45 WHERE B < A DO
} 47 A ■*- B




20 Where(di,LS,dO) ; g <- min(A,B,C) ■+ d
2 p_mov.w dl,d0
8 EndWhere
Jul 17 09:09 1984 dtw.s Page 5
WHERE Q < ADO 
A - C 
ENDWHERE











5/7 c_dbf d7,forl ; FOR d7 <- 1/2-2 TO -1 STEP -1
; 57 WHERE ADDR = 0 DO
} 58 D(A,B) g/(I4J)
Figure A. 10 (Continued)
g *- A + d
p_add.w d.w,d() ; dO now holds rnin(d0,dl,d2)
p_mov.w dO.g.w ; g <-min(A,B,C)
Where(d6,LS,d0) ; where(dO < inf)






















Figure A. 10 (Continued)
430















SIMD, simulated by a MC68000 
distance finds the distance 
between to sets of coefficients 
Input: 16-bit signed
Output: 16-bit signed 
2r + l 
None
Data Conditional
p, the number of coefficients per frame. 
All PEs hold all the input data.












Typical Time: 474 //s for p=8
Register usage: ( * means passed as argument)
d0 pe Used by macros
dl pe running sum
dl pe returns total distance
d2 pe Current lpc coefficient
d4 cu loop counter
d5 pe LADDR (logical pe address)
d6 pe inf
d7 pe WHOAMI (physical pe address)
aO* pe points to Input frames





where( (aO) = == inf || (al) == inf )
p^cmp (a0),d6 ; is (aO) . = = inf ?
p_mov.w sr,d0
p_cmp.w (al),d6 ; is (al) = = inf
p_mov.w sr,dl









#Pushs + NDataCond ,MASKCTL .w 
#Pushss + DataCond,MASKCTL w
p_mov.w d6,dl ; return(inf)







Jul 17 09:10 1984 distances Page 2
p_add.w
p_add.w







4 p_mov.w (a0) + ,d2








; sum = 0;
; sum + = [input ~ unknown)'2
Figure A.10 (Continued)
432
Jill 17 10:23 1984 shuffle.s Page 1
Program Name: shuffle (PPl)
Algorithm: Figure dtw.4??
Machine: SIMD, simulated by a MC68000
Function: This program rearranges the output data 
from lpc for input to dtw.
Precision: Input: 16-bit signed
Output: 16-bit signed
Number of PEs: 2r + l
Transfers: FROMPEO, Shift(-l), Broadcast
Masking: None
Parameters: p, the number of coefficients per frame. 
NetD, the network delay time.
I, the number of frames per utterance.
Input: PE i contains coefficient i of frame j 
for 0 < i <p and 0 <j <1.
Output: All PEs contain all coefficients for all PEs
Cycles: 16 + I[6 + p(47 + Netd) + 2 + 5] + 2 + 6 + 9 [r/:
Typical Time: 5,344 // s
Register usage: (* means passed as argument)
dO pe Used by macros
dl pe Current lpc coefficient
d3 cu loop counter
d4 cu loop counter
d6 pe Infinity
ap* pe points to Input frames



















; Total of I frames will be shuffled
c_movq #p^-ljd2 ; shift over p lpc coefficients
p_mov.w (aO)-f,dl




Broadcast(dl,(al) + ) ; Send data from PEO to all PEs
Figure A. 10 (Continued)
433












c_asr.w #l,d2 ; FOR i = r/2 - 1 TO 0 STEP -1










Machine: SIMD, simulated by a MC68000
Function: This program does a dynamic time warp 
on the input data
Precision: Input: 16-bit signed
Output: 16-bit signed
Number of PEs: 1
Transfers: None
Masking: Data Conditional
Parameters: r, the width of the warping path, 
p, the number of coefficients per frame.
NetD, the network delay time.
I, the number of fame per utterance.
Input: PE 0 holds all the input data.
Output: PE 0 holds the distance score.
Cycles: See text
Typical Time: 74 ms for p“8, r=6, and 1-40
Register Usage: (* 'means register is set by the calling routine)
dO pe used by macros
dl cii x Index into known template
d2 cu y Index into unknown template
d3 pe local distance
d3 cii j
d6 pe inf Infinity
d7 pe WHOAMI (physical pe address)
al pe points to known template(x)
a2* ipe points to unknown template(y)
a3* pe points to local distances





Data Allocation-for routine 




p_mov:w $irif,d6 ; d6 = infinity
I
5 2 FOR y :=4) TO I-l
4





c_clr.w d2 ; For y 0 to I—1
FOR x -r TO r
i
2. c_clr.w dl ; For x := -r to r (x = 0 for first pass)










IF (y+x > 0) AND (y+x > 21-2)
c_mov.w d2,d4 ; if(y +x < 0) continue
c^add.wdl,d4 ; d4 = y + x
c_blt nextpair
c_cmp.w #I,d4 ; if(y+x > 21-2) continue
c_bge nextpair
; 0 FOR is= 0 TOp-1
2 c_movq ^p-" I,d3 ; Sum dl over all PEs
f




. 10 sum : = sum + (known[x][iJ - unknown[y]fl]) I
■ *
takediff:
4 p_mov.w (a2) + ,dl ; dl = unknown frame
4 pjsub.w(al) + ,di ; dl = unknown ~ known
35 p_muls dl,dl ; dl = (unknown - known) ‘ 2
2 p_add.w dl,d3 ; d3 = sum
5/7 c_dbf d3, takediff
2 cj-st.w d2 ; if y - 0 jump to firstrow
5/6 c_beq firstrow
2 c_tst.w dl ; if y+x - 0 jump to y edge
5/6 c_bcq yedge
• .
•28 A := g[x-l][y-2j + 2d[x][y-l];
; 32 min := A
•
findA:
5 p_mov.w r + r + r + r(a3),d4 ; d(i,j~l)
4 p_asl.w #l,d4 ; 2d(i,j-l)







B := g[x~2][y—1] + 2d[x-l][y];
4 p_mov.w (a3),d5 ; d(i-l,j)
4 p_asl.w #l,d5 ; 2d(i— l,j)
4 p_add.w (a4),d5 ; g(i— 27j— 1) + 2d(i,j-l)
•
9
; 33 WHERE B < A







00N C := g[x-l][y-l] + 2d[x][y]|
findC:
6 p_mov.w f+ r + r + r + 2(a4),d5 ; g(i-lj-l)
2 p_add.w d3,d5 ; g(i— 1,j— 1) + d(i,j)
•
9
;36 WHERE C < miii









; 40 g[x)iM * = d[x][y] + min;
i
findG:
2 p_add.w d3,d4 ; g <- d(i,j) + min(A,B,C)
; 48 WHERE g[x]|y| > oo





2 p_mov.w d6,d4 ; g <- inf
8 EndWhere
; ii d[x][y] i— sum;
6
9
p_mov.w d3,r+ r + r + r + 2(a3) ; store in d array (2r)
Figure -A.ll (Continued)
437
6 p_mov.w d4,r + r + r+r + r+r+r + r+6(a4)
; store in g array (2(2r+1)-1)
2 p_addq.w #2,a3 ; move d pointer






FOR y % = 0 TO 1-1 (cont.)
2 e_addq.w #l,dl ; x = x + 1















FOR x :=-r TO r (cont.)
p_sub.w#rp + rp + rp + rp,al ; x - x - 2rp
(times 2 for word addressing) 
p addq.w #p + p,a2 ; y = y + 1
c_mov.w #~r,dl ; x == -r
c_addq.w #l,d2 '» y - y + 1
p_addq.w #2,a3 ; move d pointer (Skip over inf value)
p_addq.w #2,a4 ; move g pointer
p_mov.w d6,r+ r + r+ r(a4) ; g(i-l,j-2) <-inf
c.cmp.w #I,d2
c_bne nextdist







p_addq.w #p + p,al ; move input data pointer
2 p_addq.w #p + p,a2 ; move unknown data pointer
2 p_addq.w #2,a3 ; move d pointer
2 p_addq.w #2,a4 ; move g pointer
5 c_bra.s nextframe
; 20 ELSE IF X = 0 /* Check bottom edge */




c_tst.w dl ; if x = 0 jump to first column
Figure A ll (Continued)
438
5/6 cjbeq firstcol
6 p_mov.w r + r + r-fr(a3),d4 ; d(ij-l)
4 p_asl.w #l,d4 ;2d(i,j-l)
5 c_bra findG
15 IF Y = 0 AND X=0
18 g[x][y] • — 2 * dWM;
irstcol:
6 p_mov.w d3,r +r+r+r + 2(a3) ; store in d array (2r)
4 ,p_asl.w #l,d3 ; store in g array (2(2r+ I)— 1)
6 p_mov.w d3,r + r + r + r + r + r + r + r + 6(a4)
2 p_addq.w #2,a3 ; move d pointer




18 IF Y = 0 / * Check left edge * /
19 min := 2 * d[x][y-l];
3fedge:
4 p_mov.w (a3),d4
4 p_asl.w #l,d4 ; 2d(i—l,j)
5 cjbra findG

























SIMDV simulated by a MC68000 
This program calls the dynamic time 
warp routine.
Input: 16-bit signed 
Output: 16-bit signed 
2r + l 
Shift(ztl)
Data Conditional 
r, the width of the warping path, 
p, the number of coefficients per frame, 
NetD, the network delay time.
I, the number of frames per utterance. 
All PEs hold all the input data.
PE r holds the distance score.
See text
LADDR (logical address) 
WHOAMI (physical address) 
Pointer to start of Input data 
Pointer to start of Unknown data 
Pointer to where to store results. 
a2+ + after stored
.p_text
instr: 10 ; Space for PE instructions




.word inf,inf,inf,inf ,0,0,0,0 ,0,0,0,0 ,0,0,0,0 
.word inf,inf,inf,inf ,0,0,0,0 ,0,0,0,0 ,0,0,0,0 
.word inf,inf,inf,inf ,0,0,0,0 ,0,0,0,0 ,0,0,0,0
.word 2,3,4,5 ,0,0,0,0 ,0,0,0,0 ,0,0,0,0 
word 6,7,8,1 ,0,0,0,0 ,0,0,0,0 ,0,0,0,0 
.word 8,7,6,5 ,0,0,0,0 ,0,0,0,0 ,0,0,0,0 
.word 4,3,2,1 ,0,0,0,0 ,0,0,0,0 ,0,0,0,0 
word 0,0,0,0 ,0,0,0,0 ,0,0,0,0 ,0,0,0,0 
.word 0,0,0,0 ,0,0,0,0 ,0,0,0,0 ,0,0,0,0 
word 0,0,0,0 ,0,0,0,0 ,0,0,0,0 ,0,0,0,0 
.word 0,0,0,0 ,0,0,0,0 ,0,0,0,0 ,0,0,0,0
Figure A. 12 Parallel program for DTWing (PP2).
.word inf,inf,inf,inf ,0,0,0,0 ,0,0,0,0 ,0,0,0,0 
.word inf,inf,inf,inf ,0,0,0,0 ,0,0,0,0 ,0,0,0,0 
.word inf,inf,inf,inf ,0,0,0,0 ,0,0,0,0 ,0,0,0,0
Jul 17 08:51 1984 main.s Page 2
lib lend:
unknown:
.word 1,2,3,4 ,0,0,0,0 ,0,0,0,0 ,0,0,0,0
.word 5,6,7,8 ,0,0,0,0 ,0,0,0,0 ,0,0,0,0
.word 8,7,6,5 ,0,0,0,0 ,0,0,0,0 ,0,0,0,0
.word 4,3,2,1 ,0,0,0,0 ,0,0,0,0 ,0,0,0,0
.word 0,0,0,0 ,0,0,0,0 ,0,0,0,0 ,0,0,0,0
.word 0,0,0,0 ,0,0,0,0 ,0,0,0,0 ,0,0,0,0
.word 0,0,0,0 ,0,0,0,0 ,0,0,0,0 ,0,0,0,0
.word 0,0,0,0 ,0,0,0,0 ,0,0,0,0 ,0,0,0,0
unknownend:
word inf,inf,inf,inf ,0,0,0,0 ,0,0,0,0 ,0,0,0,0
iWord inf,inf,inf,inf ,0,0,0,0 ,6,0,0,0 ,0,0,0,0
word inf,inf,inf,inf ,0,0,0,0 ,0,0,0,0 ,0,0,0,0
dist:
.p_bss 
. = .+ 10 ; Local distance scores











d: . = .+ 2 ; Local distance
g: . = .+ 2 ; Optimal subpath distance
dold: . = .+ 2 ; Old local distance
gold: . = .+ 2 ; Old optimal subpath distance
dDTR: . = .+ 2 ; d values to transfer
gDTR: ’=.+ 2 ; g Values to transfer
dup: . = .+ 2 ; d values to transfer to PE with next higher addr.
gup: . = .+ 2 ; g values to transfer to PE with next higher addr.
ddown: . = .+ 2 ; d values to transfer to PE with next lower addr.
gdown: . = .+ 2 ; g values to transfer to PE with next lower addr.
.c_text
main:
Figure A. 12 (Continued)
441
Jul 17 08:51 1984 main.s Page 3
4 cjiiov/w #STACK,a7 ; set up stack pointer
8 c_mov.w #Initialize,MASKCTL.w ; Initialize CMS
8 c_mov.w #0x8000, MASKCTL.w
10 p_jnov.w #15,DTRDEST ; Set all transfers to PE15
6 p_mov.w WHOAMI.w,d7 ; d7 always has WHOAMI in it
(Physical address)

















Jul 17 09:09 1984 dtw.s Page 1
Precision:


















SIMD, simulated by a MC68000 
This program does a dynamic time warp 
on the input data. The local distances 
have already been computed before this 
routine is called.
Input: 16-bit signed 




r, the width of the warping path.
p, the number of coefficients per frame.
NetD, the network delay time.
I, the number of frames per utterance.
All PEs hold all the input data.
PE r holds the distance score.
See text
5.6 ms for r=6, :p=8, NetD = 18, and 1 = 40 
(* means passed as parameter) 
tmp storage
LADDR (logical address) 
inf (infinity)
WHO AMI (physical address) 
k
Pointer to local distances 
Pointer to where to store results, 
al + + after stored
#include ’’defs.h”
#include ”simd.h”














Figure A. 12 (Continued)
443
Jul 17 09:09 1984 dtw.s Page 2 
dtw:
4 p_mov.w #inf,d6; Store infinity in d6
5 2 g^O
; 3 gold*- 0
; 4 d *- oo
5 5 dold ■*— oo
.. .»
findstart:
8 p_clr.w g.w ; g <-0
8 p_<:lr.w gold.w ; gold <-0
6 p_mov.w d6,d.w ; d <- inf
6 p_mov.w d6,doId.w ; dold <-inf
• 6 WHERE ADDR = 0 DO
; 7 g «- 0
58 ENDWHERE
28 ’ Where(d5,EQ,#0)
8 p_clr.w g.w ; g <r 0
8 End Where
4
FOR k «- 1 TO I DO
c_mov.w #1—l,d7 ; FOR k <- P
fori:
t
• 14 compute d(Xindex,Yindex)
*
8 p_mov.w (a0) + ,d.w
• 15 WHERE ADDR is even DO
• 16 dDTR +- dold
5 17 gDTR 4- gold
f




10 p_mov. w g.w,gDTR.w
1
; 18 ELSEWHERE
J 19 dDTR 4- d
Figure A.12 (Continued)
1 TO 0 STEP -1 DO































TRANSFER dDTR TO dup 








; TRANSFER dDTR TO dup









I 27 USE Shift -1
; 28 TRANSFER dDTR TO ddown










; TRANSFER dDTR TO ddown
; TRANSFER gDTR TO gdown
i
; 31 WHERE ADDR = r DO





Where(d5,EQ,#r) ; WHERE LADDR = +r DO 
p_mov.w d6,gdown.w ; gdown <- inf
EndWhere
Figure A. 12 (Continued)
445
5
• 35 WHERE ADDR =-r DO
• 30 gup 4- oo
• 37 ENDWHERE
•
28 ' Where(d5,EQ,#-r) ; WHERE LADDR = -r DO
5 p_mov.w d6,gup.w ; gup <-inf
8 EndWhere
Jul 17 09:09 1984 dtw.s Page 4
9
; 30 goldg




p_mov.w g.w,gold.w ; gold <- g
10 p_mov w d.w,dold.w ; dold <- d
442 A 4— gdown + 2 * ddown
find A:
6 p_mov.w ddown.w,dO ; A <- gdown + 2 * ddown
4 p_asl.w #1 ,d0 ; 2*ddown
6 p_add.w gdown.w,d0
5 43 B 4- gold + d
findB:




; 44 C 4- gup + 2*dup
5
findC:
6 p_mov.w dup.w,d2 ; Cgup + 2 * dup
4 p_as!.w #1 ,d2
6 p_add.w gup.w,d2
I
; 40 WHERE B < ADO




26 Where(dl,LS,dO) ; g <-tnin(A,B,C) + d
2 pjmov.w dl,d0
8 EndWhere
Figure A. 12 (Continued)
446
I 49 WHERE C < ADO
5 50 A C. '





Jul 17 09:09 1984 dtw.s Page 5
i
>
I 52 g «- A +d
»
6 p_add.w d.w,d0 ; d0 now holds min(d0,dl,d2)
6 p_mov.w dO,g.w ; g <- min(A,B,C)
26 Where(d6,HI,dO) ; where(dO < inf)
6 p_mov.w d6,g.w ; g <- inf
8 EndWhere
incindex:
5/7 c_dbf d7,forl ; FOR d7 <- 1/2—2 TO -1 STEP -1
; 57 WHERE ADDR = 0 DO








Figure A. 12 (Continued)
447
Program Name: distance (PP2)
Algorithm: Figure 6.18??
Machine: SIMD, simulated by a MC68000
Function: This program computes the local distances
for the DTW program.
Precision: Input: 16-bit signed
Output: 16-bit signed 
Number of PEs: 2r + l
Parameters: r, the width of the warping path.
p, the number of coefficients per frame.
NetD, the network delay time.
I, the number of frames per utterance.
Input: PE i contains coefficient i of frame j.
Output: PE i-j+r contains the local distance
between known frame i and unknown frame j. 
Cycles: See text.
Jul 17 09:10 1984 distances Page 1
Typical Time: 35 ms for r=6, p=8, NetD = 18, and 1-40.
Register usage (* means set by calling routine)
dO pe used by macros
dl cu X Index into known template
d2 pe x Index into known template +
d2 cu y Index into unknown template
d3 cu j
d6 pe inf Infinity
d7 pe WHOAMI (physical pe address)
al pe points to known template(x)
a2* pe points to unknown template(y)
a3* pe points to local distances
4
Data allocation for routine 










FOR ! <- 1 TO r/2
p_jnovq #l,d3 ; FOR i := 1 to r-*! step 2
c_movq #r-2,d3
c_asr.w #l,d3 ; d3 - r/2 - I
• 4 WHERE |LADDR| > I DO
• 5 d[dptr] *- oo;
; 6 dptr ■*- dprt + 1; '




26 Where(d5,GT,d3) ; where(LADDR > i)
4 p_mov.w d6,(a3)+ ; (a3) <-inf
8 End W here
2 p_neg.wd3
26 Where(d5,LT,d3) ; where(LADDR < -i)
Jul 17 09:10 1984 distance.s Page 2








; i:= i + 1
FOR y 0 TO 1-1
9
2 c_clr.w d2 ; For y := 0 to 1-1
9
; 10 FOR x +- -r TO r
9
2 c_clr.w dl ; For x := —r to r (x = 0 for first pass)
4 p_mov.w #r,d2 ; d2 in PE “ dl+r in GU





IFy+x < O AND y +x < 21—2
2 c_mov.w d2,d4 ; if(y + x < 0) continue
2 c_add.wdl/d4 ; d4 - y ■+ x
5/6 e_blt next pair






sum «- (knownjx] - unknownfy]) /u2/d;
4 p_mov.w (a2),dl ; ,dl = unknown frame
4 p_sub.w (al) +,dl ; dl = unknown — known
35 p_muls dl,dl ; dl '= (unknown - known) * 2
1 13 FOR k +- 0 1:o ipgN—i
9
notinf:
2 c__movq #logN-I,d3 ; Sum dl over all PEs
2 p_dr d3
Figure A. 12 (Continued)
449





; 15 DTRXN «-sum;
} 18 TRANSFER;














c blt.s easy ; If destination PE is >= N,
•26 USE Shift -he 4r
12
















WHERE X 4t = ADDR /* Enable PE that will use the*/
d[dptr] sum; /* distance score */
dtpr *- dptr +1}
END WHERE




c.addq.w #l,dl ;x = x + l
Figure A. 12 (Continued)
450
Jul 17 09:10 1984 distances Page 4




6 p_sub.w#r+ r + r + r,al ; x = x --2r
?
6 p_add.w
(times 2 for word addressing) 
#2,a2 ;y =y + 1
4 c_mov.w #~r,dl ; x = -r
2 p_clr.w d2 ; x .= 0 in pe
2 c_addq.w #l,d2 ;y = y + 1
4 cjmp.w #M2
5/6 c_bne nextdist
Stick infinities on the end of each list
9
; 34 FOR i +- 1 to r/2
9
2 c_movq#r~2,d3
4 c_asr.w #l,d3 ; d3 = r/2 ~ 1
pad2:
9
5 35 djdptrj +— ooj
I 36 dptr <— dptr + I;
•>











Machine: SIMD, simulated by a MC68000
Function: This is the main routine. It calls filter()
and auto() to preemphize the signal and find 
the autocorrelation coefficients. If R(0)
(the energy) is greater than lothresh, it calls 
lpc(). This main routine also does the 
endpoint detection. After an utterance is 
detected, ltw() and dtw() are called.
Precision: Input: 16-bit signed
Output: 16-bit signed 
Number of PEs: 100
Parameters: N, the frame size.
autocoef, the number of autocorrelation coefs. 
r; the width of the warping path, 
p, the number of LPC coefficients.
NetD, the network delay time.
I, the number of frames per utterance. 
VOCABSIZE, the size of the vocabulary. 
Input: Sample i mod N is is PE i.
Output: One distance score per PE.
Cycles: See text.
Typical Time: See text.
Register usage:
d0 cu M (number of frames in word) 
d7 pe WHOAMI (physical address)
d7 cu i
a0 pe pointer to input data
al pe pointer to output data
The data is stored as follows:
Routine Number Data Storage
of PEs Input Output
filter 100 1 sample/PE 1 sample/PE
auto 100 1 sample/PE Each PE has all coefs.
lpc 8 Each PE has all coefs PE i has lpc coef i
ltw 8 PE i has coef i from each frame. SAME
shuffle 100 Each PE has all coefs.




inst: .=.+ 10 ; Space where pe instructions are broadcast to
.glob! WHOAMI




; Physical addresses (stored in d7)
WHOAMI: .word 0,1,2,3,4,5,6,7, 8,9,10,11,12,13,14,15
; Input signal
input: .word 1,2,3,4,5,6,7,8, 9,10,11,12,13,14,15,16
.word 11,12,13,14,15,16,17,8, \ 
9,10,11,12,13,14,15,16 
.word 1,-2,-3,-4 -5,-6 -7,-8, \
-9,-10,-11,-12-13,-14,-15-16 
.word 0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0 
.word 1,1,1,1,1,1,1,1 ,1,1,1,1,1,1,1,1
Hb: .word 1,2,3,4,5,6,7,8, 9,10,11,12,13,14,15,16
.word 11,12,13,14,15,16,17,8, \ 
9,10,11,12,13,14,15,16 


























• — ■ + 2 ; Filtered signal
. = .+ autocoef-f autocoef ; autocorrelation coefficients
. = .+ MAXFRAMES ; autocorrelation coefficients
. = .4- MAXFRAMES ; output of ltw
. = .+ MAXFRAMES ; output of ltw arranged for dtw
.cjbss
= .+ 2 ; ~ — l if R(0) > hithfesh
- = .+ 2 ; number of utterances compared
c_text
c_mov. w % Initiali ze, MASKCTL. w
e_mov.w #0x8000,MASKCTL .w
; Initialize CMS 
; stop mask unit
p__mov.w #STACK,a7 ; set up PE stack pointer
c_mov.w #STACK,a7 ; set up CU stack pointer
p_mov .w WHOAMI.w,d7 ; d7 aways has WHOAMI in it
WHILE(TRUE)
p_mov.w #input,a0 ; Pointer to input data
p_mov.w #lpcout,al ; Pointer to output lpc coefs.
found FALSE;
M 4- 0;
Figure A. 13 (Continued)
453
8 c_clr found, w; found := false;
2 .
nextframe:
c_clr dO ; M = 0
4 p_mov.w a0,-(a7) ; push aO on stack
4 p_mov.w al,-(a7) ;; push al on stack
4 V c_mov.w d0,-(a7) ;; push dO on stack
t
; 9 fllter(inputp], fllout);
• •9
filterdo:
4 p_mov.w #sig,al ; aO points to input data
10 cjsr filter
auto(filout,R[]);




4 p_mov.w (a7) + ,al
4 p_mov.w (a7) + ,a0
4 c.mov.w (a7)-h,d0
6 p_add.w #2,aO
; pull al off stack 
; pull aO off stack 
; pull dO off stack 










• 18 TOCU R[0];
; 19 energy FROMPE0;
•
14 p_mov.w R,TOCU ; Get R(0) from PEO and compare it
NetD NetworkDelay(O) ; to lothresh and hithresh to see if
8 c_niov.w FROMPEO,dl ; a word is present.
»







; if energy > ~ lothresh then
; 27




c jmp #hithresh,dl ; if energy > = hithresh then
Figure A. 13 (Continued)
454
5/4 eJMt.s labl
8 c_mov #i,found.w ; found := true,*
5 ^ -








4 p_mov.w #R,al ; The energy is greater than lothresh
10 c_Jsr lp-c ; find the lpc coefficients
; Return lpc coef., in Dl, one per PE
Ipcdone:
4 pjnov.w ■ fa?)l-yal ; pull al off stack
4 pjrnov.w ; pull aO off stack
4 cjiiov.w faf)+#0 ; pull dO off stack
4 p-movJw di^al|4- ; Save lpc coef., one per PE
■ 9 .,
5 ££ • M M + 4|
; . . - - . . " . .
2 c^addq.w #l/d0 ; Add one to ftanre count
5 c_bra nextframe ; Get next frame
5

















Figure X; IB ^Continued)
; N (M is in dO) 
; Address of II 
•; Md ress of Tout
P.mov.w #15,DTRDEST.w ; So all PEs will transfer to PEI 5
p_mov.w a0,-(a7) ; push aO on stack
p_mov.w al,-(a7) ; push al on stack








#Tout,aO ; Take data in the order put out
#fixed,al ; by Itw and arrange it for the DTW
shuffle ; program.
4 p_mov.w #lib,a0 ; aO points to known utterance in library














#fixed,al ; al points to the unknown utterance.
















APPENDIX B: VLSI Processor Array Assembly Language Programs
457
Purpose: The XX programming language is a simplified sequential
programming language for defining the codes for process­
ing elements of the CHiP computer.
Activity: Files are created or modified using a conventional UNIX edi­
tor. The files are named <name>.x where <name> is the 
name of a program referred to in the code names entries. 
For convenience in referring to Poker state information on 
the BitGraph display, it is recommended that XX program 
files be developed on the secondary (character) Poker 
display.
Programs: XX programs begin with a preamble that gives the program
name, the formal parameters, trace variables and the port 
names. The preamble is followed by the program body 
block:
<program> :;= code <id> <parmlist>; <tracelist> <port 
list> <body>
<parmlist> ::= (<idlist>) | A 
<tracelist> ::= trace <idlist>; | A 
<portlist> ::= ports <idlist>; | A 
<idlist> ::= <id>, <idlist> | <id>
<body> ::= begin <declarations> <statiist> end.
where the parameters and trace identifiers are limited to a 
list of at most four identifiers separated by commas and 
the port list is limited to a list of 8 identifiers separated by 
commas. The identifier following code names the program 
and should match the <name> of the file and the <name> 
used in the Code names entries. The parameters are for­
mal parameters that correspond one-to-one to the actual 
parameters stored in the Code Names/Parameters entries 
of the PEs; each formal must be declared in the Cdeclara- 
tions> section of the <body>. The trace list identifiers 
have their values displayed during tracing and they must 
be declared in the <declaratioris> section of the <body>. 
The port list identifiers are the symbolic port names that 
are assigned physical positions in the Port Names entries, 
and they must be declared in the <declarations> section of 
the <body>.
Declarations: There are four data dypes: signed integers (32 bits), signed 
reals (32 bits), characters (8 bits) and Booleans (1 bit). 
Except for statement label identifiers, all identifiers, 
including those appearing in the preamble, must be 
declared. Simple identifiers are scalar values of the indi­
cated type and identifiers followed by [<uhsigiiint>] are 
vectors of length <unsignint> of scalar values of the indi­
cated type:
<dcclarations> : = <deel>; <dcclarations> | A 
<decl> ::= <type> <varlist>










<type> ::= real | int | bool | char 
<varlist> ::- <vand>, <varhst> | <vand>
<varid> ::='<id> | <id> [ <unsignint> ]
where no <id> appears more than once.
The statements are:
<sLathst> ::= <lstatement>; <stathst> j < lsLaLernerit> 
<lstaLernent>. :: = <id>: <statement> | <sLaLemenL> 
<sLaLeinent> ::- <assignment> | <conditional> |
<vvhile> | <break> | <for> | <cornpound> | <io>
where <id> is used for tracing rather than the target of
goto.
The Assignment statement 
<assignrnent> :;= <varid> := <expression>
where the coercion to the left-hand side identifier type is 
provided as described in Table 1.
In the Conditional statement
<coriditional> ::= if <expression> then <lstateinent> . 
else <lstatement> | if <expressiori> 
then <lstatement>
the <expression> must evaluate to a Boolean value and an 
else is associated with the immediately preceding then.
In the While statement
<while> ::= while <expression> do <lstatement>
the expression must evaluate to a Boolean value. To assist 
in synchronization the compiler recognizes Lire construc­
tion while true do <lstatement> as a special case and does 
not generate the conditional branch code.
The Break statement 
<break> ::= break
has meaning only within the <lstaternent> of a While state­
ment, and causes control'to skip to the statement following 
the immediately surrounding While statement.
In the For statement
<for> ::= for <id> : = <expression> to <expression> do 
< Is Late me nL >
the two expressions, the lower and upper limits of the 
iteration, respectively, are evaluated once prior’ to begin­
ning the loop. If Lhe lower and upper limits are not 
integers, they are coerced to integers as described in 'table 
1.
Notice that the Compound statement 
<compound> begin <statlist> end 




<io> ::= <id> <-<id>
are restricted to simple variables, exactly one of which 
must be a port name. If the port name appears on the 
right, the statement reads from the indicated port; if the 
port name appears on the left, the statement writes to the 
indicated port. Data type consistency is not enforced 
across the communication links.
Expressions: The expressions
<expression> <expression> <binary> <expression> | 
<unary> <expression> |
<expression> <relational> <expression> |
<builtin> (<expression>) |
(<expression>) |
<unsignint> | <unsignreal> | <character> |
<boolean>
have precedence and association as in the C programming 
language. Expressions of mixed type are coerced to the 
higher type, where types are ranked bool < char < int < 
real, as described in Table la. The operators are given in 
Table lb.
bool -> char: The Boolean bit becomes the 
least significant bit; others are 0. 
char -* bool: The least significant bit 
forms the Boolean.
char-» int: The 8 character bits become
least significant bits; others are 0.
int -> char: The eight least significant
bits from the character.
int -► real: Converted to floating point
notation.
real -> int: The floating point value is
truncated and converted to integer form. All other
conversions are performed transitively.
Table 1. Semantics of representation conversion; conversions not listed 




+ <real> no op 
- <real> negation 
~ <char> not
The type indicates the highest 
type for which the operation 
is defined; the operation is 
defined for all lower types
mi uui y . .
<rcal> 4- <reai> 
<real> - <reai> 
<real> * <real> 
<real> 7 <real> 
<real> mod <real> 
<real> >= <real> 
<real> > <real> 
<real> =/ <real> 
<real> < <real> 
<Real> < = <reai> 
<real> = <real> 
<char> & <char> 


















The eunstants are unsigned integers and reals in stan- 
dard formats, quoted (') characters and true and f^lse 
AU identifiers _ begin with a letter and are followed by 
ny combination of letters and numerals. The max­
imum length of an identifier is 10 symbols.
Vectors can only be subscripted by character or integer 
types and are referenced using 1 origin. •




Mnemonic OMcrlptlon By** Cyc
MOV direct,#data Move immediate data to 
direct byte 3 2
MOV @Ri.A Move Accumulator to 
indirect RAM 1 1
MOV @Ri,direct Move direct byte to 
indirect RAM 2 2
MOV @Ri,#data Move immediate data to 
indirect RAM 2 1
MOV DPTR.#data16 Load Data Pointer with
a 16-bit constant 3 2
MOVC A,@A+DPTR Move Code byte relative 
to DPTR to A 1 2
MOVC A.@A+PC Move Code byte relative 
to PC to A 1 2
MOVX A,@Ri Move External RAM (8- 
bitaddr)toA 1 2
MOVX A,@DPTR Move External RAM (16- 
bit addr) to A 1 2
MOVX @Ri,A Move A to External RAM 
(8-bit addr) 1 2
MOVX @DPTR,A Move A to External RAM 
(16-bit addr) 1 2
PUSH direct Push direct byte onto 
stack 2 2
POP direct Pop direct byte from 
stack 2 2
XCH A,Rn Exchange register with 
Accumulator 1 1
XCH A,direct Exchange direct byte 
with Accumulator 2 1
XCH A,@Ri Exchange indirect RAM 
with A 1 1
XCHD A,@Ri Exchange low-order 
Digit ind RAM w A 1 1
BOOLEAN VARIABLE MANIPULATION 
Mnemonic Description Byte Cyc
CLR C Clear Garry flag ; ; 11
CLR bit Clear direct bit 2 1
SETB C Set Carry flag 1 1
SETB bit Set direct Bit 2 1
CPL C Complement Carry flag 1 1
CPL bit Complement direct bit 2 1
ANL C.bit AND direct bit to Carry
flag 2 2
ANL C.,1 bit AND complement of
direct bit to Carry 2 2
ORL C/bit OR direct bit to Carry
flag 2 2
ORL C,1 bit OR Complement of
direct bit to Carry 2 2
MOV C/bit Move direct bit to Carry
flag 2 1
MOV bit.C Move Carry flag to
direct bit 2 2
PROGRAM AND MACHINE CONTROL
Mnemonic Description Byte Cyc
ACALL addr11 Absolute Subroutine
Call 2 2
LCALL addr16 • Long Subroutine Call 3 2
RET Return from subroutine 1 2
RETI Return from interrupt 1 2
AJMP addrl 1 Absolute Jump 2 2
LJMP addr16 Long Jump 3 2
SJMP rel Short Jump (relative
addr) 2 2
JMP @A+DPTR Jump indirect relative to
the DPTR 1 2
JZ rel Jump if Accumulator is
Zero 2 2
JNZ rel Jump if Accumulator is
Not Zero 2 2
JC rel Jump if Carry flag is set 2 2
JNC rel Jump if No Carry flag 2 2
JB bit.rel Jump if direct Bit set 3 2
JNB bit,rel Jump if direct Bit Not
set 3 2
JBC bit.rel Jump if direct Bit is set
& Clear bit 3 2
CJNE A,direct,rel Compare direct to A &
Jump if Not Equal 3 2
CJNE A,#data,rel Comp, immed, to A &
Jump if Not Equal 3 2
CJNE Rn,#data,rei Comp, immed. to reg &
Jump if Not Equal 3 2
CJNE @Ri.#data,rel Comp, immed, to ind. &
Jump if Not Equal 3 2
DJNZ Rn.rel Decrement register &
Jump if Not Zero 2 2
DJNZ direct,rel Decrement direct &
Jump if Not Zero 3 2
NOP No operation 1 1
Notes on data addressing modes:
Rn —Working register R0-R7
direct —128 internal RAM locations, any I/O port.
control or status register
@Ri —Indirect internal RAM location addressed by
register R0 or R1
#data -8-bit constant included in instruction
#data16—16-bit constant included as bytes 2 & 3 of
instruction
bit —128 software flags, any I/O pin, control or
status bit
Notes on program addressing modes:
addrie -r-Destination address for LCALL & LJMP may
be anywhere within the 64-K program
memory address space
Addr 11 —Destination address for ACALL & AJMP will be
within the same 2-K page of program
memory as the first byte of the following
instruction
rel —SJMP and all conditional jumps include an 8-
bit offset byte. Range is +127-128 bytes relative
to first byte of the following instruction
All mnemonics copyrighted © Intel Corporation 1979
Figure B.2 8051 instruction set description and timings. (From [Intel].)
LOGICAL OPERATIONS (CONTINUED)
Mnemonic Destination Byte Cyc
ORL A,@Ri OR indirect RAM id 
Accumulator 1 1
ORL A,#data OR immediate data to 
Accumulator 2 1
ORL direct, A QR Accumulator to 
direct byte 2 .1
ORL direct, #data OR immediate data to 
direct byte 3 2
XRL A,Rn Exclusive-OR register to 
Accumulator 1 1
XRL A,direct Exclusive-OR direct 
byte to Accumulator 2 1
XRL A,@Ri Exclusiye-OR indirect 
RAM to A 1 1
XRL A,#data Exclusive-OR 
immediate data to A 2 1
XRL direct,A Exclusive-OR Accumu­
lator to direct byte 2 1
XRL direct,#data Exclusive-OR im­
mediate data to direct 3 2
CLR A Clear Accumulator 1 1
CPL A Complement
Accumulator 1 1
RL A Rotate Accumulator Left 1 1
RLC A Rotate A Left through 
the Carry flag 1 1
RR A Rotate Accumulator 
Right 1 1
RRC A Rotate A Right through 
Carry flag 1 1
SWAP A Swap nibbles within the 
Accumulator 1 1
OATA TRANSFER
Mnemonic Description Byte Cyc
MOV A,Rn Move register to 
Accumulator 1 1
MOV A,direct Move direct byte to 
Accumulator 2 1
MOV A,@Ri MOve indirect RAM to 
Accumulator 1 1
MOV A,#data Mov immediate data to 
Accumulator 2 1
MOV Rn,A Move Accumulator to 
register 1 1










direct byte 2 1
MOV direct,Rn Move register to direct 
byte 2 2
MOV direct,direct Move direct byte to 
direct 3 2
MOV direct,@Ri Move indirect RAM to 













Add indirect RAM to
2 1
. ■ ■ . Accumulator 1 1
ADD A,#data Add immediate data to
ADDC A.Rn
Accumulator
Add register to ^
2 1
Accumulator with Carry 1 1
ADDC A,direct Add direct byte to A
with Carry flag 2 1




Add immediate data to
1
SUBB A.Rn
A with Carry flag 
Subtract register from A
2 1
with Borrow 1 1
SUBB A,direct Subtract direct byte
SUBB A,@Ri
from A with Borrow 
Subtract indirect RAM
2 1
from A with Borrow 1 1
SUBB A,#data Subtract immed date
2 1from A with Borrow
INC A Increment Accumulator i ' 1
INC Rn Increment register i 1
INC direct Increment direct byte 2 i
INC @Ri Increment indirect RAM 1 1
INC DPTR Increment Data Pointer 1 2
DEC A Decrement Accumulator 1 1
DEC Rn Decrement register 1 1
DEC direct
DEC @Ri






MUL AB Multiply A & B 1 4
DIV AB Divide A by B 1 4




Mnemonic Destination Byte Cyc








ANP indirect RAM to
2 1
Accumulator t 1














ORL A,direct or direct byte to i 1Accumulator
Figure B.2 (Continued)
463
The 8051 has two built-in timers, (0 and 1). The following are the special
function register locations used to operate timer 1.
tcon 88h ; timer control register
tmod 89h ; timer mode register
til 8bh ; timer register LSB
thl 8dh ; timer register MSB
To run a timed loop, first:
mov tmod,#10h
to set timer 1 to no gate and 16 bit mode. The time for the loop is set by
LOOPTIME equ 150-7
where each loop will take 150 /is and 7 fis is the overhead to restart the timer.
At the beginning of the loop use:
loop: clr a ; Clear a register
mov tcon,a ; Stop timer
mov til,a ; Clear timer
mov thl,a
setb tcon.6 ; Start timer
At the end of the loop, wait for the timer by using:
; Wait for MSB of LOOPTIME 
; and timer 1 to match
; xor LSBs to see if they are the same 
; move least significant bit 
; into carry bit
; Sync with timer, since the cjne takes 
; 2 /xs, there is a 50/50 chance the 
; least significant bits of the timer 
; and LOOPTIME will not match, this 
; comparison should sync the program 
; up with the timer so the least 
; significant bits will always match.























Figure B.3 Using a built-in timer to control loop time.
This works for most values of LOOPTIME; of course if LOOPTIME is shorter 
than the time for one loop, it will not work at all. If LOOPTIME equals 256, 
for example, it will not work since 256=100h. The MSBs will match at the 
same time the LSB’s will match. But 2 /is will pass before the LSBs are com­
pared, so they won’t match. For this case, only the MSBs need to be com-
If the LOOPTIME is carefully chosen, the built-in timers can synchronize 
two cells which are executing different code.
Figure B.3 (Continued)
465
The following gives examples, written in 8051 assembler code, of how the 
8051 writes four bytes of data to the switch and how it reads one byte from the 
switch.
When writing a byte to the switch, the 8051 first writes the 3 bit direction 
tag to port 1 (pi) and the 8 bit data to external RAM location lowSWLat. The 
direction tag tells which port is being written to, where 0 is the north port, 1 is 
northeast, and so on. The Switch hardware polls the output latches on all the 
cells and when data appears in a given latch, the Switch looks into a table to 
find where to send the data. Then it writes the data and a tag telling from 
where it came into the input queue of the destination cell.
Here is an example of how to send four bytes of data, stored in internal 
RAM, out of the north port. The numbers to the left of the instructions are 
the execution times in jis. The syntax for a move is: mov destination,source.
2' mov dptr,#lowSWLat ; Have dptr point to the Switch 
; Lattice port in external RAM.
2 mov pl,#north ; 8051 builtin port one (pi) is where 
; the three bit direction tag is written.
1 mov a,byte0 ; Move first byte from internal 
; RAM into accumulator (a).
2 movx @dptr,a ; Store accumulator at location that dptr 
; points to, which is the switch lattice.
11 Icall writedelay ; It takes the switch 12 //s to poll 
; all the processors, so wait 12/is 
; to be sure the data have been sent.
1 mov a,bytel ; get next data byte and write to switch.
2 movx @dptr,a ; Notice dptr does not have to be reset,
; nor does pi
11 Icall writedelay ; Wait again
1 mov a,byte2
2 movx @dptr,a ; Send third byte
11 Icall writedelay
1 mov a,byte3
2 movx @dptr,a ; send last byte
Figure B.4 Example of 8051 code for inter-cell communication.
The call to writedelay takes 11 /is, the mov instruction takes 1 /is, giving 
the 12 /is delay heeded between writes to the switch. The total time to write 
one 32 bit word is 49 /is assuming no writes follow (thus the missing call to 
writedelay after the last mov @dptr,a). The three calls to writedelay give a 
total of 33 /is spent waiting on the switch. In some applications, this time can 
be used doing useful operations. Data, by convention (not hardware restric­
tions), is sent least significant byte (LSB) first.
To read from the Switch:
: 2 mov dptr,#IowSWLat ; Same as writing
2 jnb p0.7,$ ; Test bit 7 of 8051 port 0.
; If not set, there is no data,
; so keep testing until there is some.
% mov a,@dptr ; Get one byte of data.
1 mov byteO,a ; Save
l mov a,p° ; Read port 0 to see from which
; direction it came.
The program will loop on the jnb instruction until data arrives in the 
queue. Reading an empty queue is a fatal error. In some programs, owing to 
the structure of the program, the data is always in the queue when the switch 
is read, so checking port 0 bit 7 is not necessary. Otherwise it takes at least 2 














; Location of first argument 
; Location of switch lattice
Figure B.5 Contents of ports.h.
468













; Wait 12 microseconds
; Each nop takes 1 microsecond to execute.
; The call to writedelay takes 2 microseconds.
; The return take 2 microseconds.
; Calling and return from ” writedelay ” takes a 
; total of 13 microseconds. This leaves one 
; microsecond for the calling program to do 
; a register move.
; Wait for something to appear in input buffer 
»
readwait:
jnb p0.7,$ ; Check bit 7 of port 0
ret ; If it is zero there is not data in the queue
; so jiunp back and check again 
; If it is not zero, return
Figure B.6 Contents of util.h.
469
Jan 26 10:59 1984 filter.s Page 1
; Program Name: filter (f2)
; Algorithm: Figure 6.1
; Machine: VLSI processor array, simulated by Poker.
; Function: Compute ym given xm using
* y m“~ ^k^m-k m~k-
k=0 k=l
; Precision: Input: 8-bit unsigned.
; Coefficients: 8-bit unsigned.
; Output: 16-bit unsigned.
; Number of PEs: p + q + 1, the number of coefficients.
; Parameters: p + q + 1, the number of coefficients.
; Input: Arrives at the north port of cell (1,1).
; Output: Departs from the south port of cell (4,1).
; Loop Time: 33 //s to produce one output sample.
; Max sample Rate: 30 KHz
^include’’ports.h”
i
org 29h Start of readport buffers




2 mov dptr,#ARGl ; Get coef value
2 movx a,@dptr
1 mov coef,a
2 mov dptr,#lowSWLat ; dptr doesn’t change after this






















Figure B.7 8051 program listing for 8 bit fast filter (f2).





2 movx a,@dptr ; sum <- top











movx a,@dptr ; in <- right














movx @dptr,a ; out <- sum
sum :== sum +coef * in; (cont.)
»























Jan 26 10:59 1984 input.s Page 1
Program Name: input (f2)
Algorithm: None
Machine: VLSI processor array, simulated by Poker.
Function: Generate input data for filter program.
Precision: Output: 16-bit unsigned.
Number of PEs: 1
Output: Departs from the east port.
^include ”ports.h”
org 29h ; Start of readport buffers
i: ds 1 ; Filter coefficient
sum: , ds . 2
right: ds 2
org 08000h
2 mov dptr,#lowSWLat ; dptr doesn’t change after this




. , . ' • \" ■■■
2 mov i,#l ; i :'= 1 |
1 mov r0,#2 ; wait 6 microseconds so right valjies
1 ; nop . • ■■ i
2 djnz r0,$ ; follow top values into cell.js |
■ main:
1 mov r0,#4 ; wait 9 microseconds so rij^ht values
2 djnz r0,$ ; follow top values into cell s
" .. i’ ■’ ■ /,.. :■ • : ; ‘ '
; 13 out <* 1; I
1 mov a,i
2 movx @dptr,a ; out <-i
; 12 tmp <-sync?
' ! :
2 jnb p0.7,$ ; Loop until data arrives
2 movx a,@dptr ; Dummy read on Switch port
?
? 14 lt=i +1?
: : l ■ : ? , V \ ■ .



















VLSf processor array, simulated by Poker.
and send it back to some filler cells. 
Input: 16-bit unsigned.
1
Arrives at the south port.
# include ” ports.h"
org 29b ; Start of readport buffers
coef: ds 1 ; Filter coefficient
sum: ds 2
right: - ds 2
org 08000b
2 mov dptr,#IowSWLat ; dptr doesn’t change after this




I 10 out <- in;
f .
jnb p0.7,$ Wait for input
movx a,@dptr sum <- bottom Read LSB and send out
2 movx @dptr,a ; Send to other cells
1 mov sum-H,a
2 jnb p(L?,& ; Wait for input







Figure B J (Continued)
473
Jan 26 10:59 1984 zero.s Page 1
; Program Name: zero (f2)
; Algorithm: None
• Machine: VLSI processor array, simulated by Poker.
• Function: Send zeros to first filter cell after
; receiving data from the input cell.
; Precision: Output: 16-bit unsigned zeros.
; Number of PEs: 1
Output: Departs from the north port.
^include ’’ports.h”
org 29h ; Start of readport buffers




2 mov dptr,#lowSWLat ; dptr never changes after this
2 mov pl,#north ; neither does pi
2 mov p0,#0f0h
main:
'V' T . . '
; 7 * s = 0;
"• ■ ■ :
1 clr a ; out <- 0




1 .. clr ; - a
2 movx C&dptr,a
















; Wait for data in input queue
Figure B.7 (Continued)
474














VLSI processor array, simulated by Poker 





Output: 24-bit unsigned. 
p-bq + 1, the number of coefficients 
p + q-f 1, the number of coefficients 
Arrives at the north port of cell 0 
Departs from the south port of cell p + q 
63 //s to produce on output sample 
15.8 KHz
#include ”ports.h”
org 29h ; Start of readport buffers
coef: ds 1 ; Filter coefficient decimal point
; is right of MSB (i.e. coef < 1)
sum: ds 3








mov dptr,#ARGl ; Get coef value
2 movx a,@dptr
1 mov coef,a
2 mov dptr,#lowSWLat ; dptr doesn’t change after this






sum : = 0;
1 clr a
1 mov sum+ 2,a
1 mov sum +1 ,.3-
1 mov s-um+0,a-
; 0 out <- 0?
Figure B.8 8051 listing for fast filter program (f3).
475
Jan 26 14:12 1984 filter.! Page 2
2 movx @dptr,a ; out <- 0
2 1c all writedelay
1 clr a
2 movx @dptr,a




... i M sum <« topj
2 movx a,@dptr ; sum <- top
1 mov sum + 2,a
2 movx a,@dptr




1 13 in <» right|
2
1
jnb p0.7,$ ; Wait for external input
2 movx a,@dptr ; in <- LSB of right
f




1 add a, sum 4*2 ; sum := sum + in * right
1 mov sum +2, a
1 ' nop ; Wait so output.s has a chance







?16 out <« sum|
■2
1
movx @dptr,a ; out <- LSB of sum




1 addc a,sum +1 -
1 mov sum +1, a
2 jnc nocarry ; This will throw the timing off,
Figure B.8 (Continued)
Jan 26 14:12 1984 filter.! Page 3
1 inc sum ; but all will resync waiting for the next input
476
nocarry:
5 13 in <- right; (cont)
2
5
movx a,@dptr ; in <- MSB of right
t





1 add a,sum +1
5 16 out <- sum; |cout)
2
5
movx @dptr,a ; out <- middle byte of sum












5 16 out <- sum; (cont)
2
9









• Program Name: input (f3)
; Algorithm: Figure 6.2??
j Machine: VLSI processor array, simulated by Poker.
; Function: Generate input for filter program, one
• sample every LOOPTIME fis
; Precision: Output: 16-bit unsigned.
; Number of PEs: 1
; Parameters: LOOPTIME, the time between samples
; Output: Departs from the east port.
Jan 26 14:12 1984 input s Page 1
LOOPTIME equ 100-9 ; Time in micro seconds between outputs
^include ” ports.h”
org 29h ; Start of readport buffers
i: ds 2 ; Filter coefficient
sum: ds 2 . ” -
right: ds 2
org 08000h
2 mov dptr,#lowSWLat ; dptr doesn’t change after this
2 mov pl,#west ; neither does pi
2 mov p0,#0f0h
2 mov tmod,#10h ; Set timer 1 to no gate, timer, mode 1 (16 bit)
;
;8 1 := 0;
2
;
mov i+i,#i ; i := 1
2 mov i,#0




; 12 imp <• syne Rather than use a port read for




1 mov tcon,a ; Stop timer
1 mov til,a ; clear timer
1 mov thl,a \
1 setb tcon.6 ; Start timer
;
1 13 out <••i;
1
;
mov a,i + l
Figure B.8 (Continued)
478
Jan 26 14:12 1984 input.s Page 2
2 movx @dptr,a ; out <-LSB of
! 14 1 :=!+!}
1
9
add a,#l ; i := i + 1












2 movx @dptr,a ; out <- MSB of
1 mov a,#L OOPTIME


















Jan 26 14:12 1984 outputs Page 1
; Precision:
f
; Number of PEs:
; Input:





VLSI processor array, simulated by Poker. 
Receive input from filter program.
Wait for signal from input cell and send 




Data arrives from south port.
Synch signal arrives from north port. 
Departs from the east port.
org 29h ; Start of readport buffers










1 mov sum +2, a
2 jnb p0.7,$
2 movx a,@dptr













; neither does pi
; Wait for buffer to fill 
; Throw away the LSB of sum
; Wait for buffer to fill
; get middle byte of sum 
; Wait for buffer to fill 
; get MSB of sum
; Wait for buffer to fill 
; Send out sum +1 as soon as possible 
; dummy read
; wait for switch network
Figure B.8 (Continued)
2 movx @dptr,a ; Send out sum
2 movx a,@dptr ; dummy read




Jan 26 14:12 1984 output.s Page 2
Figure B.8 (Continued)
481
Jan 26 14:12 1984 zero.s Page 1
Program Name: zero (f3)
Algorithm: Figure 6.2??
Machine: VLSI processor array, simulated by Poker.
Function: Generate zero values for filter program
every time a synch signal come from the 
input cell.
Precision: Input: 16-bit unsigned.
Output: 16-bit unsigned.
Number of PEs: 1
; Input:
; Output:
Synch signal arrives from east port. 
Departs from the north port.
#include ”ports.h” 
org 29h ; Start of readport buffers









dptr,#lowSWLat ; dptr never changes after this




5 10 out <- 0;
i
»







2 movx @dptr,a .
f
; 9 dumb <- syncs
2
f
jnb p0.7,$ ; Wait for input
2 movx a,@dptr























VLSI processor array, simulated by Pofcer. 
Find autocorrelation coefficients E(i) 
given ingut signal x(m), using;
R(i)= E X(k)x(k + i).
k=0
Input: 16-bit unsigned.
Output: 32-bit unsigned, 
p, the number of coefficients computed, 
p, the number of coefficients computed. 
Arrives at the north port of cell (1,2). 
Departs from east port of merge cell.
82 //s to process one input sample.
Max Sample Eat#: 12 KHz
Sends data through the switch LSB first























; wait 37 microS for input s to fill buffer
Figure B.9 8051 programs- for autocorrelation program a3 using 16-bit inputs 
and 32-bit siiixiSi
483





mov i,a ; i := 0
sum :=
1 mov sum +3,a ; sum := 0
1 mov sum+ 2, a








mov pl,#south ; Write 0 to south port
2 mov dptr,#lowSWLat ; This value will remain
from now on
1 dr a
2 movx @dptr,a ; write 0 to switch
1 mov r0,#3




movx @dptr,a ; write 0 to switch
1






inc i ; i := i + 1
left <- In2;
2 movx a,@dptr ; Read LSB of left from switch
1 mov left+ 1,a









movx a,@dptr ; Read LSB of top from switch
1 mov top + l,a





If 1 < samples then
1
>
mov a,i load i
2 cjne a,#samples,loop; if(i ! = samples) goto loop
Figure B.9 (Continued)
'484
Jan 31 16:34 1984 auto.s Page 3
2 Ijmp endloop ; if( i = = samples) goto endloop
f sum :== sum + left * top
t Where sum is 32 bits and left and top are 16 bits
f
l 30 31 (left)
? X 2e 2f (top)
t
; t 30x2f 31x2f







2b 2c 2d .•* (sun)
sum : = sum + left * top
1 mov a.left +1 ; LSB of left
2 mov b,top +1 ; LSB of top
4 mul ab
1 add a,sum +3 ; LSB of sum (byte 4)
1 mov sum+3,a
1 moy a,b
1 addc a,sum+2 ; add in byte 3 of sum
1 mov sum + 2, a
1 clr a
1 addc a,sum +1 ; add carry to byte 2 of sum
1 mov sum + l,a
1 clr a









movx @dptr,a ; Send LSB of top to south port
5 33
l
sum: = sum + top * left (cont)
2 mov b,left ; MSB of left
4 mul ab
1 add a,sum + 2 ; add to byte 3 of sum
1 mov sum + 2, a
1 mov a,b
1 addc a,sum + l ; add to byte 2 of sum
1 mov sum + l,a
1 clr a
1 addc a,sum ; add carry to byte 1 of sum
Figure B.9 (Continued)
485






mov a,top ; MSB of top
out <•• top;
2 movx @dptr,a ; Send MSB of top to south port
; 36 sum :== sum +top * left (eont)
2
>
mov b,left + l ; LSB of left
4 mul ab
l add a,sum + 2 ; add to byte 3 of sum
1 mov sum+ 2,a
1 mov a,b
1 addc a,sum +1 ; add to byte 2 of sum
1 mov sum + l,a
1 clr a
1 addc a,sum ; add carry to byte 1 of sum
1 mov sum,a
1 mov a,top ; MSB of top
2 mov b,left ; LSB of left
4 mul ab
1 add a,sum+l ; add to byte 2 of sum
1 mov sum + l,a
1 mov a,b






; sum := sum + left * top
; Where sum is 32 bits and left and top are 16 bits
V 30 31 (left)
; X 2e 2f (top)
; + 30x2f 31x2f
; + 30x2e 31x2e
; + 2 a 2b 2c 2d (sum)
f
endloop:
} '■ . " ’ - . •
; 41 results <- sum;
J
Figure B.9 (Continued)
Jan <*1 16:34 1984 auto s P age 5
2 mov pl,#east ; Next write is to east port
I 40 sum : = sum 4r top * left;
t ■■■■
I
mov a,left + 1 ; LSB of left
2 mov b,top4-1 ; LSB of top
4 mul ab
1 add a,sum + 3 ; LSB of sum (byte 4)
I mov sum+3,a
r
; 41 results <- sum; (cont)
2
1
movx @dptf,a ; Send LSB of sum to east port
;




1 addc a,sum + 2 ; add in byte 3 of sum
t mov sum 4-2,a
1 clr a
1 addc a,sum 4" I ; add carry to byte 2 of sum
1 mov sum + l,a
1 clr a
1 addc a,sum ; add carry to MSB of sum (byte 1)
1 mov sum, a
1 mov a,top4\l ; LSB of top






a,sum + 2 ; add to byte 3 of sum
1 mov sum 4-2,a
1 mov a,b
1 addc a,sum 4-1 ; add to byte 2 of sum








; add carry to byte 1 of sum







; LSB of left









; Send third by te of sum to east port
Figure B.9 (Continued)
487
Jan 31 16:34 1984 auto.s Page 6
; 40 sum := sum + top * left; (cont)
1 mov a,b
1 addc a,sum + l
















sum: = sum + top
















; add to byte 2 of sum
; add carry to byte 1 cf sum
; MSB of top 
•MSB of left
; add to byte 2 of sum
(cont)
; Send second byte of sum to east port 
left; (cont)
; add to MSB of sum (by te 1)
(cont)
; Wait 8 microseconds for switch
; Send second byte of sum to east port
Initialize i and sum for next autocorrelation calculation
;




i mov ia ; i := 0
5 42 sum := °5
i
5 •
mov sum+3,a ; sum := 0
i mov sum + 2, a
i mov sum + l,a
Figure B.9 (Continued)
488





1 clr a ; Send a 0 to south port
1 nop ; Wait for switch
2 mov pl,#south
2 movx @dptr,a ; write to switch
1 mov r0,#3
1 nop











Jan 31 16:34 1984 input.s Page 1
; This is the warp drive version of input.x
j It outputs 16 bit integers to port 0 LSB first.
^include ”ports.h”
org 029h
} 14 Int 1}
i: ds 2
org 8000h
; 16 i s=100}
2 mov i,#0 ; i := 100
2 mov i + l,#100
2 mov dptr,#lowSWLat ; Get switch address
2 mov pljeast ; Set direction to 2 (east)
2 mov p0,#0f0h
main:
; Write i to east port
>
; 20 out <- i;
f #
1 mov a,i + l. ; get LSB of i







; 22 | ,’== 2*ij1
1
»






•21 tmp <- sync;
5
; Wait 8 micro seconds for switch
; get MSB of i 
; write to switch
; i := 2i
Figure B.9 (Continued)
490
Jan 31 16:34 1984 input.s Page 2
2 jnb p0.7,$ ; if p0.7 is 0 there is no data to read 
; so loop until there is
2 movx a,@dptr ; Read sync byte from switch
2 jnb p0.7,$ if p0.7 is 0 there is no data to read 
; so loop until there is
2
•










; Warp Drive version of pipe.x
; Does 2 byte transfers holding the first byte until
; the second is received.

















2 jnb p0.7,$ ;
f
if p0.7 is 0 there is no data to read 
so loop until there is
2 movx a,@dptr ; Read input byte from switch
1 mov rl.a ; It’ll only come from one direction
2 jnb p0.7 ,$ ;
■ >
if p0.7 is 0 there is no data to read 
so loop until there is
2 movx a,@dptr ; Read input byte from switch





Write i to north port
out <-' tmp;
1 mov a.ri ; send first byte
2 movx @dptr,a ; write to switch
1 mov r0,#3 ; Delay before second write
2 djnz r0,$















VLSI processor array, simulated by Poker 
Find autocorrelation coefficients R(i) 









Input: 8 bits, unsigned
Output: 16 bits, unsigned
p, the number of coefficients computed.
p, the number of coefficients computed.
Arrives at the north port of cell (1,2)
Departs from east port of merge cell
26 jis to process one input sample
Max Sample Rate: 38 KHz
Sends data through the switch LSB first
Use with runrev.o eproms
Warp Drive version of auto.s
Read 8 bit inputs and produces 16 bit outputs
Must be used with 16 reversed eproms, which were never made
#include ” ports.h”













1 mov i,a ; i := 0
5


















2 movx @dptr,a ; Send L SB of sum
?
; 44 I := 0;
•' f
1 - clr a
1 mov i,a ; i := 0
f
I 42 sum := 0;
$
1 mov sum,a ; sum := 0
1 mov sum + l,a
1 clr a ; Send a 0 to port 4
5 43 out <- sum;
5
1 nop ; wait for switch to get ready
1 nop ; Total of 12 micro seconds













; 31 left <- ln2;
i .
2 movx a,@dptr ; Read input byte from switch
1 mov b,a ; Assume it's from left
$
5 30 top <- ini;
mf
i movx a,@dptr ; Read input byte from switch
• • •
9
; 35 out <- top;




;23 sum := 0;
1
>
mov sum,a ; sum := 0
1 mov sum + l,a
5 25 out <- sum;I
2
9
mov pl,#south ; Write sum to south port
2 mov dptr,#lowSWLat




1 s= I +1;
1 inc i ; i := i + 1
9
; 33 if i < samples then
1
9
mov a,i ; load i




2 movx a,@dptr ; Read input byte from switch
1 mov b,a ; Assume it’s from left
; 32 top <- Ini;
2
1








sum t= sum + top * left;




mov pl,#east ; Write to east port (2)









sum + l,a 
a,b








2 movx @dptr,a ; write to switchC— Write port
;














a,sum +1 ; Add prod to lower byte of sum
sum + l,a
a,b











VLSI processor array, simulated by Poker. 
Find autocorrelation coefficients R(i) 





p, the number of coefficients computed, 
p> the number of coefficients computed, 
Arrives at the north port of cell (1,3). 
Departs from east port of merge cell.
90 (is to process one input sample.
11 KHz
Sends data through the switch in reverse order 
(i.e. LSB first)
Use with runreV.6 eproffis
Quasi synchronous, i.e. Can take input from external source
at unknown intervals
Precision:





























mov pljtsouth Write 0 to south port
Figure B.ll 8051 program for autocorrelation program a5, using asynchronous 
16-bit input and 32-hit output.
497
Jul 12 12:56 1984 auto.s Page 2


















1 mov sum+ 2,a















1 mov top + l,a
2 jnb p0.7,$
.2 . movx a,©dptr
1 mov top,a








; from now one 
; write 0 to switch 
; Wait 12 microseconds for switch 
; write 0 to switch
; i := 0
; sum := 0
; i := i + 1
; Wait for input from external program 
; Read LSB of top from switch
; Wait for input from external program 
; Read MSB of top from switch
; Wait for input from external program 
; Read LSB of left from switch
; Wait for input from external program 




Jul 12 12:56 1984 auto s Page 3
; 33 if i < samples then
1 mov a,i ; load i
2 cjne a,#samples,loop; if(i != samples) goto loop
2 Ijmp endloop ; if( i == samples) goto endloop
loop:
sum : = sum + left * top
* Where sum is 32 bits and left and top are 16 bits
t
i 30 31 (left)
* X 2e 2f (top)
y
; + 30x2f 31x2f
; + 30x2e 31x2e
; + 2a 2 b 2c 2d (sun)
; 36 sum s-= sum + top * left;
1
»
mov a,left + 1 ; LSB of left
2 mov b,top + 1 ; LSB of top
4 mul ab
1 add a, sum +3 ; LSB of sum (byte 4)
1 mov sum +3,a
1 mov a,b
1 addc a,sum + 2 ; add in byte 3 of sum
1 mov sum + 2,a
1 clr a
1 addc a,sum 4-1 ; add carry to byte 2 of sum
1 mov sum iH,a
1 dr a









movx @dptr,a ; Send LSB of top to south port
; 36 s um : == sum +top* left; (cont)
2 mov b,left ; MSB of left
4 mul ab
1 add a,sum + 2 ; add to byte 3 of sum




Jul 12 12:56 1984 auto.s Page 4
1 addc a,sum + l ; add to byte 2 of sum
1 mov sum + l,a
1 clr a
1 addc a,sum ; add carry to byte 1 of sum
1 mov sum,a
1 mov a,top ; MSB of top
•35 sum <- top; (cont)
2
»
movx @dptr,a ; Send MSB of top to south port
; 35 sum = sum + top * lefts (cont)
2
5
mov b,left + l ; LSB of left
4 mu! ab
1 add a,sum + 2 ; add to byte 3 of sum
1 mov sum+ 2,a
1 mov a,b
i addc a,sum + l ; add to byte 2 of sum
l mov sum + l,a
l clr a
l addc a,sum ; add carry to byte 1 of sum
i mov sum,a
l mov a,top ; MSB of top
2 mov b,left ; LSB of left
4 mill ab
1 add a,sum +1 ; add to byte 2 of sum
1 mov sum + l,a
1 mov a,b







I 43 out <«• sum;
1
t
clr a ' ; ; Send a 0 to south pert bottom <
2 movx @dptr,a ; write to switch
1 mov r0,#5
2 djnz r0,$ ; Wait 12 microseconds for switch
FigureB.il (Continued)
500
Jttl 12 12:56 1984 auto.s Page 5
1 clr a ; Send a 0 to port 4
2 movx @dptr,a ; write to switch
9 sum ::= sum + left * top
t Where sum is 32 bits and left and top are 16 bits
}
■ i 30 31 (left)
i X 2e 2f (top)
j
; + 30x2f 31x2f
; + 30x2e 31x2e
; + 2 a
r
2b 2c 2d (siiri)
\ 40 sum : = sum + top * left; ■
1
f
mov a,left +1 ; LSB of left
2 mov b, top +1 ; LSB of top
4 mul ab
1 add a;sum + 3 ; LSB of sum (byte 4)
i mov sum+3,a
1
; 41 results <- sum; - • _
2
r
mov pl,#east ; Next write is to east port








1 addc a,sum + 2 ; add in byte 3 of sum
1 mov sum d* 2, a
1 clr a
1 addc a,sum 41 ; add carry to byte 2 of
1 mov sum 41,a
1 clr a :
1 addc a,sum ; add carry to MSB of sum (byte
1 mov sum,a
1 mov a, top 41 ; LSB of top
2 mov b,left ; MSB of left
4 mul ab
1 add a,sum 4 2 ; add to byte 3 of sum
1 mov sum 4 2,a
I mov arb
1 addc a,sum 41 ; add to byte 2 of sum




Jul 12 12:56 1984 auto.s Page 6
1 addc a,sum ; add carry to byte 1 of sum
1 mov sum,a
1 mov a,top ; MSB of top
2 mov b,left +1 • LSB of left
4 mul ab
1 add a,sum + 2 ; add to byte 3 of sum
1 mov sum + 2,a
Ui results <- sum; (cont)
2
•»
movx @dptr,a ; Send third byte of sum to east port
•




1 addc a,sum + l ; add to byte 2 of sum
1 mov sum + l>a
1 clr a
1 addc a,sum ; add carry to byte 1 of sum
1 mov sum,a
1 mov a,top ; MSB of top
2 mov b,left ; MSB of left
4 mul ab
1 add a,sum +1 ; add to byte 2 of sum
1
; 41 results <- sum; (cont)
2
;
movx ©dptr,a ; Send second byte of sum to east port
?
; 40 sum : = sum + top * left; (cont)
1
J
mov sum + l,a
1 mov a,b





results <- sum; (cont)








Figure B. 11 (Continued)
502
Jill 12 12:56 19S4 auto.s Page 7
2 movx @dptr,a ; Send second byte of sum to east port








mov ; i := 0
turn :=05
;
1 mov sum+3,a ; sum
1 mov sum Hhg,a $
















; Number of PEs:
; Parameters:
; Input:









VLSI processor array, simulated by Poker 
Take the input data from one input port 
and write it to two output ports, spliting 
the input data stream 
1
ARG1 as set in the code names file 
If ARG1 == 100 it inputs two bytes, 
then outputs the same bytes first 
to the up port, then the down port 
If ARG! != 100, it inputs one byte 
and outputs it to the up port, 
then the down port.
if ARG1 == 1 up is ne and down is east 
if ARG1 == 2 up is north and down is south 
if ARG1 == 3 up is nw and down is sw 
if ARG1 = = lG0up is ne and down is east
org 8000h
2 mov dptr,#ARGl + 3 ; Check first parameter to see where to send
2 movx a,@dptr


























; param-= 1, up is ne 
down is east
; param = = 2, up is north 
; down is south
; param= = 3, up is nw 
; down is sw
2
mam:
mov pi,up ; Set direction up
Figure B.ll (Continued)




























































































; Wait for input
; Read byte, and write it out again
; Set direction to down 
; Send out second byte
; This split is for pel,!
; Set direction up 
; Wait for input
it for input
; Send out first byte
; Send out second by te
; Set direction to 2 (down)
Figure B.ll (Continued)
505








Jul 12 12:57 1984 merge.x Page 1
/* . \.
This routine will merge two data streams into one by 
taking interlace number of data from the top, then 
interlace number for the bottom.
Kludge: if interlace is 4, all data is read from












if interlace == 4 then 
while true do
for i := 1 to interlace do 
begin
tmp <- top; 
out <- tmp; 
end;
tmp <- bottom; 




for i :=? 1 to interlace do 
begin
tmp <- top; topholdji] := tmp;
tmp <- bottom; bottomhold[i] := tmp;
end;
fbit i;:=s- 1 to interlace do
out <-tophold[i]; 




















; Start of main loop, read a then b and find distance
main:
f








I 40 fopl:=l to coefs do
%
1 mov rO,#avec ; r0 points to current a coef.
1 mov rl,#bvec ; rl points to current b coef.
1 mov r2,#COEFS
5
; 48 d : = 0;
?
2 mov d*f 1,#0 ; d := 0
2 mov d,#0





VLSI processor array, simulated by Poker
Match two utterance using dynamic time warping
Input coefficients: 8 bits, unsigned
Distances: 16 bits, unsigned
Output score: 16 bits, unsigned
2r + l, where r is the width of the warping path
r, width of the warping path
coefs, the number of coefficients per frame
I, the number of frames per utterance
a vectors enter cells (1,7) and (1,8)
b vectors enter cells (7,7) and (6,8)
scores appear in cell (4,6)
input:
I
; 51 aout <- a[i];
;
Figure B.12 8051 routine used for DTW program d2.

















































j Find local distance d
i











; Set direction of write
; send avec[?J to switch
; Send bvec[?]
; Set new direction
; send avec[?J to switch
get a coef 
save for next time
Wait 9 more uS for b values
; get b coef
; ace :== a - b 
; take absolute value of acc
Figure B.12 (Continued)
509
; 57 d := d + tmpl * tmpl;
f
1 mov b,a ; acc := acc * acc
4 mul ab
1 add a,d +1







; Check to see if any vecter is inf
; 61 If (a[l] = inf) | b[l] ini) then
1 mov a,inf8
2 cjne a,avec,bcheck ; Is avec == inf?




2 sjmp distl ; dummy jump to kept time the same
doutdelay:




2 cjne a,bvec,doutdelay ; Is bvec == inf?
I
I 62 d := infj
$
2 mov d + l,inf + l ; d := inf
2 mov d,inf
; Send out d values
Jul 11 10:52 1984 even.s Page 3
dout:
I 64 DTtop <* d}
5
2 mov pl,#ne ; DTtop <- d












2 movx @dptr,a ; Send LSEt of d to switch
2 lea® writedfelay
1 mov a,d




jil Bffeot <- d$
: ; . . >
1 mov a^d+f









2 movx @dptr,a ; Send L SB of d to switch
2 Icall writedclay
1 mov ayd
2 movx Wpftfy# ; Send MSB of d to switch
#efid®
Jul 11 10:52 1984 even.s Page 4 *




; 'fe ' , V
% mov ayOtofe+f ; tiripl := Gbotold + 2*Dbot
I rl a











Jul 11 10:52 1984 even.s Page 5 
1 mov tmpl,a
9
i 08 tmp2 s=g+d;
9
1 mov a,d +1 ; tmp2 := g + d
1 add a,g + l





; 69 tmp3 := Gtopold + 2*Dtop;
i
1 mov a,Dtop + l ; tmp3 := Gtopold + 2*Dtop
1 rl a




1 mov a,tmp3 + l
1 add a,Gtopold +1





; 71 if tmpl < tmp2 then
; ■ ■ ■,
1 mov a,tmpl ; compare MSB
2 cjne a,tmp2,cmpl5
1 mov a,tmpl+l




; 72 min := tmpl;
9











2 mov min + l,tmp2 + l ; min :== tmp2
2 mov min,tmp2
1 nop ; Used to make both paths same length
4 nop
9
} 75 iftmp3 < min then
f
next2:
Jul 11 10;52 1984 even.s Page 6
1 mov a^tmpS ; Compare MSB
2 cjne a,min,cmp25
1 mov a,tmp3 +1 ; Compare LSB
2 ejne a, min 4%cmp2
cmp2:
2 jc next25 ; This makes the timing on





2 sjmp cmp2 ; Keep paths the same length
next25:
rW min :== tmp3;




; 78 g := d1 + min;
1
9
mov a,min + 1 ; g min + d
1 add a,d + l










2 jc next32 ; if carry, g > inf
2 cjne a,inf,$ + 3 ; if g > inf
2 jnc next35 ; This makes the timing on both




























5 80 GTtop <- g;
5
2 mov pl,#ne ; DTtop <- g






















1 mov a,g +1
















2 movx <§dptr,a 
#endif
5 \
; 83 Gtopold 2 = Gtop;
mov Gtopold 4* l,Gtop +1 ; Gtopold : = Gtop
mov Gtopold,Gtop
•. •>
l 84 Gbotold Gbot|
9
Jul 11 10:52 1984 even.s Page 8













2 movx a,@dptr ; Dtop <- DTtop





. - . ■' - .
fifdef OUTPUT






; 03 Dbot <- DTbot;
1
#ifdef BOTTOM '







2 movx a,@dptr ; Dbot <- DTbot




} 86 Dtop <- DTtop;
Figure B.12 (Continued)
515
Jul 11 10:52 1984 even.s Page 9
#ifdef TOP






2 movx a,@dptr ; Gtop <- DTtop




















2 movx a,@dptr ; Gbot <- DTbot










mov a,#low(LOOPTIME); Wait for timer
1 xrl a,til ; xor LSBs to see if they are the same
1 rrc a ; move LSB into carry bit
1 mov a, #low(L OOPTIME)
2 jnc sync ; Sync with timer, since the cjne takes
1 nop ; 2 uS, there is a 50/50 chance the LSB
; will not match, this comparison should 
; sync the program up with the timer so 








; Program Name: dtw, odd.s (d2)
; Algorithm: Figure 6.16??
• Machine: VLSI processor array, simulated by Poker
; Function: Match two utterance using dynamic time warping
• Precision: Input coefficients: 8 bits, unsigned
; Distances: 16 bits, unsigned
; Number of PEs: 2r +1, where r is the width of the warping path
; Parameters: r, width of the warping path
; coefs, the number of coefficients per frame
; I, the number of frames per utterance
Jul 11 10:52 1984 even.s Page 10
f Input: a vectors enter cells (1,7) and (1,8)
f b vectors enter cells (7,7) and (6,8)




i Start of main loop, read a then b and find distance
main:
1 clr a
1 . mov tcon,a ; Stop timer
1 ! mov til,a ; clear timer
1 mov thl,a
1 setb tcon.6 ; Start timer
r
;44 for I s== 1 to coefs do
1
t
mov r0,#avec ; r0 points to current a coef.
1 mov rl,#bvec ; rl points to current b coef.
1 mov r2,#COEF:s ■■








1 mov a^rO ; get avec[?J












Jul 11 11:36 1984 odd.s Page 1
J
; 47 bout <- b[ij;
f
1 mov a,@rl ; Send bvec[?]






















movx a,@dptr ; get a coef
a[!]j= atmp;
mov @rO,a ; save for next time
inc rO
mov b;a




















tmpl := atmp - btmp; /* Compute distance between vectors */
















; acc := a - b 

















d :== d +tmpl * tmpl
mov b,a
mul ab
; acc := acc * acc
add a,d + l













lf(a[l] = lnf)[(b[l] = Inf) then
Check for infinity
mov a,inf8
cjne a?avec,bcheck ; Is avec == inf?






















mov d +1 ,inf +1
mov d,inf
; d := inf
5
5 34 DTbot <- d;
5
> Get and send d values
din:





1 mov a,d + l






; 85 DTtop <- d;
1 mov a,d + l






; 63 tmp2 := g + d?
;
1 mov a,d + l ; tmp2 g + d
1 add a,g + l




• 59 Dbot <- DTbot?
;
Jul 11 11:36 1984 odd.s Page 3
2 movx a,@dptr ; Dbot <- DTbot




? 62 tmpl Gbot+ 2*Dbot; /* Find minimum path
1
5
mov a,Dbot •+1 ; tmpl := Gbot + 2*Dbot
1 rl a





1 add a,Gbot + l
1 mov tmpl+ 1,a
1 mov a,Gbot









2 movx a,@dptr ; Dtop <- DTtop








1 mov a,Dtop + l ; tmp3 := Gtop + 2*Dtop
1 rl a
1 mov tmp3 + l,a
1 mov a,Dtop ? :
1 rlc a
1 mov tmp3,a
1 mov a,tmp3 + l
1 add a,Gtop +1




; 66 if tmpl < tmp2 then
1
f
mov a,tm.pl ; compare MSB
2 cjne a,tmp2,cmpl5
1 mov a,tmpl +1








mov min 4-1,tmpl+ 1 ; min := tmpl
2 mov min,tmpl















; 70 if tmp3 < min then
f
; if tmp3 < min
next2:
Jul 11 11:36 1984 odd.s Page 5
1 mov a,tmp3 ; Compare MSB
2 cjne a,min,cmp25
1 mov a,tmp3 + l ; Compare LSB
2
cmp2:
cjne a,min + I,cmp2
2 jc next25 ; This makes the timing on













2 mov min + l,tmp3 +1 ; min := tmp3
2 mov min,tmp3
1 74
8 s = d + min
9
next3:
1 mov a,min + 1 ; g := min + d
1 add a,d + l






; 73 If min < inf then
5
t if g > inf then
g:=0
2 jc next32 ; if carry, g > inf
2 cjne a,inf,$+3 ; if g > inf
2 jnc next35 ; This makes the timing on both
1 nop ; branches the same
2 sjmp next 4
Figure B.12 (Continued)








} 78 g := 0;
9
next35:
1 clr a ; g := 0






















































; DTtop <- g
; Gbot <- DTbot






a,#low(L OOPTIME); Wait for timer 
a,til ; xor LSBs to see if they are the same
a ; move L SB into carry bit
a,#low(L OOPTIME)
sync ; Sync with timer, since the cjne takes
; 2 uS, there is a 50/50 chance the LSB 
; will not match, this comparison should 
; sync the program up with the timer so 
; with LSB will always match
a,tll,$




























Jul 11 10:53 1984 dtw.h Page I
COEFS equ 4 ; Number of coefficients used
LOOPTIME equ 510-7 ; Total number of microseconds per loop
#include " ports.h”
org 29h ; Start of readport buffers
avec: ds COEFS
bvec: ds COEFS
ds 1 ; number of coefficients used
d: ds 2 ; local distance
Dbot: ds 2 -
Dtop: ds 2 .





inf: ds 2 ; Infinity 16 bit










Jul 11 10:55 1984 init.h Page 1




2 mov inf +1,#0
2 mov inf,#40h
9
I 40 for i := 1 to 2*coefs do
• • •
9




; 42 a[i] := Inf;
5 43 b[i] := inffc
1 mov @rO,a ; set avec and bvec to inf8
1 inc r0
2 djnz r2, initl
•
9 ...
; 32 Gbotold := inf;
; 33 Gtopold := inf;
; 34 Gbot := tnf;
; 35 Gtop := Inf;
; 36 Dbot := inf;
; 37 Dtop := inf;
; 38 g := 0;
; t
1 mov a,inf +1 ; Initize LSB of variables
1 mov Gbotold + 1, a
1 mov Gtopold+ 1,a
1 mov Gbot + l,a
1 mov Gtop+ 1,a
1 mov Dbot + 1, a
1 mov Dtop + l,a
2 mov g + l,#0







2 J mov g,#0
2 mov dptr,#lowSWLat ; dptr is never changed after this.
2 mov p0,#0f0h ; neither is pO
Figure B.12 (Continued)
526
; Program Name: dtw, repeat.s (d2)
; Algorithm: Figure 6.16??
; Machine: VLSI processor array, simulated by Poker
; Function: 32 bit coef are input, the upper three
; bytes are thrown away. The lower byte is
; stored in aarch until an entire word is
; received. The the word is output one frame
; at a time with LOOPTIME time between
; frames. The word is outputed VOCAB times,
; i.e. once for every word in the
; vocabulary.
; Precision: Input coefficients: 32-bit unsigned integers
• Output: 8-bit unsigned
; Number of PEs: 1
• Parameters: coefs, the number of coefficients per frame
Jul 11 11:42 1984 repeat.s Page 1
VOCAB equ 3 ; Words in vocabulary
COEFS equ 4 ; coefficients per frame




count: ds 1 ;Number of coefs input, or frames output
vcount: ds 1 ; Number of vocabulary outputted
aindex: ds 2 ■' . ; Pointer to next location in aarch in EXRAM
avec: ds COEFS; INRAM temp storage for on frame




2 mov p0,#0f0h ; This is for good luck
2 mov pl,#ne ; All data goes out the ne port
Start of“ loop to read in the word and store in EXRAM
main:
2 mov count,#0
2 mov aindex + l,#low(aarch)
2 mov aindex ,#high(aa,rch)
moredata:
2 jnb p0.7,$ ; Wait for input
2 movx a,@dptr
i inc count
2 mov dpi,aindex +1 ; Get pointer to aarch
Figure B. 12 (Continued)
527




2 mov aindex+ 1,dpi
2 mov aindex,dph
2 mov dptr,#IowSWLat ; Point to switch
1 mov
dummyread:
r0,#3 ; Ignore next 3 input bytes
2 jnb p0.7,$ ; Wait for input
2 movx a,@dptr
2 djnz rO, dummyread
1 mov a,count
2 cjne a,#COEFS*FRAMES,moredata ; Jump if more data to read
t Turn on timer
y
1 clr a
1 mov tcon,a ; Stop timer
1 mov til,a ; Clear timer
l mov thl,a
1 setb tcon.6 ; Start timer
y Pad input data with on frame of inf
2
y



















2 mov dpi,aindex + 1
Figure B.12 (Continued)









2 . mov aindeX+ l,dpl
2 mov aindex,dph
r




2 cjne a,thl,$ ; Wait for upper 8 bits to match
1 mov a,#low(L OOPTIME); Wait for timer
1 xrl a, til ; xor LSBs to see if they are the same
1 rrc a ; move LSB into carry bit
1 mov a,#Iovv(LOQPTiME)
2 jnc sync ; Sync with timer, since the cjne takes
1 nop ; 2 uS, there is a 50/50 chance the LSB 
; will not match, this comparison should
; sync the program up with the timer so







i moV tcon^a ; Stop timer
i mov til,a ; Clear timer
i mov thl,a










































Jul 11 11:58 1984 seq.s Page 1
Precision:




VLSI processor array, simulated by Poker 
For each unknown vector (frame) seq.s 
receives from repeat.s it outputs the uhkown 
vector to its north port and a known vector 
from its library to the south port. The new 
vectors are output once every LOOPTIME /is. 
If no vectors come from repeat.s, seq.s will 
output infinite vectors every LOOPTIME fis. 
Input coefficients: 8-bit unsigned integers
Output: 8-bit unsigned integers 
1 ' ;
coefs, the number of coefficients per frame
Warp drive version of DTW (seq.s)






















mov inf + l,#0
mov inf,#40h




l mov @r0,a ; set avec and bvec to inf8
1 inc r0
1 nop ; Since this loop runs half as long as
1 nop ; the initl loop in even s and odd.s,
1 nop ; it must be twice as long to keep
Figure B.12 (Continued)
531
Jul 11 11:58 1984 seq.s Page 2
1 nop t the timing the same.
2 djnz r2,initl
1 mov rO,#infvec ; rO points to inf vector
1 mov rl,#infvec ; rl points to inf vector
1 mov r3,#7 ; Wait 16 microseconds for other PEs
1 nop
2 djnz r3,$
2 mov dptr,#lowS WLat
2 mov p0,#0f0h ; pO is never changed after this.
1 Start of main loop, read a then b and find distance
main:
1 clr a
1 mov tcon,a ; Stop timer
1 mov til,a ; clear timer
1 mov thl,a
1 setb tcon.6 ; Start timer
i mov r2,#COEFS







2 movx @dptr,a ; Send avec[?)
2 lcall writedelay
1 mov a,@rl ; Send bvec[?J
2 mov pl,#south ; Set new direction
2 movx @dptr,a
1 inc rO ; Point to next element in a vector
1 inc n ; Point to next element in b vector
1 mov r3,#16 ; Wait 33 microseconds
2 djnz r3,$
2 djnz r2,input






2 jnb p0.7,infout ; Jump if no input data
1 mov rO,#avec
1 mov rl,#bvec
2 mov dpl,bindex +1
2 mov dph,bindex
Get b vector from external RAM 
mov r2,#COEFS ; for r2 := coefs to 0 step -1
movx a,@dptr







Jut 11 11:58 1984 seq.s Page 3
Get a vector from input port
1 mov r2,#COEFS
i
; for r2 := coefs to 0 step —1
geta
2 jnb p0.7,$ ; Wait for next input
2 movx a,@dptr
1 mov @rO,a ; avec[rO] <- input
1 inc rO
2 djnz r2,geta
1 mov rO,#avec ; Get pointers ready to output
1 mov rl,#bvec
2 sjmp timer
No more input data, send infinity vectors instead
rO,#infvec
rl,#infvec
biiidex + l,#low(barch) ; Start b vector over again 
bin dex, # high(barch)
; Wait for timer
r .' '
timer:








Jul 11 11:58 1984 seq.s Page 4
2 cjne a,thl,$
1 mov a,#k>w(L OOPTIME)
1 xrl a,til ; xor LSBs to see if they are the same
1 rrc a ; move LSB into carry bit
1 mov a,#low(L OOPTIME)
2 jnc sync ; Sync with timer, since the cjne takes
1 nop ; 2 uS, there is a 50/50 chance the LSB
; will not match, this comparison should 
; sync the program up with the timer so
; with LSB will always match
sync:












































; Put inf vector between words
Figure B, 12 (Continued)
535









YLSI processor array, simulated by Poker 
This is routine collects the distance 
scores from pe 4,7. The only score that really 
means something is the score just before a zero 
value. All others are intermidi ate values.




input: ds 2 '
lastzero: ds 1 ; = = 1 if last value was a 0
scores: ds 2*(2 + l) ; score of all the words
org 8000h
2 mov dptr,#lowSWLat ; Get switch address
2 mov p0,#0f0h
I mov r0,#scores ; rO points to the next hi location in scores
1
main:
mov rl,#scores + l ; rl points to the next lo location in scores





2 . movx a,@dptr ; Read byte
1 mov input+ 1,a ; Save
2 jnb p0.7,$ ; Get second byte
2 movx a,@dptr
1 mov input,a
1 clr a ; jump if input != 0
2 cjne a,input,notyet
2 cjne a,input + l,notyet
2 cjne a,lastzero,notyet2
The input was zero, so the pervious input was a good score 
increment the scores pointer (r0 and rl) and don’t save 
the zero values
1 inc rO
Figure B. 12 (Continued)
536


























VLSI processor array, simulated by Poker 
Output a stored speech signal at a given 
sampling rate
Output: 16 bits, sign-magnitude 
1
SAMPLETIME, the time in /zs between samples 
None
Departs from west port once every 
SAMPLETIME jzs
SAMPLETIME equ 160-8 ; Time in microseconds between outputs
^include ’’ports.h”
org 29h ; Start of readport buffers
in: ds 2 ; pointer to next word of input data
org 08000b
2 mov in + l,#low(word) ; point in to start of data
2 mov in,#high(word)
2 mov pl,#west ; direction to send data
2 mov p0,#0f0h
2 mov tmod,#10h ; Set timer 1 to no gate, timer,
; mode 1 (16 bit)





1 mov tcon,a ; Stop timer
1 mov til,a ; clear timer
1 mov thl,a
I setb tcon.6 ; Start timer
2 mov dpl,in + l
2 mov dph,in
2 inc dptr
2 movx a,@dptr ; get LSB of next input sample
2 . mov dptr,#lowSWLat
2 . movx @dptr,a ; send byte to switch
2 mov dpl,in + l
2 mov dph,in
2 movx a,@dptr ; get MSB of next input sample
2 inc dptr
Figure B.13 Program to output stored speech signal.
538
2 inc dptr
2 mov in+ 1,dpi
2 mov in,dph
2 mov dptr,#lowSWLat
2 movx ©dptr,a ; send byte to switch





1 mov a, in
2 cjne a,#high(dataend),wait







1 mov a, #high(S AMPLETIME)
2 cjne a,thl,$
1 mov a, # low(S AMPL ETIME)
1 xrl a,til ; xor LSBs to see if they are the same
1 rrc a ; move LSB into carry bit
i mov a,#Iow(SAMPLETIME)
2 jnc sync ; Sync with timer, since the cjne takes
1
sync:
nop ; 2 uS, there is a 50/50 chance the LSB 
; will not match, this comparison should 
; sync the program up with the timer so 




sjmp main ; Wait for rest of PEs to work
#include
dataend:

















VLSI processor array, simulated by Poker 
preemphsize the input signal using the 
transfer function:
H(z)=l~.95*z"1
Input: 16 bits, 2’s complement 
Sum: 16 bits, sign-magnitude
Output: 16 bits, sign-magnitude 
1
COEF, the filter coefficient 
Arrives at the north port of cell 
Departs from east port of cell 
85 fis 
11 KHz
COEF equ 243 ; 243/256 = .95
#include ”ports.h
org 29h ;
sign: ds l . ■ / ;
sum: ds 2 ;
last: ds 2 . ;
org 08000h
2 mov dptr,#lowSWLat
2 mov pl,#east ;
2 mov p0,#0f0h
1 clr a






2 jnb p0.7,$ ;
2 movx a,@dptr ;
1 mov last+ 1,a ;
y ■ . sum := a - sum;
1 subb ajsum + l
1 mov sum + l,a
2 jnb p0.7,$ ;
Figure B.14 Assembly language program
Start of readport buffers 
Sign of sum 
last value * COEF 
last value
; dptr doesn’t change after this 
neither does pi
; 8 sum := 0
clear carry flag (no borrow)
Wait for external input 
in <- LSB of right 
Save for later
; sum := a - sum carry flag was cleared 
; at end of loop
Wait for next byte
for preemphasis filtering.
2 movx a,@dptf
1 mov last,a ; Save for later
1 subb a,sum
1 mov sum, a
2 jb sum.7, negative
l mov a,sum +1 ; Value is positive, no cbang<
2 movx @dptr,a ; out <-a
1 mov a,sum






; Convert to sign/magiritude1 mov a,sum-f 1
1 xrl a>#0Bb
1 add a,#t
2 movx @dptr,a ; out <- a
I mov a,sum
1 Xfl a,#0ffh
1 addc a,#0 ; Add in carry
1 orl a,#80h ; Set sign bit to negative
1 mov ; Wait 10 more microS
1 nop
2 djnz r0,$
2 movx @dptr,a ; out <- a
multiply:
2 jnb Iast?fpositive Jump if last input value positive
1 mov a, last+ 1 ; Convert to Sign/Magnitude
1 xrl a,#0ffh
1 add a,#l

















1 add a,sum + l





i mov a,sum +1
i xrl a,#0ffh
i add a,#l
l mov sum + l,a
i mov a,sum +0
i xrl a,#0ffh
l addc a,#0
l mov sum +0,a
pos:
i clr a




; Since COEF is a fraction,
; the result in the A register
; is thrown away, too small.
; add carry
; if not set, the number is positive
; convert back to 2’s complement
















VLSI processor array, simulated by Poker 
Find autocorrelation coefficients Bfi) 
given injmt signal x(m), using
R(i)=k"E ^(k)x(k + i)
k=0
Input: 16 bits, sign/magnitude 
Output: 32 bits, signed 
p, the number of coefficients computed, 
p, the number of coefficients computed. 
Arrives at the north port of cell (1,3) 
Departs from east port of merge cell 
160 /is to process one input sample 
6;25 KHz
Sends data through the switch in reverse order 
(i.e. LSB first)
Use with runrev.o eproms
Quasi synchronous, i.e. Can take input from external source
at unknown intervals























— yjyj ii Fu°itl"c
The results of each mul is xlr with sign, so if the result is 
negative the product will be complemented. Also the carry bit 
is set to sign, to if the result is negative, the product will have
1 added to it. The make the product in 2 comp notation.
org 08000b
Figure B.15 Assembly language program for autocorrelation.
543
i
; 25 out <-• sum;
2
1
mov pl,#south ; Write 0 to south port
2 mov dptr,#lowSWLat ; This value will remain in dptr from now on
1 clr a
2 movx @dptr,a ; write 0 to switch
1 mov r0,#6
















mov sum+3,a ; sum := 0
1 mov sum + 2,a




; 28 i s= i + 1;
i
9
inc i ; i := i + 1




2 jnb p0.7,$ ; Wait for input from external program
2 movx a,@dptr ; Read LSB of top from switch
1 mov top + l,a
2 jnb p0.7,$ ; Wait for input from external program





left < - in2;
2
9
jnb p0.7,$ ; Wait for input from external program
2 movx a,@dptr ; Read LSB of left from switch
1 mov left+ 1,a
2 jnb p0.7,$ ; Wait for input from external program




1 xrl a,top ; exclusive or signs to see what sign ]
2 jnb acc.7,there









1 mov a,left ; remove sign bit from left
1 anl a,#07fh
1 mov left,a
1 mov a,i ; load i
2 cjne a,#samples,loop; if(i !== samples) goto loop
2
loop:
ljmp endloop ; if( i = = samples) goto endloop
> sum : = sum + left * top
y Where sum is 32 bits and left and top are 16 bits
y
y 30 31 (left)
y X 2e 2f (top)
y
; + 30x2f 3 lx2f
; + 30x2e 31x2e
; + 2a
y
2 b 2c 2d (siin)
%
$ 36 sum s-= siim + top * left;
1
1
mov a, left +1 ; LSB of left
2 mov b,top +1 ; LSB of top
4 mill ab
1 xrl a,sign ; change sign if needed
1 mov c,sign.7
1 addc a, sum.+.3 ; LSB of sum (byte 4)
1 mov sum +3, a
1 mov a,b
l xrl a,sign ; change sign if needed
1 addc a,sum + 2 ; add in byte 3 of sum
1 mov suin + 2,a
1 mov a,sign
1 addc a,sum + l ; add carry to byte 2 of sum
Figure B.15 (Continued)
545
1 mov sum + l,a
1 mov a,sign
1 addc a,sum ; add carry to MSB of sum (byte
1 mov sum, a
1 mov a,top +1 ; LSB of top
; 35 out <- top;
2
5
movx @dptr,a ; Send LSB of top to south port
; 30 gum s== sum + top * left; (cont)
2
$
mov b,left ; MSB of left
4 mu! ab
1 xr! a,sign ; change sign if needed
1 mov c,sign. 7
1 addc a,sum + 2 ; add to byte 3 of sum
1 mov sum + 2,a
1 mov a,b
1 xrl a,sign ; change sign if needed
1 addc a,sum+l ; add to byte 2 of sum
1 mov sum + l,a
1 mov a,sign
1 addc a,sum ; add carry to byte 1 of sum
1 mov sum,a
1 mov a,top ; MSB of top
• 35 sum <- topi (cont)
2
i
movx @dpir,a ; Send MSB of top to south port
30 sum : == sum + top * left; (cont)
1
»
an! a,#7fh ; Remove sign bit from top
2 mov b,left +1 ; LSB of left
4 mul ab
1 xrl a,sign ; change sign if needed
1 mov c,sign.7
1 addc a, sum + 2 ; add to byte 3 of sum
1 mov sum 4-2,a
1 mov a,b
1 xrl a,sign ; change sign if needed
1 addc a,sum +1 ; add to by te 2 of sum
1 mov sum + l,a
1 mov a,sign
1 addc a,sum ; add carry to byte 1 of sum
1 mov sum,a
1 mov a,top ; MSB of top
Figure B.15 (Continued)
546
1 anl a,#7fh ; Remove sign bit from top
2 mov b,left ; LSB of left
4 mill ab
1 xrl a,sign ; change sign if needed
1 mov c,sign. 7
1 addc a,sum +1 ; add to byte 2 of sum
1 mov sum + l,a
I mov a;b
1 xrl assign ; change sign if needed










out < - sum;
1
1
clr a ; Send a 0 to south port bottom <
2 movx @dptr,a ; write to switch
1 mov r0,#6
2 djnz r0,$ ; Wait 14 microseconds for switch
1 clr a ; Send a 0 to port 4
2 movx @dptr,a ; write to switch
7 sum : = sum + left: * ti3P
7 Where sum is 32 bits and left and top are 16 bits
7
7 30 31 (left)
7 X 2e 2f (top)
7
; + 30x2f 31x2f
; + 30x2e 31x2e
; + 2a
7
2b 2c 2d (sum)
;
; 40 sum = sum + top * left;
1 mov a,left.-hi; ; LSB of left
2 mov bjtop +1 ; LSB of top
4 mul ab
1 xrl a,sign ; change sign if needed
1 mov c,sign.7





mov pl,#east ; Next write is to east port
movx @dptr,a ; Send LSB of sum to east port
sum s== sum + top • left; (cont)
mov a,b
xrl a,sign ; change sign if needed
addc a,sum + 2 ; add in byte 3 of sum
mov sum + 2, a
mov a,sign
addc a,sum.+1 ; add carry to byte 2 of sum
mov sum + l,a
mov a,sign
addc a,sum ; add carry to MSB of sum (byte 1)
mov sum,a
mov a,top +1 ; LSB of top
mov b,left ; MSB of left
mul ab
xrl a,sign ; change sign if needed
mov c,sign.7
addc a,sum+2 ; add to byte 3 of sum
mov sum+ 2,a
mov a,b
xrl a,sign ; change sign if needed
addc a,sum +1 ; add to byte 2 of sum
mov sum + l,a
mov a,sign
addc a,sum ; add carry to byte 1 of sum
mov sum,a
mov a,top ; MSB of top
anl a,#7fh ; Remove sign bit from top
mov b,left + l ; LSB of left
mul ab
xrl a,sign ; change sign if needed
mov c, sign. 7
addc a,sum + 2 ; add tq byte 3 of sum
mov sum +2,a
results <- sum; (cont)
movx @dptr,a ; Send third byte of sum to east port
sum 2= sum +top * left; (cont) 
mov a,b
xrl a,sign ; change sign if needed
Figure B. 15 (Continued)
1 addc a,sum + l










1 addc a,sum + l
1





1 40 sum ; = sum +top
}












; add to byte 2 oj sum 
; add carry to by tel of sum 
; MSB of top
; Remove sign bit from top 
; MSB of left
; change sign if needed
; add to by te 2 of sum
(cent)
; Send second byte of sum to east port 
loft; (cont)
; change sign if needed 
; add to MSB of sum (byte 1)
(cont)
Wait 14 microseconds for switch 
; Send second byte of sum to east port
initialize 8 and sum for tnexi autocorrelation calculation
;
; 44 l :=#5
;
i clr a





mov sumifc3,a ; sum := 0
i mov sumdn2,a
i ;mov sum d- l,a
i mov sum,a
2 mov plj^south
Figure ®.15 (Continued)
549
i
; 45 end
$
2 Ijmp
end
main
Figure B.15 (Continued)
