

# A Fully Parallel Vector-Quantization Processor for Real-Time Motion-Picture Compression

| 著者                | 大見忠弘                                 |
|-------------------|--------------------------------------|
| journal or        | IEEE Journal of Solid-State Circuits |
| publication title |                                      |
| volume            | 34                                   |
| number            | 6                                    |
| page range        | 822-830                              |
| year              | 1999                                 |
| URL               | http://hdl.handle.net/10097/47985    |

doi: 10.1109/4.766816

# A Fully Parallel Vector-Quantization Processor for Real-Time Motion-Picture Compression

Akira Nakada, Tadashi Shibata, Member, IEEE, Masahiro Konda, Tatsuo Morimoto, and Tadahiro Ohmi, Member, IEEE

Abstract— A vector-quantization (VQ) processor system has been developed aiming at real-time compression of motion pictures using a 0.6- $\mu$ m triple-metal CMOS technology. The chip employs a fully parallel single-instruction, multiple-data architecture having a two-stage pipeline. Each pipeline segment consists of 19 cycles, thus enabling the execution of a single VQ operation in only 19 clock cycles. As a result, it has become possible to encode a full-color picture of 640 × 480 pixels in less than 33 ms, i.e., the real-time compression of moving pictures has become available. The chip is scalable up to eight-chip master–slave configuration in conducting fully parallel search for 2-K template vectors. The chip operates at 17 MHz with a power dissipation of 0.29 W under a power-supply voltage of 3.3 V.

### I. INTRODUCTION

**R**EAL-TIME transmission of motion pictures is one of the most important technologies in the era of multimedia. In this regard, efficient image compression and decompression algorithms are required to reduce the data size to fit the bandwidth of available communication networks. A variety of work has been conducted in developing compression/decompression systems based on the worldwide standard Moving Pictures Experts Group (MPEG) [1]–[5]. Although the algorithm offers excellent compression performance, it requires dedicated hardware systems for both encoding and decoding. For personal and portable applications, simple algorithms that run very fast on simple hardware are of particular importance.

Vector quantization (VQ) [6] has become an attractive technique for image compression due to its very simple algorithm. In image compression using VQ, a macro block of pixels is approximated by the maximum-likelihood pattern in the codebook, where typical patterns occurring in video images are stored as templates, and only the code of the best matched pattern is transmitted. The decompression is carried out by just pasting the corresponding pattern at each location. Since pattern matching is a very computationally expensive process, VQ is asymmetric in compression and decompression operations. For this reason, VQ is so far suitable only for one-way video transmission in portable applications. Low-

Manuscript received January 20, 1998; revised December 9, 1998. This work was supported in part by the Ministry of Education, Science, Sports, and Culture under a Grant-in-Aid for Scientific Research on Priority Areas, Area Number 269, and in part by NEDO.

A. Nakada is with the VLSI Design and Education Center, University of Tokyo, Tokyo 113 Japan.

T. Shibata is with the Department of Information and Communication Engineering, University of Tokyo, Tokyo 113 Japan.

M. Konda, T. Morimoto, and T. Ohmi are with the Department of Electronic Engineering, Graduate School of Engineering, Tohoku University, Sendai 980-77 Japan.

Publisher Item Identifier S 0018-9200(99)04192-X.

Ax4pixel Ax4pixel Matching ORIGINAL Code Book

Fig. 1. Algorithm of vector quantization for image compression.

power VQ decoder chips were developed for such applications [7], [8].

A number of VQ encoding engines have been developed aiming at real-time applications using digital [9]-[12] and analog [13] technologies. The analog implementation of [13] employed current-mode circuitry for Euclidean distance calculation and minimum distance search, allowing very fast operation on a simple hardware. However, cost must be paid for the limited accuracy of analog computation. The digital implementation in [10] demonstrated VQ of real-time sampled NTSC video signals using a differential-vector-quantization algorithm. However, the number of template vectors on a chip was limited to 32. In [11], the functional-memory-type parallel processor architecture was employed for VQ. The chip requires 165 cycles for a single VQ operation, making it difficult for real-time compression of motion pictures. In realtime applications, minimizing the number of clock cycles for a single VQ is of primary importance. Increasing the codebook size is also important to preserve the image quality.

In this paper, a digital VLSI chip developed for real-time encoding of motion pictures based on a VQ algorithm is presented. The chip employs a fully parallel single-instruction, multiple-data (SIMD) architecture and executes a VQ operation in every 19 clock cycles. Namely, a single VQ operation is finished in 1.1  $\mu$ s at a clock frequency of 17 MHz. Moreover, the parallel search for the maximum 2048 vectors can be conducted using an eight-chip master–slave configuration. As a result, real-time VQ encoding of 640 × 480 pixel full-color motion pictures has become possible.

The organization of this paper is as follows. In Section II, the VQ algorithm employed in the present system is described.

0018-9200/99\$10.00 © 1999 IEEE







(a)



Fig. 2. The performance of VQ for image compression. (a) A  $640 \times 480$  picture used for generating a codebook and (b) an original image of Susie. (c) Susie VQ encoded using the codebook generated from the picture in (a) and decompressed. (d) Susie JPEG encoded and decompressed. In (b)-(d), the right-eye portions were enlarged.

System organization and circuit implementation are presented in Sections III and IV, respectively. Measurement results of fabricated VQ chips are demonstrated in Section V. Last, conclusions are given in Section VI.

(c)

# II. VECTOR QUANTIZATION

Fig. 1 explains how the VQ is carried out in our system. There is an original image composed of  $640 \times 480$  pixel luminance data. Each pixel has 256 levels of brightness represented by an eight-bit digital signal. Therefore, the data size of the image is 24 Mbit. From this original image, a macro block of  $4 \times 4$  pixels is taken as a 16-element vector for a single VQ operation. In the following, this vector is called an input vector. This input vector is matched with template patterns in the codebook.

(d)

The matching is conducted by calculating the Manhattan distance  $d_j$  between the template vector  $T_j$  and the input



Fig. 3. Block diagram of video/audio compression system using VQ chips.

vector  $\boldsymbol{X}$  as

$$d_j = \sum_{i=1}^{16} |t_{i,j} - x_i|$$

where j is the code number of the template  $T_j$ ,  $t_{i,j}$  is the *i*th element of  $T_i$ , and  $x_i$  is the *i*th element of X.

The template vector having the maximum likelihood to the input vector is selected by searching for the minimum-distance vector; the code number of the vector is sent as the data. If the codebook has 2-K template vectors, for instance, the 16 bytes of the macro-block data are replaced by an 11-bit code. Therefore, the compression ratio is 11.6 for black-and-white pictures and 23.3 for color pictures with 4:1:1 color subsampling. (The same codebook is used for Y, U, V signals.) Huffman coding can further enhance the compression ratio. Reconstruction of the compressed image is conducted by pasting the pattern corresponding to a received code at each location. The algorithm is very straightforward, and no dedicated hardware is needed for the image retrieval.

In the present system, a static codebook was employed. This is in contrast to VQ encoders in [9] and [11], which employ dynamic codebooks that update themselves according to input vectors. The codebook was generated using the standard technique of Kohonen's self-organizing map (SOM) [10].

The performance of the VQ image compression is demonstrated in Fig. 2. Original figures are all color pictures, but they are reproduced here in black and white. Fig. 2(a) shows the picture used for codebook generation. Each macro block in the picture is presented to the SOM, and weight updates were repeated 384 000 times. The vector length was not normalized in the present system. The original image of Susie in Fig. 2(b) was compressed using the codebook generated using the image in Fig. 2(a); the decompressed image is shown in Fig. 2(c). In Fig. 2(d), the image obtained by JPEG is shown for comparison. The peak signal-to-noise ratio (PSNR) of the decompressed VQ image in Fig. 2(c) is 32.0 dB. When the codebook is generated using the original image of Susie itself, the PSNR is 38.2 dB. To investigate the impact of the codebook size on PSNR, the picture in Fig. 2(a) was utilized to generate codebooks having 1024 and 512 template vectors. The PSNR's of the reconstructed image of Susie were 30.8 and 30.0 dB for 1024-vector and 512-vector codebooks, respectively.

#### III. SYSTEM ORGANIZATION

Fig. 3 shows the block diagram of our real-time video and audio compression system, which will be implemented on a performance board. Video and sound signals are fed to the system and captured in buffer memories. The captured signals are then transferred to the VQ chip module for encoding. The VQ chip module consists of eight VQ chips. The total number of template vectors in both the video and audio codebook is going to be made less than 2048. All control of this board is done by a personal computer via peripheral component interconnect (PCI) bus.

If only a VQ chip module is used for compression, the datacompression ratio is 23 or less. To achieve a data compression ratio as large as 200, a simple motion-sensor chip has been developed using a gate array. In this paper, however, only the VQ chip is presented.

The most important concern of the present system is the real-time encoding of motion pictures. To encode a 640  $\times$  480 full-color picture in a 4 : 1 : 1 format within 33 ms, a single VQ operation must be completed within 1.1  $\mu$ s. Our strategy toward this end is as follows. First, a fully parallel SIMD architecture has been employed. Second, a single VQ operation is conducted in two pipeline stages, each pipeline segment consisting of 19 cycles. As a result, a single VQ operation is finished in every 1.1  $\mu$ s at a clock frequency of 17 MHz. Third, the chip is extendible to an eight-chip master–slave configuration, enabling us to perform a fully parallel search for a maximum 2048 template vectors in 1.1  $\mu$ s.



**GLOBAL WINNER CODE** 

Fig. 4. Block diagram of VQ chip module.





Fig. 5. Detailed block diagram of VQ chip.

Fig. 4 shows the block diagram of the VQ chip module, which is composed of eight VQ chips, namely, one master chip and seven slave chips. Each VQ chip stores 256 template vectors in the embedded SRAM. The input vector is given to all the chips at the same time and stored in the input first in, first out (FIFO) buffers. The template vector having the minimum distance to the input vector is searched in three stages of competition by using winner-take-all (WTA) circuits. The first stage is performed in each 64-vector matching block, where the distances between the input vector and 64 template vectors are calculated and the winner (the shortest distance vector) is selected in each block. The second stage is conducted on each chip, and the chip winner is selected by the second

Fig. 6. Block diagram for Manhattan distance calculation.

WTA. The distance of each chip winner is sent to the master chip, where the final competition is carried out to find out the global winner. The VQ chip organization is explained in detail in the following.

#### IV. CIRCUIT IMPLEMENTATION

A block diagram of the VQ chip is shown in Fig. 5. The VQ chip is composed of an input FIFO buffer, four 64-vector matching blocks (only one block is shown in the figure), three stages of competition blocks, and an output FIFO buffer.

Each 64-vector matching block has an 8-K SRAM storing 64 template vectors. Before VQ operation, the template vectors are downloaded from the external memory to SRAM cells (32 K on a single chip) element by element. Each element has an 8-bit length. When all the template vectors are downloaded, an



Fig. 7. Winner-take-all circuit.



Fig. 8. Winner-observer circuit.

input vector is given to the chip by every four elements (32bit parallel) and stored in the FIFO buffer. The FIFO buffer is composed of a two-port SRAM of 32 bits  $\times$  128 words and therefore can store a maximum of 64 vectors during VQ operation. This FIFO buffer enables a flexible data transfer on the performance board. When the FIFO receives at least one input vector, the chip automatically starts the VQ operation. Vector-matching search for the entire 2048 template vectors is conducted in three competition stages all in parallel, as described in the following.

The first competition is carried out in the four 64-vectormatching blocks shown in Fig. 5. The Manhattan distance between the input vector and a template vector is calculated by the absolute value and accumulate (AVA) block in a byte-serial manner. The input vector is supplied from the input FIFO element by element. All the template vectors are simultaneously read out from the embedded 32-K SRAM on each chip also element by element, being supplied to the AVA block. The AVA block consists of 64 identical circuits described in Fig. 6, all working in parallel.

A block diagram for Manhattan distance calculation is shown in Fig. 6. In the first step, the element of a template vector is subtracted from the element of an input vector. This subtraction is carried out by adding the 2's-complement of the input-vector element to the template-vector element using an 8-bit adder. The 2's-complement is obtained by inverting the data of the input vector before the addition and supplying an extra "1" to the carry input of the least significant bit (LSB) full adder. After the subtraction, the carry bit of the 8-bit adder is examined to see whether the result is positive or negative. If the sum is positive, the data are accumulated in the 12-bit accumulator. If it is negative, the 2's-complement of the sum is accumulated. This function is implemented using exclusive OR gates.

In principle, it takes at least 16 clock cycles to obtain the Manhattan distance because a vector has 16 elements. However, the operation involves 19 clock cycles. The extra three cycles are used for the initialization of the circuit, the data transfer from the input FIFO and SRAM, and the data transfer from the 8-bit adder to a 12-bit accumulator via an 8-bit register. These 19 cycles form the first segment of the two-stage pipeline processing. The final result stored in the accumulator is sent to the WTA block in a bit-serial manner.

The processing scheme of the first-stage WTA is illustrated in Fig. 7. A state register is provided for each of the 64 parallel input distances' signals to store the flag data, indicating the status of the competition: namely, "1" means the winner and "0" means the loser. At first, all the flag data are set to "1." Then the 64 distance values obtained in the AVA block are fed to the minimum pass filter in a bit serial manner in each clock cycle in the order of most to least significant bit. At each cycle, only the smallest bit values are passed through the minimum pass filter, and the inputs having larger values are withdrawn in the following competition. The minimum pass filter is composed of 64 identical circuits. Each circuit is composed of a two-input OR gate, a two-input NAND gate, and the 64-input AND gate common to all circuits, as shown in Fig. 7. The flag stored in each state register controls the OR gate. If the flag is "1," the distance signal bit is passed through the upper OR gate. If the flag is "0," the output is forced to "1"; thus the input signal is neglected in the following competition. The 64-input AND gate detects the minimum value of the



Fig. 9. Data flow timing chart.



Fig. 10. Die photograph of VQ chip.

signals passed through the upper OR gate. The lower NAND gate compares the signal from the upper OR gate and the output of the 64-input AND gate. It outputs "1" when these values are the same and "0" if they are different. The results are stored in the state register as a flag to control the next competition step. After 12 cycles of competition steps, flag 1 remains only at the location of the minimum-distance template vector (the winner).

The minimum distance value obtained in each first-stage WTA (the signal appears at the output of the 64-input AND gate) is sent to the second-stage WTA synchronous to the operation of the first-stage WTA. The second-stage WTA works similarly, and the winner of the chip is identified. Then the distance signal of the chip winner is send to the third-



| TABLE I   LAYOUT AREA COMPARISON OF PROCESSING ELEMENTS |               |              |
|---------------------------------------------------------|---------------|--------------|
|                                                         | Area<br>[mm²] | Ratio<br>[%] |
| SRAM                                                    | 7.01          | 19           |
| Vector-Matching<br>Block                                | 29.5          | 78           |
| WTA / WO                                                | 1.27          | 3            |

stage WTA for final competition, and the global winner is identified. The third-stage WTA/winner-observer (WO) and output FIFO are enabled only in the master chip. The distance



Fig. 11. Measured waveforms.



TABLE II Specification of VQ Chip (Numbers in Parentheses Indicate Data at 5-V Power Supply)

| Technology<br>Supply<br>Power Dissipation<br>Transistor Count | 0.6μm CMOS / triple metal<br>3.3V (5.0V)<br>0.29W @ 17MHz (0.94W @ 33MHz)<br>800k Tr's |
|---------------------------------------------------------------|----------------------------------------------------------------------------------------|
| Die Size                                                      | 7.98mm x 9.03mm                                                                        |
| Code Book                                                     | 32KSRAM for 256 Vectors                                                                |
| Search Time                                                   | 1.1µsec (570nsec)                                                                      |

Fig. 12. Shmoo plot of fabricated chip.

signal transfer from the accumulator down to the third stage of WTA is conducted in a bit-serial manner, and the three-stage WTA selections are carried out within 19 clock cycles, which forms the second segment of the two-stage pipeline processing.

Encoding of the location of the winner at each WTA stage is performed by the WO circuits, shown in Fig. 8. The figure shows the logic circuit of the WO after the third-stage WTA. The WO consists of three stages of AND ladders that encode the location of the winner to a 3-bit code number. This circuit plays another important role: there is a chance of having more than two winners. If more than two winners remain after the competition at each WTA stage, the WO circuit selects only the one having a smaller code number and ignores others.

In Fig. 9, the data-flow timing chart of the VQ operation is shown. The distance calculation and the three-stage competition are conducted by two pipeline processing stages. The interface between the two stages is connected by a parallel-toserial converter, which is composed of a 12-bit register.

# V. MEASURED RESULTS

A die photograph of the VQ chip is shown in Fig. 10. The chip is fabricated in a 0.6- $\mu$ m CMOS process with triple-metal

layers. The layout area of each processing block is compared in Table I. About 78% of the chip area is occupied by four 64-vector-matching blocks.

In Fig. 12, a shmoo plot of the VQ chip is shown. With a power supply of 3.3 V, the chip operates up to the maximum frequency of 28.5 MHz (cycle time of 35 ns), well fulfilling the real-time compression requirement of 17 MHz. The power dissipation of the chip was 0.35 and 0.29 W for 28.5- and 17-MHz operation, respectively. When the power supply is increased to 5 V, the chip works at 40 MHz (25 ns). This meets the requirement of 33 MHz, which corresponds to the pipeline cycle of 570 ns and is compatible with encoding of a full-color picture in the RGB format within 33 ms. At 5-V power supply, the power dissipation is 0.94 W at a clock

frequency of 33 MHz. Table II summarizes the features of the chip.

#### VI. CONCLUSION

A vector-quantization processor has been developed for real-time encoding of motion pictures. It employs a fully parallel SIMD architecture with a two-stage pipeline. A 2Kvector search is conducted by an eight-chip configuration, namely, one master chip and seven slave chips. The search time is 1.1  $\mu$ s at 17 MHz of the clock frequency, and the power dissipation of the chip is 0.29 W under a 3.3-V power supply.

A single VQ operation for 2-K template vectors on typical CISC processors requires roughly 1.2-M operations. This number was derived from the following estimation: (38 operations/element) × (16 elements/vector) × (2048 vectors/VQ) = 1.2 M operations/VQ. The present VQ system in an eight-chip configuration can do this job in 1.1  $\mu$ s, which is equivalent to a CISC processor performance of about 1000 GOPS (1.2-M operations/1.1  $\mu$ s).

#### ACKNOWLEDGMENT

The authors wish to thank H. Akutsu, formerly with Dome, Inc., Tokyo, Japan, for his support in software development and establishing the chip specifications; A. Kawamura and K. Marumoto of Rohm Co., Ltd., Kyoto, Japan, for their contributions to the chip design; and H. Takasu of Rohm Co., Ltd., for his support in fabricating the chip.

#### REFERENCES

- [1] M. Toyokura, H. Kodama, E. Miyagoshi, K. Okamoto, M. Gion, T. Minemaru, A. Ohtani, T. Araki, H. Takeno, T. Akiyama, B. Wilson, and K. Aono, "A video DSP with a macroblock-level-pipeline and a SIMD type vector pipeline architecture for MPEG2 CODEC," *IEEE J. Solid-State Circuits*, vol. 29, pp. 1474–1481, Dec. 1994.
- [2] M. Mizuno, Y. Ooi, N. Hayashi, J. Goto, M. Hozumi, K. Furuta, Y. Nakazawa, O. Ohnishi, Y. Yokoyama, Y. Katayama, H. Takano, N. Miki, Y. Senda, I. Tamitani, and M. Yamashita, "A 1.5W singlechip MPEG2 MP@ML encoder with low-power motion estimation and clocking," in *ISSCC Dig. Tech. Papers*, Feb. 1997, pp. 256–257.
- [3] K. Ishihara, S. Masuda, S. Hattori, H. Nishikawa, Y. Ajioka, T. Yamada, H. Amishiro, S. Uramoto, M. Yoshimoto, and T. Sumi, "A half-pal precision MPEG2 motion-estimation processor with concurrent threevector search," *IEEE J. Solid-State Circuits*, vol. 30, pp. 1502–1509, Dec. 1995.
- [4] M. Matsui, H. Hara, Y. Uetani, L. Kim, T. Nagamatsu, Y. Watanabe, A. Chiba, K. Matsuda, and T. Sakurai, "A 200MHz 13 mm<sup>2</sup> 2-D DCT macrocell using sense-amplifying pipeline flip-flop scheme," *IEEE J. Solid-State Circuits*, vol. 29, pp. 1482–1490, Dec. 1994.
- [5] A. Werf, F. Bruls, R. Kleihorst, E. Waterlander, M. Verstraelen, and T. Friedrich, "LMcIC: A single-chip MPEG2 video encoder for storage," in *ISSCC Dig. Tech. Papers*, Feb. 1997, pp. 254–255.
- [6] A. Gersho and R. M. Gray, Vector Quantization and Signal Compression. Boston, MA: Kluwer Academic, 1992.
- [7] A. P. Chandrakasan, A. Burstein, and R. W. Brodersen, "A low-power chipset for a portable multimedia I/O terminal," *IEEE J. Solid-State Circuits*, vol. 29, pp. 1415–1428, Dec. 1994.
- [8] E. K. Tsern and T. Meng, "A low-power video-rate pyramid VQ decoder," *IEEE J. Solid-State Circuits*, vol. 31, pp. 1789–1794, Nov. 1996.
- [9] K. Tsang and B. W. Y. Wei, "A VLSI architecture for a real-time code book generator and encoder of a vector quantizer," *IEEE Trans. VLSI Syst.*, vol. 2, pp. 360–364, Sept. 1994.
- [10] J. E. Fowler, Jr., K. C. Adkins, S. B. Bibyk, and S. C. Ahalt, "Real-time video compression using differential vector quantization," *IEEE Trans. Circuits Syst.*, vol. 5, pp. 14–24, Feb. 1995.

- [12] A. Gentile, H. Cat, F. Kossentini, F. Sorbello, and D. S. Wills, "Real-time implementation of full-search vector quantization on a low memory SIMD architecture," in *Proceedings of the Data Compression Conference*, J. A. Storer and M. Cohn, Eds. Los Alamitos, CA: IEEE Computer Society Press, 1996.
- [13] G. T. Tuttle, S. Fallahi, and A. A. Abidi, "An 8b CMOS vector A/D converter," in *ISSCC Dig. Tech. Papers*, Feb. 1993, pp. 38–39.
- [14] T. Kohonen, Self-Organizing Maps. Berlin, Germany: Springer, 1995.



Akira Nakada was born in Ishikawa, Japan, on March 10, 1971. He received the B.S., M.S., and Ph.D. degrees in electronic engineering from Tohoku University, Sendai, Japan, in 1993, 1995, and 1998, respectively.

Since 1998, he has been a Research Associate with the VLSI Design and Education Center, University of Tokyo, Tokyo, Japan. He is currently working on new architecture electronic circuits and systems using both analog and digital technologies.



**Tadashi Shibata** (M'79) was born in Hyogo, Japan, on September 30, 1948. He received the B.S. degree in electronic engineering and the M.S. degree in material science from Osaka University, Osaka, Japan, in 1971 and 1973, respectively, and the Ph.D. degree from the University of Tokyo, Tokyo, Japan, in 1984.

From 1974 to 1986, he was with Toshiba Corp., where he was a Researcher working on the research and development (R&D) of device and processing technologies for ULSI's. He was engaged in the

development of microprocessors, EEPROM's, and DRAM's, primarily in the process integration and research of advanced processing technologies for their fabrication. From 1984 to 1986, he was a Production Engineer at one of the most advanced manufacturing lines of Toshiba. During 1978–1980, he was a Visiting Research Associate at Stanford Electronics Laboratories, Stanford University, Stanford, CA, where he studied laser-beam processing of electronic materials, including silicide, polysilicon, and superconducting materials. From April 1986 to May 1997, he was an Associate Professor in the Department of Electronic Engineering, Tohoku University, where he was engaged in the R&D of ultraclean technologies. Since May 1997, he has been a Professor in the Department of Information and Communication Engineering, University of Tokyo.

Dr. Shibata is a member of the Japan Society of Applied Physics; the Institute of Electrical Engineers of Japan; and the IEEE Electron Devices Society, Circuits and Systems Society, and Computer Society.



**Masahiro Konda** was born in Hyogo, Japan, on March 23, 1972. He received the B.S. and M.S. degrees in electronic engineering from Tohoku University, Sendai, Japan, in 1995 and 1997, respectively, where he is pursuing the Ph.D. degree.

He currently is working on circuits and systems for image processing.



**Tatsuo Morimoto** was born in Tochigi, Japan, on August 3, 1971. He received the B.S. and M.S. degrees in electronic engineering from Tohoku University, Sendai, Japan, in 1995 and 1997, respectively, where he is pursuing the Ph.D. degree.

He currently is working on circuits and systems for image processing.



**Tadahiro Ohmi** (M'81) was born in Tokyo, Japan, on January 10, 1939. He received the B.S., M.S., and Ph.D. degrees in electrical engineering from the Tokyo Institute of Technology, Tokyo, in 1961, 1963, and 1966, respectively.

Prior to 1972, he was a Research Associate in the Department of Electronics, Tokyo Institute of Technology, where he worked on Gunn diodes such as velocity overshoot phenomena, multivalley diffusion, and frequency limitation of negative differential mobility due to an electron transfer in the

multivalleys; high-field transport in semiconductors, such as unified theory of space-charge dynamics in negative differential mobility materials, Blochoscillation-induced, negative mobility and Bloch oscillators, and dynamics in injection layers. In 1972, he joined Tohoku University, Sendai, Japan, where he is a Professor in the New Industry Creation Hatchery Center (NICHe). He is currently engaged in research on high-performance ULSI's such as ultrahigh-speed ULSI, current overshoot transistor LSI, HBT LSI, and SOI on metal substrate, base store image sensor (BASIS) and high-speed flat-panel display, and advanced semiconductor process technologies, i.e., ultraclean technologies such as high-quality oxidation, high-quality metallization due to low kinetic energy particle bombardment, very low-temperature Si epitaxy particle bombardment, crystallinity control film growth technologies from single-crystal, gain-size-controlled polysilicon, and amorphous due to low kinetic energy particle bombardment; highly selective CDV, highly selective RIE, high-quality ion implantation with low-temperature annealing capability, etc., based on the new concept supported by newly developed ultras-clean gas supply system, ultra-high vacuum-compatible reaction chamber with selfcleaning function, ultra-clean wafer surface cleaning technology, etc. His research includes 700 original papers and 600 patent applications.

Dr. Ohmi is a member of the Institute of Electronics, Information and Communication Engineers of Japan, the Institute of Electronics of Japan, the Japan Society of Applied physics, and the ECS. He received the Ichimura Award in 1979, the Teshima Award in 1987, the Inoue Harushige Award in 1989, the Ichimura Prizes in Industry—Meritorious Achievement Prize in 1990, the Okouchi Memorial Technology Prize in 1991, the Minister of State for Science and Technology Award for the promotion of Invention in 1993, the Invention Prize from the Fourth International Conference on Soft Computing (IIZKA'96), the Best Paper Award in 1996, and the IEICE Achievement Award in 1997. He is the President of the Institute of Basic Semiconductor Technology-Development (Ultra Clean Society).