行政院國家科學委員會專題研究計畫 期中進度報告

# 無線光通訊之智慧型盲人預警監控及導引網路系統--子計 畫一:提昇私校研發能量專案計畫-無線光傳收機之研製 (2/3) 期中進度報告(精簡版)

| 計 | 畫 | 類 | 別 | : | 整合型                       |
|---|---|---|---|---|---------------------------|
| 計 | 畫 | 編 | 號 | : | NSC 94-2745-E-032-002-URD |
| 執 | 行 | 期 | 間 | : | 94年08月01日至95年07月31日       |
| 執 | 行 | 單 | 位 | : | 淡江大學電機工程學系                |

計畫主持人:江正雄 共同主持人:饒建奇、郭建宏

報告附件:出席國際會議研究心得報告及發表論文

處 理 方 式 :本計畫可公開查詢

#### 中華民國 95年12月23日

# 行政院國家科學委員會專題研究計畫 期中進度報告

# 子計畫一:提昇私校研發能量專案計畫-無線光傳收機之研 製(2/3)

<u>計畫類別:</u>整合型計畫 <u>計畫編號:</u>NSC94-2745-E-032-002-URD <u>執行期間:</u>94年08月01日至95年07月31日 <u>執行單位:</u>淡江大學電機工程學系

- <u>計畫主持人:</u> 江正雄
- 共同主持人: 饒建奇, 郭建宏
  - 報告類型: 精簡報告

<u>報告附件</u>:出席國際會議研究心得報告及發表論文 處理方式:本計畫可公開查詢

## 中 華 民 國 95年6月29日

# 行政院國家科學委員會補助專題研究計畫成果報告

無線光通訊之智慧型盲人預警監控及導引網路系統(1/3)

子計畫一:無線光傳收機之研製

計畫類別:整合型計畫

計畫編號: NSC 93-2745-E-032-002-URD

執行期間: 93年08月01日至94年07月31日

計畫主持人:江正雄

共同主持人:饒建奇

郭建宏

執行單位:淡江大學電機工程學系

中華民國 94 年 05 月 31 日

## 行政院國家科學委員會專題研究計畫成果報告

無線光通訊之智慧型盲人預警監控及導引網路系統(2/3)

子計畫二:無線光傳收機之研製

計畫編號:NSC 93-2745-E-032-002-URD

執行期限: 94年08月01日至94年07月31日

| 主持人:江正雄    | 執行機構及單位名稱:淡江大學電機工程學系 |
|------------|----------------------|
| 共同主持人:饒建奇  | 執行機構及單位名稱:淡江大學電機工程學系 |
| 郭建宏        | 執行機構及單位名稱:淡江大學電機工程學系 |
| 计畫參與人員:陳信良 | 執行機構及單位名稱:淡江大學電機工程學系 |
| 陳之皓        | 執行機構及單位名稱:淡江大學電機工程學系 |

關鍵字:無線光通訊 (Wireless Optical Communications,無線光通訊傳收機 (FSO Transceiver),接收器前端 (Front-End, FE), Clock Recovery, SOC

一、中文摘要

在通訊網路的最後階段--接取網路 的 铺設上,最常見的是光纖網路、銅線網 路、或無線網路 (Wi-Fi), 光纖網路與銅線 網路的鋪設費時費事又所費不貨,無線網 路的架設又牽扯的頻道的問題,一般的頻 道都必須事先申請,核准後方能使用,此 外無線網路的頻寬與傳輸速率都不高,更 侷限了它的用途。除了光纖網路、銅線網 路、或無線網路外,在所謂的「最後一哩 (Last Mile)」網路(即接取網路)的舖設 上,還有另一種新選擇,即無線光通訊網 路 (Wireless Optical Communication, or called Free Space Optics, FSO), 此網路是以 雷射二極體或發光二極體,以空氣為介 質,將聲音、影像、或數據資料由發射端 發送往接收端,其傳收速率最高可達 2.5Gbps。此外,由於無線光通訊是以點對 點傳送,因此其保密性非常好,不太可能 會被竊聽。同時其成本比起光纖網路與銅 線網路來,低了許多,大約是光纖網路的

1/3 至 1/10 的價格,架設時間只需三小時,因此可說既經濟又實惠;它又比無線網路具有更高的頻寬。無線光通訊的未來極具潛力,因此值得加以研究及發展。

#### Abstract

In the last mile deployment of the access networks, people usually use fiber optics, copper wires, or wireless networks (Wi-Fi networks). However, the fiber optics and copper wires cost a lot. For an office the cost of fiber optics deployment may cost about 200,000 US dollars and take four to twelve months to construct. For wireless networks, the radio frequency (RF) channels are licensed and have to get restricted permit in advance. This will restrict the use of wireless networks in practical. There is another choice for the last mile networks deployment, wireless optical communication networks or called free space optics (FSO). In FSO, audio, video, images, and data can be transmitted from laser diodes or light-emitted diodes (LED) through air to the receiver side. The receiver side is made by photo diodes (PD) and the corresponding receiver circuits. The

highest bit rate of FSO can reach 2.5Gbps, and the cost is only 1/3 to 1/10 of the fiber optics. The construction time needs only three hours. The FSO needs no licensed RF channels, and the bit rate is much higher than Wi-Fi networks. It is a potential network for future use, and needs to study and develop it further.

二、計劃緣由與目的

本總計畫是希望利用無線光通訊的概 念來作盲人的導盲系統,在此導盲系統 中,盲胞可利用無線光通訊系統,將該盲 胞的資料與需求送到行控中心,此導盲系 統所需使用到的無線光通訊是類似 Wi-Fi 的傳收模式,傳收模組規格不需要很高, 但必須具備低功率消耗與低成本的特性。 雖然此導盲系統所使用到的無線光通訊是 術不需要那麼高階,但因為無線光通訊是 未來通訊的關鍵技術,因此本子計畫將以 研製具低功率消耗、低成本、與高性能的 Wi-Fi 無線光通訊傳收機不僅可使用 在本總計畫的導盲系統中,也可進一步為 未來的 Wi-Fi 無線光通訊系統所使用。

本子計畫「無線光通訊傳收機系統與 晶片之研究與實現」為一個三年期之研究 計畫。在本計畫之無線光通訊傳收機中, 將分別設計發射器(Transmitter)及接收器 (Receiver)的部份,再進行二者的整合,最 後與總計畫做統整。本計畫之目的在於設 計製造出具有低功率消耗、低成本、與高 效能之無線光通訊傳收機。

三、研究方法與成果

本子計畫為應用在無線光通訊系統類 比接收器前端電路之研究,包括一般光纖 通訊接收器前端電路常用的原理、矽質光 偵測元件及利用電流鏡設計電壓轉阻放大 器電路與利用單晶電感設計的轉阻放大器 的研究,主要目標為製作一包含光偵測元 件之低功率、高傳輸速率及高頻寬的類比 接收器前端電路。

3.1 TIA

### 四、結論與討論

「無線光通訊傳收機系統與晶片之研 究與實現」的三年計畫中,第一年無線光 通訊傳收機之規格訂定相當困難,我們在 決定規格時遇到許多瓶頸,接收器類比前 端電路架構的選擇也是相當棘手。我們已 初步擬定我們所需之規格,電路方面的研 究也有一定量,之後將以成品實現整個光 通訊系統的原型電路,計畫獨立實現出其 中的接收端前端電路,並期望能在第二年 完成晶片下線。

## 五、參考文獻

- [1]. www.uec.com.tw
- [2]. www.hamamatsu.com
- [3]. H. H. Kim, S. Chandrasekhar, C. A. Burrus Jr., and J. Bauman, "A Si BiCMOS transimpedance amplifier for 10-Gb/s," *IEEE J. Solid-State Circuits*, vol. 36, pp. 769–776, May 2001.
- [4]. M. Ingels and M. S. J. Steyaert, "A 1-Gb/s, 0.7-μm CMOS optical receiver with full rail-to-rail output swing," *IEEE J. Solid-State Circuits*, vol. 34, pp. 971–977, July 1999.
- [5]. S. M. Park and C. Toumazou, "Low noise current-mode CMOS transimpedance amplifier for giga-bit optical communication," in *Proc. IEEE Int. Symp. Circuits and Systems (ISCAS)*, vol. 1, June 1998, pp. 293–296.
- [6]. B. Razavi, "A 622 Mb/s 4.5 pA/ CMOS transimpedance amplifier," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2000, pp. 162–163.
- [7]. S. M. Park and H. -J. Yoo, "1.25-Gb/s regulated cascode CMOS transimpedance amplifier for gigabit ethernet applications," *IEEE J. Solid-State Circuits*, Vol. 39, pp.112–121, Jan. 2004.
- [8]. S. S. Mohan, M. del Mar Hershenson, S. P. Boyd, and T. H. Lee, "Bandwidth extension in CMOS with optimized on-chip inductors," *IEEE J. Solid-State Circuits*, vol. 35, pp. 346–355, Mar. 2000.
- [9]. K. Schrödinger, J. Stimma, and M. Mauthe, "A fully integrated CMOS receiver front-end for optic gigabit ethernet," *IEEE J. Solid-State Circuits*, vol. 37, pp. 874–880, July 2002.
- [10]. W. -Z. Chen and C. -H. Lu, "A 2.5 Gbps CMOS optical receiver analog front-end," in *Proc. IEEE*

Custom Integrated Circuits Conf., May 2002, pp. 359–362.

- [11]. J. Lee, S.-J. Song, S. M. Park, C.-M. Nam, Y.-S. Kwon, and H.-J. Yoo, "A multichip on oxide 1 Gb/s 80 dB fully-differential CMOS transimpedance amplifier for optical interconnect applications," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2002, pp. 80–81.
- [12]. S. M. Park, J. Lee, and H.-J. Yoo, "1-Gb/s 80-dB Fully Differential CMOS Transimpedance Amplifier in Multichip on Oxide Technology for Optical Interconnects," *IEEE J. Solid-State Circuits*, vol. 39, pp. 971–974, June 2004.
- [13]. I. -H. Wang, C. -S Liu, and S.-I. Liu, "A low power 5Gb/s transimpedance amplifier with dual feedback technique," in *Proc. IEEE Int. Symp. Circuits and Systems (ISCAS)*, vol. 1, June 1998, pp. 293–296.
- [14]. B. Analui and A. Hajimiri, "Multi-pole bandwidth enhancement technique for transimpedance amplifiers," in *Proc. Eur.*

Solid-State Circuit Conf. (ESSCIRC), 2002,pp. 303–306.

- [15].B. Analui and A. Hajimiri, "Bandwidth enhancement for transimpedance amplifiers," *IEEE J. Solid-State Circuits*, vol. 39, pp. 1263–1270, Aug. 2004.
- [16]. F. Beaudoin and M. N. El-Gamal, "A 5-Gbit/s CMOS optical receiver frontend," in *Proc. IEEE Int. Midwest Symp. Circuits and Systems* (*MWSCAS*), vol. 3, 2002, pp. 168–171.
- [17]. A. K. Petersen, K. Kiziloglu, T. Yoon, F.Williams Jr., and M. R. Sandor, "Front-end CMOS chipset for 10 Gb/s communication," in *IEEE RFIC Conf. Dig. Papers*, June 2002, pp. 93–96.
- [18]. M. Kossel, C. Menolfi, T. Morf, and T. Toifl, "Wideband CMOS transimpedance amplifier," *Electron. Lett.*, vol. 39, no. 7, pp. 587–588, Apr. 2003.
- [19]. C. Kromer, G. Sialm, T. Morf, M. L. Schmatz, F. Ellinger, D. Erni, and H. Jäckel "Bandwidth enhancement for transimpedance amplifiers," *IEEE J. Solid-State Circuits*, vol. 39, pp. 885–894, June 2004.

Offset Photodiode Compensation 0.3nF CDR Single to Offset Gain Stages Differential Subtractor Amp  $V_{DD}$ R₿≥  $R_1$  $\geq R_{X1}$  $R_{X2}$ ≤R₂ X Vout  $M_1$ ₩ M₄ M  $T_2$  $T_1$  $X_2$  $R_3$ Cn  $M_2$ 

六、圖表









# 行政院國家科學委員會補助國內專家學者出席國際學術會議報告

93年7月19日

| 報告人姓名          | 江正雄                                                                                                                      | 服務機構<br>及職稱  | 淡江大學電機系<br>副教授         |  |  |
|----------------|--------------------------------------------------------------------------------------------------------------------------|--------------|------------------------|--|--|
| 會議<br>時間<br>地點 | 2004/5/23~2004/5/26<br>加拿大溫哥華                                                                                            | 本會核定<br>補助文號 | NSC 92-2218-E-032-007- |  |  |
| 會議             | (中文) 2004 年國際電子電機協會國際電路與系統研討會                                                                                            |              |                        |  |  |
| 名稱             | (英文) 2004 IEEE International Symposium on Circuits and Systems                                                           |              |                        |  |  |
| 發表<br>論文<br>題目 | (中文)HIGH-SPEED EBCOT WITH DUAL CONTEXT-MODELING<br>CODING ARCHITECTURE FOR JPEG2000<br>(英文)應用於 JPEG2000 具高速及平行處理器之 EBCOT |              |                        |  |  |

一、參加會議經過

IEEE ISCAS 是 IEEE 學會有關電路與系統的年度重要研討會,每年巡迴於世界各重 要國家與都市舉辦,每次均會吸引數千人參加。今年此會議是在加拿大的溫哥華舉 辦,會議收錄了一千多篇的論文,有來自世界各地超過兩千人參加,可為盛況空前。

此次我們有一篇論文「HIGH-SPEED EBCOT WITH DUAL CONTEXT-MODELING CODING ARCHITECTURE FOR JPEG2000」被接受,並以口頭報告的方 式於五月二十六日發表,報告後獲得很大的迴響,這對我們未來在多媒體系統的研究 上,將會有很大的幫助。

除了論文的發表外,本人也聆聽了許多與本人研究相關的論文發表及 Poster,並與 報告人討論研究心得,獲益良多。此外本人也參加了 IEEE Circuits and Systems 的 site meeting 如類神經網路(NSA, Neural Network Systems and Applications)與多媒體(MSA, Multimedia Systems and Applications)的 TC 會議,本人之前已經是類神經網路的 TC 會 員,這次更進一部成為多媒體的 TC 會員,這對本人的研究將會有很大的幫助,對台灣 也會有一定程度的幫助。

二、與會心得與建議

由於參與會議的學者專家實在很多,而且不乏該領域的知名人士,能在這種會議與這些人士交流,實在非常值得。明年本人將會繼續參與此會議,並將帶領一些研究生尤其是博士班的學生參與,讓學生們能感受學術的氣氛,這對他們未來的研究及眼界將 有很大的提昇。

三、攜回資料名稱及內容

參與會議後攜回一片會議論文光碟片及一些研討會的 Call for paper。

#### High-Speed EBCOT With Dual Context-Modeling Coding Architecture for JPEG2000

Jen-Shiun Chiang, Chun-Hau Chang, Yu-Sen Lin, Chang-Yuo Hsieh, and Chih-Hsien Hsia

Department of Electrical Engineering, Tamkang University, Tamsui, Taipei, Taiwan E-mail: {chiang, chchang, yslin, p21001}@ee.tku.edu.tw

#### ABSTRACT

This work presents a parallel context-modeling coding architecture and a matching arithmetic coder (MQ coder) for the embedded block coding (EBCOT) unit of the JPEG2000 encoder. The Tier-1 of the EBCOT consumes most of the computation time in a JPEG2000 encoding system, and the proposed parallel architecture can increase the throughput rate of the context-modeling. To match the high throughput rate of the parallel context-modeling architecture, an efficient pipelined architecture for contextbased adaptive arithmetic encoder is proposed. This encoder of JPEG2000 can work at 185MHz to encode one symbol each cycle. Compared with the conventional context-modeling architecture, our parallel architecture can increase the execution time about 25%.

#### 1. INTRODUCTION

JPEG2000 is a new image compression standard developed by the JPEG committee (ISO/IEC JTC 1/SC 29/WG 1) [1]. JPEG2000 image coding system provides very good ratedistortion performance in low bit-rate image compression and subjective image quality. The key algorithms of JPEG2000 include discrete wavelet transform (DWT), scalar quantization, context modeling, binary arithmetic coding, and post-compression rate allocation. Although JPEG2000 takes the benefits of EBCOT, the EBCOT takes more than 50% of the computation time [3]. A speedup method, sample skipping (SS) [4], was proposed to realize the EBCOT in hardware to accelerate the encoding process. Since the coding procedure proceeds column-by-column, a clock cycle is still wasted whenever the entire column is empty. In order to solve the empty column problems of SS, a method called group-of-column skipping (GOCS) [4] was proposed. However GOCS is restricted by its predefined group arrangement and it requires an additional memory block. An enhanced method of GOCS called multiple column skipping (MCOLS) [5] was also proposed. MCOLS performs tests through multiple columns concurrently to determine whether the column can be skipped. The MCOLS method has to modify the memory arrangements to supply state information for determining which column to be coded, and it limits the number of simultaneously combined columns. Besides the intensive computation, EBCOT needs massive memory locations. In conventional architectures, the block coder requires at least 20K-bit memory.

Chiang et al. proposed another approach to increase the speed of computation and reduce the memory requirement for EBCOT [6]. They use pass-parallel context modeling (PPCM) technique for the EBCOT entropy encoder. The PPCM can merge the multi-pass coding into a single pass, and it can also reduce memory requirement by 4K bits and require less internal memory accesses than the conventional architecture.

In order to increase the throughput of the arithmetic coder (MQ coder), people like to design MQ coder by pipelined techniques [8]. However the pipelined approach needs a high performance EBCOT encoder, otherwise the efficiency of the MQ coder may be reduced. This paper proposes a parallel context-modeling scheme based on the PPCM technique to generate two CX-D data each cycle, and a matched pipelined MQ coder is designed to accomplish a high performance Tier-1 coder.

#### 2. EMBEDED BLOCK CODING ALGORITHM

The block diagram of the JPEG2000 encoder is shown in Fig. 1. The discrete wavelet transform and the scalar quantization are first applied for the input image data. The quantized transform coefficients are then entropy coded by using context-modeling and adaptive binary arithmetic coding. Finally, the compressed data is organized into a feature-rich code-stream by using post-compression rate-distortion optimization algorithm. The key algorithms of the entropy coding involved in this paper are described in the following subsections.



Fig.1. The block diagram of JPEG2000 encoder system.

#### 2.1 Context-Modeling

After the transformation and quantization steps are performed, each sub-band is partitioned into rectangular blocks (called code-blocks), typically  $64 \times 64$  or  $32 \times 32$  in dimension.

In the context-modeling module, all quantized transform coefficients of the code-blocks are expressed in signmagnitude representation and divided into one sign bitplane and several magnitude bit-planes (from MSB to LSB). During coding scan, the bit-plane can be divided into several stripes. Each stripe is composed of four row samples. The bit-plane is scanned stripe by stripe. In order to improve the embedding of the compressed bit-stream, each bit-plane is coded in three coding passes. Each sample in a bit-plane is coded in only one of the three coding passes. The three coding passes and the condition for each pass are described as follows: 1) Significant pass (pass 1): The coded sample is insignificant and at least one of the neighbor sample is significant. 2) Magnitude refinement (pass 2): The relative sample of the previous bit-plane is set significant. 3) Cleanup pass (pass 3): Those samples that have not been coded by pass 1 or pass 2 in current bit-plane. These three passes are composed of four coding primitives: zero coding (ZC), sign coding (SC), magnitude refinement coding (MR), and run length coding (RLC). The contextdata are generated by these primitives according to different neighbor states of the coded sample. These states are shown in Fig. 2. The more detail about the context-modeling algorithm can be found in [1] and [3].



Fig. 2. The neighbor states used by different primitives. (a) ZC and MR (b) SC (c) RLC.

#### 2.2 Adaptive Binary Arithmetic Coding

The MQ coder is an adaptive binary arithmetic coder with renormalization-driven probability estimation. To reduce complexity, there are only 18 contexts used in JPEG2000, and each coding context is represented by 5 bits of the state information. Since the spirit of the MQ coder is adaptive in nature, the content of the selected context is updated based on the probability estimation process defined in JPEG2000 whenever a renormalization occurs. A byte of compressed data is removed and outputted from the high order bits of the code register C periodically to keep C from overflowing. When all of the symbols have been coded, the FLUSH procedure is executed to terminate the encoding operations and generate the required terminating marker. Several bytes are also generated in the FLUSH procedure.

#### 2.3 Pass Parallel Context Modeling

Because the inefficiency of the context-modeling of EBCOT, the PPCM proposed by Chiang et al. [6] can increase the efficiency by merging the three coding passes to a single one. PPCM requires four blocks of memory and each block takes 4K bits. These four blocks are classified as x (records all signs of samples in a bit-plane),  $v_p$  (records all magnitudes of samples in a bit-plane),  $\alpha_0$  (records the significance of pass 1), and  $\alpha_1$  (records the significance of pass 2) respectively. The refinement memory can be replaced by  $\alpha_0 \oplus \alpha_1$ , where  $\oplus$  is the logical exclusive-or operation. Therefore, the memory requirement of PPCM is 4K bits less than that of a conventional design. The PPCM also uses the column-base operation [4] to find the information of the memories. Since the PPCM merges the three coding passes to a single pass, it encounters two problems. One is that the coded sample belonged to pass 3 may become significant earlier than pass 1. The other is how to predict neighbor significances of the coded samples that are belonged to pass 1, pass 2, and pass 3 respectively. The authors of [6] proposed two methods to solve the first problem. Firstly they use two memory blocks  $\alpha_0$  and  $\alpha_1$  to record the significances of pass 1 and pass 3, and then they delay the pass 3 coding one stripe column. For the second problem, they use Table I to predict the neighbor significances. Besides, they use "stripe causal" mode of JPEG2000 [2] to break the correlation between the current stripe and next stripe. By using these techniques, all samples in each column can be coded one by one efficiently.

Table I The predicted technique for three pass types.

| Pass Type | Significant Prediction                                                                                   |  |  |
|-----------|----------------------------------------------------------------------------------------------------------|--|--|
| Pass 1    | Visted samples: $\alpha_0[k]$                                                                            |  |  |
|           | Have not visted samples: $\alpha_0[\mathbf{k}] \parallel \alpha_1[\mathbf{k}]$                           |  |  |
| Pass 2    | Visted samples: $\alpha_0[k]$                                                                            |  |  |
|           | Have not visted samples: $\alpha_0[\mathbf{k}] \parallel \alpha_1[\mathbf{k}] \parallel v_p[\mathbf{k}]$ |  |  |
| Pass 3    | Visted sample: $\alpha_0$ [k] $\parallel \alpha_1$ [k]                                                   |  |  |
|           | Have not visted samples: $\alpha_0[\mathbf{k}] \parallel \alpha_1[\mathbf{k}]$                           |  |  |

( "||": OR logic operation, k: location of the coded sample)

#### 3. PROPOSED ARCHITECTURE

Based on PPCM, this paper presents a parallel coding architecture to further save the coding clock cycles. Our design uses a "context-window" register to store all coded samples and neighbor status of all coded samples. Moreover, the "stripe causal" mode and column-based operation are also adopted. Fig. 3 shows the context-window. The context-window consists of two parts; the first part processes all samples that are coded by pass 1 and pass 2 in column C, and the second part processes the rest samples coded by pass 3 in column C to shift left one column to be coded in column D. The coding procedures can be divided into three steps:

Step 1: code the sample that belongs to pass 1 or pass 2.

- Step 2: code the sample that belongs to RLC of pass 3.
- Step 3: code the sample that belongs to ZC or SC of pass 3.

In order to increase the throughput rate, Step 1 and Step 3 process two samples concurrently.



Fig. 3. The proposed coding context-window register.

In order to code two samples in column C to produce the Context-Data (CX-D) simultaneously, the prediction method about upper position of each coded sample from position 1 to position 3 in column C must be modified. For example, both pass type and significance of position 0 have to be considered when the system is coding the sample at position 1 in column C. Since the correct significance of position 0 is known until next cycle, it has to be predicted in the current cycle and the method is shown as equation (1).

 $\alpha_0[k-1] = \alpha_0 [k-1] \parallel S_p$ (1) If upper pass type = 1:  $S_p = v_p [k-1]$ If upper pass type = 2:  $S_p = 1$ 

Where  $S_p$  is a variable determined by the upper pass type of the coded sample,  $v_p[k-1]$  is the upper magnitude of the coded sample, and  $\alpha_0[k-1]$  is the upper significance of the coded sample.

The block diagram of the proposed architecture is shown in Fig. 4. There are four memory blocks to store status of the code-block (magnitude, sign, pass 1 significant, and pass2 significant). In the very beginning, the data needed for coding are loaded into the context-window unit one column a time. After some operations, the information needed by all coding primitives are generated and sent into the context block. The context block unit is composed of two "ZSM" (ZC, SC, and MR) primitive blocks and one RLC primitive block. Since we process two samples concurrently, the output number of CX-D pairs is not constant (from 1 to 4) at each cycle. These CX-D pairs are sent into the MQ coder one by one. Therefore a parallel-inserial-out (PISO) buffer is needed. Fig.5 shows this architecture.

In order to avoid the data in current cycle being overwritten by next cycle. The frequency of MQ coder and the size of PISO buffer are important issues. Table II shows the output number percentage of the context modeling of 6 image pictures. From TableII, the output number of two occurs most frequently. Therefore the operation frequency of the MQ coder of twice of the context modeling is selected. Moreover the percentage of four outputs is about 5%, and thus we use 10 buffers in our design.







Fig. 5. Proposed architecture of Tier-1.

Table II. Output number percentage of the context modeling.

| Image     | Test   | The output number |                   |                   |                 |  |
|-----------|--------|-------------------|-------------------|-------------------|-----------------|--|
| Size      | Image  | 1                 | 2                 | 3                 | 4               |  |
|           | Lena   | 197700<br>28.45%  | 299552<br>43.10%  | 155829<br>22.42%  | 42862<br>6.02%  |  |
| 512x512   | Jet    | 229396<br>34.76%  | 252241<br>38.23%  | 136365<br>20.67%  | 41754<br>6.33%  |  |
|           | Baboon | 163400<br>19.72%  | 449707<br>54.28%  | 175629<br>21.20%  | 39681<br>4.79%  |  |
|           | Bike   | 4828464<br>32.61% | 6215432<br>41.97% | 2940479<br>19.85% | 823400<br>5.56% |  |
| 2048x2560 | Cafe   | 4524970<br>27.57% | 7912284<br>48.21% | 3170517<br>19.32% | 804444<br>4.90% |  |
|           | Woman  | 3806095<br>28.04% | 5948512<br>43.83% | 2957053<br>21.79% | 860102<br>6.34% |  |
| Avera     | ıge    | 28.19%            | 44.94%            | 20.88%            | 5.66%           |  |

#### 4. MQ Coder

In order to increase the performance of MQ Coder, we use a pipelined architecture to divide all the coding procedure into four stages. This architecture is shown in Fig. 6. Those CX-D data streams sent into the MQ Coder from our parallel coding architecture are interleaved. Therefore the traditional architecture must be modified to eliminate the conflict. In [6] more context registers and coding state registers are used to solve this problem. In our pipelined

design, this method is also adopted. We increase two coding state registers (A, B, C, CT) in Stage 2 and Stage 3.

In Stage 1, CX and pass number are sent into the "context table" to select an index and MPS symbol. However, the correct index is not known until Stage 2 is finished and a wrong index may be selected. Therefore a predict scheme has to be used to predict the next new index. An "index predict" unit is used for the index prediction and a register is used to save "nlps" or "nmps". If renormalization is executed during the operation of Stage 2, the correct index must be fetched from the "index predict" unit. Stage 2 and Stage 3 are used to calculate the new interval (A) and lower bound (C). In order to increase the clock rate, the calculation of C is divided into Stage 2 and Stage 3. This technique is adopted from [7]. Because the largest number of byteout is 2 bytes, we add a FIFO in Stage 4 to make the last bit-string in order.



Fig. 6. Pipeline architecture of the MQ-Coder.

This MQ Coder has been synthesized using Synopsys in the worst case environment (WCCOM). The clock rate can be run at 185MHZ to encode one CX-D each cycle.

#### 5. EXPERIMENTAL RESULTS

The execution time of this proposed architecture and PPCM architecture [6] is compared. There are six different images with size 512x512 used in our experiments. The result is shown in Table III. The proposed architecture reduces about 25% execution time.

Table III. Experimental result of the execution time.

| Test<br>Image | Executi<br>(Clock | Increased |           |  |
|---------------|-------------------|-----------|-----------|--|
| mage          | [6]               | This work | rerentage |  |
| Lena          | 1431739           | 1083918   | 24.29%    |  |
| Jet           | 1748425           | 1383706   | 25.22%    |  |
| Baboon        | 1309989           | 979650    | 20.86%    |  |
| Boat          | 1359648           | 1017169   | 25.19%    |  |
| Pepper        | 1277950           | 945675    | 26.00%    |  |
| Zelda         | 1142081           | 816326    | 28.52%    |  |
| Average       | 1378305           | 1037740   | 25.01%    |  |

#### 6. CONCLUSION

This paper proposes a parallel coding architecture to increase the throughput rate of the context-modeling of JPEG2000 for about 25% compared with the previous work. A pipelined MQ coder is also designed to match the parallel context-modeling architecture, and this encoder can operate at clock rate of 185MHz.

#### 7. REFERENCES

- M. D. Adams, *The JPEG-2000 Still Image Compression* Standard, ISO/IEC JTC 1/SC 29/WG 1 N2412, Sep. 2001.
- [2] D. Taubman, E. Ordentkich, M. Weinberger, and G.Seroussi, "Embeded block coding in JPEG2000," HP Labs, Palo Alto, Feb. 2001.
- [3] M. D. Adams and F. Kossentini, "Jasper: a softwarebased JPEG-2000 codec implementation," *Proc. IEEE Int. Conf. Image Processing*, vol. 2, pp. 53-56, Sep. 2000.
- [4] K.-F. Chen, C.-J. Lian, H.-H. Chen, and L.-G. Chen, "Analysis and architecture design of EBCOT for JPEG2000," *Proc. IEEE Int. Symp. Circuits and Systems*, vol. 2, pp. 765-768, May 2001.
- [5] H.-H. Chen, C.-J. Lian, T.-H. Chang, and L.-G. Chen, "Analysis of EBCOT decoding algorithm and its VLSI implementation for JPEG 2000," *Proc. IEEE Int. Symp. Circuits and Systems*, vol. IV, pp. 329-332, 2002.
- [6] J.-S. Chiang, Y.-S. Lin, and C.-Y. Hsieh, "Efficient pass-parallel architecture for EBCOT in JPEG2000," *IEEE Int. Symp. Circuits and System*, vol. I, pp.773-776, May 2002.
- [7] C.-J. Lian, K.-F. Chen, H.-H. Chen, and L.-G. Chen, "Analysis and architecture design of block-coding engine for EBCOT in JPEG 2000," *IEEE Trans. Circuits and Systems for Video Technology*, vol. 13, pp. 219-230, March 2003.
- [8] K.-K. Ong, W.-H. Chang, Y.-C. Tseng, Y.-S. Lee, and C.-Y. Lee, "A high throughput context-based adaptive arithmetic codec for JPEG2000," *IEEE Int. Symp. Circuits and Systems*, vol. IV, pp. 133-136, May 2002.