Abstract-A flexible transport stream processor for DTV which is also designed under cost-effective consideration is proposed in this paper. A RISC micro-controller is allocated as the core of transport stream processor for flexibly extending or changing the functions of the transport stream processor. For the consideration of cost-effective design, the functions of the transport stream processor are partitioned into ones which are suitable for hardware implementation and the others suitable for the software executed by the micro-controller. Special enhancement of the instruction set of the micro-controller is proposed, with which the code efficiency of bit-level data-field processing could be improved. A general parsing engine for parsing grouping data-fields is also proposed. With the features described above, about 50% of total cost of transport stream processor with baseline functions can be saved.
I. INTRODUCTION
With the development of Digital TV(DTV), there are more and more applications proposed in recent years, which are difficult to be implemented on the traditional analog TV. For example, the interactive TV provides the capability that allows the receiver of the broadcasted TV content to return interactive information back to the transmitter. The TV programs which present as digital signal are also more suitable for the system of conditional access, such as pay-TV. In addition to the TV programs, other digital data are allowed to transmit via the DTV broadcasting system, such as softwares, games, etc.
The various applications of DTV must be constructed based on the specification of MPEG-2 System Layer [1] . Most of the international standards of DTV, such as DVB [2] or ATSC [3] , adopt MPEG-2 System Layer [1] specification as the basis bit-stream format and append various transmission schemes or Application Interfaces(API). For example, all the data broadcasted through various channels must be packaged as the Transport Stream(TS) format. All the applications of DTV must transmit or store information via the bit-stream format specified by MPEG-2 System Layer as well.
The Transport Stream Processor(TSP), which is one of important modules of DTV and in charge of processing the received transport stream, must be designed to meet various DTV applications on the basis of the specification of MPEG-2 System Layer [1] . The received signal from the broadcasting channel is typically demodulated and passed through the error correction module and the original transport stream is restored. The transport stream processor analyzes the transport stream, parse each necessary information which would be further delivered to the CPU of the system, and extracts the desired data streams which are further processed by the corresponding decoders.
There are two typical scheme to design the architectures of transport stream processor. One is to implement functions with dedicated hardware [4] [5] . The other is to allocate a micro-controller as the core of the transport stream processor, in which all the functions are implemented in software codes operated by the micro-controller [6] . Compared to the architecture with dedicated hardware, the micro-controllerbased architecture of transport stream processor has higher flexibility to update or extend the functions of transport stream processor [7] , but the consumption of the on-chip memory significantly increases the hardware cost of the transport stream processor. In contrast, the architecture with dedicated hardware consumes lower hardware cost, however, with low flexibility.
The motivation of the proposed transport stream processor is to design a flexible transport stream processor with efficient hardware area utilization. With flexible architecture, the functions of the transport stream processor can be updated or extended easily. In addition, as a Intellectual Property(IP) in DTV SoC, a flexible transport stream processor can be adopted without redesigning the architecture even if the specification of the desired transport stream processor is changed. However, a flexible architecture of transport stream processor usually consumes tremendous hardware cost because of the large amount of on-chip data memory and code memory [6] [7] , and this is not a good characteristic of SoC IP. Therefore, a transport stream processor with flexible architecture and efficient hardware area utilization is necessary.
II. ARCHITECTURE

A. Software/Hardware Partitioning
The basic functions a typical transport stream processor needs to support are packet synchronization, clock recovery, PID filtering, CRC checking, section filtering, bit-fields data processing, and data stream distributing [8] [9] . Packet synchronization is used to detect the boundaries of the incoming transport stream packets, and a simple scheme for this is provided by the MPEG-2 System Layer [1] . Clock recovery is necessary to synchronize the System Time Counter(STC) of the decoder to that of the encoder so that the playback fluency of the received TV programs would not be disturbed by the channel. PID filter filters the desired transport stream packets while the unnecessary packets are discarded. CRC checking is used to check the correctness of the received sections at bit level. The section filter discards duplicated sections so that they would not result in unnecessary burden on the CPU. Bit-fields data processing is the major operation of transport stream processor, which parses each data fields in the incoming transport stream and analyzes them. The processed data and multimedia data streams are finally distributed to the memory for further processing by CPU or respective decoders. Fig 1 shows that overview of the proposed transport stream processor based on the partition of the basic functions, which are classified into ones that are suitable for dedicated hardware implementation and the others that are suitable for software codes implementation. The PID filtering and packet synchronization are fixed functions so that they are suitable for being implemented in dedicated hardware. The clock recovery and CRC checking modules are difficult to be implemented in software codes and usually regarded as fixed functions, and therefore hardware implementations are more suitable. Section filtering and bit-fields data processing are necessary to be designed as flexible functions so that they are supposed to be implemented in software codes which are saved in code memory. The processing core executes the software codes in the code memory and controls all the modules inside the transport stream processor. Note that the CRC checking is designed to be operated concurrently with the DMA module when delivering section data to the system memory, by which the transport stream processor doesn't need to save the large amount of section data until the CRC bytes are received at the end of the section. The data memory and code memory are designed to be accessed by the CPU and therefore the configuration of the transport stream processor can be changed or updated at the run time.
B. Processing Core
The processing core in the transport stream processor is designed as a RISC micro-controller with special enhancement for bit-fields data processing. The processing core is in charge of controlling all the modules and processes all the software codes in the code memory so that the functions of transport stream processor are flexible. Because bit-fields data are also processed by the processing core, an optimized instruction set which is efficient for bit-fields data processing is needed. The proposed instructions are less supported by traditional microcontroller which executes many redundant instructions for bitfields data processing.
The special instructions for bit-fields data processing is shown in Fig 2. Fig 2(a) describes the "shmsk" instruction, which can extract a segment of bit-fields. Similarly, the "mrg" instruction can merge two variable-length segments of bitfields into one, as Fig 2(b) shows. The length of the bitfields can be set arbitrarily in the instruction argument and ranges from 1 to 16. The "stb" instruction substitutes data C bit-by-bit by data A according to the values of each bits in data B, as Fig 2(c) shows. In addition, the special instruction set allows that the calculated result can be saved into data memory without saving the result back to register files and additional instructions to store data into data memory, which saves redundant instructions as well. The implementation of the additional special instructions is to modify the traditional ALU for calculating the desired results. Part of the ALU is rearranged to implement special instructions, as shown in Fig 3. In addition to the bit-by-bit multiplexors for "stb" instructions, the original barrel shifter and bit-wise AND unit are cascaded to implement special in- Fig 4 shows the data path to implement "shmsk" instruction, which utilizes both barrel shifter and bit-wise AND unit concurrently. Path B in Fig 4 shows the data path to implement "mrg" instruction, which combines barrel shifter, bit-wise AND unit, and bit-bybit multiplexors. The principle of the ALU architecture design to implement special instructions is to reuse and combine the existing processing units to implement complex computation so that additional hardware cost will not increase substantially. Another special enhancement of the instruction set is to allow the computation result to be stored into data memory directly, which utilizes the characteristic that the data memory is inside the transport stream processor accompanied with the processing core. The value in the register which is originally designated as the destination register of the computation would be read out as the writing address of the data memory. In order to write computation results into continuously memory ad- dresses which is suitable for parsing continuous data fields, the read-out writing address would be accumulated automatically and saved back to the original register for the next operation. With this special enhancement, instructions can be reduced because the redundant instructions to move data from register files into data memory and accumulate the writing address are saved.
C. Parsing Engine
The parsing engine is designed for conveniently parsing grouping bit-fields which often present in the bit-stream specified by MPEG-2 System Layer [1] , such as PES-header and adaptation field. If one bit-field appears in the data-stream, the corresponding flag would be true in the grouping flags. In other words, the parsing procedure needs to check each flags to check if the bit-field exists in the bit-stream. As shown in Fig 5 , the bit-field parsing operation may need to concatenate several data segments into a complete bit-field, which wastes many instructions even with the proposed special instructions. With the assistance of parsing engine, the largely consumed instruction memory which is needed to parse grouping bitfields can be reduced into few configuration data bytes for parsing engine.
The processing element of the parsing engine is shown in Fig 6. Each level of multiplexor checks the bit in the mask data bytes to tell if the segment of bit-fields would be shifted by one bit or left unchanged. The architecture of the processing element is designed to be able to shift and concatenate each segments of bit-field byte by byte so that the complete bit-field with variable length can be easily parsed. III. RESULT Table I shows the implementation result of the proposed transport stream processor with hardware cost comparison from previous works [6] [4] . The hardware cost comparison is made by the normalization with TSMC 0.18um 1P6M process, in which the areas of all the on-chip memories are transformed into equivalent logic gate counts. In contrast to the transport stream processor design with micro-controller-based implementation [6] , the proposed design saves about 50% of hardware area cost because of the proposed hardware/software partitioning and the proposed special instruction set. On the other hand, the proposed design of transport stream processor performs similarly compared to the design with dedicated hardware [4] , which provides much less flexibility than the proposed architecture. Table II shows the performance of hardware area saving contributed by special instruction set and parsing engine, respectively. The previous work which is implemented with a micro-controller [6] is also listed. According to Case A and Case B, it can be evidenced that the special instructions contribute about 14.2% of hardware area saving. On the other hand, according to Case B and Case C, the parsing engine can contribute about 6.8% of hardware cost saving. Both of the contributions described above are mainly owing to the reduction of total instructions which save the usage of on-chip code memory consequently. Fig 7 shows the chip implementation of the proposed transport stream processor with TSMC 0.18um CMOS 1P6M process, and the specification of the chip is listed in Table I . The proposed transport stream processor is designed with the consideration for flexibility, including extending or updating functions of the transport stream processor. To prevent from tremendous growth in hardware area cost consumed by the on-chip memory of the transport stream processor, the functions which are not suitable for software implementation are implemented as dedicated hardware. In addition, the processing core which executes the software codes is specially designed for processing bit-stream efficiently. Furthermore, the general parsing engine also saves much on-chip memory which is originally used to parse grouping bit-fields. The implementation result shows that the overall area saving achieves about 50% compared with the transport stream processor which implements functions in software codes completely. It is also be shown that the specially designed instructions save about 14.2% of hardware area and the parsing engine saves about 6.8% of hardware area.
