This paper introduces the hardware platform of the structured light processing based on depth imaging to perform a 3D modeling of cluttered workspace for home service robots. We have discovered that the degradation of precision and robustness comes mainly from the overlapping of multiple codes in the signal received at a camera pixel. Considering the criticality of separating the overlapped codes to precision and robustness, we proposed a novel signal separation code, referred to here as "Hierarchically Orthogonal Code (HOC)," for depth imaging. The proposed HOC algorithm was implemented by using hardware platform which applies the Xilinx XC2V6000 FPGA to perform a real time 3D modeling and the invisible IR (Infrared) pattern lights to eliminate any inconveniences for the home environment. The experimental results have shown that the proposed HOC algorithm significantly enhances the robustness and precision in depth imaging, compared to the best known conventional approaches. Furthermore, after we processed the HOC algorithm implemented on our hardware platform, the results showed that it required 34 ms of time to generate one 3D image. This processing time is about 24 times faster than the same implementation of HOC algorithm using software, and the real-time processing is realized.
Introduction
Robots can perform an essential task such as navigation, recognition, manipulation in the various fields of service robots, medical imaging and security surveillance systems more easily and effectively with the help of a 3D camera. Depth imaging based on structured light has drawn a serious attention recently due to its potential for application to service robotics. This potential is mainly from its capability of measuring depth with no textures present and of providing higher precision and robustness than stereo-sis as well as higher speed and lower cost and volume than laser scanners. The key technology for depth imaging based on structured light lies in the methodology of achieving correct pixel correspondence.
The many approaches [1] to coding available to date can be classified into 4 categories: Direct Coding, Spatial Coding, Temporal Coding, Hybrid Coding. Direct coding [2] is fast but suffers from poor accuracy and low robustness to illumination variation and noise. Spatial coding depends on spatially arranged contextual information for pixel correspondence implies its vulnerability to signal corruption or complicated scene [3] - [5] . Temporal coding is not suitable for a rapidly moving scene due to the use of a sequence of frames [6] - [8] . Hybrid coding is a combination of temporal and spatial coding [9] . The spatial coding and the hybrid coding are appropriate to a modeling of continuous objects.
There can be a transposition of address sequence in the case of discontinuous surface.
The existing approaches to date have concentrated mainly on the design of spatial and/or temporal codes in terms of providing the uniqueness in pixel identification based on contextual information. However, we found that the conventional approaches have a fundamental limitation in achieving a high quality of pixel correspondence required for precision depth imaging. The conventional approaches suffer either from large pixel-wise variations in accuracy, from a number of spurious outliers, especially, near occluding and shading boundaries, or from the inability of depth imaging for the scenes of cluttered objects where a transposition in pixel correspondence may take place. This is because the conventional approaches, focusing on decoding the received signals based on the contextual information, grossly underestimate the extent of signal corruption at the camera, often making contextual information inaccurate and erroneous. Even the adoption of a long sequence of temporal code that provides each DMD (Digital Mirror Device) [17] pixel with a unique address does not suffice for effectively handling such corruption. Besides system and environmental noise due to scattering, reflectance variation, and illumination variation, a received signal at a camera pixel is a mixture of multiple codes originated from the neighbouring and/or distant DMD pixels. No approach has so far tackled directly for the solution to signal corruption, especially due to the signal mix or code overlap.
We proposed an original approach to coding [20] , called Signal Separation Coding, that works under a significant signal corruption: the overlapped codes present in the received signal are separated out first based on HOC. The pixel correspondence is then made based on the contextual likelihood represented by the transition rules governing the set of codes separated from several neighbouring camera pixels. Unlike conventional approaches, the proposed approach is turned out to be very effective to reduce errors especially at or near occluding and shading boundaries. Furthermore, the use of transition rule effectively removes out Copyright c 2006 The Institute of Electronics, Information and Communication Engineers the brittleness of the decision on pixel correspondence based on threshold. The originality of this paper lies in the discovery of the necessity of signal separation for high quality of pixel correspondence and the presentation of novel signal separation coding for the precision depth imaging based on structured light. To employ structured light based 3D cameras in robotic fields such as navigation, localization, recognition, and manipulation, we have to handle the problems related to inconvenience caused by active radiation. It also requires numerous computations to analyze the depth imaging based on the structured light in real time. For this reason, the proposed HOC algorithm was implemented by using hardware platform to perform a real-time 3D modeling and the invisible IR pattern lights to eliminate any inconvenience for the home environment.
Signal Separation Coding

Source of Correspondence Errors
The conventional approaches focusing on decoding the received signals with contextual information only grossly underestimate a severity of the corruption of received signal at the camera, often making contextual information inaccurate and erroneous. Even the adoption of a long sequence of temporal code that provides each DMD pixel with a unique address does not suffice to avoid the effect of such corruption. As stated, the most damaging culprit for code corruption we found is the, so called, signal mix: multiple light beams projected from neighbouring or, sometimes, distant DMD pixels are incident upon a camera pixel overlapped, significantly altering codes and contextual information. This signal mix is not to be treated by a simple filtering due to its dependency on the geometry as well as the property of reflection surfaces. For instance, a considerable variation in signal mix is expected at or near occluding and shading boundaries. Signal mix can be explained in terms of the geometry of the portion of object surface where the light beam projected by a DMD pixel and the receptive field of a camera as shown in Fig. 1 . Besides the signal mix described above, the signal received by a camera pixel can be corrupted by system and environmental noise. The variation of surface reflectance causes the variation of signal intensity, while the scattering of light on object surface causes blurring or signal mix. as show in Fig. 2 . A very low reflectance on the part of object surface results in shadow or shading effect. Intensity variation of the received signal may also be caused by the change of environmental illumination. A significant noise source we have discovered with DMD based light sourcing is that a more wide variation of the intensity of received signal with respect to time is observed at or near occluding and shading boundaries, as shown in Fig. 3 . This phenomenon is considered due to the instability of signal mix. 
HOC Based Pixel Correspondence
Signal Separation by Orthogonal Codes
If we assume that a mixture signal of a pixel position is a weighted summation of the signal of the position and its neighborhood signals, the relationship between signals of the projector (sender) and corresponding signals of the camera (receiver) is represented as a linear mixing model: Y = XW, where X ∈ R ( f ×m) and Y ∈ R ( f ×m) are a source matrix of which columns contain a sending signal
T of the projector and a signal matrix of which columns contain a mixture signal y i = (y 1 , . . . , y f ) T of the camera, respectively. f denotes the number of frames and m denotes the number of pixels of the image. The W ∈ R (m×m) is a mixing matrix contains mixing coefficients. To separation of a mixed signals, we can use an orthogonal signal sets because if X is an orthonomal matrix, the mixing matrix can be calculated simply W = X T Y. Although in the case of the S/N ratio is relatively low, if the magnitude of source signal is higher than the neighbouring signals, the source signal can be separate from a mixture of signals, e.g. in order to acquire a source code from a mixture of signals in a receiving module (camera)
T , we can project the mixture signal onto the space which is consist of the orthogonal basis vectors (codes); the maximum value of the code corresponds to the right code; code (y j ) = arg max( j)W i, j . In this research, we use the concept of a signal separation in order to archive robust depth imaging.
HOC Based Pixel Address Encoding
The structured light system as a signaling system consists of an information source, an encoding of this source, a channel, a noise source, a decoding, and a sink.
Firstly, a projector module sends encoded addresses correspond to the spatial relationship of a 3D environment. Objects and environments yield a noise source that is added to the signal in the channel. A camera module receives the codes and could decode the original addresses from the noisy signals. Since the order of addresses in the projector image could be pre-defined, the system calculates disparity map from the order of addresses in the camera image.
HOC based communication process of a structured light system divided into two parts. In the encoding process, we assign a unique address in each position of the projector image. We can consider pixels on an epipolar line only; because a structural light system is similar to a passive stereo camera system with one of the cameras replaced by a projector. We assume that N pixels lie on an epipolar line. The source of projector module consists of N addresses, S = {s 1 , s 2 , . . . , s N } ∈ N. For channel coding robust to a environmental noise, we can use a binary code B = {b 1 , b 2 , . . . , b N } which is correspond to the source.
If we use a perfect orthogonal binary code set B, i.e. < b i , b j , >= 0, i j, the system has a computational problem even though we can acquire a very accurate depth image because the number of total frame of camera should be equal to the number of codes. In order to avoid the problem of increasing computational complexity, we propose a novel signal coding algorithm for a structured light system. The algorithm includes a technique to arrange orthogonal codes hierarchically in order to reduce the length of codes. We named our method "HOC (Hierarchical Orthogonal Coding)." The HOC technique focuses on the reducing the code length as short as possible, while preserving the characteristics of the orthogonal code.
The HOC is composed of two consecutive processing. The first encoding processing is concerned with reducing of the code length and the second decoding process focuses on estimating the depth image. In the encoding process, the N length of code signals divided into a few layers L and the each layer includes H orthogonal codes recursively as shown in Fig. 4 .
Although the signal codes in the HOC are not orthogonal, each layer has a set of orthogonal codes. For example, we assume that a HOC has four layers (L=4) and the number of orthogonal codes in each layer is also four (H 1 = H 2 = H 3 = H 4 = 4). In this case, the total number of signal codes is 256 (H 1 × H 2 × H 3 × H 4 = 4 4 = 256) and the code length is 16 (H 1 + H 2 + H 3 + H 4 = 16), i.e. we need 16 frame of camera image for decoding of the addresses signals.
HOC Based Signal Separation
The decoding process divided into two parts. In the first part the address encoding process is concerned with signal separation of a mixture signals as shown in Fig. 5 .
The depth estimation process is as follows. For a spatio-temporal image of f frames, we can represent a pixel intensity of i-th position as I(i, t). Let a vector y = (y 1 , . . . , y f ) T denotes pixel intensities at the position k of f frames. Since HOC contains L layers, we can regards the vector y as a augmented vector, y = (b 1 , b 2 , . . . , b L ) T , where the index of a vector b j correspond to j-th layer. Because each layer uses H orthogonal codes, a vector b is a linear combinations of orthogonal codes which are used for the channel coding as b = Xc, where c and X denote a coefficient vector and a matrix representing a set of orthogonal codes, respectively. In other word, a value of sensing signal at a position includes values of other signals due to a geometrical property of object and environments, a surface reflection of the object, etc.
We can calculate the coefficient vector at i-th posi- T , respectively. The total number of possible codes at i-th position is H L because of hierarchical structure of HOC, e.g. there are 256 candidates when using 4 layers and 4 orthogonal codes. Note that a maximum signal value does not mean a correct code because of a signal mixing. We proposed a decoding algorithm [20] which is intended to provide a disparity as correctly as possible. To select the most confidential address, we can consider factors which are signal magnitude of the position, uncertainty lies on differences among signals of the neighbourhood, and continuity as a structural constraint of objects and environments. We can represent the confidence of address as Confidence = Signal magnitude (weight) + Uncertainty + Continuity. However, we can approximate the corresponding maximum signal value as the signal sending from the projector array even though the maximum value does not mean a correct code. Based on this approximation, we used maximum values for the hardware platform based implementation.
3D Camera Implementation
System Overview
The implemented 3D Camera hardware platform consists of an IR Source, DMD, IR Camera, PC and two FPGA [19] boards as shown in Fig. 6 below. The following is the operation flow of the implemented system.
Users press the User Start button in System Control
FPGA. 2. The System Control FPGA creates the structured light pattern data and transmits them to the DMD when the User Start signal is high. 3. The System Control FPGA sends the start signal to each IR Source and IR Camera after transmitting one frame of data. 4. When the IR Source receives the start signal, it flashes a IR light and at the same time, the IR Camera opens the lens and acquires one frame of data and then transmits it to PC. 5. The above operations 3, 4 are performed 17 times repeatedly to transmit 17 frames of data to the PC. For this reason, it needs 17 frames of data to process one 3D image as explained in the previous chapter about algorithms. 6. The data saved in the PC is transmitted to the HOC Processing Accelerator through a USB2.0 Interface. 7. In the HOC Processing Accelerator, HOC algorithm is performed to process 17 frames of data into 1 frame of 3D images and transmit it to the PC.
The operations and components of each sub-module are explained in the next sub-sections.
IR Source, IR Camera and DMD
The system performs the IR based pattern projection and acquisition of images. It is composed of an IR source (IR LED, wavelength: 870 nm) [18] , an IR camera, DMD, the DMD controller, and a lens. Human eyes can only detect light waves in the range between 380 nm and 780 nm [11] . Thus infrared is used to avoid discomfort to the human eye. Infrared, by itself, is classified into three ranges: nearinfrared, medium-infrared, far-infrared. We employ nearinfrared since CCD/CMOS sensors used in digital camera and camera phone can detect near-infrared [12], [13] . This is because medium-infrared and far-infrared should be detected using special type of sensors such as InSb, PbSe [14] . Moreover, these sensors should be cooled down to cryogenic state because of the noise in normal temperature. In addition, cameras for detecting medium-infrared, far-infrared are expensive and more bulky than CCD/CMOS based camera [15] . The system includes a high power IR LED (peak wavelength: 870 nm) and CCD based camera with a zoom lens.
Generally, digital cameras have a filter to prevent image distortion by screening infrared. Cold filter (which blocks infrared) is substituted with hot filter (blocking visible light) to use infrared in our research. Glass type of BW 093 from Schneider and film type of Wratten 87 from Kodak are employed since they block the whole visible light range [16] . Additional lens may be required to compensate for the distortion generated by replacing the cold filter with a hot filter.
DMD is a semiconductor based optical switch integrated with micro mirrors. Micro mirrors tilt about ±12
• at on/off operation. The tilting function provides the capability of reflecting or blocking the lights to a designated direction. Intensity control is performed by regulating the length of reflection time [17] .
DMD is used to generate variable patterns and project the patterns to the desired direction reliably. An image like a bitmap image consisting of 1 and 0 could be obtained by switching each mirror to on/off state. In order to make grayscale images, PWM (Pulse Width Modulation) is used to adjust on-time of mirrors. For example, 100% duty ratio means that mirrors maintain on-state during the whole PWM period and 50% duty ratio means that mirrors are on for half period and off for the other half period.
System Control FPGA
This module generates the Structured Light Pattern which is going to be an output in the DMD as well as transmit it to the DMD. Once a frame of transmission is completed, this module sends the start signal to each IR Source and IR Camera so that it may adjust the synchronization of the receiving and sending systems. The Structured Light Pattern data is transmitted from the System Control FPGA to the DMD through the high-speed port of the DMD. In each clock, a 64 bit of pattern data is transmitted and saved in the Internal Memory of the DMD and it repeats the same performance for 12,288 clocks ((1024 × 768)/64). After a frame of data is transmitted, the Reset Request signal which is '1,' is transmitted from FPGA to DMD. Once the DMD receives this signal, it outputs the internally saved pattern data to Mirror and responds to FPGA with a signal which is Reset Active='1.' Once the FPGA receives the Reset Active='1,' it recognizes that the data in DMD has been outputted and then it immediately sends a start signal to each IR Source and IR Camera to adjust the synchronization of the system for 1 frame. With 17 repeated times of this operation, the 17 frames of Structured Light Pattern can be outputted from DMD.
HOC Processing Accelerator
This module receives the 17 frames of image data from the PC and performs the HOC algorithm implemented by a hardware type inside the FPGA and then outputs one 3D image data to PC. Figure 7 shows the block diagram of the HOC Processing Accelerator. The operational flow of implemented module is as Fig. 7 .
The PC saves the 17 frames of data and then transmits them to the HOC Processing Accelerator. 1 frame of data, which is inputted through the USB2.0 interface is saved in the SDRAM1 by the Receive FIFO and the SDRAM interface. With this sequence, the 16 frames of data are saved in SDRAM1 up to SDRAM4. SDRAM1 is saved as a layer unit with a character of HOC algorithm (1 layer = 4 frames). In SDRAM5, a reference data for the compensation of input image is saved.
After saving 17 frames of data, the SDRAM Interface simultaneously accesses five SDRAMs each in every clock and transmits the 136 bits of data ((32 bit × 4) + (8 bit × 1)) to the HOC Processing Module.
The SDRAM Interface transmits the 136 bits of data to the HOC Processing Module in every clock and saves the results of the HOC Processing Module in the SDRAM6. Once the process of 17 frames of data is completed, the SDRAM Interface transmits it to the PC through the transfer FIFO. 
HOC Processing Module
The 128 bits of data which are inputted as a layer unit in the SDRAM perform the Subtraction operation with the 8 bits of Ref Data for their compensation. Then the HOC Processor selects the biggest pixel value of layer and then transmits it to the Depth Calculation. The Depth Calculation finds the value corresponding to the input value in look up table. Figure 8 shows the block diagram of the HOC Processing Module and the block diagram of the HOC Processor.
Experimental Results
We have evaluated the performance by using two methods. The first method was performed by comparing the precision of the proposed algorithm with the conventional algorithm. The second method was performed by comparing the processing time of each algorithm implemented by hard wire at the FPGA (Xilinx XC2V6000) and the software at the PC (Intel Pentium4 2.4 GHz). To evaluate the proposed coding algorithms, we implemented a visible structured light system, consists of an HP-vp6110 projector and a JAI analog color CCD camera. The spatial resolution is 320 × 240. The system acquired depth images from real environments, which is including plane, plant & ball, and complex workspaces. We have compared the repeatability as a measure of precision of gray code and the proposed HOC method. Since the spatial resolution is 320 × 240, we use 8 bit for the gray code. There are four scenes including plane, plant & ball, and complex workspaces.
In each scene, we acquired thirty depth images and computed the correlation value of them. The results are summarized in Table 1 . There are statistically significant performance differences between the previous coding and the proposed coding, as shown in Table 2 . The z value is related to the statistical significance. Table 3 shows the comparison result of the correspondence errors between the proposed HOC and the gray code. We compared the calculated depth images with the ground-truth data which are generated by the line scan method. The result shows the improvement of coding accuracy at occluding and shading boundaries. Figure 9 shows depth images in a complex scene. The results show that the proposed HOC algorithm provides robust depth images in the case of complex scene including curved surface, shadow, occluding boundary, surface reflectance.
We implemented the proposed algorithm by using Xilinx XC2V6000 FPGA and measured the processing time. The implemented hardware operates on 50 MHz. Table 4 shows the results of algorithm processing time on software implementation and hardware implementation, respectively. It took about 25 ms for the PC to write 17 frames of data in the SDRAM on FPGA board by using the USB interface. It also took about 6 ms to perform the HOC processing which read 17 frames of data from the SDRAM and wrote the data back in the SDRAM. And it took about 3 ms to transmit the result of the HOC processing back to the PC. Therefore, the total processing time was about 34 ms. In the case of implementation by software, it took about 94 ms to make the Look Up table calculate the address value, and then 719 ms to get the maximum value in each layer. Hence, the total Table 4 The results of algorithm processing time on each software implementation and hardware implementation.
processing time was about 813 ms. As can be seen above, there was an obvious improvement which is 24 times faster on processing time when HOC algorithm is implemented by hardware rather than software.
Conclusion
Based on our discovery on the presence of significant code overlapping in the received signal at the camera that causes a critical degradation of performance, we present a novel approach to structured light depth imaging based on signal separation coding. But proposed algorithm has a demerit of having difficulties in real-time process because it analyzes and deals with lots of data. So we developed a high-speed 3D IR Camera hardware platform with variable structured light, designed and implemented for use in home service robots. Experimental results have demonstrated that our algorithm provides highly robust, highly precise, and very accurate depth imaging, as the result of separating overlapped codes. In particular, the experimental results have shown the effectiveness of the proposed approach for higher complexity of workspaces with a number of cluttered objects. Furthermore, after we processed the HOC algorithm on FPGA hardware platform, our results showed that it requires 34 ms of time to generate one 3D image. This processing time is about 24 times faster than using software. The future plan is to perform more tests on the proposed algorithm in diverse real-world environments and to develop the structure of processing of output data from high speed CMOS image sensor (Dalsa, 300 F/S) directly instead of going through PC, so as to design the hardware platform which would be suitable in a real-time environment.
