The single-core DSP is used for data pre-processing of the compressed data streams and forwarding it to the multi-core DSP, which processes the actual data. Pre-processing also includes disposing the data required for processing on the multi-core system using a data parallelism concept. We discuss both design considerations, and implementation details of the interface and the pre-processing algorithm.
I. INTRODUCTION

Many sensor technologies are already in use for Reliable
Advanced Driver Assistance Systems (ADAS). These systems should generally improve the driving process of a vehicle. A significant part of ADAS is based on vision systems e.g. road sign recognition. The EU-funded project ADOSE1 (reliable application-specific detection of road users with vehicle on board sensors) is focused on cost-effective sensor technologies for ADAS. Our aim at ADOSE is to develop an embed ded Silicon Retina stereo system used for pre-crash warn ing/preparation side impact detection. This system uses a bio inspired analogue optical sensor and thus, a novel approach for data acquisition and processing has to be realized.
These types of sensors detect intensity changes in an observed scene at pixel level, with each pixel delivering its address, and event data separately and independently, when ever it detects an intensity change. Unaltered parts of a scene that have no intensity variation need neither be transmitted nor processed. Thus, the amount of data varies over time depending on the scene. Due to this asynchonous behavior The Silicon Retina stereo system consists of two bio inspired optical sensors, an embedded system for data acquisi tion and pre-processing based on a TMS320C6455 single-core fixed-point digital signal processor (DSP), and an embedded system for processing the stereo matching algorithm based on a TMS320C6474 multi-core fixed-point DSP platform with three cores.
Data acquisition techniques for conventional cameras are prevalent and interfaces are available. Due to the special characteristics of this sensor, approaches for embedded real time data acquisition do not exist. In this paper we give an overview of the optical sensor and the algorithm approach for data processing. The sensor principle used is quite unconven tional. This requires new algorithmic approaches for solving the stereo correspondence problem, which has big influence on the data-handling and pre-processing. Pre-processing in this context means preparing the acquired data to afford parallel data processing on the multi-core system. The remainder of this paper is outlined as follows: Sec tion II gives an overview of related address event representa tion communication, debugging and acquisition interfaces and methodologies. Section III introduces the optical sensor and describes its core features. Section IV gives an overview of the embedded system and the DSPs used for acquisition, pre processing and stereo matching. Section V shows the concept of the hardware and details of the data acquisition and the pre processing. The next Section covers the communication from the acquisition and pre-processing embedded system to the stereo matching embedded system. Section VII shows a high performance approach for data acquisition of the processing embedded system using Serial RapidIOTM. Finally, we give a conclusion about the work.
II. RELATED WORK
Conventional optical sensors capture image data frame-by frame at a fixed frame-rate. A series of frames contains a vast amount of redundant data. The Silicon Retina uses an event triggered concept, where an Event E is a three-tuple consisting of the coordinates X and y, and a timestamp T.
Berner et. al. [1] describe in their paper a high-speed USB 2.0 address event representation interface that allows simultaneously monitoring and sequencing data. Their claim was to develop a simple, cheap and user-friendly device where the timestamps are generated with a 16-bit counter in a CPLD. The peak event-rate for monitoring is 6 million events per second (Meps) and for sequencing 3.75 Meps with four bytes per event including a separate timestamp for each event. Merolla et. al. [2] give an overview of using USB as a high-performance interface for neuromorphic systems and they cover both hardware and software considerations.
Dante et. al. [3] show in their work a device for in terfacing address-event based neuromorphic systems that is implemented on a peripheral component interface (PC I) board.
This system can handle four transmitters and four receivers and the three main functions are: monitoring, sequencing, and mapping. Monitoring contains reading and time-stamping events to make them available for further processing. Se quencing stands for generating traffic, and mapping transforms incoming events into outgoing events in different modes. The paper gives no detailed information on performance.
Litzenberger et. al. [4] present an embedded smart camera for high speed vision. They connect the optical sensor via first in first out (FIFO) buffer memory to a Blackfin fixed-point DSP from Analog Devices. Following a data available request from the optical sensor, the data is stored in the FIFO. The DSP samples the data from the FIFO while it is not empty and timestamps the events afterwards in the DSP. Here, significant performance is used for data acquisition that would be required for data processing. Due to the latency of the generation and the input to the DSP, the timestamps of the events are distorted.
III. OPTICAL SENSOR
The discussed type of sensor technologically goes back to Fukushima et. al. [5] in 1970, who implemented an electronic model of a retina. The first retina imager on silicon basis was developed by Mead and Mahowald [6] . Within ADOSE, we used two different versions of the optical sensor. The prior version has a resolution of 128x 128 pixels and a time resolution of Ims. Due to processor and hardware restrictions, the optical sensor is able to deliver up to sustained 300keps.
This version uses an UDP socket for communication [7] .
The new version of the optical sensor is a 304x240 pixels vision sensor with a time resolution of IOns. These types of sensors exploit a very efficient asynchronous event-driven data protocol that only delivers variations of intensity. In this way data redundancy is mostly reduced. Unaltered parts of a scene that have no intensity variation need neither be transmitted nor processed. Figure 1 shows a visualized image with a pedestrian, where the events of 20ms are accumu lated. The optical sensor has a 20-bit parallel asynchronous AER interface with hardware-accelerated pre-processing of the events, including IOns time-stamping, and a region-of-interest (ROI) filter. The overall performance of the AER interface is sustained 5. 125Meps [8] .
To predict the occurring data-rate of the new optical sensor both theoretical and empirical data-rate estimations were done.
Analysis of the optical sensor with a test pattern generator showed that the average stimuli frequency of a pixel is approx imately 30Hz. Hence, the average data-rate of the new optical sensor is about 2.07Meps or 8.29MiB/s, and the maximum data-rate is about 8.29Meps of 33.18MiB/s. In the empirical estimation the optical sensor was stimulated by several traffic conditions in a real-world environment resulting in an average data-rate of 5.3Meps or 21.2MiB/s.
Due to the high data-rate requirements of the system con sisting of two optical sensors, an interface is required that produces a minimum protocol overhead. 
B. Stereo A1atching
Stereo Matching deals with the reconstruction of depth information of a scene captured from two different points of view. Scharstein and Szeliski [9] give a good overview of con ventional stereo matching. Evaluations showed that area-based techniques do not exploit the features of the Silicon Retina technology. New approaches are based on event-triggered stereo matching.
IV. EM BEDDED S YSTEM
The embedded system used to perform data acquisition and pre-processing is based on a TMS320C6455 single-core fixed-point DSP from Te xas Instruments. Due to the high performance requirements of the stereo vision algorithms, a second DSP, which is a TMS320C6474, is dedicated for data processing. Both DSP models are based on the C64x+ DSP core from Texas Instruments. The C64x+ DSP core has a very long instruction word (VLIW) architecture with eight functional units. The TMS320C64x+ DSP Megamodule Reference Guide [10] gives more detailed information about this architecture.
Both the TMS320C6455 and the TMS320C6474 have many similarities, but in a more detailed view there are significant differences. The most significant difference between both DSPs is that the TMS320C6474 consists of three C64x+ DSP cores rather than one. This has a noticeable effect on the Peak million multiply accumulate cycles per second (MMACS) of each DSP. Another difference in terms of data acquisition is that the TMS320C6474 has no adequate parallel interface for connecting the parallel interface of the optical sensor. After acquisition and pre-processing, the data is output via Serial RapidIO ™ to the TMS320C6474 evaluation module (EVM), where it is further processed by the stereo algorithm.
A. Operating System
The implementation of both the acquisition and pre processing embedded system, and the stereo matching embed ded system uses DSPIBIOS, a pre-emptive scalable real-time kernel from Texas Instruments. There are various supported features supporting e.g. hardware abstraction or real-time analysis. Hardware Interrupts (HWI) are used to handle time critical asynchronous events. Software Interrupts (SWI) have a lower priority than HWIs, but a higher priority than conven tional tasks and are software triggered. Ta sks are threads that have lower priority than HWIs and SWIs, but higher priority
adapter-board programmable priority levels, support for two-dimensional When the TCC interrupt is triggered, a registered HWI routine is called and the TCC can be requested by software.
To keep the runtime of the HWI as small as possible, only a SWI is posted based on the TCe. In the SWI, the posted value can be requested and the TCC can be inquired. This two level approach allows processing the correct memory in a SWI context with a minimum latency. The SWI routine pre processed the acquired data for the multi-core system.
C. Pre-processing to Afford Parallelism
In this project, the stereo algorithm is processed on a multi core DSP. Therefore the acquired data has to be pre-processed to afford parallel processing.
Culler and Singh [16] gives a detailed overview of parallel computer architectures. Regarding him, there are several types of parallelism:
• Parallelizing loops often lead to similar operation se quences or functions being processed on large amounts of data on parallel processing units. Figure 4 shows an example for partitioning events from two sources to three destinations, thus N = 3. An event en , m is defined by E(X, y, T) and a timestamp tn , m is described by N+, where n is the source identifier and m is a continuing unique number. Regarding the compressed data-format of the source data stream, Ts , n = en , m(T), where TS , n is the current timestamp of source n.
Due to n, the event is routed to the destination, whenever the destination time Tb , n = Ts , n' otherwise the current source timestamp is set before. Using this approach, the amount of data per processing unit can be minimized, because every destination only receives the according data that can directly be processed. Fig. 4 . Example of the Split Algorithm; The events e12, Q3, e25, and e2 6 are in the first third of the scene, e14, Q5, e17, and e29 are in the second third of the scene, and the remaining events are in the third third of the scene
D. Auxiliary Further Pre-processing
Depending on the stereo algorithm of the embedded system, further pre-processing is possible before outputting the data.
In the case of a time-space correlating stereo algorithm, image rectification is important, and noise filtering is an optional task.
Image rectification is used to transfer the distorted coor dinate system into a standard coordinate system with aligned epipolar lines. Rectification needs to be performed before sep arating the amount of data to the destination cores, otherwise during performing the stereo algorithm, one destination may require data from another core.
Noise filtering or noise reduction is an optional process that pre-processes the data from the optical sensors and removes events that are disqualified. This simply reduces the amount of data.
VI. DATA OUTPUT ON THE TMS320C6455
After separating the amount of data dedicated to the destina tion, it needs to be transported from the TMS320C6455 to the TMS320C6474. According to the available interfaces of both DSPs, Ethernet and Serial RapidlO ™ are possible interfaces.
Ethernet is the standard for wide scale interconnects net works and intended for box-to-box, board-to-board, chip-to chip, and backplane interconnection. It has the ability to connect a mUltiple members, and has a very flexible and extensible architecture. Advantages of Ethernet to this project are variable packet sizes up to 9000 byte jumbo frames.
Disadvantages are a high traffic overhead, a low Ethernet PDU size up to 1500 byte, a symbol-rate per pair of a 1000Base-T Gigabit of 128Mbaud [17] , and a processing intensive stack. Within the physical layer the data is 8bllOb coded and the actual configuration of the interface using a streaming write operation with 32-bit addressed and 8-bit IDs, the theoretical maximal data-rate at a signaling-rate of 3. 125GHz is 289.86MiB.
Every core processes different data, but each core is respon sible for separate parts of the result. The result buffer is located in a global memory segment, where all cores have access to.
Thus, no resource conflicts can occur and the cores do not have to communicate with each other.
VIII. RESULTS AND CONCLUSION
This paper presented a Serial RapidIO ™ data acquisition interface for Silicon Retina based computer vision applications optimized for high performance. The performance is achieved by exploiting the processor peripherals at a minimum proces sor workload. The embedded system used was a distributed digital signal processing system using both a single-core and a multi-core DSP. Also, the pre-processing required for parallel data processing on a multi-core system using a data parallelism concept was discussed. Other kinds of schedulers to balance the work were not implemented. This approach gives a good balance between the cores with a very little overhead.
We showed that a total sustained data-rate of lOMeps for data acquisition, pre-processing and forwarding can be achieved. For stereo, this results in 5Meps per channel. This system has a performance increase of 1.67 compared to an available USB 2.0 interface. Further improvement can be achieved by using synchronous FIFOs and applying the burst mode also for lower data-rates by using variable FIFO interrupts with configurable data amounts.
