Abstract-X-ray computer tomography is a powerful method for nondestructive investigations in many fields. Three-dimensional images of internal structure are reconstructed from a sequence of two-dimensional projections. The polychromatic high density photon flux of modern synchrotron light sources offer hard X-ray imaging with spatio-temporal resolution up to the micrometer and micrometers range. Existing indirect X-ray image detection systems can be adapted for fast image acquisition by high-speed visible-light cameras. In this paper, we present a platform for custom high-speed CMOS cameras with embedded field-programmable gate array (FPGA) processing. This modular system is characterized by a high-throughput PCI Express (PCIe) interface and efficient communication blocks. It has been used to develop a novel architecture for a self-event trigger that increases the effective image frame rate and reduces the amount of received data. Thanks to a low-noise design, high frame rates in the kilohertz range, and high-throughput data transfer, this camera is well suited for ultrafast synchrotron-based X-ray radiography and tomography. The camera setup is accomplished by high-throughput Linux drivers and a seamless integration in our GPU computing framework.
evolution, allows scientists to better understand functional units of devices and organisms, and helps to optimize technical processes. The whole setup consists of three sections: the beamline providing X-rays, the UFO experimental station, and a high-performance data storage system. The UFO experimental station, shown in Fig. 1 , has three main functional units: 1) sample setup-A dedicated setup for fast sample manipulation (tomographic and laminographic rotations and scanning translations) and adequate sample loading; 2) smart high-speed camera-a fully programmable and high performance smart camera with an internal feedback loop for self-trigger applications; 3) GPU server for online data processing and evaluation-graphics processors for accelerating the 3D image reconstruction. The fast processing speed enables an online feedback loop for sample manipulations and automatic adjustment of the experimental conditions.
The setup is based on an indirect detection method with a crystal converter screen (scintillator) optically coupled to a digital CMOS camera. In this approach, the pixel detector itself is not exposed to radiation, and, consequently, a radiation-tolerant silicon pixel detector is not required. This opens up several possibilities in the field of visible-light CCD or CMOS cameras. Unfortunately, the real-time requirements exclude most of the commercially available high-speed cameras as they do not support streaming at full camera speed. Thus, the Institute for Data Processing and Electronics (IPE) at KIT decided to develop a versatile high-throughput camera platform. An open architecture will allow the design of fully programmable cameras for automatic adaptation to experiment conditions. This new generation of experimental station generates an enormous amount of data. In order to manage the huge data stream, analysis and careful reduction help to identify the desired information. For this, an intelligent image-based self-trigger for fast spontaneous processes has been developed and implemented in the high-throughput camera platform. In conventional setups, unpredictable physical events could be lost or only partially acquired due to limited observation time given by the camera memory and/or readout bandwidth limitations. To circumvent this, an adaptive frame rate and adaptive selection of the region-of-interest (ROI) is used in order to record the temporal evolution of physical events at the fastest possible speed.
An example of fast spontaneous events is shown in Fig. 2 , with two bubbles merging in gelatinous agar. The gelatinous 0018-9499 © 2013 British Crown Copyright bubbles are unchanged for the first 48.9 ms of the data acquisition, corresponding to 489 unchanged frames at 10 kframes/s. After this time, the merging takes place in less than a millisecond.
The desired logic should be able to detect fast physical events and consequently, reject redundant data frames. A multievent detection located in different frame regions is intended. Additional benefits would be simplification and acceleration of data analysis, optimization of the effective bandwidth, and a significant increase of the frame rate.
The limited density of the photon flux in the synchrotron light source application is the fundamental limit on image sensor performance during the high frame rate acquisition (short integration time). The temporal noise and the fixed pattern noise (FPN) components are dominant in these conditions. For these reasons, a low-noise camera with embedded FPN correction will be developed in order to keep a reasonable signal-to-noise ratio (SNR) in these low-illumination conditions.
The UFO project focuses on suitable data processing in hardware and software. Further image processing and image reconstruction is performed at tailored GPU servers. High-end graphic adapters have been proven to be well suited for image reconstruction and monitoring purposes [1] . Currently, processing of up to 1 GBof data per second is possible. Within the UFO project, a parallel computing framework has been proposed to support development of algorithms for parallel computing architectures using OpenCL [2] . This paper presents the architecture of the new camera platform. It describes core elements of the high-throughput FPGA design and Linux driver considerations and proposes a selfevent trigger to increase the frame rate of the image sensor. Finally, the first prototype camera based on the framework is presented.
II. HIGH-SPEED CAMERA ARCHITECTURE
Advances in CMOS image-sensor technology give rise to a new generation of high-speed cameras to capture events previously impossible to be acquired by conventional CCD cameras [3] . Present FPGA devices offer a large number of highspeed I/O interconnects combined with a large number of native blocks like DSP, FIFO, RAM, PLL, and others. For these reasons, the FPGAs are usually employed for control, readout, and real-time data processing of high-speed CMOS image sensors [4] , [5] . We have developed a modular FPGA-based highthroughput platform that is intended for scientific applications. The first version of the proposed camera platform has been realized based on commercially available components, where possible, in order to speed up the development and to focus on the key components.
A functional diagram of the camera prototype is shown in Fig. 3 . The readout architecture can be divided into three main parts: daughter card containing the CMOS-image sensor, mother-board based on the Virtex6 FPGA from Xilinx, and the PC used for camera control and the data acquisition system (DAQ). Daughter and mother boards are connected by a high-speed, high-density FMC-Samtec connector. The first realized camera is shown in Fig. 4 .
For a fixed integration time, the CMOS-image sensor receives a request for a new frame. After the integration time, the image is stored in the pixel-matrix (global shutter) and read out sequentially, row by row. The pixel values are passed to a column ADC cell and digitized. These digital signals are then transferred by multiple parallel low voltage differential signal (LVDS) channels. Each LVDS channel is responsible for a group of adjacent columns of the pixel matrix. Control registers are foreseen to control all kinds of image sensor parameters.
In order to read out the camera system as fast as possible, a standard PCI Express (PCIe) cable connection with four lanes is used to transfer the data from the camera directly to the main computer memory [6] . There are passive copper cables and active optical links available for this interface. The PCIe lane generation 2 adapter has a theoretical bandwidth of 16 Gb/s. In order to benefit fully from the high bandwidth of the PCIe link, we use direct memory access (DMA) to transfer the data from the camera to the main computer memory, and vice versa. By using PC memory, the camera benefits from the evolution of memory size and clock speed.
Addressable 32-bit user bank registers are implemented in the dedicated base address register (BAR) space. Bank registers are used to read/write the status/configurations of DMA engines, CMOS sensor, and FPGA logic. Further bank locations can be used for additional user applications. The double data rate (DDR) memory device is used for both temporary frame data storage and for self-event trigger elaboration.
The readout chain, shown in Fig. 3 , including the driver layer, has been tested and characterized. For test purposes, several million data packets, generated by the PC, are sent to the FPGAboard. The data received by the DMA are stored in the DDR memory. At the same time, a readout request from the PC is received from the board, and the data previously stored in DDR are sent back to the PC by the DMA engine. The data are compared in the PC for consistency and bit error rate estimation. Fig. 5 shows the performance of a simultaneous data transfer between both the PC to DDR and DDR to PC as a function of The three logic blocks shown in Fig. 3 are the backbone of the camera platform. They have been developed and optimized at KIT and are seamlessly integrated in the parallel GPU computing framework of the UFO project. In the following sections, the architecture and resulting performance of these three logic blocks are presented.
A. PCIe-DMA Architecture
The first module handles the communication via the PCIe interface. The term "Bus Master," used in the context of PCIe, indicates the ability of a PCIe port to initiate PCIe transactions, typically memory read and write transactions. The most common application for "Bus Mastering Endpoints" is for DMA. DMA is a technique for efficient data transfer to and from the host CPU system memory. This implementation has many advantages over standard programmed input/output (PIO) data transfers. In addition, the DMA engine offloads the CPU from directly transferring the data, resulting in better overall system performance through lower CPU utilization. The PCIe-DMA architecture developed is shown in Fig. 6 . Two IP cores are employed in combination with the logic blocks developed at KIT. An integrated Endpoint Xilinx-IP core for PCIe [7] and two Northwest Logic DMA [8] engines are used to move the data from the FPGA board to the PC central memory, and vice versa. A custom PCIe-DMA interface logic has been developed to adapt the Xilinx PCIe interface to the DMA engines.
The signal handshaking with the internal FPGA logic is provided by two I/O interface logic blocks. Each logic block includes a data FIFO and a finite state machine (FSM) to control the coherence between the received and transmitted data packets. The FSMs generate the start and end of packet signals for DMA and manage the busy and the back-pressure signals. The FIFOs are used as a temporary data storage and for frequency domain change. In this way, the user-defined clocks, and in Fig. 6 , can be used to send and receive data. The and signals will be synchronized with the user-defined domain. The signal is used to inform the logic when the valid data are present on the data out bus. A busy signal can be used in order to temporary interrupt data flow received from the FPGA internal logic. The signal allows us to write a data word at the bus with a user-defined clock frequency . The signal informs the logic that the driver-PC is in the busy status.
A software driver layer, fully compatible with this architecture, has been developed for 64-bit Linux operating system and will be presented in Section IV.
B. Fast SerDes Input Stage
The SerDes module realizes the communication between image sensor and FPGA. SerDes (serializers/deserializers) are devices that take wide bit-width, single-ended signal buses and compress them to a few, typically one, differential signal that works at a much higher frequency rate than a wide single-ended data bus. SerDes enable point-to-point movement of large amounts of data.
The clock division, parallel data width, and the training pattern are FPGA reconfigurable according to the CMOS-image sensor specifications. The image sensor employed in the camera prototype uses 16 parallel high-frequency LVDS serial lines to move data from the pixel-matrix to the receiver device. Each line works at a double data rate with 480 Mb/s. Moreover, 10-b or 12-b ADC pixel data can be selected by the user. This has an impact in the SerDes logic that must switch the serial input to 10-b or 12-b parallel output. To overcome the SerDes logic limitations in the Virtex6 FPGA [9], a new SerDes input stage module has been developed for high-throughput platform. The basic architecture of a single SerDes channel is shown in Fig. 7 . To cover all image sensor outputs, 16 parallel SerDes input stages, one for each output, are employed. A common FPGA regional clock [10] for all 16 input stages has been defined as a division of the LVDS data clock according to the parallel data width. An individual programmable absolute delay primitive block, IODELAY [9], can be used for a precise 80-ps step time synchronization between data-to-clock. The LVDS input data line is converted from double data rate to two single-ended data lines by a double-data-rate (IDDR) register [9] . Lines are then combined for parallel data output by the custom SerDes logic. The dedicated word alignment FSM checks the correct position of the MSB in the parallel data output by comparing it with the training pattern. A bit-slip signal is generated from the alignment FSM and received from the custom SerDes in order to shift the wrong MSB bit to the correct position. A data lock signal is set to inform the rest of the logic that the parallel data are aligned correctly.
C. DDR Memory Interface Logic
The last module is responsible for the on-board DDR memory management. The memory interface solutions developed at KIT combine a Xilinx physical layer (PHY) for DDR3 devices with additional block logic developed to extend the memory interface Xilinx IP core [11] features and overcome limitations present in the DDR3 IP core supplied by Xilinx. The signal allows to write a data word with a user-defined data width (N) present at the bus. A WR-FIFO is used as a temporary data storage and for frequency-domain change between a user clock frequency, in Fig. 8 , to the internal logic clock domain. The Arbiter FSM continuously checks if the WR-FIFO is empty. If not, the enable signal for the write operation is propagated to the WR-DDR FSM. The WR-DDR FSM receives the write command and generates the address and all control signals for the PHY logic. The PHY logic receives the command and writes the new data on the address position specified by the WR-DDR FSM.
The Arbiter FSM can receive a read request for the DDR3 device. In this case, the Arbiter checks if the WR-DDR FSM is in "idle" state. If yes, a read command is propagated to the RD-DDR FSM. The RD-DDR FSM receives a read command and generates the address and all control signals for the PHY logic. The output data from the PHY are stored in the RD-FIFO. According to the user-defined data width (M), a data word is present at the port. A signal is provided to inform the user logic that the data are ready. All signals are synchronized to the user-defined . As shown in Fig. 8 , the internal read and write paths work with a data width of 256 b at 200-MHz internal clock that corresponds to a bandwidth of 51 Gb/s in a half-duplex mode. This bandwidth limitation is imposed by the PHY logic in the present FPGA speed-grade.
A quasi full-duplex data flow is achieved with the proposed architecture. Balance optimization between the amount of the data in both FIFOs is managed by the Arbiter FSM. Thanks to this balance, intelligent burst write and read commands can be propagated to the PHY in an alternating way. This is equivalent to the full-duplex DDR3 interface with a mean bandwidth of 25 Gb/s in each direction.
III. SELF-EVENT TRIGGER (FAST FRAME REJECT) AND HIGH FRAME RATE READOUT STRATEGIES
Recent CMOS sensors support direct pixel access. The different image sensors realize a variety of readout schemes e.g., the readout of only selected rows or pixel blocks of the pixel-matrix. This feature makes intelligent readout strategies possible to increase the frame rate and/or reduce the amount of recorded data. The proposed fast reject logic is based on row-based interleaved readout. In interleaved mode, only selected rows are read, while a programmable number of rows is skipped. This strategy allows a drastic reduction of the readout time without losing the full field of view.
The self-event trigger and the ROI readout architecture is shown in Fig. 9 . The architecture contains an interleaving logic FSM unit responsible for the interleaving readout, an eventtrigger FSMs, row comparison logic used to compare and detect fast physical events, a CMOS control unit for reprogramming the ROI readout according to the trigger information, and a DDR3 memory device used for temporary data storage.
The self-event trigger sequence is as follows. A complete frame is stored as a reference in a dedicated DDR3 memory segment. Then, the event-trigger FSM enables the interleaved frame readout. In each interleaving frame, the rows read from the CMOS sensor are shifted down by one in order to have a roll-over readout mechanism and cover the full image after interleaving frames, as shown in Fig. 9 . Corresponding rows in the current frame and the reference frame are compared. In case of a significant difference, the row is marked as a candidate token row. The event-trigger FSM receives the token rows and uses this information for a dynamical selection of the ROI by the CMOS control unit. The CMOS control unit calculates the start and end rows for each readout window and performs a reprogramming of the CMOS sensor to acquire all rows inside the windows for the next frame readout request.
In order to adjust the algorithm to application-specific conditions several programmable parameters are foreseen, as follows: 1) pixel threshold: used to reduce the noise in the pixel count value; 2) row threshold: indicates how many positively compared pixels are needed to mark the row as token; 3) global threshold: how many token rows are needed in order to generate the trigger signal. The thresholds and the interleaving gap between rows are programmable by a dedicated register implemented in the bank register. In the proposed readout scheme, a frame consists of the interleaved rows and the selected ROI. The resulting frame rate depends on the reaction time to detect the physical event and the ROI readout time. The reaction time depends on the number of skipped rows and the size of the physical event. The readout time is given by the number of rows in ROI. As an example, a CMOS 2.2 Mpixels image sensor organized by 1088 rows and 2048 columns with a frame rate of 340 fps at the full resolution is used. The measurement of the frame rate achieved by the camera with self-event trigger logic for two different interleaving gap settings is shown in Fig. 10 . If the physical event area is larger than the interleaving gap between the rows, then the events are detected in each interleaving frame. In this condition, the detection time benefits with more interleaving skip rows and the readout time scales down with the number of rows in ROI. If the physical event area is smaller than the interleaving gap between the rows, then more than one interleaving frame is needed to detect the event, and thus the total frame rate is drastically reduced. This drop of the frame rate can be observed in Fig. 10 for the second case with a gap of 31 rows (square markers) and a ROI size of less than 30.
From Fig. 10 , it is evident that the number of skipped rows must be selected according to the experimental conditions. The intelligent ROI readout combined with the self-event trigger (Fast Reject) allows one to obtain a high spatial and temporal resolution and can be applied to fast X-ray micro-imaging radiography. The additional benefit of this method is maintaining the full field of view of the scene.
IV. ADVANCED LINUX PCI DRIVER FOR PCI-BASED DAQ ELECTRONICS
The hardware platform is supported by a high-performance and, at the same time, flexible Linux driver and SDK library. The flexibility of the electronics that is gained by the programmable logic devices needs to be met by a configurable driver at the software side. The main goals of the driver are to fully support the high throughput of the DAQ platform but, on the other hand, to provide the flexibility to customize the driver easily for different applications. The goals are met by a pluggable DMA engine and a configurable register model in the SDK library. The modularity of driver and library build a unique DAQ platform that can be easily adapted to very different applications. The universal PCI driver and library support especially the hardware development and commissioning of PCI/PCIe DAQ electronics. Fig. 11 illustrates the architecture of the Linux driver and SDK library. The main components are as follows.
• The PCI data access layer provides raw access to PCI standard I/O memory.
• The register access layer defines a register model. The registers are defined by an XML file. Additional registers may be added at run time (also used by DMA engine). Integer registers with a size of up to 64 bits are supported. The register size and alignment may be specified with bit precision. The endianness conversions is handled automatically if the register endianness is specified. Custom functions to read/write registers may be provided by plug-ins.
• The DMA engine layer is a pluggable interface for DMA engines. The basic interface defines five methods: start/stop DMA channel, read/write data from/to the DMA channel, and stream data from DMA channel to a supplied call-back function. The engines must only implement interaction with FPGA-specific DMA registers. The management of the DMA buffer is handled centrally and may be reused. Currently, the Northwest Logic DMA IP protocol is supported.
• The event engine layer defines an event-based model to integrate device-specific code (by plug-ins). Each device can define multiple events and, for each event, several data types. The events will be triggered in hardware or requested by software. The client application may subscribe to get event notifications. Upon event notification, the application can request the desired type of data. The universal driver is used for the presented high-throughput camera. The camera-specific functions are realized in the event engine. The software library reflects the versatile character of the hardware design. Other applications are easily included by replacing the register model in the register access layer. Even the DMA engine can be replaced without changing the software architecture.
V. APPLICATION OF THE CAMERA PLATFORM
The presented high-throughput platform has been used to develop a high-speed CMOS camera for synchrotron applications. The first camera prototype is based on a commercial CMOS 2.2 Mpixels image sensor. The camera achieves a continuous frame rate of 340 frames/s at the full resolution. With the online self-event trigger presented in Section III, several thousand of frames per second are possible. Fig. 12 shows a small object falling into a glass of water. In the sequence, the highlighted part shows the selected ROI. The darker part is taken from the reference frame. The sequence demonstrates that the algorithm is able to track the movement of the object. In addition, the turbulence in the water is completely covered. In the first part of the sequence, the object is tracked with a frame rate of 1 kframes/s. In the second part, the large turbulent area is tracked with a frame rate of 500 frames/s, which is still more than the native frame rate of the sensor.
The limited density of the photon flux in the synchrotron light source application and the short exposure times require a low noise level of the camera to archive high image quality. The noise can be distinguished between a time-dependent noise and a fixed pattern noise (FPN) that does not change with time. The temporal noise includes the pixel dark current shot noise, kT/C contributions, column amplifier noise, programmable gain amplifier, and ADC noise. Several precautions have been taken in order to minimize the temporal noise contributions. Low-noise voltage and current references for the CMOS sensor are implemented in the PCB design. The CMOS-image sensor is cooled by a Peltier cell cooling system controlled by the FPGA. Finally, the CMOS sensor pixel parameters have been tuned for low noise. The camera has been characterized according to the 1288 European Machine Vision Association (EMVA) standard for Image sensors and cameras. The results for the two modes of operation with 10 b and 12 b per pixel are show in Table I .
The FPN is a systematical noise component coming from two different sources. The first source is caused by slight variations of individual pixels due to the manufacturing process of the CMOS technology. The second source of FPN is the performance variations of the amplifiers shared by each column of the pixel array. The information within the pixel array is read out column by column. When one amplifier behaves slightly different, the entire column is affected. It results in vertical lines in the image. The background image recorded with a short integration time shown in Fig. 13 clearly shows this vertical structures.
A drastic reduction of this systematic pixel-to-pixel nonuniformity contribution will be achieved by an online FPGA FPN correction. The architecture shown in Fig. 9 is able to cover this kind of operation in real-time. The procedure is as follows: Several dark frames are acquired from the camera and sent to the DAQ computer for a pixel-by-pixel evaluation of the correction factors. For each pixel, the correction factor is calculated as a mean value of corresponding pixels in the dark frames. In this way, the temporal pixel noise is mitigated, and the systematical pixel difference emphasized. The new frame is stored as FPN value correction in the DDR3 memory of the camera. When the correction is enabled, the corresponding FPN value correction is subtracted from the incoming pixel data. The results of the mathematical simulation show that the first-order FPN correction is possible. The FPGA development and verification is now ongoing. The main advantage of this first-order correction lies in the fact that it is well suited for high-speed applications.
The camera prototype was integrated into the demonstrator setup at the TopoTomo beam line [12] and was successfully tested at ANKA with a moderate X-ray flux density. Several thousand of radiographies have been acquired at full speed (340 frames/s) in streaming mode; one X-ray radiography is shown in Fig. 14 . The FPN noise contribution has been drastically reduced by the procedure described earlier, and the dark noise has also been reduced to an appropriate SNR level. The spatial resolution has been measured to be a few micrometers.
VI. CONCLUSION AND FUTURE WORK
In this paper, we presented a high-throughput imaging platform for fully programmable scientific cameras. The first camera demonstrator achieves the maximum frame rate of the image sensor, with 340 frames/s and 2.2 Mpixel at 10 b and produces a data rate of up to 1 GB/s. Several thousand frames per second are achievable using the proposed intelligent self-event trigger (fast reject) logic. An available Linux driver links the camera to a GPU server for further image processing.
Frame rate and resolution are currently only limited by the image sensor properties of the prototype. The next version of a visible light camera is currently under development and will employ a faster CMOS sensor with a readout bandwidth of 50 Gb/s and a native frame rate of 5000 frames/s. A novel in-house-made readout board will be developed to scale up the performance of the universal camera platform for these increased requirements. The main concept of the high-throughput platform keeps unchanged, and the self-event trigger strategy is still applicable to reach even higher frame rates in the range of several tens of kilo frames per second.
