Abstract-
Framework for High-Performance Video
The intuitive solution to the problem of designing a scalable video acquisition system is to assemble a powerful computer with several frame grabber cards. Unfortunately, such an approach has several considerable drawbacks. The most important is the problem of synchronization and timestamping using off-the-shelf frame grabbers. These tasks would have to be done in software, reducing the resulting precision by an order of magnitude. Such a method would also not conform to the high reliability, availability, maintainability, and inspectability policy of large-scale experimental machines. A suitable solution shall rather be based on a hardware platform dedicated for data acquisition systems, like the MTCA.4 platform [2] .
II. MTCA.4 ARCHITECTURE
The Micro Telecommunications Computing Architecture (MTCA) standard defines compact shelves that host Advanced Mezzanine Card (AMC) modules. The standard covers the mechanical, electrical, and thermal requirements for the chassis. It also defines the MTCA Carrier Hub (MCH) module which is a pluggable device responsible for the management and providing switches for high-speed serial links. Nearly all of the MTCA intelligence is concentrated in the MCH, thus the backplane can be considered as a passive component. Apart from the MCH and backplane, the shelf additionally contains a power supply and a cooling unit (which is usually a tray of fans).
The base MTCA.0 specification enables building highthroughput telecommunication solutions. However, it is not well suited for data acquisition systems. Although the MCH usually allows advanced routing of clock signals, the backplane infrastructure for distributing synchronization and trigger signals is heavily limited. Furthermore, there is no possibility of connecting cables from a rear side of the shelf-which is often an official recommendation. All external signals have to enter through the front panel reducing the accessibility of the module and its neighbors [3] .
All the aforementioned issues were addressed in the MTCA.4 subsidiary specification "MicroTCA Enhancements for Rear I/O and Precision Timing." A typical MTCA.4-compliant crate is presented in Fig. 1 .
The MTCA.4 specification defines a backplane M-LVDS bus spanning all AMC slots. It allows easy distribution of system-wide signals like timing events or interlocks. Moreover, the standard introduces the concept of a rear transition module (RTM)-a second printed circuit board (PCB) closely 0018-9499 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. cooperating with the AMC module. Such an RTM module not only effectively doubles the available PCB space but also allows connecting cables on the rear side of the shelf [4] . An RTM module can only be connected to an AMC of double width. The RTM concept allows designing complex systems as pairs of modules cooperating in an MTCA.4 shelf. Each MTCA.4 slot can receive up to 80 W of electrical power and dissipate an analogous amount in heat. In most cases, each AMC module connects to a peripheral component interconnect (PCIe) switch in the MCH using up to four data lanes (e.g., with PCIe x4 gen. 2 offering throughput up to 16 Gb/s). The system can use an in-crate CPU module or interface an external computer with a link having a typical throughput of 16-128 Gb/s.
III. HARDWARE PLATFORM
The presented video acquisition and processing framework is developed for 7-Series field-programmable gate array (FPGA) circuits from Xilinx. It can be also used with UltraScale and UltraScale+ devices. It is not limited to a particular AMC module, as long as the card can provide PCIe and M-LVDS bus connectivity and an FMC connector. The reference implementation was prepared for commercially available hardware shown in Fig. 2 . The device is composed of a dual slot FMC carrier (the MFMC board), and one or two Camera Link pass-through modules. The carrier is a costeffective double-width AMC module with a recent Artix-7 FPGA, DDR3 memories, single digital input, and a minimal set of peripherals.
The block diagram of the frame grabber module is presented in Fig. 3 . The deserialization is done directly in the FPGA, thus no external active circuits are needed. The MFMC baseboard is equipped with an SDRAM bank of 2 GB, consisting of four DDR3 chips with the 16-bit data bus. These can operate at 533 MHz (DDR-1066) offering a total throughput of 68.2 Gb/s. This is more than twice the throughput required to support two cameras outputting video at the maximum possible data rate supported by the Camera Link standard. The memory can be therefore simultaneously used also by data processing algorithms.
The presented hardware and firmware architecture is a continuation of the work started by the authors some time ago and described in [1] and [5] . The main improvement in the frame grabber design is migration from the MTCA.0 standard to the MTCA.4 (full compliance). This enabled synchronization of the frame grabber trigger and precise timestamping block with the reference timing distributed in the crate. Use of the 7-Series FPGA device enabled deserialization of the Camera Link data directly in the programmable logic device, which led to the great simplification of the interface mezzanine. Another new possibility is connecting up to four cameras to a single acquisition module, either through the second FMC slot or using the Zone 3 connector and an RTM card.
IV. FPGA FRAMEWORK
The framework delivers a complete solution for video acquisition and processing in the real time. It generates triggers for the camera, provides absolute timestamps, and appends headers to the captured frames. It also enables on-line processing of captured data; however, algorithms have to be customized for a particular application. The developed framework is composed of a camera interface, video data path, camera control path, and PCIe bus support.
The FPGA firmware was prepared in the Vivado design environment. The top level of the design is almost completely defined using an IP Integrator block diagram. All exposed data interfaces are realized with AXI4 and AXI4 Stream (AXIS) protocols. Such an approach fosters the modularity and extendability of the base design. The framework is currently provided only with a Camera Link receiver block. However, Camera Link High-Speed and CoaXPress modules will be also developed soon. Exchange of the camera interface will only have a minor impact on the rest of the design.
A. Capturing Camera Link Data
Although the Camera Link standard was first released in 2000, it is still commonly used in high-speed cameras. In the most efficient mode of operation, 80 bits of data, along with four synchronization signals, are transmitted in every clock cycle. The clock frequency is limited to 85 MHz, mainly by the Channel Link chipset [6] . Before transmission, the data is split into three groups of 28 bits and serialized. Each group is then transmitted over four data lines accompanied by a corresponding clock signal, following the Channel Link protocol [6] .
Receiving a Camera Link data stream is not trivial as the link is composed of three independent channels with varying phase relationships. Therefore, each channel has to be buffered in the receiver and then synchronized with the others [7] . The Camera Link standard defines a variety of mappings between pixel data and link words. Due to a large number of possible configurations, the rearrangement of the incoming data is partly done in firmware and partly in software.
The Channel Link signaling used in the Camera Link standard has quite unusual timing. Data is serialized with a 7:1 ratio and accompanied by an asymmetric clock, whose neither falling nor rising edge is synchronized with the start of the word. Each differential pair transfers up to around 600 Mb/s. To deserialize the data, several new signals have to be synthesized. This is done with the use of either a phase locked loop (PLL) or a mixed-mode clock manager (MMCM), depending on what resource is still available in a given FPGA clock region.
The frequency synthesizer is configured to generate the following three signals.
1) A fast clock-seven times faster than reference.
2) A slow clock-a phase aligned copy of the reference.
3) A feedback clock-used for compensation of the intrinsic delay of the clock distribution tree. The fast clock is used by the ISERDES block for capturing the incoming data. The slow one is, in turn, used to latch the ISERDES output word. The deserialized data is then processed using only the slow clock. The block diagram of the Channel Link receiver is depicted in Fig. 4 .
The Camera Link data word starts in the middle of the high state on the reference clock line. On the contrary, the Xilinx ISEDRES primitive expects the data word to start on the rising edge of the clock line. In consequence, the deserializer captures 1 bit from the current clock cycle and 6 bits from the following one. To have all the bits from the same data word, the six "early" bits are delayed by a single clock cycle in a bank of registers. The data from deserializers are presented to an external logic after a simple bit reordering.
In order to further process the captured words, these have to be transferred to a common clock domain. This is done with the use of the FPGA's built-in FIFO primitives. These hard IP-cores support two asynchronous clocks and are used in the first-word-fall-through mode.
The input channel module, illustrated in Fig. 5 , is also responsible for preparing synchronization pulses, used to align data from all the three channels. The pulse is generated whenever a rising edge on Line Valid (LVAL) signal is detected. The Camera Link standard guarantees that the LVAL flag is present in all of the links. However, it is found on either bit 24 or 27 of deserialized data, depending on the Camera Link operation mode. The selected bit is extracted and provided to an edge detector. Then, the word from the camera along with the generated sync pulse is stored in a FIFO. The Channel Link reference clock, no longer needed for data transmission, is sent to a higher level module for frequency measurement. The data is read from the queue using the local 128-MHz clock. Such frequency was selected in order to guarantee that 64-bit bus operating on it (∼8 GB/s) will offer a somewhat larger throughput than 80-bit bus running at up to 85 MHz (∼7 GB/s).
Three buffered input channels now have to be synchronized in order to produce a consistent data word. This task is performed by the dedicated finite-state machine (FSM). The synchronizer and Channel Link deserializers are instantiated in the top module of the Camera Link receiver shown in Fig. 6 . Each channel is accompanied by a frequency measurement block for basic link diagnostics. Sheer information on the presence of the clocks can be effectively used for detecting problems with Camera Link cables.
The synchronizer observes "sync" pulse outputs and "valid" signals from the Channel Link FIFO queues. Initially, the module assumes that the data channels are not synchronized and reads the FIFOs until the synchronization marker is found in all of the three channels. When some channel indicates the presence of the marker, it is no longer read. These queues are short and therefore will overflow if not read in several dozens of cycles. In case of overflow, all FIFOs are cleared and the procedure is restarted.
Once all the data paths report the presence of sync flags, these are considered synchronized. From that moment on, channels are always read simultaneously. The read operation can only occur when all of them are ready. The stream is considered out of sync and the procedure is restarted if all of the sync pulses do not appear at the same time. When queues are synchronized and indicate valid data, the master "valid" signal is presented to the outside logic.
The mapping between deserializer outputs and video data words is dependent on the currently selected interface operation mode. The receiver module takes care of the data reordering, arranging it back into an 80-bit bus-analogous to what it was before the Channel Link serialization. In most of the modes, only a part of this bus is used. The unused bits will be stripped in one of the following blocks. Also, at this stage, the Frame Valid (FVAL), LVAL, and data valid (DVAL) signals are combined to produce a consistent set of TUSER AXI flags.
B. Improving the AXI Video Bus
Recent Xilinx video processing cores utilize the AXIS protocol for exchanging video streams. The receiver of the video stream must at least be aware of the frame boundaries in order to perform any meaningful data processing. According to Xilinx "AXI4-Stream Video IP and System Design Guide" the start-of-frame event shall be signaled by a logic high on the least-significant bit of the user-defined vector (TUSER) [8] . A number of video processing IP cores perform their operations in a line-by-line manner. These have to be configured in advance for a particular video resolution or receive information on the end-of-line by some means. The second option is preferred as it allows the system to dynamically adapt to the resolution changes. In Xilinx IP-cores the end-of-line is indicated by asserting the TLAST signal during the trailing data word of a given video line. According to the AMBA specification the TLAST signal is dedicated for marking an end of the data burst [9] . Therefore, each line is sent in exactly one AXIS transmission. Fig. 7 presents an example of AXIS Video transmission with one start-of-frame and two end-of-line events visible. For the sake of clarity, it is assumed that the data sink is constantly ready to receive data, which is indicated by a high level on the TREADY line. Otherwise, the transmission would have to be paused and the transmitter would need to hold its currently transmitted word until it is acknowledged by the receiver. The transmitter is also capable of pausing the transmission by clearing the TVALID signal. However, once the TVALID signal was asserted for some data word, it can be only deasserted after the receiver acknowledges the transfer.
The Xilinx AXIS Video interface is sufficient in most applications. However, the design of a frame grabber opens some unique new challenges. One of them is a conversion between a variable-width bus from the Camera Link deserializers to a regular 64-bit AXIS Video bus and then to a 512-bit plain AXIS interface, for the purpose of the storage in a DDR memory. When a system contains data buses of different number of bits, it has to also contain some kind of aligners.
In case of the described framework, the memory utilizes a wide 512-bit interface, whereas the camera returns only 16-80 bits per clock cycle. This means that aligners have to buffer a number of camera words in order to compose a single memory word. This can become a problem near the end of the transmission. Assuming that only one frame had to be captured and the number of received bits was not divisible by 512, the aligner could wait forever to complete the last output data word. This means, that the host system could never finish receiving a video frame. Therefore, the aligner must be somehow flushed at the end of the frame.
The problem could be approached by emptying the aligner when no new data arrives in a given time period. This could, however, lead to data inconsistency as the camera is allowed to make a pause its data stream or to stream two consecutive frames without a significant delay between them.
The problem is solved in the firmware by detecting an end of the transmission based on the synchronization signals from the imaging device. This information is then encoded on the second bit of the TUSER vector, along with the startof-frame marker. When this event propagates through the video pipeline, it causes flushing of every buffer on its way, to ensure real-time data delivery. As a result, eventually, each frame is predictably padded with zeros up to the next multiple of 512 bits.
C. Complete Solution
As shown in Fig. 8 , the video data is captured by the Camera Link receiver and transferred via an AXIS interface. The video stream is then directed to a Xilinx Virtual FIFO (VFIFO). It is an IP-core that helps implement FIFO queues based on external memories [10] . The virtual FIFO is a very helpful component; however, it also has some drawbacks. For instance, it can only control relatively small memory regions (up to several tens of megabytes). Moreover, it cannot emit a warning when the amount of available storage space drops below a certain limit. Therefore, a dedicated monitoring component was developed to asses if the next frame will fit in the memory or not.
In order to provide each frame with an absolute or relative timestamp, the frame grabber is equipped with a dedicated timing block. It receives a reference clock and a start event from an external timing module. These signals are provided using the M-LVDS bus available on the MTCA.4 backplane. This 64-bit counter is first preloaded with a start value and then incremented by a configurable amount in every clock cycle. Both values are set through registers accessible via an AXI interface. The proposed use for the timestamp counter is an absolute nanoseconds counter. In this scenario, it operates at a 100-MHz clock from the timing module and is incremented in steps of ten counts. The signals from the timing module can be also used to start the local camera trigger generator or be directly forwarded to the Camera Control lines.
During the write of the data to the VFIFO, the frame resolution is identified and a corresponding timestamp is obtained. The timestamp may be captured either on the trigger event sent to the camera or on the first data word received from it. The information on the arrival time, along with a frame sequential number, is stored in another queue-the frame information FIFO. Its purpose is to provide the processing block and the tagged image file format (TIFF) header generator with information on the size of the forthcoming image. The process of retrieving the data from VFIFO is coordinated by a frame reader module. The operation starts with obtaining information on the next available frame. Then, the frame reader generates a frame information vector and starts a read of the VFIFO. Apart from transferring data it also reconstructs synchronization signals.
In the reference design, the data processing module is not performing any actual operation. Its purpose is mainly to provide efficient pixelwise manipulations on the image. The foreseen uses of this module include thresholding, noise reduction, removal of white/black pixels, and linear correction. Before a transmission starts, it is provided with information on the frame to be processed. Then, it receives 8 bytes of data in every clock cycle. It can buffer several lines of the image in an FPGA's built-in memory or use an external bulk DDR3 memory. Its settings and status shall be accessed through an AXI-mapped registers file. The processing module can only operate when the frame grabber buffer is read by the host. Therefore, it cannot be used, e.g., for generating interlock signals. If the frame grabber is expected to deliver some real-time outputs (e.g., hot-spot detection in infra red (IR) images, for machine thermal protection) frame data shall be taken directly from the Camera Link receiver block.
Before the data are stored in the host memory, each captured and processed frame has to be appended with a header. Headers are generated by the TIFF header provider. The TIFF format was selected mainly due to its straightforward file structure and support for uncompressed gray-scale images.
The video stream is written to the host computer memory by a custom direct memory access (DMA) engine developed at the Department of Microelectronics and Computer Science (DMCS) of Lodz University of Technology. The engine supports efficient transfers utilizing a scatter-gather list. It is composed of an FSM, a FIFO, and an interrupt generation block. The FSM is responsible for converting data obtained through a single-stream interface to a number of PCIe transaction-level packets (TLPs). It is closely coupled to the PCIe endpoint, minimizing the number of cycles required for a single transmission. Each DMA FIFO entry describes one transfer to consecutive memory locations and is mapped to a series of TLPs in hardware. Entries are provided by the device driver, which in turn queries the operating system for physical addresses of the user-space buffers (a zero-copy solution). The interrupt generation block provides the driver with asynchronous feedback, signaling the need for refilling the FIFO and finally the end of the transmission. The measured performance of the DMA engine alone is around 1.6 GB/s for PCIe x4 gen. 2 (having a theoretical maximum of 2 GB/s).
The camera is controlled through a simple asynchronous serial interface with an LVDS signaling. Its physical layer is defined by the Camera Link specification, whereas the frame format and particular command set is vendor dependent. Therefore, the firmware is equipped with an industry-standard 16550-compatible universal asynchronous receiver/transmitter (UART). 
V. SOFTWARE SUPPORT
The framework includes complete software support consisting of a Linux device driver, application programming interface (API) libraries, console, and graphical user applications as well as integration with control systems like EPICS and DOOCS. The general structure of the software stack is illustrated in Fig. 9 .
The FPGA firmware instantiates a PCIe endpoint with two address spaces. During bus enumeration, these spaces are mapped into two regions of the host memory. Larger of these spaces are used for register accesses. The Linux device driver enables access to these registers through the IOCTL mechanism. The other smaller address space is used by the driver to control the DMA engine. This area is not accessible by unprivileged processes. The DMA is completely handled by the driver and is transparently used for every read operation.
The driver creates two separate device files in the /dev/ directory. First, enables high-performance DMA data transfers directly to the user-space memory. This file can only be opened by one process at a time, which will receive a full data stream. The second file is provided for controlling both the frame grabber and the camera. It can be simultaneously opened by several processes, assuming that these control different aspects of the system (e.g., one controls the camera while the other is focused on the frame grabber status and settings) [11] .
To facilitate software development, the framework contains a set of user-accessible API libraries. The low-level library communicates directly with the device driver and provides easy-to-use functions for image data reading, adjusting frame grabber parameters and sending/receiving characters via the UART interface. A set of console applications based on the low-level library allows accessing the diagnostic features and testing the firmware at the register level. These provide access to the current device settings and status information. Frame grabber settings include, but are not limited to the following. High-level libraries implement control protocols of particular cameras. A single library is needed for a specific group of cameras sharing the same control protocol. Thanks to such an approach, the framework can be easily extended. The only effort necessary to provide support for a new imaging device is the development of a corresponding camera library implementing its control protocol and video decoding. Currently, libraries offer only a proprietary API; however, in the future, a GenICam API library will be provided.
To facilitate testing of the camera support the framework offers a graphical user interface application written with the use of the Qt library. It provides a live video preview as well as an option to store a predefined number of consecutive frames to a specified file. It also provides control over the most important camera parameters, e.g., region of interest, shutter time, frame rate, black level, and so on. These parameters can be also accessed through an additional command line application.
The highest layers of the software support are TCP/IP servers for EPICS and DOOCS control systems. Naturally, these also use camera-specific libraries for communication with the hardware. The video stream can be either sent through a network connection or archived locally. Current settings and status are exposed as process variables.
VI. FRAMEWORK APPLICATIONS
The developed framework enables collecting data from cameras using the top performance 80-bit mode of the Camera Link interface, offering 6.8 Gb/s of a raw image data throughput. The solution was tested with a number of commercially available cameras, including a Mikrotron MC3010/MC3011, PCO EDGE 5.5, Andor NEO 5.5, and Basler Sprint spL2048-70km. Sections VI-A and VI-B present selected practical uses.
A. ITER Diagnostic Use Case
The tokamak of the ITER experimental nuclear power plant will have to sustain extreme conditions. It is essential to provide monitoring of both the nuclear reaction as well as the vessel. The ITER diagnostics will contain more than 45 systems [12] . These will include optical systems which will capture images in IR (3-5 µm) as well as in the visible light (400-700 nm) regime [13] . In total there will be over 200 imaging devices observing the vessel and the plasma itself [14] . The systems capturing data from the cameras have to integrate seamlessly with other diagnostic setups. It is not possible to connect cameras directly to ITER Ethernet-based networks. Cameras have to be connected to data processing computers instead. These machines split the video stream and package it in containers suitable for a given network. A single multicore computer can support several cameras.
The video acquisition system for ITER has to support a relatively large number of cameras. Consequently, the effort required to extend the system by another image sensor shall be low. ITER Organization does not recommend any particular camera communication interface or device. The video acquisition system architecture shall be open for using different machine vision interfaces. It was decided that a Camera Linkenabled camera will be used for the proof-of-concept solution. The Mikrotron MC3010 camera was selected because it was already used in another diagnostic use case. This device has a resolution of 1696×1710 pixels and is able to generate around 6.5 Gb/s of plain image data.
The system is built around a stand-alone industrial computer to fulfill ITER performance requirements. The computer interfaces the scientific and archiving networks using two 10-Gb/s Ethernet interfaces. It is also connected to an MTCA.4 crate by a PCIe x8 gen. 2 (32 Gb/s) copper interface. This link can be upgraded to x16 gen. 3 (126 Gb/s). The frame grabber is connected to the crate with PCIe x4 gen. 2 interface (16 Gb/s).
The frame grabber closely cooperates with the timing module. The timing module receives the temporal information over the ITER time communication network using the IEEE-1588 Precision Time Protocol. The timing module provides a reference clock through a star distribution fabric and a set of programmable event lines through a shared M-LVDS bus. These signals are used for starting and stopping the acquisition as well as for synchronization of the timestamping block. The timing module for the ITER Organization was also developed by the DMCS [15] .
B. Injector Laser at DESY
The developed framework will be used at Deutsches Elektronen-Synchrotron (DESY) for fine-tuning the injector system of a linear accelerator. The goal is to observe a cross section of a light beam traveling through two nonlinear optical crystals and eventually delivered to the photo-cathode [16] . The cameras will verify if a proper beam focus is maintained during operation of the machine. Some defocusing of the beam is expected due to heating caused by a considerable power conveyed by this optical tract.
The PCO EDGE 5.5 image sensor was selected by DESY, mainly due to its low noise floor. It is a compact Camera Link device that only operates in the 80-bit Camera Link configuration. Data from the camera is collected by a computer contained in an MTCA.4 crate. Both, the frame grabber and the camera have to be synchronized to the machine timing. The reference timing events are provided by a DESY X2-Timer module connected to a dedicated event distribution network. This enables precise time-stamping and tagging the captured data with the accelerator pulse number. The frame grabber streams acquired frames to the memory of a computer implemented on another double-width AMC module. Captured data is available through the accelerator control network.
VII. CONCLUSION
The presented video acquisition solution features the world's first Camera Link frame grabber developed for the MTCA.4 architecture. Adherence to the MTCA.4 standard allows precise synchronization of the frame grabber operation with an externally provided timing. This enables both timestamping as well as triggering of the imaging device with an accuracy of several tens of nanoseconds [15] .
The Camera Link deserialization is done using only FPGA's built-in ISERDES primitives. This task was usually done with several external deserializers. Such a solution requires more board space, is more expensive, and is less flexible than the proposed approach. It is probably the first system demonstrating the possibility of receiving the Camera Link signal directly in an Artix-7 device. It was definitely not possible on earlier Xilinx architectures due to limitations in their I/O operation frequency. The implemented video receiver module is capable of supporting all Camera Link modes and configurations. Especially, it allows capturing the 80-bit Camera Link image data with a reference clock of 85 MHz, which constitutes the highest data rate supported by the standard.
Another innovation is the use of AXI TUSER signals to provide information on the end of the line as well as to clearly indicate the end of the whole frame. The adopted method enables flushing remaining image data from all bus aligners ensuring delivery of final bytes without unnecessary delays.
The architecture of the frame grabber firmware guarantees that the frame is only captured when there is enough space in the buffer. Therefore, as long as the camera and its interface are operating correctly, the frame grabber only returns complete frames. When the data readout is too slow whole frames are rejected. Thus, the software layer almost never has to clear buffers or search for a header of the following image.
The frame grabber output is a stream of consecutive TIFF images. Therefore, in the case of gray scale, cameras collected data can be directly viewed with a generic image viewer.
