An FPGA platform for ultra-fast data acquisition by Caselle, M. et al.
KIT – Universität des Landes Baden-Württemberg und 
nationales Forschungszentrum in der Helmholtz-Gemeinschaft 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 
UFO project group 
www.kit.edu 
An FPGA platform for ultra-fast data acquisition 
M. Caselle, M. Balzer, S. Chilingaryan, A. Kopmann, U. Stevanovic, M. Vogelgesang 
 
FPGAs in Research - Applications, Technologies and Tools, Forschungszentrum Jülich, 3-4 December 2012 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 2 
Ultra Fast X-ray Imaging (ANKA/UFO experimental station) 
UFO à Ultra-Fast X-ray Imaging of Scientific Processes with On-line Assessment and 
Data-driven Process Control 
Main app l i ca t ion f i e lds : med i ca l 
diagnostics, biology, non-destructive testing, 
materials research and etc. 
High spatial resolution (<1 µm) included 
2D and 3D visualizations 
Time resolution (2D: ≈10kHz, 3D: ≈10Hz) 
to  give insight in the temporal structure 
evolution and thus access to dynamics of 
processes 
+ 
High readout bandwidth up to 50Gb/s with GPU (3D-tomography reconstruction)  
Requirements: 
High granularity and low noise monolithic silicon pixel detector, few µm pixel pitch, 
several MPixel matrix operating at several kframes/sec 
FPGAs in Research - Applications, Technologies and Tools, Forschungszentrum Jülich, 3-4 
December 2012. M. Caselle 







X-ray detector  












































Small PCIe backplane 
Real time data elaboration 
Data reduction 
High-throughput data flow 
 
Under developing by 
Data processing group 
in KIT-IPE 
Concept: 
FPGAs in Research - Applications, Technologies and Tools, Forschungszentrum Jülich, 3-4 
December 2012. M. Caselle 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 4 
KIT-IPE – Readout concept of high data throughput for 
scientific applications 
This talk is focus on  
FPGA & Readout Board 
Small backplane 
Mother readout board 
Daughter sensor board 
PCIe link to DAQ 








X-ray detector  










































Small PCIe backplane 
Under developing by 
Data processing group 
in KIT-IPE 
Concept: 
Real time data elaboration 
Data reduction 
High-throughput data flow 
 
FPGAs in Research - Applications, Technologies and Tools, Forschungszentrum Jülich, 3-4 
December 2012. M. Caselle 


















FPGA	  internal	  architecture	  












































Flexible high-throughput FPGA platform  
User	  bank	  
register	  
FPGAs in Research - Applications, Technologies and Tools, Forschungszentrum Jülich, 3-4 
December 2012. M. Caselle 


















FPGA	  internal	  architecture	  





















































FPGAs in Research - Applications, Technologies and Tools, Forschungszentrum Jülich, 3-4 
December 2012. M. Caselle 


















FPGA	  internal	  architecture	  




















































Virtex6 - floorplan 
PCIe 
ü Three logic cores have been developed for a flexible high-throughput platform 
ü PCIe-Bus Master DMA readout architecture  
ü Multi-port high speed DDR3 interface 
ü Configurable 2..16 bits “SerDes” (Serializers /Deserializers) architecture 
ü PCI Express/DMA Linux 32-64 bits driver with ring buffer data management 




























FPGAs in Research - Applications, Technologies and Tools, Forschungszentrum Jülich, 3-4 
December 2012. M. Caselle 




















































































































Data out [0..63] 
Data valid 
Clock_out 
Data in [0..63] 
WR_EN 
Clock_in 
Software layers User applications 
FIFO	  
RD	  -­‐	  Control	  
packet	  FSMs	  
Busy_logic 












I/O interface logic 
FIFO	  
I/O interface logic 
PCIe-Bus Master DMA readout architecture 
FPGA core 
ü Bus Master DMA operating with 4lanes PCIe @ Gen2 
(250MHz) 
ü IN and OUT FIFO-like interface (for User logic) 
ü FIFO used to decouple the time domain between 
DMA and User custom logic 
ü Two individual engines for write/read from FPGA (User 
logic) to PC centre memory 
Xilinx / North-West IP-core 
FPGAs in Research - Applications, Technologies and Tools, Forschungszentrum Jülich, 3-4 
December 2012. M. Caselle 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 9 
Preliminary, PCIe-Bus Master DMA new architecture 
Disadvantage of IP-cores from external vendors, are: 
 1)  expensive (35k€ for North-West DMA and 10-60k€ for EZDMA/QuickPCIe-IP by PLDA) 
 2)  for unique FPGA family (Virtex 6, speed grade -2) 
FPGAs in Research - Applications, Technologies and Tools, Forschungszentrum Jülich, 3-4 
December 2012. M. Caselle 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 10 
Preliminary, PCIe-Bus Master DMA new architecture 
FPGA ring buffer management à on-going 
Software 64bit@linux driver à under optimization (~ 32Gb/s) 





















Packet	  size	  in	  Byte	  
Comparison	  (NW-­‐DMA	  vs.	  KIT-­‐DMA)	  	  
PC	  Mem	  -­‐-­‐>	  FPGA	  (NW)	  
FPGA	  -­‐-­‐>	  PC	  Mem	  (NW)	  
PC	  Mem	  -­‐-­‐>	  FPGA	  (Michele)	  
FPGA	  -­‐-­‐>	  PC	  Mem	  (Michele)	  
Disadvantage of IP-cores from external vendors, are: 
 1)  expensive (35k€ for North-West DMA and 10-60k€ for EZDMA/QuickPCIe-IP by PLDA) 
 2)  for unique FPGA family (Virtex 6, speed grade -2) 
New, KIT-IPE Bus Master DMA engines operating with x8 lanes PCI Express @ GEN 2 
IN/OUT data at 128 bit @ 250MHz à internal bandwidth of 32 Gb/s in Read/Write 
PCIexpress GEN2 
RX engine TX engine 
Virtex6 - floorplan 
FPGA resource estimation < 4% 
FPGAs in Research - Applications, Technologies and Tools, Forschungszentrum Jülich, 3-4 
December 2012. M. Caselle 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 11 
Two-ports DDR3 memory interface architecture 
The Xilinx Multi-port Memory Controller (IP-Core) is limited in the maximum data throughput  
(less than 2GB/s for each port) & complex user interface. 
Ref. LogiCORE IP Multi-Port Memory Controller (MPMC) (v6.03.a), DS643 March 1, 2011 
Why a two-ports DDR3 memory controller .. ? 
FPGAs in Research - Applications, Technologies and Tools, Forschungszentrum Jülich, 3-4 
December 2012. M. Caselle 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 12 

























WR	  DDR	  	  
FSM	  






























Start address Port 1 
Data frame 
segment 
On-line data  
process 
segment 
256bit @200 MHz 
256bit @200 MHz 
Read/Write 
Start address Port 2 
ü Bandwidth 51Gb/s, limited by FPGA speed grade ( Virtex 6, speed grade -1) 
ü Two operations are possible in same/different segmentation/s (each operation ~ 25Gb/s) 
ü Data interface FIFO-like, minimum control signals are required 
ü Configurable user define data width N and M à 32/64/128/512 bits 
ü FIFO used to decouple the time domain between Memory Controller and custom User logic 
The Xilinx Multi-port Memory Controller (IP-Core) is limited in the maximum data throughput  
(less than 2GB/s for each port) & complex user interface. 
Ref. LogiCORE IP Multi-Port Memory Controller (MPMC) (v6.03.a), DS643 March 1, 2011 
Why a two-ports DDR3 memory controller .. ? 
FPGAs in Research - Applications, Technologies and Tools, Forschungszentrum Jülich, 3-4 
December 2012. M. Caselle 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 13 
A configurable “SerDes” input stage architecture 
Why not a Xilinx ISERDERSE stage .. ? 
Limited parallel data width (output) not more than 10bits (for two ISEDERSE in cascade configuration) 
and not dynamically configurable. The FSM Alignment in not included in the Xilinx tools. 
Ref. Virtex-6 FPGA Select IO resources user guide. ug361 (v1.3) august 16, 2010. 
FPGAs in Research - Applications, Technologies and Tools, Forschungszentrum Jülich, 3-4 
December 2012. M. Caselle 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 14 
A configurable “SerDes” input stage architecture 
Clock Buffer I/O 











ü Individual clock-to-data time tuning by IODELAY (time step of 75psec) 
ü I/O clock buffer located in the centre of the FPGA bank 
ü Regional buffer synchronous to parallel data out 
ü “SerDes” input stage fully configurable by User 
Why not a Xilinx ISERDERSE stage .. ? 
Limited parallel data width (output) not more than 10bits (for two ISEDERSE in cascade configuration) 
and not dynamically configurable. The FSM Alignment in not included in the Xilinx tools. 
Ref. Virtex-6 FPGA Select IO resources user guide. ug361 (v1.3) august 16, 2010. 
































Clock to Data 
Time tuning  
Training 
pattern 
Parallel data  
width  






























FPGAs in Research - Applications, Technologies and Tools, Forschungszentrum Jülich, 3-4 
December 2012. M. Caselle 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 15 
Future developments for high speed readout systems 
Ø  Peer – to – peer (P2P) streaming data transfer 
(based on new generation of PCI express protocol) 
Two differents approaches are possible: 
Ø  Point– to – node (net) for distributed GPU/CPU High Performance 
Computing (HPC) clusters 
q Data source and FPGA readout board located far from DAQ system 
q Using commercial/well-known protocol for ease interface with commercial devices/boards 
q Real-time FPGA + GPU data elaboration à high data throughput (range of 64Gb/s) 
Requirements: 
FPGAs in Research - Applications, Technologies and Tools, Forschungszentrum Jülich, 3-4 
December 2012. M. Caselle 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 16 
IPE - PCI Express Readout card - Overview 
ü  PCIe GEN3 optical/electrical data transmission (8 lanes x 8GT/s) 
Size à18.6 mm x 22 mm,  
heightà 14.5 mm 
FPGAs in Research - Applications, Technologies and Tools, Forschungszentrum Jülich, 3-4 
December 2012. M. Caselle 





FPGA -Virtex 6 
User 
logic 
EndPoint PCIe Integrated  





Multi-port PCIe switching 
X8 lanes GEN3 
No DMA is needed 
Data Source 
MiniPOD X12 lanes optics cable for  
PCIe GEN3 (8 GT/s per lane) 
ü  64 Gb/s (W) + 64Gb/s (R) à full-duplex mode 
ü  FPGA Real Time process à close to data source 
To PC host board 
Optical cable (up to 30m) 
Electrical cable (up to 5m) 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 17 
X16 lanes PCIe slot 
IPE - PCI Express Host card- Overview 
ü  PCIe host board with high speed data recording 




NAND flash SSD 
PCIe – host card 
ü  Fully configurable data flow 








FPGAs in Research - Applications, Technologies and Tools, Forschungszentrum Jülich, 3-4 
December 2012. M. Caselle 








High bandwidth readout system based by InfiniBand 
InfiniBand DAQ cluster 
Infiniband 
Router 
QDR 40Gbps InfiniBand  
protocol  
Optical or electrical data link 
up to 100m 
µ /ATCA 
40Gb/s à InfiniBand, in house 





Heterogeneous FPGA + CPU + GPU 
384Gb/s à in the next two years 
InfiniBand GPU cluster under developing in KIT-IPE 
by Data processing group IPE-KIT 


























FPGAs in Research - Applications, Technologies and Tools, Forschungszentrum Jülich, 3-4 
December 2012. M. Caselle 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 19 
InfiniBand readout Board - Overview 
High Speed connectors 
(HPC Samtec or similar) 

















 QSFP + 
InfiniHost III 








Xilinx -Virtex 6 
User 
logic 
Remote DMA for fast data transfer à intranet communication 
IP based application layer à possible (i.e. TCP, UDP, SSH, FTP .. )  
InfiniBand Readout board 
FPGAs in Research - Applications, Technologies and Tools, Forschungszentrum Jülich, 3-4 
December 2012. M. Caselle 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 20 
Conclusion and What’s next 
v  Logic cores for high data throughput platform à employed in several scientific 
applications: 
v  TeraHz detector + readout system for CSR (M.Caselle, V. Judin, A.S. Müller, M. 
Siegel, N. Smale, P. Thoma, M. Weber, S. Wünsch). KIT departments IPE-IMS and 
ANKA 
v  A X-ray camera for phase contrast tomography (M. Caselle, A. Kopmann, Felix 
Beckmann (HZG), Joerg Burmester(HZG) KIT and HZG 
v  A X-ray camera for high spatial resolution tomography (M. Caselle, M. Balzer, 
A. Kopmann, V. E. Asadchikova) Shubnikov Institute of Crystallography, Russian 
Academy of Sciences, Moscow, Russia  
v  A readout electronics for Ultrafast electron beam X-ray tomography system 
"ROFEX“ in HZDR (proposal under discussion) 
v  New KIT-DMA (32Gb/s) engines à developed and tested 
v  Driver 64bit@Linux à under optimization 
v  Design & production of readout board based by: 
v  PCIe GEN3 optical communication 
v  InfiniBand protocol 
v  Integration in the GPU/CPU compute infrastructure 
What’s next 
FPGAs in Research - Applications, Technologies and Tools, Forschungszentrum Jülich, 3-4 
December 2012. M. Caselle 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 21 
Frame rate from 500 to 2Kfps 
Bandwidth: 8 Gb/s. Future upgrade: 50Gb/s 
Recording & analysis of time evolution of each bunch  
in a multi-bunches accelerator filling-scheme  





32 samples inside ….. 








FPGAs in Research - Applications, Technologies and Tools, Forschungszentrum Jülich, 3-4 
December 2012. M. Caselle 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 22 
Backup slides ..  
FPGAs in Research - Applications, Technologies and Tools, Forschungszentrum Jülich, 3-4 
December 2012. M. Caselle 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 23 
Credit-based link-level flow control 
• Link Flow control assures NO packet loss within fabric even in the presence of 
congestion 
• Link Receivers grant packet receive buffer space credits per Virtual Lane 
• Flow control credits are issued in 64 byte units 
 
InfiniBand: Link layer Flow Control 
FPGAs in Research - Applications, Technologies and Tools, Forschungszentrum Jülich, 3-4 
December 2012. M. Caselle 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 24 
Number of node 
Designing with InfiniBand 
UDP or TCP, FTP, ssh …  
InfiniBand: application layers and latency 
Ref: Introduction to InfiniBand™ for End Users,  
 InfiniBand Trade Association Administration 3855 SW 153rd Drive Beaverton, OR 97006 
FPGAs in Research - Applications, Technologies and Tools, Forschungszentrum Jülich, 3-4 
December 2012. M. Caselle 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 














































































































































Fast HW loop 










SW control loops  
ü  SW control loops: based on 2D and 3D data evaluation: 
2D data  à camera calibration, autofocus, self-alignment & etc..  
3D data reconstructed à like optical flow, etc 
2D and 3D image-
based control loop 
ü  High speed & bandwidth, full programmable camera (continuous data acquisition at full speed) 
ü  Optimized image processing algorithm using GPU computing  
sample 
FPGAs in Research - Applications, Technologies and Tools, Forschungszentrum Jülich, 3-4 
December 2012. M. Caselle 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 







Peltier cell  
(camera cooling) 
Heat sink + fan 
EndPoint PCIe link  
(to UFO infrastructure) 
The main features already implemented and tested, include: 
ü Fully configurable camera à adjustable image exposure time and dynamic range, analog and digital 
pixel features as pixel threshold, mask, analog gain, etc. 
ü Continuous data acquisition at full speed 
ü On-line image-based self-event trigger architecture (Fast reject) 
ü Region-of-interest readout strategy using self-event trigger information 
ü Easily extendable to any available CMOS image sensor 
Peltier cell control 
board cture) 
Xilinx FPGA  
(for fast readout &  
on-line data process) 
Large DDR3 local 
memory 
FPGAs in Research - Applications, Technologies and Tools, Forschungszentrum Jülich, 3-4 
December 2012. M. Caselle 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 27 




32 samples inside ….. 
Sample time resolution < than 3psec 
ANKA CSR (long observation time with YBCO) 
Recording & analysis of time evolution of each bunch  








Analog signal (single bunch) 
(output of amplifier) 
FWHM = 42 ps 20 – 200 mV Strategy:  
Digitalize each pulse with 4 samples + pulse reconstruction & 
Constant Fraction Discriminator (CFD) for precise pulse timestamp. 
Measure of the peak amplitude of each bunch (resolution few mV) 
Measure of the pulse width of each bunch (resolution few psec) 
Measure of the relative time jitter between electron bunches  
(res. few psec) 
FPGAs in Research - Applications, Technologies and Tools, Forschungszentrum Jülich, 3-4 
December 2012. M. Caselle 
