Strategies for on-chip digital data compression for X-ray pixel
  detectors by Hammer, Mike et al.
Prepared for submission to JINST
Strategies for on-chip digital data compression for X-ray
pixel detectors
M. Hammer,a K. Yoshii,a A. Micelia
aArgonne National Laboratory,
9700 S. Cass Ave., Lemont, IL 60439, U.S.A.
E-mail: amiceli@anl.gov
Abstract: As frame rates of X-ray pixel detectors continue to increase, photon counting will be
replaced with charge-integrating pixel array detectors for high dynamic range applications. The
continued desire for faster frame rates will also stress the ability of application-specific integrated
circuit (ASIC) designers to provide sufficient off-chip bandwidth to reach continuous frame rates
of 1MHz. To move from the current 10 kHz to the 1MHz frame rate regime, ASIC designers will
continue to pack as many power-hungry high-speed transceivers at the periphery of the ASIC as
possible. In this paper, however, we present new strategies to make the most efficient use of the
off-chip bandwidth by utilizing data compression schemes for X-ray photon-counting and charge-
integrating pixel detectors. In particular, we describe a novel in-pixel compression scheme that
converts from analog to digital converter units to encoded photon counts near the photon Poisson
noise level which achieves a compression ratio of > 1.5× independent of the dataset. In addition,
we describe a simple yet efficient zero-suppression compression scheme called “zeromask” (ZM)
located at the ASIC’s edge before streaming data off the ASIC chip. ZM achieves an average
compression ratios of > 4×, > 7×, and > 8× for high-energy X-ray diffraction, ptychography, and
X-ray photon correlation spectroscopy datasets, respectively. We present the conceptual designs,
register-transfer level (RTL) block diagrams, and the physical ASIC implementation of these com-
pression schemes in 65 nm CMOS. When combined, these two digital compression schemes can
increase off-chip bandwidth by a factor of 6–12×.
Keywords: ASIC; Compression; Encoding; Pixel detectors.
1Corresponding author.
ar
X
iv
:2
00
6.
02
63
9v
1 
 [p
hy
sic
s.i
ns
-d
et]
  4
 Ju
n 2
02
0
Contents
1 Introduction 1
2 Detector Architecture Overview 2
3 Compression 3
3.1 Compression in the pixel 4
3.1.1 Digital denoising 4
3.1.2 Digitally encoding the ADC output near the photon Poisson noise 4
3.2 Compression at the detector ASIC edge 9
3.2.1 Analysis of four typical X-ray datasets 10
3.2.2 Design encoding scheme 10
3.2.3 Design methodology using Chisel 14
3.2.4 Design of streaming edge compressor 16
3.2.5 Verification 20
3.2.6 Physical implementation 20
4 Outlook and Conclusions 22
1 Introduction
X-rays have unique potential for nanoscale resolution imaging of centimeter-sized objects [1].
Pushing X-ray imaging into the nanoscale is crucial for understanding complex hierarchical systems
on length scales from the atomic scale to the macroscale, in order to address scientific questions
ranging from materials science and biology to mechanical and civil engineering. The fourth-
generation storage ring light sources will increase X-ray beam brightness and coherent flux by
100 to 1,000 times over current values, with great advantages for science. Increases in nanoscale-
focused brightness have motivated beamline development [2] of very fast scanning microscopy
instruments with scanning rates approaching 1MHz (i.e., 1 µs dwell times) in order to be able to
image large samples. In addition, high-energy X-ray diffraction [3, 4] provides nondestructive and
in situ 3D atomic and mesoscale information about structure and its evolution in the broad class
of single-crystal and polycrystalline materials. The combination of continuous megahertz frame
speed and high dynamic range would provide unprecedented increases in data fidelity.
While the brightness increases provided by the accelerator and optics upgradeswill be essential,
these beamlines and techniques will not reach their full potential if they are limited to using detectors
of the type available today or those currently seen on the horizon. In particular, achieving full
potential requires advancing the state of the art from present few kilohertz frame rates to megahertz
frame rates. Photon-counting detectors are fundamentally unable to provide high dynamic range per
pixel at megahertz frame rates, especially given the bunched time structures of storage rings and free
– 1 –
electron lasers. Full exploitation of bright X-ray light sources requires use of charge-integrating
detectors to preserve high dynamic range at fast frame rates. A number of charge-integrating
detector platforms [5–7] reach toward the required dynamic range, but achieved frame rates are
≥ 10 kHz. At the other extreme, burst-mode detectors [8–12] are able to frame at 5–10 MHz but
only for a limited number of frames.
The common bottleneck of these detectors is the limited data bandwidth off the front-end
detector application-specific integrated circuit (ASIC) resulting from the purely analog design flow.
An approach to overcome this limitation is to move into the digital domain as early as possible.
The transmission rates of digital data can be higher than those of analog data. Digital signals are
less prone to corruptions. Error detection and correction algorithms can ensure data integrity, and
standardized interfaces can be used. In addition, the digital data manipulation (e.g., compression)
can be incorporated on the same piece of silicon as the front-end detector. The big gain in sub-
100 nm CMOS technology is in the digital domain (i.e., clock speed, power, logic density). Using
smaller process nodes is the most effective path to high logic density. CMOS feature size reduction
results to first order in a quadratic increase with feature reduction factor [13]. Increasing frame rates
can be achieved by using multiple high-speed multi-gigabit transceivers on the front-end ASIC.
In this paper, we present an additional means to increase the bandwidth and thus the frame rate
by using data compression on the front-end detector ASIC. While this concept (i.e., bandwidth
compression [14] is not new, such advanced digital concepts have not been exploited for X-ray pixel
detectors. In this paper, we present new strategies to make the most efficient use of the off-chip
bandwidth by utilizing data compression schemes for photon-counting and charge-integrating pixel
detectors.
2 Detector Architecture Overview
To demonstrate our concept, we consider a hybrid X-ray pixel array detector that consists of a
passive sensor array bump-bonded to a mixed-signal ASIC. In this paper, we present designs in
a commercial 65 nm mixed-signal CMOS technology and aim for a pixel size around ∼ 100 µm
per side. Underneath each sensor pixel, each pixel of the mixed-signal ASIC contains an analog
front-end, a gain-switching (e.g., adaptive gain or autoranging) amplifier, a small analog-to-digital
converter (ADC), and digital logic. We note that the concepts presented here can also be applied
to other detector topologies where the analog and digital logic are located on different pieces of
silicon that are bonded together. Once per frame, the ADC generates a digital value that represents
the voltage resulting from the charge integrated by the front-end during the previous frame. The
voltage and therefore the digital value from the ADC at the end of each frame are assumed to be
linearly proportional to the number of photons detected by the sensor during the previous frame.
We will utilize digital logic to shift the sample data from all ADCs off the device. A block diagram
of the pixel array detector is shown in Figure 1 that demonstrates the different pieces required for a
complete science-grade detector array on the order of 128× 128 to 256× 256 pixels. Figure 1 also
illustrates the logic inside each pixel. Under control of a frame sync pulse, once per frame the logic
within each pixel loads the ADC outputs into registers that constitute a long shift register spanning
an entire pixel row. While charge is being integrated by the sensor and front-end electronics
for the current frame, the digital logic is shifting out all digital samples from all pixels for the
– 2 –
previous frame to the edge of the pixel array. The digital logic within each pixel can also perform
preprocessing functions (e.g., denoising and encoding) on the samples prior to shifting, in order to
minimize the number of bits sent to the edge. At the edge of the ASIC, compression logic further
reduces the number of bits required to be sent through high-speed digital transmitters to an off-chip
data acquisition system (DAQ). A digital memory can be used as an elastic store to help smooth out
peaks and valleys in the data streams being transmitted.
Figure 1: Block diagram of the pixel detector ASIC architecture.
3 Compression
A detector ASIC has a finite amount of bandwidth to shift sample data from the pixels to the
ASIC’s edge and between the ASIC and external DAQ electronics. Depending on the parameters
of a particular experiment, the user of the detector may choose either to enable on-chip digital
processing and compression, thereby maximizing the frame rate, or to disable digital processing
and compression and send the uncompressed sample data off-chip at a lower frame rate or clock
frequency. The ASIC architecture we present will allow fine tuning of how the ASIC’s bandwidth
is used by offering a high level of configurability via software programming. The only limiting
factors will ultimately be the amount of bandwidth hardwired into the ASIC.
The following sections discuss three methods for minimizing the amount of data that needs
to be transmitted off-chip: (1) digitally removing noise from each ADC sample (denoise); (2)
digitally encoding ADC outputs to a bit-reduced sequence near the photon Poisson noise level;
and (3) streaming edge compression. Methods 1 and 2 can be implemented within each pixel.
Method 3 is implemented at the edge of the ASIC, outside of the active sensor area, and therefore is
independent of the pixel design and pixel size. All three data compression methods work together
to make the best use of the available ASIC bandwidth, both within the pixel array’s shift buses
and in the high-speed transmitters that send the data off chip. This in turn allows the detector to
achieve the highest possible frame rate. The first method of removing noise from each sample
helps maximize the number of transmitted sample values that equal zero, thus maximizing the
effectiveness of the edge compression algorithm. The second method reduces the number of bits
– 3 –
to be sent by digitally dividing the ADC output such that the resulting values efficiently encode the
number of photons detected based on the photon Poisson noise level. The third method implements
a digital compression algorithm that compacts the samples with zero values. Since all three of the
data reduction techniques are digital, they lend themselves to a high degree of programmability.
Additionally, different algorithms for edge compression can be researched, such as the incorporation
of artificial intelligence andmachine learning algorithms that can autonomously adapt to the varying
data signatures of different types of experiments or transmitted only important features.
3.1 Compression in the pixel
3.1.1 Digital denoising
In order to maximize the number of zero-values pixels which are sent, the raw ADC output should
have all samples that represent only noise zeroed out. This can be achieved in the digital domain
by determining a digital noise floor value for each pixel. This is the maximum digital value on the
ADC outputs when no photons are detected. This value is determined during a calibration sequence
by exposing the detector to darkness, integrating for a specified interval, and examining the ADC
outputs. The resulting noise floor is programmed into a control register within each pixel. Any
ADC sample that is below the noise floor threshold is forced to 0 by logic. In a photon-counting
detector, this digital denoising is not necessary.
3.1.2 Digitally encoding the ADC output near the photon Poisson noise
If the detector simply transmits the raw ADC output value each frame, many of the transmitted bits
are superfluous since the available resolution of the ADC may already exceed the Poisson noise
of the incoming photon flux. By performing some simple digital operations on the ADC output,
the number of bits per sample can be reduced with little loss of information. In a companion
paper, we have shown that using only 8 or 9 bits has a negligible effect on the ptychographic image
reconstructions [15]. This section describes an in-pixel encoding scheme where the ADC value is
digitally divided to create an encoded sequence with reduced bit count. Two examples are presented.
Example 1 is a simpler encoding that applies to photon-counting detectors and encodes the photon
count directly with no intermediate ADC value. Example 2 is more complex and applies to charge-
integrating detectors with gain-switching front-ends. Example 2 also includes an explanation of
how it can be implemented with simple digital logic.
Example 1, shown in Figure 2, demonstrates how to encode a photon count N into fewer
values. This example shows a 14-bit photon count (0-16384) encoded into 9 bits (0–511). As the
photon count N increases, so too does the Poisson noise as
√
N . Therefore, as the photon count
increases, each encoded value can represent a larger number of photons per step without losing any
information. In the figure, the entire photon count range is divided into 6 regions; details are shown
in Table 1.
In region 1 of Figure 2 and Table 1, the encoded values each represent 1 photon per step.
These values cannot easily be encoded since detection of single photons is an important requirement
for a detector. In region 2, however, the Poisson noise equals 4, so each encoded value can represent
4 photons per step, and the entire 48-photon region can be encoded into 12 values.
– 4 –
511
64
Encoded Value
0 16 1024256 163844096
15
16 steps of
1 photon each
(Region 1)
12 steps of
4 photons each
(Region 2)
24 steps of
8 photons each
(Region 3)
48 steps of
16 photons each
(Region 4)
96 steps of
32 photon each
(Region 5)
192 steps of
64 photons each
(Region 6)
84 3216 12864 Poisson Noise = sqrt(N)
388 steps total
1
Encoded Value (M)
14 bit Photon Count Encoded into 9 bits
31
63
127
255
Photon Count (N)
0-15 20-31 40-63 80-127 160-255 320-511
Figure 2: Encoding for a 14-bit photon-counting front-end detector into 9 bits. The photon count
range is divided into 6 regions. At the start of each region, the Poisson noise is shown as well as
the encoding in each range.
Region Photon Count
Range
Number of
Photons per
Encoded Step
Number of
Encoded Steps
1 0-15 1 16
2 16-63 4 12
3 64-255 8 24
4 256-1023 16 48
5 1024-4095 32 96
6 4096-16383 64 192
Table 1: Number of encoded steps per photon count for the photon-counting front-end example
shown in Figure 2.
In region 6, for a detector to report individual photon counts of 4,096–16,383 is wasteful of
bandwidth since the Poisson noise is 64 or greater in this region. By encoding the photon counts
such that each encoded value represents a range of 64 photon counts, the number of values is
reduced from 12,288 to just 192 with essentially no loss of information. The greatest efficiency in
encoding comes at higher photon counts where the Poisson noise allows each unique encoded value
to represent a larger range of photon counts.
As will be explained in Example 2, the boundaries of each region are chosen to be powers of
2 to simplify the encoding logic. An artifact of this is that there are gaps in the encoding sequence
since not all encoded values between 0 and 511 are needed.
Example 2 is shown in Figure 3 and assumes a pixel design with a charge-integrating front-end
– 5 –
Max ADC output
4095
12-bit ADC output
(ADC)
Photon Count (N)0
1023
16 steps of
1 photon each
M=ADC/64
(Region 1) 
12 steps of
4 photons each
M=ADC/256 + 16
(Region 2)
32 steps of
7.5 photons each
M=ADC/32 + 32
(Region 3)
48 steps of
15 photons each
M=ADC/64 + 64
(Region 4)
128 steps of
30 photons each
M=ADC/8 + 128
(Region 5)
192 steps of
60 photons each
M=ADC/16 + 256
(Region 6)
84 3217.4 12869.7 Poisson Noise = sqrt(N)
428 steps total
1
Encoded Value (M)0-15 20-31 32-63 80-127 128-255 320-511
High Gain
Region
Mid Gain
Region
Low Gain
Region
6416 1024304 163844864
3216 864 16 ADC Divisor (D)64
Encoding a charge-integrating front-end detector
Figure 3: Charge-integrating encoding showing the number of photons as a function of a 12-bit
ADC digital output for a three-stage gain front-end. The Poisson noise and the encoding of ADC
outputs into bit reduced sequence are shown. The required number of transmitted bits per pixel is
reduced from 14 bits to 9 bits.
and automatic gain switching. This example assumes 3 analog gain regions: high gain, medium
gain, and low gain, as is typical in existing detectors. The three gain regions are each subdivided
into two subregions, making 6 regions total, as labeled in the figure. This example shows how the
encoding algorithm can be implemented in few logic gates, making it practical to easily incorporate
within each pixel and thereby reducing the number of bits that need to be sent from each pixel to
each frame. Each gain region is associated with a specific detected photon count range. Within
each region there is assumed to be a linear relationship between the number of photons detected
and the ADC output. Typically detectors might transmit the 12-bit ADC value to the DAQ each
frame, along with 2 gain bits indicating the gain region, for a total of 14 bits (ADC+gain) per pixel
per frame.
In the high gain region one can easily see that reporting a 14-bit digital value corresponding
to an ADC output of 0–4095 but representing only 0–63 detected photons is wasteful since the
photon count could be encoded into only 6 bits. This can be achieved by dividing the ADC output
by 64 and sending the resulting value. The number of encoded values can be even further reduced
by taking advantage of the algorithm described in Figure 2 and increasing the photon count per
encoded value as the Poisson noise increases. Figure 3 shows encoding photon counts of 16–63
into 12 values representing 4 photons per step. This reduces the original 0–63 photon counts to just
28=16+12 encoded values.
As described earlier, the encoding efficiency increases as the photon count increases. In the
low gain region our example detects 1,024 to 16,383 photons represented by ADC output values of
– 6 –
0–4095. Each ADC step represents less than 4 photons (4096/(16384-1024). However, the Poisson
noise is >32 in the low gain region. The ADC resolution exceeds what the Poisson noise of the
incident beam supports. The reported value could be reduced from 4,096 unique ADC values down
to 480=(16384 - 1024)/32 encoded values, and the digitization noise would be at least as good as
the Poisson noise in all cases. Additionally, if the encoding logic divides the low gain region into
two subregions such that more photon counts are encoded into each step after the ADC reaches 1/4
of its full range, then the total number of encoded values is reduced even further to 320=128+192.
The threshold values chosen in Figure 3 have been carefully selected so that the encoding
algorithm can easily be implemented in digital logic with very few gates. This allows the circuit
to be added to each pixel since it will occupy very little space and consume very little power. To
encode the ADC+gain bits, the digital logic performs the following three steps:
1. Determine which of 6 subregions the ADC is reporting for. In logic, this means examining
the 2-bit gain value to determine 1 of 3 gain regions and examining the 2 most-significant
bits (MSBs) of the ADC output to determine whether the value is above or below 1,023.
2. Divide the 12-bit ADC output by the divisor D. In digital logic dividing by a power of 2 is
just a simple binary shift. For example, to divide the 12-bit ADC output representing values
0–1023 by a divisor of 64 means removing the 6 least significant bits and padding the top 6
MSBs with zeros. No digital divider circuit is actually needed.
3. Add the divided value from step 2 with an offset value to create a monotonically increasing
encoding. In our example, all of the added offsets are powers of 2. For example, region 6
requires M=ADC/16 + 256. Dividing by 16 means right shifting the 12-bit ADC value by 4
bit positions. Adding 256 means setting the resulting 9th bit position to 1. No digital adder
circuit is actually needed.
Figure 4 shows the encoding in truth table format. There are gaps in the final encoded sequence
where there are unused values. These are because the offset additions are chosen to be powers of
2 to make the digital logic simpler. These unused values could be removed in logic such that the
final sequence has no gaps, but this would come at the expense of more complicated digital logic,
and it ultimately would not change the detector performance.
Figure 5 shows a block diagram of the logic that implements this encoding algorithm that
has been coded in RTL, synthesized, and routed in 65 nm standard cell CMOS technology. At
approximately 25 µm per side this encoding logic is small enough to be incorporated into each
pixel. Alternatively, since it is small and will run fast, it can be placed at the edge of the ASIC and
shared across pixel data streaming from the row throughout the frame. The trade-off in placing the
encoding logic within each pixel or at the edge of a pixel row is more digital logic overall with fewer
shifts bus wires versus less digital logic but more wires. An example layout of a pixel containing
this logic is shown.
For comparison, a different circuit is shown in Figure 6 that incorporates a 12-bit programmable
integer divider. This design was coded in RTL and implemented. This circuit allowsmore flexibility
in the algorithm since divisors and threshold values can be selected after theASIC is designed, unlike
the fixed power-of-2 thresholds and divisions in the algorithm presented above. The programmable
divider approach comes at the expense of about 4× as much digital logic and more than 2× the
– 7 –
bit8   bit7  bit6   bit5   bit4   bit3   bit2   bit1   bit0
16 steps
16 steps (12 used) 
32 steps
64 steps (48 used)
128 steps
256 steps (192 used)
0       0        0        0         0
0       0        0        0         1
0       0        0        1
0       0        1
0       1
1
ADC 
output
0-1023
Gain Region
High
Mid
Low
1024-4095
0-1023
1024-4095
0-1023
1024-4095
# Photons
0-15
16-63
64-303
304-1023
1024-4863
4864-16383
Photons 
per step
1
4
7.5
15
30
60
Poisson 
noise
1-4
4-8
8-16
16-32
32-64
64-128
Binary Truth Table for Encoded Photon Count Outputs 
binary values 
not used since
values reported 
in previous 
region
0
15
20
31
32
63
80
127
128
255
320
511
Value 
reported
Figure 4: Truth table of the encoding of ADC outputs into bit-reduced sequence.
Figure 5: Block diagram and physical layout of digital logic for in-pixel encoding algorithm.
power as does the fixed division approach, but it can provide a similar compression and reduction
of the number of bits that need to be transmitted. The digital logic of either approach can support
many design features, such as disabling of compression and checks to ensure that valid sample data
are being transmitted.
An alternative implementation of the encoding algorithm could use an SRAM programmed as
a lookup table. The ADC output and gain bits could be applied to the read address inputs of the
SRAM, and the output read data would be the encoded value. A hybrid approach can use logic to
– 8 –
Figure 6: Block diagram and physical layout of digital logic for a pixel including 12-bit pro-
grammable divider and denoiser.
reduce the required address space by assigning groups of larger photon count ranges to common
read addresses, minimizing the size of the SRAM needed. This approach allows the encoded values
to be programmed after device fabrication. It can also allow the encoded values to more closely
follow the photon Poisson noise curve. In 65 nm technology the physical size of the memory
required would be too large to place one in each pixel, but it would be possible to place look-up
tables at the ends of the pixel rows and time share them across pixels.
The in-pixel encoding scheme described in this section differs from traditional data compression
scheme in twoways. First, this encoding schemeproduces a fixed-bit reduction percentage regardless
of the nature of the data, which simplifies the digital transmission logic. Second, there is no need to
decompress the data off the chip since we have simply represented the original data in a bit-reduced
manner.
3.2 Compression at the detector ASIC edge
The in-pixel encoding scheme described above makes no assumptions about the nature of the data
measured by the detector. However, we can exploit redundancies in the data to further compress
the data. In this section, we describe a lossless compressor located at the ASIC edge that exploits
the most common redundancy in X-ray data — the zeros. The aim of lossless compression is to
reduce data by identifying and eliminating statistical redundancy. The characteristics of input data
affect both the complexity of the design and the compression ratio. In general, complex designs
require significant effort to validate and test in an ASIC. Our goal is to identify a simple design
that fulfills our requirements: high compression ratios and stall-free operation. First, it must yield
reasonably high compression ratios for typical X-ray datasets. Second, operation must be stall-free;
it must process input pixel data shifted from the pixel array every single cycle without causing any
stall cycle or dropping any data. The design should also be simple and have a minimal resource
footprint on the pixel detector ASIC with no dependency on any external IP libraries, which also
reduces the ASIC validation and testing efforts dramatically.
– 9 –
Dataset Mean
%
Std Dev
%
Minimum
%
Maximum
%
offline gzip
compression ratio
High-energy XRD 83.42 2.49 68.96 85.03 19
Ptychography 97.42 0.23 96.2 97.77 70
XPCS concentrated 98.62 0.03 98.02 98.68 67
XPCS dilute 99.9 0.01 99.8 99.91 351
Table 2: Statistics of zero-valued pixels per frame and offline software-based compressor results.
3.2.1 Analysis of four typical X-ray datasets
We have analyzed four typical X-ray datasets to guide the design of the edge compressor. The first
dataset is typical of high-energy X-ray diffraction (XRD) experiments of polycrystalline materials;
the data consist of 300 images from a time-resolved additive manufacturing experiment at APS
1-ID-E taken with a Dectris Pilatus3 X CdTe 2M detector. The second dataset is typical of a
ptychography experiment; the data consist of 1,737 images from a ptychography experiment at APS
2-ID-D taken with the Dectris Eiger X 500K detector. The third and fourth datasets consist of
typical X-ray photon correlation spectroscopy (XPCS) experiments with concentrated and dilute
samples of polymer spheres diffusing in glycerol; the data consist of 1,000 images taken at APS
8-ID-I with the X-Spectrum Lambda 750K detector. Figure 7 shows a single image from each
dataset. All three datasets were taken with photon-counting detectors that we use to emulate a
charge-integrating detector and have been digitally denoised as described in Section 3.1.1. As one
can see, the majority of pixels in all the images are zero. Table 2 includes a statistical analysis of the
the high-energy XRD, ptychography, XPCS concentrated, and XPCS dilute datasets on the zero-
valued pixels, and a compression ratio that gzip offline compression can achieve as a reference.
The percentage of zero-valued pixels is calculated every frame. The mean, standard deviation,
minimum, and maximum are calculated from the percentage of zero-valued pixels of all frames.
The compression ratio of the gzip offline compressor, which employs Lempel-Ziv coding (LZ77),
is obtained from the ratio between the original file size and the compressed file size of a data set.
3.2.2 Design encoding scheme
Since the majority of pixels in all the datasets are zero-valued pixels, a natural thought is to employ
run-length encoding (RLE) [16], which leverages the consecutiveness of the same value in the input
sequence. With a concrete example, an input sequence [2, 0, 0, 0, 0, 0, 0, 0] would be encoded
into [1, 2, 7, 0] using one pixel storage (e.g., 9 bits) for the number of the occurrences of a pixel
value and another pixel storage for the pixel value itself. In this example, the compression ratio
is 2. Theoretically, as the length of the input sequence increases, the compression ratio increases.
According to the percentage of zero-valued pixels in Table 2, approximately 5 out of 6 pixels,
38 out of 39 pixels, 71 out of 72 pixels, and 999 out of 1,000 pixels are a zero-valued pixel in
high-energy XRD, ptychography, XPCS concentrated and XPCS dilute, respectively, which allows
us to estimate the length of continuous zero-valued pixels. If 5 out of 6 pixels are continuous
(the best case), four pixels are needed to encode 6 pixels, which yields 1.5× in the compression
– 10 –
(a) (b) (c)
(d)
Figure 7: Four representative images for the (a) high-energy XRD, (b) ptychography, (c) XPCS
concentrated, and (d) XPCS dilute datasets. Pixel values equal to zero are displayed in black, and
≥ 1 in white. Note that the ptychography dataset taken with the Eiger 500K has been cropped to
558 × 514 pixels.
ratio on high-energy XRD. Applying the same calculation to XPCS dilute, which has the highest
percentage of zero-valued pixels, the theoretical-possible maximum compression ratio would be
250×. However, zero values are typically nonuniformly distributed through an X-ray image, which
leads to higher variability in compression ratios. In the worst case for RLE, the output size is twice
of the input size. Such higher variability in the encoded data size imposes a significant challenge to
hardware design when it comes to I/O handling and can possibly be a source of stall. Thus we need
an encoding scheme whose encoded data has little variability in the encoded data size.
In this paper, we propose a new compression encoding scheme named “zeromask” (ZM) that
compresses data by leveraging zero pixels. Figure 8 depicts how ZM encodes input data with an
example. The output of ZM consists of a metadata and non-zero-valued pixels, where the metadata
is a sequence of bits that preserves the original position of zero and nonzero pixels in the input
segment. Unlike RLE, this approach does not require zero pixels to be contiguous and is thus
effective with a more random input as long as it contains zero pixels. To simplify the logic, we
repurpose a pixel storage for the metadata. With 9-bit pixel storage, a reasonable choice for the
number of input pixels (N) to each ZM compressor logic is 8 instead of 9, considering that the
number of the pixels in a pixel array is the power of 2. The maximum length of the encoded data
will be N +1when all pixels are nonzero; this yields the worst compression ratio (0.88×with N=8),
which outperforms RLE with 0.5× compression ratio; the minimum length of encoded data being
one, which only includes metadata and yields the best compression ratio (8× with N=8).
An obvious restriction of ZM is that the maximum compression ratio is limited by the number
of input pixels and the number of bits per pixel, which is equal to the number of bits per metadata.
To relax this restriction, we employ a bit shuffling operation that can effectively increase the input
size by shuffling bits (Figure 9), which greatly resembles BitShuffle [17], a software implementation
of a bit-level transpose operation. We denote the original input by p(i)j , where i is the index of
the pixel and j is jth bit. The bit-shuffling operation simply converts p(i)j to q( j)i. This operation
resembles a matrix transpose and is reversible and thus decodable. With a 9-bit pixel, it converts 16
input pixels to 9 pixels with 16-bit resolution, which is large enough for metadata. The bit-shuffling
operation is simply a set of wires between the input bits and the output bits in a correct order and
– 11 –
Figure 8: Zeromask encoding scheme.
requires no logic circuit in ASIC; it is inexpensive to implement and verify. After the bit shuffling,
we can simply use ZM to compress data. No redesign of ZM is needed; only the parameters need
to be changed. The bit shuffling is effective for the X-ray datasets, particularly high-energy XRD,
where a majority of non-zero pixels use only the lower bits such as the pixel value 1–4 or three least
significant bits. For example, a sequence of 9-bit input pixels [1, 2, 3, 1, 0, 2, 3, 1] is converted to
a sequence of 8-bit pixels [179, 68, 0, 0, 0, 0, 0, 0, 0] with 8-bit pixels.1
To evaluate different encoding schemes, we also developed a software tool2 that reads actual
X-ray data sets from a file and compresses input data by using different encoding schemes. Each
dataset frame is divided by eight or sixteen rows, depending on the input size to the encoder. For
example, with the XPCS datasets (1556 × 512 pixels), each frame is broken into 64 chunks with
eight input pixels and 32 chunks with sixteen input pixels, where each chunk has 1,556 columns.
Columns in each chunk are fed into encoders one by one, which emulates the shift operation of the
pixel array. It iterates the same process for all chunks in each frame. We evaluate the compression
ratio of all frames in all four X-ray datasets. The compression ratio is calculated by dividing the
total number of the input pixels per frame by the total number of the encoded pixels. Tables 3, 4,
5, and 6 show estimated compression ratios on the four datasets with different encoding schemes.
The bit width of input pixels is 9 bits. Of the three encoding schemes (run-length, zeromask, and
shuffled zeromask), the shuffled zeromask scheme achieves the highest compression for all four
datasets and the lowest relative standard deviation, and the zeromask scheme achieves the second
best compression ratio with a comparatively lower relative standard deviation.
1The decimal number 179 is 10110011 in binary and 68 is 01000100.
2The tool is written in Scala so that we can reuse codes in Chisel testbenches to verify the outputs from simulated
hardware.
– 12 –
Figure 9: Bit-shuffling scheme.
Encoding Input
pixels
Mean
CR
Std Dev
CR
Minimum
CR
Maximum
CR
Run-length 8 1.407 0.083 1.006 1.476
Run-length 16 1.618 0.113 1.084 1.715
Zeromask 8 3.457 0.229 2.295 3.639
Shuffled Zeromask 16 4.177 0.134 3.527 4.289
Table 3: Compression ratios of various encoding schemes — high-energy XRD.
Encoding Input
pixels
Mean
CR
Std Dev
CR
Minimum
CR
Maximum
CR
Run-length 8 3.264 0.049 3.018 3.324
Run-length 16 5.395 0.139 4.709 5.574
Zeromask 8 6.630 0.097 6.130 6.735
Shuffled Zeromask 16 7.307 0.073 6.923 7.398
Table 4: Compression ratios of various encoding schemes — ptychography.
– 13 –
Encoding Input
pixels
Mean
CR
Std Dev
CR
Minimum
CR
Maximum
CR
Run-length 8 3.577 0.008 3.421 3.591
Run-length 16 6.380 0.028 5.869 6.434
Zeromask 8 7.201 0.016 6.901 7.231
Shuffled Zeromask 16 8.314 0.016 8.023 8.346
Table 5: Compression ratios of various encoding schemes — XPCS concentrated.
Encoding Input
pixels
Mean
CR
Std Dev
CR
Minimum
CR
Maximum
CR
Run-length 8 3.948 0.002 3.907 3.954
Run-length 16 7.779 0.010 7.607 7.803
Zeromask 8 7.934 0.003 7.874 7.942
Shuffled Zeromask 16 8.899 0.004 8.843 8.911
Table 6: Compression ratios of various encoding schemes — XPCS dilute.
3.2.3 Design methodology using Chisel
Before we describe the design and implementation of the streaming edge compressor, we describe
our design methodology. To improve both the productivity and quality of digital circuit designs,
leveraging the power and flexibility of the emerging hardware design language named Chisel [18],
we have designed and implemented a hardware compressor generator framework (Figure 10) that
generates a concrete compressor RTL design.
Our compressor design is highly parameterized and fully written in Chisel so that it can
generate a wide range of hardware configurations (e.g., different numbers of input/output pixels
and bit widths and eventually different compressor algorithms). The generator framework not
only provides a mechanism to generate synthesizable Verilog codes of a streaming compressor
with specified design parameters such as input/output sizes and bit width but also provides a fully
integrated, flexible test harnesses and test pattern generator that allows us to test the compressor
design with various test patterns before Verilog is synthesized and implemented in a particular
ASIC technology node.
Chisel is rapidly gaining popularity in digital circuit design and has been used in various ASIC
and FPGA projects. A notable project based on Chisel is the Rocket Chip generator [19], which
has generated RISC-V processor RTL designs with several tape-outs. Chisel is an open-source
hardware construction language, implemented as class libraries of the Scala functional language.
With the power of Scala, a modern functional, object-oriented language, it allows designers to
express complicated circuit blocks far more easily and more concisely than hardware description
languages (HDLs) allow; this improves readability and reduces errors. The important characteristics
of Chisel are zero-cost abstraction, higher-level of expressibility, and controllability. Chisel is not a
high-level synthesis tool. Chisel not only provides the abstraction of digital circuit primitives such
– 14 –
Figure 10: Hardware compressor generator framework.
as combinational logics, wires, registers, andmemories but also provides some basic structures such
as multiplexer, first-in first-out (FIFO), queue, and shift registers, which are defined in the Chisel
standard library. Additionally, larger building blocks such as floating-point operations are available
as open source. Chisel also includes fully integrated frameworks for unit tests so that developers can
write unit tests in Scala, and it seamlessly integrates external Verilog simulators such as Verilator.
More important, it is fairly easy to install and use; a simple command line invocation of Chisel will
generate Verilog codes, run Scala testbench, and invoke Verilator to generate a value change dump
(VCD) file.
Listings 1 and 2 show a comparison between Verilog and Chisel using a simple delay circuit.
Although it is impossible to show the whole view of Chisel with a small example, this example
renders some fundamental features in Chisel, compared to Verilog. First of all, the Chisel version
has no clock, reset signals and always blocks because it automatically infers clock and reset signals
when needed. Since the default form of the state element provided by Chisel is a positive-edge
register that supports synchronous reset, no always blocks are needed. In this example, RegNext()
delays its input by one cycle and copies it to the output. Such limitation in the state element enforces
a design guideline transparently and makes Chisel syntax more concise at the cost of flexibility.
Chisel provides frequently-used data types such as Int, UInt, Int, instead of a range of bits like in
Verilog, which also improves readability.
– 15 –
Listing 1: Simple delay circuit in Verilog
module delay (
input clock ,
input reset ,
input [ 7 : 0 ] io_in ,
output [ 7 : 0 ] io_out
) ;
reg [ 7 : 0 ] r0 ;
reg [ 7 : 0 ] r1 ;
assign io_out = r1 ;
always @ ( posedge clock ) ←↩
begin
r0 <= io_in ;
r1 <= r0 ;
end
endmodule
Listing 2: Simple delay circuit in Chisel
class delay extends Module {
val io = IO ( new Bundle {
val in = Input ( UInt ( 8 . W ) )
val out = Output ( UInt ( 8 . W )←↩
)
} )
val r0 = RegNext ( io . in )
val r1 = RegNext ( r0 )
io . out : = r1
}
}
Chisel provides powerful Scala-based testbench frameworks. Designers leverage the power
and feature of the modern general-purpose language to test their circuit designs. All Scala libraries
(I/O, math, statistical, etc.), for example, can be used to test their circuits. Chisel also invokes
Verilator to generate C++-based fast cycle accurate simulators from Verilog codes generated by
Chisel.
3.2.4 Design of streaming edge compressor
The streaming edge compressor is a data compression logic between the pixel array and a peripheral
logic such as a network interface or a memory. The two most important design considerations of
the streaming compressor are being stall-free and having a small resource footprint. Since the
pixel array generates (column) pixel data continuously every single cycle, the compression logic
must process data without any stall cycle or dropping of data. For this design we avoided inferring
a large-enough memory for the compression logic to temporarily store pixel data because of two
factors: (1) the length of each experiment, which determines the size of a temporary memory, is
unknown, and (2) the resources (e.g., the number of transistors) are scarce on the detector ASIC
chip.
Figure 11 shows the basic architecture of the streaming edge compressor. The main part
of the streaming compressor is the encoding stage, which is an implementation of the zeromask
compressor. The coalescing stage is needed for fixed-size peripherals (such as SRAMs, memory
controllers). The encoding stage generates encoded data every cycle. The size of the encoded data
varies depending on the content of the input data. The coalescing stage packs these variable-sized
encoded data fragments into an internal buffer whose size is a multiple of the size of a column pixel
data (e.g., 256 bits) before sending it to a target peripheral device. The number of cycles required
until the internal buffer becomes ready to be read is also variable.
– 16 –
Figure 11: Streaming edge compressor basic design.
Encoding stage The encoding stage receives column pixel data from the pixel array every cycle,
performs the ZM encoding, and generates encoded data whose size is variable between 1 to (N+1),
where N is the input size.
Figure 12 shows the basic idea of how the encoding stage generates an encoded data fragment.
An encoded data fragment consists of a metadata header and a series of nonzero pixels, where the
metadata is designed to fit to a single pixel.
Figure 12: Encoding stage: metadata creation and packing.
The metadata creation logic is basically a combination of comparators and bit-shift operations.
Chisel allows such circuits to be expressed in a concise manner. Listing 3 includes the actual code
– 17 –
used in our Chisel-based generator framework that generates the mask bits. The default parameters
are defined in class parameters (npixels is 8 and pixelbitwidth is 9 in this example), which
can be changed when the Header class is instantiated. With the default parameters, the I/O block
defines eight integer values (eight pixels) as input and a single integer (one pixel) as output. The
VecInit.tabulate() method creates a vector whose element is an integer that holds a mask value
for each pixel. If a pixel is nonzero, the mask value is the left-shift of 1 by i, where i is the base
index of the pixel (e.g., 1«0 for the first pixel). If a pixel is zero, the mask value is simply zero.
The metadata vector is then reduced into a single integer by an OR operation of each element to
generate the metadata header.
Listing 3: Metadata generation in Chisel
class Header ( val npixels : Int = 8 , val pixelbitwidth : Int = 9) extends ←↩
Module {
val io = IO ( new Bundle {
val in = Input ( Vec ( npixels , UInt ( pixelbitwidth . W ) ) )
val out = Output ( UInt ( pixelbitwidth . W ) )
} )
val metadata = VecInit . tabulate ( npixels ) ( i => ( io . in ( i ) = / = 0 . U ) .←↩
asUInt << i )
io . out : = metadata . reduce ( _ | _ )
}
The rest of the encoding logic is the packing logic that removes zero pixels from the input pixel
data, which includes both nonzero and zero pixels, and generates a sequence of pixel data only with
nonzero pixels while preserving the order of the pixels (Figure 13).
Internally the packing logic consists of N copies of a logic named ShiftUp. Each ShiftUp logic
receives pixel data (N pixels) and the pos indicator that holds the index of the current pixel position.
Then it outputs updated pixel data and pos that holds the index of the pixel for the next logic. All
ShiftUp logics are sequentially connected; the outputs of the previous logic are connected to the
inputs of the next logic, except the first and last logic. The input to the first ShiftUp logic receives
pixel data from the pixel array. The initial value of pos is set to zero. The outputs from the last
ShiftUp logic are connected to the coalescing stage, which includes the pixel data that hold packed
pixels and the position index that is equal to the number of the nonzero pixels. Listing 4 shows
the core part of the packing logic. io.in() is the input pixel data; io.pos is the input pixel index;
io.out() is the output pixel data; and io.outpos is the output pixel index. Mux() is a multiplexer
that selects the second argument when the first argument is true, or otherwise selects the third
argument. The for loop generates a conditional, partial shift logic that shifts all pixels above the
current position when the current pixel is zero or connects the input pixel vector to the output pixel
vector for all pixels whose index is below the current position. The last pixel is filled with zero
when it shifts.
– 18 –
Figure 13: Packing sparse data into dense data.
Listing 4: The core part of packing logic in Chisel
val sel = ( io . in ( io . pos ) === 0 . U )
io . posout : = Mux ( sel , io . pos , io . pos + 1 . U )
for ( i <- 0 to npixels − 2) io . out ( i ) : = Mux ( ( i . U >= io . pos ) & sel , ←↩
io . in ( i+1) , io . in ( i ) )
io . out ( npixel − 1) : = Mux ( sel , 0 . U , io . in ( pixel − 1) )
The number of lines in the entire packing logic is approximately 50, including boiler plate
code, while the number of lines in the generated Verilog code is approximately 500 with npixel =
8 and 1,600 with npixel = 16. Note that the number of lines in the Chisel code stays the same for
both configurations because it is fully parameterized.
Coalescing stage The data generated from the encoding stage are variable in size. If the interface
of a peripheral device accepts variable-sized data, we can simply send encoded data into such a
device. Otherwise, if the interface a peripheral device (e.g., SRAMs, memory controllers) expects
a fixed size (e.g., 256 bits), a buffering mechanism is needed. The coalescing logic is a simple
buffering logic that packs multiple variable-size encoded data into a fixed-size buffer before sending
to the peripheral device.
The coalescing logic consists of two components: the Selector and STBufmodule (see Figure
14). The Selector module receives the length of the encoded data every cycle from the encoding
logic and updates the insert position (pos in STBuf) to the buffer in the STBufmodule of the associated
encoded data. It raises the flushed signal when the buffer becomes full. The STBufmodule receives
flushed and pos from the Selector module and the encoded data from the encoding stage and
– 19 –
outputs the fixed-size data. It consists of an array of registers (buffer) and a barrel shifter.
Figure 14: Block diagram of the coalescing logic that packs encoded data into a fixed-size buffer.
3.2.5 Verification
All of the design components for the streaming ZM compressor are expressed in Chisel, which
are translated into Verilog codes. While we could simply use an external Verilog simulator such
as ModelSim to verify the functionality of the generated Verilog codes, Chisel provides a fully
integrated testing harness for functional verification that allows developers to write test benches in
Scala, instead of Verilog. It seamlessly invokes an external Verilog simulator (Verilator by default).
Since Scala is a powerful general-purpose functional, object-oriented programming language, it
allows one to write complex and flexible test bench codes. In addition we have successfully verified
the functionality of the entire compressor design (Figure 15) against all four X-ray datasets. The
testbench reads data from actual datasets and uses them as input data to the device under test (DUT)
that runs a simulated streaming compressor. An image frame in a dataset is chunked up into N-row
pixel strips, which are fed into DUT column by column. We iterate all frames in a dataset. For
example, with N=16, a 256x256 image frame is chunked up to 16 strips3 and each strip has 256
columns. The total encoding count per frame is 4,096 (= 16 × 256) in this example.
3.2.6 Physical implementation
Figure 16 shows examples of the described streaming compressors that were coded in RTL, synthe-
sized, and placed and routed in 65 nm standard cell CMOS. Layout (a) shows a placed and routed
8-pixel ZM compressor. The area of this block is close to the desired area of a single pixel, which
means that an 8-pixel streaming compressor block could be located at the end of every pixel row
with little impact on the ASIC floor plan. Placing one compressor per pixel row would add the
equivalent area of a single column of pixels to the pixel array. Layout (b) shows a placed and routed
16-pixel ZM compressor. The area of this block would span two 100 µm pixel rows. If one 16-pixel
compressor block were placed at the end of every two pixel rows, it would add the equivalent area of
2 pixel columns to the pixel array. Since each 16-pixel compressor module takes as inputs sixteen
3Each strip is each compressor instance.
– 20 –
Figure 15: Verification test bench.
9-bit buses, this compressor block would allow each pixel row to output eight interleaved 9-bit shift
buses per pixel row to the compressor inputs, allowing the shift clock frequency to be reduced by
a factor of 8. Layout (c) shows a 16-pixel bit shuffled ZM compressor. This version is slightly
smaller than the non-shuffled example of (b).
11
0µ m
90µm
(a)
165µm
(c)185µm
(b)
Figure 16: Physical layout of three variants of the ZM compressor in 65 nm CMOS: (a) 8-pixel,
(b) 16-pixel, (c) 16-pixel bit shuffled.
– 21 –
4 Outlook and Conclusions
Two compression methods have been described that reduce the number of bits required to be
streamed from a detector ASIC by a factor of 6–12× depending on the type of X-ray dataset. Let us
consider the impact on the necessary off-chip bandwidth required for a hypothetical detector ASIC.
For example, if a 256 × 256 pixel array running at a 1MHz frame rate is assumed and if each pixel
has a 12-bit ADC with 2 gain bits, the uncompressed transmission bit-rate would be 917Gbps.
Individual high-speed transmitters in 65 nm CMOS have been demonstrated at 5–10Gbps [20, 21]
and the Timepix4 ASIC is designed with 16 × 10.24 = 164Gbps off-chip bandwidth [22].
At these extreme bit-rates the number of transmitters needed to send uncompressed data would
make a 1MHz frame rate out of the question. With the described in-pixel compression reducing the
data from 14 bits to 9 bits (i.e., a 1.5× reduction), and streaming edge compression further reducing
the data by a factor of 4–8×, the required off-chip bandwidth ranges from 76 to 153Gbps. This can
more reasonably be achieved by providing 7–15 10Gpbs transmitters at the edge of the detector
ASIC, for example.
In this paper we focused on a compression scheme that leverages statistical redundancy (higher
zero occurrences) in single-cycle input pixels from the pixel array that allows us to design a simple
yet effective compressor. As a next step we plan to explore compression algorithms and circuit
designs that can exploit further statistical redundancy in spatial and temporal direction dimensions
in order to improve the compression ratio further.
Acknowledgments
We thank Andrew Chihpin Chuang, Junjing Deng, and Suresh Narayanan for the X-ray datasets.
We thank Gail Pieper for editing this manuscript. We thank Franck Cappello for discussions
on compression schemes. We thank Pete Beckman and Alec Sandy for encouraging this multi-
disciplinary collaboration between the X-ray Science Division (XSD) and the Mathematics and
Computer Science (MCS) division. This work is based in part on work supported by the U.S.
Department of Energy, Office of Science, under contract DE-AC02-06CH11357. This research
used resources of the Advanced Photon Source, a U.S. Department of Energy (DOE) Office of
Science User Facility operated for the DOE Office of Science by Argonne National Laboratory
under Contract No. DE-AC02-06CH11357. This research used the resources of the Fermi National
Accelerator Laboratory (Fermilab), a U.S. Department of Energy, Office of Science, HEP User
Facility. Fermilab is managed by Fermi Research Alliance, LLC (FRA), acting under Contract No.
DE-AC02-07CH11359.
References
[1] C. Jacobsen, X-ray Microscopy. Cambridge, UK: Cambridge University Press, 2019.
[2] J. Deng, C. Preissner, J. A. Klug, S. Mashrafi, C. Roehrig, Y. Jiang, Y. Yao, M. Wojcik, M. D.
Wyman, D. Vine, K. Yue, S. Chen, T. Mooney, M. Wang, Z. Feng, D. Jin, Z. Cai, B. Lai, and S. Vogt,
“The Velociprobe: An ultrafast hard X-ray nanoprobe for high-resolution ptychographic imaging,”
Review of Scientific Instruments, vol. 90, p. 083701, Aug. 2019.
– 22 –
[3] Y. Ren, “High-energy synchrotron x-ray diffraction and its application to in situ structural
phase-transition studies in complex sample environments,” JOM, vol. 64, pp. 140–149, Jan 2012.
[4] J.-S. Park, X. Zhang, P. Kenesei, S. L. Wong, M. Li, and J. Almer, “Far-Field High-Energy Diffraction
Microscopy: A Non-Destructive Tool for Characterizing the Microstructure and Micromechanical
State of Polycrystalline Materials,” Microscopy Today, vol. 25, pp. 36–45, Aug. 2017.
[5] M. W. Tate, D. Chamberlain, K. S. Green, H. T. Philipp, P. Purohit, C. Strohman, and S. M. Gruner,
“A medium-format, mixed-mode pixel array detector for kilohertz x-ray imaging,” Journal of Physics:
Conference Series, vol. 425, p. 062004, mar 2013.
[6] A. Mozzanica, M. Andrä, R. Barten, A. Bergamaschi, S. Chiriotti, M. Brückner, R. Dinapoli,
E. Fröjdh, D. Greiffenberg, F. Leonarski, C. Lopez-Cuenca, D. Mezza, S. Redford, C. Ruder,
B. Schmitt, X. Shi, D. Thattil, G. Tinti, S. Vetter, and J. Zhang, “The jungfrau detector for applications
at synchrotron light sources and xfels,” Synchrotron Radiation News, vol. 31, no. 6, pp. 16–20, 2018.
[7] G. Blaj, A. Dragone, C. J. Kenney, F. Abu-Nimeh, P. Caragiulo, D. Doering, M. Kwiatkowski,
B. Markovic, J. Pines, M. Weaver, S. Boutet, G. Carini, C.-E. Chang, P. Hart, J. Hasi, M. Hayes,
R. Herbst, J. Koglin, K. Nakahara, J. Segal, and G. Haller, “Performance of epix10k, a high dynamic
range, gain auto-ranging pixel detector for fels,” AIP Conference Proceedings, vol. 2054, no. 1,
p. 060062, 2019.
[8] H. T. Philipp, M. W. Tate, P. Purohit, D. Chamberlain, K. S. Shanks, J. T. Weiss, and S. M. Gruner,
“High-speed x-ray imaging with the keck pixel array detector (keck pad) for time-resolved
experiments at synchrotron sources,” AIP Conference Proceedings, vol. 1741, no. 1, p. 040036, 2016.
[9] A. Allahgholi, J. Becker, L. Bianco, A. Delfs, R. Dinapoli, P. Goettlicher, H. Graafsma,
D. Greiffenberg, H. Hirsemann, S. Jack, R. Klanner, A. Klyuev, H. Krueger, S. Lange, A. Marras,
D. Mezza, A. Mozzanica, S. Rah, Q. Xia, B. Schmitt, J. Schwandt, I. Sheviakov, X. Shi, S. Smoljanin,
U. Trunk, J. Zhang, and M. Zimmer, “AGIPD, a high dynamic range fast detector for the european
XFEL,” Journal of Instrumentation, vol. 10, pp. C01023–C01023, jan 2015.
[10] M. Hart, C. Angelsen, S. Burge, J. Coughlan, R. Halsall, A. Koch, M. Kuster, T. Nicholls,
M. Prydderch, P. Seller, S. Thomas, A. Blue, A. Joy, V. O’shea, and M. Wing, “Development of the
lpd, a high dynamic range pixel detector for the european xfel,” in 2012 IEEE Nuclear Science
Symposium and Medical Imaging Conference Record (NSS/MIC), pp. 534–537, 2012.
[11] M. Porro, L. Andricek, A. Castoldi, C. Fiorini, P. Fischer, H. Graafsma, K. Hansen, A. Kugel,
G. Lutz, U. Pietsch, V. Re, and L. Struder, “Large format x-ray imager with mega-frame readout
capability for xfel, based on the depfet active pixel sensor,” in 2008 IEEE Nuclear Science
Symposium Conference Record, pp. 1578–1586, 2008.
[12] M. Pezzoli, L. Lodola, M. Manghisoni, F. Morsani, L. Ratti, V. Re, E. Riceputi, and G. Traversi,
“Characterization of pfm3, a 32×32 readout chip for pixfel x-ray imager,” in 2019 IEEE Nuclear
Science Symposium and Medical Imaging Conference (NSS/MIC), pp. 1–5, 2019.
[13] M. Garcia-Sciveres and N. Wermes, “A review of advances in pixel detectors for experiments with
high rate and radiation,” Reports on Progress in Physics, vol. 81, p. 066101, may 2018.
[14] T. Ueno, K. Sano, and S. Yamamoto, “Bandwidth Compression of Floating-Point Numerical Data
Streams for FPGA-Based High-Performance Computing,” ACM Transactions on Reconfigurable
Technology and Systems, vol. 10, pp. 1–22, July 2017.
[15] P. Huang, M. Du, M. Hammer, A. Miceli, and C. Jacobsen, “Fast digital lossy compression for x-ray
ptychographic data.” (submitted).
– 23 –
[16] A. H. Robinson and C. Cherry, “Results of a prototype television bandwidth compression scheme,”
Proceedings of the IEEE, vol. 55, no. 3, pp. 356–364, 1967.
[17] K. Masui, M. Amiri, L. Connor, M. Deng, M. Fandino, C. Höfer, M. Halpern, D. Hanna, A. Hincks,
G. Hinshaw, J. Parra, L. Newburgh, J. Shaw, and K. Vanderlinde, “A compression scheme for radio
data in high performance computing,” Astronomy and Computing, vol. 12, pp. 181 – 190, 2015.
[18] J. Bachrach, H. Vo, B. Richards, Y. L. D. Design, and 2012, “Chisel: constructing hardware in a scala
embedded language,” DAC Design Automation Conference, pp. 1212–1221, 2012.
[19] K. Asanović, R. Avizienis, J. Bachrach, S. Beamer, D. Biancolin, C. Celio, H. Cook, D. Dabbelt,
J. Hauser, A. Izraelevitz, S. Karandikar, B. Keller, D. Kim, J. Koenig, Y. Lee, E. Love, M. Maas,
A. Magyar, H. Mao, M. Moreto, A. Ou, D. A. Patterson, B. Richards, C. Schmidt, S. Twigg, H. Vo,
and A. Waterman, “The rocket chip generator,” Tech. Rep. UCB/EECS-2016-17, EECS Department,
University of California, Berkeley, Apr 2016.
[20] C. Chen, V. Wallangen, D. Gong, C. Grace, Q. Sun, D. Guo, G. Huang, S. Kulis, P. Leroux, C. Liu,
T. Liu, P. Moreira, J. Prinzie, L. Xiao, and J. Ye, “A gigabit transceiver for the ATLAS inner tracker
pixel detector readout upgrade,” Journal of Instrumentation, vol. 14, pp. C07005–C07005, jul 2019.
[21] P. Moreira, R. Ballabriga, S. Baron, S. Bonacini, O. Cobanoglu, F. Faccio, T. Fedorov, R. Francisco,
P. Gui, P. Hartin, K. Kloukinas, X. Llopart, A. Marchioro, C. Paillard, N. Pinilla, K. Wyllie, and
B. Yu, “The GBT Project,” 2009.
[22] X. Llopart, “The design of the timepix4 chip: a 230 kpixel and 4-side buttable chip with 200ps
on-pixel time bin resolution and 15-bits of tot energy resolution.”
https://indico.cern.ch/event/788037/.
– 24 –
