Results of the NFIRAOS RTC trade study by Véran, Jean-Pierre et al.
PROCEEDINGS OF SPIE
SPIEDigitalLibrary.org/conference-proceedings-of-spie
Results of the NFIRAOS RTC trade
study
Jean-Pierre  Véran, Corinne  Boyer, Brent L. Ellerbroek,
Luc  Gilles, Glen  Herriot, et al.
Jean-Pierre  Véran, Corinne  Boyer, Brent L. Ellerbroek, Luc  Gilles, Glen
Herriot, Daniel A. Kerley, Zoran  Ljusic, Eric A. McVeigh, Robert  Prior,
Malcolm  Smith, Lianqi  Wang, "Results of the NFIRAOS RTC trade study,"
Proc. SPIE 9148, Adaptive Optics Systems IV, 91482F (21 July 2014); doi:
10.1117/12.2057323
Event: SPIE Astronomical Telescopes + Instrumentation, 2014, Montréal,
Quebec, Canada
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 6/19/2018 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Results of the NFIRAOS RTC trade study 
Jean-Pierre Vérana, Corinne Boyerb, Brent L. Ellerbroekb, Luc Gillesb, Glen Herriota, Daniel A. 
Kerleya, Zoran Ljusica, Eric A. McVeighc, Robert Priorc, Malcolm Smitha, Lianqi Wangb 
 
aNational Research Council Canada, 5071 W. Saanich Rd., Victoria, BC Canada V9E 2E7 
bTMT Observatory Corp., 1111 S. Arroyo Parkway, Pasadena, CA, USA 91101 
cUniversity of Victoria, 3800 Finnerty Rd, Victoria, BC Canada V8P 5C2 
ABSTRACT  
With two large deformable mirrors with a total of more than 7000 actuators that need to be driven from the 
measurements of six 60x60 LGS WFSs (total 1.23Mpixels) at 800Hz with a latency of less than one frame, NFIRAOS 
presents an interesting real-time computing challenge. This paper reports on a recent trade study to evaluate which 
current technology could meet this challenge, with the plan to select a baseline architecture by the beginning of 
NFIRAOS construction in 2014. We have evaluated a number of architectures, ranging from very specialized layouts 
with custom boards to more generic architectures made from commercial off-the-shelf units (CPUs with or without 
accelerator boards). For each architecture, we have found the most suitable algorithm, mapped it onto the hardware and 
evaluated the performance through benchmarking whenever possible. We have evaluated a large number of criteria, 
including cost, power consumption, reliability and flexibility, and proceeded with scoring each architecture based on 
these criteria. We have found that, with today’s technology, the NFIRAOS requirements are well within reach of off-the-
shelf commercial hardware running a parallel implementation of the straightforward matrix-vector multiply (MVM) 
algorithm for wave-front reconstruction. Even accelerators such as GPUs and Xeon Phis are no longer necessary. Indeed, 
we have found that the entire NFIRAOS RTC can be handled by seven 2U high-end PC-servers using 10GbE 
connectivity. Accelerators are only required for the off-line process of updating the matrix control matrix every ~10s, as 
observing conditions change. 
Keywords: Adaptive optics, real-time computing, real-time control 
 
1. INTRODUCTION  
The Narrow Field IR Adaptive Optics System (NFIRAOS) is the first light facility adaptive optics (AO) system for the 
Thirty Meter Telescope (TMT) [1]. The NFIRAOS project is currently at the final design stage [2]. NFIRAOS is a multi-
conjugate AO system. It has two large deformable mirrors with a total of more than 7000 actuators. For most 
observations, these deformable mirrors will be driven from the measurements of six 60x60 laser guide star (LGS) wave-
front sensors (WFSs). These WFSs produce a total of 1.23Mpixels at 800Hz, from which the NFIRAOS real-time 
controller (RTC) needs to compute updated DM commands with a latency of less than one frame (1.25ms). This task 
represents quite a computational challenge. 
In 2008-2009, TMT commissioned a conceptual design study for a Real Time Control (RTC) system. Two groups 
carried out independent studies and both proposed custom-designed solutions based on Field Programmable Gate Arrays 
(FPGAs) that could meet the requirements [3][4].  
In 2013, TMT commissioned a new trade study to re-evaluate the conclusions from 2009 in light of the rapidly evolving 
technology that has become available since then, with the plan to select a baseline architecture by the beginning of 
NFIRAOS construction expected for 2014.  
This paper presents the results of this trade study. Section 2 presents the RTC requirements and the criteria used to 
evaluate different potential solutions. Section 3 summarizes the different possible RTC algorithms. In section 4, we 
present the different hardware that we have investigated, which includes PC-servers with and without accelerators, 
AdvancedTCA and Open VPX. In section 5, we discuss our efforts at mapping the algorithms on each different type of 
hardware, and in section 6 we present the different candidate architectures that we have designed, for each type of 
Invited Paper
Adaptive Optics Systems IV, edited by Enrico Marchetti, Laird M. Close, 
Jean-Pierre Véran, Proc. of SPIE Vol. 9148, 91482F · © 2014 SPIE
CCC code: 0277-786X/14/$18 · doi: 10.1117/12.2057323
Proc. of SPIE Vol. 9148  91482F-1
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 6/19/2018 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
hardware. In section 7, we summarize our effort to verify some of these architectures through benchmarking, and in 
section 8, we present our evaluation of each of our candidate architectures. Our conclusions are presented in section 9. 
2. TRADE STUDY CRITERIA 
2.1 NFIRAOS RTC top-level requirements 
NFIRAOS is a Multi-Conjugate AO system, which includes: 
• two Deformable Mirrors (DMs) conjugated at 0km (DM0 – 3125 actuators) and 11.2km (DM11.2 – 4548 
actuators), 
• one Tip/Tilt Stage (TTS) serving as the mount for DM0, 
• six Laser Guide Star (LGS) Shack-Hartmann wavefront sensors (WFSs) of order 60x60 sub-apertures  
• up to three low order Infrared natural guide star wavefront sensors (OIWFS) within each NFIRAOS instrument, 
• one high order visible Natural Guide Star Shack-Hartmann WFS (NGS WFS) of order 60x60, which is used for 
operation without LGS, 
• one Truth Wavefront Sensor (TWFS) measuring a natural guide star at low bandwidth, which is used to 
calibrate for slow-varying biases due to temporal variations in the sodium layer profile in LGS AO mode, 
• and the RTC, which processes the inputs from the various WFSs to compute the commands of the deformable 
mirrors and tip/tilt stage. 
The RTC interfaces with additional telescope and AO sub-systems, including the AO Sequencer, the NFIRAOS 
Component Controller, the Laser Guide Star Facility System, the NFIRAOS instruments, and the Data Management 
System (DMS). 
  
 
In LGS mode, which is by far the most computationally demanding mode, the RTC processes 1.23Mpixels (2 
bytes/pixel) from the six LGS WFS cameras and outputs a total of 7673 DM commands at a rate of 800 Hz. The main 
performance requirement for the NFIRAOS RTC is to have a total latency of no more than 1.2ms with a goal of 0.6ms, 
knowing that the LGS WFS CCDs are read in 0.5ms. Here latency is calculated from the time the WFS CCD read-out 
starts to the time all the DM commands are output to the DM Drive Electronics (DM). Since the DME is allocated 
0.05ms to output the commands to the DM, the requirement corresponds to a full one frame latency, and the goal 
corresponds to one-half frame. 
Figure 1: NFIRAOS top-level block diagram
Proc. of SPIE Vol. 9148  91482F-2
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 6/19/2018 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
All the parameters required by the RTC are provided by the Reconstructor Parameter Generator (RPG), which is part of 
the AO Sequencer, not the RTC. These parameters are optimized and updated as conditions change. However,some 
parameters requiring frequent updates such as the matched filters for pixel processing, are optimized within the RTC. 
The full list of requirements for the NFIRAOS RTC is recorded in a Design Requirement Document (DRD).  
2.2 Trade study criteria description 
The trade study criteria are summarized in Table 1 and Table 2. They were set in advance of carrying out the trade study, 
in order to achieve the most unbiased comparison. The most important items in the DRD are specifically included in the 
criteria, and others are lumped into a generic line item: “compliance with requirements with minimal waivers.”  
The columns of these tables are from left to right: 
• Criterion, which is often a key requirement in the DRD 
• Requirement itself that must be met – ideally a quantifiable and measureable value 
• Goal is a tighter version of the requirement, which indicates desirable improvements over the requirements, if 
they can be done for low cost.  
• Weighting qualitatively ranks the importance of the criteria in three bands: high, medium and low. 
In Table 1 we present the very highest priority criteria. Table 2 contains items that are considers of lesser importance, 
even if the weighting is currently shown as high. 
Table 1 Highest Priority Criteria for RTC trade-off study 
Criterion Requirement Goal Weighting 
Accuracy 40 nm RMS WFE relative to FD-PCG3 20 nm High 
Latency (start of CCD readout to end of 
transfer to DME) 1.2 ms 0.6 ms High 
Latency jitter No missed frames. Jitter <TBD TBD Medium 
Availability (reliabilty) 0.1% loss of expected 3200 h science per year  High 
Cost Minimize  High 
Schedule  -- Delivery Date Meets NFIRAOS schedule 6 months early High 
Risks (technical, schedule and costs) medium low High 
Hardware Maintainability 10 yr lifetime 15 yr High 
Software Maintainability Acceptable to require a Specialist 
General 
programming 
skills sufficient 
High 
Space 24 U  Medium 
Power 1500 W  Medium 
 
 
 
 
 
 
 
Proc. of SPIE Vol. 9148  91482F-3
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 6/19/2018 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Table 2 Moderate Priority Criteria for RTC trade study 
Criterion Requirement Goal Weighting 
Upgradability NFIRAOS+ MOAO Low 
Ease of Verification Module I/O comparison with MAOS 
RTC testable with 
MAOS Medium 
Compliance with all requirements Minimal waivers No waivers High 
Telemetry Telemetry is one single criterion Low 
Rate 3.5 GB/sec 5.0 GB/sec  
Data Size 90 TB 140 TB  
Rack Space, power 60U, 6 kW   
Updating RTC from RPG No disturbance to closed loop High 
Observing Overhead Overhead is one single criterion Medium 
Initializing Dead time 10 sec (TBC) No  better goal  
RPG load - update   ready in time for dither 10 sec cadence No goal  
Dithering dead time 5 asec in < 2 sec No goal  
Flexibility Software Algorithms Medium 
Quality of Interfaces Standards based   Industrial High 
Standards Standards is one single criterion Medium 
Hardware standards compliance   
Quality of standards and documentation   
Maturity of standards and documentation   
 
The accuracy is the quadrature difference between the residual wave-front error achieved by the chosen algorithm on the 
chosen architecture, compared to the lowest wave-front error we can achieve in simulation (in median conditions). 
Reasons that can degrade the accuracy include rounding errors and/or resorting to an inferior algorithm. 
The latency measures the elapsed time between the start of WFS readout and the end of sending the DM commands to 
the DM electronics. The latency includes the CCD read-out, which is 500us. Reducing the latency towards the goal will 
reduce servo-lag errors in NFIRAOS. Furthermore, short latency algorithms have the potential to permit longer readout 
times on the WFS CCD and therefore less read noise. They also can reduce data rates on the interface between WFSs 
and the RTC, possibly saving money or reducing risk. 
For the jitter in latency between frames, the requirement is to not miss any frame. At this time, we are still evaluating 
whether this requirement should be tightened, or whether it could be loosened to allow missing a frame from time to 
time. 
For reliability, no more than 0.1% of the 3200 hours (i.e. 3 hours) per year planned for scientific observing can be lost 
due to an RTC failure. What needs to be compared with this requirement is the product of the mean time between 
failures (MTBF) and mean time between repairs (MTTR) integrated over a year. We categorize failures in two types: 
those that can be fixed remotely (e.g. by rebooting, swapping servers, or changing observing mode) that, in our 
estimation, typically cost 30 minutes of downtime; and those that need the intervention of the day crew, which can only 
occur on the following day, for which the remaining of the night is lost. The estimate also takes into account the fact that 
NFIRAOS will only be used half of the nights at TMT on average. 
The cost includes all the spares required to operate throughout the lifetime of NFIRAOS. It must, of course, be 
minimized. 
The schedule of the NFIRAOS RTC must be such that it does not impact the development of the rest of NFIRAOS, with 
a goal of delivering 6 months in advance. 
Proc. of SPIE Vol. 9148  91482F-4
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 6/19/2018 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
The technical risk is assessed according to the risk assessment framework developed at TMT. Using commercially 
available hardware with well-established costs and development methods reduces the risk. 
Maintainability quantifies how difficult it will be to maintain the RTC hardware and software. This includes how 
difficult it is to diagnose a problem, and how difficult it is to correct it. For hardware, the quantity of spares and their 
expected availability is a factor, since spares can deteriorate with time. As well, heterogeneous hardware makes 
maintainability more difficult. For software, the requirement is that the software can be changed by a specialist, e.g. a 
GPU or FPGA programmer. The goal is that this can be done by general-purpose programmers, so that the observatory 
does not need to carry specialists on staff. This goal is enabled by general-purpose programming languages, and 
homogeneous development and operating environments.  
For space, 24U is the current allocation for the RTC in the NFIRAOS electronic cabinet. However, a small amount of 
spare room exists, which could be allocated to the RTC. Hence this criterion has only a medium weighting. Same for the 
1500W power budget, which comes from a top-level allocation split between the different NFIRAOS components.  
In terms of upgradability, it is slightly desirable that the selected RTC design be upgradable for use in future order 
120x120 NFIRAOS+, or MOAO system. 
In terms of ease of verification, it would be desirable that the RTC could be interfaced directly with MAOS, the TMT 
AO simulation software, for testing, but it is required that accuracy tests be conducted with input and output files 
provided by MAOS. 
For compliance, the requirement is that the RTC only needs very few waivers of requirements. In some cases, an 
otherwise attractive option may need more waivers to be used in NFIRAOS. But it would receive a lower score on this 
item. The goal is to meet all the requirements with no waivers 
For updating the RTC in real time, the requirement is that downloading and swapping parameters (e.g. matrices) from 
the Reconstructor Parameters Generator (RPG) into the RTC shall take place without disturbing the closed loop. I.e. 
updating parameters should not cause jitter or lost cycles. 
The “Observing overhead” has three components: (i) no more than 10s to initialize a new observation; (ii) no more than 
10s to be ready for a <30 arcsec dither; and (iii) no more than 2 seconds dead time during a jitter of up to 5 arcsec. 
For the “Flexibility criterion”, the requirement is that the software may be changed. The goal is that entire algorithms 
may be replaced. The intent is that the RTC may be revised as we learn more during the construction, commissioning 
and operation of NFIRAOS. New attractive algorithms may come along. Systems that can only support one type of 
algorithms are penalized. 
The requirement is that the interfaces to WFSs, DMs, Telemetry, and RPG follow standards-based approaches, rather 
than custom protocols, formats, connectors etc., specific to NFIRAOS. The goal is that these are industrial standards, as 
opposed to consumer-product standards, which are not generally as reliable, nor as long-lived in the marketplace. This 
criterion will impact uptime of NFIRAOS as well as the long-term availability of knowledgeable staff, and the ability to 
procure spares if needed. 
Use of standards is one single criterion with medium weight, and has three components. 
• Hardware standards compliance: as above, the requirement is that all of the RTC, not just interfaces be designed 
using standards-based approaches, rather than custom protocols, formats, connectors etc., specific to 
NFIRAOS. 
• Quality of standards and documentation: as above, this criterion will impact the long-term availability of 
knowledgeable staff, and the ability to procure spares if needed. Furthermore ideally these standards should be 
open and accessible. Documentation of some standards is more comprehensive and readable than others, which 
will affect the maintainability of NFIRAOS.  Tutorial and reference documentation from a variety of sources is 
valuable. 
• Maturity of standards and documentation: well proven standards, with extensive field usage and wide 
acceptance help ensure reliability of designs, and multiple sources of supply. With such field experience comes 
a variety of tutorial information, and more complete and accurate reference documentation. A good trade-off is 
required between too new (not quite mature yet) or too old (soon to be obsolete) standards. 
Proc. of SPIE Vol. 9148  91482F-5
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 6/19/2018 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Memory (MB)
E F
2 6
10.6
20.41 48.:
/I7 CI CO f
44.6
55.2
64.9
92.0 113.!
909.6
Nos (nb3
1
to
7
8
9
10
11
Algorithm
B
of
CG30 tomo
FD3 tomo
n 4,..
Dl7J-1 -l7LU LUII
CG4 Fitting
CG30 total
FD3 total
BGS -CG20 tot
MVM total
layers)
11
7
A
1J
9
:al
1MAC
C
s/frame)
D
2
66.4
'2.8
I7 1
6
218.7
181.4
no orc.c I 70.0 I
17.9
84.3
)0.6
ì0.1
237.2
I
236.6
199.2
116.7
In LGS mode
• Pixe
(slo
• Wav
DM
In the NFIR
applying a m
apertures to b
sub-aperture)
34,752 slope
Wave-front r
above the tel
The slopes s 
matrix. 
The tomogra
matrix of the
variance and 
The wave-fro
been selected
where E is a
iterative algo
be approxim
different alg
Conjugate G
step, which 
Describing th
summary of 
the number o
3 iterations f
for CG in the
Table 3 N
wave-fron
fitting. 
We see that 
because the r
, the most com
l processing, 
pes) 
e-front recon
s 
AOS architec
atched filter t
e processed is
. This corresp
s. 
econstruction 
escope (tomog
are related to 
phy step is p
 incoming atm
Strehl ratio) a
nt reconstruct
 to solve this 
n inverse matr
rithms make u
ated by a bloc
orithms for t
radient (FDPC
involve signi
ese algorithm
the number of
f iterations wa
or FPPCG (FD
 fitting step (C
umber of opera
t reconstruction
 
MVM has the 
econstruction 
putationally i
which takes in
struction, whic
ture, pixel pr
o the image o
 17,376, with 
onds to about 
involves reco
raphy step), a
the DM comm
erformed by a
ospheric turb
veraged over t
ion consists in
problem. The
ix that encaps
se of the fact 
k diagonal m
he tomograph
G), and the B
ficantly fewer
s in details is
 operations an
s chosen as th
3), one iteratio
G4). 
tions (in Million
 algorithms, and
largest require
matrix E is fu
3. RTC 
ntensive tasks 
 the WFS pix
h takes in the
ocessing invol
f each sub-ape
between 6x6 a
1.23Mpixels p
nstructing the
nd deriving th
ands via a lin
 minimum va
ulence. The fi
he specified sc
 inverting G 
 first one is a 
ulates both to
that (i) the G m
atrix. During 
y step: the C
lock-Gauss S
 operations t
 well beyond 
d memory foo
e minimum re
n of BGS wit
 of Multiply and
 for different la
ments both in
lly dense. Iter
ALGORIT
consist of two
els and compu
 slopes and co
ves correcting
rture to produ
nd 6x15 pixels
er frame, with
 incoming wa
e 7,673 new c
ear equation: 
riance reconst
tting step is a 
ience field of 
in the minimu
straightforwar
mography and
atrix is very 
the course of 
onjugate Gra
eidel with inn
han the tomo
the scope of 
tprint that the
quired to achi
h 20 iterations
 Accumulates p
yer samplings. “
 terms of mem
ative algorithm
HMS 
 steps: 
tes the positi
mputes the n
 each pixel f
ce the X and 
 per sub-apert
 two bytes pe
ve-front on 6 
ommands to b
s = Ga, where
ructor that us
projection tha
view (FoV). 
m variance sen
d matrix-vecto
 fitting steps. 
sparse and (ii) 
the NFIRAO
dient (CG), t
er Conjugate 
graphy step, 
this paper. Ho
se algorithms 
eve proper acc
 of CG for eac
er frame) and m
total” means th
ory and in ter
s require an 
on of the spot
ew commands
or dark curre
Y slopes. The
ure (depending
r pixel. This r
layers located
e applied to b
 G is a 34,752
es as prior th
t optimizes pe
se. Two class
r multiply MV
The second c
the covariance
S design, we 
he Fourier D
Gradient (BG
we have only
wever Table 
require. For th
uracy: 30 iter
h block (BGS
emory footprin
e sum of tomogr
ms of number
order of magn
s in each sub-
 to be applied
nt and flat fi
 total number
 on the locatio
esults in 17,37
 at different 
oth DMs (fittin
 row by 7,673
e expected co
rformance (w
es of algorith
M of the for
lass comprises
 of the turbule
have formulat
omain Precon
S-CG). For th
 considered 
3 provides a t
e iterative alg
ations for CG 
-CG20) and 4 
t of the different
aphy (tomo) an
 
 of operations
itude or less m
aperture 
 to both 
eld, and 
 of sub-
n of the 
6 x 2 = 
altitudes 
g step). 
 column 
variance 
avefront 
ms have 
m a=Es, 
 several 
nce can 
ed three 
ditioned 
e fitting 
the CG. 
op level 
orithms, 
(CG30), 
iteration 
 
d 
. This is 
emory: 
Proc. of SPIE Vol. 9148  91482F-6
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 6/19/2018 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
CG30 requires the least memory, but about the same number of operations as MVM, whereas BGS-CG20 requires the 
least amount operations, but somewhat more memory.  
4. HARDWARE ARCHITECTURES 
We have investigated three different hardware architectures: PC servers, AdvancedTCA and Open VPX. 
4.1 PC server-based architectures (CPUs, GPUs and Xeon Phis) 
PC servers are widely available for consumer and industrial applications. CPUs are the heart of PC servers, and thus the 
heart of modern computing. CPUs typically include several cores, as many as 12 on modern machines, and each core can 
execute 8 floating point operations simultaneously in one clock cycle. High-end modern CPUs typically run at a clock 
rate of 2.7GHz, so the maximum theoretical computation speed is 2.7*12*8=259 GFlops (Giga Floating point operations 
per second – one such operation can be a multiply and accumulate or MAC). In practice however, such a computation 
rate can be seldom achieved because the computation coefficients cannot be brought from the memory to the compute 
registers at that speed. CPUs can access a limited amount of on-board memory (cache) with very high transfer rate and 
very low latency, but the size of this memory is limited. Data not in the cache must be brought in from the main external 
memory. This significantly reduces the transfer rate, and increases the latency. Table 4 shows the characteristics of the 
E5-2600, which is the high-end series of Intel CPUs. The current version is v2 and it has 30MB of L3 cache (shared 
between all the cores) and a maximum bandwidth to the main memory of 59.7 GB/sec. The table also shows the 
characteristic of the previous version, and what is expected from the future version, showing a clear trend of increased 
number of cores, increased cache size and increased memory bandwidth. Note that the much more expensive but higher 
grade E7 series provides somewhat higher performance, and could be used if needed.   
Table 4: Intel E5-2600 Evolution 
E5 Version v1 v2 v3 v4 
Design Name Sandy Bridge Ivy Bridge Haswell Broadwell 
Release Date Q1, 2012 Q3, 2013 Exp. Q3, 2014 2015 
Max. Cores 8 12 14 18 
Max. L3 Cache 20 MB 30 MB 35 MB 45 MB 
RAM DDR3-1600 DDR3-1866 DDR4-2133 DDR4-2400 
RAM Bandwidth 51.2 GB/sec 59.7 GB/sec 68.2 GB/sec 76.8 GB/sec 
Manufacturing 
Process 
32 nm 22 nm 22 nm 14 nm 
 
To increase the compute power, CPUs can be supplemented with accelerator cards, to which the CPUs can offload 
calculations. GPUs are the most common accelerators. They connect to the CPU via the PCI-Express bus. GPUs are 
usually programmed in CUDA. They pack in a single chip many compute engines, can offer computing power and 
memory bandwidth significantly higher than CPUs. For example, the Tesla K20X can achieve 3.95 TFlops of peak 
single precision processing power, and the bandwidth to the 6GB on-board device memory can reach 250 GB/s (a 
limited amount of faster cache memory is also available). A potential bottleneck for GPUs is the PCI-Express (PCI-E) 
bus, which needs to be used to bring data onto the GPU, and has a bandwidth of only 16 GB/s (PCI-E 3.0) and a latency 
of ~11us per transfer. 
Another type of accelerators that we have considered is the Intel Xeon Phi. The Xeon Phi also connects to the host CPU 
via the PCI-Express bus. However, their architecture is more like that of a CPU with a higher number of cores and 
higher memory bandwidth but reduced clock speed. In our study, we used the Xeon Phi 7120P, which was the top of the 
line in the summer 2013. It has 61 cores running at 1.238 GHz, 30.5 MB of cache, 16GB of RAM with a maximum 
bandwidth of 352 GB/s. The maximum theoretical processing power for single precision floating point operations is 2.4 
TFlops, so comparable to that of a GPU.  
Proc. of SPIE Vol. 9148  91482F-7
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 6/19/2018 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
 
4.2 AdvancedTCA-based architectures 
AdvancedTCA is a computing standard primarily geared at the telecommunication industry. AdvancedTCA systems 
consist of a number of printed circuit boards (or blades) hosted in a dedicated chassis (or shelf) that provides a very high 
level of connectivity between the blades, via backplanes offering point-to-point connections (no data bus). 
    
The main reason for investigating AdvancedTCA architectures in the context of the NFIRAOS RTC is, beyond the 
intrinsic capabilities of these architectures, to benefit from the recent development at NRC of the Kermode Board.  The 
Kermode XV6 is the most powerful AdvancedTCA compute blade ever built. It has been specifically designed to tackle 
the most demanding signal processing applications that exist, and its primary application will be the real-time digital 
processing of signals detected by the future radio-telescopes (correlators). The Kermode XV6 packs eight Xilinx Virtex-
6 SX475T FPGAs, delivering an outstanding 8.8 TeraMACs solely from their DSP48E1 dedicated multiply-accumulate 
engines. Each FPGA interfaces with two DDR-3 SDRAM SODIMM modules, capable of supporting up to 4 GBytes, for 
an aggregate memory capacity of 64 GBytes. The bandwidth to this memory is limited however, to a maximum 6.4GB/s 
per memory module, or 12.8GB/s per FPGA. The blade connects with the backplanes of the AdvancedTCA shelf at rates 
exceeding 500 Gb/s, making for highly-efficient clusters, with up to 128 FPGAs in a single chassis. Within the Kermode 
blades, direct communications between FPGAs can be established, at bandwidth up to 2 GB/s. 
 
4.3 Open VPX-based architectures 
Open VPX is a relatively new standard that replaces VME and VXS. It is mainly targeted to defense applications and 
can operate tough environments (temperature, shock and vibrations), with the ability to replace modules in the field (two 
level maintenance). It has also a high bandwidth density and is based on robust standards. Just like the AdvancedTCA 
standard, a VPX computers consists of a chassis with a high speed backplane, which supports various computing engines 
such as CPUs, GPUs and FPGAs, in the form of computing blades. 
 
4.4 10GbE connectivity 
In addition to the main hardware, the connectivity between all the components needs to be resolved. It is advantageous to 
define a single data transfer protocol for communications between all the different NFIRAOS modules, as it makes 
prototyping and testing easier. The main requirements for selecting a common protocol are: 
• Low latency 
• Availability and industry support. 
• Well defined evolution road map defined 
• Ability to implement at reasonable cost. 
• Reasonable total cost of ownership  
The two data protocols that are capable of fulfilling the task and are supported by the industry are 10Gbps Ethernet 
(10GbE) and FDR-10 Infiniband (Fourteen Data Rate).  
Infiniband data protocol operates at extremely low latency, 100ns in some cases. This feature is important in High 
Performance Computing systems and data centers. 
On the other hand 10GE is not as good as infiniband when latency is considered but with a few hundreds of nanoseconds 
delay it could be used in NFIRAOS. These delays are acceptable when RTC operates at 800MHz. This data protocol is 
most widely utilized in today’s telecom and computing industry. There are plenty of vendors producing switches, 
routers, adapter and cables to select from. New generations, 40GE and 100GE, are defined and are already being used. 
Finally10GbE cost per port is less than Infiniband. 
Our decision was to use 10GbE. 
 
 
Proc. of SPIE Vol. 9148  91482F-8
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 6/19/2018 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
5. ALGORITHM MAPPING 
Mapping an algorithm onto an architecture consists of examining the requirements of the algorithms given in section 3, 
and finding the layout that can meet these requirements, based on the architecture characteristics outlines in section 4. 
5.1 MVM mapping 
MVM requires 237.2 MMACs/frame. At 800Hz frame rate, this leads to 190GFlops of required computing power. As 
discussed above, in terms of raw computation, this rate is in fact fairly modest and can, in theory, be achieved by even a 
single modern CPU. The problem is that the ~1GB worth of matrix coefficient need to all be fetched from memory 
within one frame, which would require a ~800GB/s memory bandwidth. As discussed above, none of the compute 
engines can achieve such a bandwidth. Therefore, the work needs to be split on several compute engines. 
The nice feature about MVM is that the computation is very easy to parallelize. The work can be easily split between as 
many compute engines as required. Each compute engine processes one, several, or a portion of a WFS, producing 
partial DM command vectors, and all the partial DM command vectors need to be summed at the end of all the 
calculations by a central machine. One useful feature of the MVM approach is that the MVM can be carried out column-
wise. The calculation can therefore start as soon as the first WFS slope is available, and the 500us it takes to read the 
WFS CCDs can be fully utilized. 
5.2 Iterative algorithms mapping 
The problem with mapping iterative algorithms is that 1) the computational load and/or memory footprint is somewhat 
too large to fit into one processor and its high speed memory; and 2) when the work is split between several processors, 
communications of intermediate results between the processors is required, after each iteration or even more often. 
Another downside is that, unlike the MVM where calculations can start as soon as the first pixels are read out of the 
WFS CCDs, of iterative algorithm can only accomplish very little (part of the first iteration) until all the slopes are 
computed. In NFIRAOS, this results in an overhead of 500us, which is the time to read all the WFS, and which is a 
significant fraction of the requirement.  
Earlier attempt to map iterative algorithms onto GPUs have been unsuccessful [5]. On a single NVIDIA GTX 580 GPU, 
the fastest algorithm was found to be FDPCG, but it would take 2.6ms to complete the tomography step. And the DM 
fitting step took 4.5ms, for a total exceeding 7ms, a far cry from the requirement. Also, splitting the work across two 
GPUs did not improve these timings. 
5.3 Region-based CG30  
During the course of the trade-study, we have attempted to reformulate the CG30 algorithm so that it could be split onto 
several processors while minimizing the amount of data that needed to be exchanged between the processors. The 
intended target were the FPGA processors on the Kermode board. The reformulated algorithm is called region-based 
CG30 (RBCG30). The details of RBCG30 will be published elsewhere, but we present a summary now. 
In RBCG30, the aperture planes (on which the WFS measurements are taken) and the layer planes (on which the 
turbulence profile is reconstructed) are separated into regions. For the sake of discussion, we choose four regions, as 
shown in Figure 2.  
Proc. of SPIE Vol. 9148  91482F-9
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 6/19/2018 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
+ + + + + + + + + + +
+ EIIIIIMEISMES
+ E11111111111111=
+ EI1.1111.11.11Ei
+
+
+
+
=
i
+
+
+II+12111111111111111111
+E111111111111111111.
+ =+ 
++ + 
++ + 
++ + 
++ 
++ + 
++ + 
++ + Z eaN ar 
++ 09 pien9 
Figure 2: 
areas. Plu
In the RBCG
process need
area has a fix
point wide fo
zenith angle 
exchange val
values. This 
required com
Our experien
suitable for P
have not con
make the imp
projected am
footprint of t
the computin
resources to 
only conside
parallel MVM
 
6.1 PC-Serv
MVM on PC
for additiona
remotely pow
DMs, etc, an
52 from Aris
 
All the conne
  
MVM on PC
The CPU-on
Layout out of th
s symbols corre
30, a separate
s to have a cop
ed size (4 po
r the ground l
(maximum ze
ues from the g
is only a sma
munication ba
6. AR
ce with trying
C-based arch
sidered RBCG
lementation m
ount of cache 
he CG-based r
g requiremen
conduct our tr
red iterative al
 layout would
er architectu
-servers is acc
l RTC tasks su
ered if neede
d ensures that 
ta.  
ctivity is hand
-Servers onl
ly architecture
e region-based 
spond to phase p
 processor (FP
y of the value
ints), but the w
ayer (and the 
nith angle for
uard and bord
ll fraction of t
ndwidth betwe
CHITECTU
 to implement
itectures. It is 
30 in on a PC
ore feasible. F
memory avail
econstructor m
ts are also alm
ade study, we 
gorithms for t
 be prohibitiv
res 
omplished by 
ch as summin
d. A high-end 
telemetry data
led by 10Gb E
y (all CPUs) 
 is laid out in F
CG30. Left: ape
oints that are e
GA) handles e
s from the neig
idth of the b
aperture plane
 NFIRAOS o
er regions, bu
he total numb
en the process
RES CON
 iterative algor
worth noting 
-based archite
or example, w
able in CPUs 
ight well fit i
ost within th
have decided 
he AdvancedT
e (See below).
splitting the w
g all the DM p
telecom switch
 are properly r
thernet.  
igure 3. 
 
rture plane. Rig
stimated during 
ach region. In
hboring regio
order area incr
s) to 8 points 
bserving). At 
t in the NFIRA
er of values th
ors is minimiz
SIDERED 
ithms on GPU
that this assess
cture; and ii) 
hen comparin
that will be av
n the cache me
e reach of a s
to only consid
CA/Kermode 
 
ork on as man
artial comma
 handles the c
ecorded. The 
ht: layer planes,
the tomography
 order to carry
ns that lie in th
eases with the
for the highest
the end of ea
OS configura
at are handled
ed. 
IN THE TR
s led us to con
ment is not a
further optimi
g the memory
ailable in the 
mory of a sin
ingle CPU. H
er MVM for t
architecture, f
y servers as re
nds. One last s
ommunication
switch that we
 including some
 reconstruction p
 out the calcu
e border and g
 altitude of th
 (15.5km) lay
ch iteration, t
tion, this is on
 during the C
ADE STU
clude that suc
s definite as it
zation and/or n
 requirements 
short term (Ta
gle CPU, soon
owever, given
he PC-based a
or which the c
quired. One ad
erver is used 
 between the 
 have selected
 
 border and gua
rocess. 
lations in a reg
uard areas. Th
e layer, from 
er, seen at a 60
he processors 
ly 2772 floati
G calculation
DY 
h algorithms w
 sounds becau
ewer hardwar
from Table 3
ble 4), we see
, if not today, 
 the limited t
rchitectures. W
ost of implem
ditional serve
as spare, whic
servers and th
 in the 52-port
rd 
ion, the 
e guard 
just one 
 degree 
need to 
ng point 
s, so the 
ere not 
se i) we 
e might 
with the 
 that the 
and that 
ime and 
e have 
enting a 
r is used 
h can be 
e WFSs, 
 7150S-
Proc. of SPIE Vol. 9148  91482F-10
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 6/19/2018 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
 
Figure 3: Block diagram used for the CPU-only MVM architecture. 
 
There are 8 servers (S1…S8). S1-S6 handle one WFS each, S7 handles the summing of the DM commands, and S8 is a 
hot spare. The servers are 2U Supermicro dual node 6017TR-TF with two X9DRT-F motherboards. At the time of our 
trade study (2013), each motherboard contained 2 E5-2680 Intel Xeon processors with 8 cores. However, at the time of 
writing (Spring 2014), we would use the newer E5-2697 v2 with 12 cores. Each server therefore has 4 CPUs to handle 
one LGS WFS. 
MVM on a PC-Server + GPU 
The CPU+GPU architecture is the same as the CPU-only architecture, except that only 5 servers are required: 3 servers 
handle 2 LGS WFS each, one handle the summing of the DM commands, and one is a hot spare. The servers are 2U 
Supermicro single node 2027GR-TRFT with onr X9DRG-HTF motherboard. The motherboard has 2 E5 Intel Xeon 
processors + 2 K10 NVIDIA modules on x16 PCIe slots (each K10 module contains 2 GPUs). The block diagram for 
each server is presented in Figure 4.  
R
T
R
D
M
E
LGS
WFS
1
LGS
WFS
2
LGS
WFS
3
LGS
WFS
4
LGS
WFS
5
LGS
WFS
6
3X
I
R
I
S
3X
N
I
R
E
S
3X
I
R
M
S
NGS
WFS
7150S-52
Arista switch
10G Optical
10G Twinax copper
To NCCTo TMT
PSF
+
RPG
TMT
RTC
To 
NFIRAOS 
LAN
S1 S2 S3 S4 S5 S6 S7 S8
1G Copper
10G Twinax redundant link
Proc. of SPIE Vol. 9148  91482F-11
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 6/19/2018 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
 
Figure 4: block diagram of a server in the CPU+GPU architecture 
MVM on PC-Server + Xeon Phi 
This configuration is the same as the GPU configuration above, except that the GPUs are replaced by Intel 7120P Xeon 
Phi. 
6.2 AdvancedTCA + Kermode Board architectures 
For the AdvancedTCA + Kermode Board architecture, we have considered two algorithms: MVM and RBCG. For this 
architecture, the connectivity to the WFS, DMs and Recorder is handled by sFPDP, using FMC sFPDP mezzanine card 
on the Kermode boards. 
MVM on AdvancedTCA + Kermode Board 
The architecture for the MVM on Kermode Board is presented in Figure 5. 
 
Figure 5: Block diagram for MVM on AdvancedTCA + Kermode Board architecture   
Seven Kermode boards are required in order to achieve the required memory bandwidth. They are hosted in a 14 slot slot 
ATCA Chassis ASYS00001 from Schroff. Also hosted in this chassis is a CPU card AT8050-2 and a Switch card 
AT8904-2, both from Kontron. A Rear Transmission Module (RTM)  need to be designed for I/O interfacing, but this is 
a very simple design consisting of an RTM PCB, a number of optical transceivers and a connector. The RTM module 
(backplane) can be easily customized to fit Kermode’s Zone 3 connector. The Zone 2 back plane provides 20Gbps full 
mesh connectivity between boards in the chassis. 
KERM 1
KERM 2
KERM 3
KERM 4
KERM 5
KERM 6
CPU
SWITCH
A
T
C
A 
Z
O
N
E
2
B
A
C
K
P
L
A
N
E
MVM 
CALCULATION
To/from 
sequencer
KERM 7
(Pixel Processing
+Other tasks)
From WFS
To RTR
FMC
sFPDP
FMC
sFPDP
FMC
sFPDP
FMC
sFPDP
To DME
To/from
others
R
T
M
Proc. of SPIE Vol. 9148  91482F-12
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 6/19/2018 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
RBCG30 on AdvancedTCA + Kermode Board 
RBCG30 can be implemented on one Kermode board only, using 4 of the 8 FPGAs available. Each FPGA handles one 
of four regions, and 2 GB/s connections are established between adjacent FPGAs, as shown in Figure 6. 
 
Figure 6: FPGA layout for the RBCG30 algorithm 
The transfer of 2,772 floating-point numbers between each adjacent region can occur in only 11us. This includes the 
double loading of the communication lanes, due to the need to go through F2 and F3 to communicate between F1 and 
F4. The computations can be pipelined with the transfers, with points that do not require information from the guard and 
border areas being computed while the transfers occur. In order to keep up with the data flow, each FPGA would have to 
achieve 0.13TMAC/s, which is close to one half of the 0.250TMAC/s that is available in each FPGA, and therefore 
seems achievable. Therefore the total computation time for RGCG30 is 30x11=330us. This time could be halved by 
using 8 FPGAs (and 8 regions) instead of 4. The 8 FPGAs would be split into two groups of 4, with each group in a 
different Kermode board. 
6.3 Open VPX architecture 
MVM on Open VPX 
MVM could be implemented in an Open VPX architecture according to the block diagram in Figure 7.  
 
Figure 7: Functional block diagram of an MVM implementation in OpenVPX. 
 
F3
F1
F2
F4
2GB/s
1 2
43
F1F2 2GB/s
2GB/s F2F3 2GB/s
2GB/s F3F4 2GB/s
 
CPU1
LDS6524
GPU1
GSC6201
To/from others
O
P
E
N
V
P
X
B
A
C
K
P
L
A
N
E
SL240
CPU2
LDS6524
GPU2
GSC6201
SL240
CPU3
LDS6524
GPU3
GSC6201
SL240
CPU4
LDS6524
GPU4
GSC6201
SL240
CPU5
LDS6524
GPU5
GSC6201
GPU6
GSC6201
GPU7
GSC6201
SL240
WFS1
To RTR
WFS2
WFS3
WFS4
To RTR
To RTR
To RTR
WFS5
To RTR
To RTR
To/from others
To RTR
SL240CPU6
LDS6524
SL240
WFS6
To RTR
To DME
M&C
SL240CPU7
LDS6524
SL240
SL240 To DME
SWITCH
VPX6-6902
Proc. of SPIE Vol. 9148  91482F-13
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 6/19/2018 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
This computing platform utilizes military grade equipment packed in an OpenVPX chassis. This system computes MVM 
in the same fashion as the server based platform. There are 7 CPU cards and 7 GPU cards. Each GPU card contains two 
high performance MXM GPU modules. A 16 slot chassis is needed for this purpose. An OpenVPX back plane provides 
connectivity between the boards. An XMC style card, SL240, is selected to interface with WFSs and other instruments. 
RBCG30 on OpenVPX 
Our design is based on the SCFE-V6-4QSFP-OVPX card from Mercury Systems, which contains three V6SX475 
FPGAs, the same as in the Kermode board, Each board also hosts two FMC bays for front panel interfacing. The design 
fits in a six slot chassis and its functional block diagram is presented in Figure 8. 
 
Figure 8: Functional block diagram of a RGCG30 implementation in OpenVPX. 
A 4 or 6 region implementation can be achieved by using 2 (resp. 3) FPGAs from 2 cards, that are connected using their 
FMC LVDS interconnect cards, as shown in Figure 9. 
 
Figure 9: FPGA inter-connection layout for RBCG30 in OpenVPX 
With this layout, the estimated time to complete the 30 iterations is 210us if 4 regions/FPGAs are used, and 165us if 6 
FPGAs are used.  
6.4 Cost, power consumption and MTBF evaluations 
The cost, power consumption and MTBF of each system presented above have been evaluated based on the published 
values as of November 2013. Our findings are summarized in Table 5. 
 
 
 
Proc. of SPIE Vol. 9148  91482F-14
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 6/19/2018 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Table 5: Cost, power consumption and MTBF for our proposed architectures 
 Cost (kilo-USD) Power (kW) MTBF (year) 
PC-Servers CPUs only 172 5.7 1.2 
PC-Servers + GPUs 101 3.7 0.77 
PC-Servers + Xeon Phis 164 5.7 1.1 
AdvancedTCA - MVM 532 2.5 1.25 
AdvancedTCA – RGCG30 109 0.5 4.5 
Open VPX - MVM 305 3.5 1.12 
Open VPX – RBCG30 144 1.0 1.31 
 
For the PC-based – MVM systems, using GPUs reduces cost and power consumption, but also MTBF, due to the lower 
MTBF of the GPU modules. Implementing MVM on AdvancedTCA or Open VPX increases the cost significantly, due 
to the higher price of the components. However, the power consumption is lower, for about the same MTBF. 
AdvancedTCA and Open VPX seems attractive if they can run RGCG30: because the unit count is low, the cost and 
power consumption is quite a bit lower, and the MTBF is significantly higher for AdvancedTCA.  
7. BENCHMARKING 
We have conducted extensive benchmarking for our three proposed PC-based systems, and limited benchmarking for 
our proposed AdvancedTCA systems. No benchmarking was performed for our proposed Open VPX system, as this 
solution did not look very attractive for NFIRAOS, as discussed in section 8. 
7.1 PC server - CPU only benchmarking 
Benchmarking of our PC Server – CPU only system is described in details in a dedicated paper [6]. Here is a summary.  
Our benchmark machine replicates one motherboard of one of the servers noted S1…S8 in Figure 3. It has two E5-2697 
v2 CPUs, with 12 cores per CPU. As discussed in section 4.1, this is the latest Intel CPU available as of Spring 2014. 
We are therefore benchmarking the work required to process one half of one LGS WFS. A similar machine, connected 
via 10Gb Ethernet, sends pixels at the same rate as the LGS WFS would, and receives the DM commands once the 
computation is performed. Using the same machine to send and receive allows for accurate timings. 
The benchmarked machine receives pixels from half of one LGS WFS (one quarter of a WFS is assigned to each CPU), 
performs pixel processing and the part of the MVM related to these pixels and each CPU sends its DM commands back. 
The elapsed time (round-trip time) is calculated from the time the first pixel is sent to the time the last DM command is 
received. 
A number of system tunings have been performed in order to achieve reliable real-time performance. These include: 
• Turning hyper-threading off 
• Using the Linux operating system with the real-time patch 
• Isolating one core on each CPU to run all the non-RTC tasks 
• Configuring the Ethernet connection to use 9000 byte jumbo packets and setting the interrupt throttle rate for 
the 10Gb Ethernet adapter to 0 
• Assigning the interrupts originating from the 10Gb Ethernet devices to a specific core 
• Using the Linux “tuned” package to set the system profile to minimize latency 
• Assigning specific cores for specific tasks, including 2 cores for pixel reading, 2 cores for pixel processing and 
16 cores for the MVM. 
• Using UDP datagrams rather than TCP streams  
 
The results of a 12 hour run for processing half of an LGS WFS on 2 CPUs at 800Hz are shown in the form of an 
histogram in Figure 10.  
Proc. of SPIE Vol. 9148  91482F-15
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 6/19/2018 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
20 40 60
AO Time (s)
80 100
0.98
0.96
.-.
E 0.94
oU
g 0.92
1
 
Figure 10: Histogram of end-to-end round-trip times for a 12 hours run 
The average round-trip time is 766.5us, and the worst case is 877.0us, well below the requirement of 1200us. We have 
also conducted tests in which the control matrix was changed on the fly. We have found that swapping control matrices 
was slowing down the processing of the frame at which the swapping occurred, but by only ~120us, which still meets 
the requirements. The impact is relatively modest, thanks to the ability to pre-fetch the new coefficients from memory 
during the “dead time” after the new DM commands are computed and before the new frame arrives. 
7.2 PC server + GPU benchmarking 
Similar end-to-end benchmarking has been carried out for our PC server + GPU architecture. The results have already 
been published [5], and are summarized below. The test also uses two servers connected with a 10GbE link: one server 
sends pixel and receive DM commands and one server processes the pixel and computes the DM commands. The 
processing server is equipped of 2 GTX 580 GPUs, which are the consumer-grade equivalent (similar single precision 
computing performance and memory bandwidth) of the professional grade Tesla K10 GPUs that we carry in our design 
(these are built for high performance computing with high reliability and are certified for use in server environments). 
The benchmarking therefore simulates the processing of a full WFS, and also includes control matrix swapping every 
8000 steps (sending the new control matrix from CPU to GPU is spread over a few steps, 10 columns each). The end-to-
end results are shown in Figure 11. The average round-trip time is 0.88ms, the best case is 0.83ms, and the worst case is 
0.97ms. This is well below the requirement of 1.2ms. 
 
Figure 11: End-to-end round-trip times for a 9 hours run. The green curves shows the results sorted from low to high. 
7.3 PC server + Xeon Phi benchmarking 
Benchmarking of our PC Server + Xeon Phi system is described in details in a dedicated paper [6]. Here is a summary. 
1
10
100
1000
10000
100000
1000000
10000000
100000000
74
0
75
0
76
0
77
0
78
0
79
0
80
0
81
0
82
0
83
0
84
0
85
0
86
0
87
0
88
0
# 
of
 F
ra
m
es
Round-trip Time (µs)
Proc. of SPIE Vol. 9148  91482F-16
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 6/19/2018 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
ti
0
0
00
0
0
0ti
ti
.00000
10000
1000
100
10
O O O O O
ono r
o4' óN (Il M V ill ill
MVM Offload'
61
50
67
00
72
50
78
00
83
50
90
0
10
00
11
00
12
00
13
00
14
00
15
00
16
00
89
00
17
00
 _
18
00
 _
94
50
19
00
ti
10
00
0
20
00
 ]
Our benchm
CPU is conn
set-up, excep
the slopes fr
simple, we u
with simple o
all the slopes
between whe
shown in Fig
Figure 12
shows the
The results s
account for t
allows startin
average laten
believe that 
support softw
7.4 Control
The MVM al
is not part of
to compute t
show that for
iterations are
when no war
allowed to be
10GbE conne
7.5 RBCG3
We have ben
tools. Unlike
simulated ve
handling regi
ark uses the sa
ected to a Xeo
t that Xeon Ph
om the pixels
sed the OpenM
ffload directiv
 are computed
n the offload 
ure 12. 
: MVM executi
 results for time
how an unacc
he extra 500u
g when only 
cy could be r
this could solv
are does not h
 matrix comp
gorithm relies
 the NFIRAOS
he control ma
 updating the 
 required, and
m start is poss
 ready for a n
ction. 
0 on Kermod
chmarked crit
 CPU-based 
ry accurately. 
ons 1 and 4 ar
me server pla
n Phi (7120P,
is are used in
, and each Xe
P 4.0 progra
es. Our first a
, and can only
command is is
on time using a 
s less than 2ms
eptable level o
s required to 
half of the pix
educed by us
e the problem
ave at the mom
utation bench
 on another m
 RTC, and is l
trix [5]. This 
control matrix
 they can be a
ible, 100 itera
ew observatio
e Board bench
ical modules o
compute engin
The layout of 
e similar. 
tform as the C
 as described i
stead of GPUs
on Phi compu
mming framew
ttempt was to 
 start after the
sued and whe
single offload ca
. 
f latency, wit
compute all t
els are receiv
ing lower lev
 of the “slow
ent a real-tim
marking 
achine comput
ocated in the o
process can be
, where the pre
ccomplished in
tions are requi
n. Once the co
marking 
f the FPGA d
es, FPGAs a
the FPGAs ha
PU-only tests
n section 4.1)
. The server no
tes the MVM
ork, with wh
offload the MV
 whole CCD i
n the CPU rec
ll to the Xeon P
h some measu
he slopes. We
ed, but the la
el communica
 frames”, wh
e kernel. 
ing the contro
bservatory co
 parallelized, 
vious control 
 ~9 seconds o
red. This will 
ntrol matrix is
esign for the R
re very determ
ndling region 
 reported in s
. This set-up is
w handles on
 for half of th
ich computatio
M calculation
s read, 500us 
eives the last 
hi to process ha
rements of up
 have tried to
tency results g
tion routines (
ich is probabl
l matrix, and u
ntrol room. Th
and has been 
matrix can be 
n 8 GPUs. At
take 90 second
 calculated, it 
BCG30 by im
inistic, and 
2 and region 3
ection 7.1, bu
 equivalent to
e full LGS WF
e slopes. In o
ns can be ass
 in one single 
after the first p
DM command
 
lf of one LGS W
 to 10ms. Als
 use two offlo
ot even wors
no OpenMP).
y due to the f
pdating it eve
e FDPCG iter
benchmarked 
used as warm
 the beginning
s, which is we
takes about 1s
plementing w
therefore thei
 is shown in F
t now each of 
 the PC Server
S: the CPUs c
rder to keep t
igned to the X
call. This requ
ixel arrives. T
 from the Xeo
FS. The inset 
o these result
ad commands
e. We believe 
 However, we
act that the X
ry ~10s. This 
ative algorithm
on GPUs. Th
 start, only 10 
 of a new obs
ll under the 5 
 to download 
ithin the Xilin
r performance
igure 13. The
the two 
 + GPU 
ompute 
he code 
eon Phi 
ires that 
he time 
n Phi is 
s do not 
, which 
that the 
 do not 
eon Phi 
machine 
 is used 
e results 
FDPCG 
ervation 
minutes 
it over a 
x design 
 can be 
 FPGAs 
Proc. of SPIE Vol. 9148  91482F-17
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 6/19/2018 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
MASTER
FPGA
MASTER
FPGA
FPGA2
MI E
Computing
Array -CA
VP Dat.
. + Con.
1 GE Interface E
FPIGA3
-I
Computing
Array - CA
ME Il
*
PP Dat
+ Con
IGE Interfa E
,up 74
FPGA
Inter FPG
Interface -
l
7' I
a Path Inter FPG
stroller -
)PC Interface
fi
nputing
nip 14
1 1
Inter FPG
Interface -
1 1
a Path
uroller -
)PC
I
Inter FPG
Interface -
I I
FPGA,
r
L
I
AMBI
-
-
CTR
CCNTRF
l' 7
r r 1-
Figure 13
The most cri
computing el
Point Single 
A full compu
therefore ach
This well me
We have not 
8.1 Advanc
Based on our
Our main con
• The
• Tim
also
0.04
• Prog
FPG
be im
: High Level FP
tical part is the
ements. The la
Precision Add
ting array wa
ieving 0.1152T
ets our require
pursued furthe
edTCA archit
 study, we hav
cerns with thi
 hardware cost
e to repair: ev
 the highest, b
%, well below
ramming tim
As is labor in
plemented on
GA RGCG30 d
 computing ar
yout of a com
er and Multipl
s implemente
MAC/s. This
ments. 
r benchmarkin
8
ecture with K
e decided to re
s solution wer
, which is high
en though the
ecause there 
 our 0.1% bud
e and program
tensive and req
 the Kermode
esign (left), and
ray. It is made
puting elemen
ier. 
d in the Xilin
 would lead to
g for this arch
. TRADE 
ermode Boar
ject this archi
e: 
er than the oth
 MTBF of the
will be no on
get, but it is an
ming skills: a
uires specializ
 board are mor
 
 design of a com
 up of 16 com
t is shown in F
x developmen
 12us to comp
itecture. 
STUDY SC
d 
tecture. 
er solutions. 
 hardware is h
-line live spar
 order of mag
s demonstrate
ed programm
e complex tha
pute element (r
puting groups
igure 13 (righ
t tool. It was 
ute one CG ite
ORING 
ighest of all th
es. The lost t
nitude greater 
d by our attem
ing skills. Also
n the straight M
ight) 
, and each com
t). The FPU b
found that it c
ration, or 360u
e architecture
ime due to fa
than the other 
pts at benchm
, the iterative
VM.  
 
puting group
lock contains 
ould run at 4
s for the 30 it
s, the time to 
ilure was estim
architectures.
arking, progr
 algorithms tha
 has 256 
Floating 
50MHz, 
erations. 
repair is 
ated at 
amming 
t would 
Proc. of SPIE Vol. 9148  91482F-18
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 6/19/2018 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
A few strong points could be identified however. These were: 
• Low energy consumption, the only architecture that actually meets our power budget. 
• Higher flexibility, since in theory, a variety of algorithms could be implemented on the platform. 
• Savings in the RPG, since iterative algorithms avoid the computationally expensive task to compute the control 
matrix. 
However, we have found that these advantages do not outweigh the drawbacks. 
8.2 Open VPX architecture 
We have also decided to eliminate the Open VPX architecture. Disadvantages include the fact that there are only few 
vendors on the market, that the hardware is relatively expensive, and that the hardware tends to be an older generation. 
These disadvantages appear to outweigh the fact that VPX is rugged and reliable and that it is heavily standard based. 
8.3 PC-server based architectures 
In the end, we only retained the PC-based solution. 
Cost, power consumption and MTBF have already been presented in section 6.4. The PC+GPUs is the least expensive 
option for the hardware cost. However, the software development costs are estimated to be roughly the same for the 
three candidate architectures, and are much higher than the hardware cost (the total RTC cost is estimated to $6M). 
Therefore the difference in hardware cost is rather insignificant. 
The three candidate architectures meet our space requirement, with the PC+GPUs solution being the most compact (11U 
instead of 17U for the other ones). The PC+GPU solution has also the least power consumption, but still does not meet 
our power budget. As discussed previously, some margin exists within NFIRAOS to increase the power allocation for 
the RTC. The main concern here is in fact the potential heat leakage through the electronics enclosure, which only has a 
10 kW cooling capacity.  
The MTBF for each candidate architecture has been turned into expected lost time. Because all the architectures have 
on-line spares, the lost time is very low, much lower than our requirement. The lowest lost time was achieved by the 
CPU-only solution at 0.004%, but the other two architectures were only marginally higher at 0.005%. So the lower 
MTBF of the GPUs does not translate in significant increased loss time under our operational assumptions. 
Based on the MTBF, cost of replacement parts and energy consumption, we have attempted to estimate a cost of 
ownership through the full 15 year lifetime. We have found that the PC+GPU solution had a lowest cost (~$300k), 
followed by the PC+Xeon Phi (~$400k), the most expensive being the PC+CPU only solution (~$500k). This is because 
even though GPUs are less reliable, they are cheaper to replace and win because of their lower power consumption. 
However, although we have not exactly quantified it, we estimate that the order is reversed for the cost of software 
upgrade (which will be required to keep up with hardware obsolescence), with the PC+GPU being the most expensive, 
because of the specialized programming language and the more heterogeneous layout. 
9. CONCLUSION 
Our study has shown that the commercial PC-servers with 10Gb Ethernet connectivity and running an MVM algorithm 
have now reached a sufficient level of maturity to meet the NFIRAOS RTC requirements, and that we no longer need to 
resort to more specialized platforms such as VPX or AdvancedTCA. This conclusion is grounded in a detailed analysis 
and extensive end-to-end benchmarking, which also includes the benchmarking of the calculations to regularly update 
the matrix used in MVM by the RPG.   
We are very confident that both the PC+CPU-only and the PC+GPU architectures we are proposing can work, and, 
based on our scoring table, it is difficult to identify a clear winner. For the PC+Xeon Phi architecture, while we find that 
it could work in theory, our benchmarking results currently show that the timing requirements are not met, with large 
jitter occurring. Between PC+CPU-only and PC+GPU, it is difficult to pick a clear winner. 
Table 6 presents the criteria where the three proposed architectures differ the most.   
 
 
Proc. of SPIE Vol. 9148  91482F-19
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 6/19/2018 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Pluses
Minuses
All CPU
Homogeneity
.Ease of prograi
Life cycle risk
Real -time kern
*Computation n
CPU}
NoWi
mming Cost
Cori
el
nargin Hete
.D..li
Ties
Illd,t
GPU
er
putation margin
!rogeneity
%km.,IAPIIlly
with consumer
Cl
CPU + Phi
Computation ma
(in theory)
*Heterogeneity
New product
31-11C ]UUI LC
Benchmarking re
irgin
cults
 
Table 6: K
The main ap
easiest to pro
require. Due 
proposed arc
have compat
to ensure lo
computationa
The CPU+G
power. How
therefore its 
requirement 
require more
makes its evo
The CPU+X
simply becau
benchmarkin
heterogeneou
most expensi
discussed abo
Overall, we 
from a more
maintenance 
likely availab
although diff
[1] Sanders,
Vol. 914
[2] Herriot, 
IV Proc.
[3] S. Brow
Thirty M
presenta
[4] Hovey, G
Adaptive
[5] Wang L
Conferen
ey comparison
 
peal of the all
gram since it 
to limited shel
hitectures. Ho
ible newer ver
w jitter, is al
l margin and c
PU architectu
ever, the relia
complexity (
for lost observ
 maintenance 
lution somew
eon Phi archite
se more serve
g has shown 
s than the all-
ve, and the Xe
ve, this archit
feel that the a
 homogeneou
and trouble-sh
ility of comp
icult to quantif
 G. H., "Thirty
5, (2014). 
G., et al., “NF
 of SPIE Vol. 
ne et al., “A 
eter Telesco
tion only (201
. et al., “An 
 Optics for Ex
., Design and
ce. Simone E
 criteria between
-CPU architec
does not requir
l life, it will no
wever, CPU se
sions available
so extremely 
omes with a h
re is more com
nce on third-
programming,
ing time due to
at the observa
hat less predic
cture provide
rs are required
that latency w
CPU architect
on Phi co-pro
ecture did not 
ll-CPU archite
s platform re
ooting during
atible spares 
y, are worth th
 Meter telesco
IRAOS: first f
9148, (2014). 
Real-Time Co
pe”, 1st AO4
0) 
FPGA Based 
tremely Large
 Testing of G
sposito and Lu
 the three serve
ture is that it i
e the specializ
t be possible t
rvers are such
.  Continuous 
likely. The d
igher cost. 
pact, provide
party GPU ac
 connectivity,
 failure, the C
tory. Also, the
table. 
s, in theory, ev
, and therefor
as an issue 
ure, because it
cessor is a rela
pass our bench
cture is the m
duces the co
 operation.  T
during the lif
e slight increa
REF
pe project upd
acility AO sys
ntroller Archit
ELT conferen
Computing Pl
 Telescopes 07
PU based RT
ca Fini, eds. F
r-based architec
s conceptually
ed programmi
o stock spares
 mainstream s
support of real
ownside of th
s more comp
celerator card
 etc). Even th
PU+GPU arch
 GPU technol
en more comp
e more CPUs 
that might no
 needs to run 
tively new pro
marking tests
ost attractive 
st and schedu
he risk of obs
etime of NFIR
se in cost and 
ERENCES
ate," In Groun
tem for the Th
ecture for the
ce - Adaptiv
atform for Ad
006 (2010) 
C for TMT N
irenze, Italy, 2
tures 
 the simplest 
ng skills that t
 for the entire 
ystems that th
-time operatin
e all-CPU arc
utational marg
s increases th
ough all the 
itecture has al
ogy has strong
utational mar
are available f
t be resolvab
its own operat
duct available
. 
solution at th
le risks when
olescence is a
AOS. Our op
power consum
d-based and A
irty Meter tele
 Multi-Conjug
e Optics for 
aptive Optics 
FIRAOS, In: 
6-31 May 201
 
and the most h
he other accel
life span of NF
ey are the mo
g system kern
hitecture is t
in at a lower
e heterogeneit
three architec
so a lower MT
 ties with con
gin than the C
or additional c
le. This archi
ing system. T
 from one com
is time. Lesse
 building the
lso better miti
inion is that 
ption.  
irborne Telesc
scope,” In Ad
ate Adaptive 
Extremely L
Control”, 1st A
Proceedings o
3. 
omogeneous. 
erator-based p
IRAOS for an
st likely to con
els, which is im
hat it offers t
 cost and for 
y of the syst
tures meet th
BF, and there
sumer market
PU+GPU arch
alculations. H
tecture remain
his solution is 
pany only. Fi
r complexity r
 RTC, and fa
gated due to t
these risk red
opes V Proc. 
aptive Optics 
Optics System
arge Telescop
O4ELT conf
f the Third A
It is the 
latforms 
y of the 
tinue to 
portant 
he least 
a lower 
em, and 
e TMT 
fore will 
s, which 
itecture, 
owever, 
s more 
also the 
nally, as 
esulting 
cilitates 
he more 
uctions, 
of SPIE, 
Systems 
 on the 
es, oral 
erence - 
O4ELT 
[6] Smith, M., et al., “Benchmarking hardware architecture candidates for the NFIRAOS real time controller,” In 
Adaptive Optics Systems IV Proc. of SPIE Vol. 9148, (2014). 
Proc. of SPIE Vol. 9148  91482F-20
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 6/19/2018 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
