Investigation of Power8 processors for astronomical adaptive optics
  real-time control by Basden, Alastair
ar
X
iv
:1
50
6.
07
92
7v
1 
 [a
str
o-
ph
.IM
]  
26
 Ju
n 2
01
5
Mon. Not. R. Astron. Soc. 000, 000–000 (0000) Printed 11 September 2018 (MN LATEX style file v2.2)
Investigation of Power8 processors for astronomical
adaptive optics real-time control
A. G. Basden,1⋆
1Department of Physics, South Road, Durham, DH1 3LE, UK
11 September 2018
ABSTRACT
The forthcoming Extremely Large Telescopes all require adaptive optics systems for
their successful operation. The real-time control for these systems becomes compu-
tationally challenging, in part limited by the memory bandwidths required for wave-
front reconstruction. We investigate new POWER8 processor technologies applied to
the problem of real-time control for adaptive optics. These processors have a large
memory bandwidth, and we show that they are suitable for operation of first-light
ELT instrumentation, and propose some potential real-time control system designs. A
CPU-based real-time control system significantly reduces complexity, improves main-
tainability, and leads to increased longevity for the real-time control system.
Key words: Instrumentation: adaptive optics, Instrumentation: miscellaneous,
Methods: numerical.
1 INTRODUCTION
The forthcoming Extremely Large Telescopes (ELTs)
(Spyromilio et al. 2008; Nelson & Sanders 2008; Johns
2008) will all rely on adaptive optics (AO) systems (Babcock
1953) for their successful operation, allowing the degrading
effects of atmospheric turbulence to be greatly reduced. An
AO system actively measures wavefront perturbations intro-
duced by the Earth’s atmosphere, and attempts to mitigate
these in real-time (on millisecond timescales) using one or
more deformable mirrors (DMs). This is a computationally
demanding task, and requires a dedicated real-time control
system (RTCS). Computational requirements scale with the
forth power of telescope diameter when considering tradi-
tional RTCS algorithms: for a given level of AO correction,
the DM pitch must remain constant, and so the number
of sub-apertures across the telescope pupil scales with tele-
scope diameter, d. The total number of sub-apertures and
actuators therefore each scale as O(d2), and therefore the
number of operations required for wavefront reconstruction
(a matrix-vector multiplication) scales as O(d4). Due to this
rapid scaling of computational complexity, careful design
considerations must be made when designing real-time con-
trol systems for the ELTs.
These RTCSs must be designed with long lifetimes,
since the AO instruments on these telescopes are expected to
be operational for at least thirty years (Vernet et al. 2012).
Therefore maintenance, of both software and hardware is
key to success. An RTCS design which is hardware ambigu-
⋆ E-mail: a.g.basden@durham.ac.uk (AGB)
ous, i.e. doesn’t require a particular hardware set to operate,
is clearly advantageous. Previous system designs have fre-
quently relied on specific hardware, typically digital signal
processors (DSPs) and field programmable gate arrays (FP-
GAs) (for example the ESO SPARTA system, Fedrigo et al.
2006), which, due to long periods spent in design, are of-
ten close to obsolescence even during commissioning, with
availability of spare parts becoming problematic, and spe-
cific programming knowledge required. Hardware failure of
these systems then poses the risk that an entire new system
will require designing, with the original software not being
portable to new hardware.
In recent years, there has been much success with hard-
ware agnostic AO RTCSs which operate on conventional
PC hardware, including the Durham AO real-time con-
troller (DARC) (Basden et al. 2010; Basden & Myers 2012),
which is a generic system, used by the CANARY AO on-sky
demonstrator instrument (Myers et al. 2008), and the real-
time control system for the Gemini South telescope GeMS
AO system (Rigaut et al. 2012). In theory, such systems sim-
ply require a recompilation of the source code to be ported
to other (similar) hardware platforms, and are easy to move
onto upgraded hardware. In practice, the advent of binary
driver code, e.g. for wavefront sensors (WFSs) and DMs,
means that porting is not always possible. Although port-
ing to new hardware is typically limited to other PC-like
systems that have an operating system running on a central
processing unit (CPU), this is not always the case. In partic-
ular, the DARC system has a modular design which allows
parts of the real-time pipeline to be placed in alternative
hardware, including for example:
c© 0000 RAS
2 A. G. Basden et al.
(i) pixel processing and slope calculation in FPGA using
a customised version of the SPARTA system (Fedrigo et al.
2006)
(ii) wavefront reconstruction using graphics processing
units (GPUs) (Basden et al. 2010)
(iii) a full GPU pipeline, from raw WFS images to DM
demands.
However, this system still requires a CPU based core to over-
see control of the hardware accelerators.
For ELT-scale AO systems, the largest computational
requirements come from wavefront reconstruction algo-
rithms, which typically use a matrix-vector multiplication
(MVM) to obtain DM surface shape from WFS slope mea-
surements. On conventional PC hardware, this algorithm
is memory-bound, rather than compute-bound, and so for
low latency operation, systems with large memory band-
width are required. For this reason, accelerator cards (such
as graphics processing units (GPUs)) are considered in de-
signs for ELT-scale RTCSs to provide the necessary mem-
ory bandwidths for these algorithms. However, this in itself
raises new problems in moving data into and out of the ac-
celerator for processing, which adds time and hence latency
to the RTCS pipeline. Designs that minimise this latency
are key.
1.1 The POWER8 processor
The specification and road-map of the IBM POWER8
processor (Sinharoy et al. 2015) seems promising for AO
RTCSs, with two key relevant features: A memory band-
width approaching that of GPUs (up to 230 GB/s), and
support for a novel interconnect technology (NVLink, Foley
2014) due for release in 2017 that will provide an order
of magnitude increase in data bandwidth between pro-
cessor and GPU. Additionally, the OpenPower foundation
has the potential for providing novel hardware accelera-
tion architectures tightly coupled with POWER8 processors
via the Coherent Accelerator Processor Interface (CAPI)
(Stuecheli et al. 2015), including a currently available offer-
ing from the company Nallatech. The memory bandwidth
of these processors is significantly larger than other avail-
able CPUs, hence the interest for AO real-time control, and
a concise overview of the memory subsystems is given by
Starke et al. (2015).
Here, we provide details of initial performance testing
of the DARC RTCS on a POWER8 system.
In §2 we discuss the system configuration, RTCS in-
stallation process and the tests that we perform. In §3 we
present our findings, and we conclude in §4.
2 THE DARC REAL-TIME CONTROLLER ON
A POWER8 SYSTEM
Most of the results that we will present here are performed
on a low-end Tyan OpenPower Customer Reference sys-
tem, model GN70-BP010, hosted at Durham. This system
has a single 4-core POWER8 processor clocked at 3 GHz.
Each core has 8-way symmetric multi-threading, providing
a total of 32 hardware threads. The system has 16 GB
DDR3 (1.6 GHz) RAM, controlled by a single Centaur mem-
ory controller. The total theoretical memory bandwidth for
this system is 28.8 GB/s between CPU and main memory
(19.2 GB/s read, 9.6 GB/s write).
We have also had limited cloud access to a more power-
ful S824 POWER8 system with two 12-core processors (to
which our machine instance had access to 22 cores), each
8-way threaded, providing a total of 176 hardware threads.
Half of the memory banks of this machine are populated,
and thus a total memory bandwidth of about 59 GB/s for
read operations, and 29.5 GB/s for write operations is avail-
able. The operating system of this machine was run behind
a hypervisor. Both of these systems run the Ubuntu oper-
ating system (14.10). Results presented here are from our
low-end system unless stated otherwise.
2.1 Real-time control system installation
We use the publicly available DARC AO RTCS system, with
source code downloaded from the sourceforge hosting site.
Installation on a POWER8 system was trivial: we simply
had to remove three unsupported compiler options from the
Makefile (-msse2 -mfpmath=sse -march=native) and then
compile and install in the usual way. All of the required
library dependencies were available from the Ubuntu repos-
itories, and downloaded automatically as part of the DARC
installation process. We did not attempt to optimise DARC
using compiler flags specific to the POWER8 processor, and
we used the freely available gcc compiler, for which source
code is available (important for lifetime considerations).
We investigated the use of GigE Vision cameras for
wavefront sensors, using the open-source Aravis library, with
modifications specifically to allow access to the camera pixel
stream, rather than full-frame access (to reduce RTCS la-
tency). Because this library is entirely open-source, and does
not require any hardware drivers, there were no issues with
binary drivers. This library provides access to a number of
wavefront sensors that have been used on-sky with the CA-
NARY AO system, including an Imperx Bobcat camera, an
Emergent Vision Technologies HS2000 10GBit camera and
a First-Light OCAM2S camera. During operation, as soon
as sufficient pixels have arrived at the computer to com-
plete a given sub-aperture, this sub-aperture is processed
by a thread (calibration, slope calculation and partial re-
construction). The thread then returns to compute the next
available sub-aperture, in a round-robin fashion. Once all
sub-apertures for a given frame have been processed, each
thread will have a partial DM vector, and these are then
combined in a reduction step to yield the final DM com-
mand.
To further demonstrate the proof of concept of a com-
plete AO system, we selected an Alpao 241 actuator DM
with an Ethernet interface. It was necessary to develop our
own library interface for this DM since source code for the
Software Developers Kit was not available, and the binary
libraries were for X86 architectures. However, control of this
DM involves sending a UDP packet, and so was trivial to
implement. A closed-loop AO system driven by a POWER8
server is therefore feasible using an existing RTCS.
c© 0000 RAS, MNRAS 000, 000–000
Power8 processors for AO real-time control 3
2.2 Testing real-time performance
We investigate the performance of DARC on POWER8 by
configuring the system as would be used in a number of
different AO cases. These are:
(i) A 40 × 40 sub-aperture single conjugate AO (SCAO)
system.
(ii) A 80× 80 sub-aperture SCAO system.
(iii) A 80 × 80 sub-aperture system with increased actu-
ator counts.
For each of these cases, we investigate performance for
different sized sub-apertures, i.e. different numbers of pixels
per sub-aperture.
The third case can be viewed as a single WFS of the
proposed European ELT (E-ELT) multi-conjugate adaptive
optics (MCAO) instrument (Foppiani et al. 2010) with com-
putation of a full set of partial DM demands. A full MCAO
real-time control system could then be comprised of one
compute node per WFS, with combination of partial DM
demands being computed as a (low operation count) final
processing step to give the demands to be sent to the DMs.
We discuss this further in §3.4
Our tests presented here do not include a physical WFS
camera or DM, since we do not have suitable equipment
available (specifically, cameras with sufficient pixels and
frame-rates, and a DM with enough actuators). Rather, we
concentrate on the core computational pipeline. Our previ-
ous experience has shown that introducing a physical camera
to a system has little impact on overall performance (max-
imum achievable frame rate), provided the camera itself is
capable of reaching these frame rates. Because the DARC
RTCS can process pixels as they arrive at the computer,
then once the last pixel for a given frame arrives, most of
the computation has typically already completed. The RTCS
is used without frame pipe-lining here, i.e. there are never
two frames being processed at once, so that the frame-rate
represents the computation time of a given frame. We note
that with a real camera, expected readout time and data
transfer time will depend very much on camera model, and
in astronomical AO the readout time is often the limiting
factor in achievable frame-rate (likely to be the case for the
forthcoming ELTs), and for true latency considerations, this
should be taken into account. For example, for a camera with
a maximum frame rate of 500 Hz, the readout time (and ex-
posure time) will be 2 ms. Assuming that data is transferred
as it is read out (rather than buffered), this means there will
be a delay of 4 ms from start of exposure to last pixel arriving
at the computer (by which time, most of the computation
will have completed). However, an investigation of camera
latency is beyond the scope of this paper.
Of key importance in the approach that we take is
that we are using a fully configured AO RTCS, which has
been proven on-sky. When bench-marking hardware perfor-
mance, it can be tempting to write simple bench-marking
code which investigates the key algorithms under consider-
ation, i.e. image calibration (vector operations), slope com-
putation (vector and reduction operations), and wavefront
reconstruction (matrix-vector multiplication). However, this
leads to optimistic performance estimates, since the bench-
mark is grossly simplified and bears little resemblance to
actual code that would be usable on-sky at a telescope.
2.2.1 The performance metric
We define the performance of the RTCS by measuring the
time taken to perform the computation for each AO system
frame. In the default DARC configuration, which we use
here, the computation of each frame must be completed be-
fore the next frame is started. This therefore means that the
inverse of the frame computation time gives the maximum
achievable frame-rate for the AO system. This behaviour is
critical for optimising AO system latency on a given hard-
ware set.
The DARC RTCS uses a horizontal processing strat-
egy (Basden et al. 2010) with each thread operating on
WFS data from start to finish, rather than having different
threads performing individual tasks (e.g. a set of threads for
image calibration, a set for slope computation, and a set for
wavefront reconstruction). This strategy allows automatic
load balancing by the operating system, and simplifies per-
formance optimisation: the main parameter to be optimised
is the number of processing threads, rather than balancing
the number of threads per algorithm which can become a
complex optimisation problem. Of further consideration is
the number of sub-apertures that each thread should pro-
cess at once, influencing the order of memory operations and
the size of the partial matrix-vector multiplications. If this is
too small, then many inefficient small matrix-vector multi-
plication operations will reduce the performance, while if too
large, a small number of large matrix-vector multiplication
operations will lead to a saturation of memory bandwidth,
resulting in threads being work-starved.
2.3 Tests of memory bandwidth
To directly test the memory bandwidth available, we use
the STREAM benchmark (McCalpin 1995), which performs
a number of different memory read and write operations.
Results are given in table 1, and show that for our low-end
(4-core) server, over 85% of theoretical memory bandwidth
can be reached, while achieving nearly 80% on the higher-
end machine. There are several things to note here: we did
not optimise the STREAM benchmark on the higher-end
machine due to limited access, and so actual performance
is expected to be slightly higher. The STREAM results in-
clude memory read and write access, which will lead to lower
than expected results for some of these tests since the avail-
able bandwidth on POWER8 systems is asymmetric (i.e.
the read bandwidth is twice the write bandwidth). A non-
standard read-only version of Triad shows slightly higher
memory bandwidth utilisation, reaching 90.9% of the theo-
retical maximum.
3 RTCS PERFORMANCE ON POWER8
We now consider the achievable performance on the
POWER8 systems under investigation, and consider the ap-
plication for future RTCS designs. For each case, we investi-
gate changing the number of threads used by DARC, and the
processing block size used, i.e. the number of sub-apertures
processed together as a block.
c© 0000 RAS, MNRAS 000, 000–000
4 A. G. Basden et al.
STREAM Function GB/s GB/s
(4-core machine) (22-core machine)
Copy 15.5 46.0
Scale 15.1 45.5
Add 16.3 41.0
Triad 16.4 46.1
Read-only Triad 17.4
Table 1. The STREAM benchmark results for the POWER8
systems under investigation here (total memory bandwidth
achieved). For the 4-core machine, best performance was using 3
threads, while 48 threads were used for the 22-core machine. The
Read-only line is an additional function that we added to test
read memory access only (i.e. no memory writes are performed),
and is achieved using 4 threads.
0 5 10 15 20 25 30 35 40
Number of threads used
0
500
1000
1500
2000
Ac
hi
ev
a
bl
e 
AO
 fr
a
m
e
 r
a
te
 / 
Hz
1
2
3
4
5
6
Figure 1. Achievable RTCS frame rate as a function of number of
processing threads used. The individual lines represent the num-
ber of times (given by the legend) threads are reused each frame
(affecting the number of partial matrix-vector products that are
implemented).
3.1 An 8 m XAO system
We investigate the case of a eXtreme AO (XAO) system
on an 8 m telescope with 20 cm sub-apertures (40 × 40),
and results are shown in Fig. 1. Here, it can be seen that
with the low-end system a maximum frame-rate of nearly
2 kHz is achieved. In this case, the control matrix size is
1304 × 2480, requiring a memory bandwidth of 23.4 GB/s
to read this from main memory every RTCS iteration at
this frame rate. This is larger than the available memory
bandwidth (19.2 GB/s) and therefore, the control matrix
(12 MB) is being stored in the large L3 cache (32 MB).
RTCS processing tasks are divided among a selected
number of threads, and we see that using 31 threads provides
best performance. The processor has 4 cores, each with 8-
way simultaneous multithreading capability (i.e. 32 virtual
cores). Of particular note is the linearity of these curves
between 8 threads and the peak: the RTCS pipeline is seen
to be highly parallelisable with performance scaling almost
directly with the number of cores available.
We also consider the case when this system has a larger
number of actuators to control, e.g. for a woofer-tweeter sys-
tem. This is of particular interest, because it will allow us to
measure maximum RTCS performance as the control matrix
0 2000 4000 6000 8000 10000 12000 14000 16000
Number of actuators
0
500
1000
1500
2000
Ac
hi
ev
a
bl
e 
AO
 fr
a
m
e
 r
a
te
 / 
Hz
0 2000 4000 6000 800010000120001400016000
Number of actuators
16
18
20
22
24
26
28
30
M
VM
 m
em
or
y 
ba
nd
wi
dt
h 
/ G
B/
s
Theoretical 19.2GB/s
48MB matrix size(L3+L4 cache size)32M
B L3 cache
Figure 2.Maximum achievable RTCS frame rate as a function of
number of actuators controlled for a 40×40 sub-aperture system.
Inset is shown the corresponding memory bandwidth required by
the matrix-vector multiplication to achieve this frame rate.
size approaches, and exceeds, that of the L3 cache. Fig. 2
shows these results (with the optimum number of process-
ing threads selected), which shows an expected degradation
of achievable AO frame rate as the problem size increases.
Once the control matrix size approaches about 48 MB (equal
to the size of the L3 and L4 cache combined), then perfor-
mance is clearly degraded, with memory bandwidth between
the processor and main memory becoming the limiting fac-
tor. Performance levels off utilising about 90% of the avail-
able memory bandwidth for large control matrix sizes, in
agreement with the STREAM benchmark.
3.2 A single ELT WFS
We investigate the case of an E-ELT single conjugate AO
(SCAO) system, with a single WFS with 80 × 80 sub-
apertures (with 6×6 pixels per sub-aperture), and a control
matrix of size 5160× 9824 (193 MB). In this case, the max-
imum frame rate is 100.2 Hz on our low-end system, requir-
ing a memory bandwidth of 18.9 GB/s to read the matrix
from memory each iteration (it is too large to fit in cache),
in addition to reading calibration image and other memory
operations. This is very close to the theoretical maximum
memory bandwidth, and so we conclude that the POWER8
architecture is optimised and pipelined in such a way as to
achieve peak performance for mixed processing tasks.
The higher-end system provides a maximum frame-rate
of 150 Hz, requiring a memory bandwidth of 28.8 GB/s (with
a slightly larger control matrix with 10,000 actuators). It
should be noted that because of the way the RTCS is cur-
rently implemented, a single copy of the control matrix is ac-
cessed, and therefore will be stored in the memory attached
to one processor. Threads executing on the second proces-
sor must therefore access this matrix via the first processor,
therefore limiting the available memory bandwidth for con-
trol matrix access to that of one processor, i.e. 29.5 GB/s
in this case. This is clearly a limiting factor for the RTCS,
in part due to the non-uniform memory access (NUMA) ar-
chitecture of the multi-processor computer hardware, one
c© 0000 RAS, MNRAS 000, 000–000
Power8 processors for AO real-time control 5
0 50 100 150 200 250 300
Number of pixels per sub-aperture
84
86
88
90
92
94
96
98
100
102
Ac
hi
ev
a
bl
e 
AO
 fr
a
m
e
 r
a
te
 / 
Hz
Figure 3. Maximum AO frame rate as a function of number of
pixels per sub-aperture (with 80× 80 square sub-apertures).
which is now on the list of improvements to be made to the
DARC system. We note here that we are achieving an effec-
tive memory bandwidth very close to the theoretical limit
available to the system.
For reference a top-end Intel X86 processor (E5-2699-
v3) has 18 cores and a 45 MB level-3 cache, with 68 GB/s
access to main system memory, costing around 5000.
We also investigate the effect of number of pixels on
AO real-time performance, with Fig. 3 showing maximum
AO frame rate on our low-end POWER8 hardware as a
function of number of pixels per sub-aperture. Increasing
the number of pixels per sub-aperture reduces maximum
frame-rate, suggesting that as sub-apertures get larger, the
matrix-vector multiplication is no longer the sole rate lim-
iting factor. Although the memory bandwidth required to
read an image, background map and flat-field information
at the AO frame rate is small (compared to that required
for the control matrix), at only 1.5 GB/s for the largest
sub-apertures used here, the larger images will have a larger
impact on cache operations, meaning that less of the con-
trol matrix is available in cache for when required, leading
to additional memory reads, and reduced AO frame-rates.
Additionally, a larger number of floating point operations
are required for pixel processing, meaning that the matrix-
vector multiplication time is no longer so dominant.
3.2.1 Thread counts
We investigate how the number of processing threads affects
the achievable AO frame rate. Fig. 4 shows that using close
to, but less than, the number of hardware threads (32) pro-
vides best performance. Of particular note here is that (in
comparison with Fig. 1) performance no longer scales di-
rectly with the number of processing cores. This is because
this larger problem size is memory bandwidth limited, rather
than compute limited.
3.2.2 Amdahl’s law
Amdahl’s law (Amdahl 1967) states that the performance
gain in a system through parallelisation (or other) tech-
0 5 10 15 20 25 30 35 40
Number of threads used
0
20
40
60
80
100
Ac
hi
ev
a
bl
e 
AO
 fr
a
m
e
 r
a
te
 / 
Hz
1
2
3
4
5
6
Figure 4. A figure showing how maximum achievable AO frame
rate is dependent on the number of processing threads used. The
individual lines represent the number of times (given by the leg-
end) threads are reused each frame.
niques is limited by the fraction of time spent within the
parts of the system benefiting from those improvements.
In the case of a high order AO RTCS, the limiting per-
formance factor is memory bandwidth, required for wave-
front reconstruction. Increasing available memory band-
width will only continue to significantly improve perfor-
mance while other parts of the computational pipeline
(namely image calibration and slope calculation) do not be-
gin to dominate the computation time. Therefore, to be able
to make scaled performance predictions, we need to be able
to determine the time taken for these operations which are
compute limited rather than memory bandwidth limited.
We therefore investigate performance with and without
wavefront reconstruction. For the case without wavefront
reconstruction, we are interested in how well the POWER8
system can process pixel information and produce wavefront
slopes, and assume that the reconstruction could be per-
formed elsewhere (i.e. in a GPU, using NVLink), though of
course this may introduce additional latency.
Fig. 5 shows maximum achievable frame rates for the
AO RTCS processing pipeline when the large matrix-vector
multiplication for wavefront reconstruction is removed, and
thus places an approximate limit on achievable performance
for these processors when unlimited memory bandwidth is
available. Therefore, we can see that when using a POWER8
system with greater memory bandwidth (up to 256 GB/s
read bandwidth for a dual-processor server), frame rates of
nearly 1.3 kHz should be available for this system, limited by
the memory bandwidth for wavefront reconstruction, since
we know that other aspects of the real-time pipeline can be
performed faster than this (1.6 kHz on our low-end system,
and faster on a high end 24-core server).
3.3 A multiple mirror ELT SCAO system
To investigate the performance of this ELT-scale SCAO sys-
tem further, we consider the case of multiple mirror SCAO
systems, i.e. with an increased number of actuators. This in-
creases the control matrix size, and thus allows us to inves-
c© 0000 RAS, MNRAS 000, 000–000
6 A. G. Basden et al.
0 5 10 15 20 25 30 35 40
Number of threads used
0
200
400
600
800
1000
1200
1400
1600
Ac
hi
ev
a
bl
e 
AO
 fr
a
m
e
 r
a
te
 / 
Hz
1
4
8
Figure 5. A figure showing achievable AO RTCS frame-rates
as a function of thread count on the low-end POWER8 system
when wavefront reconstruction is not performed, for an ELT-scale
SCAO system (80 × 80 sub-apertures).
5000 6000 7000 8000 9000 10000
Number of actuators
40
50
60
70
80
90
100
110
Ac
hi
ev
a
bl
e 
AO
 fr
a
m
e
 r
a
te
 / 
Hz
5000 6000 7000 8000 9000 10000
Number of actuators
16.5
17.0
17.5
18.0
18.5
19.0
M
VM
 m
em
or
y 
ba
nd
wi
dt
h 
/ G
B/
s
Figure 6. Maximum AO frame rate as a function of number of
actuators controlled with 80 × 80 sub-apertures. Inset is shown
the memory bandwidth required reach this frame rate for a given
matrix size.
tigate performance limiting factors for different AO system
configurations. We also investigate performance with differ-
ent sub-aperture sizes (pixels per sub-aperture), so that we
can separate compute intensive and memory intensive tasks.
Fig. 6 shows maximum AO frame rate on our low-end
POWER8 hardware as a function of control matrix size.
The maximum achievable frame-rate is reduced propor-
tionally to the control matrix size, again limited by mem-
ory bandwidth, though we see that for larger matrices, the
memory bandwidth achieved is slightly reduced. We believe
that this is due to less of the larger matrix being cached,
i.e. when there is a larger matrix to read, cache prediction
is not so good. However, the system is still able to achieve
nearly 90% of theoretical memory bandwidth during the AO
system loop.
3.3.1 Operation at necessary frame rates
The maximum frame rates reported so far have not been
sufficient for an on-sky ELT AO system. However, we have
only been able to perform bench marking on a low-end sys-
tem. Due to the high utilisation of available memory band-
width (close to 100%), we can make predictions as to max-
imum achievable frame rates for currently available higher
end systems. A POWER8 S824 system contains two proces-
sors, each with up to 128 GB/s memory bandwidth for read
operations, a combined factor of 13.3 times greater than
our system. If memory bandwidth is the limiting factor,
we could expect an AO frame rate of greater than 1.2 kHz
for an ELT-scale SCAO system using an S824 system. It is
likely that other parts of the computational pipeline would
start to limit performance so that this frame rate would not
be achieved. In §3.2.2 we have investigated performance on
our low-end system with the matrix-vector multiplication
removed, to demonstrate that pixel processing and slope
computation at higher frame rates is achievable. Therefore,
with sufficient memory bandwidth, ELT frame rates are eas-
ily available on an existing POWER8 server.
3.4 An ELT MCAO system
We have considered the performance case for an ELT-scale
SCAO system, and we now use this information to con-
sider MCAO system design. The E-ELT MCAO instrument,
MAORY (Foppiani et al. 2010), is likely to have 4–6 laser
guide stars (LGSs) and up to 3 natural guide star (NGS)
low order wavefront sensors, with a total of 2 or 3 DMs
(including the telescope M4 DM), operating up to 10,000
actuators with a 500 Hz frame rate.
Processing of WFS images to yield wavefront gradients
is independent, i.e. slopes obtained by processing one WFS
do not depend on the processing of other WFSs. Similarly,
when using conventional matrix-vector multiplication wave-
front reconstruction methods (we discuss other methods in
§3.7, the slopes from each WFS can be used independently
of other WFSs to compute a partial set of DM commands.
The partial DM commands from each WFS can then be
summed, yielding the final DM demands to be applied to
the mirror, in a low count vector addition operation.
We therefore now consider a MCAO control solution
which has a separate POWER8 server for each LGS WFS
(directly connected), and an additional POWER8 server for
the three NGS, with partial DM demands being sent to one
server for summation to yield the final DM demands, as
shown in Fig. 7. We note that since the NGS are likely to be
of lower order (resulting in a smaller matrix-vector multipli-
cation), it would be possible to process all NGS in a single
server, reducing cost and complexity. This server is then also
used to collate the partial DM demands, which will arrive
over more than one 10G Ethernet link to reduce latency.
With this control solution, each server therefore has to
process a single WFS, and between 8000–10000 actuators,
and so we can directly estimate expected performance using
Fig. 3, which by scaling to the memory bandwidth available
in a S824 system, will yield frame rates above 500 Hz, the
MAORY design goal. Further processor improvements over
the next few years (for example the Power9 processor in
c© 0000 RAS, MNRAS 000, 000–000
Power8 processors for AO real-time control 7
Figure 7. A schematic design showing components for a ELT
MCAO real-time control system, and the links between them.
WFSs are connected individually to a POWER8 server, which
computes partial DM demands. These are then summed before
being sent to the DM.
Figure 8. A schematic design showing components for a ELT
MOAO real-time control system, and the links between them.
Four WFSs are connected to a server, which computes slope mea-
surements, and shares these with two other servers. Each server
then has access to all wavefront sensor slope measurements, and
computes DM demands for a single DM.
2017) will improve performance further, and be available
within the time frame of MAORY system development.
3.5 An ELT MOAO system
We now consider requirements for an ELT-scale multi-object
AO (MOAO) system. The E-ELT MOAO instrument is
likely to be MOSAIC (Hammer et al. 2014), and will use
6 LGS and up to 5 NGS. Up to 20 MOAO channels are pro-
posed, each with a DM, in addition to the main telescope
M4 deformable mirror.
Fig. 8 shows a possible schematic design for the MOAO
real-time control system. In summary, 21 servers are re-
quired, one for each DM, including the M4 mirror. Each
server receives images from 3 or 4 WFSs and processes
these to provide wavefront slope information. These wave-
front slopes are then shared with two other servers, which in
return also share the wavefront slope information computed
from their WFSs. Therefore, each server will have access to
the 11 WFS slope vectors. Each server then performs a to-
mographic wavefront reconstruction, projected along a given
line of sight, and sends the DM demands to the relevant DM.
With this design, each server is responsible for process-
ing 4 WFS images, and performing a matrix-vector multi-
plication with a matrix size of about 100, 000 × 5000. At
the desired frame rate of 250 Hz, this represents a required
memory bandwidth of about 470 GB/s, which is achiev-
able using a 4-socket POWER8 server (e.g. the S850 system,
which has a read memory bandwidth of 512 GB/s), though
is above that obtainable in a single dual socket server. It is
likely that within the next decade (the time-frame for ELT
MOAO instrument development), significant improvements
in memory bandwidth will be realised, enabling this perfor-
mance goal to be met with even greater overhead, reducing
latency. Additionally, the inclusion of one or two GPUs to
the system (taking advantage of the forthcoming high per-
formance NVLINK interconnect, Foley 2014) specifically to
perform matrix-vector multiplication would further reduce
latency. We discuss this further in §3.7.
It should be noted that with this design, the wavefront
reconstruction for each DM is independent, allowing differ-
ent algorithms to be trialled with performance comparisons
made while the system is in operation. This capability will
be key to maximising MOAO performance.
3.6 Variation in latency
The variation of AO system latency, or jitter, is a key
parameter when developing a real-time control system. If
this jitter is large, then there will be frequent delays in
the AO processing pipeline, leading to reduced AO perfor-
mance. This is particularly critical for higher order AO sys-
tems. Fig. 9 shows the variation in latency measured over
1,000,000 frames on the POWER8 server for both the 40×40
and 80×80 sub-aperture systems. For the higher order case,
the variation in latency follows a Gaussian distribution, with
a FWHM of 1.4 ms, 5% of the mean frame time. No frames
take more than twice the mean frame time, and 99% of
frames take less than 8% longer than the mean time.
For the low order case, the variation in latency is no
longer Gaussian, showing an extended tail, and additional
features that may be related to the granularity of the timer.
The rms jitter is 62µs. Here, less than 0.01% of frames take
longer than twice the mean frame time to complete, and
99% of frames take less than 38% longer than the mean
frame time to complete.
We are currently using a stock Ubuntu kernel (3.16.0-
23). The use of a real-time kernel would further improve this
jitter, though we do not investigate here as this is not yet
available.
3.7 Further considerations
We have so far only considered the basic AO RTCS
pipeline operations, including wavefront reconstruction us-
ing a matrix-vector multiplication algorithm, image calibra-
tion and slope computation. However, for an ELT, this is
unlikely to be sufficient, as further algorithms will be neces-
sary, for example the linear-quadratic-gaussian (LQG) con-
trol as demonstrated by CANARY, for vibration mitigation
(Sivo et al. 2014), which involves several matrix-vector mul-
tiplication operations.
Current implementations of LQG demonstrated on-sky
have required significantly more computational power and
c© 0000 RAS, MNRAS 000, 000–000
8 A. G. Basden et al.
0.024 0.026 0.028 0.030 0.032 0.034 0.036 0.038 0.040
Frame time / s
0
2
4
6
8
10
Fr
e
qu
en
cy
 / 
%
0.032 0.034 0.036 0.038 0.040
Frame time / s
0.0000
0.0005
0.0010
0.0015
0.0020
0.0025
0.0030
0.0035
0.0040
0.0045
Fr
e
qu
en
cy
 / 
%
0.0004 0.0006 0.0008 0.0010 0.0012 0.0014 0.0016 0.0018 0.0020
Frame time / s
0
2
4
6
8
10
Fr
e
qu
en
cy
 / 
%
0.0010 0.0012 0.0014 0.0016 0.0018
Frame time / s
0.000
0.002
0.004
0.006
0.008
0.010
0.012
0.014
0.016
0.018
Fr
e
qu
en
cy
 / 
%
Figure 9. (a) A histogram of frame computation times for an
80 × 80 SCAO system. (b) A histogram of frame computation
times for a 40× 40 SCAO system.
memory bandwidth than a conventional matrix-vector mul-
tiplication algorithm, and so the hardware that we are in-
vestigating here may not be sufficient for these algorithms.
There are two alternatives: LQG is an active area of re-
search, and efficient implementations are being developed
(Gray & Le Roux 2012). Alternatively, hardware accelera-
tion techniques can be considered.
A requirement for additional hardware acceleration will
benefit significantly from the proposed high-speed NVLINK
and CAPI interconnects under development for future
POWER8 processors and hardware accelerators. Specific
hardware, for example GPUs or FPGAs, can be used to pro-
vide acceleration of given algorithms, in this case, the wave-
front reconstruction problem. A high-speed, low latency link
is key to enabling this, as it will maintain low system latency:
improved algorithmic behaviour will only improve AO sys-
tem performance if the algorithms do not lead to significant
increases in AO system latency. A key feature of the CAPI
interface is that it enables abstracted code to be developed
with accelerators sharing the same memory address space
as the CPU, allowing code to be developed independently
of the physical hardware acceleration used.
A high bandwidth, low latency accelerator interconnect
is also essential for future designs of ELT-scale XAO real-
time systems. For these systems, low latency is critical.
3.7.1 Future-proofing AO real-time control
We have demonstrated that an existing AO RTCS can be
ported to an alternative processor technology with very lit-
tle effort, and that this technology has the potential to en-
able AO real-time control for first-light ELT AO instruments
without the requirement for additional hardware acceler-
ation. This greatly simplifies RTCS design, and provides
greater confidence that the RTCS software will be able to
operate for the foreseeable future, independent of underly-
ing hardware changes (provided a C compiler exists). No
proprietary libraries are necessary, and full source code for
this system is available.
Of key importance here is that an ELT-scale AO real-
time control system can be developed in the widely used
C programming language, and does not require any custom
hardware, or any niche untransferable skills. Transferability
of this system to other processor types give a significant de-
gree of confidence that a system developed in this way will
remain operable, configurable, upgradable and hardware in-
dependent for the foreseeable future. This is a key advantage
for telescopes with expected operational lifetimes approach-
ing a century.
4 CONCLUSION
We have investigated the use of a freely available, open
source, AO RTCS on new POWER8 hardware. We find
that installation on this hardware was trivial, demonstrated
the use of WFSs and a DM, and find that computational
performance is in line with expectations, with ELT-scale
AO RTCS performance being limited by available mem-
ory bandwidth, of which our RTCS typically reaches above
90% of the theoretical maximum. The large potential mem-
ory bandwidth of the POWER8 CPU, along with forthcom-
ing innovations enabling high bandwidth communication be-
tween the CPU and other hardware (including GPUs, with
NVLink), means that POWER8 systems are a prime con-
tender for use with ELT-scale AO RTCSs, and that using
conventional computer server technology is highly attractive
to maintain longevity, upgradability and comprehension of
these systems.
ACKNOWLEDGEMENTS
This work is funded by the UK Science and Technology Fa-
cilities Council, grant ST/K003569/1 and ST/L00075X/1.
We thank the referee who helped improve this paper.
REFERENCES
Amdahl G., 1967, in AFIPS Conference Proceedings, Vol-
ume 30, pp. 483-485 (1967) Validity of the Single Proces-
sor Approach to Achieving Large-Scale Computing Capa-
bilities. pp 483–485
Babcock H. W., 1953, Pub. Astron. Soc. Pacific, 65, 229
c© 0000 RAS, MNRAS 000, 000–000
Power8 processors for AO real-time control 9
Basden A., Geng D., Myers R., Younger E., 2010, Appl.
Optics, 49, 6354
Basden A. G., Myers R. M., 2012, MNRAS, 424, 1483
Fedrigo E., Donaldson R., Soenke C., Myers R., Goodsell
S., Geng D., Saunter C., Dipper N., 2006, in Advances
in Adaptive Optics II. Edited by Ellerbroek, Brent L.;
Bonaccini Calia, Domenico. Proceedings of the SPIE, Vol-
ume 6272, pp. 627210 (2006). Vol. 6272 of Presented at
the Society of Photo-Optical Instrumentation Engineers
(SPIE) Conference, SPARTA: the ESO standard platform
for adaptive optics real time applications
Foley D., 2014, Technical report, NVLink, Pascal and
Stacked Memory: Feeding the Appetite for Big Data,
http://devblogs.nvidia.com/parallelforall/nvlink-pascal-stacked-memory-feeding-appetite-big-data.
NVIDIA
Foppiani I., Diolaiti E., Baruffolo A., Biliotti V., Bregoli
G., Cosentino G., Delabre B., Lombini M., Marchetti
E., Rossettini P., Schreiber L., Tomelleri R., Conan J.-
M., D’Odorico S., Hubin N., 2010, in Society of Photo-
Optical Instrumentation Engineers (SPIE) Conference Se-
ries Vol. 7736 of Society of Photo-Optical Instrumentation
Engineers (SPIE) Conference Series, System overview of
the Multi conjugated Adaptive Optics RelaY for the E-
ELT
Gray M., Le Roux B., , 2012, Ensemble Transform Kalman
Filter, a nonstationary control law for complex AO sys-
tems on ELTs: theoretical aspects and first simulations
results
Hammer F., Barbuy B., Cuby J., Kaper L., Morris S.,
Evans C., Jagourel P., Puech M., 2014, in Society of
Photo-Optical Instrumentation Engineers (SPIE) Confer-
ence Series Vol. in print of Society of Photo-Optical In-
strumentation Engineers (SPIE) Conference Series, MO-
SAIC at E-ELT: a MOS for astrophysics, IGM, and cos-
mology
Johns M., 2008, in Extremely Large Telescopes: Which
Wavelengths? Retirement Symposium for Arne Ardeberg
Vol. 6986, The giant magellan telescope (gmt). pp 698603–
698603–12
McCalpin J. D., 1995, IEEE Computer Society Technical
Committee on Computer Architecture (TCCA) Newslet-
ter, 12, 19
Myers R. M., Hubert Z., Morris T. J., Gendron E., Dipper
N. A., Kellerer A., Goodsell S. J., Rousset G., Younger
E., Marteaud M., Basden A. G., 2008, in Society of
Photo-Optical Instrumentation Engineers (SPIE) Con-
ference Series Vol. 7015 of Presented at the Society of
Photo-Optical Instrumentation Engineers (SPIE) Confer-
ence, CANARY: the on-sky NGS/LGS MOAO demon-
strator for EAGLE
Nelson J., Sanders G. H., 2008, in Society of Photo-
Optical Instrumentation Engineers (SPIE) Conference Se-
ries Vol. 7012 of Society of Photo-Optical Instrumenta-
tion Engineers (SPIE) Conference Series, The status of
the Thirty Meter Telescope project. pp 70121A–70121A–
18
Rigaut F., Neichel B., Boccas M., d’Orgeville C., Arriagada
G., Fesquet V., Diggs S. J., Marchant C., Gausach G.,
Rambold W. N., Luhrs J., Walker S., Carrasco-Damele
E. R., Edwards M. L., Pessev P., Galvez R. L., 2012,
in Society of Photo-Optical Instrumentation Engineers
(SPIE) Conference Series Vol. 8447 of Society of Photo-
Optical Instrumentation Engineers (SPIE) Conference Se-
ries, GeMS: first on-sky results
Sinharoy B., Van Norstrand J. A., Eickemeyer R. J., Le
H. Q., Leenstra J., Nguyen D. Q., Konigsburg B., 2015,
IBM Journal of Research and Development, 59, 2:2
Sivo G., Kulcsar C., Conan J., Raynaud H., Gendron E.,
Basden A., Vidal F., Morris T., 2014, Opt. Express
Spyromilio J., Comero´n F., D’Odorico S., Kissler-Patig M.,
Gilmozzi R., 2008, The Messenger, 133, 2
Starke W. J., Stuecheli J., Daly D., Dodson J., Auernham-
mer F., Sagmeister P. M., Guthrie G. L., Marino C. F.,
Siegel M., Blaner B., 2015, IBM Journal of Research and
Development, 59(1), 3:1
Stue h li J., Blaner B., Johns C., Siegel M., 2015, IBM
Journal of Research and Development, 59, 7:1
Vernet E., Cayrel M., Hubin N., Mueller M., Biasi R., Gal-
lieni D., Tintori M., , 2012, Specifications and design of
the E-ELT M4 adaptive unit
This paper has been typeset from a TEX/ LATEX file prepared
by the author.
c© 0000 RAS, MNRAS 000, 000–000
