Hardware implementations of computer-generated holography: a review by Wang, Youchao et al.
Hardware Implementations of Computer Generated Holography: A1
Review2
Youchao Wanga, b, Daoming Donga, b, Peter J. Christophera, Andrew Kadisa, Ralf3
Mouthaana, Fan Yanga, Timothy D. Wilkinsona,*4
aUniversity of Cambridge, Centre for Molecular Materials, Photonics and Electronics, Department of Engineering, 95
JJ Thomson Avenue, Cambridge, UK, CB3 0FA6
bBoth author contributed equally.7
Abstract. Computer generated holography (CGH) is a technique to generate holographic interference patterns. One8
of the major issues related to computer hologram generation is the massive computational power required. Hardware9
accelerators are used to accelerate this process. Previous publications targeting hardware platforms lack performance10
comparisons between different architectures and do not provide enough information for the evaluation of the suitability11
of recent hardware platforms for CGH algorithms. We aim to address these limitations and present a comprehensive12
review of CGH-related hardware implementations.13
Keywords: Computer generated holography (CGH), Central proccesing unit (CPU), Graphics processing unit (GPU),14
Fied-programmable gate array (FPGA), Digital signal processor (DSP), Hardware accelerator, Holography, System-15
on-Chip (SoC).16
* Corresponding author: Timothy D. Wilkinson, tdw13@cam.ac.uk17
1 Introduction18
Holography is a technique used to record and reconstruct the entirety of an optical field.1 This19
approach was pioneered by Dennis Gabor in 1948 as a two-step, lensless imaging process for20
improving the quality of electron microscopy.221
In the early days, holograms were primarily single-use as the only recording media available22
resembled photographic film. It was not until the mid-1960s when computer generated holography23
(CGH),3 together with the noticeable improvements in technology, revolutionized the field and24
drew a significant amount of interest.25
The late 1980s saw a further shift in holography from analogue to digital with the emergence26
of digital imaging sensors as well as increases in computational powers and electronic display27
devices such as digital micromirror devices (DMDs) and liquid crystal spatial light modulators28
1
(LC SLMs). Holograms could, for the first time, be digitally captured, processed and displayed.29
Over time, holography has become regarded as a serious display technology for far-field and 3D30
applications.431
CGH is the field of algorithmically generating holographic interference patterns using digi-32
tal computers, with target applications including but not limited to display technologies5, wave-33
length selective switch (WSS)6, optical tweezers7 and telecommunications.8 Generating computer34
holograms in real-time is one of the key goals of research, with algorithms for CGH traditionally35
running on central processing units (CPUs). Despite recent increases in the processing power36
of CPUs, it remains insufficient for real-time photographic applications. Accelerated hardware37
platforms, including graphics processing units (GPUs), field programmable gate arrays (FPGAs),38
digital signal processors (DSPs), co-processors as well as application-specific integrated circuits39
(ASICs), are able to bring high fidelity holographic imagery to real-time applications.40
Figure 1 shows a typical system setup for a CGH. The creation of the computer holograms can41
be divided into three parts:142
1. Calculate: to allow the computer to digitally, instead of optically, calculate the interference43
fringes for a target object;44
2. Encode: to determine the method to represent or encode the computation results;45
3. Display: to display the encoded fringes on a suitable medium.46
CGH algorithms, regardless of them being point-source-based, polygon-based, layer-based,47
etc., would typically require a very high level of computational power. Hence, when designing any48








Fig 1 A typical system for computer generated holography consisting of three main components: a light source, a
computer or hardware platform for interference pattern calculation and a device to display the hologram.4
To the best of our knowledge, there is no modern review paper that specifically targets the51
hardware used for the generation and processing of computer holography. Previously published52
survey papers9–14 provide analyses and conceptual reviews of fast hologram generation algorithms.53
Additionally, Shimobaba et al. in 201615 and 201916 provided overviews in terms of CGH-related54
hardware implementations. However, all of the above reviews suffer from a lack of the following:55
1. A comparison between different hardware platforms;56
2. A dedicated discussion with respect to hardware implementations;57
3. An assessment of the trade-offs between different development factors for a given hardware58
platform;59
4. An up-to-date review with respect to the most recent developments in modern hardware.60
We aim, therefore, to provide review by comparing different hardware platforms and discussing61
each platform’s advantages and disadvantages. This review paper considers CPUs, GPUs, FPGAs62
and other hardware accelerators in dedicated sections. For each platform, we provide a literature63
3
survey on the applications utilizing these specific hardware platforms. This is followed by a dis-64
cussion of device properties, available development toolchains, the ease of development, and their65
advantages and disadvantages. We also present cross-platform comparisons to gain insights re-66
garding the use of different types of accelerators. Generally, we provide a thorough examination67
of the current state-of-the-art hardware implementations along with a review of their applications68
over the previous decade (2008-2020).69
This literature survey is outlined as follows. Section 1 first introduces the field holography70
alongside key concepts and a discussion of the CGH challenges. CPU, GPU, FPGA and other71
platform implementations are discussed in Sections 2, 3, 4 and 5, respectively. Section 6 reports72
the comparison between different hardware platforms and provides in-depth discussion to guide73
hardware selections. Finally, the paper is concluded after presenting future work directions in74
Section 7.75
1.1 The hologram and the replay field76
In a classical imaging system, Figure 2, focusing optics are used to focus light scattered from a77
point of an object onto a corresponding point on a sensor (the recording device). In such a system,78
any different origin point on the object leads to a corresponding change in the position on the79
recording plane of the sensor. The loss of a portion of the sensor data will result in a corresponding80
loss in the image.81
In a holographic imaging system, Figure 3, scattered light is collected without the use of a fo-82
cusing optics, instead interfering the scattered light with a reference beam. Replicating recording83
conditions allows for replication of the light field and the resulting image as depicted in Figure 4.84














from a single point
Fig 2 A classical optical imaging system.






Fig 3 A holographic imaging system for hologram recording.
Traditional analogue holography follows two steps known as recording and reconstruction:1, 2, 1787
1. Recording - Figure 3 - A coherent, collimated light source is split into object and reference88
beams. The object beam is directed onto a physical object and the resulting scattered light89
interfered with the reference beam. The interference fringes are recorded on a photosensitive90
film to produce the hologram.91
2. Reconstruction - Figure 4 - A similar system is used to reproduce the hologram. An identi-92






Fig 4 A holographic projection system for hologram reconstruction.
object directly.94
Computer generated holography goes further than this by using a known target or scene to95
generate the reconstruction image, and thus eliminating the requirement of a recording step.96
1.2 Limitations of Computer Generated Holography (CGH)97
Computer generated holography promises a great deal; however, in practice there exist a number98
of key limitations:99
1. Hologram representation. Complex modulation schemes are achievable by several means100
and methods18, 19, despite the fact that display medium, such as SLMs, are still facing techno-101
logical limitations to perform true arbitrary complex modulations. However, these methods102
often require non-trivial modifications and device setups, consequently limiting the repre-103
6
sentations of holograms. Moreover, the hardware manufacturing constraints limit the size104
and quality of the reconstructed holographic images as well as the viewing angle.16105
2. High computational power demand. The hologram and object field is correlated by Fourier106
transforms for any given pixel display. For a single image frame with N ×N points or pix-107
els, the computation complexity can be as high as O(N4). By utilizing the power of fast108
Fourier transforms (FFTs), we are able to reduce this complexity to O(N2 log(N)). Un-109
fortunately, this is still computationally expensive, before even considering the inclusion of110
other operations for any given algorithms to produce high quality images and videos where111
the incorporation of visual effects such as shading20, occlusion effects21, 22, directional scat-112
tering23 are of great essence. Such high quality image demand is one of the key limitations.113
3. Downgrade of the replay field image quality. The quality of the holographic reconstructed114
image would be affected by factors such as speckle noise13, ringing artifacts24, the opto-me-115
chanical properties of SLMs25, etc. Moreover, in real-world display devices are incapable of116
modulating light continuously; being limited to a number of discrete levels results in quan-117
tization artifacts which have an adverse effect on image quality.26 During this process, the118
information stored in the interference pattern will be reduced, leading to a degradation in119
image quality.120
A suitable hardware platform for CGH algorithm implementations needs to be selected in order121
to speed up the generation of computationally heavy holograms while ideally also improve replay122
field quality. In this paper, we aim to address this problem by providing a selection guideline for123
researchers and developers to choose the most suitable hardware platforms for computer hologram124
applications. While we stay focused on the hardware choice, it should be pointed out that such125
7
choices are also intimately related to the algorithm selected as some would require more dedicated126
and specialized hardware resources as compared to others. A further discussion of such is outlined127
in Section 6.128
We divide the current state-of-the-art hardware platforms into two categories: conventional129
processors where we refer to CPUs; and hardware accelerators such as GPUs, FPGAs, DSPs and130
co-processors. Traditionally, basic arithmetic calculations were done in CPUs. However, for cer-131
tain computationally expensive applications, there is a need for specialized architecture where the132
design is optimized for the application to accelerate performance.133
Hardware accelerators were designed to tackle this issue by exploiting properties such as paral-134
lelism and application-specific dedicated hardware accelerations. These devices are usually based135
on different architectures and inherently make use of different development tools and utilities. The136
code and algorithm migrations between these hardware platforms are not often straightforward.137
They require a good understanding of the specific hardware architectures as well as microarchitec-138
tures in order to carry out code implementations and optimizations properly.139
2 Central Processing Units (CPUs)140
Since its invention in the early 1970s, CPUs have become the core of this ever-developing dig-141
ital world. Von-Neumann, Harvard architectures and their architectural variants will continue to142
dominate the market in the foreseeable future. The fundamental operations and underlying the-143
ories remained largely unchanged throughout the years. These CPUs are designed to complete144
computational tasks that are as general as possible. Unfortunately, it is this very generality which145
prevents CPUs from executing high-performance computational operations since they lack suffi-146
cient amount of parallelism within their architectures.27147
8
It was not until 2005 when Intel introduced the Pentium D series—the first desktop-class148
dual-core processor—that exploited parallel processing for individual consumer computers run-149
ning multi-core processors. A typical contemporary computer with a multi-core processor can150
run tens and hundreds of tasks at any given time. Running multiple programmes simultaneously151
utilises concurrency by switching and jumping between different threads, or instruction streams,152
under real-time.28 The job-switching operations take up and waste CPU cycles and would hence153
prevent the platform to run at optimal efficiency when performing multi-tasking and exploiting154
parallel processing.155
For this paper, we will only evaluate Intel and AMD chip families as they are the two vendors156
to produce x86/64 architecture, the dominant high-performance CPU architecture at the time of157
writing, design.158
2.1 A platform for preliminary verification of algorithms159
Most hologram generation algorithms were developed on conventional computers, utilizing the160
power of the latest CPU chip families. Software-based algorithms run on CPUs to efficiently min-161
imize the development time and reduce the computational burden by exploiting advanced compu-162
tation libraries, software packages and other utilities.163
Reported work based purely on CPUs form the preliminary analysis of various proposed com-164
puter hologram generation algorithms. Researchers tend to focus more on theoretical development165
rather than code optimization since conventional CPUs are not used for acceleration purposes.166
Due to their commonality and the ease-of-use, the majority of work that incorporate CPUs often167
use them as the comparison baseline for algorithm implementations on other hardware platforms168
that utilize dedicated accelerators.169
9
2.2 Available tools and utilities for CPUs170
Since CPUs are the core components within a modern personal computer (PC) and workstation,171
the vast majority of software packages and development suites are readily available. Code and172
programmes can be written in many high-level languages, while low-level application program-173
ming interfaces (APIs) and frameworks, such as OpenMP and OpenCV, are also widely available.174
As an API for shared memory multiprocessing, OpenMP is dedicated to high-level parallelism in175
Fortran and C/C++ programs.29 Compiler directives, library routines as well as environment vari-176
ables can be used to optimize for multiprocessing by, for example, distributing workloads among177
the available threads and physical-cores.178
We endeavour to conclude the tools and utilities that have been reported in previously published179
papers since 2008, as shown in Table 1. The most commonly used software application is Matlab,180
due to its simplicity and numerous package supports. No strict understandings in terms of hardware181
architectures and memory management are necessary when developing algorithms over Matlab, as182
compared to other realization methods. C/C++ tend to be the most popular programming language183
used for algorithm implementations. C/C++ programming libraries and functions, such as FFTW,184
cvDFT from OpenCV and custom library CWO++,30 offer strong support for improved hologram185
generation performances.186
2.3 The advantages and disadvantages of using CPUs187
The most significant advantage of using CPUs for algorithm implementation is the short devel-188
opment time and sophisticated software toolchain support. Nearly all the software packages that189
can be found on other hardware accelerator platforms have the same or equivalent toolkits which190
are available on CPU-based PCs. These ranges from programming languages, such as C/C++ and191
10
Table 1 Tools and utilities employed for CPU implementations since 2008
Name Category Appearance Year
Matlab Software Novel LUT algorithm,31 Fast computa-
tion,32 Compressed LUT algorithm,33 Bi-
nary detour phase holograms,34 Specific
solutions for Gerchberg-Saxton (GS) al-
gorithm,35 Highly efficient calculation,36










FFTW library FFT library Wavefront recording plane (WRP) GPU
comparison,41 CWO++30
2009, 2012
Intel Math Kernel Li-
brary (MKL)
Math and FFT library Polygon-based extremely high-definition
projection42
2009
cvDFT (OpenCV) FFT function from OpenCV Wavefront recording plane43 2018
OpenMP Multi-processing API Baseline for multi-GPU cluster compari-
son44
2012
C Programming language Simulated annealing (SA) GPU compari-
son,45 Multi-GPU cluster comparison44
2010, 2012
C++ Programming language Polygon-based extremely high-definition
projection,42 WRP GPU comparison,41
SA GPU comparison,45 CWO++ and
WASABI,30, 38–40 Full colour and colour
space conversion using WASABI46
2010, 2012,
2017, 2018, 2019
Python Programming language Compressive-sensing GS47 2019
Python, compile-time and run-time libraries, to software packages.192
Other merits of using CPUs are as follows:193
1. Comparatively high clock frequency: Contemporary CPUs run in GHz domain as com-194
pared to the frequencies in other hardware that are usually between hundreds of MHz to195
above 1 GHz. Higher clock rates provide shorter clock cycles, consequently speeding up196
sequential processes.197
2. Floating point precision: CPUs tend to have better support for double-precision floating198
point arithmetic from the tools that are available, although the use of full-precision compo-199
nents can downgrade run-time execution speed.200
The disadvantages are also apparent. CPUs are optimized for sequential operations and conse-201
quently full parallelism cannot be achieved. Although state-of-the-art CPUs at the time of writing202
11
feature a higher level of parallelism than older CPUs, with tens of cores being available in a sin-203
gle package, this pales in comparison to the massively parallel architectures of GPUs and FPGAs204
which feature thousands of parallel execution units. Moreover the software libraries and APIs to205
support parallelism, such as OpenMP which help shorten the developing time needed for multi-206
thread and multiprocessing applications, exist but require an advanced level of skills to utilize207
effectively.208
2.4 Reported work using CPUs209
Most of the reported work covering CPU-based applications are for either algorithm developments210
or, more commonly, for establishing baselines for cross-platform performance comparisons.211
The most common method to optimize the performance of hologram generation using a com-212
puter is to combine both CPUs and GPUs together.213
Shimobaba et al. reported on the development of a C++ library CWO++, which is used for214
diffraction calculations.30 This library has been developed to run on both CPU (CWO class) and215
GPU (GWO class, GPU-based wave optics), and has been used in various algorithm develop-216
ments.24, 38–40, 44, 48–50217
We aim not to thoroughly review the work that reports on CPU-based platform performance,218
as in the majority of cases the CPU results are used to provide a baseline performance reference.219
However, the baselines are subsequently encountered in several throughout the survey.220
2.5 Summary of CPUs221
Contemporary CPUs offer insufficient performance for real-time CGH and hence it is not recom-222
mended to build a real-time holographic system based solely on them. Moreover,the readily avail-223
12
ability of hardware accelerators, such as GPUs and FPGAs, provide further rationale for hologram224
generation algorithms to not be implemented purely on a CPU-only platform.225
Algorithm developments in the initial phase, however, are one exception for purely CPU simu-226
lations and implementations, e.g., with MATLAB and Simulink, in order to significantly cut down227
the development time and improve the efficiency of research outputs. Moreoever, this approach228
also encourages collaboration, lowering the skills and knowledge barriers for other research groups229
to replicate and improve the corresponding algorithms.230
3 Graphics Processing Units (GPUs)231
In both academia and industry, GPUs, being the dedicated graphics accelerators, have gained much232
attention since their introduction in the late 1990s.51 Through the use of parallel operations, these233
accelerators maximise the performance of image- and video-related applications.234
Benefiting from economics of scale, GPU products are cost-effective and readily available.235
High-end products with a large count of processing units that perform parallel half (16-bit), single236
(32-bit) and double (64-bit) precision floating point operations in parallel are eminently suitable237
for image and video processing applications. The introduction of compute unified device archi-238
tecture (CUDA)52 by NVIDIA in 2007 further extends the ease of development and shortens the239
implementation as well as transplantation time. Due to their strong parallel performance and well-240
supported development environment, GPUs are one of the most effective hardware accelerators241
available on the market.242
Traditionally, GPUs have been dedicated to graphics rendering. However, throughout years243
of development which have brought forth increases in computational power, contemporary GPUs244
are encroaching upon application domains that formerly belonged to high-end high-power CPUs.245
13
These GPUs are regarded as general-purpose graphics processing units (GPGPUs). Non-specialized246
calculations, such as machine learning computations, scientific computations, heavy image/video247
editing, encryption/decryption, have been taken over by the use of GPGPUs based on their merits248
of having massive parallelism and large processing core counts as opposed to the traditional CPUs.249
Two vendors, NVIDIA and AMD are major players in the graphics processing industry. Intel,250
with the recent development of its own GPU hardware, makes it another major producer of GPUs.251
However, based on the past lines of work, we will mainly focus the NVIDIA GPU families, since252
they are the most popular hardware platform used in the CGH and image processing community.53253
3.1 The parallelism in GPUs254
Architecturally, a GPU is significantly different from a CPU. The major difference being that255
GPUs exploit massive parallelism at the hardware level. A single mainstream contemporary GPU256
incorporates thousands of dedicated processor cores, whereas even the highest-end CPUs typi-257
cally contain less than 24 cores.54 It is this inherent parallelism that provides high-performance258
computation capability for highly parallel problem spaces.259
3.2 GPU performance trends260
Over the years, NVIDIA brought out a range of core microarchitectures in their GPU series, target-261
ing both the professional high-performance uses as well as individual consumer level applications.262
The last decade has seen a large increase in the performance capability of GPUs, as summarized263
in Table 255, 56. While the rated power consumption has remained relatively constant (at around264
200-300 Watts), we have seen a significant increase in processing power. Ever since Fermi be-265
















































Fig 5 A typical parallel pipeline overview of an NVIDIA GPU consisting of streaming multiprocessors each containing
a number of cores and functional units.
working on incorporating advanced shaders, hardware ray tracing and many high performance267
functionalities, with larger and faster processing capability and speed for not only graphics render-268
ing but more general purpose usages58 Consider the floating point operations per second (FLOPS)269
performance of a top-end GPU. Between 2008 and 2018, the performance has increased from 432270
GFLOPS to 16312 GFLOPS, an improvement of more than 37×.271
GPU designs vary between different microarchitectures and production families. Figure 5272
shows a typical structural overview of an NVIDIA GPU with numerous streaming multiprocessors273
(SM) consisting of shared memories, L1/L2 caches, CUDA cores, arithmetic units (e.g. double274
precision unit), load/store (LD/ST) units, etc. This architectural setup reveals the high level of275
inherent hardware parallelism within modern GPU devices.276
15
Table 2 Microarchitectures since 2008 and their representative flagship GPU products (SP: single precision floating
point)56
Model Year of launch Microarchitecture Transistors (million) Fab (nm) GFLOPS TDP (watt)
9800 GTX 2008 Tesla 754 65/55 432 140
GTX 295 2009 Tesla 2× 1400 55 1192.3 289
GTX 480 2010 Fermi 3000 40 1344.96 (SP) 250
GTX 590 2011 Fermi 2× 3000 40 2488.3 (SP) 365
GTX 690 2012 Kepler 2× 3540 28 2× 2810.88 (SP) 300
GTX TITAN 2013 Kepler 7080 28 4499.7 (SP) 230
GTX TITAN Z 2014 Kepler 2× 7080 28 8121.6 (SP) 375
GTX TITAN X 2015 Maxwell 8000 28 6604.8 (SP) 250
GTX TITAN X 2 2016 Pascal 12000 16 10974.2 (SP) 250
GTX TITAN V 2017 Volta 21100 12 14899.2 (SP) 250
GTX TITAN RTX 2018 Turing 18600 12 16312.32 (SP) 280
3.3 Available tools and utilities for GPUs277
Two utilities are widely used: Compute Unified Device Architecture (CUDA) platform and Open278
Computing Language (OpenCL) framework.279
In 2007, a parallel computing platform and application programming interface (API) model,280
CUDA was released by NVIDIA. Prior to the introduction of CUDA, graphics and GPU program-281
ming skills for use in tools such as Direct3D, DirectX and OpenGL, with a good understanding in282
High Level Shader Language (HLSL) were required in order to take advantage of the high com-283
putational performance of graphics cards.59 CUDA, however, only required standard C/C++ or284
Fortran programming language skills as the bare minimum.285
As of the writing of this review, CUDA has iterated to its tenth generation (10.1) and comes286
with both compile-time and run-time libraries.52 In particular, the CUDA fast Fourier transform287
(cuFFT) library enables high-performance FFT and IFFT computations similar to the FFTW li-288
brary.60 Other useful libraries provided includes but not limited to a basic linear algebra subrou-289
tine library, cuBLAS, useful for linear algebraic operations; a random number generation library,290
cuRAND, useful for random phase generations; and a parallel algorithms and data structures li-291
brary, Thrust, to accelerate operations such as sum and average as well as boundary (maximum292
16
and minimum) search algorithms in parallel.293
Developing programmes over CUDA is straightforward with the support of a modified C pro-294
gramming language dedicated to the CUDA framework.295
Additionally, vendors such as NVIDIA and AMD have all provided full support and have296
released the implementations of OpenCL for their GPUs. OpenCL is a framework with low-level297
APIs for cross-platform computing. Developers can use the provided APIs from OpenCL to write298
programmes that run across CPUs, GPUs, etc., with C programming language. However, it is worth299
noting that a study (not related to holography) conducted by Memeti et al. in 2017 suggested that300
CUDA outperforms OpenCL in terms of productivity, requiring two times less programming effort301
for a specific benchmark suite.61302
Matlab, in the meantime, provides a parallel computing toolbox for GPU computing. Despite303
some limitations, it is argued that combining both CUDA kernels and Matlab support (using the304
Parallel Computing Toolbox) can further improve and smooth the programming process.34, 62 No305
knowledge in CUDA is needed while exploiting the parallel computing capabilities for CGH-306
related computation speed-ups.307
3.4 The advantages and disadvantages of using GPUs308
GPUs are used for accelerating the processing of images and videos at birth. The hardware archi-309
tectures are specially designed for this purpose by highly optimizing the parallel characteristics in310
both hardware and software.311
The key advantage of GPGPUs is that they can be programmed using high-level programming312
languages such as C/C++, making code development and corresponding debug processes faster313
and easier than in other platforms such as FPGAs.314
17
As shown in Table 2, one of the major disadvantages of using GPUs for algorithm implemen-315
tation is their high power consumption. The thermal design power (TDP), which is the maximum316
amount of heat generated by the chip during operation and which serves as a basic indicator of317
power consumption, is typically around 200-300 watts.318
A good understanding of GPU microarchitectures and, in particular, memory management is319
required for speed optimization, although dedicated utilities tend to offer modest support for the320
managing of the memory.321
More importantly, most of the GPUs cannot work as a standalone platform. A system incorpo-322
rating CPUs and other essential hardware devices tend to create limits on data throughput during323
read, fetch and write operations and would increase the overall power consumption. Additionally,324
this level of integration introduces a data transfer bottleneck, which downgrades the overall per-325
formance of the platform. The speed for data transfers between the host PC and the GPU or GPU326
cluster would even slow down when the implementation has not been properly optimized.327
3.5 The development time using GPUs328
The use of CUDA makes GPU implementations simpler. The majority of the development time329
will be spent on software coding using C/C++ programming language.330
The major difficulty in the development of GPU hologram generation application is to optimize331
the codes for the potentially high-throughput and heavy computational requirements. This requires332
a good understanding of the GPU architectures, as well as hardware and software optimization333
techniques. However, since most of the fast algorithms implemented, such as in work,45, 66, 73, 74334
require less sophisticated operations, the optimization can be based purely on increasing the data335
throughput and improving the computational power with parallel processing.336
18
Table 3 A summary of CGH implemented on GPUs since 2008
Project and application (year) Implemented algorithm Hardware model (GFLOPS SP based
on56)
Performance
Holographic optical tweezers and
4π-microscopy (2008)63
Gerchberg-Saxton algorithm Geforce 8800 GTX (345.6) and 8800
GTS (416)
One GS loop at 512×512 in 16.5 msec
Data-parallel computing for point
cloud (2009)64
Nonuniform sampling, Common visibil-
ity group (CVG) approximation
Geforce 9800 GX2 (2×384) Non-uniform 7592 points in 10.3 sec, CVG in
5.07 sec
Depth buffer rasterization for 3D
display (2009)21
Ray tracing algorithm with precomputed
look-up tables
Geforce 8800 GT (336) 266 sampling rays with 12 quads in 1.37 sec
Colour reconstruction system with
GPU (2009)65
1000-point based Geforce GTX280 (622) 1400×1050 in 31 msec
Fast CGH using S-LUT (2009)66 Split look-up tables GTX 285 (708.48) 700× faster than LUT on Intel Core i7 965 for
object point larger than 40k
Ray-tracing (as the baseline refer-
ence) using GWO library41
Ray-tracing algorithm GTX 260 (approx. 550) 48277 points 3D object in 1380 msec
Real-time CGH using multiple
GPUs (2010)67
1000-point based 3×GTX 285 (708.48 per GPU) 1000 points per colour at 22 FPS
CGH with AMD (2010)68 1024-point based AMD RV870 (unknown) comparing
NVIDIA GTX 260 (approx. 550)
1920×1024 in 31 msec
GPU acceleration using SA
(2010)45
Simulated annealing NVIDIA GTX 260 (approx. 550) Performance improvement of more an order of
magnitude compared to using CPU only










GTX 580 (1581.1) 2048×2048 in 25 msec
CWO++ library performance
benchmark (2012)30
Gerchberg-Saxton algorithm GTX 460M (518.4), GTX 295 (1 chip,
approx 600), GTX 580 (1581.1)
2048×2048 Two magnitudes faster than an In-
tel Core i7 740QM
GPU cluster for divided CGH
(2012)44
Optimized 2048-point based 12×GTX 480 (1344.96 per GPU) 6400×3072 in 55 msec
GPU cluster for distributed holo-
gram computation (2013)71
Split look-up table 9×GTX 590 (2488.3 per GPU) and
14×Quadro 5000 (722.3 per GPU)




Binary detour-phase method NVIDIA TESLA C2050 (1030.4) 35×-53× speedup compated to AMD Phenom
9850 CPU
Localized error diffusion and redis-
tribution (2014)72
Localized error diffusion and redistribu-
tion (LERDR) algorithm
GTX 590 (2488.3) 2048×2048 in 6 msec
3D binary CGH (2014, 2015)73, 74 Precalculated triangular mesh GTX 770 (3213.3) Performance is better than point-based methods
but slower than triangle-based algorithm
3D object tracking mask-based
novel-look-up-table (2015)75
OTM-NLUT 3× GTX TITAN (4499.7) 31.1 FPS of Fresnel CGH patterns
Fourier hologram benchmarking
(2015)62
Kinoform, Detour Phase, Lee and Burck-
hardt methods
NVIDIA TESLA C2050 (1030.4) Speed-up of up to 68× compared to AMD Phe-
nom 9850 CPU
Fast occlusion processing (2016)76 Point-source and wave-field hybrid GTX 780Ti (5045.7) 1024 layers with 6.7 million points 21.28 msec
GPU for block-based parallel pro-
cessing (2018)77
10K-point based GTX 1080Ti (11339.7) 1024×1024 in 18.7 msec
Photorealistic CGH benchmark
(2018)43
Backward ray-tracing and wavefront-
recording planes (WRPs)
Quadro M5000 (4300.8) 1920×1080 in 20 msec
Real-time colour holographic re-
construction (2020)78
Point cloud based 13×GTX TITAN X (8000) 1920×1080 RGB + alpha coloured at
38.31 FPS
3.6 Reported work using GPUs337
In 1995, Lucente and Galyean demonstrated the first published result of CGH generation using a338
computer graphics workstation.79 The achieved performance was calculated using eight 128×60339
pixels full-colour images, which lead to a replay field of a 3D object with different viewing angles.340
At that time, the calculation time of 2 seconds over the graphics workstation was 100 times faster341
than a conventional computer.342
Later in 2003, Petz and Magnor used an NVIDIA Geforce 4600Ti to generate the interference343
fringes for holograms.80 In their work, the authors assessed both the GPU performance and the344
19
computational time dependency based on the resolution of holograms. For an object that contains345
1024 light source points, it takes 0.96s and 3.86s to calculate the corresponding holograms of the346
resolutions of 512×512 and 1024×1024, respectively.80347
Before the introduction of CUDA, the graphical API OpenGL was used to compute holograms,348
as was reported in 200681 and 200921, however, the performance was not promising. Additionally,349
a real-time reconstruction system for an 800×600 64-point based 3D object CGH was reported in350
2006 using HLSL and DirectX API, achieving a calculation speed 47× faster than a Pentium 4351
CPU.59352
The use of OpenCL for parallel computing to generate holograms with an AMD HD5000 was353
reported by Shimobaba et al. in 2010.68354
Since the release of CUDA, there has been a surging interest in the generation of computer355
holograms utilizing the full credibility and computational power of GPUs.356
The GPU microarchitectures have changed remarkably throughout the past decade, and the357
increased computational power produced an improvement of at least ten times. This performance358
improvement can also be seen in the reported literature.359
Shiraki et al.65 in 2009 reported a 1000 point light source (3D object) real-time holographic360
video generation system using an NVIDIA GTX 280 utilizing Tesla microarchitecture. The gen-361
erated hologram resolution was 1400×1050 pixels. The performance was later surpassed by the362
introduction of GTX 1080Ti with Pascal microarchitecture in 2018.77 The system reported by363
Kim et al. can produce real-time high definition (HD) holographic generation and projection using364
10,000 points of light, a ten-times increment in terms of object-point counts than that of the system365
reported in.65366
20
Table 3 provides a summary of some of the hardware implementations using different hologram367
generation algorithms in recent literature.368
3.7 Summary of GPUs369
Traditionally, GPU vendors design their line of products in order to carry out single precision float-370
ing point operations effectively.82 Throughout the years, these vendors have worked to redesign371
their products to allow for the use of half-precision numbers and fixed points.372
GPUs are by design powerful single and double precision floating point hardware accelera-373
tors, recent trends have led to the use of half-precision and fixed-point arithmetic, which further374
enhanced the computational speed while making a trade-off in terms of precision.375
Due to the hardware and manufacturing constraints, the number of streaming (CUDA) cores376
that can be embedded within a single GPU is limited. Therefore, in order to speed up the hologram377
generation process, one practical solution is to form a GPU cluster using multiple GPUs. This can378
be done either in a single stand-alone system67 or over a dedicated network.44379
In general, GPU offers a strong candidate for CGH systems.380
4 Field Programmable Gate Arrays (FPGAs)381
Field programmable gate arrays (FPGAs) are highly-configurable integrated circuits capable of382
being reprogrammed by designers after manufacture. This degree of flexibility enables designers383
users to implement logical hardware designs during their product’s development stage and to assess384
performance before the fabrication of expensive application specific integrated circuits (ASICs).385
The three traditional vendors in this field have been Intel, Xilinx and Lattice. However, the386
growth of the market has seen additional vendors arise such as GOWIN Semiconductors. The cost387
21
for a single FPGA chip ranges from several dollars at the low-end to tens of thousands of dollars388
depending on the performance and hardware requirements as well as the market capability.389
As shown in Figure 6, a typical FPGA architecture consists of the following five fundamental390
elements:82, 83391
1. Functional unit: A fundamental programmable cell that implements both combinational392
and sequential circuits. Depending on the vendors, these logic cells have been given different393
names, e.g., Intel names these cells as logic array blocks (LABs), whereas Xilinx calls them394
configuration logic blocks (CLBs).395
2. Interconnect fabric: A mesh of programmable wires to establish the signal connections396
between functional units and inputs/outputs (I/Os).397
3. Configuration memory blocks: A portion of on-board memory which stores the synthe-398
sized bitstream contents for the use of programming the functional units and fabrics.399
4. I/O interfaces: General purpose inputs and outputs connect the signal from the integrated400
circuit to physical peripherals and I/O pins.401
5. Digital signal processing blocks: Recent FPGAs incorporate dedicated ‘hard’ digital sig-402
nal processing blocks that support various precisions, either fixed-point or floating-point,403
accumulations and multiplications to further boost the performance of FPGA-based imple-404
mentations.405
The implementation of a functional unit is vendor specific. The units in Xilinx and Intel FPGA406












































Fig 6 A typical architecture of FPGA, which consists of logic cells, I/O ports, DSP blocks, block memory, etc.
23
ROMs. Note that this can make it challenging to compare two FPGAs from different vendors; a408
fact that should be kept in mind when assessing FPGA performance.409
Given their dominance of high-performance FPGA product families, we will mainly focus on410
FPGA products from Intel and Xilinx, and provide a comprehensive review based on their product411
families.412
Both Intel and Xilinx provide intellectual property cores (IP-cores) that are programmable-413
hardware implementations of application specific peripherals and algorithms. These are optimized414
for a given product line and should be used where possible to expedite development time and boost415
performance.416
4.1 The highly configurable hardware platform417
The key strength of FPGAs is their highly configurable and hardware-programmable nature. The418
applications can be developed using computer-based hardware description languages (HDLs) such419
as Verilog/SystemVerilog, VHDL, etc.83 These language-based designs are portable and usually420
independent of technology, with the exception of applying intellectual property (IP) cores and other421
chip-specific configurations. The designers are able to repeatedly programme and reconfigure a422
given chip to affect changes at the hardware level and reuse designs across different FPGA chips423
that are normally from the same vendor.424
The ability for FPGAs to support HDLs provides a significant benefit in that almost all on-chip425
cells are highly configurable and can be used to synthesize any possible hardware implementations426
as long as the designs can potentially be fitted into the available logic cells and hardware units.427
24
4.2 Implementations based on fixed and floating point428
Since FPGA platforms are highly reconfigurable, the use of either fixed point or floating point429
arithmetic becomes one of the most important design considerations. According to a report pro-430
duced by Xilinx,82 FPGA applications will benefit from the conversion from floating point to fixed431
point arithmetic for certain applications requiring less power but higher speed.432
Floating point precision, typically includes IEEE 754 half-precision (16-bit), single-precision433
(32-bit) and double-precision (64-bit) configurations, whereas fixed points are more flexible and434
usually range from several bits to 32-bit in width.435
Devices such as GPUs, which are used in computationally heavy applications, have tradition-436
ally been designed architecturally so that they are more efficient when supporting floating point437
operations. When implemented on FPGAs at the hardware level, however, conventional floating438
point operations, e.g. based on GPUs or CPUs, are slower than fixed point alternatives. This is due439
to the need and difficulty which arises when controlling the mantissa and exponents of IEEE 754440
floating points during the calculations.84441
4.3 Available tools and utilities for FPGAs442
FPGA vendors typically provide their own proprietary tools. Intel’s Quartus Prime is widely used443
among the community to facilitate development for Intel-based FPGAs. As for those devices444
offered by Xilinx, there are development tools such as the Vivado design suite, which has replaced445
the Xilinx integrated synthesis environment (ISE).446
ModelSim is a functional simulation software package from Mentor Graphics. It can be used447
independently to simulate hardware based on HDL descriptions, as well as being compatible with448
Intel Quartus Prime, Xilinx ISE and Vivado.449
25
Table 4 Tools and utilities reported for FPGA implementations in recent years













Max+Plus II Legacy design tool 86 2002
Xilinx
Vivado design suite Design tool 90 2019
ISE Design tool 91 2011
DisplayPort IP IP-core for DisplayPort 90 2019
MIG IP IP-core for memory interface 90 2019





















In addition, many hardware implementations use intellectual-property (IP) cores provided by450
the vendors to perform certain operations on FPGAs.451
Due to their unique nature, the FPGA development process is very distinct from traditional452
CPUs and GPUs. A simplified overview is summarized as follows:83453
1. Design specification and partition: These two initial steps set up the entry point for the454
design.455
2. Simulation and functional verification: This verification step tests the functionality of a456
compiled design using a user-specified testbench file.457
26
3. Design integration and verification: This step integrates all partitioned modules into one458
large system.459
4. Pre-synthesis sign-off: At this stage, all the known functional errors should have been elim-460
inated.461
5. Synthesis and implementation: Translates the hardware description language syntax and462
contents to an optimal Boolean description that maps the selected FPGA chip. The language463
synthesis tool will also remove redundant logic from the design if optimization is selected.464
6. Configuration bitstream download: The development tool will map the synthesized HDL465
to the selected chip and configure the logic unit blocks.466
7. Prototype functional testing and verification: At this stage, the design is tested on hard-467
ware to prove its functionality.468
8. Final sign-off: All constraints should at this stage be satisfied and errors eliminated via469
hardware and simulation debugging before the final chip production.470
4.4 The development time using FPGAs471
Generally, depending on the level of hardware complexity and the use of IP-cores, the development472
time might vary. For example, reported by Takada et al.,93 the group took over 3 years to develop473
and implement their algorithms into a custom-made bespoke FPGA platform consisting of 8 high-474
end FPGA chips.475
The average development time for a project based on FPGA hardware implementations will476
typically be significantly longer than an equivalent CPU or GPU project. Although not being re-477
ported for CGH applications, a study conducted in 2012 estimated that developing algorithms on478
27
a GPU-based hardware platform for dense optical flow, stereo and local image extraction features479
takes approximately 2 months for one full-time post-doctoral employee whereas developing the480
same algorithms and functionalities over an FPGA platform will likely take 12 months for two481
post-doctoral employees94. Overall, the development time for FPGA-based applications are likely482
to take longer than the equivalent for an algorithm to be implemented on a GPU platform.483
4.5 The advantages and disadvantages of using FPGAs484
One of the merits of FPGA implementation is the potential to migrate a given FPGA register-485
transfer level (RTL) design into ASICs. ASICs are dedicated chipsets specifically designed for486
a certain application. They are inflexible and require significant one-off tooling costs, but once487
designed represent an optimal combination of performance, power and cost for a given hardware488
accelerator. The performance can be optimized for the generation of computer holograms with the489
use of ASIC technology. A recent work in 201795 demonstrated that an FPGA-based implementa-490
tion can be migrated into a very-large-scale integration without the need for vast modifications.491
The potential for high performance at moderate power consumption along with the ability to492
migrate a given design to an ASIC provides a strong argument for the use of FPGAs in CGH493
applications.494
As pointed out in Section 4.4, the most significant drawback for FPGA implementation is the495
relatively long development time. FPGA-based hologram generation projects often require years496
of work by a group of researchers. Moreover, the required knowledge in terms of understanding of497
hardware architecture and FPGA technology for the developers sets up a high entry barrier.498
28
Table 5 A summary of CGH implemented on FPGAs since 2008
Project and application (year) Implemented algorithm Hardware model Pixel size and performance
HORN 5 2-dimensional FFT
(2008)96–99




3D image with 10,000 points
at 30 FPS
HORN 6 (2009)100 Phase computation by addi-
tion, point-cloud
A 16-board cluster each contain-
ing 4×Xilinx XC2VP70 and 1×
XC2V1000
67.9 msec per hologram
Realtime hologram genera-
tion (2010)92
40,000 point light sources Xilinx XC2VP70 1408×1050 in 9.3 msec
Cell-based hardware archi-
tecture (2011)87
Point light source Altera (no specific model no.) 1408×1050 in 15.8 msec
One-step phase-retrieval
(2011)91
OSPR Xilinx Virtex-4 SX35 512×512 in 0.9 msec
Pixel-by-pixel hardware sim-
ulation(2012)88
Pixel by pixel and parallel
schemes
Altera simulation Performance not measured in
physical hardware implemen-
tation
HORN 7 (2012, 2013)101, 102 Phase computation by addi-
tion, point-cloud
Xilinx Virtex-6 ML605 2 million pixels of 16,000
points in 0.4 sec
Full analytical Fraunhofer
CGH (2015)89
Polygon based Altera Cyclone IV EP4CE115 800×600 in 9.6 msec





An effective speed equivalent
to 0.5 PFLOPS, 1920×1080
65000 points at 8.3 FPS
Clustered HORN 8 (2018)103 Spatiotemporal division
point-cloud




Layer based Xilinx XCKU115 1920×1080 RGB at 16 FPS
4.6 Reported work using FPGAs499
Table 5 summarizes recent implementations on FPGA platforms. Most of the hardware models500
used in the lines of work are high-end FPGAs from Xilinx.501
HORN (HOlographic ReconstructioN) computers, which have been in active development by502
Ito et al. since 1992,104 have provided the research community with many insights into the field of503
CGH hardware implementation, notably the use of FPGAs for real-time hologram generation. So504
far there are, in total, eight generations of devices being produced by this group, ranging from low-505
speed devices to high-speed special purpose computers. The first four generations of HORN use506
DSP or small-scale FPGA chips for real-time computation tasks.86, 104–106 The later four generations507
of devices consist of large-scale FPGA chips embedded on delicate custom printed circuit boards508
(PCBs).46, 50, 93, 96, 100 The latest product within this line of work is HORN-8, which comprises of509
seven powerful FPGA chips for calculation and one FPGA chip for communication. As reported510
29
in93 and,50 the HORN-8 special computer can generate a hologram for a 3D object of 10,000 points511
within 0.019 seconds with a peak performance of 0.5 tera floating point operations per second512
(TFLOPS) running at a 0.25 GHz clock cycle. At the time of writing, the team’s outlook is to513
further develop an ASIC design based on the HORN-8 structure107 .514
Seo et al. proposed a hardware architecture based on pixel-by-pixel calculation scheme.87, 88, 92, 95515
In this line of work, the authors efficiently reduced the number of memory accesses by utilizing the516
pixel-by-pixel method, which is different from the conventional light source-by-source calculation517
method. The authors also demonstrated a very-large-scale integration (VLSI) chip, based on the518
proposed FPGA architecture.95 The work reported by Seo et al.95 demonstrated that it is relatively519
simple to migrate an FPGA system into an ASIC design.520
4.7 Summary of FPGAs521
Benefiting from its highly configurable architecture, FPGAs are to date the most flexible hardware522
accelerators for use in hologram generation applications. The required calculations in hologram523
generation algorithms can take advantage of the high degree of parallelism within a FPGA chip.524
However, the development time to implement optimized algorithms on FPGAs are typically sig-525
nificant and require an advanced skillset of HDLs, digital logic design and hardware architecture;526
skills not typically present in traditional optics groups researching holography.527
5 Review of other available hardware platforms528
In parallel to researches on hardware implementations using hardware accelerators such as GPUs529
and FPGAs, there have been several attempts to implement holographic generation algorithms530
within other existing platforms. This section aims to review some of the candidates.531
30
5.1 Digital Signal Processors (DSPs)532
Digital signal processors (DSPs) are dedicated hardware platforms for signal processing appli-533
cations. The microprocessors have architectures that are tuned for analogue and digital signal534
processing tasks with the ability to support single instruction multiple data (SIMD).535
Nishikawa et al. reported the use of a DSP to generate holographic images in the late 1990s.108536
A multi-DSP system consisting of 3× 4 i860 DSPs was proposed to generate 3D objects for the537
application. The 3D object consists of 200 points and is 640 × 480 pixels in size. The multi-DSP538
system takes 68 seconds to generate the object as opposed to a reference workstation (SPARCsta-539
tion 10) which generates the object in 291 seconds.540
The most recent work was reported by Oi et al.109 Twenty TMS320C6727 DSPs running541
floating point arithmetic was used to form the DSP block in the proposed system. These DSPs542
were dedicated to the conversion of integral photography (IP) images to holograms in the Fresnel543
diffraction domain. With a 1.5× redundancy design, an real-time performance of 50 FPS was544
achieved.545
The current highest-end DSP products are those from Analog Devices and Texas Instruments.546
A TI TMS320C6678 eight-core floating-point DSP runs at a clock rate of 1 GHz to 1.4 GHz, with547
a maximum computational performance of 20 GFLOPS per core for single precision floating point548
operations.110549
It is unlikely that these DSPs will be capable of performing complicated hologram generation550
algorithms due to the hardware specifications and limited computational power. However, it is551
still worthwhile to regard DSP as a valuable candidate to implement less complicated algorithms552
due to their low power profile and ease of programming. DSPs are typically programmed using553
31
C language and assemblies. The toolchain support is considered mature and time-proven, further554
minimizing the development time and difficulty.53555
5.2 Xeon Phi coprocessor and ClearSpeed accelerator board556
Xeon Phi is a family of co-processors with x86 manycore architecture designed and produced by557
Intel.111 It is to-date one of the few fairly powerful manycore co-processors that are intended for558
use in hardware acceleration applications.112 This line of products supports the use of OpenMP.113559
As was introduced in Section 2.2, OpenMP is an API that is optimized for shared memory multi-560
processing programming and exploits multi-thread parallelism.561
Murano et al. in 2014 reported on the use of a Xeon Phi coprocessor unit (Xeon Phi 5110P)562
for computer hologram generation.114 The authors used the Intel MKL for the calculation of FFTs563
along with the OpenMP functionalities to make use of the available cores present in the coproces-564
sor. Their results show that in all their test cases, GPU outperforms the Xeon Phi accelerator by a565
significant margin. However, when using Xeon Phi coprocessor, the amount of existing code that566
needs to be rewritten in order to port software-based algorithms into the hardware accelerator, as567
compared to that in the GPU case, can be minimized.568
Another hardware acceleration board, ClearSpeed Advance Dual CSX600, was demonstrated569
in 2009.115 The authors were able to speed up the point cloud hologram calculation 56× faster570
than an Intel Xeon CPU performing calculation in single core. Unfortunately, as of the writing of571
this survey, the production of ClearSpeed accelerator boards is no longer active.572
32
5.3 System-on-Chip (SoC) hybrid CPU and FPGA573
There is a growing need for hologram generation systems to become compact and low in power574
consumption. A trend towards System-on-Chip (SoC) utilizing the heterogeneous system architec-575
ture (HSA) has been rapidly growing over the years. The general idea behind SoC is to incorporate576
different devices and peripherals on a single chip to reduce the overall die area and to minimize the577
power consumption.116 One hybrid product is to have both FPGA and microprocessors or CPUs578
on board one chip. A further discussion is present in Section 6.5.579
In one of the most recent studies conducted by Yamamoto et al., the authors developed a com-580
pact holographic computer using a Xilinx Zynq UltraScale+ MPSoC consisting of an ARM CPU581
and an FPGA on one single chip.117 The reported system was able to reproduce 1920×1080 pixels582
3D video at a rate of 15 frames per second.117 They also compared the result to the performance of583
a Jetson TX1 platform,118 the calculation time of 1920×1080 pixels with 6500 points on the SoC584
platform took 0.066s, whereas the Jetson TX1 took 1.294s.585
The development time for these SoC hardware implementations would be even longer than586
that of pure FPGA developments since the incorporation of both CPU, which requires multi-thread587
programming, and FPGA, which uses hardware description languages, adds another level of com-588
plexity when highly optimized codes and algorithms are needed. However, the power efficiency,589
die area and package size scale-down can bring about other benefits that mitigate for the increased590
programming workload.591
6 Discussion592
As shown in Fig 7, most of the reported work included in this survey implemented algorithms using593
GPU platforms, totaling 24 papers, as compared to other accelerator platforms between 2008 and594
33
Table 6 General comparison between CPU, GPU, FPGA, DSP and other platforms
Platform Number of
cores
Serial or parallel Clock frequency Development
time
Power efficiency Portability
CPU Low Mainly serial High Short Average Straightforward
GPU High Parallel High Average Low Less challenging






DSP Low Mainly serial Average to High Average Average Simple (from low- to
higher-performance)
Xeon Phi / ClearSpeed Average Serial with many-
core parallel
Average Short Average Average
Heterogeneous SoC
e.g. FPGA + CPU
High Serial and paral-
lel
Low Long High Platform dependent
2020. FPGA-based systems are popular as well, reaching up to 16 published papers. In particular595
the line of work exemplified by the HORN group exploits the potential of FPGA parallelism for596
fast hologram generation.597
There are also a number of research papers implementing algorithms with CPUs only, however,598
as discussed in Section 2.1, most of them tend to focus on the development of novel algorithms599
and choose a PC platform without hardware accelerators as a means of algorithm evaluation and600
verification.601
It is worth noting that cross-platform comparisons based solely on the calculation speed are602
not strictly reasonable. This is because different platforms incorporate different architectures and603
have different toolchain supports. Essentially, the algorithms implemented despite best efforts604
can still be fundamentally different across multiple platforms. Therefore, it is of great essence605
that analytical models with key performance metrics, which consider not only FPS and power606
efficiency but also other factors, be proposed to assess performances over different hardware.607
We summarize the hardware specifications for the reviewed hardware and provide a general608
comparison between these platforms in Table 6. The table shows the difference in terms of the609
number of cores, serial or parallel architectures, clock frequencies, the estimated development610























CPU 0 CPU 1 CPU 2 CPU 3
shared memory





IF ID EX MEM WB
IF ID EX MEM WB















Fig 8 Different levels of parallelism and concurrency on hardware platforms. The left hand side depicts the CPU
instruction pipeline and processor-level parallelization, whereas the right hand side shows the FPGA parallelization at
the equivalent levels. Hardware clustering at the board level further exploits parallelism.
35
We conclude the six key considerations when selecting a suitable hardware platform for CGH612
related implementations:613
1. Hardware manufacturing constraints.614
2. Toolchain support.615
3. Fixed point or floating point arithmetic.616
4. Parallel and sequential processing – shown in Figure 8.617
5. Development time.618
6. Portability of software.619
6.1 Toolchain support620
One of the most important aspects to consider is the full-cycle development toolchain support.621
CPUs and GPUs platforms are likely to be less affected by the lack of available software package622
and library supports, as discussed in the previous sections. However, FPGAs and other accelerators623
such as DSPs and co-processors might suffer from the lack of active development support and will,624
in turn, affect the overall development process. In general, the availability of tools and utilities to625
support the dedicated hardware creates a resource barrier towards the successful implementation.626
6.2 Choice of algorithms and parallel/sequential processing627
Many algorithms exist for 2D/3D hologram generation. Different algorithms would require differ-628
ent hardware resources in practice, e.g. triangular-mesh based algorithms can take the advantage of629
being compatible with modern computer graphics technologies utilizing polygon meshes for object630
36
computations13. Regardless of the algorithm used the size, e.g. hologram resolution size, number631
of points/polygons, is an important consideration in all cases, and more importantly, increasingly632
complex holograms demand larger and better hardware.633
GPUs and other specialized hardware accelerators are useful to the speed enhancement of holo-634
gram calculation by utilizing parallelism and optimizing for sequential processing. For example,635
in the point-cloud-based calculation, the hologram patterns are calculated using the same mathe-636
matical formula, and more importantly, the calculation of these patterns for each object point is637
independent of other object points.13 The independent calculation of object points can potentially638
make use of the parallel processing for hardware platforms.639
Moreover, for CGH algorithms involving FFT operations and depending on the hardware uti-640
lized, the FFT operations can be parallelized through different cores or pipelines at the processor641
level, as shown in Figure 8.642
It is also of great importance, though being algorithm-dependent, to be aware of the number643
of sequential processes required and optimize for performance while exploiting concurrency and644
parallelism within the specified hardware. For example, iterative algorithms such as the Gerchberg-645
Saxton (GS) algorithm119 requires sequential processing that cannot or tend to be difficult to par-646
allelize and multi-task. It is then of the developer’s responsibility to select a platform that does647
not only fulfill the need for parallelism but also have the options to optimize for the sequential648
operations when implementing the desired algorithm.649
6.3 Portability of software650
It is essential to consider the possibility of transferring the developed software and firmware from651
one system to another while keeping in mind the trade-offs between portability and performance.652
37
This transfer would likely be required when upgrades toward newer generations of hardware are653
expected, or performance comparisons between different devices are needed.654
The most straightforward transfer comes when the CPU platform, which is usually based on a655
PC, is used. Upgrading between different operating systems and software platforms are compara-656
tively simple thanks to the abundant software support. In comparison, porting from one NVIDIA657
GPU to another would sometimes require more work, although CUDA provides a unified develop-658
ment environment. This is mainly due to the upgrades in hardware between different generations659
of GPU products. As for intra-generation code transplant, it is usually not challenging, as long as660
the memory and computational power limitations have been taken into account by the developer.661
Code transfer among different FPGA platforms, on the contrary, would be slightly difficult,662
especially when target chip IP-cores are used for the application. With the above noted, transferring663
from a lower performance FPGA to an FPGA with higher performance can be relatively simple,664
this will likely be the case when HDL descriptions are used.665
6.4 Hologram generation quality assessment666
An end-to-end CGH hardware implementation assessment should include fast generation, hard-667
ware performance and generated quality assessment of the holograms.668
There currently is a lack of available unified criterion to assess the quality of computer holo-669
grams generated from different platforms. Kim et al.90 uses a modulation transfer function (MTF)670
to compare the image quality of different holograms. Structural similarities (SSIM) has also been671
used in work49 to evaluate the quality of the generated images. Another widely used metrics are to672
measure the mean square error (MSE) and peak signal-to-noise ratio (PSNR).120 Blinder et al.121673
provided a more detailed review of the quality assessment for computer generated holograms.674
38
6.5 Heterogeneous computing and its related hardware675
There is a growing trend in the embedded systems, image and video processing communities to676
incorporate the state-of-the-art heterogeneous computing systems into their applications.677
Heterogeneous computing systems typically refer to systems that fuse more than one type of678
processors or cores together,122 it could also refer to systems that combine a large number of679
processor cores with the same ISA,53 e.g. Intel Xeon Phi, or a small number of cores with different680
execution performances, e.g. ARM big.Little platform.123 In this section, we focus mainly on the681
development and trend in heterogeneous hardware accelerators that incorporate different types of682
instruction set architecture (ISA) devices.683
These hardware systems take advantage of conventional multi-core hardware accelerators while684
in the meantime bypass some of the limitations and disadvantages of using a single hardware685
accelerator architecture.116 The aims of having the heterogeneous system architecture (HSA) are686
to reduce the communication latency between different computing devices and to improve the687
parallel execution performance.116688
The level of heterogeneity in a computing system gradually increases, with more and more SoC689
platforms being produced. Among various of heterogeneous hardware platforms, the combination690
of CPUs and FPGAs, usually in the form of hard ARM processors embedded in an FPGA, as well691
as CPUs with DSPs, are potentially good candidates for low-cost low-power hologram genera-692
tion platforms due to their inherent merits that balance the pros and cons of conventional system693
architectures.694
Another worth mentioning heterogeneous computing platform is the Jetson module. Only the695
Jetson TX1118 was evaluated in work.117 Its upgraded version TX2124 and the most recent AGX696
39
Table 7 NVIDIA Jetson module products family
Model (year of launch) GPU Computational power Power (watt)
TX1 (2015)118 Maxwell Over 1 Tera-FLOPS Under 10
TX2 series (2017)124 Pascal 1.3 TFLOPS 7.5-20
AGX XAVIER series
(2018)125
Volta with Tensor Cores 20-32 Tera-operations per second
(TOPS)
10-30
Nano (2019)126 Maxwell 472 GFLOPS 5-10
Xavier NX (2019/2020)127 Volta with Tensor Cores 21 TOPS 10-15
Xavier,125 both with boosted performance and power efficiency, are also of implementation inter-697
est. A low-cost variant of the Jetson family, Jetson NANO, has also become available in the market698
recently.126 A list of the Jetson family is shown in Table 7.699
6.6 Future trend in embedded systems and high performance hardware platforms700
ARM developed the Neon technology for their Cortex-A series and R52 processors as an advanced701
SIMD architecture extension for image and video as well as general signal processing purposes.128702
No reported work to-date has exploited the possibility of integrating an ARM-based SoC embedded703
platform for the generation of computer holograms while utilizing technologies such as Neon.704
NVIDIA recently announced their plan to bring CUDA acceleration to the ARM ecosystem.129705
This will potentially bring the power and accessibility of CUDA to platforms such as ARM-based706
SoCs. This is also accompanied by the introduction of CUDA-X high performance computing707
(HPC) libraries which can potentially further exploit parallelism and provide improved processing708
performance.130709
From another perspective, the recently announced Vitis Unified Software Platform from Xil-710
inx provides another degree of flexibility to use high-level synthesis (HLS) FPGA languages in711
order to help reduce the development overhead of FPGA applications131. This unified platform is712
envisioned to shorten the overall development time with higher level implementations without the713
need of incorporating fully RTL-level development and to provide a better programmability for the714
40
FPGA hardware.715
7 Future work and conclusions716
CGH calculations primarily require a high degree of computation parallelism, thus embracing the717
use of hardware accelerators such as GPUs, FPGAs, etc., for the realization of real-time computer718
generated holograms.719
It is anticipated that there will be two separate research paths that lead towards the future of720
CGH hardware implementations, including:721
1. High performance hardware platforms for real-time CGH generations and displays.722
These systems will usually be of high costs and require a long development cycle. Algo-723
rithms for future fast computer hologram generations will likely be developed using these724
hardware platforms for first-phase verifications and optimizations. A good example is the725
HORN-8 special purpose computer.50, 93, 103 The team has recently announced their future726
outlook to build ASIC devices to further boost the performance.107727
2. Embedded computers and systems for low-power and low-cost applications. In order for728
this ultimate display technology to become reachable to ordinary households and individual729
consumers, a reduction in cost and a significant reduction in volumes and sizes are essential.730
A large amount of work can potentially be done on SoC platforms, e.g. CPU-FPGA, CPU-731
DSP devices, as well as on supercomputer-on-a-module embedded computing devices.132732
Examples that demonstrate the implementations for embedded systems are those from Kim733
et al.90 and Yamamoto et al.117734
41
In this review paper, we have attempted to provide a useful review on the hardware imple-735
mentations on CGH, as well as to provide practical information about the current state-of-the-art736
hardware platforms that can be selected by researchers and developers to implement computer737
hologram generation algorithms for their specific applications.738
A key insight from this review is that the potential for real-time holography exists today without739
the need for bespoke hardware. A flagship GPU can process an entire holographic frame in 20ms,740
providing high-quality CGH in real-time. We predict holography transitioning towards mobile and741
embedded platforms, a trend evidenced by extrapolating from the growth of GPU computational742
power in Table 7. Bespoke hardware accelerators, such as FPGAs and ASICs, shall continue to743
advance the field of CGH hardware in this period, pushing the boundaries on what is achievable in744
terms of computation power, energy consumption and overall system cost.745
References746
1 J. W. Goodman, “Introduction to Fourier optics,” Introduction to Fourier optics, 3rd ed., by747
JW Goodman. Englewood, CO: Roberts & Co. Publishers, 2005 1 (2005).748
2 D. Gabor, “A new microcopic principle,” Nature 161(4098), 777–778 (1948).749
3 B. R. Brown and A. W. Lohmann, “Complex Spatial Filtering with Binary Masks,” Applied750
Optics 5(6), 967–969 (1966).751
4 M. Lucente, “The First 20 Years of Holographic Video – and the Next 20,” in SMPTE 2nd752
Annual International Conference on Stereoscopic 3D for Media and Entertainment, 1–16,753
SMPTE (2011).754
5 A. Maimone, A. Georgiou, and J. S. Kollin, “Holographic near-eye displays for virtual and755
augmented reality,” ACM Transactions on Graphics 36(4), 1–16 (2017).756
42
6 B. Robertson, H. Yang, M. M. Redmond, et al., “Demonstration of multi-casting in a 1 ×757
9 LCOS wavelength selective switch,” Journal of Lightwave Technology 32(3), 402–410758
(2014).759
7 R. W. Bowman, G. M. Gibson, A. Linnenberger, et al., “Red tweezers: Fast, customis-760
able hologram generation for optical tweezers,” Computer Physics Communications 185(1),761
268–273 (2014).762
8 W. A. Crossland, T. D. Wilkinson, I. G. Manolis, et al., “Telecommunications applications763
of LCOS devices,” Molecular Crystals and Liquid Crystals Science and Technology Section764
A: Molecular Crystals and Liquid Crystals 375, 1–13 (2002).765
9 F. Yaras, H. Kang, and L. Onural, “State of the art in holographic displays: A survey,”766
IEEE/OSA Journal of Display Technology 6(10), 443–454 (2010).767
10 G. Nehmetallah and P. P. Banerjee, “Applications of digital and analog holography in three-768
dimensional imaging,” Advances in Optics and Photonics 4(4), 472 (2013).769
11 J. Liu, J. Jia, Y. Pan, et al., “Overview of fast algorithm in 3D dynamic holographic display,”770
International Symposium on Photoelectronic Detection and Imaging 2013: Optical Storage771
and Display Technology 8913(August 2013), 89130X (2013).772
12 T. Nishitsuji, T. Shimobaba, T. Kakue, et al., “Review of Fast Calculation Techniques for773
Computer-Generated Holograms with the Point-Light-Source-Based Model,” IEEE Trans-774
actions on Industrial Informatics 13(5), 2447–2454 (2017).775
13 J. H. Park, “Recent progress in computer-generated holography for three-dimensional776
scenes,” Journal of Information Display 18(1), 1–12 (2017).777
43
14 P. W. M. Tsang, T.-C. Poon, and Y. M. Wu, “Review of fast methods for point-based778
computer-generated holography [Invited],” Photonics Research 6(9), 837 (2018).779
15 T. Shimobaba, T. Kakue, and T. Ito, “Review of Fast Algorithms and Hardware Imple-780
mentations on Computer Holography,” IEEE Transactions on Industrial Informatics 12(4),781
1611–1622 (2016).782
16 T. Shimobaba and T. Ito, Computer Holography Acceleration Algorithms And Hardware783
Implementations, CRC Press, 1st ed. (2019).784
17 D. Gabor, W. E. Kock, and W. S. George, “Holography,” Science 173(3991), 11–24 (1971).785
18 S. Reichelt, R. Häussler, G. Fütterer, et al., “Full-range, complex spatial light modulator for786
real-time holography,” Optics Letters 37(11), 1955 (2012).787
19 A. J. Macfaden and T. D. Wilkinson, “Characterization, design, and optimization of a two-788
pass twisted nematic liquid crystal spatial light modulator system for arbitrary complex789
modulation,” Journal of the Optical Society of America A 34(2), 161 (2017).790
20 T. Kurihara and Y. Takaki, “Shading of a computer-generated hologram by zone plate mod-791
ulation,” Optics Express 20(4), 3529 (2012).792
21 R. H.-Y. Chen and T. D. Wilkinson, “Computer generated hologram with geometric occlu-793
sion using GPU-accelerated depth buffer rasterization for three-dimensional display,” Ap-794
plied Optics 48(21), 4246 (2009).795
22 S. Liu, H. Wei, N. Li, et al., “Occlusion calculation algorithm for computer generated holo-796
gram based on ray tracing,” Optics Communications 443(March), 76–85 (2019).797
23 J. Xiao, J. Liu, Z. Lv, et al., “On-axis near-eye display system based on directional scattering798
holographic waveguide and curved goggle,” Optics Express 27(2), 1683 (2019).799
44
24 Y. Nagahama, T. Shimobaba, T. Kakue, et al., “Image quality improvement of random800
phase-free holograms by addressing the cause of ringing artifacts,” Applied Optics 58(9),801
2146 (2019).802
25 G. Li, J. Jeong, D. Lee, et al., “Space bandwidth product enhancement of holographic dis-803
play using high-order diffraction guided by holographic optical element,” Optics Express804
23(26), 33170 (2015).805
26 E. Buckley, “Computer-Generated Phase-Only Holograms for Real-Time Image Display,”806
Advanced Holography - Metrology and Imaging (2011).807
27 J. L. Hennessy and D. A. Patterson, Computer Architecture, Fifth Edition: A Quantitative808
Approach, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 5th ed. (2011).809
28 M. Nemirovsky and D. Tullsen, Multithreading Architecture, Synthesis lectures in computer810
architecture, Morgan & Claypool Publishers (2013).811
29 OpenMP Architecture Review Board, “OpenMP FAQ,” (2018).812
30 T. Shimobaba, J. Weng, T. Sakurai, et al., “Computational wave optics library for C++:813
CWO++ library,” Computer Physics Communications 183(5), 1124–1138 (2012).814
31 S.-C. Kim and E.-S. Kim, “Effective generation of digital holograms of three-dimensional815
objects using a novel look-up table method,” Applied Optics 47(19), D55 (2008).816
32 S.-C. Kim and E.-S. Kim, “Fast computation of hologram patterns of a 3D object using817
run-length encoding and novel look-up table methods,” Applied Optics 48(6), 1030 (2009).818
33 J. Jia, Y. Wang, J. Liu, et al., “Reducing the memory usage for effectivecomputer-generated819
hologram calculation using compressed look-up table in full-color holographic display,”820
Applied Optics 52(7), 1404 (2013).821
45
34 G. Makey, M. S. El-Daher, and K. Al-Shufi, “Accelerating the calculations of binary de-822
tour phase method by integrating both CUDA and Matlab programming for GPU’s parallel823
computations,” Optik 124(22), 5486–5488 (2013).824
35 P. Memmolo, L. Miccio, F. Merola, et al., “Investigation on specific solutions of Gerchberg-825
Saxton algorithm,” Optics and Lasers in Engineering 52(1), 206–211 (2014).826
36 Z. Wang, G. Lv, Q. Feng, et al., “Highly efficient calculation method for computer-generated827
holographic stereogram using a lookup table,” Applied Optics 58(5), A41–A47 (2019).828
37 A. T. S. Ymeonidou, R. A. M. Uhamad, K. Izhakkumkara, et al., “Efficient holographic829
video generation based on rotational transformation of wavefields,” Optics Express 27(26),830
37383–37399 (2019).831
38 T. Shimobaba and T. Ito, “Fast generation of computer-generated holograms using wavelet832
shrinkage,” Optics Express 25(1), 77 (2017).833
39 T. Shimobaba, K. Matsushima, T. Takahashi, et al., “Fast, large-scale hologram calculation834
in wavelet domain,” Optics Communications 412(August 2017), 80–84 (2018).835
40 T. Shimobaba, S. Yamada, T. Kakue, et al., “Fast hologram calculation using wavelet trans-836
form,” in Proc. SPIE 10964, Tenth International Conference on Information Optics and837
Photonics, 10964, 116, SPIE (2018).838
41 T. Shimobaba, N. Masuda, and T. Ito, “Simple and fast calculation algorithm for computer-839
generated hologram with wavefront recording plane,” Optics Letters 34(20), 3133 (2009).840
42 K. Matsushima and S. Nakahara, “Extremely high-definition full-parallax computer-841
generated hologram created by the polygon-based method,” Applied Optics 48(34), H54–842
H63 (2009).843
46
43 Y. Wang, X. Sang, Z. Chen, et al., “Real-time photorealistic computer-generated holograms844
based on backward ray tracing and wavefront recording planes,” Optics Communications845
429(July), 12–17 (2018).846
44 N. Takada, T. Shimobaba, H. Nakayama, et al., “Fast high-resolution computer-generated847
hologram computation using multiple graphics processing unit cluster system,” Applied Op-848
tics 51(30), 7303 (2012).849
45 J. Carpenter and T. D. Wilkinson, “Graphics processing unit–accelerated holography by850
simulated annealing,” Optical Engineering 49(9), 095801 (2010).851
46 S. Yamada, T. Shimobaba, T. Kakue, et al., “Full-color computer-generated hologram using852
wavelet transform and color space conversion,” Optics Express 27(6), 8153 (2019).853
47 P. Pozzi, L. Maddalena, N. Ceffa, et al., “Fast Calculation of Computer Generated Holo-854
grams for 3D Photostimulation through Compressive-Sensing Gerchberg–Saxton Algo-855
rithm,” Methods and Protocols 2(1), 1–11 (2019).856
48 Y. Nagahama, T. Shimobaba, T. Kawashima, et al., “Holographic multi-projection using the857
random phase-free method,” Applied Optics 55(5), 1118 (2016).858
49 D. Arai, T. Shimobaba, T. Nishitsuji, et al., “An accelerated hologram calculation using859
the wavefront recording plane method and wavelet transform,” Optics Communications860
393(February), 107–112 (2017).861
50 T. A. A. Kamatsu, R. Y. H. Irayama, H. Irotaka, et al., “Special-purpose computer HORN-8862
for phase-type electro-holography,” Optics Express 26(20), 26722–26733 (2018).863
51 NVIDIA, “NVIDIA Launches the World’s First Graphics Processing Unit: GeForce 256,”864
(1999).865
47
52 NVIDIA, “CUDA C Programming Guide,” (2019).866
53 A. HajiRassouliha, A. J. Taberner, M. P. Nash, et al., “Suitability of recent hardware accel-867
erators (DSPs, FPGAs, and GPUs) for computer vision and image processing algorithms,”868
Signal Processing: Image Communication 68(July), 101–119 (2018).869
54 Intel, “Intel Xeon Processor E7 Family,” (2019).870
55 NVIDIA, “NVIDIA Geforce introduction,” (2019).871
56 Wikipedia, “List of Nvidia graphics processing units,” (2019).872
57 P. N. Glaskowsky, “NVIDIA ’ s Fermi : The First Complete GPU Computing Architecture873
[white paper],” Tech. Rep. September, NVIDIA (2009).874
58 NVIDIA, “NVIDIA Turing GPU [white paper],” tech. rep., NVIDIA (2018).875
59 T. Ito, T. Tanaka, T. Sugie, et al., “Computer generated holography using a graphics pro-876
cessing unit,” Optics Express 14(2), 603 (2006).877
60 NVIDIA, “CuFFT Library User’s Guide,” Tech. Rep. May, NVIDIA (2019).878
61 S. Memeti, L. Li, S. Pllana, et al., “Benchmarking OpenCL, OpenACC, OpenMP, and879
CUDA,” in 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud880
Computing, 1–6, ACM (2017).881
62 G. Makey, M. S. El-Daher, and K. Al-Shufi, “Modification of common Fourier computer882
generated hologram’s representation methods from sequential to parallel computing,” Optik883
126(11-12), 1067–1071 (2015).884
63 A. Hermerschmidt, S. Krüger, T. Haist, et al., “Holographic optical tweezers with real-885
time hologram calculation using a phase-only modulating LCOS-based SLM at 1064 nm,”886
Complex Light and Optical Forces II 6905(January 2008), 690508 (2008).887
48
64 R. H.-Y. Chen and T. D. Wilkinson, “Computer generated hologram from point cloud using888
graphics processor,” Applied Optics 48(36), 6841 (2009).889
65 A. Shiraki, N. Takada, M. Niwa, et al., “Simplified electroholographic color reconstruction890
system using graphics processing unit and liquid crystal display projector,” Optics Express891
17(18), 16038–16045 (2009).892
66 Y. Pan, X. Xu, S. Solanki, et al., “Fast CGH computation using S-LUT on GPU,” Optics893
Express 17(21), 18543 (2009).894
67 H. Nakayama, N. Takada, Y. Ichihashi, et al., “Real-time color electroholography using895
multiple graphics processing units and multiple high-definition liquid-crystal display pan-896
els,” Applied Optics 49(31), 5993 (2010).897
68 T. Shimobaba, T. Ito, N. Masuda, et al., “Fast calculation of computer-generated-hologram898
on AMD HD5000 series GPU and OpenCL.” An optional note (2010).899
69 S. Bianchi and R. Di Leonardo, “Real-time optical micro-manipulation using optimized900
holograms generated on the GPU,” Computer Physics Communications 181(8), 1444–1448901
(2010).902
70 P. Tsang, W.-K. Cheung, T.-C. Poon, et al., “Holographic video at 40 frames per second for903
4-million object points,” Optics Express 19(16), 15205 (2011).904
71 Y. Pan, X. Xu, and X. Liang, “Fast distributed large-pixel-count hologram computation905
using a GPU cluster,” Applied Optics 52(26), 6562 (2013).906
72 P. W. M. Tsang, A. S. M. Jiao, and T.-C. Poon, “Fast conversion of digital Fresnel holo-907
gram to phase-only hologram based on localized error diffusion and redistribution,” Optics908
Express 22(5), 5060 (2014).909
49
73 F. Yang, A. Kaczorowski, and T. D. Wilkinson, “Fast precalculated triangular mesh algo-910
rithm for 3D binary computer-generated holograms,” Applied Optics 53(35), 8261 (2014).911
74 F. Yang, A. Kaczorowski, and T. D. Wilkinson, “Enhancing the quality of reconstructed 3D912
objects by using point clusters,” Applied Optics 54(18), 5726 (2015).913
75 M.-W. Kwon, S.-C. Kim, S.-E. Yoon, et al., “Object tracking mask-based NLUT on GPUs914
for real-time generation of holographic videos of three-dimensional scenes,” Optics Express915
23(3), 2101 (2015).916
76 A. N. G. Illes, P. A. G. Ioia, R. É. M. I. Cozot, et al., “Hybrid approach for fast occlusion pro-917
cessing in computer-generated hologram calculation,” Applied Optics 55(20), 5459–5470918
(2016).919
77 D.-W. Kim, Y.-H. Lee, and Y.-H. Seo, “High-speed computer-generated hologram based920
on resource optimization for block-based parallel processing,” Applied Optics 57(16), 4569921
(2018).922
78 S. Ikawa, N. Takada, H. Araki, et al., “Real-time color holographic video reconstruction923
using multiple-graphics processing unit cluster acceleration and three spatial light modula-924
tors,” Chinese Optics Letters 18(1), 1–5 (2020).925
79 M. Lucente and T. A. Galyean, “Rendering interactive holographic images,” in SIGGRAPH926
’95 Proceedings of the 22nd annual conference on Computer graphics and interactive tech-927
niques, 387–394, SIGGRAPH (1995).928
80 C. Petz and M. Magnor, “Fast hologram synthesis for 3D geometry models using graphics929
hardware,” Practical Holography XVII and Holographic Materials IX 5005(June 2003), 266930
(2003).931
50
81 L. Ahrenberg and J. Watson, “Computer generated holography using parallel commodity932
graphics hardware,” Optics Express 5664(17), 603–608 (2006).933
82 Xilinx, “Reduce Power and Cost by Converting from Floating Point to Fixed Point Intro-934
duction,” White Paper: Floating vs Fixed Point , 1–14 (2017).935
83 M. D. Ciletti, Advanced Digital Design with the Verilog HDL, Prentice Hall Press, Upper936
Saddle River, NJ, USA, 2nd ed. (2010).937
84 R. Solovyev, A. Kustov, V. Rukhlov, et al., “Fixed-Point Convolutional Neural Network for938
Real- Time Video Processing in FPGA,” in 2019 IEEE Conference of Russian Young Re-939
searchers in Electrical and Electronic Engineering (EIConRus), 1605–1611, IEEE (2019).940
85 T. Shimobaba and T. Ito, “An efficient computational method suitable for hardware of941
computer-generated hologram with phase computation by addition,” Computer Physics942
Communications 138(1), 44–52 (2001).943
86 T. Shimobaba, S. Hishinuma, and T. Ito, “Special-purpose computer for holography HORN-944
4 with recurrence algorithm,” Computer Physics Communications 148(2), 160–170 (2002).945
87 Y.-H. Seo, H.-J. Choi, J.-S. Yoo, et al., “Cell-based hardware architecture for full-parallel946
generation algorithm of digital holograms,” Optics Express 19(9), 8750 (2011).947
88 Y.-H. Seo, Y.-H. Lee, J.-S. Yoo, et al., “Hardware architecture of high-performance digital948
hologram generator on the basis of a pixel-by-pixel calculation scheme,” Applied Optics949
51(18), 4003–4012 (2012).950
89 Z.-Y. Pang, Z.-X. Xu, Y. Xiong, et al., “Hardware architecture for full analytical Fraunhofer951
computer-generated holograms,” Optical Engineering 54(9), 095101 (2015).952
51
90 H. Kim, Y. Kim, H. Ji, et al., “A single-chip FPGA holographic video processor,” IEEE953
Transactions on Industrial Electronics 66(3), 2066–2073 (2019).954
91 E. Buckley, “Real-time error diffusion for signal-to-noise ratio improvement in a holo-955
graphic projection system,” IEEE/OSA Journal of Display Technology 7(2), 70–76 (2011).956
92 Y. H. Seo, H. J. Choi, J. S. Yoo, et al., “An architecture of a high-speed digital hologram957
generator based on FPGA,” Journal of Systems Architecture 56(1), 27–37 (2010).958
93 T. Sugie, T. Akamatsu, T. Nishitsuji, et al., “High-performance parallel computing for next-959
generation holographic imaging,” Nature Electronics 1(4), 254–259 (2018).960
94 K. Pauwels, M. Tomasi, J. Dı́az Alonso, et al., “A Comparison of FPGA and GPU for961
real-time phase-based optical flow, stereo, and local image features,” IEEE Transactions on962
Computers 61(7), 999–1012 (2012).963
95 Y.-H. Seo, Y.-H. Lee, and D.-W. Kim, “ASIC chipset design to generate block-based com-964
plex holographic video,” Applied Optics 56(9), D52 (2017).965
96 T. Ito, N. Masuda, K. Yoshimura, et al., “Special-purpose computer HORN-5 for a real-966
time electroholography,” Optics Express 13(6), 1923–1932 (2005).967
97 Y. Abe, N. Masuda, H. Wakabayashi, et al., “Special purpose computer system for flow968
visualization using holography technology,” Optics Express 16(11), 587–592 (2008).969
98 S.-i. Satake, Y. Hiroi, Y. Suzuki, et al., “Special-purpose computer for two-dimensional970
FFT,” Computer Physics Communications 179(6), 404–408 (2008).971
99 Y. Ichihashi, N. Masuda, M. Tsuge, et al., “One-unit system to reconstruct a 3-D movie at a972
video-rate via electroholography,” Optics Express 17(22), 19691 (2009).973
52
100 A. Shiraki, Y. Ichihashi, H. Nakayama, et al., “HORN-6 special-purpose clustered comput-974
ing system for electroholography,” Optics Express 17(16), 13895 (2009).975
101 N. Okada, D. Hirai, Y. Ichihashi, et al., “Special-Purpose Computer HORN-7 with FPGA976
Technology for Phase Modulation Type Electro-Holography,” in 19th International Display977
Workshops 2012 (IDW/AD’12), 1284–1287, Society for Information Display (2012).978
102 N. Masuda, E. Yutaka, K. Takashi, et al., “Special Purpose Computer for Phase Modulation979
Type Electro-Holography with DVI output [in Japanese],” in International Conference on980
3D Systems and Applications, 373–374 (2013).981
103 Y. Yamamoto, H. Nakayama, N. Takada, et al., “Large-scale electroholography by HORN-8982
from a point-cloud model with 400,000 points,” Optics Express 26(26), 34259 (2018).983
104 T. Ito, T. Yabe, M. Okazaki, et al., “Special-purpose computer HORN-1 for reconstruction984
of virtual image in three dimensions,” Computer Physics Communications 82(2-3), 104–110985
(1994).986
105 T. Ito, H. Eldeib, K. Yoshida, et al., “Special-purpose computer for holography HORN-2,”987
Computer Physics Communications 93(1), 13–20 (1996).988
106 T. Shimobaba, N. Masuda, T. Sugie, et al., “Special-purpose computer for holography989
HORN-3 with PLD technology,” Computer Physics Communications 130(1), 75–82 (2000).990
107 T. Nishitsuji, Y. Yamamoto, T. Sugie, et al., “Dedicated computer for computer holography991
and its future outlook,” in Proceedings Volume 10997, Three-Dimensional Imaging, Visual-992
ization, and Display 2019, 16, SPIE (2019).993
108 O. Nishikawa, T. Okada, H. Yoshikawa, et al., “High-speed holographic-stereogram cal-994
53
culation method using 2D FFT,” in Diffractive and Holographic Device Technologies and995
Applications IV, 3010, 49–57, SPIE (1997).996
109 R. Oi, T. Mishina, and K. Yamamoto, “Real-time IP–hologram conversion hardware based997
on floating point DSPs,” in Proc. SPIE 7233, Practical Holography XXIII: Materials and998
Applications, 723305, 723305, 723305–1 – 11, SPIE (2009).999
110 Texas Instruments, “TMS320C6678 Multicore Fixed and Floating-Point Digital Signal Pro-1000
cessor Datasheet,” Tech. Rep. November 2010, Texas Instruments (2014).1001
111 Intel, “Intel Xeon Phi Coprocessor,” (2019).1002
112 J. Fang, H. Sips, L. Zhang, et al., “Test-Driving Intel Xeon Phi,” in ICPE ’14 Proceedings of1003
the 5th ACM/SPEC international conference on Performance engineering, 137–148, ACM1004
(2014).1005
113 OpenMP Architecture Review Board, “OpenMP Official Website,” (2019).1006
114 K. Murano, T. Shimobaba, A. Sugiyama, et al., “Fast computation of computer-generated1007
hologram using Xeon Phi coprocessor,” Computer Physics Communications 185(10), 2742–1008
2757 (2014).1009
115 N. Tanabe, Y. Ichihashi, H. Nakayama, et al., “Speed-up of hologram generation using1010
ClearSpeed Accelerator board,” Computer Physics Communications 180(10), 1870–18731011
(2009).1012
116 G. Kyriazis, “Heterogeneous system architecture: A technical review,” tech. rep., AMD1013
(2012).1014
117 Y. Yamamoto, N. Masuda, R. Hirayama, et al., “Special-purpose computer for electroholog-1015
raphy in embedded systems,” OSA Continuum 2(4), 1166 (2019).1016
54
118 NVidia Corporation, “Jetson TX1 Module.”1017
119 R. W. Gerchberg and W. O. Saxton, “A practical algorithm for the determination of phase1018
from image and diffraction plane pictures,” Optik 35, 237–246 (1972).1019
120 P. W. Mash and T. D. Wilkinson, “Realtime hologram generation using iterative methods,”1020
in Proc. SPIE 6252, Holography 2005: International Conference on Holography, Optical1021
Recording, and Processing of Information, 62521O–1–62521O–9, SPIE (2006).1022
121 D. Blinder, A. Ahar, S. Bettens, et al., “Signal processing challenges for digital holographic1023
video display systems,” Signal Processing: Image Communication 70(October 2018), 114–1024
130 (2019).1025
122 A. Shan, “Heterogeneous processing: A strategy for augmenting moore’s law,” Linux J.1026
2006, 7– (2006).1027
123 ARM Ltd., “White Paper: big. LITTLE Technology : The Future of Mobile,” tech. rep.,1028
ARM Ltd. (2013).1029
124 NVidia Corporation, “Harness AI at the Edge with the Jetson TX2 Developer Kit,” (2019).1030
125 NVidia Corporation, “Jetson AGX Xavier Developer Kit,” (2019).1031
126 NVIDIA, “Jetson NANO Module,” (2019).1032
127 NVidia Corporation, “Jetson Xavier NX introduction page,” (2019).1033
128 ARM Ltd., “Neon Architecture,” (2019).1034
129 NVIDIA, “NVIDIA Brings CUDA to Arm, Enabling New Path to Exascale Supercomput-1035
ing,” (2019).1036
130 NVIDIA, “NVIDIA Announces CUDA-X HPC,” (2019).1037
55
131 Xilinx Inc., “UG1393: Vitis Unified Software Platform Documentation: Application Accel-1038
eration Development,” tech. rep. (2019).1039
132 NVIDIA, “Jetson TX2 introduction page,” (2019).1040
Youchao Wang is currently pursuing the Ph.D. degree in engineering from the University of Cam-1041
bridge, Cambridge, U.K, where he previously received the M.Phil. degree. He received the B.Eng.1042
(Hons.) degree in electronic engineering from University of Manchester and the B.Eng. degree in1043
electrical and electronic engineering from North China Electric Power University, Beijing, China,1044
under a joint degree program in 2018. His research interests include optical processing, CGH1045
hardware implementations, Internet of Things applications and low cost hardware system design.1046
Daoming Dong received the B.Eng. degree with first class in electronics from a joint program1047
between University of Liverpool (UoL) and Xi’an Jiaotong Liverpool University (XJTLU) in 2016.1048
He then moved to Imperial College London, where he received the MSc. degree with distinction1049
in Material Science and Engineering in 2017. He is currently a second-year PhD student under1050
the supervision of Prof. Tim Wilkinson in the centre of molecular materials for photonics and1051
electronics (CMMPE) group, Department of Engineering, Cambridge University, Cambridge. His1052
research relates to accelerate and optimize the generation process of computer-generated hologram1053
(CGH) via configurable hardware for the next generation 3D holographic displays.1054
Peter J. Christopher originally graduated from Bristol University in 2014 with a first class M.Eng.1055
degree in Civil Engineering, before spending two years working as a Software/R&D Engineer for1056
Autodesk’s Advanced Manufacturing Group focusing on additive manufacture and 3D Printing.1057
Looking for a new challenge, he joined the CDT in Ultra Precision based out of the Institute1058
56
for Manufacturing at Cambridge University in 2016 and is currently working with Prof. Tim1059
Wilkinson in the CMMPE group on high-power areal projections systems for additive manufacture.1060
Andrew Kadis originally graduated from the University of Adelaide, Australia in 2010 with a1061
first class degree in Engineering and a bachelor of Computer Science. Before commencing his1062
PhD studies in 2018, he had considerable experience in industry; working on embedded systems1063
in drones, medical devices and life sciences equipment. He is currently at Cambridge University1064
working with Prof. Tim Wilkinson in the CMMPE group.1065
Ralf Mouthaan obtained a Physics MSci degree from the University of Nottingham in 2008 be-1066
fore joining the UK’s National Physical Laboratory as a microwave metrologist where his research1067
was focused on maintaining and developing the UK’s electromagnetic exposure standards. More1068
recently, Ralf has obtained an MRes in Sensor Technologies from the University of Cambridge,1069
where he is now pursuing a PhD investigating holographic mode excitation in optofluidic waveg-1070
uides.1071
Fan Yang received the B.Eng. degree with first class from the University of Sydney in 2017 and1072
joined the CMMPE group, Department of Engineering, Cambridge University as an MPhil student1073
in 2018. He is continuing his research in the CMMPE group as a PhD student under the supervision1074
of Prof. Tim Wilkinson to develop a compatible and efficient holographic 3D display system.1075
Timothy D. Wilkinson received the undergraduate degree from Canterbury University, Riccarton,1076
New Zealand, and the Ph.D. degree from Magdalene College, Cambridge, U.K., in 1994. He is1077
currently a Professor of photonic engineering in the Department of Engineering, Cambridge Uni-1078
versity, Cambridge, and a Fellow of Jesus College. He has been working in the field of photonics,1079
57
devices, and systems for more than 20 years. His current research has been into applications of1080
holographic technology. This includes new liquid crystal device structures based on sparse arrays1081
of vertically grown multiwall carbon nanotubes, where the tubes are used as tiny electrodes to great1082
3-D electric field profiles and graded refractive index structures, which may have applications such1083
as switchable lenslet arrays and 3-D displays.1084
List of Figures1085
1 A typical system for computer generated holography consisting of three main com-1086
ponents: a light source, a computer or hardware platform for interference pattern1087
calculation and a device to display the hologram.41088
2 A classical optical imaging system.1089
3 A holographic imaging system for hologram recording.1090
4 A holographic projection system for hologram reconstruction.1091
5 A typical parallel pipeline overview of an NVIDIA GPU consisting of streaming1092
multiprocessors each containing a number of cores and functional units.1093
6 A typical architecture of FPGA, which consists of logic cells, I/O ports, DSP1094
blocks, block memory, etc.1095
7 A comparison of CPU, GPU, FPGA and other hardware implementations in recent1096
existing literature (2008-2020).1097
58
8 Different levels of parallelism and concurrency on hardware platforms. The left1098
hand side depicts the CPU instruction pipeline and processor-level parallelization,1099
whereas the right hand side shows the FPGA parallelization at the equivalent levels.1100
Hardware clustering at the board level further exploits parallelism.1101
List of Tables1102
1 Tools and utilities employed for CPU implementations since 20081103
2 Microarchitectures since 2008 and their representative flagship GPU products (SP:1104
single precision floating point)561105
3 A summary of CGH implemented on GPUs since 20081106
4 Tools and utilities reported for FPGA implementations in recent years1107
5 A summary of CGH implemented on FPGAs since 20081108
6 General comparison between CPU, GPU, FPGA, DSP and other platforms1109
7 NVIDIA Jetson module products family1110
59
