The real-time synthesis of 3D spatial audio has many applications, from virtual reality to navigation for the visually-impaired. Headrelated transfer functions (HRTF) can be used to generate spatial audio based on a model of the user's head. Previous studies have focused on the creation and interpolation of these functions with little regard for real-time performance. In this paper, we present an FPGA-based platform for real-time synthesis of spatial audio using FIR filters created from head-related transfer functions. For performance reasons, we run filtering, crossfading, and audio output on FPGA fabric, while calculating audio source locations and storing audio files on the CPU. We use a head-mounted 9-axis IMU to track the user's head in real-time and adjust relative spatial audio locations to create the perception that audio sources are fixed in space. Our system, running on a Xilinx Zynq Z-7020, is able to support 4X more audio sources than a comparable GPU and 8X more sources than a CPU while maintaining sub-millisecond latency and comparable power consumption. Furthermore, we show how our system can be leveraged to communicate the location of landmarks and obstacles to a visually-impaired user during a sailing race or other navigation scenario. We test our system with multiple users and show that, as a result of our reduced latency, a user is able to locate a virtual audio source with an extremely high degree of accuracy and navigate toward it.
INTRODUCTION & MOTIVATION
Visually-impaired individuals frequently use their ability to determine the physical location of a sound source as a means of Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. FPGA '20, February 23-25, 2020, Seaside, CA, USA © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-7099-8/20/02. . . $15.00 https://doi.org /10.1145/3373087.3375323 navigation. Blind-sailing [12] is a sport enjoyed by many visuallyimpaired people. There is a community of sighted guides, coaches, sailing-center personnel and other volunteers who enjoy supporting the blind sailors at all levels from beginners sailing recreationally, to skippers sailing in international blind-sailing regattas. Recently a new type of support for blind sailing has emerged through increasingly-advanced technology. In 1996, Alessandro Gaoso invented the Homerus Autonomous Sailing System, a technology that allowed blind sailors to compete without sighted human guides [1] .
Twenty years later, students and faculty at Olin College began working with competitive blind-sailors to improve the Homerus system. Olin's initial work included [10] and [11] , upon which this work is based.
We have recently produced two new technologies to develop a safer, easier system for blind-sailors to use. The first is an Android app, which is a new community-centered technology built to be used on repurposed Android phones. The app uses high-precision GPS sensors and provides text-to-speech output guiding the sailors. In keeping with the partnership between blind sailing technologists and blind-sailors, this technology was tested successfully by blindsailors at different levels of sailing under Blind Sail SF Bay, which organizes many blind-sailing events in the San Francisco Bay Area.
The second, described in this current paper, builds upon this work to develop a better method of output to direct the sailors. While the text-to-speech system in the Android app system is effective, locating the apparent location of a sound source in 3D space is much more natural and easy to learn, and can be used for many applications including pure-audio virtual reality experiences. Our system described within this paper builds upon the Android app to add a real-time 3D spatial audio synthesis system, allowing a user to wear stereo headphones and visualize the locations of waypoints simply by listening for the apparent location that the sound is coming from.
Synthesis of spatial audio is generally done through the use of head-related transfer functions (HRTF), used not only for headphonebased spatial audio but also for artificial stereo enhancements in music players and home audio systems [4, 13] . Although a purely HRTF-based approach is effective for pre-generating spatial audio for large sound systems, a realistic spatial audio effect can only be attained through real-time head tracking and quick updates in the synthesized location of the audio source, using a low-latency IMU (Inertial Measurement Unit) system. A fast update rate between head movements and updates in the audio is necessary in order to ensure the system is realistic and easy to use.
In the following sections an overview of related work is given, followed by the fundamentals of spatial audio synthesis. The design of our FPGA-based system for real-time audio synthesis is then presented, followed by performance benchmarks and test results for audio localization. The paper finishes with a discussion of results, conclusions, and potential applications of our system.
BACKGROUND 2.1 Related Work
There has been significant interest in the creation of stereo spatial audio recordings through specialized multi-microphone configurations [2]. While useful for creation of one-time recordings, a dynamic, HRTF-based approach has been shown to be necessary for real-time synthesis of spatial audio. Specifically, a pair of Finite Impulse Response (FIR) filters created from the HRTF are used to process an audio sample to create the left and right stereo channels to output to the user [3] . Head-related impulse responses (HRIR) have been measured for a variety of head shapes by projects such as the MIT Media Lab [7] and Listen HRTF Database [15] (note that although these projects claim to measure the transfer function, they are actually only measuring the impulse response). These measurements are then used for digital filter design for the creation of spatial audio synthesis systems such as the one created in [3] .
Prior attempts at real-time spatial audio have either relied on high-latency batch processing on CPUs and GPUs with high cost and power consumption [6] , or used embedded platforms but with a notable limit on the number of audio sources and a low synthesis quality, as well as a high update latency [5] .
Spatial Audio Synthesis
Humans use various properties of a sound waveform to determine the spatial location of the sound source. Three of the most significant properties include: The ITD and ILD are primarily based on the distance from the target, creating a so-called "cone of confusion" where all points are equidistant from both ears. In particular, the spectral cues are represented by the attenuation of different frequencies at different head elevation (pitch) angles. The location of the sound source along this cone is primarily done through spectral cues, making it crucial to include not only the azimuthal (yaw) angle but also the elevation angle of the head when synthesizing spatial audio. These properties are all encoded in the Head-Related Impulse Response (HRIR) for a given azimuth and elevation angle (the HRIR is measured using a specialized microphone setup, see [7] for more detail). In addition, listeners can move their head to compensate for ambiguous sound locations.
For purposes of spatial audio synthesis, FIR filters [14] are designed from the measured HRIRs and used to process a given sound signal to generate a left/right stereo sound pair which, when played through stereo headphones, gives the listener the illusion of the sound source being placed at a given location. In order for the source location to appear fixed, the user's head must be tracked in real time and the FIR filter coefficients adjusted to compensate for the relative angle of the source audio to the user's head.
FIR Filter Design
FIR filters, which are commonly used for digital audio processing, consist of a linear convolution over a window of the past N samples (for the purposes of spatial audio synthesis, N is usually in the range 512-2048). The output of an FIR filter can be computed using a single MAC block over N FPGA cycles, in addition to enough memory for N filter coefficients and a shift register (or equivalent) of the last N input values. We can design an FIR filter from the HRIR using standard DSP design techniques in the time domain. By ensuring that all measurements are made in the time domain, we can ensure the preservation of ITD throughout the filter.
HRIR Interpolation and Crossfading
As the measurement of HRIRs for different azimuth and elevation angles is time-consuming and requires a lot of storage space, only the HRIRs for a fixed set of angles are measured, and various interpolation methods are used for synthesizing audio signals at angles that have not been pre-measured. There are two common methods of interpolation: coefficient interpolation -where the 4 nearest FIR filter coefficients are interpolated using a weighted sum, and crossfade interpolation, where the 4 nearest FIR filters are all executed and their outputs are crossfaded using a weighted sum. While the crossfade is more computationally expensive, experiments showed that it exhibited fewer noticeable artifacts when memory updates were delayed. Additionally, the crossfade is more efficient to pipeline than the coefficient interpolation. For this reason, we focus primarily on the use of crossfade interpolation in this work.
SYSTEM DESIGN 3.1 Requirements
For the purposes of spatial audio synthesis for blind sailing / navigation and other VR applications, we enforce various constraints on our system:
• Audio sampling rate of 44.1 kHz with 16 bit sample precision • Audio synthesis with ≤ 1ms latency • Real-time head tracking with ≤ 5 ms latency • Ability to use HRTFs with 512-2048 filters for spatialization of the sound sources. • Ability to synthesize multiple unique sound sources with distinct locations and motion paths • Low-cost embedded platform with low power consumption
Hardware Platform
To meet the requirements defined above we use a specific set of hardware (especially to minimize power, cost, and size): Our system is based on an Arty Z7-20 evaluation board with a Zynq Z-7020 SoC. It is chosen as it is a powerful SoC but small enough for embedded use and having a fairly-low power consumption. The CPU is an ARM Cortex-A9 with an AXI4 bus with up to a 1024 bit width for data transfer. In addition, the evaluation board has a DDR3 RAM interface and breakout connections which can be used for audio and Session: Applications I FPGA '20, February 23-25, 2020, Seaside, CA, USA sensor interfaces. The head tracking is done using the MPU9250 [9] (which consists of a 9-axis IMU and a sensor-fusion processor) with up to a 200Hz external update rate. The MPU9250 is installed on the user's headphones so that its movements accurately reflect the user's head orientation. It is connected to the ARM Cortex-A9 CPU through an I2C bus. Finally, we use a PCM5102A [8] as a stereo DAC. It is controlled by an I 2 S signal (consisting of 44.1kHz audio for left and right channels) generated by the FPGA from the final output of the audio synthesis pipeline, and its output is connected to the input of the stereo headphones worn by the user.
System Architecture
Our system consists of several different components, split across the Cortex-A9 CPU and the FPGA fabric. Figure 1 depicts an overview of our system. Note that the "External Controller" controls the sound sources and locations, and transmits them to the CPU to begin the spatial audio synthesis process.
FIR Filter Weight Selection.
We design an FIR filter for each side of the head for each orientation (θ, Φ) that is available in the HRIR dataset (where θ is azimuth angle and Φ is elevation angle). The dataset we use has a total of 710 unique orientation pairs, of which each has a 2048-sample long, 16-bit precision HRIR for the left and right sides. This yields 1420 unique FIR filters, each having 2048 fixed-point weights of 2 bytes each. A total of 5.5 MiB of memory is required to store all of the filters. As there is not enough block memory (BRAM) on the FPGA to store all of these filter weights, we load them into the DDR3 RAM on startup and move them onto the FPGA through AXI4 as needed.
Given a head orientation (θ, Φ), it is improbable that it is contained in the pre-computed set of FIR filters, so we use a crossfade interpolation scheme where the 4 nearest filter weights are selected and the filter outputs are then averaged using a weighted average based on the angle difference. The following equations demonstrate the crossfade interpolation for a single sound source (based on common HRIR datasets, we assume Φ T L = Φ T R and Φ BL = Φ BR but NOT θ T L = θ BL ). We define W as the FIR Filter weights, the acronyms T L, T R, BL, and BR for top-left, top-right, bottom-right, and bottom-left respectively, H as the relative orientation of the user's head, and S as the sound sample.
This process is then repeated using the weight dataset for the other ear. As part of the weight selection, the eight relevant FIR filter weight sets (W T L , W BL , etc.) for each sound source are chosen and sent over AXI to the FPGA. In addition, the fractional coefficients for each filter weight are calculated, converted to fixed-point, and sent to the FPGA as well for real-time averaging of the outputs (the operations in blue are pre-computed on the CPU, while the rest are computed on the FPGA in real-time).
We use an AXI data width of 1024 bits, which, in S2MM mode yields around 75% of maximum throughput [16] : an average of 96 bytes / clock cycle. Given this memory bandwidth, the number of clock cycles required for a full replacement of FIR filter weights can be calculated as
3.3.2 Audio Buffering. In addition to the FIR filters, the source audio to be spatialized must also be transferred and stored. The audio source is received from either an external controller (a mobile device for the blind sailing application, or a desktop computer for VR applications) or from the persistent storage attached to the ARM Cortex-A9 CPU. As there are data transfer latencies for externallyreceived audio, the audio data is transmitted in chunks and buffered on the CPU, which transfers one sample (one 16-bit value) from each sound source over AXI at a rate of 22.676 µs (a frequency of 44.1kHz).
On the FPGA, each processing element (which run in parallel and each process one sound source) contains an overwriting circular buffer which is stored in Block RAM and stores the last N audio samples at all times.
FIR Filter
Block. The second part of the processing element (PE) is the FIR Filter Block. This block consists of eight FIR filters all computed in parallel: TL, TR, BL, and BR for each of the left and right ear. All eight filters are computed using a MAC block on an FPGA DSP slice. All 8 MAC blocks take the same audio sample as input (from the circular buffer), but eight different filter weights (which are stored in the weight cache, also located in BRAM). At the end of the filter computation (which takes N clock cycles for a filter length of N ), the outputs of the four filters for each ear are crossfaded together as per the equations described in the section "FIR Filter Weight Selection. " This yields a single output for each of the left and right ears for that sound source.
In aggregate amongst all of the PEs, M (number of sound sources) outputs are generated in total for each ear.
Adder Tree.
In order to crossfade multiple sound sources together, an adder tree is used to sum the M outputs for each ear. Optionally, a scaling factor (transferred from the CPU along with the audio source locations) can be applied to each sound source to create the factor of distance for each sound source. Generally, this scaling should be proportional to the distance from the sound source, as per standard acoustics. However, many applications (especially navigation) may prefer a sub-linear scaling such that the sound is audible from long distances without being unbearably loud at shorter distances. This is one of the major advantages of our system over existing blind sailing systems, which use loud sirens with a linear relationship between volume and distance.
The output of the adder tree is two single-frame audio samples, one for the left ear and one for the right. 
Output Buffer & I2S
Output. Finally, the two audio samples are buffered and then converted into an I 2 S signal. In order to prevent issues with the clock crossing, 20 audio samples are buffered into a FIFO buffer (<0.5ms added latency) and then converted into a digital (16-bit stereo) signal and outputted to the PCM5102A stereo DAC. The I 2 S protocol consists of a word select (left ear / right ear) and a serial data line. We run the I 2 S output at a frequency of 1.4112 MHz (sound frequency multiplied by bits per sample multiplied by number of ears). The DAC is connected via a 3.5mm jack to a set of over-the-ear headphones (which improve sound quality and allow the sound to enter the ear more naturally than in-ear earbuds).
RESULTS

Verification
We implemented the filtering and crossfading algorithms in Python as a reference implementation and generated some random sound samples and motion profiles to use as test cases. After ensuring that the spatial audio performed as expected with the Python prototype, we loaded the random sound samples onto an SD card and reconfigured the control logic to load audio from the SD card, send it to the FPGA, and then read it back from the FPGA's output buffer (as opposed to sending it to the DAC for real-time playback), and write back to the SD card. We then ran the system and offloaded the outputs from the SD card to validate against the output of the Python prototype. All parts of the FPGA system match the corresponding output of the Python prototype.
Hardware Resource Constraints
Running our system at a rate of 100MHz, we face two main constraints on the number of sound sources M and the maximum filter size N .
Block RAM Limitation.
Each set of filter weights and each circular buffer requires a significant amount of block memory to store on the FPGA. Specifically, each FIR filter requires 2N bytes of storage, and the circular buffer also requires 2N bytes. As there are eight FIR filters in each processing element, each processing element requires 18N bytes of storage. Given that there is 630 KiB of block memory on the Z-7020, we can calculate the number of processing elements that can be used at once as 645120 18N . Table 1 shows the maximum number of sound sources that can be processed (with regard to memory constraints only) for various values of N .
Parallel Compute Limitation.
Each FIR filter requires at least one DSP slice to compute. For this reason, each processing element needs eight DSP slices in order to physically exist on the FPGA. As the Z-7020 has 220 DSP slices, the number of processing elements which may be synthesized on the FPGA can be calculated as 220 8 = 27. Synthesizing our design onto the FPGA confirms the two numbers above as the maximum resource availability. In the simplest case (no memory crossover), we assign each sound source onto one PE, but allow one PE to process multiple sound sources in sequence and produce multiple outputs. Additionally, N clock cycles are required for one PE to process one sound sample. With the FPGA running at a clock speed of 100MHz, there are 100000000 44100 = 2267 clock cycles per sound sampling period. Given that 2048 cycles are required for the largest FIR filter to run, there is a 219 cycle slack, during which the adder tree and the AXI4 memory transfer are executed. For smaller FIR filter sizes, multiple sound sources are processed on a single PE. For example, for N = 1024, two sound sources are processed on each PE, one in the first 1024 cycles and the second in the latter 1024 cycles. Table 2 shows the maximum number of distinct sound sources for each FIR filter size. Other 2048  17  27  17  1  1024  35  27  35  2  512  70  27  70  3  256  140  27  140  6 resources on the FPGA (LUT, FF) are not found to be a bottleneck, which is why only Block RAM and DSP slices is considered. It is clear that memory, not compute, is the main bottleneck factor in all cases.
Latency and Performance Comparison
We ported our entire pipeline to C++ (optimized using gcc with the maximum level of optimization) to run on a CPU (ARM A57 @ 1.43 GHz with 2 cores), as well as CUDA for an NVIDIA GPU (Tegra X1 on Jetson Nano). This pipeline includes reading audio input, calculating the location, running the FIR filters (with batch processing), and I2S output. Results are shown in Table 3 . We note that the lack of complete parallel execution between filters creates an added overhead. In addition, we enforce a maximum 1ms latency for sound synthesis (1ms from IMU input to headphone output). The reason for the 1ms latency is to avoid compounding on the already high 5ms latency of the IMU. The FPGA and CPU do not face problems with this, as they do not rely on batch processing (due to the lower overhead of pulling data out of CPU memory vs GPU memory). Power measurements include the entire setup, including the embedded Linux OS. 
Perception Tests (Blind-Sailing)
We tested our system under various conditions using two unique methods as well as a fixed control test. We primarily tested the user's ability to locate a sound around the azimuth (the elevation of the sound source was not changed between tests), but both the azimuth and elevation tracking were enabled during the test. In cases where multiple audio sources were used, they were distinct sounds and the user was informed in advance which sound to locate.
Test 1: Rotating Head Test.
In this test, we synthesized various audio sources at fixed locations, and allow the user to rotate their head until it is facing directly towards the audio source, then press a button to signal that they had located the source. No time limit was applied during this test, but average time taken was measured and recorded. Each trial was repeated 20 times with different azimuth angles. In this test, we synthesized various audio source at fixed locations, and allowed the user to move their head freely. However, the audio was only played for 1.5 seconds and the user was then asked to select the angle that the audio seemed to be coming from on a touchscreen. Each trial was repeated 20 times with different azimuth angles. (In the results, errors where the user selected the wrong point on the cone of confusion are noted separately to avoid skewing the mean and standard deviation). In the control test, we repeated test 2 with the same methodologies, but with the user's head fixed in place. This was done to measure the significance of head movement on the ability to locate sounds. The results of this test, compared with the results of test 2, proves that head movement is crucial for not only locating sounds along the cone of confusion, but also for precise location of sounds in general. 
DISCUSSION 5.1 Interpretation of Results
We have shown that our latency and overhead is significantly lower than GPU and CPU-based platforms, allowing our system to process more unique sound sources with the same latency and similar power consumption. In addition, our system is significantly more powerful Session: Applications I FPGA '20, February 23-25, 2020, Seaside, CA, USA than that presented in [5] . Our system is able to process 70 unique audio sources with the same 512-length FIR filter, compared to the 15 sources in [5] . Additionally, our system supports variable filter lengths ranging from 256 to 2048, instead of being constrained to a specific length of 512. Finally, we have a lower overall latency as a result of our improved IMU interface and better use of caching (storing coefficients in BRAM instead of transferring them from DDR3 memory after every operation). For these reasons, we find that our solution is significantly better (given our power and cost constraints) than all prior work. Additionally, blind test results show a high degree of success for a user to locate a sound source in a fairly short amount of time. In test 1, which most accurately reflects the real-world scenario for use of this system, users were able to locate the source to an extremely high degree of accuracy. Additionally, a comparison of trials 2 and 3 demonstrates that the low-latency head movement ability is essential in order to accurately locate a sound source. Furthermore, preliminary experiments with using the system for blindfolded indoor navigation have yielded promising results, even in thin corridors where even a few degrees of error can cause a user to collide with a wall. We also noted that during blind tests, even slight increases (5-10ms) in head-tracking latency were extremely notable and caused discomfort during fast head movements. 
Blind Sailing Application
We integrated our system into our previously-built smartphone app (see Introduction) as an output method. The smartphone app has been previously tested and proven on the water, so for the purposes of this paper, the FPGA system was tested on land. The smartphone was connected to the Cortex-A9 CPU as the "External Controller" in our system. Various configuration options were made available for the user, including ignoring other boats if they are more than a short distance away. Figure 2 depicts a blind-sailing race setup. Each buoy (which functions as a waypoint for match racing) was assigned a different sound pitch, and the opposing boat was assigned two sounds (for the port and starboard tack, respectively). By default, only the buoy that the user was currently navigating to would emit a sound, along with the opposing boat when it was near the user. However, the number of unique sounds could be increased up to 8 through the tactile interface The volume of each sound source was controlled primarily by its importance to the user as opposed to the distance, in order to allow the user to focus on the most important information, not the nearest (which would include the buoy the sailor is currently sailing away from).
Future Work
Future optimizations include designing a better representation of distance, as well as compressing FIR filter weights and audio samples down to a lower bitwidth (potentially 8 bit fixed-point) and bitrate (many sound tracks are recorded at 22.05kHz or even lower), in order to allow for more sound sources and/or larger FIR filters. Tests will have to be conducted to quantitatively measure the effects of these optimizations on sound quality. In addition, further tests will have to be conducted on the effect of the headphones used for audio playback. Results from [7] suggest that it may be optimal to add an additional FIR filter to the final output (after the adder trees) to compensate for the effects of the headphones used to play the sound source.
CONCLUSION
The FPGA-based platform we presented in this paper has been proven to be a very successful method of synthesizing 3D audio in real-time using HRTFs. We showed that our system is able to process many more unique sound sources than comparable CPU and GPU-based systems, while maintaining a lower latency and equal audio quality. Our caching of filter weights and recent audio signals using a circular buffer allows use of BRAM as a more efficient cache without increasing read or write times significantly. Finally, we demonstrated that a user can locate an audio source very easily, even if only given a short sample of the sound. Our integration with a blind-sailing system is successful and can give the user all the cues needed for sailing in a blind match-racing environment.
