Abstract -In this paper a high-performance application which uses multiple 48k tap FIR filters is presented. Due to its size, complexity and restrictions such as real-time, small latency and large memory bandwidth, the filter was implemented in UltraScale+, a high-end FPGA from Xilinx. The system was verified using a gold reference model written in C (high-level algorithm verification) and an analytical model calculated manually. The system was also tested using a development board and SystemVerilog (for register-transfer level and timing verification). The obtained results show a perfect match between the reference models and the actual output. The main novelty of the paper is the implementation of such a immense real-time signal processing system based on FIR filters consisting of over a million taps all together in a single chip. Details about the resources allocated within the FPGA are also given in a table in the results chapter.
I. INTRODUCTION HE traditional approach to Finite Impulse Response (FIR) filter implementation for high-performance applications usually involves Application-Specific Integrated Circuits (ASICs) or dedicated Digital Signal Processing chips (DSPs). However, in recent years the market pressure as well as advances in technology has made Field-Programmable Gate Arrays (FPGAs) a viable alternative in certain cases. The high cost development of ASICs as well as long time-to-market periods has made this solution less attractive for some applications, especially when there is fierce competition involved. DSPs have the issue of reduced flexibility due to their sequentialexecution architecture which can prevent them from achieving the desired performance in certain cases, such as when the bandwidth they provide is not sufficient. On the other hand, FPGAs offer a balance of costs, time-tomarket, performance and flexibility which the traditional approaches lack [1] while still retaining the possibility to cover real-time use cases. FIR filters in FPGA is not a new concept. There are many published papers regarding this subject [2] [3] [4] . However, the length of the filters or the number of taps is relatively small compared to the scale of this application, even for newer published works [5] , [6] . Even when the difference between the number of taps was not several orders of magnitude, these applications were used in systems whose impulse response lasted several seconds or more. One of the main features and restrictions with this system is the real-time aspect -the overall latency of the system has to be below 2 microseconds. This further contributes to the novelty of the paper alongside the space required on the chip due to the system's size.
In this particular case, a highly intensive application is requested which includes multiple FIR filters. Each filter has 48k coefficients, working at a 48 kHz sampling rate and yielding a 1 second impulse response with some of the coefficients being shared between the filters. The entire process is linked to the Personal Computer (PC) where all the pre and post-processing is managed. The amount of processing power needed for the application is not a critical part when compared to the necessary memory bandwidth due to required parallelism. This is the main reason why the Central Processing Unit (CPU) could not be used for such an application due to its limits when working with memory intensive problems. Even though the CPU could work in conjunction with the cache memory, there is simply far too much work which needs to be done in parallel and most caches do not have enough ports to enable sharing data between many processing units. One possible solution which could provide sufficient computation power and memory bandwidth is the Graphical Processing Unit (GPU). However, GPUs were designed to provide immense throughput and parallelism in computing. As such, even the high-end GPUs introduce latency which is greater than specified for the application and so this approach was discarded. In the end, a high-end FPGA UltraScale+ from Xilinx was used along with the VCU118 development board [7] . T Random Access Memory (DRAM) has to be used. This type of memory does not enable access per every clock cycle which in turn reduces the overall speed and throughput of the system. The FPGA used in this project has Ultra Random Access Memory (URAM) which increases the size of the internal BRAM up to 6 times [8] .
II. PROPOSED SOLUTION
Another requirement which contributed to the complexity of the design was the floating-point representation of the samples, so that the PC could manage them. Most FIR filters and DSPs use fixed-point arithmetic so a special conversion block was needed which impacted the responsiveness and latency of the system. The data flow (shown in Fig. 1 ) is the following: first, the audio data is sampled by and analogue-digital converter (ADC) at a sampling rate of 48 kHz. The audio comes from multiple channels which all need to be filtered in parallel. When sampled, the audio signal is transmitted via I2S protocol to the FPGA from where it is forwarded to the PC using the Peripheral Component Interconnect Express (PCIe) bus. Initially on the PC, the input samples are pre-processed and sent back to the FPGA where the filtering is done. Afterwards, the filtered signal is sent back to the PC via PCIe for the post-processing. In the end, the finally processed signal is sent to the digital-analogue converter (DAC) and finally out through the speakers. A dedicated Linux driver for handling the PCIe protocol was written for this purpose. There are several ways to obtain a specific number of taps for a single filter. The requirement for the system stated that each filter will have 48k taps, which equals 1 second of the impulse response time and that there will be 72 of these filters. Since the nature of the problem was filtering several channels at once, it made sense to have a dedicated processing unit for each channel. The input signal was audio with a sampling rate of 48 kHz. This means that 48k taps could be obtained by having 48k processing cores working in parallel at the same frequency as the sampling one. However, this is way too resource intensive even for high-end FPGAs and would greatly underutilize the maximum working frequency of the FPGA. The opposite extreme would be to have a single core working at 2.3 GHz, provided that this core could handle the memory throughput as well. This clock frequency is too high for FPGAs, so this approach was also discarded. The goal was to find a solution somewhere in between, which would result in a compromise between attainable frequency and available resources.
The approach taken was to use the highest possible working frequency while not breaching the resource limit. Even though the FPGA itself could operate on 800 MHz, the floating-point unit created a bottle neck which reduced the working clock frequency down to 200 MHz. This resulted in 12 processing units working in parallel, each having a filter with 4k taps. These processing units were named FIR Cores and the structure of one is show in Fig.  2 . The Core consists of two buffers, one for the samples and one for coefficients, and an accumulator which multiplies the two and accumulates the result. Each core receives an input sample from the previous core, except the first one which receives the sample directly from the PC. The FIR coefficients are transferred by the coefficient bus.
Results from all 12 cores are added together and this forms the output of a single FIR Unit. There is a total of 72 FIR Units within the system. As far as verification and testing goes, there were several issues which needed to be addressed. The entire design was synthesized as an AXI IP and as such could theoretically be simulated as one (illustrated in Fig. 3) . The main reason the simulation is important is to remove any bugs which can occur in the system and with the elements linked to our system. Another key feature of having a simulation is the unparalleled visibility per each cycle which is impossible to recreate in a real world test scenario. However, due to the complexity of the design it was not possible to simulate the full design with the available workstations. Even if the available resources were not insufficient, the simulation would simply last too long to be useful for debugging. A single real time second in the simulation could take hours or even days, which would easily put the project to a halt.
Instead, a shrunk version of the system was simulated for debugging purposes and to check the structural integrity, primarily to verify the Advanced Extensible Interface (AXI) protocol was working as intended. This version used a single FIR Unit with a shortened FIR Core. Prior to testing, we needed to make sure that there were no errors in the design because the synthesis process lasted over 8 hours which made design changes extremely demanding and in final instance quite expensive. Once the simulations showed the system was error free, the next step was the real world test on the development board with both the shrunken and full version of the design. A gold reference model for the shrunken version of the design was generated using a script written in C. The inputs used for this gold model were fed into the system and the results were compared to those generated previously with an exact match. Verifying the full version design brought new challenges. Even generating a meaningful reference model took too long -about a minute for every one second of the actual signal. For this purpose simple input signals were chosen such as the delta impulse and a saw tooth whose response could be analytically calculated. Finally, this simple input signal was fed to the system which outputted the same results as the analytical model predicted with a perfect match.
III. RESULTS
The results presented in this chapter show the amount of FPGA resources required to meet all the restrictions placed on the application. The number of logic elements, DSP slices, memory elements as well as the entire % of resource usage is given in Table 1 shown below. Lookup Tables  (LUT) represent units of combinatorial logic, Flip-flops are used for individual registers, where as Block RAM (BRAM) and Ultra RAM (URAM) are used for larger quantities of memory. As shown in Fig. 2 . URAM was used for the samples and coefficients, while BRAM had a more supportive role in buffers which were used to help with the communication, mainly with the AXI protocol. Lastly, the number of DSP slices is used for support regarding floating-point operations for fast multiply and accumulate operations. The values in the table indicate that the memory was in fact one of, if not the biggest obstacle which needed to be overcome. The % of URAM usage nearly reached 100%, while the % of LUTs and DSPs which indicate processing power was somewhere in the middle.
Resource
One other thing worth mentioning is the power consumption. The total power needed for the system is just below 50 W. This is several times less than that needed for a strong GPU which could provide the sufficient throughput and processing power (disregarding the latency issues) which usually operates on a level of 150-200 W with spikes going over these values. This is another benefit of using an FPGA, especially for prolonged usage over several years.
IV. CONCLUSION A high-performance, very long response time FIR filter implemented in a high-end FPGA is presented. The results exposed adequate resource utilization along with accomplishment of the strict criteria such as low latency and real-time. These performances could not have been achieved by traditional approaches due to high memory throughput demand alongside the before mentioned restrictions.
