Real-time display of processed Fourier domain optical coherence tomography ͑FDOCT͒ images is important for applications that require instant feedback of image information, for example, systems developed for rapid screening or image-guided surgery. However, the computational requirements for high-speed FDOCT image processing usually exceeds the capabilities of most computers and therefore display rates rarely match acquisition rates for most devices. We have designed and developed an image processing system, including hardware based upon a field programmable gated array, firmware, and software that enables real-time display of processed images at rapid line rates. The system was designed to be extremely flexible and inserted in-line between any FDOCT detector and any Camera Link frame grabber. Two versions were developed for spectrometer-based and swept source-based FDOCT systems, the latter having an additional custom high-speed digitizer on the front end but using all the capabilities and features of the former. The system was tested in humans and monkeys using an adaptive optics retinal imager, in zebrafish using a dual-beam Doppler instrument, and in human tissue using a swept source microscope. A display frame rate of 27 fps for fully processed FDOCT images ͑1024 axial pixelsϫ 512 lateral A-scans͒ was achieved in the spectrometer-based systems.
I. INTRODUCTION
Optical coherence tomography ͑OCT͒ obtains crosssectional images of living human tissues with an axial resolution of several microns. 1 OCT is routinely used clinically in the field of ophthalmology 2 and has been implemented in other research and clinical prototype platforms for other fields. [3] [4] [5] OCT has advantages over conventional ultrasound in resolution and fiber optic delivery for direct imaging of internal organs with catheters, endoscopes, laparoscopes, and other surgical probes. It also has advantages over standard excisional biopsy and histopathology: surgical guidance by in situ, real-time OCT tissue microscopy, and automated discrimination minimizes costly and time-consuming tissue processing and analysis, potentially reduces sampling errors, and allows greater diagnostic access to vital organs. These attributes make OCT extremely attractive for medical diagnosis and image-guided surgery.
Recently, a new OCT paradigm, called Fourier domain OCT ͑FDOCT͒, has been adopted from the radar field ͑spec-tral radar͒. 6, 7 It achieves rapid imaging speeds by multiplexed acquisition of the axial profile via collection of its spectrum in the Fourier domain. Multiplexed operation obviates the necessity of axial scanning for collection of depth information. FDOCT instruments come in spectrometerbased and swept source versions. The more conventional FDOCT technique uses a spectrometer with a linear array detector. The fast readout speed of the linear detector ͑typi-cally tens of kilohertz͒ allows acquisition at video frame rates ͑30 fps͒ while the multiplexed scheme provides a signal-to-noise ratio ͑SNR͒ advantage over time domain OCT ͑TDOCT͒. 8, 9 As the name implies, swept source FDOCT rapidly sweeps a narrow wavelength band over a broad source spectrum while sensing the interference ͑i.e., fringes͒ associated with each instantaneous wavelength on a single-element detector. 10, 11 Swept sources generally operate at tens of kilohertz and also achieve a SNR advantage over TDOCT. 12 Recently, very rapid swept source speeds in the hundreds of kilohertz have been achieved. 13 Fourier domain methods provide direct access to the phase of the optical signal, enabling software implementation of automatic numerical dispersion compensation. 14, 15 Because of spectrometer resolution tradeoffs for efficient light capture over a given bandwidth, spectrometer-based FDOCT instrument generally have a reduced depth range compared to their TDOCT and swept source counterparts. Moreover despite the great advantage in data rate, en face FDOCT imaging in the eye is slower than scanning laser ophthalmoscopy ͑SLO͒ or combined SLO/OCT techniques because a full threedimensional ͑3D͒ data cube must be acquired before an individual en face slice is available to view. 16 Real-time display of processed FDOCT images often occurs at a much slower rate than the linear array acquisition rate because a computationally intensive fast Fourier transform ͑FFT͒ must be performed on each A-scan. This usually relegates most image analysis and diagnosis to postprocessing. For some applications such as image-guided endoscopy, biopsy, or surgery, it is essential to get feedback in real time. For high resolution imaging exploring a small patch of tis-sue, early disease detection, and conditions with localized pathologies, real-time operation is also necessary to ensure that the proper region of interest is imaged with fewer repeat sessions. In ophthalmology, it is desirable to have real-time feedback to aid in patient alignment and reduce session time, especially for older patients that make up a large percentage of the clinical population for most hospitals. Rapid processing will also enhance the operation of depth and 3D tracking hardware. 17 Moreover, real-time operation allows better screening capabilities for technicians trained to perform initial diagnoses and triage from the resultant images.
In this paper we present an approach for rapid processing and real-time display of FDOCT images. We designed and developed digital signal processing ͑DSP͒ hardware using a single field programmable gate array ͑FPGA͒ integrated circuit ͑IC͒ and a custom electronics board. A custom high-speed digitizer board was also developed for swept source-based FDOCT instruments. Firmware was developed for signal transfer, external hardware interface and timing, host processor communication, and full implementation of the FDOCT signal processing chain inside the FPGA IC. Host processor software was also developed to perform final image scaling and display, communication with the external and DSP hardware, and auxiliary functions related to user interface. We achieved a frame rate of 27 fps for fully processed spectrometer-based FDOCT images ͑1024 axial pixelsϫ 512 lateral A-scans͒ in initial testing on human volunteers, a rhesus monkey, and a zebrafish. We achieved a frame rate of 19.5 fps for fully processed swept source-based FDOCT images ͑1024 axial pixelsϫ 512 lateral A-scans͒ in initial testing on in vivo and in vitro human tissue.
II. FDOCT SIGNAL PROCESSING THEORY AND REQUIREMENTS
The principle that underlies FDOCT is that an axial reflectance profile and its broadband interference spectrum comprise a Fourier transform pair. 7 The spectral fringes acquired in FDOCT represent the interference between light backscattered from sample and reference targets for all the individual colors that compose a broadband source. Fourier transformation is used to convert the spectrum in wavelength ͑i.e., optical frequency͒ dimensions to an axial reflectivity profile in space. Synchronous optomechanical scanning of the beam provides the transverse image dimensions. For all practical targets, several additional processing steps must be performed on each spectrum prior to Fourier transformation, including reference light intensity background subtraction, spectrum interpolation, and dispersion compensation.
The detected spectrum, as a function of wavenumber, k, is given by
where E R is the reference arm field and E S is the sample arm field, including delays and attenuation. In general, the sample arm intensity is small compared to the reference arm intensity and so the last term in Eq. ͑1͒ can be ignored. In addition, the first term is removed by subtraction of the reference arm intensity prior to Fourier transformation. The middle interference term, S int ͑k͒, contains the image information and is the sum of fringes generated by the interference of light reflected from index variations within an object with the light reflected from the reference mirror
where I n is the intensity of light reflected from the nth layer in the sample, I R is the intensity of light reflected from the reference arm, z n is the depth of the nth reflection, and ͑k , z n ͒ is a general phase term that includes dispersive effects. The axial profile is obtained from the Fourier transform of Eq. ͑1͒,
where ⌫͑z͒ is the envelope of the coherence function and ␣ n is the sample reflectivity at depth z n . 18 The first and last term in brackets represent the reference and sample autocorrelation terms, respectively. The middle terms represent identical cross-correlation terms. With proper reference arm background intensity subtraction and for samples with low reflectivity, the first and last autocorrelation terms will be negligible. Only one of the cross-correlation terms is retained for imaging.
The wavenumber ͑k͒ and the axial dimension ͑z͒ are conjugate variables. The transformation from the physical spectrometer wavelength to spatial frequency contains inherent nonlinearities that must be compensated. Therefore, the raw spectrum must be resampled to produce a spectrum linear in k prior to Fourier transformation. There are three factors. First, k and are inversely related,
Second, the spectrometer also contributes to spatial nonlinearity via the grating equation ͑for first order diffraction͒,
where a is the grating spacing and i and m are the angles of the incident and diffracted beams, respectively. Third, distortion in our custom spectrometer optics also contributes to the nonlinearity. This is modeled in Zemax. The net interpolation correction resulting from all these sources is incorporated into a single mapping for our spectrometer-based FDOCT systems. The mapping for our swept source-based FDOCT systems is found empirically using a previously published algorithm. 19 The other major source of phase nonlinearity that must be compensated arises from dispersion mismatch between optical materials in the reference and sample arms. Dispersion is the variability of refractive index with wavelength and can be expressed with a Taylor series expansion
where n 0 is the center ͑phase͒ index. The first order term is the group velocity or phase dispersion term while the second order term is referred to as the group velocity dispersion. Group velocity dispersion as a function of the phase is
Group velocity dispersion and higher order terms lead to profile broadening and a reduction in the axial resolution by introduction of a phase shift, ͑k , z n ͒, into the interference signal expressed in Eq. ͑2͒. For a spectrum with a Gaussian shape, the axial resolution, ⌬z, is
The coherence length in Eq. ͑8͒ is defined by
where 0 is the center wavelength and ⌬ is the full width at half maximum bandwidth of the source. For FDOCT, dispersion compensation is usually accomplished by adjustment of the phase numerically prior to Fourier transformation.
III. HARDWARE DESCRIPTION
To perform the FDOCT signal processing and display in real time, we have designed and developed DSP hardware that consists of an off-the-shelf FPGA minimodule ͑Avnet Inc.͒ that plugs into a custom electronics board. The custom hardware was designed and implemented such that it could be integrated into any spectrometer-based FDOCT systemregardless of hardware-by insertion between the linear array detector and the frame grabber. A block diagram and a photograph of the spectrometer-based FDOCT custom hardware are shown in Figs. 1͑a͒ and 1͑b͒. All of the core FDOCT signal processing calculations are performed inside the FPGA hardware, while many of the auxiliary tasks-e.g., logarithmic display scaling-are executed by the host processor to preserve FPGA resources.
The central component of the design is a highperformance FPGA with a fully parallel architecture, around which all other elements are integrated. In the first generation design, we used a Virtex4 FPGA ͑XC4VFX12: the smallest and lowest cost member of the Virtex4 family, Xilinx Inc.͒. The FPGA has 1536 configurable logic blocks ͑CLBs͒ that are the main logic resource for implementing sequential and combinatorial circuits. There are also 32 Xtreme DSP slices available on the IC. These resources facilitate new DSP algorithms and higher levels of DSP integration with very high performance. In our case, more than 95% of the FPGA resources are utilized after implementing the entire FDOCT signal processing chain inside the hardware. The minimodule was used for rapid prototyping in the initial development phase. Because of limitations in the size of the minimodule FPGA, next generation hardware currently under development will use a single high-performance FPGA IC integrated onto the custom electronics board ͑see Sec. VI͒.
The six major tasks performed concurrently inside the FPGA are as follows: ͑1͒ capture the raw spectral data from the linear array detector, ͑2͒ execute the real-time FDOCT processing algorithm, ͑3͒ transfer either raw or processed FDOCT image data to the computer, ͑4͒ control and synchronize all external hardware components, ͑5͒ drive the digitalto-analog converter ͑DAC͒ IC, and ͑6͒ coordinate the lowspeed data transfer to and from the computer over the internal Camera Link RS-232 serial port.
The first five tasks are accomplished by two custom peripherals-the FDOCT processing peripheral and the timing peripheral-that reside in the FPGA fabric. Tasks ͑1͒-͑3͒ are accomplished with the FDOCT processing peripheral, ͑d͒ Photograph of the custom highspeed digitizer board used as the analog front end for the swept sourcebased hardware.
114301-3
Real-time processing for FDOCT Rev. Sci. Instrum. 79, 114301 ͑2008͒
which is connected at its input to the linear array detector over a standard Camera Link interface. A standard low voltage digital signal Camera Link receiver IC is used to convert the differential, serial data to a single-ended, parallel format. Our current design supports single-tap Camera Link cameras with up to 2048 pixels, 12 bit dynamic range, and a maximum 80 MHz pixel clock. The output format of the periphery, which is transferred over a Camera Link interface to the frame grabber, mimics a single-tap camera with a 16 bit Camera Link interface. Tasks ͑4͒ and ͑5͒ are accomplished by the timing peripheral. The timing peripheral is used to synchronize all external components, which are slaved to the FPGA. External components include the linear array detector, the frame grabber, and the lateral scanning galvanometers. The FPGA generates the line and frame synchronization signals for the frame grabber and the linear array detector operating in external trigger mode. The waveforms used to drive the galvanometers with standard ͑point, line, circle, raster, radial͒ and custom scans are generated by two channels of a four-channel, 16 bit, 200 kS/s DAC IC.
Task ͑6͒ is accomplished by a soft processor, called MICROBLAZE, which is also instantiated inside the FPGA. The MICROBLAZE soft processor is a 32 bit Harvard RISC architecture optimized for Xilinx FPGAs with great flexibility. The microprocessor clock speed is 100 MHz. The soft processor is used to orchestrate low-speed data communication between the computer and the two custom peripherals inside the FPGA fabric to configure and transfer initialization and user input parameters.
The communication medium with the host processor is the RS-232 serial port that is an integrated component of the standard Camera Link interface. A serial communication scheme is sufficient in terms of speed because initialization parameters are transferred only once at the beginning of the data capture and processing cycle. The MICROBLAZE processor is basically used as a bridge between the computer and the external hardware and custom FPGA peripherals. The processor is connected to these custom peripherals using the fast simplex link bus, a unidirectional point-to-point communication channel bus. In addition to these custom peripherals, the MICROBLAZE has some standard peripherals, such as RS-232, double data rate synchronous dynamic random access memory ͑DDR SDRAM͒, and a Flash memory. The RS-232 peripheral is used as a universal asynchronous receiver/transmitter between the processor and the computer. The DDR SDRAM provides 64 Mbyte on-board memory to store the application code and a user data cube for the embedded processor application. The 4 Mbyte nonvolatile flash memory is used to store the boot code and application code.
The hardware for swept source-based FDOCT systems includes a custom high-speed digitizer board in addition to the FPGA electronics board described above. A block diagram of the swept source-based FDOCT custom hardware and a photograph of the high-speed digitizer are shown in Figs. 1͑c͒ and 1͑d͒ . The digitizer board acts as an analog front end to the DSP hardware. It essentially takes the balanced detector output and converts it into a Camera Link signal format. All other functionality of the DSP hardware described above is therefore utilized.
The high-speed digitizer has a single channel, 12 bit, 80 MS/s analog-to-digital converter that digitizes the raw spectrum from the detector. It also has a complex programmable logic device that controls timing and synchronization and standard Camera Link interface ICs. In contrast to the spectrometer-based system where the real-time board ͑mas-ter͒ generates the line synchronization signals for the camera ͑slave͒, the high-speed digitizer board ͑slave͒ accepts the swept source line synchronization signal from the swept source ͑master͒. The high-speed digitizer board can operate for a range of swept source line rates exceeding 20 kHz. From the line synchronization signal, the digitizer sets the rate for the signal that drives the galvanometers. It converts all raw spectrum data and line and frame synchronization signals to a Camera Link format and transfers them to the real-time processing board.
IV. FIRMWARE AND SOFTWARE DESCRIPTION
The system software and firmware is divided among and executed by three components including the MICROBLAZE soft processor, the FDOCT peripheral ͑implemented in the FPGA fabric as described above͒, and the host processor. During the design process, the division of tasks between these three entities was determined by the physical limitations of the hardware, as well as ease of implementation, availability of predeveloped intellectual property ͑IP͒ cores, and access to available development tools. Sections IV A and IV C describe the software and firmware developed for these components.
A. Microblaze soft processor firmware
The first component of the software architecture resides inside the FPGA and runs in the MICROBLAZE soft processor core. Platform studio integrated development environment and embedded development kit embedded design flow software tools ͑Xilinx Inc.͒ were used to develop code for the MICROBLAZE processor.
The embedded code works as a custom finite state machine between the host processor user interface and the FPGA custom peripherals. Robust communication with transfer of initialization and configuration data is accomplished using handshaking and error checking. The received data are then relayed to the FDOCT or timing peripheral, where the respective memory locations are filled with configuration data. The embedded software conveys the configuration and initialization parameters to the camera over another dedicated RS-232 serial interface. Besides handshaking and confirmation packages sent back to the computer, communication is unidirectional from the computer to the FPGA.
B. FDOCT processing sequence
The real-time image processing sequence implemented in the FDOCT peripheral includes five image processing steps, including background subtraction, spectrum interpolation, high-pass filtering, dispersion compensation, and FFT, in addition to the data transfer and multiplexing steps. The system generator for DSP tool ͑Xilinx Inc.͒ is used to model and implement these steps inside the FPGA hardware. This tool provides system level modeling and automatic code generation from a MATLAB SIMULINK environment ͑MathWorks Inc.͒. These processing steps are shown in Fig. 2 and described further in Secs. IV B 1 and IV B 7.
Data capture
In the first block of the FDOCT signal processing chain, a 2048-word deep first-in, first-out ͑FIFO͒ memory buffer is used to capture the incoming data from the detector. The size of the FIFO memory does not directly limit the maximum number of pixels that can be captured or the size of the linear detector used since data is read from the FIFO as it fills up. Therefore, no data will be lost as long as the depth of the FIFO memory, M, meets the following requirement:
where N is the number of detector pixels, f m is the master clock frequency of the FPGA, and f p is the pixel clock of the camera. For our current implementation, f m = 100 MHz, f p Ͻ 80 MHz, and the maximum number of camera pixels that can be acquired without loss is 10 240. It is necessary that the FIFO memory operates as a buffer between the camera and the next processing stage because the camera and the FPGA hardware are working asynchronously from each other at different clock frequencies. To minimize the pipeline delay through the FIFO, we start to read the captured data before all camera pixels have been transferred, such that by the time the last pixel is collected, the spectrum is transferred to the next stage. The delay through the input FIFO is ϳ3120 clock cycles.
Background subtraction
Background subtraction is performed to remove fixed pattern ͑or fixed frequency͒ noise in the FDOCT spectrum. This noise can arise from external sources-e.g., room lights, stray light reflections into the camera, and/or auxiliary electronics-and from the camera itself. The host processor software captures an unprocessed background frame without an object in the sample path ͑or with the sample blocked͒, computes the average pixel values across all A-scans in the frame ͑typically 512 or 1024 A-scans͒, and transfers the average background vector ͓i.e., one-dimensional ͑1D͒ array͔ to the FPGA memory. To conserve FPGA resources, this is performed in the host processor software prior to image acquisition rather than in the FPGA in real time. Once in FPGA memory, the background spectrum is subtracted pixel by pixel from all subsequently acquired spectra. The pipeline delay through the background subtraction stage is negligible ͑approximately four clock cycles͒.
Spectrum interpolation
Interpolation is performed on the background subtracted spectrum using a previously calculated interpolation vector that is uploaded to the FPGA memory during initialization. As discussed in Sec. IIA, interpolation transforms the spectrum from -to k-space.
Interpolation can normally be accomplished with a few multiplications and additions. However, in practice, inside a FPGA with only fixed-point calculation capability and no general purpose scratch memory, implementation is not straightforward. Our approach is to reorganize the interpolation data into an index array ͑X 1 ͒ and a fixed-point interpolation coefficient array ͑X 2 ͒. We utilize three BlockRAM memory locations inside the FPGA for this calculation: one for holding the spectrum data ͑D IN ͒ and two to store the modified interpolation arrays ͑X 1 and X 2 ͒, as well as dedicated multipliers and adders to perform the interpolation operation. The output of the interpolation for each pixel is
where D IN and D OUT are the spectra before and after interpolation. The pipeline delay through the interpolation step is only ten clock cycles.
High pass filtering
A high-pass filter ͑HPF͒ is used to remove the large dc and low-frequency components of the spectrum. It also eliminates low-frequency noise in the sample arm, which is not removed with background subtraction. The HPF is realized using a 21-tap finite impulse response ͑FIR͒ filter architecture with a cut-off frequency of 1 MHz. The filter core operates at the master clock rate of 100 MHz. The pipeline delay through the HPF is equal to the number of the taps present in the FIR filter, or 21 clock cycles in our filter implementation.
Dispersion compensation
The standard method for dispersion compensation is to add dispersive material to the reference arm to match that in the sample arm. Fixed hardware dispersion compensation only corrects the average dispersion in a subject or sample population, where a significant variation can exist ͑e.g., length of the eye for ophthalmic applications͒. As discussed in Secs. IIA and IIC, group velocity dispersion mismatch between the interferometer sample and reference arms, which causes broadening of the sample profile, can be cor- 
114301-5
Real-time processing for FDOCT Rev. Sci. Instrum. 79, 114301 ͑2008͒ rected numerically because the phase of the spectrum is accessible in FDOCT instruments. Dispersion compensation is accomplished off line with an iterative optimization routine ͑see Sec. IV C͒. The dispersion compensation step includes acquisition of a single frame, optimization of the acquired frame, phase correction calculation, and transfer of the complex dispersion compensation vector to the BlockRAM memory of the FPGA. The complex dispersion compensation vector is then multiplied pixel by pixel from the phase of all subsequently acquired spectra. Two dedicated multipliers are utilized to perform the complex multiplication. The pipeline delay through the dispersion compensation block is six clock cycles. The dispersion compensation vector is also used for windowing ͑Hamming, Hanning, etc.͒ prior to Fourier transformation.
Fast Fourier transform
A Fourier transform is performed on each A-line captured by the linear detector to produce an axial depth profile of the sample. A discrete Fourier transform is computed inside the FPGA using an efficient 2048-point FFT core. The FFT core uses the Cooley-Tukey computation algorithm. 21 The FFT IP core from Xilinx Inc. has three available distinct architecture options: ͑1͒ pipelined, streaming input/output ͑I/O͒, ͑2͒ radix-4, burst I/O, and ͑3͒ radix-2, burst I/O. They offer a tradeoff between core size and transform time.
The pipelined, streaming I/O core provides continuous data processing with the ability to simultaneously perform transform calculations on the current frame, load data for the next frame, and unload the results of the previous frame. The latency through the core is minimal, and is equal to the transform size times the master clock period. However resource utilization ͑the number slices consumed͒ is much larger than the other two FFT core architectures.
The radix-4, burst I/O core offers a load/unload phase and a processing phase and uses one radix-4 butterfly processing engine. These two processes cannot be run concurrently and therefore must be pipelined. This core is smaller in size compared to the pipelined, streaming I/O core, but has a longer transform time. The number of clock cycles required for data load and transform is slightly longer than two times the transform length.
The radix-2, burst I/O core uses one radix-2 butterfly processing engine and has a two-phase process ͑load/unload and process͒ similar to the radix-4 core. While it uses the minimum of logic resources, it has the longest transform time of the three cores. The number of clock cycles required for data loading and transform is approximately six times the transform size.
For the spectrometer-based FDOCT systems, the radix-4, burst I/O FFT core was used because of extensive FPGA resource utilization in the faster pipelined, streaming I/O FFT core. The pipeline delay through the radix-4, burst I/O FFT core is 7288 clock cycles for a 2048-point FFT. For the swept source-based FDOCT systems, the pipelined, streaming I/O FFT core was used. The final processing step is calculation of the magnitude of the Fourier transformed data. The magnitude is calculated using a coordinate rotation digital computer square root block.
Data transfer
After the complete FDOCT signal processing sequence is executed, the axial depth profile is loaded into output FIFO memory. This memory functions as a buffer between the FPGA hardware and the computer frame grabber. The processed data cannot be sent to the frame grabber at the 100 MHz data processing speed because the Camera Link specification limits the maximum data transfer clock rate to 85 MHz. Since the rate at which the memory is filled is greater than the rate at which it is read, the FPGA FIFO controller configures data transfer to prevent data over-run. Pipeline delay through this block is minimized with efficient, overlapping write and read processes by the FIFO controller.
C. Host processor software
The host processor software performs four primary functions. First, it does final image processing, scaling, and display to preserve FPGA resources that are expensive and somewhat limited. Second, it allows the user to communicate with the FPGA firmware to pass certain hardware initialization and processing parameters. Third, it calculates certain matrices needed for FDOCT processing ͑e.g., background and dispersion compensation͒. Fourth, it performs a number of auxiliary functions related to user interface. The host processor software and graphical user interface were developed with LABVIEW ͑National Instruments, Inc.͒ running some MATLAB ͑Mathworks Inc.͒ scripts.
The primary task of the host processor software is to transfer the FDOCT images frame by frame from the frame grabber memory over the PCI Express bus to the computer internal memory for display on the monitor at the set frame rate with no significant delays. A custom camera file is used to configure the frame grabber settings to receive the FDOCT data properly from the real-time processing board over the Camera Link interface. The image processing performed by the host software includes 16-8 bit scaling, logarithmic scaling, and image rotation prior to display. In normal operation, the user can receive either the raw spectrum ͑unprocessed data from the linear detector͒ or the fully processed FDOCT images. In debug mode, the data after any of the intermediate processing steps can also be accessed.
The host processor software also serves as an interface to allow communication with the real-time processing board via the RS-232 interface. This enables initialization, processing, and hardware timing and control tasks to be accomplished. During the initialization process ͑or when a new scan type is configured͒, the user selects certain imaging parameters ͑e.g., scan type, scan size, frame rate, number of A-scans͒ that are converted to base parameters ͑e.g., camera integration time, line rate, number of lines, and galvanometer voltage͒ and passed to the FPGA to set up camera and galvanometer timing and control signals. This enables more generic operation as only one low level routine needs to be reconfigured for instrument operation with linear detectors from different manufacturers. Other camera attributes are either set by the user ͑analog gain, digital gain and offset, and test pattern͒ or hardwired ͑pixels, pixel frequency, bit depth-always set for the maximum dynamic range, which is usually 12 bit͒. The x-y galvanometer waveform array is calculated in the host processor software during initialization and downloaded to the FPGA BlockRAM memory and output to the galvanometers via the DAC IC. The OCT scan x-y offset is controlled by the user from a keypad on the front panel. Since the digitizer automatically detects the line synchronization signal from the swept source, no digitizer parameters need to be set during initialization.
Besides final FDOCT image processing and hardware initialization, the host processor software is also responsible for calculation and transfer to the FPGA of three arrays used in the FDOCT image processing chain: the background, interpolation, and dispersion compensation matrices. A background frame is acquired, averaged across all A-lines, and the resultant vector downloaded to the FPGA. The background is subtracted from all subsequently acquired A-lines. The 1D interpolation vector is determined a priori from the optical model or empirically and downloaded to the FPGA during the initialization phase.
A custom numerical dispersion compensation routine is run in a MATLAB script to automatically correct the phase of the spectrum for dispersion mismatch. The phase of the original spectrum is iteratively adjusted with a corrective phase curve generated from orthogonal polynomials. Each of seven orders of polynomial coefficients is independently optimized by minimization of the reciprocal of the sum of the variance of the spectrum across a number of A-lines in the image. Maximization of the pixel variance is equivalent to maximizing the image contrast, and is a standard image processing method. 22 Similar results have been achieved with other OCT dispersion compensation algorithms. 14, 15 Once the final coefficients are found, a fixed phase correction array is calculated and downloaded to the FPGA BlockRAM memory, where it is applied to the phase of each subsequently acquired A-line prior to Fourier transformation. Since the optimization routine can take up to a minute for an image depending upon the image size, it is applied off line only once for each subject. The dispersion correction array can be saved and used for the same sample or subject in subsequent imaging sessions. The user also has the option of applying a window to the complex dispersion array that will be applied to the spectral data prior to Fourier transformation.
The last task of the host processor software is operation of user interface functions. These include adjusting the OCT image scale and color map ͑e.g., gray, inverse gray, standard OCT, etc.͒, loading and saving from a text file the dispersion compensation array, applying OCT scan offsets, moving hardware stages ͑e.g., reference arm stage͒, moving a fixation target ͑for ophthalmic applications͒, displaying the spectrum or profile waveforms, saving the image data to a binary file or image file ͑jpeg, bmp, etc.͒, and management of the subject or sample database.
V. RESULTS

A. FDOCT image processing simulation
During the FPGA signal processing hardware development phase, the FDOCT system was simulated and verified in the MATLAB SIMULINK environment using the System Generator IP core blocks from Xilinx Inc. Actual FDOCT spectrum data were used as test input for the simulation. The first simulation task was to determine the speed of each of the three FFT cores. Table I lists the simulation results for a 2048-point transform for all three cores.
Each FDOCT steps is executed sequentially in our pipelined signal processing architecture. Therefore, the second simulation task was to determine the delay through each step and through the entire processing sequence. Figure 3 shows the test input and simulation results at various points in the FDOCT signal processing chain. Table II summarizes the simulated pipeline delay through each FDOCT signal processing block inside the FPGA. For a pipelined system such as ours, the overall system performance is limited by the block with the longest delay. Predictably, the FFT block has the longest delay and hence limits the maximum achievable FDOCT line rate. The FFT delay of 72.88 s translates to 13 721 lines/ s, or 27 fps for a FDOCT image with 512 A-lines. Complete agreement was found between simulated and actual system performance achieved. 
B. FDOCT imaging
The real-time FDOCT processing hardware was integrated into and tested in three systems: a spectrometer-based adaptive optics FDOCT ͑AO-FDOCT͒ system for retinal imaging, 23 ,24 a spectrometer-based dual-beam Doppler FDOCT microscope for imaging zebrafish ͑Danio rerio͒, 25 and a swept source-based FDOCT microscope. 26 In both spectrometer-based systems, the board achieved a real-time frame rate of 27 fps for fully processed FDOCT images ͑1024 axial pixelsϫ 512 lateral A-scans͒. The swept source was capable of operation at 10 and 20 kHz and the digitizer was configured for operation at 10 kHz. In the swept source-based system, a real-time frame rate of 19.5 fps-determined by the lower line rate of the swept source ͑10 kHz͒-was achieved for a similarly sized image. show discrimination of many depth layers, including the strong OCT signal from the junction of the inner and outer segments of the photoreceptors and from the retinal pigment epithelium. The ventricle and ventral aorta of a zebrafish are indicated in Figs. 4͑g͒ and 4͑h͒. The composite image was generated through 18 cardiac cycles and blurring in the image is due to the motion of the heart and blood flow. Single swept source images from a finger tip and excised intestinal tissue are shown in Fig. 5 . The swept source images of the finger tip in Fig. 5͑a͒ resolve the epidermis, dermis, and structure near several sweat glands. In Fig. 5͑b͒ , two ducts and several layers of intestinal tissue are also resolved.
VI. DISCUSSION
Development of compact DSP hardware customized for FDOCT image processing allows real-time display at rates limited only by the linear detector or swept source and without noticeable latency. This is important for most OCT applications, including ophthalmology but especially in those designed for rapid screening or image-guided biopsy/surgery. Not only does this efficiently utilize computational resources, but it also makes an instrument more amenable to portable or hand-held operation. In preliminary testing in both spectrometer and swept source-based FDOCT systems, we demonstrate operation at real-time display rates that are comparable with the maximum acquisition rates of other clinical and commercial systems. The utility of the real-time processing hardware was demonstrated in the acquisition of high resolution images from human volunteers and monkey retina, zebrafish heart, human fingertip, and human intestinal tissue.
There are still several limitations in the current implementation where further refinement and development are desirable. First, for the spectrometer-based system, the maximum line rate ͑13.7 kHz͒ is limited by the FPGA and not the camera. For the swept source-based system, the faster FFT core allows much faster line rates up to 49 kHz ͑for a single channel͒ and indicates the potential of the processing hardware. Second, several FDOCT image processing steps could be simplified and/or enhanced. A real-time background subtraction scheme that requires no user action or input is desirable. The reference arm background intensity that needs to be removed is not variable ͑i.e., it is a fixed fringe spatial frequency pattern͒, and so it is not necessary to block the sample arm to remove this signal for biological targets. ͑It is necessary to block the sample for artificial targets such as mirrors or windows that have a signal at a single fixed fringe frequency.͒ Removing the additional off-line step of recording the background is therefore not usually necessary and could be performed with adequate FPGA resources. Third, for the spectrometer-based FDOCT system, the maximum number of lines is limited to 2048 because of FPGA resources. The fourth limitation is that the raw spectrum and processed axial profile cannot be transferred to the frame grabber simultaneously. This may be critical for additional processing that requires access to the spectrum-for example, phase processing for the generation of Doppler blood flow maps.
The majority of these limitations result from the choice of a Virtex-4 FPGA with smaller number of CLB in the smaller footprint minimodule format. Although this allowed rapid prototyping, it also restricted both the choice of FFT core ͑radix-4 burst I/O rather than pipelined streaming I/O͒ and the amount of additional image processing that could be accomplished in real time.
A second generation FDOCT processing board is currently under development and will address the limitations discussed above. A higher performance FPGA with more CLBs is used in the next generation design. Xilinx Inc. now offers Virtex 5 FPGAs with higher performance and more optimized hardware architecture that will allow the use of increased clock speeds with a much denser design. For example, the Virtex 5 FPGA has six independent input lookup tables ͑compared to four for the Virtex 4 FPGA͒ that we estimate will enable a clock rate increase of at least 25% with a higher density design and fewer logic levels. In addition to use of a faster FFT core, this will also preserve resource for advanced image processing functionality.
In order to simultaneously transfer the raw spectrum and processed axial profiles ͑or data that has been processed with different methods-e.g., reflectance and phase maps͒, the multitap capabilities of the Camera Link interface will be more fully utilized. While the maximum transmission rate is only 127.5 Mbyte/s for a single-tap 12 bit camera, the rate increases to 510 Mbyte/s for a four-tap source. Future versions of the Camera Link interface will support data transfer rates to 2.3 Gbyte/s. The next generation processing board will use the full ͑versus base͒ Camera Link configuration. On the input side, this will enable use of faster linear detectors via parallel multitap acquisition. Multiple FDOCT processing engines operating in parallel will be instantiated inside a single FPGA and a memory controller/arbiter module will be designed to manage acquisition, processing, and buffering of image lines. On the output side, the full Camera Link configuration will allow simultaneous transfer of raw spectrum and processed profile to the frame grabber. Collection of the raw spectrum and further off-line processing is a feature that may be confined to research FDOCT applications. For clinical uses, it is usually desirable-for simplicity, data reduction, and database archive space limitations-to save only the processed information. We have also begun development of algorithms for advanced FDOCT image processing, including real-time calculation and display of phase maps.
