Abstract-The reconfigurable computing paradigm, which exploits the flexibility and versatility of field-programmable gate arrays (FPGAs), has emerged as a powerful solution for speeding up time-critical algorithms. This paper describes a reconfigurable computing solution for processing raw mass spectrometric data generated by MALDI-TOF instruments. The hardware-implemented algorithms for denoising, baseline correction, peak identification, and deisotoping, running on a Xilinx Virtex-2 FPGA at 180 MHz, generate a mass fingerprint that is over 100 times faster than an equivalent algorithm written in C, running on a Dual 3-GHz Xeon server. The results obtained using the FPGA implementation are virtually identical to those generated by a commercial software package MassLynx.
Peptide Mass Fingerprinting Using
Field-Programmable Gate Arrays
István A. Bogdán, Daniel Coca, and Rob J. Beynon
Abstract-The reconfigurable computing paradigm, which exploits the flexibility and versatility of field-programmable gate arrays (FPGAs), has emerged as a powerful solution for speeding up time-critical algorithms. This paper describes a reconfigurable computing solution for processing raw mass spectrometric data generated by MALDI-TOF instruments. The hardware-implemented algorithms for denoising, baseline correction, peak identification, and deisotoping, running on a Xilinx Virtex-2 FPGA at 180 MHz, generate a mass fingerprint that is over 100 times faster than an equivalent algorithm written in C, running on a Dual 3-GHz Xeon server. The results obtained using the FPGA implementation are virtually identical to those generated by a commercial software package MassLynx.
Index Terms-Biomedical computing, field-programmable gate arrays (FPGAs), mass spectrometry, optimization methods, proteins.
I. INTRODUCTION

M
ASS spectrometry has evolved rapidly in the past decades, becoming one of the most reliable proteomics research tools. The amount of data that is currently generated by mass spectrometers around the world is growing at increasing rates. Processing and interpreting mass spectrometric data rely on computer algorithms and proteomic databases running on standard microprocessor systems.
Peptide mass fingerprinting is a protein identification technique in which mass spectrometry is used to determine the masses of peptide fragments generated by specific digestion. The proteins are identified by matching the measured molecular masses of peptide fragments against theoretical peptides generated from protein-sequence databases. Peptide mass fingerprinting involves two basic operations, namely, the processing of raw spectra to derive a mass fingerprint and using the mass fingerprint to search the protein database for a possible match. A correlation score is computed between the database entries and the unknown peptide fragment mass list. The matches with the highest score are the final protein list to be returned to the user. While the processing time for performing protein identification is relatively low, it is still greater than the spectra acquisition time and is limited by the microprocessor clock frequency.
The most effective approach to speed up computations involves the development of dedicated hardware processors that are optimized to perform specific algorithms. The speed up in computation, compared with the standard sequential microprocessor, is achieved by concurrent implementation of different arithmetic and logic operations that make up a computational loop and by concurrent execution of several computation loops. A major drawback of this approach used to be the prohibitive costs associated with manufacturing a dedicated integrated circuit [application-specific integrated circuit (ASIC)].
The hardware implementation approach has become a costeffective solution thanks to the availability of high-density fieldprogrammable gate arrays (FPGAs) and of high-level system design and development tools, which make the implementation of very complex hardware designs possible with almost the same ease as the software implementation. An FPGA is a large-scale IC that can be programmed (and reprogrammed) after it has been manufactured.
Early attempts to use FPGA devices in biocomputation were made to accelerate gene-sequence analysis [7] . FPGAs, which are well suited for high-performance, high-bandwidth, and parallel-processing applications, have been successfully employed to speed up DNA sequencing algorithms [8] - [10] , [6] , [11] , [13] . FPGAs were also used in the attempt to accelerate the search of substrings similar to a template in a proteome [12] . More recently, FPGAs have been used to accelerate sequence database searches with MS/MS-derived query peptides [4] . This hardware-based solution can reportedly locate a query within the human genome about 32 times faster than a software implementation running on a 2.4-GHz processor. A hardware-sequence alignment tool implemented in FPGA is also available [14] . This paper describes the design and hardware implementation of a raw spectra processor which performs all computational tasks involved in the generation of a mass signature from a raw spectrum, namely, smoothing, peak detection, and deisotoping. The processor, which is implemented on a Xilinx XC2V8000 FPGA and runs at 180 MHz, achieves more than 100 fold speedup compared with a C software implementation running on a dual 3-GHz Xeon Server with 4-GB of memory. In an earlier paper [15] , we have successfully tested the implementation in terms of peak extraction and deconvolution accuracy against 1932-4545/$25.00 © 2008 IEEE commercial software implementations. This paper focuses in more detail on the actual processor design.
II. ALGORITHM DESCRIPTION
Following specific protein digestion, a MALDI-TOF mass spectrometer generates pairs of mass-to-charge (m/z) and abundance values symbolized with . Typically, the number of points in the spectrum ranges from a few thousand to a few hundred thousand. The determination of experimental peptide masses (the so-called peptide mass fingerprint) requires relatively complex processing of the raw mass spectrum in order to discriminate between spectral peaks that correspond to digested peptides and associated isotopes and the spurious peaks caused by noise and sample contamination.
The FPGA spectra processor was designed to implement, with some variations, an algorithm proposed in [1] which is used in a popular mass spectrometry software package. The current processor implements additional optional smoothing operation and uses a different algorithm to implement deisotoping [see Step 6) ] . The algorithm was found to be computationally efficient and well suited for hardware implementation but, in principle, other peak extraction algorithms could be considered for hardware implementation.
Step 1) Smoothing (Optional): Performing Savitzky-Golay smoothing over the raw input spectrum can reduce the effect of instrumentation noise. The algorithm is based on performing a least-squares linear regression fit of a polynomial of degree over at least data points around each point in the spectrum to smooth the data. The main advantage of this procedure is that it tends to preserve the shape of the signal peaks [3] . The smoothing operation is implemented as a standard finite-impulse-response (FIR) filter where is the size of the smoothing window, is the input data stream, is the FIR output, and are the time-varying filter coefficients. For a given filter of order and (odd) frame size , all of the coefficients needed to implement the smoothing operation to form a matrix . In a smoothing filter implementation, the last rows are used for the first data points, the first rows are used with the last data points, and the middle row is used with the rest of the spectrum. The Savitzky-Golay smoothing operation can be represented in matrix form as follows:
More details about this procedure can be found in the original paper [3] . Typically, for processing a raw spectrum, filters of order and a frame length were found to produce the best results for spectra having around 50 samples/(m/z). The hardware implementation allows the user to specify the size of the smoothing window and the corresponding filter parameters.
Step 2) Baseline and Noise Detection: The raw spectrum mass list is divided into small intervals of width , and local minimum , maximum , and abundances and their differences are computed [1] . In the equation that will be shown, the pairs correspond to data points where denotes the smallest integer greater than or equal to . For each integer mass in the spectrum , a symmetric window of width is placed around, and the baseline and noise levels are estimated as follows:
where is the index of the leftmost subinterval covered by and is the index of the rightmost subinterval covered by [1] . Subsequently, the signal-to-noise ratio (SNR) is computed for each spectral point where denotes the largest integer smaller than or equal to .
Step 3) Spectrum Segmentation: According to the signal-tonoise ratio (computed in the previous step) relative to a user adjustable threshold SN [1] , the spectrum is segmented into three categories: noise , support , and signal .
Step 4) Peak Detection: Peaks are constructed from data points that are signal or support points and are bounded by noise. A collection of data points used to construct a peak has the following criteria: or for . The center of the mass and relevant molecule abundance are computed for each constructed peak [1] as follows:
Step 5) Clustering: This involves grouping valid peaks into clusters. Two peaks and are in the same cluster if , where is a user-defined parameter (typically, ) [1] . A valid cluster has at least one peak with a data point in .
Step 6) Peak Deisotoping: The major difference compared to [1] is the method used by the FPGA processor to implement aggregation of natural isotopomers (due primarily to the natural abundance of C and N). The algorithm implemented in FPGA uses Poisson distributions to approximate the isotopic patterns for every peptide [2] . The expected proportional abundance of the heavier isotopes , with respect to the monoisotopic peaks , are computed as follows: (1) where is the theoretical abundance of the first isotope of is the abundance of the second isotope, etc. The best-fit Poisson models of isotopic distributions are shown to match those of theoretical distributions [2] .
The complete processing step for a cluster of consecutive peaks can be summarized as follows.
• The leftmost peak in the cluster is always considered to be a monoisotope [2] .
• Compute the expected abundance of the heavier isotopes [2] .
• Subtract these higher contributions from the actual abundances of the next peaks . Only the results higher than a threshold (ISOTHR) are retained for further processing [2] . These substeps are recursively repeated for the residual cluster until all of the residual peaks are less than the threshold [2] . For example, if becomes the next monoisotope and the steps are repeated. The procedure is illustrated on Fig. 1 . Here, a small cluster of four peaks is depicted in Fig. 1(a) . The first peak is considered monoisotope. Its isotopic distribution is shown in Fig. 1(b) . Fig. 1(c) shows the resulting residual cluster after subtracting (b) from (a).
III. HARDWARE IMPLEMENTATION
The block diagram of the hardware processor is depicted on Fig. 2 . The implementation has two major functional blocks: 1) a peak detection unit, which identifies all significant spectral peaks and 2) a peptide identification unit that generates the final list of peptide masses and associated abundances.
The peak detection unit implements smoothing, baseline, and noise-level estimation in order to discriminate between signal and noise peaks. The first block is a Savitzky-Golay smoothing filter [3] that implements the equations from the first algorithmic step. It has a user-defined window that can be chosen according to the instrument resolution setting (number of data points recorded per 1 m/z unit). The smoothing operation is optional; the user can specify if the data are preprocessed or not by the SGSEL input flag. A 43-tap FIR filter implements the Savitzky-Golay smoothing filter with coefficients that may be reloaded as user parameters in an LUT. The coefficients are loaded into the FIR on its dedicated inputs (COEF, LOAD, COEF WE). The first spectral abundances are loaded as coefficients and the last rows of the filter coefficient matrix are loaded on the FIR data input. Then, the middle row of the filter coefficients is loaded as coefficients and the spectrum is loaded at the FIR data input. Finally, the last data points are loaded as FIR coefficients, the last rows of the coefficient matrix are loaded as data input. The block diagram of the FIR filter, depicted on Fig. 3 , is implemented as a single channel highly parallel filter by using a Xilinx Logicore block [5] . While the abundances are smoothed, the mass list is delayed by the FIR latency. Processing time for this step is given by where is the spectrum length, is the clock period, and all constants inside the first round bracket are specified in the Xilinx LogiCore FIR implementation datasheet [5] .
The peak construction pipeline ( stages long) is depicted in Fig. 4 . It implements the algorithmic steps described in the previous section (Steps 2-4). The sorting FIFO detects the minimum and maximum abundances and computes their differences over a sliding window of length . It is implemented by using a filter with a structure similar to a median filter that sorts in ascending order its input data stream over its filter length. Instead of computing the median, the maximum and minimum values are found. Baseline and noise are computed over a bigger spectral interval of small windows of length Baseline and noise detection are implemented using 32-b pipeline dividers and accumulators. Spectrum segmentation is performed by computing an SNR for each delayed spectrum data point according to
Step 3). The result is then compared with the user-selectable threshold (SNTHR), and a classification flag coded on two bits is assigned for each data point. The flag can be 1, 2, or 3 depending on whether the respective data point is classified as noise, support, or signal, respectively. The original spectral points , their associated classification flag , and the baseline are aligned and fed into the peak construction state machine. Here, the centered mass and baseline-subtracted abundance of spectral peaks are computed according to algorithm Step 4).
Steps 1-4 are implemented in a pipeline manner and the total processing time is given by The last functional block of the peak detection unit-the cluster flag generator FIFO is depicted in Fig. 5 . Its role is to detect possible peak candidates that are isotopes of one or more singly charged chemical compounds, separated by the mass of a neutron. This group of peaks is called a cluster. Clustering involves grouping together peaks so that the m/z (mass to charge ratio) distance between two successive peaks is between and , where is a user-selectable value, typically set to 0.20. The circuit is a delay line for the input data peaks with a maximum length of .
It is assumed that the spectrum is sorted by increasing mass. The distance between the masses of all consecutive signal peaks starting with the lowest mass value is computed. To speed up computations, there are circuits that compute mass differences between and the following consecutive mass values in parallel. In our design, is an adjustable parameter, which is selected according to mass spectrometer resolution, to be larger than the maximum number of signal peaks that are registered within a window of m/z. Typically, about 50-100 samples/(m/z) are taken, so the FIFO length (60 120)/3 20-40 or less. The output of the circuit is a cluster flag of bits, which is generated for each peak . If the distance between and is within the range of m/z, the th bit in is set to 1, indicating that is a potential isotope of . If all of the bits in the flag are zero, this indicates that has no isotopes. The peptide identification unit consists of two dual-port random-access-memory (RAM) devices A and B (embedded RAM blocks), and two state machines: 1) a clustering and 2) a deisotoping machine. The output of the peak detection unit (peak mass , peak abundance , and cluster flag ) is stored in RAM (A) at consecutive addresses, starting from zero. After all peaks are stored, the clustering state machine switches the input multiplexer and clustering starts.
Clustering is simply a sorting process in which peaks belonging to the same cluster are grouped together and stored at consecutive memory locations. In addition, clusters are also indexed so that consecutive clusters are stored consecutively in the memory. If the th bit of a cluster flag at address is one, the peak at address is in the same cluster as the original peak form address . The process continues until the flag associated with a signal peak in the cluster only has zero entries. Clustering is implemented as a state machine that sequentially reads the first dual-port RAM (A). Each location stores the peak information as a 108-b word: 32 b for mass, 32 b for abundance, 32 b for the cluster flag, and 12 b for the cluster index. The cluster flag is used to calculate the memory location of the peaks that are part of the same cluster while the cluster index is an integer that uniquely identifies clusters.
The clustering process is illustrated in Fig. 6 . Here, the triplets in RAM (A) are generated by the cluster flag generator FIFO, explained earlier, and stored at consecutive addresses from 0 to . All of these peaks are analyzed starting from address 0. A 12-b counter that stores the maximum cluster index during operations is reset to 1. In this example, [ Fig. 6(a) ], the bit of the cluster flag is one. This indicates that the peak stored at address is a heavier isotope of the first peak . In the first step, the mass and abundance of the first peak from location 0 in RAM (A) is written to location 0 from RAM (B), and the cluster index of the peak at the first location of RAM (B) is assigned to the clustering index counter value (1-in our case).
The cluster flag field, where was stored, is overwritten with the address of the next memory location in RAM (B), where the next peak from the cluster will be stored. The process is repeated for the next peak from cluster 1 which is stored at address in RAM (A). Since, in this example, the cluster flag of the peak is null, this means that there are no more peaks to be added to the current cluster (i.e., cluster 1 will consist of only two peaks), and the cluster index counter is incremented to 2. Clustering continues with the remaining peaks from RAM (A) that were not yet visited. In this example, the next peak that will be considered is address 1. Here, the cluster flag is 0, so there are no other peaks to be added to this cluster. The process continues until all of the peaks from RAM (A) are visited. As a result, RAM (B) will hold the same peak list, this time, it is rearranged in ascending order according to cluster index and spectral mass. The time required to perform clustering depends on the number of detected peaks in the spectrum, the number of identified clusters, and the number of peaks in a cluster. In the worst-case scenario, the processing time is given by where the multiplicative factor of 10 is given by the longest cycle of the clustering state machine.
The final processing step required to generate the peptide fingerprint is deisotoping. In practice, deisotoping is required be- cause one cluster may contain more than one peptide. In other words, the peaks in a cluster can be viewed as a superposition of isotopic distributions of two or more peptides. The deisotoping unit identifies all peptides (the mass of the monoisotope) within a cluster and calculates the total abundance of the peptide corresponding to each monoisotope. The hardware implementation of deisotoping is based on an approximation of isotopic patterns by Poisson distribution as presented in Step 6) of the algorithm in the previous section [2] . Starting with the first peak in a cluster (assumed to be a monoisotope), the algorithm generates the theoretical isotopic distribution based on peak height (abundance) and mass value. The computed abundance values are then subtracted from the original peaks at the corresponding m/z values. Following subtraction, any negative abundance value is set to zero. The procedure was illustrated in Fig. 1 . The step is then repeated, with the remaining (height-adjusted) peaks. At each step, the monoisotopic mass value and original and total abundance (the sum of the monoisotopic peak and its theoretical isotopic abundances) are recorded in the final peak list.
The deisotoping unit processes previously computed clusters from the dual-port RAM (B), writes back partial results in RAM (B), and the final peak list in RAM (A). When deisotoping starts, the first location from RAM (B) is read in the first step. The first peak of each cluster is considered monoisotope and is written to RAM (A), where the final peaks are stored. The theoretical abundance of the first heavier isotope is computed by using (1) and is used to update the total abundance field (used previously to store the cluster flag) as . Then, the next peak from the cluster is read from RAM (B) and the difference between the abundance of this peak and the theoretical abundance contribution of the previous monoisotope is computed. If the residual abundance is less than a user-defined threshold (ISOTHR), the peak is considered to be a higher isotope of the detected monoisotope and is no longer stored or processed.
However, if , the residual peak is stored back to RAM (B) to be analyzed as the second monoisotope (peptide) in the cluster. The identi- fication of the isotopes of the first peak in the cluster continues until all of the peaks in the cluster are visited.
The total abundance of the peptide is updated step by step until the last peak in the cluster is processed. Deisotoping of the first cluster continues with the second monistotope residual peak) identified in the cluster and so on.
This operation is performed for all clusters stored in RAM (B). When processing ends, the harvested peak list in RAM (A) is ready to be used to search the protein database.
If the number of detected peaks is and the number of clusters is , the minimal processing time for deisotoping is when each cluster has one peak . If each peak in each cluster is a monoisotope, each cluster is processed in cycles, where is the number of peaks in the cluster. If is the average number of peaks in a cluster, the processing time for deisotoping is where the multiplicative constant 28 is the length of the longest cycle in the deisotoping state machine.
An example of the spectrum fragment from the range of 1032. 5-1036.5 m/z is depicted in Fig. 7 . The figure shows all peaks in the cluster, the identified peptide masses, and the abundance associated with each peptide.
The implemented peak processor has a number of user-adjustable parameters, which are given in Table I . A block diagram of the complete FPGA solution, including the database search system, is shown in Fig. 8 . The implementation uses fixed-point arithmetic on 32 b with the fractional part on 12 b.
IV. RESULTS
The spectrum processor was implemented on an FPGA motherboard equipped with a Xilinx Virtex-II XC2V8000 FPGA (46 592 individual slices/8 million equivalent gates), 4-MB ZBT RAM communicating with the host PC server via a PCI interface (32 b, 33 MHz). The server is a Dual 3.06 GHz Xeon processor machine with 4-GB RAM.
The initial design was developed by using Xilinx's System Generator (6.3) for Matlab (7.0). The resulting VHDL code was refined and optimized by using Xilinx ISE Foundation (7.1) and Modelsim (SE 5.7d). The mass/abundance data were represented as unsigned integer numbers (32 b with 12 b after the radix point).
The user FPGA from the motherboard is used to implement the spectrum processor (Virtex-II XC2V8000). The actual design occupies 73% of total FPGA resources (34 139 slices) and runs at 180 MHz.
The 4 MB of ZBT RAM is enough to store 512 K samples of mass-abundance pairs of 32 b each.
The motherboard has sockets that can host additional FPGA modules hosting a Virtex-II XC2V8000 FPGA device and 1 GB of DDR SDRAM. These modules are used to host the database search engine [18] .
In the first instance, the hardware implementation was validated by comparing the results of processing mass spectrometric data generated by the FPGA implementation and the equivalent C software implementation of the algorithms. The reference C program was run on a dual 3.06-GHz Xeon processor server. In all tests, both implementations produced identical results.
The impact of the spectral length on processing time was measured by using spectra with various lengths but constant isotopic composition and noise levels. The results are summarized in Table II. Each C simulation was repeated 30 times and the average processing time is used here. The standard deviation for the speed gain is shown in brackets. The average speed gain of the FPGA implementation is 122.07. The processing times are shown in Fig. 9 . The FPGA processing time is unscaled, but the processing time of the C implementation is scaled down by a factor of 100.
It should be noted that the time elapsed for initializations of memory locations before the effective processing of data and disk-access time are not included in the software processing time. Only the processing of effective algorithmic steps is measured. Initializations add, on average, 30 ms to the C processing time, resulting in an increased average speed gain of 169. The maximum speed gain, however, could be as high as 200.
Another simulation shows the effect of the spectral complexity over performance. The spectral length was kept constant at 175 000 data points, while the isotopic complexity (the number of relevant peaks) was modified. A different noise profile was used here, compared to the previous simulations. The performance results are given in Table III . The results are also shown in Fig. 10 .
In this case, speed of the hardware processor is almost linear with insignificant fluctuation. On average, the speed gain is influenced by the spectral length and less influenced by the spectral profile. The peak detection unit is a pipeline so the performance of this block is linear of the spectral length and it does not depend on the spectral composition.
The performance of the peptide identification unit significantly depends on the spectral profile (noise, number of detected peaks, and their abundances). However, its processing time is insignificant-compared to total processing of the spectra by the peak detection unit.
The FPGA spectrum processor was compared to a commercial product MassLynx (Waters Corp.). A recombinant protein designed as an internal standard for multiplexed absolute protein quantification [16] , [17] was digested with trypsin to release 20-limit peptides of known identity. The digested material was analyzed by MALDI-ToF MS and the raw data were processed separately by using the MassLynx package and the FPGA processor.
The FPGA implementation and MassLynx correctly identified all of the peaks [15] . The intensities of the different peaks correlated well, irrespective of the method used to process the spectrum (sample correlation coefficient of 0.9874).
V. CONCLUSION
This paper presented a hardware solution, based on FPGA devices, to accelerate algorithms used in peptide mass fingerprinting. Results show that a significant increase in processing speed is achieved, compared to software implementations running on conventional microprocessor-based systems. Once the processing speeds are achieved, it is possible to implement realtime processing of spectra during DAQ.
