Experimental evaluation with real world financial data shows that our architecture implemented on Stratix-V FPGA achieved significant speedup against LIBSVM on Core i7-4770 CPU.
I. INTRODUCTION
Support Vector Machine (SVM) is one of the most widely used supervised machine learning techniques with successful application to various classification and regression tasks [1] . To train an SVM, a Quadratic Programming (QP) problem constructed from the training dataset needs to be solved. The QP problem has a unique global optimal solution, determined by the Karush-Kuhn-Tucker (KKT) conditions [2] .
Traditionally, SVM is trained using batch training methods, in which the training dataset is gathered, then the correspond ing QP problem is solved [2] . Its drawback is whenever the training dataset changes, the QP problem needs to be solved from scratch again. For on-line tasks like financial trading in which the training dataset itself is evolving, incremen tal training methods are more suitable than batch training methods. Incremental training updates the SVM from existing KKT conditions in obtaining the solution for the new training dataset, without solving QP from scratch again. Incremental SVM training algorithms can be optimal or approximate, de pending on whether the new SVM obtained from incremental updating corresponds to the exact new optimal QP solution or an approximate one. In this paper, we develop an optimal incremental SVM training algorithm [3] on FPGA.
Much existing work accelerating SVM training on FPGA focuses on batch training [4] . This paper presents the first hardware accelerated SVM for incremental training. In par ticular, we focus on the E-insensitive SVM for Regression (E-SVR). The main novelty is a parametric parallel dataflow architecture with customisable arithmetic units for the dense linear algebraic operations involved in updating the KKT conditions. Specifically, we make the following contributions:
978-1-5090-5602-6/16/$3l.00 ©2016 IEEE
• A novel dataflow architecture addressing the challenges of incremental SVM training on FPGA: random memory access, numerical accuracy, and list manipulation.
• A parallel data path for updating KKT conditions. Paral lelism is adjustable to make the design scalable, and to trade-off between parallelism and resource usage.
• Implementation on Maxeler MPC-X2000 with an AItera Stratix-V FPGA and evaluation using real financial data. The proposed system is significantly faster than software.
The rest of this paper is organised as follows. Section II reviews related background. Section III details our dataflow architecture. Section IV presents experimental evaluation. Fi nally, Section V provides conclusion and suggests future work.
II. BACKGROUND
We focus on the E-insensitive SVM for Regression (E-SVR).
With training dataset D = {(Xi, Yi) I i = 1,·· . , N}, where Xi E IR d the input feature and Yi E IR the regression target, E-SVR gives the following regression function:
i =l Here, Bi and b are obtained from the KKT conditions. K(x, x') = (<I>(x), <I>(x')) is the kernel function [2] .
We define training error as h(Xi) = f(Xi) -Yi. For the SVM training problem, the KKT conditions can be expressed as a relation between training error and coefficients [3] . These relations divide the training dataset into three subsets:
When a sample (x c , Ye) joins or leaves the training dataset, a new optimal solution could be obtained incrementally by adjusting Bi and b such that the KKT conditions are still satis fied. This is the idea of incremental SVM training algorithms [3] , [5] . The major computation involved is calculating the sensitivities of Bi and h(Xi) with respect to Be. These sensi tivities tell us how existing training samples are affected if we change the weight Be of the new sample. These computations involve two dense matrices Q and R. Q is the kernel matrix Q ij = K (Xi, Xj ). R is a matrix that depends on Q and set S. Using {Sl, S2,··· , Sis} to denote all samples in S, matrix R can be expressed as follows:
QSls)S! QSI� ,SIS
With matrices Q and R, sensitivity of b, Bi and h(Xi) of all training samples with respect to Be and b can be calculated via matrix-vector multiplications [S] [3] . The first one is vector iJ, the sensitivity of b and Bi (i E S) with respect to Be. iJ is a matrix-vector product of R and some Q elements:
For the elements in set E, R, consider the sensitivity between training error h(Xi) (i E E U R) and Be. Denote the samples in E U R as {nl' n2, ... ,nl n } and let 6.h(Xi) = "fi 6. Be· The coefficient vector ,;y is calculated as follows:
Using the sensitivity vectors iJ and ,;y, the change of Be or b in each iteration can be calculated. The two matrices need to be updated when necessary. Q is updated when a new sample joins the training data set. As Q is symmetrical, only one row needs to be updated. R will be enlarged or shrunken when a sample joins or leaves set S [S].
The incremental training algorithm needs a starting point. It can be initialised using either an existing trained SVM (warm start) or two data samples (cold start) [3] . The incremental training procedure can also be 'reversed' when an existing training sample is removed from the training dataset, which is called decremental training. In decremental training, the coefficient B of the leaving sample is gradually reduced to 0 while preserving KKT conditions [3] , [S] .
III. HARDWARE DESIGN
The general system architecture is shown in Figure 1 . Our system supports both incremental and decremental SVM training as they share the same data path. As most of the computations involve dense linear algebra, we introduce par allelism by using blocked matrix-vector arithmetic. The idea is to divide an n-by-n matrix into KxK blocks sized at (n/K) by-(n/K) each and process them in parallel. K is a compile time parameter. We will first discuss the design challenges. access pattern. Denote the number of elements of set S by N s, and that of set E U R by N ER . ,;y is a vector of N ER elements.
Note that eq. (S) implies random access of Q, as both row and column addresses depend on the set membership of S, E, R at run-time. As matrix Q is divided into KxK blocks and each BRAM block only has two ports in hardware, we are unable to truly parallelise the random memory accesses, i.e. what if the K elements needed in a certain cycle happened to be in the same Q block? We address this challenge by exploiting problem specific properties to re-arrange the loop. We notice that in many cases N ER » N s, i.e. the majority of the training samples belong to E U R. Thus, we extend the outer loop (the loop over set E U R with random access {nl' n2, ... ,nl n }) to loop over the entire dataset (the sequential loop over sample {I, 2, 3, ... ,n}), and parallelise it with factor K. The inner loop over set S is still computed sequentially due to random memory access. The time complexity of computing ,;y without parallelisation is O(Ns x NER)' With the proposed scheme, the time complexity is O(Ns x njK) . In a typical scenario that N ER » N s, the scheme is highly effective.
2) Numerical Accuracy: The incremental SVM iteratively updates itself, and such procedure is sensitive to numerical errors. In our FPGA design we use fixed-point numbers to reduce resource usage, and it can be challenging to maintain good numerical accuracy. We handle this issue in two ways:
• Using more fractional bits. Theoretically, double has IS-17 decimal digits precision. Thus we use SO fractional bits in our fixed-point data type CIS decimal digits) .
• Exploiting problem specific features. As shown in eq. (2), a sample in set S has h(Xi) = ±E; a sample in set E has Bi = ±C; a sample in set R has Bi = O. These values are fixed. Although the incremental algorithm will compute them when updating, there may be small deviations due to limited precision. To correct such deviations we always write these fixed values explicitly (±E, ±C, 0) when we know them.
By combining the two methods above, our FPGA design achieves the same level of accuracy as the double precision LIBSVM software [6] in our experiment. For better accuracy the number of fractional bits can be further increased.
3) List Manipulation: The training samples are divided into S, E, R sets. Each set is a list. When a sample moves from one set to another we need to update the lists. Set membership update involves inserting/removing an element in a random position in a list. A straightforward implementation on FPGA is storing each list in an array and put it in a BRAM block.
In this way the set membership update will have O( n) time complexity. For better performance we implement the list using shift registers. In each clock cycle, the new value for each register can be either: 1) its current value; 2) the value from the register on its left; 3) the value from the register on its right; 4) the external input. As all these registers are synchronous and they operate in parallel, the insertion/removal of an element in a random position in the list can finish within one clock cycle. In this way we reduce the time complexity of list manipulation from O(n) to 0(1).
B. Training with Limited Resources -The Sliding Window
In the proposed architecture for incremental and decremen tal SVM training, all related coefficients are stored in FPGA's BRAM for efficient access. Among these coefficients, matrix Q and R are the most memory consuming.
Consequently, the number of training samples that the system can store is determined by the BRAM space available in the FPGA chip. To train SVM with limited resources, we adopt a sliding window approach: when the window is full and a new sample arrives, an existing sample needs to be removed before the new sample can be added to the training dataset. Similar approach of learning with limited resources has been reported [7] . In this paper, as we are evaluating the system with high-frequency financial data, we choose to remove the oldest sample from the training dataset, as it is considered to be out-of-date. For other applications, the control logic can be modified to deploy a different strategy of selecting the item to be removed.
IV. EXPERIMENTAL EVALUATION
In this section we evaluate the proposed hardware design running on FPGA against LIBSVM, one of the most widely used SVM software libraries [6] . LIBSVM uses the Sequential Minimal Optimisation (SMa) algorithm to train SVM. SMa is an efficient batch training algorithm. LIBSVM is single threaded. We use high-frequency financial order book data in our experiments.
A. Hardware Platform
The proposed system is implemented using the Max} dataflow computing language by Maxeler Technologies and built on Maxeler's MAX4 platform with an Altera Stratix V 5SGSD8 FPGA (28nm technology). The FPGA runs at 150MHz. The LIBSVM software runs on a computer with Intel Core-i7 4770 CPU at 3.4GHz (22nm technology) and with 16GB DDR3-1600 memory.
B. Benchmark Problem
We use 5-level financial order book data in our experiments. An order book is a list of buy and sell orders for a certain financial instrument. Level 1 correspond to the best bid and ask, level 2 the second best, and so on. The order book keeps evolving as market participants buy and sell. Our application predicts future mid-price at level 1 (average of best bid and best ask). We construct 16 features from the order book data as follows (EMA is Exponential Moving Average): This SVM model has the potential to predict stock price movements. Figure 2 shows the mid-price prediction using our window-based incremental SVM with window size 420. A green dot is plotted when SVM mid-price prediction is continuously higher than the recent mid-price for 150 events; and a red dot is plotted when it is continuously lower for 150 events. They can be used to make trading decisions.
C. Resource Usage
Our incremental SVM system has a window size n = 420, and RSize = 120. RSize controls the allocated memory space (RSize x RSize elements) for matrix R, which can be smaller than window size. This is because R is determined by set S and set S can be small for many real world problems. With (420, 120) fixed, we parallelise our system with K=3,4,5,6 and report resource usage in Table I . FPGA clock frequency is set RBDim  LUT  FF  BRAM  DSP   3  140  40  185942  270687  1446  770  4  105  30  201825  313540  1479  1143  5  84  24  224748  365129  1657  1486  6  70  20  253331  423728  1892  1916   Available  524800  1049600  2567  1963 to 150MHz. In the table, QBDim = n / K is the dimension of each matrix Q block, RBDim = RSize/ K is the dimension of each matrix R block. Larger K will lead to fewer kernel cycles, thus better performance. DSP is the critical resource in our system. Most of the DSPs are used by: a) Gaussian RBF kernel K(x, Xl) = exp ( _llx�:: 11 2 ), because of its complexity;
and b) matrix R enlarging and shrinking, because R is divided
We notice that although the total size of matrix Q, Rand other coefficients stay the same for different configurations as n = 420 and RSize = 120 are fixed, there is an increase in BRAM usage with parallelism K. This is because blocked storage is used for parallel access and higher parallelism means more blocks, even though the overall size is unchanged.
D. Performance Evaluation and Discussion
We compare the elapsed time for LIBSVM and FPGA to perform the training task. Our dataset contains 1902 items, corresponding to 88 seconds of trading. We use the sliding window approach with window size n = 420 and RSize = 120. This means as the stock trading goes on, we always use the latest 420 prices to train the SVM. In the beginning when there are fewer than 420 items, all data are used. Table II • ACCAct.: Actual speed-up ACCAct. = TLm /TAct.
As we see from the table, up to 40.97 times speed-up has been achieved. However, the expected speed-up calculated using the number of cycles to run and the FPGA clock frequency (l 50MHz) is much greater than the actual speed-up. This means our system is bounded by CPU-FPGA communication.
When constructing the SVM we use 16 features, so together with the prediction target (future mid-price), there are 17 items. As they are double-precision numbers, the size of each training sample is therefore 17* 64 = 1088 bits. The total size of 1902 samples is 252.61KB. In the Maxeler MAX4 system we used, CPU and FPGA communicate via an Infiniband connection with 2GB/s bandwidth. If we divide 252.61KB by 2GB/s, then data transfer only needs 0.00012s; but this is certainly not the case as the difference between T Ex p . and TAct. is much larger than that. A reasonable explanation is that bandwidth is a function of transfer data size and pattern. In our experiment, the data are transferred in small pieces (1088 bits) at random times (when the training of previous sample finishes). It is possible that for this kind of data transfer the actual Infiniband bandwidth available may be much smaller than expected.
V. CONCLUSION AND FUT URE WORK
In this paper, we have introduced a novel dataflow design for incremental SVM training. The proposed design addresses three challenges of implementing incremental SVM efficiently on FPGA: random memory access, numerical accuracy, and list manipulation. Experimental evaluation using high fre quency financial data shows the proposed design running on Stratix-V FPGA achieves up to 40.97 times speed up against LIBSVM software on Core-i7 4770 CPU. The proposed design is suitable for scenarios in which on-line SVM training is needed, such as financial time series prediction.
Possible future work includes using data compression to reduce communication overhead. Assuming the I/O overhead is removed with such compression techniques, we should be able to reach the expected speed-up (up to lO9.79 times).
