Abstract-A general solution for dealing with parameter estimation in a rigorous way is Bayesian data analysis. This analysis allows estimation of specific parameters and their uncertainties for non-linear inverse problems in a strictly mathematical way. Though advantageous, it is computationally intensive with long processing times and therefore, with exceptions such as the Kalman filter for linear systems with Gaussian noise, its full analysis is not commonly used for real-time applications. For non-linear problems, there are many suboptimal Bayesian online algorithms using different approaches which all introduce some kind of approximation. Nevertheless, an approach avoiding approximations to improve inference is desired. An acceleration of Bayesian analysis for inverse problems using reconfigurable hardware or multi-core processing is necessary and could prove valuable for scientific inference. This paper describes the acceleration of the Dispersion Interferometer (DI) for the Wendelstein 7-X (W7-X) magnetic confinement device as a proof of principle of the Bayesian model based data analysis.
I. INTRODUCTION
Parameter estimation in the analysis of inverse problems has different approaches that are tailored depending on the application. In plasma physics and fusion devices, one key aspect for diagnostic parameter estimation is the rigor of uncertainty estimation and the reduction of approximations when possible. This makes the Bayesian analysis a desired technique for data analysis in fusion experiments like W7-X [12] . In this scope, the current analysis approach is almost exclusively offline because of the computation times and processing power required for the analysis of complex models with several data sources and free parameters [11] . Model dimensionality can grow up to a point where the full posterior distribution is intractable. For these cases, the posterior distribution is not of a standard analytical form, and various iterative sampling algorithms, such as Markov Chain Monte Carlo (MCMC) are used to calculate the proper (generally non-Gaussian) probability distribution, often taking hours on regular CPUs. Providing an acceleration of this type of analysis is a complicated problem due to the volume of the analysis and its iterative nature in some cases. There are many ways of using Bayesian inference for model comparison or parameter estimation. Several solutions have been developed for time dependent models where a state-space approach and Bayesian filtering and smoothing are the preferred tools of analysis [9] . Most of these algorithms either handle linear Gaussian problems with optimal algorithms like the Kalman Filter [6] or have sub-optimal algorithms that reduce the volume of calculations by using approximations while meeting Bayesian requirements. Some examples are extended Kalman filters, particle filters, grid based methods or the most recently used variational Bayes which can be seen as Bayesian solutions to time-varying inverse problems when they can be formulated as stochastic state-space models [1] , [2] , [4] , [7] .
These have all been successful in tackling Bayesian analysis for intractable posteriors through simplification in a control theory environment and state-space approach, and for many cases these are the only way to reach a online time frame. Nevertheless, there are still cases in scientific inverse problems where the estimation of time-independent physics parameters from observations using forward modeling (predicting observations from physics parameters through scientific models) is needed. In these cases approximations are preferably avoided, generic solutions for time independent models are desired and an acceleration of this type of Bayesian data analysis would be beneficial. One example is combining data from multiple diagnostics through joint forward modeling and Bayesian inference, where approximations could propagate to an inaccurate final uncertainty in the joint analysis [10] .
As this analysis in plasma physics and fusion research commonly takes long processing times, an accelerated or even online implementation is often desirable. A real time approach of this analysis involving complex physics models could have a big impact in fusion plasma control as well; but real time control is outside of the scope of this project. The intention of this work is to accelerate parameter estimation for these complex inverse problems, specifically for interpretation of diagnostic data. This paper presents an architecture for accelerating formal model based Bayesian analysis for the Dispersion Interferometer diagnostic. It is important to notice that a single free parameter has been chosen to simplify the analysis for a proof of principle. The selected first implementation is a Field Programmable Gate Array (FPGA) design for acceleration through parallelization and application specific hardware architecture. This provides a good foundation to analyze limitations and advantages compared with CPU processing or a possible GPU implementation. Magnetic fusion devices like W7-X, as well as many other physics research devices, can benefit from having this thorough analysis in real time.
II. MODELING A. The Dispersion Interferometer
The W7-X Dispersion Interferometer is a diagnostic that allows the indirect measurement of electron density due to its proportionality with the phase shift introduced by the plasma refraction index. The interferometer works by frequency doubling a small amount of incident 10.6μm wavelength CO 2 laser light with a AgGaSe crystal. This crystal also polarizes the doubled frequency component by 90
• . The two components on the same beam path pass through an Elasto-Optical Modulator (EOM) introducing an artificial modulation. The laser then crosses the plasma twice, receiving a phase shift proportional to the plasma refractive index. The proportionality constant depends on the wavelength such that each component receives a different phase shift. The beam goes a second time through a frequency doubling crystal that has the same effect as the first crystal on the base-band component. Now with the same polarization and frequency, the two components produce an interference pattern that depends on the different phase shift obtained from the the plasma. The base frequency remainder can be filtered out. The remaining components get recorded with a photo-sensor giving an output signal that can be modeled as:
Where I 1,2 are almost constant along a 20μs period of the signal, m is the modulation depth set by the EOM at 50kHz, and Δϕ is the phase difference between the two wave components in the beam path which is proportional to the line integrated electron density n e dl of the W7-X plasma along the laser beam path.
B. Bayesian Model
Acceleration of general Bayesian analysis allows us to have a more generic solution for any case while reducing the number of approximations. Ideally such an acceleration scheme could then be used for different models and diagnostics. For an acceleration of this analysis a proper approach has to be chosen that can meet real time requirements and yet be tractable with reconfigurable hardware. In order to gradually accelerate different cases with a growing complexity, a proof of principle model has been designed and implemented. The chosen approach is sequential Bayesian analysis. Bayes' theorem states:
where D is the data, p(Δϕ | D) is the posterior probability density function(pdf), p(D | Δϕ) is the likelihood, composed by a normal distribution of the data around the predicted forward modeled value, p(Δϕ) is the prior of the selected parameters and p(D) is the evidence.
The sequential approach of Bayes' theorem instead of a batch approach, is done by generating a posterior distribution from each sample analyzed as it arrives, using the time within samples 1 to do processing instead of waiting for a set of data to be obtained. If each sample can be considered as independent, the posterior for the following observation can be generated by using the previous posterior as the prior factor of (2).
The acceleration possibilities can be categorized depending on where the bottleneck of the analysis lies. For the case of Bayesian analysis it can lie in the forward model (FM) of a complex function requiring iterative procedures, the application of Bayes theorem for highly dimensional posterior distributions and the inversion for multimodal posterior distributions.
In this case our parameter of interest is Δϕ which is time dependent. It can vary on each modulating period as well as within this period in smaller amounts. Given the possible Δϕ resolution with current noise levels and the desired uncertainty, changes within this uncertainty range in a single modulation period time frame are not expected; therefore phase difference changes within a period can be ignored and Δϕ can be seen as a constant parameter within this time frame. This allows us to apply sequential Bayesian data analysis to this signal within a period with Δϕ as free parameter. With this, it is possible to study the acceleration possibilities for higher dimensionality cases.
The initial guess of the knowledge of the phase difference is defined as the following flat prior distribution that indicates a lack of knowledge of what the value of Δϕ is:
For an initial range with low densities, the phase shift is expected to be anywhere between 0 and 2π. Even though the plasma is expected to start at a electron density of around 0, the phase shift could have any relative value within its range, so a flat prior is used.
The likelihood is chosen as a normal distribution of the data around the forward modeled value of the predicted voltage.
where V is the predicted value using (1) as the forward model, and D is the data. Variables that are not known with accuracy are ideally treated as free parameters and may be considered in a future extension of this project. The uncertainties in these variables that are treated as non-free parameters will be considered. Since calculations are made in an FPGA without floating point precision, arithmetic errors in the forward modeling as well as the non-free parameter uncertainties are taken into account. For this, σ di will include contributions from both the raw signal, σ D , and error propagation of arithmetic errors in the forward model, σ V .
The evidence term p(D) of (2) is constant for a given data set within a period and is not needed since normalization is not required to find the Maximum a Posteriori (MAP). Thus using (3) and (4) the posterior distribution of (2) can be changed to a proportionality as:
Since the main objective is acceleration, a useful approach to speed up calculation times is calculating the natural logarithm of the posterior pdf, simplifying (6) to:
Before this FPGA implementation, the model was implemented in the MINERVA Bayesian modelling framework [13] in order to study the behavior of the posterior and likelihood.
The framework was used to analyze each sample and generate a posterior per sample of data generated for Δϕ = 1.5rad. The results of this can be seen in figure (1). The analytic solution of (7) reveals the same results seen in figure 1 where one oscillatory solution is found as well as the correct solution.
Δϕ e = Δϕ r and ϕ e = sin(ωt) − ϕ r + k2π (8) In this equation ϕ e is the estimated value of the phase difference and ϕ r the correct value and k ∈ Z + .
The posterior of the 440 th sample is plotted as in figure 2 , where the multimodality can be seen clearly.
It is possible to reduce the multimodality when a good number of informative samples is available. The posterior distribution that takes them all into account is shown in (9) . This can also be seen as using the posterior generated with one sample as the prior for the next one. 
If enough samples are taken, the number of modes is then reduced to a single mode of the normalized posterior as seen in figure 3. 
III. ACCELERATION WITH FPGA ARCHITECTURE
As mentioned before, the use of sequential Bayesian analysis creates an opportunity to design a circuit that uses each sample to update a stored posterior with each incoming sample in real time. At the same time the use of parallelization can allow several calculations of the forward model simultaneously. This can be done with a GPU or a FPGA architecture. For this approach a Virtex 6 LX130T FPGA was chosen as the platform for the proof of principle.
The advantage of the FPGA in this problem is the high level of parallelization that can be reached given the required scan resolution needed on the free parameter. The possibility of designing a specific architecture for each operation without having to funnel through a single processor per thread helps the acceleration of the estimation.
A. Architecture Parametrization
The forward model was implemented in parallel using (1) so that each forward model is pre-calculated before the next sample arrives. The number of parallel calculations is determined by the desired resolution but was kept intentionally low to test FPGA resource requirements.
For the first design a set of initial requirements was defined. An initial maximum line integrated electron density of 10 20 m −2 was selected, corresponding to a maximum phase difference of 4.4806rad. The chosen phase resolution is 7x10 17 m −2 ≈ 0.03rad. For the selection of bus sizes, a resolution is required for each parameter which can be found by error propagation in the forward modeling. For this, as a way of saving FPGA resources, a fixed point architecture was chosen.
As (5) shows, the error in the forward modeling is included in the standard deviation of the likelihood. By numerically relating the standard deviation in the posterior and the likelihood with all the samples from one period considered, the desired error in the posterior requires an error of 0.032rad in the likelihood. With (5) and by doing error propagation on (1) we have
By using Taylor expansion to approximate the error in the argument of the cosine to the error in the cosine, it can be shown that σ cos = σ A with A representing the argument of the cosine and therefore
(11) With this we can introduce the known error of each of the non-free parameters and define the minimum bus size and the required resolution of the free parameter σ Δϕ ≈ 0.31. This allows us to see that the errors related to other nonfree parameters do not limit the required resolution and thus estimate the required resolution for the implementation of the cosine function. Nevertheless this value obtained cannot be taken as a design parameter for Δϕ, since it is directly related to the number of points in the posterior pdf and therefore the final resolution.
B. Architecture Design
After error propagation analysis, the scaling for the fixed point implementation is chosen to use integer arithmetic for acceleration and low use of resources. The forward modeling of the argument elements is composed of several stages that follow (1). The initial stage is an array of the scanned values of Δϕ. The following one is the addition of the modulation factor and the third one is the sine signal. The sine signal is generated with a Look Up Table (LUT) containing the scaled values of π sin(ωt). This signal is synchronized with the real signal through the EOM mentioned in section 2.1. This modulator transmits a TTL signal that drives the modulation where the jitter and delay of this signal are considered in the error propagation calculation. It is important to notice that the modularity of the FPGA allows for this process, which is part of the forward modelling, to be carried out independently and constantly, therefore adding no time to the total processing time. The output of that LUT is then multiplied by a register having the scaled value of m which can also be modified using PCIe register mapping in order to calibrate the FM results.
A generic and fast way for the implementation of the cosine function is through LUT and therefore the scaling can be done taking into consideration the addressing required for it. The scaling for the required resolution yields a 16 bit address that has to be scaled through
where K and B are scaling and offset constants that depend on the desired resolution. Based on the error propagation analysis, in order to have an address space between 0 and 2 16 , the scaling required is K = 5452 and B = 20553. This requires a bus size of 7 bits for m, 9 bits for sin(ωt) and 16bits for Δφ. This will yield the address of the LUT where the 16 bit answer of the cosine will be stored.
For the estimation of I 1 and I 2 values, given that these values do not vary significantly from one period to another the preliminary approach is done by measuring the amplitude and offset of the incoming signal within one period and using it as a prediction value for the following one. With this approach it is possible to have a forward modeling design that allows us to generate the predicted value N times depending on the desired resolution. For the first implementation the chosen value is of N = 60 parallel estimations of the forward function for different values of Δϕ, each of 16bit as is the resolution of the input data. This first N selection is done to test resource requirements.
To accelerate the processing of the cosine value without having to implement 60 LUTs, the workload is divided between 3 dual read-only memories which occupy 30 blocks of 36k RAM resources each. Each of its inputs are used for 10 values of the cosine argument that will be multiplexed. The memory will be then read at 300MHz given a chosen datapath clock frequency of 30MHz.
Once the forward modeling is complete, the computation of the likelihood has to be implemented. Given that we are looking for the MAP of the posterior pdf after using one period of samples and sequential Bayesian analysis, the logarithm of the pdf can be estimated changing (9) to
With this equation the likelihood calculation becomes simplified and requires only a subtraction and a division. For the likelihood estimation each sample is compared against its respective values of FM data and stored in an array. This array accumulates the results and waits for the next sample to be added until the full period is completed. This array contains as an initial value the selection of the prior, which for this case is 1/2π.
The resulting circuit can be seen in figure 4 . The estimation of the MAP in this design uses a comparator scheme of 60 levels to estimate a maximum of each posterior after a full period analysis is done.
For testing purposes, a PCIe transmission of the posterior was also implemented in order to observe and store the temporal behavior of the real time analysis.
IV. RESULTS AND ANALYSIS
Formal Bayesian analysis takes each value that is not known as a free parameter. However, in this first part of this project only Δϕ is taken as a free parameter. The usage of only one free parameter for the proof of principle, when others are not well known, raises a natural question of whether the analysis could be invalidated. In order to measure the effect of errors from the other non free parameters, the one with the biggest estimated error was tested. To analyze its effect, a posterior was calculated with an artificial error of 10% in m from (1). The resulting MAP has an error of 0, 01rad, which is within an acceptable range for the desired resolution. Given that the error in the pre-estimation of m is smaller than 10%, we can conclude that this small error in the non-free parameters would not invalidate the analysis.
Our implementation of this architecture allowed for a real time analysis of a single free parameter. With a resolution of 60 values of Δϕ from 0 to 4.4406rad as well as a test signal generated with a true value of Δϕ = 1.5rad the following results were generated from the unnormalized posterior distribution given by the implemented circuit.
As we can see figure 5 shows a trimmed Gaussian distribution given that the signal is periodic around 2π and the selected range is smaller than that. These results of the first implementation also have a maximum, with the value matching the test signal and with no trace of multimodality as in the single sample posterior in figure 2 .
The maximum estimation and standard deviation were also tested for 10 different values of Δϕ as we can observe in figure  6 where the error values were within the designed range.
Regarding processing time, the selected clock speed for the first implementation leaves room for improvement. The cosine LUT can also be read at a max clock frequency of 400MHz which means that the critical path clock frequency can be optimized.
In terms of duration, even though the forward modeling is done before the sample arrives and therefore has no effect on the duration of the analysis, it requires 8 clock cycles to generate a full FM value for each parallel value of Δϕ. The application of Bayes formula itself (13) , is done within 3 clock cycles per sample. The estimation of the MAP for the posterior distribution, given a N = 60 values resolution, requires 6 clock cycles to compute. This means a total of 0.3μs for the analysis plus MAP. Given that a set of 200 samples within a 20μs period was tested, a current temporal resolution of a density value is then 20.3μs which is mainly limited by the 20μs period of the modulation.
One important point for analyzing performance on this implementation is the comparison against state of the art Bayesian real time analysis. As mentioned in the introduction this type of analysis differs from current implementations of Bayesian and Smoothing filters. Other implementations of FPGA accelerators for Bayesian networks are mainly used to evaluate and compare score values for model evaluation in Bayesian networks [5] . Due to the difference of these implementations, the chosen comparison is a basic implementation of the same algorithm using C, ran on a Intel i3 − 4130 CPU with a 3.40GHz clock speed.
The C implementation yielded an average of 5μs per sample on a single thread. This initially shows a sixteen fold acceleration considering that neither of the implementations has been optimized.
Regarding the comparison between the FPGA and C implementations there are several points to consider. For this proof of principle the FPGA implementation was mainly limited by the calculation of the cosine function in the forward modeling. This is mainly due to available resources on the FPGA that require the multiplexing to reuse this LUT. If enough resources were available to add more LUT and reduce the number of multiplexed signals, the limiting factor would then be the maximum operating frequency of these LUT (up to 450Mhz). The reason for the selection of a LUT to calculate the cosine instead of CORDIC or other algorithms that can be ran at higher clock frequencies, was to keep the design general enough for other possible models. Clearly, the calculation of more complicated mathematical operations will generate the majority of the bottlenecks in the critical path. Given the variety of these operations under different models a LUT is the most generic approach in order to asses the timing limitations and feasibility of the analysis.
It is important to note that the estimation of the forward model with limited integer arithmetic is an advantage in terms of hardware resources, but it introduces an error that has to be considered and therefore would eventually limit the possible standard deviation value of the posterior. The use of higher resolution for stages like the cosine estimation requires more resources and thus a balance has to be kept in order to be able to maintain the level of parallelization required.
This indicates that for higher dimensionality or free parameter resolution, a trade off would eventually be made between parallelization and resolution. It is important to mention that for cases of low dimensionality posterior, a comparator scheme can be used to obtain the MAP. This is rarely the case and dealing with a typical multidimensional posterior requires an alternative solution. For those cases sampling algorithms like MCMC offer possible parallelized solutions that can be used to achieve the inversion and obtain the most likely values for the selected parameters [8] .
V. CONCLUSIONS
This proof of principle shows an example of Bayesian estimation that can be extended to more complex models. It proves that an initial single free parameter approach is possible for simple models and opens possibilities of scaling for more complex cases.
Multiplexing to reuse resources for mathematical operations can be helpful when dealing with a higher level of parallelization. At the same time it represents one of the main speed limitations in the analysis in terms of resources and maximum clock speed.
In the case of a higher number of free parameters, a need to marginalize introduces integrals into the processing scheme. This can also be approached with an FPGA/CPU scheme accelerating the integration required in the marginalization [3] . This iterative algorithms in reality slow down the estimation process but can nevertheless be parallelized.
Finally the comparison against a CPU implementation shows that an acceleration is already achieved even with a slow 30MHz main clock and a limitation of the FPGA resources available for this proof of principle. This can be improved by increasing the critical path clock speed; then the acceleration is finally limited by the maximum frequencies of the LUT for calculating more complex operations. Since the CPU implementation can be optimized by the use of more threads, a GPU approach could present another possibility but is still lacking advantages of FPGA pre-processing of the forward modeling.
VI. ACKNOWLEDGMENTS
This work has been carried out within the framework of the EUROfusion Consortium and has received funding from the Euratom research and training programme 2014-2018 under grant agreement No 633053. The views and opinions expressed herein do not necessarily reflect those of the European Commission. This work was also improved with the help of Dr. Loizu regarding several mathematical aspects.
