Atomic force microscopes (AFM) provide high resolution images of surfaces. In this paper, we focus on an interferometry method for estimation of deflections in arrays of cantilever in quasi-static regime. We propose a novel complete solution with a least square based algorithm to determine interference fringe phases and its optimized FPGA implementation. Simulations and real tests show very good results and open perspectives for real-time estimation and control of cantilever arrays in the dynamic regime.
Introduction
Cantilevers are used in atomic force microscopes (AFM) which provide high resolution surface images. Several techniques have been reported in literature for cantilever displacement measurement. In [1] , authors have shown how a piezoresistor can be integrated into a can tilever for deflection measurement. In [2] , authors have presented a cantilever mechanism based on capacitive sensing. These techniques require cantilever instrumenta tion resulting in complex fabrication processes.
In this paper our attention is focused on a method based on interferometry for cantilever displacement measure ment in quasi-static regime. Cantilevers are illuminated by an optical source. Interferometry produces fringes en abling cantilever displacement computation. A high speed camera is used to observe the fringes. In view of real time applications, images need to be processed quickly and then a fast estimation method is required to determine the displacement of each cantilever. In [3] , an algorithm based on spline has been introduced for cantilever position estimation. The overall process gives accurate results but computations are performed on a standard computer using Lab View ®. Consequently, the main drawback of this implementation is that the computer is a bottleneck. In this paper we pose the problem of real-time cantilever position estimation and bring a hardware/software solution. It is divided into two parts: a calibration phase using a spline based algorithm presented in [3] , and an acquisition loop relying on a fast method based on least squares and its FPGA implementation.
The remainder of the paper is organized as follows. Section 2 describes the goals we chose for our setup. Our solution based on the FPGA implementation of a least square method is presented in Section 3. Numerical exper imentations are described in Section 4. Finally conclusions and perspectives are drawn.
Design goals and choices
The expermiental setup, described in [4] has been designed to be simple, cost effective and user-friendly. It is based on a Linnick interferometer [5] to produce interferometric fringes on the sample surface. The sample is itself placed on a nano-positioning table. A CMOS camera takes images of the surface, that are analyzed to recover the cantilever deflections.
Current experiments operate on arrays with up to 17 x 4 = 68 levers, a size that should increase in a near future. Thus, we chose a goal which consists in designing a computing unit able to estimate the deflections of at least 100 cantilevers, for an image rate of 1 KHz. The geometry of the array of cantilevers is not fixed but our solution should manage a grid, for example lO x 10, 7 xIS, 4 x 25, In addition, the result accuracy must be close to lnm, the maximum precision reached in [3] . Finally, the latency between the entrance of the first pixel of an image and the end of deflection computation must be as small as possi ble. All these requirements are stated in the perspective of implementing real-time active control for each cantilever (see [6] , [7] ).
If we put aside other hardware issues like the speed of the link between the camera and the computation unit, the time to deserialize pixels and to store them in memory, the phase computation is the main bottleneck. For 100 cantilevers, our method implies the computation of the phase of interferometric fringes located on 201 segments of pixels (i.e. profiles, see Section 3). Assuming that the camera is able to pick images every milliseconds, each phase calculation (including the time to extract pixels) should take no more than 511s.
In fact, this timing is a very hard constraint. A very simple test-bench doing cumulated sums on 20 values (the average profile size in our experiments) reaches an average of 155Mflops on an Intel Core 2 Duo E6650 at 2.33GHz. Obviously, some cache effects and optimizations on huge amount of computations can drastically increase these performances: peak efficiency is about 2.5Gflops for the considered CPU. But this is not the case for phase computation that is using only a few tenth of values. This test-bench implies that the phase computation algorithm should do less than 775 floating point instructions, which is small. However, the most important point is the latency of the whole computation. Supposing that each profile must be treated one after the other in 511S, the deflection of 100 cantilevers would take Ims. It is totally inadequate for real-time requirements as for individual cantilever active control. An obvious solution is to parallelize the compu tations, for example on a GPu. Nevertheless, the cost of transferring profiles in GPU memory and of taking back results would be prohibitive compared to computation time. It is far more efficient to pipeline the computation. For example, supposing that 200 profiles of 20 pixels could be pushed sequentially in a pipelined unit clocked at a 100MHz (i.e. a pixel enters in the unit each IOns), all profiles would be treated in 200 x 20 X 10.10-9 = 40l1s plus the latency of the pipeline. Such a solution would be meeting our requirements and would be accurate for real-time control. FPGAs are appropriate for such an implementation, so they turn out to be the computation units of choice to reach our goals.
Interferometric based cantilever deflection estima tion
As shown in Figure 1 , each cantilever is covered by several interferometric fringes. The fringes distort when cantilevers are deflected. For each cantilever, the method uses three segments of pixels, parallel to its width, to determine phase shifts. The first one (reference profile) is on the base of the array. It is common to all levers and provides a reference for noise suppression. The second is located just above the AFM tip (tip profile), it provides the phase shift modulo 2n. The third one is close to the base junction (base profile). It is used to determine the "real" tip phase through an operation called unwrapping where it is assumed that the average deflections along the base and tip profiles are linearly dependent. Finally, deflections are simply derived from this unwrapped phase.
The pixel gray-level intensity I of each profile is mod eled by
where x denotes the position of a pixel in a segment, A, f and 8 are the amplitude, the frequency and the phase of the Our method consists in two main sequences: calibration and acquisition loop. The calibration does not require real time computations. First, an image of the whole array is taken, without any forces applied to levers. From profile locations, we determine the frequency f of each profile using a spline interpolation, described in Section 3-A. Then, we take several images for different locations of the nano-positioning table to compute, for each lever, the coefficients of the linear relation between base and tip phases. Finally, we derive the slope of the array with respect to the horizontal plane.
The acquisition loop consists in moving the nano positioning table and taking images at regularly spaced time steps. For each image, the phase 8 of each profile is computed to obtain, after unwrapping, the cantilever deflection. The phase determination is achieved by the FGPA, with the least square method described in Section 3-B.
A. Frequency determination
As mentioned above, the phases are computed for each image but f is computed only once during calibration. This is why the frequency determination can be done on a CPU with a time consuming method based on spline interpolation.
We denote by M the number of pixels in a segment (i.e. a profile) used for phase computation. For the sake of the simplicity of the notations, we consider the light intensity I as a function defined in the interval [0, M -1] which itself is the range of a one-to-one mapping defined on the physical segment. The pixels are assumed to be regularly spaced and centered at coordinates xP E {O, 1, ... ,M -I}. We use the simplest definition of a pixel, namely the value of 1 at its center. The pixel intensities are considered as pre-normalized so that their minimum and maximum have been resized to -1 and 1.
The first step consists in computing the cubic spline interpolation of the intensities. This allows interpolating 1 at a larger number L = 1 + k x (M -1) of points (typically k = 4 is sufficient) in the interval [O,M -1]. We denote these points by xi with i = 0, 1, ... ,L -1.
The second step is to determine the affine part ax + b of 1. It is found by an ordinary least square method, taking into account the L points. Values of 1 at xl are used to compute its intersections with ax + b. The period of 1 (and thus its frequency) is deduced from the number of intersections and the distance between the first one and the last one.
The frequency could also be obtained using the deriva tive of spline functions, which only requires to solve quadratic equations but yields higher errors.
B. Phase determination
When f is known, Equation (1) has 4 parameters: a, b,A, and 8. A least square method based on a Gauss Newton algorithm coulb be used to determine them. Such an iterative process ends with a convergence criterion, so it is not suited to our design goals. Fortunately, it is quite simple to reduce the number of parameters to 8 only. Firstly, the affine part ax + b is estimated from the M values l(xP) to determine the corrected intensities, Finally, the problem of approximating 8 is reduced to minimizing
An optimal value 8* of the minimization problem is a zero of the first derivative of the above argument,
(4) Several points can be noticed:
• If all profiles have the same fixed size, then XP and var(XP) are constants.
• The terms L�O l sin( 4rtfi) and L�O l cos( 4rtfi) are independent of 8, they can be precomputed.
• Lookup tables can be set with the 2.M values sin(2rtfi) and cos(2rtfi).
• A simple method to find a zero 8* of the optimality condition is to discretize the range [-rt, rt] with a large number nbs of points and to find which one is a minimizer in the absolute value sense. Hence, three other lookup tables can be set with the 3 x nb,\' values sin8, cos8, and
• The search algorithm using a dichotomous process is very fast, namely in log 2 (nbs) operations. Taking into account all these remarks, we obtain an algorithm called LSQ (i.e. least square abbreviation) in the following, avoiding complex operations (division, trigono metric functions, ... ) which is mandatory to be ported efficiently on an FPGA. Nevertheless, it should also be proved that it fits with our goals.
C. LSQ evaluation
We evaluated the algorithm over three criteria:
• precision of results on a cosines profile distorted by noise,
• number of operations,
• complexity of FPGA implementation. For the first item, we produced a Matlab ® version, running in double precision. Before phase computation, it generates a profile for x E {O, ... , 20} using the following equation:
The profile was generated for about 34,000 ditlerent quadruplets of periods (J E [3.1, 6 .1]' step = 0.1), phases an average ratio of 50 between phase variation in radians and lever end position in nanometers. Assuming such a ratio and nbs = 1024, the maximum lever deflection error would be 0.15nm which is smaller than Inm, the precision to reach.
Moreover, pixels have been paired and the paired in tensities have been perturbed by addition of a random number uniformly picked in [-N,N] . Notice that we have observed that perturbing each pixel independently yields too weak profile distortions. We report percentages of errors between the reference and the computed phases out of 2n, The results show that the algorithm behaves very well against noise. Assuming an average ratio of 50 (see above), an error of 1 percent on the phase corresponds to an error of 0.5nm on the lever deflection, which is still under the precision to reach.
We ignore the level of noise present in real experiments and how it will distort the profiles. Results on the 17 x 4 array mentioned above allowed us to compare experimen tal profiles to simulated ones. Figure 2 shows the profile with N = 10 that leads to the biggest error. It is a bit distorted, with pikes and straight/rounded portions. In fact, it is very close to some of the worst experimental profiles. Figure 3 shows a sample of worst profile for N = 30. It is completely distorted, largely beyond any experimental ones. Obviously, these comparisons are a bit subjective and experimental profiles could also be more distorted on other experiments. Nevertheless, they give an idea of the possible error. The second criterion is relatively easy to estimate. The number of operation is proportional to M the number of pixels. It also depends on nb.I·• We assume that M = 20, nb.l· = 1024 and k = 4, that all possible parts are already in lookup tables and that a limited set of operations (+, -, *, /, <, » is taken into account. Translating LSQ algorithm in C code, we obtain about 430 operations, which is largely under the limit of 775 (see Section2). Nevertheless, considering the total number of operations is not fully relevant for FPGA implementation for which time and space consumption depends not only on the type of operations but also on their ordering. The final evaluation is thus very much driven by the third criterion.
,----�-�-�-� -�-�--�-�-�-_ _,
The Spartan 6 used in our architecture has a hard constraint since it has no built-in floating point units. Obviously, it is possible to use some existing "black boxes" for double precision operations. But they require a lot of clock cycles to complete. It is much simpler to exclusively use integers, with a quantization of all double precision values. It should be chosen in a manner that does not alter result precision. Furthermore, it should not lead to a design with a huge latency because of operations that could not complete during a single or few clock cycles. Divisions fall into that category and, moreover, they need a varying number of clock cycles to complete. Even multiplications can be a problem since a DSP48 takes inputs of 18 bits maximum. So, for larger multiplications, several DSP must be combined which increases the overall latency.
Nevertheless, the hardest constraint does not come from the FPGA characteristics but from the algorithm itself. Its VHDL implementation can be efficient only if it can be fully (or near) pipelined. We observe that it is not a problem with the LSQ since all parts, except the dichotomic search, work on the same data with a constant size: the profile.
VHDL implementation and experimental tests

A. The FPGA board
The architecture we use is designed by the Armadeus Systems company. It consists in a development board called APF27 ®, hosting an i.MX27 ARM processor (from Freescale) and a Spartan3A (from Xilinx). This board includes all classical connectors as USB and Ethernet for instance. A Flash memory contains a Linux kernel that can be launched after booting the board via u-Boot. The processor is directly connected to the Spartan3A via its special interface called WEIM. The Spartan3A is itself connected to an extension board called SP Vision ®, that hosts a Spartan6 FPGA. Thus, it is possible to develop programs that communicate between i.MX and Spartan6, using Spartan3 as a tunnel. A clock signal at 100MHz (by default) is delivered to dedicated FPGA pins. The Spartan6 of our board is an LX100 version. It has 15,822 slices, each slice containing 4 LUTs and 8 flip/flops. It is equivalent to 101,261 logic cells. There are 268 internal RAM blocks of 18Kbits, and 180 dedicated multiply adders (named DSP48), which is largely enough for our project. Some I/O pins of Spartan6 are connected to two 2 x 17 headers that can be used for any purpose. In our setup, they are connected to an interface board that is bound to the camera and the nano-positioning table controller.
B. VHDL implementation
From the LSQ algorithm, we have written a C program that uses only integer values. We used a very simple quan tization which consists in scaling each double precision value by a power of two and by keeping the integer part. For an accurate evaluation of the division in the computation of a the slope coefficient, we also scaled the pixel intensities by another power of two. The main problem was to determine these factors. Usually, they are chosen to minimize the error induced by the quantization. But in our case, we also have some hardware constraints, for example the width and depth of RAMs or the input size of DSPs. Thus, having a maximum of values that fit in these sizes is a very important criterion to choose the scaling factors.
Consequently, we determined the maximum value of each variable as a function of the scale factors and the profile size involved in the algorithm. It gave us the maximum number of bits necessary to code them. We chose the scale factors so that any variable (except the covariance) fits in 18 bits, which is the maximum input size of DSPs. In this way, all multiplications (except one with covariance) could be done with a single DSP, in a single clock cycle. Moreover, assuming that nbs = 1024, all LUTs could fit in the 18Kbits RAMs. Finally, we compared the double and integer versions of LSQ and found a nearly perfect agreement between their results.
As mentioned above, some operations like divisions must be avoided. When the divisor is fixed, a division can be replaced by its multiplication/shift counterpart. This is always the case in LSQ. For example, assuming that M is fixed, var(XP) is known and fixed. Thus, cova���fx��XP)) can be replaced by
where» is a right bit-shift and n depends on the desired precision (in our case n = 24).
Obviously, multiplications and divisions by a power of two can be replaced by left or right bit shifts. Finally, the code only contains shifts, additions, subtractions and multiplications of signed integers, which are perfectly adapted to FPGAs.
We built two versions of VHDL codes, namely one directly by hand coding and the other with Matlab us ing the Simulink HDL coder feature [8] . Although the approaches are completely different we obtained quite comparable VHDL codes. Each approach has advantages and drawbacks. Roughly speaking, hand coding provides beautiful and much better structured code while Simulink HDL coder allows fast code production. In terms of throughput and latency, simulations show that the two approaches yield close results with a slight advantage for hand coding.
C. Simulation
Before experimental tests on the FPGA board, we sim ulated our two VHDL codes with GHDL and GTKWave (two free tools with Linux). We built a test-bench based on experimental profiles and compared the results to values given by the C implementation. Both versions lead to correct results. Our first codes were highly optimized, indeed the pipeline could compute a new phase each 33 cycles and its latency was equal to 95 cycles. Since the Spartan6 is clocked at 100MHz, estimating the deflection of 100 cantilevers would take about (95 + 200 x 33).10 = 66. 95f.1S, i.e. nearly 15,000 estimations by second. Figure 4 shows the complete processing chain, includ ing FPGA components, the i.MX processor, and external devices. The main interface between the i.MX and the FPGA is a VHDL component called interconnector, that implements the wishbone protocol. It is in charge of relaying data from/to i.MX to/from other components. It is mainly used before the acquisition loop to fill the precomputed values LUTs, to send profile positions and identifiers to the profile extractor component, and to setup the scan parameters to table mover. It also relays orders like to begin profile extraction.
D. Complete design
The table mover component is directly connected to the nPoint controller that drives the nano-positioning table. It can be programmed to achieve two types of scan: a sequence of round trips in the x-y dimension or a simple round trip in the z dimension. The first one is mainly used to obtain topographic informations and the second one a force curve. For each type, the user can specify the size of the steps in space and time and their numbers.
The phase storage component is connected to a specific bus on the board, called CSI. Usually, this bus is used to grab video frames from little CMOS camera at high rate (up to 60 MHz). We use it to flush the phases of all profiles, each time an incoming image has been totally processed. It is the most efficient way to send the phases to the CPU.
It should be noticed that unwrapping the tip phase is done on the CPU since it is not time consuming and quite complex to implement on the FPGA.
The whole source code represents 9,300 lines of VHDL (without comments) but nearly 3,800 were generated by ISE CoreGen ® to implement wrappers for RAMs and DSPs. The phase computer which is the heart of the design consists in 2,200 lines of code, highly optimized. It perfectly illustrates the complexity to transpose the 150 lines of LSQ written in C into VHDL.
E. Bitstream creation and tests
Unfortunately, the version of phase computer used during simulations led to a design that could not be placed and routed with ISE on the Spartan6 with a 100MHz clock. The main problems were encountered with series of arithmetic operations and more especially with RAM outputs used in DSPs. The distance between some RAMs and DSPs, combined to the time needed to achieve the multiplication led to a signal propagation that lasts longer than the clock cycle. We solve the problem by inserting a delay after these RAMs.
Finally, we obtained a phase computer with a latency of 128 cycles and that can compute a new phase every 60 cycles. For 100 cantilevers, it would take ( 128 + 201 x 60) x IOns = 121.88,us to compute their deflection. It corresponds to about 8,200 estimations per second, which is largely beyond our camera capacities and the possibility to extract a new profile from an image every 60 cycles. Nevertheless, it also largely fits our design goals.
Conclusion and perspectives
In this paper we have presented a full hardware and software solution for real-time cantilever deflection com putation from interferometry images. Phases are computed thanks to a new algorithm based on the least square method. It has been quantized and pipe lined to be mapped into a FPGA, the heart of our solution. Performances have been analyzed through simulations and real experiments on a Spartan6 FPGA. The results meet our initial re quirements in terms of throughput (at least 100 deflection estimations every millisecond) and precision (I nm) . The next step is to finalize the setup, integrating the camera and the nano-positioning table for the hardware part, and implementing components to drive them for the software part.
Future works will also address more general problems such as algorithm quantization and real-time filtering or control for AFM arrays in dynamic regime. 
