Holographic optical tweezers allow the three dimensional, dynamic, multipoint manipulation of micron sized dielectric objects. Exploiting the massive parallel architecture of modern GPUs we can generate highly optimized holograms at video frame rate allowing the interactive micro-manipulation of complex structures.
Introduction
Holographic optical tweezers (HOT) use light to manipulate matter at the micron scale [1] . Dielectric objects, whose refractive index is higher than the surrounding medium can be trapped in regions of high light intensity by electromagnetic forces arising from the scattering of light [2] . To achieve stable trapping in three dimensions, light has to be strongly focused using a microscope objective with high numerical aperture. Many objects can be trapped simultaneously if more than a single focal spot is generated around the objective's focal plane. Digital holography provides a way to achieve this by applying a computer generated phase mask to a laser beam before it is sent through the microscope objective. The commercial availability of Spatial Light Modulators (SLM) has made this task easier by providing a reconfigurable support for computer generated holograms which is connected to a PC through the video output (usually DVI) on a standard video card [3, 4, 5, 6] . The task of finding a phase mask that efficiently redistributes the available laser power among an array of target focal spots is not a straightforward one. Phase only modulation can easily give rise to unwanted focal spots ("ghost traps") or large intensity variations. We recently proposed an iterative procedure that achieves optimal efficiency and uniformity in a few tens of steps [7] . However the resulting computational load is so high that the use of optimized algorithms for dynamic manipulation is limited to those circumstances when one knows in advance the sequence of moves and holograms can be then pre-calculated. Such a slowness is often considered as one of the major factors for preferring scanning beam techniques [8] over digital holography, for real-time applications.
In this paper we demonstrate how the use of commercially available Graphic Processing Units (GPU) can speedup hologram generation up to a factor 350, in a way that long computational time has not to be considered a limit for HOTs any longer and the SLM inertia becomes the real time limit.
GPU device architecture
GPUs use parallel computation, this is possible because in GPUs there are many processors that can run at the same time. With the recent release of Compute Unified Device Architecture (CUDA) [9] , NVidia GPUs are now programmable using C language, we can see CUDA as a C extension which adds few new syntax which permits the user to easily give instructions to the GPU. Codes written with CUDA are scalable this means that can be utilized in all NVidia GPUs independently of the number of their multiprocessors. The possibility of using a personal computer to easily and cheaply achieve the performances of an expensive CPU cluster is revolutionizing computational physics in a wide range of fields including medical imaging, molecular dynamics, finite element analysis, biochemical networks [9] . In the field of digital holography GPUs have been used for real-time holographic microscopy [10] or holographic displays [11] . The parallel architecture of GPUs is particularly suited for digital holography, whose basic task is that of performing complex algebra over a large array of independent pixels.
To understand how to write and optimize a code running on a GPU is necessary to know its architecture. Basically a GPU is composed of global memory and a variable number of multiprocessors. Each multiprocessor includes eight scalar processor cores, two special function units, 8192 registers, a multithreaded instruction unit and one on-chip shared memory. When we want the GPU (device) to execute many concurrent threads (independent processes) we must tell the CPU (host) to call a kernel. A kernel is a function called from the host and executed on the device (the device can execute only one kernel at time) once a kernel is called the host goes on and can execute other instructions that doesn't involve the device. When a kernel is launched threads are arranged in blocks contained in a grid and each block is assigned to a multiprocessor, to launch a kernel we only have to specify the number of threads in a block and the number of blocks. Instructions written in a kernel are executed by all threads, however each thread has his own ID given by its coordinate relative to the block and the coordinate of his block (see Fig. 1 ). Threads cooperation is possible only within threads that belong to the same block, they can be synchronized and can cooperate by using shared memory. Since shared-memory resides on individual multiprocessors, data stored in it can only be shared between threads belonging to the same block. We can indeed think of the shared memory as blocks private memory and in analogy with that we can consider registers as threads private memory. Within a block threads are arranged in groups of 32 called warps, threads in a warp are physically executed in parallel and are synchronized. Any control flow instruction (if, while..) may cause threads of a warp to diverge and therefore execution of threads belonging to different paths must be serialized. If threads belonging to two different warps diverge there's no slowing down because their execution is always serialized. Multiprocessors execute one warp at time, however if threads in a warp are waiting to access memory the multiprocessor can stop executing that warp and start to process another warp eliminating memory's latency time. This can happen within threads that belong to active blocks, the number of active blocks a multiprocessor can accommodate is limited by multiprocessor resources (shared memory, registers, etc.). The management of warps is automatic and is completely invisible to the programmer. Summarizing, to take advantage of the full computational power of the GPU and hide well memory latency, one should follow the following general rules:
1. Launch a kernel with a number of blocks bigger (better if multiple) than the number of multiprocessor to keep all multiprocessors busy. 2. Choose the number of threads per block as a multiple of 32 to avoid wasting time with unfilled warps. 3. Hide memory latency maximizing the number of active warps by using many threads per block.
Another very important issue to consider when optimizing a kernel, is the access to global memory. Writing and reading global memory is very slow and sometimes it can be even better to recalculate than to cache data. Shared memory is a good alternative to global memory (only 16k are currently available to any multiprocessor), because it is on chip and it's hundreds times faster. When all threads in a half warp execute a load instruction, the hardware detects whether threads access consecutive global memory locations and coalesces all these accesses if that is the case. In analogy with what happens with global memory, shared memory can be accessed with read or write instructions simultaneously by threads belonging to a half warp. To achieve this, threads must access different memory banks, if two addresses of a memory request fall in the same memory bank there is a bank conflict and the access has to be serialized. A memory bank is a portion of shared memory of 4 bytes, the 4 bytes successive to the first bank belongs to the second bank and so on. The number of bank is 16 so that distant addresses cannot be accessed simultaneously, restrictions to avoid bank conflicts are less hard to satisfy respect to restrictions of global memory coalesced access.
Optimizing algorithms
In back focal plane phase modulation we use an SLM to apply an array of phase shifts to a plane wave at the back focal plane of a focusing optical system (Fig. 2) . Our task here is to calculate the best phase mask so that the modulated wavefront propagating through the optical system is focused onto an array of chosen target spots. Given the phase shift on each pixel Φ j the complex field on a target point m, with coordinates x m , y m , z m , is given by [7] :
Where N is the number of pixels, i is the imaginary unit and ∆ m j is a term that takes account of the phase acquired upon propagation:
where f is the effective focal length of the focusing optics (L3, L4, M0 in Fig. 5 ), λ is the laser wavelength and x j , y j are the coordinates of the j th pixel. If we want to send all the light through the single point m = 1 then we should set Φ j = ∆ 1 j , so that V 1 = 1. When considering multiple traps, a phase only modulation might not be able to split all the available power uniformly among the target points. A quantitative measure of the hologram performance can be obtained by defining an efficiency (e) and a uniformity (u) parameters as a function of the fractions of total power flowing through the m
For each pixel we now have the multiple choices ∆ m j (the single trap holograms) and finding a compromise could seem a hopeless task. A first, reasonably fast recipe is that of taking the complex superposition of single trap holograms [12] :
Where Φ j is again the phase of the j th SLM's pixel, m th is the trap index, M is the number of traps, Θ m is a random phase relative to the m th trap. Such a procedure, usually referred as the random superposition algorithm (SR), is computationally rather fast but usually results in ghost traps and poor uniformities, especially when dealing with ordered structures. In our code we first generate M random phases (Θ m ) and get the traps coordinate, then we have to transfer everything on GPU global memory. The matrix (768x768) that contains the SLM phases (Φ j ) also has to reside in the global memory. It is quite immediate to think of kernel which assign a thread for each pixel, we first start with the less efficient situation and then step by step we improve the speed of the application. In the slowest of the cases each thread execute the operations in equation (4) by reading x m ,y m ,z m and Θ m from the global memory and valuating ∆ m j M times. The blocks organized so that his threads cover a square of 16x16 pixels (notice that the maximum number of threads per block is 512). In this way, because of the organization of the block, threads in the same warp will not access contiguous addresses of global memory; we don't use shared memory too but at least we have blocks with 8 fully filled warps that can well hide memory latencies. We can improve our code by loading the traps' coordinates and random phases on shared memory at the beginning of the kernel, in this way every block reads the global memory only once instead of reading it for every thread. At the beginning of the kernel only M threads cooperate to copy data to shared memory with a coalesced access to global memory. } /* we synchronize threads so that shared data is accessible by all threads within the block*/ __syncthreads();
Next we can arrange our blocks so that threads access the addresses of the matrix in sequence. Instead of allocating square 16x16 blocks we can allocate linear blocks of 384 threads (half row), in this way we still have a good number of filled warps per block. In Table 1 Using SR we have an excellent speedup but the traps' performance that can be obtained is poor (in the best case 80% of efficiency and 20% of uniformity). A poor performance may result in particles getting trapped in unwanted ghost trap sites or bead escape from temporary low intensity traps. When such events are acceptable SR provides a good choice for real time manipulation [13] , but if a higher degree of control is required a more performing algorithm is needed. A good candidate is the GSW algorithm which gives excellent results in terms of efficiency and uniformity. The basic idea behind GSW is that, if aiming at uniform trap intensities with SR leads to nonuniformities, we may hope that there's a choice of non uniform target traps' intensities resulting in an evenly spread trapping light. GSW allows to calculate such non uniform weights w m by an iterative procedure:
Where V m are calculated with equation (1), and |V | is the average module value of the traps' field over the M traps. It is clear from (6) that the convergence The weights initialization and evaluation functions (once V m are known) can be easily parallelized but that would affect only slightly the total time. The phase evaluation function in the GSW loop is very similar to what we described when we talked about SR. Equation (4) must be replaced by (5) and at the beginning of the kernel we load into shared memory weights, traps' coordinates and the phase of the traps' field (V m /|V m | = arg(V m )). We will now discuss the propagation routine that calculates the field on traps' locations, we will later show that GSW spends most of the time in this propagation function. The field on a trap is calculated as in equation (1). In our code we will evaluate the traps' field in a loop with M cycles, what is parallelized is the sum over the pixels and the evaluation of ∆ In the first kernel (Phase Kernel) we allocate a number of threads equal to half of the number of pixels, every thread will calculate the corresponding exponential exp(Φ j −∆ m j ), when this is done all threads will cooperate via shared memory to partially sum the exponentials as shown in 3 in every step of the summation the number of elements to sum will be halved till only one element will remain that will be copied on vector resident in global memory. At the end of the kernel we will have a complex vector with the dimensions of the number of blocks allocated (every element of the vector will hold the partial sum of the corresponding block). Then we launch a number of kernels to add the partial sums. Partial sums are evaluated using the same procedure described for the previous kernel (Phase Kernel). It is important that Phase Kernel performs a sum reduction, in this way a operation of read and write to global memory is avoided respect to the case where phase calculation and sum are totally separated. Notice also that in Phase Kernel the m th trap's coordinate must be passed, we suggest to trasfer this data directly from host without passing through the device global memory. The phases ∆ m j (M*N in total) are always the same for every iteration of the GSW loop, so we tried to precache them in global memory instead of calculate them in every kernel but this did not improved the code. The time required by GSW grows almost linearly with the number of traps or iterations. In 4 we report a graph which shows deviations from linearity of the dependence between time and number of traps. The initial decay evidences the presence of a time cost which is essentially independent from the number of traps and is probably due to memory read/write operations. As we can see from the figure we can neglet the small deviations from linearity and define the time needed to calculate an hologram for a trap in a iteration, using a GTX 260 we obtain 0.44 ms/trap/iteration obtaining a 45x speedup respect to a Pentium D 3.2 GHz. Schematic view of the experimental setup for holographic optical trapping. L1,L2,L3,L4 are planoconvex achromats, M2 is a dichroic mirror.
Real-time manipulation
Our optical tweezers are based upon a Nikon TE2000U inverted microscope with a 100x objective lens, NA 1.4. To form the trap we use a Nd:YAG laser, frequency-doubled to give a maximum power of 3 W at 532 nm (LaserQuantum Opus). After expansion and collimation, the beam from this laser is reflected off a computer-controlled SLM (HoloEye LCR 2500). Our SLM is a phase only modulator that consists in a matrix of liquid crystal pixels whose refractive index relative to the incoming polarization can be easily varied by applying an electric voltage. Light will pass through different pixels with a different velocity, thus giving the desired phase retardation. In reflective SLM light passes twice through the same liquid crystal cell which can be made thinner and consequently with a better time response. When a grayscale, 8bit depth image is displayed on the SLM, a proper pattern of voltages is relayed to the pixels to linearly map a grayscale value to a phase shift ranging from 0 to 2π. Light reflected from the SLM is then coupled to the microscope by projecting a demagnified image of the SLM plane on the back focal plane of the microscope objective. : GSW performance. We report the efficiency (e) and uniformity (u) for GSW generated holograms as a function of the number of traps. The number of GSW iterations is always such to work at a fixed framerate of 20 Hz. Holograms with a performance above 90% can be generated at 20Hz for trap arrays as large as 16.
An array of optical traps is then produced around the front focal plane of the objective located in a colloidal water suspension above the coverslip. The SLM was controlled by a host PC equipped with a NVidia GeForce GTX 260 video card. User input is managed by a GUI mainloop thread (Tkinter) running in a Python shell while a Python module wraps the CUDA library functions providing a high level interface to the GPU hologram generation. As a demonstration of real-time manipulation using optimized GPU generated holograms, we show the simultaneous trapping and manipulation of eight silica beads (2µm diameter) in water. Optimized holograms are obtained with 5 GSW steps at a rate of 48 Hz following user input. Fig. 7 shows three frames from the corresponding SLM and CCD timelines. While a hologram movie is displayed on the SLM (lower timeline) based on user input, a dynamic 3D micro-hologram, consisting of an array of moving bright light spots, is projected in the sample volume providing dynamical, and real-time reconfigurable optical traps. Trapped beads are imaged with bright light illumination on a CCD camera (upper timeline). The actual frame-rate is slightly lowered due to time lost in copying from device memory to host memory and then back to the video card output where the SLM is attached. This further delay could be avoided exploiting CUDA-OpenGL interoperability. In this way holograms could be calculated and then directly displayed on the SLM without passing throught the host. Ultimately the frame-rate is limited by SLM response time, which, for liquid crystal based devices, is typically about 20 Hz. Such a frame-rate allows to perform enough iterations of GSW to generate large arrays of traps with a good efficiency and uniformity. In Fig.6 we report the efficiency (e) and uniformity (u) for GSW generated holograms as a function of the number of traps arranged in a 2D square grid. The number of GSW iterations is always such to work at a fixed framerate of 20 Hz. Though efficiency is never lower than 85% uniformity falls down to 0.36 for a 9x9 grid where only one GSW iterations is allowed at the frame-rate of 20 Hz.
In conclusion, we have used a CUDA enabled video card to achieve a speedup of 350x (SR) and 45x (GSW) over the host CPU. The obtained speedup allowed us to trap and manipulate multiparticle 3D structures with optimized holograms in real time. Our results demonstrate that the high computational load of hologram generation cannot be considered any longer as a limiting factor of holographic trapping for real time applications.
