The challenging issues of cancer prevention and cure lie in the need for a more detailed knowledge of the dynamic processes and mechanisms of cellular behaviour and tumour growth dynamics. In this paper we extend a previous 2D parallel implementation of a continuous-discrete model of tumour-induced angiogenesis to the more realistic 3D case. In particular, we look in-depth at available performance optimisation techniques to further improve the computational method and explore in more detail the hardware architecture. Recent evidence clearly indicates that GPU-accelerated computing can greatly facilitate researchers, clinicians and oncologists by performing time-saving in-silico experiments that have the potential to assist in quantifying cellular parameters, highlight model features, and help explore new cancer treatments and therapies.
Introduction
Over the last decade, high-performance computing (HPC) has evolved dramatically, in particular because of the accessibility to graphics processing units (GPUs) and the emergence of GPU-CPU heterogeneous architectures, which have led to a fundamental shift in parallel programming. Finite difference methods (FDM), such as those developed here, are the first port of call for solving complex biological phenomenon described by nonlinear partial differential equations (PDEs). However, they require intensive computational resources which generally lead to significant and time-consuming expense. The advantages of explicit time-stepping in FDM over many other types of solutions lend themselves well to exploitation in a completely dataparallel context. In such cases, GPUs can be used to greatly accelerate numerical simulations and offer an extremely valuable computational technique for tackling such problems. The compute unified device architecture (CUDA) programming model is especially well-suited to address problems that can be expressed as data-parallel computations.
In a previous paper the authors developed a 2D finite difference approximation to a hybrid continuous-discrete model of tumour-induced angiogenesis [2] . The numerical solution was implemented in both C++ and CUDA C to assess the performance benefits of porting from a serial to a parallel programming platform. Results indicated a dramatic increase in execution time between the two implementations and also highlighted a range of potential performance improvements available through more advanced data manipulation and memory management techniques. In this paper, the authors develop a 3D finite difference approximation to the same hybrid continuous-discrete model and implement some of these advanced features with a view to highlighting the potential benefits of modelling cellular and cancer dynamics whilst also highlighting the possibilities for developing new advanced clinical research tools based on GPU-accelerated applications. Indeed, in the last decade in silico trials focussed on simulating the different processes of solid tumour growth have become more readily accepted by the clinical and oncology community. The advantages of using GPU-accelerated programs and HPC continually highlight the potential performance improvements in solving complex mathematical models of biological phenomenon in this way [1, 2] .
In order to progress from the relatively harmless avascular phase to the potentially lethal vascular state, solid tumours must induce the growth of new blood vessels from existing ones, a process known as angiogenesis. While early models of angiogenesis were focused on accurately replicating key observed behaviours during the process, more recent models have been able to test specific hypotheses and suggest useful strategies for antiangiogenic drug development. A key mechanism of antiangiogenic therapy is to interfere with the process of blood vessel growth and literally starve the tumour of its blood supply. Indeed, a new class of cancer treatments that block angiogenesis have recently been approved and available to treat cancers of the colon, kidney, lung, breast, liver, brain, ovaries and thyroid [3] [4] [5] [6] [7] . Angiogenesis is without doubt a complex biological phenomena and one that at a cellular level is dynamic, spatially heterogeneous, frequently non-linear, and spans many orders of magnitude, both spatially and temporally. Mathematical and computational models of vascular formation have generated a basic understanding of the processes of capillary assembly and morphogenesis during tumour development and growth [8, 9] . However, by the time a tumour has grown to a size whereby it can be detected by clinical means, there is a strong likelihood that it has already reached the vascular growth phase and developed its own blood circulatory network. For this reason, a thorough understanding of the behavioural processes of angiogenesis is essential. The development of realistic mathematical and computational models of cancer dynamics is a powerful method of testing hypotheses, confirming biological experiments, and simulating complex behaviour. The model presented here is of a hybrid nature in which a system of couple nonlinear partial differential equations (PDEs) describing continuous chemical and macromolecular dynamics and a discrete cellular automata-type model controls cell migration and their interaction with neighbouring cells [10] . The main objective of the paper is to extend the work developed in [2] to the numerical solution of the hybrid continuous-discrete model describing tumour-induced angiogenesis to the more realistic 3D case. We also wish to address ways in which parallel performance can be optimised by making use of the explicit GPU hardware architecture and CUDA programming model.
A Continuous-Discrete Model of
Tumour-Induced Angiogenesis
The Continuous Model
For a more detailed treatment of the biological aspects of tumour-induced angiogenesis as well as a more rigorous mathematical proof, readers are directed to [2, 10] and references therein. Here we simply summarise the main mathematical development so as to focus on the main issues of the paper. If we denote the endothelial cell density by n, the TAF and fibronectin concentration by c and f, respectively the complete system of scaled coupled nonlinear PDEs describing tumour-induced angiogenesis can be written as [2, 10] :
A description of each of the parameters, and their respective values, can be found in [2, 10] . Our system is assumed to hold on a 3D spatial domain Ω (i.e., a volume of tissue) with appropriate initial conditions; c(x, y, z 0), f(x, y, z 0) and n(x, y, z 0) [2, 10] . The tumour cells are assumed to be confined within a domain Ω ∈ 0,1 in which no-flux (Neumann) boundary conditions are imposed on the boundaries of Ω (see Figure 1) . 
The Discrete Model
The technique of tracing the path of an individual endothelial cell at a sprout tip was first proposed by Anderson et al. [11] . The method involves using standard FDM to discretise the continuous model described in (1)-(3) over a 3D uniform grid. Then, the resulting coefficients of the finite difference seven-point stencil are used to generate the probabilities of movement of an individual endothelial cell in response to its local microenvironment. 3D stencil computations are those in which each node in a 3D grid is updated with a weighted average of the six neighbouring node values. Two schematic diagrams of a 3D finite difference seven-point stencil are shown in Figure 2 . We first discretise the continuous model by approximating the 3D domain Ω ∈ 0,1 on a uniform grid of node length, width and depth h, and time t by increments of size k. By applying a forward finite difference scheme, the fully-explicit discretised version of the continuous model can be obtained. 
The coefficients P 0 -P 6 can be thought of as being proportional to the probabilities of endothelial movement. That is, the coefficient P 0 , is proportional to the probability of no movement, and the coefficients P 1 , P 2 , P 3 , P 4 , P 5 , and P 6 , are proportional to the probabilities of moving left, right, up and down, out of and into the plane, respectively. The exact forms of P 0 -P 6 are functions of both fibronectin and TAF concentrations at nearby neighbouring points of an individual endothelial cell [2, 10] .
Each numerical simulation is based on an increased size of array width i.e., a finer grained uniform 3D grid. We use a constant iteration size of 1,000 time steps to allow for an adequate convergence of the numerical solution. At each time step, the numerical simulation involves solving the discrete model to generate the seven coefficients P 0 -P 6. Based on the values of these coefficients , a set of seven probability ranges are determined and then a uniform random number is then generated on the interval [0, 1], and, depending on the range into which this value falls, the current individual endothelial cell will remain stationary (R o ), move left (R 1 ), right (R 2 ), move up (R 3 ), down (R 4 ), out of (R 5 ), or into the plane (R 6 ). The complete set of parameter values used for the numerical simulation can be found in [2, 10] .
Implementation

The Kepler GK110 Architecture
The GPU is specialised for computer-intensive, highly data parallel computations allowing more transistors to be devoted to data processing rather data caching and flow control. More specifically, the GPU is especially well-suited to address problems that can be expressed such that the same algorithm is executed on many data elements in parallel, with high arithmetic intensity i.e., the ratio of arithmetic operations to memory operations. Since the algorithm is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches. The recent rollout of the Nvidia Kepler GK110 architecture marked a significant milestone in the evolution of GPU-accelerated computing. By offering much higher processing power than previous architectures and by providing new methods to optimise and increase parallel workload execution on the GPU, the Kepler GK110 has further revolutionised HPC. Each of the Kepler GK110 streaming multiprocessor (SMX) units feature 192 single-precision CUDA cores, and each core has fully pipelined floating-point and integer arithmetic logic units (ALU). Figure 3 shows the major differences between the CPU and GPU architectures in terms of ALU, cache and dynamic random access memory (DRAM) layout [12] . [12] .
Applications running on Kepler GK110 can also take advantage of the increased number of registers available to each thread to increase instruction level parallelism. The Kepler GK110 features a large dedicated L2 cache memory, double the amount of L2 available with previous architectures. The L2 cache is the primary point of data unification between the SMX units, servicing all load, store, and texture requests and providing efficient, high speed data sharing across the GPU [12] . Table 1 shows some specifications for the hardware architecture in the Kepler GK 110 architecture. The SMX schedules threads in groups of 32 parallel threads known as warps. Each SMX features four warp schedulers and eight instruction dispatch units, allowing four warps to be issued and executed concurrently [12] . Figure 4 shows a schematic of how warps are scheduled in the Kepler GK110 architecture. [12] .
The Kepler GK110 memory hierarchy is organised similarly way to earlier architectures as shown in Figure 5 and also enables compiler-directed use of an additional new on a Parallel Programming Platform cache for read-only data (see Table 1 ). When writing parallel programs, it is often necessary to communicate values between parallel threads. The typical way to do this in the CUDA programming model is to use shared memory. However, the Kepler GK110 architecture introduced a way to directly share data between threads that are part of the same warp i.e., threads in a warp can read other registers by using a new instruction called SHFL, or shuffle. Firstly, it is possible to use the shuffle instruction to free up shared memory to be used for other data. Secondly, the shuffle instruction is faster than shared memory since it only requires one instruction versus three for shared memory (write, synchronise, read). Another potential performance advantage for shuffle is that relative to older architectures, shared memory bandwidth has doubled on Kepler devices and the number of cores has increased by 6×; therefore, the shuffle instruction provides another means to share data between threads and keep the cores busy with memory accesses that have low latency and high bandwidth.
Hardware Specifications
The hardware used for the serial C++ implementation is a fourth generation Intel ® Quad Core ™ i7-4790K 4GHz CPU processor. The C++ implementation was developed and compiled in Microsoft ® Visual Studio 2012. The CUDA C++ program was also developed in Microsoft ® Visual Studio 2012 using CUDA version 7.0 and tested on an Nvidia GeForce ® GTX TM 780 GPU based on the Kepler GK110 architecture with Compute Capability 3.5. The Compute Capability describes the features of the hardware and reflects the set of instructions supported by the device as well as other specifications, such as the maximum number of threads per block and the number of registers per multiprocessor. Moreover, hardware design, number of cores, cache size, and supported arithmetic instructions are different for different versions of Compute Capability. Higher compute capability versions are supersets of lower (i.e., earlier) versions, so they are backward compatible. The operating system for both configurations is Windows 8.1. Table 2 shows some hardware specifications for the Nvidia GeForce ® GTX TM 780 GPU. 
Memory Bandwidth
Bandwidth is usually used to describe the highest possible amount of data transfer per unit time, while throughput can be used to describe the rate of any kind of information or operations carried out per unit time, such as, how many instructions are completed per cycle. Limited memory bandwidth can become a serious bottleneck to GPU performance and while a GPU typically has far greater memory bandwidth than a CPU, maximising the use of this bandwidth is still a critical issue. If an algorithm spends more time computing than transferring data, then it may be possible to overlap these operations and completely hide the latency associated with transferring data. On the other hand, if the algorithm spends less time computing than transferring data, it is important to minimise transfer between the CPU and GPU. In general, whilst performing code optimisation, it is important to determine how the application compares to theoretical limits. Theoretical memory bandwidth can be calculated using: memory clock rate × bus interface width × data rate (5) Since the Kepler 110 architecture relies on the graphics double data rate random access memory type i.e., GDDR5, the theoretical memory bandwidth for the GTX TM 780 GPU card is 288.4 GB/s = 3.004 × 384/8 × 2 . Note that bus interface width has been converted to bytes.
Instruction Throughput
Two types of floating-point numbers are typically used in algorithms, single-precision floats and double precision doubles. Single precision requires 32 bits (4 bytes) of storage and has an accuracy around 7 decimal places. Double precision requires 64 bits (8 bytes) and achieves an accuracy around 16 decimal places. Such large discrepancies between the two types of numbers have a significant impact on numerical simulations. That is, nine decimal places of information are lost when using only single-precision, and when implementing iterative procedures, such as FDM, this can introduce large errors. Results obtained using doubleprecision calculations will frequently differ from the same operation performed using single-precision arithmetic due to rounding issues. Therefore, it is important to be sure to compare values of like precision and to express the results within a certain tolerance rather than assuming them to be exact. Indeed, we can estimate a lower-bound to the performance of our CUDA C implementation by estimating the (giga) floating point operations per second (Gflops). Gflops are a measure of processing speed, equal to the number of operations the CPU and GPU can perform per second. In general, a processor can do a certain number of Gflops every time its internal clock ticks (or cycle). It is important to note that there is quite a difference between single-precision and double-precision Gflops. A processor that is capable of many single-precision Gflops may only be capable of a small fraction of that many double-precision calculations. We assume the following general formula to determine the number of Gflops for our CPU and GPU processors, given by:
For the GTX TM 780 GPU card, we get 3977 Gflops = 0.863 × 2304 × 2 i.e., 3.98 Tflops single-precision. With the GK110 architecture, double-precision performance is fixed at 1/24 that of single-precision performance i.e., 166 Gflops double-precision. Based on these values, the estimated performance improvement between serial and parallel implementations, in terms of Gflops calculations alone, should be at least in the region of 31×. Note that, in addition to accuracy, the relative conversion between double and floating point numbers (and vice versa) can also have a detrimental effect on performance.
We can also measure the performance of an algorithm in terms of its compute to memory access ratio (CMA). Many numerical algorithms, such as FDM, have a very low CMA of around 1.0, implying there is a read or write to memory for every floating-point operation. For the GTX TM 780 GPU card, which has a memory bandwidth of 288.4 GB/sec. At single precision (4 bytes) the maximum transfer rate will be 72.1 Gflops. With a CMA of 1.0, this gives a calculated flop rate of 72.1 Gflops, far less than the theoretical maximum of 3.5 Tflops. In order to achieve optimal memory bandwidth, it is vital to ensure that memory Is effectively managed, which when correctly managed, can lead to substantial increases in data transfer rates, and is vital for delivering performance that is close to the theoretical maximum.
In a GPU, a SMX relies on thread-level parallelism to maximise utilisation of its functional units. Utilisation is therefore directly linked to the number of resident warps. The number of clock cycles between an instruction being issued and being completed is defined as instruction latency. Full compute resource utilisation is achieved when all warp schedulers have an eligible warp at every clock cycle. This ensures that the latency of each instruction can be hidden by computation from other warps. Whilst bandwidth is usually used to describe the highest possible amount of data transfer per unit time, while throughput can be used to describe the rate of any kind of information or operations carried out per unit time, such as, how many instructions are completed per cycle. Another useful performance metric is the ratio of instructions to bytes. For the GTX TM 780 GPU card, the theoretical ratio is 13.8 instructions: 1 byte (= 3.98/288.4) i.e., if an application issues more than 13.8 instructions for every byte accessed, then it is bound by arithmetic performance. However, most GPU-accelerated workloads, are bound by memory bandwidth.
The CUDA Programming Model
The CUDA programming model involves running code on two different platforms concurrently; a host system (the CPU) and a device (the GPU). While GPUs are frequently associated with graphics, they are also powerful arithmetic engines capable of running thousands of lightweight threads in parallel. This capability makes them well suited to computations that can leverage parallel execution. Nowadays, modern GPUs can support up to 2,304 active threads concurrently per multiprocessor. So, for a GPU with 12 multiprocessors, this leads to more than 27,000 concurrently active threads. Threads on a CPU are generally heavyweight entities. The operating system must swap threads on and off CPU execution channels to provide multithreading capability. Context switches (i.e., when two threads are swapped) are subsequently slow and expensive. On GPUs, threads are extremely lightweight. In a typical system, thousands of threads are queued up for work in sets of 32 threads each (i.e., warps). If the GPU must wait on one warp of threads, it simply begins executing work on another. Since separate registers are allocated to all active threads, no swapping of registers or other state need occur when switching among GPU threads. Resources stay allocated to each thread until it completes its execution. In short, CPU cores are designed to minimise latency for one or two threads at a time, whereas GPUs are designed to handle a large number of concurrent, lightweight threads in order to maximise throughput. The host system and the device each have their own distinct attached physical memories. As the host and device memories are separated by the PCI Express (PCIe) bus, data in the host memory must be communicated across the bus to the device memory. Such continually data transfers usually result in memory bottlenecks which can lead to serious performance issue when developing GPU-accelerated applications.
The CUDA programming model provides an application program interface (API) that exposes the underlying GPU architecture; a collection of single instruction, multiple data on a Parallel Programming Platform (SIMD) processors capable of executing thousands of threads in parallel. A version of SIMD used by GPUs is the single instruction, multiple threads (SIMT) architecture in which multiple threads execute an instruction sequence. In CUDA C, an instruction sequence is written into a specific function known as a kernel that can be executed on a device N times in parallel by N different CUDA threads, asynchronously. Unlike a C function call, all CUDA kernel launches are asynchronous so that control returns to the CPU immediately after the CUDA kernel is invoked. An execution configuration defines both the number of threads that will run the kernel plus their arrangement in a 1D, 2D, or 3D computational grid. In its simplest form, the kernel is defined using the following CUDA C syntax [13, 14] :
__global__ kernel<<<dimGrid, dimBlock>>>(); Threads are grouped into blocks and blocks are grouped into grids as shown schematically in Figure 6 . There is a limit to the number of threads per block, for the Kepler GK110 architecture a thread block may contain up to 1,024 threads. On the GPU, each multiprocessor is responsible for handling one or more blocks in a grid which is further divided into a number of streaming processors each handling one or more threads in a block.
Figure 6. A schematic representation of threads, blocks and grids.
In general we want to size our blocks and grids to match data requirements and simultaneously maximise occupancy. Occupancy measures the efficiency to which we assign how many threads are active at any one time. The major factors influencing occupancy are efficient memory allocation and thread block size. Clearly, thread block size should always be a multiple of 32, since threads are scheduled in warps. For example, if we have a block size of 50 threads, the GPU will still issue commands to 64 threads and this would just be waste of resources. It is often necessary to try and size blocks based on the maximum numbers of threads and blocks corresponding to the Compute Capability of the GPU. The theoretical occupancy is the ratio of active warps to the maximum warps for a SMX. Each multiprocessor on a device has a set of N registers available for use by CUDA thread programs. These registers are a shared resource that is allocated amongst thread blocks executing on a multiprocessor. The CUDA compiler attempts to minimise register usage to maximise the number of thread blocks that can be active simultaneously. If a program tries to launch a kernel for which the registers used per thread times the block size is greater than N, the launch will fail. Varying the size of the thread block is a standard optimisation to find the best occupancy rates. Moreover, high occupancy rates help to hide the latency in accessing global memory.
A block is 1D, 2D, or 3D with the maximum size of the x, y, and z dimensions being 1,024, 1,024, and 64, respectively, such that M × N × O ≤ 1,024 i.e., the maximum number of threads per block. Blocks are subsequently organised into a 1D, 2D or 3D grid with the maximum size of the x, y, and z dimensions being 2 31 -1, 65,535, and 65,535, respectively. An example schematic of a block and grid set up is shown in Figure 7 . There are also a maximum of 65,536 registers available per block. 
Performance Optimisation
Optimising the performance of CUDA applications most often involves optimising data accesses which includes the appropriate use of the various available memory spaces (see Figure 8 ) of the GPU architecture. Indeed, appropriate use of these memory spaces can have significant performance implications for almost every CUDA application. Note that Figure 8 includes blocks labelled local memory within the multiprocessor. Local memory implies local in the scope of each thread. It is a memory abstraction, not an actual hardware component. In actuality, local memory gets allocated in global memory by the compiler and delivers the same performance as any other global memory region. The local and global memory spaces are not cached which means each memory access to global (or local) memory generates an explicit memory access.
Global Memory Coalescing
The latency in accessing global memory can be considerable. Although the bandwidth of global memory seems high, around 200-300 GB/s, it is very slow compared to the Tflop performance capability of a typical GPU. Global memory is implemented with dynamic random access memories (DRAM) using a parallel access process i.e., each time a memory location is accessed, a number of other memory locations (that include the requested location) are also accessed. If an application utilises data from consecutive accessed locations before accessing other locations, the DRAM can achieve near peak global memory bandwidth. Therefore, global memory delivers the highest memory bandwidth only when the global memory accesses can be coalesced. The performance penalty for non-coalesced memory operations varies according to the size of the data type (e.g., 4-bytes). Each active block is split into SIMD groups of threads; warps. Each warp contains the same number of threads i.e., warp size, which are executed by the multiprocessor in a SIMD manner. This means each thread within a warp is broadcast the same instruction from the instruction store, which directs the thread to perform some operation or manipulation of local and/or global memory. Active warps are time-sliced; the thread scheduler periodically switches from one warp to another to maximise the use of the multiprocessor's hardware resources (see Figure 9 ). The order of execution of the warps within a block and of blocks themselves is undefined, which means they can occur in any order. Moreover, all threads in a warp execute the same instruction. When all threads in a warp execute a load instruction, the hardware checks if the threads are accessing consecutive memory locations. Ideally, thread 0 on a Parallel Programming Platform accesses location n, thread 1 accesses location n + 1, ..., thread 31 accesses location n + 31, then all accesses are coalesced and combined into one single contiguous access.
Consider the case when the warp scheduler requests 32, aligned consecutive 4-byte words from global memory. Each memory address will fall into 4 segments of 128 bytes each as shown in Figure 10 . The consecutive alignment scenario in Figure 10 will result in 100% coalesced global memory access. Now consider the case when the warp scheduler requests 32, permutated consecutive 4-byte words from global memory as shown in Figure 11 . Each memory address will still fall into 4 segments of 128 bytes and also achieve 100% coalesced global memory reads. Figure 12 shows the case when memory addresses are misaligned consecutive 4-byte words. In this case, each memory address now falls into at most 5 segments of 160 bytes which results in a lower utilisation of global memory reads (80%). Finally, consider the case when memory addresses are scattered as shown in Figure 13 . This results in N segments of N × 32 bytes and a severe non-coalesced global memory access. It is therefore extremely important to aim for perfect address coalescing i.e., optimised address patterns. A warp will generally access a contiguous region of memory so it is necessary to avoid scattered access patterns or those with large strides between threads.
Array Flattening
In general memory allocated dynamically on the GPU cannot use 2D array indexing like you would using C++ i.e., a 2D array declared as: Q R S , must first be flattened into 1D linear array of memory in which each element is indexed from the beginning of the array by determining an offset, Q offset dependent on indices R, S, and the array width. In general, such memory allocation utilises row-major order. Figure 14 shows an example of how a 2D array is flattened into a 1D representation using a row-major order offset. Using linear arrays removes a great deal of the complexity associated with transferring a double pointer (i.e., ** or ) array between the host and device. In essence, a nested deep copy operation is required in the copy sequence from host to device, such that linearising (or flattening) the data allows it to be referenced using only a single pointer (*). For a 3D array, declared as: Q R S T , a similar offset is calculated but now dependent on indices R, S, T and the array width and depth as shown in Figure 15 . 
Texture Cache
Together with memory coalescing in global memory transfers, exploiting shared memory is a further key optimisation. Reusing data stored in shared memory is far more efficient than repeatedly loading from global memory, as long as it can be used efficiently within a thread block. However, with the increases in caching levels for the Kepler GK110 architecture, this has become less critical for certain types of algorithm [12] . For example, texture memory provides a surprising aggregation of capabilities including the ability to cache global memory (separate from register, global, and shared memory) and dedicated interpolation hardware separate from the thread processors. Texture memory also provides a way to interact with the display capabilities of the GPU. Since optimised data access is very important to GPU performance, the use of texture memory can (in the right circumstances) provide a large performance increase. The best performance will be achieved when the threads of a warp read locations that are spatially local. Moreover, designed primarily for graphics applications, textures are used more generally to maximise memory bandwidth in applications where global memory reads do not satisfy coherency constraints but nonetheless exhibit a high degree of spatial locality [14] .
The Kepler GK110 architecture enables applications to utilise texture cache when reading from global memory without actually using the texture reference or texture object APIs. This is done using LDG instruction which is like a global load, except that data is transported through the texture cache instead of the regular L1/L2 cache hierarchy. To allow access to such bindless textures, pointers to global memory must be decorated with __const__ and __restrict__ qualifiers. The whole point of __restrict__ is to tell the compiler that two or more pointer arguments will never overlap in memory. Two pointers alias if the memory to which they point overlaps, so in an ideal situation we require no redundant memory accesses as a result of pointer aliasing. By decorating a pointer with the restrict property, the programmer is promising the compiler that any data written to through that pointer is not read by any other pointer with the __restrict__ property. In other words, the compiler does not have to worry that a write to a restrict pointer will cause a value read from another restrict pointer to change. Pointer aliasing is something developers need to be extremely aware of on both the GPU and CPU for which proper use can significantly improve performance and code optimisation [13, 15] .
Constant Memory
Constant memory is read only and cached on-chip and has only one read port, but can broadcast data from this port across a warp. This means that constant memory access is effective when all threads in a warp read the same address, but when threads in a warp read different addresses the reads are serialised. Since constant memory is cached, a read from constant memory costs one memory read from device memory only on a cache miss; otherwise, it just costs one read from the constant cache. That is, since constant memory is cached, consecutive reads of the same address will not incur any additional memory traffic. Constant memory is declared using the __constant__ keyword and must be declared outside of the main body of the program and the kernel function. Constant cache is written to only by the host and subsequently initialised in the main body of the program using cudaMemcpyToSymbol(). Constant memory is perfect on a Parallel Programming Platform for coefficients and other data that are used uniformly across threads [13, 15] .
C++ and CUDA C Algorithms
The main body of the C++ implementation is shown in Algorithm 1.
Achieving a high-level of performance and optimisation using the CUDA programming model requires careful attention to detail [1, 2, [16] [17] [18] . Porting from C++ to CUDA C involves additional coding, as well as some efficient manipulation of the kernel function in respect of thread deployment and memory management. Within the CUDA programming model, CUDA C code is required to initialise memory on the device, and to deal with the transfers of data to the device and back to the host after the kernel execution has completed. In general, there are three steps that are essential to the successfully execution of a kernel on the GPU. Firstly, data must be initialised and transferred from the host to the device global memory. Once the data is on the GPU, the kernel is executed N times and launches the required number of N threads for the device. When all threads have completed execution (enforced through synchronisation) data is transferred back to the host from the device. In the CUDA programming model, device memory is typically allocated using cudaMalloc()and data is transferred between host and device memory using cudaMemcpy depending on the data flow i.e., either cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost. Memory is subsequently freed after completion using cudaFree(). Making efficient use of available memory (e.g. texture cache, constant memory) can reduce the amount of data that has to be physically transferred between host and device, which is typically the performance bottleneck. The algorithms for the implementation of the CUDA C kernel and the main body are shown in Algorithms 2 and 3.
The actual performance improvement is be based on the execution time of each of the C++ and CUDA C implementations. Here, the execution time is the difference between two clock statements in each of the C++ and CUDA C algorithms. One placed at the start, and the other at the end of the main looping routine (including the memory transfer in CUDA C). Thus, execution time represents the time taken to complete the entire process of a single simulation of the numerical solution to the hybrid continuous-discrete model. With CUDA C, it is important to remember that calls to kernels are asynchronous. Therefore, to accurately measure the elapsed time for a particular call or sequence of CUDA calls, it is necessary to synchronise the CPU thread with the GPU by calling cudaDeviceSynchronize() immediately before starting and stopping the CPU timer. cudaDeviceSynchronize() blocks the calling CPU thread until all CUDA calls previously issued by the thread are completed. The CUDA event API provides calls that create and destroy events, record events (timestamp), and convert timestamp differences into a floating point value i.e., milliseconds (ms) with a resolution of approximately ½ ms. cudaEventRecord() is used to place the start and stop events into the default stream i.e., stream 0. The device will record a timestamp for the event when it reaches that event in the stream. The cudaEventElapsedTime() function returns the time elapsed between the recording of the start and stop events.
Results and Discussion
Usually we require a host function to verify the results from the kernel to check that both versions are indeed producing the same answer. This usually achieved by setting the execution configuration to <<<1, 1>>>, so that the kernel is forced to run with only one block and one thread. This emulates a sequential implementation. In addition, this is very useful for verifying that numeric results are bitwise exact from one simulation to another, especially if encountering order of operations issues. Table 3 shows the magnitude of speedup (×) of the CUDA C implementation over that of C++ based on execution time and for a range of block dimensions up to the maximum allowable threads per block i.e., 1,024. Table 4 shows average occupancy, memory bandwidth, and instruction throughput for the range of block dimensions. Table 3 & 4 show that the optimal block dimensions are 16 × 4 × 4 resulting in the best performance in terms of execution speed, bandwidth and throughput. Notice that average memory bandwidth and throughput are only 56% that of their theoretical peak values which suggests there are likely further areas of optimisation that need to be investigated. However, it is not that surprising since theoretical values are rarely achieved in reality. An optimal choice of execution configuration can often lead to performance benefits at the expense of a higher occupancy. For example, a very low occupancy such as the one obtained with block dimension 4 × 4 × 4 is clearly a bad allocation of resources and will generally lead to poor performance. However, the 64 × 4 × 4 configuration resulted in a high occupancy but the lowest of throughput. So, it is not necessarily the case that the highest occupancy always results in optimum performance. Moreover, it is fair to say that algorithm optimisation is an exhaustive process i.e., involving identify an opportunity for optimisation, apply and testing, verify the speedup achieved, and repeating. It is not necessary for a programmer to spend large amounts of time memorising the bulk of all possible optimisation strategies prior to achieving reasonable speedups. Instead, strategies can be applied incrementally as they are understood. As we have seen, optimisations can be applied at various levels, from overlapping data transfers with computation all the way down to fine-tuning floating-point operations. The available profiling tools are invaluable for guiding this process, as they can help suggest a next-best course of action for the developer's optimisation efforts and provide references into the relevant portions of the optimisation section of this guide. When attempting to optimise CUDA C applications, it pays to know how to measure performance accurately and to understand the role that bandwidth plays in performance measurement.
Further areas of improved performance and code optimisation are currently being explored by the authors, including methods such as data prefetching and improvements to the instruction mix. Data prefetching involved masking the loading of data from global memory to register by overlapping data access and computation. Instruction mix optimisation is where code is refactored to maximise the number of floating-point operations as opposed to addressing and branching. An example would be loop unrolling, which decreases loop iterations whilst increasing the number of floating-point calculations per iteration. These optimisations are again based on maximising the available memory bandwidth.
Conclusions
The main objective of this paper was to extend previous work by implementing a hybrid continuous-discrete model describing tumour-induced angiogenesis in the more realistic 3D case. We also addressed ways in which performance can be optimised by making use of the GPU hardware architecture and the CUDA programming model. Indeed, CUDA has proved to be a natural programming model to deploy applications in a modern massively-parallel (i.e., highlythreaded) environment. We consider several orders of magnitude performance increase over existing technology a disruptive change that can dramatically alter various aspects of the techniques and methods applied to computational biology. For example, computational tasks that previously would have taken a year can now complete in a few days, hour long computations suddenly become interactive being completed in seconds. Moreover, new technology will allow the tractability of previously intractable complex real-time processing tasks.
Data-parallel processing maps data elements to parallel on a Parallel Programming Platform processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up computations. For example, 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel processing threads. In fact, many algorithms can be accelerated by data-parallel processing, from general signal processing or physics simulation to computational biology. In terms of clinical research, developing realistic highly-complex cancer simulations using such technologies will lead to new therapy strategies, optimised drug delivery systems, interactive computer simulations of dynamic oncological process, and other advanced cancer treatment possibilities. Radiation therapies with ion beams can precisely target cancerous tumours, while leaving surrounding healthy tissue unharmed. Such targeted therapy leads to less invasive surgery, shorter hospital stays and speedier recovery times. The drawback is that conventional ion accelerators tend to be huge in both size and cost. This puts them beyond the budget for most medical facilities. Creating live computational simulations and adding physical phenomena to the algorithms was once considered impossible. Now, with the parallel processing power of GPUs, it is no longer impossible. Moreover, the race is on to understand how cell mutation causes cancer, which kills hundreds of thousands worldwide each year and is the second leading cause of death in the U.K. Research is focused on anti-cancer drug discovery pipelines using advanced molecular dynamics simulations powered by GPUs. Enormous gains in computing power are enabling a new framework for drug discovery utilising computer simulations to capture various shapes of a tumour suppressor, the protein called p53, known as the guardian of the genome because it is a key regulator of cell growth and development in normal cells. Indeed, computer simulations capture not just how proteins are built, but how they function inside the body. Such simulations running on GPU-accelerated computers can reveal new binding sites that may help cancer researchers create new drugs to help reactivate p53 when it mutates and ceases to function correctly. Clearly, the continued development of parallel processed computer simulations using GPUs is of paramount importance. Such technologies can clearly aid in the understanding of complex mathematical models and underlying biological processes, whilst also helping to uncover the elusive cure for cancer.
