This paper describes advances in statistical computation for large-scale data analysis in structured Bayesian mixture models via graphics processing unit (GPU) programming. The developments are partly motivated by computational challenges arising in fitting models of increasing heterogeneity to increasingly large data sets. An example context concerns common biological studies using high-throughput technologies generating many, very large data sets and requiring increasingly high-dimensional mixture models with large numbers of mixture components. We outline important strategies and processes for GPU computation in Bayesian simulation and optimization approaches, examples of the benefits of GPU implementations in terms of processing speed and scale-up in ability to analyze large data sets, and provide a detailed, tutorial-style exposition that will benefit readers interested in developing GPU-based approaches in other statistical models. Novel, GPU-oriented approaches to modifying existing algorithms software design can lead to vast speed-up and, critically, enable statistical analyses that presently will not be performed due to compute time limitations in traditional computational environments. Supplemental materials are provided with all source code, example data and details that will enable readers to implement and explore the GPU approach in this mixture modelling context.
Introduction
Scientific computation using graphics processing units (GPUs) is increasingly attracting the attention of researchers in statistics as in other fields. A number of recent papers have detailed the "what" and the "why" of GPU computation; our work here assumes a reader aware of the foundations and interested now in the "how?".
We detail advances in statistical computing (programming perspectives and strategy, and how these guide algorithm implementation) and computation (experiences, results and benchmarks with specific models, data and hardware) for Bayesian analyses of structured multivariate mixtures. We are particularly concerned with mixture models with many mixture components and with large data sets -massive mixtures, where the massive parallelization offered by GPU machines has potential to define access to relevant statistical modelbased methodology as a routine. The redevelopment of computation using GPUs enables scale-up on desktop personal computers that are simply not achievable using multi-threaded CPU desktops and often impractical across distributed-memory computing clusters.
GPUs are dedicated numerical processors originally designed for rendering 3-dimensional computer graphics. Current GPUs have hundreds of processor cores on a single chip and can be programmed to apply the same numerical operations simultaneously to each element of large data arrays under a single program, multiple data (SPMD) paradigm. As the same operations, called kernels, function simultaneously, GPUs can achieve extremely high arithmetic intensity if we can ensure swift data transfer to and from the processors.
General purpose computing on GPUs (GPGPU) is capturing the attention of researchers in many computational fields. Early adoption is growing for dynamic simulation in physics, signal and image processing and visualization techniques (Owens et al., 2007) . Computational statistics and statistical inference tools have yet to substantially embrace this new direction, though early forays are most encouraging (e.g. Charalambous et al., 2005; Manavski and Valle, 2008; Silberstein et al., 2008; Lee et al., 2009) . Silberstein et al. (2008) first demonstrated the potential for GPGPU to impact the statistical fitting of simple Bayesian networks, and recent work, such as studies using novel GPU/CPU-based algorithms for MCMC fitting of highly structured Bayesian models of molecular sequence evolution (Suchard and Rambaut, 2009b,a) , clearly exemplifies the opportunities; the latter example realized a greater than 100-fold reduction in run-time in very challenging and otherwise inaccessible computations.
Scientific computation using GPUs requires major advances in computing resources at the level of extensions to common programming languages (NVIDIA-CUDA, 2008 ) and standard libraries (OpenCL: www.khronos.org/opencl) ; these are developing, and enabling processing in data intensive problems many orders of magnitude faster than using conventional CPUs (Owens et al., 2007) . The recent adoption of the OpenCL library for porting to popular statistical analysis packages reflects a future for algorithmic advances that are immediately available to the larger research community.
A barrier to adoption in the statistics community is the investment needed in developing programming skills and adopting new computing perspectives for GPUs; this requires substantial time and energy to develop skills and expertise, and this challenges researchers for whom low-level programming has never been a focus. Part of our goal here is to use our context of Bayesian mixture modelling to convey not only the essence and results of our work in GPU implementations, but to also lay out a tutorial-style description and flow of the analysis so as to engage researchers that may be interested in developing in this direction in related or other classes of statistical models.
Our work is partly motivated by computational challenges in increasingly prevalent biological studies using high-throughput flow cytometry (FCM), generating many large data sets and requiring high-dimensional mixture models with large numbers of mixture components (Herzenberg et al., 2006) . FCM assays provide the ability to measure multiple cellular markers for hundreds to thousands of cells per second, capturing cellular population data in a single, very high-throughput assay. FCM is relatively cheap and increasingly accessible, and will continue to grow dramatically as a preferred biotechnology as it provides data at the level of single cells -particularly critical in many areas of immunology, cancer and emerging systems biology. The number of variables (markers) measured is currently in the 1-20 range, but technological advances are set to increase that substantially (Ornatsky et al., 2006) . Statistical analysis using various mixture modelling approaches is readily being adopted in FCM studies (Chan et al., 2008; Boedigheimer and Ferbas, 2008; Lo et al., 2008; Pyne et al., 2009 ). The applied contexts involve very large sample sizes (n = 10 4 − 10 7 per assay) and, typically, quite heterogeneous distributions of the response markers reflecting multiple subtypes of cells and non-Gaussian structure. Further, standardization and routine application is pressing the need for efficient computation; typical studies can involve tens or hundreds of assays of such size (multiple treatments, time points, patients, etc.). Our developments aim, in part, to bring advanced Bayesian computation for FCM analysis into this frame. Laboratory practitioners and routine users of FCM do not and will not embrace institutional "computer center" cluster facilities, due to inaccessibility, expense and other constraints; also, laboratory "culture" is heavily oriented around desktop computing. The availability of low-cost GPU technology on commodity desktops and workstations offers ease of access that is simply persuasive for many, hence promoting a focus on rethinking the computational implementations to exploit GPU parallelization for routine, automated use. It is with this perspective that we are interested in substantial computational advances for a class of Bayesian multivariate mixture models that provides the statistical setting.
Structure of Bayesian Mixture Modelling
Many aspects of Bayesian computation can be cast in the "single instruction/program, multiple data" (SIMD/SPMD) framework to exploit the massive parallelism afforded by commodity GPUs, with potential speed-ups of orders of magnitude. A central case in point is Bayesian analysis of mixture models with sample sizes in the millions and hundreds of mixture components; these result in massively expensive computations for Markov chain Monte Carlo (MCMC) analysis and/or Bayesian EM (BEM) for local mode search and optimization, but that can easily be cast in the SIMD/SPMD framework. We focus on multivariate normal mixture models under truncated Dirichlet process (TDP) priors (Ishwaran and James, 2001 ), a variant of a model class very popular in density estimation and classification problems in applied statistics and machine learning (e.g. Escobar and West, 1995; MacEachern and Müller, 1998b,a; Teh et al., 2006; Ji et al., 2009) . Applied, substantive variants involving heavier-tailed or skewed mixture components, or hierarchical extensions in which component densities are non-normal, themselves modelled as mixtures of normals, add detail to the computations but do not impact on the main themes and results of interest here.
A p−dimensional data vector x has density g(x|Θ) = j=1:k π j N (x|µ j , Σ j ) with parameters Θ = {π 1:k , µ 1:k , Σ 1:k } for a fixed upper bound k. Analysis uses conditionally conjugate, parametrized priors of standard form for (µ j , Σ j ), coupled with the specific class of priors arising in the TDP model for the mixture weights π 1:k . Details are summarised in Appendix A. We use MCMC simulation methods and posterior mode search via BEM. The most computationally expensive aspect of the analysis lies in evaluation of posterior configuration
The MCMC analysis then adds multinomial draws of configuration indicators (z i |x i , Θ) ∼ M n(1, π 1:k (x i )), (i = 1 : n). The n × k tasks imply billions of subsidiary calculations when n and k are large. Conditional independencies coupled with the relatively simple operations involved make this a prime target for GPU implementations. BEM calculations involve a complete scan over all n observations, weighted by their current configuration probabilities, for each of the k components at each iterate; MCMC involves component-specific calculations with the data subset currently configured into that component.
Coding for the GPU

Strategies and process
Coding for the GPU is fundamentally an exercise in parallel programming, and the theoretical maximum speed-up is bounded by Amdahl's quantity 1/(1 − P + P/N ) (Amdahl, 1967) where P is the fraction of the program that can be parallelized and N the number of processors. Clearly, the bottleneck is in the fraction of irreducibly serial computations in the code; if P is 90%, the maximum speed-up is only 10-fold even with an infinite number of processors. However, the fraction P is not necessarily fixed, and typically increases with problem size as first pointed out by Gustafson (1988) . Hence, MCMC and related approaches that are often viewed as intrinsically serial algorithms can still benefit from tremendous parallelization speed-ups with the use of GPUs. The contexts in which this is attractive are those that involve intense, parallelizable computations per iterate of the MCMC. Since the basic algorithms for mixture modelling rely on iterative Gibbs samplers, the usual coarse-grain approach taken for "embarrassingly parallel" problems with message passing techniques (e.g.
MPI on a Beowulf cluster) simply do not work well due to the need for sequential updates of global model parameters. With increasingly large data sets and larger numbers of mixture components, however, each iteration becomes increasingly burdensome; hence the interest in opportunities to exploit fine-grain problem decompositions for parallelizing within iterations.
Problem decomposition for MCMC
We illustrate the strategy of problem decomposition using the so-called block Gibbs sampling approach in TDP models. The size of a given problem is determined by the three integers (n, p, k), respectively the data sample size, variable dimension and maximum number of mixture components. In flow cytometry applications, p is usually modest (5-20), while n can range from 10 5 -10 7 and k typically ranges from 10 1 to 10 3 . These quantities suggest that any sampling steps involving n and/or k are potential candidates for parallelization. Step 1 and, to a much lesser extent, step 2 represent key computational bottlenecks for problems in which both n and k are large -the massive mixture contexts. For example, for a serial version of MCMC for the TDP, when n = 10 6 and k = 256, step 1 accounts for more than 99% of overall compute time, making step 1 the prime target for parallelization. This number only increases as n and k becomes even larger.
Potential gains
The analysis in Section 3.1.1 provides potential targets for parallelization. However, we still need to identify the proportion of remaining code that is serial so as to understand the potential speed-up and assess just how worthwhile it might be to invest the significant effort needed for effective parallelization. In our TDP example, the serial components (including steps 2-4 that are not immediate targets for parallelization currently) comprise a very low % of the total compute burden. Focusing on step 1 alone can yield theoretical speed-up greater than 100-fold, so is well worth the effort to parallelize.
In practice, of course, the theoretically achievable speed-up is never realized due to the finite number of parallel devices (e.g. GPU cores, CPU cores, processing nodes in a Beowulf cluster) available, and also because there are latency costs involved in moving data to, from and between these devices. Suppose that we have S devices, and that each can be programmed to perform the same computational task in the same amount of time. If a parallelizable task with overall execution time c 0 can be evenly distributed, then an ideal S-fold speed-up will yield actual time of c 0 /S + c 1 where c 1 is an incurred access/utilization time.
To make things more complicated, if we have to divide tasks into even smaller pieces (such as assigning m tasks per parallel device) for some legitimate reason, the overhead parameter c 1 will then be multiplied by m. Any strategy to define a parallel algorithm has to take these issues into account in assessing potential speed-up benefits.
Coarse grain parallelization (MPI/multi-core computing)
The most familiar form of parallel computing is MPI-based computing on Beowulf clusters of inexpensive CPU machines, widely employed in academic institutions. Clusters are usually used in the "embarrassingly parallel" mode; more delicately, in the "master/slave" model for scientific computing. Clusters can scale up to large numbers of processing nodes with typical research institute clusters having 1000s of nodes. However, since most of these clusters are rack mounted computers with multi-core CPUs linked loosely through an Ethernet network, sharing of data across nodes has high latency, making the parallelization overhead parameter c 1 relatively large. Because of the high latency, coarse-grain decompositions with larger and relatively independent tasks assigned to each computing node are optimal and result in minimizing the parallelization overhead mc 1 for m tasks per node.
Another common complication is cluster node heterogeneity. As large computer clusters evolve, hardware updates lead to mixtures of architectures, CPU speed and memory available. With coarse-graining, the time for each iteration is held back by the slowest node, dragging down overall performance. Hence there is usually some trial-and-error to determine the optimal granularity and load-balancing of task decomposition for the specific cluster available. In our experiences running MCMC for TDP models on the large Duke University high-performance cluster, a granularity in the vicinity of 5,000 data points per compute node works best. Even then the resulting speed-ups, while significant, are disappointing due to the high latency of data transport. For example, with n = 5 × 10 6 , k = 256 and p = 14, speed-up of approximately only 20-fold is a good benchmark on a cluster of 100 nodes.
As an alternative to MPI-based clusters, modern desktop and laptop computers usually come with 2-4 homogeneous CPU cores that can be effectively used for parallel computing.
Using either OpenMP (http://openmp.org/wp/) or various multi-threading libraries (e.g.
Intel Thread Building Block (TBB), Boost Thread library), we can write efficient parallel code that uses these cores. The homogeneity of CPU cores and the fact that all memory is shared by all cores makes communication between threads (tasks) much faster, thus incurring much smaller overhead costs (mc 1 ) than MPI-based cluster solutions. For example, using the Intel TBB (Thread Building Block) library, we have implemented a multi-threading version of MCMC for the TDP; this achieves a 5-fold speed-up on an 8-core (dual quad-core) machine.
Of course, this approach is severely limited in scalability by the number of cores available, and is useful only for problems of modest size.
Fine grain parallelization on the GPU
In last couple of years, GPUs have begun to attract attention in scientific computational communities due to their very low expense and high performance. Current commodity GPUs can have an extremely large number of relatively simple compute cores (240-480), each being capable of executing fast arithmetic and logical instructions. Data on a GPU can be in device memory that is accessible to all cores, shared memory that is common to a block of cores, and also in registers specific to individual cores. Unlike CPU cores, cores in the GPU have very limited numbers of registers and small shared memory resources, and completely lack control logic for branch prediction and other aggressive speculative execution techniques. Therefore, an understanding of GPU architectural limitations is critical to maximally exploiting GPUs.
The architecture in a GPU has the following distinctive features:
1. communication between CPU and GPU is only possible via device memory;
2. device memory on board can be accessed by all cores but data access is much slower than for shared memory;
3. barring memory bank conflicts, shared memory is as efficient as the use of registers (1 clock cycle);
4. the shared memory within a core block allows extremely fast communication/collaboration among cores;
5. the creation of GPU threads is significantly faster than that of CPU threads.
The limited number of registers and small shared memory essentially mandates a very finegrain decomposition of parallelized algorithms for optimal performance. Such high parallelism critically hides memory latency, as different threads continue to execute while others wait on memory transactions. Yet, because of the low cost of thread creation and ultraefficient data access from registers and shared memory, dramatic speed-ups are possible for algorithms such as those that arise in massive mixture context.
In contrast to the coarse grain size of the CPU/multi-threading approach -in which each parallel CPU task thread tries to work on as many data points as possible to reduce parallelization overhead costs -the GPU implementation goes to the opposite extreme and makes multiple threads work on a single data point. More specifically, a block of threads (limited to 512 threads in current GPU architectures) first loads the necessary data concurrently from device memory to much faster shared memory. This approach proves to be very successful. With the same data set, we can achieve more than 120-fold speed-up for the overall program using one standard, cheap consumer GPU (NVIDIA GeForce GTX285 with 240 cores). This impressive factor results from parallelizing step 1 alone. Detailed discussion of speed-up across various data sets and using different
GPUs and machines appears below.
For a comparable cluster-based approach, if we ignore communication latency between nodes a 120-fold speed-up requires (at a severe underestimate, since we only achieved a 20-fold speed-up with 100 CPU cores) at least 119 additional CPU cores. Assuming high-end, quad-core CPUs would require about 30 nodes -US ∼$50,000 at today's costs. This estimate does not include maintenance, permanent personnel support, space or air-conditioning for such a large system. In any case, the cluster-approach compares quite poorly with the approximate $400 cost of a GTX285 card currently.
CUDA tutorial for Bayesian simulation and optimization
Once a strategy for fine-grain decomposition of a problem has been decided, it only remains for it to be written and compiled for the GPU. For NVIDIA hardware, this currently can be done either with the CUDA SDK (http://www.nvidia.com/object/cuda_get.html) or the OpenCL library (http://www.khronos.org/registry/cl/). We describe the approach using CUDA but the OpenCL approach is broadly similar. We detail implementation and the evolution of the MCMC code from serial to MPI/multi-threaded to GPU code that results in over two orders of magnitude speed-up. BEM code highlights differences in coding simulation and optimization routines.
MCMC: Serial, multi-threaded and CUDA
The resampling configuration step, 1 f o r each i = 1 : n, compute π 1:k (x i ) , then 2 draw z i i n d e p e n d e n t l y from M n(1, π 1:
is the main bottleneck in the MCMC routine. This typically accounts for over 99% of the execution time with large data sets, so is the primary target for parallelization.
Serial version
Algorithm 1 gives the C-style pseudo code for this sampler in a standard serial CPU implementation.
The function sample here simply normalizes the weights and then performs the multinomial draw. A number of optimization considerations can be found in our source code, which is essentially fully optimized for serial implementation.
MPI/Multi-threaded version
The strategy for parallelizing this code snippet with large n, k is obvious in view of the nested loops over samples and components. Any large grain size parallelization will allow each "slave" CPU to compute the configuration probabilities and sample indicators for a subset of the data, returning the indicators alone to the "master" computer node for next step computations. Algorithm 2 outlines the pseudo-code.
Algorithm 2. MPI/Multi-threaded version
3 G r a i n S i z e = FindOptimalGrainSizeBasedOnSystemResources ( ) ;
4 NumberOfTasks = n/ G r a i n S i z e ; 
Compared to serial Algorithm 1, this allows multiple CPUs to work in parallel to resample component indicators. GrainSize will be chosen based on the numbers of nodes in the cluster and other aspects of the parallel computing architectures.
CUDA version for GPUs
Parallel programming for GPU devices takes a quite different approach due to differences in hardware (limited fast processing registers and shared memory resources) as well in the use of threads in scheduling tasks. We start with a simple framework for CUDA programming, then extend it to describe the mixture analysis algorithms.
CUDA is implemented as a set of extensions to a subset of the C language. The NVIDIA nvcc compiler splits and runs two code components: one on CPU(s) and one on GPU(s) as CUDA kernels. CUDA refers to a CPU as the host and a GPU as the device. CUDA extends C with declaration specifications for functions that determine where the function call can originate and where it is executed. The default host declaration is a regular C function, called from and executed on the host CPU. In contrast, the device declaration describes a function called from and executed on the GPU device. The CUDA kernel is key: it is there that the program transfers execution from CPU to GPU. This can be done asynchronously, allowing simultaneous CPU and GPU computations. Kernels are executed in parallel and threads are organized into blocks. Threads within a block can communicate using extremely low latency shared memory; optimizing the use of this shared memory is fundamental to GPU performance. There are at most 512 threads in a block, and blocks are organized into grids. A block can be specified as 1 or 2-dimensional array; a grids can be specified as 1, 2 or 3-dimensional array, with the dimensionality chosen to match the problem being solved.
Thread organization is specified with a <<<gridSize,blockSize,sharedMemSize>>> syntax between the kernel function name and its argument list. The calls gridSize and blockSize are defined using the predefined type dim3 (e.g. dim3 gridSize(4,2,1)) and sharedMemSize spec- From a programming strategy perspective, the essential differences between the MCMC algorithm for the GPU and the MPI/multi-threaded parallel versions are:
1. We have very fine grain size for parallelization: each thread (task) calculates for only one data point x i against one normal density N (µ j , Σ j ).
2.
A dedicated execution plan is needed to decide how to maximize performance by organizing threads into blocks so they can cooperate with each other and optimize the use of low-latency shared memory and thread registers.
3. GPU computations are typically done in single-precision, since double-precision operations are up to 10× slower with the current generation of GPUs. This is key with current technology, but will change as substantially faster double precision GPUs emerge in future generations.
The following sections detail each of the GPU program steps, with emphasis on coding techniques to optimize performance.
Allocate memory on CPU and GPU and move data from CPU to GPU
The host (CPU) and device (GPU) have separate memory spaces. CUDA supports three type of data storage: global, constant and shared memory, with vastly different access methods and speed. Global memory is large (up to 4G in the NVIDIA Tesla chips in 2009) and the only memory accessible to the host, so all data moving from host to device or vice versa needs to go through global memory. Constant memory is basically global memory that is textured for faster access. Within a thread block, threads can communicate using shared memory whose latency is more than 100× faster than global memory, but the capacity is relatively Here we allocate n × DATA PADDED DIM floating point numbers in the GPU global memory and transfer the corresponding data on the CPU to the GPU.
Shared memory and registers
Up to 512 threads can be grouped into a thread block, and these threads have shared access to 16KB of shared memory that performs 100-150 times faster than even coalesced global memory transactions. However 16KB is very small and only holds 4096 singleprecision values. Individual threads also have private access to very fast but scarce registers.
All threads in a block share a 16KB register file that holds all the register values. If a block contains the maximum of 512 threads, each thread will only be allocated (16×1024)/(512 * 4)
single-precision registers. If more registers are requested than are available, the excess data is stored in global memory, resulting in a devastating drop in performance.
Efficient caching is thus simply critical to optimizing performance. As a result, besides using data padding for the input data to enforce coalescing shared memory transactions, we further combine all the parameters {π 1:k , µ 1:k , Σ 1:k } into a k × PACK DIM matrix; each row holds all the information for one mixture component, padded with zeros at the end of the row such that PACK DIM is a multiple of 16. We attempt to cooperatively prefetch the largest possible chunks of data sitting in global memory using coalesced transactions and cache these values in shared memory, in an order that maximizes their re-use across the threads in block.
The code snippet in Appendix B (in the Supplemental Material) is the kernel function that combines these strategies and steps. The implementation is relatively direct, with typical consideration for shared memory reuse and coalesced memory transactions. Yet the level of parallelization it achieves allows the density evaluation to be more than 150 times faster than its serial peer. Together with another relatively simple kernel function for sampling the configuration indicators, the overall performance improvement is about 120-fold in total run-time for large data sets.
Blocking for large data set
While the performance enhancement is impressive, this implementation requires more memory than the serial and MPI/multi-threading versions. In non-GPU versions, the parallel task grain size is relatively large and each task can deal with multiple data points without saving density values for each combination of (i, j), as shown in Algorithms 1 and 2. This is not the case for Algorithm 3. The particular, efficient design of thread blocks allows for densities to be evaluated in parallel. However, the efficiency comes with the cost of having to save all these densities in global memory before applying another kernel to sample from them. This will become a problem for really huge data sets on commonly used GPU devices such as GTX285 with just 1G or 2G device memory. We address this by chunking data into subsets, and launching the kernel functions multiple times, once for each subset. The overheads of this approach are minimal since it is relatively cheap to launch kernel functions multiple times (a full implementation can be found in the source code).
Using multiple GPUs
Practical implementations typically exploit multiple GPUs as well as multiple CPUs. By chunking data into subsets and launching multi-threads on multiple CPUs, with each CPU controlling one GPU, we gain substantial further speed-up; see benchmarks report in Section 4. Additional code illustrating this, and allowing interested researchers to exploit multiple processors, is available in our implementations at the web site (see url below).
Bayesian EM for TDP Mixtures -Serial and CUDA Algoithms
Posterior mode search using EM (Bayesian EM, or BEM) and other optimization algorithms, can also greatly benefit from massive parallelization even though they contain no random sampling. These algorithms can ultimately be reduced to simple sums and matrix manipulations, and in the large data context, these basic functions are extremely cumbersome and are prime candidates for parallelization. The dominant computations in BEM in our mixture models (see details in Appendix A) are as follows:
The CPU version of this algorithm parallels that for the MCMC, differing in the evaluations of probability weighted sample means and covariance matrices that scan across the entire data set rather than just a subsample. This latter element imposes additional storage as well as computational requirements, but does not modify the overall program flow.
CUDA version
Let Π be the n × k matrix of configuration probabilities π j (x i ), (i = 1 : n, j = 1 : k),
M the p × k matrix whose columns are the p−vector weighted means m j , (j = 1 : k), and X the n × p data matrix. BEM involves multiplication M = X Π of two very large matrices. In standard CPU implementations we can use highly optimized libraries, such as BLAS. The analogous library CUBLAS (NVIDIA-CUBLAS, 2008) in the CUDA toolkit is a GPU implementation of basic BLAS functions. As with BLAS, CUBLAS reads and writes matrices in column-major format, while the C convention is row-major, so all results will be implicitly transposed. Algorithm 5A uses CUBLAS in our GPU template.
Algorithm 5A. GPU BEM template version with CUBLAS 1 i n t main ( ) { 2 c u b l a s I n i t ( ) ; // Required b e f o r e any CUBLAS f u n c t i o n s a r e c a l l e d . Calculating the weighted sample covariance is more specialized and requires a custom kernel. There are several approaches but, due to the limited shared memory (Section 3.2.6)
we are led to calculating the S j element-wise, reducing to a problem of maximizing memory bandwidth by reading numbers in a coalesced fashion (Harris, 2008) , as in the kernel detailed in Appendix C (in the Supplemental Material). Implementing this simple kernel and prepackaged CUBLAS libraries has produced speedups as high as 80-fold as noted in the following section.
16 Table 1 Performance on this latter computer is particularly relevant as it reflects scale-up ability on routinely configured, cheap "everyday" laptops (US ∼$1,500 currently).
Note that the Tesla C1060 is about 20% slower than each GTX285 due to slower memory reading/writing and, as a result, the performance is downgraded a little on the desktop running on the tesla GPU. The Macbook is naturally slower than the desktop given the substantial differences in processor and memory speed, but note that the speed-up using a single GPU compared to the CPU is comparable and substantial, enabling 16 to 25-fold faster computations with medium to larger data sets. Table 2 . The BEM algorithm is less efficient for small n because it requires a complete scan and calculations on all n observations and is thus naturally less efficient in parallelized mode than the MCMC that splits into subsets of the data allocated across threads. However, when n is moderately large, the GPU implementation is vastly more efficient than its CPU counterpart. In general, moving to the GPU requires a good deal of overhead that is partly offset in the MCMC analysis by exploiting this greater parallelization opportunity. Also note that 100 iterations of BEM is generally slower than 100 iterations of MCMC. This is due to the fact that in BEM the component means and covariance matrices are computed based on all of the data while the MCMC components are based on a random subset. Bayesian computation in highly structured, multi-layered models -that lie at the heart of many significant data mining, prediction and discovery contexts in "data rich studies" -is a key example of an arena seeing initial developments in the use of GPUs, and where the technology is opening up simply major new opportunities for orders of magnitude advances in the ability to compute. Our studies and innovations in computing and computation for structured mixture models represent just one area in which such advances are being made, and we hope that the detailed discussion of this work will aid others in making advances with other model classes.
To aid in this, we have made all code available, together with links and resources related to libraries of interest to statisticians working with, or beginning to engage in, GPU-based computational methods in statistics; see details in the Supplemental Materials section below.
The potential for further growth in GPU computing is great; often quoted in 2009, Nvidia CEO Jen-Hsun Huang predicted in 2009 that GPU compute capabilities are likely to increase by 570-fold over the next 6 years (Huang, 2009) , greatly out-pacing CPU development. The computer gaming industry is now rapidly broadening its interests to address increasing attention from scientific arenas, and the next couple of years will see substantial advances in the ability to more easy access and program with GPUs as a result. There is also increasing interest in developing in hybrid GPU-CPU technologies that have the potential to bring massive parallelization to the desktop faster and more forcefully that to date.
Appendix A: Aspects of Bayesian TDP Mixture Models
The TDP analysis uses normal, inverse Wishart priors for normal component parameters,
k. This is coupled with the implicit priors over mixture probabilities arising from the underlying DP model,
(1 − v j ) for j = 2 : k − 1 and π k = 1, where v j ∼ Be(1, α) for j = 1 : k − 1. The DP precision parameter α has a conditionally conjugate gamma prior.
Analysis includes learning on hyper-parameters {m, γ, ν, Φ}; this adds compute burden, but does not materially impact on the general results and ideas of this paper.
Based on data x 1:n of sample size n, the block MCMC algorithm successively resamples values of the parameters Θ = (π 1:k , µ 1:k , Σ 1:k ) of the k normal components. Optimization computations to find posterior modes follow related steps, and in practice typically involves running multiple Bayesian EM (BEM) searches from multiple different starting points to couple with MCMC analyses. Each relies heavily on repeatedly recomputed values of the key conditional configuration probabilities π j (x i ) ≡ P r(z i = j|x i , Θ) = π j N (x i |µ j , Σ j )/g(x i |Θ), (j = 1 : k),
independently over data points i = 1 : n, at any parameter value Θ. These are the theoretical classification probabilities for the implicit configuration indicators z 1:n where z i ∈ 1 : k is such that (x i |z i = j, Θ) ∼ N (x i |µ j , Σ j ).
MCMC steps:
Each MCMC iterate has the following steps.
(a). Recompute configuration probabilities and resample configuration indicators:
Calculate π j (x i ) for each i = 1 : n, j = 1 : k. Then make n independent and individual multinomial draws of size 1, (z i |x i , Θ) ∼ M n(1, π 1:k (x i )). This reconfigures the n points independently among the k components, and delivers counts n j = #{z i = j, i = 1 : n} for j = 1 : k. Note that some components may be empty, with n j = 0 for some j. Additional, subsidiary computations address model identifiability, a thorny and challenging issue in mixture analysis when full posterior inferences via MCMC are desired. These are secondary issues in the context of the contributions of this paper, but we do note that we use novel mixture component relabelling strategies to address this in our applied work (Lin et al., 2010) . This adds to the overall computational burden but is, again, a strategy involving parallelizable steps also ideally suited to GPU computing.
BEM steps:
Each BEM iterate has the following steps, involving averaging with respect to the configuration probabilities. Full details are given in Lin et al. (2010) . (1 − v r ) for j = 2 : k − 1. Note, incidentally, that some of the v j can be zero, reflecting the ability of modal estimates of the model to "cut-back" to fewer than the upper bound k of components. Interested readers can also download the above Supplemental material from http://www. stat.duke.edu/gpustatsci/ under the Software tab. This web site also contains additional software, including extended versions of the original code with Matlab and R wrappers that create the input files and retrieve the data automatically.
