Hardware
INTRODUCTION
Radio telescopes produce enormous amounts of data. LOFAR [1] , for instance, will produce some tens of petabits per day, and the Australian square kilometer array pathfinder (ASKAP) will even produce over six exabits per day [2] . These modern radio telescopes use many separate receivers as building blocks and combine their signals to form a single large and sensitive instrument.
To extract the sky signal from the system noise, the correlator coordinates the signals from different receivers, and integrates the correlations over time to reduce the amount of data. This is a challenging problem in radio astronomy, since the data volumes are large and the computational demands grow quadratically with the number of receivers. Correlators are not limited to astronomy but are also used in geophysics [3] , radar systems [4] , and wireless networking [5] .
Traditionally, custom-built hardware, and later field programmable gate arrays (FPGAs), were used to correlate telescope signals. A recent development is to use a supercomputer [6] . Both approaches have important advantages and disadvantages. Custom-built hardware is efficient and consumes modest amounts of power but is inflexible, expensive to design, and has a long development time. Solutions that use a supercomputer are much more flexible, but are less efficient and consume more power. Future instruments, like the square kilometer array (SKA), need several orders of magnitude more computational resources. It is likely that the requirements of the SKA cannot be met by using current supercomputer technology. Therefore, it is important to investigate alternative hardware solutions.
General-purpose architectures no longer achieve performance improvements by increasing the clock frequency but by adding more compute cores and by exploiting parallelism. Intel's recent Core i7 processor is a good example of this. It has four cores and supports additional vector parallelism. Furthermore, the high-performance computing community is steadily adopting clusters of GPUs as a viable alternative to supercomputers, due to their unparalleled growth in computational performance, increasing flexibility and programmability, high power efficiency, and low purchase costs. GPUs are highly parallel and contain hundreds of processor cores. An example of a processor that combines GPU and central processing unit (CPU) qualities into one design is the Cell Broadband Engine (Cell/BE) [7] . The Cell/ BE consists of an "ordinary" PowerPC core and eight powerful vector processors that provide the bulk of the processing power. Programming the Cell/BE requires more effort than programming an ordinary CPU, but various studies showed that the Cell/ BE performs well on signal-processing tasks like fast Fourier transforms (FFTs) [8] .
In this article, we explain how many-core architectures can be exploited for signal-processing purposes. We give insights into their architectural limitations, and how to best cope with them. We treat five different, popular architectures with multiple cores: the Cell/BE, GPUs from both NVIDIA and ATI, the Intel Core i7 processor, and the BG/P supercomputer. We discuss their similarities and differences, and how the architectural differences affect optimization choices and the eventual performance of a correlator. We also discuss the programmability of the architectures. We focus on correlators but many of the findings, claims, and optimizations hold for other signal-processing algorithms as well, both inside and outside the area of radio astronomy. For instance, we discussed radio-astronomy imaging (another signal processing algorithm) on many-core hardware in previous work [9] .
In this article, we use the LOFAR telescope as a running example and use its production correlator on the BG/P as a comparison. This way, we demonstrate how many-core architectures can be used in practice for a real application. For educational purposes, we made the correlator implementations for all architectures available online. They exemplify the different optimization choices for the different architectures. The code may be reused under the GNU public license. We describe and analyze the correlator on many-core platforms in much more detail in [10] .
TRENDS IN RADIO ASTRONOMY
During the past decade, new types of radio-telescope concepts emerged that rely less on concrete, steel, and extreme cooling techniques, but more on signal-processing techniques. For example, LOFAR [1] , Karoo Array Telescope (MeerKAT) [11] , and ASKAP [2] are distributed sensor networks that combine the signals of many receiver elements. All three are pathfinders for the future SKA [12] telescope, which will be orders-of-magnitude larger. These instruments combine the advantages of higher sensitivity, higher resolution, and multiple concurrent observation directions. However, they require huge amounts of processing power to combine the data from the receiving elements.
The signal-processing hardware technology used to process telescope data also changes rapidly. Only a decade ago, correlators required special-purpose application-specific integrated circuits (ASICs) to keep up with the high data rates and processing requirements. The advent of sufficiently fast FPGAs significantly lowered the developments times and costs of correlators and increased the flexibility substantially. LOFAR requires even more flexibility to support many different processing pipelines for various observation modes, and uses FPGAs for on-the-field processing and a BG/P supercomputer to perform real-time central processing. We describe LOFAR in more detail below.
THE LOFAR TELESCOPE
LOFAR is an aperture array radio telescope operating in the 10-250 MHz frequency range [1] . It is the first of a new generation of radio telescopes that breaks with the concepts of traditional telescopes in several ways. Rather than using large, expensive dishes, LOFAR uses many thousands of simple antennas that have no movable parts (see Figure 1) . Essentially, it is a distributed sensor network that monitors the sky and combines all signals centrally. This concept requires much more signal processing, but the costs of the silicon for the processing are much lower that the costs of steel that would be needed for dishes. Moreover, LOFAR can observe the sky in many directions concurrently and switch directions instantaneously. In several ways, LOFAR will be the largest telescope of the world. The antennas are simple, but there are a lot of them -44,000 in the full LOFAR design. To make radio pictures of the sky with adequate resolution, these antennas are to be arranged in clusters. In the rest of this article, we call a cluster of antennas "receivers." The receivers will be spread out over an area of ultimately 350 km in diameter. This is shown in Figure 2 .
Data transport requirements are in the range of many terabits/s and the processing power needed is tens of tera-ops.
Another novelty is the elaborate use of software to process the telescope data in real time. LOFAR thus is an IT-telescope. The cost is dominated by the cost of computing and will follow Moore's law; becoming cheaper with time and allowing increasingly large telescopes to be built.
LOFAR will enable exciting new science cases. First, we expect to see the epoch of reionization, the time that the first star galaxies and quasars were formed. Second, LOFAR offers a unique possibility in particle astrophysics for studying the origin of high-energy cosmic rays. Third, LOFAR's ability to continuously monitor a large fraction of the sky makes it uniquely suited to find new pulsars and to study transient sources. Since LOFAR has no moving parts, it can instantaneously switch focus to some galactic event. Fourth, deep extragalactic surveys will be carried out to find the most distant radio galaxies and study star-forming galaxies. Fifth, LOFAR will be capable of observing the so far unexplored radio waves emitted by cosmic magnetic fields. For a more extensive description of the astronomical aspects of the LOFAR system, see [13] .
A global overview of the LOFAR processing is given in Figure 3 . The thickness of the lines indicates the size of the data streams. Initial processing is done in the field, using FPGA technology. Typical operations that are performed there include analog-to-digital conversion, filtering, frequency selection, and combination of the signals from the different antennas. Next, the data is transported to the central processing location in Groningen, The Netherlands, via dedicated optical wide-area networks.
The real-time central processing of LOFAR data is done on a BG/P supercomputer. There, we filter the data and perform phase-shift and bandpass-corrections. Next, the signals from all receivers are cross correlated. The correlation process performs a data reduction by integrating samples over time. Finally, the data is forwarded to a storage cluster, where results can be kept for several days. After an observation has finished, further processing, such as radio frequency interference (RFI) removal, calibration, and imaging is done offline on commodity cluster hardware. In this article, we focus on the correlator step (the highlighted part in the red box in Figure 3 ), because it must deal with the full data streams from all receivers. Moreover, its costs grow quadratically with the number of receivers, while all other steps have a lower time complexity. LOFAR uses an FX correlator: it first filters the different frequencies and then correlates the signals. This is more efficient than an XF correlator for larger numbers of receivers. In FX and XF, F means frequency separation and X means correlation (multiplication of the signal).
CORRELATING SIGNALS
Prior to correlation, the data that comes from the receivers must be reordered: each input carries the signals of many frequency bands from a single receiver, but the correlator needs data from a single frequency of all inputs. Depending on the data rate, switching the data can be a real challenge. The data reordering phase is outside the scope of this article, but a correlator implementation cannot ignore this issue. The LOFAR BG/P correlator uses the fast three-dimensional torus for this purpose; other multicore architectures need external switches.
The received signals from sky sources are so weak, that the antennas mainly receive noise. To see if there is statistical coherence in the noise, simultaneous samples of each pair of receivers are correlated, by multiplying the sample of one re ceiver with the complex conjugate of the sample of the other receiver. To reduce the output size, the correlations are integrated over time, by accumulating all products. Therefore, the correlator is mostly multiplying and adding complex numbers. Both polarizations of a station A are correlated with both polarizations of a station B, yielding correlations in XX, XY, YX, and YY directions. The correlator algorithm itself thus is straight forward, and can be written in a single formula
The total number of correlations we have to compute is 1nrReceivers 3 1nrReceivers 1 1 2 2 / 2, since we need each pair of correlations only once. This includes the autocorrelations (the correlation of a receiver with itself), since we need them later in the pipeline for calibration purposes. The autocorrelations can be computed with fewer instructions. We can implement the correlation operation very efficiently, with only four fused-multiply-add (FMA) instructions, doing eight floating-point operations in total. For each pair of receivers, we have to do this four times, once for each combination of polarizations. Thus, in total we need 32 operations. To perform these operations, we have to load the samples generated by two different receivers from memory. As explained above, the samples each consist of four single-precision floatingpoint numbers. Therefore, we need to load eight floats or 32 B in total. This results in exactly one floating-point operations per second (FLOP)/byte. We will describe the implementation and optimization of the correlator on the many-core systems in more detail in the section "Implementation and Optimization," but first, we explain the architectures themselves.
MANY-CORE ARCHITECTURES
In this section, we explain key properties of five different architectures with multiple cores and the most important differences between them. Table 1 shows the most important properties of the different many-core architectures.
GENERAL-PURPOSE MULTICORE CPUS (INTEL CORE i7)
As a reference, we implemented the correlator on a multicore general-purpose architecture, in this case an Intel Core i7. The theoretical peak performance of the system is 85 gflops, in single precision. The parallelism comes from four cores with hyperthreading. Using two threads per core allows the hardware to overlap load delays and pipeline stalls with useful work from the other thread. The SSE4 instruction set provides single instruction multiple data (SIMD) parallelism with a vector length of four floats.
IBM BG/P SUPERCOMPUTER
The IBM BG/P [14] is the architecture that is currently used for the LOFAR correlator. Four PowerPC processor cores are integrated on each BG/P chip. Each core is extended with two floating-point units (FPUs) that provide the bulk of the processing power. The BG/P is an energy-efficient supercomputer. This is accomplished by using many small, low-power chips, at a low clock frequency.
ATI GPUs
ATI's GPU with the highest performance is the Radeon 4870 [15] . The chip contains 160 cores, with 800 FPUs in total, and has a theoretical peak performance of 1.2 teraflops. The board uses a [FIG3] A simplified overview of the LOFAR processing.
[ 
THE CELL/BE
The Cell/BE [7] is a heterogeneous many-core processor, designed by Sony, Toshiba, and IBM (STI). The Cell/BE has nine cores: one power processing element (PPE), acting as a main processor, and eight synergistic processing elements (SPEs) that provide the real processing power. A SPE contains a reduced instruction set computing (RISC) core, a 256 KB local store (LS), and a direct memory access (DMA) controller. The LS is an extremely fast local memory for both code and data and is managed entirely by the application with explicit DMA transfers to and from main memory. The LS can be considered the SPU's (explicit) L1 cache. The Cell/BE has a large number of registers: each SPU has 128, which are 128-b (four floats) wide. The SPU can dispatch two instructions in each clock cycle using the two pipelines designated even and odd. Most of the arithmetic instructions execute on the even pipe, while most of the memory instructions execute on the odd pipe. For the performance evaluation, we use a QS21 Cell blade with two Cell/BE processors. The eight SPEs of a single chip in the system have a total theoretical single-precision peak performance of 205 gflops.
MAPPING SIGNAL-PROCESSING ALGORITHMS ON MANY-CORE HARDWARE
Many-core architectures derive their performance from parallelism. Several different forms of parallelism can be identified: multithreading (with or without shared memory), overlapping of I/O and computations, instruction-level parallelism, and vector parallelism. Most many-core architectures combine several of these methods. Unfortunately, an application has to handle all available levels of parallelism to obtain good performance. Therefore, it is clear that algorithms have to be adapted to efficiently exploit many-core hardware. Additional parallelism can be obtained by using multiple processor chips. In this article, however, we restrict ourselves to single chips for simplicity.
FINDING PARALLELISM
The first step is to find parallelism in the algorithm, on all different levels. Basically, this means looking for independent operations. With the correlator, for example, the thousands of different frequency channels are completely independent, and they can be processed in parallel. But there are other, more fine-grained sources of parallelism as well. The correlations for each pair of receivers are independent, just like the four combinations of polarizations. Finally, samples taken at different times can be correlated independently, as long as the subresults are integrated later. Of course, the problem now is how to map the parallelism in the algorithm to the parallelism provided by the architecture. We found that, even for the relatively straightforward correlator algorithm, the different architectures require very different mappings and strategies.
OPTIMIZING MEMORY PRESSURE AND ACCESS PATTERNS
On many-core architectures, the memory bandwidth is shared between the cores. This has shifted the balance between computational and memory performance. The available memory bandwidth per operation has decreased dramatically compared to traditional processors. For the many-core architectures we use here, the theoretical bandwidth per operation is three to ten times lower than on the BG/P, for instance. In practice, if algorithms are not optimized well for many-core platforms, the achieved memory bandwidth can easily be ten to 100 times lower than the theoretical maximum. Therefore, we must treat memory bandwidth as a scarce resource, and it is important to minimize the number of memory accesses. In fact, one of the most important lessons of this article is that on many-core architectures, optimizing the memory properties of the algorithms is more important than focusing on reducing the number of compute cycles that is used, as is traditionally done on systems with only a few or just one core.
WELL-KNOWN MEMORY OPTIMIZATION TECHNIQUES
The insight that optimizing the interaction with the memory system is becoming more and more important is not new. The book by Catthoor et al. [17] is an excellent starting point for more information on memory-system related optimizations. We can make a distinction between hardware and software memory-optimization techniques. Examples of hardware-based techniques include caching, data prefetching, write combining, and pipelining. The software techniques can be divided further into compiler optimizations and algorithmic improvements. The distinction between hardware and software is not entirely black and white. Data prefetching, for instance, can be done both in hardware and software. Another good example is the explicit cache of the Cell/BE processor. This is an architecture where the programmer handles the cache replacement policies instead of the hardware.
Many optimizations focus on utilizing data caches more efficiently. Hardware cache hierarchies can, in principle, transparently improve application performance. Nevertheless, it is important to take the sizes of the different cache levels into account when optimizing an algorithm. A cache line is the smallest unit of memory than can be transferred between the main memory and the cache. Code can be optimized for the cache line size of a particular architecture. Moreover, the associativity of the cache can be important. If a cache is N-way set associative, this means that any particular location in memory can be cached in either of N locations in the data cache. Algorithms can be designed such that they take care that cache lines that are needed later are not replaced prematurely. In addition, write combining, a technique that allows data writes to be combined and written later in burst mode, can be used if the ordering of writes is not important. Finally, prefetching can be used to load data into caches or registers ahead of time.
Many cache-related optimization techniques have been described in the literature, both in the context of hardware and software. For instance, an efficient implementation of hardwarebased prefetching is described in [18] . As we will describe in the section "Implementation and Optimization," we implemented prefetching manually in software, for example by using multibuffering on the Cell/BE, or by explicitly loading data into shared memory or registers on the GPUs. A good starting point for cacheaware or cache-oblivious algorithms is [19] . An example of a technique that we used to improve cache efficiencies for the correlator is the padding of multidimensional arrays with extra "dummy" data elements. This can be especially important if memory is accessed with a stride of a (large) power of two. This way, we can make sure that cache replacement policies work well, and subsequent elements in an array dimension are not mapped onto the same cache location. This well-known technique is described, for instance, by Bacon et al. [20] . Many additional data access patterns optimization techniques are described in [17] .
Many memory-optimization techniques have been developed in the context of optimizing compilers and run-time systems (e.g., efficient memory allocators). For instance, a lot of research effort has been invested in cache-aware memory allocation; see e.g., [21] . Compilers can exploit many techniques to optimize locality by applying code and loop transformations such as interchange, reversal, skewing, and tiling [22] . Furthermore, compilers can optimize code for the parameters and sizes of the caches by carefully choosing the placement of variables, objects, and arrays in memory [23] .
The memory systems of the many-core architectures are quite complex. GPUs, for instance, have banked device memory, several levels of texture cache, in addition to local memory, application-managed shared memory (also divided over several banks), and write combining buffers. There also are complex interactions between the memory system and the hardware thread scheduler. GPUs literally run tens of thousands of parallel threads to overlap memory latencies, trying to keep all functional units fully occupied. We apply the techniques described above in software by hand, since we found that the current compilers for the many-core architectures do not (yet) implement them well on their complex memory systems.
APPLYING THE TECHNIQUES
So, the second step of mapping a signal-processing algorithm to a many-core architecture is optimizing the memory behavior. We can split this step into two phases: an algorithm phase and an architectural phase. In the first phase, we identify algorithm-specific, but architecture-independent optimizations. In this phase, it is of key importance to understand that, although a set of operations in an algorithm can be independent, the data accesses may not be. This is essential for good performance, even though it may not be a factor in the correctness of the algorithm. The number of memory accesses per operation should be reduced as much as possible, sometimes even at the cost of more compute cycles. An example is a case where different parallel operations read (but not write) the same data. For the correlator, the most important insight here is a technique to exploit date reuse opportunities, reducing the number of memory loads. We explain this in detail in the section "Architecture-Independent Optimizations."
The second phase deals with architecture-specific optimizations. In this phase, we do not reduce the number of memory loads, but think about the memory access patterns. Typically, several cores share one or more cache levels. Therefore, the access patterns of several different threads that share a cache should be tailored accordingly. On GPUs, for example, this can be done by coalescing memory accesses. This means that different concurrent threads read subsequent memory locations. This can be counter-intuitive, since traditionally, it was more efficient to have linear memory access patterns within a thread. Table 2 summarizes the differences in memory architectures of the different platforms. Other techniques that are performed in this phase include optimizing cache behavior, avoiding load delays and pipeline stalls, and exploiting special floating-point instructions. We explain several examples of this in more detail in the section "Architecture-Specific Optimizations."
A SIMPLE ANALYTICAL TOOL
A simple analytic approach, the Bound and Bottleneck analysis [24] , [25] , can provide more insight on the memory properties of an algorithm. It also gives us a reality check and calculates what the expected maximal performance is that can be achieved on a particular platform. The number of operations that is performed per byte that have to be transferred (the flop/byte ratio) is called the arithmetic intensity (AI) [24] . Performance is bound by the [ gives a rough idea of the performance than can be achieved. It is insightful to apply this method to the correlator on the GPUs. We do it for the NVIDIA GPU here, but the results for the ATI hardware is similar. With the GPUs, there are several communication steps that influence the performance. First, the data has to be transferred from the host to the device memory. Next, the data is read from the device memory into registers. The host-to-device bandwidth is limited by the low PCI-express throughput, 5.6 GB/s in this case. We can easily show that this is a bottleneck by computing the AI for the full system, using the host-to-device transfers. (The AI can also be computed for the device memory.)
As explained in the section "Correlating Signals," the number of flops in the correlator is the number of receiver combinations times 32 operations, while the number of bytes that have to be loaded in total is 16 B times the number of receivers. The number of combinations is 1nrReceivers 3 1nrReceivers 1 1 2 2 / 2 (see the section "Correlating Signals"). If we substitute this, we find that the AI 5 nrReceivers 1 1. For LOFAR, we can assume 64 receivers (each in turn containing many antennas), so the AI is 65 in our case. Therefore, the performance bound on NVIDIA hardware is 65 3 5.6 5 364 gflops. This is only 39% of the theoretical peak. Note that this even is optimistic, since it assumes perfect overlap of communication and computation.
COMPLEX NUMBERS
Support for complex numbers is important for signal processing. Explicit hardware support for complex operations is preferable, both for programmability and performance. Except for the BG/P, none of the architectures support this. The different architectures require two different approaches of dealing with this problem. If an architecture does not use explicit vector parallelism, the complex operations can simply be expressed in terms of normal floatingpoint operations. This puts an extra burden on the programmer, but achieves good performance. The NVIDIA GPUs work this way. If an architecture does use vector parallelism, we can either store the real and complex parts alternatingly inside a single vector, or have separate vectors for the two parts. In both cases, support for shuffling data inside the vector registers is essential, since complex multiplications operate on both the real and imaginary parts. The architectures differ considerably in this respect. The Cell/BE excels; its vectors contain four floats, which can be shuffled around in arbitrary patterns. Moreover, shuffling and computations can be overlapped effectively. On ATI GPUs, this works similarly. The SSE4 instructions in the Intel CPUs do not support arbitrary shuffling patterns. This has a large impact on the way the code is vectorized.
IMPLEMENTATION AND OPTIMIZATION
In this section, we explain the techniques described above by applying them to the correlator for all different architectures.
ARCHITECTURE-INDEPENDENT OPTIMIZATIONS
An unoptimized correlator would read the samples from two receivers and multiply them, requiring two sample loads for one multiplication. We can optimize this by reusing a sample as often as possible, by using it for multiple correlations (see Figure 4) . The figure is triangular, because we compute the correlation of each pair of receivers only once. The squares labeled A are autocorrelations. For example, the samples from receivers 8, 9, 10, and 11 can be correlated with the samples from receivers 4, 5, 6, and 7 (the red square in the figure), reusing each fetched sample four times. By dividing the correlation triangle in 4 3 4 tiles, eight samples are read from memory for sixteen correlations, reducing the amount of memory operations by a factor of four. The maximum number of receivers that can be simultaneously correlated this way (i.e., the tile size) is limited by the number of registers that an architecture has. The samples and accumulated correlations are best kept in registers, and the number of required registers grows rapidly with the number of receiver inputs. The example above already requires 16 accumulators. To obtain good performance, it is important to tune the tile size to the architecture. There still is opportunity for additional data reuse between tiles. The tiles within a row or column in the triangle still need the same samples. In addition to registers, caches can thus also be used to increase data reuse.
ARCHITECTURE-SPECIFIC OPTIMIZATIONS
We will now describe the implementation of the correlator on the different architectures, evaluating the performance and optimizations needed in detail. For comparison reasons, we use the performance per chip for each architecture. The performance results are shown in Figure 5 .
INTEL CPUs
The SSE4 instruction set can be used to exploit vector parallelism. Unlike the Cell/BE and ATI GPUs, a problem with SSE4 is the limited support for shuffling data within vector registers. Computing the correlations of the four polarizations within a vector is inefficient, and computing four samples with subsequent time stamps in a vector works better. The use of SSE4 improves the performance by a factor of 3.6 in this case. In addition, multiple threads should be used to utilize all four cores. To benefit from hyperthreading, twice as many threads as cores are needed. For the correlator, hyperthreading increases performance by 6%. Also, the number of vector registers is small. Therefore, there is not much opportunity to reuse data in registers, limiting the tile size to 2 3 2; reuse has to come from the L1 cache.
THE BG/P SUPERCOMPUTER
We found that the BG/P is extremely suitable for our application, since it is highly optimized for processing of complex numbers. However, the BG/P performs all floating-point operations in double precision, which is overkill for our application. Although the BG/P can keep the same number of values in register as the Intel chip, an important difference is that the BG/P has 32 registers of width two, compared to Intel's 16 of width four. The smaller vector size reduces the amount of shuffle instructions needed. In contrast to all other architectures we evaluate, the problem is compute bound instead of I/O bound, thanks to the BG/P's high memory bandwidth per operation, which is three to ten times higher than for the other architectures.
ATI GPUs
The ATI architecture has several important drawbacks for dataintensive applications. First, the host-to-device bandwidth is a bottleneck. Second, overlapping communication with computation does not work well. We observed kernel slowdowns of more than a factor of two due to asynchronous transfers in the background. This can clearly be seen in Figure 5 . Third, the architecture does not provide random write access to device memory, but only to host memory. The correlator reduces the data by a large amount, and the results are never reused by the kernel. Therefore, they can be directly streamed to host memory. Nevertheless, in general, the absence of random write access to device memory significantly reduces the programmability and prohibits the use of traditional programming models. ATI offers two separate programming models at different abstraction levels [15] . The low-level programming model is called Compute Abstraction Layer (CAL). It provides communication primitives and an assembly language, allowing fine tuning of device performance. For high-level programming, ATI provides Brook+. We implemented the correlator with both models. In both cases, the programmer has to do the vectorization, unlike with NVIDIA GPUs. CAL provides a feature called swizzling, which is used to select parts of vector registers in arithmetic operations. We found this improves readability of the code. However, the programming tools still are unsatisfactory. The high-level Brook+ model does not achieve acceptable performance. The low-level CAL model does, but it is difficult to use. The best-performing implementation uses a tile size of 4 3 3, thanks to the large number of registers. Due to the low I/O performance, we achieve only 16% of the theoretical peak.
NVIDIA GPUs
NVIDIA's programming model is called Compute Unified Device Architecture (CUDA) [16] . CUDA is relatively high level, and achieves good performance. An advantage of NVIDIA hardware, in contrast to ATI, is that the application does not have to do vectorization. This is thanks to the fact that all cores have their own address-generation units. All data parallelism is expressed by using threads. When accessing device memory, it is important to make sure that simultaneous memory accesses by different threads are coalesced into a single memory transaction. In contrast to ATI hardware, NVIDIA GPUs support random write access to device memory. This allows a programming model that is much closer to traditional models, greatly simplifying software development. It is important to use shared memory or the texture cache to enable data reuse. In our case, we use the texture cache to speedup access to the sample data. CUDA provides barrier synchronization between threads within a thread block. We exploit this feature to let the threads that access the same samples run in lock step. This way, we pay a small synchronization overhead, but we can increase the cache hit ratio significantly. We found that this optimization improved performance by a factor of two. This optimization is a good example that shows that, on GPUs, it is important to optimize memory behavior, even at the cost of additional instructions and synchronization overhead.
We also investigated the use of the per-multiprocessor shared memory as an application-managed cache. Others report good results with this approach [26] . However, we found that, for our application, the use of shared memory only led to performance degradation compared to the use of the texture caches.
Registers are a shared resource. Using fewer registers in a kernel allows the use of more concurrent threads, hiding load delays. We found that using a relatively small tile size (3 3 2) and many threads increases performance. The kernel itself, without host-to-device communication achieves 38% of the theoretical peak performance. If we include communication, the performance drops to 32% of the peak. Just like with the ATI hardware, this is caused by the low PCI-e bandwidth. With NVIDIA hardware, significant performance gains can be achieved by using asynchronous host-to-device I/O.
THE CELL/BE
With the Cell/BE it is important to exploit all levels of parallelism. Applications deal with task and data parallelism across multiple SPEs, vector parallelism inside the SPEs, and multibuffering for asynchronous DMA transfers [7] . Acceptable performance can be achieved by programming the Cell/BE in C or C11, while using intrinsics to manually express vector parallelism. Thus, the programmer specifies which instructions have to be used but can typically leave the instruction scheduling and register allocation to the compiler. A distinctive property of the architecture is that cache transfers are explicitly managed by the application using DMA. This is unlike other architectures, where caches work transparently. Communication can be overlapped with computation, by using multiple buffers. Although issuing explicit DMA commands complicates programming, we found that this usually is not problematic for signal-processing applications. Thanks to the explicit cache, the correlator implementation fetches each sample from main memory only exactly once. The large number of registers allows a big tile size of 4 3 3, leading to a lot of data reuse. We exploit the vector parallelism of the Cell/BE by computing the four polarization combinations in parallel. We found that this performs better than vectorizing over the integration time. This is thanks to the Cell/ BE's excellent support for shuffling data around in the vector registers. Due to the high memory bandwidth and the ability to reuse data, we achieve 92% of the peak performance on one chip. If we use both chips in a cell blade, we still achieve 91%. Even though the memory bandwidth per operation of the Cell/ BE is eight times lower than that of the BG/P, we still achieve excellent performance, thanks to the high-data reuse factor. Figure 5 shows the performance on all architectures we evaluated. The NVIDIA GPU achieves the highest absolute performance. Nevertheless, the GPU efficiencies are much lower than on the other platforms. The Cell/BE achieves the highest efficiency of all many-core architectures, close to that of the BG/P. Although the theoretical peak performance of the Cell/BE is 4.6 times lower than the NVIDIA chip, the absolute performance is only 1.6 times lower. If both chips in the cell blade are used, the Cell/BE also has the highest absolute performance. For the GPUs, it is possible to use more than one chip as well, for instance with the ATI 4870x2 device. However, we found that this does not help, since the performance is already limited by the low PCI-e throughput, and the chips have to share this resource. In Table 3 we summarize the architectural strengths and weaknesses that we discussed.
COMPARISON AND EVALUATION

PROGRAMMABILITY OF THE PLATFORMS
The performance gap between assembly and a high-level programming language is quite different for the different platforms. It also depends on how much the compiler is helped by manually unrolling loops, eliminating common subexpressions, and the use of register variables up to a level that the C code becomes almost as low level as assembly code. The difference varies between only a few percent to a factor of ten.
For the BG/P, the performance from compiled C++ code was by far not sufficient. The assembly code is approximately ten times faster. For both the Cell/BE and the Intel Core i7, we found that high-level code in C or C11 in combination with the use of intrinsics to manually describe the SIMD parallelism yields acceptable performance compared to optimized assembly code. Thus, the programmer specifies which instructions have to be used, but can typically leave the instruction scheduling and register allocation to the compiler. On NVIDIA hardware, the high-level CUDA model delivers excellent performance, as long as the programmer helps by using SIMD data types for loads and stores, and separate local variables for values that should be kept in registers. With ATI hardware, this is different. We found that the high-level Brook+ model does not achieve acceptable performance compared to hand-written CAL code. Manually written assembly is more than three times faster. Also, the Brook+ documentation is insufficient.
CONCLUSIONS
Radio telescopes require large amounts of signal processing and have high computational and I/O demands. We presented general insights on how to use many-core platforms for signalprocessing applications, looking at the aspects of performance, optimization, and programmability. As an example, we evaluated the extremely data-intensive correlator algorithm on today's many-core architectures.
The many-core architectures have a significantly lower memory bandwidth per operation compared to traditional architectures.
[ This requires completely different algorithm implementation and optimization strategies: minimizing the number of memory loads per operation is of key importance to obtain good performance. A high memory bandwidth per operation is not strictly necessary, as long as the architecture (and the algorithm) allows efficient data reuse. This can be achieved through caches, shared memory, LSs, and registers. It is clear that application-level control of cache behavior (either through explicit DMA or thread synchronization) has a substantial performance benefit, and is of key importance for signal-processing applications. We demonstrated that the many-core architectures have very different performance characteristics, and require different implementation and optimization strategies. The BG/P supercomputer achieves high efficiencies thanks to the high memory bandwidth per operation. The GPUs are unbalanced: they provide an enormous computational power but have a relatively low bandwidth per operation, both internally and externally (between the host and the device). Because of this, many data-intensive signal-processing applications will achieve only a small fraction of the theoretical peak. The Cell/BE performs excellently on signal-processing applications, even though its memory bandwidth per operation is eight times lower than the BG/P. Applications can exploit the applicationmanaged cache and the large number of registers. For the correlator, this results in optimal reuse of all sample data. Nevertheless, it is clear that, for signal-processing applications, the recent trend of increasing the number of cores will not work indefinitely if I/O is not scaled accordingly.
ACKNOWLEDGMENTS
This work was performed in the context of the NWO STARE AstroStream project. We gratefully acknowledge NVIDIA, and in particular Dr. David Luebke, for freely providing some of the GPU cards used in this work.
AUTHORS
Rob V. van Nieuwpoort (nieuwpoort@astron.nl) is a postdoctoral researcher at the Vrije Universiteit, Amsterdam and ASTRON. His current research interests focus on many-core systems and radio astronomy. He got his Ph.D degree at Vrije Universiteit, Amsterdam on efficient Java-centric grid computing. He has designed and implemented the Manta, Ibis, Satin, and JavaGAT systems and worked on the GridLab and Virtual Labs for E-science projects. At ASTRON, he works on the central, real-time data processing of LOFAR. His research interests include high-performance computing, parallel and distributed algorithms, networks, programming languages, and compiler construction.
John W. Romein (romein@astron.nl) is a senior system researcher of high-performance computing at ASTRON, where he is responsible for the central, real-time data processing of LOFAR. He obtained his Ph.D. degree on distributed search algorithms for board-game playing at Vrije Universiteit, Amsterdam. As a postdoctoral researcher, he solved the game of Awari using a large computer cluster and did research on parallel algorithms for bioinformatics. His research interests include high-performance computing, parallel algorithms, networks, programming languages, and compiler construction.
