The main contributors to message delivery latency in message passing environments are the copying operations needed to transfer and bind a received message to the consuming process/thread. A significant portion of the software communication overhead is attributed to message copying. Recently, a set of factors has been leading highperformance processor architectures toward designs that feature multiple processing cores on a single chip (a.k.a. CMP). The Cell Broadband Engine (BE) shows potential to provide high-performance to parallel applications (e.g., MPI applications). The Cell's non-homogeneous architecture along with small local storage in SPEs impose restrictions and challenges for parallel applications. In this work, we first characterize various data delivery mechanisms in the Cell BE processor; then, we propose techniques to facilitate the delivery of a message in MPI environments implemented in the Cell BE processor. We envision a cluster system comprising several cell processors each supporting several computation threads.
Introduction
The main contributors to message delivery latency in message passing environments are the copying operations needed to transfer and bind a received message to the consuming process/thread. A significant portion of the software communication overhead is attributed to message copying. Therefore, if the necessary data could be placed into the fastest level of the memory hierarchy, closest to the processor where the consuming process/thread resides and before it is needed, latency will be minimized.
Recently, a set of factors, such as poor performance/power efficiency and limited design scalability in monolithic designs, has moved high-performance processor architectures toward designs that feature multiple processing cores on a single chip (a.k.a. CMP).
The advent of the Cell Processor [1] with its multitude of synergistic processing elements (SPEs) and their associated fast memories provides an opportunity for an advanced computational environment for demanding MPI applications.
The challenge is to ensure data availability to the computations running on the SPEs. This data may be exist locally, on another SPE, on the PPE, or globally on another Cell processor.
In previous studies [2, 3, 4, 5] , we have proposed methods that predict the data consumption patterns and facilitate its (the data) placement to the processor that eventually consumes it.
The present work focuses on studying the data transfer mechanisms between the processing elements of the Cell BE (i.e., the PPE and the SPEs) and identifying their communication capabilities (in terms of latency and throughput) for a variety of communication patterns. The ultimate goal is to use this information together with our prediction techniques to implement an efficient MPI environment. This paper is organized as follows. Section 2 discusses related work; section 3 presents an overview of the Cell processor architecture. Section 4 illustrates the simulation environment; section 5 presents the obtained results. Section 6 discusses the achieved results, and finally we conclude with section 7.
Motivation and Related Work
The Cell processor has shown potential to be used in high-performance computing, which uses MPI as the defacto programming model. Therefore, it is vital to implement MPI efficiently on the Cell BE processor to leverage its tremendous computational power. For this, there has been a feasibility study of MPI implementation on the Cell BE [6, 7] . In this study, a minimal set of a synchronous mode MPI on the Cell BE has been implemented, and the results show the potential of the Cell BE to run MPI applications efficiently.
Also, a programming model, MPI microtask, is proposed [8] Figure 1 . An overview of the Cell processor programmers do not need to manage the SPEs' local stores as long as they partition their application into partitions (a.k.a. microtasks) that fit into the local stores. Besides the studies on the MPI implementations, there have been several compiler proposals that aim at automatically generating parallel code for the Cell BE processor [9, 10, 11] .
Another study [12] has characterized the DMA transfers in the Cell BE processor when the DMA operations are issued on the SPEs. In the current study, we have repeated the DMA characterization as discussed in [12] , but we have also investigated different components of DMA operations, additional modes of data transfers including PPE initiated, mailbox communication, and memory management overheads.
The present study quantifies the latency of the DMA technique and measures its different components to gain more insight into data transfer in the Cell BE. Using these measurements, we propose preliminary techniques to facilitate the receiving and binding the arrived data in the Cell processor environment.
The portion of our results focusing on SPE issued DMA operations agree with the ones presented in [12] . However, the results on the remainder modes of communication and the memory management behavior are, to our knowledge, new.
The Cell Processor Architecture Overview
The Cell BE was designed as a joint venture by Sony, Toshiba, and IBM to be the processor for Sony's Playstation gaming system [1] . It consists of a high performance PowerPC core, called PPE, that controls eight simple inorder SIMD cores, called synergistic processing elements (SPEs). An overview of the Cell processor is shown in Figure 1 .
The PPE, the eight SPEs, memory controller, and I/O controller are connected through 4 data rings, known as EIB. Each unit can transfer data via EIB at a rate of 8 bytes/cycle. In the Cell processor both types of processor cores share access to a common address space, which includes main memory, address ranges corresponding to each SPE's local store, control registers, and I/O devices.
The PPE is a dual-threaded 64-bit PowerPC with a vector multimedia extension (VMX). It runs the operating system and manages the SPEs and includes a conventional cache hierarchy with 32KB first level instruction and data caches and a 512KB second-level cache. As a standard 64-bit PowerPC processor, the PPE can run existing 64-bit PowerPC binaries. Each SPE contains a synergistic processing unit (SPU) and a memory flow controller (MFC). The SPU is an inorder SIMD architecture capable of operating on 128-bit vectors and is optimized for computation-intensive applications. The SPEs have their own local storage (256 KB) from which they fetch instructions and read and write data. They have a simple and in-order implementation to save power and area. Data and instructions are transferred between the PPE and SPEs using a DMA controller which is part of the MFC. The MFC includes the DMA controller and a memory management unit (MMU). The DMA controller unit consists of two DMA queues for transferring data between the SPEs and between the SPE and the PPE, one for SPE-Initiated DMA and another for PPE-Initiated DMA commands.
The MFC has a MMU responsible for address translation and protection using segment and page tables. A DMA transaction between a SPE and the PPE involves a data transfer between a local store address on the SPE and an effective address on the system memory. For this transfer, the MFC obtains the system memory effective address through a virtual address translation using its page table. The first DMA transfer between the SPE and the PPE causes a page table fault in the MFC. The DMA engine also transfers data directly between the local stores of two SPEs and inside its SPE.
Methodology and Simulation Environment
The focus of this work is to study and quantify the latency and throughput of different data transfer techniques.
The first phase involves the study of the communication overhead to transfer data (send or receive) between the PPE and the SPEs. There are two general data transfer methods, involving the PUT and GET functions. The GET function transfers data from the PPE to the SPEs. The PUT function transfers data from the SPEs to the PPE. These functions can be initiated from either the PPE or the SPE side and employ different mechanisms to set up and transfer data between the cores.
These different data transfer mechanisms result in different characteristics in terms of setup, latency, and throughput of the data transfer channels. To gain more insight, we explore different components contributing to the communication overhead in these techniques. These components include the DMA issue and set-up times, latency, datadelivery time, memory-management overhead, and synchronization between cores. In this study, the total data transfer time can be expressed as a function of several components, as shown in Figure 2 , which include:
• First byte arrival time (Latency): a lower bound on the delay incurred in transferring a message containing a byte from its source core to its destination core.
• Data Delivery: defined as the time that the processor needs to transfer the arrived message to its final destination.
• DMA overhead (DMA set up time): defined as the time to enqueue a DMA transfer.
• Gap: the minimum time interval between consecutive message arrivals in a non-blocking transfer.
DMA setting overhead

Latency
Sender
Gap (in subsequent messages)
Receiver Data delivery
Figure 2. Components of data transfer
To quantify these components, we wrote several microbenchmarks.
The DMA issue time is established easily by measuring the time elapsed between issuing a DMA including function (i.e., PUT or GET) and the return from this function. The onboard timer is used to measure the elapsed time.
The data delivery time is established by determining the arrival times of the first and last elements of a message by continuously monitoring the destination buffer.
Determining the remainder of these components (i.e., the first byte delivery and the MMU overhead) is more complex. Our methods are discussed in the subsequent sections.
First-Byte & Last-Byte Delivery measurement
In this part, we explain our methodology in evaluating the first-byte delivery latency quantity. The first-byte delivery is defined as the elapsed time from issuing a DMA command to receiving the first byte of the message. In general, our approach detects the arrival of the first byte by checking the contents of the destination buffer.
Since PUT (GET) exhibits long issue times on the PPE, we elected to implement a two-threaded benchmark, where the first thread implements the data transfer while the second thread (in a tight loop) determines the time of first-bytearrival. The pseudo code depicted in Table 1 shows the implementation when the initiator and the target of the data transfer are on the PPE. Table 2 shows the pseudo code of the micro-benchmark implementation when the initiator resides on the SPE while target of the transfer is the PPE.
Similar benchmarks were developed which utilize the PUT and GET functions in the opposite directions to the ones depicted in Tables 1 and 2 . The micro-benchmark depicted in Table 2 determines times at two different entities (i.e., the PPE and the SPE). The associated timers need to be synchronized. We will discuss the synchronization mechanism in the following subsections. In this part, we only show the pseudo code for the first-byte arrival time on the PPE in the SPE-Initiated GET function. Similar benchmarks were developed to detect the arrival of the last byte. 
Synchronizing the PPE and the SPEs timers
As mentioned earlier in section 4.1, synchronizing the timers on the SPEs and the PPEs is crucial. The main issue that we should consider is the accuracy of this synchronization. To reach this objective mailbox communications, which access the SPEs' problem state directly, is selected. This approach is chosen due to its short roundtrip latency between the PPE and the SPEs, which is from 10 to 14 cycles. Using this approach, we are able to find the times on each core with high accuracy and the least delay. The following code is part of our synchronization approach. In this code, we write into the mailbox using the spe in mbox write() function, which accesses the SPE's problem state directly. As shown in the code, the start time on the SPE is obtained from the time we receive data in the mailbox on the SPE and the round-trip time, which is evaluated from a ping-pong method. 
Memory Management Overhead
Memory management overhead plays a significant role in the latency when a transfer from a new source to a new target destination is required.
In the first part, we examine the miss penalty of the TLB (i.e. Translation Look-aside Buffer) on the SPEs. In each data transfer between the PPE and the SPE, the effective address of the memory location on the PPE is obtained through the TLB in the MFC, which is a 256-entry table and each entry has the mapping address of a 4K page in the main memory of the PPE. If there is no entry in the TLB for the address of the referenced memory location, this will result in a TLB miss. Then, the memory management system will find the corresponding effective address for the memory location and will fill out the corresponding entry in the TLB for the following accesses.
To measure the TLB miss penalty, our micro-benchmark uses the fact that the first transfer causes a cold miss in the TLB, while the second transfer of the same buffer will result to a hit. Therefore, the difference between these two cases is equal to the memory-management overhead. Another quantity that we want to measure is the replacement penalty due to a conflict in the TLB. For this, we wrote a micro-benchmark to generate addresses which result to the same entry in the TLB table. As stated before, the first access to the TLB table results in a miss, while following accesses with the same address will result to a hit. However, accessing the TLB with different addresses, which refer to the same entry in the table, causes a replacement. Therefore, the difference between the times of a TLB hit and the following replacement is equal to the replacement penalty in the case of conflicts in the TLB table. This approach is almost the same as the code shown in Table 4 4.
Experimental Setup
We use the aforementioned methods to measure the different components of various data transfer techniques. We performed our experiments on a 3.2 GHz PS3 running Linux Kernel 2.6.16. For all experiments, we use C language intrinsics with the libspe2.1 library. The SPU decrementer register is used to measure the elapsed time in the SPEs, which ticks every 12.5ns. The time-base register is used in the PPE for measuring elapsed times, which has the same tick rate as the SPU decrementer's. We run all experiments 20000 times and find the minimum of the results using the k-best algorithm.
The following section describes techniques and the experimental results.
Results
Having prepared the environment, we run our microbenchmarks and explored the incurred latency and overhead during data transfers.
While in the previous section (section 4) we discussed methods of achieving synchronization and determining elapsed times, in this section, our objective is to determine the communication components that play a role in delivering a message. We consider the following mechanisms:
• Transferring data between the PPE and the SPE using PUT and GET functions (SPE-Initiated and PPEInitiated)
• Transferring data between the SPEs
• Using Mailbox for data transfer
These mechanisms have different characteristics in terms of latency, data-delivery, DMA setting overhead, and receiving rates. In some cases, in spite of having a highbandwidth communication path, other factors impose restrictions to reach high throughput in transferring data. We will explore these situations in the following sections.
DMA Transfer Time between the PPE and SPEs
In this section, we present results for transferring data using DMA between the PPE and SPEs. First, we run the 
Figure 6. Accumulative data delivery components of the GET function (PPE INITIATED)
micro-benchmarks in a blocking mode, that is, we need to wait for the termination of each data transfer and to receive the acknowledgement from the receiver. Subsequently, we consider the non-blocking mode of transfer. In this mode, we initiate several DMAs without waiting for the arrival of individual acknowledgements. We call this batched mode. For example, if we issue n DMA transfers before waiting for the acknowledgement, we will have a batch-n data transfer.
GET Function Characteristics
We measured the different components of data transfers using the GET function, initiated on the PPE and the SPEs. These measurements are obtained for the blocking and nonblocking methods. Figures 3 and 4 show the total data transfer time per message using the GET function between the PPE and SPEs. As can be observed, the latency of the SPEInitiated GET function is much less than the PPE-initiated function. In order to investigate further this behavior, we quantify the components of these data transfers. Figures 5 and 6 show the different components of data transfer. To recognize dominant factors in each data transfer these figures illustrate the total time of data transfer for all messages. For example, in a batch-2 data transfer it shows the elapsed time from issuing the first message to the arrival of the last byte of the second message. Figures 5 and 6 include the overhead of setting up a DMA transfer in the GET function, first message arrival (Latency), data delivery, and the gap between successive messages.
It is worth mentioning that Figures 4 and 6 depict different time intervals. While Figure 4 shows the per message total transfer time as measured at the PPE, Figure 6 shows the components of the transfer as observed at the SPE. These components exclude the acknowledgement, which is received by the PPE after the completion of data delivery of 8  16  32  64  128  256  512  1024  2048  4096  8192  16384  8  16  32  64  128  256  512  1024  2048  4096  8192  16384  8  16  32  64  128  256  512  1024  2048  4096  8192 8  16  32  64  128  256  512  1024  2048  4096  8192  16384  8  16  32  64  128  256  512  1024  2048  4096  8192  16384  8  16  32  64  128  256  512  1024  2048  4096  8192 the last message at the SPE, and reflect the time taken by all n messages in a batch-n to be transferred. The difference in these two observations yields the acknowledgement time at 5µs. As can be observed from the figures, the main component of the SPE-initiated GET function is data delivery. In other words, the DMA setup time on the SPE, which is accomplished by enquing the request through channels, is fast enough so that it is not a bottleneck in this data delivery. However, the PPE-initiated GET function shows long DMA-issue time as well as a long gap in receiving successive messages.
PUT Function Characteristics
In this part, we investigate the different components of data transfers using the PUT function, initiated on the PPE as well as the SPEs. Figures 7 and 8 show the total data transfer time per message using the PUT function between the PPE and SPEs. As can be observed, the latency of the SPEInitiated put function is much less than the PPE-initiated function. Similar to the previous part, we also quantify the components of these data transfers.
Figures 9 and 10 include the overhead of setting up a DMA transfer in the PUT function, first message arrival time (Latency), data delivery, and the gap between successive messages. Similar to the GET function, we show the total time of data transfer for all messages. To find the latency of data transfer per message, we need to divide the total time by the number of sent messages. For example, in a batch-8 transfer, we should divide the total time by 8.
We should point out that the results in Figures 8 and  10 have been achieved using different methods. Figure 8 depicts the latency of the PUT function by measuring the elapsed time from issuing the PUT function to receiving the corresponding acknowledgement on the PPE. However, Figure 10 represents the different components of the data transfer using the PUT function using a different thread on the PPE. By observing the figures, if the acknowledgement is received after the arrival of the first message and before the completion of the whole data transfer, we will be able to describe the existing discrepancies. We speculate that the indicated discrepancy is related to the arrival of the acknowledgement before the completion of the data transfer on the PPE. This issue depends upon the implementation of the data transfer libraries on the Cell BE processor.
As can be observed from the figures, the main components of the SPE-initiated PUT function are data delivery and latency (i.e., the time to receive the first byte). There two main concerns to handle in the PPE-Initiated PUT function: the DMA issue time and the latency. As a result of these observations, we need to consider techniques to hide the latency of PUT functions on the PPE side.
DMA Latency among and inside the SPEs
In this part, we measure the latency of transferring data using DMAs between different SPEs. We use synchronization through mailboxes among different SPEs to start the data transfer. We also transfer the local store area address of destination to source through DMA. PUT and GET operations are used for this purpose. As our results match with the results in [12] , we summarize these results with other techniques in Table 8 .
We also investigated the latency of data transfer inside each SPE using DMA transfer and copy operation (i.e. processor load and store instructions). This measurement, shown in Table 5 , would help us choose the most efficient technique to transfer and bind the arrived message to its destination thread on the SPEs. 
Mailbox Communication
Another mechanism to transfer data (32 bits) between the PPE and the SPEs in the Cell Processor is through Mailboxes. For this, we measure the latency of the mailbox communication in two different situations: using library functions and directly accessing the SPE's problem state. Table 6 shows the latency of the mailbox communication in the above-mentioned situations. As can be inferred from the figure, mailbox communication through directly accessing the SPE's problems state is one order of magnitude faster than using library function. However, the former method is an unprotected access which needs to be more cautious. 
Address Translation Behavior
This section examines the address translation behavior in data transfer between the PPE and the SPEs.
First, we explored the miss penalty of TLB (i.e. Translation Look-aside Buffer) on the SPEs. Second, we measured the replacement penalty due to a conflict in the TLB table. Table 7 shows the results of these experiments for different message sizes. 
Discussion
In this work, we have measured the latency and memory management overhead incurred during transferring data between the PPE and the SPEs as well as between the SPEs in the Cell processor. A summary of the achieved results is provided in Table 8 . Considering the summarized results in Table 8 , we distinguish different situations to transfer data among cores in the Cell processor.
To send data from the PPE to the SPEs, we can distinguish two different cases: long and short messages. For long messages, SPE-initiated GET function has the least overhead to transfer data from the SPE to the PPE. The address of the destination buffer is usually sent through the parameters during spawning the SPE thread; therefore, the SPE does not need to communicate with the PPE to obtain the destination buffer. However, it is likely that the destination buffer changes during the execution time. In this case, mailbox communication can be employed, which takes 0.27µs, to receive the destination buffer instead of issuing a long latency GET function on the PPE. For short messages (<= 4B), mailbox communication is a very efficient method to transfer data from the PPE to the SPE.
To send data from the SPEs to the PPE, we also distinguish two different cases: long and short messages. For long messages, as can be observed from the Table 8 , the SPE-initiated PUT function has the least overhead provided the SPE is responsible for the data transfer and knows the target buffer. However, if the PPE is responsible for the data transfer, there are two possible ways for this data transfer. First, we might employ a mailbox communication to send the target address to the SPE to initiate the transfer. Second, if we predict the pattern of data consumption on the PPE, we can issue the data transfer ahead of time to tackle the PUT function's bottleneck, which is the DMA issue time, as shown in Figure 10 .
In the case of transferring data from a memory location on the PPE (or even from another SPE) to a SPE for the first time, our results show that the TLB miss penalty is from 0.15 µs to 0.4 µs, which is comparable to the achieved data transfer times. This high overhead is attributed to memory management operations in the PPE to allocate the required memory and to set appropriate tables for address mapping on the SPEs. The TLB latency needs therefore to be hidden. This can be achieved by pre-allocating TLB entries and by using prediction [5] to manage this allocation. We are presently investigating the efficacy of these methods.
In the current study, the results achieved using two different timers (i.e., the PPE and the SPE) when the initiator and the receiver are located on different cores. In this case, we had to synchronize these timers. For this, we used mailbox communication through the SPEs' problem state. The measured round-trip time for this communication is from 10 to 14 timer's cycles (i.e., 0.120 µs to 0.175 µs ), which is quite short in comparison to our measurements. Therefore, we are quite confident that our results have high accuracy.
Conclusions
In this work, we measured the different components of data transfers using different techniques. In this study, we have established a framework to transfer data in a Cell processor. We plan to investigate and tune the proposed techniques by implementing several basic MPI functions and testing them on a Cell processor. We also envision to exploit our predication techniques in such environment.
