Abstract: For flexible mapping of various task-level pipelines on a multi-core processor, the authors proposed the memory-centric network-on-chip (NoC). The memory-centric NoC manages producer-consumer data transactions between the tasks in the case of task-level pipelines are distributed over multiple processing cores. Since the memory-centric NoC manages the data transactions, it relieves burden of the software running on the processing cores and this results in power-efficient execution of task-level pipeline. To prove advantages of the memory-centric NoC, the authors implemented a multi-core processor based on the memory-centric NoC.
Introduction
In recent decades, very large-scale integration designs with multiple processing cores have been prevalent [1] [2] [3] [4] since modern process technology has enabled integrating billions of transistors on a single chip. On the other hand, implementing a heavy single-core processor to exploit instruction level parallelism (ILP) reached to a point of diminishing returns [5] . Power overhead of complex out-oforder execution units have become intolerable, whereas the amount of available ILP in most of the common applications is limited. Therefore multi-core designs have been preferred to avoid the limitations of heavy single-core processors. By dividing target applications into small computation kernels and concurrently executing these kernels using multiple processing cores, the operation frequency and supply voltage of the chip can be lowered to reduce power consumption without performance degradation [6] .
Video and image processing applications generally contain intrinsic parallelism and requires intensive computations because they involve numerous iterations of complex calculations over entire pixels of the input image. Video and image processing applications are commonly referred as a stream processing because they comprise series of tasks organised in the form of a pipeline and the data flows through the pipelined tasks as a stream [7 -9] . In a stream processing, data dependencies mainly exist between the adjacent tasks in the pipeline and intermediate data are locally shared and rarely reused instead of being globally shared. Adopting multi-core processors for the stream applications is proper solution to meet vast computation requirements in a power efficient manner. Different number of processing cores could be assigned to different tasks according to their workloads and it is also possible to distribute series of tasks over the multiple processing cores. After that, operation frequency and supply voltage of each core can be adjusted to optimise power consumption under the given performance constraints. For efficient utilisation of multi-core processors for various stream applications, it is important to design on-chip interconnections capable of managing concurrent data transactions initiated by the multiple processing cores. In this case, designing efficient on-chip interconnections for a multi-core processor becomes more feasible by considering the characteristics of data flow in the stream applications.
In this paper, we propose the memory-centric networkon-chip (NoC) to facilitate mapping various types of tasklevel pipeline on multi-core processor architecture. In the case of assigning different number of processing cores for each stage of the task-level pipeline, there arise data transactions between the producer and consumer tasks in the form of N-to-1 and 1-to-M. The feature of the memorycentric NoC is to support low-overhead N-to-1 and 1-to-M producer-consumer data transactions to avoid overhead of using software level primitives for data synchronisation. To manage producer-consumer data transactions, the memory-centric NoC allows programmers to schedule utilisation of the on-chip shared memories and track the validity of the shared memory entries. By adopting the memory-centric NoC, use of polling-based synchronisation primitives such as barriers and spin locks is greatly reduced and this results in reduced power consumption and amount of on-chip data transactions. We implemented and published an object recognition processor based on the memory-centric NoC in [10 -12] . In this paper, we focus on the memory-centric NoC itself and details of the memory-centric NoC programming model, operation and advantages are described.
The remainder of the paper is organised as follows. In Section 2, multi-core processor implementations for video and image processing applications are discussed first. And then, previous works regarding the producer -consumer data transactions of stream processing model are also covered. The architecture, operation and programming model of the memory-centric NoC are described in Section 3. The advantages of the memory-centric NoC for efficient N-to-1 and 1-to-M producer -consumer data transactions are explained in Section 4. After the implementation results are presented in Section 5, conclusions are made in Section 6.
Related works
Recently, CELL processor [3, 13] has been developed to answer the huge performance requirement of digital media applications. It comprises eight processing cores, namely SPE, for parallel execution of multiple threads and on-chip interconnection among the SPEs are provided by four data ring buses. Data transactions between the SPEs are managed by dedicated direct memory accesses (DMAs) integrated for each SPE and memory accesses of the SPE are limited to its own private memory. Although CELL processor is appropriate for accelerating parallel execution of multiple threads, supporting 1-to-M producer-consumer data transactions are not efficient since the data from the producer SPE should be copied M times into the private memories of the consumer SPEs. On the other hand, IMAGINE [7] which consists of eight ALU clusters and stream register file (SRF) has more flexibility in supporting the various types of producer-consumer data transactions. IMAGINE has large on-chip shared memory called SRF and intermediate data are maintained in the SRF to exploit data locality between the producer and consumer tasks. The drawback of IMAGINE is poor scalability of the processor architecture since it requires significant redesign of the processor and compiler. In the case of the memorycentric NoC, addition of the processing cores does not affect existing part of the original processor and programs since the memory-centric NoC dynamically manages physical memory assignments for producer-consumer data transactions. Larrabee [4] is the most recently proposed multi-core processor for video and image processing applications. Since Larrabee has aimed at wide range of applications, it adopts conventional distributed L2 caches and ring topology multi-layer buses which are not optimised for producer-consumer data transactions.
Regarding the data transactions of stream processing applications, streaming consistency [14] has been reported. In the streaming consistency model, producer -consumer data transactions are performed through communication buffers along with the explicit acquire and release synchronisation sections. Since the streaming consistency allows reordering between the synchronisation sections which are associated with the different communication buffers, it is practical to support concurrent producerconsumer data transactions. In the streaming consistency model, C-HEAP protocol [15, 16] is used to manage communication buffers. Even though the C-HEAP protocol is suitable for simple one-to-one producerconsumer data, supporting various types of data transactions is not efficient because of its FIFO-based data transactions. When the data access patterns of consumer tasks are different from those of the producer task, the FIFO-based data transaction requires reordering of the data in the consumer tasks. In addition, the explicit management of the read/write pointers in the FIFOs arise considerable data transaction overheads since it requires continuous polling of synchronisation variables. To reduce such overheads of supporting the various types of producer-consumer data transactions, the memory-centric NoC is proposed in this paper.
Memory-centric NoC
Most of stream applications consist of series of tasks which are feasible to organise as a task-level pipeline. Examples of such applications are MPEG-4 decoding [8] , image depth extraction [7] , object recognition [12] and many others. In the case of mapping such applications into multi-core processors, it is advantageous to assign different number of processing cores for each stage of the pipeline regarding the workloads. Moreover, because the number of tasks to be processed varies according to the applications, it is necessary to support flexible mappings of various task-level pipelines for efficient utilisation of multi-core processors. A generalised representation of various types of task-level pipeline is depicted in Fig. 1a .
As shown in Fig. 1a , data transactions between the adjacent tasks occur in the form of N-to-1 and 1-to-M, because different numbers of processing cores are assigned for each task and the most of the data transactions are unidirectional. In addition, the amount of data transferred between the tasks also varies at each stage. 
Architecture
Based on the idea described in Fig. 1b , we implemented a multi-core processor using the memory-centric NoC as an on-chip interconnection [10] [11] [12] . The architecture of the memory-centric NoC and the implemented processor is shown in Fig. 2 . For parallel executions of multiple tasks, ten PEs are integrated. A RISC processor is also integrated to manage program executions of the PEs. The eight dual port memories provide shared buffers for inter-PE data transactions. The memory-centric NoC consists of five crossbar switches in a hierarchical star topology, four channel controllers, eight valid-check logics at each dual port memory and a number of network interface modules (NIMs).
The topology of the memory-centric NoC is decided by considering the characteristics of on-chip data transactions. In the case of executing task-level pipeline, temporary data produced by a task are only accessed by a subset of the PEs that execute the subsequent consumer tasks and data are not globally shared. In addition, direct PE-to-PE data transaction rarely occurs because most of the shared data are exchanged through the dual port memories. This means that on-chip data transactions are highly localised within subset of the PEs and dual port memories. In the previous work of our research group [17] , we concluded that hierarchical star topology is the most efficient considering the area and power consumption under the assumption of less than hundreds of modules in the NoC, hence, the hierarchical star topology is adopted for the memorycentric NoC. On the other hand, four channel controllers dynamically update routing look-up tables (LUTs) of the NIMs in response to the commands from the programs running on the PEs. By the run-time manipulation of the LUTs, abstractions of N-to-1 and M-to-1 data transactions are provided. The valid-check logic enables the memorycentric NoC to track the validity of each memory entry. The NIMs are placed at each component of the processor to perform packet generation and parsing. In the next subsection, the programming model of the memory-centric NoC is described.
Programming model
Conventional approaches for scheduling inter-core data transactions are to build compilers that statically arrange memory utilisations from the multiple processing cores [1, 7, 13, 17, 18] . In the compiler-based approaches, data movements across the on-chip processing cores are implicitly performed by accessing special communication registers [1, 18] , or explicit codes for DMA controllers are generated [7, 13, 18] . Compiler-based approaches are convenient for programmers; however, developing compilers for multi-core processor requires complete knowledge about the target processor architecture and large design efforts. In addition, it is difficult to integrate heterogeneous processing cores or to change interconnection topology among the integrated cores. However, the NoCs are generally designed to be used for various processor architectures with homogeneous or heterogeneous processing cores. Therefore it is advantageous to devise a programming model that is independent from the chip architecture. The memorycentric NoC provides low overhead synchronisation primitives which reduce complexity of generating program codes for manual parallelisation or developing compilers for multi-core processors.
When a task-level pipeline is mapped on the proposed multi-core processor shown in Fig. 2 , two issues arise that have to be addressed. The first issue is scheduling the dual port memory utilisation for inter-PE data transactions. This issue occurs when the number of concurrent data transactions between the PEs exceeds the number of the dual port memories on the chip. In such cases, the www.ietdl.org scheduling of the data transactions between the PEs should be performed so that limited memory spaces are utilised in a time-multiplexed fashion. The second issue is revealed after a dual port memory is assigned to a subset of PEs for N-to-1 or 1-to-M data transactions. In here, we refer a PE computing former stage of the task-level pipeline as a producer PE and a PE of the subsequent stage as a consumer PE. When a consumer PE tries to read data produced by a producer PE, the consumer PE should wait until the producer PE writes valid data in the assigned memory. This requires the consumer PE executing polling loops on a synchronisation variable to read data from the producer PE at the right time. The memory-centric NoC provides solutions for these two issues with dynamically assigning the dual port memories for producer-consumer data transactions and tracking the validity of the memory entries after the dual port memory is assigned. Since these two operations are transparent to the programs running on a PEs, the memory-centric NoC contributes to simplified programming model and it is described in this sub-section.
In the memory-centric NoC, all shared data transactions are explicit and controlled by following three commands which are initiated by memory mapped register writes of each PE.
Open channel -Producer PE command, which requests memory assignment for shared data transaction. This command includes a channel number to be used and list of the consumer PEs.
Close channel -Producer PE command, which is necessary to release assigned memory space for shared data transaction. This command includes a used channel number and list of the consumer PEs.
End channel -Consumer PE command, which is necessary to release assigned memory space for shared data transactions. This command includes a used channel number, producer PE and PE ID of the consumer PE. After writing the open channel command, the producer PE is allowed to write the shared data for consumer PEs without checking availability of the memory space, i.e. dual port memories. The scheduling of the on-chip memory utilisation is transparently performed by the memory-centric NoC instead of programs running on the PEs. By writing the resulting data of the producer PE in the pre-defined address region according to the channel number, the resulting data are directed to one of the dual port memories decided by the memory-centric NoC. At the end of shared data transaction, it is required to write a close channel command to release the assigned dual port memory. On the other hand, consumer PEs are able to start reading shared data from the producer PE without writing any command. By the memory-centric NoC, read accesses are automatically blocked until the requested data are prepared in the assigned dual port memory, and no polling loops for checking synchronisation variables are required. Similar to the producer PE, the consumer PEs also read fixed pre-defined address region according to the producer PE and channel number. After reading all shared data from the producer PE, writing end channel command from each consumer PE is required to release the assigned dual port memory.
In the memory-centric NoC programming model, N-to-1 data transaction is simply realised by a consumer PE reading the multiple address regions of corresponding producer PEs. For example, PE 2 of Fig. 3 could read both PE1_CH0_READ_ADDR and PE0_CH0_READ_ADDR address regions to read data from the two consumer tasks. In the case of 1-to-M data transactions, data transfers to the M consumer PEs are managed by defining list of the consumer PEs in the open/close channel command. The channel numbers are used to manage concurrent independent data transactions, and re-ordering between the data transactions of different channel number is allowed. To impose sequential order between two different data transactions initiated by a producer PE, the same channel number should be used. Finally, N-to-M data transaction is simply realised by M consumer PEs reading data from N producer PEs in common. In this case, N dual port memories are assigned until the all corresponding producer and consumer PEs for each dual port memory complete the data transactions.
Operation
The function of the memory-centric NoC is to manage N-to-1 and 1-to-M data transactions between the producer and consumer tasks, and it is accomplished by communication buffer management and memory transaction control operations, respectively. As shown in Fig. 4a , the communication buffer management operation schedules the requests for use of the dual port memories in response to the open channel commands from the producer PEs. The memory-centric NoC updates routing LUTs in the NIM of each PE so that memory accesses of producer and consumer PEs to the fixed address regions regarding the channel number and PE number are directed to the same dual port memory location which is dynamically selected according to the dual port memory utilisation status. In this manner, physical memory address assignment for producer and consumer PEs are performed transparent to the programs running on the PEs. The memory transaction control operation, which supports low overhead data transactions between the producer and consumer PEs, is shown in Fig. 4b . The consumer PE should wait valid data from the producer PE and this requires executing polling loops in the consumer PE. The memory-centric NoC reduces this overhead by the NIM execute the polling loops instead of the PE. In this case, power wasted for waiting valid data can be saved because the PE is in idle state and only the NIM is in active state. In this subsection, communication buffer management and memory transaction control operations are described in detail. Fig. 5 shows the overall procedure of the communication buffer management operation in the memory-centric NoC. To explain the procedure, we assume that PE 1 is the producer PE and PEs 3 and 4 are consumer PEs. As described in the programming model sub-section, the operation of the memory-centric is initiated by PE 1 writing an open channel command to the nearest channel controller (Fig. 5a) . In response to the open channel command, the channel controller reads the global status register of the dual port memories to check the utilisation status. After selecting an available dual port memory, the channel controller updates the routing LUTs in the NIMs of PEs 1, 3 and 4, so that accesses to the pre-defined address region are directed to the assigned dual port memory (Fig. 5b) . The LUTs are updated by the channel controller sending the special configuration (CFG) packets which are only visible for the NIMs. At each PE, accesses to predefined address regions are blocked until the corresponding LUTs are updated. After the LUT updates, PE1 writes shared data to the dual port memory assigned by the memory-centric NoC, and PEs 3 and 4 reads the data from the same dual port memory (Fig. 5c) . At the end of shared data transactions, PE 1 sends close channel command to the channel controller to invalidate the write LUT of its NIM, and PEs 3 and 4 send end channel commands to invalidate read LUTs in their corresponding NIMs. After the channel controller receives a close channel command and the same number of end channel commands as the number of consumer PEs, the assigned dual port memories is released to be used for the other data transactions (Fig. 5d). 
Communication buffer management:

Memory transaction control:
To realise the memory transaction control, the memory-centric NoC tracks every write access to the assigned dual port memory from the producer PE after the requested memory space is assigned. To track the write accesses to the assigned memory, the valid-check logics are integrated to each of the dual port memories. The valid check logic includes an array of 1-bit valid bit entries of which the number matches the number of words in the dual port memory. All the entries of valid bit array are initialised to low at the reset of the processor or at the release of the assigned dual port memory. In response to the write access from the producer PE, the valid bit of the corresponding address is set to high. When a memory address with a low valid bit is accessed by the consumer PEs, the valid-check logic asserts an invalid signal to the NIM of the dual port memory. In response to the invalid signal, the NIM in the dual port memory sends invalid packets to the NIM of the consumer
518
IET www.ietdl.org
PEs so that they retry reading the corresponding address later. To support burst read/write operations, the valid check logic is designed to check validity of up to eight entries in a single cycle. Fig. 6 describes the overall procedure of the memory transaction control operation. We assume again that PE 1 is producer PE, and PEs 3 and 4 are consumer PEs. In the example case, PE 3 reads the shared data at address 0 Â 0 and PE 4 reads the data from address 0 Â 8, whereas PE 1 has written valid data at only address 0 Â 0 of the assigned dual port memory (Fig. 6a) . In this case, PE 3 read data at address 0 Â 0 and the NIM of PE 4 receives an invalid packet instead of data at address 0 Â 8 (Fig. 6b) . In response to the invalid packet, the NIM of PE 4 holds operation of the PE 4 and periodically retries reading valid data at address 0 Â 8 until PE 1 writes valid data at address 0 Â 8 (Fig. 6c) . After reading the valid shared data from the dual port memory, the operation of the PE4 continues (Fig. 6d) .
The advantages of the memory transaction control are reduced NoC traffic and PE activity, which contribute to a low-power operation. While the NIM performs polling valid data in the assigned memory, the consumer PE is in idle state and this reduces power consumption of the PE. The reduction in on-chip traffic is also realised by the NIM of consumer PE using compact packet for retrying valid data. In the case of the NIM performs polling on the data, the address field is not required in read request and data field are not required in the invalid packet. The advantages of the memory transaction control operation are discussed in detail in Section 4. 
Performance evaluations
This section quantifies the advantages of the memory-centric NoC in realising power-efficient data transactions for various task-level pipelines. The first advantage is reduced power consumption which results from the NIM executing polling loops instead of the heavy and power-hungry PE. The second advantage is the reduced on-chip data transactions, which comes from using compact packets for polling valid data and synchronisation variable in the shared memory. To evaluate the advantages of the memory-centric NoC, we mapped three types of task-level pipelines on the multi-processor architecture described in Fig. 2 . The three mappings are shown in Fig. 7 . Mapping A only consists of N-to-1 data transactions and Mapping B only consists of 1-to-M data transactions. As a more practical example, task mapping of the SIFT object recognition [9] is chosen as Mapping C, and the detailed computations of each task are described in our previous publications [10 -12] . For Mappings A and B, each task is configured to read data from the external memory or previous stage, and to write the fetched data into the external memory or following stage. In this case, each PE accesses the dual port memories every ten cycles on an average because of execution time required for address calculation and branch instructions. Fig. 8 compares power consumption and percentage of active cycles in each PE for the cases of with and without the memory transaction control. Average power consumption of the Mapping C is higher than the Mappings A and B since the SIFT object recognition requires complex computations, whereas only simple load/ store instructions and waiting loops are executed in the case of the Mappings A and B. The first advantage of the memory-centric NoC for efficient unidirectional data transactions is apparent in Fig. 8 . Instead of the consumer PEs, the NIMs of the memory-centric NoC execute polling loops required for waiting valid data from the producer PEs, and this reduces necessary active cycles for Fig. 9 compares execution times of the SIFT object recognition (Mapping C) with and without the memory-centric NoC according to the varying data synchronisation unit sizes. For the SIFT computation with the memory-centric NoC, data synchronisation unit is fixed to 4 bytes which corresponds to a word size in the dual port memory. As shown in Fig. 9 , any data synchronisation size does not outperform the SIFT computation with the memorycentric NoC. For small data synchronisation units, the frequent accesses to synchronisation variables degrade performance whereas unnecessarily long waiting time slows down the overall execution time for the cases with large data synchronisation units. Fig. 10 shows contributions of the memory-centric NoC in reduction of on-chip data transactions. The reduction of data transactions is more noticeable in the Mappings A and B, because the synchronisation variable is updated at every write of shared data. In the case of Mapping C, the synchronisation variable is updated every 48 bytes writes of shared data. The average reduction of on-chip data transactions is 35.9, 38.6 and 12.1% for Mappings A, B and C, respectively.
Implementation results
The proposed multi-processor SoC based on the memorycentric NoC is implemented using a 0.18 mm standard CMOS process technology. The size of the implemented chip is 7.7 Â 5 mm 2 and the operation frequencies of the chip are designed to 400 MHz for the memory-centric NoC and 200 MHz for the other parts of the chip. The higher operating frequency of the memory-centric NoC is to compensate for latency overhead of the packet switching network. Fig. 11 reports implementation summary of the implemented chip. The peak power consumption of the chip 
