Abstract
Introduction
Considering today's advanced CMOS technology scaling that allows high transistor density, a novel paradigm of data processing is required to cover up the large wire delay. Dynamically-scheduled clustered architectures to process data locally will be able to fulfill this requirement [1], [13] , [14] . In the clustered architectures, global structures are partitioned into simple smaller structures and each of them is arranged in a PE (processing element) called cluster in some papers. This partitioning makes the hardware simpler and its control and data paths faster because the number of entries and ports of the partitioned structures can be reduced.
The performance of clustered architectures depends on the amount of parallel execution of instructions and the amount of inter-PE communication to synchronize dependent instructions. If too many instructions are steered to a particular PE, then communication among PEs seldom occurs. However, PEs are deprived of working in parallel and the instructions in the overloaded PE cause resource conflicts, which degrade performance. This is referred to as workload imbalance. On the contrary, if instructions are steered to PEs evenly, the possibility of parallel processing is increased. However, the amount of inter-PE communication is also increased, which also degrades performance. Hence, we must design a clustered architecture that balances the workload and communication among PEs.
Many proposals for instruction steering schemes tried balancing the workload and communication across PEs [1], [14] , [16] . However, just using the existing steering schemes, there are limitations in increasing the performance. In order to overcome these limitations of the existing steering schemes, we must redesign some hardware components in addition to instruction steering schemes. The key components that should be reconsidered are communication structures between PEs that affect the delay of the communication.
In this paper, we make every pair of neighboring PEs cooperate with each other in the clustered architecture. To achieve effective cooperation, we add direct communication structures between neighboring PEs and we propose novel instruction steering schemes suitable for the structures. The additional communication structure can reduce the latency of the communication between neighboring PEs. The load imbalance is also avoidable since instructions can be steered with more flexibility without extra inter-PE communication delay.
The rest of this paper is organized as follows. In section 2, we briefly show the overview of a baseline clustered architecture and a baseline instruction steering scheme. Then, we discuss the limitation of the existing steering schemes and consider making every neighboring PE cooperate with each other. Section 3 describes the experimental framework, the evaluation methodology and the results. Section 4 shows some related work. Section 5 concludes this paper.
