High Performance and Energy Efficiency are critical requirements for Internet
of Things (IoT) end-nodes. Exploiting tightly-coupled clusters of programmable
processors (CMPs) has recently emerged as a suitable solution to address this
challenge. One of the main bottlenecks limiting the performance and energy
efficiency of these systems is the instruction cache architecture due to its
criticality in terms of timing (i.e., maximum operating frequency), bandwidth,
and power. We propose a hierarchical instruction cache tailored to
ultra-low-power tightly-coupled processor clusters where a relatively large
cache (L1.5) is shared by L1 private caches through a two-cycle latency
interconnect. To address the performance loss caused by the L1 capacity misses,
we introduce a next-line prefetcher with cache probe filtering (CPF) from L1 to
L1.5. We optimize the core instruction fetch (IF) stage by removing the
critical core-to-L1 combinational path. We present a detailed comparison of
instruction cache architectures' performance and energy efficiency for parallel
ultra-low-power (ULP) clusters. Focusing on the implementation, our two-level
instruction cache provides better scalability than existing shared caches,
delivering up to 20\% higher operating frequency. On average, the proposed
two-level cache improves maximum performance by up to 17\% compared to the
state-of-the-art while delivering similar energy efficiency for most relevant
applications.Comment: 14 page