Optimizing data movements is becoming one of the biggest challenges in
heterogeneous computing to cope with data deluge and, consequently, big data
applications. When creating specialized accelerators, modern high-level
synthesis (HLS) tools are increasingly efficient in optimizing the
computational aspects, but data transfers have not been adequately improved. To
combat this, novel architectures such as High-Bandwidth Memory with wider data
busses have been developed so that more data can be transferred in parallel.
Designers must tailor their hardware/software interfaces to fully exploit the
available bandwidth. HLS tools can automate this process, but the designer must
follow strict coding-style rules. If the bus width is not evenly divisible by
the data width (e.g., when using custom-precision data types) or if the arrays
are not power-of-two length, the HLS-generated accelerator will likely not
fully utilize the available bandwidth, demanding even more manual effort from
the designer. We propose a methodology to automatically find and implement a
data layout that, when streamed between memory and an accelerator, uses a
higher percentage of the available bandwidth than a naive or HLS-optimized
design. We borrow concepts from multiprocessor scheduling to achieve such high
efficiency.Comment: Accepted for presentation at ASPDAC'2