In many parallel applications, network latency causes a dramatic
loss in processor utilization. This paper examines software
pipelining as a technique for network latency hiding. It
quantifies the potential improvements with
detailed,instruction-level simulations.
The benchmarks used are the Livermore Loop kernels and BLAS Level
1.
These were parallelized and run on the instruction-level RISC
simulator DLX, extended with both a blocking and a pipelined
network. Our results show that prefetch in a pipelined network
improves performance by a factor of 2 to 9, provided the network
has sufficient bandwidth to accept at least 10 requests per
processor