In this master thesis, we design and implement MultiStream: a solution that extends the existing data parallel skeleton library SkePU with NVIDIA CUDA Streams to overlap main memory – device memory data transfers with CUDA Kernel executions.
We show the benefits of this approach using a task-parallel framework, FastFlow, on-top of SkePU.
Finally, we compare the MultiStream extended SkePU to an ad-hoc solution to discuss the tradeoffs between the level of abstraction and the maximum achievable performance