3 research outputs found

    Best of both latency and throughput

    Get PDF
    Abstrac

    Impact of Heterogeneity on DSM Performance

    Get PDF
    This paper explores area/parallelism tradeoffs in the design of distributed shared-memory (DSM) multiprocessors built out of large single-chip computing nodes. In this context, area-efficiency arguments motivate a heterogeneous organization consisting of few nodes with large caches designed for single-thread parallelism, and a larger number of nodes with smaller caches designed jror multi-thread parallelism. This paper quantitatively studies the performance of such organization for a set of homogent: ous multiprocessor programs from the SPLASH-2 benchmark suite. These programs are mapped onto the heterogeneous processors without source code modifications via static thread a.ssignment policies. A constant-area simulation analysis shows that a 4-node heterogeneous DSM with 21 processors outperforms i t s homogeneous counterpart with 4 processors by an average of 36% for the studied mu/- tiprocessor workload, while having the same performance for sequential codes. Also studied are the implications of the degree of heterogeneity in the functional units of such heterogeneou.3 DSkI on overall system cost and performance. This paper presents a sensitivity analysis based on a factorial design experiment that determines the relative impact of heterogeneity on performance. The studied benchmarks are affected, on average, primarily by heterogeneity in processor performance (59.9%), followed by cache sizes (18.2%), memory latency (14.6%) and network latency (5.6%)

    Impact of Heterogeneity on DSM Performance

    No full text
    This paper explores area/parallelism tradeoffs in the design of distributed shared-memory (DSM) multiprocessors built out of large single-chip computing nodes. In this context, area-efficiency arguments motivate a heterogeneous organization consisting of few nodes with large caches designed for single-thread parallelism, and a larger number of nodes with smaller caches designed for multi-thread parallelism. Quantitative performance of such organization is reported for a set of homogeneous multiprocessor programs from the SPLASH-2 benchmark suite. These programs are mapped onto the heterogeneous processors without source code modifications via static thread assignment policies. Simulation-based analysis is used to compare the performance of heterogeneous and homogeneous DSMs that occupy the same silicon area. The analysis shows that a 4-node heterogeneous DSM with 21 processors outperforms its homogeneous counterpart with 4 processors by an average of 36 % for the studied multiprocessor workload, while having the same performance for sequential codes. A sensitivity analysis based on a factorial design experiment is used to study the implications of processor, memory, and network heterogeneity on overall cost and performance of a heterogeneous DSM. The studied benchmarks are affected, on average, primarily by heterogeneity in processor performance (59.3%), followed by cache sizes (18.2%), memory latency (14.6%), and network latency (5.6%)
    corecore