11,664 research outputs found
Performance evaluation of multi-core multi-cluster architecture
A multi-core cluster is a cluster composed of numbers of nodes where each node has a number of processors, each with more than one core within each single chip. Cluster nodes are connected via an interconnection network. Multi-cored processors are able to achieve higher performance without driving up power consumption and heat, which is the main concern in a single-core processor. A general problem in the network arises from the fact that multiple messages can be in transit at the same time on the same network links. This paper considers the communication latencies of a multi-core multi-cluster architecture will be investigated using simulation experiments and measurements under various working conditions
GRAPE-6: The massively-parallel special-purpose computer for astrophysical particle simulation
In this paper, we describe the architecture and performance of the GRAPE-6
system, a massively-parallel special-purpose computer for astrophysical
-body simulations. GRAPE-6 is the successor of GRAPE-4, which was completed
in 1995 and achieved the theoretical peak speed of 1.08 Tflops. As was the case
with GRAPE-4, the primary application of GRAPE-6 is simulation of collisional
systems, though it can be used for collisionless systems. The main differences
between GRAPE-4 and GRAPE-6 are (a) The processor chip of GRAPE-6 integrates 6
force-calculation pipelines, compared to one pipeline of GRAPE-4 (which needed
3 clock cycles to calculate one interaction), (b) the clock speed is increased
from 32 to 90 MHz, and (c) the total number of processor chips is increased
from 1728 to 2048. These improvements resulted in the peak speed of 64 Tflops.
We also discuss the design of the successor of GRAPE-6.Comment: Accepted for publication in PASJ, scheduled to appear in Vol. 55, No.
Cost-Effective Clustering
Small Beowulf clusters can effectively serve as personal or group
supercomputers. In such an environment, a cluster can be optimally designed for
a specific problem (or a small set of codes). We discuss how theoretical
analysis of the code and benchmarking on similar hardware lead to optimal
systems.Comment: 7 pages, 2 figures (one in color). Color version of paper to be
published as part of proceedings of CCP2000 (Brisbane) in a special isssue of
Computer Physics Communication
Efficient resources assignment schemes for clustered multithreaded processors
New feature sizes provide larger number of transistors per chip that architects could use in order to further exploit instruction level parallelism. However, these technologies bring also new challenges that complicate conventional monolithic processor designs. On the one hand, exploiting instruction level parallelism is leading us to diminishing returns and therefore exploiting other sources of parallelism like thread level parallelism is needed in order to keep raising performance with a reasonable hardware complexity. On the other hand, clustering architectures have been widely studied in order to reduce the inherent complexity of current monolithic processors. This paper studies the synergies and trade-offs between two concepts, clustering and simultaneous multithreading (SMT), in order to understand the reasons why conventional SMT resource assignment schemes are not so effective in clustered processors. These trade-offs are used to propose a novel resource assignment scheme that gets and average speed up of 17.6% versus Icount improving fairness in 24%.Peer ReviewedPostprint (published version
Dynamic Energy Management for Chip Multi-processors under Performance Constraints
We introduce a novel algorithm for dynamic energy management (DEM) under performance constraints in chip multi-processors (CMPs). Using the novel concept of delayed instructions count, performance loss estimations are calculated at the end of each control period for each core. In addition, a Kalman filtering based approach is employed to predict workload in the next control period for which voltage-frequency pairs must be selected. This selection is done with a novel dynamic voltage and frequency scaling (DVFS) algorithm whose objective is to reduce energy consumption but without degrading performance beyond the user set threshold. Using our customized Sniper based CMP system simulation framework, we demonstrate the effectiveness of the proposed algorithm for a variety of benchmarks for 16 core and 64 core network-on-chip based CMP architectures. Simulation results show consistent energy savings across the board. We present our work as an investigation of the tradeoff between the achievable energy reduction via DVFS when predictions are done using the effective Kalman filter for different performance penalty thresholds
- …