Search CORE

282 research outputs found

QCD on \alpha-Clusters

Author: C. Best
Fischer
Frommer
K. Schilling
N. Eicker
Th. Lippert
Publication venue: 'Elsevier BV'
Publication date: 01/01/1999
Field of study

It is shown that the 21264 Alpha processor can reach about 20% sustained efficiency for the inversion of the Wilson-Dirac operator. Since fast ethernet is not sufficient to get balancing between computation and communication on reasonable lattice- and system-sizes, an interconnection using Myrinet is discussed. We find a price/performance ratio comparable with state-of-the-art SIMD-systems for lattice QCD.Comment: LATTICE99(machines), 3 page

arXiv.org e-Print Archive

CiteSeerX

Crossref

Juelich Shared Electronic Resources

CERN Document Server

\u3cem\u3eHP-DAEMON\u3c/em\u3e: \u3cem\u3eH\u3c/em\u3eigh \u3cem\u3eP\u3c/em\u3eerformance \u3cem\u3eD\u3c/em\u3eistributed \u3cem\u3eA\u3c/em\u3edaptive \u3cem\u3eE\u3c/em\u3energy-efficient \u3cem\u3eM\u3c/em\u3eatrix-multiplicati\u3cem\u3eON\u3c/em\u3e

Author: Chen Longxiang
Chen Zizhong
Ge Rong
Li Dong
Tan Li
Zong Ziliang
Publication venue: e-Publications@Marquette
Publication date: 01/01/2014
Field of study

The demands of improving energy efficiency for high performance scientific applications arise crucially nowadays. Software-controlled hardware solutions directed by Dynamic Voltage and Frequency Scaling (DVFS) have shown their effectiveness extensively. Although DVFS is beneficial to green computing, introducing DVFS itself can incur non-negligible overhead, if there exist a large number of frequency switches issued by DVFS. In this paper, we propose a strategy to achieve the optimal energy savings for distributed matrix multiplication via algorithmically trading more computation and communication at a time adaptively with user-specified memory costs for less DVFS switches, which saves 7.5% more energy on average than a classic strategy. Moreover, we leverage a high performance communication scheme for fully exploiting network bandwidth via pipeline broadcast. Overall, the integrated approach achieves substantial energy savings (up to 51.4%) and performance gain (28.6% on average) compared to ScaLAPACK pdgemm() on a cluster with an Ethernet switch, and outperforms ScaLAPACK and DPLASMA pdgemm() respectively by 33.3% and 32.7% on average on a cluster with an Infiniband switch

epublications@Marquette

A Message Scheduling Scheme for All-to-All Personalized Communication on Ethernet Switched Clusters

Author: Ahmad Faraj
Pitch Patarasuk
Xin Yuan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Bandwidth optimal all-reduce algorithms for clusters of workstations

Author: Bar-Noy
Bar-Noy
Bruck
Bruck
Faraj
Faraj
Faraj
Gropp
Iannello
Karwande
Knodel
Lane
Patarasuk
Pitch Patarasuk
Rabenseifner
Rabenseifner
Thakur
van de Geijn
Xin Yuan
Yuan
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Evaluating the performance of the allreduce collective operation on clusters. Approach and results

Author: Anshus Otto J.
Bjørndalen John Markus
Bongo Lars Ailo
Publication venue: University of Tromsø
Publication date: 01/01/2004
Field of study

The performance of the collective operations provided by a communication library is important for many applications run on clusters. The communication structure of collective operations can be organized as a tree. Performance can be improved by configuring and mapping the tree to the clusters in use. We describe and demonstrate an approach for evaluating the performance of different configurations and mappings of allreduce run on clusters of different size, consisting of single-CPU hosts, and SMPs with a different number of CPUs. A breakdown of the cost of allreduce using the best configuration on different clusters is provided. For all, the broadcast part is more expensive than the reduce part. Inter-host communication contributes more to the time per allreduce than the synchronization in the allreduce components. For the small messages sizes used (4 and 256 bytes), the time spent computing the partial reductions is insignificant. Reconfiguring hierarchy aware trees improved performance up to a factor of 1.49, by avoiding scalability problems of the components on SMPs, and by finding the right balance between available concurrency, load on 'root' hosts and the number of network links in a tree. Extending a tree by adding more threads, or by combining two trees does not have a negative influence on the performance of a configuration, but increasing message size does

Munin - Open Research Archive

A survey of techniques and technologies for web-based real-time interactive rendering

Author: Pacheco Filipe
Tovar Eduardo
Publication venue: IPP-Hurray Group
Publication date: 01/01/2001
Field of study

When exploring a virtual environment, realism depends mainly on two factors: realistic images and real-time feedback (motions, behaviour etc.). In this context, photo realism and physical validity of computer generated images required by emerging applications, such as advanced e-commerce, still impose major challenges in the area of rendering research whereas the complexity of lighting phenomena further requires powerful and predictable computing if time constraints must be attained. In this technical report we address the state-of-the-art on rendering, trying to put the focus on approaches, techniques and technologies that might enable real-time interactive web-based clientserver rendering systems. The focus is on the end-systems and not the networking technologies used to interconnect client(s) and server(s).Siemens; Bertelsmann mediaSystems GmbH; Eptron Multimedia; Instituto Politécnico do Porto - ISEP-IPP; Institute Laboratory for Mixed Realities at the Academy of Media Arts Cologne, LMR; Mälardalen Real-Time Research Centre (MRTC) at Mälardalen University in Västerås; Q-Systems

Repositório Científico do Instituto Politécnico do Porto