13 research outputs found

    Divergent initialization experiments using a global primitive equation spectral model

    Get PDF
    An initialization of a spectral formulation of the primi­tive equations using a diagnostic divergence is tested for a global model. The initial conditions are generated from a developing baroclinically unstable wave. A semi-implicit time scheme is developed and tested along with the usual explicit method during the course of the experiments. Results show a relatively small effect of a divergent initialization on the ensuing integrations. The semi-implicit method shows a ten­dency to smooth out high frequency oscillations in local ten­dencies.http://archive.org/details/divergentinitial00lubeLieutenant, United States NavyApproved for public release; distribution is unlimited

    Performance and Scalability Analysis of Teraflop-Scale Parallel Architectures using Multidimensional Wavefront Applications

    No full text
    The authors develop a model for the parallel performance of algorithms that consist of concurrent, two-dimensional wavefronts implemented in a message-passing environment. The model, based on a LogGP machine parameterization, combines the separate contributions of computation and communication wavefronts. The authors validate the model on three important supercomputer systems, on up to 500 processors. They use data from a deterministic particle transport application taken from the ASCI workload, although the model is general to any wavefront algorithm implemented on a 2-D processor domain. They also use the validated model to make estimates of performance and scalability of wavefront algorithms on 100 TFLOPS computer systems expected to be in existence within the next decade as part of the ASC

    MonteSim

    No full text

    Implementation and performance modeling of deterministic particle transport (Sweep3D) on the IBM Cell/B.E.

    No full text
    The IBM Cell Broadband Engine (BE) is a novel multi-core chip with the potential for the demanding floating point performance that is required for high-fidelity scientific simulations. However, data movement within the chip can be a major challenge to realizing the benefits of the peak floating point rates. In this paper, we present the results of implementing Sweep3D on the Cell/B.E. using an intra-chip message passing model that minimizes data movement. We compare the advantages/disadvantages of this programming model with a previous implementation using a master–worker threading strategy. We apply a previously validated micro-architecture performance model for the application executing on the Cell/B.E. (based on our previous work in Monte Carlo performance models), that predicts overall CPI (cycles per instruction), and gives a detailed breakdown of processor stalls. Finally, we use the micro-architecture model to assess the performance of future design parameters for the Cell/B.E. micro-architecture. The methodologies and results have broader implications that extend to multi-core architectures

    Performance Evaluation of the SGI Origin2000: A Memory-Centric Characterization of LANL ASCI Applications

    No full text
    : In this paper we compare single-processor performance of the SGI Origin and PowerChallenge and utilize a previously-reported performance model for hierarchical memory systems to explain the results. Both the Origin and PowerChallenge use the same microprocessor (MIPS R10000) but have significant differences in their memory subsystems. Our memory model includes the effect of overlap between CPU and memory operations and allows us to infer the individual contributions of all three improvements in the Origin's memory architecture and relate the effectiveness of each improvement to application characteristics.. 1 Introduction The biggest challenge in the design and use of high-performance computer systems involves managing the disparity between central processing unit (CPU) speed and memory subsystem speed. The need to address this issue is likely to become more acute in the future, because processor speed may double every eighteen months but DRAM memory access speed is expected to inc..

    Development and validation of a hierarchical memory model incorporating CPU- and memory-operation overlap model

    No full text
    By acceptance of this article, the publisher recognizes that the U S . Government retains a nonexclusive royalty-free license to publish or reproduce the published form of this contribution or to allow others to do so, for U.S. Government purposes. The Los Alarnos National Laboratory requests that the publisher identify this article as work performed under the auspices of the U.S. Department of Energy. DISCLAIMER This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, make any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product., process, or service by trade name, trademark, manufacturer, or otherwise does not necessariiy constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof. Development and Validation of a Hierarchical Memory Model Incorporating ABSTRACT Distributed shared memory architectures (DSM's) such as the Origin 2000 are being implemented which extend the concept of single-processor cache hierarchies across an entire physically-distributed multi-processor machine. The scalability of a DSM machine is inherently tied to memory hierarchy performance, including such issues as latency hiding techniques in the architecture, global cache-coherence protocols, memory consistency models and, of course, the inherent locality of reference in algorithms of interest. In this paper, we characterize application performance with a "memory-centric" view. Using a simple mean value analysis (MVA) strategy and empirical performance data, we infer the contribution of each level in the memory system to the application's overall cycles per instruction (cpi). We account for the overlap of processor execution with memory accesses -a key parameter which is not directly measurable on the Origin systems. We infer the separate contributions of three major aichitecture features in the memory subsystem of the Origin 2000: cache size, outstanding loads-under-miss, and memory latency
    corecore