Search CORE

86 research outputs found

4.45 Pflops Astrophysical N-Body Simulation on K computer -- The Gravitational Trillion-Body Problem

Author: Ishiyama Tomoaki
Makino Junichiro
Nitadori Keigo
Publication venue
Publication date: 13/04/2015
Field of study

As an entry for the 2012 Gordon-Bell performance prize, we report performance results of astrophysical N-body simulations of one trillion particles performed on the full system of K computer. This is the first gravitational trillion-body simulation in the world. We describe the scientific motivation, the numerical algorithm, the parallelization strategy, and the performance analysis. Unlike many previous Gordon-Bell prize winners that used the tree algorithm for astrophysical N-body simulations, we used the hybrid TreePM method, for similar level of accuracy in which the short-range force is calculated by the tree algorithm, and the long-range force is solved by the particle-mesh algorithm. We developed a highly-tuned gravity kernel for short-range forces, and a novel communication algorithm for long-range forces. The average performance on 24576 and 82944 nodes of K computer are 1.53 and 4.45 Pflops, which correspond to 49% and 42% of the peak speed.Comment: 10 pages, 6 figures, Proceedings of Supercomputing 2012 (http://sc12.supercomputing.org/), Gordon Bell Prize Winner. Additional information is http://www.ccs.tsukuba.ac.jp/CCS/eng/gbp201

arXiv.org e-Print Archive

CiteSeerX

Virtual Machine Level Temperature Profiling and Prediction in Cloud Datacenters

Author: Garraghan PM
Jiang X
Li X
Wu Z
Ye K
Zomaya A
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/06/2016
Field of study

Temperature prediction can enhance datacenter thermal management towards minimizing cooling power draw. Traditional approaches achieve this through analyzing task-temperature profiles or resistor-capacitor circuit models to predict CPU temperature. However, they are unable to capture task resource heterogeneity within multi-tenant environments and make predictions under dynamic scenarios such as virtual machine migration, which is one of the main characteristics of Cloud computing. This paper proposes virtual machine level temperature prediction in Cloud datacenters. Experiments show that the mean squared error of stable CPU temperature prediction is within 1.10, and dynamic CPU temperature prediction can achieve 1.60 in most scenarios

Crossref

Lancaster E-Prints

White Rose Research Online

Bio-inspired call-stack reconstruction for performance analysis

Author: Giménez Lucas Judit
González Juan
Labarta Mancho Jesús José
Llort German
Servat Harald
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

The correlation of performance bottlenecks and their associated source code has become a cornerstone of performance analysis. It allows understanding why the efficiency of an application falls behind the computer's peak performance and enabling optimizations on the code ultimately. To this end, performance analysis tools collect the processor call-stack and then combine this information with measurements to allow the analyst comprehend the application behavior. Some tools modify the call-stack during run-time to diminish the collection expense but at the cost of resulting in non-portable solutions. In this paper, we present a novel portable approach to associate performance issues with their source code counterpart. To address it, we capture a reduced segment of the call-stack (up to three levels) and then process the segments using an algorithm inspired by multi-sequence alignment techniques. The results of our approach are easily mapped to detailed performance views, enabling the analyst to unveil the application behavior and its corresponding region of code. To demonstrate the usefulness of our approach, we have applied the algorithm to several first-time seen in-production applications to describe them finely, and optimize them by using tiny modifications based on the analyses.We thankfully acknowledge Mathis Bode for giving us access to the Arts CF binaries, and Miguel Castrillo and Kim Serradell for their valuable insight regarding Nemo. We would like to thank Forschungszentrum Jülich for the computation time on their Blue Gene/Q system. This research has been partially funded by the CICYT under contracts No. TIN2012-34557 and TIN2015-65316-P.Peer ReviewedPostprint (author's final draft

Crossref

UPCommons. Portal del coneixement obert de la UPC

Juelich Shared Electronic Resources

Optimizing the MapReduce Framework on Intel Xeon Phi Coprocessor

Author: Goh Rick Siow Mong
He Bingsheng
Huynh Huynh Phung
Huynh Richard
Liang Yun
Lu Mian
Ong Zhongliang
Zhang Lei
Publication venue
Publication date: 01/01/2013
Field of study

With the ease-of-programming, flexibility and yet efficiency, MapReduce has become one of the most popular frameworks for building big-data applications. MapReduce was originally designed for distributed-computing, and has been extended to various architectures, e,g, multi-core CPUs, GPUs and FPGAs. In this work, we focus on optimizing the MapReduce framework on Xeon Phi, which is the latest product released by Intel based on the Many Integrated Core Architecture. To the best of our knowledge, this is the first work to optimize the MapReduce framework on the Xeon Phi. In our work, we utilize advanced features of the Xeon Phi to achieve high performance. In order to take advantage of the SIMD vector processing units, we propose a vectorization friendly technique for the map phase to assist the auto-vectorization as well as develop SIMD hash computation algorithms. Furthermore, we utilize MIMD hyper-threading to pipeline the map and reduce to improve the resource utilization. We also eliminate multiple local arrays but use low cost atomic operations on the global array for some applications, which can improve the thread scalability and data locality due to the coherent L2 caches. Finally, for a given application, our framework can either automatically detect suitable techniques to apply or provide guideline for users at compilation time. We conduct comprehensive experiments to benchmark the Xeon Phi and compare our optimized MapReduce framework with a state-of-the-art multi-core based MapReduce framework (Phoenix++). By evaluating six real-world applications, the experimental results show that our optimized framework is 1.2X to 38X faster than Phoenix++ for various applications on the Xeon Phi

arXiv.org e-Print Archive

Crossref

Distributed-memory large deformation diffeomorphic 3D image registration

Author: Biros George
Gholami Amir
Mang Andreas
Publication venue
Publication date: 11/08/2016
Field of study

We present a parallel distributed-memory algorithm for large deformation diffeomorphic registration of volumetric images that produces large isochoric deformations (locally volume preserving). Image registration is a key technology in medical image analysis. Our algorithm uses a partial differential equation constrained optimal control formulation. Finding the optimal deformation map requires the solution of a highly nonlinear problem that involves pseudo-differential operators, biharmonic operators, and pure advection operators both forward and back- ward in time. A key issue is the time to solution, which poses the demand for efficient optimization methods as well as an effective utilization of high performance computing resources. To address this problem we use a preconditioned, inexact, Gauss-Newton- Krylov solver. Our algorithm integrates several components: a spectral discretization in space, a semi-Lagrangian formulation in time, analytic adjoints, different regularization functionals (including volume-preserving ones), a spectral preconditioner, a highly optimized distributed Fast Fourier Transform, and a cubic interpolation scheme for the semi-Lagrangian time-stepping. We demonstrate the scalability of our algorithm on images with resolution of up to

1024^3

on the "Maverick" and "Stampede" systems at the Texas Advanced Computing Center (TACC). The critical problem in the medical imaging application domain is strong scaling, that is, solving registration problems of a moderate size of

256^3

---a typical resolution for medical images. We are able to solve the registration problem for images of this size in less than five seconds on 64 x86 nodes of TACC's "Maverick" system.Comment: accepted for publication at SC16 in Salt Lake City, Utah, USA; November 201

arXiv.org e-Print Archive

Crossref

Year 1 Report: Center for Trustworthy Scientific Cyberinfrastructure

Author: Welch Von
Publication venue
Publication date: 01/01/2013
Field of study

This material is based in part on work supported by the National Science Foundation under Grant Number OCI-1234408. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation

IUScholarWorks (University of Indiana)

Computational Science, Demystified...the Future, Revealed...and CiSE, 2013

Author: Thiruvathukal George K.
Publication venue: Loyola eCommons
Publication date: 01/03/2013
Field of study

What are some of the exciting avenues that computational science is exploring, and how can we best give a voice to such emerging ideas

Loyola eCommons