1,643 research outputs found

    Using Cognitive Computing for Learning Parallel Programming: An IBM Watson Solution

    Full text link
    While modern parallel computing systems provide high performance resources, utilizing them to the highest extent requires advanced programming expertise. Programming for parallel computing systems is much more difficult than programming for sequential systems. OpenMP is an extension of C++ programming language that enables to express parallelism using compiler directives. While OpenMP alleviates parallel programming by reducing the lines of code that the programmer needs to write, deciding how and when to use these compiler directives is up to the programmer. Novice programmers may make mistakes that may lead to performance degradation or unexpected program behavior. Cognitive computing has shown impressive results in various domains, such as health or marketing. In this paper, we describe the use of IBM Watson cognitive system for education of novice parallel programmers. Using the dialogue service of the IBM Watson we have developed a solution that assists the programmer in avoiding common OpenMP mistakes. To evaluate our approach we have conducted a survey with a number of novice parallel programmers at the Linnaeus University, and obtained encouraging results with respect to usefulness of our approach

    Large-Eddy Simulations of Flow and Heat Transfer in Complex Three-Dimensional Multilouvered Fins

    Get PDF
    The paper describes the computational procedure and results from large-eddy simulations in a complex three-dimensional louver geometry. The three-dimensionality in the louver geometry occurs along the height of the fin, where the angled louver transitions to the flat landing and joins with the tube surface. The transition region is characterized by a swept leading edge and decreasing flow area between louvers. Preliminary results show a high energy compact vortex jet forming in this region. The jet forms in the vicinity of the louver junction with the flat landing and is drawn under the louver in the transition region. Its interaction with the surface of the louver produces vorticity of the opposite sign, which aids in augmenting heat transfer on the louver surface. The top surface of the louver in the transition region experiences large velocities in the vicinity of the surface and exhibits higher heat transfer coefficients than the bottom surface.Air Conditioning and Refrigeration Project 9

    Power efficient job scheduling by predicting the impact of processor manufacturing variability

    Get PDF
    Modern CPUs suffer from performance and power consumption variability due to the manufacturing process. As a result, systems that do not consider such variability caused by manufacturing issues lead to performance degradations and wasted power. In order to avoid such negative impact, users and system administrators must actively counteract any manufacturing variability. In this work we show that parallel systems benefit from taking into account the consequences of manufacturing variability when making scheduling decisions at the job scheduler level. We also show that it is possible to predict the impact of this variability on specific applications by using variability-aware power prediction models. Based on these power models, we propose two job scheduling policies that consider the effects of manufacturing variability for each application and that ensure that power consumption stays under a system-wide power budget. We evaluate our policies under different power budgets and traffic scenarios, consisting of both single- and multi-node parallel applications, utilizing up to 4096 cores in total. We demonstrate that they decrease job turnaround time, compared to contemporary scheduling policies used on production clusters, up to 31% while saving up to 5.5% energy.Postprint (author's final draft

    Hierarchical Parallelisation of Functional Renormalisation Group Calculations -- hp-fRG

    Get PDF
    The functional renormalisation group (fRG) has evolved into a versatile tool in condensed matter theory for studying important aspects of correlated electron systems. Practical applications of the method often involve a high numerical effort, motivating the question in how far High Performance Computing (HPC) can leverage the approach. In this work we report on a multi-level parallelisation of the underlying computational machinery and show that this can speed up the code by several orders of magnitude. This in turn can extend the applicability of the method to otherwise inaccessible cases. We exploit three levels of parallelisation: Distributed computing by means of Message Passing (MPI), shared-memory computing using OpenMP, and vectorisation by means of SIMD units (single-instruction-multiple-data). Results are provided for two distinct High Performance Computing (HPC) platforms, namely the IBM-based BlueGene/Q system JUQUEEN and an Intel Sandy-Bridge-based development cluster. We discuss how certain issues and obstacles were overcome in the course of adapting the code. Most importantly, we conclude that this vast improvement can actually be accomplished by introducing only moderate changes to the code, such that this strategy may serve as a guideline for other researcher to likewise improve the efficiency of their codes

    Domain knowledge specification for energy tuning

    Get PDF
    To overcome the challenges of energy consumption of HPC systems, the European Union Horizon 2020 READEX (Runtime Exploitation of Application Dynamism for Energy-efficient Exascale computing) project uses an online auto-tuning approach to improve energy efficiency of HPC applications. The READEX methodology pre-computes optimal system configurations at design-time, such as the CPU frequency, for instances of program regions and switches at runtime to the configuration given in the tuning model when the region is executed. READEX goes beyond previous approaches by exploiting dynamic changes of a region's characteristics by leveraging region and characteristic specific system configurations. While the tool suite supports an automatic approach, specifying domain knowledge such as the structure and characteristics of the application and application tuning parameters can significantly help to create a more refined tuning model. This paper presents the means available for an application expert to provide domain knowledge and presents tuning results for some benchmarks.Web of Science316art. no. E465

    NBSymple, a double parallel, symplectic N-body code running on Graphic Processing Units

    Full text link
    We present and discuss the characteristics and performances, both in term of computational speed and precision, of a numerical code which numerically integrates the equation of motions of N 'particles' interacting via Newtonian gravitation and move in an external galactic smooth field. The force evaluation on every particle is done by mean of direct summation of the contribution of all the other system's particle, avoiding truncation error. The time integration is done with second-order and sixth-order symplectic schemes. The code, NBSymple, has been parallelized twice, by mean of the Computer Unified Device Architecture to make the all-pair force evaluation as fast as possible on high-performance Graphic Processing Units NVIDIA TESLA C 1060, while the O(N) computations are distributed on various CPUs by mean of OpenMP Application Program. The code works both in single precision floating point arithmetics or in double precision. The use of single precision allows the use at best of the GPU performances but, of course, limits the precision of simulation in some critical situations. We find a good compromise in using a software reconstruction of double precision for those variables that are most critical for the overall precision of the code. The code is available on the web site astrowww.phys.uniroma1.it/dolcetta/nbsymple.htmlComment: Paper composed by 29 pages, including 9 figures. Submitted to New Astronomy

    Evaluation of OpenMP Dependent Tasks with the KASTORS Benchmark Suite

    Get PDF
    International audienceThe recent introduction of task dependencies in the OpenMP specifi-cation provides new ways of synchronizing tasks. Application programmers can now describe the data a task will read as input and write as output, letting the runtime system resolve fine-grain dependencies between tasks to decide which task should execute next. Such an approach should scale better than the excessive global synchronization found in most OpenMP 3.0 applications. As promising as it looks however, any new feature needs proper evaluation to encourage applica-tion programmers to embrace it. This paper introduces the KASTORS benchmark suite designed to evaluate OpenMP tasks dependencies. We modified state-of-the-art OpenMP 3.0 benchmarks and data-flow parallel linear algebra kernels to make use of tasks dependencies. Learning from this experience, we propose extensions to the current OpenMP specification to improve the expressiveness of dependen-cies. We eventually evaluate both the GCC/libGOMP and the CLANG/libIOMP implementations of OpenMP 4.0 on our KASTORS suite, demonstrating the in-terest of task dependencies compared to taskwait-based approaches

    FISH: A 3D parallel MHD code for astrophysical applications

    Full text link
    FISH is a fast and simple ideal magneto-hydrodynamics code that scales to ~10 000 processes for a Cartesian computational domain of ~1000^3 cells. The simplicity of FISH has been achieved by the rigorous application of the operator splitting technique, while second order accuracy is maintained by the symmetric ordering of the operators. Between directional sweeps, the three-dimensional data is rotated in memory so that the sweep is always performed in a cache-efficient way along the direction of contiguous memory. Hence, the code only requires a one-dimensional description of the conservation equations to be solved. This approach also enable an elegant novel parallelisation of the code that is based on persistent communications with MPI for cubic domain decomposition on machines with distributed memory. This scheme is then combined with an additional OpenMP parallelisation of different sweeps that can take advantage of clusters of shared memory. We document the detailed implementation of a second order TVD advection scheme based on flux reconstruction. The magnetic fields are evolved by a constrained transport scheme. We show that the subtraction of a simple estimate of the hydrostatic gradient from the total gradients can significantly reduce the dissipation of the advection scheme in simulations of gravitationally bound hydrostatic objects. Through its simplicity and efficiency, FISH is as well-suited for hydrodynamics classes as for large-scale astrophysical simulations on high-performance computer clusters. In preparation for the release of a public version, we demonstrate the performance of FISH in a suite of astrophysically orientated test cases.Comment: 27 pages, 11 figure
    • …
    corecore