    Performance and energy footprint assessment of FPGAs and GPUs on HPC systems using Astrophysics application

    New challenges in Astronomy and Astrophysics (AA) are urging the need for a large number of exceptionally computationally intensive simulations. "Exascale" (and beyond) computational facilities are mandatory to address the size of theoretical problems and data coming from the new generation of observational facilities in AA. Currently, the High Performance Computing (HPC) sector is undergoing a profound phase of innovation, in which the primary challenge to the achievement of the "Exascale" is the power-consumption. The goal of this work is to give some insights about performance and energy footprint of contemporary architectures for a real astrophysical application in an HPC context. We use a state-of-the-art N-body application that we re-engineered and optimized to exploit the heterogeneous underlying hardware fully. We quantitatively evaluate the impact of computation on energy consumption when running on four different platforms. Two of them represent the current HPC systems (Intel-based and equipped with NVIDIA GPUs), one is a micro-cluster based on ARM-MPSoC, and one is a "prototype towards Exascale" equipped with ARM-MPSoCs tightly coupled with FPGAs. We investigate the behavior of the different devices where the high-end GPUs excel in terms of time-to-solution while MPSoC-FPGA systems outperform GPUs in power consumption. Our experience reveals that considering FPGAs for computationally intensive application seems very promising, as their performance is improving to meet the requirements of scientific applications. This work can be a reference for future platforms development for astrophysics applications where computationally intensive calculations are required.Comment: 15 pages, 4 figures, 3 tables; Preprint (V2) submitted to MDPI (Special Issue: Energy-Efficient Computing on Parallel Architectures

    LUSA: the HPC library for lattice-based cryptanalysis

    This paper introduces LUSA - the Lattice Unified Set of Algorithms library - a C++ library that comprises many high performance, parallel implementations of lattice algorithms, with particular focus on lattice-based cryptanalysis. Currently, LUSA offers algorithms for lattice reduction and the SVP. % and the CVP. LUSA was designed to be 1) simple to install and use, 2) have no other dependencies, 3) be designed specifically for lattice-based cryptanalysis, including the majority of the most relevant algorithms in this field and 4) offer efficient, parallel and scalable methods for those algorithms. LUSA explores paralellism mainly at the thread level, being based on OpenMP. However the code is also written to be efficient at the cache and operation level, taking advantage of carefully sorted data structures and data level parallelism. This paper shows that LUSA delivers these promises, by being simple to use while consistently outperforming its counterparts, such as NTL, plll and fplll, and offering scalable, parallel implementations of the most relevant algorithms to date, which are currently not available in other libraries

    Practical Parallelization of Scientific Applications

    Parallel improved Schnorr-Euchner enumeration SE++ on shared and distributed memory systems, with and without extreme pruning

    The security of lattice-based cryptography relies on the hardness of problems based on lattices, such as the Shortest Vector Problem (SVP) and the Closest Vector Problem (CVP). This paper presents two parallel implementations for the SE++ with and without extreme pruning. The SE++ is an enumeration-based CVP-solver, which can be easily adapted to solve the SVP. We improved the SVP version of the SE++ with an optimization that avoids symmetric branches, improving its performance by a factor of ≈ 50%, and applied the extreme pruning technique to this improved version. The extreme pruning technique is the fastest way to compute the SVP with enumeration known to date. It solves the SVP for lattices in much higher dimensions in less time than implementations without extreme pruning. Our parallel implementation of the SE++ with extreme pruning targets distributed memory multi-core CPU systems, while our SE++ without extreme pruning is designed for shared memory multi-core CPU systems. These implementations address load balancing problems for optimal performance, with a master-slave mechanism on the distributed memory implementation, and specific bounds for task creation on the shared memory implementation. The parallel implementation for the SE++ without extreme pruning scales linearly for up to 8 threads and almost linearly for 16 threads. In addition, it also achieves super-linear speedups on some instances, as the workload may be shortened, since some threads may find shorter vectors at earlier points in time, compared to the sequential implementation. Tests with our Improved SE++ implementation showed that it outperforms the state of the art implementation by a factor of between 35% and 60%, while maintaining a scalability similar to the SE++ implementation. Our parallel implementation of the SE++ with extreme pruning achieves linear speedups for up to 8 (working) processes and speedups of up to 13x for 16 (working) processes(undefined)info:eu-repo/semantics/publishedVersio

    Classifying Process Instances Using Recurrent Neural Networks

    Process Mining consists of techniques where logs created by operative systems are transformed into process models. In process mining tools it is often desired to be able to classify ongoing process instances, e.g., to predict how long the process will still require to complete, or to classify process instances to different classes based only on the activities that have occurred in the process instance thus far. Recurrent neural networks and its subclasses, such as Gated Recurrent Unit (GRU) and Long Short-Term Memory (LSTM), have been demonstrated to be able to learn relevant temporal features for subsequent classification tasks. In this paper we apply recurrent neural networks to classifying process instances. The proposed model is trained in a supervised fashion using labeled process instances extracted from event log traces. This is the first time we know of GRU having been used in classifying business process instances. Our main experimental results shows that GRU outperforms LSTM remarkably in training time while giving almost identical accuracies to LSTM models. Additional contributions of our paper are improving the classification model training time by filtering infrequent activities, which is a technique commonly used, e.g., in Natural Language Processing (NLP).Peer reviewe

    Advances in Engineering Software for Multicore Systems

    The vast amounts of data to be processed by today’s applications demand higher computational power. To meet application requirements and achieve reasonable application performance, it becomes increasingly profitable, or even necessary, to exploit any available hardware parallelism. For both new and legacy applications, successful parallelization is often subject to high cost and price. This chapter proposes a set of methods that employ an optimistic semi-automatic approach, which enables programmers to exploit parallelism on modern hardware architectures. It provides a set of methods, including an LLVM-based tool, to help programmers identify the most promising parallelization targets and understand the key types of parallelism. The approach reduces the manual effort needed for parallelization. A contribution of this work is an efficient profiling method to determine the control and data dependences for performing parallelism discovery or other types of code analysis. Another contribution is a method for detecting code sections where parallel design patterns might be applicable and suggesting relevant code transformations. Our approach efficiently reports detailed runtime data dependences. It accurately identifies opportunities for parallelism and the appropriate type of parallelism to use as task-based or loop-based

    Analytical attack modeling and security assessment based on the common vulnerability scoring system

    The paper analyzes an approach to the analytical attack modeling and security assessment on the base of the Common Vulnerability Scoring System (CVSS) format, considering different modifications that appeared in the new version of the CVSS specification. The common approach to the analytical attack modeling and security assessment was suggested by the authors earlier. The paper outlines disadvantages of previous CVSS version that influenced negatively on the results of the attack modeling and security assessment. Differences between new and previous CVSS versions are analyzed. Modifications of the approach to the analytical attack modeling and security assessment that follow from the CVSS modifications are suggested. Advantages of the modified approach are described. Case study that illustrates enhanced approach is provided

    Evaluation of the Memory Communication Traffic in a Hierarchical Cache Model for Massively-Manycore Processors

    The scaling of semiconductor technologies is leading to processors with increasing numbers of cores. A key enabler in manycore systems is the use of Networks-on-Chip (NoC) as a global communication mechanism. The adoption of NoCs in manycore systems requires a shift in focus from computation to communication, as communication is fast becoming the dominant factor in processor performance. Many researchers have focused on direct communication between cores in the NoC, however in a manycore processor the communication is actually between the cores and the memory hierarchy. In this work, we investigate the memory communication traffic of shared threads in a hierarchical cache architecture. We argue that the performance scalability for shared-memory applications in a hierarchical cache architecture for systems with thousands of processor cores depends on the distance between threads sharing memory in terms of the cache hierarchy (the "memory distance"). We present latency and throughput results comparing fat quadtree, concentrated mesh and mesh topologies as a function of the "memory distance" between the threads. Our results using the ITRS physical data for 2023 show that the model of thread placement and the distance of placing them significantly affects the NoC performance, and that scale-invariant topologies perform better than flat topologies

    Specification and verification of synchronisation classes in Java:A practical approach

    Digital services are becoming an essential part of our daily lives. To provide these services, efficient software plays an important role. Concurrent programming is a technique that developers can exploit to gain more performance. In a concurrent program several threads of execution simultaneously are being executed. Sometimes they have to compete to access shared resources, like memory. This race of accessing shared memories can cause unexpected errors. Programmers use synchronisation constructs to tame the concurrency and control the accesses. In order to develop reliable concurrent software, the correctness of these synchronisation constructs is crucial. In this thesis we use a program logic, called permission-based Separation Logic, to statically reason about the correctness of synchronisation constructs. The logic has the power to reason about correct ownership of threads regarding shared memory. A correctly functioning synchroniser is responsible for exchanging a correct permission when a thread requests access to the shared memory. We use our VERCORS verification tool-set to verify the correctness of various synchronisation constructs. In Chapter 1 we discuss the scope of the thesis. All the required technical background about permission-based Separation Logic and synchronisation classes is explained in Chapter 2. In Chapter 3 we discuss how threads' start and join as minimum synchronisation points can be verified. To verify correctness of the synchronisation classes we have to first specify expected behaviour of the classes. This is covered in Chapter 4. In this chapter we present a unified approach to abstractly describe the common behaviour of synchronisers. Using our specifications, one is able to reason about the correctness of the client programs that access the shared state through the synchronisers. The atomic classes of java.util.concurrent are the core element of every synchronisation construct implementation. In Chapter 5 and Chapter 6 we propose a specification for atomic classes. Using this contract, we verified the implementation of synchronisation constructs w.r.t to their specifications from Chapter 4. In our proposed contract the specification of the atomic classes is parameterized with the protocols and resource invariants. Based on the context, the parameters can be defined. In Chapter 7 we propose a verification stack where each layer of stack verifies one particular aspect of a specified concurrent program in which atomic operations are the main synchronisation constructs. We demonstrate how to verify that a non-blocking data structure is data-race free and well connected. Based on the result of the verification from the lower layers, upper layers can reason about the functional properties of the concurrent data structure. In Chapter 8 we present a sound specification and verification technique to reason about data race freedom and functional correctness of GPU kernels that use atomic operations as synchronisation mechanism. Finally, Chapter 9 concludes the thesis with future directions

    Distributed simulation optimization and parameter exploration framework for the cloud

    Simulation models are becoming an increasingly popular tool for the analysis and optimization of complex real systems in different fields. Finding an optimal system design requires performing a large sweep over the parameter space in an organized way. Hence, the model optimization process is extremely demanding from a computational point of view, as it requires careful, time-consuming, complex orchestration of coordinated executions. In this paper, we present the design of SOF (Simulation Optimization and exploration Framework in the cloud), a framework which exploits the computing power of a cloud computational environment in order to carry out effective and efficient simulation optimization strategies. SOF offers several attractive features. Firstly, SOF requires “zero configuration” as it does not require any additional software installed on the remote node; only standard Apache Hadoop and SSH access are sufficient. Secondly, SOF is transparent to the user, since the user is totally unaware that the system operates on a distributed environment. Finally, SOF is highly customizable and programmable, since it enables the running of different simulation optimization scenarios using diverse programming languages – provided that the hosting platform supports them – and different simulation toolkits, as developed by the modeler. The tool has been fully developed and is available on a public repository1 under the terms of the open source Apache License. It has been tested and validated on several private platforms, such as a dedicated cluster of workstations, as well as on public platforms, including the Hortonworks Data Platform and Amazon Web Services Elastic MapReduce solution