Search CORE

708 research outputs found

The GPU vs Phi Debate: Risk Analytics Using Many-Core Computing

Author: Varghese Blesson
Publication venue: 'Elsevier BV'
Publication date: 26/01/2015
Field of study

The risk of reinsurance portfolios covering globally occurring natural catastrophes, such as earthquakes and hurricanes, is quantified by employing simulations. These simulations are computationally intensive and require large amounts of data to be processed. The use of many-core hardware accelerators, such as the Intel Xeon Phi and the NVIDIA Graphics Processing Unit (GPU), are desirable for achieving high-performance risk analytics. In this paper, we set out to investigate how accelerators can be employed in risk analytics, focusing on developing parallel algorithms for Aggregate Risk Analysis, a simulation which computes the Probable Maximum Loss of a portfolio taking both primary and secondary uncertainties into account. The key result is that both hardware accelerators are useful in different contexts; without taking data transfer times into account the Phi had lowest execution times when used independently and the GPU along with a host in a hybrid platform yielded best performance.Comment: A modified version of this article is accepted to the Computers and Electrical Engineering Journal under the title - "The Hardware Accelerator Debate: A Financial Risk Case Study Using Many-Core Computing"; Blesson Varghese, "The Hardware Accelerator Debate: A Financial Risk Case Study Using Many-Core Computing," Computers and Electrical Engineering, 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

University of St. Andrews - Pure

St Andrews Research Repository

Machine Assisted Proof of ARMv7 Instruction Level Isolation Properties

Author: A. Fox
C. Heitmeyer
E. Alkassar
G. Barthe
M.O. Myreen
T. Murray
W. Paul
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

In this paper, we formally verify security properties of the ARMv7 Instruction Set Architecture (ISA) for user mode executions. To obtain guarantees that arbitrary (and unknown) user processes are able to run isolated from privileged software and other user processes, instruction level noninterference and integrity properties are provided, along with proofs that transitions to privileged modes can only occur in a controlled manner. This work establishes a main requirement for operating system and hypervisor verification, as demonstrated for the PROSPER separation kernel. The proof is performed in the HOL4 theorem prover, taking the Cambridge model of ARM as basis. To this end, a proof tool has been developed, which assists the verification of relational state predicates semi-automatically

Publikationer från KTH

Crossref

RISE – Research Institutes of Sweden

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Swedish Institute of Computer Science Publications Database

Software institutes' Online Digital Archive

Fault Diagnosis of Hybrid Computing Systems Using Chaotic-Map Method

Author: Philip Bobby
Rao Nageswara S. V.
Publication venue: 'IntechOpen'
Publication date: 05/11/2018
Field of study

Computing systems are becoming increasingly complex with nodes consisting of a combination of multi-core central processing units (CPUs), many integrated core (MIC) and graphics processing unit (GPU) accelerators. These computing units and their interconnections are subject to different classes of hardware and software faults, which should be detected to support mitigation measures. We present the chaotic-map method that uses the exponential divergence and wide Fourier properties of the trajectories, combined with memory allocations and assignments to diagnose component-level faults in these hybrid computing systems. We propose lightweight codes that utilize highly parallel chaotic-map computations tailored to isolate faults in arithmetic units, memory elements and interconnects. The diagnosis module on a node utilizes pthreads to place chaotic-map threads on CPU and MIC cores, and CUDA C and OpenCL kernels on GPU blocks. We present experimental diagnosis results on five multi-core CPUs; one MIC; and, seven GPUs with typical diagnosis run-times under a minute

IntechOpen

Crossref

Hardware-software co-design of an iris recognition algorithm

Author: Canto Navarro Enrique Fernando
Daugman J.
López García Mariano
Publication venue: 'Institution of Engineering and Technology (IET)'
Publication date: 01/01/2011
Field of study

This paper describes the implementation of an iris recognition algorithm based on hardware-software co-design. The system architecture consists of a general-purpose 32- bit microprocessor and several slave coprocessors that accelerate the most intensive calculations. The whole iris recognition algorithm has been implemented on a low-cost Spartan 3 FPGA, achieving significant reduction in execution time when compared to a conventional software-based application. Experimental results show that with a clock speed of 40 MHz, an IrisCode is obtained in less than 523 ms from an image of 640x480 pixels, which is just 20% of the total time needed by a software solution running on the same microprocessor embedded in the architecture.Peer ReviewedPreprin

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

UPCommons. Portal del coneixement obert de la UPC

Hardware implementation of non-bonded forces in molecular dynamics simulations

Author: Caicedo Beltrán Álvaro José
Publication venue: INGENIERIA ELECTRÓNICA
Publication date: 01/01/2011
Field of study

Molecular Dynamics is a computational method based on classical mechanics to describe the behavior of a molecular system. This method is used in biomolecular simulations, which are intended to contribute to the study and advance of nanotechnology, medicine, chemistry and biology. Software implementations of Molecular Dynamics simulations can spend most of time computing the non-bonded interactions. This work presents the design and implementation of an FPGA-based coprocessor that accelerates MD simulations by computing in parallel the non-bonded interactions, specifically, the van der Waals and the electrostatic interactions. These interactions are modeled as the Lennard-Jones 6-12 potential and the direct-space Ewald summation, respectively. In addition, this work introduces a novel variable transformation of the potential energy functions, and a novel interpolation method with pseudo-floating-point representation to compute the short-range forces. Also, it uses a combination of fixed-point and floating-point arithmetic to obtain the best of both representations. The FPGA coprocessor is a memory-mapped system connected to a host by PCI Express, and is provided with interruption capabilities to improve parallelization. Its main block is based on a single functional pipeline, and is connected via Avalon Bus to other peripherals such as the PCIe Hard-IP and the SG-DMA. It is implemented on an Altera¿s EP2AGX125EF35C4 device, can process 16k particles, and is configured to store up to 16 different types of particles. Simulations in a custom C-application for MD that only computes non-bonded forces become up to 12.5x faster using the FPGA coprocessor when considering 12500 atoms.PregradoINGENIERO(A) EN ELECTRÓNIC

Biblioteca Digital de la Universidad del Valle

Accelerating Gauss-Newton filters on FPGA's

Author: Da Conceicao Jean-Paul Costa
Publication venue: Department of Electrical Engineering
Publication date: 01/01/2010
Field of study

Includes bibliographical references (leaves 123-128).Radar tracking filters are generally computationally expensive, involving the manipulation of large matrices and deeply nested loops. In addition, they must generally work in real-time to be of any use. The now-common Kalman Filter was developed in the 1960's specifically for the purposes of lowering its computational burden, so that it could be implemented using the limited computational resources of the time. However, with the exponential increases in computing power since then, it is now possible to reconsider more heavy-weight, robust algorithms such as the original nonrecursive Gauss-Newton filter on which the Kalman filter is based. This dissertation investigates the acceleration of such a filter using FPGA technology, making use of custom, reduced-precision number formats

Cape Town University OpenUCT

Numerically robust implementations of fast recursive least squares adaptive filters using interval arithmetic

Author: Callender Christopher Peter
Publication venue: The University of Edinburgh
Publication date: 01/01/1991
Field of study

Edinburgh Research Archive

Deep convolutional network evaluation on the Intel Xeon Phi:where subword parallelism meets many-core

Author: Raina G.
Publication venue
Publication date: 31/01/2016
Field of study

Pure OAI Repository

Vitruvius+: An area-efficient RISC-V decoupled vector coprocessor for high performance computing applications

The maturity level of RISC-V and the availability of domain-specific instruction set extensions, like vector processing, make RISC-V a good candidate for supporting the integration of specialized hardware in processor cores for the High Performance Computing (HPC) application domain. In this article,1 we present Vitruvius+, the vector processing acceleration engine that represents the core of vector instruction execution in the HPC challenge that comes within the EuroHPC initiative. It implements the RISC-V vector extension (RVV) 0.7.1 and can be easily connected to a scalar core using the Open Vector Interface standard. Vitruvius+ natively supports long vectors: 256 double precision floating-point elements in a single vector register. It is composed of a set of identical vector pipelines (lanes), each containing a slice of the Vector Register File and functional units (one integer, one floating point). The vector instruction execution scheme is hybrid in-order/out-of-order and is supported by register renaming and arithmetic/memory instruction decoupling. On a stand-alone synthesis, Vitruvius+ reaches a maximum frequency of 1.4 GHz in typical conditions (TT/0.80V/25°C) using GlobalFoundries 22FDX FD-SOI. The silicon implementation has a total area of 1.3 mm2 and maximum estimated power of ~920 mW for one instance of Vitruvius+ equipped with eight vector lanes.This research has received funding from the European High Performance Computing Joint Undertaking (JU) under Framework Partnership Agreement No 800928 (European Processor Initiative) and Specific Grant Agreement No 101036168 (EPI SGA2). The JU receives support from the European Union’s Horizon 2020 research and innovation programme and from Croatia, France, Germany, Greece, Italy, Netherlands, Portugal, Spain, Sweden, and Switzerland. The EPI-SGA2 project, PCI2022-132935 is also co-funded by MCIN/AEI/10.13039/501100011033 and by the UE NextGen- erationEU/PRTR. This work has also been partially supported by the Spanish Ministry of Science and Innovation (PID2019-107255GB-C21/AEI/10.13039/501100011033).Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC