3,536 research outputs found

    Speculative Segmented Sum for Sparse Matrix-Vector Multiplication on Heterogeneous Processors

    Full text link
    Sparse matrix-vector multiplication (SpMV) is a central building block for scientific software and graph applications. Recently, heterogeneous processors composed of different types of cores attracted much attention because of their flexible core configuration and high energy efficiency. In this paper, we propose a compressed sparse row (CSR) format based SpMV algorithm utilizing both types of cores in a CPU-GPU heterogeneous processor. We first speculatively execute segmented sum operations on the GPU part of a heterogeneous processor and generate a possibly incorrect results. Then the CPU part of the same chip is triggered to re-arrange the predicted partial sums for a correct resulting vector. On three heterogeneous processors from Intel, AMD and nVidia, using 20 sparse matrices as a benchmark suite, the experimental results show that our method obtains significant performance improvement over the best existing CSR-based SpMV algorithms. The source code of this work is downloadable at https://github.com/bhSPARSE/Benchmark_SpMV_using_CSRComment: 22 pages, 8 figures, Published at Parallel Computing (PARCO

    E-QED: Electrical Bug Localization During Post-Silicon Validation Enabled by Quick Error Detection and Formal Methods

    Full text link
    During post-silicon validation, manufactured integrated circuits are extensively tested in actual system environments to detect design bugs. Bug localization involves identification of a bug trace (a sequence of inputs that activates and detects the bug) and a hardware design block where the bug is located. Existing bug localization practices during post-silicon validation are mostly manual and ad hoc, and, hence, extremely expensive and time consuming. This is particularly true for subtle electrical bugs caused by unexpected interactions between a design and its electrical state. We present E-QED, a new approach that automatically localizes electrical bugs during post-silicon validation. Our results on the OpenSPARC T2, an open-source 500-million-transistor multicore chip design, demonstrate the effectiveness and practicality of E-QED: starting with a failed post-silicon test, in a few hours (9 hours on average) we can automatically narrow the location of the bug to (the fan-in logic cone of) a handful of candidate flip-flops (18 flip-flops on average for a design with ~ 1 Million flip-flops) and also obtain the corresponding bug trace. The area impact of E-QED is ~2.5%. In contrast, deter-mining this same information might take weeks (or even months) of mostly manual work using traditional approaches

    Improved Architectures for Secure Intra-process Isolation

    Get PDF
    Intra-process memory isolation can improve security by enforcing least-privilege at a finer granularity than traditional operating system controls without the context-switch overhead associated with inter-process communication. Because the process has traditionally been a fundamental security boundary, assigning different levels of trust to components within a process is a fundamental change in secure systems design. However, so far there has been little research on the challenges of securely implementing intra-process isolation on top of existing operating system abstractions. We find that frequently-used assumptions in secure system design do not precisely hold under realistic conditions, and that these discrepancies lead to exploitable vulnerabilities. We evaluate two recently-proposed memory isolation systems and show that both are vulnerable to the same generic attacks that break their security model. We then extend a subset of these attacks by applying them to a fully-precise model of control-flow integrity, demonstrating a data-only attack that bypasses both static and dynamic control-flow integrity enforcement by overwriting executable code in-memory even under typical w^x assumptions. From these two results, we propose a set of kernel modifications called Xlock that systemically addresses weaknesses in memory permissions enforcement on Linux, bringing them into line with w^x assumptions. Finally, we present modifications to intra-process isolation systems that preserve efficient userspace component transitions while drastically reducing risk of accidental kernel mismanagement by modeling intra-process components as separate processes from the kernel\u27s perspective. Taken together, these mitigations represent a more robust architecture for efficient and secure intra-process isolation

    A scalable parallel finite element framework for growing geometries. Application to metal additive manufacturing

    Get PDF
    This work introduces an innovative parallel, fully-distributed finite element framework for growing geometries and its application to metal additive manufacturing. It is well-known that virtual part design and qualification in additive manufacturing requires highly-accurate multiscale and multiphysics analyses. Only high performance computing tools are able to handle such complexity in time frames compatible with time-to-market. However, efficiency, without loss of accuracy, has rarely held the centre stage in the numerical community. Here, in contrast, the framework is designed to adequately exploit the resources of high-end distributed-memory machines. It is grounded on three building blocks: (1) Hierarchical adaptive mesh refinement with octree-based meshes; (2) a parallel strategy to model the growth of the geometry; (3) state-of-the-art parallel iterative linear solvers. Computational experiments consider the heat transfer analysis at the part scale of the printing process by powder-bed technologies. After verification against a 3D benchmark, a strong-scaling analysis assesses performance and identifies major sources of parallel overhead. A third numerical example examines the efficiency and robustness of (2) in a curved 3D shape. Unprecedented parallelism and scalability were achieved in this work. Hence, this framework contributes to take on higher complexity and/or accuracy, not only of part-scale simulations of metal or polymer additive manufacturing, but also in welding, sedimentation, atherosclerosis, or any other physical problem where the physical domain of interest grows in time

    An integrated framework to support remote IEEE 1149.1 /1149.4 design for test experiments

    Get PDF
    Remote experiments for academic purposes can only achieve their educational goals if an appropriate framework is able to provide a basic set of features, namely remote laboratory management, collaborative learning tools and content management and delivery. This paper presents a framework developed to support remote experiments in a design for test class offered to final year students at the Electrical and Computer Engineering degree at the University of Porto. The proposed solution combines a test language command interpreter and various virtual instruments (VIs), with a demonstration board that comprises a boundary-scan IEEE 1149.1 / 1149.4 test infrastructure. The experiments are presented as embedded learning objects, with no distinction from other e-learning contents (e.g. lessons, lecture notes, etc.)

    Comprehensive analysis of high-performance computing methods for filtered back-projection

    Get PDF
    This paper provides an extensive runtime, accuracy, and noise analysis of Computed To-mography (CT) reconstruction algorithms using various High-Performance Computing (HPC) frameworks such as: "conventional" multi-core, multi threaded CPUs, Compute Unified Device Architecture (CUDA), and DirectX or OpenGL graphics pipeline programming. The proposed algorithms exploit various built-in hardwired features of GPUs such as rasterization and texture filtering. We compare implementations of the Filtered Back-Projection (FBP) algorithm with fan-beam geometry for all frameworks. The accuracy of the reconstruction is validated using an ACR-accredited phantom, with the raw attenuation data acquired by a clinical CT scanner. Our analysis shows that a single GPU can run a FBP reconstruction 23 time faster than a 64-core multi-threaded CPU machine for an image of 1024 X 1024. Moreover, directly programming the graphics pipeline using DirectX or OpenGL can further increases the performance compared to a CUDA implementation
    • …
    corecore