Search CORE

14,705 research outputs found

A Novel Hybrid Quicksort Algorithm Vectorized using AVX-512 on Intel Skylake

Author: Bramas Berenger
Publication venue: 'The Science and Information Organization'
Publication date: 01/10/2017
Field of study

The modern CPU's design, which is composed of hierarchical memory and SIMD/vectorization capability, governs the potential for algorithms to be transformed into efficient implementations. The release of the AVX-512 changed things radically, and motivated us to search for an efficient sorting algorithm that can take advantage of it. In this paper, we describe the best strategy we have found, which is a novel two parts hybrid sort, based on the well-known Quicksort algorithm. The central partitioning operation is performed by a new algorithm, and small partitions/arrays are sorted using a branch-free Bitonic-based sort. This study is also an illustration of how classical algorithms can be adapted and enhanced by the AVX-512 extension. We evaluate the performance of our approach on a modern Intel Xeon Skylake and assess the different layers of our implementation by sorting/partitioning integers, double floating-point numbers, and key/value pairs of integers. Our results demonstrate that our approach is faster than two libraries of reference: the GNU \emph{C++} sort algorithm by a speedup factor of 4, and the Intel IPP library by a speedup factor of 1.4.Comment: 8 pages, research pape

arXiv.org e-Print Archive

MPG.PuRe

Low Power Processor Architectures and Contemporary Techniques for Power Optimization – A Review

Author: Gujarathi Hemal S
McDonald-Maier Klaus D
Qadri Muhammad Yasir
Publication venue: 'Academy Publisher'
Publication date: 01/01/2009
Field of study

The technological evolution has increased the number of transistors for a given die area significantly and increased the switching speed from few MHz to GHz range. Such inversely proportional decline in size and boost in performance consequently demands shrinking of supply voltage and effective power dissipation in chips with millions of transistors. This has triggered substantial amount of research in power reduction techniques into almost every aspect of the chip and particularly the processor cores contained in the chip. This paper presents an overview of techniques for achieving the power efficiency mainly at the processor core level but also visits related domains such as buses and memories. There are various processor parameters and features such as supply voltage, clock frequency, cache and pipelining which can be optimized to reduce the power consumption of the processor. This paper discusses various ways in which these parameters can be optimized. Also, emerging power efficient processor architectures are overviewed and research activities are discussed which should help reader identify how these factors in a processor contribute to power consumption. Some of these concepts have been already established whereas others are still active research areas. © 2009 ACADEMY PUBLISHER

University of Essex Research Repository

CiteSeerX

Crossref

Half-buffer retiming and token cages for synchronous elastic circuits

Author: Ampalam
Austin
Bañeres
Blaauw
Bowman
Brej
Bufistov
Carloni
Carloni
Carmona
Casu
Collins
Cortadella
Cortadella
Ghosh
Jacobson
Júlvez
Kam
Li
Li
Lu
Lu
M.R. Casu
Reese
Publication venue: IET
Publication date: 01/01/2011
Field of study

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

High Performance Biological Pairwise Sequence Alignment: FPGA versus GPU versus Cell BE versus GPP

Author: Akoglu Ali
Benkrid Khaled
Ling Cheng
Liu Ying
Song Yang
Tian Xiang
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2012
Field of study

This paper explores the pros and cons of reconfigurable computing in the form of FPGAs for high performance efficient computing. In particular, the paper presents the results of a comparative study between three different acceleration technologies, namely, Field Programmable Gate Arrays (FPGAs), Graphics Processor Units (GPUs), and IBM’s Cell Broadband Engine (Cell BE), in the design and implementation of the widely-used Smith-Waterman pairwise sequence alignment algorithm, with general purpose processors as a base reference implementation. Comparison criteria include speed, energy consumption, and purchase and development costs. The study shows that FPGAs largely outperform all other implementation platforms on performance per watt criterion and perform better than all other platforms on performance per dollar criterion, although by a much smaller margin. Cell BE and GPU come second and third, respectively, on both performance per watt and performance per dollar criteria. In general, in order to outperform other technologies on performance per dollar criterion (using currently available hardware and development tools), FPGAs need to achieve at least two orders of magnitude speed-up compared to general-purpose processors and one order of magnitude speed-up compared to domain-specific technologies such as GPUs

Crossref

Directory of Open Access Journals

Edinburgh Research Explorer

System-on-Chip Design and Test with Embedded Debug Capabilities

Author: Marwah Tushti
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/08/2006
Field of study

In this project, I started with a System-on-Chip platform with embedded test structures. The baseline platform consisted of a Leon2 CPU, AMBA on-chip bus, and an Advanced Encryption Standard decryption module. The basic objective of this thesis was to use the embedded reconfigurable logic blocks for post-silicon debug and verification. The System-on-Chip platform was designed at the register transistor level and implemented in a 180-nm IBM process. Test logic instrumentation was done with DAFCA (Design Automation for Flexible Chip Architecture) Inc. pre-silicon tools. The design was then synthesized using the Synopsys Design Compiler and placed and routed using Cadence SOC Encounter. Total transistor count is about 3 million, including 1400K transistors for the debug module serving as on chip logic analyzer. Core size of the design is about 4.8mm x 4.8mm and the system is working at 151MHz. Design verification was done with Cadence NCSim. The controllability and observability of internal signals of the design is greatly increased with the help of pre-silicon tools which helps locate bugs and later fix them with the help of post-silicon tools. This helps prevent re-spins on several occasions thus saving millions of dollars. Post-silicon tools have been used to program assertions and triggers and inject numerous personalities into the reconfigurable fabric which has greatly increased the versatility of the circuit

University of Tennessee, Knoxville: Trace

A fine-grain time-sharing Time Warp system

Author: Pellegrini Alessandro
Quaglia Francesco
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2017
Field of study

Although Parallel Discrete Event Simulation (PDES) platforms relying on the Time Warp (optimistic) synchronization protocol already allow for exploiting parallelism, several techniques have been proposed to further favor performance. Among them we can mention optimized approaches for state restore, as well as techniques for load balancing or (dynamically) controlling the speculation degree, the latter being specifically targeted at reducing the incidence of causality errors leading to waste of computation. However, in state of the art Time Warp systems, events’ processing is not preemptable, which may prevent the possibility to promptly react to the injection of higher priority (say lower timestamp) events. Delaying the processing of these events may, in turn, give rise to higher incidence of incorrect speculation. In this article we present the design and realization of a fine-grain time-sharing Time Warp system, to be run on multi-core Linux machines, which makes systematic use of event preemption in order to dynamically reassign the CPU to higher priority events/tasks. Our proposal is based on a truly dual mode execution, application vs platform, which includes a timer-interrupt based support for bringing control back to platform mode for possible CPU reassignment according to very fine grain periods. The latter facility is offered by an ad-hoc timer-interrupt management module for Linux, which we release, together with the overall time-sharing support, within the open source ROOT-Sim platform. An experimental assessment based on the classical PHOLD benchmark and two real world models is presented, which shows how our proposal effectively leads to the reduction of the incidence of causality errors, as compared to traditional Time Warp, especially when running with higher degrees of parallelism

ART

Archivio della ricerca- Università di Roma La Sapienza