Search CORE

538 research outputs found

Dynamic instruction scheduling and data forwarding in asynchronous superscalar processors

Author: Mullins Robert D.
Publication venue: The University of Edinburgh
Publication date: 01/01/2001
Field of study

Edinburgh Research Archive

Power reduction in superscalar datapaths through dynamic bit-slice activation

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2001
Field of study

Crossref

A VHDL model of a superscalar implementation of the DLX instruction set architcture

Author: Ferno Paul
Publication venue: RIT Scholar Works
Publication date: 01/10/1996
Field of study

The complexity of today\u27s microprocessors demands that designers have an extensive knowledge of superscalar design techniques; this knowledge is difficult to acquire outside of a professional design team. Presently, there are a limited number of adequate resources available for the student, both in textual and model form. The limited number of options available emphasizes the need for more models and simulators, allowing students the opportunity to learn more about superscalar designs prior to entering the work force. This thesis details the design and implementation of a superscalar version of the DLX instruction set architecture in behavioral VHDL. The branch prediction strategy, instruction issue model, and hazard avoidance techniques are all issues critical to superscalar processor design and are studied in this thesis. Preliminary test results demonstrate that the performance advantage of the superscalar processor is applicable even to short test sequences. Initial findings have shown a performance improvement of 26% to 57% for instruction sequences under 150 instructions

RIT Scholar Works

SAMIE-LSQ: set-associative multiple-instruction entry load/store queue

Author: Abella Ferrer Jaume
González Colás Antonio María
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2006
Field of study

The load/store queue (LSQ) is one of the most complex parts of contemporary processors. Its latency is critical for the processor performance and it is usually one of the processor hotspots. This paper presents a highly banked, set-associative, multiple-instruction entry LSQ (SAMIE-LSQ,) that achieves high performance with small energy requirements. The SAMIE-LSQ classifies the memory instructions (loads and stores) based on the address to be accessed, and groups those instructions accessing the same cache line in the same entry. Our approach relies on the fact that many in-flight memory instructions access the same cache lines. Each SAMIE-LSQ entry has space for several memory instructions accessing the same cache line. This arrangement has a number of advantages. First, it significantly reduces the address comparison activity needed for memory disambiguation since there are less addresses to be compared. It also reduces the activity in the data TLB, the cache tag and cache data arrays. This is achieved by caching the cache line location and address translation in the corresponding SAMIE-LSQ entry once the access of one of the instructions in an entry is performed, so instructions that share an entry can reuse the translation, avoid the tag check and get the data directly from the concrete cache way without checking the others. Besides, the delay of the proposed scheme is lower than that required by a conventional LSQ. We show that the SAMIE-LSQ saves 82% dynamic energy for the load/store queue, 42% for the LI data cache and 73% for the data TLB, with a negligible impact on performance (0.6%)Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Towards a goal-oriented agent-based simulation framework for high-performance computing

Author: Cortés García Claudio Ulises
Garcia Gasulla Dario
Gnatyshak Dmitry
Oliva Felipe Luis Javier
Padget Julián
Vázquez Salceda Javier
Álvarez Napagao Sergio
Publication venue: 'IOS Press'
Publication date: 01/01/2019
Field of study

Currently, agent-based simulation frameworks force the user to choose between simulations involving a large number of agents (at the expense of limited agent reasoning capability) or simulations including agents with increased reasoning capabilities (at the expense of a limited number of agents per simulation). This paper describes a first attempt at putting goal-oriented agents into large agentbased (micro-)simulations. We discuss a model for goal-oriented agents in HighPerformance Computing (HPC) and then briefly discuss its implementation in PyCOMPSs (a library that eases the parallelisation of tasks) to build such a platform that benefits from a large number of agents with the capacity to execute complex cognitive agents.Peer ReviewedPostprint (author's final draft

arXiv.org e-Print Archive

UPCommons. Portal del coneixement obert de la UPC

Improving latency tolerance of multithreading through decoupling

Author: González Colás Antonio María
Parcerisa Bundó Joan Manuel
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2001
Field of study

The increasing hardware complexity of dynamically scheduled superscalar processors may compromise the scalability of this organization to make an efficient use of future increases in transistor budget. SMT processors, designed over a superscalar core, are therefore directly concerned by this problem. The article presents and evaluates a novel processor microarchitecture which combines two paradigms: simultaneous multithreading and access/execute decoupling. Since its decoupled units issue instructions in order, this architecture is significantly less complex, in terms of critical path delays, than a centralized out-of-order design, and it is more effective for future growth in issue-width and clock speed. We investigate how both techniques complement each other. Since decoupling features an excellent memory latency hiding efficiency, the large amount of parallelism exploited by multithreading may be used to hide the latency of functional units and keep them fully utilized. The study shows that, by adding decoupling to a multithreaded architecture, fewer threads are needed to achieve maximum throughput. Therefore, in addition to the obvious hardware complexity reduction, it places lower demands on the memory system. The study also reveals that multithreading by itself exhibits little memory latency tolerance. Results suggest that most of the latency hiding effectiveness of SMT architectures comes from the dynamic scheduling. On the other hand, decoupling is very effective at hiding memory latency. An increase in the cache miss penalty from 1 to 32 cycles reduces the performance of a 4-context multithreaded decoupled processor by less than 2 percent. For the nondecoupled multithreaded processor, the loss of performance is about 23 percent.Peer ReviewedPostprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

A First Practical Fully Homomorphic Crypto-Processor Design: The Secret Computer is Nearly Here

Author: Bowen Jonathan
Breuer Peter
Publication venue: 'Elsevier BV'
Publication date: 02/03/2016
Field of study

Following a sequence of hardware designs for a fully homomorphic crypto-processor - a general purpose processor that natively runs encrypted machine code on encrypted data in registers and memory, resulting in encrypted machine states - proposed by the authors in 2014, we discuss a working prototype of the first of those, a so-called `pseudo-homomorphic' design. This processor is in principle safe against physical or software-based attacks by the owner/operator of the processor on user processes running in it. The processor is intended as a more secure option for those emerging computing paradigms that require trust to be placed in computations carried out in remote locations or overseen by untrusted operators. The prototype has a single-pipeline superscalar architecture that runs OpenRISC standard machine code in two distinct modes. The processor runs in the encrypted mode (the unprivileged, `user' mode, with a long pipeline) at 60-70% of the speed in the unencrypted mode (the privileged, `supervisor' mode, with a short pipeline), emitting a completed encrypted instruction every 1.67-1.8 cycles on average in real trials.Comment: 6 pages, draf

arXiv.org e-Print Archive

Elsevier - Publisher Connector