Search CORE

17 research outputs found

The "MIND" Scalable PIM Architecture

Author: Brodowicz Maciej
Sterling Thomas
Publication venue
Publication date: 01/01/2005
Field of study

MIND (Memory, Intelligence, and Network Device) is an advanced parallel computer architecture for high performance computing and scalable embedded processing. It is a Processor-in-Memory (PIM) architecture integrating both DRAM bit cells and CMOS logic devices on the same silicon die. MIND is multicore with multiple memory/processor nodes on each chip and supports global shared memory across systems of MIND components. MIND is distinguished from other PIM architectures in that it incorporates mechanisms for efficient support of a global parallel execution model based on the semantics of message-driven multithreaded split-transaction processing. MIND is designed to operate either in conjunction with other conventional microprocessors or in standalone arrays of like devices. It also incorporates mechanisms for fault tolerance, real time execution, and active power management. This paper describes the major elements and operational methods of the MIND architecture

Caltech Authors

On the Distribution of Control in Asynchronous Processor Architectures

Author: Rebello Vinod
Publication venue: University of Edinburgh. College of Science and Engineering. School of Informatics.
Publication date: 01/07/1997
Field of study

Institute for Computing Systems ArchitectureThe effective performance of computer systems is to a large measure determined by the synergy between the processor architecture, the instruction set and the compiler. In the past, the sequencing of information within processor architectures has normally been synchronous: controlled centrally by a clock. However, this global signal could possibly limit the future gains in performance that can potentially be achieved through improvements in implementation technology. This thesis investigates the effects of relaxing this strict synchrony by distributing control within processor architectures through the use of a novel asynchronous design model known as a micronet. The impact of asynchronous control on the performance of a RISC-style processor is explored at different levels. Firstly, improvements in the performance of individual instructions by exploiting actual run-time behaviours are demonstrated. Secondly, it is shown that micronets are able to exploit further (both spatial and temporal) instructionlevel parallelism (ILP) efficiently through the distribution of control to datapath resources. Finally, exposing fine-grain concurrency within a datapath can only be of benefit to a computer system if it can easily be exploited by the compiler. Although compilers for micronet-based asynchronous processors may be considered to be more complex than their synchronous counterparts, it is shown that the variable execution time of an instruction does not adversely affect the compiler's ability to schedule code efficiently. In conclusion, the modelling of a processor's datapath as a micronet permits the exploitation of both finegrain ILP and actual run-time delays, thus leading to the efficient utilisation of functional units and in turn resulting in an improvement in overall system performance

Edinburgh Research Archive

Instruction scheduling in micronet-based asynchronous ILP processors

Author: Sotelo-Salazar Salvador
Publication venue: The University of Edinburgh
Publication date: 01/01/2003
Field of study

Edinburgh Research Archive

Performance mapping of a class of fully decoupled architecture

Author: Crawford Alan W. R.
Publication venue: The University of Edinburgh
Publication date: 01/01/1999
Field of study

Edinburgh Research Archive

Limits of a decoupled out-of-order superscalar architecture

Author: Jones Graham P.
Publication venue: The University of Edinburgh
Publication date: 01/01/1999
Field of study

Edinburgh Research Archive

Procedure graphs and computer optimizations.

Author
Publication venue: Department of Cultural and Religious Studies, The Chinese University of Hong Kong
Publication date: 01/01/1992
Field of study

by Ho Kei Shiu Edward.Thesis (M.Phil.)--Chinese University of Hong Kong, 1992.Includes bibliographical references (leaves 199-202).AcknowledgementAbstractChapter Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Initial Motivations --- p.1Chapter 1.2 --- Objectives of Our Study --- p.2Chapter 1.3 --- Outline of the Thesis --- p.3Chapter Chapter 2 --- Basics of the Procedure Graph Theory --- p.6Chapter 2.1 --- Introducing Procedure Graph Theory --- p.6Chapter 2.1.1 --- "Nodes, Arcs and Pseudo-time Labels" --- p.7Chapter 2.2 --- Examples --- p.12Chapter 2.3 --- Exploring the Meanings of the Pseudo-time Labels --- p.13Chapter 2.4 --- Equivalence and Transformation --- p.16Chapter 2.4.1 --- Equivalence --- p.16Chapter 2.4.2 --- Transmission Track and Causality Preservation --- p.16Chapter 2.4.3 --- Transformation --- p.17Chapter 2.4.3.1 --- Serial-to-Parallel Transformations (SP) --- p.18Chapter 2.4.3.2 --- Parallel-to-Serial Transformations (PS) --- p.20Chapter 2.4.3.3 --- Store-Store Cancellations (SSC) --- p.21Chapter 2.4.3.4 --- Normalization of Pseudo-time Labels --- p.23Chapter 2.4.3.5 --- Boundary Conditions and Multi-level Pseudo-time Labels --- p.24Chapter 2.5 --- Procedure Graph Optimizations --- p.28Chapter 2.5.1 --- Representing Dependencies --- p.28Chapter 2.5.2 --- Eliminating Unnecessary Dependencies --- p.32Chapter 2.6 --- Simulation Program --- p.36Chapter 2.6.1 --- Preliminary Study Using the Simulation Program --- p.36Chapter 2.6.2 --- Economic Factors --- p.37Chapter 2.6.3 --- Combinatorial Explosion of Procedure Graphs --- p.38Chapter Chapter 3 --- Extending the Procedure Graph Theory --- p.45Chapter 3.1 --- The T-Operator and the F-Operator --- p.45Chapter 3.2 --- Modifying the Firing Rule --- p.46Chapter 3.3 --- Procedure Graph Representation for Different Branch Strategies --- p.49Chapter 3.3.1 --- Multiple-Path Execution --- p.49Chapter 3.3.2 --- Conditional Execution with Delayed Commitment of Results --- p.51Chapter 3.3.3 --- Speculative Execution with Register Backup and Branch Repair --- p.52Chapter 3.4 --- Procedure Graph Representation for a Stack --- p.56Chapter 3.5 --- Vector Forwarding --- p.58Chapter 3.5.1 --- An Example of Vector Chaining in Cray-1 --- p.58Chapter 3.5.2 --- "Vector SP, PS and SSC" --- p.59Chapter 3.5.3 --- A Note Concerning the Use of Algorithmic Time Labels --- p.61Chapter 3.5.4 --- Further Consideration of Vector Forwarding --- p.62Chapter Chapter 4 --- Hardware Realization of Procedure Graph Optimizations --- p.64Chapter 4.1 --- Node-Oriented Versus Arc-Oriented Representation --- p.64Chapter 4.2 --- Backward Pointers Versus Forward Pointers --- p.65Chapter 4.3 --- Backward Pointers as Hardware Tags --- p.69Chapter 4.4 --- Pointer Algebra --- p.72Chapter 4.4.1 --- Serial-to-Parallel Transformations --- p.72Chapter 4.4.2 --- Store-Store Cancellations --- p.73Chapter 4.4.3 --- Parallel-to-Serial Transformations --- p.74Chapter 4.5 --- Drawbacks of Using Backward Pointers --- p.75Chapter 4.6 --- Multiple Tags --- p.76Chapter Chapter 5 --- A Backward-Pointer Representation Scheme :The T-Architecture --- p.82Chapter 5.1 --- The T-Architecture --- p.82Chapter 5.2 --- Local Addressing Space Within the CPU --- p.83Chapter 5.3 --- Why Reservation Stations --- p.84Chapter 5.4 --- Memory Data Forwarding --- p.89Chapter 5.4.1 --- The Updating Buffer --- p.90Chapter 5.4.2 --- Ordering and Consistency --- p.96Chapter 5.4.2.1 --- Store After Store --- p.96Chapter 5.4.2.2 --- Store After Load --- p.97Chapter 5.5 --- Speculative Execution --- p.97Chapter 5.5.1 --- Procedural Dependencies --- p.97Chapter 5.5.2 --- Branch Instruction Format --- p.98Chapter 5.5.3 --- Branch Prediction --- p.99Chapter 5.5.4 --- Branch Instruction Unit --- p.99Chapter 5.5.5 --- Register Backups --- p.100Chapter 5.5.5.1 --- Branch is Correctly Predicted --- p.101Chapter 5.5.5.2 --- Branch Repair --- p.102Chapter 5.5.5.3 --- Example --- p.102Chapter 5.5.6 --- Total Ordering Memory Stores --- p.110Chapter 5.5.7 --- Simplifying the Checkpoint Repair Mechanism --- p.112Chapter 5.6 --- A Simulator for the T-Architecture --- p.113Chapter 5.6.1 --- Basic Configuration of the Simulator --- p.114Chapter 5.6.2 --- Parameters of the Simulator --- p.115Chapter 5.6.3 --- Benchmark Programs --- p.116Chapter 5.7 --- Experiments --- p.118Chapter 5.7.1 --- Experiment1 --- p.119Chapter 5.7.2 --- Experiment2 --- p.121Chapter 5.7.3 --- Experiment3 --- p.123Chapter 5.7.4 --- Experiment4 --- p.127Chapter Chapter 6 --- Predictive Procedure Graph Optimizations in the S-Prototype --- p.137Chapter 6.1 --- Keys to Higher Performance --- p.138Chapter 6.2 --- The Superscalar Approach --- p.139Chapter 6.3 --- Processor Architecture of the S-Prototype --- p.139Chapter 6.4 --- Design Strategies of the S-Prototype --- p.141Chapter 6.4.1 --- Fetching Multiple Instructions --- p.142Chapter 6.4.2 --- Handling Procedural Dependencies : Branching Instructions --- p.142Chapter 6.4.2.1 --- Branch Unit and Branch Predicting Buffer --- p.143Chapter 6.4.2.2 --- Branch Repairing - Recovering Machine State --- p.144Chapter 6.4.3 --- Extensive Tagging and Result Forwarding --- p.147Chapter 6.4.4 --- Static and Dynamic Data Dependencies --- p.148Chapter 6.4.4.1 --- Handling Static Dependencies by using the Multitag Pool --- p.149Chapter 6.4.4.2 --- Handling Dynamic Dependencies by using the Reservation Stations --- p.150Chapter 6.4.5 --- Extracting Parallelism --- p.152Chapter 6.4.5.1 --- Representing Data Dependency in the Multitag Pool --- p.153Chapter 6.4.5.2 --- Implementing Transformation Rules --- p.156Chapter 6.4.6 --- Out-of-order Issue and Execution --- p.157Chapter 6.4.7 --- Memory Accesses --- p.158Chapter 6.4.8 --- Bus Contention and Arbitration --- p.160Chapter Chapter 7 --- An Attempt To Simulate Procedure Graphs Using Graph Grammar --- p.161Chapter 7.1 --- Introducing Graph Grammar --- p.161Chapter 7.2 --- Basic Concepts in Sequential Graph Grammar --- p.161Chapter 7.2.1 --- Production Rules and Interface Graph --- p.162Chapter 7.2.2 --- Gluing Constructions and Pushouts --- p.162Chapter 7.2.3 --- Gluing Conditions --- p.163Chapter 7.3 --- Initial Considerations to Simulate Procedure Graphs --- p.165Chapter 7.4 --- Example --- p.165Chapter 7.5 --- Problems Encountered --- p.167Chapter 7.6 --- Some Insights into the Unsolved Problem --- p.168Chapter 7.7 --- "Parallelism, Concurrency and New Transformation Rules" --- p.171Chapter Chapter 8 --- Representing Causality Using Petri Nets --- p.175Chapter 8.1 --- Defining Petri Nets --- p.175Chapter 8.1.1 --- Petri Nets as a Tool for System Modeling --- p.176Chapter 8.1.2 --- The Characteristics of a Petri Net --- p.177Chapter 8.1.3 --- Useful Extensions --- p.178Chapter 8.2 --- Program Analysis and Modeling Computer Operations --- p.179Chapter 8.2.1 --- Representing Causality Relationships --- p.180Chapter 8.2.2 --- Representing the Total Ordering of Instructions in a Sequential Program --- p.184Chapter 8.3 --- Extending the Model --- p.186Chapter 8.4 --- Comparing Procedure Graphs and Petri Nets --- p.188Chapter Chapter 9 --- Conclusion and Future Research Directions --- p.190Chapter 9.1 --- Formalizing the Procedure Graph Theory --- p.190Chapter 9.2 --- Mathematical Properties of Procedure Graphs --- p.191Chapter 9.3 --- Register Abuses --- p.192Chapter 9.4 --- Hardware Representation of Procedure Graphs --- p.194Chapter 9.5 --- Tags Describing Tags --- p.196Chapter 9.6 --- Software Optimizations --- p.197Chapter 9.7 --- Simulation Programs --- p.198References --- p.19

CUHK Digital Repository

Design of a distributed memory unit for clustered microarchitectures

Author: Bieschewski Stefan
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2013
Field of study

Power constraints led to the end of exponential growth in single–processor performance, which characterized the semiconductor industry for many years. Single–chip multiprocessors allowed the performance growth to continue so far. Yet, Amdahl’s law asserts that the overall performance of future single–chip multiprocessors will depend crucially on single–processor performance. In a multiprocessor a small growth in single–processor performance can justify the use of significant resources. Partitioning the layout of critical components can improve the energy–efficiency and ultimately the performance of a single processor. In a clustered microarchitecture parts of these components form clusters. Instructions are processed locally in the clusters and benefit from the smaller size and complexity of the clusters components. Because the clusters together process a single instruction stream communications between clusters are necessary and introduce an additional cost. This thesis proposes the design of a distributed memory unit and first level cache in the context of a clustered microarchitecture. While the partitioning of other parts of the microarchitecture has been well studied the distribution of the memory unit and the cache has received comparatively little attention. The first proposal consists of a set of cache bank predictors. Eight different predictor designs are compared based on cost and accuracy. The second proposal is the distributed memory unit. The load and store queues are split into smaller queues for distributed disambiguation. The mapping of memory instructions to cache banks is delayed until addresses have been calculated. We show how disambiguation can be implemented efficiently with unordered queues. A bank predictor is used to map instructions that consume memory data near the data origin. We show that this organization significantly reduces both energy usage and latency. The third proposal introduces Dispatch Throttling and Pre-Access Queues. These mechanisms avoid load/store queue overflows that are a result of the late allocation of entries. The fourth proposal introduces Memory Issue Queues, which add functionality to select instructions for execution and re-execution to the memory unit. The fifth proposal introduces Conservative Deadlock Aware Entry Allocation. This mechanism is a deadlock safe issue policy for the Memory Issue Queues. Deadlocks can result from certain queue allocations because entries are allocated out-of-order instead of in-order like in traditional architectures. The sixth proposal is the Early Release of Load Queue Entries. Architectures with weak memory ordering such as Alpha, PowerPC or ARMv7 can take advantage of this mechanism to release load queue entries before the commit stage. Together, these proposals allow significantly smaller and more energy efficient load queues without the need of energy hungry recovery mechanisms and without performance penalties. Finally, we present a detailed study that compares the proposed distributed memory unit to a centralized memory unit and confirms its advantages of reduced energy usage and of improved performance

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Secretaría de Estado de Cultura

Techniques for Efficient Implementation of FIR and Particle Filtering

Author
Publication venue: 'Linkoping University Electronic Press'
Publication date
Field of study

Crossref

Proceedings of the 21st Conference on Formal Methods in Computer-Aided Design – FMCAD 2021

Author
Publication venue: TU Wien Academic Press
Publication date: 18/10/2021
Field of study

The Conference on Formal Methods in Computer-Aided Design (FMCAD) is an annual conference on the theory and applications of formal methods in hardware and system verification. FMCAD provides a leading forum to researchers in academia and industry for presenting and discussing groundbreaking methods, technologies, theoretical results, and tools for reasoning formally about computing systems. FMCAD covers formal aspects of computer-aided system design including verification, specification, synthesis, and testing

Directory of Open Access Books (DOAB)