Guarded atomic actions and refinement in a system-on-chip
development flow: bridging the specification gap with Event-B by Colley, John
University of Southampton Research Repository
ePrints Soton
Copyright © and Moral Rights for this thesis are retained by the author and/or other 
copyright owners. A copy can be downloaded for personal non-commercial 
research or study, without prior permission or charge. This thesis cannot be 
reproduced or quoted extensively from without first obtaining permission in writing 
from the copyright holder/s. The content must not be changed in any way or sold 
commercially in any format or medium without the formal permission of the 
copyright holders.
  
 When referring to this work, full bibliographic details including the author, title, 
awarding institution and date of the thesis must be given e.g.
AUTHOR (year of submission) "Full thesis title", University of Southampton, name 
of the University School or Department, PhD Thesis, pagination
http://eprints.soton.ac.ukUNIVERSITY OF SOUTHAMPTON
Guarded Atomic Actions and
Reﬁnement in a System-on-Chip
Development Flow: Bridging the
Speciﬁcation Gap with Event-B
by
John Larry Colley
A thesis submitted in partial fulﬁllment for the
degree of Doctor of Philosophy
in the
Faculty of Engineering, Science and Mathematics
School of Electronics and Computer Science
November 2010UNIVERSITY OF SOUTHAMPTON
ABSTRACT
FACULTY OF ENGINEERING, SCIENCE AND MATHEMATICS
SCHOOL OF ELECTRONICS AND COMPUTER SCIENCE
Doctor of Philosophy
by John Larry Colley
Modern System-on-chip (SoC) hardware design puts considerable pressure on existing
design and veriﬁcation ﬂows, languages and tools. The Register Transfer Level (RTL)
description, which forms the input for synchronous, logic synthesis-driven design is at too
low a level of abstraction for e cient architectural exploration and re-use. The existing
methods for taking a high-level paper speciﬁcation and reﬁning this speciﬁcation to an
implementation that meets its performance criteria is largely manual and error-prone
and as RTL descriptions get larger, a systematic design method is necessary to address
explicitly the timing issues that arise when applying logic synthesis to such large blocks.
Guarded Atomic Actions have been shown to o er a convenient notation for describ-
ing microarchitectures that is amenable to formal reasoning and high-level synthesis.
Event-B is a language and method that supports the development of speciﬁcations with
automatic proof and reﬁnement, based on guarded atomic actions. Latency-insensitive
design ensures that a design composed of functionally correct components will be inde-
pendent of communication latency. A method has been developed which uses Event-B
for latency-insensitive SoC component and sub-system design which can be combined
with high-level, component synthesis to enable architectural exploration and re-use at
the speciﬁcation level and to close the speciﬁcation gap in the SoC hardware ﬂow.Contents
Acknowledgements ix
1 Introduction 1
2 Background 4
2.1 Board-level Design and Veriﬁcation . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Chip-level Design and Veriﬁcation . . . . . . . . . . . . . . . . . . . . . . 4
2.3 EDA Design and Veriﬁcation Languages . . . . . . . . . . . . . . . . . . . 5
2.3.1 Modelling Languages . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.2 Test Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.3 Property Languages . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 EDA Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4.1 Simulation-based Veriﬁcation . . . . . . . . . . . . . . . . . . . . . 6
2.4.2 Disadvantages of Simulation . . . . . . . . . . . . . . . . . . . . . . 7
2.4.3 Formal Veriﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.4 Disadvantages of Current Model Checking Solutions . . . . . . . . 8
2.5 Higher-level Modelling Abstractions . . . . . . . . . . . . . . . . . . . . . 9
2.5.1 Behavioural Modelling . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5.2 Transaction-Level Modelling(TLM) . . . . . . . . . . . . . . . . . . 10
2.6 SystemC Transaction Level Modelling . . . . . . . . . . . . . . . . . . . . 10
2.6.1 TLM level 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6.2 TLM level 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6.2.1 Modelling TLM level 2 Processes . . . . . . . . . . . . . . 11
2.6.2.2 Modelling TLM level 2 Communications . . . . . . . . . 11
2.6.3 TLM level 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6.4 TLM Architectural Exploration . . . . . . . . . . . . . . . . . . . . 13
2.7 Microprocessor Pipeline Veriﬁcation . . . . . . . . . . . . . . . . . . . . . 13
2.8 Modelling Concurrency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8.1 Modelling Concurrency with Partial Orders . . . . . . . . . . . . . 14
2.8.2 Modelling Concurrency with Guarded Atomic Actions . . . . . . . 16
2.8.3 Bluespec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.8.4 CAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.9 Event-B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.9.2 Reﬁnement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.9.3 Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.9.4 Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
iiCONTENTS iii
2.9.5 Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.9.6 Witnesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Enhancing the SoC Hardware Design Flow with Event-B 20
3.1 Background to the Existing Flow . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.1 Logic Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.2 Formal RTL Veriﬁcation . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.3 RTL Architectural Exploration . . . . . . . . . . . . . . . . . . . . 23
3.1.4 Closing the Gap: Behavioural Synthesis . . . . . . . . . . . . . . . 24
3.1.5 Closing the Gap: High-level Synthesis with Term Rewriting Systems 25
3.2 Closing the Gap: The Event-B method with TRS . . . . . . . . . . . . . . 29
3.2.1 Event-B Speciﬁcation Reﬁnement in the SoC Flow . . . . . . . . . 30
3.2.2 Architectural Exploration at the Speciﬁcation Level . . . . . . . . 31
4 Developing SoC Components 33
4.1 Restrictions on SoC Component Size . . . . . . . . . . . . . . . . . . . . . 34
4.2 State Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Case Study: Developing an EFSM for Hu man Encoding/Decoding . . . 35
4.3.1 The Abstract Model . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.2 The First Reﬁnement . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.3 The Second Reﬁnement . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.4 The Third Reﬁnement . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4.1 Modern SoC Microprocessor Pipelines . . . . . . . . . . . . . . . . 44
4.4.2 Designing and Verifying an SoC Microprocessor Pipeline . . . . . . 44
4.4.3 A Pipeline Example: Counting Playing Cards . . . . . . . . . . . . 45
4.4.4 Event Simultaneity in Pipelines . . . . . . . . . . . . . . . . . . . . 51
4.4.5 Measuring Pipeline Complexity at the Speciﬁcation Level . . . . . 53
4.4.6 Pipeline Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.7 An Alternative Compositional Approach to Pipeline Reﬁnement . 58
4.4.8 Pipeline Decomposition . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4.9 Generalising the Approach to Pipeline Veriﬁcation with Event-B . 66
5 Developing an SoC Pipelined Microprocessor Model 68
5.1 Modelling DLX with Event-B . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1.1 Instruction Fetch (IF) . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1.2 Instruction Decode (ID) . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1.3 Execute (EX) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.4 Memory Access (MEM) . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1.5 Writeback (WB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 A General Overview of the Method . . . . . . . . . . . . . . . . . . . . . . 73
5.3 Abstracting and Reﬁning the Arithmetic Instruction . . . . . . . . . . . . 74
5.3.1 The Abstract ISA Model . . . . . . . . . . . . . . . . . . . . . . . 75
5.3.2 The First Reﬁnement: a 2-stage pipeline . . . . . . . . . . . . . . . 77
5.3.3 Detecting the RAW Hazard . . . . . . . . . . . . . . . . . . . . . . 79
5.3.4 Dealing Correctly with the RAW Hazard . . . . . . . . . . . . . . 80
5.3.5 The Second Reﬁnement: a 3-stage pipeline . . . . . . . . . . . . . 80CONTENTS iv
5.3.6 Shared Event Decomposition of the Feedback Loop . . . . . . . . . 85
5.3.7 The Third Reﬁnement: a 4-stage pipeline . . . . . . . . . . . . . . 87
5.3.8 Generalising the ArithRR model . . . . . . . . . . . . . . . . . . . 90
5.4 Abstracting and Reﬁning the Branch Instruction . . . . . . . . . . . . . . 91
5.4.1 The Abstract ISA Model . . . . . . . . . . . . . . . . . . . . . . . 91
5.4.2 The First Reﬁnement . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4.3 The Second Reﬁnement: a 2-stage pipeline . . . . . . . . . . . . . 95
5.4.4 The Third Reﬁnement: a 3-stage pipeline . . . . . . . . . . . . . . 98
5.4.5 The Fourth Reﬁnement: a 4-stage pipeline . . . . . . . . . . . . . . 101
5.4.6 The Fifth Reﬁnement . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.4.7 Pipeline Decomposition . . . . . . . . . . . . . . . . . . . . . . . . 105
5.5 Pipeline Instruction Composition . . . . . . . . . . . . . . . . . . . . . . . 109
5.6 Measuring Pipeline Complexity at the Speciﬁcation Level . . . . . . . . . 112
5.6.1 Combined State Machine Arc Coverage . . . . . . . . . . . . . . . 112
5.6.2 Combined State Machine Path Coverage . . . . . . . . . . . . . . . 113
5.7 Component Re-use in Pipeline Speciﬁcations . . . . . . . . . . . . . . . . 116
5.7.1 Parameters and Witnesses in Component Re-use . . . . . . . . . . 120
5.8 A Review of the Pipeline Development Method . . . . . . . . . . . . . . . 120
5.8.1 Modelling Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.8.2 Modelling Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.8.3 Advancing from Modelling to Proving: Problem Decomposition . . 121
5.8.4 Automatic Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.8.5 Managing Architectural Complexity . . . . . . . . . . . . . . . . . 122
5.8.6 Measuring Architectural Complexity . . . . . . . . . . . . . . . . . 122
5.8.7 Memory Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6 Memory Accesses: Managing Component Latency 123
6.0.8 Modelling Synchronous Memory . . . . . . . . . . . . . . . . . . . 123
6.0.9 The IP Look-up Circular Pipeline . . . . . . . . . . . . . . . . . . 126
7 Developing SoC Sub-Systems 133
7.1 Managing Design Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.2 Asynchronous Design and Transaction-Level Modelling . . . . . . . . . . . 134
7.3 Unit-Transaction Level Design . . . . . . . . . . . . . . . . . . . . . . . . 135
7.4 Latency-Insensitive Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.5 Synchronous Elastic Design . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.6 Microarchitectural Exploration: Introducing Synchronous Elastic Bu er-
ing with Event-B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.6.1 The Synchronous Elastic Bu er: An Abstract Speciﬁcation in
Event-B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.6.2 Synchronous Bu ering with Forwarding . . . . . . . . . . . . . . . 141
7.6.3 Synchronous Elastic Bu ering with Stalling . . . . . . . . . . . . . 143
7.6.4 Shared Event Pipeline Decomposition . . . . . . . . . . . . . . . . 150
7.6.5 A Review of the use of Synchronous Elastic Design with Event-B . 151
8 Conclusions 153
A Published Papers 166List of Figures
2.1 Simulation Based Veriﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Model Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 TLM Level 2 Master/Slave Communication . . . . . . . . . . . . . . . . . 12
2.5 channel behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 SoC Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 The Role of RTL Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 The Role of Formal RTL Checking . . . . . . . . . . . . . . . . . . . . . . 22
3.4 RTL Architectural Exploration . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Behavioural Synthesis Flow . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6 TRS Processor Instruction Rule . . . . . . . . . . . . . . . . . . . . . . . . 26
3.7 TRS Synthesis Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.8 TRS Architectural Exploration . . . . . . . . . . . . . . . . . . . . . . . . 28
3.9 Event-B and TRS ﬂow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.10 Event-B Architectural Exploration . . . . . . . . . . . . . . . . . . . . . . 31
4.1 Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 HUFFMAN Encoding Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 HUFFMAN Partial Order . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Hu man EFSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5 The Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.6 Pipeline Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.7 Even Number Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.8 Pipeline without Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.9 Pipeline with Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.10 Abstract Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.11 First Reﬁnement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.12 Abstract Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.13 First Reﬁnement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.14 Component Q and its Environment . . . . . . . . . . . . . . . . . . . . . . 61
4.15 Initial Component Composition . . . . . . . . . . . . . . . . . . . . . . . . 63
4.16 Pipelined Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.1 DLX 5-stage Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Event-B DLX Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3 Abstract Machine: Microarchitecture . . . . . . . . . . . . . . . . . . . . . 76
5.4 Reﬁned Machine: Microarchitecture . . . . . . . . . . . . . . . . . . . . . 77
vLIST OF FIGURES vi
5.5 Successive Instructions can Interfere . . . . . . . . . . . . . . . . . . . . . 79
5.6 Reﬁned Machine with Forwarding: Microarchitecture . . . . . . . . . . . . 80
5.7 Second Reﬁnement: Microarchitecture . . . . . . . . . . . . . . . . . . . . 81
5.8 Instructions one apart can Interfere . . . . . . . . . . . . . . . . . . . . . . 81
5.9 Second Reﬁnement with Forwarding: Microarchitecture . . . . . . . . . . 85
5.10 Third Reﬁnement: Microarchitecture . . . . . . . . . . . . . . . . . . . . . 85
5.11 ArithRR 4-stage Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.12 AbstractMachine: Microarchitecture . . . . . . . . . . . . . . . . . . . . . 91
5.13 Reﬁned Machine: Microarchitecture . . . . . . . . . . . . . . . . . . . . . 93
5.14 Second Reﬁnement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.15 Third Reﬁnement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.16 Fourth Reﬁnement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.17 Fifth Reﬁnement: Combined State Machine . . . . . . . . . . . . . . . . . 103
5.18 Fifth Reﬁnement: Events . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.19 Branch 4-stage Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.20 The Pipeline Development Flow . . . . . . . . . . . . . . . . . . . . . . . . 109
5.21 ArithRR 4-stage Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.22 Branch 4-stage Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.23 Composed 4-stage Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.24 Combined State Machine Arcs . . . . . . . . . . . . . . . . . . . . . . . . 113
5.25 Branch Reﬁnement Diagram: 4th Reﬁnement, No RAW . . . . . . . . . . 114
5.26 NoBranch Reﬁnement Diagram: 4th Reﬁnement, No RAW . . . . . . . . 114
5.27 Abstract Machine: Microarchitecture . . . . . . . . . . . . . . . . . . . . . 116
5.28 First Reﬁnement: Components . . . . . . . . . . . . . . . . . . . . . . . . 117
5.29 First Reﬁnement: Component Composition . . . . . . . . . . . . . . . . . 118
5.30 Second Reﬁnement: Parameter Instantiation . . . . . . . . . . . . . . . . 118
5.31 Third Reﬁnement: Component Composition . . . . . . . . . . . . . . . . . 119
5.32 Third Reﬁnement: Parameter Instantiation . . . . . . . . . . . . . . . . . 119
6.1 Managing RAM Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.2 SDRAM lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.3 IP Lookup Circular Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.1 Latency Insensitive Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.2 The SELF Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.3 Connecting 2 Elastic Components . . . . . . . . . . . . . . . . . . . . . . . 138
7.4 Abstract Synchronous Elastic Bu er . . . . . . . . . . . . . . . . . . . . . 139
7.5 Abstract Machine: Microarchitecture . . . . . . . . . . . . . . . . . . . . . 140
7.6 Reﬁned Machine with forwarding: Microarchitecture . . . . . . . . . . . . 141
7.7 Reﬁned Machine with bu ers and forwarding: Microarchitecture . . . . . 141
7.8 Reﬁned Machine with synchronous elastic bu ers: Microarchitecture . . . 144
7.9 Synchronous Elastic Bu ers: Stalling mechanism . . . . . . . . . . . . . . 149
7.10 Synchronous elastic bu er decomposition: Microarchitecture . . . . . . . . 150List of Tables
4.1 Hu man Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Abstract Model Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Reﬁned Model Merged Events . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Pipeline Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.5 2-stage 2-event pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.6 2-stage 4-event pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.7 2-stage m+n-event pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.8 Verilog RTL Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.1 DLX Pipeline Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2 Pipeline Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3 Pipeline Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.1 IP Lookup State Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.2 IP Lookup Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.1 SELF State Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
viiAcademic Thesis: Declaration Of Authorship
I, John Larry Colley
declare that this thesis and the work presented in it are my own and has been generated by me as the result of 
my own original research.
Guarded Atomic Actions and Refinement in a System-on-Chip Development Flow: Bridging the Specification Gap 
with Event-B
 I confirm that:
1. This work was done wholly or mainly while in candidature for a research degree at this University;
2. Where any part of this thesis has previously been submitted for a degree or any other qualification at this 
University or any other institution, this has been clearly stated;
3. Where I have consulted the published work of others, this is always clearly attributed;
4. Where I have quoted from the work of others, the source is always given. With the exception of such 
quotations, this thesis is entirely my own work;
5. I have acknowledged all main sources of help;
6. Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was 
done by others and what I have contributed myself;
7. Either none of this work has been published before submission, or parts of this work have been published as:
 
On Proving with Event-B that a Pipelined Processor Model Implements its ISA Specification
Colley, J and Butler, M
Refinement Based Methods for the Construction of Dependable Systems
Dagstuhl, Germany
September 2009
Signed: ………………………………………………………………………………………………………………………….
Date:  ………………………………………………………………………………………………………………………….Acknowledgements
I would like to thank my supervisors Michael Butler and Jo˜ ao Marques-Silva for their
guidance and Edd Turner, Lis Ball, Renato Silva and Ross Horne for their help during
my research.
ixChapter 1
Introduction
The complexity of modern System-on-Chip (SoC) hardware stretches the existing de-
sign and veriﬁcation ﬂows, languages and tools to the limit of their capabilities [Asanovic
et al., 2006], [Sylvester and Keutzer, 2001]. Veriﬁcation takes a larger and larger propor-
tion of the overall e ort and it is often very late in the design process that timing issues,
resulting from the very small feature sizes of modern silicon processes, are encountered
and can only be corrected by substantial re-design [Carloni and Sangiovanni-Vincentelli,
2002]. The commensurate reduction in the predictability of the veriﬁcation closure pro-
cess means that there is a clear need to enhance existing design ﬂows to be better able
to manage this increased complexity, without losing the well-established beneﬁts that
have driven successful synchronous design.
First, the Register Transfer Level (RTL) description, which forms the input for syn-
chronous, logic synthesis-driven design [Devadas et al., 1994] is at too low a level of
abstraction for e cient architectural exploration and re-use [Hoe, 2000] and second, the
existing methods for taking a high-level paper speciﬁcation and reﬁning this speciﬁcation
to an implementation that meets its performance criteria is largely manual and error-
prone [Arvind et al., 2004]. Third, as RTL descriptions get larger, a systematic design
method is necessary to address explicitly the timing issues that arise when applying logic
synthesis to such large blocks [Asanovic, 2007], [Rose et al., 2005].
Term Rewriting Systems (TRS) [Hoe and Arvind, 1999] have been shown to o er a
convenient notation, as a set of guarded atomic actions, for describing microarchitec-
tures that is amenable to formal reasoning and is used by the industrial tool Bluespec
[Nikhil, 2004] for high-level synthesis. Event-B is a language and method [Abrial et al.,
2006] that supports the development of speciﬁcations with automatic proof and reﬁne-
ment, also based on guarded atomic actions. Latency-insensitive design [Carloni and
Sangiovanni-Vincentelli, 2002] is a method for assembling and managing the commu-
nication between components in complex synchronous hardware systems which ensures
1Chapter 1 Introduction 2
that a design composed of functionally correct components will be independent of com-
munication latency.
In this work a method has been developed which uses Event-B for latency-insensitive SoC
component and sub-system design which can be combined with high-level component
synthesis to enable architectural exploration and re-use at the speciﬁcation level and to
close the speciﬁcation gap in the SoC hardware ﬂow. It will be shown that an abstract
speciﬁcation can be reﬁned systematically to a level where it is possible to represent
and reason about the key elements that the hardware designer employs in developing
an architecture to meet the required constraints on performance and power. These key
elements include pipelining strategies, the size of components, and the communication
mechanisms employed between components. The goals of this work are threefold.
First, to show that alternative pipeline architectures within components can be devel-
oped from a high-level speciﬁcation, compared and veriﬁed systematically. The challenge
here is that all possible combinations of pipeline activity must be identiﬁed and explored
exhaustively.
Second, that a high-level model for latency-insensitive communication can be devel-
oped that can be used at the speciﬁcation level to verify formally that communicating
components obey the latency-insensitive protocol and will not cause deadlock.
Third, that the speciﬁcation can be reﬁned to a level of abstraction that matches that
required for input to high-level or RTL synthesis.
Chapter 2 provides background information, covering the technical and ﬁnancial moti-
vation for the adoption of veriﬁcation tools and the need for greater conﬁdence in the
quality of a design before sign-o  and manufacture. It then looks at current veriﬁcation
technologies and solutions, their strengths and weaknesses, and the recent emergence
of new technologies to address these weaknesses. It also covers the new and evolv-
ing standard for Transaction-Level Modelling (TLM) in SystemC and how it augments
the existing hardware design ﬂow. It goes on to provide a background to methods for
modelling concurrency, including the use of guarded atomic actions for modelling and
high-level synthesis, and an introduction to the Event-B language and method.
The contribution of Chapter 3 is to show how the existing SoC synchronous hardware
design ﬂow can be enhanced by the seamless introduction of a formal speciﬁcation and
veriﬁcation method that addresses the gap that exists in this ﬂow between the speci-
ﬁcation and the RTL description used for logic synthesis. It looks at the reasons that
behavioural synthesis has not in general helped to close this gap, and how guarded atomic
actions can be used to provide e cient high-level synthesis and architectural exploration.
It then provides an overview of how the Event-B formal method can be combined with
RTL and high-level guarded atomic action synthesis to close the speciﬁcation gap. This
method is then elaborated in detail in the subsequent chapters.Chapter 1 Introduction 3
The contribution described in Chapter 4 is an Event-B based method that can be used
for developing SoC component speciﬁcations that takes into account the restrictions
that hardware process technology places on component size. It then shows how the
method can be applied to pipelined architectures and explores the implications that
such architectures have for how simultaneous pipeline events are managed.
The contribution of Chapter 5 is to show how a SoC microprocessor pipelined implemen-
tation can be derived formally from its abstract speciﬁcation using the Event-B based
method described in Chapter 4.
Chapter 6 addresses the issues of latency introduced by memory sub-systems. The
contribution of this chapter is to show how latency and out-of-order completion can be
modelled formally in an IP Lookup circular pipeline.
Chapter 7 covers the emerging latency-insensitive protocols which can de-couple sub-
system design from the complex timing interaction that can occur between components.
The contribution is an extension of the method described in Chapert 4 to support
latency-insensitive design with formal proof.Chapter 2
Background
2.1 Board-level Design and Veriﬁcation
Before the widespread adoption of complementary metal oxide semiconductor (CMOS)
technology [Horowitz and Hill, 1989], the complexity of application-speciﬁc hardware
designs was limited by transistor size. Transistor Transistor Logic (TTL) [Horowitz and
Hill, 1989] packaged 4 NAND gates in a 2cm x 1cm chip so that approximately 500 gates
could be placed on a board. Using multiple boards it was possible to build designs with
a few thousand gates.
The advantage, however, of using discrete components was that it was very easy to build
a working prototype of the design which could be run at full speed on a tester, and in the
event of errors in the design it was possible to insert tester probes anywhere on the board
to help locate the errors [Williams, 1986]. Errors could be ﬁxed by replacing components
or inserting wire patches. Test patterns were developed by hand, and once the prototype
passed the tests, track layout could be performed and the boards manufactured [Russell,
1985]. Even if errors were discovered after manufacture, the boards could be patched
on-site.
The low cost of failure meant that there was low demand for commercial tools to aid in
design veriﬁcation, and the majority of tools used, for instance to optimise layout, were
developed in-house by the electronics companies themselves.
2.2 Chip-level Design and Veriﬁcation
As CMOS technology became more accessible, companies such as Texas Instruments
and LSI Logic o ered methodology and manufacturing services for application-speciﬁc
4Chapter 2 Background 5
(ASIC) CMOS design. Coupled with the shrinking transistor sizes enabled by the in-
creasingly sophisticated manufacturing processes developed by the semi-conductor com-
panies, it was now possible for fab-less design companies to develop designs of consid-
erable complexity with very low unit cost. By 2006 it was possible to produce designs
with more than 5 million gates on a single chip.
It is, however, far more di cult to verify a design on a single chip than on a board.
It is no longer possible to produce a working prototype and, prior to manufacture, a
photographic mask of the chip [Ragan et al., 2002] , which can cost as much as $2 million,
must be produced. The cost of failure has become extremely high. It is these economic
considerations that have driven the development of an Electronic Design Automation
(EDA) industry that was, in 2006, valued at $5.3 billion [Huygen, 2007].
2.3 EDA Design and Veriﬁcation Languages
2.3.1 Modelling Languages
Early modelling languages only supported logic gate descriptions and were seldom en-
tered by hand. Schematic capture tools [Russell, 1985] allowed users to enter the design
graphically, and the modelling language was used for data interchange between tools
[Stanford and Mancuso, 1990], but this technology severely restricted the size of design
that could be handled.
In the mid 1980’s the modelling languages Verilog [Thomas and Moorby, 2002] and
VHDL [Perry, 1994] raised the level of modelling abstraction to the Register Transfer
Level (RTL) and enabled a switch from schematic capture to language-driven design.
An RTL description represents the design as a set of communicating processes. Each
process can represent a block of combinational logic, a ﬁnite state machine or a combi-
nation of both. Although the data types are still low level, the signiﬁcant improvement
in control logic abstraction means that much larger designs can be contemplated. Cou-
pled with the availability of Logic Synthesis tools [Devadas et al., 1994], which generate
automatically a gate level representation from the RTL, designers had a mechanism for
exploiting the increased complexity made available by the new CMOS technology.
At the same time, VHDL and Verilog provided language constructs to support be-
havioural modelling. VHDL took these constructs, such as procedure calls, from ADA,
and Verilog from C. It was hoped that behavioural modelling would supersede RTL
modelling, but this has not happened, for reasons that will be discussed later.Chapter 2 Background 6
2.3.2 Test Languages
Because of the very high masking costs, considerable e ort has been placed in the EDA
industry into designing languages that can be used to ensure that the model of the
design has been thoroughly tested before manufacture and, importantly, to provide a
measure of how well the tests cover the design. Specman [Kuhn et al., 2001], VERA
[Haque and Michelson, 2001] and SystemVerilog [Sutherland et al., 2004] provide object-
oriented languages that allow the environment of the design to be modelled at a very
high level. Constrained-Random generation of tests means that the number of tests that
can be applied to the design can be signiﬁcantly higher than could be achieved using
hand-written directed tests, and there is more chance that unanticipated corner-cases
are covered.
The test languages provide a measure of functional coverage which, coupled with tra-
ditional code coverage measures, can be used to increase conﬁdence in the signed-o 
design.
2.3.3 Property Languages
While test languages provide a functional view of the properties of the design and its
environment, property languages provide a formal view. Based on Linear-time Temporal
Logic (LTL) [Huth and Ryan, 2004] the property languages PSL (Property Speciﬁcation
Language) [Foster et al., 2005] and SVA (SystemVerilog Assertions) [Sutherland et al.,
2004] augment the test languages by providing a means to specify the temporal behaviour
of the design (properties/assertions) and its environment (assumptions). The other
major advantage of properties is that they may be embedded in the design itself, so that
errors can be ﬂagged locally.
2.4 EDA Tools
2.4.1 Simulation-based Veriﬁcation
RTL simulation remains the dominant technology for the veriﬁcation of application-
speciﬁc hardware designs. Figure 2.1 shows how the model of the design, represented by
the function f of its inputs, outputs and internal state, is placed in a test harness that
represents the speciﬁcation of the design. The test harness may consist of directed-tests,
pseudo-random tests and assertions. The design is simulated using the input values
generated by each test, and the outputs are checked to ensure correct functionality. At
the same time, the code and functional coverage of each test is measured, and these
measurements are accumulated to provide an overall value of design coverage. When
the accepted coverage threshold is reached, the design is signed-o .Chapter 2 Background 7
Model of Component
  (p State Variables)
Test Harness - Specification
Generate
Inputs
    In
Check
Outputs
    Om
f(In, Om, Sp)
Measure Coverage
Figure 2.1: Simulation Based Veriﬁcation
2.4.2 Disadvantages of Simulation
Although the use of test languages and functional coverage has provided increased con-
ﬁdence in design sign-o , this has come at a price. The number of tests that can
accumulate with a design has increased from hundreds to tens of thousands. Pseudo-
random tests can never be as well-targeted as directed tests, and this results in much
test duplication. For a 2 million gate design with 20 thousand tests, this can result in
simulation times of up to 2 weeks, even if the tests are distributed on simulation farms
with hundreds of processors (Inﬁneon, Bristol: Case Studies1). The e ort required to
verify design changes increases greatly as the design progresses.
The real disadvantage of simulation, however, is that the executable speciﬁcation of the
design, the set of tests, is not in any way formal, and therefore any measure of coverage
is approximate. For designs that have safety considerations, particularly those in the
expanding automotive sector, simulation on its own is not su cient to give the required
conﬁdence for design sign-o .
2.4.3 Formal Veriﬁcation
Although there was interest in some theorem proving solutions in the 1990’s (Larch/VHDL
[Baraona and Alexander, 1994], [Alexander and Baraona, 1997] and Lambda [Hadjini-
colaou et al., 1994]), their limited capacity and need for expert manual intervention
meant that the EDA industry focused on the promise of automatic formal veriﬁcation
1Private CommunicationChapter 2 Background 8
that model checking technology [Clarke and Emerson, 1981], [J.R. Burch et al., 1990]
provided.
EDA tool developers and users have put considerable e ort into deﬁning the IEEE stan-
dard Property Speciﬁcation Language (PSL), which is based on Linear-time Temporal
Logic [Vardi and Wolper, 1984] can be used for both formal and simulation-based veriﬁ-
cation, and has well-deﬁned formal semantics [Gordon et al., 2003]. The essence of PSL
has also been incorporated into SystemVerilog and, as a result, users have access to a
mechanism for formal hardware speciﬁcation in all the major industry tool-sets.
In parallel with the deployment of PSL, considerable advances have been made in model
checking technology, particularly with the adoption of SAT-based techniques [Clarke
et al., 2001], which means that very large design components can be veriﬁed auto-
matically. Model checkers take as input the same RTL model description as used in
simulation, and Figure 2.2 shows how the test-bench used for simulation is replaced
with a formal description of the design’s environment (the assumptions), and a set of
properties (the assertions) that specify the behaviour of the design.
 o    o   o  on nt
         e        e  
 o      o    o   n   on  nt
   n         
 o        c   c t on    o      o   t   
    u     o   t   o      
Figure 2.2: Model Checking
2.4.4 Disadvantages of Current Model Checking Solutions
Apart from the fact that model checking cannot be used on very large designs, the
e ort required to write the PSL properties for a design component is large. PSL was
designed to be used with the low-level data types used in RTL descriptions, and the
properties required to describe RTL behaviour can be long and complex. AlthoughChapter 2 Background 9
suitable for parallel interfaces such as ARM AMBA where the state of the bus is deﬁned
explicitly by a set of low-level signals [Cohen et al., 2004], for serial interfaces such as
PCI Express (TransEDA PCI Express PSL Property Library2) where the bus state has
to be decoded from the bit stream, the properties become long and unwieldy. In practice,
HDL monitors need to be written to present a higher transaction level stream to the
PSL checker, but it would be easier if it were possible to raise the level of abstraction
and deﬁne transaction-level properties in the property language itself.
Property coverage metrics [Chockler et al., 2001] are also immature, so there is no way
of measuring that a set of properties adequately covers the functionality of the models.
PSL properties are often written as implications and there is no way of telling that
an implication has only been vacuously satisﬁed and that the left hand side of the
implication is always false. This situation can occur in simulation-based veriﬁcation
when the test bench is inadequate and the signals comprising the left hand side have
not been fully exercised. It can also occur in model checking when the environment is
over-constrained.
2.5 Higher-level Modelling Abstractions
2.5.1 Behavioural Modelling
As described in Section 2.3 on Page 5, modern hardware description languages have
inherited the capabilities of the procedural programming languages, ADA and C. Writing
procedural descriptions of hardware, excluding all timing and concurrency issues, is
termed behavioural modelling. The main advantage of behavioural models is that they
can be simulated at speeds orders-of-magnitude faster than their RTL counterparts. The
disadvantage is that e cient hardware is not procedural, but highly concurrent, and the
translation of a behavioural description to an RTL description that will implement a
given hardware architecture is complex and application speciﬁc. Much research and
development has been undertaken over the last 20 years to automate this translation
(behavioural synthesis), but no viable general solution has been found, and it remains
largely a manual operation. Without this automatic step in place, any attempts to use
behavioural descriptions for formal veriﬁcation would be of little value.
Currently, behavioural models are usually written in C to provide a behavioural refer-
ence speciﬁcation for the RTL model that can be used in system simulation to improve
simulation speed. The C test-bench is re-used to verify the RTL, or the C and RTL
models are co-simulated in the design.
2Private CommunicationChapter 2 Background 10
2.5.2 Transaction-Level Modelling(TLM)
Another way of increasing simulation speed is to reduce the number of discrete data
communications between processes, and the frequency of those data communications
[Donlin, 2004]. Where the interface comprises many bit and bit vector signals, these are
bundled into a single packet of data for transfer. In addition, if each communication
consists of a sequence of data transfers over time, the data packets can be collected
and passed as a single transaction. This abstraction of both simulation data values and
events in inter-process communication leads to simpler and faster simulation models.
Although primarily developed to speed up simulation, the event and data abstraction
principles on which TLM is founded could provide a major contribution to raising the
abstraction level for formal veriﬁcation.
2.6 SystemC Transaction Level Modelling
The Open SystemC Initiative (OSCI), which promotes the SystemC hardware-oriented
language based on C++ class libraries, has recognised the importance of TLM [Ghenas-
sia, 2006] and produced a draft standard [OSCI-TLMSubgroup, 2007] for TLM modelling
in SystemC. The standard focuses on the communication between hardware processes
using standardised interfaces [Rose et al., 2005], [Swan, 2006], [Bombieri et al., 2006a].
2.6.1 TLM level 3
The highest level, TLM level 3, is a behavioural modelling style but with a well-deﬁned
communication mechanism. It is designed for high-performance simulation while sup-
porting a design methodology which establishes an inter-process communication mech-
anism that can be reﬁned at lower levels. TLM level 3 is untimed: there is no notion of
time or a clock to control execution of the design and the ordering of events is determined
by the order in which the communications (transactions) occur between processes. After
an initial evaluation, each process waits until there is a change on one or more of its
inputs. The process code is then executed and then suspends again, waiting for further
input changes. Each execution of the process code is assumed to take zero time. A Mas-
ter process is written using sequential C++ code and the Master and Slave communicate
via read and write procedure calls implemented by the slave. High-level types are used
to represent the data. The master issues a command to the slave using the write proce-
dure call, the operation is performed by the slave and the return result made available in
zero time. The Master may therefore call the read procedure call immediately to get the
result. When two independent processes communicate in SystemC, the communication
and synchronisation is handled by the scheduler of the simulator kernel [Mueller et al.,Chapter 2 Background 11
2001]. The request is placed in a priority queue [Brown, 1988], which incurs a context
switch in the SystemC simulator. Since Master/Slave communications at TLM Level 3
do not need to use simulator kernel event scheduling, level 3 models simulate quickly, but
retain the disadvantages of behavioural modelling described in 2.5.1 on page 9 above.
2.6.2 TLM level 2
2.6.2.1 Modelling TLM level 2 Processes
At TLM level 2, each process is represented by an Extended Finite State Machine
(EFSM) [Gajski et al., 1997], [Bombieri et al., 2006b]. An EFSM provides a more
compact representation of the states and transitions of a process than a traditional
Finite State Machine (FSM).
In its simplest form, an EFSM has a single state and no variables, and is represented by
purely combinational logic; the output values are a function of the input values.
An EFSM with one or more variables may have a single state with one or more transitions
representing the allowable value changes of those variables. For instance, a counter has
a transition to set or reset the count variable to zero and a transition to increment the
count. The input is reset and the output is count. Each transition has a condition or
guard that enables the transition (Figure 2.3).
         
          
  u t      
                       
                              
           u t     
                    
                              
     u t      u t    
Figure 2.3: Counter
In general, an EFSM has a single state variable, zero or more other variables, a set of
states that the state variable may take, and a set of transitions between those states.
EFSMs are deterministic; only one transition may be enabled from any given state.
2.6.2.2 Modelling TLM level 2 Communications
Each process communicates with other processes using uni-directional channels. These
channels are based on the standard, bounded SystemC queue and replace the read andChapter 2 Background 12
write calls of the level 3 model. Thus, the Master and Slave at level 2 are modelled
as two distinct processes with an output channel to represent the write and an input
channel to represent the read as shown in Figure 2.4. The TLM standard deﬁnes put
and get operations for the channels.
    e     ve
    e
 e d
   
 e 
 e 
   
Figure 2.4: TLM Level 2 Master/Slave Communication
The level 2 processes are also sensitive to input changes: when a process resumes it
evaluates the actions associated with its current state and the values of its inputs, and
sets the value of the next state. This set of actions is atomic and represents a process
event which must be scheduled using the simulation kernel. Simulation is therefore
slower than at TLM level 3.
TLM level 2 simulation is event accurate: the order in which the process events occur
is deterministic and an event can represent a transaction that in the hardware imple-
mentation will take place over several clock cycles. Simulation is therefore faster than
at RTL where every cycle of the transaction must be scheduled through the simulation
kernel.
2.6.3 TLM level 1
TLM level 1 retains the high-level data types of level 2 but now introduces a clock which
synchronises the process behaviour: each process is sensitive to the rising (or falling)
edge of the clock and the process executes even if its state not going to change. TLMChapter 2 Background 13
level 1 is termed cycle accurate. Because of these extra process evaluations and the
fact that each cycle of a transaction must be simulated, simulation is slower than at
TLM level 2.
This is the ﬁnal level of abstraction before bit and bit-vector data types are introduced
in the RTL.
2.6.4 TLM Architectural Exploration
TLM level 3 is not suitable for architectural exploration because of its untimed, be-
havioural nature. It is at levels 1 and 2 that designers can begin to explore di erent
possible architectures to meet performance and power consumption constraints. To fa-
cilitate such exploration it is desirable to have an automated route to RTL and early
feedback of performance and power consumption estimates from tools downstream in
the ﬂow.
Forte Design Systems have introduced the TLM synthesis tool Cynthesizer [Cline, 2007a]
which promises such an automated route from TLM to RTL and [Cline, 2007b] discusses
the early adoption of TLM synthesis by design teams. As TLM synthesis technology
matures, designers will be able to focus most of their attention at the electronic system-
level (ESL) [Henkel, 2003] rather than RTL and to utilise the performance and power
estimation tools currently available at RTL.
It has been a limitation of the TLM standard at level 1 and 2 that standard bus interfaces
had to be modelled using standard channels. The June 2007 release of the standard
addresses this issue by providing an abstract, black-box bus interface that takes into
account the capabilities of industry-standard buses such as AMBA [ARM, 1999], AXI
[ARM, 2003] and OCP (www.ocp.org).
This new bus interface facilitates architectural exploration greatly, but is it possible to
raise the level at which architectural exploration can be done to the speciﬁcation level?
2.7 Microprocessor Pipeline Veriﬁcation
Early work in the formal veriﬁcation of microprocessors was focused on simple, non-
pipelined processors described at the Register Transfer Level (RTL). In [Joyce et al.,
1986] the RTL is represented in the ML programming language and the HOL proof
assistant system [Gordon and Melham, 1993] used to discharge the proofs.
In [Burch and Dill, 1994] and [Jones et al., 1995] the representation of the processor is
raised to the Instruction Set Architecture (ISA) level and the techniques described focus
on the formal veriﬁcation of the control logic of ﬁrst a 3-stage pipelined ALU and then theChapter 2 Background 14
full 5-stage DLX processor. ALU operations are represented as uninterpreted functions.
In order to show that the pipelined processor will behave in the same way as a notional
non-pipelined version, the concept of pipeline ﬂushing is introduced. Stall instructions
are introduced at the pipeline input to ensure that each instruction is completed before
the next is initiated. The notion of reﬁnement maps are introduced in [Manolios, 2000]
and [Manolios and Srinivasan, 2005a] to extend the ﬂushing concepts of Burch and Dill
to more complex 3 and 10-stage pipelines, using the ACL2 functional programming
language and theorem prover [Kaufmann and Moore, 2004]. [Manolios and Srinivasan,
2005b] takes a pipelined processor model, translates it into a combination of two di erent
formal languages and then shows that this translation meets its speciﬁcation. None of
these methods is incorporated into the top-down design ﬂow, and the formal veriﬁcation
is performed when the RTL has been completed.
[Tahar and Kumar, 1994] focuses its attention on the formalization of the pipeline haz-
ards that can occur when multiple instructions are executed at once in the DLX pipeline.
Structural, data and control hazards are represented and checked using the HOL veriﬁca-
tion system [Gordon and Melham, 1993]. Incremental design techniques with reﬁnement
are described in [Borger and Mazzanti, 1997] to show that a notional DLX pipeline with,
however, no overapping instruction execution, can be reﬁned to a pipeline that executes
5 instructions at each clock cycle and manages structural hazards does not encounter a
sequence of instructions that would incur data or control hazards. This pipeline is then
further reﬁned to model the data and control hazards. Abstract State Machines (ASMs)
are used to represent the DLX instructions. In [Kroening and Paul, 2001], a tool that
takes a sequential model of the DLX pipeline, which is assumed to be correct, and adds
the forwarding logic is described. The tool also provides a proof of correctness for the
generated hardware. [Plosila and Sere, 1997] uses Action Systems in a reﬁnement-based
approach to processor veriﬁcation, but does not have the beneﬁt of tool support.
2.8 Modelling Concurrency
2.8.1 Modelling Concurrency with Partial Orders
A process in a concurrent system may be represented as a set of partially ordered mul-
tisets (pomsets) [Pratt, 1986]. Each element of a partial order represents an event, and
for two events e and f, e < f is interpreted as ”event e precedes event f”. Events are
not necessarily atomic and can represent intervals as well as instants. In this case,
e < f means ”the whole of e must precede the whole of f” . e must complete before f
can begin. Actions label events, and an event is an instance (occurrence) of its action.
The motivation for studying partial orders is to investigate an Event-B modelling style
where the speciﬁcation is as loose as possible in the early stages of reﬁnement, decisionsChapter 2 Background 15
about actual event ordering are deferred for as long as possible and potential concurrency
is identiﬁed as early as possible. Specifying the behaviour of a component using partial
orders means that the speciﬁcation can be re-used when the component is subsequently
re-targeted to a di erent hardware architecture for power or performance reasons.
Modelling a Communications Channel as a Partial Order
A channel is a process in which the events Transmit (T) data and Receive (R) data
occur in ordered pairs:-
(data, T) < (data, R)
For the stream of data 011, the behaviour of the channel is speciﬁed in Figure 2.5.
(0, T) (1, T) (1, T)
(0, R) (1, R) (1, R)
Figure 2.5: channel behaviour
The actual order in which transmits and receives occur in the implementation will
depend on the performance characteristics of the processes on each end of the channel.
Communication Reﬁnement using Partial Orders
[Lieverse et al., 2001] describes a methodology with which multi-task applications can be
designed at the abstract level using high-level inter-task communication. The commu-
nication can then be reﬁned by transforming the high-level view of the communication
primitives into a partially-ordered representation using lower-level primitives. Then,
depending on the hardware architecture, the partial order can be reﬁned into the ap-
propriate total order. Factors that govern the most appropriate total order are whether
communication is enabled with channels (message passing) or shared memory and how
much local memory a processor has to store local results before passing them to the next
process.Chapter 2 Background 16
2.8.2 Modelling Concurrency with Guarded Atomic Actions
Any hardware component can be represented as a set of variables that comprise the
state of the component, and a set of actions on those variables that updates the state.
The actions are atomic, all the variable updates comprising an action occur simultane-
ously, and are protected by guards, variable predicates, that determine which actions
are enabled or disabled for a given state of the component. If more than one action is
enabled in a given state, a single action is chosen, non-deterministically, for evaluation.
The behaviour of the component is represented by the set of legal sequences of atomic
actions.
2.8.3 Bluespec
Bluespec is a set of commercial tools, developed from research carried out at MIT
[Rosenband and Arvind, 2004], which allows hardware components to be speciﬁed as a
set of guarded atomic actions, called rules, simulated and then synthesised automatically
to RTL. The toolset enables architectural exploration to be done at the speciﬁcation
level, and then RTL to be generated to meet di ering power, size and performance
constraints. Links are provided to ﬁt in a SystemC TLM ﬂow and, in particular, a
library is provided that converts TLM put and get calls into a sequence of bus-speciﬁc
operations. AMBA, AXI and OCP busses are supported.
Bluespec synthesis is based on a term-rewriting system [Hoe and Arvind, 1999], but no
formal veriﬁcation capability is provided for Bluespec models; the tools set is purely
simulation based. The semantics of guarded atomic actions are not strictly adhered to,
and it is possible for multiple enabled actions to be executed simultaneously. It is also
possible for the user to apply priorities to actions using tool pragmas. Bluespec rules
are translated to a set of RTL concurrent assigments, which are processes sensitive to
changes to variables on the right hand side of the assignment. Simultaneously-ﬁring
rules are therefore mapped to concurrent assignments that are evaluated simultaneously
in the target HDL; the left hand sides of the assignments are not updated until all
the variables on the right hand sides have had their values updated, preventing race
conditions.
2.8.4 CAL
CAL [Bhattacharyya et al., 2008] is a language, also based on guarded atomic actions,
for hardware/software co-design. It provides routes for both hardware and software
synthesis, but no formal veriﬁcation capability has been developed.Chapter 2 Background 17
2.9 Event-B
2.9.1 Introduction
Event-B [Abrial and Mussat, 1998], [Hallerstede, 2007] is a proof-based modelling lan-
guage and method that enables the development of speciﬁcations using reﬁnement. The
Rodin platform [Abrial et al., 2006] is the Eclipse-based IDE that provides automated
support for Event-B modelling, reﬁnement and mathematical proof.
In Event-B, an abstract model comprises a machine that speciﬁes the high-level be-
haviour and a context, made up of sets, constants and their properties, that represents
the type environment for the high-level machine. The machine is represented as a set of
state variables, v and a set of events, guarded atomic actions, which modify the state. If
more than one action is enabled, then one is chosen non-deterministically for execution,
an observable transition on the state variables which must preserve an invariant on the
variables, I(v). A more concrete representation of the machine may then be created
which reﬁnes the abstract machine, and the abstract context may be extended to sup-
port the types required by the reﬁnement. Gluing invariants are used to verify that
the concrete machine is a correct reﬁnement: any behaviour of the concrete machine
must satisfy the abstract behaviour. Gluing invariants give rise to proof obligations for
pairs of abstract and corresponding concrete events. Events may also have parameters
which take, non-deterministically, the values that will make the guards in which they
are referenced true.
An event can be represented by the generalized substitution,
any x where P(x,v) then v := F(x,v) end
where x represents the event parameters and v represents the machine state variables.
Informally, this event can be ﬁred provided that the guard P(x, v) can be satisﬁed for
some value x. The details are explained in [Abrial, 2005].
2.9.2 Reﬁnement
Event-B reﬁnement allows a model to be built gradually [Abrial and Hallerstede, 2006],
starting with an abstract model and then introducing successive, more concrete reﬁne-
ments. Adding variables achieves spatial extension and adding events temporal exten-
sion. Events in the abstract model may be reﬁned by one or more events in the concrete
model. New events, which reﬁne skip may also be introduced in the reﬁnement. Data-
reﬁnement [Abrial, 2005] can also be used to modify the state so that an abstract variable
can be replaced with a concrete variable that can be implemented in the target hardware
or software.Chapter 2 Background 18
2.9.3 Decomposition
Event-B supports two mechanisms for formal composition and decomposition; shared
event [Butler, 2009] and shared variable [Abrial and Hallerstede, 2006]. Shared event
decomposition has tool support in Rodin [Silva et al., 2010].
Shared Event Decomposition
[Butler, 2009] describes a parallel composition operator for machines, where the compo-
sition of machines M and N is written M || N. Machines M and N synchronise over
shared events which have common names. If eM and eN are the shared events in M
and N respectively and m and n are the (disjoint) variables of M and N respectively,
then M || N is deﬁned as follows.
if
eM = any x where P(x,m) then v1 := F(x,m) end
eN = any y where Q(y,n) then v2 := G(y,n) end
then
eM || eN = any x, y where P(x,m)   Q(y,n) then v1 := F(x,m) || v2 := G(y,n) end
Shared Variable Decomposition
An alternative to shared event decomposition is presented in [Abrial, 2009]. Instead of
synchronising over shared events, the machines M and N communicate using shared
common variables, which must be replicated in M and N. When M || N is decomposed,
however, it is necessary to impose the restriction that the shared common variables must
not be data-reﬁned in any subsequent reﬁnements of M or N. Additional events must
also be introduced into M and N, called external events, which simulate the way the
shared events are managed in the composition M || N.
2.9.4 Variants
A convergent event is an event which reﬁnes skip and has some liveness property associ-
ated with it through the deﬁnition of a variant. A variant is a natural number expression
that must be decreased by the convergent event, or a ﬁnite set expression that must be
made strictly smaller by the convergent event. An anticipated event is an event that
reﬁnes skip that is not convergent but will become convergent in a later reﬁnement.
[Abrial, 2007]Chapter 2 Background 19
2.9.5 Records
Structuring of data can be achieved in Event-B by using the conventions established in
[Evans and Butler, 2006]. For a set representing the data type T, projection functions
can be deﬁned which map a ﬁeld of T to the ﬁeld’s value. For instance, if T has two
ﬁelds F1 and F2 which are natural numbers and one ﬁeld F3 which is an integer, then
the record can be described thus.
axm1 : F1   T   N
axm2 : F2   T   N
axm3 : F3   T   Z
2.9.6 Witnesses
When an abstract event with parameter p is reﬁned then the parameter p in the concrete
event corresponds directly with that in the abstract event. If, however, the parameter
p does not appear in the concrete event, then p must receive a concrete value, called a
witness in the reﬁnement. [Abrial, 2007]Chapter 3
Enhancing the SoC Hardware
Design Flow with Event-B
This chapter covers the modern synchronous design ﬂow and the gap that exists in this
ﬂow between the speciﬁcation and the RTL description used for logic synthesis. It looks
at the reasons that behavioural synthesis has not in general helped to close this gap,
and how the use of guarded atomic actions can provide e cient high-level synthesis and
architectural exploration. Although semi-conductor companies express a clear need to
raise the design process to the Electronic System Level (ESL) [Asanovic et al., 2006], the
approach that the EDA industry has taken to address this need has been fragmentary,
no clear standards have emerged, and the tried and proven RTL design methodology
still forms the signiﬁcant bedrock of any modern SoC design ﬂow (Figure 3.1). The
hardware description languages are mature and simulators can be second-sourced. It
is, however, the increasing maturity of Logic Synthesis [Devadas et al., 1994] that has
made the most signiﬁcant contribution.
The contribution of this chapter is to show how the Event-B formal method can be
incorporated into this ﬂow and combined seamlessly with high-level and RTL synthesis
to close the speciﬁcation gap.
3.1 Background to the Existing Flow
3.1.1 Logic Synthesis
Logic Synthesis, as shown in Figure 3.2, contributes the ﬁrst step in bridging the gap
between the textual speciﬁcation and the gate level description from which hardware can
be generated automatically. Although RTL descriptions raise the level of abstraction
20Chapter 3 Enhancing the SoC Hardware Design Flow with Event-B 21
   
     
     
         
                
             
Figure 3.1: SoC Flow
signiﬁcantly, enable language-driven design and reduce simulation e ort, if the transla-
tion route to gates were a manual process the RTL would simply represent some useful
documentation and would make little contribution to closing the speciﬁcation gap.
Early users of logic synthesis expended signiﬁcant simulation e ort on verifying that the
output from synthesis was correct. Today, very little gate level simulation is performed,
not only because logic synthesis is mature but also, more importantly, because formal
tools have been introduced to augment the design ﬂow.
3.1.2 Formal RTL Veriﬁcation
Although modern logic synthesis tools are mature and reliable, they are also very large
and complex and present the user with an enormous range of control options. A logic
synthesis tool could never represent a trusted component [Meyer et al., 2003] in the
SoC ﬂow in the sense of representing a provable transformation on the RTL. It would be
infeasible to show that the tool, given an RTL description as input produced an equivalent
gate level description as output. It is possible, however, given the logic synthesis input
and output description, to reason about the correctness of the translation using the
formal methods property checking [J.R. Burch et al., 1990] and equivalence checking
[Drechsler and Horeth, 2002], as shown in Figure 3.3Chapter 3 Enhancing the SoC Hardware Design Flow with Event-B 22
   
   E 
     
  n  e   
  e         n
  e         n
   
   
   E 
     
  n  e   
  n       e     
Figure 3.2: The Role of RTL Synthesis
   
     
     
         
             
             
   
   
     
     
         
                
              
     
       
           
       
Figure 3.3: The Role of Formal RTL Checking
Property Checking
A property checking tool takes a set of temporal properties [Cohen et al., 2004], [Suther-
land et al., 2004], derived from the speciﬁcation and checks these properties against the
synthesised gate-level description.Chapter 3 Enhancing the SoC Hardware Design Flow with Event-B 23
Property checking depends on having a comprehensive set of properties to represent the
desired behaviour. It is very di cult to establish whether su cient properties have been
written or not, and property coverage [Chockler et al., 2001], [Hoskote et al., 1999] is a
topic of ongoing research.
Without a measurable outcome, property checking will not become an indispensable
component in the SoC ﬂow, but it does provide an independent check on the validity of
the design, and is a valuable auxilliary method for ﬂushing out design faults.
Equivalence Checking
An equivalence checking tool takes the RTL and the synthesised netlist as inputs, con-
verts both into an internal gate-level format and then veriﬁes that they are functionally
equivalent. Equivalence checking is a mature technology, requires little user interaction
and is now widely used to verify synthesis output.
The combination of logic synthesis with equivalence checking, which provides practical
and e cient support for RTL design and a formal link between RTL and gate-level views,
represents a signiﬁcant breakthrough in raising the abstraction level in hardware design.
In particular, it raises the level at which architectural exploration and component re-use
can be considered.
3.1.3 RTL Architectural Exploration
Before the adoption of RTL synthesis, when design was done using gate-level schematic
capture, arriving at an architecture that satisﬁed performance constraints was laborious
and time consuming. Components designed for re-use were only available as gate-level
descriptions, which could require signiﬁcant re-work to be of value in di erent designs.
Once an appropriate architecture had been settled on to target a given hardware tech-
nology, represented by a library of cell primitives, even targeting a similar library for a
di erent hardware technology could require signiﬁcant and time-consuming changes to
the gate-level description, including changes to the supposedly re-usable components.
The ﬁrst clear beneﬁt of RTL synthesis is that it facilitates technology-independent
design. The second beneﬁt is that it is possible to explore micro-level architectural
alternatives without changing the RTL description. Using the rich set of control options
in the synthesis process it is possible to generate di ering gate-level implementations to
meet di ering performance requirements. Re-usable components are now provided as
RTL descriptions which can therefore also be manipulated by the synthesis process.
If, however, it is not possible to reach the required targets through synthesis alone,
alternative RTL architectures must be developed, and even though RTL/RTL equiv-
alence checking can be used to ensure functional equivalence between di erent RTLChapter 3 Enhancing the SoC Hardware Design Flow with Event-B 24
representations, RTL architectural exploration, shown in Figure 3.4, is still very time
consuming.
   
     
 o   
 yn  e   
  n       e     
        v  en e
  e  e 
          v  en e
  e  e 
    v  en e
  e  e 
 o   
 yn  e   
Figure 3.4: RTL Architectural Exploration
3.1.4 Closing the Gap: Behavioural Synthesis
As SoC’s have become more complex, managing system design, architectural exploration
and component re-use at the Register Transfer Level, has become increasingly di cult.
Since high-level, behavioural models and system-level speciﬁcations are already devel-
oped using programming languages such as C and C++, behavioural synthesis from a
programming language source is an attractive concept [Edwards, 2005]. If e cient hard-
ware could be generated automatically from a collection of behavioural descriptions, and
re-usable components could also be represented in this way, then the gap between spec-
iﬁcation and implementation would close dramatically. Behavioural Synthesis takes a
behavioural description and generates an RTL description that can then be used as an
input to logic synthesis. The behavioural synthesis ﬂow is shown in Figure 3.5.
Behavioural Synthesis Issues
Although there is a clear relationship between RTL processes and their gate level equiv-
alents, and the transformations between the two levels are well deﬁned, no such rela-
tionships and transformations can be easily deﬁned between a behavioural description
and an RTL description. Fundamentally, RTL descriptions by their very nature contain
architectural information and behavioural descriptions do not. What is needed is a highChapter 3 Enhancing the SoC Hardware Design Flow with Event-B 25
   
   E 
     
  n  e   
  e         n
  e         n
   
 e  v        e        n
   
 e  v      
  n  e   
 e  v        e        n
 e  v      
  n  e   
Figure 3.5: Behavioural Synthesis Flow
level representation that can represent the architecture of the target implementation in
an abstract way and is amenable to formal reasoning.
3.1.5 Closing the Gap: High-level Synthesis with Term Rewriting Sys-
tems
Term Rewriting Systems (TRS) [Baader and Nipkow, 1998] o er a natural way to de-
scribe hardware architectures which facilitates architectural exploration at the speci-
ﬁcation level and allows the designer to explore design trade-o s much earlier in the
process [Arvind and Shen, 1999]. TRS can be used to describe both deterministic and
non-deterministic behaviour; as an abstract speciﬁcation is successively reﬁned the non-
determinism, which aids greatly in the representation of concise abstract speciﬁcations,
can systematically be made more concrete. In addition to being amenable to formal anal-
ysis, TRS descriptions, in the form of guarded atomic actions, have also been demon-
strated [Hoe and Arvind, 1999], [Arvind and Hoe, 1999], [Arvind et al., 2004], [Hoe,
2004] to be amenable to high-level synthesis [R. Kumar et al., 1996]. Once a detailed
TRS description of a hardware component is available, there is an automatic route to an
RTL description [Rosenband and Arvind, 2005] and therefore to hardware using current
logic synthesis ﬂows [Devadas et al., 1994]. The commercial, high-level synthesis tool,
Bluespec [Nikhil, 2007], is already being used successfully in industrial ﬂows (Figure 3.7).Chapter 3 Enhancing the SoC Hardware Design Flow with Event-B 26
Term Rewriting Systems
A Term Rewriting System is a set of rules which deﬁnes a set of transitions on state,
represented by TRS terms. Each rule is guarded by a predicate on the current state and
has an associated atomic action which updates the state [Arvind and Shen, 1999]. More
formally, a TRS is deﬁned as a tuple(S,R,S0) where S is a set of terms, R is a set of
re-writing rules and S0 is a set of initial terms, S0   S. States are represented by TRS
terms and transitions are represented by TRS rules. A rule is of the form
s1 if p(s1)
  s2
where s1 and s2 are terms and p is a predicate. The term s1 is rewritten to the term
s2.
Consider a microprocessor Proc with an instruction address (program counter) ia, a
register ﬁle rf and instruction memory im. Any given state of the processor can be
represented as Proc(ia,rf,im). The predicate representing the next instruction to be
processed in the instruction memory (at the instruction address) can be written as say
im[ia] = “rd := Op(rs1,rs2)”. If, for a given state of the processor, this instruction
is encountered in the instruction memory, then the rule will ﬁre and the state of the
processor will updated be appropriately: the instruction address will be incremented
to point to the next instruction location and the value of the target register rd in the
register ﬁle will be updated to reﬂect the result of the operation on the two source
registers rs1 and rs2. The state of the instruction memory im is unchanged. This rule
is shown in Figure 3.6.
                                                         
                                                                    
Figure 3.6: TRS Processor Instruction Rule
If, for a given state, the predicates of more than one of the rules evaluate to true, then
one of the enabled rules is chosen non-deterministically for evaluation. [Dave, 2005]
describes the use of the commercial high-level synthesis tool Bluespec [Nikhil, 2004],
which generates RTL from a TRS description, to design a processor. In practice the TRS
synthesis tool allows the use of pragmas to impose event ordering, and if simultaneously
enabled events do not conﬂict (do not attempt to write di erent values to the same
variable) they are scheduled to be evaluated simultaneously. The synthesis tool generates
an RTL scheduler to implement these semantics. At present, no formal checker exists
to verify that the derived RTL is a correct reﬁnement of the TRS description, but the
nature of guarded atomic actions means that they are amenable to formal analysis,Chapter 3 Enhancing the SoC Hardware Design Flow with Event-B 27
and the transformations from TRS rules to RTL processes could be formalised. How
TRS-based high-level synthesis ﬁts in the design ﬂow is shown in Figure 3.7.
RTL
GATES
Logic
Synthesis
Specification
Specification
Gap
Guarded Atomic Actions
RTL
Term Re-writing
System
Guarded Atomic Actions
Term Re-writing
System
Figure 3.7: TRS Synthesis Flow
TRS Architectural Exploration
TRS synthesis enables the abstraction level at which architectural exploration can be
conducted to be raised considerably [Arvind et al., 2004]. At a very early stage of
the design it is possible to explore di erent architectures and use the downstream TRS
and RTL synthesis ﬂow to generate a gate level description for performance and power
estimation. Figure 3.8 shows that the synthesis process can generate di erent RTL
descriptions from the same TRS input under user control, or di erent TRS descriptions
can be synthesised and compared downstream.
Together with higher-level architectural exploration comes the opportunity for higher-
level component re-use [Ng et al., 2007].
TRS Synthesis Issues
Incorporating TRS synthesis into an SoC design ﬂow can decrease the design e ort
considerably, but currently does not decrease the veriﬁcation e ort, since most of the
veriﬁcation must be conducted downstream in the ﬂow. Bluespec descriptions can be
simulated before synthesis, but simulation facilities are limited and can only be used
e ectively to eliminate gross errors. The major part of the veriﬁcation e ort mustChapter 3 Enhancing the SoC Hardware Design Flow with Event-B 28
Guarded Atomic Actions
RTL
TRS
Synthesis
Guarded Atomic Actions
RTL
TRS
Synthesis
Figure 3.8: TRS Architectural Exploration
still be expended on the generated RTL. This e ort includes writing test benches for
simulation and properties for model checking. To get the full beneﬁt from TRS synthesis
it will be necessary to raise at least some of the veriﬁcation e ort to the TRS level.
Bluespec works well for SoC component design, but for the development of SoC sub-
systems, complex inter-component interactions, described using Bluespec method invo-
cation, can lead to ine ciency in the synthesised design. Composition is functional in
Bluespec. A guarded action can make a call to a method which itself can be guarded
and can in turn make further method calls. Bluespec di erentiates between the top-level
explicit guard and the implicit guards of the called methods [Nikhil, 2007]. Explicit and
implicit guards must be combined to determine whether an action is enabled or not and
this leads to complexity and strong coupling between the components. [Asanovic, 2007]
identiﬁes this problem and proposes an approach where composition is connection-based.
Bluespec is used to model the components and then an explicit inter-component commu-
nication mechanism is used to model the sub-systems, decoupling the components with
message queues. What is clearly needed is an environment in which both components
and sub-systems can be modeled and reasoned about in a systematic way.
The speciﬁcation gap still remains. Textual or behavioural descriptions of the speciﬁca-
tion must be transformed by hand to TRS descriptions, and since these descriptions are
fundamentally di erent in nature, this transformation process can be a major potential
source of errors.Chapter 3 Enhancing the SoC Hardware Design Flow with Event-B 29
3.2 Closing the Gap: The Event-B method with TRS
Since Event-B is also based on guarded atomic actions, the semantics for updating the
values of variables is the same as for Bluespec, CAL and indeed RTL.
Support for non-determinism in Event-B means that early abstract models can be under-
speciﬁed in a natural way. In terms of hardware modelling, this non-determinism
matches well the desirable characteristic to be able to represent the abstract model
as a partial order on a set of events.
Events can also have parameters which take, non-deterministically, the values that will
make the guards in which they are referenced true. This provides a ﬂexible and proof-
driven mechanism for describing the environment for which a hardware model’s be-
haviour is deﬁned, analogous to the pseudo-random, constraint-based techniques of the
simulation test languages such as SpecMan [Hollander et al., 2001] and VERA [Haque
and Michelson, 2001].
An event takes the following form:-
Event E   =
any
p1..pn
where
G1   ..   Gn
then
A1..An
end
where p1..pn are parameters, G1..Gm are predicates representing the event guards and
A1..Ap atomic assignments representing the event actions.
As explained in Section 2.9 on Page 17, Event-B reﬁnement allows a model to be built
gradually [Abrial and Hallerstede, 2006], starting with an abstract model and then
introducing successive, more concrete reﬁnements. Adding variables achieves spatial
extension and adding events temporal extension. Events in the abstract model may be
reﬁned by one or more events in the concrete model. New events, which reﬁne skip may
also be introduced in the reﬁnement.Chapter 3 Enhancing the SoC Hardware Design Flow with Event-B 30
3.2.1 Event-B Speciﬁcation Reﬁnement in the SoC Flow
A method is proposed where Event-B is used to represent the abstract hardware speciﬁ-
cation with the TRS-style descriptions familiar to hardware designers who use of Blue-
spec or CAL. The abstract speciﬁcation is then reﬁned systematically to reﬂect the
architectural decisions of the designer. It will be shown in the following chapters how
the designer will be able to deal with one architectural consideration at a time, reﬁning a
particular aspect of the design to a concrete representation while leaving the rest of the
representation abstract. At each reﬁnement step it will be shown how the Rodin tool
helps the designer to discover the gluing invariants that must be proved to demonstrate
that the concrete representation is a correct reﬁnement of the abstract. These invari-
ants are fundamental properties of the design that can be translated directly into PSL
or SVA descriptions and used downstream in the ﬂow for RTL formal and simulation-
based veriﬁcation. The reﬁnement process continues until a concrete representation of
the speciﬁcation has been derived that is suitable for either TRS or RTL synthesis. The
veriﬁcation e ort has been raised to the speciﬁcation level because the concrete repre-
sentation has been proved formally to implement the abstract speciﬁcation. Veriﬁcation
is made manageable because it is performed incrementally within the design ﬂow. The
hardware and its associated properties are described in an event-based language which
is a natural vehicle for synchronous hardware description. All the associated proof obli-
gations are generated and proved by the Rodin tool environment. Figure 3.9 shows how
the TRS ﬂow is augmented by this method.
 T 
  T  
 o  c
  nt     
 onc  t     c   c t on
T    u       to  c  ct on 
T    u       to  c  ct on 
T      -   t n 
   t  
   t  ct    c   c t on
 onc  t     c   c t on
   nt-      n   nt
 n 
 o       oo 
   t  ct    c   c t on
   nt-      n   nt 
 n 
 o       oo 
Figure 3.9: Event-B and TRS ﬂowChapter 3 Enhancing the SoC Hardware Design Flow with Event-B 31
3.2.2 Architectural Exploration at the Speciﬁcation Level
The designer can with this method choose to explore di erent architectures at a very
early stage of the development process. As shown in Figure 3.10, alternative concrete
representations can be derived from the speciﬁcation, translated automatically using a
TRS mapper to a Bluespec or CAL representation and then passed to tools downstream
in the ﬂow that predict the performance and power consumption of the target hardware.
This feedback from downstream tools then allows the user to select the appropriate
concrete representation or explore further alternatives. Where an existing design is to
be targeted to a new hardware platform, architectural decisions can be revisited at the
speciﬁcation level.
  ent-  Re  nement
an 
  rmal Pr   
 ruste 
  m  nent
  ncrete   ec   cat  n  
 R 
 R 
 a  er
  ncrete   ec   cat  n   
 R 
 R 
 a  er
A stract   ec   cat  n
Figure 3.10: Event-B Architectural Exploration
In general, increasing concurrency can increase performance, but will increase power
consumption. Reducing the clock speed, however, reduces power consumption at a
greater rate than it is increased by exploiting concurrency [Kumar et al., 2003]. Therefore
architectural solutions which maximise concurrency are often desirable, and is the reason
that modern SoCs incorporate increasing numbers of processor cores. It is also the
reason that pipelining [Hennessy and Patterson, 2006] is so widely exploited. Pipelining
uses shared registers to communicate between simple, specialised pipeline stages. The
simplicity of each stage contributes to keeping power consumption low because the low
gate count minimises the amount of transistor switching. Similary, shared registers can
provide an e cient mechanism for communication between state machines.Chapter 3 Enhancing the SoC Hardware Design Flow with Event-B 32
For synchronous design, however, shared register communication can only be used for lo-
calised, intra-component communication, because the logic synthesis tools cannot man-
age the global track delays between components located in non-adjacent areas of the
chip. This issue will be dealt with in detail in the next chapter. Transaction Level
Modelling (TLM) [Ghenassia, 2006] and Network on Chip (NoC) technology [Gebhardt
and Stevens, 2008] have evolved to meet the requirements for inter-component commu-
nication on SoCs and, in particular, to support component re-use.
To raise architectural exploration and re-use to the speciﬁcation level it is essential that
the issues of inter- and intra-component communication can be addressed at this level,
before synthesis. It is therefore important that the speciﬁcation method supports both
component and sub-system design and it is an key goal of the following chapters to show
that speciﬁcation reﬁnement with formal proof can be used, starting with an abstract
speciﬁcation, to develop, systematically, alternative sub-system architectures to meet
di ering performance and power consumption goals.Chapter 4
Developing SoC Components
This chapter looks at the use of the Event-B method for developing SoC component
speciﬁcations, and how this relates to the restrictions that hardware process technology
places on component size. It ﬁrst looks at the modelling and reﬁnement of ﬁnite state
machines in general and then looks at pipelined architectures and the implications that
such architectures have for how simultaneous pipeline events are managed. It concludes
with a detailed investigation of how latency can be managed in an IP Lookup circular
pipeline.
The contribution described in this chapter is the development of a method where an
abstract speciﬁcation of an SoC component can be reﬁned formally to derive a con-
crete implementation of the component that has a direct correspondence to its HDL
description. Whether the concrete implementation is represented by a single process
or by multiple processes communicating with shared variables, the method enables the
implementation to be veriﬁed formally against its abstract speciﬁcation.
• A single process may be represented in hardware design by a Finite State Machine.
• Multiple processes may be represented by a set of communicating Finite State Ma-
chines which communicate with message-passing FIFOs or with shared registers.
• A hardware pipeline is a special case of a set of multiple communicating processes.
An approach is presented where a concrete representation of a single FSM, amenable
to hardware synthesis, is derived systematically from an abstract hardware speciﬁca-
tion using Event-B reﬁnement. The approach is then extended to show how an ab-
stract hardware speciﬁcation can be modeled, reﬁned and then decomposed to form
a concrete, synthesisable pipelined representation, comprising several communicating
processes, that has been proved to implement its abstract speciﬁcation.
33Chapter 4 Developing SoC Components 34
This chapter presents the building blocks of the method which will be elaborated in later
chapters.
4.1 Restrictions on SoC Component Size
In a modern, low-power System-on-Chip (Soc) development ﬂow it is already possible
to incorporate several microprocessors (multi-core) in a design [Geer, 2005] and in the
near future it is feasible that this number could increase to several hundred (many-
core) [Asanovic et al., 2006]. The trend towards multi-processor design has come about
because the constraints of sub-90 nanometre design mean that it is no longer possible
to simply increase processor clock speeds to achieve higher performance [Geer, 2005].
These design constraints, which restrict the size of all hardware processing components,
not just microprocessors, result from several factors, which are explored in [Sylvester
and Keutzer, 2001].
First, increases in clock speed increase power consumption and cause heat dissipation
problems. Second, the small feature sizes mean that the beneﬁts of faster device switch-
ing can be negated by the delays incurred when connecting these devices. Synchronous
design tool-chains rely on the combinational logic settling between clock edges and these
global track delays therefore restrict the speed at which a component can be clocked.
As components get larger and more complex, a corresponding increase in global track
delay length is also incurred, limiting the size of components that can be incorporated
into a SoC. Third, the capacity limitations of current veriﬁcation tools, whether formal
or simulation-based, coupled with the large cost of verifying complex hardware logic
also limits the size and complexity that can be handled. There is therefore a compelling
argument for SoC hardware components to be kept as simple as possible.
The detailed investigation of these physical factors in [Sylvester and Keutzer, 2001]
concludes that SoC component size needs to be restricted to between 100,000 and 200,000
gates, and that a Network on Chip (NoC) protocol [Carloni and Sangiovanni-Vincentelli,
2002] is then used to manage the communication between components.
This chapter focuses on the methods required to reﬁne the di erent types of hardware
component encountered on an SoC, from a high-level speciﬁcation to a representation
that is suitable for high-level or RTL synthesis. Chapter 7 will address the development
of sub-systems and inter-component communication.
4.2 State Machines
For an abstract model, represented as a Term Rewriting System or in an HDL at the
Transaction Level, each constituent process may be represented by an Extended FiniteChapter 4 Developing SoC Components 35
State Machine (EFSM) [Gajski et al., 1997], [Bombieri et al., 2006b]. An EFSM pro-
vides a more compact representation of the states and transitions of a process than a
traditional Finite State Machine (FSM).
In its simplest form, an EFSM has a single state and no variables, and is represented by
purely combinational logic; the output values are a function of the input values.
An EFSM with one or more variables may have a single state with one or more transitions
representing the allowable value changes of those variables. For instance, a counter has
a transition to set or reset the count variable to zero and a transition to increment the
count. The input is reset and the output is count. Each transition has a condition or
guard that enables the transition (Figure 4.1).
         
          
count    a 
re et or  co nt        
                              
         count     
       co nt        
                              
   count    count    
Figure 4.1: Counter
In general, an EFSM has a single, explicit state variable, zero or more other variables,
a ﬁnite set of states that the state variable may take, and a set of transitions between
those states. EFSMs are deterministic; only one transition may be enabled from any
given state.
4.3 Case Study: Developing an EFSM for Hu man En-
coding/Decoding
Hu man Encoding [Hu man, 1952] is an algorithm used for lossless data compression.
In this development, an EFSM implementation of the Hu man Encoding/Decoding
Algorithm is derived formally from its abstract speciﬁcation using Event-B reﬁnement
and proof. The purpose of this development is to encode a stream of vowels into a stream
of bit values, decode the stream and prove that the decoded encoded stream is identical
to the original stream. An encoding tree for the letters A, E, I, O, and U is constructed
using the probabilities of encountering a given letter in the stream. The more commonly
encountered the character the shorter the bit encoding. The tree is shown in Figure 4.2.Chapter 4 Developing SoC Components 36
level
   0 
 
   1
    
    
    
Figure 4.2: HUFFMAN Encoding Tree
The development begins with an abstract model which deﬁnes the control ﬂow for the
encoding/decoding process. In the ﬁrst reﬁnement, the character to be encoded is read
from the input stream and stored. Then, in the second reﬁnement the bit encoding for
each character is introduced. Finally, in the third reﬁnement the bit encoding of the
character is decoded and gluing invariants introduced to prove that the decoded encoded
stream is identical to the original stream.
4.3.1 The Abstract Model
We present an abstract model of Hu man Encoding/Decoding using Event-B. Two state
variables char to encode and string to decode and three events get char, encode
and decode are identiﬁed. The partial order on events is shown in Figure 4.3. After the
ﬁrst encode event, get char and decode are enabled simultaneously and can occur in
any order.Chapter 4 Developing SoC Components 37
get_c  r   decode 
encode
char_to_encode    ALS 
 tr ng_to_decode    ALS 
   T AL SAT   
char_to_encode   T   
 tr ng_to_decode    ALS 
char_to_encode    ALS 
 tr ng_to_decode   T   
Figure 4.3: HUFFMAN Partial Order
The alphabet is represented as an enumerated set and get char does a non-deterministic
assignment to the variable current char to represent the input character stream.
CONTEXT HUFFC
SETS
Alphabet
CONSTANTS
A,E,I,O,U
AXIOMS
axm1 : Alphabet = {A,E,I,O,U}
END
At this level of abstraction there is no explicit encoding or decoding; the model simply
describes the control logic.
MACHINE HUFFM
SEES HUFFC
VARIABLES
current char,char to encode,string to decodeChapter 4 Developing SoC Components 38
INVARIANTS
inv1 : current char   Alphabet
inv2 : char to encode   BOOL
inv3 : string to decode   BOOL
EVENTS
Initialisation
begin
act1 : current char := A
act2 : char to encode := FALSE
act3 : string to decode := FALSE
end
Event get   =
any
char
where
grd1 : char   Alphabet
grd2 : char to encode = FALSE
then
act1 : char to encode := TRUE
act2 : current char := char
end
Event encode   =
when
grd1 : char to encode = TRUE
grd2 : string to decode = FALSE
then
act1 : char to encode := FALSE
act2 : string to decode := TRUE
end
Event decode   =
when
grd1 : string to decode = TRUEChapter 4 Developing SoC Components 39
then
act1 : string to decode := FALSE
end
END
4.3.2 The First Reﬁnement
The encode event is reﬁned to introduce 5 new events; one for each character. A new
variable encoded char is introduced to store the value of the character from the input
stream. This variable is required because current char can be overridden before encod-
ing and decoding operations on the character are complete. The model has potentially
a one stage pipeline and, in general, one element of storage is required for each pipeline
stage. In practice, this means that there is always a trade-o  between performance and
power consumption. At this stage the decision as to whether to exploit the pipelining
or not is left open.
For example, the event representing the encoding of the letter E is as follows.
Event encode E   =
reﬁnes encode
when
grd1 : char to encode = TRUE
grd2 : string to decode = FALSE
grd3 : current char = E
then
act1 : char to encode := FALSE
act2 : string to decode := TRUE
act3 : encoded char := E
end
The following invariants are introduced which deﬁne encoded char and establish its
relationship with the abstract variable current char.
INVARIANTS
inv1 : encoded char   Alphabet
inv2 : char to encode = FALSE   string to decode = TRUE   current char =
encoded charChapter 4 Developing SoC Components 40
4.3.3 The Second Reﬁnement
Now the bit encoding is introduced using four boolean variables, and each of the encode
events extended to incorporate the encoding. For example:-
Event encode A   =
reﬁnes encode A
when
grd1 : char to encode = TRUE
grd2 : string to decode = FALSE
grd3 : current char = A
then
act1 : char to encode := FALSE
act2 : string to decode := TRUE
act3 : encoded char := A
act4 : bit0 := TRUE
act5 : bit1 := FALSE
act6 : bit2 := FALSE
end
New invariants relate the encoded char to the bit variables.
INVARIANTS
inv1 : bit0   BOOL
inv2 : bit1   BOOL
inv3 : bit2   BOOL
inv4 : bit3   BOOL
inv5 : string to decode = TRUE   encoded char = E   bit0 = FALSE
inv6 : string to decode = TRUE   encoded char = O   bit0 = TRUE   bit1 =
TRUE
inv7 : string to decode = TRUE   encoded char = A   bit0 = TRUE   bit1 =
FALSE   bit2 = FALSE
inv8 : string to decode = TRUE   encoded char = U   bit0 = TRUE   bit1 =
FALSE   bit2 = TRUE   bit3 = FALSE
inv9 : string to decode = TRUE   encoded char = I   bit0 = TRUE   bit1 =
FALSE   bit2 = TRUE   bit3 = TRUEChapter 4 Developing SoC Components 41
4.3.4 The Third Reﬁnement
Up to now the decode event has done nothing but update the state. In this reﬁne-
ment events and variables are introduced to model the encoding tree. The variable
decoded char is introduced to store the decoded encoded value, and the variable level
which is set to zero to represent the root and is incremented during the tree walk. For
the events that reﬁne skip it is necessary to introduce a variant which must be decreased
by these events.
VARIANT
3   level
The proof obligations generated for the variant ensure that the tree walk terminates.
The decode events in this reﬁnement are named decode L B where L represents the tree
level and B represents the branch taken in the search. The encoding tree is represented
using two types of event: those which represent the leaves of the tree and reﬁne decode.
For instance,
Event decode 2 0   =
reﬁnes decode
when
grd1 : string to decode = TRUE
grd2 : bit2 = FALSE
grd3 : level = 2
then
act1 : string to decode := FALSE
act2 : decoded char := A
end
and those which represent intermediary nodes and reﬁne skip. For instance,
Event decode 0 1   =
Status convergent
when
grd1 : string to decode = TRUEChapter 4 Developing SoC Components 42
grd2 : bit0 = TRUE
grd3 : level = 0
then
act1 : level := 1
end
The tree is therefore represented by a total of 8 decode events, decode 0 0, decode 0 1,
decode 1 0, decode 1 1, decode 2 0, decode 2 1, decode 3 0 and decode 3 1.
The following invariants link the original encoded char to the ﬁnal decoded char:-
INVARIANTS
inv1 : decoded char   Alphabet
inv2 : level   0 .. 3
inv3 : string to decode = FALSE encoded char = E level = 0 decoded char =
E
inv4 : string to decode = FALSE encoded char = O level = 1 decoded char =
O
inv5 : string to decode = FALSE encoded char = A level = 2 decoded char =
A
inv6 : string to decode = FALSE encoded char = U level = 3 decoded char =
U
inv7 : string to decode = FALSE encoded char = I level = 3 decoded char =
I
inv8 : 3   level   0
All proof obligations associated with the abstract model and its reﬁnements are dis-
charged automatically, as shown in Table 4.1.
Total no. of Discharged Discharged Not
proof obligations Automatically Manually Discharged
Abstract Model 0 0 0 0
First Reﬁnement 8 8 0 0
Second Reﬁnement 35 35 0 0
Third Reﬁnement 94 94 0 0
Table 4.1: Hu man ProofsChapter 4 Developing SoC Components 43
Through the use of the gluing invariants, it has been proved that the decoded en-
coded stream produced by the concrete implementation is identical to the original input
stream. The EFSM representing the concrete model, shown in Figure 4.4, can now be
implemented as a single RTL process. There is, however, potential for pipelining in the
Hu man algorithm, as described above. The following section will describe how an ab-
stract hardware speciﬁcation may be reﬁned into a pipelined implementation comprising
two or more RTL processes communicating with shared memory.
  decode
           e   e        
                                   
               e e   =  
level   3               e   e       
                                                 
                  e e   =  e e     
encode
get_char
Figure 4.4: Hu man EFSMChapter 4 Developing SoC Components 44
4.4 Pipelines
Although it is good practice, as exempliﬁed by the SystemC TLM methodology, to
employ FIFOs for communication between SoC components, it is often desirable, for
performance reasons, to implement synchronous communication within a component
with shared variables. If a component is unable to meet its performance targets, then it
may be split into sub-components which not only share the workload, but can operate
concurrently. This technique is called pipelining. A consequence of using a shared
variable for communication is that while one sub-component is reading the value of the
shared variable, the other sub-component will simultaneously write to the same variable,
and this must be taken into account when modelling the pipeline.
4.4.1 Modern SoC Microprocessor Pipelines
In order to achieve high instruction throughput, modern microprocessors can execute
several instructions at once in a multi-stage pipeline [Sutherland, 1989]. Early RISC
processors typically implemented 5 pipeline stages as exempliﬁed by the DLX micropro-
cessor [Hennessy and Patterson, 2006], but during the 1990’s, to keep up with increasing
performance needs, more and more stages were introduced. Intel’s Pentium III micro-
processor has 10 stages and was followed by the Pentium IV with 20. During the design
of the Pentium IV however, it was found that in order to keep the global track delays
within the required bounds to meet performance targets, it was necessary to manage the
latency in the 20-stage pipeline explicitly. Track delay was emerging as the major factor
governing pipeline depth [Taylor et al., 2002]. The further shrinking of device sizes and
lower power requirements has required a re-evaluation of what constitutes the optimal
number of stages [Hartstein and Puzak, 2004] and many modern SoC Microprocessors:
MIPS, Virtex PowerPC, ARM9, OR1K and Tensilica Extensa all implement a 5-stage
pipeline.
4.4.2 Designing and Verifying an SoC Microprocessor Pipeline
Despite the elegance and simplicity of the core DLX pipeline architecture, any implemen-
tation based on this architecture is still subject to design faults which can be extremely
di cult to detect using traditional, simulation-based veriﬁcation techniques. In partic-
ular, resource conﬂicts encountered when several pipelined instructions execute at once
can easily be missed. These resource conﬂicts result from structural, data and control
hazards [Hennessy and Patterson, 2006]. What is needed is a suitable representation
and a systematic and rigorous methodology for the design of an SoC microprocessor
which supports architectural exploration of the trade-o s incurred by the di erent waysChapter 4 Developing SoC Components 45
of dealing with pipeline hazards. Ideally such a methodology would ﬁt well with exist-
ing synchronous design methodologies and provide a seamless link with the lower level
representations required for chip manufacture.
4.4.3 A Pipeline Example: Counting Playing Cards
In this example, an abstract Event-B model is developed to represent the speciﬁcation of
a card counting game. Then, to exploit the potential concurrency inherent in the spec-
iﬁcation, the abstract model is reﬁned to produce a concrete model which implements
a two-stage pipeline. In the development of the reﬁnement it will be shown that the
simultaneous access of shared variables must be managed correctly. Appropriate gluing
invariants are introduced to prove that the pipelined implementation meets its abstract
speciﬁcation.
Informal Speciﬁcation
Each card in the pack is assigned a positive integer value. As cards are taken successively
from the pack, their values are accumulated and a count kept of the number of cards
taken. If the count reaches a predeﬁned limit, or if the accumulated total meets or
exceeds a predeﬁned maximum value, the session terminates and a fresh session begins.
The Abstract Model
The Context deﬁnes the maximum accumulated total, the maximum card count and the
maximum card value.
CONTEXT PIPE21C
CONSTANTS
MaxTotal,MaxCardValue,MaxCards
AXIOMS
axm1 : MaxTotal   N1
axm2 : MaxCardValue   N1
axm3 : MaxCardValue < MaxTotal
axm4 : MaxCards   N1
ENDChapter 4 Developing SoC Components 46
Two variables count and total model the number of cards taken and the accumulated
card values respectively.
INVARIANTS
inv5 : count   MaxCards
inv6 : total < MaxTotal + MaxCardValue
The event Accumulate takes as a parameter the card value and updates count and total
appropriately. (The guard grd3 is independent of cardval and di erent cards may have
the same value.)
Event Accumulate   =
any
cardval
where
grd1 : cardval   1 .. MaxCardValue
grd2 : count < MaxCards
grd3 : total < MaxTotal
then
act1 : total := total + cardval
act2 : count := count + 1
end
The event Reset zeroes count and total when the terminating conditions are satisﬁed.
Event Reset   =
when
grd1 : count = MaxCards   total   MaxTotal
then
act1 : count := 0
act2 : total := 0
end
Although not in general a requirement of Event-B modelling, it is a requirement of this
particular system that the model is free from deadlock. At least one of the events Accu-
mulate and Reset must always be enabled. This is achieved by introducing a theorem
to represent the disjunction of the guards of the 2 events.Chapter 4 Developing SoC Components 47
thm1 : (count < MaxCards total < MaxTotal) count = MaxCards total   MaxTotal
This gives rise to a proof obligation: the theorem must follow from the invariants and
axioms.
In addition, although not a general Event-B modelling requirement, in this system the
enabling of Accumulate and Reset should be mutually exclusive. The conjunction of the
guards should never be true.
thm2 : ¬((count < MaxCards   total < MaxTotal)   (count = MaxCards   total  
MaxTotal))
The behaviour of the abstract model may be represented by the Finite State Machine
(FSM), shown in Table 4.2, where count and total represent the state variables and the
abstract events Accumulate and Reset represent the transitions.
Current State Actions Next State
count total count total
<MaxCards <MaxTotal Accumulate count+1 total+value
=MaxCards   Reset 0 0
   MaxTotal Reset 0 0
Table 4.2: Abstract Model Events
An implementation could now be derived from this abstract model in which a card
value is generated and added to the total in a single clock cycle. Each evaluation of
Accumulate or Reset therefore represents a clock cycle in the implementation. Although
this implementation would meet its functional speciﬁcation, it may be found not to meet
its performance targets in terms of speed or power consumption. The complexity of the
implementation could mean that either an operation could not be completed within
the required clock period, or that doing so would generate too much heat. Event-B
reﬁnement is therefore used to explore the viability of a pipelined solution.
The Reﬁned Model
The overall task is split in two. The ﬁrst sub-task generates the card value from a pack
of cards and the second sub-task adds this value to the total and checks the termination
conditions. A pipeline register is introduced comprising two variables. The pipeline is
shown in Figure 4.5.Chapter 4 Developing SoC Components 48
 ccu u  to 
c
 
 
 
 
 
 
 
t
 
 
 
u
 
  n   to 
Figure 4.5: The Pipeline
The variable value represents the card value generated and the variable cardsleft repre-
sents the number of cards left in the current pack. The new event Generate is introduced
which writes to value and the guards of the abstract event Accumulate are strengthened
so that it reads from value. A pack of cards is available in the initial state, but when
this pack runs out, a new pack must be fetched. The event GetNewPack is introduced
to do this.
Event Generate   =
any
cardval
where
grd1 : cardval   1 .. MaxCardValue
grd2 : cardsleft > 0
then
act1 : value := cardval
act2 : cardsleft := cardsleft   1
end
Event GetNewPack   =
where
grd1 : cardsleft = 0
then
act1 : cardsleft := PackSize
endChapter 4 Developing SoC Components 49
Event Accumulate   =
reﬁnes Accumulate
any
cardval
where
grd1 : cardval = value
grd2 : count < MaxCards
grd3 : total < MaxTotal
then
act1 : total := total + cardval
act2 : count := count + 1
end
The pipeline, however, will not operate as desired. Initially, only the Generate event is
enabled, but an evaluation of this event results in both Generate and Accumulate being
enabled simultaneously. If Generate is chosen, non-deterministically, for evaluation, the
initial value generated will be incorrectly overwritten. If, however, Accumulate is chosen
for evaluation and then chosen a second time, the original generated value will be incor-
rectly used twice. Using a FIFO instead of a variable for communication would ensure
the desired operation, but the overhead of incorporating the FIFO and its control logic
in small components is prohibitive. There are therefore two alternative architectures
that can be considered. A single state variable could be used to manage the enabling
of the Generate and Accumulate events and ensure the desired ordering. An imple-
mentation of this architecture would require two clock cycles for each operation and for
very low power applications this may be acceptable. In general, however, it desirable
to exploit the simultaneity o ered by a pipelined architecture to maximise throughput.
An example of the desired behaviour is shown in Figure 4.6.
  c cle 1                c cle 2                c cle 3                c cle                 c cle                 c cle  
 enerate             enerate             enerate             enerate              enerate                   
                            ccu ulate                                    ccu ulate          ccu ulate               
                                                       eset                                                                         eset
                                                                                                                                        et e  ac 
Figure 4.6: Pipeline ExecutionChapter 4 Developing SoC Components 50
To model this behaviour, it is necessary to introduce further events into this reﬁnement
which represent a merge of the events in the two pipeline stages. How the events in the
ﬁrst stage must be merged with those in the second is dependent on the value of the
pipeline control variable cardsleft. Generate is enabled when cardsleft is greater than
zero and GetNewPack is enabled when cardsleft is equal to zero. Either Accumulate
or Reset is enabled simultaneously depending on the values of count and total. The
exception is when a new pack is presented and cardsleft equals PackSize. In this case
only Generate is enabled. The merged events and the conditions under which they are
enabled, as represented by the values of cardsleft, count and total, is shown in Table 4.3.
The table also includes the states of these variables that result from the evaluation of the
merged events. An entry “   ” indicates “don’t care” and “   ” indicates “unchanged”.
Consider an example of merging events. The merged event GenerateAccumulate, shown
in the second row of Table 4.3, both reads and writes the variable value. Event-B se-
mantics ensures that value of the variable read is always the value prior to update, and
this matches precisely the semantics of a hardware implementation.
Event GenerateAccumulate   =
reﬁnes Accumulate
any
gcardval,cardval
where
grd1 : gcardval   1 .. MaxCardValue
grd2 : cardval = value
grd3 : total < MaxTotal
grd4 : count < MaxCards
grd5 : cardsleft > 0
grd6 : cardsleft < PackSize
then
act1 : value := gcardval
act2 : total := total + cardval
act3 : count := count + 1
act4 : cardsleft := cardsleft   1
endChapter 4 Developing SoC Components 51
Current State Actions Next State
cardsleft count total cardsleft count total
PackSize     Generate PackSize-1    
1..PackSize-1 <MaxCards <MaxTotal Generate cardsleft-1 count+1 total+value Accumulate
1..PackSize-1 =MaxCards   Generate cardsleft-1 0 0 Reset
1..PackSize-1    MaxTotal Generate cardsleft-1 0 0 Reset
0 <MaxCards <MaxTotal GetNewPack PackSize count+1 total+value Accumulate
0 =MaxCards   GetNewPack PackSize 0 0 Reset
0    MaxTotal GetNewPack PackSize 0 0 Reset
Table 4.3: Reﬁned Model Merged Events
Total no. of Discharged Discharged Not
proof obligations Automatically Manually Discharged
Abstract Model 11 11 0 0
First Reﬁnement 13 13 0 0
Table 4.4: Pipeline Proofs
All proof obligations associated with the abstract model and its reﬁnement, which in-
cludes an event for each case identiﬁed in Table 4.3, are discharged automatically by the
Rodin tool, as shown in Table 4.4.
We modelled simultaneous execution of pipeline stages as a single event in Event-B. We
then identiﬁed the di erent combinations of cases for the stages (Table 4.3). In the next
section we outline a general scheme for managing event simultaneity in pipelines.
4.4.4 Event Simultaneity in Pipelines
Consider a 2-stage pipeline with the ﬁrst stage represented by the event E1 and the
second by the event E2, and each event comprising a guard gi and an action Ai (Ei is
giAi). If g1 and g2 are both true, A1 and A2 are executed simultaneously. If either of g1
and g2 is true (exclusively) then one of A1 and A2 is executed. If both guards are false
then neither action is executed. The required cases of behaviour are shown in Table 4.5.
g1   g2    A1   A2
g1   ¬g2    A1
¬g1   g2    A2
¬g1   ¬g2     
Table 4.5: 2-stage 2-event pipelineChapter 4 Developing SoC Components 52
If a particular conjunction of two guards always evaluates to false, then the associated
action will never occur. For instance, if g1   ¬g2 can never be true, then action A1 can
never be executed on its own.
More generally, a pipeline stage may be represented by a set of events. For instance,
in a 2-stage pipeline where each stage, i, is represented by two events Ei1 (gi1Ai1)and
Ei2 (gi2Ai2) then, where Ngi is ¬(gi1 gi2), the required behaviour is shown inTable 4.6.
Again, not all combinations may be possible.
g11   g21    A11   A21
g11   g22    A11   A22
g11   Ng2    A11
g12   g21    A12   A21
g12   g22    A12   A22
g12   Ng2    A12
Ng1   g21    A21
Ng1   g22    A22
Ng1   Ng2     
Table 4.6: 2-stage 4-event pipeline
For a 2-stage pipeline, where the stages are represented by m and n events respectively,
the combinations are shown in Table 4.7
g11   g21    A11   A21
. . . . .
g11   g2n    A11   A2n
g11   Ng2    A11
. . . . .
. . . . .
g1m   g21    A1m   A21
. . . . .
g1m   g2n    A1m   A2n
g1m   Ng2    A1m
Ng1   g21    A21
. . . . .
Ng1   g2n    A2n
Ng1   Ng2     
Table 4.7: 2-stage m+n-event pipeline
Each row in the table, apart from the last, corresponds to a combined event that will need
to be represented in the machine for the pipeline to be modelled correctly. Combined
events whose guards always evaluate to false, however, can be excluded. Since the guards
of the events representing a pipeline stage need not be mutually exclusive, more than
one combined event may be enabled for a given set of variable values.Chapter 4 Developing SoC Components 53
4.4.5 Measuring Pipeline Complexity at the Speciﬁcation Level
The number of simultaneous events that need to be considered is exponential with the
number of pipeline stages. For a pipeline stage Si represented by ni events, there is a
maximum of ni + 1 ways that these events can be combined with the events of other
stages, since the absence of an event must also be considered. Thus, for a pipeline with
p stages the maximum number of combined events is
(n1 + 1)   (n2 + 1)   ···   (np + 1)
In practice, not all combinations are usually possible, but nevertheless the number of
combined events increases rapidly with the number of stages in a component pipeline.
For instance, a 7-stage pipeline with 2 events per stage could have up to 2187 combined
events. The number of combined events, however, is a direct measure of the complex-
ity of the component and represents the number of cases that must be considered for
full component veriﬁcation. Valuable feedback to the designer on the precise number
of feasible simultaneous pipeline event combinations can therefore be provided during
micro-architectural exploration at the speciﬁcation level.
4.4.6 Pipeline Feedback
In this section we consider the issue of pipeline feedback. Consider the example from
[Shankar, 1998], a pipelined version of which is shown in Figure 4.7. The component P
guarantees that its output is even if both its inputs are odd. The component Q guaran-
tees that its outputs are both odd if its input is even. It is required that three invariants
are preserved by the composition of P and Q as a pipeline.
Invariant 1: x is always odd
Invariant 2: y is always odd
Invariant 3: z is always evenChapter 4 Developing SoC Components 54
y
 
     
odd
odd
e en
Figure 4.7: Even Number Generator
The di culty with this example arises from the fact that in composing P and Q the
output of P is fed back to the input of Q.
For a pipeline section with no feedback, as shown in Figure 4.8, where E1 and E2
are events and V1 and V2 are shared registers, then E2 followed by E1 (E2;E1) is
equivalent to E1 and E2 occuring simultaneously (E1||E2). Even if V2 depends on V1,
by evaluating E2 before E1, E2 sees the previous value of V1 which matches precisely
the hardware semantics for simultaneous register read and write.
For a pipeline section with feedback, however, as shown in Figure 4.9, there is no in-
terleaving that represents (E1||E2). It is therefore necessary to merge the events E1
and E2 into a single, combined event E1E2 to model correctly the simultaneity of the
pipeline section. For pipeline sections with feedback, the invariants must be proved for
the complete, merged section. Once these invariants have been proved, however, the
merged pipeline section may be decomposed, using the techniques described in Section
2.9 on Page 17 and illustrated later in this section, into the events that represent each of
its constituent stages. By the use of Event-B decomposition, the exponential explosion
of combined pipeline events with pipeline length can be controlled.Chapter 4 Developing SoC Components 55
E     E     E    
  ppop
E    
Figure 4.8: Pipeline without Feedback
E     E     E     E    
      
Figure 4.9: Pipeline with FeedbackChapter 4 Developing SoC Components 56
 ven  
even
Figure 4.10: Abstract Model
The Abstract Model
At the abstract level, the Even Number Generator may be considered as a black box that
writes an even number to a register z, as shown in Figure 4.10, preserves the invariant
inv2 : z mod 2 = 0
and can be described with the single event
Event Even   =
any
v
where
grd1 : v   N
grd2 : v mod 2 = 0
then
act1 : z := v
end
This abstraction captures the important property of this system: register z may be
updated, but always with an even number.Chapter 4 Developing SoC Components 57
The First Reﬁnement
The simultaneous behaviour of the complete pipeline with feedback can now be modelled.
The micro-architecture of this reﬁnement is shown in Figure 4.11.
 
 
    z
   
   
    
Figure 4.11: First Reﬁnement
Two registers x and y are introduced together with the following invariants
inv1 : x   N
inv2 : x mod 2  = 0
inv3 : y   N
inv4 : y mod 2  = 0
inv5 : (x + y) mod 2 = 0
and the parameter v is bound to the registers x and y with the witness
v : v = x + y
which instantiates the parameter of the abstract model with a concrete value, as de-
scribed in Section 2.9 on Page 19.
The register z now takes the value x + y while simultaneously x and y both take the
value z + 1 as shown in the actionsChapter 4 Developing SoC Components 58
act1 : z := x + y
act2 : x := z + 1
act3 : y := z + 1
The event Evencmp represents the composed behaviour P||Q.
Event Evencmp   =
reﬁnes Even
begin
with
v : v = x + y
act1 : z := x + y
act2 : x := z + 1
act3 : y := z + 1
end
END
All proof obligations are discharged automatically by the Rodin tool, showing that the
composed concrete implementation is a correct reﬁnement of the abstract model.
4.4.7 An Alternative Compositional Approach to Pipeline Reﬁnement
Recall the abstract machine architecture of Figure 4.12
and the event of the abstract model
Event Even   =
any
v
where
grd1 : v   N
grd2 : v mod 2 = 0
then
act1 : z := v
endChapter 4 Developing SoC Components 59
 ven z
even
Figure 4.12: Abstract Model
The First Reﬁnement
Now consider an Event-B machine with a single event Evenp which reﬁnes the abstract
machine and models the component P of Figure 4.11 together with an abstract repre-
sentation of the environment in which P must be placed if it is to deliver its speciﬁed
behaviour.
Event Evenp   =
reﬁnes Even
any
pin1,pin2
where
grd1 : pin1   N
grd2 : pin1 mod 2  = 0
grd3 : pin2   N
grd4 : pin2 mod 2  = 0
grd5 : (pin1 + pin2) mod 2 = 0
with
v : v = pin1 + pin2
thenChapter 4 Developing SoC Components 60
act1 : z := pin1 + pin2
end
The micro-architecture is shown in Figure 4.13.
   
   
   
    
Figure 4.13: First Reﬁnement
There is a requirement on the environment in which component P is used to ensure that
both inputs are odd, represented by the guards
grd1 : pin1   N
grd2 : pin1 mod 2  = 0
grd3 : pin2   N
grd4 : pin2 mod 2  = 0
If these conditions are met, then the model takes its two inputs pin1 and pin2, adds
them together and writes the result to the output register z. The parameter v is replaced
by the parameters pin1 and pin2 using the witness
v : v = pin1 + pin2
Thus, if P’s environment fulﬁlls its requirement to ensure that both inputs are odd then
P will produce an even number on its output.
All proof obligations are discharged, showing that component P, together with its pa-
rameterised environment, is a correct reﬁnement of the abstract model.Chapter 4 Developing SoC Components 61
The Second Reﬁnement
Now consider a parameterised representation of the component Q of Figure 4.11 and its
environment that takes a single input, increments the value on that input and writes the
result to two output registers x and y. There is a requirement on the environment in
which component Q is used to ensure that the input is even. If the environment fulﬁlls
this requirement then Q will produce an odd number on each of its outputs.
The micro-architecture of the component is shown in Figure 4.14.
 
 
Q
odd
odd
even
Figure 4.14: Component Q and its Environment
and its single event is as follows.
Event Oddp   =
any
pin
where
grd1 : pin   N
grd2 : pin mod 2 = 0
then
act1 : x := pin + 1
act2 : y := pin + 1
endChapter 4 Developing SoC Components 62
In this reﬁnement, the machine representing the component P and its environment
is composed, using the Rodin composition plug-in, with the machine from the ﬁrst
reﬁnement (representing Q and its environment) to form a machine with a single event
PQ which represents the composition of the events Evenp and Oddp and reﬁnes Evenp.
Event PQ   =
reﬁnes Evenp
any
pin,pin1,pin2
where
grd1 : pin   N
grd2 : pin mod 2 = 0
grd3 : pin1   N
grd4 : pin1 mod 2  = 0
grd5 : pin2   N
grd6 : pin2 mod 2  = 0
grd7 : (pin1 + pin2) mod 2 = 0
then
act1 : z := pin1 + pin2
act2 : x := pin + 1
act3 : y := pin + 1
end
All proof obligations are discharged automatically by the Rodin tool, showing that
the composed machine is a correct reﬁnement of the abstract machine. At this stage,
however, no communication has been established between the components because the
parameters have not yet been bound to the registers, as shown in Figure 4.15.Chapter 4 Developing SoC Components 63
y
x
 
odd
odd
even
P z
odd
odd
even
pin
pin 
pin 
Figure 4.15: Initial Component Composition
The Third Reﬁnement
Now, the parameterised inputs and outputs of the composed machine can take the
values from the appropriate registers using witnesses. The parameter pin is bound to
the register z and pin1 and pin2 are bound to the registers x and y respectively. The
concrete, combined event is as follows.
Event PQ   =
reﬁnes PQ
begin
with
pin : pin = z
pin1 : pin1 = x
pin2 : pin2 = y
act1 : z := x + y
act2 : x := z + 1
act3 : y := z + 1
end
ENDChapter 4 Developing SoC Components 64
Shared register communication has been established, all proof obligations are discharged
automatically, showing that this is a correct reﬁnement of the abstract model and the
ﬁnal, concrete micro-architecture of Figure 4.16 is achieved.
 
 
   
 
   
   
e en
Figure 4.16: Pipelined Implementation
The event reﬁnement steps used can be summarised as follows.
Even (Abstract event)
 
Evenp (Component P with its Environment)
 
Oddp || Evenp (P with its Environment || Q with its Environment)
 
PQ
4.4.8 Pipeline Decomposition
The last reﬁnement is a concrete representation of the simultaneous behaviour of the
2-stage pipeline, with each evaluation of the composed event Evencmp representing a
clock cycle in the hardware pipeline. The model can now be decomposed formally using
Event-B shared event decomposition, as described in Section 2.9 on Page 18, into twoChapter 4 Developing SoC Components 65
models, to represent the two hardware processes that constitute the two stages of the
pipeline, communicating with the shared registers x, y and z. The two models each
comprise a single event, P and Q respectively,
Event P   =
begin
act1 : z := x + y
end
Event Q   =
begin
act1 : x := z + 1
act2 : y := z + 1
end
and have a direct correspondence with the two Verilog RTL processes shown in Table 4.8
always @ (posedge clk) always @ (posedge clk)
begin begin
z <= x + y x <= z + 1
end y <= z + 1
end
Table 4.8: Verilog RTL Processes
At each clock cycle, z takes the value x + y while, simultaneously, x and y takes the
value z + 1. The update semantics of the Event-B and the Verilog are identical since in
each case the registers are updated with the previous values of the registers referred to
on the right-hand side of the assignments; the non-blocking nature of the assignments
used in the Verilog ensures that all updates happen in parallel and match the atomic
nature of the Event-B actions.
The three key invariants from the Event-B model,
inv2 : z mod 2 = 0
inv3 : x mod 2  = 0
inv4 : y mod 2  = 0
can also be translated directly into assertions in the hardware property speciﬁcation
language PSL to accompany the RTL.Chapter 4 Developing SoC Components 66
inv2 : assert always((z % 2) == 0)
inv3 : assert always((x % 2) ! = 0)
inv4 : assert always((y % 2) ! = 0)
4.4.9 Generalising the Approach to Pipeline Veriﬁcation with Event-B
This example suggests a general approach to pipeline modelling and veriﬁcation with
Event-B, which will be explored in detail in subsequent chapters and is a key contribution
of our work.
The method begins with an abstract Event-B model representing the speciﬁcation of
the required behaviour. The abstract speciﬁcation represents a high-level view of the
hardware which executes in a single cycle. A reﬁnement of the abstract model is then
introduced which represents the behaviour as a two-stage pipeline; the second stage rep-
resents a concrete stage of the ﬁnal implementation while the ﬁrst stage is an abstraction
of the rest of the pipeline. The choice of which stage to make concrete depends on the
nature of the abstract speciﬁcation. For instance, the speciﬁcation may be deﬁned in
terms of the e ect that an operation has on the pipeline registers or the program counter
and this will determine how the reﬁnement proceeds, as explained in the next chapter.
The two stages communicate via shared registers and gluing invariants will need to be
introduced to prove that the two-stage pipeline implements the abstract speciﬁcation.
This model executes in two cycles but with overlapping execution of the two stages
which, coupled with any feedback in the pipeline, can lead to data or control hazards.
It will be shown in the next chapter how the proof obligations generated by the Rodin
tool can be used to discover the appropriate invariants to deal with the hazards.
This two-stage model is then further reﬁned into a three-stage pipeline with now two of
the stages representing the ﬁnal implementation while the third is an abstraction of the
rest of the pipeline. Again, gluing invariants are introduced to prove that the three-stage
pipeline of the reﬁnement implements the more abstract two-stage model and to deal
with any further pipeline hazards. This process continues until all pipeline stages within
a pipeline feedback loop have a concrete implementation.
Once the invariants have been proved, which shows that the overlapping execution within
the feedback loop has been implemented correctly, shared event decomposition is used
to decompose the pipeline model into a set of models, one for every stage of the pipeline.
Models representing the stages outside the feedback loop can then be introduced. Each
pipeline stage model represents a hardware process of the ﬁnal pipeline implementation,
and the invariants represent the properties of that pipeline.Chapter 4 Developing SoC Components 67
The Event-B models representing each pipeline stage can then be mapped directly to a
Bluespec, CAL or RTL process representation and the invariants mapped to a property
speciﬁcation language such as PSL.
In the next chapter we will apply this general approach to a typical SoC Microprocessor
and also show how, by decomposing the problem, each instruction of the ISA can be
modeled and reﬁned individually before arriving at a concrete implementation of the
pipeline.Chapter 5
Developing an SoC Pipelined
Microprocessor Model
The contribution described in this chapter is the development of the method outlined
at the end of Chapter 4 so that an abstract speciﬁcation of an SoC microprocessor can
be reﬁned formally to derive a concrete implementation of the microprocessor that has
a direct correspondence to its HDL description. The method supports the exploration
and development of the pipelined microarchitecture, where the abstract machine repre-
sents directly an instruction from the ISA that specifes the e ect that the instruction
has on the microprocessor register ﬁle and/or the program counter. Reﬁnement is then
used systematically to derive a concrete, pipelined execution of that instruction that is
proved to implement correctly its abstract ISA speciﬁcation. When all the instructions
have been modelled, formal re-composition of the instruction contributions to each stage
is then used to derive the full, pipelined ISA implementation. Microarchitectural con-
siderations are raised to the speciﬁcation level and design choices can be veriﬁed much
earlier in the ﬂow.
The motivation for this work is to demonstrate an approach that uses Event-B and
reﬁnement to develop and verify a concrete implementation of a typical SoC micropro-
cessor from its ISA speciﬁcation. A TRS-style description of the DLX instruction set is
presented, which is a natural mapping from the DLX speciﬁcation taken from [Hennessy
and Patterson, 2006] and shown in Table 5.1, to show that an abstract speciﬁcation of
the DLX pipeline can be developed in Event-B, that this speciﬁcation can be reﬁned
systematically to a point where di erent architectural trade-o s can be explored to deal
with pipeline hazards and that ﬁnally a concrete speciﬁcation suitable for synthesis can
be produced that has been veriﬁed formally to meet its abstract speciﬁcation.
68Chapter 5 Developing an SoC Pipelined Microprocessor Model 69
Instruction
Memor 
IMem
 C
   
M
 
 
M
 
 
M
 
 
M
 
 
 ero 
   
 egisters
 egs
 ata
Memor 
 Mem
I I  I      M M M M  
 
I     
I      
I      
Generic  Operations
 oad
Store
 ranch
 rith  
 rithImm
 ipeline  Stages
Instruction  etch (I )
Instruction  ecode (I )
  ecute (  )
Memor   ccess (M M)
 rite ack (  )
- pipeline register
I  I     M M   
Figure 5.1: DLX 5-stage Pipeline
5.1 Modelling DLX with Event-B
DLX has 5 generic instructions: Load, Store, Branch, ALU (register/register) and ALU
(register/immediate) which are executed over 5 pipeline stages: Instruction Fetch (IF),
Instruction Decode (ID), Execute(EX), Memory Access (MEM) and Writeback (WB)
as shown in Figure 5.1.
5.1.1 Instruction Fetch (IF)
The IF stage gets the instruction address from the program counter (PC) register and
fetches the instruction from the Instruction Memory (IMem). The instruction is written
to the pipeline register IFID. If a branch instruction has been encountered in the pipeline,
then PC is updated to reﬂect the new instruction address. Otherwise PC is incremented
to point to the next instruction in IMem. The value of PC is also written to the IFID
pipeline register.
5.1.2 Instruction Decode (ID)
DLX has 3 32-bit instruction types, where the instruction (IR) ﬁelds and o sets di er
according to type.Chapter 5 Developing an SoC Pipelined Microprocessor Model 70
I-type:
0..5 6..10 11..15 16..31
Opcode rs1 rd Immediate
where rs1 (IR6..10) is the source register, rd (IR11..15) is the destination register and
Immediate (IR16..31) is a 16-bit source data value stored in the instruction (used for
register/immediate arithmetic operations and for conditional branches),
R-type:
0..5 6..10 11..15 16..20 21..31
Opcode rs1 rs2 rd func
where rs1 (IR6..10) and rs2 (IR11..15) are source registers, rd (IR16..20) is the destination
register and func (IR21..31) is the ALU operation to be performed, and
J-type:
0..5 6..31
Opcode Branch O set
used for unconditional jumps.
The values at o sets IR6..10 and IR11..15,read from the IFID pipeline register, are used
as indices into the register ﬁle Regs and the resulting values are stored in the registers A
and B respectively, which form part of the IDEX pipeline register. The value at o set
IR16..31 is written directly to the registerImm in the IDEX pipeline register, together
with the instruction itself and PC value from the previous stage.
5.1.3 Execute (EX)
This stage performs 3 di erent operations, depending on instruction type.
For ALU operations, the values of A and (B or Imm) are passed to the ALU and the
result written to the ALUoutput register.
For Load and Store operations, the values of A and Imm are added together by the ALU
to generate the memory address, which is also stored in ALUoutput. The value of B is
written directly to the EXMEM pipeline register.
For conditional Branch instructions, the PC and Imm values are added by the ALU to
generate the branch o set, also stored in ALUoutput. In addition, the cond register is
set to ”1” if the boolean operation (branch if equal/not equal to ”0”) on the value of
register A determines that the branch will occur. (In all other cases, cond is set to ”0”.)Chapter 5 Developing an SoC Pipelined Microprocessor Model 71
5.1.4 Memory Access (MEM)
The address line of the data memory DMem takes the value of the register ALUoutput
from the previous stage. In addition, the values of ALUoutput and cond are fed back to
the Instruction Fetch stage so that the next value of PC can be determined if there is a
branch.
For Store operations, the value of B from the previous stage is written to the data line
of Dmem.
For Load operations the value on the output line of DMem is written to the register
LMD.
For ALU operations, the value of ALUoutput is transferred directly to the MEMWB
pipeline register.
5.1.5 Writeback (WB)
For ALU and Load/Store operations the value of ALUoutput from the previous stage
is written to the register ﬁle at the location determined by the target register in the
instruction.
I 
I          e s I      I I  I ))
I          e s I       I I  I ))
I           I I     
I    I     I I  I 
I    Imm    I       I I  I )
  
      I     I    I 
           tp t     p I        I     )   r
           tp t     p I        I    Imm)
        n      
      I     I    I 
           tp t    I       
                                     I    Imm
        n      
           I     
           tp t    I         
                                     I    Imm
        n      p I         )
      a  St re  ran  
   
      I           I 
           tp t               tp t        
      I           I 
              mem            tp t)  r
  em            tp t)           
      a  St re
  
 e s I             I ))               tp t  r
 e s I             I ))               tp t
 e s I             I ))               tp t
      a  St re
I I  I        I em   )
I I     ,                  r
I I     ,            
I 
Figure 5.2: Event-B DLX Instructions
Figure 5.2 shows how the DLX Pipeline Structure as summarised in [Prabhu, 2000] and
shown in Table 5.1 can be mapped directly to an Event-B description whilst retaining
the style and structure of the original representation.Chapter 5 Developing an SoC Pipelined Microprocessor Model 72
S
t
a
g
e
A
L
U
i
n
s
t
r
u
c
t
i
o
n
L
o
a
d
o
r
s
t
o
r
e
i
n
s
t
r
u
c
t
i
o
n
B
r
a
n
c
h
i
n
s
t
r
u
c
t
i
o
n
I
F
I
F
/
I
D
.
I
R
 
M
e
m
[
P
C
]
;
I
F
/
I
D
.
N
P
C
,
P
C
 
(
i
f
E
X
/
M
E
M
.
c
o
n
d
{
E
X
/
M
E
M
.
N
P
C
}
e
l
s
e
{
P
C
+
4
}
)
;
I
D
I
D
/
E
X
.
A
 
R
e
g
s
[
I
F
/
I
D
.
I
R
6
.
.
1
0
]
;
I
D
/
E
X
.
B
 
R
e
g
s
[
I
F
/
I
D
.
I
R
1
1
.
.
1
5
]
;
I
D
/
E
X
.
N
P
C
 
I
F
/
I
D
.
N
P
C
;
I
D
/
E
X
.
I
R
 
I
F
/
I
D
.
I
R
;
I
D
/
E
X
.
I
m
m
 
(
I
R
1
6
)
1
6
#
#
I
R
1
6
.
.
3
1
;
E
X
E
X
/
M
E
M
.
I
R
 
I
D
/
E
X
.
I
R
;
E
X
/
M
E
M
.
I
R
 
I
D
/
E
X
.
I
R
;
E
X
/
M
E
M
.
A
L
U
o
u
t
p
u
t
E
X
/
M
E
M
.
A
L
U
o
u
t
p
u
t
 
E
X
/
M
E
M
.
A
L
U
o
u
t
p
u
t
 
 
I
D
/
E
X
.
A
o
p
I
D
/
E
X
.
B
;
I
D
/
E
X
.
A
+
I
D
/
E
X
.
I
m
m
;
I
D
/
E
X
.
N
P
C
+
o
r
I
D
/
E
X
.
I
m
m
;
E
X
/
M
E
M
.
A
L
U
o
u
t
p
u
t
 
I
D
/
E
X
.
A
o
p
I
D
/
E
X
.
I
m
m
;
E
X
/
M
E
M
.
c
o
n
d
 
0
;
E
X
/
M
E
M
.
c
o
n
d
 
0
;
E
X
/
M
E
M
.
B
 
I
D
/
E
X
.
B
E
X
/
M
E
M
.
c
o
n
d
 
(
I
D
/
E
X
.
A
o
p
0
)
;
M
E
M
M
E
M
/
W
B
.
I
R
 
M
E
M
/
W
B
.
I
R
 
E
X
/
M
E
M
.
I
R
;
E
X
/
M
E
M
.
I
R
;
M
E
M
/
W
B
.
A
L
U
o
u
t
p
u
t
 
M
E
M
/
W
B
.
L
M
D
 
E
X
/
M
E
M
.
A
L
U
o
u
t
p
u
t
;
M
e
m
[
E
X
/
M
E
M
.
A
L
U
o
u
t
p
u
t
]
;
o
r
M
e
m
[
E
X
/
M
E
M
.
A
L
U
o
u
t
p
u
t
]
 
E
X
/
M
E
M
.
B
;
W
B
R
e
g
s
[
M
E
M
/
W
B
.
I
R
1
6
.
.
2
0
]
 
R
e
g
s
[
M
E
M
/
W
B
.
I
R
1
1
.
.
1
5
]
 
M
E
M
/
W
B
.
A
L
U
o
u
t
p
u
t
;
M
E
M
/
W
B
.
L
M
D
;
o
r
R
e
g
s
[
M
E
M
/
W
B
.
I
R
1
1
.
.
1
5
]
 
M
E
M
/
W
B
.
A
L
U
o
u
t
p
u
t
;
T
a
b
l
e
5
.
1
:
D
L
X
P
i
p
e
l
i
n
e
S
t
r
u
c
t
u
r
eChapter 5 Developing an SoC Pipelined Microprocessor Model 73
5.2 A General Overview of the Method
Using Event-B, we develop a method where an abstract machine represents directly an
instruction from the ISA that specifes the e ect that the instruction has on the micropro-
cessor register ﬁle or the program counter (PC). Reﬁnement is then used systematically
to derive a concrete, pipelined execution of that instruction. At each reﬁnement step
the importance is shown of addressing the inherent simultaneity that characterises the
pipelined behaviour, the explosion of complexity that can result from this simultane-
ity and the e ect that feedback, which may introduce data or control hazards, has on
pipeline construction. The architect can focus on developing and reﬁning the pipeline
description and deﬁning the veriﬁcation conditions which link successive levels of re-
ﬁnement; the necessary proof obligations are generated and discharged in the Event-B
development environment, Rodin. Once the instruction operation within a feedback
loop has been correctly modelled and proved, formal decomposition is used to derive
the contribution of the instruction to the process representing each pipeline stage within
the feedback loop. The instruction model can then be extended formally to represent
the remaining pipeline stages. When all the instructions have been modelled, formal
re-composition of the instruction contributions to each stage is then used to derive the
full, pipelined ISA implementation.
To illustrate the method, the register/register arithmetic and branch instructions of a
typical SoC microprocessor are speciﬁed and reﬁned. The technique, termed forward-
ing, where intermediate values are fed back to a stage that needs them, is employed
in modern microprocessors to provide a very e cient means of managing data hazards
[Hennessy and Patterson, 2006]. Finding errors in the forwarding logic has, however,
been found to be di cult and expensive [Kroening and Paul, 2001]. With the introduc-
tion of appropriate invariants in our approach, it is shown that the concrete, pipelined
reﬁnement will not preserve these invariants unless the data hazards are detected and
managed appropriately. The concrete Event-B model implements forwarding in a way
that corresponds directly to the techniques used in microprocessor design and is proved,
automatically, in the Rodin environment to be a correct reﬁnement of the abstract ISA
speciﬁcation.
A method for managing pipeline control hazards caused by branching is to detect the
branch instruction in the Instruction Decode pipeline stage and to stall the pipeline
until it is known whether the branch is taken or not. If the branch is not taken, the
pipeline ﬂow can simply resume. If the branch is taken, pending instructions are ﬂushed,
the address of the next instruction is determined, and operation resumes at the new
value of PC. Again, with the introduction of appropriate invariants, it will be proved
that the pipelined implementation executes the instructions in accordance with the ISA
speciﬁcation.Chapter 5 Developing an SoC Pipelined Microprocessor Model 74
The ﬁnal, concrete models of the instructions comprise a set of Event-B machines, one for
each pipeline stage. The machines representing each particular stage are then composed
formally to provide a complete, concrete model of the pipeline. Thus, microarchitectural
considerations are raised from the register transfer level to the speciﬁcation level and
design choices can be veriﬁed much earlier in the ﬂow. It will be shown that the concrete
model also has a direct correspondence to an equivalent hardware description at the
register transfer level or in the high-level languages Bluespec and CAL, which like Event-
B are based on guarded atomic actions. The method proposed therefore has the potential
to be integrated into an existing synthesis methodology, providing an automated design
and veriﬁcation ﬂow from high-level speciﬁcation to hardware.
5.3 Abstracting and Reﬁning the Arithmetic Instruction
In our approach, modelling begins with an abstract representation of the instruction
that can be validated against the ISA. The microarchitecture of the abstract machine
for the arithmetic instruction speciﬁes its e ect on the register ﬁle. The instruction is
issued and completes in a single stage and is therefore free of data hazards. The abstract
machine is then reﬁned, introducing a single concrete pipeline stage at each reﬁnement
step while the rest of the pipeline is left abstract. In the ﬁrst reﬁnement, a 2-stage
pipeline, the register ﬁle is read by the ﬁrst stage and written by the second simulta-
neously which can result in RAW hazards. The abstract speciﬁcation is therefore not
met by the reﬁnement and the proof obligations generated by the Rodin tool to verify
the correctness of the reﬁnement cannot be proved. The method forces the architect to
address the inherently hazardous nature of the pipeline in order to prove the correctness
on the reﬁnement. It will be shown how the Event-B proof method helps in identifying
the problem and the solution. It will be also be shown how the natural solution to this
problem, forwarding, can be implemented in the reﬁnement and how, with the use of
appropriate gluing invariants, the reﬁnement can be proved correct. The second reﬁne-
ment, a 3-stage pipeline, introduces additional RAW hazards which again can be dealt
with by introducing forwarding. RAW hazards are symptomatic of the inherent feedback
associated with reading and writing the register ﬁles simultaneously, and this simultane-
ity must be correctly modelled by the architect to ensure that the Instruction Decode,
Execute and Write Back stages can be proved to implement the abstract speciﬁcation,
using gluing invariants that represent the whole feedback loop. Once the feedback loop
has been modeled and the gluing invariants proved, showing that the overlapping exe-
cution within the feedback loop has been implemented correctly, decomposition of the
pipeline stages into the individual processes which have a direct mapping to the ﬁnal
hardware implementation can occur.Chapter 5 Developing an SoC Pipelined Microprocessor Model 75
5.3.1 The Abstract ISA Model
The structure of a register/register arithmetic instruction associates the opcode with
a destination register Rr and two source registers Ra and Rb. The Event-B context,
PIPEC, for the arithmetic instruction therefore deﬁnes a set of operations Op, the
type Register, the subset of operations that are of type register/register arithmetic,
ArithRRop, and the relationship between the ﬁelds of the arithmetic instruction and
their associated registers. The conventions of Evans and Butler [2006] are followed to
model operation ﬁelds as described in Section 2.9 on Page 17. The context also deﬁnes
No Operation, NOP.
CONTEXT PIPEC
SETS
Op
CONSTANTS
Register,Rr,Ra,Rb,NOP,ArithRROp
AXIOMS
axm1 : Register   N
axm2 : Rr   Op   Register
axm3 : Ra   Op   Register
axm4 : Rb   Op   Register
axm5 : ArithRROp   Op
axm6 : NOP   Op
axm7 : NOP /   ArithRROp
END
Each Op has three register ﬁelds, modelled as projection functions, (Rr, Ra, Rb).
ArithRROp is the set of all register/register arithmetic operations. NOP is a null oper-
ation. The abstract machine, PIPEM, deﬁnes the register ﬁle Regs and a single event
ArithRR that speciﬁes the e ect that execution of the instruction has on the register
ﬁle. The microarchitecture of the abstract machine is shown in Figure 5.3.
For simplicity of presentation, the addition operation is shown, but this can more gener-
ally be represented by an uninterpreted function, Burch and Dill [1994], without a ecting
the proof approach used. The more general approach is outlined later in the section.
The parameter pop speciﬁes the environment for the event; given an instruction of type
ArithRROp, the state of the register ﬁle will be updated according to that instruction.Chapter 5 Developing an SoC Pipelined Microprocessor Model 76
 ri       g 
pop
Figure 5.3: Abstract Machine: Microarchitecture
MACHINE PIPEM
SEES PIPEC
VARIABLES
Regs
INVARIANTS
inv1 : Regs   Register   Z
EVENTS
Initialisation
begin
act1 : Regs := Register   {0}
end
Event ArithRR   =
any
pop
where
grd1 : pop   ArithRROp
then
act1 : Regs(Rr(pop)) := Regs(Ra(pop)) + Regs(Rb(pop))
end
ENDChapter 5 Developing an SoC Pipelined Microprocessor Model 77
5.3.2 The First Reﬁnement: a 2-stage pipeline
A 2-stage pipeline is now introduced which reﬁnes the abstract machine. The second
pipeline stage is a concrete representation of the Write Back (WB) stage while the ﬁrst
stage is still abstract, representing the Fetch/Decode/Execute operations of the pipeline.
The microarchitecture of the reﬁned machine is shown in Figure 5.4.
FDEX
E
X
A
L
U
o
u
t
p
u
t
EXop
ppop
WB
R
e
g
s
Figure 5.4: Reﬁned Machine: Microarchitecture
MACHINE PIPER1
REFINES PIPEM
SEES PIPEC
VARIABLES
Regs,EXALUoutput,EXop
INVARIANTS
inv1 : EXop   Op
inv2 : EXALUoutput   Z
inv3 : EXALUoutput = Regs(Ra(EXop)) + Regs(Rb(EXop))
EVENTS
Initialisation
begin
act1 : Regs := Register   {0}
act2 : EXop := NOP
act3 : EXALUoutput := 0
endChapter 5 Developing an SoC Pipelined Microprocessor Model 78
Two new variables, EXALUoutput and EXop are introduced to represent the EXWB
pipeline registers. The parameter pop of the abstract ArithRR event is bound to the
concrete register EXop using an Event-B witness and a new parameter ppop represents
the environment of the reﬁned event, FDEXWB. The FDEXWB event models the si-
multaneous execution of both pipeline stages.
Event FDEXWB   =
reﬁnes ArithRR
any
ppop
where
grd1 : EXop   ArithRROp
grd2 : ppop   ArithRROp
with
pop : pop = EXop
then
act1 : Regs(Rr(EXop)) := EXALUoutput
act2 : EXALUoutput := Regs(Ra(ppop)) + Regs(Rb(ppop))
act3 : EXop := ppop
end
END
It is now necessary to introduce the gluing invariant to establish that this is a correct
reﬁnement of the abstract machine. To preserve the meaning of the abstract speciﬁ-
cation, the new variable EXALUoutput must always have the value Regs(Ra(EXop))
+ Regs(Rb(EXop)), as represented by the invariant inv3. The Rodin prover, however,
cannot discharge the following proof obligation generated by the tool to show that the
invariant is preserved.
Ra(ppop) = Rr(EXop)
 
Regs(Ra(ppop)) + Regs(Rb(ppop)) = EXALUoutputChapter 5 Developing an SoC Pipelined Microprocessor Model 79
               
               
         R     
               
                                                
     R         
                                
Figure 5.5: Successive Instructions can Interfere
5.3.3 Detecting the RAW Hazard
The abstract FDEX pipeline stage may only read from the source registers Ra and Rb if
they do not coincide with the target register Rr of the previous instruction, represented
by Rr(EXop)). This interference between successive instructions is shown in Figure 5.5.
Therefore, in the failing proof obligation,
Ra(ppop) = Rr(EXop)
 
Regs(Ra(ppop)) + Regs(Rb(ppop)) = EXALUoutput
the goal may in general be false. One way of addressing this is to strengthen the
hypothesis in order to falsify it. A new guard, grd3 is therefore introduced into the
reﬁned event to model the case when there is no hazard on register Ra. A further guard,
grd4 is also introduced to check that there is no hazard on register Rb.
grd1 : ...
grd2 : ...
grd3 : Rr(EXop)  = Ra(ppop)
grd4 : Rr(EXop)  = Rb(ppop)
The Rodin prover now shows that the invariant
EXALUoutput = Regs(Ra(EXop)) + Regs(Rb(EXop))
is preserved by the reﬁned machine.Chapter 5 Developing an SoC Pipelined Microprocessor Model 80
5.3.4 Dealing Correctly with the RAW Hazard
It is now necessary to deal with the cases where a hazard is encountered on register Ra
alone, on register Rb alone and on both registers Ra and Rb. In each case, the required
value(s) can be read from the EXALUoutput register. This corresponds directly to the
forwarding technique used in microprocessor design. Three extra events are introduced
to deal with each case. For instance, for the hazard on register Ra, the guards of the
event are
grd3 : Rr(EXop) = Ra(ppop)
grd4 : Rr(EXop)  = Rb(ppop)
and the associated action now reads the value of Ra from EXALUoutput.
act2 : EXALUoutput := EXALUoutput + Regs(Rb(ppop))
The Rodin prover shows that, for each case, the invariant is preserved. The microarchi-
tecture of the modiﬁed reﬁned machine is shown in Figure 5.6.
    
 
 
 
 
 
 
 
 
 
 
 
    
ppop
  
R
 
 
 
Fo       g
Figure 5.6: Reﬁned Machine with Forwarding: Microarchitecture
5.3.5 The Second Reﬁnement: a 3-stage pipeline
In the second reﬁnement, the concrete Execute (EX) stage is introduced together with
the IDEX pipeline registers.
The microarchitecture is shown in Figure 5.7.
The registers A and B store the values of Ra and Rb respectively. The following two
new gluing invariants are introduced.Chapter 5 Developing an SoC Pipelined Microprocessor Model 81
  
 
 
 
 
 
o
 
 
p
 
 
  op
    
  
 
 
g
 
 F  
 
 
  op
  r ar   g
Figure 5.7: Second Reﬁnement: Microarchitecture
inv1 : A = Regs(Ra(IDop))
inv2 : B = Regs(Rb(IDop))
The introduction of the third pipeline stage in this reﬁnement introduces further data
hazards to those dealt with in the previous section. In addition to adjacent instructions
interfering, in the 3-stage pipeline instructions that are one apart can also interfere, as
shown in Figure 5.8. If, for instance, the source register Ra coincides with the target
register Rr in the EXop register, then the register A must have its value forwarded from
EXALUoutput rather than getting it directly from the register ﬁle.
     R   R   R 
    R   R   R 
     R   R   R 
     R   R   R 
     R   R   R  R            R           R      
     R   R   R 
R            R           R      
Figure 5.8: Instructions one apart can Interfere
All possible RAW hazard combinations must be dealt with. Four potential hazards are
incurred by adjacent instructions and four by instructions one apart leading to a total
of sixteen combinations, modeled by sixteen events. For example in the following event
IDRAWaEXWBnoRAW, the guardsChapter 5 Developing an SoC Pipelined Microprocessor Model 82
grd3 : Rr(EXop)  = Ra(IDop)
grd4 : Rr(EXop)  = Rb(IDop)
cater for the case where there is no interference between the adjacent IDop and EXop
instructions, while the guards
grd5 : Rr(EXop) = Ra(pppop)
grd6 : Rr(EXop)  = Rb(pppop)
cater for the case where the source register Ra of the fetched instruction is the same as
the target Rr of the EXop instruction. In this case EXALUoutput is updated normally
using the values from the registers A and B,
act2 : EXALUoutput := A + B
register B gets its value directly from the register ﬁle,
act5 : B := Regs(Rb(pppop))
but register A must have its value forwarded from EXALUoutput
act4 : A := EXALUoutput
The complete description of event IDRAWaEXWBnoRAW follows.
Event IDRAWaEXWBnoRAW   =
reﬁnes EXWBnoRAW
any
pppop
where
grd1 : EXop   ArithRROp
grd2 : IDop   ArithRROp
grd3 : Rr(EXop)  = Ra(IDop)
grd4 : Rr(EXop)  = Rb(IDop)
grd5 : Rr(EXop) = Ra(pppop)
grd6 : Rr(EXop)  = Rb(pppop)Chapter 5 Developing an SoC Pipelined Microprocessor Model 83
with
ppop : ppop = IDop
then
act1 : Regs(Rr(EXop)) := EXALUoutput
act2 : EXALUoutput := A + B
act3 : EXop := IDop
act4 : A := EXALUoutput
act5 : B := Regs(Rb(pppop))
act6 : IDop := pppop
end
The case where there are no hazards is managed by the event IDnoRAWEXWBnoRAW
Event IDnoRAWEXWBnoRAW   =
reﬁnes EXWBnoRAW
any
pppop
where
grd1 : EXop   ArithRROp
grd2 : IDop   ArithRROp
grd3 : Rr(EXop)  = Ra(IDop)
grd4 : Rr(EXop)  = Rb(IDop)
grd5 : Rr(EXop)  = Ra(pppop)
grd6 : Rr(EXop)  = Rb(pppop)
with
ppop : ppop = IDop
then
act1 : Regs(Rr(EXop)) := EXALUoutput
act2 : EXALUoutput := A + B
act3 : EXop := IDop
act4 : A := Regs(Ra(pppop))
act5 : B := Regs(Rb(pppop))
act6 : IDop := pppop
endChapter 5 Developing an SoC Pipelined Microprocessor Model 84
and the case where hazards are encountered on both Ra and Rb for both adjacent and
one-apart instructions is managed by the event IDRAWabEXWBabRAW.
Event IDRAWabEXWBabRAW   =
reﬁnes EXWBabRAW
any
pppop
where
grd1 : EXop   ArithRROp
grd2 : IDop   ArithRROp
grd3 : Rr(EXop) = Ra(IDop)
grd4 : Rr(EXop) = Rb(IDop)
grd5 : Rr(EXop) = Ra(pppop)
grd6 : Rr(EXop) = Rb(pppop)
with
ppop : ppop = IDop
then
act1 : Regs(Rr(EXop)) := EXALUoutput
act2 : EXALUoutput := EXALUoutput + EXALUoutput
act3 : EXop := IDop
act4 : A := EXALUoutput
act5 : B := EXALUoutput
act6 : IDop := pppop
end
In this case the registers EXALUoutput, A and B are all updated with the forwarded
value from EXALUoutput. The microarchitecture, with forwarding, is shown in Fig-
ure 5.9.
All the proof obligations generated are discharged automatically by the Rodin tool, as
shown in Table 5.2.
Total no. of Discharged
proof obligations Automatically
Abstract Model 3 3
1st Reﬁnement 33 33
2nd Reﬁnement 192 192
Table 5.2: Pipeline ProofsChapter 5 Developing an SoC Pipelined Microprocessor Model 85
  
 
 
 
 
 
o
 
 
p
 
 
  op
ppop
  
 
 
 
 
    
 
 
  op
 or  r i g
Figure 5.9: Second Reﬁnement with Forwarding: Microarchitecture
5.3.6 Shared Event Decomposition of the Feedback Loop
Since the ArithRR instruction does not access memory, the MEM stage does nothing
other than pass on values to the next pipeline stage and therefore, for clarity of presen-
tation, it has been omitted. In the ﬁnal four-stage pipeline, as shown in Figure 5.10,
there is no feedback between the IF stage and the rest of the pipeline. It is therefore
possible, now that the global gluing invariants for the feedback loop have been proved,
to perform a three-way decomposition of the sixteen events of the second reﬁnement into
three distinct machines representing the concrete EX and WB stages and the IFID stage
which is still abstract. The Event-B shared event decomposition mechanism is explained
in Section 2.9 on Page 17. Each Event-B machine represents a process communicating
by shared pipeline variables.
EX
E
X
A
L
U
o
u
t
p
u
t
EXop
WB
R
e
g
s
ID
A
B
IDop
IF PC
IMem
IFOP
Figure 5.10: Third Reﬁnement: Microarchitecture
We now outline the decomposed components. The WB stage is represented by a machine
with a single event.Chapter 5 Developing an SoC Pipelined Microprocessor Model 86
Event WB   =
when
grd1 : EXop   ArithRROp
begin
act1 : Regs(Rr(EXop)) := EXALUoutput
end
The WB event reads the EXALUoutput register and writes to the Register File.
The EX stage is represented by a machine comprising four mutually exclusive events.
For instance, in the case when a hazard is encountered on register Ra alone,
Event EXaRAW   =
when
grd1 : IDop   ArithRROp
grd2 : EXop   ArithRROp
grd3 : Rr(EXop) = Ra(IDop)
grd4 : Rr(EXop)  = Rb(IDop)
then
act1 : EXALUoutput := EXALUoutput + B
act2 : EXop := IDop
end
The EX events read from the IDOP, A and B registers and write to the EXop and
EXALUoutput registers.
The abstract IFID stage is also represented by a machine comprising four mutually
exclusive events. For instance, in the case when a hazard is encountered on both source
registers Ra and Rb,
Event IFIDRAWab   =
any
pppop
where
grd1 : EXop   ArithRROp
grd2 : IDop   ArithRROp
grd3 : Rr(EXop) = Ra(pppop)Chapter 5 Developing an SoC Pipelined Microprocessor Model 87
grd4 : Rr(EXop) = Rb(pppop)
then
act1 : A := EXALUoutput
act2 : B := EXALUoutput
act3 : IDop := pppop
end
This gives a total of nine events in three machines for the IFID, EX and WB stages,
compared with the 16 combined events before decomposition. Decomposing as soon as
the feedback loop has been veriﬁed, stems the exponential increase in the number of
events.
5.3.7 The Third Reﬁnement: a 4-stage pipeline
The abstract IFID model is now reﬁned, introducing the concrete program counter PC
and the instruction memory IMem
CONSTANTS
IMSIZE
IMem
AXIOMS
axm1 : IMSIZE   N
axm2 : IMem   0 .. IMSIZE   Op
END
resulting in four concrete events which represent the combined operation of the IF and
ID stages. In the case when a hazard is encountered on both source registers Ra and
Rb,
Event IFIDRAWab   =
reﬁnes IFIDRAWab
where
grd1 : EXop   ArithRROp
grd2 : IDop   ArithRROp
grd3 : Rr(EXop) = Ra(IFop)Chapter 5 Developing an SoC Pipelined Microprocessor Model 88
grd4 : Rr(EXop) = Rb(IFop)
grd5 : PC < IMSIZE
with
ppop : pppop = IFop
then
act1 : A := EXALUoutput
act2 : B := EXALUoutput
act3 : IDop := IFop
act4 : PC := PC + 1
act5 : IFop := IMem(PC)
end
The concrete IFID machine is now decomposed into two machines, IF and ID which
represent the Instruction Fetch and Instruction Decode stages of the hardware pipeline
respectively, as shown in Figure 5.10.
The Instruction Fetch stage is represented by a single event,
Event IF   =
when
grd1 : PC < IMSIZE
then
act1 : PC := PC + 1
act2 : IFop := IMem(PC)
end
which reads and writes PC and writes to the IFop register,
and the Instruction Decode stage by four events of the form
Event IFIDRAWab   =
where
grd1 : EXop   ArithRROp
grd2 : IDop   ArithRROp
grd3 : Rr(EXop) = Ra(IFop)
grd4 : Rr(EXop) = Rb(IFop)
thenChapter 5 Developing an SoC Pipelined Microprocessor Model 89
act1 : A := EXALUoutput
act2 : B := EXALUoutput
act3 : IDop := IFop
end
which read from the EXALUoutput and IFOP registers and write to the IDop, A and
B registers, resulting in a total of ten events in four processes.
All parameters have been resolved to concrete pipeline registers, and each process may
be mapped directly to an RTL representation. Each of the DECODE and EXECUTE
processes comprise four mutually exclusive events which are mapped to four branches
in an RTL case statement. The ﬁnal concrete pipeline is shown in Figure 5.11.
 
  
  e ent IF
      e e
       g d  PC   I  I E
    t en
         t  PC   PC    
         t  IFop   I em PC 
  end
  
  
  
  e ent ID RAW
      e e
       g d  IFop   A  t RR p
       g d  EXop   A  t RR p
       g d  R  EXop    R  IFop 
       g d  R  EXop    R  IFop 
    t en
         t  A   EXALUoutput
         t  B   Regs R  IFop  
         t  IDop   IFop
  end
  
  e ent EXB  n  
      en
       g d  IDop   
B  n   p
    t en
         t  EXALUoutput   
NPC   ImmV
         t  Cond   
Boo  p A     
         t  EXop   IDop
  end
  
  e ent EX
      e e
       g d  IDop   
B  n   p
    t en
         t  EXALUoutput    
 
         t  Cond    
         t  EXop   IDop
  end
  e ent IDnoRAW
      e e
       g d  IFop   A  t RR p
       g d  EXop   A  t RR p
       g d  R  EXop    R  IFop 
       g d  R  EXop    R  IFop 
    t en
         t  A   Regs R  IFop  
         t  B   Regs R  IFop  
         t  IDop   IFop
  end
I
 
 
 
 
I
 
 
 
N
 
 
I
 
 
 
I  I  I    
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  T        T           
  e ent EXnoRAW
      e e
       g d  IDop   A  t RR p
       g d  EXop   A  t RR p
       g d  R  EXop    R  IDop 
       g d  R  EXop    R  IDop 
    t en
         t  EXALUoutput   A   B
         t  EXop   IDop
  end
  e ent EX RAW
      e e
       g d  IDop   A  t RR p
       g d  EXop   A  t RR p
       g d  R  EXop    R  IDop 
       g d  R  EXop    R  IDop 
    t en
         t  EXALUoutput   
EXALUoutput   B
         t  EXop   IDop
  end
  e ent EX RAW
      e e
       g d  IDop   A  t RR p
       g d  EXop   A  t RR p
       g d  R  EXop    R  IDop 
       g d  R  EXop    R  IDop 
    t en
         t  EXALUoutput   A   
EXALUoutput
         t  EXop   IDop
  end
  e ent EX  RAW
      e e
       g d  IDop   A  t RR p
       g d  EXop   A  t RR p
       g d  R  EXop    R  IDop 
       g d  R  EXop    
R  IDop 
    t en
         t  EXALUoutput   
EXALUoutput   EXALUoutput
         t  EXop   IDop
  end
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
           T    
  e ent WB
    t en
         t  Regs R  EXop          
            EXALUoutput
  end
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 RIT  A  
 
 
 
 
  
 e ent IFID RAW
      e e
       g d  IFop   A  t RR p
       g d  EXop   A  t RR p
       g d  R  EXop    R  IFop 
       g d  R  EXop    R  IFop 
    t en
         t  A   Regs R  IFop  
         t  B   EXALUoutput
         t  IDop   IFop
  end
 e ent ID  RAW
      e e
       g d  IFop   A  t RR p
       g d  EXop   A  t RR p
       g d  R  EXop    R  IFop 
       g d  R  EXop    R  IFop 
    t en
         t  A   EXALUoutput
         t  B   EXALUoutput
         t  IDop   IFop
  end
Figure 5.11: ArithRR 4-stage PipelineChapter 5 Developing an SoC Pipelined Microprocessor Model 90
5.3.8 Generalising the ArithRR model
To generalise this approach for uninterpreted arithmetic operations, the action
act1 : Regs(Rr(p)) := Regs(Ra(p)) + Regs(Rb(p))
can be replaced with
act1 : Regs(Rr(p)) := f (Regs(Ra(p))    Regs(Rb(p)))
where
grd1 : f = func(p)
and
axm8 : func   Op   (N   N   N)
is a ﬁeld of the arithmetic instruction. The invariants are generalised as well.Chapter 5 Developing an SoC Pipelined Microprocessor Model 91
5.4 Abstracting and Reﬁning the Branch Instruction
The branch instruction presents a di erent modelling challenge to the arithmetic instruc-
tion. Incorrect branch prediction results in a Control Hazard, which must be managed
correctly in the pipeline. The implementation must detect the hazard, stall the pipeline
and ﬂush out the incorrectly pre-fetched instructions.
Modelling begins with an abstract representation of the instruction that can be validated
against the ISA. The microarchitecture of the abstract machine for the branch instruction
speciﬁes its e ect on an abstract variable that represents the next instruction to be
executed. The instruction is issued and completes in a single stage and therefore the
next instruction to be executed is always known. The abstract machine is then reﬁned,
introducing a single concrete pipeline stage at each reﬁnement step while the rest of the
pipeline is left abstract. In the ﬁrst reﬁnement, the registers associated with the Execute
stage are introduced. In the second reﬁnement, the Write Back stage is developed,
which introduces RAW hazards on the Ra source register. As with the arithmetic
instruction, the hazards are managed using forwarding. In the third reﬁnement, the
IDEX pipeline registers are introduced, together with a 2-bit counter Stall to manage
pipeline stalling. In the fourth reﬁnement, the program counter PC is introduced,
together with a set of gluing invariants that link PC to the abstract variable that
represents the next instruction to be executed. Finally, in the ﬁfth reﬁnement, this
abstract variable is removed since it is not needed in the implementation. Once this is
done, decomposition of the pipeline stages into individual processes can occur.
5.4.1 The Abstract ISA Model
The microarchitecture of the abstract model is shown in Figure 5.12.
FDEX
   
TNextExecuted
Figure 5.12: AbstractMachine: MicroarchitectureChapter 5 Developing an SoC Pipelined Microprocessor Model 92
The abstract machine, bPIPEM, deﬁnes two parameterised events, NoBranchp and
Branchp, that specify the e ect that execution of a branch instruction has on the pro-
gram counter PC when a branch fails or succeeds respectively. The current value of
PC is represented by the parameter ppc and the next value of PC by ppcprime. The
current instruction in Instruction Memory is represented by pop and the Register File
by pr. The value of pop’s source register, Ra, is compared with 0 using BoolOp which
represents either a BEQ0 or a BNE0 instruction. If the comparison fails, PC is in-
cremented by one. Otherwise, PC is incremented by the Immediate value of pop. The
variable TNextExecuted is introduced to represent the address of the next instruction to
be executed. The event EX represents the e ect on TNextExecuted by the execution of
instructions other than Branch.
Event NoBranchp   =
any
pop,pr,ppc,ppcprime
where
grd1 : pop   BranchOp
grd2 : ppc   0 .. IMSIZE
grd3 : pr   Register   Z
grd4 : BoolOp(pr(Ra(pop))    0) = 0
grd5 : ppcprime = ppc + 1
then
act1 : TNextExecuted := ppcprime
end
Event Branchp   =
any
pop,pr,ppc,ppcprime
where
grd1 : pop   BranchOp
grd2 : ppc   0 .. IMSIZE
grd3 : pr   Register   Z
grd4 : BoolOp(pr(Ra(pop))    0) = 1
grd5 : ppcprime = ppc + Imm(pop)
then
act1 : TNextExecuted := ppcprime
endChapter 5 Developing an SoC Pipelined Microprocessor Model 93
Event EX   =
any
pop,ppc,ppcprime
where
grd1 : pop /   BranchOp
grd2 : ppc   0 .. IMSIZE
grd3 : ppcprime = ppc + 1
then
act1 : TNextExecuted := ppcprime
end
The abstract machine executes instructions in a single stage and determines the next
instruction to be executed. In the ﬁnal pipelined implementation it will be necessary to
prove that the program counter is updated in accordance with the abstract speciﬁcation;
gluing invariants will need to be introduced to link the concrete program counter to the
address of the next instruction executed, represented by the variable TNextExecuted.
5.4.2 The First Reﬁnement
The registers associated with EX stage, EXALUoutput, Cond and EXop are introduced,
as shown in Figure 5.13.
FDEX
E
X
A
L
U
o
u
t
p
u
t
Cond
pop
EXop TNextExecuted
Figure 5.13: Reﬁned Machine: Microarchitecture
The current instruction address is added to the Immediate value of the branching in-
struction and the result written to EXALUoutput irrespective of whether the branch
will be taken or not. Cond, the register which will be used to determine whether the
branch is taken or not, is set to 1 by the EXBranch event and 0 by the EXnoBranch
event. In the ﬁnal pipeline, EXALUoutput and Cond are read by the Instruction Fetch
stage to calculate the new value of PC.Chapter 5 Developing an SoC Pipelined Microprocessor Model 94
Event EXBranch   =
reﬁnes Branchp
any
pop,pr,ppc,ppcprime
where
grd1 : pop   BranchOp
grd2 : ppc   0 .. IMSIZE
grd3 : pr   Register   Z
grd4 : BoolOp(pr(Ra(pop))    0) = 1
grd5 : ppcprime = ppc + Imm(pop)
then
act1 : EXALUoutput := ppc + Imm(pop)
act2 : Cond := BoolOp(pr(Ra(pop))    0)
act3 : EXop := pop
act4 : TNextExecuted := ppcprime
end
Event EXNoBranch   =
reﬁnes NoBranchp
any
pop,pr,ppc,ppcprime
where
grd1 : pop   BranchOp
grd2 : ppc   0 .. IMSIZE
grd3 : pr   Register   Z
grd4 : BoolOp(pr(Ra(pop))    0) = 0
grd5 : ppcprime = ppc + 1
then
act1 : EXALUoutput := ppc + Imm(pop)
act2 : Cond := BoolOp(pr(Ra(pop))    0)
act3 : EXop := pop
act4 : TNextExecuted := ppcprime
end
The event, which represents all other non-branching instructions, must set the value of
Cond to 0 to ensure that correct next instruction is fetched.Chapter 5 Developing an SoC Pipelined Microprocessor Model 95
Event EX   =
reﬁnes EX
any
pop,ppc,ppcprime
where
grd1 : pop /   BranchOp
grd2 : ppc   0 .. IMSIZE
grd3 : ppcprime = ppc + 1
then
act1 : EXALUoutput :  Z
act2 : Cond := 0
act3 : EXop := pop
act4 : TNextExecuted := ppcprime
end
5.4.3 The Second Reﬁnement: a 2-stage pipeline
In the second reﬁnement, the Register File and the Write Back stage is introduced as
shown in Figure 5.14. Although the branch instruction does not itself modify the regis-
ters, an earlier arithmetic instruction in the pipeline can. A RAW hazard is encountered
if a branch instruction reads the value of the Ra register at the same time that an arith-
metic instruction is writing back to the same register. The hazard is dealt with using
forwarding.
pop
    
 
 
 
 
 o  
 
 
 
 
 
o
 
 
p
 
 
  op
             
  
 o    d n 
Figure 5.14: Second Reﬁnement
Two events are required for the case when no branch is taken and the EXop instruction is
an ArithRROp. The ﬁrst, when there is a hazard on register Ra, Rr(EXop) = Ra(pop)
and the evaluation of BoolOp must use the forwarded value,Chapter 5 Developing an SoC Pipelined Microprocessor Model 96
Event EXRAWaNoBranchWB   =
reﬁnes EXNoBranch
any
pop,ppc,ppcprime
where
grd1 : pop   BranchOp
grd2 : ppc   0 .. IMSIZE
grd3 : BoolOp(Regs(Ra(pop))    0) = 0
grd4 : ppcprime = ppc + 1
grd5 : EXop   ArithRROp
grd6 : Rr(EXop) = Ra(pop)
with
pr : pr = Regs
then
act1 : EXALUoutput := ppc + Imm(pop)
act2 : Cond := BoolOp(EXALUoutput    0)
act3 : EXop := pop
act4 : TNextExecuted := ppcprime
act5 : Regs(Rr(EXop)) := EXALUoutput
end
and the second, where there is no hazard and Rr(EXop)  = Ra(pop)
Event EXnoRAWNoBranchWB   =
reﬁnes EXNoBranch
any
pop,ppc,ppcprime
where
grd1 : pop   BranchOp
grd2 : ppc   0 .. IMSIZE
grd3 : BoolOp(Regs(Ra(pop))    0) = 0
grd4 : ppcprime = ppc + 1
grd5 : EXop   ArithRROp
grd6 : Rr(EXop)  = Ra(pop)
withChapter 5 Developing an SoC Pipelined Microprocessor Model 97
pr : pr = Regs
then
act1 : EXALUoutput := ppc + Imm(pop)
act2 : Cond := BoolOp(Regs(Ra(pop))    0)
act3 : EXop := pop
act4 : TNextExecuted := ppcprime
act5 : Regs(Rr(EXop)) := EXALUoutput
end
Similary, two events are required for the case when the branch is taken. The hazard free
event is shown here.
Event EXnoRAWBranchWB   =
reﬁnes EXBranch
any
pop,ppc,ppcprime
where
grd1 : pop   BranchOp
grd2 : ppc   0 .. IMSIZE
grd3 : BoolOp(Regs(Ra(pop))    0) = 1
grd4 : ppcprime = ppc + Imm(pop)
grd5 : EXop   ArithRROp
grd6 : Rr(EXop)  = Ra(pop)
with
pr : pr = Regs
then
act1 : EXALUoutput := ppc + Imm(pop)
act2 : Cond := BoolOp(Regs(Ra(pop))    0)
act3 : EXop := pop
act4 : TNextExecuted := ppcprime
act5 : Regs(Rr(EXop)) := EXALUoutput
endChapter 5 Developing an SoC Pipelined Microprocessor Model 98
F D
N
 
 
 D  
ppop
EX
 
e
 
 
   d
 
 
 
 
  E
X
 
 
 
 
u
t
 
u
t
EX  
 
t
 
 
 
TNextEXecuted
  
 o        
Figure 5.15: Third Reﬁnement
5.4.4 The Third Reﬁnement: a 3-stage pipeline
The IDEX pipeline registers are introduced, together with a 2-bit counter Stall to man-
age the pipeline stalling mechanism as shown in Figure 5.15.
Further RAW hazards are encountered in this reﬁnement if the ID events load the value
of the Ra register into the pipeline register A at the same time that an arithmetic
instruction is writing back to the same register. These hazards are again dealt with
using forwarding.
Stall is initialised to 0 and while no branch instruction is encountered stays at 0.
Event IDEXWB   =
reﬁnes EXWB
any
ppop,pnpc,ppcprime
where
grd1 : IDop /   BranchOp
grd2 : ppop /   BranchOp
grd3 : Stall = 0
grd4 : pnpc   0 .. IMSIZE
grd5 : EXop   ArithRROp
grd6 : ppcprime = pnpc
with
pop : pop = IDop
thenChapter 5 Developing an SoC Pipelined Microprocessor Model 99
act1 : EXALUoutput :  Z
act2 : Cond := 0
act3 : EXop := IDop
act4 : TNextExecuted := ppcprime
act5 : Regs(Rr(EXop)) := EXALUoutput
act6 : IDop := ppop
act7 : A := Regs(Ra(ppop))
act8 : ImmV := Imm(ppop)
act9 : NPC := pnpc
end
When a branch operation is encountered, Stall is set to 2 which enables one of the two
following events that establish whether the branch is taken or not, calculate the next
value of PC, decrement Stall and set the speculatively fetched instruction to NOP.
Event IDStallEXNoBranchWB   =
reﬁnes EXNoBranchWB
any
ppop,ppcprime
where
grd1 : IDop   BranchOp
grd2 : BoolOp(A    0) = 0
grd3 : ppcprime = NPC + 1
grd4 : ppop   Op
grd5 : EXop   ArithRROp
grd6 : Stall > 0
with
pop : pop = IDop
ppc : ppc = NPC
then
act1 : EXALUoutput := NPC + ImmV
act2 : Cond := BoolOp(A    0)
act3 : EXop := IDop
act4 : TNextExecuted := ppcprime
act5 : Regs(Rr(EXop)) := EXALUoutput
act6 : IDop := NOPChapter 5 Developing an SoC Pipelined Microprocessor Model 100
act7 : Stall := Stall   1
act8 : A := Regs(Ra(NOP))
act9 : ImmV := Imm(NOP)
end
Event IDStallEXBranchWB   =
reﬁnes EXBranchWB
any
ppop,ppcprime
where
grd1 : IDop   BranchOp
grd2 : BoolOp(A    0) = 1
grd4 : ppcprime = NPC + Imm(IDop)
grd5 : EXop   ArithRROp
grd6 : ppop   Op
grd7 : Stall > 0
with
pop : pop = IDop
ppc : ppc = NPC
then
act1 : EXALUoutput := NPC + ImmV
act2 : Cond := BoolOp(A    0)
act3 : EXop := IDop
act4 : TNextExecuted := ppcprime
act5 : Regs(Rr(EXop)) := EXALUoutput
act6 : IDop := NOP
act7 : Stall := Stall   1
act8 : A := Regs(Ra(NOP))
act9 : ImmV := Imm(NOP)
end
The pipeline stalls for a further cycle to ensure that all the speculatively-fetched instruc-
tions have been ﬂushed, represented by a stall event that decrements Stall to 0 so that
normal pipeline operation can resume.Chapter 5 Developing an SoC Pipelined Microprocessor Model 101
Event IDStallEXWB   =
reﬁnes EXWB
any
ppop
where
grd1 : IDop /   BranchOp
grd2 : EXop   ArithRROp
grd3 : ppop   Op
grd4 : Stall > 0
with
pop : pop = IDop
then
act1 : EXALUoutput :  Z
act2 : Cond := 0
act3 : EXop := IDop
act4 : Regs(Rr(EXop)) := EXALUoutput
act5 : IDop := NOP
act6 : Stall := Stall   1
act7 : A := Regs(Ra(NOP))
act8 : ImmV := Imm(NOP)
end
Three gluing invariants establish that the reﬁnement is correct.
inv7 : IDop   BranchOp   A = Regs(Ra(IDop))
inv8 : IDop   BranchOp   ImmV = Imm(IDop)
inv9 : Stall = 1   IDop = NOP
5.4.5 The Fourth Reﬁnement: a 4-stage pipeline
It is at this stage that the program counter, PC, is introduced together with the IFID
pipeline registers and the instruction memory IMEM as shown in Figure 5.16.
It is now necessary to introduce the gluing invariants that will prove that the program
counter is updated in accordance with the abstract speciﬁcation. This is done by linking
the concrete variable PC to the variable INPC, which in turn is linked to NPC whichChapter 5 Developing an SoC Pipelined Microprocessor Model 102
WB   
 
 
 
  op
EX
R
e
g
s
 ond
 
m
m
 
  E
X
 
 
 
o
 
t
p
 
t
EXop
 
t
 
 
 
  extExec ted
  
 
 
 
 
  op  
 
I
M
E
M
 or  r i  
Figure 5.16: Fourth Reﬁnement
ﬁnally is linked to the abstract variable TNextExecuted. Informally,
INPC always has the same value as PC.
If IDop is not a NOP then INPC holds the addresss that immediately succeeds NPC in
Instruction Memory.
If IDop is not a NOP then NPC holds the address of the next instruction to be executed.
Formally,
inv4 : INPC = PC
inv5 : IDop  = NOP   INPC = NPC + 1
inv6 : IDop  = NOP   NPC = TNextExecuted
It should be noted that, because there is no no operation instruction in the user-accessible
DLX instruction set, NOP cannot occur in the input instruction stream.
5.4.6 The Fifth Reﬁnement
Now that the gluing invariants have established, formally, the link between PC and
TNextExecuted, the 5th reﬁnement removes TNextExecuted since it is not needed in the
implementation. TNextExecuted is a convenient modelling abstraction representing a
notional PC for a non-pipelined ISA. Its relationship to real registers is given by the
invariants inv4-6 above.Chapter 5 Developing an SoC Pipelined Microprocessor Model 103
Figure 5.17 shows the control ﬂow of the pipelined branch operation as a state machine.
     
  I op   B BranchOp
                                  
                          
      
         
            
        
                             
  I op   B BranchOp
                                  
                          
  I op   B BranchOp
                                  
                          
  I op   B BranchOp
                                  
                          
 Boo Op(A    ) =  
                                  
                          
 Boo Op(A    ) = 1
                                  
                          
          
                                  
                          
          
                                  
                          
Figure 5.17: Fifth Reﬁnement: Combined State Machine
Under normal operation, in the absence of branching instructions, the value of the Cond
register is always 0 and Stall remains at 0. When a branch instruction is detected in
the decode stage, the Stall counter is set to 2. In the next cycle, the value of Cond is
calculated. If it is set to 1 then the branch is taken. Otherwise, if Cond is set to 0,
the branch is not taken. In both cases Stall is decremented to 1. The pipeline then
stalls for one extra cycle to ﬂush out the last incorrectly fetched instruction and Stall
is decremented to 0. The pipeline then either resumes normal operation or, if the next
instruction is a branch operation, it stalls again.
All the proof obligations generated are discharged automatically by the Rodin tool, as
shown in Table 5.3.Chapter 5 Developing an SoC Pipelined Microprocessor Model 104
Total no. of Discharged
proof obligations Automatically
Abstract Model 4 4
1st Reﬁnement 9 9
2nd Reﬁnement 18 18
3rd Reﬁnement 79 79
4th Reﬁnement 76 76
5th Reﬁnement 6 6
Table 5.3: Pipeline Proofs
Figure 5.18 illustrates further the operation of the concrete pipeline. Concrete events
represent the transitions of the model’s state. The numbered arcs are annotated with
the value of Stall and point to the event or events that are enabled in the next pipeline
state.
 event I I BranchEXRAWaWB re nes  I I BranchEXRAWaWB
    where
      @grd1 I op   BranchOp
      @grd2 I op   BranchOp
      @grd3  tall =  
      @grd4  ond =  
      @grd  Rr(EXop)   Ra(I op)     o RAW  a ard
      @grd  P    I  I E
    then
      @act1 EXALUoutput :   
      @act2  ond    
      @act3 EXop   I op
      @act  Regs(Rr(EXop))   EXALUoutput
      @act  I op   I op
      @act   tall   2
      @act  A   Regs(Ra(I op))
      @act  Imm    Imm(I op)
      @act1   P    I P 
      @act11 P    P  + 1
      @act12 I P    P  + 1
      @act13 I op   I em(P )
  end
 event I I EXWB re nes I I EXWB
    where
      @grd1 I op   BranchOp
      @grd2 I op   BranchOp
      @grd3  tall =  
      @grd4  ond =  
      @grd  P    I  I E
    then
      @act1 EXALUoutput :   
      @act2  ond    
      @act3 EXop   I op
      @act  Regs(Rr(EXop))       
            EXALUoutput
      @act  I op    I op
      @act  A   Regs(Ra(I op))
      @act  Imm    Imm(I op)
      @act   P    I P 
      @act1  P    P  + 1
      @act11 I P    P  + 1
      @act12 I op   I em(P )
  end
 event I I BranchEXnoRAWWB re nes  I I BranchEXnoRAWWB
    where
      @grd1 I op   BranchOp
      @grd2 I op   BranchOp
      @grd3  tall =  
      @grd4  ond =  
      @grd  Rr(EXop)   Ra(I op)     o RAW  a ard
      @grd  P    I  I E
    then
      @act1 EXALUoutput :   
      @act2  ond    
      @act3 EXop   I op
      @act  Regs(Rr(EXop))   EXALUoutput
      @act  I op   I op
      @act   tall   2
      @act  A   Regs(Ra(I op))
      @act  Imm    Imm(I op)
      @act1   P    I P 
      @act11 P    P  + 1
      @act12 I P    P  + 1
      @act13 I op   I em(P )
  end
 event I  tallEXBranchWB re nes I  tallEXBranchWB
    where
      @grd1 I op   BranchOp    current instruction 
in Instruction  emory
      @grd2 BoolOp(A    ) = 1    BE   or B E 
      @grd   tall = 2
    then
      @act1 EXALUoutput    P  + Imm 
      @act2  ond   BoolOp(A    )
      @act3 EXop   I op
      @act  Regs(Rr(EXop))   EXALUoutput
      @act  I op     OP
      @act   tall    tall   1
      @act  A    Regs(Ra( OP))
      @act  Imm    Imm( OP)
  end
  event I  tallEXWB re nes I  tallEXWB
    where
      @grd1 I op   BranchOp
      @grd2  tall = 1
      @grd3  ond =  
    then
      @act1 EXALUoutput :   
      @act2  ond    
      @act3 EXop   I op
      @act4 Regs(Rr(EXop))   EXALUoutput
      @act  I op     OP
      @act   tall    tall   1
      @act  A    Regs(Ra( OP))
      @act  Imm    Imm( OP)
  end
 event I BranchI  tallEXWB re nes I BranchI  tallEXWB
    where
      @grd1 I op   BranchOp
      @grd2  tall = 1
      @grd3  ond = 1
      @grd4 EXALUoutput       I  I E
    then
      @act1 EXALUoutput :   
      @act2  ond    
      @act3 EXop   I op
      @act4 Regs(Rr(EXop))   EXALUoutput
      @act  I op     OP
      @act   tall    tall   1
      @act  A    Regs(Ra( OP))
      @act  Imm    Imm( OP)
      @act  P    EXALUoutput
      @act1  I P    EXALUoutput
      @act11 I op    OP
  end
    
 
 
 
 
 
 
 
 
 
 
   
                 o            o       o           
 event I  tallEX oBranchWB re nes I  tallEX oBranchWB
    where
      @grd1 I op   BranchOp    current instruction in 
Instruction  emory
      @grd2 BoolOp(A    ) =      BE   or B E 
      @grd3  tall = 2
    then
      @act1 EXALUoutput    P  + Imm 
      @act2  ond   BoolOp(A    )
      @act3 EXop   I op
      @act  Regs(Rr(EXop))   EXALUoutput
      @act  I op     OP
      @act   tall    tall   1
      @act  A    Regs(Ra( OP))
      @act  Imm    Imm( OP)
  end
  
Figure 5.18: Fifth Reﬁnement: EventsChapter 5 Developing an SoC Pipelined Microprocessor Model 105
5.4.7 Pipeline Decomposition
The concrete representation of the branching instruction can now be decomposed into
four machines, representing each of the IF, ID, EX and WB stage processes. The WB
stage is represented by a single event, identical to that derived from the ArithRR devel-
opment. The EX stage is represented by two events; one for the branching instruction,
Event EXBranch   =
when
grd1 : IDop   BranchOp
then
act1 : EXALUoutput := NPC + ImmV
act2 : Cond := BoolOp(A    0)
act3 : EXop := IDop
end
and the event that represents the behaviour of all other instructions.
Event EX   =
when
grd1 : IDop /   BranchOp
then
act1 : EXALUoutput :  Z
act2 : Cond := 0
act3 : EXop := IDop
end
The EX events write to the EXop, EXALUoutput and Cond registers and read from
the IDop, A, NPC and ImmV registers.
The ID stage comprises an event for non-branching instructions, two branching events
(one of which manages the data hazard) and a stall event.
Event ID   =
when
grd1 : IFop /   BranchOp
grd2 : Stall = 0Chapter 5 Developing an SoC Pipelined Microprocessor Model 106
then
act1 : IDop := IFop
act2 : A := Regs(Ra(IFop))
act3 : ImmV := Imm(IFop)
act4 : NPC := INPC
end
Event IDBranchNoRAW   =
when
grd1 : IFop   BranchOp
grd2 : Stall = 0
grd3 : Rr(EXop)  = Ra(IFop)
then
act1 : IDop := IFop
act2 : A := Regs(Ra(IFop))
act3 : ImmV := Imm(IFop)
act4 : NPC := INPC
act5 : Stall := 2
end
Event IDBranchRAWa   =
when
grd1 : IFop   BranchOp
grd2 : Stall = 0
grd3 : Rr(EXop) = Ra(IFop)
then
act1 : IDop := IFop
act2 : A := EXALUoutput
act3 : ImmV := Imm(IFop)
act4 : NPC := INPC
act5 : Stall := 2
end
Event IDStall   =
when
grd1 : IFop   Op
grd2 : Stall > 1Chapter 5 Developing an SoC Pipelined Microprocessor Model 107
then
act1 : IDop := NOP
act2 : A := Regs(Ra(NOP))
act3 : ImmV := Imm(NOP)
act4 : Stall := Stall   1
end
The ID events write to the IDop, A, NPC and ImmV registers and read from the Stall,
IFop and INPC registers.
The IF stage is represented by two events, which check the value of the Cond pipeline
register and set the new value of PC accordingly.
Event IF   =
when
grd1 : Stall = 0
grd2 : Cond = 0
grd3 : PC < IMSIZE
then
act1 : PC := PC + 1
act2 : INPC := PC + 1
act3 : IFop := IMem(PC)
end
Event IFBranch   =
when
grd1 : Stall = 1
grd2 : Cond = 1
grd3 : EXALUoutput   0 .. IMSIZE
then
act1 : PC := EXALUoutput
act2 : INPC := EXALUoutput
act3 : IFop := IMem(PC)
end
The IF events write to the PC, INPC and IFop registers and read from the EXALUoutput
and PC registers.Chapter 5 Developing an SoC Pipelined Microprocessor Model 108
All parameters have been resolved to concrete pipeline registers, and each process may
be mapped directly to an RTL representation. The DECODE and EXECUTE processes
comprise four and two mutually exclusive events respectively which are mapped to the
branches in an RTL case statement. The ﬁnal concrete pipeline representing the four
processes which implement the branch instruction is shown in Figure 5.19.
  
  eve t E  r    
      e 
        r 1 I op   
 r     p
    t e 
         t1 E A  output   
N C   Imm 
         t2 Co     
 oo  p A     
         t  E op   I op
  e  
  
  eve t E 
      ere
        r 1 I op   
 r     p
    t e 
         t1 E A  output    
 
         t2 Co      
         t  E op   I op
  e  
  
  eve t I 
      e 
        r 1 I op    r     p
        r 2  t       
    t e 
         t1 I op   I op
         t2 A    e s    I op  
         t  Imm    Imm I op 
         t  N C   IN C
  e  
  eve t I  r    No A 
      e 
        r 1 I op    r     p
        r 2  t       
        r    r E op       I op 
    t e 
         t1 I op   I op
         t2 A    e s    I op  
         t  Imm    Imm I op 
         t  N C   IN C
         t   t      2
  e  
  eve t I  r     A  
      e 
        r 1 I op    r     p
        r 2  t       
        r    r E op       I op 
    t e 
         t1 I op   I op
         t2 A   E A  output
         t  Imm    Imm I op 
         t  N C   IN C
         t   t      2
  e  
  eve t I  t   
      e 
        r 1 I op    p
        r 2  t      1
    t e 
         t1 I op   N  
         t2 A    e s    N    
         t  Imm    Imm N   
         t   t       t      1
  e  
 
  eve t I 
      e 
        r 1  t       
        r 2 Co      
        r    C   IM I E
    t e 
         t1  C    C   1
         t2 IN C    C   1
         t  I op   IMem  C 
  e  
  eve t I  r    
      e 
        r 1  t      1
        r 2 Co     1
        r   E A  output           
            IM I E
    t e 
         t1  C   E A  output
         t2 IN C   E A  output
         t  I op   IMem  C 
  e  
 
t
a
 
 
 
 
 
 
 
 
o
p
A
 
m
m
 
 
 
 
 
 
o
p
         EX
E
X
A
 
 
o
u
t
p
u
t
 
o
n
d
E
X
o
p
                       
  eve t E  r    
      e 
        r 1 I op    r     p
    t e 
         t1 E A  output   N C     
            Imm 
         t2 Co      oo  p A     
         t  E op   I op
  e  
  
  eve t E 
      ere
        r 1 I op    r     p
    t e 
         t1 E A  output     
         t2 Co      
         t  E op   I op
  e  
E
X
A
 
 
o
u
t
p
u
t
 
o
n
d
E
X
o
p
EX WB           
  eve t   
      e 
        r 1 E op    r     p
    t e 
         t1  e s  r E op          
            E A  output
  e  
E
X
A
 
 
o
u
t
p
u
t
 
o
n
d
E
X
o
p
         
R
e
g
s
Figure 5.19: Branch 4-stage PipelineChapter 5 Developing an SoC Pipelined Microprocessor Model 109
5.5 Pipeline Instruction Composition
The speciﬁcations of the ArithRR and Branch operations have been considered and
reﬁned separately. This loose decomposition of the problem domain has allowed the
particular characteristics of these instructions to be modeled and reﬁned independently.
Further instructions from the ISA, such as the Load and Store instructions may be
considered in a similar way.
Once all the instructions have been modeled, the concrete instruction implementations
may be composed to derive a pipeline that supports all the instructions. The overall
method that we have developed is illustrated, using the ArithRR and Branch instruc-
tions, in in Figure 5.20.
ISA
ArithRR
Abstract Machine
Branch
Abstract Machine
ArithRR
Concrete Machine
Branch
Concrete Machine
Re ine Re ine
Decompose into
 4 machines   1
per pipeline stage
Compose
IF       ID       EX       WB IF       ID       EX       WB
IF                     ID                     EX                      WB
Figure 5.20: The Pipeline Development Flow
First, the task of implementing the ISA is decomposed, in the problem domain, into
a set of tasks: one for each instruction. Event-B reﬁnement with formal proof is then
used to derive a concrete implementation, represented by a single Event-B machine, of
each instruction. Shared event decomposition is then used to decompose each Event-B
machine into four machines, one for each stage and representing the contribution of each
instruction to each of the four pipeline stages. The contributions from each instruction
to a given stage are then composed so that in the ﬁnal step of the method a single,
4-stage pipeline comprising 4 processes, one per stage, is derived that represents the
combined operation of the instructions.
Recall the implementations of the ArithRR instruction as shown in Figure 5.21 and the
Branch instruction as shown in Figure 5.22.Chapter 5 Developing an SoC Pipelined Microprocessor Model 110
 
  
  event   
      ere
       grd1            
    t en
       act1           1
       act2   op     e     
  end
  
  
  
  event   a   
      ere
       grd1   op    rit    p
       grd2   op    rit    p
       grd   r   op     a   op 
       grd   r   op         op 
    t en
       act1          o tp t
       act2      eg       op  
       act    op     op
  end
  
  event    ranc 
      en
       grd1   op   
 ranc  p
    t en
       act1      o tp t   
          
       act2  ond   
 oo  p     0 
       act    op     op
  end
  
  event   
      ere
       grd1   op   
 ranc  p
    t en
       act1      o tp t    
 
       act2  ond   0
       act    op     op
  end
  event   no   
      ere
       grd1   op    rit    p
       grd2   op    rit    p
       grd   r   op     a   op 
       grd   r   op         op 
    t en
       act1      eg   a   op  
       act2      eg       op  
       act    op     op
  end
 
 
o
p
 
 
 
 
 
 
 
 
 
 
o
p
           
 
 
 
 
 
o
 
t
p
 
t
 
o
n
d
 
 
o
p
FE    EXE   E DE  DE   
  event   no   
      ere
       grd1   op    rit    p
       grd2   op    rit    p
       grd   r   op     a   op 
       grd   r   op         op 
    t en
       act1      o tp t        
       act2   op     op
  end
  event   a   
      ere
       grd1   op    rit    p
       grd2   op    rit    p
       grd   r   op     a   op 
       grd   r   op         op 
    t en
       act1      o tp t   
     o tp t    
       act2   op     op
  end
  event       
      ere
       grd1   op    rit    p
       grd2   op    rit    p
       grd   r   op     a   op 
       grd   r   op         op 
    t en
       act1      o tp t       
     o tp t
       act2   op     op
  end
  event   a    
      ere
       grd1   op    rit    p
       grd2   op    rit    p
       grd   r   op     a   op 
       grd   r   op    
     op 
    t en
       act1      o tp t   
     o tp t        o tp t
       act2   op     op
  end
 
 
 
 
 
o
 
t
p
 
t
 
o
n
d
 
 
o
p
      EXE   E   
  event   
    t en
       act1  eg   r   op          
                 o tp t
  end
 
 
 
 
 
o
 
t
p
 
t
 
o
n
d
 
 
o
p
WRI EBA  
 
e
g
 
  
 event         
      ere
       grd1   op    rit    p
       grd2   op    rit    p
       grd   r   op     a   op 
       grd   r   op         op 
    t en
       act1      eg   a   op  
       act2          o tp t
       act    op     op
  end
 event   a    
      ere
       grd1   op    rit    p
       grd2   op    rit    p
       grd   r   op     a   op 
       grd   r   op         op 
    t en
       act1          o tp t
       act2          o tp t
       act    op     op
  end
Figure 5.21: ArithRR 4-stage Pipeline
  
                
        
                   
        
        
                          
          
                   
             
                       
     
  
          
         
                   
        
        
                           
 
                    
                       
     
  
          
        
                           
                     
        
                       
                              
                            
                      
     
                     
        
                           
                     
                               
        
                       
                              
                            
                      
                     
     
                    
        
                           
                     
                               
        
                       
                           
                            
                      
                     
     
               
        
                     
                     
        
                      
                             
                           
                             
     
 
          
        
                     
                    
                       
        
                       
                         
                           
     
                
        
                     
                    
                                  
                  
        
                            
                              
                           
     
 
 
 
 
 
 
 
 
 
 
 
o
p
 
 
 
 
 
 
 
 
 
 
o
p
           
 
 
 
 
 
o
 
 
p
 
 
 
o
 
 
 
 
o
p
FETC  EXECUTE DEC DE   
                
        
                           
        
                                  
                
                                
                       
     
  
          
         
                           
        
                            
                    
                       
     
 
 
 
 
 
o
 
 
p
 
 
 
o
 
 
 
 
o
p
      EXECUTE   
          
        
                           
        
                                  
                       
     
 
 
 
 
 
o
 
 
p
 
 
 
o
 
 
 
 
o
p
WRITEBAC 
 
 
 
 
Figure 5.22: Branch 4-stage PipelineChapter 5 Developing an SoC Pipelined Microprocessor Model 111
Following the procedure described in [Silva and Butler, 2009], for each stage the ini-
tialisation events and the invariants are composed, and the variables merged. For the
FETCH stage, the IF events from each instruction are composed and the IFbranch
event, from the Branch reﬁnement, included. For the DECODE stage, the four events
from each instruction are included, resulting in eight mutually exclusive events. For the
EXECUTE stage, the EX event from the Branch reﬁnement is merged with the four
events from the Arithmetic reﬁnement, and the EXBranch included to give a total of ﬁve
events. For the WRITEBACK stage the WB events from each instruction are merged,
resulting in a single WB event in the composition.
The composed 4-stage pipeline comprises ﬁfteen events as shown in Figure 5.23.
  
  
  
 
  
  
 e ent  F
    w en
       grd   ta      
       grd   ond    
       grd            E
    t en
       a t             
       a t               
       a t   Fop     e     
  end
  e ent  FBran  
    w en
       grd   ta      
       grd   ond    
       grd  EXALUoutput       
                     E
    t en
       a t       EXALUoutput
       a t                    
           EXALUoutput
       a t   Fop     e     
  end
 
 
 
 
 
 
 
 
 
 
 
o
p
 
 
 
 
 
 
 
 
 
 
o
p
           
 
 
 
 
 
o
 
 
p
 
 
 
o
 
 
 
 
o
p
FE C  EXECU E DECODE   
  
e ent EXBran  
       
  end
 e ent EXArit RRnoRAW
       
  end
  e ent EXArit RRaRAW
       
  end
  e ent EXArit RR RAW
       
  end
  e ent EXArit RRa RAW
       
  end
 
 
 
 
 
o
 
 
p
 
 
 
o
 
 
 
 
o
p
      EXECU E   
  
 e ent WB
    t en
       a t  Regs Rr EXop      
                EXALUoutput
 end
 
 
 
 
 
o
 
 
p
 
 
 
o
 
 
 
 
o
p
WRI EBAC 
 
 
 
 
  e ent  DnoRAW
        
   end
  e ent  DaRAW
        
   end
  e ent  Da RAW
       
  end
  e ent  F D RAW
       
  end
  e ent  DBran   oRAW
       
  end
  e ent  DBran  aRAW
       
  end
  e ent  D ta  
       
  end
Figure 5.23: Composed 4-stage Pipeline
Each instruction of the ISA can be developed independently, addressing the particular
architectural considerations of that instruction, and then composed formally with the
other instructions. Each stage of the composed pipeline is represented by a set of
mutually exclusive events that maps directly to an HDL description which can be RTL,
Bluespec or CAL.Chapter 5 Developing an SoC Pipelined Microprocessor Model 112
5.6 Measuring Pipeline Complexity at the Speciﬁcation
Level
The simplicity of the ﬁnal pipeline is deceptive. Although the processes representing
each stage are not in themselves complex, reﬂecting the inherent e ciency and elegance
of the DLX design, the real complexity of the pipeline derives from that the fact the
pipeline stage processes are running concurrently and interacting via shared variables
with feedback.
In a current RTL development ﬂow, the implementation will be handed to a veriﬁcation
engineer who must verify the design against a test plan. The di culty comes in develop-
ing a credible test plan that reﬂects the complexity of the design. The ISA speciﬁcation
on its own cannot be used as the basis for the test plan; the design engineer needs to
understand the detailed pipeline implementation. In practice, the test engineer must
derive the behaviour of the combined state machine from the individual state machines
that represent each pipeline stage. Just as ensuring full code coverage of each individual
process is insu cient, ensuring full arc coverage of each of the interacting state machines
is also insu cient. In general, generating a combined state machine is impractical as it
is extremely di cult to decide which of all the possible combined transitions are actually
valid.
In the Event-B, proof-based method described, design complexity is exposed explicitly
in the design process, the combined state machine is visible from an early stage of the
design, the e ect on complexity of design decisions can be seen immediately and design
alternatives can be explored and measured. Proof-based reﬁnement, with invariant
preservation and convergence, can be seen to obviate the need for unit tests.
5.6.1 Combined State Machine Arc Coverage
Recall the concrete implementation (prior to decomposition) of the branch instruction
pipeline feedback loop as shown in Figure 5.24.
The combined state machine representing the three pipeline stages and its valid transi-
tions are revealed explicitly. The eleven arcs are labeled with the value of the counter
stall; the veriﬁcation engineer must ensure that all eleven are covered by the tests. These
tests for the downstream process can therefore be derived directly from the ﬁnal com-
bined state machines revealed in this Event-B model, ensuring that full arc coverage can
be targeted.Chapter 5 Developing an SoC Pipelined Microprocessor Model 113
 e e t  F DB     EXRAW WB  e  es   F DB     EXRAW WB
      e e
       g     Dop   B      p
       g     Fop   B      p
       g     t       
       g     o      
       g    R  EXop    R   Fop      o RAW       
       g              E
    t e 
         t  EXALUoutput     
         t   o      
         t  EXop    Dop
         t  Regs R  EXop     EXALUoutput
         t   Dop    Fop
         t   t       
         t  A   Regs R   Fop  
         t              Fop 
         t             
         t              
         t                
         t    Fop     e     
  e  
 e e t  F DEXWB  e  es  F DEXWB
      e e
       g     Dop   B      p
       g     Fop   B      p
       g     t       
       g     o      
       g              E
    t e 
         t  EXALUoutput     
         t   o      
         t  EXop    Dop
         t  Regs R  EXop         
            EXALUoutput
         t   Dop     Fop
         t  A   Regs R   Fop  
         t              Fop 
         t            
         t              
         t                
         t    Fop     e     
  e  
 e e t  F DB     EX oRAWWB  e  es   F DB     EX oRAWWB
      e e
       g     Dop   B      p
       g     Fop   B      p
       g     t       
       g     o      
       g    R  EXop    R   Fop      o RAW       
       g              E
    t e 
         t  EXALUoutput     
         t   o      
         t  EXop    Dop
         t  Regs R  EXop     EXALUoutput
         t   Dop    Fop
         t   t       
         t  A   Regs R   Fop  
         t              Fop 
         t             
         t              
         t                
         t    Fop     e     
  e  
 e e t  D t   EXB     WB  e  es  D t   EXB     WB
      e e
       g     Dop   B      p     u  e t   st u t o  
     st u t o   e o  
       g    Boo  p A             BE   o  B E 
       g     t       
    t e 
         t  EXALUoutput             
         t   o     Boo  p A     
         t  EXop    Dop
         t  Regs R  EXop     EXALUoutput
         t   Dop       
         t   t       t       
         t  A    Regs R       
         t                 
  e  
  e e t  D t   EXWB  e  es  D t   EXWB
      e e
       g     Dop   B      p
       g     t       
       g     o      
    t e 
         t  EXALUoutput     
         t   o      
         t  EXop    Dop
         t  Regs R  EXop     EXALUoutput
         t   Dop       
         t   t       t       
         t  A    Regs R       
         t                 
  e  
 e e t  FB      D t   EXWB  e  es  FB      D t   EXWB
      e e
       g     Dop   B      p
       g     t       
       g     o      
       g    EXALUoutput            E
    t e 
         t  EXALUoutput     
         t   o      
         t  EXop    Dop
         t  Regs R  EXop     EXALUoutput
         t   Dop       
         t   t       t       
         t  A    Regs R       
         t                 
         t       EXALUoutput
         t          EXALUoutput
         t    Fop      
  e  
INI 
 
 
 
 
 
 
 
 
 
 
   
Num e ed A  s  S o  t e  alue o  t e  ounte  Stall
 e e t  D t   EX oB     WB  e  es  D t   EX oB     WB
      e e
       g     Dop   B      p     u  e t   st u t o     
  st u t o   e o  
       g    Boo  p A             BE   o  B E 
       g     t       
    t e 
         t  EXALUoutput             
         t   o     Boo  p A     
         t  EXop    Dop
         t  Regs R  EXop     EXALUoutput
         t   Dop       
         t   t       t       
         t  A    Regs R       
         t                 
  e  
  
Figure 5.24: Combined State Machine Arcs
5.6.2 Combined State Machine Path Coverage
Although combined state machine Arc Coverage is a desirable, but often unattainable,
goal in design veriﬁcation, it is in itself not su cient to ensure full functional coverage.
All valid paths through the combined state machine must also be covered and identifying
these paths is extremely di cult. It is a major beneﬁt of a proof-based method that there
is no need to be concerned about the paths of the combined state machine; invariant
preservation and convergence are proved for all possible interleavings of the composed
events.
In general it is di cult to document the combined state machine with diagrams such
as that shown in Figure 5.24. A better approach it to use event reﬁnement diagrams
[Butler, 2009] which also reveal the reﬁnement hierarchy. The uppermost node of the
diagram represents an event from the abstract machine. This node is connected to one or
more events below it which represent the ﬁrst reﬁnement. These events may themselves
be reﬁned by events below them in the tree. An event reﬁnement diagram, annotated
to show the value of the Stall variable before (left) and after (right) the event has been
evaluated, is used in Figure 5.25 to illustrate the case where a branch is taken, but no
RAW hazard is incurred. The dashed line denotes that the event EX reﬁnes skip. The
solid lines denote that, for instance, EXNoBranch reﬁnes NoBranch. The XOR denotes
that one and only one of the events below it are enabled at a time.Chapter 5 Developing an SoC Pipelined Microprocessor Model 114
B      
EX EXB     
EXWB EXB     WB
                                    
ID     EXB     WB
                   
IDEXWB
                                               
IDB       RAWEXWB
                           
ID     EXWB
                   
IFIDEXWB
                                      
IFB     ID     EXWB
                                          
IFID     EXB     WB
                                               
IFIDB       RAWEXWB
   
Figure 5.25: Branch Reﬁnement Diagram: 4th Reﬁnement, No RAW
NoBranchp
EX EXNoBranch
EXWB EXNoBranchWB
                                                   
IDStallEXNoBranchWB
                   
IDEXWB
                                                
IDBranchNoRAWEXWB
                          
IDStallEXWB
                    
IFIDEXWB
                                      
IFIDStallEXWB
                                                   
IFIDStallEXNoBranchWB
                                               
IFIDBranchNoRAWEXWB
X R
Figure 5.26: NoBranch Reﬁnement Diagram: 4th Reﬁnement, No RAWChapter 5 Developing an SoC Pipelined Microprocessor Model 115
At the fourth level of reﬁnement it can be seen that the pipeline, in normal operation
and in the absence of branch instructions, executes the event IFIDEXWB. The value
of the Stall remains at zero. When a branch instruction is encountered, the event
IFIDBranchNoRAWEXWB is enabled and Stall is set to 2 to indicate that the pipeline
will now stall. This enables the event IFIDStallEXBranchWB which determines that
the branch will be taken and decrements Stall. IFBranchIDStallEXWB is now enabled
to complete the pipeline stall and decrement the value of the counter to 0 so that normal
operation can resume. Figure 5.26 documents the case where there is no hazard and the
branch is not taken.
A total of four such diagrams, to reﬂect the combination of cases where a RAW hazard
is encountered or not and where the branch is taken or not, document concisely the com-
plexity of the combined state machine, its transitions and their associated convergence.Chapter 5 Developing an SoC Pipelined Microprocessor Model 116
5.7 Component Re-use in Pipeline Speciﬁcations
Once the key concepts of pipeline speciﬁcation in this method have been developed,
to manage for instance stalling and forwarding, it is desirable that the e orts of this
speciﬁcation work can be re-used in subsequent pipeline designs. Exploring further
the component-based composition described in the speciﬁcation of the Even Number
Generator in Section 4.4.6 on Page 53, the ArithRR instruction is revisited to show
how speciﬁcation components representing a pipeline stage can be introduced into the
reﬁnement.
Recall the abstract model of the ArithRR instruction represented by the event
Event ArithRR   =
any
pop
where
grd1 : pop   ArithRROp
then
act1 : Regs(Rr(pop)) := Regs(Ra(pop)) + Regs(Rb(pop))
end
with the microarchitecture shown in Figure 5.27.
A  t RR Regs
pop
Figure 5.27: Abstract Machine: Microarchitecture
Now consider the two component speciﬁcations of Figure 5.28.
The two machines representing each of these abstract speciﬁcations can be composed
formally to create a single machine that must be proved to be a correct reﬁnement of the
original abstract speciﬁcation. The abstract machine is reﬁned by Machine 1. The eventChapter 5 Developing an SoC Pipelined Microprocessor Model 117
WB Re  
pop
p   
                           
                 
         
                           
                                                
        
                                
     
MACHINE  
EX
ppop p p    
p 
popo  
p 
               
                                 
         
                            
                                 
                                 
                             
                                       
                              
                  
     
                
               
                
MACHINE  
Figure 5.28: First Reﬁnement: Components
representing the WriteBack stage, WBp, reﬁnes the abstract event ArithRR and the four
events EX*RAWp reﬁne skip. The initial composition, using Rodin’s composition plug-
in, results in four composed events as show in Figure 5.29.
In a further reﬁnement, the variables EXALUoutput and EXop are now introduced, the
parameters of the composed machine are instantiated using witnesses and the gluing
invariant introduced as shown in Figure 5.30. The proof obligations are discharged
automatically.
The composition process is repeated, introducing the parameterised speciﬁcation of the
Instruction Decode stage as shown in Figure 5.31. Four Instruction Decode events are
composed with the four events from the previous reﬁnement to give a total of sixteen
events.
In the third reﬁnement the variables A, B and EXop are introduced, the parameters
of the composed machine are again instantiated using witnesses and the gluing invari-
ants introduced as shown in Figure 5.32. Again, the proof obligations are discharged
automatically.
Now that the full feedback loop of the pipeline has been modelled and proved, the events
can be de-composed so that there is a machine to represent each pipeline stage, in the
same way as described for the ArithRR instruction in Section 5.3.6 on Page 85.Chapter 5 Developing an SoC Pipelined Microprocessor Model 118
  
pop
  
popout
    
p es
      ED  A  I E
ppop p p   e
p 
p 
  event EXWBaRAWp renes ArithRR 
    any pop pres ppop pv pvprime popout pr 
    where
      @w I ER2wb grd1 pop   ArithRROp
      @w I ER2wb grd3 pres = Regs(Ra(pop)) + Regs(Rb(pop))
      @w I E e  grd1 ppop   ArithRROp
      @w I E e  grd2 Rr(popout) = Ra(ppop)
      @w I E e  grd3 Rr(popout)   Rb(ppop)
      @w I E e  grd4 pr   Register    
      @w I E e  grd5 pvprime = pv + pr(Rb(ppop))
      @w I E e  grd6 popout   ArithRROp
      @w I E e  grd7 pv    
    then
      @w I ER2wb act1 Regs(Rr(pop))   pres
  end
event EXWBabRAWp r renes ArithRR
event EXWBbRAWp r renes ArithRR
event EXWBnoRAWp r renes ArithRR
                         
Figure 5.29: First Reﬁnement: Component Composition
 a  ab         E op E A  o  p  
       E   a A         E   a A p
    a   ppop
         
            ppop   A      Op
            E op   A      Op
          3    E op     a ppop 
               E op     b ppop 
        
       pop pop   E op
       p    p      E A  o  p  
       p  p        
       p  p    E A  o  p  
       p p     p p       E A  o  p           b ppop  
       popo   popo     E op
        
       a            E op     E A  o  p  
       a    E A  o  p     E A  o  p           b ppop  
       a  3 E op   ppop
     
EX
E
X
A
L
U
o
u
t
p
u
t
EXop
    
WB Regs
    3 E A  o  p           a E op           b E op  
E   a A p  Pa a      I   a   a  o 
p              
p          E A  o  p  
p p        E A  o  p           b ppop  
popo       E op
Figure 5.30: Second Reﬁnement: Parameter InstantiationChapter 5 Developing an SoC Pipelined Microprocessor Model 119
EXo 
  eve   EXWB R W    e  es EXWB R W
          o        b
      e e
       g  1   o         RR  
       g  2 EXo         RR  
       g    R  EXo     R    o  
       g    R  EXo     Rb   o  
       g          Regs R    o   
       g      b   Regs Rb   o   
      e 
          1 Regs R  EXo      EX   ou  u 
          2 EX   ou  u    EX   ou  u      b
            EXo      o 
  e  
  eve   EXWB R W    e  es EXWB R W
          o        b
      e e
       g  1   o         RR  
       g  2 EXo         RR  
       g    R  EXo     R    o  
       g    R  EXo     Rb   o  
       g          Regs R    o   
       g      b   Regs Rb   o   
      e 
          1 Regs R  EXo      EX   ou  u 
          2 EX   ou  u    EX   ou  u      b
            EXo      o 
  e  
  eve   EXWB R W    e  es EXWB R W
          o        b
      e e
       g  1   o         RR  
       g  2 EXo         RR  
       g    R  EXo     R    o  
       g    R  EXo     Rb   o  
       g          Regs R    o   
       g      b   Regs Rb   o   
      e 
          1 Regs R  EXo      EX   ou  u 
          2 EX   ou  u    EX   ou  u      b
            EXo      o 
  e  
  eve   IDbR W 
           o               e    b
           b    e    o ou          o   ev
      e e
       g  1    o         RR  
       g  2        Reg s e     
       g            
       g            e        R     o   
       g       b    
       g       b    e      b
       g       o ou         RR  
       g    R     o   ev    R     o  
       g    R     o   ev    Rb    o  
  e  
  eve   IDbR W 
           o               e    b
           b    e    o ou          o   ev
      e e
       g  1    o         RR  
       g  2        Reg s e     
       g            
       g            e        R     o   
       g       b    
       g       b    e      b
       g       o ou         RR  
       g    R     o   ev    R     o  
       g    R     o   ev    Rb    o  
  e  
  eve   IDbR W 
           o               e    b
           b    e    o ou          o   ev
      e e
       g  1    o         RR  
       g  2        Reg s e     
       g            
       g            e        R     o   
       g       b    
       g       b    e      b
       g       o ou         RR  
       g    R     o   ev    R     o  
       g    R     o   ev    Rb    o  
  e  
  eve   EXWB R W  e  es EXWB R W
          o        b
      e e
       g  1   o         RR  
       g  2 EXo         RR  
       g    R  EXo     R    o  
       g    R  EXo     Rb   o  
       g          Regs R    o   
       g      b   Regs Rb   o   
      e 
          1 Regs R  EXo      EX   ou  u 
          2 EX   ou  u    EX   ou  u      b
            EXo      o 
  e  
EX
E
X
 
 
 
o
u
 
 
u
 
ppop
WB Regs
pp 
pp 
  eve   IDbR W 
           o               e    b
           b    e    o ou          o   ev
      e e
       g  1    o         RR  
       g  2        Reg s e     
       g            
       g            e        R     o   
       g       b    
       g       b    e      b
       g       o ou         RR  
       g    R     o   ev    R     o  
       g    R     o   ev    Rb    o  
  e  
 D
      ED  A  I E
 A  I E  
Figure 5.31: Third Reﬁnement: Component Composition
                
EXop
EX
E
X
A
L
U
o
u
t
p
u
t
WB Regs ID B
A
IDop
                                                  
             
         
                                         
                                         
                                            
                                            
                                      
                                         
                                         
        
                       
                   
                   
                                              
                              
                                            
                                        
                               
                                 
                       
        
                                                     
                                                      
                                    
                                
                            
                         
     
                           
                          
                                        
Figure 5.32: Third Reﬁnement: Parameter InstantiationChapter 5 Developing an SoC Pipelined Microprocessor Model 120
5.7.1 Parameters and Witnesses in Component Re-use
Witnesses, described in Section 2.9 on Page 19 were introduced into Event-B to make
proof easier. They do also, however, provide a powerful mechanism for importing re-
usable event templates into an Event-B development. In our method, a re-usable com-
ponent has outputs that are concrete variables and inputs that are parameters. These
parameters represent the environment within which the component is guaranteed to im-
plement its speciﬁed behaviour, represented by Event-B invariants, in terms of its output
variables. The parameterized environment can be represented in the guards of the events
in a powerful and ﬂexible way, which can broaden considerably the scope for re-use of
that component by deﬁning a family of behaviours which meets the component’s input
speciﬁcation.
When a component is introduced into an Event-B reﬁnement, its output variables are
bound using witnesses to the parameters of the reﬁnement and its input parameters
are bound, again using witnesses, to the variables that already exist or are freshly
introduced at this reﬁnement level. In this way, communication is established between
the component and the existing model. Alternatively, the parameters of the component
may be left to be bound in a subsequent reﬁnement; the parameters become part of
the environment speciﬁcation of the model. The developer does not need to modify
the component, only to provide the witnesses that constrain the parameters to the
particular variables of the development. The Rodin tool proves that the variables meet
the parameterized speciﬁcation.
It is a fundamental requirement of a re-use method that the developer should not have
to modify a component to meet a particular need. Event-B parameters and witnesses
provide a mechanism that meets this requirement.
5.8 A Review of the Pipeline Development Method
Even for a simple and elegant pipeline architecture such as DLX it was a daunting
proposition to attempt to derive, formally, a correct, concrete implementation from the
processor ISA. It was therefore necessary to decompose the problem. The ﬁrst issue that
needed to be addressed was whether it was actually possible to model the ISA and the
derived pipeline in Event-B. This issue splits into two; the modelling of data and the
modelling of control.
5.8.1 Modelling Data
The key observation here was that the data needed to be only concrete enough to be
amenable to high-level or RTL synthesis. It is one of the major strengths of modernChapter 5 Developing an SoC Pipelined Microprocessor Model 121
synthesis techniques that they relieve the designer of the need to worry about the low-
level data representation. Event-B types, supported by the structuring conventions
that facilitate the representation of records was found to be su cient to represent the
instructions and registers of the processor.
5.8.2 Modelling Control
The next step was to ensure that each concrete pipeline stage operation could be mod-
elled with Event-B events; to show a direct mapping between the speciﬁed operations and
these events was an important step in validating the approach. At this stage, however,
it became clear that the processor behaviour could not be modelled as a straightfor-
ward interleaving of these events and that it would be necessary to model explicitly the
inherent simultaneity of the pipeline.
5.8.3 Advancing from Modelling to Proving: Problem Decomposition
Now it was possible, using multiple reﬁnement steps, to derive a concrete implementation
of the pipeline, but not to prove that the implementation implemented its speciﬁcation.
It was therefore important to decompose the problem further and consider a single
instruction at a time. The Arithmetic instruction was chosen ﬁrst and a strong abstract
speciﬁcation sought. In this case the speciﬁcation of the e ect that the instruction has
on the processor register ﬁle, for a notional abstract machine that executes in a single
cycle, was settled upon. It was then possible to reﬁne this to a two-stage abstract
machine with the introduction of just two pipeline variables and focus on using the
tool to help discover the appropriate gluing invariants. At this stage either stalling or
forwarding could have been chosen to address the RAW hazards shown up by the failing
proof obligations, but since these proof obligations pointed directly to the forwarding
solution, this was chosen.
The Branch instruction was then chosen so that the pipeline stalling technique could
be developed. Again, a strong abstract speciﬁcation, which deﬁned the e ect of the
instruction on the program counter of an abstract single-cycle machine, was chosen.
5.8.4 Automatic Proof
It is highly desirable for a proof-based method that the proof-obligations are discharged
automatically without manual user intervention. Where proofs cannot be discharged, it
is not always obvious to the designer whether the problem is with the model or the tool.
It is therefore important that the developer of the method ensures that automatic proof
is feasible, and it was not considered an option in our work to require manual proofChapter 5 Developing an SoC Pipelined Microprocessor Model 122
intervention. In retrospect, striving for automatic proof had other major beneﬁts, not
least in ensuring that each reﬁnement step represented a clear, understandable increment
addressing a single modelling issue.
5.8.5 Managing Architectural Complexity
A further beneﬁt of taking small, incremental steps is that any increase is complexity
is noticed and can be managed immediately. The method ensures that all the simulta-
neous pipeline activity must be addressed at every reﬁnement step and makes it easy
to explore alternative architectures; the designer can move back up the reﬁnement hi-
erarchy, as far as it is necessary, to chose an alternative reﬁnement approach. Event-B
de-composition and re-composition have also been shown to be invaluable in managing
pipeline complexity.
5.8.6 Measuring Architectural Complexity
An interesting by-product of our work has been the establishment of a relationship
between formal proof and functional coverage; in particular combined state machine arc
and path coverage. During the development of the pipeline, faults were deliberately
injected into the models by, for instance, deliberately changing the value of a variable
and then re-running the provers. Sensitivity to faults is a measure of the strength of the
formal model and if a variable can take faulty values without a ecting the proofs, then
that variable can also be considered not to be covered. The relationship between model
strength and coverage is an interesting area for further research.
5.8.7 Memory Accesses
The DLX microarchitecture assumes that memory loads and stores occur in a single
cycle. Although straightforward to model, this does not reﬂect the actual memory
access characteristics of an SoC microprocessor. A more representative memory access
example is considered in the next chapter.Chapter 6
Memory Accesses: Managing
Component Latency
In modern SoC designs a large area of the SoC can be devoted to on-chip memory
and memory accesses for Synchronous Dynamic RAM (SDRAM) can take several clock
cycles, although it is possible to make a read request on every clock cycle. In addition,
where a microprocessor architecture is capable of issuing multiple instructions in a single
cycle which can complete in an arbitrary order it must be shown that the re-order bu er
maintains the original issue order.
The contribution of this chapter is to extend the method that has been developed in
earlier chapters to model RAM latency and manage out-of-order completion. Modeling
SDRAM, in the context of an Internet Protocol (IP) address look-up module, is the
subject of detailed analysis in [Arvind et al., 2004] and the same example is used here to
demonstrate the Event-B modelling techniques that we have developed in the extension
of the method.
6.0.8 Modelling Synchronous Memory
For a synchronous RAM with a latency of N cycles, an auxiliary N-bit shift register is
introduced to keep track of the valid memory accesses, as shown in Figure 6.1.
On every clock cycle, if a read request is initiated, then a  1  is written to the shift
register. Otherwise, if there is no read request, then a  0  is written. At the same time,
on every clock cycle, the last bit of the shift register is read. If it is a  1 , then the value
on the output data line of the RAM represents the valid result of a read request made
N cycles earlier, and this result can be written to the output FIFO. If it is a  0 , the
result is ignored.
123Chapter 6 Memory Accesses: Managing Component Latency 124
Synchronous  e ory
Latency =   cycles
Address  ata
  bit shift register
Output F FO
valid
valid
0 1 1 1 0 1
Figure 6.1: Managing RAM Latency
To model the memory latency in Event-B, a partial function latency table is introduced
that maps a unique token representing a memory read to a constant natural number
that represents the number of clock cycles that will elapse before the data associated
with the memory read will be ready.
inv1 : latency table   N     N
inv2 : last token   N
The variable last token is used to ensure that the next, incremental token value is asso-
ciated with each new lookup.
A further table maps the unique token to the memory address.
inv3 : address table   N     Address
The SDRAM maps an address to a value,
inv4 : RAM   Address     Value
and the output FIFO maps a natural number to a Value and is accessed using a write
and read index, wr and rd respectively.
inv5 : output ﬁfo   N     ValueChapter 6 Memory Accesses: Managing Component Latency 125
The microarchitecture is shown in Figure 6.2.
 
  
       
       
             
                 
         
           
               
   
                 
   
  
           
       
             
                 
          
       
     
Figure 6.2: SDRAM lookup
When a read request is initiated, a new entry for the associated read address is added
to the latency table with the natural number set to the memory latency. At the same
time, a lambda function is used to decrement the value associated with any previous
read address in the table,
act2 : latency table := ( i·i   dom(latency table) latency table(i) > 0|latency table(i) 
1)     {tok    Latency}
where tok is the token associated with the latest read request and Latency is the SDRAM
latency in clock cycles.
If at a clock cycle no read request is initiated, then the values associated with outstanding
read requests must be still be decremented.
act2 : latency table := ( i·i   dom(latency table) latency table(i) > 0|latency table(i) 
1)
When the value associated with an outstanding read request reaches zero, then the
look-up can complete.Chapter 6 Memory Accesses: Managing Component Latency 126
Event CompleteLookup   =
any
tok
where
grd1 : tok   dom(latency table)   latency table(tok) = 0
then
act1 : output ﬁfo(wr) := RAM(address table(tok))
end
The token tok is used to ﬁnd the corresponding RAM address in the address table and
the value at that address is writen to the output FIFO.
This method for modeling latency is now used in an Event-B model of an IP look-up
pipeline.
6.0.9 The IP Look-up Circular Pipeline
The IP look-up module in a router is responsible for conducting a Longest Preﬁx Match
of an IP address with the entries in a look-up table. (An IP address is a sequence
of numbers separated by dots.) The look-up table has a sparse tree representation
in memory and therefore multiple memory reads may be needed to get a match. To
maximise throughput, a circular pipeline is employed so that a memory read request
can be made at each clock cycle. The number of memory reads required will vary
according to the makeup of the IP address and an out-of-order completion bu er is
therefore needed to ensure that the results are made available in the correct sequence.
The architecture of the IP Address Lookup model is shown in Figure 6.3.
When an IP address is passed to the look-up module, it is associated with a numerical
token that denotes its place in the look-up queue. The ﬁrst sub-address is placed in the
address table and then used as an index to the top-level node of the table in memory.
The latency table ensures that the result of the look-up is correctly synchronised. If the
sub-address points to a leaf, then the look-up result is placed in the completion bu er at
the location denoted by the IP address token. Otherwise, the reference to the sub-tree
is re-circulated and the next sub-address read and used as an index into the sub-tree.
This continues until a leaf is reached or it is found that the address is invalid. The entry
in the completion bu er is updated accordingly and the next IP address is processed.
The model performs three fundamental actions: enter, recirculate and completelookup.
Since this model is pipelined, it is also necessary to consider the simultaneous actions
as described by the state machine of Table 6.1. The action enter is enabled when a new
IP address is presented to the moduleChapter 6 Memory Accesses: Managing Component Latency 127
 
 
  2 1  
       
late    ta le
to e      late   
 
  1         
  2 1  
 es lt ta le
to e        es lt 
ta le
add ess          es lt 
a1
a 
a 
 
a1
a2
a 
a 
a1
a  a2
a 
       
add ess ta le
to e      add ess
 e t to e 
s   add ess
e  o    
lea    1 
s   t ee   1       
RA 
Figure 6.3: IP Lookup Circular Pipeline
Event enter   =
reﬁnes enter
any
tok,subaddr
where
grd1 : tok   N
grd2 : subaddr   Subaddress
grd3 : tok = last token + 1
grd6 : table in place = TRUE
grd7 : tok /   dom(latency table)
grd8 :  x·x   dom(latency table)   latency table(x) > 0
then
act1 : address table := address table     {tok    {0    subaddr}}
act2 : last token := tok
act3 : latency table := ( i·i   dom(latency table)   latency table(i) > 0|
latency table(i)   1)     {tok    4}
endChapter 6 Memory Accesses: Managing Component Latency 128
The address table maps a token to an ordered list of IP sub-addresses.
inv1 : address table   N     (N     Subaddress)
The new token is given the next free value,
grd3 : tok = last token + 1
a guard ensures that no memory access is completing in this cycle,
grd8 :  x·x   dom(latency table)   latency table(x) > 0
the token is associated with the ﬁrst sub-address of the IP address and added to the
address table,
act1 : address table := address table     {tok    {0    subaddr}}
and the token is associated with the RAM latency (in this model the value is 4 cycles)
while, simultaneously, all other values in the latency table are decremented by one.
act3 : latency table := ( i·i   dom(latency table)   latency table(i) > 0|
latency table(i)   1)     {tok    4}
It is also necessary to model the case where there is no address lookup pending. The
action idle simply decrements the values in the latency table to ensure that the latency
is correctly modelled.
Event idle   =
reﬁnes idle
when
grd1 : table in place = TRUE
grd3 :  x·x   dom(latency table)   latency table(x) > 0
then
act1 : latency table :=  i·i   dom(latency table)   latency table(i) > 0|
latency table(i)   1
endChapter 6 Memory Accesses: Managing Component Latency 129
The action completelookup is enabled when a result is available to put in the completion
bu er. It is composed with either an idle or enter action. For instance,
Event enter complete lookup   =
reﬁnes enter
any
tok,subaddr,ctok
where
grd1 : tok   N
grd2 : subaddr   Subaddress
grd4 : ctok   N
grd5 : tok = last token + 1
grd8 : table in place = TRUE
grd9 : tok /   dom(latency table)
grd10 : ctok   dom(latency table)
grd11 : ctok   dom(address table)
grd13 : address table(ctok)   dom(table)
grd14 : latency table(ctok) = 0
grd15 : card(table(address table(ctok))) = 1
then
act1 : address table := address table     {tok    {0    subaddr}}
act2 : last token := tok
act3 : latency table := ( i·i   dom(latency table)   latency table(i) > 0|
latency table(i)   1)     {tok    4}
act4 : result table := result table     {ctok    table(address table(ctok))}
end
In this composed event tok is the entering token and ctok is the completing token.
The value in the latency table for the token has reached zero,
grd14 : latency table(ctok) = 0
and therefore the result table can be updated.
act4 : result table := result table     {ctok    table(address table(ctok))}Chapter 6 Memory Accesses: Managing Component Latency 130
The completing token ctok is used to get the sub-address from the address table, this
sub-address is then used to get the result from the look-up table table and the result is
written to the completion bu er result table.
An extra action, completelookuperr is added to deal with invalid addresses, again com-
posed with either an idle or enter action. For instance,
Event idle complete lookup err   =
reﬁnes idle
any
ctok
where
grd1 : ctok   N
grd2 : table in place = TRUE
grd3 : ctok   dom(latency table)
grd4 : ctok   dom(address table)
grd6 : address table(ctok) /   dom(table)
grd7 : latency table(ctok) = 0
then
act1 : result table := result table     {ctok     }
act2 : latency table :=  i·i   dom(latency table)   latency table(i) > 0|
latency table(i)   1
end
The event recirculate is enabled when the result of the lookup is not a leaf. In this case,
the reference to the sub-tree is re-circulated and the next sub-address read and used as
an index into the sub-tree.
Event recirculate   =
reﬁnes recirculate
any
tok,subaddr,next
where
grd1 : table in place = TRUE
grd2 : tok   N
grd4 : tok   dom(latency table)Chapter 6 Memory Accesses: Managing Component Latency 131
grd6 : tok   dom(address table)
grd7 : address table(tok)   dom(table)
grd3 : latency table(tok) = 0
grd5 : card(table(address table(tok))) > 1
grd8 : subaddr   Subaddress
grd9 : next   N
grd10 : next = card(dom(address table(tok)))
then
act1 : address table := address table     {tok    address table(tok)    
{next    subaddr}}
act2 : latency table := ( i·i   dom(latency table)   latency table(i) > 0|
latency table(i)   1)     {tok    4}
act3 : result table := result table     {tok    table(address table(tok))}
end
Current State Action Next State
Res Leaf Valid Res Leaf Valid
F       idle/enter   T/F T/F T/F
T F T   recirculate   T/F T/F T/F
T   F   idle/enter   completelookuperr   T/F T/F T/F
T T T   idle/enter   completelookup   T/F T/F T/F
Table 6.1: IP Lookup State Machine
Row 1: If the result (indicated by the boolean Res) of a memory lookup is not yet
available due to memory latency, or there are no lookups pending, then the model will
either idle if there is no pending lookup or enter the next pending IP address. (Lookup
of the next address can proceed even while an earlier address is still being processed.)
Row 2: When the result of the memory lookup becomes available and a leaf node has
not been encountered, precedence is given to this lookup in progress over pending new
lookups and the action recirculate is enabled.
Row 3: If the memory lookup fails because the address is invalid, then an empty result is
written to the completion bu er and, simultaneously, the next address can be processed
if there is one pending.
Row 4: When the result of the memory lookup becomes available and a leaf is encoun-
tered, the valid result is written to the completion bu er and, simultaneously, the next
address can be processed if there is one pending.
The model is developed with three reﬁnement stages and all proof obligations are dis-
charged automatically as shown in in Table 6.2.Chapter 6 Memory Accesses: Managing Component Latency 132
Total no. of Discharged Discharged Not
proof obligations Automatically Manually Discharged
Abstract Model 28 28 0 0
First Reﬁnement 7 7 0 0
Second Reﬁnement 11 11 0 0
Third Reﬁnement 21 21 0 0
Table 6.2: IP Lookup Proofs
This section has demonstrated how Event-B can be used to model the ﬁxed latency of
synchronous RAM in a hardware component. It has also demonstrated how multiple
memory accesses can be managed simultaneously and how a re-order bu er can be in-
troduced to ensure that the look-up results are processed in the correct order. These
techniques can be transferred to the modelling of super-scalar SoC pipelined micropro-
cessors.
For applications where latency is variable, a latency-insenstitive approach provides an
e ective solution. Modeling latency-insensitive protocols is the topic of Chapter 7.Chapter 7
Developing SoC Sub-Systems
This chapter covers the emerging latency-insensitive protocols which can de-couple sub-
system design from the complex timing interaction that can occur between components
and the implications of latency-insensitive design for an Event-B based method.
The contribution described in this chapter is the development of a high-level represen-
tation in Event-B of a low-level, latency-insensitive protocol that can be used at the
speciﬁcation level to verify that a sub-system of components is a correct reﬁnement of
its abstract speciﬁcation, that each component conforms to the protocol and that the
sub-system does not deadlock.
7.1 Managing Design Complexity
Once a methodology is in place for the speciﬁcation, reﬁnement and architectural explo-
ration of SoC components, the “consequent challenge is addressing the communication
and synchronization issues that arise while assembling predesigned components.” [Car-
loni and Sangiovanni-Vincentelli, 2002]
The requirements of an SoC inter-component communication mechanism are many-fold.
First, it must address the issue of latency which has an increasing impact on SoC design
methods. Second, it must decouple the components so that function and communication
can be considered as orthogonal concerns [Keutzer et al., 2000]. Increases in component
complexity result in a corresponding exponential increase in veriﬁcation complexity.
Third, it must work with existing tool ﬂows which are predominantly synchronous, and
fourth it should not in itself impose unreasonable performance and power-consumption
overheads.
133Chapter 7 Developing SoC Sub-Systems 134
7.2 Asynchronous Design and Transaction-Level Modelling
Introducing uni-directional FIFOs as a communication mechanism between components
addresses both latency and complexity concerns. This mechanism forms the foundation
of the SystemC Transaction Level Modeling methodology [Ghenassia, 2006].
Strictly enforcing FIFO-based communication ensures that there are no shared con-
trol variables and the only way that communication between components can occur is
through message-passing. However, to ensure that the functionality of a sub-system
depends only on the order in which messages are received, and not on their timing, it is
essential that every component is stallable [Carloni and Sangiovanni-Vincentelli, 2002];
if an input FIFO is empty or an output FIFO is full, then a component must retain its
internal state until it is possible for communication to recommence. The price of the
reduced complexity and independence from latency issues that FIFO-based communi-
cation exacts is that it must be veriﬁed that the components can stall correctly and, in
addition, that the stalling mechanism cannot lead to inadvertent sub-system deadlock
or livelock. This additional veriﬁcation requirement is better addressed using formal
methods rather than simulation.
Although an exponential merging of events is required for each individual pipelined
component, as shown in earlier chapters, the events of di erent components do not
need to be merged. The FIFO de-coupling means that the order in which the events of
di erent components are activated does not matter, and the interleaving semantics of
Event-B are precisely those that are required. So, for a 10-stage pipeline represented by 3
events per stage, the total number of potential combined events that must be considered
is 310 = 59049. If the pipeline register between the ﬁfth and sixth stage is replaced with
a FIFO, splitting the pipeline in two, the total number of events that must be considered
is
35 + 35 = 243 + 243 = 486
Splitting a pipeline to manage latency issues has a commensurate e ect on model com-
plexity, with the proviso that each component of the pipeline must be shown to be
stallable.
Although FIFO-based communication addresses the issues of latency and complexity,
the asynchronous nature of the communication means that it doesn’t ﬁt well within syn-
chronous tool chains. This has the knock-on e ect that the synthesised output cannot be
as e cient, in terms of performance and power consumption, as for a purely synchronous
design since it is much more di cult to optimise across component boundaries.
FIFO-based communication is a mechanism, not a method. It imposes the burden that it
diverges from the well-established synchronous design methods and is more suitable forChapter 7 Developing SoC Sub-Systems 135
communication between large sub-systems which may operate in di erent clock domains
than for communication within sub-systems.
7.3 Unit-Transaction Level Design
Unit-Transaction Level design (UTL) [Asanovic, 2007], builds on the notion of FIFO-
based communication with the constraint that each component must be a transactor
(transactional actor) which has private local state, bu ered uni-directional input and
output channels and a set of guarded atomic actions (transactions) that can read from
its input channels, modify local state and write to its output channels. A scheduler is
associated with each transactor which determines the next transaction to be executed,
although this scheduling function could be implemented implicitly within the transac-
tions themselves. A transaction is chosen non-deterministically for execution from the
set of ready transactions.
The UTL method has been designed expressly for use with Bluespec high-level synthe-
sis, but with one important constraint. Individual Bluespec descriptions are written for
each transactor; the object-oriented mechanism that Bluespec provides for hierarchical
composition is not used and is replaced by explicit FIFO-based communication. Trans-
actors are constrained in size according to the guidelines recommended in [Sylvester and
Keutzer, 2001] so that conventional logic synthesis can be used on the RTL output from
Bluespec for each transactor in turn.
7.4 Latency-Insensitive Design
The theory of latency-insensitive design [Carloni et al., 2001], [Carloni et al., 1999] is
proposed as “the foundation of a correct-by-construction methodology for SoC design”
[Carloni and Sangiovanni-Vincentelli, 2002]. Latency-insensitive design introduces the
notion of a communication shell which wraps each synchronous, stallable component
with a set of bu ers that manage point-to-point, uni-directional data links between the
components of a sub-system. The behaviour of a sub-system composed of components
that conform to the latency-insensitive protocol is independent of the communication
latencies between the components.
It is an important beneﬁt of this approach that the sub-system itself is also synchronous.
In an asynchronous system, the delay between two successive messages is arbitrary. In a
latency-insensitive system, this delay is constrained to be a multiple of the clock period.
Latency insensitive design therefore exhibits the beneﬁcial properties of asynchronousChapter 7 Developing SoC Sub-Systems 136
design, but with the considerable advantage that, because all inter-component communi-
cation is synchronous, it can be used in a well-established synchronous design tool ﬂow,
with all the advantages of the optimised synthesis facilities that such a ﬂow provides.
For each input of a stallable component the wrapper implements 3 registers as shown in
Figure 7.1.
DataI 
voidI 
stop  t
Figure 7.1: Latency Insensitive Protocol
The DataIn register holds the input data. The single bit voidIn register takes the value
 1  if the input data is valid and  0  if not. The single bit stopOut register is set to  1 
by the receiving component if it is stalled and  0  otherwise. A corresponding set of 3
registers is implemented for each output. If a component can be stalled with a gated
clock then the wrapper can be generated automatically.
Where the point-to-point communication between two components cannot meet the
clock period constraint, then one or more relay stations may be inserted without a ect-
ing the sub-system functionality. Relay stations are simple, stallable components which
comply with the latency-insensitive protocol and simply hold the data value passed to
them for a clock period before passing it on. A point-to-point path with relay stations
acts as a pipeline and therefore can minimise the e ect on throughput.Chapter 7 Developing SoC Sub-Systems 137
7.5 Synchronous Elastic Design
Synchronous ELastic Flow (SELF) [Krstic et al., 2006], [Cortadella et al., 2006b], [Cor-
tadella et al., 2006a] builds on the theory of latency-insensitive design by deﬁning an
highly e cient communication protocol that imposes a very low overhead in terms of
area, performance and power consumption. The protocol is shown in Figure 7.2.
istop
ivalid
IDLE
TRANSFER
RETRY
ovalid   ostop  
ovalid   ostop  
ovalid   ostop  
ovalid   ¬ostop  
ovalid   ¬ostop  
ovalid   ¬ostop  
¬ovalid
¬ovalid
istate
IDLE
TRANSFER
RETRY
ivalid   istop  
ivalid   istop  
ivalid   istop  
ivalid   ¬istop  
ivalid   ¬istop  
ivalid   ¬istop  
¬ivalid
¬ivalid
ostate
ostop
ovalid
odata
  V1   V2
idata
Figure 7.2: The SELF Protocol
Each elastic component must be stallable and implement the SELF protocol on each of
its inputs and outputs. Two control signals, valid and stop, determine the three possible
states that an output or input channel can take as shown in Table 7.1.
State Signal Conditions Actions
TRANSFER valid   ¬stop receiver accepts valid data from sender
IDLE ¬valid sender has no valid data
RETRY valid   stop valid data from sender is not accepted by receiver
Table 7.1: SELF State Machine
Two elastic components are connected with an elastic bu er as shown in Figure 7.3.
The bu er is essentially a 2-entry FIF0, which can be implemented using 2 latches, and a
simple mechanism for propagating the values of valid and stop between the components.
Synchronous Elastic Design combines the de-coupling beneﬁts of asynchronous design
with the e ciency that can be achieved using synchronous tool ﬂows and is therefore a
strong candidate for use in SoC sub-system design.Chapter 7 Developing SoC Sub-Systems 138
     
      
    
        
     
                
                
                
                 
                 
                 
       
       
      
    
        
     
                
                
                
                 
                 
                 
       
       
      
     
      
     
         
     
      
       
              
Figure 7.3: Connecting 2 Elastic Components
7.6 Microarchitectural Exploration: Introducing Synchronous
Elastic Bu ering with Event-B
If a component design is seen to be getting too complex and potential timing issues arise,
the microarchitectural designer will need to consider breaking the component into sub-
components. By introducing synchronous elastic bu ers to replace pipeline registers, the
designer is able to address the timing issues in a way that de-couples the functionality
from the timing.
We present a method, illustrated by examples, that enables the designer to prove that
an alternative, bu ered sub-system design is functionally equivalent to the original com-
ponent design which used shared register communication.
In the ﬁrst example, we show how the output of the Execute stage in a microprocessor
pipeline may be bu ered instead of written directly to pipeline registers. Two reﬁne-
ments of the abstract pipeline are shown; one which introduces registers in the traditional
way and another which introduces synchronous elastic bu ers instead of the registers.
The Rodin tool is then used to prove that, provided that appropriate forwarding mech-
anisms are introduced, both microarchitectural alternatives are correct reﬁnements of
the abstract pipeline. The use of forwarding with synchronous bu ering is a valuable
technique because it allows the designer to manage, for instance, arithmetic operations
that take a variable number of cycles to complete but with the performance beneﬁts
that access to the forwarded value can provide.Chapter 7 Developing SoC Sub-Systems 139
In the second example, synchronous elastic bu ering is used to implement a distributed
stalling mechanism in a microprocessor pipeline which provides an e cient alternative
to forwarding that allows the pipeline implementation to sustain higher clock rates.
7.6.1 The Synchronous Elastic Bu er: An Abstract Speciﬁcation in
Event-B
[Cortadella et al., 2006c] presents an abstract model of a synchronous elastic bu er,
based on an unbounded FIFO B, indexed by two variables wr and rd as shown in in
Figure 7.4.
       
r   r
             
 
Figure 7.4: Abstract Synchronous Elastic Bu er
B can be modelled in Event-B as
inv1 : B   N   Op
inv2 : rd   N
inv3 : wr   N
The value to bu ered, in this case a microprocessor opcode is indexed by two natural
numbers, the read index rd and the write index wr. Values enter the bu er at the
location pointed to by the write index and are read from the bu er at the location
pointed to by the read index. Since the synchronous elastic bu er is of size two,
inv4 : wr = rd + 1
and the variables are initialised thus.
act1 : B := N   {NOP}Chapter 7 Developing SoC Sub-Systems 140
act2 : rd := 0
act3 : wr := 1
[Cortadella et al., 2006c] uses the DLX pipeline to illustrate how elasticity can be intro-
duced into microarchitectural design. Here, the ArithRR instruction
Event ArithRR   =
any
pop
where
grd1 : pop   ArithRROp
then
act1 : Regs(Rr(pop)) := Regs(Ra(pop)) + Regs(Rb(pop))
end
will be used to explore two alternative pipeline implementations with synchronous bu er-
ing to manage RAW hazards. The microarchitecture of the abstract machine is shown
in Figure 7.5.
 ri       g 
pop
Figure 7.5: Abstract Machine: MicroarchitectureChapter 7 Developing SoC Sub-Systems 141
FDEX
E
X
 
 
 
o
u
 
 
u
 
EXo 
ppop
WB
R
e
g
s
Forwarding
Figure 7.6: Reﬁned Machine with forwarding: Microarchitecture
7.6.2 Synchronous Bu ering with Forwarding
Recall the ﬁrst reﬁnement of the ArithRR model with the microarchitecture shown in
Figure 7.6.
In this alternative reﬁnement of the abstract model, the registers EXALUoutput and
EXop are replaced with the synchronous elastic bu ers vbuf and obuf respectively as
shown in Figure 7.7.
FDEX
ppop
WB
R
e
g
s
 orwardi  
vbuf
obuf
Figure 7.7: Reﬁned Machine with bu ers and forwarding: Microarchitecture
Both bu ers have read indices (vrd and ord respectively) and write indices (vwr and owr
respectively) which are all incremented at each event execution to ensure synchronous
operation. The combined event, for instance, that deals with a RAW hazard on the
source register RaChapter 7 Developing SoC Sub-Systems 142
Event EXWBaRAW   =
reﬁnes ArithRR
any
ppop
where
grd1 : obuf (ord)   ArithRROp
grd2 : ppop   ArithRROp
grd3 : Rr(obuf (ord)) = Ra(ppop)
grd4 : Rr(obuf (ord))  = Rb(ppop)
with
pop : pop = obuf(ord)
then
act1 : Regs(Rr(obuf (ord))) := vbuf (vrd)
act2 : ord := owr
act3 : vrd := vwr
act4 : vbuf (vwr) := vbuf (vrd) + Regs(Rb(ppop))
act5 : obuf (owr) := ppop
act6 : vwr := vwr + 1
act7 : owr := owr + 1
end
can be compared with the same event in the register-based reﬁnement,
Event EXWBaRAW   =
reﬁnes ArithRR
any
ppop
where
grd1 : EXop   ArithRROp
grd2 : ppop   ArithRROp
grd3 : Rr(EXop) = Ra(ppop)
grd4 : Rr(EXop)  = Rb(ppop)
with
pop : pop = EXopChapter 7 Developing SoC Sub-Systems 143
then
act1 : Regs(Rr(EXop)) := EXALUoutput
act2 : EXALUoutput := EXALUoutput + Regs(Rb(ppop))
act3 : EXop := ppop
end
The new gluing invariant is
inv9 : vbuf (vrd) = Regs(Ra(obuf (ord))) + Regs(Rb(obuf (ord)))
compared with the original
inv3 : EXALUoutput = Regs(Ra(EXop)) + Regs(Rb(EXop))
and all proof obligations are discharged automatically. In this case, 88 proof obligations
are discharged compared with the 33 in the original, register-based solution.
7.6.3 Synchronous Elastic Bu ering with Stalling
Introducing synchronous elastic bu ering allows the designer to take advantage of the
elastic bu ering protocol and implement stalling in a distributed manner, obviating the
need for a centralised stall counter. When a RAW hazard is detected for instance,
instead of using forwarding, the Valid control signals are used e ectively to stall the
pipeline for a single cycle. A bubble of invalid data is ﬂushed out through the pipeline
and then normal operation can resume.
In order to implement this stalling mechanism it is not only necessary to introduce
bu ering between the Execute and Writeback stages but also bu ering in a backwards
direction within the feedback loop which writes values back to the register ﬁle; if values
travelling in a forward direction are stalled because of a hazard, then the delay caused
must be compensated for when the values are written back. The stalling protocol,
described below, ensures that writing back to the register ﬁle is correctly synchronised
in this presence of RAW harards.
The tradeo  is the cost of an extra cycle when a hazard is encountered against the extra
forwarding logic complexity, but with the added advantage that the pipeline will be able
to sustain higher clock frequencies because of the shorter signal paths introduced by
bu ering the write back path. The micro-architecture is shown in Figure 7.8.Chapter 7 Developing SoC Sub-Systems 144
 F  EX
ppop
WB Regs
 F  EX WB SEB
vbuf2
obuf2
vbuf1
obuf1
vwr2
owr2
vrd2
ord2
vrd1
ord1
vwr1
owr1
valid2
valid1
Figure 7.8: Reﬁned Machine with synchronous elastic bu ers: Microarchitecture
In this reﬁnement of the abstract ISA speciﬁcation, events are introduced which reﬁne
the abstract event ArithRR. The actions that update the synchronous bu er indices,
act6 : ord1 := owr1
act7 : vrd1 := vwr1
act8 : vwr1 := vwr1 + 1
act9 : owr1 := owr1 + 1
act10 : ord2 := owr2
act11 : vrd2 := vwr2
act12 : vwr2 := vwr2 + 1
act13 : owr2 := owr2 + 1
are common to all the events and are omitted for clarity.
Under normal operation, in the absence of hazards, Valid1 and Valid2 are TRUE and
the event EXWMBnoRAW is enabled. The guardsChapter 7 Developing SoC Sub-Systems 145
grd5 : Rr(obuf1(ord1))  = Ra(pppop)
grd6 : Rr(obuf1(ord1))  = Rb(pppop)
check that the target register Rr of the instruction bu ered in the Writeback stage is not
about to overwrite either of the source registers, Ra and Rb, of the incoming instruction
pppop and the guards
grd7 : Rr(obuf2(ord2))  = Ra(pppop)
grd8 : Rr(obuf2(ord2))  = Rb(pppop)
check that the source registers, Ra and Rb, of the incoming instruction pppop do not
need the result stored in the target register Rr of the instruction bu ered in the Execute
stage.
Event EXWBnoRAW   =
reﬁnes ArithRR
any
pppop
where
grd1 : pppop   ArithRROp
grd2 : obuf1(ord1)   ArithRROp
grd3 : obuf2(ord2)   ArithRROp
grd4 : Valid1 = TRUE
grd5 : Rr(obuf1(ord1))  = Ra(pppop)
grd6 : Rr(obuf1(ord1))  = Rb(pppop)
grd7 : Rr(obuf2(ord2))  = Ra(pppop)
grd8 : Rr(obuf2(ord2))  = Rb(pppop)
grd9 : Valid2 = TRUE
with
pop : pop = obuf1(ord1)
then
act1 : Regs(Rr(obuf1(ord1))) := vbuf1(vrd1)
act2 : obuf1(owr1) := obuf2(ord2)
act3 : vbuf1(vwr1) := vbuf2(vrd2)
act4 : vbuf2(vwr2) := Regs(Ra(pppop)) + Regs(Rb(pppop))Chapter 7 Developing SoC Sub-Systems 146
act5 : obuf2(owr2) := pppop
... //Update bu er indices
end
The control signals Valid1 and Valid2 remain TRUE, the values of the source registers
Ra and Rb are added together and written to the bu er vbuf2, the result from the
previous cycle, stored in bu er vbuf2, is written to the bu er vbuf1, the result from the
cycle before that, stored in bu er vbuf1 is written back to the Register File and all the
bu er read and write indexes are incremented.
To illustrate the stalling mechanism, consider the case where a RAW hazard is encoun-
tered on register Ra of the new instruction pppop because the pipeline is about write a
value back to that location, as detected by the guard
grd5 : (Rr(obuf1(ord1)) = Ra(pppop))
of the event EXWBRAWa
Event EXWBRAWa   =
reﬁnes ArithRR
any
pppop
where
grd1 : pppop   ArithRROp
grd2 : obuf1(ord1)   ArithRROp
grd3 : obuf2(ord2)   ArithRROp
grd4 : Valid1 = TRUE
grd5 : Rr(obuf1(ord1)) = Ra(pppop)
grd6 : Rr(obuf1(ord1))  = Rb(pppop)
grd7 : Rr(obuf2(ord2))  = Ra(pppop)
grd8 : Rr(obuf2(ord2))  = Rb(pppop)
grd9 : Valid2 = TRUE
with
pop : pop = obuf1(ord1)
then
act1 : Regs(Rr(obuf1(ord1))) := vbuf1(vrd1)
act2 : obuf2(owr2) := pppopChapter 7 Developing SoC Sub-Systems 147
act3 : obuf1(owr1) := obuf2(ord2)
act4 : vbuf1(vwr1) := vbuf2(vrd2)
... //Update bu er indices
act14 : Valid2 := FALSE
end
The control signal Valid2 is set to FALSE, invalidating the value in vbuf2, which retains
its previous value. The results from the earlier instruction are handled in the normal
way and Valid1 remains TRUE.
The event EXstallWB is now enabled, which writes the new result to the bu er vbuf2
and sets Valid2 to TRUE, transfers the invalid result from vbuf2 to vbuf1 and sets Valid1
to FALSE, and writes the value in vbuf1 back to the Register File.
Event EXstallWB   =
reﬁnes ArithRR
any
pppop
where
grd1 : pppop   ArithRROp
grd2 : obuf1(ord1)   ArithRROp
grd3 : obuf2(ord2)   ArithRROp
grd4 : Valid1 = TRUE
grd5 : Rr(obuf1(ord1))  = Ra(pppop)
grd6 : Rr(obuf1(ord1))  = Rb(pppop)
grd7 : Rr(obuf2(ord2))  = Ra(pppop)
grd8 : Rr(obuf2(ord2))  = Rb(pppop)
grd9 : Valid2 = FALSE
with
pop : pop = obuf1(ord1)
then
act1 : Regs(Rr(obuf1(ord1))) := vbuf1(vrd1)
act2 : obuf1(owr1) := obuf2(ord2)
act3 : vbuf1(vwr1) := vbuf2(vrd2)
act4 : vbuf2(vwr2) := Regs(Ra(pppop)) + Regs(Rb(pppop))
act5 : obuf2(owr2) := pppop
... //Update bu er indicesChapter 7 Developing SoC Sub-Systems 148
act14 : Valid1 := FALSE
act15 : Valid2 := TRUE
end
Finally, the event WXWBStall which reﬁnes skip is enabled which, because Valid1 is
FALSE, does not perform the Register File write back The bubble of invalid data has now
been ﬂushed from the pipeline, Valid1 is set to TRUE and normal pipeline operation
resumes.
Event EXWBstall   =
any
pppop
where
grd1 : pppop   ArithRROp
grd2 : obuf2(ord2)   ArithRROp
grd3 : obuf2(ord2) = obuf1(ord1)
grd4 : Valid1 = FALSE
grd5 : Valid2 = TRUE
grd6 : Rr(obuf1(ord1))  = Ra(pppop)
grd7 : Rr(obuf1(ord1))  = Rb(pppop)
grd8 : Rr(obuf2(ord2))  = Ra(pppop)
grd9 : Rr(obuf2(ord2))  = Rb(pppop)
then
act1 : vbuf2(vwr2) := Regs(Ra(pppop)) + Regs(Rb(pppop))
act2 : obuf2(owr2) := pppop
act5 : vbuf1(vwr1) := vbuf2(vrd2)
act6 : obuf1(owr1) := obuf2(ord2)
act9 : Valid1 := TRUE
... //Update bu er indices
end
The stalling mechanism is shown in Figure 7.9. Concrete events represent the transitions
of the model’s state. The numbered arcs are annotated with the values of Valid1 and
Valid2 and point to the event or events that are enabled in the next pipeline state.Chapter 7 Developing SoC Sub-Systems 149
                                
              
         
                             
                                   
                                   
                         
                                       
                                       
                                       
                                        
                           
        
                            
        
                                    
           
                                     
                                     
                                            
               
                               
                                  
     
                               
              
         
                                
                                      
                                      
                         
                                         
                                         
                                       
                                       
                          
        
                            
        
                                      
           
                                  
                                        
                                        
                                  
                              
     
                                
              
         
                             
                                   
                                   
                         
                                       
                                       
                                       
                                       
                           
        
                            
        
                                               
                                     
                                     
                                            
               
                               
                                  
                           
                          
     
                 
              
         
                              
                                   
                                     
                          
                         
                                       
                                       
                                       
                                        
        
                                            
               
                               
                                     
                                      
                                  
                         
     
 a i 1    R E
 a i 2    R E
 a i 1    R E
 a i 2    R E
 a i 1    R E
 a i 2    R E
 a i 1    R E
 a i 2        
 a i 1        
 a i 2    R E
 a i 1    R E
 a i 2    R E
Figure 7.9: Synchronous Elastic Bu ers: Stalling mechanism
Two gluing invariants are required to prove that this is a correct reﬁnement.
inv14 : Valid1 = TRUE vbuf1(vrd1) = Regs(Ra(obuf1(ord1)))+Regs(Rb(obuf1(ord1)))
inv15 : Valid2 = TRUE vbuf2(vrd2) = Regs(Ra(obuf2(ord2)))+Regs(Rb(obuf2(ord2)))
From Figure 7.9 it can be seen that all possible combinations of Valid1 and Valid2 are
feasible except one: Valid1 = FALSE and Valid2 = FALSE.
The invariant
inv20 : ¬(Valid1 = FALSE   Valid2 = FALSE)
is therefore introduced and proved, showing that this case cannot occur and that pipeline
cannot deadlock.
In this case 195 proof obligations are discharged automatically.Chapter 7 Developing SoC Sub-Systems 150
7.6.4 Shared Event Pipeline Decomposition
When pipeline communication is via shared registers, a two-way, shared event decompo-
sition is used to decompose the combined events into two processes. Where synchronous
elastic bu ers are introduced to de-couple the pipeline stages, a three-way, shared event
decomposition [Butler, 2009] as described in Section 2.9 on Page 18 is used. The mi-
croarchitecture of the partitioning is shown in Figure 7.10.
      
ppop
       
             
v   2
o   2
v   1
o   1
vwr2
owr2
vrd2
ord2
vrd1
ord1
vwr1
owr1
valid2
valid1
 a  i   1  a  i   2  a  i    
Figure 7.10: Synchronous elastic bu er decomposition: Microarchitecture
The Execute stage of Machine 1 and WriteBack stage of Machine 3 do not communi-
cate directly with each other; all communication is via a shared event, SEBtransfer, in a
separate Event-B machine (Machine 2) which represents the synchronous elastic bu ers
and which increments the bu er indices.Chapter 7 Developing SoC Sub-Systems 151
Event SEBtransfer   =
then
act1 : ord1 := owr1
act2 : vrd1 := vwr1
act3 : vwr1 := vwr1 + 1
act4 : owr1 := owr1 + 1
act5 : ord2 := owr2
act6 : vrd2 := vwr2
act7 : vwr2 := vwr2 + 1
act8 : owr2 := owr2 + 1
end
7.6.5 A Review of the use of Synchronous Elastic Design with Event-B
Despite the increase in modelling complexity introduced by the incorporation of latency-
insensitive bu ers into the pipeline, increasing the number of proof obligations from 33
to 195 when compared with the original shared register model, these proof obligations
are still discharged automatically. Stalling is inherently a more complex solution to the
problem of RAW hazards than forwarding, but on the other hand the distributed stalling
mechanism that latency-insensitive design enables is inherently less complex than the
centralised mechanism using to implement the branch instruction.
It is a great beneﬁt of our method that the gluing invariants discovered in the shared
variable implementation can be re-used with minor and well-deﬁned modiﬁcations which
would be amenable to automatic tool support. In addition, the introduction of the extra
actions needed to update the bu er indices could also be automated.
The microarchitectural designer has the choice as to whether to use forwarding or stalling
techniques and shared register or bu ered communication. The trade-o s between ver-
iﬁcation complexity, power consumption and performance can be taken into account at
this early stage. Forwarding, though easier to implement, may not meet the performance
requirements.
Components that become too large or complex can be partitioned to form sub-systems
within which communication between components is both latency-insensitive and e -
cient. Track length constraints preclude the design of long pipelines with only shared
register inter-stage communication. Introducing latency-insensitve bu ers breaks up the
pipeline into more manageable sections without a ecting its functionality. The explo-
ration is done at the speciﬁcation level with the support of Event-B reﬁnement and
automatic proof.Chapter 7 Developing SoC Sub-Systems 152
The designer can also use these latency-insensitive design techniques to model instruc-
tions which in some SoC microprocessor architectures can have variable latency, such as
memory accesses and arithmetic instructions.Chapter 8
Conclusions
The complexity of modern System-on-Chip (SoC) hardware stretches the existing de-
sign and veriﬁcation ﬂows, languages and tools to the limit of their capabilities [Asanovic
et al., 2006], [Sylvester and Keutzer, 2001]. Veriﬁcation takes a larger and larger propor-
tion of the overall e ort and it is often very late in the design process that timing issues,
resulting from the very small feature sizes of modern silicon processes, are encountered
and can only be corrected by substantial re-design [Carloni and Sangiovanni-Vincentelli,
2002].
Although semi-conductor companies express the desire to raise the design process to the
Electronic System Level (ESL) [Asanovic et al., 2006] in order to improve the predictabil-
ity of the veriﬁcation closure process, the approach that the EDA industry has taken to
address this need has been fragmentary, no clear standards have emerged, and the tried
and proven RTL design methodology still forms the signiﬁcant bedrock of any modern
SoC design ﬂow. There is a clear need to enhance existing design ﬂows to be better able
to manage this increased complexity, without losing the well-established beneﬁts that
have driven successful synchronous design.
In a current RTL development ﬂow, the implementation will be handed to a veriﬁca-
tion engineer who must verify the design against a test plan. The di culty comes in
developing a credible test plan that reﬂects the complexity of the design. An ISA spec-
iﬁcation, for instance, on its own cannot be used as the basis for the test plan; the test
engineer needs to understand the detailed pipeline implementation. In practice, the test
engineer must derive the behaviour of the combined state machine from the individual
state machines that represent each pipeline stage. Just as ensuring full code coverage of
each individual process is insu cient, ensuring full arc coverage of each of the interact-
ing state machines is also insu cient. In general, generating a combined state machine
is impractical as it is extremely di cult to decide which of all the possible combined
transitions are actually valid.
153Chapter 8 Conclusions 154
Property checking, rather than using tests, takes a set of temporal properties [Cohen
et al., 2004], [Sutherland et al., 2004], derived from the speciﬁcation and checks these
properties against the synthesised gate-level description. Property checking depends on
having a comprehensive set of properties to represent the desired behaviour. It is very
di cult to establish whether su cient properties have been written or not, and property
coverage [Chockler et al., 2001], [Hoskote et al., 1999] is a topic of ongoing research.
Without a measurable outcome, property checking will not become an indispensable
component in the SoC ﬂow.
In our Event-B, proof-based method, design complexity is exposed explicitly in the design
process, the combined state machine is visible from an early stage of the design, the e ect
on complexity of design decisions can be seen immediately and design alternatives can
be explored and measured. Proof-based reﬁnement, with invariant preservation and
convergence, can be seen to obviate the need for unit tests.
Although combined state machine Arc Coverage is a desirable, but often unattainable,
goal in design veriﬁcation, it is in itself not su cient to ensure full functional coverage.
All valid paths through the combined state machine must also be covered and identifying
these paths is extremely di cult. It is a major beneﬁt of our proof-based method that
there is no need to be concerned about the paths of the combined state machine; invariant
preservation and convergence are proved for all possible interleavings of the composed
events.
In our approach, an enhancement of the System-on-Chip hardware design and veriﬁca-
tion ﬂow, incorporating synthesis, has been explored which exploits the synergy between
the Event-B method and the rule-driven approach of guarded atomic action, high-level
synthesis. Since both these methods are based fundamentally on guarded atomic ac-
tions, it has been possible to deﬁne a augmented ﬂow whereby a formal and systematic
approach to high-level speciﬁcation with reﬁnement and supporting architectural ex-
ploration can be provided by Event-B, resulting in a concrete speciﬁcation that can be
mapped directly to a TRS description for high-level synthesis to RTL and therefore close
the gap between speciﬁcation and implementation. It has also been shown that, in the
case of microprocessor pipelines, it is possible to derive a concrete speciﬁcation that can
be mapped directly to RTL.
A general approach to specifying and reﬁning SoC components, in particular those with
pipelined architectures, has been presented which focuses on the issues that arise when
feedback is present in the pipeline. The method begins with an abstract Event-B model
representing the speciﬁcation of the required behaviour. The abstract speciﬁcation rep-
resents a high-level view of the hardware which executes in a single cycle. A reﬁnement
of the abstract model is then introduced which represents the behaviour as a two-stage
pipeline. The two stages communicate via shared registers and gluing invariants are
introduced to prove that the two-stage pipeline implements the abstract speciﬁcation.Chapter 8 Conclusions 155
This model executes in two cycles, with overlapping execution of the pipeline stages.
This two-stage model is then further reﬁned until all pipeline stages within a pipeline
feedback loop have a concrete implementation. Once the invariants have been proved,
which shows that the overlapping execution within the feedback loop has been imple-
mented correctly, shared event decomposition is used to decompose the pipeline model
into a set of models, one for every stage of the pipeline. Models representing the stages
outside the feedback loop can then be introduced. Each pipeline stage model represents
a hardware process of the ﬁnal pipeline implementation, and the invariants represent
the properties of that pipeline.
This general approach is then applied, using the register/register arithmetic and branch
instructions as examples, to show that the ISA speciﬁcation of the instruction can be
reﬁned systematically to a pipelined model that can be proved to implement its ISA
speciﬁcation. The ﬁnal, concrete models of each instruction comprise a set of Event-B
machines, one for each pipeline stage. The machines representing each particular stage
are then composed formally to provide a complete, concrete model of the pipeline. The
method ensures, through the introduction of gluing invariants at each stage, that mi-
croarchitectural considerations are addressed early in the design ﬂow. Di erent microar-
chitectures may be explored and veriﬁed at the speciﬁcation level. Stepwise reﬁnement
allows the developer to manage the multiplicity of cases caused by pipeline data hazards.
A method for incorporating a latency-insensitive design protocol for inter-component
communication has also been developed, which retains the beneﬁts of asynchronous
communication but within a synchronous tool ﬂow. This does, however, require that
components are stallable and do not cause sub-system deadlock. The DLX register/reg-
ister arithmetic instruction has been used to show that synchronous elastic bu ers can
be used to implement a distributed pipeline stalling mechanism, as an e cient alter-
native to forwarding, to manage data hazards which has been proved to be free from
deadlock.
Three goals were set for our work which, if achieved, would represent signiﬁcant, original
contributions to the area of microelectronic design.
First, that alternative pipeline architectures within components could be developed from
a high-level speciﬁcation, compared and veriﬁed systematically.
Second, that a high-level model for latency-insensitive communication could be devel-
oped that could be used at the speciﬁcation level to verify formally that communicating
components obey the latency-insensitive protocol and will not cause deadlock.
Third, that the speciﬁcation could be reﬁned to a level of abstraction that matches that
required for input to high-level or RTL synthesis.
All three goals have been achieved. In our correct-by-construction approach, Event-B
is used to represent the abstract hardware speciﬁcation with the TRS-style descriptionsChapter 8 Conclusions 156
familiar to hardware designers who use Bluespec or CAL. The abstract speciﬁcation is
then reﬁned systematically to reﬂect the architectural decisions of the designer. It has
been shown how the designer is able to deal with one architectural consideration at a
time, reﬁning a particular aspect of the design to a concrete representation while leaving
the rest of the representation abstract. At each reﬁnement step the Rodin tool helps
the designer to discover the gluing invariants that must be proved to demonstrate that
the concrete representation is a correct reﬁnement of the abstract. These invariants are
fundamental properties of the design that can be translated directly into PSL or SVA
descriptions and used downstream in the ﬂow for RTL formal and simulation-based
veriﬁcation. Where performance or complexity considerations require that a component
be split into a sub-system of communicating components, it has been shown how an
alternative reﬁnement, incorporating synchronous elastic bu ers, can be derived that
also meets the abstract speciﬁcation. The reﬁnement process continues until a concrete
representation of the speciﬁcation has been derived that is suitable for either TRS or
RTL synthesis.
The veriﬁcation e ort has been raised to the speciﬁcation level because the concrete
representation has been proved formally to implement the abstract speciﬁcation. Veriﬁ-
cation is made manageable because it is performed incrementally within the design ﬂow.
The hardware and its associated properties are described in an event-based language
which is a natural vehicle for synchronous hardware description and all the associated
proof obligations are generated and proved by the Rodin tool environment.
The fourth, and overriding goal, was that the method developed could ﬁt seamlessly
within a modern SoC veriﬁcation ﬂow. It has been a drawback of many formal hard-
ware veriﬁcation developments that they sit as an adjunct to the main ﬂow, can be useful
in ﬁnding bugs, but require the design to be translated into specialist representations for
analysis by specialist scientists. A modern veriﬁcation ﬂow requires a measurable out-
come that provides the conﬁdence required for the design to be signed o . A mechanism
that may or may not ﬁnd additional bugs does not necessarily increase that conﬁdence.
Code and functional coverage measurements provide the conﬁdence required for sign-o 
in current, simulation-based ﬂows. We have developed, systematically, models that have
been proved to be a correct implementation of the abstract speciﬁcation for all possible
paths through the combined state machine that represents the concurrent behaviour of a
component. The component can therefore to be said to have achieved full path coverage
and this measurement can be correlated directly with path coverage in a simulation-
based ﬂow. It is therefore possible with our method to combine the coverage results
from formal and simulation-based veriﬁcation into a single metric for design sign-o .
The work presented has focused on microprocessor pipelines, but these are speciali-
sations of what constitutes the general mechanism for representing complex hardware
designs, communicating state machines, which communicate via shared registers or by
message-passing with bu ers. A pipeline stage is a state machine that communicatesChapter 8 Conclusions 157
with neighbouring stages using the pipeline stage registers. Our method can therefore
be used to develop implementations of the components that form the building blocks
of a modern hardware design. For each type of component, the challenge will to iden-
tify the abstract speciﬁcation that best embodies the essential characteristics of that
component and to develop the reﬁnement strategy that allows a concrete representation
to be derived formally with automatic proof. A library of parameterised templates for
each component type can then be assembled for subsequent use in the mainstream de-
sign ﬂow. Such templates would best be developed in a small, leading-edge group with
highly-experienced designers or in the central research and development department of a
larger semiconductor company, using existing designs. Having incurred the initial cost of
developing the templates, the return on investment would be realised in the mainstream
ﬂow with greatly improved e ciency in reaching coverage targets when compared to
traditional, simulation-based methods and the increased conﬁdence that formally-based
coverage results would deliver.
As far as SoC processor pipelines are concerned, these will inevitably become more
complex while attempting to retain the low power consumption beneﬁts of the simple,
ﬁve-stage pipelines that dominate current SoC designs. Super-scalar implementations
with multi-issue instruction capabilities will require the multiple memory access and out-
of-order completion capabilities demonstrated in our IP lookup circular bu er develop-
ment. Instructions with variable latency can be managed elegantly using synchronous,
latency-insensitive design.
The long-term beneﬁt of incorporating our method into the veriﬁcation ﬂow will be to
bridge the speciﬁcation gap that currently exists, to enable the veriﬁcation process to
begin at the speciﬁcation level and to allow both functional and performance consider-
ations to be addressed through architectural exploration at a much earlier stage in the
ﬂow than is currently possible.References
J.R. Abrial. The Event-B Modelling Notation. 2007. URL http://deploy-eprints.
ecs.soton.ac.uk/11/3/notation-1.5.pdf.
JR Abrial. Rigorous Open Development Environment for Complex Systems: event B
language. 2005.
J.R. Abrial. Event model decomposition. 2009.
J.R. Abrial and S. Hallerstede. Reﬁnement, decomposition, and instantiation of discrete
models: Application to Event-B. Fundamenta Informaticae, XXI, 2006.
J.R. Abrial and L. Mussat. Introducing dynamic constraints in B. B, 98:83–128, 1998.
J.R. Abrial, M. Butler, S. Hallerstede, and L. Voisin. An open extensible tool envi-
ronment for Event-B. In International Conference on Formal Engineering Methods
(ICFEM), 2006.
P. Alexander and P. Baraona. Formal methods at the systems level. In Systems, Man,
and Cybernetics, 1997. ’Computational Cybernetics and Simulation’., 1997 IEEE In-
ternational Conference on, 1997. URL http://ieeexplore.ieee.org/iel3/4942/
13802/00638303.pdf?arnumber=638303.
ARM. Speciﬁcation Rev 2.0. ARM Limited, 1999.
ARM. AMBA AXI Protocol Speciﬁcation. 2003.
Arvind and J.C. Hoe. Micro-architecture Exploration and Synthesis via TRS’s. Technical
report, MIT, April 1999.
Arvind and X. Shen. Using Term Rewriting Systems to Design and Verify Processors.
IEEE Micro, 19(3):36–46, 1999.
Arvind, R. Nikhil, D. Rosenband, and N. Dave. High-level Synthesis: An Essential
Ingredient for Designing Complex ASICs. CSAIL, April, 2004.
K. Asanovic. Transactors for Parallel Hardware and Software Co-Design. International
High Level Design Validation and Test Workshop, IEEE, November, 2007.
158REFERENCES 159
K. Asanovic, R. Bodik, B.C. Catanzaro, J.J. Gebis, P. Husbands, K. Keutzer, D.A.
Patterson, W.L. Plishker, J. Shalf, S.W. Williams, et al. The Landscape of Parallel
Computing Research: A View from Berkeley. Electrical Engineering and Computer
Sciences, University of California at Berkeley, Technical Report No. UCB/EECS-
2006-183, December, 18(2006-183):19, 2006.
F. Baader and T. Nipkow. Term Rewriting and All That. Cambridge University Press,
1998.
P. Baraona and P. Alexander. VSPEC: A Language for Digital System Speciﬁcation.
Proceedings of the Al and Systems Engineering Workshop, pages 19–27, 1994.
Shuvra S. Bhattacharyya, Gordon Brebner, J¨ orn W. Janneck, Johan Eker, Carl von
Platen, Marco Mattavelli, and Micka¨ el Raulet. Opendf: a dataﬂow toolset for recon-
ﬁgurable hardware and multicore systems. SIGARCH Comput. Archit. News, 36(5):
29–35, 2008. ISSN 0163-5964. doi: http://doi.acm.org/10.1145/1556444.1556449.
N. Bombieri, F. Fummi, and G. Pravadelli. On the evaluation of transactor-based ver-
iﬁcation for reusing TLM assertions and testbenches at RTL. Proceedings of the
conference on Design, automation and test in Europe: Proceedings, pages 1007–1012,
2006a.
N. Bombieri, F. Fummi, and G. Pravadelli. A Methodology for Abstracting RTL Designs
into TL Descriptions. Proc. of ACM/IEEE MEMOCODE, pages 145–152, 2006b.
E. Borger and S. Mazzanti. A Practical Method for Rigorously Controllable Hardware
Design. ZUM’97, the Z Formal Speciﬁcation Notation: 10th International Conference
of Z Users, Reading, UK, April 3-4, 1997: Proceedings, 1997.
R. Brown. Calendar queues: a fast 0 (1) priority queue implementation for the simulation
event set problem. Communications of the ACM, 31(10):1220–1227, 1988.
J.R. Burch and D.L. Dill. Automatic veriﬁcation of Pipelined Microprocessor Control.
Proceedings of the 6th International Conference on Computer Aided Veriﬁcation, pages
68–80, 1994.
M. Butler. Decomposition structures for Event-B. Integrated Formal Methods iFM2009,
Springer, LNCS, 5423, 2009.
L.P. Carloni and A.L. Sangiovanni-Vincentelli. Coping with latency in SOC design.
Micro, IEEE, 22(5):24–35, 2002.
L.P. Carloni, K.L. McMillan, A. Saldanha, and A.L. Sangiovanni-Vincentelli. A method-
ology for correct-by-construction latency insensitive design. Proc. Intl. Conf. on
Computer-Aided Design, 1999.REFERENCES 160
L.P. Carloni, K.L. McMillan, and A.L. Sangiovanni-Vincentelli. Theory of latency-
insensitive design. Computer-Aided Design of Integrated Circuits and Systems, IEEE
Transactions on, 20(9):1059–1076, 2001.
H. Chockler, O. Kupferman, and M. Y. Vardi. Coverage metrics for temporal logic model
checking. Lecture Notes in Computer Science, 2031:528+, 2001. URL citeseer.ist.
psu.edu/chockler02coverage.html.
E.M. Clarke and E.A. Emerson. Design and Synthesis of Synchronization Skeletons
Using Branching-Time Temporal Logic. Lecture Notes In Computer Science, pages
52–71, 1981.
E.M. Clarke, A. Biere, R. Raimi, and Y. Zhu. Bounded model checking using sat-
isﬁability solving. Formal Methods in System Design, 19(1):7–34, 2001. URL
citeseer.ist.psu.edu/clarke01bounded.html.
B. Cline. Forte announces cynthesizer 3.3. Website, May 2007a. URL http://www.
forteds.com/news/pr052507.pdf.
B. Cline. Transaction-level modeling gains further momentum, 2007b. URL http:
//www.chipdesignmag.com/print.php?articleId=813?issueId=20.
B. Cohen, S. Venkataramanan, and A. Kumari. Using PSL/Sugar for formal and dy-
namic veriﬁcation: Guide to Property Speciﬁcation Language for Assertion-based Ver-
iﬁcation. VhdlCohen Publ., 2004.
J. Cortadella, M. Kishinevsky, and B. Grundmann. Synthesis of synchronous elastic
architectures. In DAC ’06: Proceedings of the 43rd annual conference on Design
automation, pages 657–662, New York, NY, USA, 2006a. ACM. ISBN 1-59593-381-6.
doi: http://doi.acm.org/10.1145/1146909.1147077.
J. Cortadella, M. Kishinevsky, and B. Grundmann. Speciﬁcation and design of syn-
chronous elastic circuits. In Proc. International Workshop on Timing Issues in the
Speciﬁcation and Synthesis of Digital Systems (TAU), pages 16–21, February 2006b.
J. Cortadella, M. Kishinevsky, and B. Grundmann. SELF: Speciﬁcation and design of
synchronous elastic circuits. In TAU’06: Proceedings of the ACM/IEEE International
Workshop on Timing Issues 2006. Citeseer, 2006c.
N.H. Dave. Designing a processor in Bluespec. Masters thesis, MASSACHUSETTS
INSTITUTE OF TECHNOLOGY, 2005.
S. Devadas, A. Ghosh, and K. Keutzer. Logic synthesis. McGraw-Hill, Inc. New York,
NY, USA, 1994.
A. Donlin. Transaction Level Modeling: Flows and Use Models. CODES+ ISSS, 4:
75–80, 2004.REFERENCES 161
R. Drechsler and S. Horeth. Gatecomp: Equivalence checking of digital circuits in an
industrial environment. Int’l Workshop on Boolean Problems, pages 195–200, 2002.
S.A. Edwards. The Challenges of Hardware Synthesis from C-Like Languages. Design,
Automation, and Test in Europe: Proceedings of the conference on Design, Automa-
tion and Test in Europe-, 1:66–67, 2005.
N. Evans and M. Butler. A Proposal for Records in Event-B. FM, pages 21–27, 2006.
H. Foster, E. Marschner, and Y. Wolfsthal. IEEE 1850 PSL: The next generation, 2005.
D. Gajski, J. Zhu, and R. Domer. Essential Issues in Codesign. Technical report,
Technical report ICS-97-26, University of California, Irvine, 1997, 1997.
D. Gebhardt and K.S. Stevens. Elastic Flow in an Application Speciﬁc Network-on-Chip.
Electronic Notes in Theoretical Computer Science, 200(1):3–15, 2008.
D. Geer. Chip makers turn to multicore processors. IEEE Computer, 38(5):11–13, 2005.
F. Ghenassia. Transaction-Level Modeling with Systemc: Tlm Concepts and Applications
for Embedded Systems. Springer-Verlag New York, Inc. Secaucus, NJ, USA, 2006.
M. Gordon and T. Melham. Introduction to HOL: a theorem proving environment for
higher order logic. Cambridge University Press New York, NY, USA, 1993.
M. Gordon, J. Hurd, and K. Slind. Executing the formal semantics of the ac-
cellera property speciﬁcation language by mechanised theorem proving, 2003. URL
citeseer.ist.psu.edu/gordon03executing.html.
M.G. Hadjinicolaou, G. Musgrave, and R.B. Hughes. Graphical speciﬁcation of dig-
ital systems using interval temporal logic. In ASIC Conference and Exhibit, 1994.
Proceedings., Seventh Annual IEEE International, 1994. URL http://ieeexplore.
ieee.org/iel2/3197/9098/00404589.pdf?tp=&isnumber=&arnumber=404589.
S. Hallerstede. Justiﬁcations for the Event-B Modelling Notation. In B 2007: Formal
Speciﬁcation and Development in B, 2007.
F. Haque and J. Michelson. Art of Veriﬁcation with VERA. Veriﬁcation Central, 2001.
A. Hartstein and T. Puzak. The optimum pipeline depth considering both power and
performance. ACM Trans. Archit. Code Optim., 1(4):369–388, 2004. ISSN 1544-3566.
doi: http://doi.acm.org/10.1145/1044823.1044824.
J. Henkel. Closing the SoC design gap. Computer, 36(9):119–121, 2003.
J.L. Hennessy and D.A. Patterson. Computer Architecture: A Quantitative Approach.
Morgan Kaufmann, 2006.REFERENCES 162
J.C. Hoe. Synthesis of Operation-Centric Hardware Descriptions. Proceedings of the
2000 IEEE/ACM international conference on Computer-aided design, pages 511–519,
2000.
J.C. Hoe. Operation-centric hardware description and synthesis. Computer-Aided Design
of Integrated Circuits and Systems, IEEE Transactions on, 23(9):1277–1288, 2004.
J.C. Hoe and Arvind. Hardware Synthesis from Term Rewriting Systems. Proceedings
of the IFIP TC10/WG10. 5 Tenth International Conference on Very Large Scale
Integration: Systems on a Chip, pages 595–619, 1999.
Y. Hollander, M. Morley, and A. Noy. The e language: A fresh separation of concerns.
Proceedings of TOOLS, 38, 2001.
P. Horowitz and W. Hill. The Art of Electronics. Cambridge University Press, 1989.
Y. Hoskote, T. Kam, P.H. Ho, and X. Zhao. Coverage estimation for symbolic model
checking. Proc. 36th Design automation conference, pages 300–305, 1999.
D. A. Hu man. A Method for the Construction of Minimum-Redundancy Codes. Pro-
ceedings of the IRE, 40(9):1098–1101, 1952.
M. Huth and M. Ryan. Logic in Computer Science: Modelling and Reasoning about
Systems. Cambridge University Press, 2004.
Y. Huygen. Eda consortium report. Web, April 2007. URL http://www.edac.org/
downloads/pressreleases/07-04-9 MSS Q4 2007 ReleaseFINAL.pdf.
R.B. Jones, D.L. Dill, and J.R. Burch. E cient validity checking for processor ver-
iﬁcation. In IEEE International Conference on Computer-Aided Design, San Jose,
California, USA, 1995. URL citeseer.ist.psu.edu/jones95efficient.html.
J.J. Joyce, G. Birtwistle, and M. Gordon. Proving a Computer Correct in Higher Order
Logic, Report No. 100. Computer Laboratory, Cambridge University, 1986.
J.R. Burch, E.M. Clarke, K.L. McMillan, D.L. Dill, and L.J. Hwang. Symbolic Model
Checking: 1020 States and Beyond. In Proceedings of the Fifth Annual IEEE Sym-
posium on Logic in Computer Science, pages 1–33, Washington, D.C., 1990. IEEE
Computer Society Press. URL citeseer.ist.psu.edu/burch90symbolic.html.
M. Kaufmann and J. Moore. Industrial proofs with acl2. Technical report, University
of Texas, 2004.
K. Keutzer, A.R. Newton, J.M. Rabaey, and A. Sangiovanni-Vincentelli. System-level
design: orthogonalization of concerns and platform-based design. IEEE Transac-
tions on Computer-Aided Design of Integrated Circuits and Systems, 19(12):1523–
1543, 2000.REFERENCES 163
D. Kroening and W.J. Paul. Automated pipeline design. In Proceedings of the 38th
conference on Design automation, pages 810–815. ACM New York, NY, USA, 2001.
S. Krstic, J. Cortadella, M. Kishinevsky, and J. O’Leary. Synchronous Elastic Networks.
Formal Methods in Computer Aided Design, 2006. FMCAD’06, pages 19–30, 2006.
T. Kuhn, T. Oppold, M. Winterholer, W. Rosenstiel, M. Edwards, and Y. Kashai.
A framework for object oriented hardware speciﬁcation, veriﬁcation, and synthesis.
Proceedings of the Design Automation Conference (DAC’2001), pages 413–418, 2001.
R. Kumar, K.I. Farkas, N.P. Jouppi, P. Ranganathan, and DM Tullsen. Single-ISA
heterogeneous multi-core architectures: the potential for processor power reduction.
Microarchitecture, 2003. MICRO-36. Proceedings. 36th Annual IEEE/ACM Interna-
tional Symposium on, pages 81–92, 2003.
P. Lieverse, P. van der Wolf, and E. Deprettere. A trace transformation technique
for communication reﬁnement. In Proceedings 9th International Symposium on
Hardware/Software Codesign (CODES’2001), pages 134–139, Copenhagen, Denmark,
April 25–27 2001. URL citeseer.ist.psu.edu/lieverse01trace.html.
P. Manolios. Correctness of pipelined machines. Formal Methods in Computer-Aided
Design–FMCAD, 1954:161–178, 2000.
P. Manolios and S. Srinivasan. A computationally e cient method based on commitment
reﬁnement maps for verifying pipelined machines models. ACM-IEEE International
Conference on Formal Methods and Models for Codesign, pages 189–198, 2005a.
P. Manolios and S.K. Srinivasan. Veriﬁcation of executable pipelined machines with
bit-level interfaces. In Proceedings of the 2005 IEEE/ACM International conference
on Computer-aided design, page 862. IEEE Computer Society, 2005b.
B. Meyer, E.T. Hochschule, and S. Zurich. The grand challenge of trusted components.
Software Engineering, 2003. Proceedings. 25th International Conference on, pages
660–667, 2003.
W. Mueller, J. Ruf, D. Hofmann, J. Gerlach, T. Kropf, and W. Rosenstiehl. The
Simulation Semantics of SystemC. Proc. of DATE 2001, 2001.
M.C. Ng, M. Vijayaraghavan, N. Dave, G. Raghavan, and J. Hicks. From WiFi to
WiMAX: Techniques for High-Level IP Reuse across Di erent OFDM Protocols. Pro-
ceedings of the 5th IEEE/ACM International Conference on Formal Methods and
Models for Codesign, pages 71–80, 2007.
R. Nikhil. Bluespec System Verilog: e cient, correct RTL from high level speciﬁcations.
Formal Methods and Models for Co-Design, 2004. MEMOCODE’04. Proceedings. Sec-
ond ACM and IEEE International Conference on, pages 69–70, 2004.REFERENCES 164
R.S. Nikhil. Composable Guarded Atomic Actions: a Bridging Model for SoC Design.
Application of Concurrency to System Design, 2007. ACSD 2007. Seventh Interna-
tional Conference on, pages 23–28, 2007.
OSCI-TLMSubgroup. Systemc tlm 2 draft. Technical report, 2007. URL http://www.
systemc.org/web/sitedocs/TLM 2 0.html.
D.L. Perry. VHDL. McGraw-Hill New York, 1994.
J. Plosila and K. Sere. Action systems in pipelined processor design. In Proceedings
Third International Symposium on Advanced Research in Asynchronous Circuits and
Systems, pages 156–166. Citeseer, 1997.
G. Prabhu. Computer architecture tutorial, 1 2000. URL http://www.cs.iastate.
edu/ prabhu/Tutorial/title.html.
V. R. Pratt. Modelling concurrency with partial orders. International Journal of Parallel
Programming, 15(1):33–71, 1986. URL citeseer.ist.psu.edu/pratt86modelling.
html.
R. Kumar, C. Blumenroehr, D. Eisenbiegler, and D. Schmid. Formal synthesis in
circuit design-A classiﬁcation and survey. In M. Srivas and A. Camilleri, edi-
tors, First international conference on formal methods in computer-aided design,
volume 1166, pages 294–299, Palo Alto, CA, USA, 1996. Springer Verlag. URL
citeseer.ist.psu.edu/kumar96formal.html.
D. Ragan, P. Sandborn, and P. Stoaks. A Detailed Cost Model for Concurrent Use With
Hardware/Software Co-Design. Proceedings of the Design Automation Conference,
pages 269–274, 2002.
A. Rose, S. Swan, J. Pierce, and J.M. Fernandez. Transaction Level Modeling in Sys-
temC. OSCI TLM-WG, 2005.
D.L. Rosenband and Arvind. Modular Scheduling of Guarded Atomic Actions. Proc.
41st DAC (June 2004), 2004.
D.L. Rosenband and Arvind. Hardware synthesis from guarded atomic actions with per-
formance speciﬁcations. Proceedings of the 2005 IEEE/ACM International conference
on Computer-aided design, pages 784–791, 2005.
G. Russell. CAD for VLSI. Van Nostrand Reinhold, 1985.
N. Shankar. Lazy compositional veriﬁcation. Lecture Notes in Computer Science, 1536:
541–564, 1998.
R. Silva and M.J. Butler. Supporting reuse mechanisms for developments in event-b:
Composition. Technical report, University of Southampton, 2009.REFERENCES 165
R. Silva, C. Pascal, T.S. Hoang, and M. Butler. Decomposition tool for Event-B. 2010.
P. Stanford and P. Mancuso. EDIF: Electronic Design Interchange Format: Version
200. Electronic Industries Association, Engineering Department, 1990.
I.E. Sutherland. Micropipelines. Communications of the ACM, 32(6):720–738, 1989.
S. Sutherland, S. Davidmann, and P. Flake. System Verilog for Design: A Guide to Using
SystemVerilog for Hardware Design and Modeling. Kluwer Academic Pub, 2004.
S. Swan. SystemC transaction level models and RTL veriﬁcation. Proceedings of the
43rd annual conference on Design automation, pages 90–92, 2006.
D. Sylvester and K. Keutzer. Impact of small process geometries on microarchitectures
in systems on a chip. Proceedings of the IEEE, 89(4):467–489, 2001.
S. Tahar and R. Kumar. Formal Veriﬁcation of Pipeline Conﬂicts in RISC Processors.
Proc. European Design Automation Conference (EURO-DAC94), Grenoble, France,
September, pages 285–289, 1994.
M. B. Taylor, J. Kim, J. Miller, D. Wentzla , F. Ghodrat, B. Greenwald, H. Ho man,
P. Johnson, J. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen,
M. Frank, S. Amarasinghe, and A. Agarwal. The raw microprocessor: A computa-
tional fabric for software circuits and general-purpose programs. IEEE Micro, 22(2):
25–35, 2002. ISSN 0272-1732. doi: http://doi.ieeecomputersociety.org/10.1109/MM.
2002.997877.
D.E. Thomas and P.R. Moorby. The Verilog (r) Hardware Description Language. Kluwer
Academic Publishers, 2002.
M.Y. Vardi and P. Wolper. Automata theoretic techniques for modal logics of programs:
(extended abstract). In STOC ’84: Proceedings of the sixteenth annual ACM sym-
posium on Theory of computing, pages 446–456, New York, NY, USA, 1984. ACM
Press. ISBN 0-89791-133-4. doi: http://doi.acm.org/10.1145/800057.808711.
T. W. Williams. VLSI testing. Elsevier Science Publishers BV Amsterdam, The Nether-
lands, The Netherlands, 1986.Appendix A
Published Papers
On Proving with Event-B that a Pipelined Processor Model Implements its
ISA Speciﬁcation
Colley, J and Butler, M
Reﬁnement Based Methods for the Construction of Dependable Systems
Dagstuhl, Germany
September 2009
166On Proving with Event-B that a Pipelined
Processor Model Implements its ISA
Speciﬁcation
John Colley1, Michael Butler2
1 University of Southampton,
School of Electronics and Computer Science
Southampton, SO17 1BJ, UK
jlc05r@ecs.soton.ac.uk
2 University of Southampton,
School of Electronics and Computer Science
Southampton, SO17 1BJ, UK
mjb@ecs.soton.ac.uk
Abstract. Microprocessor pipelining is a well-established technique that
improves performance and reduces power consumption by overlapping in-
struction execution. Verifying, however, that an implementation meets
this ISA speciﬁcation is complex and time-consuming. One of the key
veriﬁcation issues that must be addressed is that of overlapping instruc-
tion execution. This can introduce hazards where, for instance, a new
instruction reads the value from a register which will be written by an
earlier instruction that has not yet completed. Using Event-B’s support
for reﬁnement with automated proof, a method is explored where the ab-
stract machine represents directly an instruction from the ISA that spec-
iﬁes the e ect that the instruction has on the microprocessor register ﬁle.
Reﬁnement is then used systematically to derive a concrete, pipelined ex-
ecution of that instruction. Microarchitectural considerations are raised
to the speciﬁcation level and design choices can be veriﬁed much earlier
in the ﬂow. The method proposed therefore has the potential to be in-
tegrated into an existing high-level synthesis methodology, providing an
automated design and veriﬁcation ﬂow from high-level speciﬁcation to
hardware.
Keywords. Event-B, Pipeline, ISA
1 Introduction
Microprocessor pipelining is a well-established technique that improves perfor-
mance and reduces power consumption by overlapping instruction execution.
Modern System-on-Chip microprocessors used for mobile applications have very
stringent power consumption requirements and are typically based on the 5-stage
DLX microprocessor [1]. From the Instruction Set Architecture (ISA) speciﬁca-
tion, a pipelined microarchitecture is developed that implements the speciﬁca-
tion. Verifying, however, that an implementation meets this ISA speciﬁcation is2 J. Colley, M. Butler
complex and time-consuming. Current veriﬁcation techniques are predominantly
test based within a Register Transfer Level (RTL) simulation and synthesis ﬂow.
One of the key veriﬁcation issues that must be addressed is that of overlap-
ping instruction execution. This can introduce hazards where, for instance, a
new instruction reads the value from a register which will be written by an ear-
lier instruction that has not yet completed. These are termed Read-After-Write
(RAW) data hazards [1]. The presence of hazards depends on the instruction mix
presented to the microprocessor and pseudo-random test generation techniques
have been used in an attempt to achieve adequate test coverage of instruction
combinations [2], [3] .
Formal techniques, using both model checking and theorem proving, have
been used in microprocessor veriﬁcation, but as an adjunct to the simulation-
based ﬂow. These techniques are applied after the design is completed in the hope
of detecting errors not discovered by testing. Higher-level hardware description
languages such as Bluespec [4] and CAL [5], which provide an automatic syn-
thesis route to RTL, can speed up the design process, but it is the veriﬁcation
costs that dominate in the overall ﬂow and the bulk of the veriﬁcation must still
be done at the Register Transfer Level.
Event-B [6], [7] is a proof-based modelling language and method that en-
ables the development of speciﬁcations using reﬁnement. The Rodin platform
[8] is the Eclipse-based IDE that provides support for Event-B reﬁnement and
mathematical proof. Using Event-B’s support for reﬁnement with automated
proof, a method is explored where the abstract machine represents directly an
instruction from the ISA that speciﬁes the e ect that the instruction has on the
microprocessor register ﬁle. Reﬁnement is then used systematically to derive a
concrete, pipelined execution of that instruction. At each reﬁnement step the im-
portance is shown of addressing the inherent simultaneity that characterises the
pipelined behaviour and, in particular, the e ects that feedback has in pipeline
construction.
To illustrate the method, the register/register arithmetic instruction of a typ-
ical System-on-Chip (SoC) microprocessor is chosen that can exhibit RAW data
hazards with overlapping execution. The technique, termed forwarding, where
intermediate values are fed back to a stage that needs them, is employed in mod-
ern microprocessors to provide a very e cient means of managing RAW hazards
[1]. Debugging the forwarding logic has, however, been found to be di cult and
expensive [9] . With the introduction of appropriate invariants in our approach,
it is shown that the concrete, pipelined reﬁnement will not preserve these invari-
ants unless the RAW hazards are detected and managed appropriately.
The concrete Event-B model implements forwarding in a way that corre-
sponds directly to the techniques used in microprocessor design and is proved,
automatically, in the Rodin environment to be a correct reﬁnement of the ab-
stract ISA speciﬁcation. Thus, microarchitectural considerations are raised to
the speciﬁcation level and design choices can be veriﬁed much earlier in the ﬂow.
The concrete model also has a direct correspondence to an equivalent hardware
description in the high-level languages Bluespec and CAL, which like Event-BEvent-B Pipelined Processor Proof 3
are based on guarded atomic actions. The method proposed therefore has the
potential to be integrated into an existing high-level synthesis methodology, pro-
viding an automated design and veriﬁcation ﬂow from high-level speciﬁcation to
hardware.
2 An Overview of Event-B
In Event-B, an abstract model comprises a machine that speciﬁes the high-
level behaviour and a context, made up of sets, constants and their properties,
that represents the type environment for the high-level machine. The machine
is represented as a set of state variables, v and a set of events, guarded atomic
actions, which modify the state. If more than one action is enabled, then one
is chosen non-deterministically for execution, an observable transition on the
state variables which must preserve an invariant on the variables, I(v). A more
concrete representation of the machine may then be created which reﬁnes the
abstract machine, and the abstract context may be extended to support the
types required by the reﬁnement. Gluing invariants are used to verify that the
concrete machine is a correct reﬁnement of the abstract. Gluing invariants give
rise to proof obligations for pairs of abstract and corresponding concrete events.
Events may also have parameters which take, non-deterministically, the values
that will make the guards in which they are referenced true.
An event can be represented by the generalized substitution,
any x where P(x,v) then v := F(x,v) end
where x represents the event parameters and v represents the value of the ma-
chine state variables. Informally, this event can be ﬁred provided that the guard
P(x, v) can be satisﬁed for some value x. The details are explained in [10] .
3 Modelling the Arithmetic Instruction
3.1 The Abstract ISA Model
The structure of a register/register arithmetic instruction associates the opcode
with a destination register Rr and two source registers Ra and Rb. The Event-B
context, PIPEC, for the arithmetic instruction therefore deﬁnes a set of oper-
ations Op, the type Register, the subset of operations that are of type regis-
ter/register arithmetic, ArithRRop, and the relationship between the ﬁelds of
the arithmetic instruction and their associated registers. The conventions of [11]
are followed to model operation ﬁelds. The context also deﬁnes No Operation,
NOP.4 J. Colley, M. Butler
CONTEXT PIPEC
SETS
Op
CONSTANTS
Register
Rr
Ra
Rb
NOP
ArithRROp
AXIOMS
axm1 : Register   N
axm2 : Rr   Op   Register
axm3 : Ra   Op   Register
axm4 : Rb   Op   Register
axm5 : ArithRROp   Op
axm6 : NOP   Op
axm7 : NOP /   ArithRROp
END
The abstract machine, PIPEM, deﬁnes the register ﬁle Regs and a single
event ArithRR that speciﬁes the e ect that execution of the instruction has on
the register ﬁle. For simplicity, the addition operation is shown, but this can
more generally be represented by an uninterpreted function [12] without a ect-
ing the proof approach used. The parameter pop speciﬁes the environment for
the event; given an instruction of type ArithRROp, the state of the register ﬁle
will be updated according to that instruction.
MACHINE PIPEM
SEES PIPEC
VARIABLES
Regs
INVARIANTS
inv1 : Regs   Register   Z
EVENTS
Initialisation
begin
act1 : Regs := Register   {0}
end
Event ArithRR   =
any
pop
where
grd1 : pop   ArithRROp
then
act1 : Regs(Rr(pop)) := Regs(Ra(pop)) + Regs(Rb(pop))
end
END
The microarchitecture of the abstract machine is shown in Figure 1.Event-B Pipelined Processor Proof 5
Arit RR Regs
pop
Fig.1. Abstract Machine: Microarchitecture
3.2 The First Reﬁnement: a 2-stage pipeline
A 2-stage pipeline is now introduced which reﬁnes the abstract machine. The
second pipeline stage is a concrete representation of the Write Back (WB) stage
while the ﬁrst stage is still abstract, representing the Fetch/Decode/Execute op-
erations of the pipeline.
MACHINE PIPER1
REFINES PIPEM
SEES PIPEC
VARIABLES
Regs
EXop
ALUout
INVARIANTS
inv1 : EXop   Op
inv2 : ALUout   Z
inv3 : ALUout = Regs(Ra(EXop)) + Regs(Rb(EXop))
EVENTS
Event FDEXWB   =
reﬁnes ArithRR
any
ppop
where
grd1 : EXop   ArithRROp
grd2 : ppop   ArithRROp
grd3 : Rr(EXop)  = Ra(ppop)6 J. Colley, M. Butler
grd4 : Rr(EXop)  = Rb(ppop)
with
pop : pop = EXop
then
act1 : Regs(Rr(EXop)) := ALUout
act2 : ALUout := Regs(Ra(ppop)) + Regs(Rb(ppop))
act3 : EXop := ppop
end
END
Two new variables, ALUout and EXop are introduced to represent the EXWB
pipeline registers. The parameter pop of the abstract ArithRR event is bound
to the concrete register EXop using an Event-B witness and a new parameter
ppop represents the environment of the reﬁned event, FDEXWB. The FDEXWB
event models the simultaneous execution of both pipeline stages. The microar-
chitecture of the reﬁned machine is shown in Figure 2.
 D  
A
 
 
o
u
t
  op
ppop
  
 
e
 
s
Fig.2. Reﬁned Machine: Microarchitecture
It is now necessary to introduce the gluing invariant to establish that this
is a correct reﬁnement of the abstract machine. To preserve the meaning of the
abstract speciﬁcation, the new variable ALUout must always have the value
Regs(Ra(EXop)) + Regs(Rb(EXop)), as represented by the invariant inv3. The
Rodin prover, however, shows that this invariant is not preserved by the reﬁned
machine. The abstract FDEX pipeline stage simultaneously reads the register
ﬁle while the WB stage is writing to it. If the location being read is the same as
that being written, a Read After Write (RAW) data hazard will be encountered
and the wrong value will read by the ﬁrst pipeline stage. This inherent feedback
in the pipelined implementation must be addressed explicitly if it is to meet its
speciﬁcation.Event-B Pipelined Processor Proof 7
3.3 Detecting the RAW Hazard
The abstract FDEX pipeline stage may only read from the source registers
Ra and Rb if they do not coincide with the target register Rr of the previous
instruction, represented by Rr(EXop)). Two new guards are introduced into the
reﬁned event to meet this requirement.
grd1 : ...
grd2 : ...
grd3 : Rr(EXop)  = Ra(ppop)
grd4 : Rr(EXop)  = Rb(ppop)
The Rodin prover now shows that the invariant ALUout = Regs(Ra(EXop)) +
Regs(Rb(EXop)) is preserved by the reﬁned machine.
3.4 Dealing Correctly with the RAW Hazard
It is now necessary to deal with the cases where a hazard is encountered on
register Ra alone, on register Rb alone and on both registers Ra and Rb. In
each case, the required value(s) can be read from the ALUout register. This
corresponds directly to the forwarding technique used in microprocessor design.
Three extra events are introduced to deal with each case. For instance, for the
hazard on register Ra, the guards of the event are
grd3 : Rr(EXop) = Ra(ppop)
grd4 : Rr(EXop)  = Rb(ppop)
and the associated action now reads the value of Ra from ALUout.
act2 : ALUout := ALUout + Regs(Rb(ppop))
The Rodin prover shows that, for each case, the invariant is preserved. The
microarchitecture of the modiﬁed reﬁned machine is shown in Figure 3.
3.5 Further Reﬁnements
The reﬁnement process can continue, systematically, until all the pipeline stages
are represented in concrete form. At each step, the gluing invariants will ensure
that the reﬁnement implements its predecessor.
In the second reﬁnement, the concrete Execute (EX) stage is introduced
together with the IDEX pipeline registers. The registers A and B store the values
of Ra and Rb respectively. Four events in the abstract Fetch/Decode stage are
needed to deal with the possible data hazard combinations and two new gluing
invariants,
inv1 : A = Regs(Ra(IDop))
inv2 : B = Regs(Rb(IDop))8 J. Colley, M. Butler
FDE 
A
L
U
o
u
t
E op
ppop
WB
R
e
g
s
Forwarding
Fig.3. Reﬁned Machine with forwarding: Microarchitecture
ensure that the data hazards are dealt with correctly. When combined with the
four EXWB events, this gives a total of sixteen events.
In the third reﬁnement, the concrete Instruction Fetch IF and Instruction
Decode ID are established.
To generalise this approach for uninterpreted arithmetic operations, the ac-
tion
act1 : Regs(Rr(p)) := Regs(Ra(p)) + Regs(Rb(p))
can be replaced with
act1 : Regs(Rr(p)) := fop(Regs(Ra(p))    Regs(Rb(p)))
where
grd1 : fop = func(p)
and
axm8 : func   Op   Register
is a ﬁeld of the arithmetic instruction. The proofs with arbitrary arithmetic
operations are still automatic.
The ﬁnal, concrete pipeline is represented by sixteen events and all the proof
obligations generated are discharged automatically by the Rodin tool, as shown
in Table 1.
4 Related Work
Early work in the formal veriﬁcation of microprocessors was focused on simple,
non-pipelined processors described at the Register Transfer Level (RTL). In [13]Event-B Pipelined Processor Proof 9
Total no. of Discharged
proof obligations Automatically
Abstract Model 3 3
1st Reﬁnement 33 33
2nd Reﬁnement 192 192
3rd Reﬁnement 115 115
Table 1. Pipeline Proofs
the RTL is represented in the ML programming language and the HOL proof
assistant system [14] used to discharge the proofs.
In [12] and [15] the representation of the processor is raised to the Instruction
Set Architecture (ISA) level and the techniques described focus on the formal
veriﬁcation of the control logic of ﬁrst a 3-stage pipelined ALU and then the
full 5-stage DLX processor. ALU operations are represented as uninterpreted
functions. In order to show that the pipelined processor will behave in the same
way as a notional non-pipelined version, the concept of pipeline ﬂushing is intro-
duced. Stall instructions are introduced at the pipeline input to ensure that each
instruction is completed before the next is initiated. The notion of reﬁnement
maps are introduced in [16] and [17] to extend the ﬂushing concepts of Burch
and Dill to more complex 3 and 10-stage pipelines, using the ACL2 functional
programming language and theorem prover [18].
[19] focuses its attention on the formalization of the pipeline hazards that
can occur when multiple instructions are executed at once in the DLX pipeline.
Structural, data and control hazards are represented and checked using the HOL
veriﬁcation system [14]. Incremental design techniques with reﬁnement are de-
scribed in [20] to show that a notional DLX pipeline that executes one instruction
at a time can be reﬁned to a pipeline that executes 5 instructions at each clock
cycle and manages structural hazards does not encounter a sequence of instruc-
tions that would incur data or control hazards. This pipeline is then further
reﬁned to model the data and control hazards. Abstract State Machines (ASMs)
are used to represent the DLX instructions. In [9], a tool that takes a sequen-
tial model of the DLX pipeline, which is assumed to be correct, and adds the
forwarding logic is described. The tool also provides a proof of correctness for
the generated hardware. Our approach is the only one that starts with an ab-
stract ISA speciﬁcation and proves, systematically, that the concrete, concurrent
pipeline model derived from the ISA implements that speciﬁcation.
5 Conclusions
A method has been explored, using the register/register arithmetic instruction
as an example, to show that the ISA speciﬁcation of the instruction can be re-
ﬁned systematically to a pipelined model that can be proved to implement its
ISA speciﬁcation. The method ensures, through the introduction of gluing in-
variants at each stage, that microarchitectural considerations are addressed early10 J. Colley, M. Butler
in the design ﬂow. Di erent microarchitectures may be explored and veriﬁed at
the speciﬁcation level. Stepwise reﬁnement allows us to manage the multiplicity
of cases caused by pipeline data hazards. The models have been developed us-
ing the Rodin Platform and all the generated proof obligations are discharged
automatically by the tool.
Current work is focused on managing the e ect of branch instructions on
correct pipeline execution. The techniques described have been used to prove
that the pipeline program counter is updated correctly according to the branch
instruction ISA speciﬁcation. Gluing invariants are being developed to ensure
that instructions that have been fetched speculatively are not executed when a
branch is encountered.
A disadvantage of our approach is that we need to specify separate pipeline
stages with a single event. We are exploring a technique that uses reﬁnement and
decomposition to create separate events for each stage once the gluing invariants
have been proved.
In common with Bluespec and CAL, Event-B is based on guarded atomic
actions. The method proposed therefore has the potential to be integrated into
an existing high-level synthesis methodology, providing an automated design and
veriﬁcation ﬂow from high-level speciﬁcation to hardware.
References
1. Hennessy, J., Patterson, D.: Computer Architecture: A Quantitative Approach.
Morgan Kaufmann (2006)
2. Hollander, Y., Morley, M., Noy, A.: The e language: A fresh separation of concerns.
Proceedings of TOOLS 38 (2001)
3. Haque, F., Michelson, J.: Art of Veriﬁcation with VERA. Veriﬁcation Central
(2001)
4. Nikhil, R.: Bluespec System Verilog: e cient, correct RTL from high level spec-
iﬁcations. Formal Methods and Models for Co-Design, 2004. MEMOCODE’04.
Proceedings. Second ACM and IEEE International Conference on (2004) 69–70
5. Bhattacharyya, S.S., Brebner, G., Janneck, J.W., Eker, J., von Platen, C., Mat-
tavelli, M., Raulet, M.: Opendf: a dataﬂow toolset for reconﬁgurable hardware and
multicore systems. SIGARCH Comput. Archit. News 36 (2008) 29–35
6. Abrial, J., Mussat, L.: Introducing dynamic constraints in B. B 98 (1998) 83–128
7. Hallerstede, S.: Justiﬁcations for the Event-B Modelling Notation. In: B 2007:
Formal Speciﬁcation and Development in B. (2007)
8. Abrial, J., Butler, M., Hallerstede, S., Voisin, L.: An open extensible tool environ-
ment for Event-B. In: International Conference on Formal Engineering Methods
(ICFEM). (2006)
9. Kroening, D., Paul, W.: Automated pipeline design. In: Proceedings of the 38th
conference on Design automation, ACM New York, NY, USA (2001) 810–815
10. Abrial, J.: Rigorous Open Development Environment for Complex Systems: event
B language. (2005)
11. Evans, N., Butler, M.: A Proposal for Records in Event-B. FM (2006) 21–27
12. Burch, J., Dill, D.: Automatic veriﬁcation of Pipelined Microprocessor Control.
Proceedings of the 6th International Conference on Computer Aided Veriﬁcation
(1994) 68–80Event-B Pipelined Processor Proof 11
13. Joyce, J., Birtwistle, G., Gordon, M.: Proving a Computer Correct in Higher Order
Logic, Report No. 100. Computer Laboratory, Cambridge University (1986)
14. Gordon, M., Melham, T.: Introduction to HOL: a theorem proving environment
for higher order logic. Cambridge University Press New York, NY, USA (1993)
15. Jones, R., Dill, D., Burch, J.: E cient validity checking for processor veriﬁca-
tion. In: IEEE International Conference on Computer-Aided Design, San Jose,
California, USA (1995)
16. Manolios, P.: Correctness of pipelined machines. Formal Methods in Computer-
Aided Design–FMCAD 1954 (2000) 161–178
17. Manolios, P., Srinivasan, S.: A computationally e cient method based on commit-
ment reﬁnement maps for verifying pipelined machines models. ACM-IEEE Inter-
national Conference on Formal Methods and Models for Codesign (2005) 189–198
18. Kaufmann, M., Moore, J.: Industrial proofs with acl2. Technical report, University
of Texas (2004)
19. Tahar, S., Kumar, R.: Formal Veriﬁcation of Pipeline Conﬂicts in RISC Proces-
sors. Proc. European Design Automation Conference (EURO-DAC94), Grenoble,
France, September (1994) 285–289
20. Borger, E., Mazzanti, S.: A Practical Method for Rigorously Controllable Hard-
ware Design. ZUM’97, the Z Formal Speciﬁcation Notation: 10th International
Conference of Z Users, Reading, UK, April 3-4, 1997: Proceedings (1997)