Abstract-Instruction sets of modern processors contain hundreds of instructions defined on a relatively small set of datapath components and distinguished by their codes and the order in which they activate these components. Optimal design of an instruction set for a particular combination of available hardware components and software requirements is crucial for system performance and is a challenging task involving a lot of heuristics and high-level design decisions. The overall design process is significantly complicated by inefficient representation of instructions, which are usually described individually despite the fact that they share a lot of common behavioural patterns.
I. INTRODUCTION
Modern microprocessors become increasingly diversified in terms of power modes, heterogeneous hardware platforms, requirements for legacy software reuse, etc. This is amplified by the rapidly growing demand for low power consumption, high performance and small area of the produced circuits. As a result, under the pressure of time to market constraints, a computer architect faces a productivity gap: the capacity of modern CAD tools is insufficient for exploring the variety of possible architectural solutions and for identifying the optimal instruction set, which is a large part of a microprocessor design.
A. Instruction Set Architecture (ISA) criteria
There are several criteria which determine the choice of a processor microarchitecture and the generation of an efficient instruction set:
Functionality. Each instruction is associated with a sequence of atomic actions (usually acyclic) to complete the task. Note that while a sequential run of actions is sufficient to achieve the instruction functionality, it is often practical to enable some of the actions concurrently, e.g. in order to speed up the instruction execution and to efficiently utilise the available energy. The distinctive classes of instruction functionality are arithmetic operations, data handling, memory access and flow control.
The amount of computation per instruction is the key dilemma of computer architecture -it determines the tradeoff between the complexity of microarchitecture implementation and the software code it executes. Historically there were different views on this dilemma [9] . Initially, the Complex Instruction Set Computer (CISC) architecture with its semantically rich instruction set dominated the microprocessor market. CISC instructions could access their operands in several addressing modes and could execute complex multi-cycle operations without storing the intermediate results, which was advantageous for slow and expensive memory. The major disadvantage of CISC was the complexity of the instruction decoding logic -it had to distinguish among many instructions and their addressing modes. This problem has been resolved in the Reduced Instruction Set Computer (RISC) architecture, where the simplicity of instruction decoding and pipelining was achieved at the cost of decreased code density. In RISC a relatively small set of basic instructions was employed to build complex functionality at the level of software [4] .
Operation modes. The same functionality can be achieved in different ways targeting various optimisation criteria. For example, an arithmetic operation can be executed either in an energy efficient way but slowly, or in a low latency mode at the price of extra energy consumption. Alternatively, for security applications, the operation can be combined with power masking and data scrambling. The choice of available operation modes is usually made at the design time and is limited by the circuit area and the timing constraints. Selection of the operation mode can be encoded in the instruction set at two levels: coarse-grain, as a separate class of mode-switching instructions or fine-grain, as a part of each instruction code.
For example, in the ARM architecture [7] , apart from the standard RISC-like operation mode with a 32-bit instruction set there are several special modes, e.g. Thumb and Jazelle. In the Thumb mode the processor switches to a compact 16-bit encoding of a subset of ARM instructions and makes the instruction operands implicit. This reduces the processor functionality but improves its performance as less data needs to be fetched from the memory. In the Jazelle mode the instruction set is changed to natively execute Java Bytecode and to support JIT compilation [16] .
Resources. At least one computation resource needs to be available for each type of atomic action comprising the instructions. The availability of resources has two aspects: static and dynamic. The static aspect is addressed at the stage of system synthesis and is mostly constrained by the circuit area and timing requirements. The dynamic aspect arises at the runtime when the same resource is needed for several actions -such a conflict has to be resolved through scheduling which may also involve resource arbitration. It is advantageous to optimise the quantity of each resource type at the synthesis stage targeting a trade-off between resource idle time and the number of conflicts to resolve. This can be achieved by the statistical analysis of potential resource utilisation and careful adjustment of the instruction set.
Usually a designer tries to balance the load on CPU, memory and communication buses at the design time. However, it is often not possible to fine tune the circuit for all execution scenarios at the design time and one of the circuit components becomes a bottleneck limiting the performance of the whole system. In this situation a dynamic reconfiguration of the system brings advantages, e.g. the critical path can be sped up to improve circuit latency and the non-critical paths can be slowed down to save power.
Modern microprocessors, while often referred to as RISClike, also exhibit the features of CISC and Very Long Instruction Word (VLIW) architectures. For example, they have multi-clock instructions with high-level execution semantics (e.g. if-then-else, DSP and multimedia instructions), which is typical for CISC. They also combine the compile-time scheduling of VLIW architecture with dynamic arbitration of resources to employ ILP for instruction pipelining, out-oforder and speculative execution. Being combined with various operation modes and resource restrictions, such a diversity of instruction functionality presents a real challenge to the efficient design of microprocessors.
B. Existing ISA approaches and challenges
There are several well-established approaches for the functional-level description and formal verification of ISA. Event-B [21] is a widely adopted language for specifying first-order logic systems and doing refinement on that representations. Being combined with the RODIN theorem prover tool, it becomes a powerful platform for proving that a (refined) system satisfies the initial specification, e.g. does not leave a certain set of 'good' states during its operation. HOL [6] is a computer-assisted proving environment for constructing verifiably correct mathematical proofs. Although its expressiveness is unrivalled, the generic nature of a tool such as ISABELLE/HOL makes it more suitable for analysing individual instructions with deep mathematical properties; see, for example, verification of IA-64 division algorithm [8] .
These formal ISA methods have a history of being used for reasoning about hardware implementations, however they are more targeted to the software-related aspects of processor functionality. No hardware implementation issues are usually taken into consideration apart from those directly visible to the instructions, such as the size of addressable memory, the number and type of available registers, etc. As a result, an ISA designer does not have the full control on how the specified functionality is achieved in hardware, what are the costs of every instruction in terms of energy consumption and computation resources, how to minimise latency of instruction decoding logic or how to dynamically adapt the processor to the current operating conditions. Modelling such low-level implementation details in Event-B or HOL is costly; a more targeted formalism is needed to interface the representation of knowledge about instructions sets with that of knowledge about their execution.
There is clearly a niche in microprocessor EDA where the following design requirements need to be addressed:
• compact description of individual instruction functionalities as partial orders of atomic actions; • efficient representation of complete instruction sets to allow their transformations (optimisation of encoding, retargeting for different hardware platforms, etc.); • capturing of processor operation modes as explicit parameters of the instruction sets; • possibility to express the resource availability constraints;
• encoding of instruction set for different optimisation criteria (code length minimisation, complexity of decoding logic, legacy software compatibility, etc.); We propose to address these requirements using a graph model, called Conditional Partial Order Graphs (CPOGs) [15] . This model is particularly convenient for composition and representing large sets of partial orders in a compact form. It can be equipped with a set of mathematical tools for the refinement, optimisation, encoding and synthesis of the control hardware which implements the required instruction set, similar in spirit to the approach based on control automata [2] . We envisage that the model can be used as a complementary formalism for the existing ISA methodologies providing a formal link between the software and hardware domains. Although general-purpose modelling languages and proving environments, such as Event-B or HOL, may be used to a similar effect, the CPOG model offers a superior mathematical construction permitting automated analysis and synthesis. This paper presents a significant contribution to the relatively new concept of CPOGs. The previous CPOG-related publications, e.g. [13] [14] [15] , focused on algebraic CPOG properties, controller synthesis, verification and optimal encoding of partial orders, while this work brings all these methods to the area of formal specification of processor instruction sets and introduces CPOG transformations as an efficient way of instruction set management.
The organisation of the paper is as follows. Section II gives the background of the CPOG model and shows how to use it for specification and composition of processor instruction sets. It is followed by Section III, where we describe several transformations defined on CPOGs and discuss issues of a physical microcontroller implementation. Case study in Section IV demonstrates how CPOGs can be used for capturing different hardware configurations and operation modes. The paper is concluded with experiments, Section V, where we specify an instruction set and study its FPGA implementation.
II. FORMAL MODEL FOR INSTRUCTION SETS
This section presents the basic definitions behind the CPOG model and demonstrates how it can be applied to the efficient specification of processor instruction sets. 
A. CPOG essentials
A Conditional Partial Order Graph [15] (further referred to as CPOG or graph) is a quintuple H = (V, E, X, ρ, φ) where:
• V is a set of vertices which correspond to events (or atomic actions) in a modelled system. • E ⊆ V × V is a set of arcs representing dependencies between the events.
• Operational vector X is a set of Boolean variables. An opcode is an assignment (
|X| of these variables. An opcode selects a particular partial order from those contained in the graph.
• ρ ∈ F(X) is a restriction function, where F(X) is the set of all Boolean functions over variables in X. ρ defines the operational domain of the graph: X can be assigned only those opcodes (x 1 , x 2 , . . . , x |X| ) which satisfy the restriction function, i.e. ρ(
condition φ(z) ∈ F(X) to every vertex and arc z ∈ V ∪E in the graph. Let us also define φ(z)
CPOGs are represented graphically by drawing a labelled circle for every vertex and drawing a labelled arrow for every arc. The label of a vertex v consists of the vertex name, semicolon and the vertex condition φ(v), while every arc e is labelled with the corresponding arc condition φ(e). The restriction function ρ is depicted in a box next to the graph; operational variables X can therefore be observed as parameters of ρ. Fig. 1(a) shows an example of a CPOG with |V | = 5 vertices and |E| = 7 arcs. There is a single operational variable x; the restriction function is ρ(x) = 1, hence both opcodes x = 0 and x = 1 are allowed. Vertices {a, b, d} have constant φ = 1 conditions and are called unconditional, while vertices {c, e} are conditional and have conditions φ(c) = x and φ(e) = x respectively. Arcs also fall into two classes: unconditional (arc c → d) and conditional (all the rest). As CPOGs tend to have many unconditional vertices and arcs we use a simplified notation in which conditions equal to 1 are not depicted in the graph; see Fig. 1 
(b).
The purpose of conditions φ is to 'switch off' some vertices and/or arcs in a CPOG according to a given opcode, thereby producing different CPOG projections. An example of a graph and its two projections is presented in Fig. 2 . The leftmost projection is obtained by keeping in the graph . The rightmost projection is obtained in the same way with the only difference that variable x is set to 0; it is denoted by H| x=0 , respectively. Note that although the condition of arc c → d evaluates to 1 (in fact it is constant 1) the arc is still excluded from the resultant graph because one of the vertices it connects, viz. vertex c, is excluded and naturally an arc cannot appear in a graph without one of its vertices. Each of the obtained projections can be regarded as specification of a particular behavioural scenario of the modelled system, e.g. as specification of a processor instruction. Potentially, a CPOG H = (V, E, X, ρ, φ) can specify an exponential number of different instructions (each composed from atomic actions in V ) according to one of 2 |X| different possible opcodes.
B. Specification and composition of instructions
Consider a processing unit that has two registers A and B, and can perform two different instructions: addition and exchange of two variables stored in memory. The processor contains five datapath components (denoted by a . . . e) that can perform the following atomic actions: a) Load register A from memory; b) Load register B from memory; c) Compute sum A + B and store it in A; d) Save register A into memory; e) Save register B into memory. Table I describes the addition and exchange instructions in terms of usage of these atomic actions.
The addition instruction consists of loading the two operands from memory (actions a and b, causally independent and thus possibly concurrent), their addition (action c), and saving the result (action d). Whether a and b are to be performed concurrently depends on: i) the system architecture, e.g. if concurrent read memory access is allowed, ii) static and dynamic resources availability (the processor hardware
with maximum concurrency P ADD P XCHG Table I : Two instructions specified as partial orders configuration must physically contain two memory access components and they both have to be immediately available for use), and iii) the current operation mode which determines the scheduling strategy, e.g. 'execute a and b concurrently to minimise latency', or 'execute a and b in sequence to lower peak power'. Let us assume for simplicity that in this example all causally independent actions are always performed concurrently, see the corresponding partial order P ADD in the table 1 . Section IV will address joint specification of different scheduling strategies of an instruction.
The operation of exchange consists of loading the operands (concurrent actions a and b), and saving them into swapped memory locations (concurrent actions d and e), as captured by P XCHG . Note that in order to start saving one of the registers it is necessary to wait until both of them have been loaded to avoid overwriting one of the values.
One can see that the two partial orders in Table I appear to be the two projections shown in Fig. 2 , thus the corresponding graph can be considered as a joint specification of both instructions. Two important characteristics of such a specification are that the common events {a, b, d} are overlaid and the choice between the two operations is distributed in the Boolean expressions associated with the vertices and arcs of the graph. As a result, in our model there is no need for 'nodal point' of choice, which tend to appear in alternative specification models (a Petri Net [5] would have an explicit choice place, a Finite State Machine [12] -an explicit choice state, and a specification written in a Hardware Description Language [12] would describe the two instructions by two separate branches of a conditional statement if or case).
The following notions are introduced to formally define specification and composition of instruction sets.
An instruction is a pair I = (ψ, P ), where ψ ∈ {0, 1} |X| is a vector assigning a Boolean value to each variable in X, and P = (V, ≺) is a partial order defined on a set of atomic actions V . Semantically, ψ represents the instruction opcode 2 , 1 In this paper we describe partial orders using Hasse diagrams [3] , i.e. without depicting transitive dependencies, such as, for example, dependencies a → d and b → d in partial order P ADD . 2 In this section the instruction operands are implicit and the opcode completely defines the instruction. We elaborate on this in Section IV.
while the precedence relation ≺ of the partial order captures behaviour of the instruction 3 . We assume that V and X belong to the corresponding universes shared by all the instructions of the processor: V ⊆ U V and X ⊆ U X .
An instruction set (denoted by IS) is a set of instructions with unique opcodes, i.e. for any IS = {I 1 , I 2 , . . . , I n }, such that I k = (ψ k , P k ), all opcodes ψ k must be different.
Given a CPOG H = (V, E, X, ρ, φ) there is a natural correspondence between its projections and instructions: an opcode ψ = (x 1 , x 2 , . . . , x |X| ) induces a partial order H| ψ , and paired together they form an instruction I ψ = (ψ, H| ψ ) according to the above definition. This leads to the following formal link between CPOGs and instruction sets.
A CPOG H = (V, E, X, ρ, φ) is a specification of an instruction set IS(H) defined as a union of instructions (ψ, H| ψ ) which are allowed by the restriction function ρ:
Using this definition we can formally state that the graph in Fig. 2 specifies the instruction set from Table I. In the rest of this section we show how to obtain such CPOG specifications.
Composition of two instruction sets IS 1 and IS 2 is their union IS 1 ∪ IS 2 . Composition is not defined if the union contains two instructions with the same opcode (otherwise, the result would not be an instruction set by the above definition). Due to the commutativity and associativity properties of set union ∪ we can compose more than two instruction sets by performing their pairwise composition in arbitrary order.
Note that, if instructions in given sets IS k are represented individually (as they are in conventional methods), then the complexity of the composition operation is linear with respect to the total number of instructions: Θ(|IS|), where IS = k IS k . This is because we have to iterate over all of them to generate the result. It may be unacceptably slow for those applications which routinely perform various operations on large instruction sets. Using the CPOG model for the compact representation of instruction sets allows most of the operations to be performed much faster, as demonstrated below.
Let instruction sets IS 1 and IS 2 be specified with graphs
We call H the CPOG composition of H 1 and H 2 and denote this operation as H = H 1 ∪ H 2 . Note that if ρ 1 · ρ 2 = 0 then the composition is undefined, because IS(H 1 ) and IS(H 2 ) contain instructions with the same opcode ψ allowed by both restriction functions: ρ 1 (ψ) = ρ 2 (ψ) = 1. It is possible to formally prove that IS(H) = IS(H 1 ) ∪ IS(H 2 ) using algebraic methods 4 [15] , deriving the following important equation:
Crucially, the complexity of computing a CPOG composition does not depend on the total number of instructions |IS 1 ∪ IS 2 |. It depends only on the sizes of graph specifications
Since the number of arcs |E k | is at most quadratic with respect to |V k | and |V k | ≤ |U V | (all vertices are contained in universe U V ), we have the following upper bound on CPOG composition complexity: O(|U V | 2 ). Note that |U V | 2 is potentially much smaller than the number of different instructions 5 , which can be exponential with respect to |V |, in particular the total number of partial orders on set U V is greater than 2
To conclude, we can operate on the CPOG representations of instruction sets faster than on the instruction sets themselves.
Let us demonstrate specification and composition of instruction sets on the aforementioned processing unit example. Fig. 3(a,b) shows two graphs H ADD and H XCHG specifying singleton instruction sets IS(H ADD ) = {(1, P ADD )} and IS(H XCHG ) = {(0, P XCHG )}, respectively. Since their restriction functions are orthogonal ρ ADD ·ρ XCHG = x·x = 0, we can compose them into the graph shown in Fig. 3(c) . It specifies compositional instruction set IS(H ADD ∪H XCHG ) = {(1, P ADD ), (0, P XCHG )} as intended (see Fig. 2 ).
III. TRANSFORMATIONS
In this section we describe several CPOG transformations which allow efficient management of instruction sets. We also discuss the issues associated with physical controller implementation and possible signal-level refinements of the model for capturing synchronous and self-timed control interfaces.
A. Basic graph transformations
Consider a graph H = (V, E, X, ρ, φ). Since elements of the quintuple are shared by all instructions from IS(H), we can make global modifications of the instruction set without iterating over all the instructions. For example, we can add a new action go at the beginning of every instruction by setting V ′ = V ∪{go}, φ(go) = 1, and φ(go → v) = 1 for all v ∈ V . 4 The proof follows from Theorems 1 and 2 of [15] which concern a more restrictive operation -CPOG addition. 5 Although this statement does not hold for our simplistic examples, e.g., |V | + |E| = 5 + 7 = 12 and |IS| = 2 in Fig. 3 , it does hold in practice. For example, Intel 8051 microprocessor has 111 instructions but its CPOG representation [17] contains only 13 vertices and 34 arcs (excluding auxiliary go and done). Also, if we do not use abstraction and treat instructions ADD A,B and ADD C,D as different ones, the number of instructions of a typical processor can easily grow to 2 32 while its CPOG will remain compact.
The cost of this global modification is only Θ(|V |); we call transformations of this type event insertions.
It is possible to introduce a global concurrency reduction between actions a and b, by setting E ′ = E ∪ {a → b} and φ(a → b) = 1. As a result, action b will always be scheduled after a in all the instructions. The cost of this transformation is O(1), but it is not safe in general: it can introduce deadlocks if action a is scheduled to happen after b in one of the instructions (forming a cyclic dependency). To ensure deadlock freeness verification algorithms from [14] must be employed.
Another basic transformation with the global effect is variable substitution. For instance, by replacing every occurrence of x with x in all conditions φ and function ρ, we flip the corresponding bit in all instruction opcodes. To perform this operation we need to change Θ(|V | 2 ) Boolean functions. Variable substitution is a powerful transformation, it can affect not only a single bit, but all the opcodes; care must be taken to ensure that the resultant opcodes do not clash.
The above transformations are global. It is possible to apply them to a subset of selected instructions using the operations of set extraction and decomposition defined below.
B. Set-theoretic operations
Instead of looking at the whole instruction set of a processor we may need to focus our attention on its smaller part. As an example, consider the MMIX processor instruction set [11] containing 256 different opcodes. 16 of them, starting with bits 0010, are dedicated to addition/subtraction operations, and we want to manipulate them separately from the others.
Let graph H = (V, E, X, ρ, φ) specify the whole instruction set IS(H) of the processor and 8-bit opcodes be encoded with variables {x 1 , . . . , x 8 }. Function f = x 1 · x 2 · x 3 · x 4 enumerates all Boolean vectors starting with 0010 and its conjunction with ρ enumerates all wanted opcodes. Thus, graph H ′ = (V, E, X, f · ρ, φ) specifies the required part of IS(H). There is a dedicated operation in CPOG algebra, called scalar multiplication, intended for this task:
. Its main feature is that
∀f, IS(f · H) ⊆ IS(H)
In our context, f can be considered an instruction property and operation f · H can be called a set extraction: it extracts a subset of a given instruction set according to a required property.
A generalisation of this operation is called decomposition. It is easy to see that H 1 = f ·H and H 0 = f ·H together contain all instructions from IS(H): all instructions with opcodes satisfying property f are put into H 1 , and all the rest are put into H 0 . Thus, any instruction set can be decomposed into two disjoint sets according to a given property. This is formally captured by the following statement:
Set extraction and decomposition are very cheap operations: they only require computation of a conjunction of two Boolean functions f and ρ. 
C. Refinements for control synthesis
As soon as all the intended manipulations with the instruction set are performed, we can proceed to the stage of mapping the resultant CPOG into Boolean equations and produce a physical implementation of the specified microcontroller. In order to descend from the abstract level of atomic actions to the physical level of digital circuits the signal-level refinements are necessary.
To interface with an asynchronous datapath component a it is possible to use the standard request-acknowledgement handshake (req_a, ack_a), as shown in Fig. 4 . In case of a synchronous component b the request signal is used to start the computation but, as there is no completion detection, the acknowledgement signal has to be generated using a matched delay [19] . Also, there are cases when a matched delay has to be replaced with a counter connected to the clock signal to provide an accurate multi-cycle delay -see the interface of component c in the same Fig.. Note that we do not explicitly show synchronisers [10] in the diagram; it is assumed that components b and c are equipped with the necessary synchronisation mechanisms to accept asynchronous requests from the microcontroller.
To explicitly specify handshake signals it is possible to perform a graph transformation explained in , then both vertices are split 6 , etc. Semantically, when an atomic action a 1 is ready for execution, the controller should issue the request signal req_a 1 to component a; then the high value of the acknowledgement signal ack_a 1 will indicate completion of a.
Notice that the microcontroller does not reset handshakes until all of them are complete. This leads to a potential problem: a component cannot be released until the instruction execution is finished. To deal with the problem it is necessary to decouple the microcontroller from the component, see box 'decouple' in Fig. 4 and its gate-level implementation in Fig. 6(a) . Also, when a component b is used twice in an instruction we have to combine two handshakes (req_b 1,2 , ack_b 1,2 ) into one using the merge controller, see Fig. 6(b) . Merge controllers can only be used if the requests are mutually exclusive 7 . If this is not the case, as e.g. for concurrent actions c 1 and c 2 , then we have to set an arbiter guarding access to the component. Its implementation consists of the merge controller and the mutual exclusion (ME) element [10] , see Fig. 6(c) .
Finally, the refined graph can be mapped into Boolean equations. An event associated with vertex v ∈ V is enabled to fire (req_v+ is excited) when all the preceding events u ∈ V have already fired (ack_u have been received) [15] :
where a ⇒ b stands for Boolean implication indicating 'b if a' relation. Mapping is a simple structural operation, however the obtained equations may not be optimal and should undergo the conventional logic minimisation [12] [15] and technology mapping [5] procedures.
It is interesting to note that the size of the microcontroller does not depend on the number of instructions directly. There are Θ(|V | 2 ) conditions φ in all the resultant equations; the average size of these conditions is difficult to estimate, but in practice we found that the overall size of the microcontroller never grows beyond Θ(|V | 2 ). In this section we study a common low-level GPU instruction, called DP3, which given two vectors x = (x 1 , x 2 , x 3 ) and y = (y 1 , y 2 , y 3 ) computes their dot product x · y = x 1 · y 1 + x 2 · y 2 + x 3 · y 3 . There are many ways to achieve the required functionality in hardware; consider the following datapath components (denoted by a . . . e) which can be used to fulfil this task: a) 2-input adder; b) 3-input adder; c) 2-input multiplier; d) fast 2-input multiplier; e) dedicated DP3 unit. Similar to the Energy Token model [18] , we associate two attributes, execution latency and power consumption, with every component. Fig. 7 visualises them as labelled boxes, whose dimensions correspond to their attributes; the area of a box represents energy required for the computation.
Depending on the current operation mode and availability of the components, the processor has to schedule their activation in the appropriate partial order. Fig. 8 lists several possible partial orders together with their power/latency profiles.
Fastest implementation: the fastest way to implement the instruction is to compute multiplications tmp k = x k · y k concurrently using three fast multipliers d1-d3 and then compute the final result tmp 1 +tmp 2 +tmp 3 with a 3-input adder b; see Fig. 8(a) . This implementation has a very high cost in terms of peak power and thus may not always be affordable.
Least peak power implementation: a directly opposite scheduling strategy is shown in Fig. 8(b) . Three multiplications are performed sequentially on the same slow multiplier c1, followed by 3-input addition b. This strategy has the largest latency among all presented because it is completely sequential and uses slow power-saving components. On a positive side, this implementation requires only two basic functional blocks, which are likely to be reused by other instructions, so its component utilisation is high.
Use of a dedicated component: it is possible that the chosen hardware platform contains a dedicated computation unit capable of computing dot product of two vectors, e.g. Altera Cyclone III FPGA board allows building a functional block (3) with three multipliers connected to a 3-input adder. We can directly execute this block without any scheduling -see Fig. 8(c) . While being convenient and potentially very efficient due to custom design, such solution is not always justified because of low component utilisation: it is impossible to reuse the built-in multipliers for implementing other instructions and if DP3 is rarely used by software then this dedicated component will be wasting area and power (due to the leakage current) most of the time. Moreover, such implementation does not allow any real-time rescheduling thereby being less flexible.
Fast implementation with limited resources: if there are only two available multipliers c1 and c2 (either because of hardware limitations or because other multipliers are busy at the moment) then the fastest possible scheduling strategy is as follows. At first, two multiplications should be performed in parallel. Then their results are fed to 2-input adder a, while c1 is restarted for computing the third multiplication. Finally, the obtained results are added together by the same adder a as shown in Fig. 8(d) .
Balanced solution: Fig. 8 (e) presents a balanced strategy, which aims to spread power consumption evenly over time, while being relatively fast. This schedule may be advantageous for the best energy utilisation and in security applications. The point is to demonstrate that even such basic instruction as DP3 has a lot of valid scheduling strategies with distinct characteristics. Importantly, it is not possible to select the best strategy because none of them is the best. Therefore including only one of them into a processor instruction set is a serious compromise which should not be done at this early and abstract stage of the design process. We propose to include as many different implementations into the instruction set as possible, and, if needed, reduce the behavioural spectrum at the later design stages when more information is at hand (some final decisions can even be made during runtime by dynamic processor reconfiguration). The CPOG model is perfectly suited for this task: it can represent a multitude of different implementations of the same instruction efficiently. If the instruction is intended to have only one opcode, we can distinguish between its different implementations using mode and configuration variables. They are not part of the opcode (which is fetched from the program memory during software execution), but can be dynamically changed by the power/latency runtime control mechanisms [20] or be statically set to constants according to the limitations of the actual hardware platform, as shown in Fig. 9 .
We can specify all discussed implementations of DP3 instruction using a single CPOG. To do that we first have to encode all of them. If there are no requirements on the mode/configuration codes, then a designer is free to assign them arbitrarily, however it may affect CPOG complexity and, as a consequence, complexity of the resultant microcontroller. In this case it is possible to resort to the help of automated 8 optimal encoding methods [13] , which generate codes ψ 1 = 001, ψ 2 = 011, ψ 3 = 000, ψ 4 = 111, and ψ 5 = 101 for the five partial orders depicted in Fig. 8 (note that these optimal codes are far from trivial sequence of binary codes 000-100). If we compose all of them into a single CPOG using the method from Section II, we obtain the graph shown in Fig. 10(a) . The mode/configuration variables are denoted as X = {x, y, z}, and two intermediate variables {p, q} are derived from them to simplify other graph conditions; as a result only seven 2-input gates are required to compute all 8 We used WORKCRAFT framework [1] for CPOG modelling and encoding. Figure 10: CPOG specification of DP3 instruction graph conditions. The obtained graph is a superposition of the given partial orders, i.e. all of them can be visually identified in it -see, for example, Fig. 10(b) , which shows the balanced implementation generated by code ψ 5 , and compare it with partial order in Fig. 8(e) . For a designer this gives a useful higher-level picture which brings out interaction between the components much better than separate partial order diagrams (this is similar to a metro map which represents a set of metro lines in a compact understandable form). V. EXPERIMENTS This section demonstrates specification of an instruction set with three instructions running under two different operation modes and on two different hardware platforms.
In addition to instruction DP3 described in the previous section, we consider two more instructions, namely ADD and MAD, which are traditionally supported by most GPUs at the assembly level. Instruction ADD computes 4-component vector sum x + y = (x 1 + y 1 , x 2 + y 2 , x 3 + y 3 , x 4 + y 4 ), i.e. it executes four addition operations add k = x k + y k independently, while MAD performs a more sophisticated task, executing four independent multiplications followed by additions: mad k = x k · y k + z k . Fig. 11 shows the complete instruction set split into four mode/configuration sets. Hardware configuration 0 contains four multiplication and four addition components (denoted by m1-m4 and a1-a4), while configuration 1 contains only a pair of each component type (m1, m2, a1, and a2), thus being less flexible in terms of possible scheduling strategies but better in terms of component utilisation.
There are two operation modes: mode 0 aims to maximise performance of the processor by high parallelism, while mode 1 executes the instructions under the limited power availability. The power limit was set to allow concurrent execution of a multiplier and an adder, or concurrent execution of three adders. We use a subscript to denote the mode/configuration code of an instruction, e.g. DP3 01 stands for implementation of instruction DP3 intended for use in mode 0 and configuration 1.
All the instructions have been composed into a single instruction set IS; the corresponding CPOG H is shown in Fig. 12(a) . Opcodes ψ DP 3 = 01, ψ ADD = 00, and ψ M AD = 11 have been automatically generated by the optimal encoding procedure [13] . There are four variables in the complete instruction code: X = {x, y, m, c}, where x and y are the opcode bits, while m and c stand for the mode and configuration bits, respectively; for example, instruction MAD 01 has code 1101. Since opcode (x, y) = (1, 0) is not used, function ρ = x + y forbids it.
We synthesised a microcontroller for each configuration by using decomposition into IS 0 = IS(c · H) and IS 1 = IS(c · H), followed by mapping of the obtained instruction sets into Boolean equations, as explained in Section III. These equations were imported into Altera Quartus II design kit for logic minimisation and technology mapping into an FPGA board from the Cyclone III family; we used 32-bit multipliers and 64-bit adders in our design. Fig. 12(b) shows the reduced instruction set IS 1 after logic minimisation [15] (conditions containing variable c were minimised).
Both microcontrollers have been tested to confirm the correct implementation of each instruction in terms of its functionality and the proper activation order of the datapath components. Latency and peak power of each instruction have Table II : Latency and peak power of instructions been measured using Quartus II analysis tools and are reported in Table II . As expected, in mode 0 all the instructions are executed faster but at the expense of higher peak power (up to 0.72 mW); in mode 1, on the other hand, the peak power never gets higher than 0.26 mW. In configuration 1 the difference between the modes is smaller, because there are not enough hardware components to take advantage of maximum parallelism. Note that we were unable to perform peak power measurements with a good accuracy using PowerPlay Analyzer (a part of Quartus II toolkit), therefore the figures in Table II should be considered as a guide only.
