Complex hardware systems can be designed by breaking down their behaviour into high-level descriptions of constituent scenarios and then composing these scenarios into an efficient hardware implementation using a form of highlevel synthesis. There are a few existing methodologies for such scenario-based specification and synthesis, and in this paper we focus on highly concurrent systems, whose scenarios are typically described using explicit concurrency models such as partial orders.
I. INTRODUCTION
Hardware systems grow more complex every year: processors gain new features and application-specific instructions [1] , and the number of processing cores and other IP components steadily increases following the need for IP reuse [2] . Conventional approaches to the development of hardware systems rely on HDL system descriptions, which require designers to deal with low-level implementation details. When the level of complexity increases, it is convenient to raise the level of abstraction for easier system representation, for using automatic hardware synthesis features, and, consequently, for an increased productivity [3] . See examples [4] , [5] .
In this work, we use the high-level methodology based on the Conditional Partial Order Graphs (CPOG) [6] formalism to design hardware architectures in the control domain. This methodology, originally conceived for the design of processor instruction set architectures (ISAs) [7] , is supported by automated hardware synthesis flow, and by algorithms for the derivation of efficient hardware implementations [8] . However, previously published algorithms do not scale to large numbers of behavioural scenarios and have no support for composition constraints, which are important for real-life systems heavily relying on IP reuse. This motivates our research.
The paper comprises the following sections.
• Background: Section II reviews the CPOG formalism and the related methodology [7] .
• Related work: Section III compares the methodology with other approaches in the field of behavioural synthesis, and reviews existing algorithms for efficient composition of scenarios. • Scenario composition algorithm: Section IV presents our main contribution: a new algorithm for CPOG composition that scales to systems comprising hundreds of partial order scenarios and supports composition constraints. • Design automation: Section V describes the developed open-source tool SCENCO [9] , which is integrated in the WORKCRAFT framework [10] as an external plugin and implements the CPOG methodology. • Algorithm and tool validation: Section VI validates the presented contributions on a set of benchmarks that includes ad-hoc controllers, processor instruction sets, and software output logs. We discuss achieved results and future research in Section VII. This paper is an extended version of [11] and includes the following changes. We review the CPOG-based design methodology in Section II-B and summarise differences with other existing behavioural synthesis approaches in Section III-A. The new algorithm for scenario composition is described in greater detail: in particular, we discuss how to reduce the space of possible solutions to improve the performance of the algorithm (Section IV-A), describe how the composition algorithm handles constraints (Sections IV-B and IV-D), and analyse the algorithm's correctness and complexity (Sections IV-E and IV-F). We describe how to synthesise the interface between the controller and the controlled datapath modules in Section V-A. Finally, the presented algorithm and tool are evaluated on an extended set of benchmarks in Section VI.
II. BACKGROUND Complex systems are designed by breaking them down into their constituent behaviours, or scenarios. In this paper, a scenario is a list of operations that are executed in a specified order. Formally, a scenario s = (O, ≺) is a partial order (PO) [12] , i.e. a binary precedence relation ≺ describing dependencies between a set of operations O that satisfies two properties:
• Irreflexivity: ∀a ∈ O, ¬(a ≺ a) • Transitivity: ∀a, b, c ∈ O, (a ≺ b) ∧ (b ≺ c) ⇒ (a ≺ c) A scenario specification formally captures the behaviour of a system by the set of its constituent scenarios S = {s 1 , .., s n }.
As an example, the behaviour of a processor can be specified by the set of instructions it can execute, see Figure 1a . Fig. 1 : Subfigure (a) shows a scenario specification comprising two processor instructions whose behaviour is expressed using partial orders. Scenario s 1 corresponds to the arithmetic instruction that fetches an instruction from the program memory, decodes it, loads the two operands concurrently (loadA loadB), uses them to perform an arithmetic operation (ALU), and subsequently saves the result into the memory via the saveMEM operation. Scenario s 2 is the unconditional branch, which takes one operand to compute the jump address (ALU) and saves the result into the program counter register (savePC). Subfigures (b-c) show two approaches to behavioural composition of scenarios.
We use Conditional Partial Order Graphs (CPOGs) (reviewed in Section II-A) for representation of scenario specifications. CPOGs are supported by efficient scenario composition methodology that allows to take advantage of the similarities between scenarios. The methodology will be described in detail in Section II-B, but here we provide an intuitive explanation of what we mean by 'efficient composition'. Figure 1b shows an inefficient composition where each scenario is synthesised in isolation and the right scenario is selected by means of (de)multiplexers in runtime. A more efficient approach consists of deriving a hardware implementation where system resources and common parts of behaviour are shared, as shown in Figure 1c . This paper presents a new approach to composition of scenarios for deriving efficient hardware implementations of control circuits, such as interface controllers and processor instruction decoders.
A. Conditional Partial Order Graphs
A CPOG is a collection of scenarios in the form of partial orders. Formally, a CPOG [6] is a tuple H = (V, E, B, φ) 1 :
• V is a set of vertices which correspond to operations (or events) in a modelled system. CPOGs can be represented graphically: vertices are depicted as circles , and arcs are depicted as arrows →. Vertices and arcs are labelled by their conditions φ(z). For example, Figure 2 shows a CPOG with two possible projections on top (we define projections in Section II-B3). The purpose of conditions φ is to switch vertices and arcs on (off) when the conditions on them are (not) satisfied. We use dashed circles and arrows to represent vertices and arcs that have been switched off by their conditions.
The example in Figure 2 shows that a CPOG can be used to represent multiple behavioural scenarios compactly by overlaying their common parts. In practice CPOGs remain compact and easy to understand even when the number of scenarios increases, making the formalism suitable for representing a large class of hardware systems.
B. Design methodology
This section reviews the design methodology based on the CPOGs [7] , see Figure 3 . The scenarios of a system are formally specified as a scenario specification (in the form of a set of partial orders). Scenarios are composed into a system specification (in the form of a CPOG), which represents the complete system behaviour. The latter is used to synthesise a hardware controller (in the form of gate-level description in Verilog). The presented approach enables the specification of composition constraints (in the form of codes). The controller is then automatically interfaced to the specified datapath modules in the final system implementation.
As a running example, the methodology is applied to the system described by the scenarios in Figure 1a .
1) Scenario specification: A hardware system is described by a collection of scenarios, each in the form of a partial order. Vertices and arcs constitute the basic elements of these graphs, where vertices represent system operations (or events), and arcs represent dependencies between them.
System scenarios can be specified either graphically (see Section V) or textually in a file. Text files containing scenarios are parsed, and each scenario is converted into a graph. As an example, text-based descriptions of the scenarios in Fig. 1a are shown below.
The effort required by engineers to produce such scenario specifications is high, and it is desirable to extract scenarios from higher-level descriptions. There are several examples of high-level specification languages targeting processor architectures, e.g. see Arm's Architecture Specification Language [13] and Sail [14] . This aspect of automation is outside the scope of this paper; we refer the reader to [15] for a relevant example.
2) Scenario encoding: Scenario encoding is the process of finding an injective function between a set of scenarios and a set of codes. Let n be the number of scenarios. The following definitions will be used to formally state the CPOG encoding problem.
• S is the set of scenarios {s 1 , s 2 , ..., s n } described as POs. • C is the universe of codes {c 1 , c 2 ..., c |C| } satisfying the following two properties:
given a set of Boolean variables B = {b 1 , b 2 }, the corresponding code universe is C(B) = {00, 01, 10, 11}. • Encoding is a set of n pairs {(s 1 , c 1 ), ..., (s n , c n )}, where each scenario s i is encoded by the code c i , such that:
The arithmetic instruction scenario s 1 and the unconditional branch scenario s 2 in Fig. 1a can be encoded by one Boolean variable B = {b}, with the code universe C(B) = {0, 1}. The encoding illustrated in Fig. 2 is e = {(s 1 , 0), (s 2 , 1)}.
Different encodings lead to different CPOGs, and consequently to different hardware implementations, see next sections.
3) Composition: Let e = {(s 1 , c 1 ), ..., (s n , c n )} be a scenario encoding for a CPOG H = (V, E, B, φ). The following definitions will be used to formally state the CPOG synthesis problem.
• A projection H| ci applies the code c i to all Boolean conditions of H. The result is a graph H i , whose vertex/arc conditions are now fully evaluated to 1 or 0, see Figure 2 . • The operation scen(H i ) removes vertices and arcs with 0 condition, and applies the transitive closure to the resulting graph, obtaining the scenario s i .
The purpose of the above definitions is to let a code c i select a scenario s i from the CPOG:
The CPOG synthesis process uses the encoding e to synthesise the CPOG H. It produces the encoding functions F (B) = {f 1 , f 2 , ..., f n }, so that the code c i ∈ e selects the scenario s i ∈ e. Following [6] , we represent the CPOG H as the following linear combination of projections:
The CPOG synthesis requirement is satisfied if the encoding functions are orthogonal (f i f j = 0, 1 ≤ i < j ≤ n), and are not contradictions, i.e f i = 0 for all 1 ≤ i ≤ n.
As an example, consider the encoding {(s 1 , 0), (s 2 , 1)} of the scenarios in Figure 1a . The resulting CPOG should be in the form of H = f 1 H| c1 + f 2 H| c2 such that scen(H| c1 ) = s 1 and scen(H| c2 ) = s 2 . The CPOG is represented by the linear combination H = bH| 0 + bH| 1 , and the encoding functions f 1 = b and f 2 = b satisfy the synthesis requirement. Figure 2 shows the resulting CPOG H at the bottom, and the projections H| b=0 and H| b=1 on the top. The CPOG represents the system specification.
4) Hardware synthesis: The hardware synthesis step of the design flow extracts a set of Boolean equations from the derived CPOG, obtaining an implementation of the controller. Its area, latency and power strongly correlate with the CPOG complexity [16] , defined as the number of Boolean literals of conditions φ. An operation v ∈ V can be executed if: 1) it belongs to the current projection, i.e. φ(v) = 1; 2) all preceding vertices have already been executed:
This is captured in terms of Boolean equations as follows:
where (u, v) is the arc from u to v, req(v) is the request signal which activates the v operation, while ack(u) is the acknowledgement signal which comes from the u operation, and indicates its completion. As an example, the hardware implementation (in the form of Boolean equations) of the CPOG in Figure 2 is shown below:
Signals go and done are automatically added into set of operations to delimit the start and the end of a scenario execution. The above Boolean equations are used for the synthesis of the gate-level description of the system hardware controller (in Verilog), which is in compliance with its scenario specification. Finally, the controller can be connected to the specified synchronous or asynchronous datapath modules automatically, see Section V-A, and the final system implementation can be further processed by conventional EDA tools.
III. RELATED WORK

A. Behavioural synthesis
Behavioural synthesis is not new and several other approaches exist that allow the designer to formally describe the behaviour of a controller and synthesise the corresponding hardware implementation. The most relevant approaches are: the work by Cortadella et al. [17] that is based on Signal Transition Graphs (STG) as the formal specification model and produces asynchronous controllers; and the work by De Micheli [18] that uses synchronous Finite State Machines (FSM) to derive controllers implemented as microcode memories or hard-wired control units. In [6] , CPOGs were compared to STGs and FSMs in terms of their compactness and ease of use when specifying asynchronous circuits. Below we highlight the main reasons for using CPOGs in the broader context of scenario-based synthesis.
• The separation of datapath (scenarios) and control (encoding) abstraction layers enables scenarios to remain unchanged when the encoding changes. • Underlying partial orders can efficiently represent highly concurrent systems without incurring exponential state explosion. • Scenario composition allows CPOGs to remain compact even when the size of the specification grows. • Opportunity to minimise various design criteria (e.g. area, power, latency) by scenario encoding, which is our main goal. In this paper, we compare CPOGs, STGs and FSMs practically by synthesising real scenario-based specifications. Our benchmarks, evaluated in Section VI, highlight that: (1) the STG methodology does not scale to specifications that include many scenarios, (2) the presented approach shows better results than the FSM methodology.
As an example of specifications, Fig. 4a and 4b show an STG and FSM model of the processor scenarios in Fig. 1a . In these figures: red transitions are the inputs of the designed controller, blue ones are the outputs and green ones are dummy transitions (used to simplify the model). In the STG in Figure 4a , the two scenarios are mutually excluded via the choice place p1, and the causality dependencies of their operations are modelled via sequences of request/acknowledge transitions. The two scenarios are encoded by the same encoding used in Figure 2 : {(s 1 , 0), (s 2 , 1)} on one bit b. STG specifications are handled by the EDA tools Petrify [19] and MPSat [20] , which synthesise asynchronous implementations using different algorithms. Petrify uses binary decision diagrams [17] , while MPSat uses Petri net unfoldings [20] .
In the FSM specification in Figure 4b , the two scenarios are selected via one bit b observed at the rising of the go signal, which starts the computation. Upon the completion of each scenario, all output requests r all are reset, and the FSM returns to the initial state s0 when all input acknowledgements a all are also reset. Such specifications are described in VHDL as FSMs, and are handled by Design Compiler [21] to derive synchronous controllers. We applied concurrency reduction [6] to some of the considered FSM specifications not to incur state explosion, see benchmarks in [9] .
B. CPOG scenario composition
The characteristics of the synthesised hardware controller correlate with the encoding selected [8] . In this work, we present a metric for extracting such a correlation, and an algorithm for approaching the efficient behavioural composition heuristically. In this section, we report other encoding techniques available for the efficient composition of scenarios into a CPOG.
The Single-literal encoding [16] is based on the graph colouring algorithm [22] . It finds and encoding under the constraint that each Boolean equation φ(z) of the synthesised CPOG can have at most 1 literal. The number of Boolean variables |B| determines the colours available for solving the graph colouring problem, and can be increased above log 2 |S| automatically.
The SAT-based encoding [8] uses SAT solvers (CLASP [23] or MINISAT [24] ) for minimising the synthesised CPOG Boolean equations. The number of Boolean variables |B| for encoding is set by the user. In this paper, we set |B| = log 2 |S| . In Section VI, we show that the above approaches do not scale well to high number of scenarios (|S| > 15).
IV. SCENARIO COMPOSITION ALGORITHM
The optimal scenario encoding problem is NPcomplete [16] . Finding the encoding that optimises a target hardware characteristic can be only achieved by synthesising and comparing all available encodings. In practice, this exhaustive search is infeasible due to the exponential growth of number of available encodings |E| when either the number of |S| scenarios or |C| codes increases, |E| defined in Section IV-A. This motivates the proposed Heuristic encoding, described in this section. I: Symmetric encodings derivable from e 1 , e.g. e 2 is symmetric to e 1 , as it can be obtained by negating the Boolean variable b 1 in all the codes in e 1 . 
A. Symmetric encodings
It is inefficient to inspect encodings that result in similar hardware implementations. This is the case for symmetric encodings, which are best explained by an example. The encoding e 1 = {(s 1 , 00), (s 2 , 01), (s 3 , 10), (s 4 , 11)} has three symmetric encodings: e 2 , e 3 and e 4 , see examples in Table I. A symmetric encoding can be obtained by negating one or more Boolean variables in all the codes of an encoding. We do not consider symmetric encodings, as the corresponding implementations differ only in terms of input inverters, which is insignificant. To rule out symmetric encodings, we always encode the first scenario by the first available code, e.g. the zero code 00..0:
The symmetry-breaking allows to restrict the universe of allowed encodings E = {e 1 , e 2 , ..., e |E| } to the set that satisfies the two properties below: 1) All encodings are different: e i = e j for 1 ≤ i < j ≤ |E|.
2) No two encodings e i and e j are symmetric. Given |S| scenarios and |C| codes, the size of the universe of encodings is:
Note that at least log 2 n Boolean variables are needed to encode |S| scenarios (|B| ≥ log 2 |S| ). In this paper, we fix the number of such variables to the minimum, and restrict |E| using |C| = 2 log2|S| codes (see Section II-B2).
B. Composition constraints
In real-life systems, there are composition constraints that restrict the space of allowed encodings, for example due to backward compatibility requirements. Consider the two scenarios in Figure 1a , and assume that the following constraints must be met:
• The code of the arithmetic instruction (s 1 ) consists of an arbitrary 2-bit opcode, and two 3-bit operands A and B. • The arithmetic instruction 2-bit opcode (b 1 b 2 ) is denoted by ??, where each ? is a don't care bit that becomes either 0 or 1 in the encoding. Each X is a don't use bit, which is not used for selecting a PO from those contained in the CPOG. In fact, 6 bits are left unused for the two operands
is fixed to 00111, the remaining 3 bits are left unused for the branch offset operand (b 6 b 7 b 8 ).
As shown in the above example, a constraint g is an assignment g : B → {0, 1, ?, X} of the set of Boolean variables B. Sets of constraints are used to express composition constraints
The presented scenario composition algorithm handles composition constraints. As an example, the constraints set above can be satisfied by {(s 1 , 10XXXXXX), (s 2 , 00111XXX)}. On the other hand, {(s 1 , 00XXXXXX), (s 2 , 00111XXX)} is an incorrect encoding as the code 00111000 selects both the instructions. Codes such as 00XXXXXX and 00111XXX are said conflicting. The implementation details for satisfying the composition constraints and finding an initial encoding prior to the heuristic optimisation are described in Algorithm 1.
Algorithm 1: Algorithm for satisfying the given composition constraints and finding the initial encoding. 2 C ← (0 |B| , · · · , 1 |B| ) ; // universe of codes 3 enc ← (− 1 , · · · , − |S| ) ; // empty encoding
The function findInitialEncoding takes as input the Boolean variables B for encoding and the set of composition constraints G. The latter is an array of size |S| whose indexes represent the scenarios and whose elements represent the constraints. As running example, we consider the constraints on Initially, the universe of codes C is initialised with 2 |B| codes (line 2), and the encoding enc with |G| no-code symbols (−) as the encoding is initially empty (line 3). The array enc represents the initial encoding, its indexes represent the scenarios and its elements represent the codes. In the example, C and enc are: The function randomAssignment can introduce conflicting codes. In the example, the constraint G[1] =??X cannot be turned to 11X, as the latter is already used for encoding s 3 (11X / ∈ C). Lines 10-12 can be repeated up to a M AX of 10 times to increase the probability of satisfying all constraints. If the constraint is still not satisfied, an error is returned (line 14).
In line 16, an error is returned if the number of codes left for encoding (|C|) is less than the scenarios that need to be encoded (|− ∈ enc|). In this case, more codes and bits B are required for encoding the given S under the constraints G.
In lines 17-19, the first scenario s 0 is encoded by the zero code if it is unconstrained (G[0] =? |B| ) and if the zero code has not been used (0 |B| ∈ C). This is necessary for avoiding symmetric encodings. In the example, C and enc become: The output of Algorithm 1 is the encoding enc, which satisfies G and can be optimised via the heuristics that we will describe shortly.
C. Heuristic cost function
The main idea of the heuristics is to encode similar scenarios by similar codes. Similarities between codes are determined using the classic Hamming distance metric [25] . Similarities between scenarios, on the other hand, are determined by referring to their partial order representation. Consider two scenarios s 1 = (O 1 , ≺ 1 ) and s 2 = (O 2 , ≺ 2 ). The distance between s 1 and s 2 is computed following the two rules below:
and if it connects two operations which are both present in the operation sets of the two scenarios:
Distances between pairs of scenarios are elements of the Scenario Distance Matrix SD. Distances between pairs of codes are elements of the Code Distance matrix CD. Both SD and CD have size |S| × |S|, where |S| is the size of the scenario specification. Elements SD ij and CD ij represent the number of differences between the i th and j th scenarios and codes, respectively, in an encoding e = {(s i , c i ), (s j , c j ), · · · , (s |S| , c |S| )}). These matrices are used to evaluate encodings heuristically via the below cost function:
Intuitively, minimising F means encoding similar scenarios with similar codes. We evaluated the cost function F empirically, by analysing several scenario specifications.
As an example, Figure 5a shows the analysis of a subset of 8 scenarios of the Intel 8051 [7] scenario specification, where the universe of encoding E is fully inspected, and 5040 controllers are synthesised with a 90 nm technology library [9] . The size of the controllers is plotted against the heuristic value F of the corresponding encodings. Figure 5b , in turn, shows the analysis of the scenario specification of the Arm Cortex M0+ [11] , composed of 11 scenarios. In this figure, 10 2 controllers produced by the proposed algorithm, described in Section IV-D, are compared to 10 5 controllers produced by encoding scenarios randomly.
The two figures highlight the existence of a correlation between the controller area and F, and suggest the following two claims: • the likelihood of synthesising efficient implementations is higher where F is lower, see Promising candidates in Fig. 5a ; • the likelihood of synthesising efficient implementations is proportional to the number of encodings inspected, due to the inaccuracy of the heuristics, see Variability span in Fig. 5b .
D. The heuristic encoding algorithm
The presented heuristic algorithm is based on the cost function F, and on the below implementation of the simulated annealing (SA) [26] . The latter is a heuristic method for solving optimisation problems where a function must be minimised in a large search space. The algorithm pseudo-code is shown in Algorithm 2. The inputs of the function heuristicEncoding are the Boolean variables B, the scenarios S and constraints G. We continue the running example used for the Algorithm 1, where constraints G = {???, ??X, ???, 110, ???, ???} were turned to enc = {000, 01X, 100, 110, 001, 111} by the findInitialEncoding function (line 2). The encoding is also copied into enc best (line 3), which represents the best encoding found during the SA search.
Simulated annealing parameters were calibrated experimentally. The initial temperature is t 0 = 10. The cooldown factor alpha is a = 0.996, and the ending temperature is t e = 0.1. These parameters can be modified for increasing or decreasing the number of iterations for the SA optimisation. Line 4 initialises the universe of codes C. The code of the first scenario enc[0] is removed for avoiding symmetric encodings.
Lines 5-21 minimise the initial encoding heuristic value F(S, enc) by repeatedly swapping pairs of codes in the encoding, until the initial temperature t 0 reaches the ending temperature t e (line 5). Line 6 stores the current encoding enc into the the next encoding enc next . Lines 7-8 select a random scenarios s i in enc (1 ≤ i < |S|), and a random code c j in C, respectively. Such indexes are used for swapping codes in enc next (see lines 9-12). In the example, if i = 4 and j = 7, the fourth scenario (encoded by 001) is swapped with the code 111 (which identifies s 5 in enc). enc and enc next become: enc = (000, 01X, 100, 110, 001, 111) enc next = (000, 01X, 100, 110, 111, 001)
The next encoding enc next is considered if it satisfies the composition constraints G (line 13). The function satisfy (lines 22-28) checks that the bit size of the code matches the bit size of the constraint (line 23), and that the bits constrained by {0, 1, X} hold these values in the final code (lines [24] [25] [26] [27] . Notice that bits constrained by ? do not need to be checked, as both logic values {0, 1} satisfy such constraints.
In lines 14-15, enc next replaces the best encoding enc best found during the SA optimisation if the former has a lower heuristic value than the latter, i.e. F(S, enc next ) < F(S, enc best ). In lines 16-19, enc next also replaces enc either if the former has a lower heuristic value than the latter (i.e. v < e − d t 0 for all 0 ≤ v < 1 ∧ d ≤ 0), or if the extracted random value v is lower than e − d t 0 , with d > 0. In the second case, a worse encoding (with a higher F) replaces enc.
The randomness allows the heuristicEncoding to return a different enc best (output) at every execution. The solution space is connected, as all e ∈ E are reachable by a set of swap moves.
The current implementation of the algorithm is run in a single thread of execution. However, multiple instances of the heuristicEncoding function can be run on multiple threads, resulting in several encodings to be produced concurrently. The parallelisation of the presented algorithm is left as future research.
E. Correctness
An encoding enc, constrained by composition constraints G, is said to be correct if: can be derived from G[i] by substituting every ? by either 0 or 1. 2) the encoding enc does not contain conflicting codes, which do not identify scenarios univocally (see Section IV-B);
Whenever Algorithm 1 terminates, a correct encoding enc is returned by construction. I.e. the result enc is constructed by selecting the codes for encoding from the universe C, which only contains valid codes being derived by the number of bits |B| selected for encoding (line 2). Fully and partially constrained scenarios are always encoded by codes derived by their constraints, see lines 6 and 9-14, respectively. Thus the resulting enc always satisfies the constraints G. Also, overlapping codes cannot be introduced in the final encoding result enc: a code is always removed from C when it is used to encode a scenario and thus cannot be reused to encode a different scenario, see lines 7, 15, 19 and 22.
On the other hand, Algorithm 1 generates an error if the constraints G cannot be met for any of the following reasons:
• The user introduces overlapping constraints, see line 5. • A partially constrained code is not turned into a code c left for encoding (c ∈ C) in any of the MAX iterations, see lines 9-13. An optimal solution would be to run an exhaustive search, which we avoid to reduce the algorithmic complexity. • The number of codes |C| is not enough for encoding a set of scenarios with size |G| with the given constraints G, see line 16. With regards to Algorithm 2, the function heuristicEncoding handles the output of the previous function enc, and advances to enc best through a sequence of swap moves that inspects many intermediate encodings enc next . Given a correct enc (see line 2), each intermediate encoding enc next derived by a swap (see lines 7-12) is always an encoding with no conflicting codes (i.e. code swap does not introduce encoding conflicts). Intermediate encodings can replace enc best only if they satisfy the constraints G (lines 13 and 22-27). Consequently, enc best is also correct by construction. The two algorithms always terminate, as there are not infinite loops.
F. Time complexity analysis
The function findInitialEncoding is constituted by a sequence of three loops. The first one (lines 4-7) encodes fully constrained scenarios by moving their codes into the encoding enc. Its complexity only depends on the number of fully constrained codes introduced: O(|S|). The second loop (lines 8-15) encodes partially constrained scenarios by looping over the bits |B| of each constraint, in order to flip every ? to {0, 1}. Thus, its complexity is: O(|S| · |B|). The third loop (lines 20-22) makes use of the function pickRandom (O(1)) to extract codes left in the code universe C and encode the unconstrained scenarios. Its complexity depends on the constraints: O(|S|). Consequently, the complexity of the Algorithm 1 (A 1 ) comes from the second loop. In this paper, we assume that |B| = log 2 |S| , hence the below equation:
On the other hand, the function heuristicEncoding, excluding the internal findInitialEncoding function in line 2, is constituted by a loop that implements an exponential multiplicative cooling strategy of the simulated annealing algorithm (SA) [26] , i.e. an initial temperature t 0 is multiplied by a constant factor a at each iteration, until an ending temperature t e is reached. This causes a fixed number of iterations n that can be tweaked by modifying these parameters. At each iteration of the SA, the most computationally expensive statements are in lines 13 and 22-28: where the enc next is checked against the constraints G, and in lines 14 and 16: where the function F has to be computed. The former has a complexity of O(|S| · log 2 |S| ), as the encoding has to be checked for every bit of each constraint. The latter has a complexity of O(|S| 2 ), see Formula 1. Consequently, Algorithm 2 (A 2 ) has the following time complexity:
This analysis disregards the implementation details of the further set of functions (e.g. pickRandomScenario) that the proposed algorithm rely on. However, these additional functions do not increase the above time complexity if implemented reasonably.
V. DESIGN AUTOMATION
The design methodology described in Section II-B is implemented in the EDA tool SCENCO [9] , which stands for SCE-Nario ENCOder. It features the following scenario encoding algorithms.
1) Exhaustive search fully explores the universe of encodings E. SCENCO relies on Espresso [27] for Boolean minimisation, and Abc [28] for technology mapping, the gate library is specified in the GenLib format [29] . Abc is also used for producing synthesised controllers in the Verilog file format. SCENCO also uses Clasp [23] and MiniSAT [24] SAT solvers for supporting the SAT encoding.
SCENCO graphical user interface is described in [30] , Figure 6 shows and describes an example of the applied design methodology in WORKCRAFT.
A. Interface synthesis
The synthesis of the interface between the controller and the datapath has been automated in the EDA tool [9] , relying on the ideas elaborated in [31] and summarised below.
The controller can be interfaced either with asynchronous datapath modules, relying on the reqest/acknowledge handshake, and to synchronous modules using matched delays [32] , which produce acknowledgement signals after a chosen delay. In turn, since the controller resets request signals only at the end of each scenario execution, decouple and merge [31] are needed to release datapath modules immediately after they acknowledge their completion. Also, merge is used when a module is executed multiple times within a scenario, see a schematic of the interface in Figure 7 .
The developed tool [9] takes as input the datapath modules in the form of Verilog, and interfaces them to the synthesised controller automatically. The produced Verilog file contains the final system implementation, see Figure 3 . 
VI. ALGORITHM AND TOOL VALIDATION
We validate the presented algorithm and tool over a set of benchmarks coming from three domains: ad-hoc controllers, processor instruction sets and process mining in Sections VI-B, VI-C and VI-D, respectively. Experimental results are compared with existing scenario composition approaches on all benchmarks, and also with the behavioural synthesis methodologies based on STGs and FSMs in the processor benchmarks. All used benchmarks can be found online (see benchmarks folder in [9] ), and can be displayed and run in WORKCRAFT [10] .
A. Configuration and notation
We run our experiments on an Intel-i7-3610QM 2.30 GHz CPU, with 8 GB DDR 1600 MHz RAM Memory. Benchmarks are: 1) Specified in the form of partial orders in WORKCRAFT [10] , and synthesised by the presented tool SCENCO [9] . 2) Specified as STGs in WORKCRAFT, and synthesised by Petrify [19] and MPSat [20] . 3) Specified as synchronous FSMs in VHDL, and synthesised by Synopsys Design Compiler [21] .
The same 90nm gate library is used for technology mapping. For presenting benchmark results, we use the following notation. #e denotes the number of encodings generated and synthesised by the proposed algorithm. The smallest controller out of these is shown as result. Area (|B|) denotes the area [µm 2 ] of the resulting controller, with the number of bits used for encoding in brackets. RT denotes the tool runtime [s], which is the time that goes from parsing the specification to obtaining the final implementation. We only consider results produced within a runtime of 1 hour, denoted in turn as timeout TO. Finally, we use the dash character '−' when a behavioural synthesis approach cannot be applied to a benchmark due to a technical issue, see textual description for an explanation. Controllers derived by FSMs and STGs include sequential components (registers and C-elements, respectively) for holding system states. In the results, we only consider the combinational part of the controllers for not penalising them in the comparison.
B. Ad-hoc controllers
The first set of benchmarks includes an on-chip power management controller of a buck converter [33] , and an asynchronous controller for the reconfigurable pipeline of a dataflow processor [34] .
The power management controller is required to regulate the activation of the PMOS (gp) and NMOS (gn) transistors in response to three signals coming from sensors within the power regulator: over-current (oc), under-voltage (uv) and zero-crossing (zc). The two transistors must never be on at the same time to avoid a short circuit. Two of the four scenarios that compose the power management controller are shown in Figure 8a and described below. Over-current scenario: when the oc condition is detected (event oc+), the PMOS transistor must be switched off (event gp-). Afterwards, the NMOS transistor must be switched on (event gn+). Zero-crossing followed by under-voltage scenario: If zc is detected before uv, the NMOS transistor must be switched off (event gn-). The two transistors must stay off until the arrival of the uv condition. Afterwards, the PMOS transistor must be switched on (event gp+). The asynchronous dataflow processor contains a 16-stage reconfigurable pipeline for statistical analysis of data streams. Its controller manages the energy-quality of the result by controlling the number of active pipeline stages. It is an important case study, as it was fabricated in an ASIC and tested [34] . Three of its 13 scenarios that compose the reconfigurable pipeline are specified below in the text-form, i.e. the scenario s 1 activates 4 pipeline stages, the s 2 activates 5 stages, up to the scenario s 13 that activates all 16 stages of the pipeline:
. . .
Evaluation: Table II shows the results upon application of the state-of-the-art encoding algorithms. The Single-literal encod-ing produces a 1.9% smaller buck controller in comparison to other approaches, and uses one more variable than needed (|B| = 3). On 2 variables, the optimal controller is generated by the Exhaustive search by definition. Such a controller is also achieved by the SAT-based and by the proposed Heuristic algorithm.
The reconfigurable pipeline controller is not produced within the considered timeout by the Exhaustive and the SATbased algorithms, due to the complexity of the corresponding scenario specification. The Single-literal controller is 25.8% smaller than the controller produced by the proposed encoding technique, and uses 3× more variables. In [34] , we implemented the controller produced by the proposed encoding technique, as the final design was constrained by the pins of the external package.
The runtime of the tool for processing the above benchmarks is always less than 1 s.
C. Processor instruction sets
The second set of benchmarks includes different subsets of instructions of the ARM Cortex M0+ [11] , Texas Instruments MSP430 [8] and Intel 8051 [7] , [35] . These processor specifications were derived by analysing their corresponding ISA reference manuals, and identifying classes of instructions (scenarios) that share similar functionalities and addressing modes.
In regards to the design of real processors, the above manual scenario extraction approach is not ideal to obtain accurate specifications. However, recent research on specification languages for processor architectures (see Section II-B1) enables to fully specify the behaviour of modern systems comprising hundreds of instructions, and to derive accurate specifications for synthesising real processors. In this context, the presented algorithm is important as it scales well to hundreds of scenarios (as we show in Section VI-D) making the CPOG-methodology suitable to the design of such modern systems.
The ARM Cortex M0+ scenario specification is fully described in [11] . This processor has an ISA constituted of 68 instructions. The specification composed of 11 scenarios and 6 datapath modules (scenario operations) models 61 of these instructions. As an example, two scenarios of the specification are shown in Figure 8b and described below.
Load (reg.) covers the LDR (reg.) instruction. The ALU operation computes the memory address, the MAU loads a value from the memory and stores it into a specified register. The IFU fetches a new processor instruction. Arit/Log (Imm.) covers arithmetical, logical and data transfer instructions with immediate addressing, e.g. ADD (imm.), LSR (imm.). An immediate value is fetched from the instruction register (PCIU → IFU), and used as operand for the selected operation (ALU). The result is stored into a specified register. The ALU operation is executed concurrently with the program counter incrementation (PCIU/2). The resulting P C is used for fetching a new instruction (IFU/2). ARM specification (random search) Fig. 8 : (a) shows 2 scenarios of the Buck controller specification [33] . (b) shows 2 scenarios of the ARM CORTEX M0+ [11] .
(c) shows 2 scenarios of the TI MSP430 specification [8] , the scenario Cond. ALU op. #123 to Rn is a conditional scenario in the form of CPOG, due to the conditional operation ALU/2. (d) shows the CPOGs synthesised from full scenario specification in [11] , i.e. the CPOG derived by the proposed encoding is on the left-hand side, and the one derived by the Random search is on the right-hand side.
The Texas Instruments (TI) MSP430 scenario specification has been introduced in [8] . The specification composed of 8 scenarios and 7 datapath modules models the full instruction set composed of 51 instructions. This benchmark is important as some of its scenarios have conditional elements. As an example, two of its scenarios are shown in Figure 8c The second scenario is said to be conditional, and can be described in the form of a CPOG. Conditional scenarios can be composed regularly with other scenarios, see [16] for further details.
The Intel 8051 specification supported the design of an asynchronous version of this processor [35] . It comprises 37 scenarios and 17 datapath modules that model 255 processor instructions. It is important to the CPOG validation, as it contains 3× more scenarios and 2× more operations than the other processor benchmarks. Evaluation: Table III shows the results of the applied stateof-the-art CPOG encoding algorithms to the described set of processor benchmarks. This set is also used to compare the proposed algorithm based on CPOGs to the methodologies based on FSMs and STGs, as it is the most diverse set being characterised by (1) specifications of different sizes (from 4 to 37 scenarios), (2) specifications with conditional scenarios (see TI MSP430), (3) specifications comprising a different number of datapath modules (from the ARM processor with 6, to the Intel with 17). For these reasons, it is able to highlight the characteristics of all used approaches to behavioural synthesis.
The Exhaustive search produces the smallest instruction decoders using log 2 |S| variables. In practice, it is applicable to specifications that contains up to 8 scenarios, as its runtime increases exponentially with the specification size.
The Single-literal encoding produces the smallest instruction decoders in most of the cases when it does not exceed the time limit. However, synthesised decoders might not be applicable to real processors, as the code size |B| is fixed by the algorithm rather than by the processor (op)code specifications. The SAT-based encoding produces decoders with an average overhead of 7.4% in comparison to Exhaustive decoders. The current implementation does not support scenarios in the form of CPOGs (see missing results − in the TI MSP430 rows). The runtime of the SAT-based and Single-literal approaches increase exponentially (exceeding the timeout) when |S| grows.
On average, the Proposed encoding produces implementations with an area overhead of 4.5% in comparison to optimal solutions. It scales to higher number of scenarios (see Intel 8051 results), and supports scenarios in the form of CPOGs (see TI MSP430 results). The runtime is always within the timeout. As an example, Figure 8d shows the ARM system specifications obtained by composing its 11 constituent scenarios via the proposed encoding (left-hand side) and via the random search (right-hand side). The 'proposed' CPOG contains shorter conditions φ.
We also run the Proposed encoding by constraining |S| 2 scenarios of every processor specification randomly, using {0, 1, ?, X}. The resulting decoders always satisfy the composition constraints given, and have an overhead of 12.4%, on average, in comparison to optimal implementations. Finally, we used the behavioural synthesis approaches based on Finite State Machines (FSM) and Signal Transition Graphs (STG), in order to show that the proposed methodology shows better results in comparison to established techniques in the field. The approach based on synchronous FSM and Design Compiler (known as dc shell in the Synopsys tool-chain) is always able to synthesise controllers from the given specifications with the usage of the sequential encoding. Synthesised implementations show an average area overhead of 56% in comparison to the proposed unconstrained approach. The processing runtime is comparable.
On the other hand, the methodology based on STG is never able to synthesise implementations from the given specifications with the sequential encoding. The results shown on the table are derived with the one-hot encoding, which simplifies the specifications by replacing the go transitions and their dependencies with the codes, see [9] . However, even after this simplification, the methodology is not often successful. In most cases, Petrify returns the error "support too big for minimisation" (see missing results − on the left-hand side of the STG column), and MPSat does not find a solution within the given time limit (see TO entries on the right-hand side). MPSat is partially successful with the ARM Cortex M0+, whose scenarios include fewer datapath modules (6 compared to the 17 modules of the Intel 8051) and which does not include conditional scenarios (as the TI MSP430). On average, the methodology based on STG has an area overhead of 51% in comparison to the proposed unconstrained approach, and a much higher synthesis runtime.
D. Software output logs
The third set of benchmarks includes scenario specifications that describe a set of different software output logs [36] . They come from the process mining community: artificial logs derived from the simulation of a process model (BigLog1, Log2, Caise2014), and real-life traces in different other contexts (purchasetopay, incidenttelco, svn log, telecom, documentflow).
Due to the size of these benchmarks (from 16 to 651 scenarios), we compare the Proposed encoding to the Sequential encoding and Random search, as the other CPOG algorithms always exceed the time limit. The proposed encoding is applied with three configurations: (a) with #e set to 1 , (b) with #e set to 10 , (c) and with #e = 1 and the Simulated Annealing parameters modified in such a way to allow ×10 more iterations for the optimisation (SA ×10). Evaluation: On average, the area of the controllers found by the proposed encoding in configurations 1 and (2) are 4.7% (9.8%) more efficient, in terms of area, than sequential controllers, and 12.9% (18%) more efficient than random controllers. In turn, the results produced by the proposed encoding in configuration 3 are 13.2% and 21.7% more efficient, on average, than the sequential and random implementations, respectively.
On average, the Sequential encoding produces 8.56% smaller controllers in comparison to the Random search algorithm. Such a good result is due to a certain degree of similarity between pairs of subsequent scenarios, which are encoded naturally by pairs of subsequent and similar codes by the Sequential algorithm.
In Table IV , the benchmarks are divided in three sets of different sizes, from the bottom to the top: (S)mall (10 < |S| < 30), (M)edium (30 < |S| < 400) and (L)arge (400 < |S| < 652). See below consideration:
Small set, configuration 2 of the proposed encoding finds the best results, as the higher number of encodings inspected (#e = 10) provides a higher chance to produce a good result. The increased number of SA iterations of configuration 3 is not justified in this set due to the small |S|. Medium set, configuration 3 finds the best results, as the higher |S| justifies a longer optimisation time provided to the Simulated Annealing optimisation. Large set, configuration 3 finds smaller controllers in comparison to configurations 1 and 2. However, these benchmarks highlight the heuristic (inaccurate) component of the proposed approach, which may find worse controllers in comparison to trivial algorithms. For this set, a higher number of SA iterations would be justified for obtaining good results. This set of benchmarks shows that the proposed approach can handle specifications of hundreds of scenarios. Also, it can be tuned as much as needed by modifying the time for the SA optimisation.
VII. DISCUSSION AND FUTURE RESEARCH
This paper presented a novel approach to scenario composition for the design methodology based on the Conditional Partial Order Graphs. The presented open-source SCENCO tool, embedded in the EDA toolsuite WORKCRAFT, implements this methodology. The algorithm and tool are evaluated on a set of benchmarks and compared to the state-of-theart composition algorithms for CPOG, and to the behavioural synthesis techniques based on FSM and STG. Table V summarises the comparison of all CPOG composition techniques, relying on the experimental results shown and evaluated in Section VI. The proposed algorithm, unlike previously published techniques, handles hundreds of scenarios with a good area/synthesis runtime trade-off, and supports composition constraints. It also supports conditional scenarios for modelling behaviours that contain dynamic branching. Also, the experimental results highlight that the CPOG methodology produces more efficient implementations (in terms of area) than the approach based on the FSM and STG. The latter can be applied only to relatively compact models.
To further improve the CPOG methodology, a number of recommendations for future research are given. (1) Parallel implementation of the presented algorithm, which is important to further improve the efficiency of behavioural composition by exploring more solutions at no extra runtime cost. (2) Support for x-aware scenario encoding (with x being latency, power, energy, and other characteristics), which is important for making the methodology attractive to many practical domains.
