Synchoros VLSI design style has been proposed as an alternative to standard cell based design. It enables end-to-end automation of large designs with ASIC-like efficiency. A key problem in this automation process is generation of regional clock tree. Synchoros design style requires that the clock tree should emerge by abutting its identical fragments, that are absorbed in synchoros large grain VLSI design objects as a one-time engineering effort. The clock tree should not be ad-hoc but a structured parametrically predictable design whose cost metrics are known. We present a new clock tree design that is compatible with the Synchoros design. The proposed design has been verified with static timing analysis and compared against functionally equivalent clock tree synthesized by commercial EDA tools. The scheme is also scalable and we show that the composition by abutment scheme is able to generate valid VLSI designs from 0.5 to ~2 million gates without the need for clock tree synthesis. More critically, the synthesized design is correct by construction and requires no further verification. In contrast, the hierarchical synthesis flow requires to synthesize the regional clock tree in addition to follow up verification step because it lacks predictability. The results also demonstrate that the capacitance, slew and the ability to balance skew of the clock tree synthesized by abutment is comparable to the one generated by commercial EDA tools.
I. INTRODUCTION
n this paper, we present a clock tree generation scheme for a novel synchoros VLSI (Very Large Scale Integration) design framework that is an alternative to the standard cell based VLSI design framework. The word synchoros is derived from the Greek word for space -"choros". Synchoricity is analogous to synchronicity. In synchronous systems, time is discretized with clock ticks to enable temporal composition of synchronous datapaths. In synchoros systems, space is discretized uniformly with a virtual grid to enable spatial composition of the system by abutting synchoros building blocks that we call as SiLago (Silicon Lego bricks) blocks. All wires, including clock, are absorbed within these blocks and when a system is composed by abutting SiLago blocks, a valid DRC (Design Rule Checking) and timing clean GDSII (Graphic Database System II) is produced. The design does not need any further VLSI engineering other than what has gone into creating the SiLago blocks as a one-time engineering effort. In essence, SiLago blocks are the new mega standard cells.
The need for synchoros VLSI design style as a replacement  D. Stathis and A. Hemani are with the KTH Royal Institute of Technology, Stockholm, Sweden (e-mail: stathis, hemani@kth.se).
for standard cells based VLSI design is elaborated in section II. Briefly, the key benefits of synchoros VLSI design style are: 1) end-to-end automation of complex system-models to GDSII; complex system models implies 100+ million gates, nondeterministically concurrent untimed applications, 2) Application-Specific Integrated Circuit (ASIC) like computational efficiency [1] , 3) programming comparable engineering effort with correct by construction and near perfect prediction of cost metrics, 5) sufficient flexibility for in-field bug fixes and version updates, 6) foundry compatible and 7) potential to eliminate mask engineering cost and also significantly lower silicon and engineering cost related to DFT (Design For Testing).
Clocks in synchoros VLSI designs have three levels of hierarchy. The highest level of clocks are the global clocks that are derived from PLL and distributed to region instances. The region instances roughly correspond to chiplets or sub-systems in traditional SOC architectures. They communicate with each other on a latency insensitive basis. Each region instance is composed of SiLago blocks that communicate with each other on synchronous basis fed by a regional clock tree (RCT). Each SiLago block has its internal local clock synthesized by commercial EDA tools.
In this paper, we focus on a key aspect of synchoros VLSI design style -the generation of valid RCT by abutment of SiLago blocks. The proposed RCT generation by abutment scheme does not compete with the existing clock tree synthesis schemes on traditional figure-of-merits. Instead, the proposed scheme obviates the need for RCT synthesis.
The key contributions in this paper are: a. We present a RCT design that allows it to be divided into components that are absorbed as part of the synchoros SiLago blocks. When these SiLago blocks abut, a valid RCT gets created. No additional step like clock tree synthesis in traditional standard cells based design flow is needed. All that is done is to abut the synchoros SiLago blocks. b. The generated RCT and the synchoros SiLago based design is correct by construction and does not require any further verification by static timing analysis or DRC. c. The generated RCT is predictable in terms of its switching capacitance, arrival time and slew and skew at each leaf node. d. The generated RCT has comparable cost metrics:
capacitance, skew and slew to a functionally equivalent RCT generated by commercial EDA tools The roadmap for the rest of the paper is as follows: In next P. Chaourani and S. M. A. H. Jafri were with KTH Royal Institute of Technology, Stockholm, Sweden (e-mail: pancha, jafri@kth.se).
Regional Clock Tree Generation by Abutment in Synchoros VLSI Design
Dimitrios Stathis, Panagiotis Chaourani, Syed M. A. H. Jafri, Ahmed Hemani  I section II, we justify the need for synchoros VLSI design style as a replacement for standard cell based VLSI design. In section III, we introduce a synchoros VLSI design platform as the context in which the proposed RCT scheme has been developed. In the same section, we also introduce the demands for RCT generation in a synchoros VLSI design platform. Sections II and III give an overview of the basics of the synchoros VLSI design style, and serve as the motivation and basis for this work. In section IV, we elaborate the proposed RCT scheme and in section V we explain the process of configuring the proposed clock tree. In section VI, we quantify the benefits and cost of RCT, compare it to functionally equivalent clock tree generated by commercial EDA tools and provide proof-of-concept experimental results for the predictability of the generated RCT. In section VII, we review the state-of-the-art in clock tree synthesis and argue why these techniques do not meet the requirements for RCT generation in the synchoros VLSI design style. Finally, we draw conclusions and point to the ongoing enhancements to the SiLago platform.
II. THE NEED FOR SYNCHOROS VLSI DESIGN
Synchoros VLSI design style is needed to make ASIC-like performance and computational efficiency affordable. It also makes ASICs accessible to end users with no VLSI design competence and by implication no need for logic and physical syntheses. The problem addressed in this paper is a key subproblem in making synchoros VLSI design possible. We make the case for synchoros VLSI design by a chain of arguments: the need for ASIC like performance and efficiency, why standard cell based design flows cannot make ASIC affordable and accessible to all and how synchoros VLSI design style can achieve what standard cells cannot.
A. ASIC like performance and efficiency is acutely needed
ASICs are known to outperform software implementations by 2-3 orders of magnitude in energy efficiency [2, 3, 4] . The need for ASICs is being acutely felt in almost all application domains. Cloud computing community has shown that the TCO (total cost of ownership) for ASIC based clouds is 3-5 orders less compared to GPUs and FPGAs [5] . It is not surprising to see Google adopting ASIC as implementation style for TPU [6] for its cloud servers but now also for edge computing. Autonomous driving community has openly expressed their exasperation with electronics and computer architecture community not being able to deliver sufficiently low power computing platform. According to Delphi [7] , autonomous driving requires 100+ laptop equivalent of power. This will force the automobile industry to exclude electric cars from being the initial targets of autonomous driving. The need for ASICs in the AI community is so acute that Forbes highlighted this with the title of an article: The next big thing in AI will be ASICs [8] . The title is meant to shock the readers who would naturally think some new learning algorithm, NN topology etc. would be the next big thing in AI. The academic community has also shown that ASICs outperform GPUs by 3-5 orders [9, 10, 11] . 5G in all aspects expects 1-2 orders improvement compared to 4G (LTE) and yet expects these improvements to be delivered with 1 order less energy consumption. No implementation style other than ASIC fits the bill.
B. ASICs are only affordable to big players
The cost of a VLSI design is proportional the size of VLSI design space that must be refined by the end-user for every design and design iteration. We argue and explain next that the VLSI design space embodied by standard cells is too large to be refined automatically. This has made standard cell based complex VLSI designs unaffordable, except to large players requiring >300 million US dollars [12] ; 90% of this cost is engineering.
1) The concept of VLSI Design Space
VLSI Design Space is spanned by three dimensions. Abstraction and complexity are the independent variables and number of solutions is an exponential function of the two independent variables. Synthesis is the process of refining from more abstract to more detailed and proceeds in top-down manner by successively refining functionality at each level to the next level. At each level, the refinement process evaluates multiple functionally equivalent implementation alternatives in terms of objects at the next lower level of abstraction. This process continues and at each level the number of design alternatives increases exponentially as shown in 1. By the time physical synthesis happens, the design space exponentially 
expands to (((T L ) R ) A ) P . This space further increases exponentially with complexity C to (((T L ) R ) A ) P ) C .
The exponential increase in design space, i.e., number of possible solutions, with abstraction is what we define as the fundamental VLSI design automation challenge.
2) The Unscalability of Standard Cell VLSI Design Space
Initially, the end user had to manually refine the entire VLSI design space and this style of design was naturally called full custom, Fig. 1 (a) . In spite of Mead Conway's structured VLSI design style [13] and later attempts to automate using silicon compilers, the full custom design space was too large to automate its refinement. Full custom designs were restricted to 10s of thousand gates.
To go beyond what was possible with full custom style of VLSI design, standard cells were introduced to exponentially reduce the design space to allow a higher degree of automation. This was achieved by standardising all Boolean level logic as a set of one-time engineered set of standard cells. This effectively raised the physical design to Boolean level, i.e., as soon as a design is at Boolean level we know its physical design since the standard cells are pre-designed. However, the wires that connect these standard cells, clock them, reset them etc. are not known and have to be synthesized as part of physical synthesis.
In synchoros VLSI design style the need for synthesis of all wires, including clocks is obviated for the end-user
Standard cell, in spite of imperfectly raising the physical design to Boolean level, succeeded in automating synthesis from RTL down to physical level. The Achilles heel of standard cell based design flows is that the refinement from system-level down to RTL is done manually. For ASICs, this refinement is in terms of functional RTL designs and for SOCs, this refinement is in terms of infrastructural IPs. This manual refinement requires costly functional verification to ensure that the system model is preserved in the manually refined RTL model. Manual refinement is done using crude estimates that can only be verified when the design has been refined down to physical level. This is shown as constraints verification in Fig.  1 (b) . The two verifications, functional and constraints, are not independent and are the dominant cost components that makes complex VLSI design in standard cells unaffordable.
To increase automation beyond RTL and thus reduce manual refinement, high-level synthesis has been intensely researched for three decades and it is still not mainstream. As recently as 2016, status of HLS was judged as [14] : "Could High Level Synthesis be the key to the next generation of EDA? As we all know, that did not happen-despite some very large investments".
The cost of making SOCs has become unaffordable ~300 Million USD cost in spite of reuse heavy Platform Based Design (PBD) style. The performance and efficiency of these SOCs has left the community clamouring for ASICs. These are sure signs that it is time for overhauling the VLSI design methodology on the same scale as what introduction of standard cells did to the full-custom design style.
C. Synchoricity make ASICs affordable and accessible
The core idea behind synchoros VLSI design style is to raise the physical design level to RTL for both logic and all wires. Wires includes the data buses and infrastructure wires such as the RCT, reset, power grid etc. After HLS the physical design is raised to algorithmic level as shown in Fig. 1 (d) . This reduces exponentially the design space that must be refined by the end user for every design instance and iteration. Further, synchoricity also makes the reduced space composable and predictable. These factors enable automated synthesis of custom hardware-centric designs as opposed to manual refinement of software centric PBD in terms of processors and other infrastructural IPs. This automation eliminates need for functional verification because no manual refinement is involved. The constraints verification is also eliminated because synchoricity makes the reduced design space predictable with post-layout accuracy: as soon as a design is at RTL or algorithmic level, the dimension and position of every single transistor and wire segment is known with certainty. Application and System Level syntheses (ALS/SLS) synthesize of not just single algorithms as HLS does but synthesize of complete system that includes NOCs, scratchpad memory, 3D DRAM vaults etc. ALS and SLS are not the focus of this paper and is under development. We refer the interested reader to [15] for an early version of ALS.
We close this section with Table I , where we compare standard cell and synchoros VLSI design styles and show how the former overcomes the limitations of the latter. SiLago blocks are atomic physical design building blocks that implement RTL operations.
Physical Design raised to higher abstractions
When a design is at Boolean level the dimension and position of every intra standard-cell transistor and wire segment is known. A follow up physical synthesis is required to implement inter standard cell wires and buffers.
When a design is at RTL level, dimension and position of every intra and inter SiLago Block transistor and wire-segment is known. This advantage over standard cell is enabled by the Synchoricity property of the SiLago blocks.
Reduction in VLSI Design Space
Reduces VLSI design space from
The reduction in design space enables partial automation from RTL to Physical level; systemlevel to RTL refinement remains manual.
The reduction in design space enables end to end automation from system-level to RTL that is equivalent to being at Physical level (GDSII).
III. THE SYNCHOROUS VLSI DESIGN FRAMEWORK
In this section, we present an experimental synchoros VLSI design framework in terms of its synchoros SiLago (Silicon Lego) building blocks, their domain specificity, generality as aggregation of wide range of domain specific SiLago blocks, architectural hierarchy and briefly, design methodology. This framework forms the basis for arguing why a new method for regional clock tree generation is needed. Also the need for elaborating the details of the composable RCT (Regional Clock Tree) that emerges when the synchoros SiLago blocks abut. As a result of the abutment, not just RCT, but other inter-SiLago block functional and infrastructural wires also gets generated.
Fig. 2. Composition by Abutment of SiLago Blocks

A. Synchoros SiLago Blocks and Composition by Abutment
Silicon Lego bricks or SiLago blocks replace standard cells as the atomic physical design building blocks in the synchoros VLSI design style. Synchoricity of SiLago blocks is manifested in the discrete spatial dimensions of the SiLago block in terms of the virtual grid cells as shown in Fig. 4 . Composition by abutment is enabled by bringing all the inter-SiLago block interconnects (functional and infrastructural wires like clock, power grid etc.) to the periphery at the right place and on the right metal layer. As a result, when functionally compatible SiLago blocks are placed on the grid, their corresponding interconnect abut to create a larger valid VLSI design. This happens without any further need for VLSI engineering other than what has already gone into creating the SiLago blocks, see 
B. Regions: Domain specific SiLago Blocks
SiLago blocks are heavily customized for specific application domains called regions. There are two broad categories of regions: functional and infrastructural. Fig. 3 shows examples in these two categories. Functional region types roughly correspond to the dwarfs identified in the Berkeley Report on Landscape of Parallel Computing [16] . Functional regions are domain specific Coarse Grain Reconfigurable Architectures (CGRAs) that are customized for their respective domains for computation, control, address-generation, regional-interconnect, access to scratchpad etc. Differentiation of SiLago CGRAs compared to others is elaborated in [1] . Evidence of these CGRAs achieving power and performance comparable to ASICs that is typically 3-5 orders higher than GPUs is provided in [9, 11, 1] .
C. The generality of Synchoros VLSI Design Framework
Synchoros VLSI design style is as general as the standard cell based design style. The key difference is in the abstraction. Using standard cells to express functionality is similar to using the first 100 word we use as a child. While feasible, it is unscalable to express Newton's Principia using the first 100 words. This is what is happening when we are attempting to express100-500 million gate VLSIs in terms of standard cells. The native vocabulary spoken by standard cells is too primitive. IPs and software libraries do not help because they are more like macros and do not constitute native vocabulary of hardware the way domain specific SiLago blocks do. SiLago blocks are functional, IPs are typically infrastructural. SiLago blocks compose by abutment and do not require any follow up VLSI engineering. IPs, even if they are hardened, do not compose by abutment and require significant follow-up VLSI engineering. SiLago blocks from different regions natively implement the Webster's dictionary to make expressing complex functionality scalable.
Each region represents a domain, see Fig. 3 (a), and is expressive enough to capture any functionality in that domain. As long as an application is composed of functionalities that belongs to the domains shown in Fig. 3(a) , the synchoros SiLago design framework is as general as the standard cells. Should there be a domain that is not covered by the regions shown in Fig. 3(a) , it is not a fundamental limitation of synchoricity that it is not there, it is merely a question of an extra one-time engineering effort to create such a region. We also draw attention of the readers to the list of infrastructural regions to emphasize that it is not merely data parallel compile time static applications that we target but also reactive applications that are non-deterministically concurrent.
D. Three Levels of Hierarchy
In the synchoros VLSI design framework, there are three levels of hierarchy: local, regional and global, see Fig. 3 
The design of the clock distribution network is also divided in three levels, see Fig. 4 . The synchoros grid in Fig. 4 enforces synchoricity. Each hierarchy level is using a different clock distribution mechanism. Local: At the local level, i.e., the intra-SiLago block level, all wires -functional and infrastructural -are ad hoc and synthesized by the commercial EDA tools the way they are done in standard cell based design flows; the local clock tree (LCT) is synthesized by the EDA tools. Regional: Region instances are aggregation of region specific SiLago blocks. All inter-SiLago block wires in a region instance are created by abutment. SiLago blocks communicate with each other via local region specific NOCs, see [19] [20] [21] [22] [23] . Regional Clock Trees (RCT) are design constructs that are also absorbed in each SiLago block. When SiLago blocks in a region 
(a) Functional and Infrastructural Region Types (b) Three Levels of hierarchy in Synchoros SiLago VLSI Design Platform
Global Regional Local instance abut, a RCT gets created that feeds the LCTs in each SiLago block by balancing the skew and also maintaining the slew rate. Global: The global level (or chip level) is composed of region instances and is also composed by abutment. Region instances communicate with each other using global NOCs, see [1] for more details. These NOCs are also composed of SiLago blocks for buffered wires, NOC logic and region specific network interface units. Region instances communicate with each other on latency insensitive basis using a variant of GALS called Globally Ratiochronous and Locally Synchronous (GRLS) method, see [24] for more details. Global Clock derived from PLL (s) is distributed via the space allocated for global NOCs to feed the RCTs in each region instance via a GRLS interface at region instance boundaries. Often the GRLS interface is absorbed as part of a region specific network interface units.
E. Overview of Synchoros VLSI Design Flow
The synchoros VLSI design flow has three components as shown in Fig. 5 . Two of these components are one-time engineering effort and the third component is used by the enduser. The first component is responsible for the development of the Synchoros SiLago platform. This methodology component is based on commercial EDA tools. SiLago region types are designed at RTL, verified, synthesized down to physical level, made synchoros and abutable and characterized with post layout data, see [25] . This results in the synchoros SiLago block types, the leaf nodes in Fig. 3 (b). The RCT fragment in each SiLago block type is designed and incorporated during this phase as shown in Fig. 5 . The second component develops a library of function implementations or FIMPs using SiLago HLS tool. For each function/algorithm/actor that could be used to compose application or system models is implemented by the HLS in M different architectures and degrees of parallelism, thereby also varying in cost metrics (area, latency and average energy). For instance, FIMPs for 2048 point FFT actor could vary in number and type of butterflies. The libraries developed corresponds to BLAS, LAPACK, Matlab toolboxes etc. FIMPs, effectively raise the physical design to algorithmic level and sufficiently reduce the synchoros VLSI design space as shown in Fig. 1 d to enable application and system level synthesis.
The third component transforms application and system models composed in terms of actors/functions to custom synchoros design in terms of SiLago blocks. At present a proof of concept ALS tool exist, see [15] and we are in the process of enhancing it to SLS. The present ALS has three main steps. In the first step it does DSE (Design Space Exploration). If there are L actors and each actor has M FIMPs, there are M L possible ways of implementing the application. A constraints satisfaction problem has been formulated to do global minimization -the cost of entire application is evaluated and minimized, while HLS tools optimize individual algorithms. Constraints are also applied to the entire application and not budgeted for each algorithms as is done by HLS tools. Finally, ALS explores design space in terms of more serial or parallel FIMPs and not in terms of more or less arithmetic resources as HLS tools do. Once the optimal mix of FIMPs is selected by the constraints solver, global interconnect and control (GLIC) logic to connect selected FIMP instances is synthesized. In HLS tools, this is done manually. GLIC logic is also realized in terms of the SiLago operations. Finally, the entire design is floorplanned. SiLago blocks abut to create all inter-SiLago wires, including the RCT and also decides the RCT taps in programmable delay lines to balance the skew. This is discussed in section V.
When constraints solver evaluates solutions in the M L space, the cost metrics of all the evaluated solutions is known with post layout accuracy and not estimated. The same applies to GLIC logic. The reason why ALS, like the one described above fails in standard cells is that when the cost metrics for the evaluated solutions in M L space are not known -they are crude estimates of what HLS, logic synthesis and physical syntheses will do. The discussion above may look far from the RCT synthesis problem, but it is vital to convince the reader about the value of synchoricity and how the proposed RCT enables it.
IV. RCT BY ABUTMENT IN SYNCHOROS VLSI DESIGN
In this section, we elaborate the RCT design scheme. We first laydown the requirements of synchoros VLSI design in subsection IV.A that the RCT design must comply with. After requirements, we elaborate the RCT scheme in terms of its components in sub-section IV.B. Next, in sub-section IV.C, we elaborate the delay model of RCT that is generated as a onetime activity, as part of the design of synchoros VLSI platform. This is followed by how the delay model is used during application and system-level syntheses. Finally, we present minimizing the difference in arrival of RCT in each SiLago block's LCT as a search and optimization problem and present its solution in section V. 
A. RCT Requirements in Synchoros VLSI Design
Regional clock tree (RCT) synthesis, like any other clock tree synthesis, has two constraints. The first is to minimize the clock skew to maximize the percentage of clock period that can be used by the combinational logic. The second is to maintain sufficient drive strength to ensure that the slew rate, a technology-design-rule, is not violated. This is critical because the timing models are characterized for a tight range of slew rates and if this range is violated, the timing model would no longer be valid. Finally, such a clock tree should factor in variations in manufacturing, temperature and power supply. Besides these classic requirements on RCT that all clock trees must fulfil, there are two synchoros VLSI design specific requirements that the RCT generated by abutment must fulfil:
All SiLago blocks of the same type should have identical electrical properties: delay, load, drive etc. This requirement stems from the need to make the cost metrics of a synchoros VLSI design predictable for the syntheses tools. A signal that propagates through a SiLago block of a specific type anywhere in the fabric should have the same characteristics. If this requirement is not fulfilled, the same SiLago block type would have to be characterized for each possible location in the fabric. Such an engineering effort would be un-scalable.
2) No follow up VLSI engineering after abutment
The second requirement is that when SiLago blocks abut, the RCT fragments should compose into a valid RCT tree without having to do any further synthesis of wires, buffers or verification in terms of static timing analysis (STA) or DRC (Design Rule Check). In the next-sub-section, we present the RCT design scheme that fulfils these requirements.
B. RCT for Synchoros VLSI Design Style
Every SiLago block has an RCT fragment that has three components that enables a scalable, correct by construction RCT to emerge by abutment of these fragments.
1) Standardized Entry and Exit Points
Every SiLago block type has standard entry (Hin, Vin) and exit (Hout, Vout) points for the RCT fragment as shown in Fig. 6 . Standard implies fixed location on specific edges and metal layers. The entry and exit points of RCT fragments in neighbouring SiLago blocks abut to create a valid RCT. Since the neighbours can be in horizontal and/or vertical dimensions, SiLago blocks need entry and exit points on both horizontal and vertical edges as shown in Fig. 6 .
RCT can be distributed/propagated in four orientations: topdown and left-right, bottom-up and left-right and bottom-up and right-left. The choice of orientation depends on the corner at which the GCT (Global Clock Tree) enters. The decision at which corner the GCT enters depends on the number and floor planning of the global NOCs during the application and system level syntheses decisions; as stated earlier in section II, the GCT is routed in the same physical space as the Global NOCs.
The selection of the entry point on the top-left corner is done to make sure that the regional clock tree can connect with the global clock tree in a regular fashion. The different SiLago blocks can vary in size, but due to synchoricity, the size is discrete and standardized, see Fig. 4 . This synchoros property of the SiLago blocks allow for blocks of different size to abut and generate a valid regional clock tree, see Fig. 2 . Fig. 6 . Hardened SiLago block and its integrated RCT fragment and its components
2) Multiplexed and buffered horizontal and vertical chords
These components have two functions. The first is to select RCT input and output and the second is to maintain the slew rate. Selecting the input implies selecting the Hin or Vin as shown in Fig. 6 . Only one of the inputs to the OR gate can be a clock and the other is set to zero when configuring the RCT at the power up time. Selecting the output implies selecting if RCT is to be propagated to the right exit, i.e., Hout or to the bottom exit, i.e., Vout or to the both; the unselected exit is set to 0 to not leave any hanging wires. This is achieved using two AND gates as shown as in Fig. 6 . Depending upon the configuration of the two AND gates, one of the four variants of chord delay, TRCT_chord gets selected see Fig. 6 . These gates also serve as the drivers to maintain the slew of the clock.
3) Programmable Delay Line
The third component is a programmable delay line that enables adjusting the delay to the LCT entry point, see Fig. 6 . The delay is adjusted depending on the position of the SiLago block in a region instance with respect to where the GCT enters the instance. The objective of adjusting the delay is to minimize the skew for arrival of RCT at LCT entry points in SiLago blocks in a region instance.
The three components together are called RCT fragment and the design of the RCT fragments in all SiLago blocks types is identical. Depending on the size of the SiLago block type, the length and position of the horizontal and vertical chords and if required in the strength of the driver of the OR gate to maintain slew can differ from one SiLago block type to another. The small and simple logic in RCT fragment is built from standard cells and does not require any custom design.
In contrast, the LCT can be completely different for one SiLago block type to another. The LCT is automatically generated by the EDA tools and depends on the functionality that the block is implementing.
The RCT fragment along with the rest of the logic, wires and buffers in a SiLago block is hardened and characterized with post layout data. The hardening process ensures synchoricity: SiLago blocks occupy multiples of SiLago grid cells and all interconnects, including those for RCT fragments are brought to right positions and right metal layer to enable composition by abutment. As stated, this is a one-time engineering effort and needs to be done for each type of SiLago block and can then be used for any SiLago block instantiated in any position in a SiLago region instance of any permissible size. We next derive the RCT delay model that enables calculating the delay of an arbitrary RCT created by abutting the RCT fragments without having to do STA (Static Timing Analysis). This timing model is used to decide the permissible size of a region instance and deciding the index i in each SiLago block instance to minimize the skew between arrival of RCT at LCT entry points in SiLago blocks in a region instance.
C. RCT Delay Model
The principal delay that we are interested in is the latency of RCT, starting from the entry point into a region instance to each of the LCT entry point in the SiLago blocks, see Fig. 7 . There will be as many instances of this latency, as the number of SiLago blocks in the region. TLCT_x,i is the notation used for this latency with the subscript x identifying the node id and subscript i identifying the selected tap index of the programmable delay line in node x. The purpose of programmable line is to make all TLCT_x,i as equal as possible and keep the skew within the limits of the slack of the SiLago block to not cause any timing violation. TLCT_x,i has two components. 1. One is the sum of delays, TRCT_chord(s), delays in RCT chords in nodes that are previous to the node x. This delay is called the natural propagation delay of RCT, Tnat_x. Fig. 7 shows Tnat_6 as the thick red line composed of RCT chords in previous nodes= 0, 4 and 5. 2. The second component is the delay that is imposed on RCT by the delay line in the destination node. This imposed delay in the destination node n is represented by ttap_i,x, where i is the tap index. ttap_i,x is itself made up of three components and defined in Fig. 6 . In the example shown in Fig. 7 node 6 is the destination node and ttap_i,6 is shown as the thick green line.
EQ. 2 These delay components are extracted from post layout design-data using sign-off quality timing analysis tools as a one-time engineering effort. They are space invariant, i.e., no matter where a SiLago block of a specific type is instantiated in a fabric, its delay components discussed above are same. Note that each SiLago region type in principle has the same type of SiLago block, however there can a small number of variants because of functional and position dependent interconnect requirements. The SiLago blocks are characterized together with its possible neighbours. This ensures that a valid model exists for any possible driver and load. In the experimental setup that we have used for this paper, the nodes on the edge have slightly different values of TRCT_chord. These are reported in section VI.B. The simple delay model described above can predict TLCT_x with same precision as STA applied to post layout data. This is validated in section VI.C. The end user never uses this timing model explicitly because the RCT generated is correct by construction. It is however used by application and system level synthesis tools for deciding the maximum size of synchronous region instance and selecting the optimal tap selections in delay lines. This is elaborated in the next section.
V. OPTIMAL TAP SELECTION IN DELAY LINES
The search and optimization problem is formulated as what is the assignment of tap index in each node x, that would minimize the absolute difference among TLCT_x,i. We first formulate a simpler, locally optimum cost function and then a globally optimum cost function.
To formally define these cost functions, we introduce some notations and conventions. Node IDs are 1…N and tap indices 1…M. The first tap, i=1, imposes minimum delay and the last tap, i=M, the maximum delay. In other words, ttap_1=min(ttap_i) and ttap_M=max(ttap_i). We remind here that inside its delay line every node has the same number of taps, M. The ID of node where RCT enters a region instance is by convention 1 and it obviously has Tnat_1=0 because there are no previous nodes, see 1. The node ID N is reserved for the node that is furthest, i.e., Tnat_N= max (Tnat_x). The tap index in node N is fixed to ttap_1 to have the minimal latency, TLCT_N,1. This selection is done for two main reasons. First is to minimize the insertion delay to the blocks. The second is to minimize the number of buffers used in each delay line, reducing the power consumption of the clock tree. As a result, the search space excludes tap index space in node N, it is fixed to 1.
A. Locally Optimum Solution
The locally optimum cost function in EQ3, L_abs_mean quantifies the mean of the absolute differences between TLCT_x,i and TLCT_N,1, where x=1…N-1. 
(1)
(2)
(3)
t tap_i,6 T LCT_6,i
RCT Entry Point
Since L_abs_mean is a sum of absolutes, the minimality of L_abs_mean can only be guaranteed if each term in the summation is also minimal. This in turn can be guaranteed by visiting each of the 1…N-1 nodes and sweeping through the M taps to find the index that gives the minimal abs difference w.r.t. to the reference node N. This recipe for finding the locally minimal solution ILM can be formalized as follows:
EQ4
To conclude, L_abs_mean would be minimal if we replace TLCT_x,i with TLCT_x,k where k=ILM(x). The complexity of locally optimum solution is O(NM). This is evident from the EQ4; there are N-1 nodes to be evaluated and in each node, there are M taps to be evaluated.
What makes L_abs_mean local is the fact that it minimizes the mean w.r.t. a single node -the furthest node. If the reference node is changed to some other arbitrary node K, it is not guaranteed that ILM would give a minimal L_abs_mean. In short, L_abs_mean w.r.t. any node would prioritize the needs of a single node, potentially at the expense of other nodes.
Before we present the global cost function, we would like to complement the abstract concepts with a concrete example. This example is then also used with global cost function to highlight the difference. A SiLago region instance can be modelled as a DAG (Directed Acyclic Graph) laid out as a two dimensional mesh shown in Fig. 8 ; this example is a subset of the example shown in Fig 7 with 6 instead of 8 nodes. Each node in the DAG represents a SiLago block and the edge to the nearest neighbour represents TRCT_chord. Without loss of generality, let us simplify the four variants of TRCT_chord to two: horizontal and vertical with 1.5 and 1.0 units of delay respectively. Each node x is annotated with its node-ID and the values of delays introduced in 1 and 2. The values in the nodes reflect the assumption that the RCT enters the region instance at top-left corner and propagates in top-down, left-right fashion. Each node is equipped with a delay line with M=8 taps. The delay associated with each tap is shown on the right side in Fig. 8 . The tap selection is done to have a minimal L_abs_mean w.r.t to the furthest node 6.
The L_abs_mean in the example given in Fig. 8 is given by the equation 3 and is equal to: 
B. Globally Optimum Solution
A global cost function would weigh in the needs of all nodes. G_abs_mean is such a cost function and quantifies the sum of differences of N-1 nodes with respect to each of the N nodes as the reference; y replaces N in 3 and is swept over 1...N, EQ. 5.
EQ5
An implication of the global nature of G_abs_mean is that, it is no longer sufficient to select tap index in each node independent of the tap index selection in other nodes. Since there are N nodes and each node has M taps, there are M N possible configurations of the delay lines, Iconf, in a region instance:
= { 1 , 2 . . . . . . } where = ( 1 , 2 . . . . . . ) and ∈ {1, 2 . . . } Note that IK is a sequence and not a set; the elements of this sequence i) have an order 1...N and ii) can be duplicates. Next, let us define G_abs_mean(Ik) for the candidate solution Ik:
The globally minimal solution IGM is then defined as follows: Fig. 9 . Solution points in globally optimum search space with red dot highlighting the globally minimum solution.
We next apply the above recipe for finding the globally optimum solution for the same problem that was used as a concrete example for locally optimum solution in previous section. We have plotted all, 8 5 the Ik solutions in the globally optimum search space, Fig. 9 . The green circle identifies the globally minimal solution: I={7, 5, 2, 6, 3, 1} and absolute variation 0.173. If the tap selection in Fig. 8 that was made for minimal L_abs_mean is used to compute the G_abs_mean, it will be 0.24. This proves the benefit of adopting G_abs_mean as the cost function.
The globally optimum solution has an exponential complexity of O(N 2 M N ), the N 2 factor represents the complexity of calculating the cost-function. In the experimental platform we report in section VI, M=32. Even for a small region instance of just 10 nodes the complexity is of the order of ~10 17 evaluations of abs differences. Fortunately, this design space can be easily pruned down to a scalable size without sacrificing the global optimality. We next justify how M can be pruned down to 2, i.e. M=2 independent of the dimension M of the delay line and ′ can replace N where ′ ≪ and ′ is also independent of the size N of region instance.
C. Pruning the Search Space delay line tap selection
In this section, we present justifications for dramatically pruning the delay line configuration space and then pruning the pair wise combinations of nodes in a region instance.
1) Justification for M=2
Ideally, we would like L_abs_mean and G_abs_mean to be zero. Since these cost functions are the mean of absolute differences, in order for them to be zero each term should be reduced to zero. This in turn requires the two TLCT terms in abs difference expression to be equal. Since Tnat component in TLCT is invariant, the ttap should have an exact delay to make the two TLCT equal. This delay is called the ideal ttap delay. For the L_abs_mean case, this is formalized as follows in EQ6. Such an ideal tap delay is not possible in practice because it requires an infinitely divisible delay line. However, the ideal tap delay helps identify the two contiguous tap indices whose delays bracket the ideal tap delay. This is illustrated in Fig. 10 and as can be seen, ttap_ideal is the basis for finding the two taps i and i+1 whose delay brackets the ttap_ideal in node x. This can be repeated for all N-1 nodes that identifies the lower bound tap i for each node; node N has a pre-decided tap index as stated before. We represent this sequence of ideal lower bound indices by Ideal_LB. By knowing the lower bound index i = Ideal_LB(x), we can easily compute the ideal upper bond index to be i+1.
The above justification was for L_abs_mean that uses the furthest node as the reference. For G_abs_mean, we still maintain the assumption of furthest node N having a fixed tap index set to min (ttap_i). This allows us to retain the same index bounds as LB_ideal for 1…N-1 nodes while searching for the globally optimal solution. Note that this does not in any way compromise the global optimality.
In conclusion, the complexity reduces to respectively O (2N) and O(N 2 2 N ) for locally and globally optimum solutions respectively. 2 N is still a large factor and we next justify why this can be pruned down by replacing N by ' ≪ .
2) Justifying ′ ≪ A combinational path implies a register output to register input path composed of wires and combinational logic. It is desirable that the clock skew between such register to register paths be minimized. The cost functions specified in EQ3 and EQ4 attempt to minimize such skews, i.e., absolute difference in TLCT between the furthest node pairs like (1 and N) and also between the nearest neighbours. This would be necessary if every node pair in a region instance has a combinational path between them. This is clearly not the case and is the basis for replacing N by ′ ≪ . Fig. 11 . Every node has combinational paths to other nodes in the span of the sliding window creating a need to minimize TLCT among nodes with such sliding windows rather than the whole region instance To understand this, let us revisit the local NOCs: the intraregion-instance, structured interconnect scheme in synchoros VLSI region types, see section III. These local NOCs serve two purposes. The first is more conventional, to allow SiLago blocks in a region instance to communicate with each other. The second embodies the spirit of Synchoricity, they allow the functionality hosted by individual SiLago blocks to be clustered to provide variations in function, capacity and degree of parallelism. In standard cells based VLSI design, this objective is achieved by synthesizing ad-hoc wires to cluster the standard cells. In synchoros VLSI design, these wires pre-exist as fragments of local NOCs in SiLago blocks and the region wide local NOCs emerge as a result of abutment. However, in synchoros VLSI design, the combinational paths between SiLago blocks is restricted to a small window, that slides by a certain stride to cover the entire region instance as shown in Fig.  11 . This implies that two SiLago blocks that are in different windows do not have combinational path between them and as such there is no need to worry about clock skew between them. Communication between such SiLago blocks is possible but involves multiple hops/cycles, i.e., the wires are pipelined.
Since every node is not connected to every other node but only to nodes surrounding it in a sliding window, the absolute difference in arrival time that RCT needs to be minimized between every node x and the small set of nodes combinationally reachable from x surrounding it in a sliding window. In the example shown in Fig. 11 , this small set of nodes N'=15. In the synchoros VLSI region type called DRRA, N'=14, see [26] .
The space invariance property might suggest that once the absolute difference has been minimized for one sliding window, the same solution should be applicable to all other sliding windows. Unfortunately, this is not true. If the overlap between sliding windows is zero, solution for one sliding window would apply to all others. Because of the overlap, the natural delay of RCT, though perfectly predictable, is not uniform. Finally, the size of the sliding window decides N' and is evident from
Sliding windows slide with a horizontal stride of 2 columns and a vertical stride of 1 row
The horizontal and vertical strides and the sliding window size 3×5 nodes are examples It will change from one region type to another 1. Every SiLago node is surrounded by 14 SiLago nodes with which it has combinational paths -the local NOCs. The sliding window has 3×5 nodes. 2. The sliding windows gradually loose their span towards edges example in Fig. 11 , it is independent of size of region instance.
D. Maximum Size of Region Instance
The timing model parameters that were introduced in section IV.C are also used to decide the maximum size of region instance that we allow during the system-level synthesis. We remind that region instances are treated as synchronous regions. More accurately, all flops within a region instance have their clocks as phase aligned but not necessarily skew aligned. Paths within the sliding window span are skew aligned. Since the sliding windows overlap as shown in Fig. 11 all flops in a region instance have a phase aligned clock even if there is no timing path between them. A concrete manifestation of this policy is that the furthest node has the smallest tap delay assigned to it and this is not subject to change as part of optimization. This not only guarantees minimal RCT latency to the furthest flops but also decides the maximal logical span of region instance. In essence, the number of taps and their ability to compensate the monotonous increase in Tnat_x with tap delays decides the maximum size of region instance allowed.
( _ ) − ( _ ) ≤ ( _ ) We quantify the above equation in next section and also show how the region instance size scales with number of taps.
VI. EXPERIMENTS AND RESULTS
In this section, we experimentally validate the key claims of ability to generate a valid and predictable RCT by abutment with end user not having to do logic or physical synthesis. A valid RCT means that generated RCT is guaranteed to be timing and DRC (Design Rule Check) clean. A predictable RCT means that all properties of the generated RCT is known with post layout accuracy without having to do Static Timing Analysis.
Three experiments were performed. The first experiment reports the results of the RCT model and its properties as discussed in section IV. The second experiment then uses the RCT model to predict the properties of the RCT in an experimental design. The predicted values are validated against the values analysed by commercial EDA tools. A side effect of this experiment is that RCT generated by abutment is shown to be timing and DRC clean by the commercial EDA tools. The third experiment benchmarks the properties of RCT generated by abutment against a functionally equivalent RCT generated by EDA to tools to show that the two are comparable in their figures of merits, i.e., the synchoricity and abutment does not degrade the quality of RCT generated.
A. Experimental setup
In this sub-section, we present the experimental setup we have used for our experiments.
1) Technology and Tools
All experiments have been done in 40 nm technology node and results validated using commercial EDA tools. These tools have been used for three purposes: a) To build the synchoros VLSI design platform, including RCT design and its characterization. This use of EDA tools is a one-time engineering effort and not seen by the end user. b) To validate the claim of generated RCT being timing and DRC clean and predictable. c) To demonstrate that the benefits of synchoricity and abutment do degrade the quality of RCT.
2) Experimental Design
The proposed RCT by abutment scheme and the state of the art hierarchical EDA flow are applied to a common experimental design that is a composite region instance of two different types: a dense linear algebra CGRA fabric called (DRRA) [19] Dynamically Reconfigurable Resource Array and a CGRA for scratchpad fabric called DiMArch [21] (Distributed Memory Architecture). The region instance has 24 SiLago blocks that corresponds to roughly 1.5 million NAND gates and 16 kBs of SRAM or 4 mm 2 in 40 nm. The design has been kept small to enable easier presentation of results and its analysis.
B. RCT Model
An RCT fragment, shown in Fig. 6 , with 32 taps in delay line was incorporated into the DRRA and DiMArch SiLago blocks. These blocks were hardened to be synchoros and all interconnects including the RCT interconnects were brought to the periphery to enable abutment, conceptually shown in Fig.  12 . The RCT model parameters were extracted using STA (Static Timing Analysis) from the post layout data and results tabulated in Table II . The SiLago blocks on the edges have slightly different values of TRCT_chord compared to the ones in the middle. Because the interconnect and the layout of the three rows in Fig. 12 each row has different TRCT_chord delays. The delay model takes that into consideration to correctly predict the Tnat. Here we report the values for the middle row. The slew at LCT entry point and the total capacitance of the RCT structure in each SiLago block is also reported. The tap delay ttap_i in our setup can introduce delay from 1.7 to 6.2 ns.
The timing model created above factors in variations in temperature, VDD and process as part of standard logic synthesis. For the experiments reported in this paper, the timing model factors in Best and Worst Case Commercial variations. This ensures that the RCT constructs and its timing and electrical properties will have the robustness that have been factored in these variations. 
C. Predictability and validity
The RCT model quantified in the previous sub-section was used to predict the properties of the RCT created by abutment as shown in Fig. 12 . This is done by taking the post layout SiLago region instance created by abutment and analysing the properties of the generated RCT by two methods as shown in Fig. 13 . The first method is to use the RCT model and SiLago analysis scripts embodied by equations 1 and 2. The second method is to use the EDA analysis tools. The results of these paths is compared to establish the accuracy of RCT model with EDA tools as the benchmark. A side-effect of this experiment is that the RCT created by abutment gets certified as being timing and DRC clean by the EDA analysis tools. The three RCT properties that we focus on predicting are the arrival times of RCT, TLCT_x,i , the slew rate at the entry point of LCT in each of the 24 SiLago blocks and the combined capacitance of the RCT structure in 24 blocks. The optimization algorithm was used to find the optimal set of tap indices in the 24 blocks to minimize the absolute difference among TLCT_x,i. Two critical quantified conclusions can be drawn. The first is that the worst case absolute difference is 129 ps which is easily absorbed by the slack margin with which the SiLago blocks are synthesized. The second is that the predicted values by the SiLago Analysis Tools is almost identical to the one analysed by the EDA tools. The worst case error compared to EDA tools is 1.5 ps and the RMS is 0.0005ps. We believe that this difference comes from the fact that our experimental setup does not have an infinite ground plane. This results in different parts of the design experiencing different coupling with long signals, like the reset, and the ground plane. This suffices as a proof of concept and provides sufficient accuracy that a post abutment STA is not needed.
The slew rate at each LCT entry point and the total capacitance predicted by both methods are a near perfect match between SiLago and EDA tools. The SiLago design reports slew and capacitance equal to 67ps and 129.9pf. The EDA respective values are: 87ps and 127.2pf. We would like to note here that this is the capacitance for the whole clock, RCT and LCT, and that the LCT dominates the capacitance with the RCT to be < 3%.
To address scalability concerns, we first preformed the experiments, as reported above for the base line design of 8 columns and 3 rows that corresponds to 1.5 million gate. Then we repeated the experiment for a larger design with 25 columns and three rows corresponding to roughly 5 million gates. The predictability equally well for the larger design as it did for the smaller baseline design.
D. SiLago RCT comparison with EDA
In this section, we demonstrate that the quality of RCT (Regional Clock Tree) generated by abutment is comparable to a functionally equivalent RCT generated as part of a hierarchical EDA flow. This experiment is visualized in Fig. 14. The Synchoros SiLago design flow starts with an untimed model at application/system-level. Application/System level synthesis then transforms the model into a logical netlist of SiLago blocks -the region-instance; in the current context, a single region instance shown in Fig. 12 . The SiLago blocks in the netlist are picked from the Synchoros SiLago VLSI design platform and the VLSI design, including RCT, of the region instance created by abutment. Note that creation of the synchoros VLSI design platform is a one-time engineering like standard cell library creation.
The second path is based on commercial hierarchical EDA flow starts with a logical netlist of RTL SiLago blocks that does not include the regional (inter-SiLago block) wires like RCT. SiLago blocks are hardened but does not include global wires. Once hardened, these blocks are floor-planned and the regional clock tree and inter-SiLago block wire synthesis step follows that generates the functionally equivalent RCT tree. Note that in EDA terminology, the RCT and inter-SiLago block wires are called global wires but we refrain from using the term global to maintain consistency with the SiLago flow in which global implies inter-region instance wires. Both designs are then analysed by the commercial EDA tools to compare the properties of the two functionally equivalent RCTs. The results are shown in Table III . As can be seen, the values of key parameters are comparable. The standard cell area refers to standard cells dedicated to creation of RCT, including MUX/AND/OR gates as reported from the EDA tools. In SiLago case, each SiLago block has a fixed overhead and not all of it is used. This is the key reason for a larger overhead compared to the EDA RCT. However, do note that the RCT area as a percentage of SiLago block is only 0.1%.
The SiLago RCT achieves comparable arrival time of RCT at LCT (Local Clock Tree) entry points compared to the commercial EDA RCT. The SiLago RCT has slightly better slew rate at LCT entry and the average and absolute difference in slew at different points in trunks for the entire (RCT+LCT) is also comparable and within the limits of the technology rules.
While the above experiment and number establishes that RCT generated by abutment has comparable quality as the one generated by commercial EDA tools. The difference is that the EDA tool generated RCT is ad-hoc and synthesized anew for each design instance. Notice that the EDA RCT's irregularity in Fig. 15 (a) resulting from attempt to factor and reuse the buffers. In contrast, the synchoros RCT is regular. This violate the requirements that synchoros VLSI design places on RCT. The synchoros RCT by abutment is regular as shown in Fig.  15 (b) . It has three main branches corresponding to two DRRA and one DiMArch rows. Each branch has 8 leaf nodes and each leaf node has the same RCT structure. This regularity, enables abutment, its absorption in SiLago blocks as a pre-synthesized and characterized structure and regularity enables predictability. Though not the most critical difference, the RCT generated by commercial EDA tools took ~50 minutes for synthesis and another ~5 minutes for STA for a modest 1.5 million gate design. Any post-CTS optimization requires another ~30min. This time is expected to increase exponentially with design size. In case of synchoros RCT, the clock tree generation is and its analysis are instantaneous.
VII. STATE OF THE ART
In this section, we justify the contributions of this paper in comparison to the state of the art primarily in clock tree synthesis but also in composition by abutment.
CTS (Clock Tree Synthesis) has been researched since the earliest days of VLSI. One of the most critical chapters in the classic textbook on VLSI Systems by Mead-Conway [13] is dedicated clocking and timing. Ever since then, CTS has been a mainstream VLSI research topic. The main objective of CTS research has been to optimize the cost metrics of the clock tree, i.e., smaller switching capacitance, minimizing skew, maintaining edge etc. [27, 28] . These approaches are based on sophisticated heuristic algorithms. Most of these methods are based on van Ginneken dynamic programming algorithm for buffer insertion and sizing, and delay model used is the Elmore delay model [29] . A good survey of CTS techniques is presented in [30] , additionally they propose their own algorithm for chip level clock tree synthesis that minimizes the skew and being manufacturing variation aware. [31, 32] also propose solutions to chip level clock tree synthesis that is aware of on chip variations. J. Lu et al. in [33] address a different problem with a similar method, where they use a post-clock-treesynthesis optimization to improve the clock period. Their proposed method addresses the problem by buffering a synthesized clock tree considering the already synthesized local clock tree and the delay of the logic path, allowing for non-zero skew. In contrast, our goal is not to optimize the clock period, but to balance the arrival time of the clock at the source points of the local clock trees. The balance is done by configuring an existing delay line that has been already implemented. In synchoros VLSI design style, once a SiLago block has been designed, no further change is possible and no additional external circuitry can be added. Fig. 15 . The ad-hoc unstructured EDA RCT vs. the regular structured synchoros SiLago RCT With dynamic voltage frequency scaling becoming mainstream, research in clock tree also addressed concerns raised by varying VDD that changes the drive and thus influences the amount of skew. To address these problems, a common method is the use of Adjustable Delay Buffers (ADBs). There have been many research work recently that addresses this problem using ADBs [34, 35, 36] .
H-tree and Mesh clock structures have also been widely researched [29, 37] . The regularity of these structures also suggests them as a good match with the regularity of synchoros VLSI design style. However, both these structures do not fulfil the requirements that RCT by abutment requires as part of synchoros VLSI design style. H-Tree is a hierarchical structure and its depth would depend on the size of region instance. For this reason, absorbing an H-Tree as RCT fragments and be able to create an arbitrary H-Tree by abutment depending on the size and shape of region instance is not feasible. Mesh based clock tree poses a challenge in terms of the predictability requirement. Since, the entry point in clock tree mesh, an equi-potential surface, is not known, it creates cyclic graphs that are hard to analyse for STA [29, 37] . This would be an even bigger problem for RCT by abutment scheme because it requires no STA and is thus required to have iron-clad timing models that predicts the properties of RCT with sufficient accuracy. However, we note that in terms of geometry, the proposed RCT structure is a mesh structure but in which the entry point and propagation of RCT is known and thus analysable to create the RCT model as shown in section IV.C.
FPGAs, like synchoros VLSI designs are regular structures and naturally invites comparison of their clock tree schemes. The fundamental difference in regularity is that the regularity of FPGAs is not changeable in silicon for the end user, i.e., no matter how the end user configures an FPGA, the clock tree wires will not change in their position, length or drive. These wires can be configured differently but the silicon will not change. In contrast, depending on functionality and constraints, the output of application and system level syntheses will imply different RCT wires in silicon -all of them regular and structured but different structures. FPGAs partition their leaf nodes -the flops -into clusters based on spatial locality. These clusters can be clocked by different global clock tree wires in a configurable manner. Further, depending on the datapath created by the end user and the choice of clock to drive them, FPGAs do require a STA as a post synthesis step. In synchoros VLSI design unlike FPGAs, the datapath is highly customized for a specific application domain -much more than DSP slices in FPGAs. These regions are pre-characterized to work at certain clock speeds as long as RCT infrastructure is able to deliver RCT to the LCT entry points within a known skew margin. This is the reason why in synchoros VLSI design, no STA is required as a follow-up step and region instances of arbitrary size and hosting arbitrary domain specific functionality are guaranteed to be correct-by-construction.
The clock tree synthesis research that we address in this paper does not compete directly with these approaches. Method reported in this paper complements and builds on existing research. As shown by our experiments and would be obvious to most practitioners of VLSI design, LCT has the dominant switching capacitance and we rely on the existing body of research to automate this step. What we propose is an alternative to the clock tree generation at higher level that has relatively insignificant capacitance but profoundly affects the engineering cost.
The proposed solution, most directly competes against the widely prevalent hierarchical synthesis flow of the EDA tools and its global clock tree synthesis phase. We propose to replace this ad hoc follow-up synthesis step with a one-time engineering effort that makes the generation of VLSI design alternatives non-incrementally faster, easier and most critically predictable and correct by construction. This we believe is essential to automate the higher abstraction synthesis (see Fig.  1 ) as argued by Hemani et. al. in [1] .
Another aspect of research, besides RCT synthesis, that is proposed in this paper is composition by abutment. Once again, this method is also as old as the VLSI designs. The Mead-Conway [13] method involved composing macros by abutting bit slices. Later, these macros were parametrized and complete system level design was attempted as part of silicon compiler research. This phase of VLSI design automation was eclipsed when standard cells were introduced and composition by abutment as a method lost its appeal. More recently, [38, 39] have proposed methods that bears some similarity to the SiLago method but it does not detail how they propose to handle synthesis of infrastructural wires like clocks, resets and power grid by abutment. The critical difference that SiLago has is that it aims at automated synthesis of functional hardware, whereas [38, 39] aim at incrementally improving the EDA tool's hierarchical synthesis flow for designing processor centric infrastructural hardware.
VIII. CONCLUSION AND FUTURE WORK
We have presented a regional clock tree synthesis method based on abutting identical fragments of the clock tree that are absorbed in the one-time VLSI engineered abuttable SiLago blocks. The generated clock tree has been shown to be valid, i.e. timing clean, comparable in its cost metrics to a functionally equivalent clock tree generated by the EDA tools. The advantage that the proposed scheme has over the EDA tool based design is that it generates a predictable, correct by construction design of the regional clock tree in two orders of magnitude less time. This we argue is essential to enable automation of functional hardware from higher abstractions. We have also shown that the scheme is scalable to larger designs of ~5 million gate complexity and the synthesis time required is less than a minute.
The programmable delay line is constructed from qualified standard cells and its atomic delay decides the resolution to which we can minimize skew. We are working on a more advanced programmable delay line that will allow a finer adjustment of delay that would be need for higher frequencies.
In this improved delay line, the number of taps would increase logarithmically with the amount of delay and thereby the size of design over which it can maintain synchronicity. Such a delay line will be constructed with a positional weighted taps much like a fixed point number. We are also in the process of making the delay tapped in the delay line adaptive so that it can be adjusted in response to changes in Vdd as part of dynamic voltage frequency scaling scheme.
