# SYNTHÈSE DE <br> RÉSEAUX DE DISTRIBUTION D'HORLOGES <br> EN PRÉSENCE DE <br> VARIATIONS DU PROCÉDÉ DE FABRICATION 

> MOHAMED NEKILI DÉPARTEMENT DE GÉNIE ÉLECTRIQUE ET DE GÉNIE INFORMATIQUE ÉCOLE POLYTECHNIQUE DE MONTRÉAL

THÈSE PRÉSENTÉE EN VUE DE L'OBTENTION DU DIPLÔME DE PHILOSOPHIAE DOCTOR (Ph.D.) (GÉNIE ÉLECTRIQUE)

Juin 1998
(C)Mohamed Nekili, 1998

## Bibliothèque nationale

 du CanadaAcquisitions et services bibliographiques

395, rue Wellington OHawa ON K1A ON4 Canada

The author has granted a nonexclusive licence allowing the National Library of Canada to reproduce, loan, distribute or sell copies of this thesis in microform, paper or electronic formats.

The author retains ownership of the copyright in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission.

L'auteur a accordé une licence non exclusive permettant à la Bibliothèque nationale du Canada de reproduire, prêter, distribuer ou vendre des copies de cette thèse sous la forme de microfiche/film, de reproduction sur papier ou sur format électronique.

L'auteur conserve la propriété du droit d'auteur qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation.

## UNIVERSITÉ DE MONTRÉAL

## ÉCOLE POLYTECHNIOUE DE MONTRÉAL

## Cette thèse est intitulée: <br> SYNTHÈSE DE <br> RÉSEAUX DE DISTRIBUTION D'HORLOGES <br> EN PRÉSENCE DE VARIATIONS DU PROCÉDÉ DE FABRICATION

présentée par: NEKILI Mohamed
en vue de l'obtention du diplôme de: Philosophiae Doctor (Ph.D.) a été dûment acceptée par le jury d'examen constitué de:
M. SAWAN Mohamad, Ph.D., président
M. SAVARIA Yvon, Ph.D., membre et directeur de recherche
M. BOIS Guy, Ph.D., membre et codirecteur de recherche
M. RUMIN Nick Charles, Ph.D., membre
M. AUDET Daniel, Ph.D., membre

À ma grand-mère Fatma, le berger de mon enfance. À ma mère, à son intelligence du coeur, À mon père, à sa patience, À leurs courage.

À ma femme Kahina, À ma soeur Aini, ma belle-soeur Djamila et ma tante Yamina, À mon beau-frère Ahmed, À mon frère Hakim
qui a veillé sur la famille pendant ma longue quête de la science, À mon frère Djamel, À mes neveux Djillali, Abdelhafid, Koceila et Chabane.

À toute la famille.

À mon meilleur ami en Amérique du Nord,
Abdelouahab

## Remerciements

Je tiens à remercier une personne, la première, qui mérite un paragraphe seule. C'est un éloge qui s'adresse à mon directeur de recherches, Professeur Yvon SAVARIA, pour sa contribution fondamentale, son soutien financier et moral durant mon Ph.D. Je le remercie pour partager avec moi le souci de cette thèse au niveau technique et méthodologique. Je lui témoigne d'une compétence rare. Je salue sa patience et ses valeurs humaines. Qu'il trouve ici l'expression de ma profonde gratitude pour sa présence continuelle au premier plan.

A mon codirecteur de recherche, Professeur Guy BOIS. Je le remercie pour sa contribution technique, son soutien financier et moral. J'ai trouvé en lui un ami et un collaborateur précieux. Sur le plan administratif, la compétence et la disponibilité de Ghyslaine ETHIER-CARRIER sont très appréciées. Cela rend la vie d'un étudiant certainement plus agréable.

Je remercie les Professeurs Mohamad SAWAN d'avoir accepté de présider ma soutenance de thèse, Nick RUMIN d'être mon examinateur externe et Daniel AUDET de se déplacer de Chicoutimi afin d'enrichir ma thèse de son expertise.

A tous mes collègues du laboratoire de VLSI à l'Ecole Polytechnique de Montréal et du laboratoire GRM au pavillon André Aisenstadt à l'Université de Montréal, Merci. Et en particulier à Jean BOUCHARD et Réjean LEPAGE pour leur patience.

A tous ceux et celles ayant un mérite dans l'accomplissement de ce travail.

## Résumé

Une des limitations croissantes que subissent les réseaux de distribution d'horloge de grande taille et de haute vitesse est le biais de synchronisation dû aux variations de procédé de fabrication (VPF). Cette thèse s'intéresse à déterminer l'importance du problème dans une architecture régulière, comme une matrice de processeurs. Les VPF sont alors modélisés soit sous la forme d'un gradient uniforme orienté (en VLSI) ou d'un gradient non-uniforme (en WSI) de la constante de temps du transistor MOS. Un des buts de cette modélisation est de pouvoir dégager des stratégies que le concepteur peut incorporer tôt dans le processus de conception du réseau de distribution d'horloge. Ensuite, afin de comprendre les sources du problème, nous proposons une caractérisation spatiale de l'effet des VPF sur la constante de temps du transistor MOS. Cette caractérisation est réalisée en mesurant la période d'oscillateurs en anneau CMOS opérant à 500 MHz . Ces oscillateurs sont implantés à différentes positions, sur différents dés, dans différentes tranches de silicium. Les variations spatiales de la constante de temps du transistor MOS se sont avérées de sources diverses et dans certains cas, échappent à la modélisation. Ainsi, une solution préconisée dans cette thèse consiste à mesurer la résultante des différentes composantes et à la compenser en égalisant les délais des différents chemins d'un arbre de distribution d'horloge après fabrication du circuit. La calibration se fait au niveau des amplificateurs du signal d'horloge et au niveau de l'interconnexion à l'aide de coupures laser et de masques reconfigurés.


#### Abstract

An ever-growing limitation in high-speed and large clock distribution networks is process-induced skew. This thesis addresses the importance of the problem in the case of a regular architecture such as a processor array. Process variations are modeled either as an oriented uniform gradient (VLSI scale) or as a non-uniform gradient (WSI scale) of the MOS transistor time constant. One of the goals is also to propose some strategies which can be used in early stages of clock distribution network design. In order to understand the sources of the problem, a spatial characterization of the effects of process variations on the MOS transistor time constant is proposed. This characterization is achieved by measuring CMOS ring oscillators operating around 500 MHz . These oscillators are implemented in different locations, on different dies and wafers. Spatial variations of the MOS transistor time constant appear to be originating from diverse sources, sometimes difficult to model. Thus, a solution proposed in this thesis consists in measuring the resultant of all components and compensating it by delay calibration of different paths in the clock distribution network. This calibration is performed either via the clock drivers or the interconnections, by using laser cuts and reconfiguring some masks.


## TABLE DES MATIÈRES

DÉDICACE ..... iv
Remerciements ..... v
Résumé ..... vi
Abstract ..... vii
TABLE DES MATIÈRES ..... viii
LISTE DES ANNEXES ..... xiii
LISTE DES FIGURES ..... xiv
LISTE DES TABLEAUX ..... xix
CHAPITRE 1. Introduction générale ..... 1
I. Problématique ..... 1
II. Objectifs et méthodologie .....  .2
III. Contributions ..... 4
IV. Publications ..... 6
V. Plan de thèse ..... 7
CHAPITRE 2. Revue de littérature ..... 8
Design of Clock Distribution Networks
in Presence of Process Variations ..... 9
I. Introduction .....  9
II. Importance of the Problem ..... 10
II.1. Work of Fisher \& Kung. ..... 10
II.2. Work of Pelgrom et al. ..... 12
III. Source of the problem ..... 14
III.1. Work of Pavasovic \& Andreou ..... 14
III.2. Work of Gneiting \& Jalowiecki ..... 15
IV. "Design Rules" Solution ..... 17
IV.1. Work of Shoji ..... 17
IV.2. Work of Vittoz ..... 20
V. Exact-Zero Skew Algorithms ..... 21
V.1. Work of Tsay ..... 22
VI. Built-in Self-Compensation ..... 27
VII. Conclusion ..... 28
CHAPİTRE 3. Importance des variations spatiales de la constante de temps du transistor MOS ..... 29
Pipelined H-trees for High-Speed Clocking of Large Integrated Systems
in Presence of Process Variations ..... 32
I. Introduction ..... 34
II. Logic-Based H-tree: An Extension of the Pipeline Clocking Method ..... 36
III. Skew Modeling in a Logic-Based H-tree Subject to a Uniform and Oriented Transistor Time Constant Gradient ..... 42
III. 1 Assumptions ..... 42
III. 2 Logic-Based H-tree Definition ..... 45
III. 3 Delay Formulation Along a Path of the H-tree ..... 49
III. 4 Skew Formulation ..... 54
III. 5 An Upper Bound Result on Clock Skew. ..... 55
III. 6 Two-Dimensional Arrays ..... 63
IV. Generalization to Large-Area Parametric Variations ..... 67
V. Trade-offs between Skew and Clock Rate ..... 70
V. 1 Skew Dominance in the Determination of the Maximal Clock Rate ..... 74
VI. Conclusion ..... 78
VII. Acknowledgments ..... 79
CHAPITRE 4. Conception de réseaux de distribution d'horloges fiables
et à faible consommation de puissance ..... 80
Design of Low-Power and Reliable Logic-Based H-trees. ..... 84
I. Introduction ..... 84
II. Formulation of Constraints ..... 86
III. Consequences on Global Performance. ..... 90
IV. Dominance of Constraints ..... 92
V. Conclusion ..... 95
CHAPITRE 5. Sources des variations spatiales de la constante de temps du transistor MOS ..... 96
Spatial Characterization of Process Variations via MOS Transistor
Time Constants in VLSI \& WSI ..... 99
I. Introduction ..... 99
II. Circuit Design ..... 102
III. Discussion of Experimental Results ..... 105
III. 1 The Environmental Component ..... 107
III. 2 The Process Component ..... 111
III. 3 The Power-Supply Component ..... 114
IV. Spectral Analysis and Spatial Correlations ..... 117
IV.1. FFT-based Analysis ..... 117
IV.2. Spatial Correlations on-die and from-die-to-die ..... 120
V. Conclusion ..... 121
Acknowledgment ..... 122
CHAPITRE 6. Techniques de minimisation du biais de synchronisation
par calibration de délai ..... 139
Minimizing Process-Induced Skew Using Delay Tuning. ..... 141
I. Introduction ..... 141
II. Delay Tuning via Interconnections ..... 143
II. 1 Algorithm ..... 144
II. 2 Test chip ..... 146
III. Delay Tuning via Oscillators \& Buffers ..... 148
III. 1 Oscillator-Based Method ..... 149
III. 2 Buffer-Based Method ..... 150
III. 3 Impact on prototyping delays ..... 151
III. 4 Test chip proposals ..... 151
III.4.1. Test on delay ..... 152
III.4.2. Test on skew ..... 152
IV. Conclusion ..... 152
CHAPITRE 7. Conclusion générale ..... 159
Références ..... 166

## LISTE DES ANNEXES

Annexe A. Number of inverters in a Logic-Based H-tree ..... 172
Annexe B. A Variable-Size Parallel Regenerator for Long Integrated In- terconnections ..... 174
I. Introduction ..... 175
II. Circuit Description ..... 176
III. Performance Analysis Under an ATP Metric ..... 178
IV. Conclusion ..... 181
Annexe C. A Fast Low-Power Driver for Long Interconnections in VLSI Systems ..... 184
I. Introduction ..... 185
II. Circuit Description ..... 186
III. Results and Discussion ..... 192
IV. Conclusion ..... 195

## LISTE DES FIGURES

Fig. 2.1. Difference model ..... 11
Fig. 2.2. Summation model ..... 12
Fig. 2.3. H-tree clock ..... 16
Fig. 2.4. A two-phase clock Circuit ..... 17
Fig. 2.5. Interconnecting two zero-skew subtrees ..... 23
Fig. 3.1. A logic-based H-tree ..... 38
Fig. 3.2. Detail of the logic-based variant of an H -tree- based clock network ..... 39
Fig. 3.3. Distributing a 1.11 GHz clock through a 50 -stage chain of successive mini- mum-sized inverters ..... 40
Fig. 3.4. Series used in delay calculation ..... 48
Fig. 3.5. Skew variation versus the configuration of direction variables ..... 57
Fig. 3.6. The two pairs of leaves that produce the skew upper bound ..... 59
Fig. 3.7. A processor and its 8 neighbors ..... 63
Fig. 3.8. An example of sub-regions in Fig. 3.1 ..... 64
Fig. 3.9. Gaussoid distribution of a transistor time constant ..... 68

Fig. 3.10. Use of drivers and interconnection in an H-tree metallic path ......................... 71

Fig. 3.11. Skew dominance in the determination of the maximal clock rate ................... 77

Fig. 4.1. Amplification dans un arbre en H hybride ....................................................... 82

Fig. 4.2. Logic-based H-tree .......................................................................................... 85

Fig. 4.3. Effect of processor-size on H-tree depth......................................................... 90

Fig. 4.4. Domains of dominance of power and skew constraints................................... 94

Fig. 5.1. Ring oscillator (schematic) ............................................................................ 123

Fig. 5.2. Ring oscillator and inverter's output (photograph) ........................................ 123

Fig. 5.3. HSPICE electrical simulation ....................................................................... 124

Fig. 5.4. Oscillation cell (photograph) ........................................................................ 124
Fig. 5.5. Our design (photograph) ............................................................................... 125

Fig. 5.6. Vdd and Gnd pads (photograph) ................................................................... 125
Fig. 5.7. The die core and our design .......................................................................... 126

Fig. 5.8. Wafer (photograph) ....................................................................................... 127

Fig. 5.9. Equipment (photograph) ............................................................................... 127
Fig. 5.10. Two signal outputs (photograph) .................................................................. 128

Fig. 5.11. Four-side representation of clock period at the die level
(a) 1st randomly-chosen die; (b); 2nd randomly-chosen die;
(c) die from a non-scribed wafer; 129

Fig. 5.12. Histogram of period distribution in die b) of Fig. 5.11130

Fig. 5.13. Correlation along X \& Y directions
(a) 1st non-scribed wafer; (b); 2nd non-scribed wafer; ................................... 131

Fig. 5.14. Correlations between neighbors
(a) horizontal pair of die sides; (b); vertical pair of die sides;......................... 132

Fig. 5.15. Spatial characterization of clock period on WSI scale
(a) 1st non-scribed wafer; (b); 2nd non-scribed wafer;................................... 133

Fig. 5.16. Numerical values of the process component................................................. 134

Fig. 5.17. Schematic of the power-supply distribution network .................................... 134

Fig. 5.18. Voltage fluctuations along power-supply rails
(a) Power fluctuations along Vdd rail; (b); Power fluctuations along Gnd rail;
(c) Voltage difference Pi-Gi; 135

Fig. 5.19. Clock period variations versus the position of a ring oscillator along the segment 136

Fig. 5.20. Modulation of the clock period .................................................................... 136

Fig. 5.21. One-dimensional FFT of period distribution
(a) in X direction of Fig. 5.13a; (b); in Y direction of Fig. 5.13a; ..... 137
Fig. 5.22. Types of neighbors ..... 138
Fig. 6.1. Delay tuning via interconnections ..... 153
Fig. 6.2. Test chip ..... 154
Fig. 6.3. Logic for generating and driving the clock signal ..... 154
Fig. 6.4. HSPICE simulation of the clock signal ..... 155
Fig. 6.5. Pads for microwave probes ..... 156
Fig. 6.6. Capacitor layout ..... 156
Fig. 6.7. Configuration modes for the delay tuning structure
(a) mode oscillateur; (b); mode amplificateur; ..... 157
Fig. 6.8. Laser tuning of buffer size ..... 157
Fig. 6.9. Test on delay ..... 158
Fig. 6.10. Test on skew ..... 158
Fig. B.1. The Parallel Regeneration Technique (PRT) introduced in [64] ..... 183
Fig. B.2. A Variable-size PRT (VPRT) ..... 183

Fig. B.3. Output voltage of a 10 cm line versus time using PRT and VPRT with CMOS4S and BATMOS ............................................................................... 180

Fig. C.1. Insertion of drivers for line regeneration ....................................................... 187
Fig. C.2. A $\pi$ line model .............................................................................................. 187

Fig. C.3. A clamping regenerator................................................................................... 188

Fig. C.4. Outputs of successive drivers with a good design of the clamper ................. 190

Fig. C.5. Problems due to a bad design of the clamper................................................. 190

Fig. C.6. Oscillations caused by a bad design of the clamping regenerator .................. 191

Fig. C.7. Line delay versus driver size ......................................................................... 193

Fig. C.8. Line delay versus segment length .................................................................. 194

## LISTE DES TABLEAUX

Tableau 2.1.Transistor time constants ..... 18
Tableau 3.1. Horizontal and vertical time constant gradient components ..... 50
Tableau 3.2. Skew complexities versus H -tree size according to some authors ..... 66
Tableau 3.3. Summary of the H-tree performances with different configurations ..... 77
Tableau 4.1. Performance versus processor switching activity ..... 92
Tableau 5.1. Means and standard deviations of period distributions at the die level ..... 107
Tableau 5.2. Means and standard deviations of period distributions at the wafer level ..... 114
Tableau 5.3. Power in various components of the clock signal (in $\mathrm{ps}^{2}$ ) ..... 120
Tableau 5.4. Means of absolute period differences ..... 121
Tableau B.1. Interconnection delays (CMOS4S) ..... 185
Tableau B.2. Interconnection delays (BATMOS) ..... 185
Tableau C.1. Comparisons with RID ..... 196
Tableau C.2. Speed Stability ..... 197

## Chapitre 1.

## Introduction générale

## I. Problématique

L'une des conditions essentielles de tout système synchrone est la capacité de limiter le biais de synchronisation (BS) ${ }^{1}$. Cette condition contraint de plus en plus les réseaux de distribution d'horloge à mesure que la technologie est réduite à l'échelle et que la taille des puces augmente. A titre d'exemple, la tendance actuelle avec la technologie CMOS est que la période d'horloge s'approche de lns. Une règle empirique pour assurer la fiabilité d'un système synchrone est de maintenir le BS en dessous de $10 \%$ de la période. On comprend dès lors la difficulté de s'assurer que le retard le plus important entre tous les noeuds d'horloge du système soit maintenu en dessous de 100 ps .

Durant la dernière décennie, une grande partie de la littérature s'est concentrée sur des efforts pour minimiser le BS lors du processus de conception. Ces efforts consistent principalement à égaliser les longueurs électriques des différents chemins de l'horloge dans le système. En dépit de ces efforts, une quantité non négligeable de BS persiste après fabrication. Celle-ci est due notamment à la présence de VPF ${ }^{2}$. De plus, le processus de miniaturisation a été beaucoup plus agressif que l'amélioration de la tolérance aux VPF. Dans la littérature, on compte très peu de travaux visant à mesurer l'importance du phénomène, à

[^0]connaître ses sources et à y suggérer des solutions. Remédier à cette carence est un des buts de cette thèse.

## II. Objectifs et méthodologie

Un effort important a été consacré à une revue des contributions dans le domaine afin de distinguer les différentes approches préconisées par les chercheurs. Cet exercice nous permet de mieux situer notre contribution.

Ensuite, nous nous intéressons à déterminer l'importance de l'effet des VPF sur le BS dans une architecture régulière, comme une matrice de processeurs, lorsque cette architecture est implantée dans un contexte VLSI ou WSI. Les VPF sont alors modélisés soit sous forme d'un gradient spatial orienté (VLSI) ou d'une gaussienne (WSI) de la constante de temps du transistor. Des travaux expérimentaux ont montré que la logique est beaucoup plus affectée par les VPF que ne le sont les interconnexions. D'où le recours à une modélisation sur la base des propriétés des transistors. Un des buts de cette modélisation est de pouvoir dégager des stratégies et méthodes que le concepteur peut incorporer tôt dans le processus de conception d'un réseau de distribution d'horloge.

A haute vitesse, une autre contrainte prend toute son importance dans la détermination de la performance d'un système synchrone de grande taille: la consommation de puissance. Certains processeurs comme le Alpha de DEC allouent la moitié de leur budget de puissance au réseau de distribution de l'horloge. Lorsqu'un circuit opère à haute vitesse, la composante de la puissance dynamique associée au courant dit de court-circuit joue un
rôle d'une grande importance voire prédominante dans la consommation de puissance d'un arbre de distribution d'horloge. Pour les concepteurs de systèmes portables, cette question est de première importance. Dans cette thèse, nous considérons simultanément les contraintes de BS et de dissipation de puissance pour prédire les performances du système. De plus, nous y proposons des structures d'amplification du signal d'horloge conciliant la haute vitesse et une consommation électrique réduite.

Les premiers modèles adoptés dans cette thèse ont permis de mieux comprendre l'importance du phénomène, mais ils demeurent des modèles de premier ordre. Le seul moyen de rendre compte de toute la réalité demeure l'expérience. De plus, pour comprendre les sources du problème, il est utile de recourir à une mesure physique afin de tenir compte de toutes les composantes possibles du phénomène. Pour ce faire, nous proposons une caractérisation spatiale de l'effet des VPF à travers un paramètre directement impliqué dans les questions temporelles: la constante de temps du transistor MOS. Les variations de ce paramètre sur de grandes surfaces de silicium ont des conséquences directes sur la synchronisation et les performances temporelles des systèmes intégrés de grande vitesse. Cette caractérisation expérimentale est réalisée en mesurant la période d'oscillation d'oscillateurs en anneau CMOS opérant aux environs de 500 MHz . Ces oscillateurs sont implantés à différentes positions, sur différents dés, dans différentes tranches de silicium. Le but est d'établir une cartographie à l'échelle VLSI et WSI de la constante de temps des transistors MOS.

L'analyse des variations spatiales mesurées de la constante de temps des transistors

MOS s'est avérée comporter des composantes de sources diverses et une partie de nos résultats expérimentaux échappent à la modélisation. Ainsi, la solution préconisée dans cette thèse consiste à mesurer la résultante des différentes composantes et à la compenser en égalisant les délais des différents chemins d'un arbre de distribution d'horloge après fabrication du circuit. La calibration se fait au niveau des amplificateurs du signal d'horloge et au niveau des interconnexions.

## III. Contributions

La revue de littérature présentée dans cette thèse est le premier effort en littérature qui fait la synthèse des contributions visant la compréhension globale du phénomène des VPF sur les systèmes synchrones.

L'idée de modéliser les VPF sous forme d'un gradient uniforme orienté (en VLSI) et d'un gradient non-uniforme (en WSI) est originale et elle apporte un éclairage nouveau en regard des modèles connus en littérature concernant l'effet des VPF sur le BS maximal dans les matrices de processeurs. Nous avons, de plus, proposé une nouvelle manière de distribuer une horloge: un arbre en H construit entièrement à partir d'inverseurs de taille minimale. En plus de montrer les performances intéressantes que peut avoir cette technique, nous l'avons utilisé pour localiser, dans la matrice à processeurs, les types de communications entre processeurs qui peuvent rendre la synchronisation de grandes matrices impossible. Enfin, l'analyse de cette technique a rendu la comparaison possible avec les autres schémas de construction de l'arbre d'horloge (soient les arbres hybrides et métalli-
ques).
Cette thèse présente aussi l'une des premières études montrant l'interaction des contraintes de BS et de puissance dans la détermination des performances globales d'un réseau de distribution d'horloge. Par ailleurs, la thèse propose deux techniques originales d'amplification du signal d'horloge à faible consommation de puissance. La première se base sur l'insertion d'amplificateurs en parallèle avec la ligne d'horloge et tire avantage de la possibilité de varier la taille des différents amplificateurs. L'autre se base sur une insertion sérielle et parvient à une réduction de la puissance de commutation par l'utilisation d'une double alimentation.

Une des contributions originales de cette thèse est la caractérisation spatiale de l'effet des VPF à travers un paramètre directement impliqué dans les questions temporelles: la constante de temps du transistor MOSFET. Cette thèse présente, à notre connaissance, la première cartographie du genre à l'échelle VLSI et WSI de ce paramètre.

L'idée de recourir à la calibration au niveau des amplificateurs du signal d'horloge et au niveau des interconnexions pour compenser le BS dû aux VPF n'est pas nouvelle. En effet, le principe fut déjà appliqué aux plots d'entrées-sorties. Dans ce cas, la calibration se fait sur toutes les puces fabriquées. Cependant, le mérite de cette thèse est d'en montrer l'application dans le contexte des arbres de distribution d'horloge et de ne soumettre à la calibration qu'un ensemble limité de puces prototypes.

## IV. Publications

Cette thèse a fait l'objet d'un certain nombre d'articles publiés et de circuits expérimentaux fabriqués, dont:

## Articles de revue:

1. M. Nekili, G.Bois \& Y. Savaria "Pipelined H-trees for High-Speed Clocking of Large Integrated Systems in Presence of Process Variations" IEEE Transactions on VLSI Systems, Vol.5, No.2, pp. 161-174, juin 1997.
2. M. Nekili, Y. Savaria \& G.Bois, "Characterization of Process Variations via MOS Transistor Time Constants in VLSI \& WSI". Accepté pour publication au Journal of Solid-State Circuits, octobre 1997.
3. M. Nekili, G. Bois \& Y. Savaria, "Design of Clock Distribution Networks in Presence of Process Variations", soumis pour publication (tutorial) au IEEE Trans. on VLSI Systems, février 1998.
4. M. Nekili, Y. Savaria \& G.Bois, "Minimizing Process-Induced Skew Using Delay Tuning" soumis pour publication au Journal of Microelectronic Systems Integration, juin 1998.
5. M. Nekili, G.Bois \& Y. Savaria, "Design of Low-Power and Reliable Logic-Based H-trees" soumis pour publication au IEEE Transactions on VLSI Systems, juin 1998.

## Articles de conférence:

1. M. Nekili, Y. Savaria \& G. Bois, "Minimizing Process-induced Skew Using Delay Calibration in Clock Distribution Networks", IEEE International Workshop on Clock Distribution Networks, Atlanta, Georgia, 9-10 Octobre 1997.
2. M. Nekili, Y. Savaria \& G. Bois, "Design of Clock Distribution Networks in Presence of Process Variations" accepté pour publication en session spéciale au 8th Great Lakes Symposium on VLSI, Lafayette, Louisiana, 19-21 Février 1998.
3. M. Nekili, Y. Savaria \& G. Bois, "A Fast Low-Power Driver for Long Interconnections in VLSI Circuits" Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS'94), Londres, Juin 1994.
4. M. Nekili, Y. Savaria \& G. Bois, "A Variable-Size Parallel Regenerator for Long Integrated Interconnections", Proceedings of Midwest Symposium on Circuits and Systems (MWSCAS'94), Lafayette, Louisiana, Août 94.

## Rapport technique:

M.Nekili, G.Bois \& Yvon Savaria "Deterministic Skew Modeling and High-Speed Clocking using Logic-Based and Mixed H-trees within Integrated Systems", Rapport technique EPM/RT 94/09, Editions de l'École Polytechnique de Montréal, 1994.

## Circuits expérimentaux:

1. Sur un dé de silicium de $1.44 \mathrm{~cm} \times 1.44 \mathrm{~cm}$, des oscillateurs en anneau opérant à 500 MHz sont implantés sur les bords du dé. Le dé est répété sur la surface d'une tranche. Le but est de caractériser l'effet des VPF sur la constante de temps du transistor MOSFET. Le circuit a été soumis pour fabrication le 28 juin 1995 en technologie CMOS $1.2 \mu \mathrm{~m}$ de Nortel. Le circuit a été fabriqué et testé avec succès.
2. Sur les bords d'un dé de silicium de $1.44 \mathrm{~cm} \times 1.44 \mathrm{~cm}$, un arbre de distribution d'horloge en H est implanté. Le but est de tester une méthode de compensation du BS induit par VPF à l'aide de calibration de délai par coupures laser au niveau des interconnexions. Le circuit a été soumis pour fabrication en février 1997 en technologie BiCMOS $0.8 \mu \mathrm{~m}$ de Nortel.

## V. Plan de thèse

Le chapitre 2 présente une revue de littérature du domaine. Le chapitre 3 montre l'importance du problème et propose l'utilisation d'un arbre d'horloge entièrement construit à base de logique (pipelined H -trees). L'effet de la dissipation de puissance sur la performance globale des arbres de distribution d'horloge est analysé au chapitre 4. Le chapitre 5 s'intéresse aux sources du problème en opérant une caractérisation spatiale des VPF à travers la constante de temps du transistor MOS. Enfin, le chapitre 6 propose une solution au problème sous forme de techniques de compensation du BS par calibration de délai. Cette thèse est écrite en format "thèse par articles". Les chapitres $2,3,4,5$ et 6 sont basés sur des articles de revue. De plus, le chapitre 4 réfère à deux articles de conférence qui se trouvent aux annexes $B$ et $C$.

## Chapitre 2.

## Revue de Littérature

Une des limitations croissantes que subissent les réseaux de distribution d'horloge de grande taille et de haute vitesse est le BS dû aux variations de procédé de fabrication. Ces fluctuations dans les paramètres du procédé ont été anticipées au moins depuis les années 70. Pour compenser le désalignement des caractéristiques de transistors, la conception des circuits analogiques de haute-précision s'est traditionnellement réalisée en suivant un ensemble de règles empiriques acquises avec l'expérience. Avec la réduction d'échelle en technologie MOS, le désalignement des transistors a pris plus d'importance, surtout parce que la tolérance du procédé n'est pas réduite à l'échelle de manière proportionnelle à la géométrie. La minimisation de ce désalignement dans les systèmes intégrés synchrones a été approchée sous différents angles. La section II de l'article qui suit montre l'importance du problème qui est aisément étudiée dans le cas de réseaux de noeuds d'horloges distribués selon des géométries pré-définies. Dans la section III, deux exemples de caractérisations expérimentales des VPF pointent vers les sources du problème. Les solutions existantes incluent les règles incorporées tôt dans le processus de conception (section IV), les algorithmes dits à BS "exactement-nul" et les méthodes d'auto-compensation incorporées (section V ).

# Design of <br> Clock Distribution Networks in Presence of Process Variations 

Mohamed Nekili, Yvon Savaria \& Guy Bois<br>Ecole Polytechnique de Montréal<br>Department of Electrical and Computer Engineering<br>P.O. Box 6079, Station "Centre-Ville"<br>Montréal, PQ, Canada, H3C 3A7<br>E-mail: nekili@vlsi.polymtl.ca<br>Tél: 1-(514) 340-4711 ext. 4737


#### Abstract

Tolerance to process-induced skew remains one of the major concerns in the design of large-area and high-speed clock distribution networks. Indeed, despite the availability of some efficient exact-zero skew algorithms that can be applied during circuit design, the clock skew remains an important performance limiting factor after chip manufacturing, and is of increasing concern for sub-micron technologies. This tutorial reviews the importance of the problem, its sources, as well as typical examples of existing solutions. Solutions range from suitable design rules to built-in self-compensation methods.


## I. Introduction

An ever-growing limitation for high-speed and large area clock distribution networks is process-induced skew. The present paper reviews the research progress for solving this problem since the early 80 's.

Minimizing clock skew in synchronous integrated systems has been approached from different angles. In Section II the importance of the problem is assessed. In section III, two examples of experimental characterizations of process variations point to the source of the problem. Existing solutions include rules incorporated early in the design process (section IV), optimistic exact-zero skew algorithms (section V) and built-in self-compensation
techniques (section VI) which are activated after chip manufacturing.

## II. Importance of the Problem

The importance of the problem can easily be studied in the case of clock nets distributed according to predefined geometries. Existing models of process variations are either probabilistic (Steiglitz \& Kugelmass [1] and Afghahi \& Svensson [2]), deterministic (Fisher \& Kung [3] and Nekili et al. [4, 5]) or hybrid (Pelgrom et al. [6]).

With the probabilistic approach, Steiglitz \& Kugelmass [1] deal with the signal delay along a given clock path as a sum of the delays of different segments of that path, each behaving according to a probabilistic law. Assuming an independance of segment delays, the total path delay as well as the clock skew tend to follow the normal law. For Afghahi \& Svensson [2], the clock skew is assumed to be a dispersion of physical circuit parameters (geometrical dimensions) and process parameters (temperature sensitivity, ...).

## II. 1. Work of Fisher \& Kung [3]

Fisher \& Kung proposed two distinct clock skew models: a difference model (Fig. 2.1) and a summation model (Fig. 2.2). They assume an architecture composed of processors synchronized by the clock and organized as an array of a given size. The processors act as the leaves of a binary tree (Fig. 2.1).

For the difference model, the clock skew between two tree nodes $C_{i}$ and $C_{j}$ depends on the physical distance from each of the nodes to the root of the tree (bold line in Fig. 2.1). This model was proposed for high-speed systems made of discrete components, where
clock trees are often wired so that the delay from the root is the same for all cells.


Fig. 2.1. Difference Model
Nevertheless, as system size increases, small variations in electrical characteristics along clock lines can build up unpredictably to produce skews even between wires of the same length. In the worst case, two wires can have propagation delays which differ in proportion to the sum of their lengths. This remark suggests a second model that simulates process variations which affect the clock tree.

In the summation model, the clock skew between two tree nodes $C_{i}$ and $C_{j}$ depends on the sum of their distances to their common ancestor (bold line in Fig. 2.2).

Beside the binary tree distributing the clock signal (Fig. 2.1 and Fig. 2.2), it is assumed that the processors exchange data through a communication graph superposed on the
clock graph.


Fig. 2.2. Summation Model
Fisher \& Kung reached the following conclusions:

- Under both models, if the communication graph is linear or one-dimensionnal, i.e., a direct connection links all adjacent processors laid out as a linear array, the processor array can be clocked with a skew that is independent of system size.
- For multi-dimensionnal processor arrays, i.e., each processor communicates with any other processor, the above conclusion remains valid under the difference model only, making the synchronization of large systems unfeasible under the summation model.


## II.2. Work of Pelgrom et al. [6]

Despite the widely recognized importance of matching, there were only a limited number of specialized literature contribution to this field, up to late 80's. Previously, Shyu et al. [7, 8] analyzed the variation in capacitor and current sources in terms of local and
global variations. Later, Lakshmikumar et al. [9] described MOS-transistor matching by means of threshold-voltage and current-factor standard deviations.

Pelgrom et al. defined mismatch as the process that causes time-independent random variations in physical quantities of identically designed devices. An analysis is performed for a general parameter $P$, which can be, in particular, the delay of an electrical device.

Pelgrom et al. assumed that the value of a parameter $P$ is composed of a constant part and a random part, which results in different values of $P$ for different pairs of coordinates $(x, y)$ over the wafer. If the variations are small, the average value of the parameter $P$ over a given area can be represented by the integral function of $\mathrm{P}(\mathrm{x}, \mathrm{y})$, as well as the mismatching $\Delta P$ between two identical areas of logic at different coordinates on the wafer.

Based on a two-dimensional Fourier transform, Pelgrom et al. expressed the mismatch $\Delta P$ as the product of a geometry-dependent function and a process-dependent function. The latter being the combination of a short-distance "white" noise and a long-distance stochastic phenomenon acting on the wafer scale.

For two rectangular and identical electrical components with an area of $W \times L$ each at an horizontal distance $D_{x}$ from each other, the variance $\Delta P$ can be written as:

$$
\sigma^{2}(\Delta P)=\frac{A_{p}^{2}}{W \times L}+S_{p}^{2} \times D_{x}^{2}
$$

where $A_{p}$ is the area proportionality constant for parameter $P$, while $S_{p}$ describes the variation of parameter $P$ with spacing. The proportionality constants can be measured and
used to predict the mismatch variance of a circuit.
As an application example, Pelgrom et al. used the threshold voltage $\mathrm{V}_{\mathrm{T}}$ and the current factor $\beta$, two parameters that determine the drain current and hence the transistor time constant. Another direct application of Pelgrom's modeling was proposed recently by Karhunen et al. [10]. It is a compensation technique based on centroid configurations.

## III. Source of the Problem

The source of the problem of process variations is investigated through characterization. Two typical examples are provided in this paper: the work of Pavasovic and Andreou [11, 12] on VLSI scale and the work of Gneiting and Jalowiecki [13] on WSI scale. Recently, Nekili et al. [14] conducted a series of experiments in order to build a map of variations in CMOS transistor time constants on die and wafer levels. For a detailed analysis of the source of disturbances in IC manufacturing process, see Maly et al. [15]. Also, Shyu et al. [8] analysis of current mismatch pointed out four physical causes: edge effects, implantation and surface-state charges, oxide effects and mobility effects.

## III.1. Work of Pavasovic and Andreou [11, 12]

CMOS combined with the subthreshold operation of the transistor has been a technology of choice for implementing low-power devices. However, with voltage reduction and scaling, subthreshold operation faces substantial parameteric variations in the process. Pavasovic and Andreou addressed the measurement of process variations through the subthreshold transistor drain current [11] and the above-threshold drain current [12]. The
drain current is a parameter determining transistor time constant and hence affecting the delay in electronic devices.

Their experiment is based on a set of four transistor arrays. Transistors belonging to different arrays have different sizes, while transistors of the same array have the same size. Pavasovic and Andreou reported the presence of three quasi-deterministic effects ("edge", "striation" and "gradient" effects) and a random phenomenon.

The "edge" effect manifests itself as a drastic increase or decrease of drain current at the borders of the transistor arrays. According to Pavasovic and Andreou, a probable cause to this effect is strain-induced shifts in the characteristics of the devices. During annealing processes, stress in the middle of the array will tend to propagate outward and accumulate in the periphery. The "striation" effect appears as a spatial, sinusoidal oscillation of slowly varying frequency in the device current. For Pavasovic and Andreou, a possible cause of this effect is the threshold adjustment ion implantation process. It could be the resultant doping concentration spatial distributions due to summation of the Gaussian doping profile at each pass of the scan. Regarding the "gradient" effect, it appears as a position-dependent spatial variation of much lower frequency acting on long distance. When systematic effects are removed from the data, the random variations follow an inverse linear dependence on the square root of transistor area.

## III.2. Work of Gneiting \& Jalowiecki [13]

The idea behind this experiment is to characterize, on WSI scale, the effect of process variations on a ring-oscillator designed using 51 inverter stages. It involved 25 wafers
from different lots. Copies of the ring-oscillator were dispatched over the different dies in a wafer and over different wafers. Oscillation frequencies ranging from 60 to 90 MHz were measured, with variation coefficients $\left(\frac{\sigma}{\mu}\right)$ of $11 \%$, where $\mu$ and $\sigma$ respectively represent, the average and the standard deviation.

The above data combined with other process parameters extracted from V-I curves allowed studying the impact of process variations on delay dispersion and skew in a set of typical WSI clock distribution networks which are reviewed in [16].

One of their main observations is the negligible contribution of variations due to interconnections (as low as $10.6 \%$ on delay and $17.3 \%$ on skew). Moreover, the H -tree based clock network illustrated in Fig. 2.3, seems to be the most affected in terms of delay, while it is the least affected in terms of skew.


Fig. 2.3. H-tree Clock

## IV. "Design Rules" Solution

An approach proposed by Shoji in [17] is based on a first-order model of process variations while, in [18], Theune et al. exploit electro-magnetic phenomena. The goal is to elaborate rules which will allow circuit designers to minimize clock skew. In [19], Vittoz summarizes a set of rules applied by designers of high-performance analog circuits.

## IV.1. Work of Shoji [17]

Shoji starts from the architecture of Fig.2.4, which shows two clock paths for a clock signal I. The first one contains an even number of inverters, while the second contains an odd number of inverters. The two paths feed identical processors represented by a load.


Fig. 2.4. A two-phase clock circuit

Shoji makes the following reasoning:
Let us call $T_{1}$ and $T_{2}$ the delays of inverters 1 et 2 and $T_{A}, T_{B}$ and $T_{C}$ the delays of inverters A, B and C. In order to make this architecture process-insensitive, the circuit is
designed such that the electrical length of the clock path that inverts signal (i.e., $T_{I}$ ) is equal to that of the path which does not invert signal $I$ (i.e., $\mathrm{T}_{\mathrm{NI}}$ ).

In other words:

$$
\begin{gather*}
T_{I}=T_{N I}  \tag{2.1}\\
\text { with } T_{I}=T_{A}+T_{B}+T_{C}  \tag{2.2}\\
\text { and } T_{N I}=T_{1}+T_{2} \tag{2.3}
\end{gather*}
$$

According to Shoji, a circuit designer assumes that his circuit will be implemented in a typical process noted $M$. In practice, however, the transistors $P$ and $N$ are subject to variations of their time constants. Shoji reflects this phenomenon by considering two other processes: a high-current process (noted H) and a low-current process (noted L). These processes are distinct from the typical process with respect to the time constants of P-type and N-type transistors. The time constants will be higher for the H -process and lower for the L-process (Table 2.1).

Table 2.1: Transistor Time Constants [14]

| Process | NFET (ps) | ratio $\mathrm{f}_{\mathrm{N}}$ | PFET (ps) | ratio $\mathrm{f}_{\mathrm{P}}$ |
| :---: | :---: | :---: | :---: | :---: |
| High current (H) | 88.0 | 0.556 | 201.0 | 0.620 |
| Typical (M) | 158.0 | 1.000 | 324.0 | 1.000 |
| Low current (L) | 273.0 | 1.730 | 530.0 | 1.630 |

As an example, let us consider the case of a low-to-high transition on signal I and a circuit that is subject to a high-current process H . Table 2.1 suggests that the delay of inverter B (determined by the P-type transistor time constant) becomes:

$$
T_{B}(H)=\frac{T_{B}(M)}{f_{P}(H)}
$$

where $f_{P}(H)$ is the ratio between $T_{B}(M)$ and $T_{B}(H)$. The factor $f_{N}$ is defined similarly from Table 2.1, based on inverter A whose delay is determined by its N-type transistor time constant. Applying a similar reasoning to the rest of inverters, one can re-write Eqs. 2.2 and 2.3 as:

$$
\begin{gathered}
T_{I}(H)=\frac{\left(T_{A}(M)+T_{C}(M)\right)}{f_{N}(H)}+\frac{T_{B}(M)}{f_{P}(H)} \\
T_{N I}(H)=\frac{T_{1}(M)}{f_{N}(H)}+\frac{T_{2}(M)}{f_{P}(H)}
\end{gathered}
$$

Taking into account Eq. 2.1, we have:

$$
\begin{equation*}
T_{I}(H)-T_{N I}(H)=\left(T_{B}(M)-T_{2}(M)\right)\left(\frac{1}{f_{P}(H)}-\frac{1}{f_{N}(H)}\right) \tag{2.4}
\end{equation*}
$$

Therefore, since $f_{P}(H)$ is in general different from $f_{N}(H)$, a condition for the circuit to be process-insensitive, i.e., $\quad T_{I}(H)=T_{N I}(H)$, is to satisfy $\mathrm{T}_{\mathrm{B}}(\mathrm{M})=\mathrm{T}_{2}(\mathrm{M})$. This equality between $\mathrm{T}_{\mathrm{B}}(\mathrm{M})$ and $\mathrm{T}_{2}(\mathrm{M})$ can also be expressed as: $T_{A}(M)+T_{C}(M)=T_{1}(M)$.

Thus, Shoji concludes by stating the following design rules to be applied by designers aiming at process-insensitive circuits:

1) match the sum of pull-up delays of the P-type transistors with the pull-up delays of other clock paths.
2) match the sum of pull-down delays of the N-type transistors with the pull-down
delays of other clock paths.

## IV.2. Work of Vittoz [19]

Most analog circuit techniques are based on the matching properties of similar components. For a given process, matching of critical devices may be improved by enforcing the set of rules that are summarized below. These rules are not specific to CMOS and are applicable to all kinds of IC technologies. The relevance and the quantitative importance of each of these rules depend on the particular process and on the particular device under consideration.

1) Devices to be matched should have the same structure. For instance, a junction capacitor cannot be matched with an oxide capacitor.
2) They should have the same temperature, which is no problem if power dissipated on chip is very low. Otherwise, devices to be matched should be located on the same isotherm, which can be obtained by a symmetrical implementation with respect to the dissipative devices.
3) They should have the same shape and the same size. For example, matched capacitors should have same aspect ratios, and matched transistors or resistors should have the same width and the same length, and not simply the same aspect ratios.
4) Minimum distance between matched devices is necessary to take advantage of spatial correlation of fluctuating physical parameters.
5) Common-centroid geometries should be used to cancel constant gradients of
parameters. Good practical examples are the quad configuration used to implement a pair of transistors, and common-centroid sets of capacitors.
6) The same orientation on chip is necessary to eliminate dissymmetries due to anisotropic steps in the process, or the anisotropy of the silicon substrate itself. In particular, the source to drain flows of current in matched transistors should be strictly parallel.
7) Devices to be matched should have the same surroundings in the layout. This is to avoid for instance the end effect in a series of current sources implemented as a line of transistors, or the street effect in a matrix of capacitors.
8) Using non-minimum size devices is an obvious way of reducing the effect of edge fluctuations, and to improve spatial averaging of fluctuating parameters.

## V. Exact-Zero Skew Algorithms

In the competitive market of integrated circuits, the algorithmic approach is a fast and usually inexpensive design tool. Early work in this direction focused on global clock routing which addressed geometrical issues exploiting the notions of Steiner minimal tree and recti-linear minimal tree [20,21]. Later, researchers [22,23,24,25] began taking into account electrical aspects of the clock tree such as skew, by balancing and minimizing the length of clock paths. A disadvantage of this approach is that it uses minimum width lines which are very susceptible to variations in the etch rate of the metal lines, as well as to mask misalignment or local spot defects [26]. As a consequence, effective interconnect
impedance and delay can vary greatly from wafer to wafer. To design process-insensitive interconnects, Pullela et al. [27, 28, 29] have developed an automated layout algorithm that widens rather than lengthens the interconnect.

The work of Tsay [30, 31] announced a new generation of algorithms [32, 33, 34]. Algorithms where delay balancing is based on electrical length rather than geometrical length, and that takes into account clock net loads. Though the primary concern of this algorithm is not process-induced skew, to date, it is representative of a class of algorithms which illustrate the limit of how much clock skew can be optimized prior to chip manufacturing.

## V.1. Work of Tsay [30, 31]

This algorithm is based on Elmore delay model [35]. It uses a bottom-up recursive procedure which is illustrated in Fig. 2.5. In this figure, it is assumed that the algorithm has reached an intermediate stage in the clock tree hierarchy. Also, two zero-skew subtrees are already built, i.e., for each subtree, all root-to-leaf delays are equal. This assumption is obvious in the case the subtree reduces to a leaf, which is the initial condition of the algorithm.

To interconnect two zero-skew subtrees using a wire, while ensuring the resulting tree is zero-skew, the problem is to find a location (tapping point) on the wire so that the delays from all leaves to the new tree root are equal. The tapping point does not necessarily cut the wire into two equal halves, but into two different segments. Each segment is represented by a $\pi$ interconnect model, where $r_{1}$ and $c_{1}$ are the resistance and capacitance
of the first segment. Parameters $r_{2}$ and $c_{2}$ are similarly defined for the second segment. Each subtree is replaced by an input capacitance $C_{i}$ (accounts for load differences) and a branch delay (accounts for capability of clock drivers).


Fig. 2.5. Interconnecting two zero-skew subtrees [30]

To ensure the resulting clock tree is zero-skew, the following equation has to be satisfied:

$$
\begin{equation*}
r_{1}\left(\frac{c_{1}}{2}+C_{1}\right)+t_{1}=r_{2}\left(\frac{c_{2}}{2}+C_{2}\right)+t_{2} \tag{2.5}
\end{equation*}
$$

The total wire length is noted $l$. If the length of the wire segment between the new tree root and the root of subtree 1 is $x \times l$, then, the length of the wire segment between the new tree root and the root of subtree 2 will be $(1-x) \times l$.

Let $\alpha$ and $\beta$ be the resistance and capacitance per unit of wire length. Hence, we get:

$$
\begin{aligned}
& r=\alpha l, r_{1}=\alpha x l \text { and } r_{2}=\alpha(1-x) l . \text { We also have } c=\beta l, c_{1}=\beta x l \text { and } \\
& c_{2}=\beta(1-x) l .
\end{aligned}
$$

Solving Eq. 2.5 ensures the zero-skew condition and fixes the tapping point location at:

$$
x=\frac{(t 2-t 1)+\alpha l\left(C 2+\frac{\beta l}{2}\right)}{\alpha l(\beta l+C 1+C 2)}
$$

If $0 \leq x \leq 1$, the tapping point lies somewhere on the wire, otherwise ( $x<0$ or $x>1$ ), no tapping point satisfying the zero-skew condition can be found on the wire. For $\mathrm{x}<0$, subtree 1 features a delay greater than what the wire can balance. Therefore, the tapping point should be fixed at the root of subtree 1. Then, a wire elongation is explored following a way similar to the above. A similar scheme is adopted for $x>1$, which corresponds to an unbalanced delay lying on the subtree 2 side. In case the delays of the two subtrees are too different, so that a reasonable wire elongation cannot balance them, Tsay recommends the use of drivers, delay lines or capacitors.

As far as circuit design is concerned, in order to complete this approach, a consistent strategy for buffer insertion was needed $[29,36,37,38,39]$. One can understand the importance of such a strategy if we know that up to $90 \%$ of process variations affecting delay are due to buffers [13]. The notion of iso-radius levels [37] is worth mentionning here. Previous methods insert buffers level-by-level at the branch split points of the clock tree. Unfortunately, this works only in a full binary tree where all sinks have the same number of levels (e.g., processor arrays). In general, to ensure equalized root-to-leaf
delays, one sometimes needs to insert a buffer somewhere between consecutive levels, at every iso-radius level.

Despite the extent to which skew is reduced by exact-zero skew algorithms, this strategy faces two main limitations [40]:

1- It is vulnerable to process variations. The clock routing produced by [30, 31] is guided by estimated interconnect and logic RC parameters. It therefore achieves zeroskew only for the parameter values available during the design process. After chip manufacturing, any variation in these parameters will cause skew to appear again.

2- It lacks in design flexibility. If designers want to modify the location of any clock pin, the entire clock routing needs to be redone. Modifications commonly occur because designers cannot know the exact clock pin locations until late in the physical design process.

To overcome the above drawbacks, Lin and Wong [40] proposed a hierarchical twostage multiple-merge approach which relies on a zero-skew center chunk (fat wire).

Another attempt to design process-insensitive circuits [37] aimed at constructing a clock distribution tree by automatically and separately sizing PMOS and NMOS transistors of the clock buffers (Shoji's technique [17]). However, it was shown [45] that worst-case analysis is often carried out in terms of a correlated set of parameters, which results in a design that is unnecessarily pessimistic. This remark was confirmed in [14]. Nekili et al. noticed that active devices, such as inverters, are subject to large dispersions in their time constants depending on the position of the transistor in the wafer, and on the
nature of its surroundings. As a consequence, only a small proportion of the transistor population is close to the worst or best case. Therefore, designing a clock system based on transistor worst-case performance will cause a substantial penalty to the clock frequency, whereas in practice, one might take advantage of knowledge on the variations of transistor performance throughout the wafer.

To overcome the difficulty of controlling clock path delays due to process parameter variations, Neves and Friedman [46] reduce minimum clock period using intentional skew. A permissible clock skew range is calculated for each local data path while incorporating process dependent delay values on the clock signal paths.

Based on a concept initially proposed by Lee and Murphy [47], Nekili et al. [48] recently proposed zero-skew techniques to minimize process-induced clock skew using delay calibration in buffered clock trees. In a bottom-up fashion, an algorithm proceeds by a set of clock signal measurements performed at a limited number of clock nets and then the delays of tree branches are balanced by affecting the electrical features of either the interconnections or the buffers. A first technique consists of connecting several capacitors to the largest branches of the clock tree. The clock skew is measured between two clock nets and accordingly, a laser alternately cuts capacitors from the corresponding pair of tree branches, until both skew and delay are minimized at the current tree level. A $500-\mathrm{MHz}$ clocked experimental chip was fabricated with tree branches laid out over a 1.8 cm by 1.8 cm die, using the Nortel 0.8 um BiCMOS technology. A second technique implements a clock buffer as a number of minimum-sized inverters in parallel, all linked using the top
metal layer. According to the skew measurement at a pair of clock nets, a laser cuts inverters from one of the corresponding pair of buffers until the skew is minimized at the current tree level. Once the whole calibration phase is finished, the circuit designer puts back the final buffer sizes into the circuit design using an adapted mask of the top metal layer and vias only. This allows rapid and relatively inexpensive development of production masks with calibrated clock trees. With this method, skews induced by deterministic effects such as environment-induced process variations and power supply drops may be compensated, even though no method may be available to accurately predict these effects before manufacturing.

## VI. Built-in Self-Compensation

The literature available in support of this approach comes mainly from industry, which tends to adopt a "pragmatic" approach. With this approach, the emphasis is put on the detection or measurement of the resultant of all process variations, in order to compensate them either on the device level [41, 42] or by self-adjustment [43, 44].

Cox et al. [41] have developed a control circuit methodology, which measures the relative performance of a CMOS chip, and transmits a digital state code to output drivers and clock generation circuits, in order to monitor the device characteristics in presence of process variations, temperature and voltage fluctuations.

Chengson et al. [43] developed a self-adjusting synchronization system for clock distribution. This system receives a digital and periodical clock signal as a reference, and
generates multiple clock signals that dynamically synchronize with the clock reference. This structure was the critical part of an error-tolerant computer system.

Beside these types of variations, Watson et al. [44] focus on compensating load variations when drivers are used to increase the fanout of a clock distribution network. Watson et al. use a sort of delay adjustment called absolute to ensure that a driver delay remains a multiple of clock period.

The effect of process variations being more important with scaled technologies, Asahina et al. [42] use AC and DC fluctuations of macrocells characteristics in ASICs in order to detect such variations and compensate them. With this method, clock delay variations (skew) as well as noise are reduced.

## VII. Conclusion

To our knowledge, this is the first tutorial addressing the design of clock distribution networks in presence of process variations. Based on deterministic as well as probabilistic modeling, it was observed that process-induced skew makes synchronous multidimensionnal processor arrays unfeasible. Various sources to this problem were pointed out, including local and global variations. Solutions to the problem range from design rules, applied to the design of digital and analog devices, to built-in self-adjustment techniques. The evolution of algorithms and their limitations in skew minimization was also reviewed.

## Chapitre 3.

## Importance des variations spatiales

## de la constante de temps du transistor MOS

Dans une première phase, nous nous intéressons à déterminer dans quelle mesure les VPF peuvent désynchroniser une structure a priori équilibrée, c'est-à-dire, où l'architecture du système utilisant l'horloge est régulière, où l'arbre de distribution d'horloge est régulier et où les charges des noeuds terminaux de l'horloge sont identiques. Un exemple typique d'architecture de système (une matrice de processeurs à charges identiques) est traité à l'aide d'une approche analytique qui peut être envisagée lorsque nous pouvons imaginer des modèles simples des VPF. Ce chapitre propose une telle approche applicable aux niveaux d'intégration VLSI et WSI. Les variations de délai d'un inverseur sont résumées sous forme de variations déterministes sur les constantes de temps de transistors MOS et d'interconnexions métalliques.

De façon similaire à la démarche de Shoji au chapitre 2 , section IV, ce chapitre suggère des modèles de VPF au niveau VLSI et WSI puis, en analysant leur impact sur le BS, dégage des stratégies que le concepteur peut incorporer tôt dans le processus de conception du réseau de distribution d'horloge. Dans notre cas, ces stratégies consistent au niveau VLSI en un effort de symétrisation du nuage de points terminaux de l'horloge et au niveau WSI en des orientations particulières à donner aux circuits afin de les rendre moins sensibles aux VPF.

Ce chapitre traite aussi bien du problème de synchronisation de systèmes digitaux de grande dimension et de grande vitesse que de la modèlisation déterministe du BS. Pour synchroniser un système digital de grande dimension, la méthode conventionnelle utilise un ensemble de lignes métalliques organisées en arbre. Cette méthode est limitée par la bande passante du réseau d'horloge. Une autre limitation des solutions existantes est que les modèles disponibles du BS ne tiennent pas compte des VPF. Dans le but de fournir un modèle fiable du BS et pour éviter les limitations en fréquence, nous nous basons sur une approche qui distribue l'horloge à l'aide d'un arbre en H , dont les branches sont réalisées à base d'inverseurs de taille minimale, plutôt qu'avec du métal. Avec une telle structure, nous obtenons la plus haute fréquence d'horloge réalisable avec une technologie donnée. A titre d'exemple, des fréquences d'horloge de l'ordre de 1 GHz sont possibles avec une technologie CMOS à $1.2 \mu \mathrm{~m}$.

Du point de vue modélisation du BS, nous développons une expression analytique du BS entre deux feuilles de l'arbre en H , un BS que nous considérons être la différence de délai des chemins menant de la racine aux deux feuilles en question. La borne supérieure de BS obtenue a un ordre de complexité qui, relativement à la dimension D de l'arbre en H , est le même que celui dérivé à partir du modèle de Fisher \& Kung, soit $\Omega\left(D^{2}\right)$. Ceci est valable aussi bien pour les communications d'un bord physique à l'autre de l'arbre que les communications de voisin à voisin. Tandis que le modèle probabiliste de Steiglitz \& Kugelmass prédit une complexité de l'ordre de $\Theta(D \times \log D)$.

Dans un arbre en H implanté à base de lignes métalliques, le BS entre deux feuilles est
de toute évidence borné par le délai maximal entre la racine et les feuilles. Cependant, avec un arbre en $H$ à base de logique, le BS entre feuilles crô̂t plus vite que ce délai, quand l'arbre est en présence de VPF comme par exemple un gradient uniforme de constante de temps de transistor.

Ce chapitre propose aussi des généralisations du modèle de BS aux cas suivants:

- des puces dans une tranche de silicium (wafer) sujette à un gradient non-uniforme.
- un arbre en H combinant la logique et les interconnexions.


# Pipelined H-trees for High-Speed Clocking of Large Integrated Systems in Presence of Process Variations 

Mohamed NEKILI, Guy BOIS and Yvon SAVARIA<br>Department of Electrical Engineering, Ecole Polytechnique of Montreal<br>P.O. Box 6079, Station "Centre-Ville", Montréal,<br>PQ, Canada H3C 3A7, (514)-340.4737<br>Email: BOIS@VLSI.POLYMTL.CA SAVARIA@VLSI.POLYMTL.CA

Abstract. This paper addresses the problem of clocking large high-speed digital systems, as well as deterministic skew modeling, a related problem. A conventional method for clocking a large digital system is to use a set of metallic lines organized as a tree. This method is limited by the bandwidth of the clock network. Another limitation of existing solutions is that available skew models do not directly take into account process variations. In order to provide a reliable skew model, and to avoid the frequency limitation, we propose a novel approach that distributes the clock with an H-tree, whose branches are composed of minimum-sized inverters rather than metal. With such a structure, we obtain the highest clocking rate achievable with a given technology. Indeed, clock rates around $1 \mathbf{G H z}$ are possible with a 1.2 $\mu \mathrm{m}$ CMOS technology. From the skew modeling standpoint, we derive an analytic expression of the skew between two leaves of the $\mathbf{H}$-tree, which we consider to
be the difference in root-to-leaf delay pairs. The skew upper bound obtained has an order of complexity which, with respect to the H-tree size $D$, is the same as the one that may be derived from the Fisher \& Kung model for both side-to-side and neighbor-to-neighbor communications, i.e., a $\Omega\left(D^{2}\right)$, whereas the Steiglitz and Kugelmass probabilistic model predicts $\Theta(\mathrm{D} \times \sqrt{\log D})$. In an H-tree implemented with metallic lines, the leaf-to-leaf skew is obviously bounded by the delay between the root and the leaves. However, with the logic-based H-tree proposed in this paper, we arrive at a non obvious result, which states that, the leaf-to-leaf skew grows faster than the root-to-leaf delay in presence of a uniform transistor time constant gradient. This paper also proposes generalizations of the skew model to: - the case of chips in a wafer subject to a smooth, but non uniform gradient, — the case of H-tree configurations mixing logic and interconnections; in this respect, this paper covers the H-tree configurations based on the combination of logic and interconnections.

## I. Introduction

The evolution of VLSI chips toward larger die sizes and faster clock speeds makes clock design an increasingly important issue. A striking example of what can be accomplished with aggressive clock design is the DEC alpha chip [49], designed to operate at more than 200 MHz . At such speeds, clock skew becomes a very significant problem. Available literature dealing with skew $[50,51,3,17,52,53,54,2,56]$ approaches the problem both from deterministic and probabilistic standpoints.

In the deterministic approaches, Friedmann and Powell [52] emphasize the use of a hierarchical clock distribution, while others $[50,51,54,56]$ suggest the length equalization of the different paths followed by the clock throughout the circuit. Shoji [17] suggests an approach that guarantees a symmetry between paths that contribute to propagate " 0 " and " 1 ". This symmetry ensures proper operation despite some types of process variations [17]. Except for the work of Fisher and Kung [3], which provides bounds on skew, the other authors do not deal with the analytic modeling of system skew.

In the probabilistic approaches, Kugelmass and Steiglitz [53] consider the delay of a clock signal along a given path as a sum of delays along path segments, each of these segments behaving according to a probabilistic law. Then, by assuming independence between these delays, the total delay, as well as the skew, can then be described by a normal law. By assuming independence and the linearity of delay with line length, their approach becomes an oversimplification of the reality. Other authors [2] consider the
skew as a dispersion in the physical parameters of a circuit (e.g., geometrical dimensions) and in the process (e.g., sensitivity to temperature).

The work that is most directly related to that presented in this paper is the work of Fisher and Kung [3]. These authors have developed two deterministic skew models (the difference model and the summation model), from which they determined bounds on skew. However, these models do not directly refer to a process variation model. The difference model tends to be unrealistically optimistic, whereas, under the summation model, Fisher and Kung reached a pessimistic result which states that, from a skew standpoint, synchronous systems are not feasible with large two-dimensional arrays.

In order to avoid the frequency limitation when using metallic lines, we propose a logic-based H-tree structure that provides the highest clocking rate achievable with a given technology (section II). To provide a reliable skew model, we suggest a model based on delay differences combined with a model of electrical variations in the process parameters (section III). Under this model, we derive an analytic expression of the skew between any leaf pair, which we consider to be the difference in root-to-leaf delay pairs. Even though the model of electrical variations described in this paper assumes a VLSI scale, section IV suggests a generalization of this model when VLSI chips are considered as tiles on a wafer. Section V analyzes the performances (in terms of skew and clock rate) of H-tree configurations based on the combination of logic and interconnections. Except for the logic-based H-tree approach, section V does not claim originality for other
existing and well-known approaches of distributing the clock, i.e., metallic and mixed. It rather compares the behavior of the three approaches when facing process variations as well as pointing out the advantages of each.

Thus, this paper analyzes three methods of implementing H -tree based clock networks: the metallic H -tree illustrating the conventional scheme, the logic-based H -tree as an alternative solution to the problems faced by the metallic H -tree and finally, the mixed H -tree presented as the general case.

The skew model proposed in this paper does not consider variations not due to the process itself. For instance, a hot spot in the chip may produce a temperature gradient that introduces parametric variations, which would depend on how a circuit is exercised. Another example would be the parameter variations caused by environmental and power supply variations.

## II. Logic-Based H-Tree: An Extension of the Pipeline Clocking Method

A common way to reduce the skew is the H -tree technique [ $50,51,3,54$ ]. This consists of distributing a clock such that the interconnections which carry the signals towards the sub-blocks of the circuit are of equal length (Fig. 3.1). If the signals going to two different sub-blocks are equally delayed, they will remain perfectly synchronous. This is the case of tree leaves in Fig. 3.1. The only uncontrollable skew is the one due to parameter variations in the process (oxide thickness, thresholds of receiver transistors,
height and width of interconnections, etc.).

Conventionally, the different paths of the H-tree are implemented as metallic lines. With metallic lines, in general, only one clock event is in transit between the root of the tree and its leaves. This implies a clock frequency limited by the inverse of the delay of an event going from root to leaves. Because this delay increases as the square of tree size, with long metallic lines, in a conventional silicon process, the designer must reduce the clock frequency when dealing with large trees. To partially overcome this problem, one can, of course, use pipeline techniques [50] [3] by appropriately inserting buffers in the clock path. Fisher \& Kung had proposed pipeline clocking for technologies and systems where wires are slow and logic is fast. That is exactly what happens when components are scaled down under constant voltage, the time constant of logic is approximately reduced as the square of the scaling factor, whereas the time constant of metallic interconnections is invariant. The increase in clock frequency brought by the pipeline techniques is however limited by the presence of wires. In fact, completely removing wires between buffers eliminates the need of large buffers and then allows using the buffers of minimum size. The clock distribution link becomes a chain of inverters which allows inserting a new signal once the previous one has propagated through a pair of inverters along the clock path.


Fig. 3. 1. A Logic-Based H-tree (the H-tree branches are implemented using chains of minimum-sized inverters)

Another advantage of removing the wires between inverters is to ease the skew modeling. Indeed, since no realistic and/or experimental characterization of wire delay variations has been reported in the literature, designing the H-tree by using only inverters allows to express the skew on the sole basis of known variations on transistor time constants along the H-tree.

Therefore, as an extension of the concept of "pipeline clocking" proposed by Fisher \& Kung [3], we suggest implementing the clock lines with a chain of minimumsized inverters (Fig. 3.2). In other words, we distribute the clock through logic rather

[^1]than metallic interconnections. With such a configuration, the time separation between two successive events may be as small as the delay of a minimum-sized inverter driving another one of the same size. This delay can be regarded as a practical minimum limit on the timing resolution that can be achieved with a given CMOS process. However, even in this extreme case of pipelining, where wires are completely removed and inverters pushed to their limit, there are limitations: the intrinsic bandwidth of the technology (the inverters act as low-pass filters that attenuate the frequency considerably beyond a technology-dependent cut-off frequency) and the phenomenon of dispersion due to unbalanced rise and fall times.


Fig. 3. 2. Detail of the Logic-Based variant of an H-tree based clock network

With this logic-based H-tree, the maximum bandwidth is theoretically estimated to be $\frac{1}{6 \tau}, \tau$ being the time constant of the technology used, and $6 \tau$ the delay of a pair of minimum-sized inverters [58]. In practice, with the $1.2 \mu \mathrm{~m}$ CMOS technology of Northern Telecom used for detailed simulations [59], using the HSPICE [60] simulator (Fig. 3.3), this bandwidth is estimated to be as high as 1.11 GHz , instead of the theoretical value which would be between 6 and 7 GHz . The significant difference, between the theoretical and simulated estimates of the maximum bandwidth, is partly related to parasitics and dynamic effects, such as input rise time and effective loading, considered by HSPICE, but not in the simplified theoretical model. Over 1.11 GHz , the signal is rapidly attenuated because the number of stages in a reasonable size clock network is large.


Fig. 3. 3. Distributing a 1.11 GHz clock through
a 50 -stage chain of successive minimum-sized inverters

Distributing clocks at such a high speed is not trivial, but this extremely high predicted theoretical limit justifies devoting a great deal of effort to exploring the feasibility of this extreme case of pipelining. In fact, this potential maximum clock rate exceeds the typical maximum clock frequencies used in VLSI circuits implemented in comparable CMOS technology by more than one order of magnitude. In order to operate at such high frequencies, the main performance parameter that must be characterized is the skew. We will not be concerned with the end-to-end delay, provided that the skew between the H-tree leaves (those which directly communicate through combinational logic) is bounded by reasonable values [3].

Of course, dissipated power is not a trivial issue. Indeed, an additional cost in power dissipation is involved when considering an H-tree designed using logic. However, this is the price to pay for achieving high frequencies, because the problem of high power dissipation at high-speed is intrinsic in CMOS. On the other hand, due to the regeneration effect (of the inverters) which produces sharper transitions, the logic-based structure used at moderate frequencies would reduce the short-circuit current dissipation as compared to that of a conventional H-tree (buffers driving wires). Furthermore, a detailed analysis of power dissipation that we could not include in the present paper due to space limitation has shown that, by limiting the H-tree size, a logic-based H-tree can still operate at relatively high frequencies without going beyond what is offered by State-of-the-Art capabilities in heat removal.

Also, one has to consider the extra routing for Vdd and Gnd buses. However, since these buses will serve to feed the surrounding logic (the processors of the array) and not only the H -tree, it does not represent an extra routing cost.

## III. Skew Modeling in a Logic-Based H-Tree Subject to a Uniform and Oriented Transistor Time Constant Gradient

## III.1- Assumptions

The model developed in the following is based on a set of assumptions. Some of these are not as general as one might wish, but one must be careful to avoid arriving at intractably complex models. It is possible, however, to adapt the proposed methodology to less restrictive assumptions and, to our knowledge, no authors have proposed or used more realistic assumptions up to now.

A1) Using an inspection model [57], the delay along a path is determined by the sum of the transistor time constants ${ }^{2}$ of inverters along the given path.

A2) We assume that the chip is subject to process variations that lead to an oriented transistor time constant gradient, with a relative magnitude $\epsilon$ and an orientation $\theta$ relative to the horizontal axis of the wafer. In other words, the inverter time constant is assumed to change by $\epsilon \%$ over a 2D distance in the direction $\theta$.

[^2]The relative magnitude $\epsilon$ can be expressed as follows:

$$
\begin{equation*}
\varepsilon=2 \times \mathrm{D} \times \lambda \times \tau_{0} \tag{3.1}
\end{equation*}
$$

where $\lambda=2 \times 10^{-6} \mu \mathrm{~m}^{-1}$ is the percentage of the transistor time constant variation per unit distance over the wafer (like in [63], we assume a transistor time constant variation of $20 \%$ around the mean value $\tau_{0}$ over a distance of 10 cm ).

This assumption takes its roots in the observation of smooth variations of resistance over large sections of gallium arsenide wafers [55]. Moreover, Pavasovic et al. [11] recently addressed the spatial variability of a parameter, which is related to the time constant of a transistor (and thus to the skew), for both subthreshold [11] and above threshold [12] operation: the drain current. They reported experimental measurements from large transistor arrays with device sizes typical of digital and analog VLSI systems. As a result, three different position-dependent behaviors and a random variation were observed.

Among the quasi-deterministic effects reported by Pavasovic et al., there are: $\therefore$

- a "gradient" effect which appears as a very low-frequency spatial variation of the transistor drain current along the array;
- a "striation" effect that features a low-frequency sinusoidal and spatial variation;
- and an "edge" effect;

Due to the sinusoidal nature of the "striation" effect, we assume a compensation such that it approximately acts as a gradient type. Moreover, as pointed out by Pavasovic
et al., the "striation" effect is relatively weak for small transistors because it is hidden by the random effect. This is the case when minimum-sized inverters are used for implementing the H -tree.

An "edge" effect was also observed and characterized. The transistor characteristics were observed to vary as a function of whether the transistor is uniformly surrounded by other structures or isolated. In the experiments performed by Pavasovic et al., a contrast in drain current was observed at the physical frontiers between the arrays (the transistor size varying from one array to another) as well as on their peripheries. With the logic-based approach however, all the inverters are identical and they are generally surrounded uniformly throughout the wafer (due to the linear shape of the H -tree), thus the "edge" effect can be made small.

On the other hand, it appears from [11] and [12] that the random component is less significant than the quasi-deterministic components. Also, the random component tends to be negligible when the sum of drain-current gaussian distributions are averaged throughout large areas of silicon.

For the reasons mentioned above, and for the purpose of keeping the skew model tractable, we considered only the "gradient" effect as a meaningful first step.

A3) In agreement with the previous assumption, even though the effects of process variations are significant over large areas of silicon, we assume that they are negligible within the small area used for the layout of a minimum-sized inverter (where the transistor
time constant is estimated to vary by $0.002 \%$ around its mean value).

A4) The transistor time constant gradient behaves similarly for the N and P transistors of an inverter. This assumption may be removed at the price of increased model complexity.

## II.2- Logic-Based H-Tree Definition

Before developing a model of the skew, the H-tree itself must first be modeled. Without loss of generality, the H-tree has N hierarchical levels, where N denotes the tree depth. Level 0 corresponds to the root branch, and level N to the branches that support leaves. For instance, the H-tree described in Fig. 3.1 is drawn for $\mathrm{N}=4$. Horizontal branches have even ranks, and vertical branches odd ranks. With such a topology, the tree can distribute the clock to an array of $2^{\mathrm{N}}$ identical processors.

The H-tree topology is completely defined by:

- $Y(0)$, the ordinate of the level 0 branch.
- The series $\{1(\mathrm{i})\}$ of branch lengths. For example, the topology in Fig. 3.1 is such that:

$$
\begin{equation*}
1(2 k+1)=1(2 k+2)=\frac{D}{2^{k+1}} \quad \text { with } k \in\left[0, \frac{N}{2}-1\right] \tag{3.2}
\end{equation*}
$$

- and the depth N of the H -tree.

In order to simplify the analysis, the depth N is assumed to be even. A similar set of equations could be developed for odd depth trees.

From a geometric standpoint, an inverter, when laid out, has a rectangular shape with some length $D_{\text {inv }}$. Each inverter of a branch is balanced in order to avoid a skew due to the asymmetry of delay propagation of logical " 0 " and " 1 " [17]; this is factored-in by considering that the input capacitance of an inverter is M times greater than that of an N-type transistor (we generally fix $M$ to a value between 3 and 4). Then, each inverter sees the next one as a capacitance of $\mathrm{M} \times \mathrm{C}_{\mathrm{g}}$, where $\mathrm{C}_{\mathrm{g}}$ is the effective gate capacitance of a minimum-sized N -transistor, except for the last inverter of a branch, which sees simultaneously the first two inverters of the next two branches, i.e., a capacitance of $2 \times \mathrm{M} \times \mathrm{C}_{\mathrm{g}}$. This effect was neglected in the following analysis. Although the branch splits would limit the maximum bandwidth, electrical simulations using HSPICE with a $1.2 \mu \mathrm{~m}$ CMOS technology have shown that a 50 -stage chain of inverters features a cutoff frequency of 1110 MHz , while the same chain with branch splits every 10 inverters featured a cut-off frequency of 910 MHz . This represents a frequency attenuation of less than $18 \%$, which remains acceptable and is not as significant as we could have expected, i.e., when compared to the $50 \%$ attenuation expected due to the doubling of the gate capacitance at the branch splits. Moreover, a 910 MHz maximum bandwidth is still a very high clock frequency as compared to typical clock operation rates in a CMOS technology.

A path c between the root (level 0 ) and a leaf (level N ) of the H -tree is an oriented route consisting of a sequence of horizontal and vertical branches. This path is completely defined by a set of direction variables $\mathrm{d}^{\mathrm{v}} \mathrm{u}$ that describe the moving direction between two consecutive branches $u$ and $v$, even or odd, such that:

$$
\begin{align*}
& \left\{\begin{array}{ll}
{ }_{x}^{c} d_{2 p-1}^{2 p+1}
\end{array}\right\}, p \in\left[1, \frac{N}{2}\right] \text { where }{ }_{x}^{c} d_{2 p-1}^{2 p+1}= \begin{cases}+1 & \text { when moving right } \\
-1 & \text { when moving left }\end{cases}  \tag{3.3}\\
& \left\{\begin{array}{ll}
{ }_{y}^{c} d_{2 p-2}^{2 p}
\end{array}\right\}, p \in\left[1, \frac{N}{2}\right] \text { where }{ }_{y}^{c} d_{2 p-2}^{2 p}= \begin{cases}+1 & \text { when moving up } \\
-1 & \text { when moving down }\end{cases} \tag{3.4}
\end{align*}
$$

For example, ${ }_{x}^{c} d_{2 p-1}^{2 p+1}$ represents the direction variable assigned to path c when moving (along the $x$ axis) from the vertical branch with rank $2 p-1$ to the following vertical branch with rank $2 \mathrm{p}+1$ (Fig. 3.4.a). Note that, in order to simplify the illustrations in Fig. 3.4, the branches of the H -tree are shown as lines rather than as sequences of inverters. Also, given the fact that the H-tree is composed of identical inverters, the width of all branches is constant.


Fig. 3. 4. Series used in delay calculation

The gradient of a transistor time constant is modeled by a vector taking its origin at point $(0,0)$, with a magnitude $\epsilon$ and an orientation $\theta$ relative to the x axis. Note that $\epsilon$ is the relative time constant variation of a channel area square along a distance 2 D . Thus, the time constant of a channel area square with coordinates $(x, y)$ is:

$$
\begin{equation*}
\tau(\mathrm{x}, \mathrm{y})=\tau_{0}+\frac{\mathrm{x}}{2 \mathrm{D}} \varepsilon \cos \theta+\frac{\mathrm{y}}{2 \mathrm{D}} \varepsilon \sin \theta \tag{3.5}
\end{equation*}
$$

where $\tau_{0}$ represents the value of the product $\mathrm{R} \times \mathrm{Cg}$ at point $(0,0)$ in the wafer system of coordinates. At the geometrical point $(0,0)$, the process variations are assumed to be zero (Eq. 3.5). Therefore, the transistor time constant of an inverter that would be implemented at that location can be determined by the typical values of the technology process reflected in SPICE technology model.

## III.3- Delay Formulation Along a Path of the H-Tree

Taking into account the gradient orientation, horizontal and vertical branches are not affected in the same way by a time constant gradient. Thus, the total delay along a path c of the H -tree may be expressed as:

$$
\begin{aligned}
\mathrm{t}_{\mathrm{c}} & =\sum \text { delays of even rank branches } \\
& +\sum \text { delays of odd rank branches }
\end{aligned}
$$

Table 3.1 summarizes the oriented time constant gradient components (i.e., the variations of the time constant in the horizontal and vertical directions) that affect each of the H-tree's horizontal and vertical branches. The H-tree branches used to calculate these components are illustrated in Fig. 3.4.

Table 3. 1. Horizontal and vertical time constant gradient components

| Type of branch | Horizontal gradient <br> component | Vertical gradient <br> component |
| :---: | :---: | :---: |
| Horizontal branch <br> of rank $2 \mathrm{p}, p \in\left[0, \frac{N}{2}\right]$ | $\mathrm{X}(\mathrm{i}, 2 \mathrm{p}) \frac{\varepsilon}{2 D} \cos \theta$ | $\mathrm{Y}(2 \mathrm{p}) \frac{\varepsilon}{2 \mathrm{D}} \sin \theta$ |
| Vertical branch <br> of rank $2 \mathrm{q}+1$, <br> $q \in\left[0, \frac{N}{2}-1\right]$ | $\mathrm{X}(2 \mathrm{q}+1) \frac{\varepsilon}{2 \mathrm{D}} \cos \theta$ | $\mathrm{Y}(\mathrm{j}, 2 \mathrm{q}+1) \frac{\varepsilon}{2 \mathrm{D}} \sin \theta$ |

## In Table 3.1:

- $X(2 q+1)$ designates the $x$ coordinate of a vertical branch of rank $2 q+1$. Referring to Fig. 3.4.c, $\mathrm{X}(2 \mathrm{q}+1)$ is recursively defined as follows:

$$
\begin{align*}
X(2 q+1) & =X(2 q-1)+{ }_{x}^{c} d_{2 q-1}^{2 q+1} 1(2 q) \\
& =X(1)+\sum_{j=1}^{q}{ }_{x}^{c} d_{2 j-1}^{2 j+1} 1(2 j) \tag{3.6}
\end{align*}
$$

where $\mathrm{X}(1)$ denotes the x coordinate of a vertical branch of rank 1 .

- $X(i, 2 p)$ designates the $x$ coordinate of an inverter at position $i$ within the horizontal branch of rank 2 p . Referring to Fig. 3.4.a, $\mathrm{X}(\mathrm{i}, 2 \mathrm{p})$ is recursively defined as follows:

$$
\begin{aligned}
& \mathrm{X}(\mathrm{i}, 0)=\mathrm{i} \mathrm{D}_{\mathrm{inv}} \\
& \mathrm{X}(\mathrm{i}, 2)=\mathrm{X}(1)+{ }_{\mathrm{x}}^{\mathrm{c}} \mathrm{~d}_{1}^{3} \mathrm{i} \mathrm{D}_{\mathrm{inv}}
\end{aligned}
$$

$$
\text { For } p \geq 2, \quad X(i, 2 p)=X(2 p-1)+{ }_{x}^{c} d_{2 p-1}^{2 p+1} i D_{i n v}
$$

By substituting the results of Eq. (6),

$$
\begin{equation*}
X(i, 2 p)=X(1)+\sum_{j=1}^{p-1}{ }_{x} d_{2 j-1}^{2 j+1} l(2 j)+{ }_{x}^{c} d_{2 p-1}^{2 p+1} i D_{i n v} \tag{3.7}
\end{equation*}
$$

where, as mentioned earlier, $\mathrm{D}_{\mathrm{inv}}$ is the inverter length. In the remainder of this paper, the coordinate of an inverter is measured at its end. Having that coordinate measured at the beginning would have little impact on the results. In fact, for a logic-based $\mathbf{H}$-tree whose root-to-leaf path consists of thousands of successive inverters, an error equivalent to the delay variation of 1 inverter has negligible effect on total skew. Also, the inverter cell is so small that its coordinates (and borders) are well approximated by its center of mass.

- $Y(2 p)$ designates the $y$ coordinate of an horizontal branch of rank $2 p$. Referring to Fig. 3.4.b, $\mathrm{Y}(2 \mathrm{p})$ is recursively defined as follows:

$$
\begin{align*}
Y(2 p) & =Y(2 p-2)+{ }_{y}^{c} d_{2 p-2}^{2 p} l(2 p-1) \\
& =Y(0)+\sum_{i=1}^{p}{ }_{y}^{c} d_{2 i-2}^{2 i} l(2 i-1) \tag{3.8}
\end{align*}
$$

- $\mathrm{Y}(\mathrm{j}, 2 \mathrm{q}+1)$ designates the y coordinate of an inverter at position j within the vertical branch of rank $2 q+1$. Referring to Fig. 3.4.d, $Y(j, 2 q+1)$ is recursively defined as follows:
$Y(j, 0)=Y(0)+{ }_{y}^{c} d_{0}^{2} j D_{\text {inv }}$
For $q \geq 1, \quad Y(j, 2 q+1)=Y(2 q)+{ }_{y}^{c} d_{2 q}^{2 q+2} j D_{i n v}$
By substituting the results of Eq. (8),

$$
\begin{equation*}
Y(j, 2 q+1)=Y(0)+\sum_{i=1}^{q}{ }_{x}^{c} d_{2 i-2}^{2 i} l(2 i-1)+{ }_{y}^{c} d_{2 q}^{2 q+2} j D_{i n v} \tag{3.9}
\end{equation*}
$$

Now, using the previous definitions, we can define the propagation delay $t_{c}$ along a path c of the H -tree of depth N as:

$$
\begin{align*}
& t_{c}=M \times w_{n}\left[\sum_{p=0}^{\frac{N}{2}} \sum_{i=1}^{\frac{\frac{12 p)}{D_{\text {inv }}}}{} \frac{\left(\tau_{0}+(Y(2 p) \sin \theta+X(i, 2 p) \cos \theta) \frac{\varepsilon}{2 D}\right)}{w_{n}}}\right. \\
& \left.+\sum_{q=0}^{\frac{N}{2}-1} \sum_{j=1}^{\frac{1(2 q+1)}{D_{\text {inv }}}}\left(\frac{\tau_{0}+(X(2 q+1) \cos \theta+Y(j, 2 q+1) \sin \theta) \frac{\varepsilon}{2 D}}{w_{n}}\right)\right] \tag{3.10}
\end{align*}
$$

where $w_{n}$ is the width of the N-type transistor of the basic inverter. We note from Eq. 3.10 that the term $w_{n}$ can be simplified, meaning that the total delay along a given path does not depend on the inverter size. This simplification assumes that interconnect capacitances are neglected, which is reasonable in this case where inverters are directly connected, i.e., with no wire between the inverters so that the only interconnect capacitance results from a very short direct metallic connection and the drain capacitance of the N and P transistors. This capacitance is comparable to that of an inverter gate. An alternative to take into account this capacitance, is to adjust the transistor time constant to an effective accurate value, while keeping the inverters at a minimal size, since there is no need to keep the inverters large due to the absence of wires. Another alternative is to use inverters 4 to 5 times the minimum size, making the parasitic capacitance relatively smaller than the capacitance of an inverter gate. In both cases, the delay and skew models developed later in this paper hold within a constant factor. In the rest of this section, we maintain
the basic inverter at a minimal size. All the inverters used in the H -tree are identical. The tree branch width is then constant throughout the system.

After moving constants outside the internal sums, we obtain:

Finally, we can express the delay $t_{c}$ as a sum of two components, such that:

$$
t_{c}=r_{c}+s_{c}
$$

where:

$$
\begin{align*}
\mathrm{r}_{\mathrm{c}} & =\frac{\mathrm{M} \tau_{0}}{\mathrm{D}_{\text {inv }}} \sum_{\mathrm{p}=0}^{N} \mathrm{l}(\mathrm{p}) \\
\text { and } \mathrm{s}_{\mathrm{c}} & =\alpha_{\mathrm{c}} \sin \theta+\beta_{\mathrm{c}} \cos \theta, \tag{3.11}
\end{align*}
$$

with:

$$
\begin{align*}
& \alpha_{c}=M \frac{\varepsilon}{2 D}\left[\sum_{p=0}^{\frac{N}{2}}\left(\frac{l(2 p)}{D_{\text {inv }}} Y(2 p)\right)+\sum_{q=0}^{\frac{N}{2}-1} \sum_{j=1}^{\frac{1(2 q+1)}{D_{\text {inv }}}} Y(j, 2 q+1)\right]  \tag{3.12}\\
&=M \frac{\varepsilon}{2 D}\left[\frac{1(0)}{D_{\text {inv }}} Y(0)+\sum_{p=1}^{\frac{N}{2}}\left(\frac{1(2 p)}{D_{\text {inv }}} Y(2 p)\right)\right. \\
&+\left.\sum_{j=1}^{\frac{1(1)}{D_{\text {inv }}}}\left(Y(0)+{ }_{y}^{c} d_{0}^{2} j D_{\text {inv }}\right)+\sum_{q=1}^{\frac{N}{2}-1} \frac{\frac{1(2 q+1)}{D_{\text {inv }}}}{\sum_{j=1}} Y(j, 2 q+1)\right] \\
& \beta_{c}=M \frac{\varepsilon}{2 D}\left[\sum_{p=0}^{\frac{N}{2}} \sum_{i=1}^{\frac{1(2 p)}{D_{\text {inv }}}} X(i, 2 p)+\sum_{q=0}^{\frac{N}{2}-1}\left(\frac{l(2 q+1)}{D_{\text {inv }}} X(2 q+1)\right)\right] \tag{3.13}
\end{align*}
$$

$$
\begin{aligned}
& \quad=\mathrm{M} \frac{\varepsilon}{2 \mathrm{D}}\left[\sum_{\mathrm{i}=1}^{\frac{1(0)}{D_{\text {inv }}}} \mathrm{i} D_{\text {inv }}+\sum_{\mathrm{i}=1}^{\frac{1(2)}{\mathrm{Dinv}^{2}}}\left(\mathrm{X}(1)+{ }_{x}^{\mathrm{c}} \mathrm{~d}_{1}^{3} \mathrm{i} \mathrm{D}_{\mathrm{inv}}\right)+\sum_{p=2}^{\frac{N}{2}} \sum_{i=1}^{\frac{l(2 p)}{D_{\text {inv }}}} \mathrm{X}(\mathrm{i}, 2 \mathrm{p})\right. \\
& \left.+\frac{1(1)}{\mathrm{D}_{\text {inv }}} \mathrm{X}(1)+\sum_{\mathrm{q}=1}^{\frac{N}{2}-1}\left(\frac{1(2 \mathrm{q}+1)}{\mathrm{D}_{\text {inv }}} \mathrm{X}(2 \mathrm{q}+1)\right)\right]
\end{aligned}
$$

While the component $\mathbf{r}_{\mathbf{c}}$ is analytically eliminated in the skew definition (Eq. 3.14 in next section), the component $\mathrm{s}_{\mathrm{c}}$ contributes to the skew between the H-tree leaves. Notice that a perfect manufacturing process, with $\epsilon=0$ (i.e. without variations), corresponds to $\mathrm{s}_{\mathrm{c}}=0$.

## III.4- Skew Formulation

Based on Eq. (3.11) and by substituting the $X(2 q+1), X(i, 2 p), Y(2 p)$ and $Y(j, 2 q+1)$ terms appearing in Eqs. (3.12) and (3.13) with their developments (Eqs. 3.6 to 3.9), the skew $\mathrm{s}_{12}$ between two leaves, 1 and 2 , of the H -tree can be expressed as:

$$
\begin{equation*}
\mathrm{s}_{12}=\mathrm{t}_{1}-\mathrm{t}_{2}=\mathrm{s}_{1}-\mathrm{s}_{2}=\mathrm{a} \sin \theta+\mathrm{b} \cos \theta \tag{3.14}
\end{equation*}
$$

where $\mathrm{a}=\alpha_{1}-\alpha_{2}$ and $\mathrm{b}=\beta_{1}-\beta_{2}$, such that:

$$
\begin{align*}
& b=M \frac{\varepsilon}{2 D}\left[\sum_{i=1}^{\frac{(221}{D_{\text {inv }}}}\left({ }_{x} d_{1}^{3}-{ }_{x}^{2} d_{1}^{3}\right) i D_{\text {inv }}\right. \\
& +\sum_{p=2}^{\frac{N}{2}}\left(\sum_{i=1}^{\frac{1(2 p)}{D_{\text {inv }}}}\left(\sum_{j=1}^{p-1}\left[\left({ }_{x}^{1} d_{2 j-1}^{2 j+1}-{ }_{x}^{2} d_{2 j-1}^{2 j+1}\right) l(2 j)\right]+\left({ }_{x}^{1} d_{2 p-1}^{2 p+1}-{ }_{x}^{2} d_{2 p-1}^{2 p+1}\right) i D_{\text {inv }}\right)\right) \\
& \left.+\sum_{q=1}^{\frac{N}{2}-1}\left(\frac{1(2 q+1)}{D_{\text {inv }}} \sum_{j=1}^{q}\left[\left({ }_{x}^{1} d_{2 j-1}^{2 j+1}-{ }_{x}^{2} d_{2 j-1}^{2 j+1}\right) 1(2 j)\right]\right)\right] \tag{3.16}
\end{align*}
$$

At this level, the skew (Eq. 3.14) is a function of $\mathrm{N}+1$ variables: N differences of direction variables (Eqs. 3.17 and 3.18), which may also be treated as parameters, and an orientation $\theta$ of the time constant gradient. In fact, $a$ and $b$ can be expressed as follows:

$$
\begin{gather*}
\mathrm{a}=\sum_{u=1}^{\frac{\mathrm{N}}{2}} \mu_{\mathrm{u}} \mathrm{y}_{\mathrm{u}} ; \quad \mathrm{b}=\sum_{v=1}^{\frac{\mathrm{N}}{2}} \lambda_{v} \mathrm{x}_{v}  \tag{3.17}\\
\mathrm{x}_{\mathrm{v}}={ }_{\mathrm{x}}^{1} \mathrm{~d}_{2 v-1}^{2 v+1}-{ }_{x}^{2} \mathrm{~d}_{2 v-1}^{2 v+1} ; \quad \mathrm{y}_{\mathrm{u}}={ }_{\mathrm{y}}^{1} \mathrm{~d}_{2 u-2}^{2 u}-{ }_{y}^{2} \mathrm{~d}_{2 u-2}^{2 u} \tag{3.18}
\end{gather*}
$$

where $\mu_{\mathrm{u}}$ and $\lambda_{\mathrm{v}}$ are coefficients that depend on the H-tree topology and on elements $\mathrm{D}_{\text {inv }}, \mathrm{X}, \epsilon, \mathrm{D}$. Due to the definition of a direction variable (Eqs. 3.3 and 3.4), parameters $x_{v}$ and $y_{u}$ are elements of the set $\{-2,0,2\}$.

## III.5- An Upper Bound Result on Clock Skew

A necessary condition for the skew $s_{12}$ to be maximum is that:

$$
\begin{equation*}
\frac{\partial \mathrm{s}_{12}}{\partial \theta}=0 \tag{3.19}
\end{equation*}
$$

Let us express the partial derivatives of the first and second orders relative to the gradient orientation $\theta$.

$$
\begin{gather*}
\frac{\partial \mathrm{s}_{12}}{\partial \theta}=\operatorname{acos} \theta-\mathrm{b} \sin \theta \\
\frac{\partial \mathrm{~s}_{12}}{\partial \theta}=0 \text { for } \theta_{0}=\arctan \frac{\mathrm{a}}{\mathrm{~b}}  \tag{3.20}\\
\frac{\partial^{2} \mathrm{~s}_{12}}{\partial \theta^{2}}=-\mathrm{s}_{12}
\end{gather*}
$$

By combining Eqs. 3.14 and 3.20 , we get a skew upper bound:

$$
\begin{equation*}
\mathrm{s}_{12}=\mathrm{a} \sin \left(\arctan \left(\frac{\mathrm{a}}{\mathrm{~b}}\right)\right)+\mathrm{b} \cos \left(\arctan \left(\frac{\mathrm{a}}{\mathrm{~b}}\right)\right) \tag{3.21}
\end{equation*}
$$

which depends on the value of the N-tuple ( $\mathrm{x}_{\mathrm{v}}, \mathrm{y}_{\mathrm{u}}$ ) (Eqs. 3.17 and 3.18), i.e., the pair of H-tree leaves we are considering. In the following, we address the issue of which pair of H-tree leaves leads to the highest possible skew.

Fig. 3.5 shows the effect of varying parameters $a$ and $b$ on skew. We can look at this figure from two angles, with different consequences. If we consider variables a and $b$ as continuous, it is obvious that the skew diverges when $a$ and $b$ increase, and the center of Fig. 3.5 is a kind of inflection point in three dimensions that does not lead to an extremum. However, if we limit the a and b values to those illustrated in Fig. 3.5, function $s_{12}$ has obviously a maximum and a minimum. This interpretation corresponds to the case of an H-tree whose depth is finite, and where parameters a and bare limited
to a finite range determined by those of the N -tuple of parameters $\left(\mathrm{x}_{\mathrm{v}}, \mathrm{y}_{u}\right)$.


Fig. 3. 5. Skew variation versus the configuration of direction variables

At this level, the next hurdle to overcome is to determine, for an H-tree of depth N , the two leaves, and therefore the orientation of the time constant gradient, which lead to a maximum skew.

Fig. 3.5 shows that the maximum skew is reached in the region where $b$ is at its most positive value. Let us consider $E$ as the set of pairs ( $\mathrm{x}_{\mathrm{j}}, \mathrm{y}_{\mathrm{i}}$ ), where ( $\mathrm{i}, \mathrm{j}$ ) belongs to $\left[1, \frac{N}{2}\right]^{2}$, such that b is maximum. Then, all the elements of E verify the relation $\frac{\partial^{2} \text { s.2 }}{\partial \theta^{2}}<0$ and they lead to a maximum of the $s_{12}$ function. However, only two pairs will produce the upper bound. The first pair corresponds to the parameter a, at its most positive value,
and the second corresponds to the parameter a, at its most negative value. Because all the $\mu_{\mathrm{u}}$ and $\lambda_{\mathrm{v}}$ coefficients of Eqs. 3.17 and 3.18 are positive, then the two pairs described above are obtained for $\left(\mathrm{x}_{\mathrm{v}}, \mathrm{y}_{\mathrm{u}}\right)=(2,2)$ and $\left(\mathrm{x}_{\mathrm{v}}, \mathrm{y}_{\mathrm{u}}\right)=(2,-2)$. The first pair depicts the case of leaves L1 and L2 (Fig. 3.6) and the second corresponds to leaves L3 and L4 (Fig. 3.6).

The above result supports the strong intuition which states:

There cannot be a larger skew than the one observed for the two leaves located at the ends of the first diagonal of the H-tree, where the displacement along all the branches that constitute the path leading to the first leaf occurs in the ascending direction of the gradient, while the displacement along all the branches that constitute the path leading to the second leaf occurs in the descending direction of the gradient

Indeed, we shall not expect more skew than what is obtained when sending the same signal along two paths, one path following the direction of monotonically increasing process variations and the other path following the direction of monotonically decreasing process variations.

Due to the perspective view, Fig. 3.5 seems to present two different skew upper bounds reached at the two corners of the surface. These corners correspond to the two pairs ( $\mathrm{x}_{\mathrm{j}}, \mathrm{y}_{\mathrm{i}}$ ) described so far. However, another geometrical point of view reveals that these two pairs lead to an identical skew upper bound. Indeed, when substituting the values of $(\mathrm{a}, \mathrm{b})$ corresponding to these two pairs into Eq. 3.21 , we arrive at the same value. In the following, due to the symmetry in the problem, we can then focus only
on the first pair.

Thus, we can describe the location of the two leaves located at the ends of the first diagonal (Fig. 3.6) of the H -tree as:

$$
\begin{gather*}
x_{x}^{1} d_{2 q-1}^{2 q+1}=-{ }_{x}^{2} d_{2 q-1}^{2 q+1}=1 \quad \forall q \in\left[1, \frac{N}{2}\right]  \tag{3.22a}\\
\text { and }{ }_{y}^{1} d_{2 q}^{2 q+2}=-{ }_{y}^{2} d_{2 q}^{2 q+2}=1 \quad \forall q \in\left[1, \frac{N}{2}\right] \tag{3.22b}
\end{gather*}
$$



Fig. 3. 6. The two pairs of leaves that produce the skew upper bound

Then, the upper bound of $s_{12}$ is the following:

$$
\begin{equation*}
\max \left(s_{12}\right)=a_{\max } \sin \left(\arctan \frac{a_{\max }}{b_{\max }}\right)+b_{\max } \cos \left(\arctan \frac{a_{\max }}{b_{\max }}\right) \tag{3.23}
\end{equation*}
$$

where, $a_{\max }$ and $b_{\max }$ come from the combination of Eqs. (3.15), (3.16), (3.22a) and (3.22b). Thus:

$$
\begin{align*}
& a_{\text {max }}=2 M \frac{\varepsilon}{2 D}\left[\sum_{p=1}^{\frac{N}{2}}\left(\frac{l(2 p)}{D_{\text {inv }}} \sum_{i=1}^{p} l(2 i-1)\right)+\frac{l(1)}{2}\left(\frac{l(1)}{D_{\text {inv }}}+1\right)\right. \\
& \left.+\sum_{q=1}^{\frac{N}{2}-1}\left[\frac{l(2 q+1)}{2}\left(\frac{l(2 q+1)}{D_{\text {inv }}}+1\right)+\frac{l(2 q+1)}{D_{\text {inv }}} \sum_{i=1}^{q} l(2 i-1)\right]\right]  \tag{3.24}\\
& b_{\text {max }}=2 M \frac{\varepsilon}{2 D}\left[\frac{l(2)}{2}\left(\frac{l(2)}{D_{\text {inv }}}+1\right)+\sum_{p=2}^{\frac{N}{2}}\left[\frac{l(2 p)}{2}\left(\frac{l(2 p)}{D_{\text {inv }}}+1\right)+\frac{l(2 p)}{D_{\text {inv }}} \sum_{j=1}^{p-1} l(2 j)\right]\right. \\
& \left.+\sum_{q=1}^{\frac{N}{2}-1}\left(\frac{l(2 q+1)}{D_{\text {inv }}} \sum_{j=1}^{q} l(2 j)\right)\right] \tag{3.25}
\end{align*}
$$

and the corresponding gradient orientation is:

$$
\begin{equation*}
\theta_{0}=\arctan \frac{a_{\max }}{b_{\max }} \tag{3.26}
\end{equation*}
$$

We now have the expression of the skew upper bound between H -tree leaves, as well as the corresponding orientation of the time constant gradient, as a function of the H-tree topology only.

From Appendix A of [61], we get:

$$
\begin{gather*}
\max \left(\mathrm{s}_{12}\right)=2 \mathrm{t} \frac{\tau_{0}}{\beta} \lambda \mathrm{M} \frac{\mathrm{D}^{2}}{\mathrm{D}_{\mathrm{inv0}}} \text { and },  \tag{3.27}\\
\theta_{0} \approx \operatorname{arctg} \frac{7}{5}=54.46 \mathrm{deg} \tag{3.28}
\end{gather*}
$$

where:

- $t$ is a parameter that depends on the H-tree topology, such that:

$$
\mathrm{t}=1.43 \text { for the } \mathrm{l}(2 \mathrm{k}+1)=\mathrm{l}(2 \mathrm{k}+2)=\frac{\mathrm{D}}{2^{\mathrm{k}+1}} \text { topology }
$$

- $\tau_{0}$ is the transistor time constant when using a $1.2 \mu \mathrm{~m}$ CMOS technology.
- $\mathrm{D}_{\mathrm{inv0}}$ is the length of the basic inverter with which the H -tree is built.
$\bullet \beta$ is the scaling factor from a $1.2 \mu \mathrm{~m}$ CMOS technology.

In these calculations, we assumed that the size of the chip $\mathrm{D} \gg \mathrm{D}_{\text {inv }}$.

Getting some concrete numerical results from the previous analysis is of interest.
Using the following parameters:
$\tau_{0}=24 \mathrm{ps}, \beta=1\left(1.2 \mu \mathrm{~m}\right.$ CMOS technology), $\mathrm{M}=4, \mathrm{D}_{\mathrm{inv} 0}=12.4 \mu \mathrm{~m}$ and $\mathrm{t}=1.43$ (topology of Fig. 3.1),

Eq. (3.27) predicts that, on a silicon die of $1 \mathrm{~cm} \times 1 \mathrm{~cm}$, we can propagate a 100 MHz clock with a skew upper bound not exceeding one tenth of clock period, that is to say,

1ns. Moreover, for the same silicon area, when the technology is scaled down, this skew upper bound is reduced in proportion to the scaling factor $\beta$, thus allowing a clock rate that increases in the same proportion. Also, we will see in section 111.6 that the clock rate can be doubled for neighbor-to-neighbor communications between processors.

The gradient orientation which maximizes the skew (Eq. 3.28) is about $10^{\circ}$ off from the intuitive $45^{\circ}$ one might expect due to the presence of the trigonometric functions $\sin \theta$ and $\cos \theta$ in the expression of the transistor time constant gradient (Eq. 3.5). This phenomenon, which is thoroughly characterized in [61], is associated with the step increase of the transistor time constant variation between two successive colinear segments of the tree.

We can now state important results:

- In the case of multi-dimensional arrays, where there exist direct communications between the leaves at opposite sides of the array (side-to-side communications), Eq. (3.27) shows that the skew upper bound grows like the square of the H -tree size. Note that the skew between 2 processors communicating through a third processor is not relevant in the determination of the skew upper bound.
- In a logic-based H-tree subject to a uniform gradient of the transistor time constant, the leaf-to-leaf skew grows faster than the root-to-leaf delay according to Eq. (3.27). This quadratic growth of the skew relative to the H -tree size is due to the space dimension added by the time constant gradient (Eq. 3.1).


## III.6- Two-Dimensional Arrays

When the array is multi-dimensional, some processors may communicate with distant processors in the array. In this case, Eq. (3.27) shows that the skew upper bound grows like the square of the H -tree size.

However, in an array of processors where each processor communicates only with its 8 neighbors (Fig. 3.7), one might hope to keep the skew constant irrespective of the array size, since the distance between these neighbors can be kept constant when the array grows in size. Unfortunately, this optimistic vision turns out to be false. In order to develop a detailed analysis of the maximum skew between neighbors, we first locate the neighbors that lead to the maximum skew, and then determine the relationship between their skew and the array size.


Fig. 3. 7. A processor and its 8 neighbors

To ease the process of locating maximum skew neighbors, let us first denote by $(\mathrm{I}, \mathrm{J})$ the indexes of a processor $\mathrm{P}_{\mathrm{r}}$ in the array, such that the lower left processor and
the upper right processor will be respectively indexed as $(1,1)$ and $\left(2^{\frac{\pi}{2}}, 2^{\frac{\mathrm{N}}{2}}\right)$. We define the following sets of processors $\mathrm{P}_{\mathrm{r}}(\mathrm{I}, \mathrm{J})$ :

$$
\begin{gathered}
A_{r}=\left\{P_{r}(I, J) / I \geq \frac{2^{\frac{N}{2}}}{2}+1\right\} ; A_{I}=\left\{P_{r}(I, J) / I \leq \frac{2^{\frac{N}{2}}}{2}\right\} \\
A_{t}=\left\{P_{r}(I, J) / J \geq \frac{2^{\frac{N}{2}}}{2}+1\right\} ; A_{b}=\left\{P_{r}(I, J) / J \leq \frac{2^{\frac{N}{2}}}{2}\right\}
\end{gathered}
$$

For example, Fig. 3.8 illustrates the sets $A_{r}$ and $A_{1}$ which could be superimposed on Fig. 3.1.


Fig. 3. 8. An example of sub-regions in Fig. 3.1

The two neighbors that will produce the maximum skew cannot belong simultaneously to $A_{t}$ or simultaneously to $A_{b}$, because they would miss the opportunity to diverge at the branches ranked 1 (Fig. 3.1). The same analysis holds for $A_{r}$ or $A_{l}$ with the branches of rank 2. Then, the only possibility that remains valid for the neighbor-toneighbor skew is the pairs of processors that are located at the center of the array, and which communicate in a diagonal direction, i.e, $P_{r}\left(2^{\frac{N}{2}}+1,2^{\frac{N}{2}}+1\right)$ with $P_{r}\left(2^{\frac{N}{2}}, 2^{\frac{\mathrm{N}}{2}}\right)$ or $P_{r}\left(2^{\frac{\mathrm{V}}{2}}, 2^{\frac{\mathrm{N}}{2}}+1\right)$ with $\mathrm{P}_{\mathrm{r}}\left(2^{\frac{\mathrm{N}}{2}}+1,2^{\frac{\mathrm{V}}{2}}\right)$. Note that the divergence at the primary levels of the H -tree produces the main part of the skew between any pair of leaves in the system.

In a detailed analysis [61] based on a computerized exhaustive enumeration for N ranging from 2 to 14 , we observed that the skew upper bound between neighbors is indeed obtained for these pairs, and following a procedure similar to the one that leads to Eq. 3.27, we obtained an expression of the skew between these selected pairs of processors, skew which is given by:

$$
\begin{equation*}
\therefore \quad \max \left(\mathrm{s}_{12}\right)=\mathrm{t} \frac{\tau_{0}}{\beta} \lambda \mathrm{M} \frac{\mathrm{D}^{2}}{\mathrm{D}_{\mathrm{inv} 0}} \tag{3.29}
\end{equation*}
$$

Equation 3.29 shows that the maximum skew between neighbors is half the maximum skew for side-to-side communications (Eq. 3.27). This difference is partly explained by the fact that the paths leading to the processors that produce the maximum skew for neighbor-to-neighbor communications are located within a square of side $D$, whereas the paths of the two processors that produce the maximum skew in side-to-side
communications are located within a square of side 2 D .
Table 3.2 summarizes the skew complexity as a function of the H -tree size, as may be derived from the deterministic Fisher \& Kung's summation model [3], the probabilistic Steiglitz and Kugelmass's model [53] and the uniform gradient model proposed in this paper.

Table 3. 2. Skew complexities versus H -tree size according to some authors

| Skew Model |  <br> Kugelmass's <br> Probabilistic <br> Model | Fisher \& Kung's <br> Summation Model <br> combined with a <br> Uniform Delay <br> Gradient | Uniform Gradient |
| :---: | :---: | :---: | :---: |
|  | $\Theta(D \times \sqrt{\log D})$ | $\Omega\left(D^{2}\right)$ | Model |

As can be noticed from Table 3.2, we predict the same order of complexity as Fisher \& Kung, i.e., a $\Theta\left(D^{2}\right)$, but the uniform gradient model is definitely more pessimistic than the Steiglitz and Kugelmass's probabilistic model.

## IV. Generalization to Large-Area Parametric Variations

The skew analysis developed above could be easily generalized to the case of a wafer containing several VLSI circuits. However, the distribution of the transistor time constant could be subject to more complex gradient such as a gaussoid variation (Fig. 3.9).

With respect to the wafer axis ( $\mathrm{X}, \mathrm{Y}$ ), an example of a gaussoid distribution could be expressed as:

$$
\begin{gathered}
\mathrm{G}(\mathrm{X}, \mathrm{Y})=\mu_{\max } \times \mathrm{e}^{-\gamma\left(\mathrm{X}^{2}+\mathrm{Y}^{2}\right)}+g_{\min } \\
\text { where } \mathrm{G}(0,0)=\mathrm{g}_{\min }+\mu_{\max } \text { and } \mathrm{G}(\infty, \infty)=\mathrm{g}_{\min }
\end{gathered}
$$

$g_{\min }, \mu_{\max }$ and $\gamma$ are process-dependent constants.


Fig. 3. 9. Gaussoid Distribution of a Transistor Time Constant

However, despite the complexity of the gradient, its distribution for VLSI chips of typical dimensions can be approximated by a uniform gradient (Eq. 3.5), in which case the skew analysis developed in this paper would produce a good approximation. Thus, the problem consists of finding the triplet $\left(\tau_{0}, \epsilon, \theta\right)$ for each chip in the wafer, where $\tau_{0}, \epsilon$, and $\theta$ are defined in Eq. 3.5. If we consider that $\mathrm{G}_{\mathrm{LT}}, \mathrm{G}_{\mathrm{LB}}, \mathrm{G}_{\mathrm{RT}}$ and $\mathrm{G}_{\mathrm{RB}}$ are the values of $\mathrm{G}(\mathrm{X}, \mathrm{Y})$ at the four corners of a chip, respectively the left-top, left-bottom, right-top and right-bottom comers in the ( $\mathrm{x}, \mathrm{y}$ ) axis, then, similarly to the way we established Eq. 3.5, we can write:

$$
\begin{gathered}
\theta=\operatorname{Arctan}\left(\frac{\mathrm{G}_{\mathrm{RT}}-\mathrm{G}_{\mathrm{RB}}}{\mathrm{G}_{\mathrm{LT}}-\mathrm{G}_{\mathrm{LB}}}\right) \\
\epsilon=\sqrt{\left(\mathrm{G}_{\mathrm{RT}}-\mathrm{G}_{\mathrm{RB}}\right)^{2}+\left(\mathrm{G}_{\mathrm{LT}}-\mathrm{G}_{\mathrm{LB}}\right)^{2}} \\
\tau_{0}=\mathrm{G}_{\mathrm{LB}}
\end{gathered}
$$

The gradient orientation $\theta$ is now fixed for a given chip, then for this case, the skew model developed in this paper holds until the end of section III.4. Therefore, a procedure different from the one described in section III. 5 has to be used in order to find the maximum skew of neighbor-to-neighbor and side-to-side communications for each site on a wafer. For instance, if $\theta$ is known to belong to the interval $\left[0, \frac{\pi}{2}\right]$, the search of the maximum skew for side-to-side communications consists of finding the terms a and $b$ such that the skew expression (Eq. 3.14) is maximum. Because the terms $\sin \theta$ and $\cos \theta$ are both positive in the present case, then this skew expression is maximum when the parameters a and b are both at their most positive values. This is the case (as shown in section !II.5) for the skew of the two leaves located at the ends of the first diagonal of the H -tree (Fig. 3.6).

Thus, depending on the position of a chip and on the nature of the communications in the array, a manufacturer could attempt predicting the skew distribution for a chip lot, as well as which chip on a wafer is expected to perform better.

## V. Trade-offs between Skew and Clock Rate

This section analyzes a more general configuration, the hybrid clock distribution H tree, which is laid out using both logic and interconnection. As proposed by some authors [63], one may add metallic interconnections between the inverters and increase the size of these inverters. All solutions considered so far (the metallic H -tree and the logic-based H tree) may be viewed as special cases of the hybrid H-tree. Let us now discuss the advantages of this hybrid H-tree in terms of skew and clock rate. The logic-based H -tree presented in Fig. 3.2 allows a clock frequency that, for large chips, is limited by the skew rather than by the minimum time between two successive events propagated through the H-tree (degree of pipeline). As a consequence, subject to the $10 \%$ rule of thumb relating the skew to the clock period, clock frequencies of at most 200 MHz are allowed for synchronizing a 1 cm by 1 cm array of processors, when considering neighbor-to-neighbor communications (section III.6). The hybrid H -tree configuration [63], where the pipeline degree is lowered, allows trading-off the clock frequency and skew. Fig. 3.10 illustrates such a relaxing in the degree of pipeline on a path of the H -tree, which goes from the root to a processor.

Given the fact that Fig. 3.1 structure is pipelined, to insert the same kind of event (transition to 0 or 1 ), one has to wait until the previous occurrence of the event propa-
gates to the third stage. Another general rule is to limit the skew to $10 \%$ of the clock period. Then, a


Fig. 3.10 Use of drivers and interconnection in an H-tree metallic path (Mixed H-tree)
clock can be propagated through Fig. 3.10 structure with a frequency as large as:

$$
\begin{equation*}
f_{\max }=\frac{1}{\max \left(2 \times T_{\text {segment }}, 10 \times \text { skew }\right)} \tag{3.32}
\end{equation*}
$$

where $T_{\text {segment }}$ is the delay per line segment. Because the time constant of an interconnection square is a million times smaller than that of the logic (Appendix $C$ of [62]), adding interconnections in the logic-based H-tree (Fig. 3.2) makes the skew accumulate slower through the line. In Appendix C of [62], Dally modeled each line segment as an interconnect resistance with a lumped interconnect capacitance ( $\pi$-model).

Assuming that $\tau_{i}$ and $\tau_{l}$ are respectively the time constants of minimum squares of interconnection and logic (transistors), and if T is the total delay along a line with a length L, then the delay variation (skew) due to process variations of $\tau_{i}$ and $\tau_{l}$ can be written
as:

$$
d T=\frac{\partial T}{\partial \tau_{i}} \times d \tau_{i}+\frac{\partial T}{\partial \tau_{l}} \times d \tau_{l}=\text { skew }
$$

Dally also developed an analysis predicting the design which minimizes the end-toend delay. Note that, in general, minimizing path delay does neither guarantee minimum skew nor maximum bandwidth. Thus, it is just a heuristic design which will turn out to provide a good trade-off between skew and bandwidth.

From Appendix C of [62], we get the optimal delay per segment of line, which is:

$$
\begin{equation*}
T_{\text {segment }}=6 \times(1+\sqrt{2}) \times \tau_{l}=\beta \times \tau_{l} \tag{3.33}
\end{equation*}
$$

Combining Eq. 3.33 with the optimal segment length given by Dally in Appendix C of [62], we deduce the following optimal delay on a line with total length $L$ :

$$
T=(2+\sqrt{2}) \sqrt{3} \times \sqrt{\tau_{i} \times \tau_{l}} \times L
$$

which is obtained for a configuration where drivers are inserted in a metallic H -tree such that the optimal segment length and driver size are respectively 2683 and 63 squares of $1.2 \mu \mathrm{~m}$ by $1.2 \mu \mathrm{~m}$ (for a $1.2 \mu \mathrm{~m}$ technology).

Thus,

$$
\begin{equation*}
d T=\frac{2+\sqrt{2}}{2} \sqrt{3} \times L\left(\sqrt{\frac{\tau_{l}}{\tau_{i}}} \times d \tau_{i}+\sqrt{\frac{\tau_{i}}{\tau_{l}}} \times d \tau_{l}\right) \tag{3.34}
\end{equation*}
$$

Due to the lack of knowledge regarding the physical relationship between $\tau_{i}$ and $\tau_{l}$, we will use the following empirical equation (inspired from (Appendix $C$ of [62])) to characterize this relationship:

$$
\tau_{i}=\frac{5}{\lambda} \times 10^{-7} \times \tau_{l}
$$

where $\lambda$ is the technology's minimum feature size. This allows us to rewrite Eq. 3.34 based only on variations of the transistor time constant:

$$
d T=\gamma \times L \times d \tau_{l}
$$

where $\gamma$ is a constant related to $\lambda$.
While the transistor time constant $\tau_{l}$ varies from $\tau_{1}$ to $\tau_{2}$ along the line, the line delay T varies as follows:

$$
\Delta T=\int_{T_{1}}^{T_{2}} d T=\int_{\tau_{1}}^{\tau_{2}} \gamma \times L \times d \tau_{l}=\gamma \times L \times\left(\tau_{2}-\tau_{1}\right)
$$

Using the linearity of the transistor time constant gradient with respect to space (assumption A2, section III.1), we can write that $\tau_{2}-\tau_{1}=c \times L \quad(p s)$, where c is a constant. The skew ( $\Delta T$ ) can then be rewritten as quadratically growing with respect to line length

L:

$$
\begin{equation*}
\Delta T=\alpha \times L^{2}(p s) \tag{3.35}
\end{equation*}
$$

where $\alpha=31 \times 10^{-8}$, assuming a $1.2 \mu \mathrm{~m}$ CMOS technology $(\lambda=0.6)$ and a $20 \%$ variation of the transistor time constant along a 10 cm line [63]. In these conditions, with $\mathrm{L}=1 \mathrm{~cm}$ (i.e., 8333 squares of $1.2 \mu \mathrm{~m}$ by $1.2 \mu \mathrm{~m}$ ), using Eq. 3.35 , we obtain a skew value of 21.51 ps . Taking into account the effect of an orientation of the transistor time constant gradient on a given path in the H -tree (as is the case with the uniform gradient model, Eq. 3.5) leads to a lower skew, because of the presence of the trigonometric functions, $\cos \theta$ and $\sin \theta$, whose values are between 0 and 1 . Moreover, Eq. 3.33 gives $\mathrm{T}_{\text {seg- }}$ ${ }_{\text {ment }}=347.62$ ps. According to Eq. 3.32, the maximal clock frequency is then about 1.44 GHz . One could compare this 1.44 GHz result with the 1.11 GHz clock frequency obtained in Fig. 3.3, using HSPICE simulations. Of course, the former is an optimistic theoretical value based on approximate technological parameters taken from Appendix C of [62], and simulations would yield a lower performance. From [64], achieving the maximum performance would require a re-optimization of the configuration, taking into account the complete transistor model.

## V. 1 Skew Dominance in the Determination of the Maximal Clock Rate

According to Eq. 3.32, the skew determines the maximal clock rate $f_{\text {max }}$ if:

$$
10 \times \Delta T>2 \times T_{\text {segment }}
$$

Using Eqs. 3.33 and 3.35, we obtain:

$$
\alpha \times L^{2}>\frac{\beta \times \tau_{l}}{5} \text { i.e., } \quad L>\sqrt{\frac{\beta \times \tau_{l}}{5 \times \alpha}}=L_{0}
$$

Numerically, for a $1.2 \mu \mathrm{~m}$ technology $(\lambda=0.6), \mathrm{L}_{0}=1.8 \mathrm{~cm}$. This is the line length above which the skew determines $f_{\max }$ (Fig. 3.11). In this first interval, due to the quadratic growth of the skew with respect to the line length (Eq. 3.35), $\mathrm{f}_{\max }$ decreases quadratically too. Below the limit $L_{0}$, the delay of the two first stages of the line dominates. Figure 11 illustrates these relationships. The quadratic growth of the skew is illustrated as well as the non-linear behavior of $f_{\text {max }}$. Indeed, the maximum in Eq. 3.32 produces a switch between the inverse of a quadratic dependence and the fixed bandwidth of the minimal delay design.

Table 3.3 summarizes the performances, in terms of skew and maximal clock rate, for the different configurations of a clock distribution H-tree considered in this paper. Considering the skew and the clock rate simultaneously, this table shows that a mixed H tree provides a very good trade-off for an H-tree clock design.

The analysis performed in this section shows that:

1. A metallic $\mathbf{H}$-tree achieves the lowest skew (due to the low resistivity of the metallic H-tree paths), while the clock rate is low because no simultaneous events are accepted within the H -tree.
2. Due to the high degree of pipeline, a logic-based H -tree will achieve the highest clock rate, but with a high skew due the high resistivity of H -tree logic paths.
3. A hybrid H -tree combines the advantages of the other two configurations, thus leading to a trade-off featuring a low skew and a high clock rate.

Area and power dissipation are also key elements in determining the best trade-off. However, analyzing these trade-offs is beyond the scope of this paper.

Let us conclude this section by stressing the fact that Dally's design, which minimizes overall line delay, was used here in a heuristic manner, and in general there is no reason why it would represent the best compromise between skew and bandwidth, which are really different performances. Determining the optimal compromise between skew and bandwidth remains an open question.


Fig. 3.11. Skew dominance in the determination of the maximal clock rate (notice that the two curves do not follow the same ordinate scale; one is a frequency, the other a time difference)

Table 3.3. Summary of the H -tree performances with different configurations

| Type of H-tree | Skew | Maximal Clock Rate |
| :---: | :---: | :---: |
| Metallic H-tree | Lowest | Low |
| Logic-Based H-tree | High | Highest |
| Mixed H-tree | Low | High |

## VI. Conclusion

Under the assumption of a unified model of manufacturing process variations, a deterministic expression of skew upper bound has been established using a logicbased H-tree to clock an array of processors. Implementing the H -tree with logic rather than metallic connections (conventional method), process variations were modeled as variations of the transistor channel time constant subject to a uniform gradient with given magnitude and orientation. In these conditions, we argued that this structure can support the best clocking rate achievable with a given technology. We also showed that for side-to-side communications, this logic-based H -tree has a skew upper bound that is obtained with the opposite leaves on the first diagonal of the rectangle that circumscribes the H tree leaves, and, for neighbor-to-neighbor communications, with the pair of processors communicating in a diagonal direction at the center of the array. In terms of complexity regarding the H -tree size D , we predict the same order of complexity as Fisher \& Kung for both side-to-side and neighbor-to-neighbor communications, i.e., a $\Theta\left(D^{2}\right)$, whereas the Steiglitz and Kugelmass's probabilistic model predicts $\Theta(D \times \sqrt{\log D})$. We also proposed generalizations of the skew model to large-area parametric variations and to the case of H-tree configurations mixing logic and interconnections. This led us to observing that Dally's design which minimizes the end-to-end delay provides a good compromise between bandwidth and skew, without being optimal with respect to these parameters for all system sizes.

## VII. Acknowledgments

We would like to acknowledge the $\operatorname{INI}$ (Institut National de formation en Informatique) in Algiers and CIDA (Canadian International Development Agency) for their financial support to this project. This work has also been supported by a strategic grant and by operating grants from the Natural Sciences and Engineering Research Council of Canada. Also, we would like to thank the reviewers for their helpful comments.

## Chapitre 4.

## Conception de réseaux de distribution d'horloges

## fiables et à faible consommation de puissance

Nous avons vu au chapitre 3 dans quelle mesure les variations du procédé de fabrication limitent la fréquence d'horloge que l'on peut distribuer sur de grandes surfaces de silicium. Le but de ce chapitre est de montrer l'importance de la dissipation de puissance dans la détermination de la performance globale des arbres de distribution d'horloge décrits au chapitre 3, soient l'arbre en H à base de logique et l'arbre en H hybride.

Pour l'arbre en Hà base de logique, ces deux contraintes sont considérées simultanément pour prédire la fréquence d'horloge maximale du système. L'arbre en H construit à base de logique active consomme une puissance qui doit être ajoutée à celle des processeurs aux feuilles de l'arbre. Pour éviter le risque que la consommation totale ne dépasse la capacité maximale de dissipation du circuit intégré, il est pertinent d'évaluer la composante de puissance contribuée par l'arbre de distribution d'horloge, d'autant plus que la logique est présente partout sur l'arbre. De part et d'autre d'une taille limite du système, ce dernier peut fonctionner en deux modes, l'un déterminé par la contrainte de BS et l'autre par la contrainte de puissance. L'effet de la puissance semble se faire ressentir seulement en dessous de cette taille limite du système.

En ce qui concerne l'arbre en H hybride (Fig. 4.1), le signal est injecté à l'aide d'un gros amplificateur à la racine de l'arbre en H , puis d'autres amplificateurs sont ajoutés aux
bifurcations des branches [50].
Contrairement au cas de l'arbre en H à base de logique, la capacité des interconnexions entre deux amplificateurs successifs est importante, obligeant le concepteur à augmenter la taille des amplificateurs au dessus de la valeur minimale. Cela contribue à faire que la puissance de commutation et de court-circuit soient plus élevées, ce qui rend la contrainte de puissance plus sévère et affecte ainsi la performance du système. Une étude complète du compromis puissance-délai dans un arbre de distribution d'horloge est présentée à la référence [68]. Néanmoins, l'effet des variations du procédé de fabrication n'y est pas pris en compte. Un effort pour inclure cet effet a été entrepris dans les travaux de Xi et Dai [37], cependant comme nous le montrerons au chapitre 5, section III.3, la modélisation de cet effet n'y est pas fiable.

Afin de montrer l'enjeu que représente la puissance sur des interconnexions capacitorésistives de haute performance, ce qui est le cas dans un arbre de grande taille fonctionnant à grande vitesse, nous présentons deux exemples de conception d'amplificateurs avec un effort particulier mis sur la minimisation de la puissance dissipée. Néanmoins, pour simplifier le problème, nous avons considéré une interconnexion linéaire au lieu d'un arbre.

La première structure est basée sur un article de conférence [69] que nous avons publié au MWSCAS (Midwest Symposium on Circuits \& Systems), Lafayette, Louisiane, EtatsUnis, août 1994. Cet article présente une variante de la technique de régénération parallèle


Fig. 4.1. Amplification dans un arbre en H hybride
(PRT) [64], à puissance et surface réduites. L'idée est de s'écarter de l'hypothèse de l'uniformité de la taille des amplificateurs le long de la ligne et de jouer sur la progression de ces tailles, ce qui conduit à une variante que nous avons nommée VPRT (Variable PRT). Un calcul analytique et des simulations électriques sont entrepris afin de déterminer l'amélioration apportée par la VPRT au sens d'une métrique ATP, où A représente la surface occupée par les régénérateurs (celle occupée par la ligne étant identique dans le cas du design uniforme de PRT), $T$ le délai de propagation du signal sur la ligne et $P$ est la puissance dissipée par les régénérateurs (excluant le terme $\mathrm{CV}^{2}$ dû à la ligne, puisque similaire à celui du design uniforme de la PRT). En conclusion, lorsque de longues interconnexions sont utilisées, le design uniforme de la PRT tend à occuper une surface deux fois plus grande que la VPRT et à consommer deux fois plus.

La seconde structure est basée sur un article de conférence [70] que nous avons publié à ISCAS (International Symposium on Circuits \& Systems), Londres, juin 1994. Dans ce cas, le circuit d'amplification utilise une double alimentation et un signal écrêté, qui permet une augmentation de la vitesse de propagation du signal, tout en réduisant la puissance consommée.

A chaque noeud dans un circuit intégré est associée une consommation de puissance de l'ordre de $f_{d} C_{L} V^{2}$, où $\mathrm{f}_{\mathrm{d}}$ est la fréquence de commutation moyenne, $\mathrm{C}_{\mathrm{L}}$ la capacité totale du noeud et V la tension crête-à-crête sur ce noeud. Généralement, pour les circuits CMOS alimentés par $\mathrm{V}_{\mathrm{dd}}$, la tension crête-à-crête V est $\mathrm{V}_{\mathrm{dd}}$. Cet article met l'accent sur la réduction de puissance à travers le terme quadratique en V .

A cause de la complexité du modèle théorique du circuit d'amplification, trouver une configuration de circuit optimale en explorant systématiquement tous les paramètres de conception est une longue tâche. Dans notre cas, nous avons utilisé une heuristique en deux étapes qui s'appuie sur des simulations électriques qui explorent les paramètres de conception dans un intervalle limité. La méthode considère la configuration de circuit produite à chacune des étapes de l'heuristique. En comparaison avec la technique conventionnelle, où l'amplificateur est un simple inverseur de taille non minimale inséré à intervalles réguliers dans l'interconnexion à régénérer, cette heuristique produit deux circuits avec des réductions de délai et de surface respectivement de $57 \%$ et $14 \%$ pour un premier circuit et de $46 \%$ et $87.7 \%$ pour un second circuit.

Le texte intégral de ces deux articles de conférence se trouve dans les annexes B et C.

# Design of Low-Power and Reliable 

Logic-Based H-trees

M. Nekili, G. Bois and Y. Savaria<br>Ecole Polytechnique de Montreal<br>Department of Electrical \& Computer Engineering<br>P.O. Box 6079, Station "Centre-Ville", Montreal, Quebec, Canada H3C 3A7<br>Phone: (514) 340-4711 ext. 4737; E-mail: nekili@vlsi.polymtl.ca


#### Abstract

This paper addresses the problem of low-power and reliable logic-based Htrees, i.e., clock distribution trees implemented with active logic only. While this type of H-tree features an advantageous bandwidth, it is limited by skew and power constraints. The effects of these constraints on system performances are analyzed as well as their domains of dominance.


## I. Introduction

The increasing popularity of portable applications and the high costs of system cooling have increased the importance of low-power design [68,93]. The clock net accounts for a significant fraction of the system power dissipation as it switches most frequently, involving large capacitive loads. The clock network power dissipation is typically one third of the total power dissipation in CMOS VLSI systems [94] and it can represent more than half the total power in some designs. Also, reliability issues are of increasing importance as transistor feature size decreases [95]. Therefore, designing clock distribution networks that are both lower-power and reliable is important. In this paper, reliability and lowpower issues are simultaneously analyzed in the design of a particular clock distribution structure: a logic-based H-tree (Fig. 4.2), i.e., an H-tree that is built using only active logic. In the past, clock trees used to be implemented based on metallic wires only. With
the increase in complexity and die sizes, drivers are inserted in order to drive the large capacitive loads in the tree. Implementing the tree with minimum-sized inverters [5] makes very high clock frequencies possible in CMOS technology (more than 1 GHz for a $1.2 \mu \mathrm{~m}$ CMOS technology). Indeed, the clock period is limited only by the delay of a pair of minimum-sized inverters in that technology.

This paper is organized as follows. Section II formulates the power and skew constraints in the logic-based H-tree. Then, the global system performance is calculated in Section III. In Section IV, the dominance of the constraints is studied as a function of the system size.


Fig. 4.2. Logic-Based H-tree

## II. Formulation of Constraints

Besides the power consumed by the processors, the logic-based H-tree dissipates a power that must be taken into account. To avoid the risk of exceeding the maximal dissipation capacity of the system, it is important to assess the power dissipation component that is contributed by the clock distribution network, especially because the logic-based H tree is implemented based on active logic. The state-of-the-art in heat removal capabilities is around $10 \frac{\mathrm{~W}}{\mathrm{~cm}^{2}}[50,62,65]$, which represents the maximal power density not to exceed, otherwise damage may be caused to the integrated circuit.

Let us assume that clock edges are injected into the H-tree in such a way each clock edge will cause one inverter to switch at a rate of $\frac{1}{T}$, where T is the clock period. Although the static power dissipation remains negligible, each inverter in the H -tree consumes dynamic power. This power consists of two components: a first one related to capacitance switching (switching power) and the other related to short-circuit current (short-circuit power) which is consumed when the transistors N and P are both ON during clock transitions.

If $f$ is the clock frequency, $V_{d d}$ the power-supply voltage and $C_{g}$ the gate capacitance of a minimum-sized NMOS transistor, the switching power [66] is given by:

$$
M \times C_{g} \times V_{d d}^{2} \times f
$$

where $M$ is the number of $C_{g}$ included in the inverter's input capacitance. With a logicbased H-tree, where tree branches are composed of chains of identical and minimum-sized
inverters, $M$ is equal to $\frac{\mu_{N}}{\mu_{P}}+1$, where $\mu_{N}$ and $\mu_{P}$ are respectively the mobilities of charge carriers in the N and P transistors. This gives the P transistor a size double that of the N transistor in order to balance their time constants. As a consequence, rise and fall times of the clock signal at the input of the inverter are equal. In such conditions, the short-circuit power can be written as [67]:

$$
\frac{\beta}{12}\left(V_{d d}-2 \times V_{T}\right)^{3} \times \frac{\tau}{T}
$$

where $\beta$ and $V_{T}$ represent respectively the gain factor and the threshold voltage of a MOS transistor, and $\tau$ is the rise (or fall) time of the clock signal at the inverter's input, i.e., $M \times \tau_{N}$, where $\tau_{N}$ is the time constant of a minimum-size NMOS transistor. Therefore, the total power consumption of the H -tree is:

$$
P_{H}=c_{1} \times f \times n
$$

where

$$
\begin{gathered}
c_{1}=\left(\frac{\beta}{12}\left(V_{d d}-2 \times V_{T}\right)^{3} \times \tau_{N}+C_{g} \times V_{d d}^{2}\right) \times M \\
n=3 \times 2^{\frac{N}{2}} \times \frac{D}{D_{i n v}}
\end{gathered}
$$

and n is the number of inverters in the clock tree (see Appendix A), N is the maximal number of levels in the H -tree (the number of tree leaves is $2^{\mathrm{N}}$ ), D is half the side of the square area used to implement the chip containing the H-tree and, $\mathrm{D}_{\mathrm{inv}}$ is the physical size of the inverter used to build the clock tree.

Besides the power consumption in the H-tree, the array of processors synchronized by the clock also consumes power. Each processor dissipates a power $P_{r}$ such as:

$$
P_{r}=c_{2} \times f
$$

where $\mathrm{c}_{2}$ is a constant reflecting the number of gates that switch at the frequency $f$ in the processor. Similarly to the constant $\mathrm{c}_{1}$, the constant $\mathrm{c}_{2}$ is a sum of terms related to switching and short-circuit powers.

Consequently, the power dissipated by unit area in the system (clock tree and processor array) implemented in a silicon area, of side $2 \times D$, is:

$$
P_{d}=\frac{P_{H}+2^{N} \times P_{r}}{4 \times D^{2}}
$$

The state-of-the-art in heat removal from integrated circuits puts a limit on the power density such that $[50,62,65]$ :

$$
\begin{equation*}
P_{d} \leq P_{d}(\max )=10 \frac{\mathrm{~W}}{\mathrm{~cm}^{2}} \tag{4.1}
\end{equation*}
$$

Two other constraints must be taken into account: the system tolerance to clock skew and the intrinsic limitation on the tree depth (i.e., the number of levels).

In order to avoid a synchronization problem in a synchronous system, the clock skew is generally kept under a fraction $p$ of the clock period $T$, as a rule of thumb. With a logicbased H -tree subject to a gradient of process variations, this constraint can be expressed as
[5]:

$$
\begin{equation*}
\text { skew }=2 \times t \times M \times \tau_{N} \times \lambda \times \frac{D^{2}}{D_{i n \nu}} \leq p T \tag{4.2}
\end{equation*}
$$

where $t$ is a constant related to the topology of the clock distribution network and $\lambda$ the percentage of variation in the transistor time constant per unit distance along the silicon wafer. An immediate consequence of this constraint is an upper bound on the size $D$ of the H-tree, such that:

$$
\begin{equation*}
D_{\max }=\left(\frac{c_{3}}{f}\right)^{\frac{1}{2}} \tag{4.3}
\end{equation*}
$$

where $c_{3}$ is a constant related to the system tolerance such that:

$$
c_{3}=\frac{p \times D_{i n v}}{2 \times t \times M \times \tau_{N} \times \lambda}
$$

The skew expression in Eq. 4.2 is the result of an analytical approach [5, section III-E] that considers the H-tree depth as independent of the H-tree size D [61, appendix A]. The assumption is true if the processors at H-tree had a zero physical dimension. In practice, a processor will occupy some space. Without loss of generality, let us assume that a processor has a square shape with a size $D_{p}$ such that an H-tree leaf distributes the clock signal to the center of this processor (Fig. 4.3).

Hence, the H-tree depth can theoretically increase until the distance between two neighboring processors is equal to zero. Based on Fig. 4.3, this constraint can be expressed as:

$$
\begin{equation*}
l(N) \geq \frac{D_{p}}{2} \tag{4.4}
\end{equation*}
$$

where $l(N)$ is the length of a branch at level $N$ of the H -tree.


Fig. 4.3 Effect of processor size on H-tree depth

Based on Eq. 2 in [5], one can prove that the previous constraint sets an upper bound $\mathrm{N}_{\text {sup }}$ on N , such that:

$$
\begin{equation*}
N_{\text {sup }}=2 \times \log _{2}\left(2 \times \frac{D}{D_{p}}\right) \tag{4.5}
\end{equation*}
$$

## III. Consequences on Global Performance

Let us now apply simultaneously the constraints of power density (Eq. 4.1), of system tolerance to skew (Eq. 4.2) and of H-tree depth (Eq. 4.4). By replacing the value of $\mathrm{D}_{\max }$ (Eq. 4.3) in Eq. 4.5 and by developing the inequality in Eq. 4.1, an upper bound on clock frequency is set, such that:

$$
\begin{equation*}
f \leq \frac{P_{d}(\max )}{\frac{3}{2} \times \frac{c_{1}}{D_{p} \times D_{i n v}}+\frac{c_{2}}{D_{p}^{2}}}=f_{\max } \tag{4.6}
\end{equation*}
$$

(note here the use of the mathematical relationship $2^{\log _{2}(a)}=a$ ).
To provide some numbers, let us consider the following example:

- a minimum-sized NMOS transistor implemented in a $1.2 \mu \mathrm{~m}$ CMOS technology has the following characteristics: $\beta=35 \mu \mathrm{~A} / \mathrm{V}^{2} ; V_{T}=0.75 \mathrm{~V} ; \tau_{N}=24 \mathrm{ps} ; C_{8}=2.4 \mathrm{fF}$.
- the basic inverter used to build the H -tree is such that: $\mathrm{M}=4$ and $D_{i n v}=12.4 \mu \mathrm{~m}$.
- the processor constant $c_{2}$ is calculated such that each square of $20 \mu m \times 20 \mu m$ contains a gate, with an input capacitance $M \times C_{g}$ and a switching time $M \times \tau_{N}$, operating at the frequency of the clock tree. The switching activity of these gates depends on the processor that is considered. A switching activity of $100 \%$ corresponds to the simple case where the power density is uniform with that of the clock tree. A less optimistic case would be a switching activity of $10 \%$.
- processor size $D_{p}=0.1 \mathrm{~cm}$.
- power supply $V_{d d}$ of 5 V .

Table 4.1 summarizes the performance parameters $f_{\max }, D_{\max }$ and $N_{s u p}$ as a function of the processor switching activity. First, $f_{\max }$ is calculated using Eq. 4.6. To obtain $D_{\max }$, $f_{\max }$ is substituted in Eq. 4.3. Finally, $D_{\max }$ is used to calculate $N_{\text {sup }}$ (Eq. 4.5).

With a switching activity of $100 \%$, an array of 32 processors ( N set to 5 ) can operate at a
clock frequency of around 150 MHz . For a switching activity of $10 \%$, an array of 256 processors ( N set to 8 ) can run at a clock frequency greater than 1 GHz .

Table 4.1: Performance vs. processor switching activity

| Performance | $100 \%$ <br> activity | $10 \%$ <br> activity |
| :---: | :---: | :---: |
| $f_{\max }$ | 151 MHz | 1007 MHz |
| $D_{\max }$ | 0.37 cm | 0.96 cm |
| $N_{\text {sup }}$ | 5.83 | 8.56 |

## IV. Dominance of Constraints

An interesting comparison can be made between the result developed in [5, section IIIE] and the results reported here. For this purpose, let us analyze separately the constraints of skew and power as a function of the H-tree size. In this analysis, a switching activity of $100 \%$ is adopted for the processors. Also, in order to reduce the problem to two constraints instead of three, as many processors as Eq. 4.5 allows are implemented. In these conditions, the power and skew constraints set two, a priori different, maximal clock frequencies $f_{P}^{\max }$ and $f_{\text {skew }}^{\text {max }}$ :

$$
\begin{gather*}
f_{P}^{\max }=\frac{P_{d}(\text { max })}{\frac{3}{2} \times \frac{c_{1}}{D_{p} \times D_{i n \nu}}+\frac{c_{2}}{D_{p}^{2}}}  \tag{4.7}\\
f_{\text {skew }}^{\max }=\frac{p \times D_{i n \nu}}{2 \times t \times M \times \tau_{N} \times \lambda} \times \frac{1}{D^{2}} \tag{4.8}
\end{gather*}
$$

As was concluded in [5], $f_{s k e w}^{m a x}$ decreases quadratically with the H-tree size D. However, $f_{P}^{\max }$ is independent of D . This can be explained by the fact that the system area is completely occupied by processors ( $N=N_{\text {sup }}$ ) and also because the power constraint is based on the power density instead of the power itself. Moreover, $f_{P}^{m a x}$ is identical to the maximal clock frequency (Eq. 4.6) that resulted from the simultaneous application of power and skew constraints. Indeed, the terms related to the skew constraint (Eq. 4.2) do not appear in Eq. 4.6. A plot of $f_{P}^{m a x}$ and $f_{\text {skew }}^{m a x}$ (Fig. 4.4) helps understand this result. In order to operate properly, the system must satisfy both constraints, i.e., the clock frequency $f$ must be such that:

$$
f \leq \min \left(f_{P}^{\max }, f_{s k e w}^{\max }\right)
$$

Based on Fig. 4.4, one can conclude that the maximal clock frequency is determined by the power constraint for $D<0.37 \mathrm{~cm}$ and by the skew constraint for $D>0.37 \mathrm{~cm}$. At intersection point $I$, the two constraints lead to identical maximal clock frequencies. This corresponds to a case where the maximal clock frequency can be represented with either $f_{P}^{\max }$ or $f_{\text {skew }}^{\text {max }}$.

In summary, the system is governed by two modes of operation:
i) $D<0.37 \mathrm{~cm}$ : a maximal clock frequency equal to $f_{P}^{\max }$, i.e., 151 MHz .
ii) $D>0.37 \mathrm{~cm}$ : a maximal clock frequency equal to $f_{\text {skew }}^{m a x}$, which will decrease when the H -tree size D increases.

In particular, for $D=0.5 \mathrm{~cm}$, the skew constraint determines the maximal clock fre-
quency and the numerical result obtained in [5, section III-E] is not affected by the power constraint.

If a power budget was allocated to the system instead of being constrained by a power density limit, the maximal clock frequency due to this constraint would be a function of $D^{2}$, since the power would not be divided by the system area when setting the power constraint. Indeed, it is as if $f_{P}^{\max }$ (Eq. 4.7) was multiplied by the term $D^{2}$. However, $f_{s k e w}^{\max }$ (Eq. 4.8) remains a function of $\frac{1}{D^{2}}$. These two functions will put the system in two operation modes similar to those in Fig. 4.4, except that when the power constraint dominates, it features a maximal clock frequency that increases with the system size (while it was constant in the case of a power density constraint). Also, the intersection point I is located at a value of D which is the square root of that obtained with a power density constraint. Therefore, the power constraint will cease to be dominant at values of system size that are smaller.


Fig. 4.4. Domains of dominance of power and skew constraints

## V. Conclusion

This paper has explored simulatenously the effects of skew and power constraints on the performance of logic-based H-trees. Depending on H-tree size, the system features two modes of operation. It was shown that power is the dominant constraint for small systems, while clock skew is the dominant constraint for large systems.

## Chapitre 5.

## Sources des variations spatiales

## de la constante de temps du transistor MOS

Le besoin d'une caractérisation des variations de procédé de fabrications et ses effets sur les paramètres temporels a été relevé en littérature, en particulier en ce qui a trait aux systèmes de distribution d'horloge de grande taille et opérant à grande vitesse [73]. A notre connaissance, le seul travail publié dans ce sens est dû à Pavasovic et al. [11] et Andreou et al. [12]. Ces auteurs ont abordé la variabilité spatiale du courant de drain (relié à la constante de temps du transistor), aussi bien pour le fonctionnement au dessus du seuil que pour celui en dessous du seuil. Ils ont rapporté des mesures expérimentales effectuées sur de grandes matrices de transistors avec des tailles de composants typiques des systèmes VLSI digitaux ou analogiques. Comme résultat, ils ont observé trois différents comportements dépendant de la position (ils ont nommé ces comportements effets de "bord", de "striation" et de "gradient") et une composante aléatoire. L'effet de "bord" se manifeste comme une diminution ou augmentation abrupte du courant des transistors sur les bords des matrices. L'effet de "striation" apparaît comme une variation du courant de transistor, dépendant de la position, suivant une oscillation quasi-sinusoïdale de basse fréquence spatiale. L'effet de "gradient" se présente comme une variation spatiale dépendant de la position, mais de fréquence spatiale beaucoup plus basse. Des variations aléatoires attribuées à
du "bruit blanc" ont aussi été observées. Bien que leur étude a fourni un nouvel éclairage de la "fine structure" derrière le désalignement des paramètres du transistor, le paramètre utilisé pour une telle exploration (le courant de drain) ne caractérise pas directement les performances temporelles. Plus récemment, Gneiting and Jalowiecki [13] ont caractérisé les variations de paramètres temporels. Ils ont mesuré des périodes d'oscillation pour des oscillateurs en anneau implantés dans différents dés, dans un contexte WSI, et opérant à des fréquences allant de 60 à 90 MHz . Dans ce travail, la dispersion des périodes est combinée avec d'autres paramètres des transistors obtenus par extraction, et ce, dans le but de prédire la dispersion de délai et du BS dans des réseaux de distribution d'horloge typiques en WSI. Cependant, aucune corrélation spatiale n'a été établie. De plus, sur la base de leurs mesures, les auteurs de [13] ont supposé que les variations au niveau d'une tranche sont négligeables. A la lumière des phénomènes que nous avons observés, dont les résultats sont rapportés dans le papier qui suit, cette hypothèse semble très questionnable, non seulement à l'échelle de la tranche de silicium mais aussi au niveau du dé.

Ce chapitre présente une caractérisation spatiale à grande échelle des variations d'un procédé de fabrication à travers un paramètre directement impliqué dans les questions temporelles: la constante de temps des transistors MOS. Les variations de ce paramètre sur de grandes surfaces de silicium ont des conséquences directes sur la synchronisation et les performances temporelles des systèmes intégrés de grande vitesse. Cette caractérisation est réalisée en mesurant la période d'oscillation d'oscillateurs en anneau CMOS opérant à 500 MHz . Ces oscillateurs sont implantés à différentes positions sur différents dès,
dans différentes tranches de silicium.
Ce chapitre est organisé comme suit. La section II décrit le composant expérimental caractérisé. La section III présente une interprétation des différentes courbes de caractérisation spatiale décrivant les variations de la constante de temps du transistor au niveau dé, entre les dés et entre des tranches de silicium. La section IV fournit une analyse spectrale du signal d'horloge et étudie les corrélations spatiales de différentes paires d'oscillateurs en anneau.

# Spatial Characterization of Process Variations via MOS Transistor Time Constants in VLSI \& WSI 

Mohamed Nekili, Yvon Savaria \& Guy Bois<br>Ecole Polytechnique of Montreal<br>Department of Electrical Engineering and Computer Science<br>VLSI Laboratory<br>Telephone: 1-(514) 340-4737; E-mail: nekili@vlsi.polymtl.ca


#### Abstract

This paper is the first large-scale experimental characterization of spatial process variations for a parameter that is directly involved in timing issues: the MOS transistor time constant. This is achieved by measuring the oscillation period of highspeed ( 500 MHz ) CMOS ring oscillators that are implemented at different locations on individual dies and over wafers. Novel phenomena are observed, improving our understanding of how process variations affect the performance of synchronous systems, particularly in clock distribution networks. We observed four components contributing to period variations: an environment-dependent component, a processdependent component of lower spatial frequency, a random component analogous to white noise and a component depending on the geometry of the power-supply distribution network.


## I. Introduction

The limitations due to fluctuations in process parameters have been anticipated at least since the 70's [71], and to compensate for transistor mismatching, the design of high-pre-
cision analog circuits has been traditionally done by following a set of rules based on empirical knowledge acquired from experience [19]. With MOS technology scaling, mismatching in transistors has gained much more importance [6], particularly because process tolerances do not scale necessarily in proportion to geometries [72]. Indeed, miniaturization has been pursued very aggressively, and it is not clear that circuits tolerances have progressed accordingly. Moreover, it has been shown [3,5] that the effects of process variations limit the accuracy of synchronization mechanisms in large integrated systems, due to the quadratic growth of clock skew with respect to system size.

The need for a characterization of the process variations and its effects on timing parameters has been noted in the literature, especially in relation to large high-speed synchronous integrated systems [73]. To our knowledge, the only published work in this direction is due to Pavasovic et al. [11] and Andreou et al. [12]. These authors have recently addressed the spatial variability of the drain-current (related to the time constant of the transistor channel), for both subthreshold [11] and above threshold [12] operations. They reported experimental measurements from large transistor arrays with device sizes typical of digital and analog VLSI systems. As a result, three different position-dependent behaviors (namely, "edge", "striation" and "gradient" effects) and a random variation were observed. The "edge" effect manifested itself as a drastic decrease or increase of transistor current at the edges of transistors arrays. The "striation" effect appeared as a positiondependent variation in transistor current following a quasi-sinusoidal oscillation of lowspatial frequency. The "gradient" effect was observed as a position-dependent spatial var-
iation, but of much lower frequency. Random variations attributed to "white noise" were also observed. Although their study has provided a new insight of the "fine structure" behind transistor mismatch, the parameter used for such an exploration (the transistor drain-current) does not directly characterize the timing performances. More recently, Gneiting and Jalowiecki [13] characterized process variations of timing parameters. They measured time periods of ring oscillators implemented in different dies in a WSI context and running at frequencies ranging from 60 to 90 MHz . In that work, the spread of the time period is combined with other extracted transistor parameters in order to predict the spread in delay and skew of typical WSI clock distribution networks. However, no spatial correlation was established. Moreover, based on their measurements, authors in [13] assumed that on-wafer variations are negligible. Due to new phenomena that we observed, and in light of results reported in this paper, this assumption appears very questionable, not only at wafer scale, but also at die level.

The present paper is the first large-scale characterization of spatial process variations in a parameter that is directly involved in timing issues: the MOS transistor time constant. Variations of this parameter over large areas of silicon have direct consequences on the synchronization and timing performances of high-speed integrated systems. This characterization is achieved by measuring the oscillation period of high-speed ( 500 MHz ) CMOS ring oscillators that are implemented at different locations on individual dies and on wafers.

This paper is organized as follows. Section II describes the experimental device that
was characterized. Section III presents an interpretation of the different spatial characterization curves describing the process variations on-die, from-die-to-die and from-wafer-towafer. A privileged application of this paper is the design of clock distribution networks. However, the reported finding could have an influence on analog design methods and on the timing analysis in digital systems. Section IV provides a spectral analysis of the clock signal and studies spatial correlations of different pairs of ring oscillators. Finally, our findings and conclusions are summarized in section V .

## II. Circuit Design

The spatial characterization presented in this paper is based on frequency measurements of a set of identical oscillators implemented at different locations on individual dies, as well as on dies distributed over a wafer. Since the sensitivity of synchronous integrated systems to process variations is expected to increase with the clock frequency, the oscillator is a chain of 3 minimum-sized inverters (Fig. 5.1). Minimizing the number of stages gives the maximum frequency produced by this type of oscillator, and using minimumsize inverters makes parameter fluctuations linked to geometry variations substantial and more likely to be detectable in measurements.

The direct relationship between the measured parameter (oscillator clock period $\mathrm{T}_{\text {osc }}$ ) and the transistor time constant $\mathrm{T}_{\mathrm{r}}$ is clearly illustrated by Shoji in [74]. For a N -stage ring oscillator, Shoji expresses the oscillator period as:

$$
T_{o s c}=2 \times N \times T_{r}
$$

where $T_{r}$ is the ratio between the load capacitance of any inverter in the chain and the transconductance. In the remainder of this paper, the term period will refer to the oscillator's clock period $\mathrm{T}_{\text {osc }}$ and N is fixed to 3 . Also, depending on the context, the oscillator is sometimes called a cell.

In order to drive a probe of $50 \Omega$ impedance, while maintaining a sufficient peak-topeak voltage, a $38.4 \mu \mathrm{~m}$ wide PMOS transistor is inserted between the oscillator and the probe (Fig. 5.2). Using HSPICE [75], an electrical simulation of the layout in Fig. 5.2, with the parameters of a typical process, for the $1.2 \mu \mathrm{~m}$ Nortel CMOS technology [59], showed that the probe should approximately measure a signal of 200 mV peak-to-peak amplitude with a frequency of 520 MHz (Fig. 5.3). To measure such a high frequency at the oscillator's output, a microwave 3-pin probe [76] is used (Fig. 5.4). In Fig. 5.4, upper \& lower metal buses are power distribution buses for power and ground respectively, while the left and right pads are tied to ground in order to obtain a controlled impedance and to minimize antenna effects. The oscillator's output is tied to the central pad (signal).

To characterize the effect of process variations along the two dimensions, a test chip in the $1.2 \mu \mathrm{~m}$ Nortel CMOS technology was submitted to the Canadian Microelectronics Corporation (CMC). The test chip is composed of 4 separate segments of 20 oscillation cells each (Fig. 5.5). In order to achieve the best spatial resolution in our characterization, the oscillation cells are kept at a minimal distance, which is determined by the geometry of the probe and the minimal distance required between passivation windows. This test
chip (Fig. 5.5) is aligned along the borders of a die in a multi-project prototyping environment (Fig. 5.7). Each segment has independent power \& ground buses to supply the oscillation cells from Vdd \& Gnd pads (Fig. 5.6).

This configuration has been adopted in order to minimize the partitioning of the die, whose core was used to implement the chips of other users, as can be seen in Fig. 5.7. Instead of this "O" configuration, " H " or " C " structures could have been implemented as well. The die is 1.44 cm by 1.44 cm and the wafer contains 55 dies (Fig. 5.8).

The equipment used to perform the experimental measurements included (Fig. 5.9):

- an HP54120B Digitizing Oscilloscope Mainframe with an HP54122A DC-to-12.4GHz Four Channel Test Set.
- a Microzoom Microscope (Bauch \& Lomb).
- Microwave probes [76] for signal measurements.
- and DC probes for power \& ground supply.

Experiments were first performed on 10 different dies from an initial wafer returned by the CMC (Canadian Microelectronics Corporation) and then 2 other wafers were requested in order to perform characterizations on a WSI scale. The frequency measurements were repeatable within 10 ps of the clock period. Also, a jitter between 60 ps and 190ps was experienced with the probed signal.

## III. Discussion of Experimental Results

Early experimental results showed a good match with HSPICE simulations. Indeed, Fig. 5.10 shows the measured output signal of two typical oscillation cells, with a peak-to-peak voltage of 163 mV and 190 mV , and respective clock frequencies of 500 MHz and 475 MHz.

At the die level, Fig. 5.11 shows a characterization of the clock period for oscillators implemented on the four sides of three different dies. All data analysis and plotting in this section are produced using MATLAB [77]. Dies a) and b) were chosen randomly from a collection of 10 individual loose dies, probably belonging to a first wafer (original placement of dies is not traceable after scribing the wafer). Die c) was located in a typical region of a second wafer (non-scribed). Depending on the location of the oscillator within the die, the oscillation period varies between 1945 ps and 2670 ps , which represents a measured variation of $37 \%$ between minimum and maximum values. This is a substantial variation that can affect seriously the synchronization of a high performance chip if it was experienced by buffers in a clock tree. These measurements confirm the validity of some assumptions made in literature $[78,5]$ about the magnitude of process variations on a WSI scale. Visually, when comparing the shapes of the four die-sides in the three dies, no particular constant property seems to emerge a priori. This remark can be reinforced by calculating the correlation coefficient ${ }^{1}$ [79] between the period distributions of two pairs of dies: die a) with die b), and die a) with die c). In the calculation of the correlation coeffi-
cient, all period measurements along the four sides of a die are considered as a vector of 80 elements. As a result, the period distribution of the three dies seem to be quasi-uncorrelated. Indeed, we obtained correlation coefficient values of 0.01 and 0.04 respectively for the above mentioned pairs of dies. Given the apparent random nature of the period distribution, the following explores some of its statistical properties. For this purpose, in each die, the measured periods of the 80 oscillation cells are considered as observations of a discrete random variable. Table 5.1 summarizes the mean, $\mu$, the standard deviation, $\sigma$ as well as the variation coefficient $\frac{\sigma}{\mu}$ of period distributions in dies a), b) and c).

Table 5.1 shows that, from a die to another, the magnitude of period variations differ by a factor of 2.6. Some of these distributions (Die b) of Fig. 5.11) appear to be almost gaussian (see histogram in Fig. 5.12), raising therefore the possibility that period variations result from the summation of a large number of random phenomena. Also, maximum and minimum values are not equidistant from the mean value (which seems to be closer to the maximum value in most dies). In fact, the ratios between distance from mean value to the maximum, and distance from mean value to the minimum, are respectively $2.12,2.5$, and

1. Given two random and discrete distributions $X$ and $Y$ (vectors consisting of $N$ observations each), the correlation coefficient is as: $\rho_{X Y}=\frac{\left(\sum_{i=1}^{N}(X(i)-\mu(X)) \times(Y(i)-\mu(Y))\right) / N}{\rho_{X} \times \rho_{Y}} ;$ where $\mu$ and $\rho$ are respectively the mean and standard deviation operators. When $\rho_{X Y}=1, X$ and $Y$ are fully correlated and when $\rho_{X Y}=0, \mathrm{X}$ and Y are non-correlated.
1.06 for dies a), b) and c) of Fig. 5.11.

Table 5.1: Means \& standard deviations of period distributions at the die level

| Period <br> Distribution <br> of | $\mu(\mathrm{ps})$ | $\sigma(\mathrm{ps})$ | $\frac{\sigma}{\mu}(\%)$ |
| :---: | :---: | :---: | :---: |
| Die a) of <br> Fig. 5.11 | 2233 | 132 | 6 |
| Die b) of <br> Fig. 5.11 | 2220 | 95 | 4.3 |
| Die c) of <br> Fig. 5.11 | 2053 | 48 | 2.3 |

Observations of the clock period at a wafer scale revealed local and global effects in addition to "white noise". The nature of surrounding devices (environmental component) seems to profoundly shape the time constant of active devices, and large-scale variations show pronounced "edge" effects capable of drastically reducing the worst-case performance of chips. Also, the geometry of the power-supply distribution network plays a significant role.

## III.1- The environmental component:

One of the ideas behind implementing oscillators along the sides of the die was to allow an interpolation of process variations for regions at the core of the die, based on measured values at die sides. The four sides were chosen because they provide a linear coverage of
the two plane dimensions, X and Y . The hope was to be able to interpolate a surface representation of the plane based on linear measurements. Our experimental results show that this is unfortunately not possible, based on two main observations:
i. A pattern of period variations seem to repeat itself throughout the wafer. Figure 13 shows the variations of clock period along 5 adjacent dies in the X \& Y directions, for two different wafers. Measured periods of 20 cells per die (one side) are represented in both directions. One can clearly notice the "periodicity" of the clock period values with the position in both directions. An intuitive a priori explanation of such a periodicity is proposed in the following. If we recall that inside the die, several unrelated designs of other users are implemented in a multi-project fabrication run (Fig. 5.7), it becomes clear that the electrical environment along a given die side varies when we move from end to end. However, since it is obvious that oscillation cells located at the same position in all dies have identical electrical environments (dies are replicated over the wafer), moving along an orthogonal direction, such as X -axis or Y -axis, over more than one die, will lead to the same environment every 20 cells (total number of cells implemented along one die side). Indeed, the clock period behaves as if it was modulated or shaped by the electrical devices in the immediate environment of the oscillation cells along the die side. However, it tends to return back to approximately the same value every 20 positions, and the sudden increases and decreases happen approximately at the same positions within the period over the 5 die sides. The presence of a pattern is confirmed by the values of the correlation coefficients between the

5 spatial periods (consecutive groups of 20 observations) taken two by two in X direction. We obtain correlation values as high as 0.61 (first and second periods), 0.76 (second and third periods), 0.77 (third and fourth periods), and 0.73 (fourth and fifth periods). Consequently, interpolating the internal regions from the die borders, which might have different electrical environments, seems to be unfeasible in presence of an environmental component of this nature and magnitude.
ii. Important variations occur even for oscillators that are immediate neighbors. This phenomenon is shown in Fig. 5.14, where the clock periods of oscillators implemented along adjacent sides of two neighbor dies across scribe lines ${ }^{2}$ are compared. The magnitude of period variations, when moving across the scribe line, exceeds $20 \%$ of the average period. This can again be explained by the difference in environment of any couple of neighbors. This shows how sudden the spatial process variation can be even over short distances $(700 \mu \mathrm{~m})$. Note that the distance between two neighbors along the same die side is comparable to that between two neighbors (each belonging to a different die) across the scribe line. Calculation of correlation coefficients shows in fact that immediate neighbors across scribe lines are not as highly correlated (as low as 0.55 for horizontal sides). This observation reinforces the conclusion that it is unfeasible to interpolate the amplitude of process variations, from a region to another one over a die.

[^3]With current design strategies, the presence of the environmental variation described above seems to be inevitable. Indeed, it is intrinsic to the fact that a clock distribution network, for instance, is always surrounded by the logic it feeds. The clock period "periodicity" mentioned earlier leaves little doubts about the direct influence of environment variability on the time period of oscillation cells. This variability is due to the fact that inside the square ring (Fig. 5.5) of oscillation cells (delimiting the die), designs of other users are implemented (Fig. 5.7). Consequently, each cell faces a region with different electrical characteristics. Depending on which electronic devices the designer implements in the surroundings of a clock distribution network, some regions of the tree will be faster than others. Note that even though a chip designer has the knowledge of the nature and position of the devices he implements, the effect of these devices on the close neighbor active logic of the clock distribution network, when fabricated under a real manufacturing process, remains unknown. In this paper, this effect is called the "environment" phenomenon. Therefore, in absence of knowledge, considering the transistor time constant (or other similar characteristics influencing parametric performances of digital or analog integrated circuits) as a random variable (as we did in Table 5.1), seems realistic.

The "environment" effect can be understood as a generalization of the "edge" effect mentioned by Pavasovic et al. [11] and Andreou et al. [12]. This "edge" effect was observed as occuring only at the edges of a transistor array [11, 12]. According to these authors, a probable cause of this effect is strain induced shifts in the characteristics of the devices. During the annealing processes, stress in the middle of the transistor array tends
to propagate outward and accumulates in the periphery. Based on our findings, an alternative to the above explanation would be again the environmental variation. In fact, "edge" effects were observed in [11, 12] both at the inner borders of a cluster of arrays as well as at the outer borders of the arrays. In their experiment, transistors in an array have the same size, and they are directly abutted to each other. Moreover, transistors in different arrays have different sizes. Therefore, the amount of inter-space between transistors varies only at the borders between the arrays. Also, transistors at outer array borders face open areas, while transistors inside the array have other transistors acting as their environment. Note that such an effect was not observed inside the arrays [11, 12], again because of the uniformity of transistor environment in those regions.

## III. 2 The process component

The behavior of the clock period variations cannot be explained just by environment related dependencies coupled to low amplitude "white noise". Indeed, a second component can be observed from Fig. 5.13. This component varies according to the location of the oscillation cell on the wafer. It can be easily noticed, for example, from the varying amplitude of the maximal peak within a die side, along X \& Y directions. This type of variations is probably due to large-scale non uniformity in the process (variations in transistor threshold, transistor channel length and width over long distances).

This large-scale component is shown in Fig. 5.15 (for 2 different non-scribed wafers), where a single oscillation cell, the same, is considered in all dies in order to keep the envi-
ronment constant. This cell, one ${ }^{3}$ from the right side in the die of Fig. 5.7, is called the right cell. Periods of cells at the extreme left border of the wafer are also represented in Fig. 5.15. A representative of these cells is one ${ }^{4}$ from the left side in the die of Fig. 5.7 and it is called the left cell. Each wafer contains 55 dies. The period of right cells in dies at the right border of the wafer (open areas) confirms again the importance of the environmental effect, because these cells have a different environment from that of corresponding right cells in the dies of internal wafer regions (Fig. 5.15). In fact, one can visually notice (Fig. 5.8 ), for the former, only "light"5 masks (virtual dies, almost open-areas) are implemented on the right of the cells, instead of "solid" masks (physical dies) for the latter. This phenomenon resembles the "edge" effect observed by Pavasovic et al. [11] and Andreou et al. [12] on outer edges of their arrays, in the sense that both are caused by the presence of open-areas. In most cases, cells in regions of higher logic densities seem to produce lower clock periods than the cells facing open-areas (cells corresponding to numbers in bold character in Fig. 5.16). Also, peaks in oscillator period can be reached in the middle region of the wafer's left border. That region corresponds to a location at which the wafer appears stressed, probably due to mechanical wafer manipulation. Colour changes are clearly noticeable on both left and right sides of the wafer. If no spatial criteria is established for
3. We have chosen, without any particular criteria, the cell in the middle of the side.
4. Idem.
5. Three types of areas can be visually noticed from Fig. 5.8: i) "solid" die masks ii) "light" die masks and iii) open-areas. The "solid" masks are the physical masks of the designs which are implemented in the central part of the wafer. These are bordered by "light" die masks (no design implemented but only a light trace of the physical masks) and themselves surrounded by an outerarea made of silicon and absolutely open and empty. Ring oscillators implemented near the second and third types of areas do not seem to behave in a similar way.
considering dies for subsequent tests, a die implemented in such regions of the wafer might pass the test and affect substantially the worst-case performance of chip designs implemented in such a die. This observation is significant, since VLSI parts are usually designed to meet a target specification with worst case process parameters. Our results indicate that the location of worst case process parameters may be predictable.

Once again, and similarly to what we experienced at first look on the environmental component in Fig. 5.11, Fig. 5.15 reveals a priori no deterministic or constant pattern in the distribution of the period variation, when the electrical environment is fixed by considering the same cell in different dies over the wafer. Moreover, the period distributions of right cells, that have the same environment (thus excluding wafer's right border, i.e., a subset of Fig. 5.15 observations), seem to be quasi-uncorrelated (a correlation coefficient of 0.04 was obtained). Considering these period distributions as random discrete variables, Table 5.2 summarizes their means $\mu$, standard deviations $\sigma$ as well as their variation coefficients $\frac{\sigma}{\mu}$. A comparison between Table 5.1 and Table 5.2, on the basis of the parameter $\frac{\sigma}{\mu}$, shows that the period variation is in most cases smaller at wafer scale than at die scale. It is an indication of the relative importance of the process (or large-scale) component compared to total period variations summarized in Table 5.1, when two-dimensional period distributions (in both X \& Y directions) are considered.

Table 5.2: Means \& standard deviations of period distributions at the wafer level

| Period <br> Distribution <br> in | $\mu(\mathrm{ps})$ | $\sigma(\mathrm{ps})$ | $\frac{\sigma}{\mu}(\%)$ |
| :---: | :---: | :---: | :---: |
| Wafer 1 | 2144 | 61 | 2.85 |
| Wafer 2 | 2162 | 37.5 | 1.73 |

## III. 3 The power-supply component

This section shows how the geometry of the power supply distribution network affects the clock period at the output of ring oscillators.

A segment of ring oscillators like those in Fig. 5.5 can be modeled according to the schematic in Fig. 5.17. In our experiment, the power signals, Vdd and Gnd, were supplied via DC probes and pads located at the ends of the segment, and then distributed to the ring oscillators through parallel rails. As a consequence, each ring oscillator inherits a pair of local power supplies, noted (Pi,Gi), $i \in[1,20]$. The distance between two adjacent ring oscillators in our test design is approximately 0.7 mm . This experimental set-up introduces the possibility of ohmic voltage drop between the oscillators. Indeed, using HSPICE electrical simulations, power supply fluctuations take place along the Vdd rail (Fig. 5.18a) as well as along the Gnd rail (Fig.18b). The DC probe parameters used in the electrical simulations were $5 \mathrm{pF}, 5 \Omega$ and 10 nH . We noticed in the simulations that the Vdd signal is subject to ringing with a higher amplitude when moving away from the Vdd pad. The
same applies to the Gnd signal. The voltage difference, $\mathrm{Pi}-\mathrm{Gi}$, which determines the clock period, is shown in Fig. 5.18c, where the top curve corresponds to the difference P1 - Gl followed in order by curves corresponding to $\mathrm{P} i$ - Gi for $i \in[2,20]$. This difference can drop as low as 200 mV away from the 5 V normal supply voltage, with a 70 mV peak-to-peak amplitude (comparable to measurements). Also, the power supply current oscillates around 12.3 mA with a 0.5 mA peak-to-peak and a frequency equivalent to the clock period at the output of ring oscillators. This current results from the combined activity of the 20 ring oscillators tied to the power-supply rails. This, of course, affects the clock period. An additional period variation is thus added by the power-supply fluctuations and is illustrated in Fig. 5.19, which shows the clock period versus the position of the ring oscillator along the segment. Considering this variation as a discrete random variable, we find a variation coefficient ( $\frac{\sigma}{\mu}$ ) of $1.4 \%$, which constitutes a significant portion of the total variations summarized in Table 1. On a larger time scale, the clock period is also subject to a modulation (Fig. 5.20).

In the power supply distribution network of Fig. 5.17, the Vdd and Gnd signals are at the opposite ends of the segments. We noticed that moving the Vdd to the extreme left end of the segment leads to unexplained and substantial changes (measurements) in the clock period.

Note that this type of clock period variation component depends solely on the geometry of the power-supply distribution network, and should be considered in the circuit design
phase. It manifests itself through an oscillating current along the power-supply rails. Due to the resistance between ring oscillators, it involves voltage variations that affect directly the clock period. In a typical VLSI or WSI circuit, this current can also be affected by the power dissipation in the areas between the ring oscillators, i.e., the activity in different regions of the circuit. Power dissipation seems in general to be a major player in determining power-supply fluctuations.

To summarize, power supply current and associated voltage drops emerge as a significant contribution in the observed fluctuations of the oscillator clock period.

Based on the quasi-random phenomena observed in environmental, process and powersupply components, the effectiveness of some conventional techniques to compensate process variations [17] becomes questionable. For instance, a recent attempt using such techniques [37] aimed at constructing a clock distribution tree by sizing separately PMOS and NMOS transistors. Shoji's technique [17] assumes that all PMOS transistors (on which the buffers of the clock tree are based) are subject to the same type of process (fast, slow or typical). The same was assumed to apply to NMOS transistors. However, the characterization performed in this paper shows that (see histogram in Fig. 5.12) active devices, such as inverters, are subject to large dispersions in their time constants depending on the position of the transistor in the wafer, and on the nature of its surroundings. As a consequence, only a small proportion of the transistor population is close to the worst or best case. Therefore, designing a clock system based on transistor worst-case performance will cause a substantial penalty to the clock frequency, whereas, in practice one might take
advantage of the variations of transistor performance throughout the wafer.

## IV. Spectral Analysis and Spatial Correlations

In order to better understand the relationships between the different components of the clock period variation characterized in section III, we applied a one-dimensional spatial FFT on the period variations over five dies in the X direction of Fig. 5.13a, as well as in the Y direction. Also, we analyzed the spatial correlations between various cells.

## IV.1. FFT-based Analysis

The objective in performing a spectral analysis on the clock period (treated as a signal) is to check if there are some particular spatial frequencies at which signal power is observed. An answer to this question is provided in Fig. 5.21. The FFT is performed on the 100 period observations of Fig. 5.13a along the X direction, as well as in the Y direction. Note that the DC component was not shown to avoid masking the phenomena of interest. This analysis reveals that the signal power is shared between:

- A DC component corresponding to the average clock period.
- Several spectral components, out of which three appear at the following angular spatial frequencies:

$$
\omega_{1}=1 \times \frac{2 \pi}{100} ; \omega_{5}=5 \times \frac{2 \pi}{100} ; \quad \omega_{10}=10 \times \frac{2 \pi}{100} ;
$$

The first component, $\left(\omega_{1}=\frac{2 \pi}{100}\right)$, refers to a large-scale component with a spatial
period of 100 cells, while the second component, $\left(\omega_{5}=\frac{2 \pi}{20}\right)$, confirms the "periodicity" visually noticed in Fig. 5.13. In other words, the value of the clock period along the $\mathbf{X}$ direction follows a periodic pattern that more or less repeats itself every 20 observations (which corresponds to returning back to the same cell but in a different die). The third component, $\left(\omega_{10}=\frac{2 \pi}{10}\right)$, which is actually an harmonic of H5 (Fig. 5.21a). Note that H15 (Fig. 5.21b) can also be considered an harmonic of H5. Harmonics H10 and H15 are, in fact, located at multiple frequencies compared to that of H5.

It is remarkable that harmonics are shifted by MATLAB one position to right. The DC component (corresponding to an angular spatial frequency of zero) is in fact reported as the first frequency component. Also, note that Fourier analysis produces symmetric harmonics.

- A noise component corresponding to the rest of FFT points. The elements of the noise magnitude are distributed around a mean value of approximately 1.3 ps . This noise reflects probably the spatial "white" noise introduced by Pelgrom in [6], which describes short-distance process variations. If it is the case, the standard deviation of this noise should tend toward zero as more period measurements are added, along the X direction, for example. According to Pelgrom, this kind of noise is related to distributions of ion-implanted, diffused, or substrate ions; local mobility fluctuations; oxide
granularity and oxide charges.
A similar spectral analysis was performed for the two directions in Fig. 5.13b. The power (in picosecond square) is shared among the various components of the period signal as shown in Table 5.3. The power of an harmonic is calculated as the square of its magnitude. The power in the large-scale component is the sum of powers in component H 1 and its immediate neighbors, which are clearly well above the noise floor (Fig. 5.21a and Fig. 5.21b). The same applies to the environmental component composed of $\mathrm{H} 5, \mathrm{H} 10$ and H 15 , and the respective immediate components. The squared magnitudes of the remaining FFT points are summed to form the noise component. Since some noise contribution is also included in components H 1 through H 15 , and in order not to count noise more than once, an amount of power corresponding to the noise mean-value is substracted from the power of spectrum components associated with the large-scale and environmental components.

Table 5.3 shows that, in most cases, the contribution of noise to the total period variation is 11 to $15 \%$. However, it can be as large as $20 \%$ in the particular case of direction X in Fig. 5.13b, where the environmental component was almost buried into noise. Also, depending on the wafer considered (Fig. 5.13a or Fig. 5.13b) and which direction along the wafer, the environmental and large-scale components have a variable significance. The power carried by the DC component remains much larger than all components of the period variation.

Table 5.3: Power in various components of the clock signal (in ps ${ }^{\mathbf{2}}$ )

| Component | Direction X <br> in Fig. <br> 5.13a | Direction Y <br> in Fig. <br> 5.13a | Direction X <br> in Fig. <br> 5.13 b | Direction Y <br> in Fig. <br> 5.13 b |
| :---: | :---: | :---: | :---: | :---: |
| Environmental | 314 | 423 | 670 | 375 |
| Large-scale | 511 | 290 | 304 | 741 |
| Noise | 123 | 128 | 251 | 132 |
| Total variation | 948 | 841 | 1225 | 1248 |
| DC | $4.25 \times 10^{6}$ | $4.51 \times 10^{6}$ | $4.87 \times 10^{6}$ | $4.47 \times 10^{6}$ |

## IV.2. Spatial correlations on-die and from-die-to-die

In the following, we are interested in the spatial correlations (Fig. 5.22) between:
1 -immediate neighbors (physical neighbors on the same die),
2- cells with the same environment in different dies (can be considered as neighbors at WSI scale, since there is only one cell with an identical environment on each die), 3- and cells which share neither the same die nor the same electrical environment (nonneighbors).

For each type, a number of pairs were constructed from Fig. 5.13a along the X direction.
For each pair of cells, the absolute value of the difference in clock period is considered.
Table 5.4 summarizes the mean value of the distribution of these differences for the various types of neighbors.

Note that along 5 die sides, there are 20 groups of 4 pairs of same-environment neigh-

Table 5.4:
Means of absolute period differences

| Type of cells | $\mu(\mathrm{ps})$ |
| :---: | :---: |
| Immediate Neighbors | 17 |
| Same-Environment <br> Neighbors | 70 |
| Non-Neighbors | 50 |

bors, such that cell pairs of different groups have different environments. Table 5.4 shows that the period differences are 1.4 to 4.2 times more important between same-environment neighbors than for other types of cell pairs. Despite the fact that a certain "periodicity" seems to be respected for same-environment neighbors in Fig. 5.13, the gradient of changes in the magnitude of clock period seems to be more conserved from die to die than the magnitude itself. Also, over long distances ( 5 dies in our case), the large-scale component modulates the environmental component, thus affecting the magnitude levels to which same-environment neighbors are set in different dies.

## V. Conclusion

In this paper, we have presented the first large scale experimental characterization of spatial process variations on a parameter that is directly involved in timing issues: the MOS transistor time constant. This is achieved by measuring the oscillation period of
high-speed ( 500 MHz ) CMOS ring oscillators that are implemented at different locations on individual dies and over wafers.

An environmental phenomenon that modulates the clock period was discovered. That phenomenon manifests itself as a strong periodical influence due to variations in surrounding electrical devices, and which can be viewed as a generalization of the "edge" effect observed and reported in previous work. Also, a large-scale process component induces shifts into the environmental part of the period variations. This component is characterized by a much lower spatial frequency. Based on their measurements, Gneiting et al. assumed in [13] that on-wafer variations are negligible. Due to new phenomena that we observed, their assumption appears to be very questionable, not only at wafer scale, but particularly at die level. Similarities can be drawn between the "striation" effect observed by Pavasovic et al. [11] and Andreou et al. [12] with the environmental component (in terms of "periodicity") as well as between the "gradient" effect and the process component (in terms of low frequency). Also, important "edge" effects were observed on the wafer borders. The effects of the power-supply distribution network on the clock period variations was also analyzed and characterized.

## Acknowledgment

We would like to acknowledge Prof. John Currie for making available the facilities of LISA laboratory, Normand Gravel for his helpful experience in wafer probing and Ali Rahal from POLYGRAM laboratory for helpful discussions on microwave issues, all at Ecole Polytechnique of Montreal. The authors also acknowledge the Canadian Microelectronics Corporation and NORTEL for manufacturing our experimental device and for providing full wafers for testing. This work was also partly supported by grant 0GP0006574 from the National Sciences and Engineering Research Council of Canada and by a grant of the Synergie program of the province of Quebec.


Fig. 5.1. Ring oscillator (schematic)


Fig. 5.2. Ring oscillator and inverter's output (photograph)


Fig. 5.3. HSPICE electrical simulation


Fig. 5.4. Oscillation cell (photograpb)


Fig. 5.5. Our design (photograph)


Fig. 5.6. Vdd \& Gnd pads (photograph)


Fig. 5.7. The die core and our design (plor)


Fig. 5.8. Wafer (photograph)


Fig. 5.9. Equipment (phorograph)


Fig. 5.10. Two signal outputs (photograph)


Fig. S.11. Four-side representation of clock period at the die level.


Fig. 5.12. Histogram of period distribution in die b) of Fig. 5.11.


Fig. 5.13. Cortelation along $\mathbf{X}$ \& $\mathbf{Y}$ directions


(b) verical pair of die sides.

Fig. 5.14. Correlations between neighbors

(a) first non-scribed wafer

(b) second non-scribed wafer

Fig. S.15. Spatial characterization of elock period on WSI scale

```
19381938 2482 2185 2179 2197 2164 2349 1938 1938
19382402 2186 2211 2237 2206 2169 2137 2331 1938
19382448 2268 2184 2171 2149 2188 2181 2209 1938
2295 2166 2148 2160 2148 2156 2169 2248 2140 2290
3485 2142 2145 2170 2150 2368 2177 2216 2181 2194
1938 2329 2125 2154 2101 2153 2135 2161 2168 1938
1938 2321 2085 2155 2151 2138 2117 2143 1938 1938
193819382230 2087 2137 2183 2100 2000 1938 1938
```

Fig. 5.16. Numerical values of the process component
(Note: Since the wafer is not rectangular, inexistent dies have been replaced by the minimal period value i.e., 1938 ps )


Fig. S.17. Schematic of the power-supply distribution network


Fig. S.18. Volage fuctuations along power-supply rails.

(a) in X direction of Fig. 5.13a

(b) in Y direction of Fig. 5.13a

Fig. 5.21. One-dimensional FFT of period distribution


Fig. 5.22. Types of neighbors

## Chapitre 6.

# Techniques de minimisation du biais de synchronisation par calibration de délai 

Minimiser le biais de synchronisation dans un réseau de distribution d'horloge reste I'un des soucis majeurs dans les systèmes synchrones de grande taille et opérant à grande vitesse. Malgré l'application de techniques de minimisation du biais de synchronisation pendant la phase de conception du circuit, les variations du procédé de fabrication réintroduisent un biais substantiel lors de la fabrication. En effet, nous avons montré au chapitre 3 l'évolution quadratique du biais de synchronisation dû aux variations du procédé de fabrication avec la taille du réseau de distribution d'horloge, lorsque ces variations sont décrites par un gradient uniforme. Cela limite évidemment le fonctionnement à grande vitesse. De plus, des mesures directes sur la constante de temps des transistors MOS à différentes positions d'une tranche (chapitre 5) confirment l'existence de variations substantielles. Ces variations, nous l'avons vu au chapitre 5, s'expriment en plusieurs composantes liées à l'environnement des transistors, aux variations à grande distance du procédé, aux fluctuations de l'alimentation et à des variations aléatoires locales. L'idée préconisée dans ce chapitre consiste à mesurer la résultante des différentes composantes et à la compenser en égalisant les délais des différents chemins d'un arbre de distribution d'horloge après fabrication du circuit.

Nous présentons deux techniques de calibration du BS ${ }^{1}$ en combinaison avec l'utilisation d'un laser. Dans un contexte d'arbre hybride, la capacité parasite du réseau d'interconnexion ou la structure d'amplification sont considérés comme des paramètres de calibration. Dans le cas de l'interconnexion (section II), des capacités externes sont greffées sur les branches et des coupures laser viennent modifier la charge capacitive des branches et par conséquent, le délais correspondants. Pour la structure d'amplification (section III), deux méthodes sont présentées. La première se base sur la mesure des fréquences d'oscillateurs implantés à l'endroit des amplificateurs et l'autre s'appuie sur la variation de la taille des amplificateurs. La calibration se fait au prix de coupures laser faciles à exécuter, ou de la re-conception de quelques masques de fabrication. Pour valider ces méthodes, des circuits de test ont été conçus et/ou fabriqués en technologie BiCMOS $0.8 \mu \mathrm{~m}$ de Nortel. Le circuit de calibration des interconnexions a été fabriqué mais une erreur dans sa conception a empêché le test. Les circuits de calibration des amplificateurs ne sont pas encore fabriqués. Tout au long de ce chapitre, la technologie BiCMOS $0.8 \mu m$ de Nortel est considérée par défaut.

Bien que nous ayons, de manière indépendante, remarqué l'intérêt d'utiliser les structures proposées dans ce chapitre, l'idée de recourir à la calibration de la capacité et la taille de l'amplificateur n'est pas nouvelle. En effet, le principe fut introduit dans [47] et appliqué aux plots d'entrées-sorties. Dans ce cas, la calibration se fait sur toutes les puces fabriquées. Cependant, le mérite du présent chapitre est d'en montrer l'application dans le contexte des arbres de distribution d'horloge et de ne soumettre à la calibration qu'un ensemble limité de puces prototypes.

1. Biais de synchronisation.

# Minimizing Process-Induced Skew Using Delay Tuning 

M. Nekili, Y. Savaria and G. Bois<br>Ecole Polytechnique de Montreal<br>Department of Electrical \& Computer Engineering<br>P.O. Box 6079, Station "Centre-Ville", Montreal, Quebec, Canada H3C 3A7<br>Phone: (514) 340-4711 ext. 4737; E-mail: nekili@ vlsi.polymtl.ca

Abstract: Tolerance to process-induced skew remains one of the major concerns in large-area and high-speed clock distribution networks. Indeed, despite the availability of some efficient exact-zero skew algorithms that can be applied during circuit design, the skew remains an important performance limiting factor after chip manufacturing. This paper presents techniques to minimize this kind of skew using delay tuning in buffered clock trees.

## I. Introduction

Minimizing skew remains one of the major concerns in large-area and high-speed clock distribution networks. Despite the availability of exact-zero skew algorithms for minimizing skew during the circuit design phase, a substantial amount of skew can be induced during chip manufacturing. Therefore, the design of clock distribution networks in presence of process variations becomes an important issue [95]. Process-induced skew has various sources, some of which are hard to model. Therefore, the idea behind this paper consists of measuring the combined effect of the different sources and to compensate it by balancing the delays of clock paths in the clock distribution network.

The present paper applies zero-skew techniques to minimize process-induced clock
skew using delay calibration in buffered clock trees. In a bottom-up fashion, these techniques perform a set of clock signal measurements at a limited number of clock nets and then balance the delays of the clock tree branches by modifying the electrical features of either the interconnections or the buffers.

A first technique (section II) consists of connecting several capacitors to the branches of the clock tree that are closest to the root. The clock skew is measured between two clock nets and accordingly, a laser alternately cuts capacitors from the corresponding pair of tree branches, until both skew and delay are minimized at the current tree level. A $500-\mathrm{MHz}$ experimental chip was fabricated with tree branches laid out over a 1.4 cm by 1.4 cm die, using the Nortel $0.8 \mu \mathrm{~m}$ BiCMOS technology.

A second technique (section III) implements a clock buffer as a number of minimumsized inverters in parallel, all linked using the top metal layer. According to the skew measurement at a pair of clock nets, a laser cuts inverters from one of the corresponding pair of buffers until the skew is minimized at the current tree level. Once the calibration phase is finished, the designer adapts the final buffer sizes into the circuit design. This can be done by changing the top metal layer and vias only, which allows rapid and relatively inexpensive development of production masks with calibrated clock trees. With this method, skews induced by deterministic effects such as environment-induced process variations and power supply drops may be compensated, even if no accurate method is available to predict these effects before manufacturing.

Although the interest of using the basic structures proposed in this paper was developed independently, the idea of calibrating interconnections and buffers in order to reduce clock skew is a concept initially proposed by Lee and Murphy [47]. This concept was applied to the I/O frame and they proposed submitting all fabricated chips to calibration. The merit of the present paper is to show its application in the context of clock distribution networks and to show how deterministic but hard to predict effects can be compensated in situ to derive a mass calibration from a small set of prototypes. This paper will address some of the practical issues related to our proposal such as how to measure actual skews. Also, test circuits are proposed.

## II. Delay Tuning via Interconnections

This first technique is based on observing a clock signal at a limited number of clock nets and then, depending on the skew between these nets, adjust the capacitive load on the clock paths in order to reduce the skew. This adjustment can be performed using laser cuts. This idea is illustrated in Fig. 6.1, where a 4-level H-tree [50] is used to feed a 16processor array. The clock signal is injected at the clock root. Then, the signal travels through tree branches of decreasing width and reaches the tree leaves (processors). In order to measure the clock signal, windows are open in the passivation layer at a limited number of clock nets. Capacitors are tied to the different tree branches. In practice, these capacitors will be added to the longest branches only, i.e., those closest ${ }^{1}$ to the tree root.

[^4]Some of these capacitors will be disconnected from their branch using a laser in order to minimize the total branch capacitance, and thus, the delay of this branch. To balance the delays of two different clock paths, capacitors will be disconnected from branches of the slowest path.

## II.1. Algorithm

In general, for a N -level tree, the minimization process can be described by the fol-
lowing algorithm, which can be run automatically or manually by an operator:

## Algorithm TuningViaInterconnect;

Step 1: Consider the set of primary levels.
Step 2: Consider branch pairs which are the closest to the tree leaves.
Step 3: Consider a pair of adjacent branches among these pairs.
Step 4: Observe clock signals at the ends of this pair of branches.
Step 5: IF the clock signals do not arrive at the same time
THEN

- disconnect capacitors from the slowest path until delays are balanced.


## EndIF

Step 6: Disconnect capacitors from both paths in order to reach an equilibrium point where the capacitors are reduced as much as possible, allowing the clock to operate at its maximum speed. It is important to disconnect capacitors alternately from the two paths in order to avoid creating an imbalance could not be re-balanced later. This step is repeated until no capacitor is left on one of the paths.
Step 7: IF any pair of adjacent branches in the current primary level remains unprocessed THEN GOTO Step3 ELSE

IF other primary levels remain to be processed
THEN consider the next primary level closer to the clock root; GOTO
Step 3

## EndIF

EndIF

## EndOf TuningViaInterconnect;

For the example of Fig. 6.1, the two primary levels considered by the algorithm are levels 1 and 2. The algorithm begins with primary level 2 :

- Pair 1: consisting of branches b12 and b22.
- Pair 2: consisting of branches b32 and b42.

Considering pair 1, the clock signal is observed at points P1(2) and P2(2). Depending on measured skew, capacitors are disconnected from the branches of this pair according to Steps 5 and 6. Since there remains another pair of adjacent branches in the current primary level (level 2), Step 7 returns to Step 3. Pair 2 is then considered, points P3(2) and P4(2) observed, and Steps 5 and 6 processed. Since there are no more adjacent branches at primary level 2, Step 7 moves to primary level 1 and then to Step 3.

At primary level 1, there is a single pair: the one consisting of branches b11 and b21. The clock signal is observed at points $\mathrm{P} 1(1)$ and $\mathrm{P} 2(1)$ then Steps 5 and 6 are processed. Since there are no more adjacent branches and primary levels to be processed, this completes the algorithm.

Note that, as it is, the algorithm does not guarantee that all clock paths be equal in delay. Indeed, subject to the accuracy of the measurements and the resolution of the calibration, if d (bij) is the delay of branch bij after running the algorithm, branch delays satisfy the following: $\mathrm{d}(\mathrm{b} 12)=\mathrm{d}(\mathrm{b} 22)$ and $\mathrm{d}(\mathrm{b} 32)=\mathrm{d}(\mathrm{b} 42)$ in primary level 2 and $\mathrm{d}(\mathrm{b} 11)=\mathrm{d}(\mathrm{b} 21)$ in primary level 1. Therefore, paths linking $\mathrm{Pl}(0)$ and respectively $\mathrm{P} 1(2)$ and $\mathrm{P} 2(2)$ have the same delay. However, the path linking $\operatorname{P1}(0)$ and $P 1(2)$, and, the path linking $P 1(0)$ and
$P 3(2)$, are not necessarily equalized in delay because, in general $d(b 12) \neq d(b 32)$.
In the case of Fig. 6.1, after the processing of level 2 branches, a solution to calibrate all paths would be to consider branch delay targets given by the following equations when processing branches b11 and b21.

$$
\begin{array}{ll}
d(b 11)=d(b 21)+(d(b 12)-d(b 32)) & \text { if } d(b 12)<d(b 32) \\
d(b 11)+(d(b 12)-d(b 32))=d(b 21) & \text { if } d(b 12)>d(b 32)
\end{array}
$$

## II.2. Test chip

In order to experiment this algorithm, a test circuit (Fig. 6.2) was designed in Nortel BiCMOS $0.8 \mu \mathrm{~m}$ technology [80]. In this test circuit, the clock tree has a C shape occupying the borders of a square die with 1.4 cm side. The clock signal is generated by a 7 stage ring oscillator composed of minimum-sized inverters (Fig. 6.3) which produces a $500-\mathrm{MHz}$ frequency. In order to drive the large capacitance of tree branches, a chain of inverters [81] is inserted at the oscillator's output. The numbers on the inverters indicate their size ( 1 for minimum size, 2 for the double, etc ...). The first inverter is of minimum size, the second is double size, and so on. This covers the geometrical progression $2^{0}, 2^{1}$, $2^{2}, 2^{3}, 2^{4}$ and $2^{5}$. In order to avoid the calibration of a branch to affect that of the adjacent branch, the last stage in the chain of inverters can be split into 2 buffers with a size of $2^{4}$. Each of these two buffers drives a branch. The clock signal is then carried to the branch
ends using the TopMetal layer. Due to area constraints, the wire width is maintained at minimum size.

An HSPICE [75] electrical simulation shows the shape of the signal at the end of a branch (Fig. 6.4). Each branch (metal line) is modeled using a distributed RC network. Due to line attenuation, the peak-to-peak signal voltage is only 3 V . In that design, in order to measure the $500-\mathrm{MHz}$ clock signal, a microwave probe is used [76]. This type of probe requires 2 ground signals adjacent to the signal to be measured, in order to avoid antenna effects and provide an environment with controlled impedance (Fig. 6.5). Note that the area occupied by the pad used for observing the signal is important. However, this observation is done for a limited number of clock nodes (primary levels only). Therefore, the area allocated to observation pads remains limited when the number of tree levels increases.

The capacitors are implemented using plates based on GATE and CAP layers of the BiCMOS $0.8 \mu \mathrm{~m}$ technology (Fig. 6.6). These layers offer the maximum capacitance per unit area, among the layers of this technology. If such linear capacitors are not available with a given process, a normal transistor with drain and source to ground would suffice. The capacitors are then tied to the TopMetal layer using a contact, Metal1, VIA1, Metal2, VIA2 chain of layers. The resulting structure is then tied to the branch through a TopMetal bridge. When needed, this capacitor can be isolated from the branch by a simple transversal laser cut across the TopMetal bridge.

## III. Delay Tuning via Oscillators \& Buffers

The delay tuning technique via interconnections (section II) involves a laser surgery operation along all the branches of primary levels in the clock distribution tree. When the clock tree is made of buffers and interconnections, the number of locations requiring laser cuts can be reduced by performing the delay tuning on buffers rather than interconnections. Moreover, the main source of process-induced skew lies in the buffers. Therefore, it is more natural to solve the problem at this level.

Previous work [14, 96] has shown large on-die and on-wafer variations in transistor time constants. A large scale experimental characterization of spatial process variations was performed on a parameter that is directly involved in timing issues: the MOS transistor time constant. This was achieved by measuring the oscillation period of high-speed ( 500 MHz ) CMOS ring oscillators that are implemented at different locations on individual dies and over wafers. An environmental phenomenon that modulates the clock period was observed. That phenomenon manifests itself as a strong periodical influence due to variations in surrounding electrical devices. Also, a large-scale process component induces shifts into the environmental part of the period variations. This component is characterized by a much lower spatial frequency. Also, important "edge" effects were observed on the wafer borders. The effects of the power-supply distribution network on the clock period variations were analyzed and proved to be significant.

To compensate for such variations in the transistor time constants of buffers, a delay
tuning structure is suggested. It consists of $M$ fixed-size inverters, which can be configured in a serial mode (Fig. 6.7a) or in a parallel mode (Fig. 6.7b).

In a general case, where clock leaves are distributed non-uniformly throughout a silicon area and with different loads, in order to minimize process-induced skew, two methods are proposed: the oscillator-based method and the buffer-based method. These methods are developed in the following.

## III.1. Oscillator-Based Method

This first method is based on the idea of delaying the determination of buffer sizes until the chip is manufactured and the buffer time constants are measured. During the design phase, an exact-zero skew algorithm is applied to synthesize the clock distribution network with an effort on minimizing the components of skew not associated with process variations. This is done by delay balancing of clock tree branches. During the design phase, all buffers are replaced by ring oscillators.

After chip manufacturing, the frequencies of the ring oscillators are measured in order to provide the designer with data about the behavior of time constants of the clock buffers and especially to sense the environmental component of transistor time constant variations. For each buffer, the net result of deterministic parametric process variations is captured at its location. Based on these measurements, final buffer sizes are determined so that delay branches remain balanced despite the presence of process variations. The steps of this methodology are as follows:

## Algorithm TuningViaBuffer;

Step 1: Apply an exact-zero skew algorithm to balance tree delays before manufacturing.
Step 2: Use TopMetal layer (highest layer in the process) and Via2 layer to configure delay tuning structure in oscillator mode (Fig. 6.7a) and place it at the locations of buffers.

Step 3: Fabricate a first chip prototype.
Step 4: Measure the frequencies of ring oscillators and save this data in a Look-Up Table (LUT) in correspondence with the locations of buffers in the clock tree.

Step 5: Use TopMetal and Via2 layers to reconfigure all ring oscillators in buffer mode (Fig. 6.7b) while re-running an exact-zero skew algorithm to balance clock tree branches of the primary levels only, based on data in LUT.

Step 6: Re-submit to silicon foundry the masks of TopMetal and Via2 layers.
Step 7: Fabricate second chip prototype.

## EndOf TuningViaBuffer;

If the results of skew are satisfactory, volume production can be started.

## III.2. Buffer-Based Method

Instead of re-running an exact-zero skew algorithm after chip manufacturing of the first prototype, on can take advantage of the measurements step (step 4) to laser tune the buffer sizes and see the effect of this operation instantaneously. At step 2, the delay tuning structure is configured in a buffer mode so that the clock signal at buffer output can be observed. An algorithm similar to TuningViaInterconnect can be used to guide the delay balancing along primary levels of the tree. Once the remaining skew is kept below acceptable boundaries, the designer reflects the buffer sizes into TopMetal and Via2 masks. This method has the advantage of increasing the chances of success when checking the second chip prototype. Also, the buffers act as drivers placed in a real environment (under real clock and power supply), which allows to capture complete information about deterministic environment-induced delay variations in the clock tree.

The delay tuning of buffers can be performed using laser cuts as shown in Fig. 6.8. Each laser cut disconnects a minimum-sized inverter from the buffer structure, increasing its output impedance and decreasing its input capacitance. This affects the delay of the tree branch containing this buffer in order to balance branch delays for a given tree level.

## III.3. Impact on prototyping delays

With the Nortel $0.8 \mu \mathrm{~m}$ BiCMOS technology [80], accessed through the silicon services of the Canadian Microelectronics Corporation, the delay between the submission of the first prototype and its fabrication is approximately 16 weeks. The procedure suggested here is as follows. A fraction of a wafer lot is fabricated using TopMetal and Via2 layers and contains the delay tuning structure configured in a specific mode and the rest of the lot is fabricated without these layers (delay tuning structure not configured). Once measurements are performed on the first portion of the lot, the final configuration of the TopMetal and Via 2 layers is determined and is then reflected in the masks. The fabrication of the second prototype uses the rest of the lot and requires only one additional week (according to the Nortel foundry). The total prototyping delay is thus 17 weeks instead of 32 .

## III.4. Test chip proposals

In the experiment suggested in Fig. 6.2, process variations are due essentially to the two clock branches, i.e., the interconnections. Variable capacitors were tied to the interconnection in order to compensate for the variations. Gneiting and Jalowiecki [13] have shown that interconnections contribute for approximately $10 \%$ to the total variation on clock
signal delay, i.e, skew. More skew is involved when buffers are added. In order to test the efficiency of buffer-based compensation, two configurations are suggested below:
III.4.1. Test on delay: Fig. 6.9 illustrates two branches distributing a data signal to locations at both ends of a chip. The accumulation of variations in the buffer time constants on each of the data signal paths leads to a delay variation that is observed when comparing the signals measured through the passivation windows. The insertion of this test circuit in a silicon die causes no partitionning of the multi-project prototyping die and leaves empty the die core. This allows an easy placement of the designs of other foundry users.
III.4.2. Test on skew: Fig. 6.10 illustrates a clock distribution tree with 4 leaves. The clock signal generation and driving logic is placed at the die center in order to minimize die partitionning ( 2 partitions only). Depending on the path followed by the clock signal from root to leaves, the buffers on that path are affected differently by process variations. This leads to a skew between the signal observed at the passivation windows.

## IV. Conclusion

In this paper, methods for minimizing process-induced skew using delay tuning were presented. Based on measurements of the clock signal at strategic locations in the clock distribution tree, delay tuning is performed either through interconnections or buffers using laser cuts and re-configuring few masks. Also, experiments for testing these methods have been proposed.


Fig. 6.1. Delay tuning via interconnections


Fig. 6.2. Test chip


Fig. 6.3. Logic for generating and driving the clock signal


Fig. 6.4. HSPICE simulation of the clock signal
clock signal at branch end


Fig. 6.5. Pads for microwave probes


Fig. 6.6. Capacitor layout


Fig. 6.7. Configuration modes for the delay tuning structure (a) oscillator mode ; (b) buffer mode


Fig. 6.8. Laser tuning of buffer size


Fig. 6.9. Test on delay


Fig 6.10. Test on skew

## Chapitre 7.

## Conclusion générale

Cette thèse est, à notre connaissance, la première à viser la compréhension globale du phénomène des VPF dans les systèmes synchrones. L'intérêt d'y consacrer une thèse est évident quand on sait que les VPF sont aujourd'hui considérés comme un enjeu et une limitation de premier plan pour tous les systèmes intégrés de haute vitesse.

La revue de l'état de l'art présentée au chapitre 2 est le premier effort en littérature pour faire la synthèse des contributions dans ce domaine. Cette revue retrace l'évolution du domaine et les obstacles successifs que les chercheurs ont surmonté. La revue couvre aussi les enjeux et les défis qui restent à relever. Nous avons exploré comment les chercheurs ont séparément tenté de modéliser le phénomène, de comprendre ses sources ou de préconiser des solutions. Ce sont l'ensemble de ces approches que cette thèse a tenté de réunir.

Au chapitre 3, nous nous sommes intéressés à déterminer l'importance de l'effet des VPF sur le BS dans une architecture régulière, comme une matrice de processeurs, lorsque cette architecture est implantée dans un contexte VLSI ou WSI. Nous avons proposé des modèles de premier ordre de l'effet des VPF sur la constante de temps du transistor MOSFET sous forme d'un gradient uniforme orienté (en VLSI) et d'un gradient non-uniforme (en WSI). Nous avons, de plus, proposé une nouvelle manière de distribuer une horloge: un arbre en H construit entièrement à partir d'inverseurs de taille minimale. Cette configu-
ration permet la fréquence d'opération la plus élevée que l'on peut distribuer sur un arbre d'horloge et cette fréquence est indépendante de la taille du système. Nous avons établi une expression de la borne supérieure du BS qu'une telle structure peut produire en face des modèles de VPF envisagés. Pour des processeurs qui communiquent à longue distance (ex. processeurs d'un bord à l'autre de la matrice), cette borne supérieure est obtenue pour la paire de processeurs qui se trouvent aux extrémités de la première diagonale du rectangle qui circonscrit la matrice de processeurs. Pour les processeurs qui communiquent à courte distance (voisins immédiats), cette borne est obtenue pour la paire de processeurs qui communiquent sur la diagonale au centre de la matrice. Il s'ensuit une difficulté de préserver une haute vitesse quand les matrices sont de grande taille. Dans les deux cas, la borne est obtenue lorsque l'orientation du gradient est approximativement parallèle à la première diagonale. Les complexités obtenues sont d'un ordre quadratique par rapport à la taille du système. Elles sont du même ordre que celles prédites par les modèles déterministes renconitrés en littérature, mais elles sont plus pessimistes dans le cas des modèles probabilistes. Nous avons aussi proposé une généralisation de ce modèle au cas des variations paramétriques sur une grande surface (à l'échelle WSI) ainsi qu'au cas d'arbres de distribution d'horloge combinant la logique et les interconnexions (arbres hybrides). Nous y avons conclu par une comparaison des performances (BS et fréquence maximale d'opération) des arbres métalliques, des arbres à base de logique et des arbres hybrides. Il en découle que les arbres à base de logique sont les meilleurs en terme de fréquence maximale d'opération et les arbres métalliques les meilleurs en terme de BS.

Au chapitre 4, nous avons voulu compléter la synthèse des contraintes du système en incluant la contrainte de puissance qui est devenue aujourd'hui, avec les systèmes portables, une question essentielle et incontournable. Ainsi, nous avons étudié simultanément les contraintes de BS et de dissipation de puissance pour prédire les performances d'un système de distribution d'horloge. C'est, à notre avis, l'une des premières études de ce genre, sinon la première. Nous avons d'abord exploré l'effet des contraintes de BS et de puissance dans la synthèse de l'arbre en H à base de logique. De part et d'autre d'une taille limite du système, ce dernier peut fonctionner en deux modes, l'un déterminé par la contrainte de BS et l'autre par la contrainte de puissance. L'effet de la puissance semble se faire ressentir seulement en dessous de cette taille limite du système. En ce qui concerne l'arbre en H hybride, nous avons proposé deux structures d'amplification du signal d'horloge qui permettent d'alléger la contrainte de puissance. L'une se base sur une progression non-uniforme de la taille des amplificateurs le long d'un chemin d'horloge. Cette structure tend à occuper une surface deux fois moins grande que la progression uniforme et à consommer deux fois moins de puissance lorsque de longues interconnexions sont utilisées. L'autre structure se base sur un circuit d'amplification qui utilise une double alimentation et un signal écrêté, ce qui permet une augmentation de la vitesse de propagation du signal, tout en réduisant la puissance consommée. Une heuristique combinée à des simulations électriques ont conduit à des gains de puissance de l'ordre de $64 \%$ par rapport à la méthode conventionnelle.

Au chapitre 5 , nous avons proposé une caractérisation spatiale de l'effet des VPF à tra-
vers un paramètre directement impliqué dans les questions temporelles: la constante de temps du transistor MOSFET. Cette caractérisation est réalisée en mesurant la période d'oscillation d'oscillateurs en anneau CMOS opérant à 500 MHz . Ces oscillateurs sont implantés à différentes positions, sur différents dés, dans différentes tranches de silicium. Le but était d'établir une cartographie à l'échelle VLSI et WSI de la constante de temps du transistor MOSFET. Nous avons découvert un phénomène environnemental qui module la fréquence des oscillateurs en anneau. Ce phénomène se manifeste comme une forte influence périodique dûe aux variations dans la nature des composants électriques qui entourent l'oscillateur. De plus, une autre composante, de plus faible fréquence spatiale et qui agit à grande échelle, induit des dérives dans la composante environnementale. Des effets de "bord" très prononcés ont été observés sur les contours de la tranche de silicium. Enfin, cette expérience a montré que le réseau de distribution de l'alimentation affecte de manière non négligeable la fréquence des oscillateurs.

Les variations spatiales de la constante de temps du transistor MOSFET se sont avérées de sources diverses et dans certains cas, elles échappent à la modélisation. Ainsi, la solution préconisée au chapitre 6 consistait à mesurer la résultante des différentes composantes et à la compenser en égalisant les délais des différents chemins d'un arbre de distribution d'horloge après fabrication du circuit. Nous avons présenté deux techniques de calibration du BS en combinaison avec l'utilisation d'un laser. La calibration se fait au niveau des amplificateurs du signal d'horloge et au niveau de l'interconnexion. Dans le
cas de l'interconnexion, des capacités externes sont greffées sur les branches et des coupures laser viennent modifier la charge capacitive des branches et par conséquent, les délais correspondants. Dans le cas des amplificateurs, deux méthodes ont été présentées. La première se base sur la mesure des fréquences d'oscillateurs implantés à l'endroit où on planifie mettre des amplificateurs et l'autre s'appuie sur la variation de la taille des amplificateurs. Dans les deux cas, quelques masques du plus haut niveau de fabrication sont re-configurés afin de tenir compte de l'effet des VPF et de compenser les variations déterministes dans la version finale du circuit.

Cette thèse contient cependant certaines limitations qu'il convient de souligner. Le modèle développé au chapitre 3 ne tient pas compte des capacités au niveau des feuilles de l'arbre. Ces charges capacitives dépendent de la nature des processeurs de la matrice synchronisée et peuvent être élevées. La conséquence est de diminuer la bande passante du système parceque, dans le cas de l'arbre en H basé sur la logique par exemple, le délai du dernier inverseur qui commanderait la capacité d'une feuille peut être trop long par rapport à celui d'une paire d'inverseurs de taille minimum. Une manière de relever la bande passante est de commander la capacité d'une feuille d'arbre à l'aide de la chaîne classique d'inverseurs [81]. Les méthodes algorithmiques présentées au chapitre 2 résolvent ce problème en procédant à la construction de l'arbre de bas vers le haut, ce qui permet de tenir compte des charges aux feuilles dés le départ. Le chapitre 4 ne propose pas de modèle analytique du compromis entre le biais de synchronisation et la puissance dissipée pour le cas
de l'arbre en H hybride. Pour écrire un tel modèle, il faut développer une forme analytique de la consommation de puissance d'un amplificateur qui commande une charge non pas capacitive mais capacito-résistive. De plus, comme montré au chapitre 2 , à la section V , les meilleurs résultats de biais de synchronisation sont obtenus lorsque les amplificateurs sont insérés à l'intérieur des branches. Le calcul s'en trouverait du coup plus ardu que dans le cas où l'insertion est faite au niveau des embranchements. Par ailleurs, le modèle d'interconnexion RC d'un chemin d'arbre est moins évident à écrire vu les embranchements sur plusieurs niveaux d'arbre. Au chapitre 5, notre contribution a été apportée beaucoup plus au niveau de l'analyse qualitative (identification des sources de variations de la constante de temps et analyse spectrale) qu'au niveau de la corrélation de quantités. Un effort reste à faire dans ce sens. De plus, une question ouverte sur la composante dûe aux fluctuations sur l'alimentation est de trouver le réseau de distribution optimal. Certains des circuits expérimentaux proposés pour valider l'approche de calibration de délai (chapitre 6) restent à tester, d'autres à fabriquer et tester. De plus, ces techniques impliquent une phase de prototypage supplémentaire et l'existence d'un équipement laser. Une approche plus souple serait de recourire à des techniques d'auto-compensation électriques. En effet, des circuits PLL peuvent être utilisés pour détecter le biais de synchronisation entre deux feuilles de l'arbre d'horloge, le transformer en impulsion et l'utiliser pour équilibrer les délais des deux chemins du signal d'horloge. L'autre question ouverte serait d'explorer les enjeux de la distribution d'horloges en présence des variations du procédé dans les contextes FPGA et MCM.

Nous avons publié ou soumis pour publication l'ensemble de nos travaux. De plus, nous avons conçu et fabriqués deux circuits expérimentaux en technologies CMOS $1.2 \mu \mathrm{~m}$ et BiCMOS $0.8 \mu \mathrm{~m}$ de Nortel dans le but de donner une assise plus solide à nos résultats. Le premier a été testé avec succès. Cependant, une erreur de conception a empêché le test du second. La fabrication des circuits de calibration des amplificateurs est envisagée.

## Références

[1] KUGELMASS, S.D. et STEIGLITZ, K. (1988) "A Probabilistic Model for Clock Skew" Proceedings of International Conference on Systolic Arrays, publié par IEEE, NY, USA.
[2] AFGHAHI, A. et SVENSSON, C. (1989) "Calculation of Clock Path Delay and Skew in VLSI Synchronous Systems" European Conference on Circuit Theory and Design, IEE conference publication No. 308, London, England, pp. 265-269.
[3] FISHER, A.L. et KUNG, H.T. (1985) "Synchronizing Large VLSI Processor Arrays" IEEE Transactions on Computers, Vol. C-34, No. 8, août.
[4] NEKILI, M., SAVARIA, Y. et BOIS, G. et BENNANI, M. (1993) "Logic-Based HTrees for Large VLSI Processor Arrays: A Novel Skew Modeling and High-Speed Clocking Method" Proceedings of the 5th International Conference on Microelectronics (ICM'93), Arabie Séoudite, pp. 144-147.
[5] NEKILI, M., BOIS, G. et SAVARIA, Y. (1997) "Pipelined H-trees for High-Speed Clocking of Large Integrated Systems in Presence of Process Variations" IEEE Transactions on VLSI Systems, Vol.5, No.2, pp. 161-174, juin.
[6] PELGROM, M.J., KUINMAIJER, A.C.J. et WELBERS, A.P.G. (1989) "Matching Properties of MOS Transistors", IEEE J. Solid-State Circuits, Vol. SC-24, No. 5.
[7] SHYU, J.B., TEMES, G.C. et YAO, K. (1982) "Random Errors in MOS capacitors" IEEE JSSC, vol. SC-17, pp. 1070-1075.
[8] SHYU, J.B., TEMES, G.C. et KRUMMENACHER, F. (1984) "Random Errors Effects in Matched MOS Capacitors and Current Sources", IEEE JSSC, vol, SC-19, pp. 948-955.
[9] LAKSHMIKUMAR, K.R., HADAWAY, R.A. et COPELAND, M.A. (1986) "Characterization and Modeling of Mismatch in MOS Transistors for Precision Analog Design", IEEE JSSC, Vol. SC-21, pp. 1057-1066.
[10] KARHUNEN, E.S., FERNANDEZ, F.V. et VAZQUEZ, R. (1997) "Mismatch Distance Term Compensation in Centroid Configurations with Nonzero-Area Devices", ISCAS, pp. 1644-1647, Hong Kong.
[11] PAVASOVIC, A., ANDREOU, A.G., et WESTGATE, G.R. (1994) "Characterization of Subthreshold MOS-Mismatch in Transistors for VLSI Systems", Journal for VLSI Signal Processing, juin.
[12] ANDREOU, A.G. et BOAHEN, K.A. (1994) "Neural Information Processing II" dans "Analog VLSI" par M. Ismail et T. Fiez, McGraw-Hill.
[13] GNEITING, T.M. et JALOWIECKI, I.P. (1995) "Influence of Process Parameter Variations on the Signal Distribution Behavior of Wafer Scale Integration Devices", IEEE Trans. on Components, Packaging and Manufacturing Technology, Part B, Vol. 18, No. 3, août.
[14] NEKILI, M., SAVARIA, Y. et BOIS, G. (1997) "Characterization of Process Variations via MOS Transistor Time Constants in VLSI \& WSI". Accepté pour publication au Journal of Solid-State Circuits, octobre.
[15] MALY et al. (1988) "VLSI Prediction and Estimation", IEEE Trans. on CAD, Vol. CAD-5, No.1, p. 117, janvier.
[16] KEEZER, D.C. et NIGAM, N. (1993) "A comparative study of clock distribution approaches for WSI" dans Proceedings of the IEEE International Wafer Scale Integration, pp. 243-251.
[17] SHOJ, M. (1986) "Elimination of Process-dependent Clock Skew in CMOS VLSI" IEEE Journal of Solid-State Circuits Vol. SC-21, No. 5, octobre.
[18] THEUNE, D. et al. (1992) "HERO: Hierarchical EMC-Constrained Routing", ICCAD, pp. 468-471.
[19] VITTOZ, E.A. (1985) "The Design of High-Performance Analog Circuits on Digital CMOS Chips", IEEE Joumal of Solid-State Circuits, Vol. Sc-20, No. 3, juin.
[20] HANAN, M. (1966) "On Steiner's Problem with Rectilinear Distance", J. SIAM on Applied Mathematics, Vol. 14, No.2, USA, mars.
[21] KRUSKAL, J.B. (1956) "On the Shortest Spanning Subtree of a Graph", Proceedings of American Mathematics Society, pp. 48-50.
[22] JACKSON, M.A.B. et al. (1990) "Clock Routing for High-Performance ICs" Design Automation Conference, pp. 573-579, IEEE/ACM.
[23] KAHNG, A. et al. (1991) "High-Performance Clock Routing Based on Recursive Geometric Matching" Design Automation Conference, pp. 322-327, IEEE/ACM.
[24] ZHU, Q. et DAI, W.W. (1992) "Perfect-balance Planar Clock Routing with Minimal Path-Length", ICCAD, pp. 473-476.
[25] CHOU, N. et CHENG, C. (1993) "Wire Length and Delay Minimization in General Clock Net Routing" ICCAD, pp. 552-555.
[26] FRIEDMAN, E.G. (1995) "Clock Distribution Networks in VLSI Circuits and Systems" IEEE Press, p. 17.
[27] PULLELA, S., MENEZES, N. et PILLAGE, L.T. (1993) "Reliable Non-Zero Clock Trees Using Wire Width Optimization" Proc. of ACM/IEEE Design Automation Conference, pp. 165-170, juin.
[28] MENEZES, N., BALIVADA, A., PULLELA, S. et PILLAGE, L.T. (1993) "Skew Reduction in Clock Trees Using Wire Width Optimization" Proc. of IEEE Custom Integrated Circuits Conference, pp. 9.6.1-9.6.4, mai.
[29] PULLELA, S. et al., (1993) "Skew and Delay Optimization for Reliable Buffered Clock Trees" ICCAD, pp. 556-562.
[30] TSAY, R-S. (1991) "Exact Zero Skew", Proceedings of ICCAD'91, pp. 336-339, novembre.
[31] TSAY, R-S. (1993) "An Exact Zero-Skew Clock Routing Algorithm", IEEE Trans. on CAD, pp. 242-249, février.
[32] KHAN, W. et al. (1992) "Zero Skew Clock Routing in Multiple-Clock Synchronous Systems", ICCAD, pp. 464-467.
[33] LI, Y. et JABRI, M.A. (1992) "A Zero-Skew Clock Routing Scheme for VLSI Circuits" ICCAD, pp. 458-461.
[34] EDAHIRO, M. (1993) "Delay Minimization for Zero-Skew Routing" ICCAD, pp. 563-566.
[35] ELMORE, W.C. (1948) "The Transient Response of Damped Linear Networks with Particular Regard to Wide Band Amplifiers" Journal of Applied Physics, Vol. 19, pp. 5563.
[36] CHO, J.D. et SARRAFZADEH, M. (1993) "A Buffer Distribution Algorithm for High-Speed Clock Routing", 30th ACM/IEEE Design Automation Conference, pp. 537540.
[37] XI, J.G. et DAI, W.W-M. (1995) "Buffer Insertion and Sizing Under Process Variations for Low Power Clock Distribution", 32nd Design Automation Conference.
[38] CHEN, Y.P. et WONG, D.F. (1996) "An Algorithm for Zero-Skew Clock Tree Routing with Buffer Insertion", European Design \& Test Conference.
[39] TELLEZ, G.E., SARRAFZADEH, M. (1997) "Minimal Buffer Insertion in Clock Trees with Skew and Slew Rate Constraints", IEEE Trans. on CAD of Integrated Circuits
\& Systems, Vol. 16, No. 4.
[40] LIN, S. et WONG, C.K. (1994) "Process-Tolerant Clock Skew Minimization", ICCAD.
[41] COX, D.T. et al. (1989) "VLSI Performance Compensation for Off-Chip Drivers and Clock Generation", IEEE Custom Integrated Circuits Conference, pp. 14.3.1-14.3.4
[42] ASAHINA, K. et al. (1993) "Output Buffer with On-Chip Compensation Circuit" IEEE Custom Integrated Circuits Conference, pp. 29.1.1-29.1.4.
[43] CHENGSON, D. et al. (1990) "A Dynamically Tracking Clock Distribution Chip with Skew Control" IEEE Custom Integrated Circuits Conference, pp. 15.6.1-15.6.4.
[44] WATSON, R.B. et al. (1992) "Clock Buffer Chip with Absolute Delay Regulation over Process and Environmental Variations" IEEE Custom Integrated Circuits Conference, pp. 25.2.1-25.2.5.
[45] NASSIF, S.R., STROJWAS, A.J. et DIRECTOR, S.W. (1986) "A Methodology for Worst-Case Analysis of Integrated Circuits", IEEE Trans. on CAD, Vol. CAD-5, Vol. 1, pp. 104-113, janvier.
[46] NEVES, J.L. et FRIEDMAN, E.G. (1996) "Optimal Clock Skew Scheduling Tolerant to Process Variations", 33rd Design Automation Conference.
[47] LEE, C.M. et MURPHY, B.T. (1987) "Trimmable Loading Elements to Control Clock Skew," Patent \#4,639,615, AT\&T Bell Laboratories, January 27, 1987; IEEE Journal of Solid-State Circuits, Vol. SC-22, No. 6, pp. 1220, décembre.
[48] NEKILI, M., SAVARIA, Y. \& BOIS, G. (1997) "Minimizing Process-induced Skew Using Delay Calibration in Clock Distribution Networks", IEEE International Workshop on Clock Distribution Networks, Atlanta, Georgia, USA, 9-10 octobre.
[49] COMERFORD, R. (1992) "How DEC Developed Alpha", IEEE Spectrum, Vol. 29, No. 7, pp. 26-31.
[50] BAKOGLU, H.B. (1990) "Circuits, Interconnections and Packaging for VLSI" Addison-Wesley.
[51] BAKOGLU, H.B., WALKER, J.T., et MEINDL, J.D. (1986) "A Symmetric Clockdistribution Tree and Optimized High-speed Interconnections for Reduced Clock Skew in ULSI \& WSI Circuits" Proceedings of IEEE International Conference on Computer Design: VLSI in Computers, ICCD.
[52] FRIEDMAN, E.G. et POWELL, S. (1986) "Design and Analysis of a Hierarchical

Clock Distribution System for Synchronous Standard Cell/Macrocell VLSI" IEEE Journal of Solid-State Circuits vol. SC21, No. 2, avril.
[53] KUGELMASS, S.D. et STEIGLITZ, K. (1990) "An Upper Bound on Expected Clock Skew in Synchronous Systems" IEEE Transactions on Computers, Vol. C-39, No. 12, pp. 1475-1477, décembre.
[54] DHAR, S., FRANKLIN, M.A. et WANN, D.F. (1984) "Reduction of Clock Delays in VLSI Structures" Proceedings of International Conference on Computer Design (ICCD84) Port Chester NY, 8-11 octobre.
[55] REPETA, M. ((1987) "Fabrication et Caracterisation de MESFET sur des Couches Epitaxiales d'Arseniure de Gallium" Thèse de M.ing Departement de génie physique, Ecole Polytechnique de Montréal, avril.
[56] BASSETT, P.D., GLASSER, L.A., RETTBERG, R.D. (1987) "Dynamic Delay Adjustment: A Technique for High-Speed Asynchronous Communication" Proceedings of the Fourth MIT Conference.
[57] SAVARIA, Y. (1988) "Conception et Verification des Circuits VLSI" éditions de l'Ecole Polytechnique de Montreal.
[58] MEAD, C. et CONWAY, L. (1980) "Introduction to VLSI Systems", Reading, MA, Addison-Wesley.
[59] BROWN, D. et SCOTT, A. (1990) "Design rules and process parameters for the Northern Telecom CMOS4S process", Canadian Microelectronics Corporation, Report IC90-01, 8 février.
[60] Meta-Software (1990) "HSPICE H9001" User's Manual .
[61] NEKILI, M., BOIS, G. et SAVARIA, Y. (1994) "Deterministic Skew Modeling and High-Speed Clocking using Logic-Based and Mixed H-trees within Integrated Systems", rapport technique, EPM/RT 94/09, Ecole Polytechnique de Montreal.
[62] DALLY, W.J. (1987) "A VLSI Architecture for Concurrent Data Structures" Kluwer Academic Publishers.
[63] KEEZER, D.C., et JAIN, V.K. (1992) "Design and Evaluation of Wafer Scale", Proceedings of the 1992 Conference on WSI, pp. 168-175.
[64] NEKILI, M. et SAVARIA, Y. (1993) "Parallel Regeneration of Interconnections in VLSI \& ULSI Circuits" IEEE International Symposium on Circuits and Systems, Chicago, Illinois, 3-6 mai.
[65] MCDONALD, J.F. et al., (1984) "Trials of Wafer Scale Integration" IEEE Spectrum, octobre.
[66] WESTE, N. et ESHRAGHIAN, K. (1993) "Principles of CMOS VLSI Design" Reading, MA: Addison-Wesley.
[67] VEENDRICK, H.J.M. (1984) "Short-Circuit Dissipation of Static CMOS Circuitry and Its Impact on the Design of Buffer Circuits", IEEE JSSC, Vol. SC-19, No. 4, août.
[68] CONG, J. et KOH, C-K. (1994) "Simultaneous Driver and Wire Sizing for Performance and Power Optimization", IEEE Trans. on VLSI, Vol. 2, No. 4, pp. 408-425, décembre.
[69] NEKILI, M., SAVARIA Y. et BOIS, G. (1994) "A Variable-Size Parallel Regenerator for Long Integrated Interconnections", Proceedings of Midwest Symposium on Circuits and Systems (MWSCAS'94), Lafayette, Louisiana, USA, aôt.
[70] NEKILI, M., SAVARIA, Y. et BOIS, G. (1994) "A Fast Low-Power Driver for Long Interconnections in VLSI Systems" International Symposium on Circuits \& Systems ISCAS'94 Proceedings, London, England, Vol. 4, pp. 4.343-4.346, 2-5 juin.
[71] KEYS, R.W. (1975) "Physical Limits in Digital Electronics" Proc. IEEE, Vol. 63, No. 5, pp. 740-767.
[72] LI, E.H., et NG, H.C. (1991) "Parameter Sensitivity for Narrow-Channel MOSFET"S", IEEE Electron Device Letters, Vol. 12, No. 11, pp. 608-610, novembre.
[73] FRIEDMAN, E.G. (1995) "Clock Distribution Networks in VLSI Circuits \& Systems", Chapitre 1, IEEE Press.
[74] SHOJI, M. (1988) "CMOS Digital Circuit Technology", chapter 4, section 7, p. 176, Prentice-Hall inc.
[75] Meta-Software, inc. (1992) HSPICE User's Manual.
[76] "High Performance Microwave Probes", Picoprobe Model 40A, by GGB Industries inc., Florida, USA.
[77] MATLAB (1992), "High-Performance Numeric Computation \& Vizualization Software", Reference Guide, The MATH WORKS inc.
[78] KEEZER, D.C. et JAIN, V.K. (1992) "Design and Evaluation of Wafer Scale Clock Distribution," dans Proc. IEEE Int. Conf. Wafer Scale Integration, pp. 168-175.
[79] BEAUCHAMP, K., et YUEN, C. (1979) "Digital Methods for Signal Analysis", Georges Allen \& Unwin.
[80] Nortel inc., (1993) "Design rules for CMC 0.8-micron BiCMOS", une version de NTE BATMOS, ICI-040R02, février.
[81] LIN, H.C. et LINHOLM, L.W. (1975) "An optimized Output Stage for MOS Integrated Circuits" IEEE Journal of Solid-State Circuits, Vol. SC-10, No. 2, pp. 106-109, avril.
[82] SINHA, A.K., COOPER, J.A. et LEVINSTEIN, H.J. (1982) "Speed Limitation due to Interconnect Time Constants in VLSI Integrated Circuits" IEEE Electron Device Letters, vol. EDL-3, pp.90-92, avril.
[83] SARASWAT, K.C. et MOHAMMADI, F. (1982) "Effect of Scaling of Interconnections on the Time Delay of VLSI Circuits" IEEE JSSC, vol. SC-17, No.2, pp. 275-280, avril.
[84] KEYES, R.W. (1982) "The Wire Limited Logic Chip" IEEE JSSC, Vol. SC-17, No. 6, pp. 1232-1233, décembre.
[85] DHAR, S., FRANKLIN, M.A. et WANN, D.F. (1984) "Reduction of Clock Delays in VLSI Structures" Proceedings of International Conference on Computer Design (ICCD84) Port Chester NY, 8-11 octobre.
[86] NEKILI, M. et SAVARIA, Y. (1992) "Optimal Methods for Driving Long Interconnections in VLSI Circuits", IEEE International Symposium on Circuits and Systems, San Diego, California, USA, pp. 21-24, 10-13 mai.
[87] LIU, D. et SVENSSON, C. (1993) "Trading Speed for Low Power by Choice of Supply and Threshold Voltages", IEEE JSSC, Vol. 28, No.1, pp. 10-17, janvier.
[88] CHEN, H. et KANG, S. (1987) "Performance Optimization for DOMINO CMOS Circuit Modules" Proceedings of the International Conference on Computer Design, pp. 522-525.
[89] BAKOGLU, H.B. et MEINDL, J. (1985) "Optimal Inteconnection Circuits for VLSI" IEEE Trans. on Electron Devices, Vol. ED-32, N0. 5, pp. 903-909, mai.
[90] GLASSER, L.A. et DOBBERPUHL, D.W. (1985) "The Design and Analysis of VLSI Circuits" Addison Wesley.
[91] MEAD, M., REM, M. (1982) "Minimum Propagation Delays in VLSI" IEEE Journal of Solid-State Circuits, vol. Sc-17, No. 4 , pp. 773-775, aoât.
[92] SAKURAI, T. (1983) "Approximation of Wiring Delay in MOSFET LSI" IEEE Journal of Solid-State Circuits, vol. SC-18, pp. 418-426, août.
[93] VITTAL, A. et Marek-Sadowska, M. (1997) "Low-Power Buffered Clock Tree Design", IEEE Trans. on CAD of Integrated Circuits and Systems, Vol. 16, No. 9, septembre.
[94] LIU, D. et SVENSSON, C. (1994) "Power Consumption Estimation in CMOS VLSI Circuits", IEEE J. Solid-State Circuits, Vol. 29, pp. 663-670, juin.
[95] NEKILI, M., SAVARIA, Y. et BOIS, G. (1998) "Design of Clock Distribution Networks in Presence of Process Variations", IEEE 8th Great Lakes Symposium on VLSI, pp. 95-103, Louisiana, pp. 95-102, 19-21 février.
[96] NEKILI, M. (1998) "Synthesis of Clock Distribution Networks in Presence of Process Variations", Ph.D. dissertation, Ecole Polytechnique de Montréal, Québec, Canada, juin.

Annexe A.

## Number of inverters in a Logic-Based H-tree

The number $n$ of inverters in the Logic-Based H -tree is calculated as the ratio of:

- L, the sum of the lengths of all H-tree branches and,
$-D_{\text {inv }}$, the size of an inverter.
A branch at level i of the H -tree has a length 1(i), such that:

$$
l(2 k+1)=l(2 k+2)=\frac{D}{2^{k+1}}\left(k \in\left[0, \frac{N}{2}-1\right]\right)
$$

As a consequence, $L$ can be written as follows:

$$
\begin{gathered}
D+\sum_{i=1}^{N} 2^{i} \times l(i)=D+\sum_{j=1}^{\frac{N}{2}}\left(2^{2 j-1} \times l(2 j-1)+2^{2 j} \times l(2 j)\right)=L \\
D+\sum_{j=1}^{\frac{N}{2}}\left(2^{2 j-1} \times \frac{D}{2^{j}}+2^{2 j} \times \frac{D}{2^{j}}\right)=L=D+\frac{3}{2} \times \sum_{j=1}^{\frac{N}{2}} 2^{j}=\left(3 \times 2^{N / 2}-2\right) \times D
\end{gathered}
$$

and therefore, by neglecting the constant 2 with respect to the exponential term, the parameter n becomes:

$$
n=3 \times 2^{N / 2} \times \frac{D}{D_{i n v}}
$$

## Annexe B.

# A Variable-Size Parallel Regenerator for 

# Long Integrated Interconnections 

Mohamed NEKILI, Yvon SAVARIA and Guy BOIS<br>Department of Electrical Engineering and Computer Science<br>P.O. Box 6079, Station "Centre-Ville", Montreal, Quebec, Canada H3C 3A7<br>Phone: (514) 340-4737; Email: nekili@vlsi.polymtl.ca


#### Abstract

This paper presents a low-power and low-area variant of the recently proposed parallel regeneration technique (PRT), thus providing an improved technique for the regeneration of long integrated interconnects. Taking advantage of the particular design of the regenerator in PRT, we propose a variant (called VPRT), where the regenerators along the interconnection have a variable size. Electrical simulations involving different interconnection lengths and technological processes are carried out to show that the interconnection delay, obtained with VPRT, is smaller than with PRT. A performance analysis combining area (A), delay (T), and power dissipation (P) shows that, VPRT leads to an ATP metric at least 4 times better than with PRT.


## I. INTRODUCTION

In general, delay on a long integrated interconnection grows as the square of its length. Furthermore, as component sizes are shrunk [82] [83], switching delay of logic decreases, but the size of the chips tend to grow. Due to the increased complexity of the systems
being integrated, interconnection delays grow and they become a critical performance bottleneck. Indeed, complex VLSI circuits increasingly rely on long interconnections [84].

In previous work [85] [86] [64], different structures were proposed to solve this problem. In [64], we suggested a regeneration technique which was shown to achieve an AT performance metric lower than conventional methods (RID), based on repetitively inserted drivers [86]. In this paper, we propose a low-power and low-area variant of this technique, called VPRT. We also propose a performance analysis of this regenerating structure under a more general metric: an ATP metric (A for area, T for delay and P for power).

## II. CIRCUIT DESCRIPTION

The regeneration structure (Fig. B. 1) introduced in [64] uses a regenerator (transistors $T_{1}$ through $T_{4}$ ) which is inserted at regular intervals in the interconnection to be regenerated, thus dividing the interconnection into segments. Let us assume that the line is first precharged to $\mathrm{V}_{\mathrm{dd}}$ by means of transistors $\mathrm{T}_{1}$, controlled by a globally available and properly deskewed clock signal. Distributing such a signal with low skew and characterizing this skew has been discussed elsewhere [4]. In this case, only the discharge process needs to be analyzed, since the propagation delay of a logical " 1 " is negligible. The signal to be sent through the interconnection is then presented to the gate of the emitter transistor, $\mathrm{T}_{\mathrm{e}}$, which initiates the discharge of the line at point $B$. At different points along the interconnection, a pull-down transistor $\left(\mathrm{T}_{3}\right)$ accelerates this discharge, as soon as a sense gate
detects this transition (transistors $\mathrm{T}_{2}$ and $\mathrm{T}_{4}$ ). At the end of the interconnection, a detector is used in order to detect the signal as early as possible.

Due to the complexity of Fig. B. 1 circuit, which follows from the high degree of coupling between stages, the analysis done in [64] was simplified firstly by assuming that the regenerators are regularly spaced, and secondly by using heuristics, so that the design of the regenerator be mainly determined by one parameter (the width $w_{3}$ of transistor $T_{3}$ ). We had also assumed that the different regenerators were of the same size. Based on these assumptions, and on electrical simulations, our previous research led us to the best $w_{3}$ in terms of an AT metric.

Because the discharge process is initiated from left to right in Fig. B.1, in practice, a regenerator drives only the part of the interconnection it sees toward the interconnection output. It is then obvious that the more we move toward the beginning of the line, the more powerful a regenerator needs to be. For that reason, we suggest a variant of the regular structure proposed in [64], where the regenerators are still regularly spaced, whereas their sizes decrease as we move toward the interconnection output. A simplified diagram of the proposed structure is shown in Fig. B. 2 (without the emitter and the detector).

The size reduction of the regenerators leads to a reduction of the total area, as well as of the power dissipation. The delay is also expected to decrease, because the parasitic capacitances with which the regenerators load the interconnection are smaller.

## III. PERFORMANCE ANALYSIS UNDER AN ATP METRIC

Power dissipation is today a very important limitation to the computing capacity of VLSI chips [87]. Therefore, the current section presents a performance analysis of VPRT (Fig. B.2) under a metric that includes power dissipation.

Let us first assume that:
$-\alpha$ is the optimal regenerator size (the sum of transistors $T_{1}$ to $T_{4}$ widths) determined for the regular design in [64];

- $\mathrm{P}_{0}$ is the power dissipated by a minimum-sized regenerator;
$-\mathrm{n}_{\mathrm{s}}$ is the number of segments in the interconnection determined for PRT in [64];
- a regenerator area is described by the active area of transistors $T_{1}$ to $T_{4}$ (Chen's model [88]), i.e., the sum of transistors widths, if we keep all the transistors channel lengths at the minimum allowed by the technology used, that is $2 \lambda$, where $\lambda$ is the technology minimum feature size.

We now configure the Fig. B. 2 structure as follows (from left to right):

- the first regenerator has a size of $\alpha$,
- at each of the following stages, the regenerator size is lineraly decreased, by an amount of

$$
\frac{\alpha-\beta}{n_{s}-1}
$$

until we reach the last regenerator whose size is $\beta$. The parameter $\beta$ is the optimal regen-
erator size when a regenerator drives only one interconnection segment. Of course, any other non-linear decreasing rate of the regenerator size can be chosen, with, however, an increase in the analysis complexity. Moreover, as we mentioned earlier, due to the intrinsic complexity of Fig. B.I circuit, no analytical delay model of a manageable complexity could be developed, and therefore, no analytical optimization was performed.

Thus, the total area A occupied by the regenerators (the area occupied by the interconnection being the same for PRT and VPRT) can then be expressed as:

$$
\begin{gathered}
A=2 \lambda \times\left(\alpha+\left[\alpha-1 \times\left(\frac{\alpha-\beta}{n_{s}-1}\right)\right]\right. \\
+\left[\alpha-2 \times\left(\frac{\alpha-\beta}{n_{s}-1}\right)\right] \\
\left.+\ldots+\left[\alpha-\left(n_{s}-2\right) \times\left(\frac{\alpha-\beta}{n_{s}-1}\right)\right]+\beta\right)
\end{gathered}
$$

This sum can also be rewritten as:

$$
\begin{equation*}
A=2 \lambda \times\left(\alpha+\frac{n_{s} \times(\alpha+\beta)}{2}\right) \tag{B.1}
\end{equation*}
$$

By contrast, the regular design (PRT) occupies an area of:

$$
2 \lambda \times n_{s} \times \alpha
$$

As confirmed by Tables B. 1 and B. 2 based on HSPICE electrical simulations (Fig. B.3), the total delay T is smaller with the Fig. B. 2 structure (VPRT), due the lower parasitic capacitances of the regenerators. In Tables B. 1 and B.2, the delays of both the regular design (PRT) and the variable-size design (VPRT) are listed as a function of:

- the technology process: CMOS4S $(1.2 \mu m)$ and BATMOS $^{1}(0.8 \mu m)$ of Northern Telecom
- and the interconnection length $(5 \mathrm{~cm}, 10 \mathrm{~cm}, 20 \mathrm{~cm}$ and 40 cm ). These length values are typical of the today's largest interconnections in ULSI and WSI.

Fig. B. 3 shows an HSPICE electrical simulation for a 10 cm line using CMOS4S and BATMOS processes. This simulation shows the output voltage at point Out (Fig. B.1) after a voitage step has been injected into the line through the point In (Fig. B.1).


Fig. B. 3. Output voltage of a 10 cm line versus time using PRT and VPRT with CMOS4S and BATMOS

Tables B. 1 and B. 2 also show the delay T achieved with the conventional methods (RD) [64]. RID is based on repetitively inserted drivers, which are simple fine-tuned inverters. In terms of delay, we notice from these tables that VPRT performs up to 4.5

[^5]times better than RID. Also, the interconnection speed of VPRT is approximately invariant with the interconnection length.

The power $P$ dissipated by the regenerators (excluding the $\mathrm{CV}^{2}$ of the wire, which is the same for all configurations) is mainly due to transistors $T_{2}$ and $T_{3}$, and it is proportional to their widths (Fig. B.1). This power is then proportional to the area of the regenerator. Thus, similarly to the analysis of area, we can prove that the total dissipated power by the regenerators is:

$$
\begin{equation*}
P=\left(\alpha+\frac{n_{s} \times(\alpha+\beta)}{2}\right) \times P_{0} \tag{B.2}
\end{equation*}
$$

against $n_{s} \times \alpha \times P_{0}$ for PRT.
Using Eq. B.1, Eq. B. 2 and the corresponding values of area and power with PRT, we come to the conclusion that, when the number of segments becomes large (i.e, $\beta \ll \alpha$ ) (which is the case for long interconnections), the regular design (PRT) (Fig. B.1) tends to occupy an area which is two times larger than with VPRT (Fig. B.2) and tends to dissipate two times the power dissipated by PRT. In this case, neglecting the speed improvement (i.e., a lower T), VPRT (Fig. B.2) tends to offer an ATP metric at least 4 times better than PRT.

## IV. CONCLUSION

We have presented a low-power and low-area variant of the recently proposed parallel regeneration technique(PRT), thus providing a higher performance for the regeneration of long integrated interconnects. Taking advantage of the particular design of the regenerator
in PRT, we proposed a variant (called VPRT), where the regenerator size is decreased along the interconnection. As a consequence, the total area occupied by the regenerators as well as the power dissipation are reduced. The regenerators parasitic capacitances being smaller with VPRT, the delay was also expected to decrease. This was confirmed using electrical simulations involving different interconnection length and different technological processes. Using a performance analysis combining area (A), delay (T), and power dissipation (P), we showed that VPRT leads to an ATP metric at least 4 times better than with PRT.


Fig. B. 1. The Parallel Regeneration Technique (PRT) introduced in (64)


Fig. B. 2. A Variable-size PRT (VPRT)

Table B.1: Interconnection Delays (CMOS4S)

| Technology <br> and <br> Interconnection <br> Length | PRT [64] | $\dot{\text { VPRT }}$ | RID [64] |
| :---: | :---: | :---: | :---: |
| CMOS4S <br> 5 cm | 1.1 ns | 0.99 ns | 4.51 ns |
| CMOS4S <br> 10 cm | 2.37 ns | 2.34 ns | 9.04 ns |
| CMOS4S <br> 20 cm | 5.94 ns | 5.8 ns | 18.10 ns |
| $\mathrm{CMOS} 4 S$ <br> 40 cm | 16 ns | 14.9 ns | 36.17 ns |

Table B.2: Interconnection Delays (BATMOS)

| Technology <br> and <br> Interconnection <br> Length | PRT [64] | VPRT | RID [64] |
| :---: | :---: | :---: | :---: |
| BATMOS <br> 5 mm | 0.80 ns | 0.75 ns | 3.68 ns |
| BATMOS <br> 10 cm | 2.40 ns | 2.30 ns | 7.41 ns |
| BATMOS <br> 20 cm | 5.75 ns | 5.60 ns | 14.84 ns |
| BATMOS <br> 40 cm | 14.27 ns | 13.68 ns | 29.66 ns |

## Annexe C.

# A FAST LOW-POWER DRIVER for LONG INTERCONNECTIONS in VLSI SYSTEMS 

Mohamed NEKILI, Yvon SAVARIA and Guy BOIS

Department of Electrical Engineering and Computer Science
P.O. Box 6079, Station A, Montreal, Quebec, Canada H3C 3A7

Phone: (514)-340-4737; Email: nekili@vlsi.polymtl.ca


#### Abstract

Propagation delays of signals on long interconnections and power consumption are very important limitations to the computing capacity of VLSI chips. Thus, this paper presents a low-power circuit for the fast propagation of signals on long interconnections in VLSI systems. Using electrical simulations with HSPICE guided by a heuristic, we present two designs that perform better in terms of speed and area than the conventional method. With both designs, power dissipation is reduced by $\mathbf{6 4 \%}$.


## I. INTRODUCTION

In general, the delay on a long integrated interconnection grows as the square of its length. Furthermore, as component sizes are shrunk [82] [83], switching delay of logic decreases, but the size of the chips tends to grow. Due to the increased complexity of the systems being integrated, interconnection delays grow and they become a critical performance bottleneck. Indeed, complex VLSI circuits increasingly rely on long interconnections [84].

In previous work [85] [89] [64], different structures were proposed to solve this problem. However, these publications focused on trade-offs between area and delay, and they do not take into account power consumption, which is today a very important limitation to the computing capacity of VLSI chips [87].

To each node in an integrated circuit is associated a power consumption of $f_{d} C_{L} V^{2}$ [90], where $f_{d}$ is the average switching frequency, $C_{L}$ is the total node capacitance and $V$ is the voltage swing on that node. Generally, for CMOS circuits supplied at $\mathrm{V}_{\mathrm{dd}}$, the voltage swing V is $\mathrm{V}_{\mathrm{dd}}$. This paper focuses on the reduction of power consumption through the quadratic term in V. A low-power circuit is presented. This circuit uses dual supply and reduced signal swing, which allows increasing propagation speed while reducing power consumption.

## II. CIRCUIT DESCRIPTION

The delay of a long line, with distributed resistive and capacitive components, grows as the square of its length. This results in a propagation speed which decreases with line length [62] [57]. To avoid this dependence, a common solution consists in regularly inserting repeaters in the line (Fig. C.1) [85] [91] [86]. Let us consider the components in Fig. C.1. The line segments of Fig. C. 1 may be modeled in different ways [92]. A good approximation of line delay in the interconnection is obtained with the $\pi$ model described in Fig. C.2. This line model allows an accuracy better than $3 \%$ in delay calculations [92].


Fig. C. 1. Insertion of drivers for line regeneration


Fig. C. 2. a $\pi$ line model

Several configurations [85] [89] [64] have been proposed for the driver in Fig. C.1. However, as we mentioned earlier, these configurations perform very well in terms of area and delay, but they do not take into account the very important issue of power consumption. In order to have a low-power interconnection, we suggest as a driver (Fig. C.1) the one described in Fig. C.3, which consists of:

1. a ( $0,5 \mathrm{~V}$ ) wing (transistors $\mathrm{Tn} 1, \mathrm{Tpl}$ ),
2. a (p,q) wing (transistors $\operatorname{Tn} 2, \mathrm{Tp} 2$ ),
3. a conventional inverter (transistors $\mathrm{Tn} 3, \mathrm{Tp} 3$ ).


Fig. C.3. A clamping regenerator

The role of the $(0,5 \mathrm{~V})$ wing is to rapidly initiate the charge or discharge of the output. This wing is activated by inverter I1. When the clamping regenerator output is about to reach $p$ (or $q$ ) volts, activating the ( $\mathrm{p}, \mathrm{q}$ ) wing makes the output converge either to p (or q ) volts (Fig. C.4). The (p,q) wing is activated by inverter I2, while inverter I1 is inhibited to avoid an overshoot. The low level on output is thus p . The circuit is symmetric with respect to a high-to-low transition on In, and one easily deduces that the high voltage on Out is $q$.

Assuming overshoot has been avoided, the circuit switches from $p$ to $q$ and from $q$ to $p$ at a speed determined by the 5 V supply of $\mathrm{I} 1, \mathrm{I} 2, \mathrm{Tpl}$ and Tn 1 .

As mentioned earlier (section I), the voltage swing on the output node determines the power consumption. That is particularly true when the line load, $\mathrm{C}_{\mathrm{L}}$, is much larger than the internal capacitances of the clamping regenerator. In first approximation, reducing the voltage swing from $\mathrm{V}_{\mathrm{dd}}$ to ( $\mathrm{q}-\mathrm{p}$ ), reduces power dissipation by

$$
\left(1-\frac{(q-p)^{2}}{v_{d d}^{2}}\right) \times 100 \%
$$

The above equation neglects internal power dissipation of the regenerator.
To better understand the principle of operation of this circuit, let us consider a low-tohigh transition on In. Assuming the circuit was stable, Out was high at the beginning, and the outputs of I 1 and I 2 are respectively high and low. Thus, T p 2 and Tn 1 are active at the beginning of the transition. The transition of In to high pulls Out toward 0 through Tn 1 and $\operatorname{Tn} 3$. When Out crosses the threshold of I 2 , this inverter starts activating $\operatorname{Tn} 2$. After a delay determined by $\mathrm{I} 1, \mathrm{Tn} 1$ is switched off. If the internal delays are properly balanced, the output will clamp to p just in time.

Fig. C. 4 shows an HSPICE simulation of the successive outputs of the 16 clamping regenerators driving a 40 cm line. A $1.2 \mu \mathrm{~m}$ CMOS technology is used in these simulations, where the widths of the clamping regenerator transistors have been fixed so that, the output of the Fig. C. 3 circuit swings strictly between 1 V and 4 V , rather than the conventional 0 V and 5 V . When stable, the output is said clamped.


Fig. C.4. Outputs of successive drivers with a good design of the clamper

However, the apparent simplicity hides two major problems which may appear due to a bad selection of the transistor widths of the clamping regenerator (Fig. C.5):


Fig. C.5. Problems due to a bad design of the clamper

1. The output may experience a large overshoot, thus producing an output outside the $\mathrm{p}-\mathrm{q}$ interval. This is illustrated in Fig. C. 5 by the overshoot of amplitude D. Late activation of the ( $p, q$ ) wing brings the output back toward q. However, the power dissipation is then larger than expected. This happens, for instance, when inverter I 2 is much slower than inverter Il.
2. At the end of the discharge process of input In, if the supply voltage $p$ is such that the transistors $\operatorname{Tn} 2$ and $\operatorname{Tn} 3$ are still conductive, while the ( $\mathrm{p}, \mathrm{q}$ ) wing is finishing the charge process of the output, there is a conflict which sets the output to a value smaller than $q$. This is illustrated by the deviation d in Fig. C.5. This deviation is propagated to the next drivers in the interconnection, and HSPICE simulations showed that it increases until it reaches the threshold of a driver (about 2.5 V ), in which case it causes the remaining drivers to oscillate (Fig. C.6). Of course, the symmetric problem applies to the charge process


Fig. C.6. Oscillations caused by a bad design of the clamping regenerator

To prevent this latter problem, p and q must satisfy $p \leq V_{s n}$ and $q \geq v_{d d}-\left|V_{s p}\right|$, where, $V_{s n}$ and $V_{s p}$ are respectively the thresholds of N and P transistors. However, these conditions limit the reduction of power dissipation one can get. Fortunately, one can find some designs of the clamping regenerator (case of Fig. C.4) violating the above rule, but where the increase in the deviation asymptotically converges to a stable safe value with the number of stages crossed by the signal to be regenerated. For the remainder of this paper, p and q are respectively set to 1 V and 4 V (note that $\mathrm{V}_{\text {th }}$ of P and N transistors are respectively -0.8 V and 0.7 V in our technology), i.e., a power reduction of $64 \%$.

## III. RESULTS AND DISCUSSIONS

Due to the complexity of a theoretical model of the clamping regenerator, finding out an optimal design, by systematically investigating the design parameters of the clamping regenerator, is a hard and long task. Instead, we used a heuristic which relies on extensive electrical simulations and which explores the design parameters within limited ranges. This heuristic consists of 2 steps, each of which leads to a design that we compare to the conventional methods [64], in terms of delay and area.

As a first step, we investigate the line delay versus the clamping regenerator size (Fig. C.7) for an arbitrary and acceptable line segment length ( 2.5 cm , thus subdividing the line into 4 segments) and a 10 cm line (typical value of long interconnections). Transistors Tn 1 , $\mathrm{Tn} 2, \mathrm{Tn} 3, \mathrm{Tp} 1, \mathrm{Tp} 2$ and Tp 3 widths are all fixed to the value Tw . The minimum delay in Fig. C. 7 is obtained for the largest transistor width Tw considered (i.e., $240 \mu \mathrm{~m}$ ). Due to
the asymptotical shape of the Fig. C. 7 curve, further increasing this width brings Iittle reduction in delay, while substantially increasing the area. The result of this first step is thus a minimal-delay design (noted Design \#1) within the Tw limits in Fig. C.7.

In order to avoid the use of non-conventional techniques when laying out very wide transistors, the second step of the search heuristic keeps Tw to its minimal considered value (i.e., $60 \mu \mathrm{~m}$ in Fig. C.7), thus minimizing the driver area. The results of this second step are illustrated in Fig. C.8, which shows the line delay versus the line segment length, fixed to the value SL (a square is $1.2 \mu \mathrm{~m}$ long) whose maximum is the whole line length, that is, 10 cm . Under these conditions, this second step of the heuristic produces a design (noted Design \#2), where transistors $\mathrm{Tn} 1, \mathrm{Tn} 2, \mathrm{Tn} 3, \mathrm{Tp} 1, \mathrm{Tp} 2$ and Tp 3 widths are fixed to $60 \mu \mathrm{~m}$, and the line segment length corresponds to the minimum delay in Fig. C. 8 , i.e., 5 cm , thus dividing the line into 2 segments.

For both designs described above, N and P transistors of inverters I1 and I2 are fixed respectively to $12 \mu \mathrm{~m}$ and $24 \mu \mathrm{~m}$.


Fig. C.7. Line delay versus driver size


Fig. C.8. Line delay versus segment length

Table C. 1 presents the comparison in terms of area and delay between the method based on the clamping regenerator (Design \#1 and Design \#2) and the conventional regeneration method (the RID variant) [64]. With this conventional method, called RID, the driver is a simple fine-tuned inverter. The driver size and the line segment length are the result of analytical derivations based on a theoretical model of the interconnection and the driver. However, the results of delay given in table C. 1 are all based on HSPICE simulations. The area parameter in this table considers only the area of the driver transistor channels, calculated in units measuring $1.2 \mu \mathrm{~m}$ by $1.2 \mu \mathrm{~m}$ each.

Table C.1: Comparisons with RID

| Method | Delay (ns) | Area (units) |
| :---: | :---: | :---: |
| RID [64] | 9 ns | 5870 |
| Design \#1 | 3.85 ns | 5040 |
| Design \#2 | 4.92 ns | 720 |

Table C. 1 shows that, in terms of delay and area respectively, reductions achieved with
respect to the RID method are $57 \%$ and $14 \%$ for Design \#1 and $46 \%$ and $87.7 \%$ for Design \#2.

One goal behind the Fig. C. 1 structure is to make the speed independent of the line length. Table C. 2 shows the line delay for different line lengths with a $1.2 \mu \mathrm{~m}$ technology, using the design which achieves the smallest area-delay metric in Table C.1, i.e., Design \#2. We notice from Table C. 2 that the speed decreases when line length increases, however, it converges asymptotically to a minimum value near $1.57 \mathrm{~cm} / \mathrm{ns}$. The signal speed is stable for lines longer than 40 cm .

Table C.2: Speed Stability

| Line Length | Line Delay | Signal <br> Speed |
| :---: | :---: | :---: |
| 5 cm | 2.3 ns | $2.17 \mathrm{~cm} / \mathrm{ns}$ |
| 10 cm | 5.7 ns | $1.75 \mathrm{~cm} / \mathrm{ns}$ |
| 20 cm | 12.44 ns | $1.61 \mathrm{~cm} / \mathrm{ns}$ |
| 40 cm | 25.43 ns | $1.57 \mathrm{~cm} / \mathrm{ns}$ |

## IV. CONCLUSION

A low-power circuit for driving long interconnections in VLSI systems has been presented. This circuit is inserted repetitively in the line in order to avoid speed reduction when line length increases. We showed that some problems may result due to a bad design of this circuit, and we proposed solutions to these problems.

Using electrical simulations with HSPICE, guided by a heuristic, we present two designs that perform better than the conventional method in terms of speed and area. With
both designs, power dissipation is reduced by $64 \%$.

© 1993, Applled Image, Inc.. All Rights Reserved


[^0]:    1. Biais de Synchronisation, défini comme le retard d'arrivée du signal d'horloge à deux points distincts du système.
    2. Variations dans le Procédé de Fabrication
[^1]:    1 Pipelining often refers to inserting registers in a datapath which is synchronized by a clock. However, the concept of pipeline clocking has been repeatedly used over the last decade [50, 3, 54 \& 63] and was coined by Fisher \& Kung [3].

[^2]:    2 The transistor time constant is defined here as the product $\mathrm{R} \times \mathrm{Cg}$, where R denotes the resistance of a minimum-sized square transistor channel and Cg the capacitance of a minimum-sized transistor gate. A similar definition can be found in [58, p.4]. A practical method of estimating $\mathrm{R} \times \mathrm{Cg}$ is to simulate a chain of inverters with SPICE. If the chain is long enough, the transistor time constant measured in the middle of the chain is independent of boundary conditions such as the waveform shape at the input of the chain. This is mainly due to the regeneration effect of inverters.

[^3]:    2. These are pre-determined virtual lines along which cuts are made in the wafer to separate the dies. There is a scribe line between any adjacent pair of dies (along their adjacent sides).
[^4]:    1. These branches are called primary branches.
[^5]:    1. We did not take advantage of the fast bipolar transistors offered in this process.
