Application des systèmes de calcul à haute performance dans les études électrothermiques à l'échelle nanoscopique by Gholami, Bahman
UNIVERSITÉ DU QUÉBEC 
MÉMOIRE PRÉSENTÉ À 
L'UNIVERSITÉ DU QUÉBEC À TROIS-RIVIÈRES 
COMME EXIGENCE PARTIELLE 
DE LAMAÎTRISE EN GÉNIE ÉLECTRIQUE 
PAR 
BAHMAN GHOLAMI 
APPLICATION DES SYSTÈMES DE CALCUL À HAUTE PERFORMANCE DANS 
LES ÉTUDES ÉLECTROTHERMIQUES À L'ÉCHELLE NANOSCOPIQUE 
AVRIL 2011 
  
 
 
 
Université du Québec à Trois-Rivières 
Service de la bibliothèque 
 
 
Avertissement 
 
 
L’auteur de ce mémoire ou de cette thèse a autorisé l’Université du Québec 
à Trois-Rivières à diffuser, à des fins non lucratives, une copie de son 
mémoire ou de sa thèse. 
Cette diffusion n’entraîne pas une renonciation de la part de l’auteur à ses 
droits de propriété intellectuelle, incluant le droit d’auteur, sur ce mémoire 
ou cette thèse. Notamment, la reproduction ou la publication de la totalité 
ou d’une partie importante de ce mémoire ou de cette thèse requiert son 
autorisation.  
Résumé 
La miniaturisation des MOSFETs a été le carburant de l'industrie semi-conducteur pour 
quelque dizaine d'années. L'industrie continuera, probablement, son développement à 
l'avenir en créant de nouveaux concepts et technologies. À long terme, pour arriver à une 
miniaturisation extrême en comparaison avec l'état de l'art, de nouveaux concepts de 
transistors sont prévus pour remplacer les MOSFETs conventionnelles. Même dans la 
prochaine décennie, des corrections dans la modélisation et conception des MOSFETs sont 
prévues. Au même temps que la miniaturisation arrive aux barrières physiques, de 
nouveaux phénomènes voient le jour. L'un de ces phénomènes est le transport balistique. 
Le transport balistique et les autres phénomènes reliés à physique quantique, qui 
devienne dominant quand les tailles de composant atteignent un certain niveau, ont besoin 
d'algorithmes de calcul très lourds pour être modélisés avec précision. L'utilisation des 
techniques de calcul à haute performance (HPC) est une solution prometteuse pour le 
problème de modélisation et conception de MOSFETs du futur. 
Dans ce travail, un outil de modélisation créé par les chercheurs à l'université de 
Perdue est employé comme un algorithme qui modélise de nouveaux effets de transport 
quantique dans les MOSFETs à double grille. Cet algorithme est parallélisé avec MA TLAB 
et exécuté sur une infrastructure HPC de consortium CLUMEQ, sous la plate-forme Calcul 
Canada. 
Abstract 
MOSFET scaling has been the fuel of semiconductor industry for decades. The 
industry, most probably, will continue its development in the future; but for that to happe n, 
new concepts and technologies have to be created along the way, as it has been the case in 
the past decades. In the long run, to achieve extreme scaling in comparison to the state of 
the art, new concepts of transistors are expected to replace conventional MOSFETs; even in 
the next decade, corrections to MOSFET modeling and design are expected to be necessary 
as scaling reaches sorne physical barriers and exposes new phenomena, one of which is 
ballistic transport. 
Ballistic transport and other quantum physics related phenomena, that become 
dominant when device sizes reach certain levels, need very computationally expensive 
algorithms to be modeled precisely. The use of high performance computing techniques is a 
hopeful solution to the problem of modeling and designing future MOSFETs. 
In this work, a modeling tool created by researchers at Purdue University IS 
employed as an algorithm that models new quantum transport effects in double gate 
MOSFETs. This algorithm is parallelized under MA TLAB and fUn on HPC infrastructure 
consortiums CLUMEQ and SHARCNET, under Compute Canada platform. 
Dedication 
To my parents whose devotion and sacrifice is endless. 
To Azar, with love. 
Acknowledgments 
l thank Professor Adam W. Skorek for great motivation and courage that he gave 
me to start this project and for his continued support and trust throughout my academic 
program at UQTR. 
l also thank Simon Delisle for his kind and valuable help. 
l wou Id also like to thank Vladimir Timochevski from CLUMEQ for his effective 
help. 
Contents 
Résumé ....... .. ..... .......... .... ............ ............... ..... ..... ...... ................................ ..... .. ..................... ii 
Abstract ....... ............. ..... .. ......... .... ... .... ............................. .. .......... ..... ............................. .... .... ii 
Dedication ..... ......... .... ... .. ........... ............ ...... ..... ....... ........ ........... ..... ............ .... ..................... iv 
Acknow ledgrnents ...... ..... ....... ... .............. ..... ..... ..... ........ ....... .... ....... .......... ... ...... .............. .. .. .. v 
Contents ....... ..... ... .............. ... .................. ... ...... .. ........ ... ... ... ...... ....... ........................ ........ ..... vi 
List of tables ..... ............................ ..... ........ ...... .... ............. ..... ... ... ................. .. .. ................ ... .. ix 
List of figures .. .. .... ............ ..................... ......... ................. ................................ .. ...... ... ... ......... x 
List of symbols .. ... ........ .. .. ...... .. .. .................... ..... ... .. .... .. .... ...... ............................. .......... .... xiii 
Résumé français .......... .. ... .......... ............. ...... ....................... ............... : ... .. .... .................... .. . FI 
Chapitre 1 - Introduction ........... ................. .............. .... ..... ............................. ... .... .................. 1 
1.1 Nano scale electro-thermal phenomena ............ ...................... .... .... .... .... ................. 1 
1.2 Problem ................ .............. .............. .... ....... ...... ................. ...... ........... ... ................ 15 
1.3 Role of paraUel computations ........ .. .... ........................................ .... .... ............ ....... 18 
1.4 Objective ........... ...... ..... ... .... ..... ... ..... ... ............. ............................. .. .... ............. ... ... 20 
1.5 Methodology ..................... .. ................. ... .......... ... ........ .. ............. .. ............ ...... ............... 21 
Vil 
1.6 Thesis structure ...... . ........ .... ............... ........ .... ... ... ... . .. . ........ .. .. .. .... 23 
Chapitre 2 - Nanoelectronics devices and Nano thermal constrains ..................................... 25 
2.1 Nanoelectronics devices .. ......... .................. .. ..... ... ....... .......... ... ..... .... ....... .. .. .. .... .. .. 25 
2.1.1 Introduction ............... ..... ... .... .... .... ...... ... .............. .. ........ .... ...... .. .. ........ ....... 25 
2.1.2 Structure and Operation of MOSFET ............. ..... ...... ................................. 29 
Chapitre 3 - High Performance Computing Systems ........ .... ... .... ...... ....... ..... ... ...... ... .... ... ... .46 
3.1 Parallelization ... ....... ... ..... ...... ...................... .... ....... ... ......... ....... ... .......................... 47 
3.2 Infrastructure .. .... .. .... ....... ... ....................... ....... ...... ..... ....... .. ... .. ............................. 50 
3.2.1 Accessing Colosse : ....... ....... .... .. .... ........... ............. ......... ... .. ........ .............. 59 
Chapitre 4 - Modeling ........ ........... ...... ... ..................... .. .. ............ ... ..................... ..... .. .. ....... .. 65 
4.1 Physical and Mathematical modeling ......... .... ... .......... ... ........................................ 65 
4.1.1 Material considerations ................... ... .... ............. ... ..................................... 67 
4.1.2 MOSFET theory and design .......... ....................... ............ ....... .. .......... ..... .. 68 
4.1.3 Transistor structure ........ .............. ...... ..... .... ...... ....................................... ... 70 
4.1.4 Modeling and simulation ............... ........ ......... ......................................... ... 77 
4.2 Geometry and fabrication ............... .. .. ....... .......... ........ ........................ ..... .............. 79 
4.2 .1 Fabrication Process .............. ... ......... ...... ......... ... .. ....... ......................... .... ... 79 
4.2.2 Packaging ............ ..... ............. ... ....... ...... ...... .................... ........... ................. 88 
4.3 Numerical modeling ..... .................. .. ................. ....... .... ... ......... ... .. ......................... 96 
viii 
Chapitre 5 - Implementation .......... .. ......... ...... .. ........ .... ... .................................. .. ...... ... ..... . 109 
5.1 Mathematical formulation ....... ...................... .. ..... ..... ......... ... ... .. ...................... .... 109 
5.2 Algorithm establishment .... ... .......... ....... .. ....... ...... .. ........... .. ............ ... ........ ....... .. 124 
5.3 Parallelization .. .............. .. .... ....... ...... .. ........... .... ... .............. .. ............ ....... .. ..... ...... 131 
5.4 Implementation ... ............. .......... ....... ............................................. ... ... .. .......... ..... 136 
Chapitre 6 - Validation ............................. .... ..... ... ..... ...................... .. ..... ...... ..... ..... .. ..... .... .. 143 
6.1 Comparison of seriaI and parallel code performances ....... .............. .. ... .... ..... .... .. 143 
6.2 Validation ................. .. ... ... ... ........... .. ........... ...... .. ..... .................. ...... .. ... .......... ... .. 145 
6.3 Parametric analysis .... .... .. ... ............ .. ................. ........ ...... ... ... ... ..... ... .. .. ...... .......... 147 
Chapitre 7 - Conclusion ...... .... ..... ... .......... ... ... .......... ...... .............................. ..... ............. .. .. 152 
References .... .. .......................... .. .... ... .......... .... .................. ......... ............ ................ ... .......... 155 
Table 1-1 
Table 3-1 
Table 4-1 
Table 4-2 
Table 4-3 
Table 6-1 
List of tables 
Thermal conductivities of a few materials used in semiconductor 
device fabrication, Phonon boundary scattering significantly 
reduces the thermal conductivity ofa 10 nm thin silicon film ....................... .4 
Data storage in colosse ............ ....... ... .. ...... ... ....... ...... ... ... ................. ............. 61 
MOSFET equation parameters .. .. ............................................................ ... ... 72 
Simulation device parameters ..... .... ......... ...... .. ....................................... ..... . 98 
Bias parameters .... .. ... .... ... ...... ........... ........ .. .... ......................................... ... .. 99 
Convergence data for simulations with two and four workers .............. .. ... .146 
List of figures 
.Figure 1-1 Device power density growth over the tirne. Interesting to note is 
that the vertical axis is logarithrnic and the horizontal one is 
linear so an exponential trend is obvious. Sorne fact to compare: 
the power density in a hot plate is in the order of 10 W /cm2, in a 
nuclear reactor in the order of 100 W / cm 2 and at the surface of 
the sun about 7000 W / cm 2 • Data compiled by F. Labonte, 
Stanford ......... ..... ........ ............ .... ... ... .. ........ ............ ..... ......... ... ....... ......... ... .... . 9 
Figure 1-2 (a) bulk silicon (b) (strained) silicon/germanium on insulator (c) 
multiple-gate or FinFET devices. In more advanced geometries 
(b,c) thermal conductivity ofmaterials is lower and heat 
removal is harder. In band c for the gate length of 10-nm, the 
thickness ofsemiconductor is 30% to 50% of the gate length. 
Fig. (c) is from the ITRS [7] ...................................... .. .......... .................... .. .. .. 9 
Figure 2-1 Structure of a NMOS transistor .. ...................................................... .. ........... 30 
Figure 2-2 Nanoelectronic devices .. ........................ ......... .... .. ..... ........... ... ... ......... ..... ... .. 34 
Figure 2-3 Quantum weIl ofa RTD ........ .. .... .. .. .... ...... .... ............ .. .................................. 37 
Figure 2-4 Structure ofa RTT ................................... .. ...... ...... ................ ..... .. ................. 38 
Figure 2-5 Structure of an RTT .. ... ... ... ...... .. .... ...... ..... ...... ............ .. .. ............. ................. .40 
Figure 2-6 Switching of solid-state nanoelectronic devices . .. ........................................ 42 
Figure 2-7 A hybrid RTD-FET. ........ .. ...... .... .......... .. .......... .. .................. .. .. ................ .. .. 44 
Figure 3-1 OpenMP schematic in comparison to single processing ............... .. ...... .. ...... 50 
Figure 3-2 Placement of the electronic devices ........................................................ .. .... 55 
Figure 3-3 Side-view of the structure ........ .... ...................... .. .......................................... 57 
Figure 3-4 Structure viewed from above ......... .. ........................................................ .. .. . 58 
Figure 4-1 Output characteristics of a 200 nm channel length device ........ .... .. .. .......... .. 74 
Figure 4-2 Threshold voltages ........ .............. ..... ... .. ..... .... ............ .. ....... ..... .. ........ ...... ..... 75 
xi 
Figure 4-3 Transfer characteristics of 100 nm and 50 nm devices ........... .............. ........ 76 
Figure 4-4 Sub threshold slopes ... .......................... .... ...... .. ................ ............................. 76 
Figure 4-5 Semi recessed LOCOS process ............ .... ........................ ............................. 81 
Figure 4-6 Preparation for LOCOS process ........................ .............................. .... .......... 83 
Figure 4-7 Mask used in LOCOS process to etch oxide and nitride layers ..... ...... ......... 83 
Figure 4-8 Finished LOCOS process with bottom gate area to be implanted 
(red) and oxide (blue) ... .......... .................. ...... ................................ .. ............. 83 
Figure 4-9 Amorphous silicon deposited on top of oxide layer, also LPO 
layer (green) and Nickel (violet) are deposited to start 
crystallization ................................................................................................ 85 
Figure 4-10 Mask used to etch the LPO ............................................................................ 85 
Figure 4-11 Crystallized channel, ready to be implanted with B ions ................... .. .. ...... . 87 
Figure 4-12 Mask used to etch the channel silicon . .............. ........................ .................... 87 
Figure 4-13 Deposition ofupper gate polysilicon layer.. .................. ........................ .... .. .. 87 
Figure 4-14 Mask used to form the upper gate . .... .......................................... .................. 88 
Figure 4-15 Electroless Ni-Au UBM ........ .. ............ ...................................... .................... 90 
Figure 4-16 Reflowed solder bumps on electroless Ni-Au UBM ..................................... 90 
Figure 4-17 A "Quad Flat No lead" structure .................................... ............ .......... .. ....... 92 
Figure 4-18 Pin Grid AITay at the bottom ofa processor ................................................. 93 
Figure 4-19 An LGA processor .. .............. .. ................................... ..................... ...... ......... 94 
Figure 4-20 BGA RAM lCs connected to a PCB ........ .................... .................. ...... ......... 95 
Figure 4-21 Drain CUITent for different gate lengths ...................................................... .1 00 
Figure 4-22 Drain current for different gate isolator thicknesses ...... .. .... .................... ... 100 
Figure 4-23 Drain CUITent for different gate isolator dielectric constants ......................... 101 
Figure 4-24 Drain CUITent for different Gaussian doping profile slopes ......................... 1 0 1 
Figure 4-25 Device parameters ....................................................................................... 1 03 
xii 
Figure 4-26 Valleys and electron masses ..... ..... .. ....... ... .... .. ....................... .... ..... .. ... .. ..... 107 
Figure 5-1 Deviee grid ........................................ .... .................. .. .................................. 113 
Figure 5-2 Algorithm for mode1ing scattering ..... .. .. .. .......... .... ..................................... 121 
Figure 6-1 Speed-up for different numbers of workers ..... .. .. .. ..................................... 145 
Figure 6-2 Measured and theoretical speed-up ..................... .. ...................................... 149 
Figure 6-3 Measured speed-up is normalized to Gustafson's law's 
prediction .... .................... .... .... ..................................................................... 150 
List of symbols 
C Capacitance 
Cox Oxide capacitance 
D Si' D Di effects of source and drain on the local density function for subband i 
D ji Local density 
E Electric field 
Ei Bound state energy of sub band i 
Longitudinal en erg y 
E pI Maximum energy of subband i 
f frequency 
Fn Quasi-Fermi energy 
G Green's function 
G+ Hermitian conjugate of G 
Gn , GP Correlation functions describing electron and hole density 
G:~ mth diagonal entry of Gn 
h Plank constant 
H Single-electron effective mass Hamiltonian 
xiv 
1 Current 
Drain CUITent 
CUITent spectrum at terminal m 
Constant 
Current in sources of scattering or carrier 
J CUITent density 
Constant 
L Inductance 
m Mass 
Electron effective mass in z direction 
• <m > Average of electron effective mass in the transport direction 
n Electron density 
n(E,m) Electron density spectrum 
Donor and acceptor' s doping concentrations 
Grid node numbers in x and z directions 
Effective density of states in the conduction band 
Electron density constant 
Electron density constant with scattering 
xv 
P Momentum 
P Role density 
p Parallelizable percentage of algorithm 
p Number of processors 
P2D; Constant 
q Elementary charge constant 
q<T > A verage of electron state lifetime 
Q Charge 
r Space vector 
Ron Channel on resistance 
S Speed-up factor 
t Time 
TSDi (E) Source-to drain transition in terms of electron energy 
Tjki Transmission from k to j in subband i 
v Speed of electron 
Vd Drain voltage 
Vdd Operating voltage 
VDS Drain-source voltage 
xvi 
Vg Gate voltage 
VGS Gate-source voltage 
~ ~h Threshold voltage , 
Vsat Saturation velocity 
W Energy 
a Non-parallelizable part of algorithm 
Beta Caughy-Thomas parameter 
Dielectric constant 
Body Fermi en erg y 
mu low Low field mobility 
Iln Mobility of electron 
Ils , IlD Contact Fermi energies of source and drain 
Il j Fermi potential 
Self-energy matrix 
L , Lin and LOII1 Self-energy functions 
L:: mth Diagonal entry of Lill 
\}' (r) Field operator 
\}'i(Z) Envelope function of subband i 
XVII 
3 1/ 2 Fermi-Dirac integral of order Yi 
3 _1/ 2 Fermi integral oforder -112 
Résumé français 
Introduction 
La miniaturisation des MOSFETs a été le carburant de l'industrie semi-conducteur pour 
quelque dizaine d 'années. L' industrie continuera, probablement, son développement à 
l' avenir en créant de nouveaux concepts et technologies. À long terme, pour arriver à une 
miniaturisation extrême en comparaison avec l'état de l'art, de nouveaux concepts de 
transistors sont prévus pour remplacer les MOSFETs conventionnelles. Même dans la 
prochaine décennie, des corrections dans la modélisation et conception des MOSFETs sont 
prévues. Au même temps que la miniaturisation arrive aux barrières physiques, de 
nouveaux phénomènes voient le jour. L'un de ces phénomènes est le transport balistique. 
Le transport balistique et les autres phénomènes reliés à physique quantique, qui 
devienne dominant quand les tailles de composant atteignent un certain niveau, ont besoin 
d'algorithmes de calcul très lourds pour être modélisés avec précision. L'utilisation des 
techniques de calcul à haute performance (HPC) est une solution prometteuse pour le 
problème de modélisation et conception de MOSFETs de l'avenir. 
Dans ce travail, un outil de modélisation créé par les chercheurs à l'université de 
Perdue est employé comme un algorithme qui modélise de nouveaux effets de transport 
quantique dans les MOSFETs à double grille. Cet algorithme est parallélisé avec MA TLAB 
et exécuté sur une infrastructure HPC de consortium CLUMEQ, sous la plate-forme Calcul 
F2 
Canada. 
Objectif 
Aujourd'hui, la miniaturisation des MOSFETs envisage un obstacle majeur qui est la 
consommation de puissance. Dans les technologies courent, la miniaturisation augment la 
densité de puissance (figure 1), la génération de chaleur et la température de chip à un 
niveau très élevé dans le quel la fonctionnalité et la fiabilité de composant sont menacés 
[5]. 
Densité de chaleur élevée et génération non-uniforme de chaleur, qui est présent dans 
les nano-MOSFETs, créent les gradients forts de température [4]. 
Température élevée pourrais créer des problèmes variés: décharge électrostatique, 
électro-migration, surcharge électrique, courent de perte, erreur de synchronisation, auto-
échauffement, point chauds, stress thermique, etc. [4] 
Propriétés de transfert de chaleur dans l'échelle macroscopique ne sont pas identiques 
avec ces propriétés dans l'échelle nanoscopique. Conduction de chaleur dans l'échelle 
nanoscopique est balistique et similaire à radiation thermique [6] . 
Dans les échelles très petits, «points chauds» de l'échelle nanoscopique, créés dans la 
région drain de transistor, augment résistance série de drain et résistance d' injection de 
source. Dans les géométries non-conventionnelles qui utilisent nouvelles matériaux (par 
exemple dans transistor ultra-mince, finFET ou nanofil) ce phénomène est même plus fort. 
En plus, augmentation de ratio de surface à volume dans les nouvelles géométries (figure 
2), augment la résistance thermique de frontière des matériaux. Avec les nouvelles 
géométries il est plus difficile de faire sortir la chaleur créée à l' intérieur de composant. 
F3 
Nouvelles matériaux utilisés dans ces nouveaux composants ont les conductivités 
thermiques beaucoup plus faible que les matériaux utilisés dans les composants bulks 
(Table 1) [5]. 
Figure 1 
1000 
-N 
E 
u 
~ 100 
~ 1;; 
c 
a 
10 
1 
• 
4 AMD 
• Intel 
• PowerPC 
-Trend 
• 
• 
.. . 
• ri' 
. 4 
• •• 
• 
1990 1994 1998 2002 2006 2010 
Densité de puissance dans le temps ([9]) 
F4 
Strained Si, Ge, SiGe 
Burled oxide 
Bulk Silicon SIlicon substrate 
Figure 2 (a) Silicium bulk, (b) Silicium/Germanium tendu sur isolant, (c) 
composants multi-grille ou FinFET. Conductivité thermique dans 
technologies plus avancés (b et c) est plus faible. (c) est pris de ITRS 
Table 1 
Material 
Si 
Ge 
Silicide 
Si (10 nm) 
Si02 
([9]) 
Conductivités thermiques des matériaux utilisés dans la fabrication de 
semi-conducteurs ([5]) 
Thermal Conductivity (W/m*K) 
148 
60 
40 
13 
1.4 
En plus, dans les composant où l'une ou plus d'une des dimensions de semi-conducteur 
de canal est plus petite que « mean free path » d'électron ou phono n, le déplacement des 
F5 
transporteurs est décrit par les modèles de transport balistique. Transport balistique des 
particules crée les points chauds et cela change la modèle thermique de composant. Dans 
ces tailles, où les films semi-conducteurs sont plus minces que «mean free path », 
conductivité thermique de ces films est diminuée à cause de confinement de phonon et 
diffusion de frontières [5]. 
Avant arriver à ce niveau de miniaturisation, les points chauds apparaissait qu'à dans le 
niveau de circuit et ils étaient reliés à l'architecture de chip. Mais maintenant, avec 
miniaturisation extrême, les points chauds nanoscopique sont créés à l'intérieur de 
transistors [9]. 
Miniaturisation conventionnelle de CMOS se compose de réduire la longueur de grille 
et la largeur de diélectrique de grille et augmenter dopage du canal. Dans les MOSFETs 
plane, dopage élevé du canal diminue les effets de canal court, mais il aussi réduit la 
mobilité et augment courent de perte. La fonction principale de solutions multi-grille et 
« fully-depleted» ultra-mince est de diminuer ces effets indésirables et améliorer la 
performance. 
MOSFETs à double grilles sont les composants avec deux électrodes de grille, placés à 
deux côtés opposés du canal. Cette structure donne un meilleur control sur canal. Ces deux 
électrodes de grille peuvent être deux surfaces d'une grille ou deux électrodes 
indépendantes, commandés par deux signaux de grille [20] . 
Présentement, la demande de calcul élevée d'analyse électrothermique nanoscopique 
cause des difficultés dans l'optimisation de conception. Pour arriver aux conceptions 
efficientes de circuits basés sur nano transistors, cet obstacle doit être battu [15] . 
F6 
Technique rapide d'analyse thermique, basée sur transport classique Fourier, peut 
servir pour optimiser les conceptions de chips et de refroidissement. Mais cette technique 
est incapable de décrire les effets nanoscopiques, parce qu'elle manque la description 
quantique de phonons. D'autres techniques sont développées pour modéliser ces effets: 
méthodes dynamiques moléculaires, équation de transport de Boltzmann et modèle de 
diffusion balistique. Mais complexité de calcul reste d'être un obstacle. Il y a beaucoup de 
différence entre les modèles d'analyse thermique nanoscopique de chip, optimisées pour 
efficacité et les modèles qui sont optimisées pour précision [14]. 
Une solution pour le problème de simulation de modèles thermique exigeantes est 
d'employer les algorithmes parallèles et les implémenter dans les systèmes de calcul à haut 
performance [15]. 
Quand la miniaturisation des MOSFETs entre l'échelle nanoscopique, nouvelles effets 
de nature quantique deviendront important. Dans quelque domaine de conception de 
MOSFETs (par exemple concentration de dopage ou épaisseur de couche d'oxyde) les 
couches sont tellement minces que nous pouvons déjà compter les atomes. Dans les autres 
domaines, ces effets quantiques vont entrer la scène dans le future proche. Pour aller plus 
loin que régime 10 nm, c'est nécessaire de modéliser ces effets. Mais les simulations 
précises prennent trop de temps. Facilités de calcul à haut performance créent une 
opportunité pour ces simulation d'être fais raisonnablement vite et avec la précision 
désirée. 
Dans ce travail, l'objectif est de proposer et implémenter un algorithme de parallélisme 
pour un outil de simulation des effets b~listiques dans les MOSFETs à l'échelle 
nanoscopique à doubles grilles. Cela est fait dans le but de diminuer le temps de simulation 
F7 
pour amver à utiliser les modèles plus précises et faire les recherches paramétriques 
nécessaire pour conceptions optimisées. Le composant considéré est étudié dans les aspects 
physiques et mathématiques. Outil de simulation considéré est étudié en détailles et les 
possibilités de parallélisation dans logiciel MA TLAB sont considérés et comparés. À la fin, 
l'algorithme de parallélisation est implémenté et le code parallèle est généré et exécuté sur 
une facilité de calcul à haute performance de Calcul Canada. Les résultats sont présentés, 
validés (par comparaison avec les résultats de code original) et étudiés concernant 
l'efficacité de parallélisation. 
Méthodologie 
NanoMOS, un outil open source développé par les chercheurs à l'Université Purdue, est 
un simulateur de MOSFETs à double grille à l'échelle nanoscopique. Il considère transport 
balistique et résout les équations Schrôdinger et Poisson pour trouver les distributions de 
charge et courent à l'intérieur de composant. Dans ce travail, le code de NanoMOS a été 
étudié concernant les possibilités de parallélisation et un algorithme de parallélisation a été 
implémenté sur ce code. Le résultat a été testé en utilisant les facilités HPC de Calcul 
Canada. Les résultats de simulation et les facteurs d'accélération de temps sont présentés et 
discutés pour différent nombre de nœuds de calcul. Ce travail, étant une expérience de 
parallélisation, peut être utilisée dans implémentation d'algorithmes parallèles sur les 
modèles plus complexes qui facilitent les études paramétriques et optimisations nécessaires 
pour un design précise de composants nanoélectroniques. 
Dans l'exécution parallèle d'un code, la charge de travail est divisée et distribuée parmi 
différents nœuds de calcul. Idéalement la durée de l'exécution de code va être divisée par le 
nombre des processeurs engagés. Aussi la mémoire totale disponible sera le résultat de 
F8 
multiplication de la mémoire d'un processeur avec le nombre des processeurs. Ce qui est 
important dans parallélisation, c 'est la façon de diviser le job et la méthode de 
communiquer parmi les nœuds [15]. 
Calcul Canada est un collectif de consortia des infrastructures HPC, géographiquement 
distribués mais travaillant dans un système unique nationale. Un compte de Calcul Canada, 
offert aux professeurs et chercheurs, donne la possibilité de travailler avec les machines de 
calcul dans toutes les infrastructures de Calcul Canada et de profiter des supports 
techniques, programmes éducatifs etc. [21]. 
CLUMEQ, l'un des consortia de Calcul Canada, est composé de deux superordinateurs 
situé à l'Université McGill, Montréal et Université Laval, Québec [22]: 
Colosse: une grappe de 960 nœuds (chacun avec 8 cores processeur et 24 GB de 
RAM), totalement 7680 cores et 23 TB mémoire, avec réseau infiniband QDR et capacité 
de système de fichier de 1 PB [22]. 
Krylov: une grappe de 48 nœuds (21 nœuds avec 4 cores et 8 GB de RAM et 27 nœuds 
de 8 cores et 16 GB de RAM), totalement 300 cores, avec réseau infiniband SDR [22]. 
De côté logiciel, NanoHUB est un ressource sur ligne éducationnel et académique dans 
la domaine de nanotechnologie. Il inclue cours sur ligne et documents d'apprentissage ainsi 
qu'un grand nombre de simulateurs de phénomènes et composants nanoscopiques. 
NanoHUB est formé par contributions de plus de 600 chercheurs et éducateurs et le nombre 
de ses utilisateurs augment rapidement [34]. 
F9 
NanoMOS est un simulateur open source offert par NanoHUB et écrit avec MA TLAB. 
L'utilisateur a le choix de faire rouler l'outil NanoMOS à distance ou télécharger le code et 
l'exécuter sur son ordinateur [29] , [34]. 
Modèle conventionnelle de transport de transporteurs est dérivé d'équation de transport 
de Boltzmann. Cette modèle est basée sur diffusion de transporteurs et décrit bien le cas où 
canal est longue. Mais dans les composants avec les canaux très courts, modèle quasi-
balistique de transport doit être employé, comme il est employé dans NanoMOS. Dans cet 
outil, formalisme de Fonction non-équilibre de Green (NEGF), qui est une méthode de 
résoudre des équations différentiels inhomogènes, est employé pour résoudre l'équation de 
Schrôdinger et calculer les densités de charge et courent. En plus, Équation de Poisson est 
résolu avec méthode de Newton pour calculer profil de potentiel [20], [29]. 
NanoMOS a été créé plus tôt avec un regarde algorithmique et mathématique, et pas 
avec un stress sur codage. C'est parce que le défi, ici, c'est de trouver des solutions pour les 
problèmes de modélisation physique et résolution des modèles complexe mathématiques et 
pas faire un codage optimal. MA TLAB a été choisi comme la langue de programmation 
pour profiter de ses fonctions avancées. NanoMOS a environ 5000 lignes de code distribué 
dans différents fichiers fonctions de MA TLAB qui sont appelés selon le type de simulation 
demandé par utilisateur [36]. 
Pour décrire brièvement la façon de fonctionnement de ce code, il faut dire qu' à part 
des étapes de donner les entrées, initialisation et calculs primaires, NanoMOS fait ses 
calculs en un nombre des itérations d'une boucle de calcul. Cette boucle doit satisfaire un 
paramètre de condition qui représente la convergence de réponses produits par les 
itérations. À l'intérieur de cette boucle, deux étapes sont pris: au début, fonction 
FIO 
chargenanomos.m est appelé pour calculer la densité de charge en utilisant le modèle de 
transport choisi. Après, fonction poisson.m est appelé pour calculer le nouveau profil de 
potentiel en utilisant les densités produites par chargnanomos.m. La différence maximum 
entre nouveau profil de potentiel et ancien profil de potentiel est comparée avec une critère 
de convergence, défini par l'utilisateur. Si le critère n'est pas respecté, la boucle est répétée 
[28]. 
Après avoir compris le fonctionnement du code, il faut l'étudier pour trouver les 
stratégies possibles de parallélisation. NanoMOS, en bref, performe un série d'algorithmes 
de solution des équations sur les nœuds d'une grille de l'espace. Cela est répété pour 
chaque point de bias. Dans quelque sous-programme de NanoMOS, ces opérations sont 
faites pour tous les « subbands» et « valleys» d'électrons aussi. Considérant cette 
structure, quelque stratégie de parallélisation est suggérée. La première c'est de distribuer 
les calculs en distribuant la grille de l'espace parmi les processeurs. L'autre est de 
paralléliser sur les points de bias. L'autre solution pouvait être parallélisation de 
l'algorithme de calcul (effectué sur chaque point de grille et bias) soi-même. Ça vaut dire 
trouver les possibilités de parallélisation de l'algorithme de solution des équations [28]. 
Parallélisation sur les points de bias est une solution simple et directe et en même temps 
inclusif et lourde en termes de routines. C'est simple et directe, parce que c'est l'algorithme 
le plus claire. Dans NanoMOS, la boucle de calcul principale (solution auto-cohérente) est 
répétée pour tous les points de bias. C'est inclusif et lourd parce qu'avec cette stratégie, 
presque toutes les routines de calcul sont engagées dans parallélisation. Chacun parmi eux 
doit être examiné pour compatibilité avec algorithme parallèle. Ses variables doivent être 
passés et retournés correctement. Finalement, quand un grand nombre de routines sont 
Fll 
inclus, au moment de faire des modifications ou ajouter des lignes de code, plus d'erreurs 
doivent être réglées. Les calculs reliés aux points de bias sont indépendants de l'un à 
l'autre. Elles ont seulement besoin de quelque initialisation, qui est généralement fait à 
l'extérieur de partie parallèle de code [28]. 
Pour implémenter l'algorithme de parallélisation sur les nœuds de grille, quelque 
décision doit être prise. Dans cette approche, des modifications sont réalisées à l'intérieur 
de toutes les routines qui travaillent avec nœuds de grille. Chaque routine reçoit ses entrées 
dans la forme des grands matrices qui incluant les données pour tous les nœuds. La routine 
fait ses manipulations et produit les matrices de sortie. Alors modifications parallèles sont 
faites indépendamment dans chaque routine. Cela augment « overhead » de cette stratégie. 
En plus, il faut voir si les routines ont des boucles assez longues pour chaque point de 
grille. Cela est nécessaire pour que la parallélisation soit efficient. Parce qu'avec les 
boucles courtes (peu de calcul pour chaque nœud) créer un nouveau nœud de calcul n'est 
pas profitable. En effet, la plus part des routines n'ont pas des boucles longues désirées. 
Une solution algorithmique peut être de diviser tous les nœuds de grille en quelque groupe 
et envoyer les calculs de chaque groupe à un processeur [28]. 
La possibilité de parallélisation sur « valle ys » et « subband » est plus limitée que les 
deux solutions discutées. Seulement deux routines répètent leurs calculs pour les 
« valleys» et« subband» et ils sont, tous les deux, des parties de solution d'équation de 
transport. Alors dans chaque itération de solution auto-cohérente, juste la moitié 
d'algorithme est parallélisée. Une solution peut être de paralléliser sur les points de grille 
quand c' est mieux et paralléliser sur les « subbands» et« valleys» quand c'est préféré. 
De toute façons complexité et fragilité de cette solution est remarquable [28]. 
Fl2 
Parallélisation d'algorithme ne semble pas très prometteur non plus. La plus part des 
calculs sont effectués dans la boucle auto-cohérent, mais les deux routines principales de 
cette boucle (chargenanomos.m et poisson.m) sont dépendants de résultats de l'un à l'autre. 
Alors ils ne peuvent pas être exécutés en parallèle. Les autre possibilités de parallélisation 
d'algorithme ne sont pas assez significatif [28]. 
Pour conclure, quand il y a plus d'un point de bias, parallélisation sur les points de bias 
est très efficient; Parce que la plus part de charge de calcul est parallèlisée et aussi parce 
que peu de communication nécessaire. De l'autre côté plus de routines sont engagées et les 
paramètres et les appels des routines sont plus nombreux et l'algorithme est plus complexe. 
Quand il y a juste un point de bias, un mélange des autres solutions doit être utilisé. Dans 
ce cas, la solution n'est pas aussi efficiente. Parallélisation est moins complexe pour chaque 
routine, mais il y a un nombre considérable d'opérations parallèles à concevoir et réaliser. 
Ça vaut dire que les sessions parallèles sont indépendantes pour chaque routine et ça 
augment « overhead »considérablement [28] . 
Dans ce projet, NanoMOS est parallélisé sur les points de bias. Une série de points de 
bias avec tensions de grille identiques et tensions de drain variantes est utilisée pour les 
simulations. Fonction «parfor » de MATLAB est utilisée pour paralléliser la boucle auto-
cohérent de code. Le code parallèle est exécuté sur une grappe et les résultats sont présentés 
[28], [39]. 
Codage parallèle dans MATLAB est fait en utilisant «MA TLAB Parallel Computing 
Toolbox » qui inclut les fonctions parallèles et les fonctions de préparation de réseau de 
nœuds de calcul. « MA TLAB Distributed Computing Server» est utilisé pour gérer la 
FI3 
grappe et communiquer avec «scheduler» ou «job manager» de grappe. Les nœuds de 
calcul sont appelés « worker )) et le nœud principal est appelé « client )) [39]. 
Résultats 
Dans ce projet, la diminution de la durée d' exécution de code est le but principal. 
Facteur d'accélération est le paramètre qui représente augmentation de la vitesse 
d'exécution. Les limitations de facteur d'accélération sont discutées. En bref, la partie non-
parallèlisable de code impose une limite au facteur d'accélération [20], [40]. 
Comme il est discuté en haut, parallélisation sur points de bias produit le moins « 
overhead )) parmi toutes les solutions et abandonne très peu de la charge de calcul à 
l'extérieur de partie parallélisée de code. En d 'autres termes, cette solution a le plus court 
route critique (route nécessairement sériaI d'exécution) [20], [28]. 
Code parallèle est exécuté sur 2, 4, 8, 16 et 28 nœuds sur grappe Krylov de CLUMEQ. 
Points de bias sont générés avec tensions de drain qui augmentent de 0 V à 1.62 V avec les 
étapes de 0.1 V. Quand la tension dépasse 1.62 V (pour 28 nœuds) taille d' étapes est 
diminuée à 0.06 V pour éviter les tensions élevée de drain qui effectue la vitesse 
d' exécution de code. [22], [39]. 
Facteur d' accélération est présentée pour différents nombres de nœuds de calcul dans 
figure 3. 
F14 
20 
18 
16 
14 
12 
la 
8 
6 
4 
2 
a 
2 workers 4 workers 8 workers 16 workers 28 workers 
. speed-up 
Figure 3 Facteur d'accélération mesuré par auteur pour différents nombres de 
processeurs 
Considérant les discussions sur algorithme choisi, il est possible de dire que 
l'augmentation du facteur d' accélération ne sera pas saturée facilement avec une 
augmentation de nombre des nœuds et la tendance vue dans figure 3 peut être répétée pour 
les nombres plus grand de nœuds de calcul. 
Validation quantitative de ces résultats est faite par comparaison de résultats de code 
parallèle et code originale. C'est notable que modifications reliées à parallélisation ne 
changent pas les fonctionnalités de calcul de code. Alors les résultats produits seront 
identiques. Le vecteur de convergence des donnés est utilisé pour cette validation. 
FI5 
Pour voir l'effet de nombre de « workers » sur efficacité de code parallèle, les facteurs 
d'accélération produits sont comparés avec la prédiction idéale (sans considérer les 
communications, initialisations et tolérances) dans figure 4 [28], [41]. 
30 
25 
20 
15 
10 
5 
0 
2 workers 4 workers 8 workers 16 workers 28 workers 
Figure 4 
...... measured speed-up _Gustafson's law's predicted speed-up 
Facteur d'accélération mesuré par auteur, et facteur d'accélération 
théorique 
Pour pouvoIr mIeux comparer, les résultats normalisés aux prédictions théoriques 
idéales sont présentés dans figure 5. 
Fl 6 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0 
2 workers 4 workers 8 workers 16 workers 28 workers 
Figure 5 
• measured speed-up normalized to Gustafson's law speed-up 
Facteur d' accélération mesuré par auteur normalisé pour prédiction de 
la loi de Gustafson 
C' est difficile à reconnaître un motif dans figure 5. La raison est que la durée de 
l' exécution de code parallèle est de même ordre de grandeur que les tolérances de grappe 
utilisé. Il a été un des problèmes graves au moment de faire les mesures. Différents 
techniques ont été utilisés pour diminuer l'effet des tolérances. De toute façon, une pente 
vers le bas est visible quand l'ensemble des résultats sont considérés. Ça vaut dire que 
quand le nombre des nœuds de calcul augment, les mesures suivent la prédiction théorique 
moins étroitement. Cela est à cause de la petite augmentation de la durée de l'exécution 
pour les points de bias avec tensions de drain plus élevées. Cela cause le code parallèle 
souffre plus que le code série, alors efficacité diminue. Mais cela ne justifie pas l'utilisation 
de point de bias concentrer à proximité de zéro (seulement pour avoir des résultats plus 
idéal), parce que ça ne sera pas la façon que le code parallèle est destiné à être utilisé en 
Fl7 
pratique; parce que le changement de point de bias est probablement nécessaire pour 
optimiser une conception. 
Conclusion 
Un algorithme de parallélisation est proposé et implémenté sur le code d'un outil de 
simulation balistique de MOSFETs à l'échelle nanoscopique à doubles grilles. Pour créer 
une base théorique, différents concepts des composants nanoélectroniques sont étudiés. le 
composant considéré est étudié dans les aspects physiques et mathématiques et 
technologiques et les obstacles de sa modélisation sont considérés. Calculs parallèles est 
proposé comme une méthode de conquérir les obstacles de simulations précises et 
complètes de ce type de composants. Facilités de calcul à haute performance sous le nom 
Calcul Canada sont présentés et étudiés comme infrastructures de cette recherche. Un outil 
de simulation de composant considéré est choisi et étudié en détaille. Les scénarios possible 
de parallélisation de cet outil dans logiciel MATLAB sont présentés et discutés. Le 
meilleur algorithme de parallélisation est choisi et implanté et un code parallèle est généré 
qui est exécuté sur infrastructures de Calcul Canada. Les résultats de calculs parallèles sont 
présentés et validés. Les résultats sont étudiés concernant l'efficacité de parallélisation. 
Facteurs d'accélération produits sont satisfaisants et la diminution de la durée de simulation 
est atteinte comme c'était attendu. Algorithme et code parallèle généré sont accessibles 
pour travaux additionnelles des chercheurs. 
Chapitre 1 - Introduction 
1.1 Nanoscale electro-thermal phenomena 
N anotechnology in definition is the science of manipulation of systems with at least one 
dimension in the range of 1 to 100 nrn. Nanoscale engineering uses different properties of 
materials in this scale to create new devices [1] . 
Nanotransistors are one of the most important nanoscale systems and are now subject to 
great research and study and hopefully will replace today's transistors in near future. 
Naturally designing and producing nanotransistors with desired characteristics is of great 
importance. In the following discussion we see that temperature and thermal phenomena, 
with many other factors , have a huge effect on transistors characteristics. So it is essential 
to be able to model these effects correctly and efficiently. 
Transistors, as one of the main nanosystems, are affected by temperature in different 
ways: First, transistor delay changes with temperature: higher temperature lowers the 
threshold voltage and decreases the mobility, causing the delay to change (increase or 
decrease depending on the strength of each effect).Second, high temperature increases the 
resistances, causing the interconnect delays to grow. Third, leakage power changes 
radically with temperature changes. The relation is in fact exponential. Leakage power 
itself can be very dangerous. This CUITent increases the temperature, so a positive feedback 
is formed that can lead to the buming of the device. Fourth, reliability problems, for 
2 
example negative temperature bias instability (NBTI), oxide breakdown, and electro 
migration increase with high temperature too [2], [3] . 
In technologies where the isolation from silicon substrate is created usmg buried 
silicon-dioxide layers, because this isolator has a low thermal conduction, self-heating is a 
more serious problem. Examples are silicon-on-insulator (SOI), strained-silicon-on 
insulator (SSOI) and tri-gate CMOS transistors [4]. 
The scaling of transistors has continued for decades now, providing an increasing 
processing power. Today, this scaling process is facing one major roadblock: power 
problem. In CUITent technologies, more scaling leads to power densities, heat generation 
and chip temperatures so high that the functionality and reliability of operation of the 
transistors are challenged. Today power density in chips is on the order of 100 W / cm 2 [5]. 
In addition to increased power densities, non-uniform heat generation, which appears in 
nanotransistors, creates sharp temperature gradients in these structures [4] . 
High temperature can cause a wide range of problems: electrostatic discharge (ESD) 
causing malfunctioning, electro migration and electrical overstress, leakage cUITent, timing 
failure, self-heating, hot spots, thermal stresses and so on[ 4]. 
Classical laws of physics do not predict the behavior of systems in nanoscale cOITectly 
and quantum physics must be employed to predict heat transfer in these scales. For 
example, Fourier law cannot model the thermal conductivity of super lattices and 
Boltzmann law is not successful in predicting radiation across small gaps[6]. 
3 
Sorne of the properties of heat transfer in macro scale are not present in nanoscale, 
sorne change. In nanoscale heat conduction is ballistic and similar to thermal radiation. On 
the other hand thermal conductivity becomes dependent ofmaterial property [6]. 
In very small scales a phenomena called nanometer-scale hot spot occurs In drain 
region of a transistor, causing an increase in the drain series resistance and source injection 
electrical resistances . In non-traditional transistor geometries using novel materials, for 
example in ultra-thin body, FinFET, or nanowire transistors, this effect is stronger and heat 
conduction is more affected. Aiso increased surface to volume ratio in new geometry 
transistors increases the effect of material boundary thermal resistance in the modeling of 
transistors that needs attention [5] . 
New geometries make it more difficult to rem ove the heat generated inside the device 
and new materials used in new devices have much smaller thermal conductivities than bulk 
silicon (Table 1-1). Aiso when one or more of device dimensions are less smaller than 
electron or phonon mean free path, sub-continuum effects, like ballistic electron and 
phonon transport, create hot spots in drain and change the thermal modeling of device. In 
addition to that, the non-local transport conditions displace these hot spots in a different 
way than classical heat diffusion theory prediction and produce a higher temperature rise 
than what is predicted by that theory. Just to make it worse, when a transistor has 
semiconductor films that are thinner than the phonon mean free path, the thermal 
conductivity of these films is reduced because of phonon confinement and boundary 
scattering [5]. 
Table 1-1 Thennal conductivities of a few materials used in semiconductor 
device fabrication, phonon boundary scattering significantly reduces 
the thennal conductivity of a 10 nm thin silicon film 
Pop, E. , Goodson, K.E., "Thennal phenomena in nanoscale transistors", The Ninth 
Intersociety Conference on Thermal and Thermomechanical 
Phenomena in Electronic Systems, ITHERM '04. 1-4 June 2004. 
Material Thennal Conductivity (W /mK) 
Si 148 
Ge 60 
Silicide 40 
Si (10 nm) 13 
Si02 1.4 
4 
Electron transport is called electrical CUITent and phonon transport is heat flow. When 
these two particles' transport is done under Ballistic conditions at nanometer length scales, 
a strong non-equilibrium between these energy carriers is caused. It means that the 
electron-phonon interaction will no longer be energetically or spatially unifonn. Electrons 
transfer their energy to the lattice by interacting with phonons. Optical phonons have no 
role in the thennal conductivity. It is longitudinal acoustic phonons that transport heat. This 
is the process of generation and transportation of heat in a transistor: applied voltage creates 
an electric field with its peak happening in drain. The electric field accelerates and gives 
energy to free charge carriers which, in the case of a MOSFET are conduction band or free 
electrons. Heated electrons scatter with many particles: each other, lattice vibrations 
(phonons), interfaces, imperfections or impurity atoms. But only their scattering with 
5 
phonons can transfer energy. In silicon and most of other semiconductors, optical phonon 
emission is responsible for most of high field Joule heating. So the phonons take energy 
from electrons and the lattice is heated by Joule heating. When the lattice's temperature 
increases, electronic transport properties of the environrnent change. For example, in bulk 
silicon the electron mobility decreases as T -2.4 in room temperature [5]. 
Optical phonons which take energy from electrons are slow moving particles that do 
not have any role in heat transport. But they decay into acoustic phonons which move fast 
and transport energy. Electron-phonon scattering time is in the order of tenths of 
picoseconds. But optical to acoustic phonon decay time is on the order of picoseconds. So 
that creates a population of optical phonons waiting to decay to acoustic phonons. But now 
if the generation rate of optical phonons from electrons in the process of Joule heating is 
greater than the rate of decay into acoustic modes, the population of optical mode phonons 
increases which, as mentioned above, affects electron transport properties of the 
environrnent. In the micro-transistors, the volumetric Joule heating rate is the dot product of 
the electric field (Ë) and CUITent density (]). But in nanotransistor this is not the case. 
Electrons are accelerated by electric field mostly in drain, where the field is strongest. Then 
electrons travel a "mean free path" before scattering with phonons and giving their energy 
to the lattice. Electrons release their energy to the lattice in several steps over several mean 
free path travelling. That creates a non-Iocality of phonon emission from the peak of 
electrical field position in nanotransistors. The electron-phonon mean free path is calculated 
as the product of electron-phonon scattering time, which is almost 0.05 Pico seconds in the 
high electrical field, and electron saturation velocity (10 7 cmls in silicon). So each mean 
free path is around5 nrn. In micro-scale transistors this distance is negligible compared to 
6 
distances around sorne mIcrons, or a little smaller. But in nanoscale transistors with 
channels with the lengths near 10 nm, it is a big displacement of energy. In other words 
nanometer-sized hot spots are displaced as much as sorne nanometers from the prediction 
of continuum theory about their position in drain. This in very small transistors is a 
significant change and must be modeled to predict the heat transportation correctly [5], [7]. 
By scaling, the density of devices in a chip increases, but at the same time there is 
almost no decrease in dynamic power dissipation, so volumetric heat generation density 
increases in a rate equal to scaling rate. When the dynamic power of due to nanotransistor 
technologies is considered: C eff Vdd 2 fit seems that reducing operating voltage ( Vdd ) will be 
ideal for reduction of dynamic power. But to keep the drive CUITent high enough, reducing 
Vdd must be accompanied with a reduction in threshold voltage (~). But source-drain 
leakage grows exponentially when ~ is decreased. So it is obvious that there is a tradeoff 
between performance and power consumption. When threshold voltage is lowered too 
much to lower the power, the performance suffers. So during the history of scaling, there 
has been a strong tendency to keep Vdd and ~ high and they have been scaled very lowly 
compared with the physical dimensions of devices. So over time transistors have had 
stronger electric fields and higher power densities. Now the fundamental barriers ahead of 
this technology have stopped it from continuing to go forward in expense of field 
magnitude and power density and a major shi ft in process design, performance targets, and 
architecture seems inevitable [8]. 
Dynamic power, being related to impedance, has an inverse relation with temperature. 
But leakage power increases with junction temperature. It means that if the increase in 
7 
power or thermal resistance happens, as it is happening in new technologies, leakage power 
will grow too. So a positive feedback is formed in which leakage power and temperature 
start to grow [8]. 
To sum up so far, as device size reaches 10 nm, circuit density passes one billion 
devices per centimeter. There is an agreement that the technological roadblock of scaling 
trend at this point is the power problem. In this view, power densities, heat generation, and 
chip temperatures are approaching levels so high that reliable operation of devices is no 
longer guaranteed (Figure 1-1)[9]. 
Here a more detailed discussion of heat transfer mechanism in lattice is presented. 
Phonons are Eigen modes of vibration of atoms in the solid state of lattice. This 
phenomenon is used in describing thermal behavior of materials. These vibrations are 
divided in two categories of longitudinal and transverse modes. In a different 
categorization, depending on if the neighboring atoms oscillate in phase or out of phase the 
phonon associated with the vibration is an acoustic or an optical phonon respectively. The 
name optical phonon cornes from the fact that an atomic dipole is formed by out of phase 
vibration, can interact with photons. High energy electrons generate optical phonons which 
in turn have to decay to acoustic phonons so that heat can be transported from the hot area 
of lattice. The ballistic travelling of particles makes the bulk parameter of thermal 
conductivity, which has been used to describe heat conduction in electronic thermal 
management models, an inaccurate approximation of heat transport for very short times, or 
very short distances [1]. 
Although thermal conductivity and heat diffusion equation are still used to explain the 
majority of thermal transport phenomena at nanoscale, with increasingly high drive fields 
8 
they stop responding to the changes, because in these conditions different populations of 
electrons and phonons are not in equilibrium. In other words, the situation is so far from 
classical conditions that a single temperature cannot be measured for the lattice. This non-
equilibrium of electrons and phonons and also the non-equilibrium between optical and 
acoustic phonons create nanoscale hot spots in short channel transistors. Another factor 
increasing this hot spot effect is the ballistic electron and phonon transport. Another 
phenomenon that changes the heat transfer in nanotransistors is thermal boundary 
resistance, formed at the interface of two solids. Currently there is not any theory to 
describe the thermal boundary resistance and thermal resistance of nanometers-thick layers . 
In sorne tries to describe these effects phonons are supposed to be particles that transport 
through an interface in a diffusion reflectance model. Sometimes phonons are supposed to 
be waves and their transport through an interface follows acoustic mismatch model. These 
theories have not been able to explain sorne of the experimental results. The choice 
between models depends on roughness of surfaces and scattering at the interface. Thermal 
conductivity is a bulk parameter; it is the average of effects of many different forms of 
phonons representing different modes of lattice vibrations. Because it is difficult to 
calculate each mode's exact share of heat transport, the complete understanding of heat 
transport in nanodevices is not yet reached [1]. 
1000 
-N 
e 
u 
~ 100 
~ 
~ 10 
1 
• 
... AMO 
• Intel 
• PowerPC 
-Trend 
• 
1990 1994 1998 2002 2006 2010 
9 
Pop, E., Sinha, S., Goodson, K.E., "Heat Generation and Transport in Nanometer-
Scale Transistors", Proceedings of the IEEE, vol. 94, Issue 8, pp. 1587 
- 1601, Aug. 2006 . 
. Figure 1-1 Device power density growth over the time. Interesting to note is that 
the vertical axis is logarithmic and the horizontal one is linear so an 
exponential trend is obvious. Sorne fact to compare: the power density 
in a hot plate is in the order of 10 W / cm 2 , in a nuclear reactor in the 
order of 100 W / cm 2 and at the surface of the sun about 7000 W / cm 2 
. Data compiled by F. Labonte, Stanford 
Strained Si, Ge, SiGe 
Buried oxide 
<a) Silicon substrats 
~----------------~ 
Bulk Silicon 
Pop, E., Sinha, S., Goodson, K.E., "Heat Generation and Transport in Nanometer-
Scale Transistors", Proceedings afthe IEEE, vol. 94, Issue 8, pp. 1587 
- 1601,Aug. 2006. 
Figure 1-2 (a) bulk silicon (b) (strained) silicon/germanium on insulator (c) 
multiple-gate or FinFET devices. In more advanced geometries (b,c) 
10 
thennal conductivity of materials is lower and heat removal is harder. 
In band c for the gate length of 10-nrn, the thickness of semiconductor 
is 30% to 50% of the gate length. (c) is from the ITRS [7] 
In advanced geometries heat removal is more complicated and harder and new 
materials, necessary to develop these architectures have lower thennal conductivities than 
bulk silicon (Table 1-1). It was said that self-heating in nanodevices is due to interaction of 
electrons and phonons. We saw how electron's mean free path in bulk silicon is about 5-10 
nni. Phonons also have their own mean free path that can be calculated to be about 200-300 
nrn in bulk silicon at room temperature. When channel lengths in future technologies 
reaches both these limits, ballistic transport will be the main transport mode of both 
electron and phonon, in other words, both CUITent and heat. These ballistic transports cause 
the energy carriers to have a non-equilibrium distribution. It was explained how different 
phonon modes affect heat flow differently. Worse yet, because ofphonon confinement and 
boundary scattering in geometries where semiconductor film is thinner than phonon mean 
free path, the thennal conductivity of the environment will be dramatically less than bulk 
silicon. So in thin-film and confined-geometry transistors [Figure 1-2 (b) and (c)] heat flow 
and cooling of transistor is more difficult and temperature is higher than bulk transistors 
[Figure 1-2(a)] with equal power input [9]. 
Joule heating, dot product of the macroscopic electric field and CUITent density fails to 
approximate the microscopie non-Iocality ofphonons in drain. This average model gives no 
understanding of electron energy exchanges with various phonon modes and it does not 
provide a spectral understanding ofphonons either [7] . 
Electrons lose their energy by electron-phonon scattering at a rate equal to the 
macroscopic relation: P=I.V. In channel lengths bellow 50 nm, phonon distribution 
11 
function affects the heat conductivity of the transistor. So the distribution of phonons must 
be taken into account in the design. In nanotransistors, for a drain size of around 10 nm 
where the field is strong, phonons receive energy from electrons in a distance up to 50 nm 
away from drain. Acoustic phonons transport the heat to the contacts and package 
boundaries. The propagation of acoustic phonon is faced a resistance when they scatter with 
other phonons, impurities, and boundaries. That causes a heat accumulation inside the 
device and creates higher temperature. More exactly, phonon population is increased inside 
the device and its effect is represented by a higher effective temperature [8]. 
Phonon boundary scattering can reduce the effective thermal conductivities ofultra-thin 
silicon films up to 10 times compared to that of bulk silicon [8]. 
Within sorne 10nm space around heat source, charge and heat energy transports do not 
follow the diffusion model. In this distance, electrons and phonons are in near-ballistic 
conditions causing highly non-equilibrium distributions of both electrons and phonons. If 
sorne certain quantum effects are small enough to be neglected, the solving Boltzmann 
transport equation (BTE) gives almost accurate distribution profiles for different phonon 
modes, leading to an almost exact profile ofheat distribution [8]. 
According to simulations, in electric field strengths equal to that of modern transistors, 
almost 2/3 of the heat generation is done through generation of optical phonons. But they 
are responsible for little to no share of heat transport, because they are much slower than 
long wavelength acoustic modes. So it is absolutely necessary to find out how these optical 
phonons decay into acoustic modes [8]. 
To understand this decay it is necessary to know the optical phonon lifetime for 
different conditions, inc1uding different power densities [8]. 
12 
This lifetime has been shown to be about 4 picoseconds for equilibrium conditions in 
300 degree of Kelvin. This time decreases as power density of the channel increases [8]. 
As the channel length is scaled the power density increases. The relation between 
power density and channel length for transistors with channel length around 10 nm is 
expected to be: r l.7 [10]. 
The thermal conductivity of silicon thin films and nanowires in the transistors with 
channel lengths beUow 100 nm has been shown by experiment to be two to five times 
bigger than that of bulk silicon. This is caused by phonon surface and interface scattering 
[1] . 
In macro scale, many different carriers are involved in heat transfer: phonons, electrons, 
photons are molecules. They aU behave like partic1es in these scales. In nanoscale on the 
other hand, carriers show wave behaviors a lot more often, so that the wave effects are 
often more important than partic1es behaviors, though that is not always the case[ Il]. 
Mean free path of a partic1e is the distance it travels between two collisions with other 
partic1es. These collisions can happen with two different types of partic1es. Nanoscale heat 
transfer phenomena can be seen in five categories depending on the relative sizes of mean 
free path and channel length [11]. 
When the structure is much bigger than the mean free path of the carriers, the diffusion 
regime can safely be used. This model then can be solved using conventional methods. 
Now depending on the relative size of the structure to the mean free path of energy carriers, 
i.e. electrons and phonons, two distinct regions can be identified. Inside each region a 
regime can be used that accounts for the wave properties of that carrier, instead of its 
13 
particle properties. There are also two transition zones where the size is comparable with 
the mean free path of a carrier and using one regime means accepting sorne approximation. 
In these zones scattering process that takes place in the environment determines if the 
diffusion regime should be used or ballistic regime. If scattering destroys the phase of the 
wave, the wave effects die out and diffusion can be used with sorne approximation. 
Inelastic scattering occurs when inelastic collisions, for example collisions between 
electrons and phonons or between phonons and phonons, destroy the phase of the wave. 
The when the structure is smaller than mean free path and the ballistic regime is used, it 
means that the scattering is neglected [11]. 
The ballistic behavior happens in highly scaled transistors when power density is above 
1000 W / cm 3 and channel length is well below 100 nm, in other word much less than 
relaxation length. In this situation the behavior of "hot electrons" and "hot phonons" needs 
to be analyzed [12]. 
The non-equilibrium situation of energy transport in nanotransistors changes their 
behavior to a great extent. This phenomenon reduces the trans-conductance of the 
transistors, in other word, the drive CUITent. Aiso it can increase sub threshold leakage 
power. These changes occur due to the fact that when electrons scatter and transfer their 
energy to optical phonons, because these phonon modes have low group velocities (~ 1000 
mis), they cannot transport the heat from the hot region. In 10 picoseconds they decay into 
fast moving acoustic phonons. But the delay means that there is an increase in energy 
density near hot spots, which affects electron transport [12]. 
So, nanoscale sizes create nanoscale hot spots in the transistor's drain, and that 
increases the drain series and source injection electrical resistances through creation of high 
14 
densities of carriers. This process is helped by use of new materials and modem geometries 
like ultrathin body, FinFET, or nanowire devices. Thermal analysis in these cases differs 
from classical diffusion theory and sub-continuum phenomena like ballistic electron 
transport play an important role in heat generation and heat transport [9]. 
Aiso in new geometries there is a greater surface to volume ratio which means 
boundary thermal resistance has a larger effect on heat transport in modem transistors [9] . 
Now that it is observed that heat conduction in nanostructures is much more limited 
than anticipated by the Fourier theory and that size effects produce higher temperatures, 
more effective thermal management measures have to be taken. As a positive side, these 
size effects might also offer the possibility of developing highly efficient thermoelectric 
(TE) materials and more efficient thermal management through direct cooling [13] . 
Size effects can cause limited heat conduction capability in super lattices of 
nanotransistors, which produce sorne thermal problems. But these effects also enhance the 
properties of thermoelectric (TE) devices, used in thermal management of semiconductor 
lasers. Maybe they have the same effect on thermal management of microelectronic devices 
too. In fact a factor called TE figure of merit (FOM) that indicates the efficiency of TE 
coolers, has shown high numbers in sorne III-V super lattices and TE super lattices 
meaning low thermal conductivities [13]. 
Heat transfer at macro scale follows sorne principle rules. One of them is the Fourier 
law that is basically a diffusion equation. Another useful factor to describe thermal 
behavior in macro sc ale is thermal conductivity which is a material property and 
independent of size and shape [6]. 
15 
High power density and temperature in electronic circuits reduces transistor carrier 
mobility and threshold voltage and increases interconnect resistance, it produces reliability 
problems like electro migration, dielectric breakdown, and negative body biasing. It also 
increases power consumption by increasing sub-threshold current. At the end cooling costs 
increase because of all the negative impacts of scaling on thermal management [14] . 
To solve a heat generation problem within the classical limits, Joule heat generation 
rates are calculated as dot product of electric field and CUITent density. Temperature profile 
can be found by solving heat diffusion equation derived from Fourier law. This method is 
applicable when the assumption that electrons and phonons, as current and heat carriers, are 
in local equilibrium is valid and also when the discussed sizes are bigger than sorne mean-
free-paths of carriers [12]. 
Self-heating in semiconductor device is a production of scattering of electrons with 
phonons. To find a fairly exact model of this process all of the scattering events have to be 
considered in the simulation. Solution of the Boltzmann Transport Equation (BTE) for both 
electrons and phonons can provide this goal. BTE is a semi-classical transport regime in 
which charge and energy carriers are particles between scattering events [12]. 
1.2 Problem 
In thermal aspect of micro and nanoelectronics two problems can be identified. One is 
the process of heat generation, heat transport and heat conduction in each device and the 
other one is the heat management at a system level [6]. 
Here we will discuss thermal analysis within each transistor. 
16 
Today, thermal management has to be applied at all levels of system. It starts from 
transistor and continues to the circuit and microarchitecture and finally to the packaging. 
However in modern thin-body devices with higher field levels thermal management within 
the transistors has become more important than before [10]. 
Before, hotspots used to appear only at the circuit level, due to the architecture of 
devices on the chip. Now, nanoscale hot spots are starting to appear inside the transistors 
[9]. 
Micro- and nanoscale electronic devices are capable of producing heat fluxes in small 
hot spots in the order of sorne thousands of watts per centimeter square. In the chip level 
sorne areas as big as a couple of hundred micrometers in diameter with a temperature 
10 - 400 C hotter than the surroundings can be formed because of difference in the work 
load of transistors. Naturally cooling systems are designed to fight the highest temperature 
on the chip and because the cooling systems usually cool all the area of the chip, this means 
that sorne of the cooling resources go to waste because of hot spots. If the hot spots are 
removed by correcting the design, cooling costs decrease, or there will be more cooling 
resources and more densities of devices become practical. The same goes for the nanoscale 
hot spots formed in the nanotransistors. In bulk silicon devices, the thick silicon substrate 
(sorne several hundred micrometers thick), with its high thermal conductivity, absorbs the 
heat from hot spots and distributes it to a size comparable to its own thickness. This is still 
much smaller than the whole circuit size, but one can observe that in this way bulk silicon 
substrate softens the extremely non-uniform power dissipation patterns. Now when the 
silicon wafer becomes thinner and wh en stacked chips or three dimensional chips are 
further developed, there will be much less heat dissipation from silicon substrate and hot 
17 
spots will be smaller and heat distribution function will be more steep, all that giving rise to 
higher temperatures [1] . 
Here heat flow paths will be discussed in a chip. Up to 70% of heat generated in a 
microprocessor exits through thermal interface materials and heat sink. 30% exits through 
multi-level interconnect dielectric structure, solder bumps and printed circuit board. To 
reach to a reliable system-Ievel thermal analysis, both of these heat flow paths must be 
simulated. Interconnects are the part that have received little attention as heat flow path, but 
they have more potential for further studies. Aiso CUITent thermal analysis is based on 
average power dissipation. But considering the growing non-uniformities in devices this 
has to change in future [4]. 
Investigations on nanoscale electronic properties started in the 1960s, but nanoscale 
thermal property is a new field of research. One reason might be that measuring thermal 
data in small sc ale is more difficult. In fact sorne new measurement techniques for 
temperature and heat flow at atomic scales have been developed in recent years: 
thermocouples placed at the tip of atomic force microscopes or scanning thermal 
microscopy (SThM) can measure temperature with 100-nm resolution. Sharp heated 
metallic tips in ultrahigh vacuum environment make scanning seebeck voltage 
measurement technique capable of 2-3 nm resolution measurements. Femto second laser 
technology measures heat transfer and acoustic velocities in thin-film sub micrometer 
structures. Thermo reflectance imaging in visible wavelengths measures the surface 
temperatures with sorne 100 nm spatial resolution and 1- 100 mK temperature resolution 
[1] . 
18 
Recently, as thennal aspect of design has become more important, researches have 
started to see interconnect heat dissipation and especially self-heating in modem transistors 
as an area with essential importance in device and circuit level design [8]. 
Although transport simulations have grown successfully recently, nanoscale self-
consistent electro-thennal simulations are still rare and undeveloped [8]. 
In the near future the dominant transistor geometries will be ultra-thin body devices 
su ch as multi-gate MOSFET (FinFET) and silicon-on-insulator (SOI) [14]. 
Considering the complexity of these modem devices, being able to model thennal 
characteristics of a given lC quickly and efficiently, so that thennal effects of any change in 
design can be simulated before the design is complete, is becoming more and more 
complicated [14]. 
Presently the high computation demand of nanoscale electro thennal analysis makes it 
very difficult to alter designs in response to thennal concems. To reach the efficient design 
of circuits based on nanotransistors, this obstacle must be overcome [15]. 
1.3 Role of parallel computations 
Conventional thennal analysis techniques of chips are so costly in tenns of computation 
resources and time that thennal analysis had no place in original circuit design. In fact, 
cooling was designed for the circuit at hand and sorne thennal optimization was perfonned 
only in the packaging and cooling system design. So the main opportunity to optimize the 
thennal characteristic, which is during circuit design, was lost. Now, fast thennal analysis 
techniques, based on classical Fourier transport, are emerging and they are employed to 
model heat transfer in chips and also cooling packages [14]. 
19 
These techniques, being fast enough, still use Fourier heat flow model which is a 
classical mode!. So they are incapable of simulating devices in nanoscale, because they do 
not describe phonon quantum thermal effects correctly. However sorne techniques have 
been developed to model nanoscale phonon heat transport in device-Ieve!. Sorne of them 
are molecular dynamics methods, Boltzmann transport equation (BTE), and ballistic-
diffusion mode!. Still computational complexity is a big obstacle in applying these methods 
in large-scale thermal analysis. Among them molecular dynamics methods are the most 
accurate models, because they directly simulate inter atomic interactions. Being so 
accurate, they demand extreme amounts of computational resources. The BTE methods 
simulate phonon transport and approximate ballistic phonon transport quite accurately. 
They are much less costly than molecular dynamics methods in terms of computational 
resources and it is much more practical to use them, but they are still very complex and are 
used only in device-Ievel analysis. The ballistic-diffusion model is an approximation of the 
Boltzmann transport equation. It is more efficient than the two other models, but still its 
complexity and demand of computational resources are not comparable with the Fourier 
mode!. So there is a large distance between nanoscale chip-package thermal analysis 
models optimized for efficiency and those optimized for accuracy. To enjoy the benefits of 
both in a thermal analysis model that is applicable to le design processes, this distance has 
to be shortened greatly [14]. 
During the last decade, several research studies of nanosystems have been done. But 
nanoscale heat transfer model is comparably a very young research subject [15]. 
20 
One solution to the problem of simulating computationally demanding thermal analysis 
is employing parallel algorithms and implementing them on high performance computing 
systems [15]. 
1.4 Objective 
As it was explained above, there is a need to create effective modeling tools for 
nanodevices. These tools inc1ude physical, mathematical and informatics aspects. The 
objective in this work is to contribute to creation and improvement of such modeling tools. 
Modeling heat generation and heat flow problems in nanotransistors is c10sely 
related to electrical modeling of these devices. Many of the presently available modeling 
tools created by researchers in this field are more often at the stage of electrical modeling 
using quantum mechanics phenomena, rather than working directly on heat generation and 
heat flow. The later requires access to established electrical models, which is still an 
objective in this field of research. Therefore, working on a modeling tool that employs 
quantum effects to model nanoMOSFETs is a suiting objective for this thesis. Electrical 
modeling of nanoMOSFETs is the primary step in designing, optimizing and finally 
manufacturing next generation of MOSFETs. Modeling techniques that incorporate 
quantum phenomena to model nanoMOSFETs and generate carrier (electron and phonon) 
distribution data along the way pave the way for exact heat analysis too. 
Another problem to face in this field, as explained in 1.3 , is the fact that physical 
models and mathematical tools that are used to de scribe nanodevices are of such 
complexity that simulations that are based on them are too computationally expensive. To 
attack this problem parallel computation has been suggested as an effective solution. In 
21 
fact, employing parallel computation to facilitate simulations that perform exact modeling 
of nanoMOSFETs is a new approach to improve modeling tools in this field. 
Considering above discussions, the objective of this thesis can be summarized as 
follows. A general understanding of nanodevices (with emphasis on nanoMOSFETs) and 
obstacles of their development is provided. A specific device is chosen and a modeling 
technique developed for that device is studied from literature in details. High performance 
computing algorithms and infrastructures are studied as tools of this research. A simulation 
algorithm that implements the studied modeling technique is taken as a base and different 
parallelization algorithms are considered to be implemented on it. High performance 
computation tools are used to generate and fUll a parallel version of the simulation 
algorithm in order to improve its results in terms of computation time. 
1.5 . Methodology 
Following the objectives mentioned above, the methodology of the research can be 
described under three categories: gathering theoretical information and related literature, 
accessing the tools, realization of objectives and generation of results. 
The flIst step, naturally, is literature review and providing the basic knowledge 
needed to go on with the project. This step includes study of electro-thermal phenomena in 
nanoscale, defining the main obstacles to be resolved in this field, an introduction to 
nanodevices, MOSFET operation principals, MOSFET geometry and fabrication processes, 
study of Double gate MOSFET and introduction to high performance computing. The main 
means of acquisition of this information is reading academic articles, books and theses and 
other scientific and technical documents. The usual method of finding these materials is 
22 
searching in IEEE archive of academic articles using IEEE official website or finding other 
academic articles using other online search engines, University library (mainly for books) 
and sorne specialized web sites (ITRS, NanoHUB, Compute Canada ... ) for specific 
documents. 
Finding and accessing the tools needed for this research refers to tools related to 
high performance computing and nanoMOSFET simulation tool and corresponding 
modeling technique. High performance computing infrastructures are gathered under 
Compute Canada platform. To gather information about the consortia and each machine 
and infrastructure and also to gain the permission to access these systems a researcher 
should open an account in Compute Canada website and seek remote access to the 
machines according to the regulations. By doing so, one can also bene fit from support of 
technical staff who have specialty in operation of the machines and also in use of high 
performance computing in different branches of science and engineering. To choose a 
simulation tool and the corresponding modeling technique, nanoHUB, a cyberinfrastructure 
of nanotechnology related resources, has been used. NanoHUB provides educational 
documents and applications and also simulation tools in this field. Researchers in Canada 
can seek access to these resources through nanohub website. A simulation tool for double 
gate nanoMOSFETs (NanoMOS) was chosen as the simulation algorithm to base this work 
on. The modeling method used in this tool is explained in details in the PhD thesis related 
to this too!. 
Realization of the objective of the research and generation of results includes study 
of selected simulation algorithm (NanoMOS) which is closely related to the modeling 
technique described in accompanying thesis, study of parallel algorithms applicable to the 
23 
simulation algorithm, study of options of high performance computing available and 
suitable for this project, applying the chosen parallel algorithm on the code of NanoMOS 
and tinally running the parallel code on a parallel processing machine. 
NanoMOS is an open source code written in MA TLAB. It applies Newton method 
to Schrodinger and Poisson equations. These equations are derived from the physical 
modeling of the double gate nanoMOSFET. A parailei algorithm based on parailelization 
over bias points of MOSFET is chosen after discussing different options. This algorithm is 
applied to the code using MATLAB Parailel Computing Toolbox. Three different consortia 
of Compute Canada (CLUMEQ, sharcnet and Westgrid) are studied as candidates for 
running the parallel code. Aiso running the code on a parailei array of machines in 
computer Lab of Department of Electrical and Computer Engineering of UQTR is studied. 
MATLAB uses its Distributed Computing Server to form a parailei array of individual 
machines. Finally the code is fUll on Krylov infrastructure from CLUMEQ. The parailei 
code is run in different tests with different numbers of parailei threads and duration of each 
fUll is measured. The results are compared to the duration of execution of the seriaI code 
doing the same tasks, which represents the contribution of this work. The results of parailei 
code are validated by comparison to the results of the original seriaI code and sorne 
parametric analysis of the duration times are presented. 
1.6 Thesis structure 
This text is structured in seven chapters. In the tirst chapter (introduction), studied 
phenomena, the subject and goal of this study and also the main tools employed in this 
research are introduced. In second chapter different categories of nanoelectronic devices are 
introduced. Then the type of devices studied in this work, as weil as their electro-thermal 
24 
constraints IS discussed. In third chapter, High performance computing concept and 
infrastructures are introduced. Physical and mathematical models and concepts related to 
the chosen device, its fabrication and packaging methods and geometrical aspects and 
finally numerical approach of modeling are discussed in chapter fOUf. In chapter five, 
mathematical formulation of the models used in simulation is discussed. Then algorithm of 
simulation is studied in detail and different possibilities of its parallelization are discussed. 
Then implementation of parallelization algorithm is explained. In chapter six, performance 
of parallel code is compared to the performance of seriaI code and also parallel code is 
validated by comparison of results. Chapter seven is a brief conclusion of this work. 
Chapitre 2 - N anoelectronics devices and N ano 
thermal constrains 
2.1 Nanoelectronics devices 
2.1.1 Introduction 
Nanotechnology and Nano-science takes different names when it employs different 
phenomena and is applied to domains of electrical, mechanical or optic engineering: Nano 
Electro Mechanical Systems (NEMS), Nano Opto Electro Mechanical Systems (NOEMS). 
Among more detailed applications of nana in electrical engineering, the most important 
ones are:nanoelectronics devices and in particular nanotransistors, nanotubes and 
Nanowires [15]. 
NEMS and NOEMS: Since development of photolithography Technique, fabrication 
of micro-electronics devices has advanced a lot. Advances in micro-electronics and le 
fabrication have had a great impact on development of micromechanical devices, Micro 
Electro Mechanical Systems (MEMS) and Micro Opto Electro Mechanical Systems 
(MOEMS) too. These techniques are the means of mass production and low prices, and at 
the same time they are high precision techniques, a feature that is essential to the sensitive 
nature of their products. Aiso miniaturization technologies of MEMS, in their advanced 
forms, help create NEMS and NOEMS [15]. 
26 
NOEMS and NEMS are nanoscale analogous technologies of MOEMS and MEMS. 
NEMS are simply electro mechanical structures (sensors or actuators)that have at li st one 
dimension with the size below 100 nm. Despite anal ogy, physics of NEMS and NOEMS is 
different from that of MEMS and MOEMS, because in nanoscale surface forces are even 
stronger and sometimes quantum effects describe the behavior of the system more 
accurately [15]. 
There are two main approaches toward fabrication of NEMS: Top-down and Bottom-
up. In Top-down approach bulk material is taken and by different techniques is formed into 
a structure. Nano-Lithography and Nano-Printing are two technologies included in this 
category [15]. 
In Bottom-up approach individual atoms and molecules are manipulated and used to 
assemble molecules, particles and structures that join together and form a certain system. 
Auto-assembly and dip-pen lithography are two technologies included in this category. 
Dip-pen lithography is a method of manipulation of single atoms using the tip of an atomic 
force microscope [15]. 
Carbon nanotubes: Carbon nanotube is a very recent discovery. It was first seen and 
discovered using an electronic microscope in 1991. They have many astonishing properties. 
Carbon nanotubes are chemically very stable. Each carbon atom is in bond with three other 
and the whole structure is the form of a cylinder with a circumference of sorne nanometers 
and a length possibly as big as sorne micrometers [15]. 
Having high heat conductivity, carbon nanotubes can have sorne applications in 
electronics. Aiso they can play the role of heat removal element in combinations with 
27 
plastics. This will take away the most serious barrier of using plastics in engines and other 
applications where a lot of heat is generated [15]. 
Possible applications of nanotubes in electronics too seem very exciting. Chirality is a 
characteristic of each carbon nanotube which in brief is the angle in which the graphite 
sheet is roUed over itself to make the nanotube. Chirality determines if the carbon nanotube 
is a conductor or it is a semiconductor. By having this property, carbon nanotube are 
completely unique among aU the materials. Also sorne types of them become 
superconductors in low temperatures. By connecting two metal electrodes by a nanotube, it 
is possible to make a Field Effect Transistors (FET). Electrons pass from the inside of the 
nanotube and it plays the role of silicon in conventional transistors. Additionally, carbon 
nanotubes waste very little electricity. In theory switches made from carbon nanotube can 
have 1000 times' higher frequency than conventional switches. If the obstacle of 
manipulation and assembly is overcome, they wiU definitely find vast applications in 
different NEMS systems like nanoengines, ultra high frequency osciUators, actuators, 
resonators, etc.[15]. 
Sorne methods have been suggested teUing how to separate conductor and 
semiconductor carbon nanotubes, but more development is needed. Also when making a 
structure using nanotubes, the impurities that enter the structure along with nanotubes, have 
been a problem (catalyzer particles, amorphous carbon, carbon fuUerenes). So sorne 
advances in purification are also needed for fabrication of nanotube structures [15]. 
Nanotransistors: For the past decades, scaling of transistors has made computers more 
and more powerful. But now fabrication limits and sorne quantum phenomena related to the 
28 
size and materials have slowed the scaling process of conventional field effect transistors 
(FET) [16]. 
To pass these barriers, sorne new devices, based onquantum effect and in atomic scales 
are being developed. These new devices that are supposed to replace conventional 
transistors and be used in ultra-dense circuits, have the ability to function as switches and 
amplifiers, as conventional transistors do. But they are based on different concept and are 
specially designed to perform successfully in nanoscale. FETswork with electrical currents 
flown by charge carriers that travel in masses through bulk semiconductor in the channel. 
nanodevices employ quantum mechanical phenomena that are more effective and ev en 
dominant in nanoscale. These charge carriers can move one by one through energy levels in 
lattices [16]. 
Here we will have a mostly qualitative discussion at length to explain how these next 
generation transistors differ from today transistors. There are two main categories of 
nanoelectronic devices: 
1. Solid-state quantum-effect nanoelectronic devices 
2. Molecular electronic devices [17]. 
Solid-state quantum-effect nanoelectronic devices are closer in structure and principles 
of operation to the conventional transistors. In these technologies fabrication methods that 
are being used will continue to exist and the geometries are more advance versions of 
conventional ones [17]. 
Molecular electronics instead employs completely different materials and structures 
than what exists now. This revolutionary change might have very beneficial features. 
29 
Molecules that are the structural basis of these technologies are already as small as a 
building block can be. Aiso aIl the molecules of the same kind are completely identical and 
the process of making them might be as simple and as cheap as mixing two solutions and 
they can be made in really huge numbers. AlI that is needed to be solved is what, if any, 
molecules and structures have the potential to form a controllable switch and also how to 
manipulate these tiny molecules and put them in their designed places in the circuits [16]. 
2.1.2 Structure and Operation of MOSFET 
The metal-oxide semiconductor FET (MOSFET), with its low power consumption and 
efficient fabrication method has been the dominant transistor in the industry for decades 
[16]. 
As it was said above, novel designs for nanoscale devices employ different physical 
laws than in MOSFETs to let electrical CUITent flow and to control it. But they aIl keep the 
basic concept used in MOSFETs. They aIl have a source, a drain and usually agate. What 
changes most, in structures, is the material and geometry of channel [16]. 
30 
(b) 
(c) 
GOLDHABER-GORDON D., MONTEMERLO M. S., LOVE J. c., OPITECK G. 
J., ELLENBOGEN 1. C., "Overview ofNanoelectronic Devices", 
Proceedings of the IEEE, vol. 85 , Issue: 4, pp. 521-540, 1997. 
Figure 2-1 Structure of an NMOS transistor 
The expression "metal-oxide-semiconductor field effect transistor" contains the names 
of materials that construct the MOSFET. Doped silicon, in spite of pure silicon, which is a 
bad conductor, has semi-conductor-like characteristics. That is because dopants impurities 
create sorne mobile charges in the silicon crystal. Negatively doped or N-doped silicon 
receives free electrons and positively doped or P-doped silicon has sorne extra electron 
vacancies or holes. Both free electrons and free holes are charge carriers and can pass the 
CUITent flow through silicon. MOSFETs have a crystalline substrate of doped silicon. Gate 
is a metal electrode that is insulated from semiconductor substrate by an insulating oxide. 
This is why it is called "Metal Oxide semiconductor". When the gate is charged to a certain 
voltage, the produced electric field in the substrate reaches a certain level than can affect 
31 
the movement of carriers and thus CUITent flow in the semiconductor. For example in a p-
doped MOSFET as in Figure 2-1 , ordinarily even when there is a difference of potential 
between source and drain, the carriers are too few to caITy electrical CUITent. But when gate 
voltage is increased, electrons are attracted to the gate and the channel will have more 
positive carriers and CUITent flows. The name "field effect transistor" cornes from this 
process [2]. 
Miniaturization of FETs: A very common tendency is to develop enhanced versions 
of FETs for nanoscale needs. But a valid question is if any geometry built on the same 
concept can work in that scale. For example because su ch a chip has to have very naITOW 
wires, CUITent can flow between wires and performance is affected. Aiso the technology to 
fabricate such small structures, uniformly and efficiently, has yet to be developed [16]. 
Here it is discussed why the potential of scaled down FETs to play the role of 
nanotransistors is limited and why new generations of devices are expected to emerge: 
1. When the FET is scaled to a great deal, the electric fields produced by bias voltage 
gets stronger and eventually it can cause an "avalanche breakdown" in which a high CUITent 
suddenly flows through material and damages it [16]. 
2. Heating in FETs is inevitable. It even increases with scaling. So, increased densities 
necessarily mean increased heating in the unit of surface. Especially that heat conductance 
of transistor is limited. So scaling and density have limits and in fact right now, the 
technology has reaches sorne ofthese limits [16] . 
3. When the channel gets very small, the number of dopants becomes so small that the 
uniformity of their distribution in the channel is not val id anymore. So there will be two 
32 
solutions: one is to stop using dopants as in a GaAs hetero structure. The other one is to 
make an absolutely uniform doping using molecular nanoelectronics [16]. 
4. With extreme scaling the tunnel is so thin that electrons tunnel from source to the 
drain through the semiconductor, even when the device is off. On the other hand sorne 
nanoelectronic devices use the same phenomenon to control CUITent [16]. 
5. The oxide insulator beneath gate prevents the electrons from tunneling between gate 
and drain. With scaling this leakage CUITent grows too[16]. 
So in nanoscale, where the quantum physics effects become dominant, building a 
device that relies on bulk properties of materials and uniform doping is going to be an 
exceedingly difficult job. Instead, because energy quantization and tunne1ing are common 
properties of materials in that scale, quantum effect devices and molecular electronics that 
employ these properties are going to have obvious advantages [16]. 
Solid state quantum effect devices and single electron nanoelectronic devices: 
These are sorne solid-state devices proposed to be used in nanoscale applications instead of 
the bulk-effect devices. AU of these devices have a structural common element, called 
"island", which plays the role of channel in FETs. Island is a small region made of 
semiconductor or metal in which electrons may be trapped or confined. These devices 
employ quantum mechanics effects to control the displacement of electrons in and out of 
the island. Three categories can be identified in regard to the number of dimensions in 
which electrons are confined in the island [16]: 
1. Quantum Dots (QDs) or artificial atoms: Electrons are confined in three dimensions 
and have zero degrees of freedom [16]. 
33 
2. Resonant Tunneling Deviees (RTDs): Electrons are confined In one or two 
dimensions and have two or one degrees of freedom [16). 
3. Single-Electron Transistors (SETs): Electrons are confined in no dimensions and 
have three degrees offreedom. (Figure 2-2) [16). 
By changing the shape and size and composition of an island, quantum effects act 
differently and island' s properties and consequently device' s properties change. As an 
example of composition, when an island is made of semiconductor, the electrons have a 
much more mobility in it than when it is made from metal. That is because the mean free 
path of electrons in metals is much smaller than semiconductors and passing electrons can 
travel through semiconductor with less collisions. So by changing the composition of the 
paths that electron might take, conductivity of the deviee can be controlled. For ex ample 
adding a semiconductor path might let the current flow by increasing the conductivity. Aiso 
semiconductors made from elements from group III and group V of periodic table [gallium 
arsenide (GaAs), aluminum arsenide (AlAs), ... ] have higher motilities than silicon (group 
IV).!t has been shown that we can make defect-free junctions between these III-V 
semiconductors more easily than between two group IV semiconductors(Si and Ge) [16] . 
Alternatives for Nanometer-Scale 
Electronic Switehes 
Aggresslvely Miniatunze 
Semieonduetor Transistors 
Quantum-Effeet 
Nanoeleetronie Deviees 
"Hybrid" Micro-Nano-
Electronie Devices 
Quantum Dots 
(QDs) 
Electrons on island have 
zero 
classical d.o.f. 
Solid-State 
Nanoeleetronic Devices 
Resonant Tunneling 
Deviees 
Electrons on island have 
one or two 
classical d.o.f. 
Resonant unneling 
Diodes (RTDs) 
Molecular Electronics 
Single-Electron 
Transistors (SETs) 
Electrons on island have 
three 
classical d.o.f. 
34 
ELLENBOGEN J. c., "A BriefOverview of Nanoelectronic Devices", 1998 
Government Microelectronics Applications Conference (GOMAC98,) 
Arlington, VA, 13-16 March 1998. 
Figure 2-2 Nanoelectronic devices 
Quantum effects in islands and potential wells: Islands being the switching area of a 
nanodevice have at least one dimension in the range of 5 nm to 1 OOnm. Island must be 
defined from the other regions in sorne ways. Many times the material that it is composed 
of is different from the surroundings in these cases the island is usually surrounded by the 
oxide of island's material or sorne other insulator. In sorne cases the material is not 
important and the boundaries of the island are shaped by electrodes put in a certain pattern. 
Anyway, what matters from an electrical point of view is that island is always separated 
from its surroundings by sorne potential energy barriers. In this way it can confine electrons 
within these potential walls [16]. 
35 
One negative point for quantum effect and single electron solid state devices is that 
usually the region that mobile electrons can occupy inside the island is just a small puddle 
in the middle of the island. The reason is that these electrons are repelled by surface 
charges on the boundaries of island. So the island has to be physically much bigger than the 
confinement region. Around the puddle there is a large portion of island's space called 
depletion region which is not used efficiently. This limits the miniaturization potential of 
these devices [16] . 
There are two quantum mechanical effects that act on confined electrons and on 
nanoscale islands that need to be discussed. First one is quantum states. Each partic1e in 
general can only occupy one energy level among a finite (in the practical energy limits) 
number of discrete energy levels called quantum states. The amount of energy in each level 
and the distribution of these discrete energy states depend on the physical and electrical 
properties of the environment. In an island electrical potential inside the walls effects the 
amount of energy an electron has to have to occupy one of the states, available inside the 
island. The distance between the walls also has an inverse relation with the difference of 
energy between the states. So a smaller island has quantum states more separated in energy. 
The other important quantum effect is tunneling: Depending on the magnitude of the 
potential barriers, if the distance between them is small enough (5 nm to 10 nm), electrons 
with energy levels less than the height of the barrier, have a chance to pass through the 
barrier on enter or exit the island. Of course these two effects do not cancel each other. So 
for an electron to tunnel through a barrier, there must be an available quantum state on the 
other side with an energy level low enough for the electron to occupy it. Aiso not every 
electron that has a chance to occupy a state on the other side of a barrier will tunnel. Let 's 
36 
see how these effects work in the context of a solid state quantum effect device: Bias 
voltage, applied to source and drain, pushes the mobile electrons in the source conduction 
band to pass through the island and reach the drain with lower potential. The island is 
separated from source and drain by high potential barriers, so electrons need to tunnel into 
the island and then tunnel out of it and into the drain region. For that to happen, for an 
electron that occupies a state in source with a certain energy level there must be an 
available state inside the island with the same energy level so that the electron can occupy it 
after tunneling. The same goes for the tunneling from island to the drain. The difference is 
that because drain has a lower potentiallevel, there is always a free state with energy lower 
than the energy of the electron inside the island. So once an electron get inside the island, 
the bias voltage can easily push it into the drain region. In bulk semiconductor because of 
the wide physicallimits, there are much more energy levels placed close to each other, both 
in the source and in the drain. These energy levels form almost a continuous band. But in a 
nanoscale potential weil the states are so discrete. (Figure 2-3) [16J. 
ENERGY 
unoccuplld 
Le .... ,. 
1sll!'xclted I!'Mlrg'j atate for 
N+l eleclront ln potlHlttel Wd 
I--........ -+- Lawe;a encrgy .111 for 
N ël/!!ettOI'lIl in potet'l.IJaJ 'A\l!t:I 
:~------------------~------~------__ ~~~ANCE 
HARROW 
,SMUla . ,. IMll000P 
SEIIl- 8ENI, (:O!U:u è_ ,ÇÇlIiCl\Jc;TOfli 
( ...... AlM1 ,:. • • 0AlIi. ) 
wlbe 
tl.i;HlXloV 
8EII~ 
ë OKcuetait 
«1 .... .1,10\ .. 
37 
GOLDHABER-GORDON D., MONTEMERLO M. S., LOVE J. C. , OPITECK G. 
1., ELLENBOGEN 1. C., "Overview of Nanoelectronic Deviees", 
Proceedings of the IEEE, vol. 85 , Issue: 4, pp. 521-540, 1997. 
Figure 2-3 Quantum weIl ofa RTD 
Resonant Tunneling Deviees: To tum a device on, the main idea is to lower the energy 
of the quantum states inside the island relative to the energy bands of source and drain so 
that electrons in the source band can enter the island. In the case of an RTD (Figure 2-4), 
this is done by increasing the bias voltage which puts sorne parts of source energy band 
equal to a free state in the potential weIl and let an electron from source occupy it [16]. 
Metal 
C 
Tunnel Barriers 
Metal 
"Potential Energy Weil" 
Energy 
Energy 
Occupied 
Conduction 
Band I---~I"""- Lowest-Energy 
Distance (b) 
(c) 
(N+1 )-Electron 
..... --- States in Weil 
1==~Transmitted Electrons 
38 
GOLDHABER-GORDON D. , MONTEMERLO M. S., LOVE J. C., OPITECK G. 
J. , ELLENBOGEN J. C., "Overview ofNanoelectronic Devices", 
Proceedings of the IEEE, vol. 85 ,Issue: 4, pp. 521-540,1997. 
Figure 2-4 Structure ofa RTT 
If bias voltage is not high enough to create an overlap between source conduction band 
and lowest free energy band in island, no electron can pass into potential weIl and so there 
is no CUITent passing. In this state the device is off or "out of resonance". [Figure2-4 (b)] If 
bias voltage is high enough to put an unoccupied state in island within the reach of an 
electron in source band, CUITent passes and device is on or "in resonance" (Which indicates 
39 
the frequency associated to the energy level of potential well is in resonance to the 
frequency associated to an electron's energy in source).[Figure2-4(c)]So we have a device 
whose CUITent is turned on and off according to a bias voltage. This is a two terminal 
resonant tunneling device or a resonant tunneling diode (RTD).This concept, further 
developed, leads to a three terminal device in which the relative state energies of source and 
island are adjusted by a third gate's voltage directly connected to the potential weIl. By 
controlling this voltage potential of states in the potential well can be controlled. So this is a 
device whose CUITent is controlled by a small voltage. It is called a resonant tunneling 
transistor (RTT). [Figure 2-5] RTT is capable of replacing MOSFET, working as switch or 
amplifier [16]. 
TUllnal 801r .... r.1 SubSir.nle 
. Of*iUllIl ~M~!W "....,,11" 
Oecuph:d ~
Con.dootkJn • 
~Il~y t Da,.' '. 1 
~ 
Plij"flCtl 
Lowest>-EII«91f 
IN .. l),EfCdrlM1 
. -4- StalH in YlC'JI 
.......... 
{ç} P~m&II~!iI 'lfllin·Cl II! '(QltiI!gll 
40 
GOLDHABER-GORDON D., MONTEMERLO M. S., LOVE 1. c., OPITECK G. 
1., ELLENBOGEN J. C., "Overview ofNanoelectronic Devices", 
Praceedings afthe IEEE, vol. 85 ,Issue: 4, pp. 521-540, 1997. 
Figure 2-5 Structure of an RTT 
We know that there is more than one free quantum state inside the island. Now if the 
energy difference between these states is bigger than the difference energy between source 
band edge and Fermi level, in other words if source conductance band is thinner than 
distance of island states from each other, for each time that source conductance band 
reaches the level of a free state in weH, a flow of CUITent will happen. So R TDs and R TTs 
are capable of multiple on and off states. This valuable feature means that a circuit can be 
implemented with fewer devices using RTDs and RTTs rather than MOSFETs. Because 
with multi state switches a function is realized with less switches. This leads to fewer 
41 
devices thus less heat generation. In Figure 2-6 (b) for a case where conductance band is 
mu ch thinner than potential differences in well, the CUITent peaks represent the moments 
wh en bias voltage (or in the case of RTT, the gate voltage)aligns a free energy state with 
the conduction band. As shown in Figure 2-6 (c), even when conduction band is wider than 
state energy differences, a step pattern is formed that might work as a multi-state logic [16]. 
J13LJ1:jL fLn .\ : ,~W ÇjW 
: (a) : . 
Current : : : 
RTD 
(M:» U) 
Current 
00 
(M~ U) 
Cu,reo' r 
SET 
(L\E« U) 
, 0 
, 0 
, 0 
, 0 
, 0 
, 0 
o 0 
(b) 
(d) 
Bias Voltage 
Bias Vo.ltage 
Bias Voltage 
42 
GOLDHABER-GORDON Do, MONTEMERLO Mo S., LOVE 1. C., OPITECK G. 
1., ELLENBOGEN 1. c., "Overview ofNanoeleetronie Deviees", 
Proceedings of the IEEE, vol. 85, Issue: 4, pp. 521-540, 1997. 
Figure 2-6 Switehing of solid-state nanoeleetronie devices 
To fabrieate these deviees layers of different III/V semieonduetors (for example GaAs 
and AIAs) are put together to form the source and island and drain. To make a resonant 
tunneling diode [Figure 2-4(a)] , two insulating barriers are used to envelope the island 
semieonduetor and ereate the potential weIl between them. Islands in RTDs are about 10 
nm wide [17]. 
43 
Hybrid Microelectronic-nanoelectronic Resonant Tunneling Transistors: 
Fabrication of RTTs is still a technological challenge. They are too small, too complex and 
too sensitive to be produced uniformly and efficiently with CUITent technology. There are 
devices designed for the period of transition from microelectronics to nanoelectronics that 
employ the advantage of multi state logic of quantum effect devices in the CUITent 
MOSFETs. IfRTDs are put into drain or source of a micro scale MOSFET an RTD-FET or 
a hybrid RTT is built whose source-drain CUITent has multi on/off states controlled by 
RTD's bias voltage.(Figure 2-7) In this way the logic density of circuit increases without 
scaling transistors or one can say that the CUITent high density of chips that is producing 
problems can do more in this way [16] , [17]. 
44 
RTOs 
Il 0.5 Jlm 
GOLDHABER-GORDON D., MONTEMERLO M. s., LOVE 1. c., OPITECK G. 
J., ELLENBOGEN 1. c., "Overview ofNanoelectronic Deviees", 
Praceedings afthe IEEE, vol. 85 , Issue: 4, pp. 521-540, 1997. 
Figure 2-7 A hybrid RTD-FET 
Anyway, on a larger time scale hybrid logic is mostly a practical step on along path that 
ends in production of purely nanoelectronic devices [16]. 
Single electron transistors: Single electron transistors are very similar to RTTs . They 
also have a potential well, isolated from source and drain. The difference is that here the 
island is so small that it can contain only a few electrons at the same time. In fact electrons 
in a process called Coulomb blocking repel the other electrons and stop them from entering 
the limited space of the island. So there is a balance between the force of bias voltage and 
the force of Coulomb blocking and no CUITent passes. Using a voltage induced into an 
electrode that acts as the gate in R TTs, potential of the island can be increase so that one 
electron from source can enter island. Then one electron willleave the island and go to the 
drain; hence electrical CUITent flows. Single electron transistors can only operate at low 
45 
temperatures because high thermal energy will destroy the order of this very sensitive 
device and will make electrons jump over the well in their high energy resonances, 
regardless of Coulomb blocking [15]. 
Chapitre 3 - High Performance Computing Systems 
Conventional chip-package thennal analysis techniques have been so slow that 
evaluating numerous design alternatives was prohibitively expensive during lC design. As 
a result, most thennal optimization was done during packaging and cooling solution design. 
Unfortunately, by that time the design is already tightly constrained. Recently, a number of 
researchers have developed fast thennal analysis techniques for use during the lC design 
process. Using these methods, heat transfer through chip and cooling package is modeled 
using the classical Fourier transport model [14]. 
Although sorne of these techniques are fast enough for use during lC design and within 
run-time thennal management techniques, they are all based on the Fourier heat flow 
model. This model cannot capture phonon quantum thennal effects and yields inaccurate 
results when used at length scales on the order of phonon mean free path. These 
observations are supported by the data presented in Section IV -A. Techniques with 
different fidelities and efficiencies have been developed to model nanoscale device-Ievel 
phonon heat transport, including molecular dynamics methods, Boltzmann transport 
equation (BTE), and ballistic-diffusion model. Computational complexity has been the 
primary challenge of adopting nanoscale heat transfer methods for large-scale lC chip-
package thennal analysis. Molecular dynamics methods model heat transfer by directly 
simulating inter atomic interactions. Approaches implemented using these methods are 
highly accurate. However, the y are extremely computationally expensive. The BTE method 
47 
and its variants model heat transfer by simulating the transport of phonons. It can 
accurately approximate ballistic phonon transport. BTE methods are much more efficient 
than molecular dynamics methods. However, their computational complexity remains 
prohibitive, and their use has been restricted to device-level analysis. The ballistic-diffusion 
based thermal analysis method is an approximation of the Boltzmann transport method. 
Although the ballistic-diffusion model is the most efficient of these, it is still much more 
computationally-demanding than the Fourier model. In addition, results from the ballistic-
diffusion model tend to have low fidelity. In summary, there is a gap between the efficiency 
and accuracy of nanoscale and chip-package thermal analysis techniques that must be 
closed if high-quality temperature-aware design techniques and run-time thermal 
management algorithrns are to be developed for ICs composed of nanoscale devices [14]. 
The study of nanosystems has been the subject of several searches in the last decade. 
But the works related to the heat transfer of these new components is still at the stage of 
development [15]. 
So, heat analysis, an increasingly important field in order to improve design, demands a 
lot of computation. A solution to this problem can be employing parallel algorithms using 
MPI standard and implementing them on high performance computing systems [3] , [15], 
[18]. 
3.1 Parallelization 
High performance computers (HPC) distribute the work between many processors 
which are put in a certain structure which is unique to each HPC facility (CLUMEQ, West 
48 
Grid ... ). By dividing the job between many processors and making them work in parallel, 
the computation time needed decreases dramatically. In fact if n processors work in 
parallel, the processing time is ideally divided by n. AIso, the memory available for the 
whole process is n times the memory of each processor. What needs attention in 
parallelization is how to distribute a job between nodes and how to transfer the data 
between them during the processing. Message passing Interface (MPI) is employed to 
create efficient connections between these processors [15]. 
According to Flynn's taxonomy [19], there are four processmg unit architectures 
imaginable to do a task. In Single Instruction Single Data stream (SISD) one processor 
perforrns one series of instructions on one set of data. This architecture is used in the 
traditional uniprocessor PC (not PCs with multi-core processors). A single Instruction 
Multiple Data stream (SIMD) is when one instruction is given to more than one processor 
to be perforrned on separate parts of data simultaneously. So parallelism (on processed 
data) is possible in this architecture. In Multiple Instruction Single Data stream (MISD) 
machines different instructions are given to different processors to be perforrned on a single 
piece of data. Here also a parallelization (this time on instructions) is forrned. This is an 
uncommon architecture which is ideal for applications where fault tolerance is low. 
Because different processing units work with the same data and they have to be in good 
agreement on its correctness. So each error can be discovered by other units. Multiple 
Instruction Multiple Data streams (MIMD) gives different instructions to different 
processors and they perforrn them in parallel on different data. MIMD can be implemented 
in two forrns : Single Program Multiple Data (SPMD) in which one program is executed 
independently on different processors and Multiple Program Multiple Data (MPMD) in 
49 
which different programs are executed on different processors at the same time. Usually in 
these applications one processor executes a "manager" pro gram which communicates with 
other processors executing another program. Each processor sends its results to the 
manager independently and there the results are fed into the manager pro gram [20]. 
SPMD architectures are very common and are realized in two forms: shared memory 
and distributed memory. In shared memory architecture, processing units have 
simultaneous access to one shared memory. So they can pass information by writing them 
on memory and no separate communication between them is needed. So they are fast, but 
instead they are more expensive and in programming level attention must be paid to avoid 
any conflict of tasks done by processors. In distributed memory architecture each processor 
has its own memory. Processors communicate with each other directly in order to pass 
information [15]. 
There are two SPDM algorithms, OpenMP and MPI Standard, developed respectively 
for shared memory architectures and shared and distributed memory architectures that are 
introduced here. 
OpenMP: Open multi-processing is a programmmg interface including sorne 
directives, libraries and variables that helps create a parallel code and execute it in C, C++ 
and Fortran languages. With OpenMP, a task, that has sorne elements that can be executed 
in parallel, is coded as a parallel algorithm. To do this it is desired to follow the mas ter 
thread of the code and fork it to a number of threads whenever a parallelizable element is 
reached. These threads can be executed separately. OpenMP gives each of these threads an 
ID number and sends them to different processors . When the parallel task is finished, the 
master thread continues to run until it reaches another parallel task (figure 3-1) [20] . 
50 
MPI Standard: Message Passing Interface (MPI) is a programming interface to create 
clusters of processors, placed in a HPC facility or in distant PCs, for parallel computing. 
MPI do es that by managing the communication between the processors. MPI works with 
both shared memory machines and distributed memory machines [20]. 
Parallel Task 1 Parallel Task Il Parallel Task III 
Master Thread 
Parallel Task 1 Parallel Task Il Parallel Task III 
Master Thread " -,;:; . ........ \ ,: 
--. ;- .. _-
, 
, 
wikipedia.org 
, 
" 
" 
, 
, 
) 
_: 
Figure 3-1 OpenMP schematic in comparison to single processing 
3.2 Infrastructure 
, 
8 
'. 
CLUMEQ cluster is used to implement the parallel code created usmg the 
NanoMOS 3.5 MATLAB code. CLUMEQ IS a part of Canadian HPC infrastructures 
gathered under the name Compute Canada. 
Compute Canada is national collective of consortia of HPC infrastructures existing 
all over Canada. These infrastructures are geographically distributed aver the country and 
cover the high performance computation needs of researchers, not in a regional sense, but 
51 
III a country-wide way. Although these infrastructures are located in seven different 
regions, they are joined together in Compute Canada as a net of distributed infrastructure 
accessible through one gateway. So having an account in Compute Canada, offered to 
researchers, professors, students .. . , gives you the right to execute yOUf code on any of the 
consortia and also bene fit from technical support, educational programs and other services 
offered by them [21]. 
These consortia are: ACEnet (Atlantic Computational Excellence Network,) 
CLUMEQ (Consortium Laval, Université du Québec, McGill and Eastern Québec), 
RQCHP (Réseau québécois de calcul de haute performance), HPCVL (High Performance 
Computing Virtual Laboratory, SciNet (based at University of Toronto), SHARCNET 
(Shared Hierarchical Academic Research Computiilg NETwork) and WestGrid (Western 
Canada Research Grid). Each of them includes sorne HPC infrastructures or 
supercomputers based in universities placed in that region. Compute Canada cooperates 
with these HPC consortia to plan future infrastructures and fund these projects, to provide 
the software and hardware and to provide technical support for the researchers in each 
regional facility [21]. 
CLUMEQ includes two supercomputers situated in Mcgill University, Montreal and 
Laval University, Quebec city. These two are: 
Colosse: A cluster of 960 nodes (8 processor cores and 24 GB of RAM each), with 
totally 7680 cores and 23 TB memory, with a full bisection Infiniband QDR network and a 
Lustre parallel file system of capacity up to 1 PB (currently 500 TB) [22]. 
52 
Krylov: A cluster of 48 nodes (4 or 8 processor cores and (300 cores total) and 8 or 16 
GB ofRAM each), with totally 300 cores, with Infiniband SDR network [22]. 
Supercomputers are a great and central point to this study. One very important aspect of 
a supercomputer is its structure, both electronically and mechanically. Computation power 
of a supercomputer is a direct result of its number of nodes and the power of each node. So, 
in respect to electronic hardware, the mode l, number and other properties (for example 
speed and memory) of processors used are the first features of a supercomputer to mention. 
Of course the data transfer cables and other hardware used in the system must have a 
capacity that covers the needs of the system and lets it work to its full computation power. 
Choosing and employing software is another concem in managing all these processors 
work in parallel. Bellow a structural discussion of Colosse is presented to coyer these 
hardware and software aspects. On the other hand, when the number of processors in a 
cluster is huge, the problem of heating becomes huge too, because processors are the device 
in a computer that produces the most heating. So when there are a lot of them in a 
computer, heating goes to a whole new level. Surprisingly extensive and expensive 
measures have to be taken to exit the heat in a supercomputer. To do that, a special cooling 
system has been used for colosse that forms the mechanical part of the structure. The size 
and power of these cooling devices is ev en greater than that of the electronic parts. This 
fact reinforces the necessity of studies of this kind and should be discussed here. So a 
description of the cooling system of Colosse will also be a part of this chapter [22]. 
As mentioned above, Colosse has 960 computational nodes. Each of these nodes 
includes 2 Intel Nehalem-EP processors each of which has 4 cores and 24 GB RAM. 
Colosse also has 40 nodes for infrastructural responsibilities. So in total there is 8000 cores 
53 
and 24 TB RAM. The nodes are connected with infiniband QDR cables with a speed equal 
to 40 Gb/s. The infrastructure nodes are connected with an ethemet connection to each 
other and to the outside world through web. Half of them make a parallel file system of 1 
PB capacity [22]. 
Infiniband is a certain kind of communication system that connects nodes mostly in 
HPC applications. It provides failover capability, which means in the case one device stop 
working it can switch to a stand by one. Infiniband link is also scalable. It is usually used to 
connect the processors to IIO and which in this case is mostly the RAM. Infmiband uses a 
switched fabric model in which the nodes are connected to each other through one or more 
routes. This is contrary to the broadcast networks which have a hierarchical model. SDR 
infiniband sends data one every clock cycle. QDR infiniband, on the other hand, is an 
infiniband that sends data four times in a clock period (on the rising and falling edges and 
between them.) So it is 4 times faster than an SDR infiniband. To find the moments where 
the clock is in 90 and 270 degrees it uses another clock with 90 degrees delay in relation to 
the main clock [20]. 
Lustre is a special file system developed by sun microsystems. File system is a 
method of storing and managing files in a computer. Lustre, made from mixing words linux 
and cluster, is meant to service clusters with as much as tens of thousands of nodes and 
sorne PB of memory and sorne hundreds of GB per second data transfer rate and at the 
same time have a high availability. This file system is also scalable. It is now widely 
employed in supercomputers [20] . 
54 
Colosse is assembled inside the remains of a particle accelerator in University of 
Laval campus. The particle accelerator was first built in the 60s. Lately after the colocsse 
project was started, in search for a place to set up the supercomputer, it was decided to use 
the former accelerator for this purpose. The final design was a very exiting plan: The old 
particle accelerator had a big silo that was abandoned for years. The plan was to tum this 
structure into the home ofnew supercomputer [22]. 
A three floor metal structure has been built inside the silo. Cabinets containing 
processors are put on these floors in a circular form so that they make a smaller cylinder 
inside the big cylinder which is the inside space of the silo. The first floor is where the 
infrastructure devices and servers connecting to the outside world are placed. Calculating 
processor cabinets are placed in the second and third floors . By this three floor design; 
accessible area to place the processors has almost tripled. So now there are a lot of extra 
space in the second and third floor to place more processors in future upgrades of the 
system (figure 3-2) [22]. 
CLUMEQ (www.clumeq.org) 
55 
Racks aligned in a 
circle around a 
centra l hot core; 
outside ring Îsa 
cold air plenum 
Second floor contains 
ail compute racks + 
~.~IjijjI~....,.;t-- core networking 
switches 
First floor contains 
file system & 
infrastructure nodes 
Figure 3-2 Placement of the electronic devices 
The circumference of this smaller cylinder is isolated from the space between two 
cylinders so that a warm tunnel of air can be formed inside it to exit the heat. This tunnel is 
fed by big fans placed bellow the small cylinder that blow the air from hot air tunnel into 
the cooling radiator in the ground floor and then to the co Id air area between two cylinders. 
In other words these fans produce a positive pressure between the cold air region and hot 
air region so that the cold air passes through processor cabinets into the hot air tunnel. By 
dividing the inside of the silo into two hot and co Id areas, the efficiency of use of fans 
increases, because power of fans is not used to blow hot air to the processors. So it is 
possible to use fans with less power or less speed of cooling air in the design. Also 
56 
distributing the cabinets in three floors (in comparison to putting aH of them in one floor) 
increases the surface of cooling and makes it more efficient. The circular placing of the 
cabinets decreases the length of the cables too. The circular structure of the silo itself and 
circular placement of processor cabinets are ideal for the flow of the cooling air, because 
they do not have any corners, as a square structure would have had (Figures 3-3 and Figure 
3-4) [22]. 
Free air 
cooling 
syste 
"",,,,"'.:--:::"=-H-Ifl!f--<"'""H 
~~jt 
'~.~~. 
Main 
cooUng 
system 
~~dFrr~~~tr:1 
CLUMEQ (www.clumeq.org) 
Figure 3-3 Side-view of the structure 
57 
Plan niveau 1 
CLUMEQ (www.clumeq.org) 
Figure 3-4 Structure viewed from above 
58 
cold a.ir plenum 
(32 m2) 
hot air core 
(25 m2) 
In addition to the main cooling system, which is in ground floor (beneath fans) and 
cools the hot air before it goes to cool air region, there is a system to use the co Id air from 
outside to do part of the cooling process. Aiso in the cold season the heat produced by 
processors is employed in heating system of the campus. AU this shows the great amount of 
heat produced by what we would have carelessly called "just sorne electronic devices" [22]. 
59 
In total, there is 56 cabinets on the metal structure in three floors which are cooled 
bya system with a ventilation capacity of 132,500 CFM (62,500 liters per second) and a 
cooling capacity of 1.2 MW [22]. 
In this section we will review the steps that need to be taken to connect to the colosse 
supercomputer and login to the system and then we will see how to start a new project and 
how to send and have your job executed on the system. First of all a researcher needs to 
make an account on Compute Canada website. Students can do that through the account of 
their supervisor and using the information given to them by their supervisor. Next, an 
account is created on the certain consortia chosen by the researchers and that lets them 
apply their job on the supercomputers within that consortia. Here we suppose that one has 
created the necessary accounts to access Colosse supercomputer and now wants to login to 
the system and use it to execute a parallel code [22]. 
3.2.1 Accessing Colosse: 
Login: Users do not have direct access to the computing nodes. They have to 
connect to the head node through a secure shell client. A shell client is a pro gram working 
as an interface between a user and an operating system. Referring to the outer space of 
something, the word "shell" is also generally used for cases in which a pro gram plays the 
role of interface to a system or program. For example a shell account is an account that a 
user has on a remote server and can access sorne data, memory or applications through that. 
A secure shell is a protocol used to make a secured and encrypted connection between two 
points. It can be used for example to connect to a she1l account. Secure shell (SSH) was 
created to replace connection protocols like telnet that do not encrypt the data [20]. 
60 
So as was rnentioned above users connect to the head node through an SSH client. Here 
they need to enter their user narne and password, defined wh en they created their accounts. 
In Linux and MacOS systems this is do ne easily by typing frorn a terminal [22]: 
$ ssh 
user@colosse.clurneq.ca 
A Windows user has to install sorne software to rnakes this SSH connection possible. 
These prograrns are accessible for free on the web. Sorne of thern can be found in the 
following address [22]: 
http ://www.openssh.com/windows.html 
Then it rnight be necessary to create a profile using the interfacing pro gram installed. 
To do that these information are needed [22] : 
Host name: colosse.clurneq.ca 
User narne: your user name 
Port number: 22 
File transfer: Now that the connection to colosse has been realized, user needs to use a 
secure proto col to send and receive data and files. Secure Copy (SCP) and SSH File 
Transfer Protocol (SFTP) are two protocols that could be used. These two are designed 
based on SSH, but they are compatible with other protocols too. Their job is to guarantee a 
safe transfer of files . SCP is just a means of transferring files , but the newer SFTP is more 
advanced and is also capable of doing some operations on the distant files too. Sorne of its 
advantages over SCP are the possibility of resurning a transfer after it has been interrupted, 
61 
and removing files remotely. Windows users need to install a program (examples: FileZiIla 
and CoreFTP) to add these protocols [22] . 
This is the File system structure accessible to users when they connect to colosse: 
Table 3-1 Data storage in colosse 
www.c1umeq.org 
,-
--
Path Description Quota 
F I Your account path 1 5GB ~ 
Group shared space. Rolds data or common 
/rap 100GB 
tools for the group 
Temporary scratch space. Tuned for fast 
/scratch 
IIO. 
Compiling: There are different softwares accessible on colosse that users can caU them 
and use them to compile their codes. This is done using the module commando Module is a 
command that make the shell ready for a curtain application. For example it creates the 
necessary variables and so on. In brief users call their desired application using this 
command [22]. 
Rere sorne useful module commands are mentioned that are taken from clumeq.ca: 
"Show aIl modules available on the system: 
62 
$ module avail 
Show cUITently loaded modules: 
$ module li st 
Load a module (This command is also used inside your Grid Engine submit files): 
$ module load <modulename> 
Unload a module: 
$ module unload <modulename> "[22] 
In the foUowing example, again taken from clumeq.ca, the first line caUs a compiler 
already existing on Colosse and the second line calls a simple C pro gram: 
"[USER@cyclops ~]$ module load compilers/intel/ll . l.059 
[USER@cyclops ~]$ $CC -0 hworld Iclumeq/example/hworld.c" [22] 
Running the code: To fUll a code on Colosse one should send it to Grid Engine (GE). 
There it will enter a queue and later will be fed to the cluster by GE [22]. 
These are sorne commands, taken from clumeq.ca, to work with GE: 
"qsubscriptyle - submits a new job, where scriptyle is a text file, containing the 
name of your executable program and execution options. Examples of script yle 
are explained below. 
qstat - shows a CUITent list of submitted jobs. Can be used with the foUowing 
arguments: 
qstat -ext - shows a listing of aU your queued and running jobs 
qstat -g c - shows used and available resources for each queues 
qstat -u "*" - shows jobs information for aIl users 
qstat -jJOB JD- shows details for a specifie queued or running job 
qstat -t - shows the resources used by your running jobs 
qdeVob JD - kills the job, or removes it from the queue. The job JD can be 
obtained by qstat commando 
qconf -sql- shows existing queues in the system" [22] 
63 
Here the structure of a script_file is explained and then the actual command m 
command line is mentioned. (From clumeq.ca) [22]: 
"#! Ibin/bash 
#$ -N JOB_NAME 
#$ -P PROJECT 
#$ -1 h_rt=00:05 :00 
#$ -pe mpi NB_CORE 
module load SOME MODULE 
/your/program/path/here 
The statements above have the foIlowing meaning: 
- The name of your job. 
- Your project name which is the CCDB RAP id. The format is abc-OOO-OO. 
- Estimated mn time ofthis job. For example, 5 minutes is 00:05:00 
- Parallel environment (PE). See the description of the various PEs below. 
- Commands to be executed on the compute node : 
$ qsub sub.sh 
Your job 212 ("MyJob") has been submitted" [22] 
64 
There are three parallel environments (PE) accessible on Colosse. As said above, the 
type of parallel environment chosen for the code to be executed in must be written as part 
of the submit commando These environments and their explanation and then the way they 
are addressed in a submit command and also the scipt_ file referring to them is taken from 
c1umeq.ca [22]: 
"mpi 
"-pe mpi" is used for mpi job. Those processes can be run anywhere on the 
machine using a "fill up" allocation rule (all slots allocated in anode 
before moving to the next one). "nb_slots" represents the number of cores 
requested. 
Dode 
"-pe node" The processes will run on compute nodes dedicated to this 
job. Your "nb _ slots" (number of core) should be divisible by 8. 
smp 
"-pe smp" is for non mpi jobs and mulithread jobs. This parallel 
environment allocates cores (slots) on a single node. The maximum number of 
slots (nb_slots) is 8" [22] 
"#$ -pe mpi NB_CORES 
#$ -pe node NB_CORES 
#$ -pe smp [NB_CORES]" [22] 
"#!/binlbash 
#$ -NMyJob 
#$ -P abc-OOO-OO 
#$ -1 h_rt=OO:05:00 
#$ -pe mpi 32 
module load mpi/openmpill.3.4_intel 
mpirun /your/program/path/here" [22] 
Chapitre 4 - Modeling 
4.1 Physical and mathematical modeling 
Semiconductor industry has been constantly developing in many different directions for 
many years. These improvements, according to their area of effectiveness, can be 
categorized into sorne trends of development called scaling trends: integration level, cost, 
speed, power, compactness and functionality. Most of these trends are the result of constant 
reduction of minimum feature size in semiconductor processes. Moore's law is a well-
known example of scaling trends: number of components on a chip is doubled every 24 
months. But still, scaling is not necessarily identical to feature size reduction in fabrication 
processes. For example now there is a tendency towards equivalent scaling, as opposed to 
geometrical scaling (which is basically the feature size reduction, resulting in Moore's law). 
Equivalent scaling is the improving effect of new materials, better designs, 3D integration, 
more efficient processes and ev en better software, benefitting the industry just as 
geometrical scaling has been doing [23]. 
Having made this distinction, we can retum to the subject of this study. Transistor 
scaling is mostly, but not only, a result of feature size scaling. Development of materials 
(like high k insulators), designs and architectures (SOI, thin-body and multi-gate 
MOSFETS) and new fabrication processes have great effect on transistor scaling too. Here, 
the importance and possible limits of the scaling will be discussed, in regard to 
performance, geometry, material and process issues. 
66 
ITRS predicts that physical gate length in MOSFETs will reach 1 ° mn in 2020, but at 
that scales difficulties are expected to appear that will challenge the whole CMOS scaling 
future. The effort needed to stretch CUITent CMOS technology through deep scales is 
becoming more and more costly. The costs of production has been increasing in a rate that 
there are questions whether it can continue in the same way for more than 15 years. It 
seems necessary to look for post-CMOS devices that improve device performance and cost 
per function. It is worth mentioning that part of this development load will be carried by 
new design and manufacturing paradigms (currently, performance of processors does not 
improve equal to the increase in the number of their transistors) and also by integrating 
technologies like Sip and SOc. Without doubt these new axis of development will change 
concept of computation systems, but in the long run, transistor technology has no way but 
to improve. So the move toward enhanced CMOS architectures and ultimately post-CMOS 
devices must be taken seriously. These devices can be quite similar to today's CMOS, only 
with new materials and designs, or they can be radically different, like quantum and 
molecular transistors. In any case, studies that target the edges of CMOS technology and 
apply new models to simulate them will help, either by helping scaling CMOS to its 
maximum potential or by creating an understanding of models that might describe future 
devices. Noting that in gate lengths of less than 10 nm non-classical effects will not be 
negligible; it seems a necessary choice to try to employ ballistic and semi-ballistic models 
in those lengths, as it is the goal ofthis project [23]. 
Conventional scaling of CMOS consists of reducing gate length, gate dielectric 
thickness and increasing channel doping. In planar MOSFETs high channel doping helps 
reduce short channel effects, but it also reduces mobility and increases leakage power 
67 
consumption. The mam function of multi-gate or ultra-thin body fully depleted SOI 
solutions is to fight these effects and improve performance. 
Off cUITent, consisting of gate and drain leakage currents, grows 10 times in each 
generation of CMOS. Leakage power has exponential relation with gate length, oxide 
thickness and threshold voltage. So as generation appear, a greater portion of total power 
leaks. This, supposing a needed computation power level, increases total power. Today 
almost 50% of total power is leaked [23]. 
Sorne challenges ahead of semiconductor industry and in sorne cases their respectable 
solutions were reviewed. In the following sub-chapters, this discussion is extended to a 
materials point of view and today's difficulties and future solutions are discussed. 
Qualitative and mathematical description of MOSFET is employed to find and discuss the 
main parameters in structure of a MOSFET that have impact on its performance. 
Improvement of these parameters is the focal point of industry in transistor scaling process. 
Realization of an actual MOSFET is studied. First, a fabrication process, designed for the 
specific choice of device in this study, is presented in details. Then packaging possibilities 
are discussed and sorne suggestions are made for an actual packaging process. At the end to 
make an actual design feasible, a series of simulations is performed to study the parameters 
that play a role in design ofhighly scaled MOSFETs. 
4.1.1 Material considerations 
Metal gate transistors beyond 16 nm technology, will need an Equivalent gate Oxide 
Thickness (EOT) of 0.6 nm. EOT is the thickness ofa silicon dioxide layer having the same 
effect as a layer made from a high k material. Providing these EOTs will be difficult, ev en 
68 
with very high dielectric constant insulators. At the same time, channel mobility is 
decreased because of interface effects between silicon and high k material. Interface 
between high mobility channel materials (Ge, SiGe and Ill-V materials) and high K 
dielectrics is more problematic than a Si-high k interface, because it these materials have 
more complex surfaces and that they do not have oxides that can function as insulators [23]. 
Interconnect scaling has its own set of problems and solutions. High conductivity 
metals and low permittivity dielectrics and low k dielectrics (to reduce parasitic capacitance 
and heat flow and increase speed) are needed to make the connectors smaller and the 
isolation thinner [23]. 
Of course fabrication processes play a crucial role in transistor scaling. Apart from the 
obvious fact that new processes make minimum feature size reduction possible, use of more 
efficient materials or designs might impose innovation in process. For example, to fight 
short channel effects it will be helpful to create a channel with a carefully chosen doping 
profile, instead of doping the channel evenly. Less doping near the substrate insulator and 
also near the source and drain interfaces and more doping in the middle of the channel is 
desirable. To be able to do this, a selective epitaxy of channel with Si, Ge and C is 
performed so that doping can be varied locally [23], [24] . 
To achieve higher drive CUITent, high transport materials can be used in channel: Ge on 
Insulator, III-V, nanowires, carbon nanotube [23]. 
4.1.2 MOSFET theory and design 
MOSFET is a voltage controlled CUITent source. Gate voltage controls the CUITent in a 
channel made of semiconductor placed between drain and source electrodes. What are the 
69 
parameters that affect its function? To give an answer, more discussion of its structure is 
needed. 
In an n-MOSFET, channel has p-type dopant and drain and source have n-type 
dopants. A voltage difference is put on drain and source electrodes but channel, even 
though it is doped, is not a conductor and CUITent does not pass. Gate electrode is place on 
top of the channel and is separated from it by a thin layer of insulator material, so that it has 
no electrical connection with channel. Now if a positive voltage is given to the gate, free 
electrons in channel are absorbed towards gate. So, a region is formed inside the channel, 
just under the insulator where electrons are gathering. If the gate voltage is high enough, 
there will be enough free electrons gathered in that region to be able to conduct CUITent 
from source to drain [25] . 
The amount of electrons gathered in the channel determines how easily CUITent flows. 
The insulator put between gate electrode and doped channel make a capacitor whose 
electric charge is calculated from equation 01 . Concentration of electrons is represented by 
Q. We see that gate voltage determines Q. SO gate is controlling channel conductance and 
thus the CUITent. Another conclusion that can be made is that by decreasing capacitance, C, 
MOSFET can be turned on more easily and it can pro duce higher drain CUITents. This, of 
course, is very important in scaling of MOSFETs. We will discuss this effect in more 
details [25]. 
Let's look at channel and see it as a resistor (Ron) through which drain CUITent 
passes. As gate voltage increases, more electrons join the channel and it conducts better, so 
resistivity of the channel decreases. In Id - Vd graph, this means that for higher gate 
70 
voltages, slope increases. Other parameters that influence Ron are length of channel and its 
width. It is obvious from ohmic resistance equation that Ron (slope of Id - Vd graph) 
increases by an increase in length or a decrease in width of channel. If we draw a graph 
showing Id for different V g s, we see that as explained above, by increasing Vg , Id (for a 
constant Vd ) increases. This rise starts from the point where electron channel is formed and 
device is tumed on and continues until saturation. If Ron is decreased (by a decrease in 
length or an increase in width), for a certain Vg , Id will be higher, so the slope of the graph 
between tum on point and saturation increases. Semiconductor industry has always tried to 
minimize channellength to reach higher drain currents. But the same does not go for width. 
More width can be reached easily. In the easiest solution two MOSFETs are connected in 
parallel, doubling the width. But this increases total gate capacitance (general capacitance 
equation) which might limit the speed. So a tradeoff is necessary here. Another parameter 
of equal importance as channel length is gate insulator thickness . A thinner insulator means 
that the capacitance between channel and gate is bigger, leading to more charges gathered 
in channel. We now that it means better channel conductance and less Ron ' It is an 
established trend in semiconductor industry to decrease insulator thickness to reach higher 
drain current, just as it was for channellength [25]. 
Current flow in channel is a drift CUITent. It means that carriers are accelerated by an 
electrical field, so there is a statistical velocity of carriers that me ans a displacement of 
charges and flow of CUITent (the other form is diffusion in which carriers move from high 
concentration region to low concentration region by diffusion) . This is shown in (4-1). This 
velocity, along with charge Q, in (4-2), yields to drain current. The charge is calculated 
71 
from (4-3) (pinch off considered). In this way Drain CUITent is calculated for device's 
terminal voltages as (4) . When (4-5), channel works in its linear region and (4-4) can be 
linearized to (4-6). It is in this region that Ron can be defined. This equation formulates all 
above discussion [25]. 
d~x) 
v=-Jl E=+Jl --
n n dt 
J=Q.v 
(4.1) 
(4.2) 
(4.3) 
(4.4) 
(4.5) 
(4.6) 
By increasing Vd ' Id increases. But MOSFETs reach saturation when Vd is higher than 
overdrive voltage (Vg - V,h). This happens because of pinch-off effect. Anyway, MOSFETs 
can be used in saturation region with a drain CUITent that can be controlled by gate voltage. 
The saturation CUITent is another important parameter in MOSFETs and can be calculated 
as in (4-7) [25]. 
(4.7) 
72 
Table 4-1 MOSFET equation parameters 
Symbol Quantity 
v Speed of electron 
f.Jn Mobility of electron 
E Electrical field 
V( x) voltage in point x 
Q (X) Charge in point x 
W Width 
Cox Capacity of oxide 
VGS Gate-source voltage 
VTH Threshold voltage 
ID Drain CUITent 
L Channellength 
Vos Drain-source voltage 
4.1 .3 Transistor structure 
SOI transistors are the CUITent answer to the scaling problems of MOSFETs. This 
architecture can be made using two different operation modes: partial depletion and fully 
depletion. 
73 
In a brief article [4], which is a primary presentation of an extensive study, these 
two modes are compared using different characteristics of transistors. The following plots, 
taken from reference [3], are based on fabricated models with gate lengths down to 25 nm 
[26]. 
The goal is to compare the benefits of these two types of transistors and in this way 
introduce the concept of dully depletion in transistors, which is employed in the modeling 
and simulation tool (NanoMOS 3.5) used in this study. In this article a lot of fabrication 
issues are explained which might be useful in the following weeks of our course. 
First let's review the depletion concept. In a MOSFET gate electrode is isolated from 
the channel by an insulator. If for example the channel is negatively doped, then inducing a 
negative voltage in the gate will repel the electrons from the region near to the gate. This 
means that carriers cannot use a portion of the area of the channel. In this way channel 
CUITent is controlled by gate voltage. These transistors are also called depletion mode 
MOSFETs [27]. 
Transistors with buried oxide of 100 nm and substrate thickness of 45 nm and 3nm 
oxide layer under the gate have been fabricated. Devices have n-doped channels and 
undoped channels for PD and FD modes respectively. 
In Figure 4-1 output characteristics of a 200nm gate length transistor for different gate 
overdrive voltages are shown. A kink effect is observed in doped tunnel of PD devices for 
higher voltages. This is due to floating body effect, which is the capacitor formed between 
the isolated silicon tunnel and the body oftransistor [20], [26]. 
74 
L. Dreeskornfeld, 1. Hartwich, E. Landgraf, R. 1. Luyken, W. Rôsner, T. Schulz, M. 
Stadele, D. Schmitt-Landsiedel, L. Risch, "Comparison ofpartially 
and fully depleted SOI transistors down to the sub 50nm gate length 
regime",Injineon Technologies-Corporate Research Otto-Hahn-Ring, 
Institute for Technical Electronics, TU Munich, Germany. 
Figure 4-1 Output characteristics of a 200 nm channellength device 
Undoped tunnel of PD device do es not have this undesired effect. In addition to that it 
has 30 % higher on cUITent, because of a higher mobility [26]. 
VIn [V] VIn [V] 
0,0 0,8 
-D ,2 o--D 0,6 
-D,4 >~ 0,4 • 
-D ,6 0,2 
-D ,8 ···· 0- -,. _. ,- - ,- - ,- . -.,...-- 0,0 
50 100 150 200 
gate length [nm] 
75 
L. Dreeskornfeld, 1. Hartwich, E. Landgraf, R. J. Luyken, W. Rosner, T. Schulz, M. 
Stadele, D. Schmitt-Landsiedel, L. Risch, "Comparison ofpartially 
and fully depleted SOI transistors down to the sub 50nm gate length 
regime", Infineon Technologies-Corporate Research Otto-Hahn-Ring, 
Institute for Technical Electronics, TU Munich, Germany. 
Figure 4-2 Threshold voltages 
Figure 4-2 shows that the threshold voltage of a PD device is quite constant for a range 
of channellengths, but FD devices have varying threshold voltages [26]. 
0,01 
lE-4 L.-____ -='~-~rtfttt 
lE-6 
Ê 
::1. lE-B 
< 
_ 01E_l0 
'1 E-12 
'1 E-14 
-1 ,5 -1 ,0 -0,5 0,0 0,5 1 ,0 
V., M 
0,01 • _ 
1 FD LG= 50nm ~ 
lE-4 ~F"'''''''' 
IE-6 /~-~ E 
:1. 
~ 1E-B 
_0 
Vos= 0.01V 
1 E-l0 0.5V 
1-OV 
1 E-12 
1 E-14 
-1 ,5 -1,0 -0,5 0,0 0 ,5 1,0 
Va [V) 
0 ,01 
1 E-4 
1 E-6 
E 
~lE-8 
~ 
_01E-l0 
lE-12 
lE-14 
0,01 
1E-4 
lE-6 
E 
:1. lE-8 
~ 
_0 lE- '10 
lE- '12 
lE-14 
1 PD LG=100nm 1 
~,J 
-1 ,5 -1 ,0 
, 
-0,5 
#.~.,.-
IF"v =O .01V 
1 os 0 .5V 
0.0 0.5 
1.0V 
1 .5\/ 
2 .0V 
1.0 
V G (Ill 
1 PD La=50nm 
-1,5 -1,0 -0,5 0.0 0,5 1 .0 
Va [V) 
76 
L. Dreeskomfeld, J. Hartwich, E. Landgraf, R. J. Luyken, W. Rësner, T. Schulz, M. 
Stadele, D. Schmitt-Landsiedel, L. Risch, "Comparison ofpartially 
and fully depleted SOI transistors down to the sub 50nm gate length 
regime",Injineon Technologies-Corporate Research Otto-Hahn-Ring, 
Institute for Technical Electronics, TU Munich, Germany. 
Figure 4-3 Transfer characteristics of 100 nm and 50 nm devices 
140 
~120 ~o 
~100 ~ 
;; _vds~.o~i 
80 -'-Vds~.SV 
~Vds=1 .0V 
60 -40- Vds=1 .SV 
--:?- Vds=2 .0V 
50 100 150 200 
gate length [nm) 
'0 
al 
'0 
:> 
.s 
en 
140 
120 
100 
80 
60 
50 
1 PDS (L) 1 
__ Vds=O. 01V 
--'-Vds=O.SV 
~Vds=l.OV 
-.1.-Vds= 1. SV 
~Vds=2.OV 
100 150 200 
gate length [nm] 
77 
L. Dreeskomfeld, 1. Hartwich, E. Landgraf, R. 1. Luyken, W. Rôsner, T. Schulz, M. 
Stadele, D. Schmitt-Landsiedel, L. Risch, "Comparison ofpartially 
and fully depleted SOI transistors down to the sub 50nm gate length 
regime", Infineon Technologies-Corporate Research Otto-Hahn-Ring, 
Institute for Technical Electronics, TU Munich, Germany. 
Figure 4-4 Sub threshold slopes 
Aiso as seen in transfer characteristics of 100 nm and 50 nm (Figure 4-3), PD devices 
have lower off CUITent levels, which is another advantage for PD devices (especially in 50 
nm channel length) [26]. 
Another advantage for FD devices is that as shown in Figure 4-4, they can have near 
ideal sub threshold slopes in smalliengths [26]. 
It is also concluded in this article that the channel length of FD devices is limited to 4 
times the thickness of silicon channel. This li mit is set to guarantee the good performance 
of the transistor. It is generally concluded that except in sorne cases, FD devices have 
better performance than PD devices [26]. 
4.1.4 Modeling and simulation 
Modeling and simulation tool used in this mas ter project is nanoMOS 3.5, which is 
accessible through nanohub.org. This tool is originally developed by Zhibin Ren. 
NanoMOS tool simulates thin body fully depleted double gated n-MOSFETs using 
different transport models [28]. 
Double gate MOSFETS are devices with two gate electrodes, placed at two opposite 
sides of channel. This architecture gives better control of channel. Generally, these two gate 
78 
electrodes can be two surfaces of one gate or two independent electrodes controlled by two 
gate signaIs [20]. 
This thesis tries to solve the problem of near ballistic, non-equilibrium transport in 
deeply scaled transistors using non-equilibrium Green's functions. This method replaces 
the quasi-equilibrium, scattering transport model, valid in transistors with longer channel, 
and is meant to describe quantum effects that occur in nanoscale [29]. 
In the introduction the scaling of MOSFETs is discussed in regard to the bulk and SOI 
technologies. Here a resume of the discussion is presented. 
When devices are scaled, a compromise between functionality and reliability IS 
inevitable. Functionality and reliability are affected by Short Channel Effects (SCEs) [8]. 
Performance related examples of these effects are: In sma11er channel lengths the 
threshold voltage ro11off causes the sub threshold swing to reduce. That makes it more 
difficult to turn the transistor off. Drain-induced barrier lowering (DIDL) effect causes the 
drain voltage to vary with threshold voltage. This is a challenge in designing circuits using 
these devices. Reliability examples of SCEs are gate tunneling CUITent junction tunneling 
CUITent. 
Higher channel doping concentration, Thinner insulator and a higher isolator dielectric 
constant in relation to semiconductor dielectric constant make SCE effects less significant 
and make deeper scaling possible [29]. 
On the other hand a low channel doping concentration decreases body effect 
coefficient, which increases sub threshold swing. To solve this conflict, a doping profile 
79 
can be use that gives the channel a lower doping concentration near the silicon-oxide 
interface and a higher doping concentration elsewhere [29]. 
These measures and sorne other make it possible to scale bulk silicon transistors down 
to 25 nm. After that leakage currents are too high. In partially depleted SOI MOSFETs, 
isolation causes charge build up in channel and floating body effect. This is reduced in a 
fully depleted SOI device [29]. 
In a double gate FD MOSFET the second gate reduces the SCEs and so high doping 
concentration is not very much needed to fight SCE anymore. Lower doping decreases the 
band to band tunneling junction leakage CUITent [29]. 
The introduction concludes that despite fabrication difficulties, thin body DG 
MOSFETs are the best choice for the continuation of scaling [29]. 
4.2 Geometry and fabrication 
4.2.1 Fabrication Pro cess 
Double Gate SOI MOSFET is believed to be a structure with high potential for being 
used in deeply scaled MOSFETs. [30] 
Here a fabrication process will be introduced that builds Double Gate SOI structures 
on wafers that are not exactly conventional SOI wafers. In this method, amorphous silicon 
is used as the base for making a film that behaves like an SOI film. More specifically 
amorphous silicon is recrystallized to form a high quality polysilicon layer that behaves 
similarly to single crystal silicon. This film is called large grain polysilicon on insulator or 
LPSOI. Double gate device that is made on this film has close characteristics as one made 
80 
on a conventional SOI film. Instead this method offers more flexibility and simplicity in 
fabrication process of devices. Solidworks is used to generate fabrication models and masks 
in this chapter [30], [33]. 
Bottom gate, in this method, is formed using a semi-recessed LOCOS process. It is 
necessary to briefly introduce this process. LOCal Oxidation of Silicon or LOCOS is a 
process that forms a silicon dioxide layer in a selected region on a silicon surface in a way 
that the silicon/oxide interface is placed beneath the surface. It was first developed to 
insulate different MOS transistors from each other. The steps to be taken to realize the 
selective oxidation on the silicon layer are as follows [20] : 
First a silicon dioxide layer and then a silicon nitride layer is deposited on the 
silicon using Chemical Vapor Deposition (CVD) method (figure 4-5, layers 2 and 3). Then 
these layers are etched in selected regions so that silicon is exposed. Now if the wafer is put 
in thermal oxidation environment, the coating layers will prevent the oxidation of silicon in 
unwanted regions, hence LOCal Oxidation of Silicon. Silicon nitride permits very low 
diffusion of oxygen in high temperature, so it is a good coating layer for selective 
oxidation. [20] 
When silicon is oxidized, its volume increases to 2.4 times. In a LOCOS process, 
this means that the coating layer is pushed upward. This expansion creates a strong tension 
between silicon and nitride layer that will damage the substrate. Oxide layer provides a 
solution: when heated during thermal oxidation, its viscosity is decreased so it Can absorb 
the tension between two layers. [20] 
CVD SiO, " 
IL 
IlL 
1\1. 
V. 
. ........ O 2 ....... '1 
VI. 
~, 
....... r ). ' 
'; ~': ( J ! 
(! ; 
................. , 
....... Cl ~ ': 3 j 
_ ......... ,, 1; 
VII. • • " 2>. ,_ .. _-_ ........... , ... ~ 
_ . .Al l 
www.wikipedia.org 
Figure 4-5 Semi recessed LOCOS process 
81 
When the oxide is grown to the desired amount, the oxidation is stopped and the nitride 
mask is removed. What was explained is semi-recessed LOCOS, in which oxidation starts 
from the actual level of silicon layer. In fully recessed LOCOS before starting thermal 
oxidation, a little silicon is etched so that oxidation starts from a lower level, which creates 
a more deeply buried local oxide. [20] 
In the following process, LOCOS method is employed to form the bottom gate. First 
bottom gate area is selected and coated with an oxide and a nitride layer (Figure 4-6) and 
then Thermal oxidation is performed. Figure 4-7 shows the mask used to etch oxide and 
nitride layers and leave the selected areas of silicon exposed to oxidation. This LOCOS 
process creates bottom gate edges with more planar-like features. It means that the process 
82 
softens the sharp edges ofbottom gate and that decreases the reliability problems caused by 
these edges (Figure 4-8). Then by Arsenic (a substance from group V) implant the bottom 
gate is doped. A high quality bottom gate oxide layer, with 270 Angstrom thickness, is also 
formed by thermal oxidation. [30] 
3D models are an created using solidworks [33]. 
83 
Figure 4-6 Preparation for LOCOS process 
Figure 4-7 Mask used in LOCOS process to etch oxide and nitride layers 
Figure 4-8 Finished LOCOS process with bottom gate area to be implanted (red) 
and oxide (blue) 
Now that bottom gate and its oxide layer are formed, to make the body of the transistor, 
500 to 1000 Angstrom of amorphous silicon is deposited on the gate oxide to be 
84 
crystaUized into an LPSOI layer. This is done through Metal Induced Lateral 
CrystaUization (MILC), which is itself a form of Metal Induced Crystallization (MIC). MIC 
is a method of transforming amorphous silicon to polycrystalline silicon in relatively low 
temperature using a metal as the seed for crystallization. An annealing step in temperatures 
between 150 and 400 is also included to let the crystals grow. In MILC method, instead of 
coating the amorphous silicon with the seed metal, metal is put in contact with silicon only 
in one small area and then the crystal is grown gradually and laterally, starting from the 
metal seed and expanding throughout the amorphous silicon. In this way metal 
contamination is mu ch lower because the silicon/metal contact is much more limited. In 
transistor applications, to grow a polycrystalline channel from amorphous silicon, metal can 
be deposited in source or drain region to better keep the channel from metal contamination 
[20], [31]. 
Here the amorphous silicon (deposited in a Low Pressure Chemical Vapor 
Deposition, LPCVD process using SiH4 as Silicon source) is transformed in a Nickel 
Induced Crystallization process in 580 C for 24 hours [30], [31]. 
Figure 4-9 Amorphous silicon deposited on top of oxide layer, also LPO layer 
(green) and Nickel (violet) are deposited to start crystallization. 
Figure 4-10 Mask used to etch the LPO 
85 
To create the seeding window for Nickel deposition, 3000 Angstrom of Low 
Temperature Oxide (LPO) is deposited and etched in the side. Then 100 Angstrom of Ni is 
deposited and lateral crystallization is started (Figure 4-9). Figure 4-10 shows the mask 
used to etch the LPO to make the contact window for Nickel. When crystallization is 
fini shed after several hours, the Ni is removed with H2S04:H202 (4: 1) in 70 C and then 
the sacrificial oxide layer is aiso removed with HF. Then the annealing process is 
performed at temperatures less than 850 C which increases the grain sizes and completes 
the process [30], [31]. 
86 
Then the channel is implanted with Band then 100 Angstrom oxide is grown as the 
top gate oxide layer (Figure 4-11). Figure 4-12 shows the mask used to etch the crystallized 
polysilicon of channel. Then a poly-crystalline layer is deposited on the oxide layer (Figure 
4-13). This poly layer is then etched to form the top gate (Figure 4-14) and at last the top 
gate and the source and drain regions are implanted with arsenic ions. [30] 
87 
Figure 4-11 Crystallized channel, ready to be implanted with B ions 
Figure 4-12 Mask used to etch the channel silicon 
Figure 4-13 Deposition of upper gate polysilicon layer 
88 
Figure 4-14 Mask used to form the upper gate 
4. 2.2 Packaging 
A processor lC as the main way semiconductor industry sells MOSFETs has been 
chosen as the subject of packaging study. There are two major steps in packaging of a 
processor: attaching the die to the package substrate and connecting the package substrate 
to the PCB [32]. 
Wire bonding is the primary and still most frequently used method of connecting die 
attachment. It consists of simply stretching a golden, copper or aluminum wire between the 
pads on the die and the connectors. Wire bonding can be used to conne ct dies to other dies 
or PCBs to other PCBs too. This technology has been developing considerably, but today 
and in the case of processors, where the frequency, density and power dissipation is high, 
wire bonding has been replaced by flip chip technology [20], [23]. 
No wire is used in flip chip technology, instead the die, with its pads made on its top 
side, is flipped and put directly on the board (this is why ït is also called Direct Chip Attach 
or DCA). Flip chip reduces the occupied area and height of a chip significantly. So it is 
89 
ideal for applications where space is limited. Aiso because bond wires are eliminated, 
inductance and capacitance of the connection are reduced up to 10 times and connection 
path is shortened 25 to 100 times. This gives it a much better electrical performance and 
suitable for high speed applications. In regard to VO capability, flip chip offers a big 
advantage: in wire bonding, putting pads far from perimeter of the die makes the die 
attachment very difficult and so the inside area of the die is never used. In flip chip 
technology all the surface of the die is used for connections. This increases the number of 
pads possible on a die, a much needed feature by increasingly dense processors. 
Mechanically, flip chip connections with their adhesive underfill in place, are the strongest 
connection type. At last, in high volume flip chip can be the least costly connection type 
[32]. 
90 
www.flipchips.com 
Figure 4-15 Electroless Ni-Au UBM 
www.flipchips.com 
Figure 4-16 Reflowed solder burnps on electroless Ni-Au UBM 
Solder burnp flip chip is the rnost cornrnon type of flip chips. There are four steps in a 
solder burnp flip chip process: First step is preparing pads to be connected to solder. The 
second step is to put the burnpers on the die pads. These burnpers are the bulging parts that 
rnake the connection. Then the die is aligned and attached to the substrate using a reflow 
91 
process (technically me ans deforming plastically). At last the space under the die is "under 
filled" by an insulate adhesive. In addition to electrical connection, bumps provide a 
thermal path for heat dissipation from die to substrate. "Sputtered UBMlprinted solder" is 
the chosen solder bump flip chip process. It has low cost, good composition control, 
compatibility with different alloys (lead-free, non-binary ... ) and acceptable precision (150 
um) [32]. 
"Micro lead frame" or "flat no lead" technology is a near "Chip Scale Package" (a 
package whose area is not more than 1.2 times the area of the die) package to substrate 
connection method which uses perimeter leads or pins (tens to two or three hundreds of 
pins in one or two rows) to connect to the board. Die is attached by wire bonding to the 
leads on the frame or perimeter of the package. The die is in contact with a thermal pad 
which can be on top or on the bottom of the package. Package is connected to the board by 
soldering using a solder paste. This is a low cost, reliable and thermally effective package 
[20], [23]. 
Wire 
bonding 
Silicon le Plastic encapsulation 
Die attach 
Section View of a QFN package 
www.wikipedia.org 
Figure 4-17 A "Quad Flat No lead" structure 
92 
"Grid array" packages are the technology to be employed wh en high va count chips 
with highest technical need are to be used. In this technology the connectors at the bottom 
of the package are patterned in a square array that might coyer the entire bottom surface of 
the package. Pin Grid Array (PGA) is a kind of grid array package in which connectors are 
pins coming out of package bottom surface. These pins can connect to the board with 
through holes or using a socket. From 1990 PGA has been the dominant package used for 
Intel and AMD processors. This continued almost until 2004 (Intel Pentium 4 and AMD 
Athlon 64) when Land Grid Array (LGA) appeared. In LGA the pins are moved to the 
socket. Instead on the package bottom surface, where pins would have been, there is an 
array of pads that make the contact with pins on socket. Pads are made of Cu covered with 
a layer of Au. Latest processors need more than 1000 pins and LGA provides a higher pin 
density, compared to PGA. This technology is currently the dominant package solution in 
processor industry. In "BalI Grid Array" (BGA) pins or pads, discussed above, are replaced 
with solder bumps. The package can be mounted on a PCB directly, without solder paste 
93 
and with a reflow process, with the least height and resulting in a very good connection. 
Fine pitch BGA (FBGA) is a branch of BGA with the highest density of contacts among 
package to substrate connection types. In 2014 it will reach to 100 urn array pitch. BGA 
packages have higher thermal conductivity between the package and the board and lower 
inductance, due to direct contacts [20], [23]. 
www.wikipedia.org 
Figure 4-18 Pin Grid Array at the bottom of a processor 
94 
www.wikipedia.org 
Figure 4-19 An LOA processor 
BOA has technical advantage over LOA. It has more bump density, smaller size and 
less parasitic effects and better thermal conductivity. But it needs to be connected to the 
board by a machine and cannot be removed and replaced. Today, big producers use LOA 
technology with a socket on the mother board for PC applications where removable 
processors are needed. BOA instead can be used when consumer's choice is limited and 
size and mechanical uniformity of the device is important (very low cost laptops and 
notebooks, for children or in laptop distribution programs in developing countries). In 
smaller processing units used in household devices, cars, watches and other electronic 
devices, BOA might have a wider application. 
95 
www.wikipedia.org 
Figure 4-20 BGA RAM lCs connected to a PCB 
Today, one important trend in material development is to make it green. In wire 
bonding, flip chip and die attachment techniques, lead free regulations have been accepted 
in many countries. Underfill materials, thermal interface materials and package substrates 
move toward halogen-free materials [23]. 
Substrate is the most expensive part of packages. It also imposes parasitic effects and 
performance limitations more than any other part of packaging. Smaller feature size, less 
thermal expansion and better electrical performance are the desired characteristics in 
substrates. To increase speed of circuits without parasitic capacitance effects from package, 
low k materials are used in substrate. Now materials with dielectric constant of 3.4 are 
usable in industry the ones with dielectric constant of 2.8 is in research. Moving toward 
halogen-free materials has challenged lowering dielectric constant [23] . 
96 
4.3 Numerical modeling 
NanoMOS 2.0 is a simulation tool for ballistic and non-equilibrium transport in deeply 
scaled MOSFETs. It is accessible on nanohub.org and employs three transport models with 
different levels of quantum effects consideration to simulate Thin Body Fully Depleted 
Double Gated n-MOSFETs [28]. 
After a brief introduction, sorne simple NanoMOS' simulation results are shown. Then 
NanoHUB website and NanoMOS as a tool will be explained in more details. 
NanoHUB is a nanotechnology web-based resource for educational and academic 
purposes. It inc1udes online courses and educational documents and learning modules to 
help students and researchers become familiar with nanotechnology in a general or in a 
technical and academic level. It also inc1udes a large number of simulators that simulate 
various phenomena and devices in nanoscale. These simulators vary in complexity and 
goal: sorne are designed for general education about a certain aspect of nanoscience and 
sorne perform detailed simulations on a specific device. NanoHUB benefits from the 
contribution of more than 600 researchers and educators and the number of its users and 
simulations done using its tools is rapidly growing [34] . 
One can sign in to nanoHUB as a researcher and access the tools inc1uding NanoMOS 
directly. In this way a communication window is opened between user's web browser and 
the tool through nanoHUB and input is given by the user to the system step by step and the 
program will run on a remote computer and the downloadable results are presented to the 
user. In many cases tools are open source and downloadable as it is the case for NanoMOS 
presently. In this way users can run the code on their own computer. In this project, of 
97 
course, aIl of the measurements and manipulations have been done on open source 
MATLAB code ofNanoMOS [29], [34]. 
In NanoMOS, User can choose to simulate a non-equilibrium thin-body transistor and 
calculate drain cUITent-drain to source voltage and drain cUITent-gate to source voltage 
plots. The selected device can be simulated using drift diffusion transport, semi classical 
ballistic transport and ballistic transport with Green's function approach. Each one of these 
models takes into account a curtain amount of quantum phenomena and produces the 
results with curtain exactitude. Naturally each one consumes a curtain amount of 
computation resources. Then user can choose a bias for their simulation including both 
gates and source and drain voltages and the signal step sizes. Then device geometry, doping 
profiles, wafer material and insulators dielectric constants are defined. At the end sorne 
simulation options (equations, convergence parameters and other technical options) are 
chosen [28]. 
User can choose among a wide range of results. In accordance to the importance of 
quantum aspect in these devices, most of thern are related to quantum pararneters (electron 
density profiles, energy profiles ... ), but it is also possible to see sorne simple electronic 
plots like the ones discussed above [28]. 
Using this tool, drain CUITent variations of thin-body fully-depleted double-gate 
MOSFET has been studied in relation to changes in different parameters. The goal is to 
observe and measure the effects of different pararneters, discussed in Physical and 
Mathernatical modeling sub-chapter, on MOSFET's performance. Table 4-2 contains 
device pararneters and Table 4-3 bias pararneters [28]. 
Table 4-2 Simulation device parameters 
parameter 
Temperature 
Source/drain doping 
concentration (lcm3) 
Source/drain Gaussian 
doping profile slope 
Material and wafer 
orientation 
Top/bottom insulator 
relative dielectric constant 
Quantity 
300 
1e+20 
o 
Si, 
wafer(OO 1), 
trans(100) 
3.9 
Body relative dielectric Il.7 
constant 
Silicon film thiclmess 
(nm) 
l.5 
Gate isolator thiclmess 1.5 
(nm) 
98 
99 
Table 4-3 Bias parameters 
parameter Quantity 
ToplBottom gate voltage 0.6 
(V) 
Source voltage (V) 0 
Drain voltage (V) 0: 0.05 : 1.4 
In Figure 4-21, drain CUITent is shown for gate lengths of 3, 5, 10 and 20 llffi. A 
respective decline in CUITent is observed that accounts for increase in canal resistance. 
By decreasing the thickness of gate isolators from 1.5 nm to 1 nm, the maximum drain 
CUITent is increased from 1.11e+3 to 1.21e+3 (J1A / ;.on). (Figure 4-22) 
Increasing the relative dielectric constant of gate isolators from 3.9 to 6, more charges 
are absorbed to canal and CUITent increases. (Figure 4-23) 
Finally by employing a doping profile in the channel, the performance of channel is 
improved and CUITent increases. (Figure 4-24) 
!.",. ( 
"" 1 / 
/ 
1.11.14,---------, 
Figure 4-21 Drain CUITent for different gate lengths 
~~b----"në6--------~ 
Ct.., "",,.(11) 
~~.---~Q.~--~---~ 
Drain~"'t:(V) 
Figure 4-22 Drain CUITent for different gate isolator thicknesses 
100 
·C' . 03 
.".---
o _ .... _ ..... _ . __ .-_., __ . __ .. 
0.6 \).0 : 
{.), a:nI;U~* (V1 
Figure 4-23 Drain CUITent for different gate isolator dielectric constants 
1,··03 ~"--
s /~' 
li ~ 1II0 / / 
O ;l:--_~ M C~ , ",-~ CG 04 
Ol al" ~ ... M Ora'n"ltl_II (Vj 
Figure 4-24 Drain CUITent for different Gaussian doping profile slopes 
101 
NanoMOS gets its input as an input file that includes different input data categories. 
The first thing that input file defines is device structure. Sizes and positions are defined 
with symmetry in the direction of x and with y coordination starting from top of the top 
oxide layer. As it is explained in more detailed in the next chapter, a grid is defined on the 
device's plan that is used in solving the model with matrix calculation method. The size of 
grid elements needs to be given as input too. These are device and grid input parameters as 
explained by [35] : 
nsd: SourcelDrain doping concentration ( 1 / cm3 ) 
nbody: Body doping concentration ( 1 / cm3 ) 
19top: Length of the top gate (nm) 
19bot: Length of the bottom gate (nm) 
lsd: Length of the Source/Drain (nm) 
overlap_s: Source extension length (nm) 
overlap _ d: Drain extension length (nm) 
dopslope_s: Source Gaussian doping profile slope (dec/nm) 
dopslope _ d: Drain Gaussian doping profile slope (dec/nm) 
tox_top: Top insulator thickness (nm) 
tox bot: Bottom insulator thickness (nm) 
tsi : Silicon film thickness (nm) 
temp: Lattice temperature (K) 
dx: Horizontal no de spacing (nm) 
dy: Vertical node spacing (nm) 
refine: Refining factor for vertical grid (used in schred.m) 
102 
A positive value for overlap means that gate extends over drain or source. A negative 
value refers to an underlap: gate does not reach drain or source in x direction. A Gaussian 
profile is assumed for doping profile in source and drain. "dopslope" parameters define the 
slope of this Gaussian profile. If they are set to zero, the doping profile is a step function. In 
solving Schrôdinger equation (in schred function) a refined grid is used. That is where 
103 
refine parameter is used. Bellow a plan of device and device parameters is shown [29] , 
[35]. 
, 
:-< 
t 
1 1 
:-< ~I 
t 
overlap 1 
i 
N+ (nsd) ~ xchar 
, 
i 
19top 
1 
>' U 1 
vgtop, "vfunc_top 
: .... 
Top oxide :' 
, 
, 
~ 
.. P (nbody) .. xchar~ 
, 
~ 
Bottom oxide 
lsd 
N+ 
U 
vgbot, 'wfunc_bot X 
1 
~I r 
19bot y 
ZhibinRen, with contributions by: Ramesh Venugopal, Jung-HoonRhew, Jing Guo, 
Dave Rumsey, Dr. SebastienGoasguen, Prof. SupriyoDatta and Prof. 
Mark S. Lundstrom, "NanoMOS 2.0 A 2D-Simulator for Double-gate 
MOSFETs", Manual, Purdue University, Decembre 2001. 
Figure 4-25 Deviee parameters 
Five transport models are available for simulations. In Classical Ballistic Transport 
model (CLBTE) and Quantum ballistic Transport model (QBTE) in each point of x 
(channel direction) a ID Schr5dinger equation is solved and subband profiles of electrons 
are found (ESUB(x)). Then in CLBTE carrier transport in each subband is found using 
thermionic emission. Thermionic emission is the flow of carriers over a potential barrier 
due to heat. In this model the tunneling through the barrier between source and 
" 
,. 
104 
semiconductor is not considered. In QBTE transport of carriers in x direction is calculated 
by solving a ID Schrôdinger equation in x direction using Non-Equilibrium Green's 
Function method (NEGF). Source-channel barrier quantum tunneling is considered in this 
model. Drift diffusion (DD) is a version of classical drift diffusion model with quantum 
corrections. Subband profiles in y direction are found, as in two previous cases, quantum 
mechanically by solving ID Schrôdinger equation. To find the carrier transport in each 
subband a ID drift diffusion equation is written in x direction. Mobility is found with 
Caughey-Tomas model (equ. 4.8) where *E is electrical field in x direction and other 
parameters are given by user [29] , [35]. 
f.1(E) = mu /ow (4.8) 
[1 + (E . mu /OW) bela ]lI bela 
ve/ sa! 
Energy transport model (et) is a ID energy transport model that works by calculating 
thermal flow in a gas of electrons moving in3D [29] , [35] . 
Quantum Dissipative Transport model (QDTE) calculates ESUB(x) or subband profiles 
quantum mechanically as in CLBTE. Then it employs Green's function method to find a 
dissipative model of transport, meaning that it accounts for scattering of carriers to describe 
the 2D transport [29] , [35]. 
Bellow a list of transport model related parameters that input file must contain is 
presented as explained in [26]. The last two are related to energy transport model [35]. 
model: transport Model (DD, CLBTE,QBTE,ET,QDTE) 
mu_low: Low field mobility (cm/s) 
beta: Caughy-Thomas parameter (dimensionless) 
vsat: Saturation velocity (cm/s). 
ELE_TAUW: Electron energy relaxation time (s) 
ELE _ CQ: Energy flux related parameter (dimensionless) 
These are bias parameters as explained in [35]: 
vgtop: Top gaie voltage 
vgbot: Bottom gate voltage 
vs: Source contact voltage 
vd: Drain contact voltage. 
vd_initial: Drain start voltage. 
vgstep: Step size for the gate voltage 
ngstep: Number of gate voltage steps 
vdstep: Step size for the drain voltage 
ndstep: Number of drain voltage steps 
105 
vd _init is the initial guess in iterative process of solving the equations for drain voltage. 
It is recommended that to increase the convergence rate of iterationsvd_init should be set 
bellow vd value. If ndstep is nonzero, drain CUITent will be plotted versus drain voltage and 
if it is zero, drain CUITent will be plotted versus gate voltage [35]. 
These are material parameters (mostly of quantum nature) as explained by [35]: 
wfunc_top: Top gate contact work function (eV) 
wfunc _bot: Bottom gate contact work function (eV) 
mlong: Longitudinal relative electron mass ratio 
mtrans: Transverse relative electron mass ratio 
eps_top: Top insulator relative dielectric constant 
kox _top: Top insulator relative dielectric constant 
kox bot: Bottom insulator relative dielectric constant 
dec_top: Conduction band offset between the substrate and the top gate insulator 
(eV) 
106 
dec _bot: Conduction band offset between the substrate and the bottom gate insulator 
(eV) 
Self-consistent solution of equations in NanoMOS is dependent on two divergence 
parameters. One is "dvpois" which sets the minimum difference needed between potential 
values resulted from two iterations of solving Poisson equation. If the difference between 
potentials is less than dvpois then the solution has converged to the final result. Other one is 
"dvmax" which does the same for the potentials resulted from self-consistent solution of 
Poisson equation and Schrôdinger equation [29], [35]. 
There are sorne options to be set in input file too. "oxyenetrate" is a flag that 
determines if, when solving 1 D Schrôdinger equation in y direction, the electron 
penetration in the oxide layer should be considered or not. In chapter 5 it is explained in 
more details how this is done. "dg" is another flag that couples the ramping of top and 
bottom gate voltages, if it is set to true. If it's faIse, the bottom gate will stay at "vgbot" 
value and does not change. When using drift diffusion model, user can choose to apply 
107 
Fermi-Dirac or Maxwell-Boltzmann statistics in solving the transport model. When using 
other models, Fermi-Dirac statistics is automatically applied. Electrons in different valleys 
have different effective masses and it influences their mobility (Figure 4-26). Valleys in y 
direction impose much more effective masses and are called unprimed valleys. "valleys" 
flag determines if only electrons in unprimed valleys should be considered in simulation or 
electrons in aIl valleys. "num_subbands" tells the number of subbands in each valley that 
must be considered in solving iD Schrôdinger equation. Usually only lowest subbands are 
occupied by electrons [29] , [35]. 
y 
Illy = Illt (0.19 m e) 
Illy =1111 (0.98 me) 
Illy = mt 
ZhibinRen, with contributions by: Ramesh Venugopal, Jung-HoonRhew, Jing Guo, 
Dave Rumsey, Dr. SebastienGoasguen, Prof. SupriyoDatta and Prof. 
Mark S. Lundstrom, "NanoMOS 2.0 A 2D-Simulator for Double-gate 
MOSFETs", Manual, Purdue University, Decembre 2001. 
Figure 4-26 Valleys and electron masses 
User can choose the plots to be saved and showed using following flags as explained in 
[35] : 
z 
x 
I_ V: Drain current versus drain voltage characteristics. 
Ec3d: Three-dimensional plot of the conduction band. 
Ne3d: Thre~-dimensional plot of the carrier concentration band. 
Ec_sub: Energy profile of the different subbands along the channel. 
Ne_sub: Carrier concentration of the different subbands along the channel. 
Te: Carrier temperature along the channel (eV). 
Ec _IV: Energy profile of the lowest subbands. 
Ne IV: Carrier concentration of the lowest subbands. 
108 
The I_ V characteristics will be plotted versus gate voltage Ifndstep is set to zero. If 
more than one bias point has been simulated, all other plots are generated for the last bias 
point that has been simulated [35]. 
At the end of each simulation, data like convergence history, I-V characteristics, 
subband energies, electron densities, potential profile and carrier temperature are saved to 
files with .dat and .psextensions [29], [35]. 
Chapitre 5 - Implementation 
5.1 Mathematical formulation 
As it has been stated qualitatively in previous chapters, conventional theories of carrier 
transport cannot predict the behavior of MOSFET channels in nanoscale. In this chapter we 
see a more detailed description of transport models that describe these devices more 
accurately. 
Conventional models of carrier transport are derived from Boltzmann Transport 
Equation (BTE). These models are formed on a scattering based view of carrier transport. 
This, being the case for long channel devices, fails to models quasi-ballistic transport that 
occurs in very short channel devices. [29] 
BTE deals with statistical distribution of particles in the matter. It is applied when a 
system is far from thermodynamic equilibrium, for example when there is an electric field 
applied to a region (which is the case with MOSFETs) or when a temperature gradient is 
applied. In these cases BTE can de scribe how CUITent flows (how charges move) or how 
heat flows (movement ofheat carriers) [30]. 
BTE describes statistically the position and momentum of particles as a function of 
time. f(r,p,t) is a density function so that f(r,p,t)drdpis the number ofparticles in a 
neighborhood of rand with a momentum within a neighborhood of p in time t . Now if an 
110 
external force Fis applied to partic1es with mass m, without considering any collisions, we 
will have [20], [29]: 
f(r +.E.dt,p + Fdt,t + dt)drdp = f(r,p,t)drdp (5-1) 
m 
Meaning that after time dt has passed, partic1es that were in r with momentum p, 
would have moved to x +.E. dt and their momentum would have changed to p + fdt . Now 
m 
if we add the effect of collisions of partic1es as pf !coll (founded by quantum calculations), 
pt 
we will have [20], [29]: 
Solving BTE is a hugely complicated problem and still it fails to capture any non-
c1assical carrier transport features. To do so, Non-Equilibrium Green's Function (NEGF) 
formalism is a very appropriate method to calculate nanoscale effects. Green's functions, in 
mathematics, are functions used to solve inhomogeneous differential equations with sorne 
boundary conditions. In Physics this name is often used more generally to refer to many 
correlation functions [20], [29]. 
In NEGF approach, electrons and phonons, which are carriers of CUITent and heat, 
create what is called quantum fields. Previously it was mentioned that these carriers do not 
have equilibrium distribution in nanoscale. NEGF is a method of solving the non-
equilibrium dynamic equation that de scribes non-equilibrium distribution of these partic1es. 
A field operator, \fi (r) (r being space-time, written as r = CF, t)) is defined in relation to 
quantum fields of these partic1es. Then Green's functions are written in as functions of 
111 
these field operators: < 'l' +(r2 )'I'(fj) > or< 'l'(fj)'I'+(r2 ) >. Brackets average field 
operators' product over states of the system. So the functions define the relation between a 
field operator in space-time fj and its conjugate in space-time r2 • In this way Green's 
functions describe the correlation of carriers in 'i and r2 [29]. 
Fourier transform is used to transform these correlation functions from time domain to 
energy domain and finally two equations that de scribe non-equilibrium transport can be 
derived as [29] : 
G(E) = [El - Ho(E) - ~(E)rl (5-3) 
Gn (E) = G(E)~in (E)G+ (E), G P (E) = G(E)~out (E)G+ (E) (5-4) 
G is retarded Green's function and G+ is its Hermitian conjugate or the advanced 
Green's function. Gn and GP are correlation functions that describe electron and ho le 
density. ~, ~in and ~out are self-energy functions. His the single-electron effective mass 
Hamiltonian, related to band structure and Hartree potential of electron-electron 
interactions. This scalar potential is found by solving the coupled Poisson equation. Solving 
these correlation functions self-consistently with Poisson equation will produce electron 
and current density spectrum equations [29]: 
n(E,m) = _1 G,: , (5-5) 
27r 
112 
Where n(E,m) is the electron density spectrum and m is a discretized unit cell. C; is the 
mth diagonal entry of e n and lm (E) is the current spectrum at terminal m. L~~ is the mth 
diagonal entry of Lin . q is the elementary charge constant and h is the Plank: constant [29]. 
To widen our view over this approach we see how this approach is 
implemented to a study of a MOSFET with ballistic transport. In such a simulation, 
reservoirs of carrier, which are source and drain, are considered in equilibrium. Their 
potential difference is represented as their different Fermi energy levels. The interaction 
between source and drain and the active region (channel), which constitutes the transport 
carrier system is represented inL, Lin and L Olif . Interactions of electrons with each other are 
represented in H as mentioned above. With no other interactions, the transport is ballistic. 
L in and L Olif can be written in terms of L and Fermi energy levels. Having the Poisson 
equation to solve for H, L can be solved too [29]. 
When simulating short channel MOSFETs, it is important to model short channel 
effects and ballistic transport in two dimensions. Here we will review the solution of 
Poisson equation and Schrodinger equation in 2D. NEGF technique is used to solve 
Schrodinger equation. This 2D solution can be very computationally expensive. In the case 
of ultra-scaled SOI MOSFET equations can be simplified using mode-space technique to a 
great extent. Poisson equation will serve to produce potential profile and transport equation 
produces charge and current densities. Modeled device is shown in Figure 5-1 . The grid 
elements in the Figure 5-1 will serve as elements ofmatrix calculation [29]. 
113 
Top Gate 
rx~ t ····· .. +·········;··········;···· ······· ; ·· ··· ·· ·· ·; · · ······ · ·;·· · ··· ··· ·~" ·~F~·"··· · .. ····,·· .. ······· ; ······ ·····t···········;····· · ····· ·r· ·· ·· ··· ·· · ;··· ·· ·· ··~---.- b -+--+---l~+--
z 
Source 
Bottom Gate 
ZhibinRen, "NANOSCALE MOSFETS: PHYSICS, SIMULATION AND 
DESIGN", thesis, Purdue University, Octobre 2001. 
Figure 5-1 Device grid 
To solve the Poisson equation, Gauss's law is used: 
#[&Ê(x, z)].dS = J q[p - n + N D - N A]dn 
n 
Drain 
(5-7) 
Ë is electric field, p and n are hole and electron densities and ND and N A are donor and 
acceptor's doping concentrations, q is elementary electric charge and & is dielectric 
constant of the materials. If we suppose that grid node numbers in x and z directions are Nx 
and N z ' solving Poisson will produce N x x N z potential values (one for each node). To find 
these unknowns N . x N z equations must be used, for which we can write and solve Poisson 
in aU of the internaI nodes or we can use boundary conditions in an of the boundary nodes. 
For the internaI nodes, using central difference approximation (field on a mesh boundary is 
114 
equal to the average of fields in central nodes of neighboring meshes) and replacing 
electrical field with potential, (5-7) can be written as (5-8) for anode mn. a and b are taken 
from Figure 5.1. By assigning smaller value for b the grid will be finer in z direction and a 
more precise result for the ultra-thin channel is possible. We suppose that p is zero in fully 
depleted ultra-thin body nMOSFET [29]. 
a b a b b a ab 
-b Vm- J.n +- v'n.n-J - 2(-b +-)v'n,n + - v'n.n+J + -b Vm+J.n = --q(ND - NA - n)mn (5-8) 
a a a ê 
ê is determined according to the material of the region in which the mesh is placed. 
When a mesh is on the boundary of Si and oxide (with ê Top and ê 801 for the material on the 
top and on the bottom) layers (5-8) becomes [29]: 
a V b (1 êBOI)V (a b)(l êBoI)V b (1 ê BOI )V a ê BoI V 
- +- +-- - -+- +-- +- +-- +--
b m-Ln 2 m.n-J b m,n 2 m.n+J b m+Ln a ê Top a ê Top a ê Top ê Top 
ab 
=--q(N -N -n) D A m.n 
ê Top 
(5-9) 
In boundary nodes the same can be done except for the nodes neighbor with gate, 
source or drain electrodes. In gate contacts Dirichlet boundary condition is valid which 
determines the node potential in agate region node equal to vacuum potential of the gate 
[29]. 
(5-10) 
At source/drain contacts, Neumann boundary conditions impose that the net charge 
is zero [29]: 
(5-11) 
11 5 
It means that potential is floating and will be determined in a way that net charge is 
zero in the contact. If a more fixed condition for potential is chosen, ballistic transport will 
not be properly modeled, especially that ballistic effects are stronger in these regions. For 
other boundary nodes we just need to suppose that no electric field exits the channel, 
meaning that the potential on the outer mesh is zero. For example for a no de on the left side 
and one on the top side (5-12) and (5-13) are valid. For the outer mesh of the node on the 
top left side both apply [29]. 
Vm n - Vm_1 n = 0 . . (5-12) 
v -V = 0 
m,n tn ,n+ 1 (5-13) 
We now can solve the system of linear equations built above and find the potential 
in nodes . Here, a technique is discussed that improves the conversion of the solution of this 
system of equations. In fact electron density (n) used in the above system as input can be 
written as: 
n =N 3 [(FJm.n+qVm.n] 
m.n C 1/ 2 k T 
B 
(5-14) 
Fn is quasi-Fermi energy, V is the old node potential, 3 "2 is Fermi-Dirac integral of 
order Y2 and Ne is effective density of states in the conduction band. Looking at (5-9) we 
see that an increase in potential, implies a decrease in electron density, but in (5-14) it is the 
opposite. So using (5-14) along with Poisson equation in each iteration williimit the jumps 
of potential caused by solving Poisson. This will stabilize the convergence of Poisson 
equation and make the algorithm more efficient. (5-14) is nonlinear, so in a way we have a 
nonlinear Poisson system of equations now. We can use a Newton-Raphson loop to solve 
116 
it. The Jacobian matrix of this solution is a very sparse matrix which saves sorne memory 
and time [29]. 
Here we discuss application of NEGF to the transport equations. In ballistic 
transport, NEGF method is equal to solving Schrôdinger equation with open boundary 
conditions. 
First, we will see the algorithm to solve Schrôdinger equation in ID. This is done by 
an effective mass approximation of Schrôdinger' s equation. We want to review its solution 
in ID, in the direction normal to channellength (across gates and channel) [29]. 
(5-15) 
Ei is bound state energy of sub band i and \{Ii(Z) is the envelope function of 
subbandi. q is elementary charge and m; is electron effective mass in z direction. From 
here bound state energies are calculated and used in (5-16) and (5-17) (that are related to 
the structure of lattice) to find the electron and hole densities [29]. 
(5-16) 
(5-17) 
n2D, and P2D, and kB are constants and f.1 is body Fermi energy. Then (5-18) is used 
to find the CUITent in unit ofwidth [29]. 
(5-18) 
l Oi is a constant and :3' /2 is Fermi-Dirac integral of order Y2 [29]. 
Il? 
Deep scaling of MOSFETs has caused tunneling through oxide layer a problem to 
be considered in design. In fact insulators have a finite band offset in relation to the channel 
and this offset can be passed by sorne carriers. To calculate the tunneling cUITent, an 
integral is taken of the probability function of electrons in oxide layer. So with oxide layers 
as thin as 1 nm, it is necessary to simulate electron penetration in insulator. To do this, 
when solving Schrodinger equation, zero boundary conditions must be put at the interface 
between insulator and contact, not at the interface between semiconductor and insulator 
[29]. 
So when the problem is solving Schrodinger in 2D for the grid shown in Figure 5.1, we 
can first slice the 2D surface in z direction and solve the ID Schrodinger as explained 
above for each slice. Electron penetration in insulator can be considered too. Then the 
Hamiltonian (total energy operator that helps calculate time evolution of quantum states) of 
the system is written in terms of two wave functions in the direction of width of the device 
and in the grid plane. The Hamiltonians size is related to the number of nodes: (N
x 
x NJ 2 , 
but with mode space approximation, its size becomes related to nodes only in x direction: 
N/. This, of course reduces the time of calculations. To be able to use mode space 
representation, we must be able to suppose that [20], [29] : 
d • 
-'P; (x,z) = 0 
dx 
(5-19) 
In other words, the changes of potential in z direction must be negligible. Then it is 
possible to rewrite 2D Schrodinger equation as a ID (a long the x direction) equation in 
terms of just one sub band. In other words, equation can be solved for each sub band 
separately. So the Hamiltonian can be crushed into independent smaller parts. The above 
11 8 
assumption fits SOI technology very weIl, because in a very thin channel, it is less likely 
that potential changes along the z direction. For bulk technology space mode representation 
is not applicable [29). 
The next step is writing retarded Green function as [29] : 
(5-20) 
In which E l = E - Ek is longitudinal energy and L is self-energy matrix. Using 
) 
Green's function, a 2D electron density matrix is calculated for each longitudinal energy 
level, as [29] : 
(5-21) 
Total 2D electron density matrix can be derived from (21). Then, to produce the 3D 
electron density, in each node along x direction, distribution function l'I'i(X,z)1 2 is 
calculated with total 2D electron density matrix. The result can be given to Poisson 
equation for a potential profile calculation [29]. 
CUITent can be calculated using [29]: 
(5-22) 
Total CUITent is found integration of (5-22) over aU valleys (El ) [29). 
Scattering is another phenomenon to be modeled in a MOSFET simulation. In a general 
view, scattering happens because of defects in lattice, impurities, rough insulator surfaces 
and high densities of carriers. More specificaUy, sources of scattering can be categorized 
119 
as: surface roughness scattering and electron scattering with impurities, phonons and other 
electrons [29]. 
U sually mobility (f.1) is used to represent the ease of electron displacement in an 
environment. In quantum analysis, this can be done through the concept of state lifetime, 
meaning that lifetime of a state of an electron is the time before a scattering event causes 
the electron to go to another state [29]. 
f.1 = q < : > (5-23) 
<m > 
q is the elementary charge and q < r > is the average of electron state lifetime and 
< m' > is the average of electron effective mass in the transport direction. A verages are 
taken over different sub band energies that an electron might occupy [29]. 
The overall algorithm for simulating scattering is first to solve 2D Schrôdinger 
equation as two ID problems in vertical direction and in the channel direction. This will 
give the vertical electron concentration and sub band profiles and electron concentration in 
channel direction. Then having 2D electron density, 2D Poisson equation is solved to find 
new potential profile [29]. 
Physically, scattering means that the state of carriers is altered by sorne scattering 
phenomena while passing through the channel. The total number of carriers remains 
constant during these transformations. To model this, sorne functions are defined (called 
Buttiker probe) that take out sorne of the carriers from the system and import the same 
number of carriers, only with different energy states. This can mimic what scattering does 
to the carriers [29]. 
120 
Figure 5-2, taken from [29] shows the process of self-consistent solution of Poisson 
and Schrôdinger equation along with two Buttiker probes as methods of modeling 
scattering [29]. 
No 
Gupss initial potential VOid 
and ).ln 's in the first Büttlker probe model 
or 
'3_1/ 2 ().l,,) 's in the second Büttiker probe model 
Solve the iD Shroclinger pquation 
in the confinement direction for 
subband profiles and wave functions 
Evaluate leakage current at 
each scatterer. m, 1 III ().ln) 
conservation 
No 
Solve for Iln 's in the first Büttiker probe model 
or 
Solve the Poisson 
equation for V New 
Check convergence: 
Compare V New with VOid 
Evaluate 
cU/Tent and stop 
'3_112 ().l,J 's in the second Büttiker probe model 
ZhibinRen, "NANOSCALE MOSFETS: PHYSICS, SIMULATION AND 
DESIGN", thesis, Purdue University, Octobre 2001. 
Figure 5-2 Algorithm for modeling scattering 
121 
122 
Using Boltzmann equation, source injects electrons into the semiconductor channel and 
they must pass over a potential barrier made by gate. In fact sub band energy of electrons 
must surpass a minimum energy that depends on gate potential. The densities of electrons 
in each subband on different sides of this barrier are described by two different equations 
(left is for source side and write is for drain side) [29]: 
1 Ep, dE 1 00 dE nlefi (x)=~n2Di{ln(1+ eIlS )+[ 1 f r;:;:::"5 - I12(f.1S- E )+ 1 f r;:;3_1/2(f.1D- E )]}(5-
...; 7r 0 "'; E ...; 7r E . ...; E 
p' 
24) 
(5-25) 
n2Di is a constant, EpI is the maximum energy of subband i , f.1s and f.1D are contact 
Fermi energies of source and drain and 3 _1/2 is the Fermi integral oforder -112 [29]. 
Using Green's functions, this density function is formed in one equation [29]: 
+00 
n(x) = 7 nOi {f dEP_1 12 (f.1s - E)Dsi (E, x) + 3 _112 (f.1D - E)DDi (E, x)]} (5-26) 
nOi is a constant and D Si and D Di represent the effects of source and drain on the 
local density function for subband i [29] . 
Ballistic CUITent at source or drain contact is [29]: 
123 
-+00 
l O / W = '7 l Oi f TSDi (E)[:3 _1/ 2 (,us - E) - :3_1/ 2 (,uO - E)]dE (5-27) 
-00 
l Oi is a constant and TSDi (E) is source-to drain transition III terms of electron 
energy. In classical analysis, TSOi (E) has a binary value. Above barrier energy it is equal to 
land below that energy it is equal to 0, meaning that the barrier is considered as a solid 
wall. In quantum analysis, TSOi (E) has a continuous range of values ab ove and below the 
barrier energy, meaning that there is always a possibility of passing or tunneling. Classical 
formula of (5-27) can be solved analytically, but quantum version is solved numerically and 
with iterations. [29]. 
Scattering centers are considered just like source and drain as sources of carrier. But 
they only change the state of carriers and not the number of them. The electron density 
function containing scattering effects is [29]: 
-+00 
n = LnOi L fdE[:3 _1/ 2 (,u j - E)Dji (E ,x)] 
1 } 
(5-28) 
- 00 
i represents subbands and j represent sources of scattering and carrier. Dji is local 
density and ,uj is Fermi potential [29]. 
Current in each source of scattering or carrier 1 ej is [29]: 
-+00 
l ej ='7 l 0i r fTjki (E)[:3 -1/ 2 (,u j - E) - :3_1/ 2 (,uk - E)]dE (5-29) 
-00 
i represents subbands and k represents sources of current. Tjki is transmission from k 
to j in subband i [29]. 
124 
5.2 Algorithm establishment 
NanoMOS has been created with a principally quantum physical and mathematical 
concem, rather than a coding approach. That is because the challenge in this quite new 
aspect of electronic device design is still finding solutions for physical modeling and 
solving complex models, Coding and optimization come afterwards. MATLAB has been 
chosen as the programming language for this tool in order to be able to benefit from 
advanced functions and routines that exist in this language [36]. 
NanoMOS has almost 4500 line of code distributed in different functions that are 
called according to the type of simulation chosen by user. Here These functions and 
routines are introduced briefly and then we will have a more detailed look at their position 
and functionality in the whole algorithm [36]. 
[36] divides the routines of NanoMOS to two group ofbig and small routines. This 
division is made in regard to the algorithmic importance of the routines. Big routines 
according to [36] are [28]: 
poisson: it solves Poisson equation. 
charge ... routines: they calculate charge density for different transport models. 
CUITent... routines: they calculate CUITent densities for different transport models. 
et: it handles energy transport model. 
cUITent_mat: it calculates the Fermi level for scattering transport model. 
schred: it solves Schrôdinger equation in transverse direction. 
saveoutput: it saves the data and plots them. 
125 
Main: it manages the flow of algorithm and caUs other routines. 
According to [36], small routines are [28]: 
nmos _ in: this is the function to caU in MA TLAB environment. It sets sorne input data 
and caUs main and sorne other routines. 
doping: it generates 2D doping profiles. 
fermi : It computes Fermi integrals. 
integral: it calculates classical charge density for classical quantum model. 
fprime: it calculates Fermi-Dirac integrals for Poisson routine. 
dummyand anti_dummy: they convert charge to quasi-Fermi level and vice versa for 
non-linear Poisson solver. 
dummy ~rime: it differentiates dummy functions. 
Now let's foUow the flow of algorithm as NanoMOS starts to work and performs a 
simulation. In this way aU of the functions will be introduced in more details and the way 
they work and relate to each other will be discussed. In this process we mainly foUow the 
calculation algorithm and neglect IIO coding. 
As it was mentioned, this tool gets started by typing "nmos in" in MA TLAB. 
nmos_in.m is the run file and input deck of NanoMOS. AU the input variables are set here 
and then the necessary functions are caUed. After defining sorne global variables, nmos_in 
caUs phys _ constants.m which simply sets sorne basic physical variables like 6' , permittivity 
constant and q the elementary charge [28]. 
126 
After that, nmos_in includes input variables discussed in last sub-chapter of chapter 
4.The first set of input is transport variables, including transport model, semiconductor 
material, and mobility model. Silicon with different lattice orientation, Germanium and 
sorne III-V materials are among material choices. There are Il choices of material in total. 
Then bias and device geometry variables, as discussed in sub-chapter 4-3, are set [28]. 
Having this input data,nmos_in calIs emass_cal and passes the material chosen by the 
user as input to this function. emas _cal assigns uses a simple MA TLAB switch function to 
choose between different sets of material physical data according to its input, material [28] . 
Then nmos_in continues to set mobility, grid, quantum parameters, solution and 
optional input variables. Grid spacing in x and y directions is set to 0.3 and 0.1 nm as 
default. It is interesting to pay attention to the fact that grid is more finely defined in y 
direction. This helps model more significant potential gradient in y direction. It is also 
noticeable that criterion_outer and criterion_inner (two parameters defining the 
convergence limits of inner and outer loops) are set at 1 e-3 and 1 e-6 [28]. 
Now nmos_in has defined aIl the input and can calI main.m. main.m does aIl the 
computation and returns the desired parameters. N mos _ in times the execution of main.m to 
show the computation time of the simulation at the end. When main returned its output 
variables, nmos_in will save the output in output. mat and then calI saveoutput.m. 
saveoutput.m plots the data as desired by user and does the post processing part [28]. 
Now we look at main.m and folIow its different branches. main.m sets a variable named 
"progress" to keep track of progress of the algorithm and show it to user as it finishes each 
part. Then it starts "initializing". During this process it first makes sure that criterion _outer 
is not too little for sorne transport models. Then it calculates total number ofbias points and 
127 
Then it creates two vectors containing gate and drain voltages for aIl the bias points 
specified by user. Then it reaITanges grid so that there is an integer number of meshes in 
each region (source, channel, drain) of device. Then it assigns sorne parameters that will be 
used in quantum equations (doping densities ... ). Finally it calls device_geo.m to draw the 
plan of the device with grid on it. This way main.m finishes its initial preparations and 
enters to the process of calculating sorne initial parameters [28]. 
First it allocates memory to sorne matrix parameters that will hold the key physical 
values of the device. These parameters will be used in the following parts and will pass 
sorne data to other functions. Sorne of them, like electron density and potential profile, are 
1 D vectors and have a value for every grid point. Others like subband energies and electron 
density for each subband are 3D matrices with values for every valley in every subband in 
every grid point along x direction. CUITent is predicted to have one value for every bias 
point [28]. 
Then it calls doping.m. It was mentioned in sub-chapter 4-3 that Gaussian doping 
profiles are supposed for source and drain whose slopes are determined by user. doping.m 
generates the functions of doping profile according to the input chosen by user over the 
grid. In fact, doping.m assigns a doping value to every grid point. Then main.m calls 
fprime.m which calculates the derivative of doping and saves it in a matrix. This matrix 
will be used by Poisson solver [28]. 
From now on, main.m starts self-consistent solution of Poisson and Schrôdinger 
functions. First it calculates mobility in relation to doping profile. To do so, it calls 
mobility.m and passes mobility model number to it. According to the temperature, electric 
field, material and mobility mode l, mobility.m caIculates the mobility of channel. 
128 
NanoMOS uses MATLAB "sparse" function to compress big matrices and increase the 
speed of calculations. At the end of computations it retrieves the full matrices using "full" 
function and then sends them to nmos_in.m [28]. 
Then main.m makes its initial guess. To do this, chargenanomos.m is called. 
chargenanomos.m is the routine that calls charge and CUITent routines that solve transport 
models. For the initial guess, chargenanomos is called with transport model 1 and a fixed 
initial value for drain voltage. Input variables that are passed to chargenanomos by main.m 
are old electron density, old potential energy profile, old subband energies, old electron 
densities or subbands and if it should calculate charge or CUITent. Chargenanomos, on the 
other hand, returns new electron density, subband energies, new electron densities for 
subbands, CUITent density, CUITent in each subband and electron tempe rature for subbands 
[28]. 
When chargenanomos is called for initial guess, it caUs charge_init.m, which sets 
initial values of electron density and potential profile for grid points. No transport equation 
is used to this point [28]. 
When main.m receives initial guess for charge density and potential profile, it calls 
poisson.m. poisson.m gets doping density, electron density and old potential profile vectors 
and produce new potential profile for grid points in a vector. poisson.m solves Poisson 
equation in 2D using Newton method. To do so it uses fermi .m function that takes Fermi 
integrals of desired order. poisson.m has a loop that satisfies a convergence parameter, 
criterion jnner, that is compared with the change in potential profile in each iteration[28]. 
129 
Then main.m enters what is called "CuITent calculation loop". But actually it solves 
transport model and Poisson equation self consistently and calculates all the desired energy 
profiles and charge densities in each bias point. It also calculates 1-V characteristics [28]. 
Here, main.m enters a "for loop" on gate voltage. Number of increments depends on 
number of gate voltage steps chosen by user in input document. Here only a vector is 
created that contains gate bias voltages and immediately the second loop starts for drain 
voltage. Its increment number depends on number of drain steps chosen by user. Here too a 
vector is created that contains drain bias point voltages. Inside these two loops or in other 
words for each bias point, there is a while loop to continue a self-consistent solution of 
Poisson and Schrôdinger until a convergence factor, criterion _outer is satisfied [28]. 
Inside this while loop, three main actions are taken: first chargenanomos.m is called to 
calculate charge density parameters according the transport model. As explained above, 
chargenanomos does it by calling one of charge ... m functions. One of these functions exists 
for each transport model. Bellow we will review their main functionalities and differences. 
When chargenanomos returns new charge densities, main.m calls poisson.m which 
computes new potential profile. Then the maximum difference between old and new 
potential profile vectors (happening at any grid point) is compared with criterion _outer. If it 
is bigger than criterion _outer, the loop continues until the responses of two equations 
converge to each other and the change in potential profilein one iterationbecomes very 
small. When this is achieved, while loop is ended and convergence data is saved [28]. 
With the final potential profile at hand, main.m caUs chargenanomos one last time to 
get the final charge density parameters. Then chargenanomos.m is called again to calculate 
CUITent. To do this, it calls one of current...m functions according to transport model. 
130 
Bellow these functions are explained. Having calculated CUITent for the present bias point, 
aIl of the ca1culated data, related to the present bias point is saved in separate matrices. 
Then the inner for loop is incremented and the whole process of self-consistent solution 
repeats for the next bias loop. When aIl of bias points have been treated, main.m finishes. 
The variables that main returns are: current density, CUITent density in subbands, electron 
temperature in subbands, electron density, electron density in subbands, conduction band 
profile, subband energies, Fermi energy level and convergence data [28]. 
Now we will have a quick look to charge and CUITent computing functions . There is a 
charge density computing function for each transport model. They need to solve the charge 
density problem in two dimensions. AH of them solve the Schrôdinger equation in 
confinement direction (gate to gate direction) and then apply their respective transport 
model in the channel direction and produce 2D results. The function that aU of them caU to 
solve Schrôdinger equation in ID is schred.m. schred.m takes potential energy profile as 
input and produces subband energies and wavefunctions in confinement direction. Current 
calculating functions act similar to charge computing functions (after they have received 
the results of schred.m). They just apply the formulas related to their respective transport 
model to the potential profile and other input data and calculate CUITent and carrier 
temperature [28] , [36]. 
Knowing the exact functionality of each routine and their role in total algorithm is 
necessary to consider the possibilities for parallelization of the algorithm. In the following 
section sorne of these possibilities wiU be presented and the parallelization algorithm 
implemented in this work will be discussed in more details. 
131 
5.3 Parallelization 
To parallelize an algorithm, one should look for parts of the algorithm that can be 
executed separately at the same time. These portions of the algorithm must be independent 
of each other, meaning that their development must happen without relying on each other's 
results. It is possible to establish connection between different processors, using Message 
Passing Interface (MPI) or other means, but it is a sensible operation and different 
parameters must be considered. Of course one needs to make sure that developed parallel 
algorithm will work correctly in all situations, meaning that different processing unites, 
communicating with each other, will send their respective messages at about the right time, 
otherwise improvement factor suffers. Aiso in HPC machines data transfer and 
communication is a much more time consuming process than data processing. So by 
dramatically increasing communication in an algorithm, advantages of parallelizing may be 
10st. Choosing the best parallel algorithm in this sense depends partially on choosing 
routines to be parallelized that need the least communication and also on distributing the 
job between as many parallel processors a possible, without letting data transfer time 
overcome the advantage of greater distribution. 
Here, l will try to identify and discuss different possible strategies of parallelization 
ofNanoMOS code. At the end the chosen strategy is explained in more details. 
NanoMOS, in brief, is a code that performs a series of computations and equation 
solving algorithms on a grid of space nodes for each bias point (and in sorne routines each 
subband and each valley). From this structure sorne parallelization strategies emerge 
instantly. First of all it seems promising to try to distribute the computation over grid nodes 
or bias points. Then, the computation algorithm for each grid node in each bias point can be 
132 
examined. Possibly it offers sorne parallelization opportunities too. In aIl of these 
considerations, independence of computations in parallel tasks is the first question to be 
asked [28]. 
Parallelization over bias points is a very straightforward and at the same time heavy 
solution. It is straightforward, because it has a lot of algorithmic clarity. In NanoMOS the 
main computation loop (self-consistent solution of Poisson and transport equations) is 
repeated for each bias point. It is heavy, because in this strategy, aIl of the computing 
routines are engaged in parallelization. Each one of them must be examined for 
compatibility with parallel algorithm, their variables must be passed and retumed in order 
and correctly and finally more errors must be dealt with wh en a large number of routines 
are involved in sorne code modifications. The computations related to bias points are 
independent from each other. They only need sorne common initializations that can be done 
outside parallel part of the code [28]. 
For parallelization over grid nodes sorne initial decisions need to be made. In this 
approach, modifications must be made in each routine that works on grid nodes. Each 
routine takes its input as large matrices, containing data for aIl of the nodes, performs its 
computation on the matrices and retums its variables as matrices too. So parallelization has 
to be done inside the routines, separately from parallelization in other routines. This 
increases the overhead computation to sorne extent. There are many routines working with 
bias points. Computations of each no de in these routine is generally independent of 
computations of other nodes. To choose sorne of these routines as candidates for 
parallelization, they should be checked to see if a large part of their computation is done 
inside loops with grid nodes as index or not. For example fprime.m and schred.m functions 
133 
have long for loops (for loops with a lot of computation lines) over grid nodes, poisson.m 
and charge ... m and current...m functions have short for loops (for loops with a few 
computation lines) and doping.m falls somewhere in between. Now if the coding language 
or parallel function requires a setup of parallel task with a large overhead each time a loop 
is being distributed, parallelizing short for loops may not be beneficial. But fprime.m is not 
part of self-consisting lop, so parallelizing it will not decrease the computation of the whole 
code very much. It should also be noted that the amount of computation for each grid node 
in any of the above applications is very little and not enough to be sent to one 
computational unite. Aiso the number of nodes is much bigger than most cluster sizes. That 
is why it seems necessary to divide the whole grid nodes into a number of groups and send 
aU the grid nodes in one group to one computing unit in cluster [28]. 
As for parallelization over valleys and subbands, charge ... m and current...m 
functions must be considered that have quite long for loops over these two parameters. The 
important point to consider here is that these functions are transport equation solvers. They 
are part of self-consistent solution and are coupled with Poisson solver. If only subbands 
and valleys are taken as parallelization parameters, Poisson equation and parts of transport 
equation solution (schred.m) are still being computed completely serially. The most 
compelling solution seems to be parallelizing over grid nodes in functions in which it is 
beneficial and paraUelizing over subbands and/or valleys in transport model solvers [28]. 
By looking at computation algorithm, of course, any solution must include self-
consistent loop. But the two main computational blocks inside the loop (Poisson and 
Schr6dinger solvers) are dependent on each other's final results, so they cannot be 
parallelized. If we look inside these two blocks, we would see that poisson.m does not offer 
134 
any algorithmic possibilities, but chargenanomos.m, depending on transport model, may 
offer sorne. For example, inside charge_dd.m there are sorne parallelizable parts of 
algorithm. Inside the "for loop" for valleys, the loop over subbands and the rest of 
computations seem to be independent and can be parallelized. This, by itself, is not a 
significant improvement, However it can be part of an extensive parallelization strategy 
[28]. 
Now let's make sorne conclusions about different strategies. When there are more 
than one bias point, parallelization over bias points is very efficient. Because in this 
strategy, parallel section includes almost all of computational load and there is not heavy 
communication overhead. Each parallel unit communicates with central node only at start 
and the end of its computation. Communication includes quite large matrices, but it is the 
same for other strategies too. This strategy needs dealing with a lot of functions and 
variables. It is more difficult to make sure that the whole algorithm of self-consistent 
solution can be run for all of the bias points simultaneously. Different parallel threads must 
be able to call the same functions at the same time and use common sets of data and 
variables. Wh en the goal is to parallel the computation for just one bias point, different 
solutions should be employed at the same time, as explained above. To include as much 
computation load as possible in the parallel algorithm, each function has to be parallelized 
over the variable that works best for it. Doing this in each function is simpler than 
parallelization for the whole algorithm, as in the case for bias points, but on the other hand 
it involves various parallel operations [28]. 
Parallelization of an older version of NanoMOS code is explained in [37]. This 
work has been done using MPI and PYM parallel toolboxes created for MATLAB by 
135 
another researcher ([6]). This work differs in two major aspects from my project. First, it 
was performed on a detailed physical model that took a much longer time to be simulated. 
Second, it was implemented with different software and hardware tools. At the time that 
this work was done, MA TLAB was not directly parallelizable and they used sorne 
additional toolboxes, written for MATLAB Aiso the HPC facility used in this work is 
different from one used in current work [37]. 
In this project, NanoMOS2.5 is parallelized over bias points. A series of bias points 
with equal gate voltages and different drain voltages are used for simulations. "parfor" 
function of MATLAB is used to parallelize self-consistent loop of NanoMOS. Then the 
parallel code is run on a cluster of processors. The results are presented in the next chapter 
[28], [39]. 
As it will be explained in more details in the next section, "parfor" is used to 
parallelize the self-consistent solution of potential and charge density solvers for different 
bias points [28], [39]. 
Parallel coding in MATLAB is done using MATLAB Parallel Computing Toolbox, 
which includes sorne parallel functions and is being increasingly integrated with sorne other 
toolboxes of MA TLAB to be able to support parallel computing in data processing in 
different toolboxes. MA TLAB Distributed Computing Server (MDCS) is used to connect a 
cluster of computing nodes that are managed by a scheduler or a job manager. MDCS 
cornes with MATLAB job manager that can be installed to manage the cluster, but it can be 
linked to and work with third party schedulers too. Each computing node in cluster 
(worker) has a license of MDCS and a MATLAB session mns on it with any toolboxes 
136 
necessary for the computation. The main node (client) has the Parallel Computing Toolbox 
installed on it and communicates with the cluster throughjob manager [39]. 
On the Krylov system, there is a procedure for running MATLAB parallel codes. The 
user must save the main MA TLAB code containing the parallel functions and all of the 
MA TLAB files that are called by this main file on the hard disc on the remote computer 
(Krylov system). Krylov has its own license of MATLAB Parallel Computing Toolbox 
(MPCT) and MA TLAB Distributed Computing Server. The user only provides the code 
files generated using their own li censes. Krylov also has its own scheduler which is 
connected to the MDCS. MDCS receives parallel jobs from MPCT and communicates 
accordingly with the scheduler of the system to create the parallel platform needed for that 
certain parallel job to be distributed on. In this way, on Krylov, the user does not need to 
deal with creating the parallel platform themselves. For example when a parfor is used, the 
user only determines the number of workers to be created by the MDCS by choosing the 
loop parameter of parfor. MDCS and scheduler will take that number and create an array of 
computers with the same number of workers to mn the parfor loop and a client computer to 
fUll the main code. The communications between the workers and the client is also 
managed by MDCS and scheduler. In a parfor application, these communications are 
already implemented in the definition of parfor and automatically performed when the 
parfor is being executed [22], [39]. 
5.4 Implementation 
In this section, the steps that have been taken in implementation of the algorithm 
described in previous section are discussed. This section has a very practical nature. It has 
little theoretical content and is not meant to be taken as a reference on coding or algorithm 
137 
development, but instead it might provide a more guided experience for students following 
a similar path and trying to make parallel modifications in MA TLAB environment. 
When a "parfor" loop is used in a MA TLAB code, sorne basic mIes are checked by 
MATLAB before running the code. If a MA TLAB code, containing "parfor", is mn without 
opening the "matlabpool", MATLAB runs the code in seriaI and executes "parfor" loop as 
a for loop. Nevertheless, it checks sorne mIes that are specified to "parfor". These mIes are 
laid down to make sure that "parfor" is understandable by MA TLAB and that the loop body 
is distributable to different workers. One of these mIes is about classification of variables. 
MATLAB accepts four types of variables inside a "parfor" loop [39]. 
The first problem, as it apparently happens in many cases, appeared with variable 
classification. A matrix can be indexed inside a "parfor" loop by a loop counter, but any 
other first level index of that matrix has to be a constant, a non-Ioop counter variable or a 
colon. The goal of this mIe is that MATLAB needs to be able to distribute only necessary 
part of the variables to each worker at the time of distributing parallel tasks (sliced variable) 
[39]. 
Any matrix indexing that violates this mIe, create an error. In the case of incorrectly 
indexed matrices, the indexing has to be broken. First it has to be used only with one of the 
indexes, and then the other one. There are examples in documentation that explain this. In 
NanoMOS code this happened with variable "conv" at the end of main.m. This is not a 
clear case of "nested for loop", but apparently " 1 : 1ength (converge )" causes the same 
indexing problem [28], [39]. 
conv (l: length (converge ), ii_vg , ii_vd ) converge ; 
138 
[28] 
To solve this, "co nv erge" is saved as in an indexed matrix inside the "parfor" loop: 
myconverge (:,i i_vd) = converge ; 
[28] 
And then, outside the "parfor" loop, it is put into " conv" : 
for (cnv_count = l: Nd_step +l) 
myconverge (:, ii_vd); 
end 
[28] 
Another type of error was references to variables that were considered as c1eared 
variables. This happened when a variable was defined and assigned a value before the 
"parfor" loop and inside main.m or one of other functions. Then when the variable was 
used inside the "parfor" loop, it created an error [28]. 
No c1ear explanation of the reason for this error was found in documentation, but 1 
believe it can be explained with the concept ofworkspaces in MATLAB. "Workspace" is a 
set of variables related to a session of using MA TLAB that are kept in memory while that 
session continues. But different workspaces might exist at the same time. Variables used in 
each MATLAB function form a separate workspace, so normally these variables are not 
accessible from other workspaces [39]. 
main.m is a function called by nmos_in.m. main.m, in tum, calls sorne functions and 
passes sorne variables to them. When "parfor" loop is started, different workers are created 
by MA TLAB which work with variables that might have been assigned sorne values 
139 
before, inside main.m. The problem is that the definition of these variables or their value 
assignment might be before the start of "parfor" loop. In this case, workers might find these 
variables not correctly defined. 
To solve these sorts of errors, it might be necessary to redefine sorne variables, or 
reassign desired values to sorne variables inside the "parfor" loop too. In doing this, we 
must pay enough attention not to min algorithm of calculations. 
Another important type of error that might happen in the process of modifying the code 
is that functions that are caUed by main.m do not retum their variables correctly. MA TLAB 
documentations about "parfor" do not discuss calling functions from inside a "parfor" very 
thoroughly. There is actuaUy little notes on caUing nested functions, but no clear cautions 
about calling functions. This, normally, wou Id mean that there should be no problems with 
caUing functions. But actuaUy sorne errors are created that say the retumed variables are 
not assigned. This might be due to the fact that different workers are caUing the same 
functions and working with the same variables at the same time. NaturaUy, function 
executions and function output variables should be separated on each worker, but when 
there are function caUs inside the loop, there seems to be sorne sort of separation problem. 
It seems to the writer that such complications are not very hard to imagine when a very 
high levellanguage is adapting paraUe1 computing. In any case, a solution to this problem 
can be replacing a function caU by actual code of that function inside the loop [39]. 
This solution has been employed to solve this problem in a very simple and effective 
way. Chargenanomos.m and poisson.m and function caUs inside these functions have been 
replaced. This solution necessitates sorne considerations that will be discussed below. 
140 
When code lines have replaced functions, function workspaces mlX. It means that 
variables that were being created and used in different workspaces are now being seen by 
the worker in one workspace. This makes a second round of variable checking necessary. 
This time it should be checked if variable name similarities exist between functions and if 
supposedly separate uses of variables in functions are being invaded after code 
replacements. These sorts of errors and algorithm malfunctions, when found, can be easily 
corrected by renaming variables. 
Sometimes there are parts of code that are not compatible with a parallei 
implementation. These parts of algorithm have to be rewritten or eliminated. In main.m 
there is a variable caUed "progress" that keeps the percentage of completion of the whole 
simulation. "progress" is incremented after each bias loop is fini shed and then it is printed 
for the user to see the progress of simulation. When a loop is distributed between workers, 
the linear concept of loop progress does not exist anymore. The progress is happening 
inside the workers and the client has no idea of the progress of each worker until it receives 
the results from workers. At the beginning of the loops over bias points, there are a few 
lines that check if aU bias points have been treated and increments "progress" and prints it. 
Since this part is not compatible with parallelization and it is not a crucial part og the 
algorithm (it has no computational value), it was eliminated from the parallel code [28]. 
The last point to be mentioned is the "route of execution". NanoMOS calls different 
routines according to the user' s choices of transport mode l, electron penetration limit, etc. 
More specificalIy, different Schrodinger solvers are called according to transport model and 
each one of them calI a different set of functions . Aiso in most of the routines, different 
routes are taken according to the state of sorne flags set by user. These flags are electron 
141 
penetration, mobility model, etc. Because the measurements on the improvement factor of 
parallel code was to be done using a fixed simulation setup (a fixed set of input data), all of 
the code replacements were done according to the route cOITesponding to a certain input 
set. To do simulations with different setups, the cOITesponding route must be carefully 
identified and necessary code replacements should be done along that route [28]. 
To give a practical reference of the function caUs in NanoMOS code, the "branching 
plan" of NanoMOS code is given here. The route cOITesponding to each transport model 
can be easily found using this list [28]. 
nmos _in: caUs main, phys _constants, emass _calI and saveoutput. 
Main: calls device _geo, fprime, doping, mobility, chargenanomos and poisson. 
chargenanomos: caUs anti_dummy and the foUowing functions for the mentioned 
purpose or according to mentioned transport model. The functions caUed by each one of 
charge ... and CUITent... functions are mentioned in parantheses. Chargenanomos: 
For initialization, caUs charge_init (does not caU any functions). 
For "drift diffusion transport" model, calls charge _ dd (schred, mobility, 
anti_dummy) and current_dd (mobility, anti_dummy). 
For "semi-baUistic transport" model, calls charge _ semib (schred, fermi) and 
sUITent_semib (fermi). 
For "quantum baUistic transport" model, caUs charge_qbte (schred, myquad, 
fermi, func_energy) and cUITent_qbte (fermi, func_energy) . 
For "energy transport" model, caUs charge_et (schred, et) and cUITent_et (et). 
142 
Poisson: calls dummy and dummy ~rime. 
emass call: calls emass which is written inside the same m file. 
func _ energu, dummy and dummy ~rime call fermi. 
phys_constants, [prime, device_geo, doping, mobility, anti_dummy, schred, et, fermi, 
myquad do not call any function. 
[28] 
Chapitre 6 - Validation 
6.1 Comparison of seriai and parallel code performances 
General goals of any parallelization are making the computation feasible (mostly 
when very large data sets are involved) and making the computation faster (both when very 
large data sets are involved and when the computation is very time consuming). In this 
project, shorter computation time is the main goal. But speed-up factor has its own limits. 
Amdahl ' s law (6.1) explains how non-parallelizable portion of an algorithm limits the 
speed-up factor [20] , [40] . 
S=_I_ 
I-P 
(6.1) 
S is the speed-up factor in relation to seriaI algorithm and P is the parallelizable 
percentage of algorithm. Amdahl ' s law ignores communication lost time and gives a 
maximum limit of S for a large number of processors. Gustafson' s law inc1udes a finite 
number of processors in calculation of maximum S [41]: 
Sep) = P- a(P-l) (6.2) 
S is speed-up and P is the number of processors and a is the part of the algorithm 
that is not parallelizable [41] . 
As it was explained in previous chapters, in a simulation with multiple bias points, 
in a one to one comparison, parallelization of NanoMOS over bias points is the most 
144 
inclusive strategy, in the sense that it leaves out the shortest "critical path" (the longest 
chain of dependent calculation) out of parallelization process and do es not add anything to 
it, While other strategies include seriaI execution of a larger part of algorithrn [20], [28]. 
Parallei code was fUll on 2, 4, 8, 16 and 28 workers on Krylov cluster ofCLUMEQ. 
Currently, Krylov's limit for the number of workers is 32. But it is not possible to run the 
code using aIl of the accessible units as workers, because at least one of the processing 
units has to take the role of client and there might be other limitations too. Bias points are 
distributed with drain voltage following a simple step function over the range of 0 to 1.62 V 
with 0.1 V steps. In the case that drain voltage exceeds 1.62 V, The step size is reduced (to 
0.06 V for 28 workers). It is worth mentioning that computation time in each run has a 
tolerance. Usually in sequential runs the tolerance is smaIl, but over a longer period oftime 
this to1erance might grow, which makes it difficult to have an exact reading of the 
computation time for different simulations. To minimize the effect of tolerance, each 
measurement was repeated a number of times. Then unacceptable results were e1iminated 
and an average of acceptable results was used as the computation time for each number of 
workers [22] , [39]. 
In Figure 6-1, speed up results (execution times of seriaI code normalized for 
parallei code execution time) are presented for different numbers of workers: 
20 --------------.-. 
18 
16 
14 
12 
10 
8 
6 
4 
2 
o 
2 workers 4 workers 8 workers 16 workers 
. speed-up 
Figure 6-1 speed-up for different numbers ofworkers 
145 
28 workers 
It is weIl expected to see a dramatic decrease in execution time as number of 
processors increase. By increasing the number of workers speed up will increase too. This 
algorithm, as discussed before, do es not saturate very soon as the numbers grow, because 
there is not extensive communication involved. 
6.2 Validation 
Quantitative validation of this work is fairly simple, because the parallel algorithm 
does not affect quantitative aspects of the algorithm. In other words, computations are 
performed in the same way in series and in parallel case. To verify this claim, convergence 
data (discussed in previous chapters) of simulations were compared. In Table 6-1 logs of 
convergence data for two parallel simulations with two and four workers (consequently 
with two and four bias points) are presented. 
146 
Table6-1 Convergence data for simulations with two and four workers 
Simulation 1 Simulation2 
Bias point 1 Bias point 2 Bias point 1 Bias point 2 Bias point 3 Bias point 4 
(0.0 V) (0.1 V) (0.0 V) (0.1 V) (0.2 V) (0.3 V) 
1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 
0.0222 0.0222 0.0222 0.0222 0.0329 0.1107 
0.0421 0.0392 0.0421 0.0392 0.0381 0.0377 
0.0375 0.0339 0.0375 0.0339 0.0327 0.0320 
0.0261 0.0225 0.0261 0.0225 0.0214 0.0207 
0.0153 0.0134 0.0153 0.0134 0.0130 0.0128 
0.0101 0.0098 0.0101 0.0098 0.0096 0.0095 
0.0076 0.0075 0.0076 0.0075 0.0074 0.0074 
0.0059 0.0059 0.0059 0.0059 0.0058 0.0060 
0.0046 0.0045 0.0046 0.0045 0.0045 0.0046 
0.0035 0.0035 0.0035 0.0035 0.0035 0.0035 
147 
0.0027 0.0026 0.0027 0.0026 0.0026 0.0026 
0.0020 0.0020 0.0020 0.0020 0.0020 0.0019 
0.0015 0.0015 0.0015 0.0015 0.0015 0.0014 
0.0011 0.0011 0.0011 0.0011 0.0011 0.0011 
0.0008 0.0008 0.0008 0.0008 0.0008 0.0008 
Two colurnns of the first simulation are in order equal to the first two colurnns of 
the second simulation. This c1early shows that two simulations follow the same 
computational sequence when the input data (here bias points) are equal. 
6.3 Parametric analysis 
Communication is an important factor in any parallelization strategy. As explained 
in the discussion about grid distribution strategy, it is not always beneficial to distribute a 
job among as many processors as possible. With a large number of processors, and little 
computation for each one to do, communication between the main node and computing 
nodes (to send data and to initialize the node) plus communication between nodes (as part 
of computing algorithm) might even out the time saved by distributing the computation. 
Depending on the algorithm and amount of communication there might be a maximum 
number of computing nodes in which speed-up is maximum and after that, by increasing 
the number of computing nodes, communication overcome the effect of parallelization. If 
limitations of resources and economic efficiency are considered too, the maximum number 
148 
of computing nodes might be ev en smaller that the number determined by the algorithm. In 
this work, because there is no communication between workers and there is little 
communication between client and workers, there seems to be no limit for speed-up. 
To see the effect of the number of workers on efficiency of parallel code, the speed-
up for different numbers of workers is presented along with Gustafson's law's speed-up 
prediction in Figure 6-2. Taking Gustafson's law's speed-up as the theoretical maximum 
value of S that is decreased to a practical value because of communication lost time and 
other delays, it is expected that Figure 6-2 gives us a measure of the impact of number of 
workers on efficiency of each experiment. Non-parallelizable portion of the algorithm has 
been estimated by measuring the computational time of "initializing" part of NanoMOS in 
relation to the total computational time [28], [41]. 
149 
--_._------------_._. __ ..... _ .. _ ... _-_._----_._ .. _-_.--------_ .. _-----_._._---_._--------, 
30 
25 
20 
15 
10 
5 
0 
2 workers 4 workers 8 workers 16 workers 28 workers 
-+-measured speed-up _ Gustafson's law's predicted speed-up 
Figure 6-2 Measured and theoretical speed-up 
To see the relationship between two lines more easily, measured speed-up IS 
normalized to Gustafson's law's prediction in. Figure 6-3. 
150 
,-------_._---_ .. _ .. _--
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0 
2 workers 4 workers 8 workers 16 workers 28 workers 
• measured speed-up normalized to Gustafson's law speed-up 
Figure 6-3 Measured speed-up is normalized to Gustafson's law's prediction 
It is difficult to observe a very obvious pattern in Figure 6-3. This is mostly due to 
the fact that parallel code fUll time is relatively small and so tolerances have a considerable 
effect on it. As a result calculated speed-up is strongly affected. Nevertheless a slight 
downward slope is visible in Figure 6-3 . In other words, as number ofprocessors increases, 
speed-up factor follows the Gustafson' s law less closely. This is because the difference in 
computational time for different bias points. Higher drain voltages impose longer 
computational time. In parallel execution of the code, the workers that finish their 
computation sooner, will return their results to client and then they will wait until workers 
that are computing bias points with higher drain voltage (hence with longer computation 
time) finish their job. It is clear that this means non-ideal use of resources . In other words, 
parallelization is not ideal as it is supposed in Gustafson's low. Computation times for drain 
voltages near to zero are quite close. Then when drain voltage becomes bigger (when there 
151 
are more bias points and more workers) computation time grows and the difference 
between workers run time lllcreases. That is why measured speed-up IS c10sest to 
theoretical prediction in the experiment with two workers. 
Chapitre 7 - Conclusion 
MOSFET scaling has continuously made higher processing power with lower price 
in smaller space possible for decades. Such a high yearly improvement factor, kept for such 
a long time, is by far an industrial exception. This dramatic and constant improvement has 
been realized to a great deal due to MOSFET scaling. And it is important to note that it has 
not happened without tireless efforts to go beyond technological limits of each time. In this 
sense, today's efforts in MOSFET scaling are a part of a long lasting trend. But in relation 
to MOSFET as a specific device, it seems that we are at a tuming point. 
As MOSFETs are scaled down to nanoscale, new physical effects become important 
in their perfonnance. Due to very small dimensions in today's devices, these effects have 
quantum nature. In sorne specific aspects of the device design, for example doping 
concentration and oxide layer thickness, we are already counting atoms; in other aspects 
and parts it may happen not in a very long time. 
In the long tenn, transistors might go under dramatic changes. Quantum transistors 
and molecular transistors are examples of possible candidates for post CMOS era. While 
there is still a long time until these technologies become available and MOSFET 
technology has still a considerable amount of improvement to make, it has been suggested 
that MOSFET scaling is starting to face sorne fundamental physical barri ers and MOSFET 
technology is reaching its scaling limit. 
153 
Whether or not we consider future transistors, quantum transistors for instance, a 
continuation of MOSFET concept, it seems crucial to ex tend CUITent microelectronic 
science to quantum and molecular scales. In fact, not only future MOSFETs will work 
under these new rules, but also future transistors will do so too. 
It is necessary to model these quantum phenomena effectively, if we want to be able 
to design MOSFETs for beyond 10 nm regimes. Exact simulation of these effects can be 
very time taking. Finding ways to carry out these computationally expensive simulations 
seems important for future designs. High performance computation (HPC) facilities create 
an opportunity for su ch simulations to be do ne in reasonable time and with desired 
preClSlOn. 
Another important scaling tool is novelty in geometry. New geometries decrease 
unwanted effects and open way for more scaling while keeping performance at an 
acceptable level. Double gate MOSFET has been suggested as an effective solution to short 
channel effects. 
NanoMOS, an open source tool developed by researchers at Purdue University, is a 
simulator of double gate MOSFETs in nanoscale. It considers ballistic transport in 
MOSFETs and solves Schrodinger and Poisson equations to find charge distribution and 
CUITent. 
In this work, NanoMOS code has been studied for parallelization possibilities and a 
parallelization algorithrn has been implemented on it. The result has been tested using HPC 
facilities of Compute Canada. Simulation results and time improvement factors have been 
presented and discussed for different numbers of processing nodes. This work, as a 
parallelization experience, can be used in future to implement parallel algorithrns on more 
154 
complex models, making it possible to perforrn parametric studies and model optimization 
and achieve a precise design strategy for nanoelectronic devices. 
References 
[1] Shakouri. A., "Nanoscale Thermal Transport and Microrefrigerators on a Chip", 
Proceedings of the IEEE, vol. 94, Issue 8, pp 1613-1638, Aug. 2006. 
[2] Yong Zhan, Goplen B. , Sapatnekar S.S., "Electrothermal analysis and optimization 
techniques for nanoscale integrated circuits", South Asia Pacific Conference on Design and 
Automation, 24-27 Jan. 2006. 
[3] Ble, S.V. , Skorek, A.W., "Parallel Approach to the Nanothermal Numerical 
Analysis", Canadian Conference on Electrical and Computer Engineering, CCECE '06, 
May 2006. 
[4] Liu, W., Asheghi, M., "Impact of phonon-boundary scattering and multilevel 
copper-dielectric interconnect system on self-heating of SOI transistors",Twenty First 
Annual IEEE Semiconductor Thermal Measurement and Management Symposium, 15-17 
March 2005. 
[5] Pop, E., Goodson, K.E., "Thermal phenomena in nanoscale transistors",The Ninth 
Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic 
Systems, ITHERM '04, 1-4 June 2004. 
[6] Issa, M., Skorek, A.W., "Nanoscale Thermal Analysis of Electronic Devices", 
Canadian Conference on Electrical and Computer Engineering, CCECE '06, May 2006. 
[7] Pop, E., Dutton, R., Goodson, K., "Detailed heat generation simulations via the 
Monte Carlo method", International Conference on Simulation of Semiconductor 
Processes and Devices, SISPAD 2003,3-5 Sept. 2003. 
[8] Rowlette, J., Pop, E., Sinha, S., Panzer, M., Goodson, K., "Thermal phenomena in 
deeply scaled MOSFETs",IEEE International Electron Devices Meeting, IEDM Technical 
Digest, 5-5 Dec. 2005. 
[9] Pop, E., Sinha, S., Goodson, K.E., "Heat Generation and Transport in Nanometer-
Scale Transistors",Proceedings of the IEEE, vol. 94, Issue 8, pp. 1587 - 1601, 
Aug. 2006. 
[10] Rowlette, J.A., Goodson, K.E., "Fully Coupled Nonequilibrium Electron-Phonon 
Transport in Nanometer-Scale Silicon FETs",IEEE Transactions on Electron Devices, 
Volume 55, Issue 1, Jan. 2008 Pages:220 - 232. 
[11] Skorek, A.W., Ble, S.V., Gryko-Nikitin, A., Nazarko, J., "Nanothermal 
Management in Nanoelectronics Systems", Canadian Conference on Electrical and 
Computer Engineering, CCECE 2007,22-26 April 2007. 
156 
[12] Rowlette, J., Pop, E., Sinha, S., Panzer, M., Goodson, K., 
"Thermal simulation techniques for nanoscale transistors", IEEE/ACM International 
Conference on Computer-Aided Design, ICCAD-2005, 6-10 Nov. 2005. 
[l3] Gang Chen, "Nanoscale heat transfer and nanostructured thermoelectrics", IEEE 
Transactions on Components and Packaging Technologies, vol. 29, Issue 2, pp. 238-246, 
June 2006. 
[14] Allec, N., Hassan, Z., Shang, L., Dick, R.P.; Yang, R.,"ThermaIScope: Multi-scale 
thermal analysis for nanometer-scale integrated circuits",IEEE/ACM International 
Conference on Computer-Aided Design, ICCAD 2008, 1O-l3 Nov. 2008. 
[15] Ble, S. V., "MODÉLISATION PARALLÈLE DES PHÉNOMÈNES 
NANOTHERMIQUES", Thesis in Electrical Engineering, University of Quebec in Trois-
Rivieres, 2009, p. 120. 
[16] GOLDHABER-GORDON D., MONTEMERLO M. S., LOVE J. c., OPITECK G. 
J., ELLENBOGEN 1. c., "Overview of Nanoelectronic Devices", Proceedings of the 
IEEE, vol. 85 ,Issue: 4, pp. 521-540, 1997. 
[17] ELLENBOGEN 1. C., "A Brief Overview of Nanoelectronic Devices",1998 
Government Microelectronics Applications Conference (GOMAC98,)Arlington, VA, 13-16 
March 1998. 
[18] Skorek, A.W., "High Performance Computing in nanoscale electrothermal modeling 
and simulations", 15th International Conference on Mixed Design of Integrated Circuits 
and Systems, MIXDES 2008,19-21 June 2008. 
[19] Flynn, M., Sorne Computer Organizations and Their Effectiveness, IEEE Trans. 
Comput., vol. C-21, pp. 948, 1972. 
[20] w~'W.wikipedia.org 
[21] www.computecanada.org 
[22] www.clumeq.ca 
[23] www. itrs.net 
[24] Deleonibus, S. Aid, M. de Salvo, B. Ernst, T. Faynot, O. Fedeli, 1.-M. Giffard, B. 
Le Royer, C. Poiroux, T. Robert, P. Sillon, N. Vinet, M., "New routs and divesifications for 
nanoelectronics by the end of the roadmap and beyond", Electron Devices and Solid-State 
Circuits, EDSSC 2008 IEEE International Conference on, 8-10 Dec 2008,pp. 1-6. 
[25] BehzadRazavi,Fundamentals ofmicroelectronics. USA, Wiley, 2008. 
157 
[26] L. Dreeskomfeld, 1. Hartwich, E. Landgraf, R. 1. Luyken, W. Rôsner, T. Schulz, M. 
Stiidele, D. Schmitt-Landsiedel, L. Risch, "Comparison of partially and fully depleted SOI 
transistors down to the sub 50nm gate length regime", Infineon Technologies-Corporate 
Research Otto-Hahn-Ring, Institute for Technical Electronics, TU Munich, Germany. 
[27] http://www.play-hookey.com/semiconductors/depletion mode mosfet.html 
[28] ZhibinRen; SebastienGoasguen; Akira Matsudaira; 
KurtisCantley; Mark Lundstrom; Xufeng Wang (2006), 
1 0254/nanohub-r 1305.11. (DOl: 1 0254/nanohub-r 1305.11). 
Shaikh S. Ahmed; 
"NanoMOS," 001: 
[29] ZhibinRen, "NANOSCALE MOSFETS: PHYSICS, SIMULATION AND 
DESIGN", thesis, Purdue University, Octobre 2001. 
[30] Xinnan Lin, ChuguangFeng, Shengdong Zhang. Wai-HungHo and iMansun Chan, 
"Double-Gate SOX MOSFET Fabrication from Bulk Silicon Wafer", SOI Conferenee, 
2001 IEEE International, Durango, CO, 2001, pp. 93-94. 
[31] Hongmei Wang, Mansun Chan, Singh Jagar, Vincent M. C. Poon, Ming 
Qin,Yangyuan Wang and Ping K. Ko, "Super Thin-Film Transistor with SOI CMOS 
Performance Formed by a Novel Grain Enhancement Method", Electron Deviees, IEEE 
Transactions on, vol. 47, Issue: 8, pp. 1580 - 1586,2000. 
[32] www.t1ipchip.com 
[33] www.solidworks.com 
[34] www.nanohub.org 
[35] ZhibinRen, with contributions by: Ramesh Venugopal, Jung-HoonRhew, Jing Guo, 
Dave Rumsey, Dr. SebastienGoasguen, Prof. SupriyoDatta and Prof. Mark S. Lundstrom, 
"NanoMOS 2.0 A 2D-Simulator for Double-gate MOSFETs", Manual, Purdue University, 
Decembre 2001. 
[36] SebastienGoasguen, "A guided tour of nanoMOS code and sorne tips on how to 
parallelize it", Electron Deviees at the NanolMolecular Scale, Summer School at uruc, 
May 21-22,2002. 
158 
[37] SebastienGoasguen, Ah. R. Butt, Kevin D. Colby and Mark S. Lundstrom, 
"Parallelization of the Nanoscale Device Simulator nanoMOS2.0Using a 100 Nodes Linux 
Cluster", Proceedings of the 2nd IEEE Conference on Nanotechnology, IEEE-NANO 
2002, November 2002, pp. 409-412. 
[38] http://atc.ugr.es/ javier-bin/mpitb eng 
[39] www.mathworks.com 
[40] Amdahl, G. (April 1967) "The validity of the single processor approach to achieving 
large-scale computing capabilities". In Proceedings of AFIPS Spring Joint Computer 
Conference, Atlantic City, N.J., AFIPS Press, pp. 483-85. 
[41] Gustafson John L., "Reevaluating Amdahl's Law",Communications of the ACM 
31(5),19898, pp.532-33. 
