22 research outputs found
BrainFrame: A node-level heterogeneous accelerator platform for neuron simulations
Objective: The advent of High-Performance Computing (HPC) in recent years has
led to its increasing use in brain study through computational models. The
scale and complexity of such models are constantly increasing, leading to
challenging computational requirements. Even though modern HPC platforms can
often deal with such challenges, the vast diversity of the modeling field does
not permit for a single acceleration (or homogeneous) platform to effectively
address the complete array of modeling requirements. Approach: In this paper we
propose and build BrainFrame, a heterogeneous acceleration platform,
incorporating three distinct acceleration technologies, a Dataflow Engine, a
Xeon Phi and a GP-GPU. The PyNN framework is also integrated into the platform.
As a challenging proof of concept, we analyze the performance of BrainFrame on
different instances of a state-of-the-art neuron model, modeling the Inferior-
Olivary Nucleus using a biophysically-meaningful, extended Hodgkin-Huxley
representation. The model instances take into account not only the neuronal-
network dimensions but also different network-connectivity circumstances that
can drastically change application workload characteristics. Main results: The
synthetic approach of three HPC technologies demonstrated that BrainFrame is
better able to cope with the modeling diversity encountered. Our performance
analysis shows clearly that the model directly affect performance and all three
technologies are required to cope with all the model use cases.Comment: 16 pages, 18 figures, 5 table
Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital Systems
Nanoscale technology nodes bring reliability concerns back to the center stage of digital system design. A systematic classification of approaches that increase system resilience in the presence of functional hardware (HW)-induced errors is presented, dealing with higher system abstractions, such as the (micro) architecture, the mapping, and platform software (SW). The field is surveyed in a systematic way based on nonoverlapping categories, which add insight into the ongoing work by exposing similarities and differences. HW and SW solutions are discussed in a similar fashion so that interrelationships become apparent. The presented categories are illustrated by representative literature examples to illustrate their properties. Moreover, it is demonstrated how hybrid schemes can be decomposed into their primitive components
BrainFrame: A node-level heterogeneous accelerator platform for neuron simulations
Objective. The advent of high-performance computing (HPC) in recent years has led to its increasing use in brain studies through computational models. The scale and complexity of such models are constantly increasing, leading to challenging computational requirements. Even though modern HPC platforms can often deal with such challenges, the vast diversity of the modeling field does not permit for a homogeneous acceleration platform to effectively address the complete array of modeling requirements. Approach. In this paper we propose and build BrainFrame, a heterogeneous acceleration platform that incorporates three distinct acceleration technologies, an Intel Xeon-Phi CPU
Modeling and Mitigation of Parametric Time-Dependent Variability in Digital Systems ,,
Current and future semiconductor technology nodes, bring about a variety of challenges that pertain to the reliability and dependability of digital integrated systems. Compounds, such as high-κ materials in the transistor gate stack, tend to intensify the time-zero and time-dependent variability of transistors. A case, can certainly be made for the phenomena like Bias Temperature Instability (BTI) and Random Telegraph Noise (RTN). The use of such modern materials is also coupled to an aggressive downscaling trend, which further amplifies matter discretization within the transistor's channel. These are two major manufacturing trends that give rise to time-dependent variability in integrated digital systems. It stands to reason, that as the materials become more variable, the error rates experienced at the circuit- or even system-level are also intensified. Coupled to increasing integrated circuit functionality (namely, the `"More than Moore" trend), it is reasonable to expect that future digital chips will be exhibiting reliability profiles that vary across the chip's lifetime.
The goal of the research presented in the current text is to study the above observations both regarding the modeling and the mitigation of time-dependent variability. Target systems, naturally, constitute digital integrated circuits, such as processors and memories thereof. The current research contributes with specific reductions to practice that aim to confirm existing techniques and develop novel insight into the modeling and mitigation of time-dependent variability. As such, the current text is broadly split into two major parts.
The modeling part starts with the reiteration of atomistic modeling concepts, which are very useful in the analysis of phenomena like BTI and RTN, especially for deca-nanometer transistors. As a result, the complexity of atomistic models for integrated circuit aging analysis is made abundantly clear. In order to alleviate the complexity of circuit reliability analysis across the system lifetime, the current research has formulated the concept of the Compact Digital Waveform (CDW). This format targets regions of circuit operation that are similar from a waveform point of view (e.g. similar frequency or duty cycle) and abstracts them to a single point. This enables striding over circuit lifetime intervals, while retaining key features of atomistic reliability models (e.g. workload dependency).
Still on the modeling front, the current research additionally contributes towards exposing transistor variability information to the architecture level. The metric of choice in this case is the component failure probability (Pfail). This is typically required by system designers to appropriately provision their systems with appropriate disabling or correction mechanisms. In order to derive the Pfail, this text features the Most Probable Failure Point (MPFP) method, which is applied for the cases of BTI/RTN variability. Two modeling approaches are used for these aging phenomena and observations are drawn regarding the importance of accurate standard deviation capturing (regarding the threshold voltage variability). A statistical reformulation of the MPFP concept is also presented towards handling standard cells, apart from memory components.
The failure of system components has traditionally been triggering some sort of reaction from the system, at least as far as detectable errors are concerned. Academia and industry have been using the term Reliability, Availability and Serviceability (RAS) in order to refer to such techniques. Their invocation is strictly coupled to the rate of errors that appears at the circuit level and typically comes at a measurable performance cost.
On the mitigation side, the starting point of the current research is a simple, rollback-based RAS technique that aims to recover a system from transient errors. This concept has been implemented on a research-grade many-core platform, in the data plane of which errors are injected at a user-defined rate. The rollbacks make sure the running application is brought to an earlier correct state, so that execution can continue. This fail/stop model, however granular, is creating a measurable drawback in the timely execution of the running application. It is reasonable to explore that a portion of injected errors may not corrected in order to reduce this performance overhead. In view of this trade-off, the current work explores the trade-off between application correctness, performance and energy budget, in search of the optimal operation points.
The final milestone of the current work is a generally applicable solution for the problem of dependable performance, in view of temporal RAS overheads such as the one illustrated above. Towards that direction, the issue of dependable performance is formulated from scratch and a control-theoretic solution is proposed. More specifically, a PID controller is used in order to manipulate the frequency of a processor, within which RAS schemes are invoked at the price of additional clock cycles. The concept is verified within the current research, both with simple simulations and on a real processing platform.status: publishe
Μοντελοποίηση και αντιμετώπιση παραμετρικής χρονικά εξαρτημένης μεταβλητότητας ψηφιακών συστημάτων
Current and future semiconductor technology nodes, bring about a variety of challenges that pertain to the reliability and dependability of digital integrated systems. Compounds, such as high-κ materials in the transistor gate stack, tend to intensify the time-zero and time-dependent variability of transistors. A case, can certainly be made for the phenomena like Bias Temperature Instability (BTI) and Random Telegraph Noise (RTN). The use of such modern materials is also coupled to an aggressive downscaling trend, which further amplifies matter discretization within the transistor's channel. These are two major manufacturing trends that give rise to time-dependent variability in integrated digital systems. It stands to reason, that as the materials become more variable, the error rates experienced at the circuit- or even system-level are also intensified. Coupled to increasing integrated circuit functionality (namely, the "More than Moore" trend), it is reasonable to expect that future digital chips will be exhibiting reliability profiles that vary across the chip's lifetime. The goal of the research presented in the current text is to study the above observations both regarding the modeling and the mitigation of time-dependent variability. Target systems, naturally, constitute digital integrated circuits, such as processors and memories thereof. The current research contributes with specific reductions to practice that aim to confirm existing techniques and develop novel insight into the modeling and mitigation of time-dependent variability. As such, the current text is broadly split into two major parts.The modeling part starts with the reiteration of atomistic modeling concepts, which are very useful in the analysis of phenomena like BTI and RTN, especially for deca-nanometer transistors. As a result, the complexity of atomistic models for integrated circuit aging analysis is made abundantly clear. In order to alleviate the complexity of circuit reliability analysis across the system lifetime, the current research has formulated the concept of the Compact Digital Waveform (CDW). This format targets regions of circuit operation that are similar from a waveform point of view (e.g. similar frequency or duty cycle) and abstracts them to a single point. This enables striding over circuit lifetime intervals, while retaining key features of atomistic reliability models (e.g. workload dependency).Still on the modeling front, the current research additionally contributes towards exposing transistor variability information to the architecture level. The metric of choice in this case is the component failure probability (Pfail). This is typically required by system designers to appropriately provision their systems with appropriate disabling or correction mechanisms. In order to derive the Pfail, this text features the Most Probable Failure Point (MPFP) method, which is applied for the cases of BTI/RTN variability. Two modeling approaches are used for these aging phenomena and observations are drawn regarding the importance of accurate standard deviation capturing (regarding the threshold voltage variability). A statistical reformulation of the MPFP concept is also presented towards handling standard cells, apart from memory components.The failure of system components has traditionally been triggering some sort of reaction from the system, at least as far as detectable errors are concerned. Academia and industry have been using the term Reliability, Availability and Serviceability (RAS) in order to refer to such techniques. Their invocation is strictly coupled to the rate of errors that appears at the circuit level and typically comes at a measurable performance cost.On the mitigation side, the starting point of the current research is a simple, rollback-based RAS technique that aims to recover a system from transient errors. This concept has been implemented on a research-grade many-core platform, in the data plane of which errors are injected at a user-defined rate. The rollbacks make sure the running application is brought to an earlier correct state, so that execution can continue. This fail/stop model, however granular, is creating a measurable drawback in the timely execution of the running application. It is reasonable to explore that a portion of injected errors may not corrected in order to reduce this performance overhead. In view of this trade-off, the current work explores the trade-off between application correctness, performance and energy budget, in search of the optimal operation points.The final milestone of the current work is a generally applicable solution for the problem of dependable performance, in view of temporal RAS overheads such as the one illustrated above. Towards that direction, the issue of dependable performance is formulated from scratch and a control-theoretic solution is proposed. More specifically, a PID controller is used in order to manipulate the frequency of a processor, within which RAS schemes are invoked at the price of additional clock cycles. The concept is verified within the current research, both with simple simulations and on a real processing platform.Οι σύγχρονες τεχνολογιές πυριτίου φέρνουν στο προσκήνιο μία ποικιλία προκλήσεων για την αξιοπιστία ψηφιακών ολοκληρωμένων κυκλωμάτων. Δομικά στοιχεία, όπως τα υλικά υψηλής διηλεκτρικής σταθεράς στην πύλη των τρανζίστορ, παρουσιάζουν στατική αλλά και χρονικά εξαρτόμενη μεταβλητότητα. Αντιπροσωπευτικό παράδειγμα είναι τα φαινόμενα Bias Temperature Instability (BTI) και Random Telegraph Noise (RTN). Η χρήση τέτοιων υλικών υψηλής τεχνολογίας επιτρέπει την σμήκρυνση των ολοκληρωμένων κυκλωμάτων, η οποία επιπρόσθετα ενισχύει την διακριτοποίηση της ύλης μέσα στο κανάλι των τρανζίστορ. Αυτές οι κατασκευαστικές τάσεις προάγουν την στατική και χρονικά εξαρτόμενη μεταβλητότητα των ψηφιακών ολοκληρωμένων κυκλωμάτων. Είναι λογικό ότι όσο ενισχύεται η μεταβλητότητα των υλικών, τόσο πληθαίνουν τα σφάλματα σε επίπεδο κυκλώματος και συστήματος. Σε συνδυασμό με τις συνεχώς διευρυνόμενες λειτουργίες ολοκληρωμένων κυκλωμάτων (τάση "More than Moore"), είναι λογικό να αναμένουμε ότι μελλοντικά ολοκληρωμένα κυκλώματα θα παρουσιάζουν αξιοπιστία που μεταβάλεται κατά τη διάρκεια της "ζωής" τους. Ο στόχος της παρούσας έρευνας είναι να μοντελοποίησει αλλά και να αναπτύξει τρόπους αντιμετώπισης της στατικής και χρονικά εξαρτόμενης μεταβλητότητας. Στοχευόμενα συστήματα είναι τα ψηφιακά ολοκληρωμένα συστήματα, όπως επεξεργαστές και μνήμες αυτών. Η παρούσα έρευνα παρουσιάζει πρακτικές εφαρμογές που επιβεβαιώνουν και επεκτείνουν τεχνικές για την μοντελοποίηση και αντιμετώπιση της στατικής και χρονικά εξαρτόμενης μεταβλητότητας ολοκληρωμένων συστημάτων. Το παρόν κείμενο είναι διαχωρισμένο σε δύο κύρια μέρη.Το πρώτο μέρος αφορά τη μοντελοποίηση και ξεκινά με μία επανάληψη βασικών εννοιών ατομιστικής προσομοίωσης, που χρησιμοποιείται εκτεταμένα για φαινόμενα όπως BTI και RTN, ειδικά για τρανζίστορ σε διαστάσεις δεκάδων νανομέτρων. Κατά συνέπεια, εκτίθεται η πολυπλοκότητα αυτών των μοντέλων. Η παρούσα έρευνα αντιμετωπίζει την υψηλή πολυπλοκότητα της μοντελοποίησης αξιοπιστίας προτείνοντας την έννοια της συμπαγούς ψηφιακής κυματομορφής (Compact Digital Waveform - CDW). Αυτό το πρότυπο απεικόνισης κυματομορφών απομονώνει διαδοχικές περιοχές ενός σήματος με παρόμοια μορφή (π.χ. βάσει συχνότητας) και τις συνοψίζει σε ένα σημείο. Αυτό επιτρέπει στην προσομοίωση να υπερπηδά διαστήματα ζωής του κυκλώματος, ενώ διατηρούνται τα χαρακτηριστικά του σήματος τα οποία είναι αναγκαία για την ορθότητα του ατομιστικού μοντέλου. Παραμένοντας στο μέτωπο της μοντελοποίησης, η παρούσα έρευνα συνεισφέρει επιπρόσθετα στην έκθεση της μεταβλητότητας των τρανζίστορ στο επίπεδο αρχιτεκτονικής. Η μετρική που επιλέγεται είναι η πιθανότητα αποτυχίας του εκάστοτε υποσυστήματος, Pfail. Τέτοιες μετρικές είναι απαραίτητες στους αρχιτέκτονες υπολογιστών, ώστε να μπορούν να λάβουν κατάλληλα σχεδιαστικά μέτρα απενεργοποίησης ή διόρθωσης σφαλμάτων. Για τον υπολογισμό του Pfail χρησιμοποιείται η τεχνική Most Probable Failure Point (MPFP), η οποία εφαρμόζεται στο παρόν κείμενο για την περίπτωση μεταβλητότητας λόγω BTI/RTN. Επίσης, αποδεικνύεται η σημασία του σωστού υπολογισμού τυπικής απόκλισης της μεταβολής τάσης κατωφλίου για τον υπολογισμού του Pfail.Η αποτυχία ορισμένων υποσυστημάτων σε ένα ολοκληρωμένο κύκλωμα συνήθως προκαλεί κάποια αντίδραση από εσωτερικούς μηχανισμούς. Ο όρος Αξιοπιστία, Διαθεσιμότητα και Επισκευασιμότητα (Reliability, Availability and Serviceability - RAS) χρησιμοποιείται στη βιομηχανία και ακαδημία για να περιγράψει τέτοιες τεχνικές. Η χρήση τέτοιων τεχνικών αντιμετώπισης συνδέεται άρρικτα με τον ρυθμό εμφάνισης σφαλμάτω σε κυκλωματικό επίπεδο.Το δεύτερο μέρος της παρούσας έρευνας αφορά την αντιμετώπιση της χρονικά εξαρτόμενης μεταβλητότητας. Σημείο έναρξης είναι μία απλή RAS τεχνική που χρησιμοποιείται για την αντιμετώπιση παροδικών σφαλμάτων στο υλικό και βασίζεται σε επαναφορά εκτέλεσης. Η τεχνική αυτή υλοποείται σε μία ερευνητική πλατφόρμα πολλών πυρήνων. Οι επαναφορές εκτέλεσης εγγυόνται ότι η εκτέλεση της εφαρμογής επανέρχεται σε σωστή αρχιτεκτονική κατάσταση. Ωστόσο, αυτό το μοντέλο διακοπής/εκτέλεσης, όσο λεπτομερείς και να είναι οι επαναφορές, επιβαρύνει την επίδοση της εκτέλεσης προσθέτοντας ένα χρονικό κόστος. Είναι ιδιαίτερα ενδιαφέρον να διερευνήσουμε κατά πόσο ένα υποσύνολο των σφαλμάτων μπορεί να αγνοηθεί, με σκοπό να μειώσουμε την χρονική επίδραση των επαναφορών εκτέλεσης. Στην παρούσα έρευνα, παρουσιάζουμε αναλυτικά τον συμβιβασμό μεταξύ της ορθότητας εκτέλεσης, της επίδοσης και της καταναλυσκόμενης ενέργειας. Επιπρόσθετα, η παρούσα έρευνα προτείνει τη γενίκευση του προβλήματος αξιόπιστης επίδοσης ψηφιακών συστημάτων, ειδικότερα υπό την επίδραση τεχνικών RAS. Σε αυτή τη κατεύθυνση, παρουσιάζεται η μαθηματική θεμελίωση του προβλήματος της αξιόπιστης επίδοσης και προτείνεται μία τεχνική αυτομάτου ελέγχου για την αντιμετώπισης της. Ποιο συγκεκριμένα, ένας ελεγκτής PID χρησιμοποείται για να διαμορφώσει την συχνότητα ενός επεξεργαστή κατά τη διάρκεια της εκτέλεσης. Η αποτελεσματικότητα αυτής της τεχνικής παρουσιάζεται τόσο μέσα από προσομοιώσεις όσο και μέσα από υλοποίηση λογισμικού που εκτελείται σε πραγματική πλατφόρμα
Enhanced Cellular Materials through Multiscale, Variable-Section Inner Designs: Mechanical Attributes and Neural Network Modeling
In the current work, the mechanical response of multiscale cellular materials with hollow variable-section inner elements is analyzed, combining experimental, numerical and machine learning techniques. At first, the effect of multiscale designs on the macroscale material attributes is quantified as a function of their inner structure. To that scope, analytical, closed-form expressions for the axial and bending inner element-scale stiffness are elaborated. The multiscale metamaterial performance is numerically probed for variable-section, multiscale honeycomb, square and re-entrant star-shaped lattice architectures. It is observed that a substantial normal, bulk and shear specific stiffness increase can be achieved, which differs depending on the upper-scale lattice pattern. Subsequently, extended mechanical datasets are created for the training of machine learning models of the metamaterial performance. Thereupon, neural network (NN) architectures and modeling parameters that can robustly capture the multiscale material response are identified. It is demonstrated that rather low-numerical-cost NN models can assess the complete set of elastic properties with substantial accuracy, providing a direct link between the underlying design parameters and the macroscale metamaterial performance. Moreover, inverse, multi-objective engineering tasks become feasible. It is shown that unified machine-learning-based representation allows for the inverse identification of the inner multiscale structural topology and base material parameters that optimally meet multiple macroscale performance objectives, coupling the NN metamaterial models with genetic algorithm-based optimization schemes
Προσομοίωση σε λογισμικό της κατανομής θερμοκρασίας και γήρανσης υλικών ολοκληρωμένων κυκλωμάτων
77 σ.Advances in the modern semiconductor industry give rise to various reliability aspects of electronic design. On the one hand, device behaviour is dominated by stochastic phenomena that may vary even between the devices of the same technology. A timely example of such mechanisms is Bias Temperature Instability (BTI) and Random Telegraph Noise (RTN). On the other hand, the inevitable increase of functional block density in modern integrated circuits makes the temperature distribution a serious design constraint. As a result, fast, yet accurate thermal profiling is of vital importance both at design time and at runtime.
The first part of the current work deals with the time and workload dependent device variability of modern downscaled technologies. BTI and RTN are incorporated in circuit simulations of larger circuits, the parametric reliability of which is of major importance. There is a fundamental differentiation from the state of the art, since the atomistic approach towards BTI and RTN allows the observation of detailed workload dependency in the simulation results. This concept is materialized in a fully functional simulation framework of a 32 bit SRAM partition. Based on a real memory architecture, the workload of such a structure was extracted by realistic a realistic application and simulated on the circuit under test. Performance metrics of the circuit, were monitored during the simulation of different workloads. Each different workload, or RunTime Situation (RTS), is instantiated by a cumulative delay metric and a leakage energy value. This concept enables the clustering of RTSs into workload scenarios.
The second part of the current work deals with the numerical acceleration of the publicly available thermal simulator HotSpot-5.0. An extensive profiling of its source code, revealed a CPU intensive iterative method for the extraction of the transient solution. This method was replaced with a simplified equivalent that achieved the intended acceleration, without imposing any accuracy degradation. The accelerated version of the tool was successfully incorporated to a broader tool that performs hierarchical thermal profiling of Muli-Processor System-on-Chip (MPSoC) floorplans. That way, the application spectrum of such thermal analysis tools is significantly broadened.Η πρόοδος που παρατηρείται στην σύγχρονη βιομηχανία ημιαγωγών, αναδεικνύει διάφορες πτυχές αξιοπιστίας στην σχεδίαση ηλεκτρονικών. Από την μία πλευρά, η συμπεριφορά των τρανζίστορ χαρακτηρίζεται από στοχαστικά φαινόμενα τα οποία ποικίλουν σε συμπεριφορά ακόμα και μεταξύ τρανζίστορς της ίδιας τεχνολογίας. Χαρακτηριστικά παραδείγματα τέτοιων φαινομένων είναι η Θερμοκρασιακή Αστάθεια Πόλωσης (Bias Temperature Instability - BTI) και ο Τυχαίος Θόρυβος (Random Telegraph Noise - RTN). Από την άλλη πλευρά, η αναπόφευκτη αύξηση της πυκνότητας των μοντέρνων ολοκληρωμένων κυκλωμάτων καθιστά την κατανομή θερμοκρασίας μία βασική σχεδιαστική προδιαγραφή. Κατά συνέπεια, η γρήγορη και ακριβής προσομοίωση θερμοκρασίας έχει τεράστια σημασία τόσο κατά τη φάση σχεδίασης, όσο και κατά την λειτουργία του κυκλώματος.
Το πρώτο μέρος της εργασίας ασχολείται με την χρονικά και λειτουργικά εξαρτημένη ποικιλομορφία των σύγχρονων τεχνολογιών μειωμένων διαστάσεων. Τα φαινόμενα BTI και RTN συμπεριλαμβάνονται σε προσομοιώσεις κυκλωμάτων, η παραμετρική αξιοπιστία των οποίων έχει μεγάλη σημασία. Η κύρια διαφοροποίηση από την βιβλιογραφία έγκειται στην ατομιστική αντιμετώπιση των φαινομένων BTI και RTN, η οποία συνεπάγεται συσχετισμό των τελευταίων με τις συνθήκες λειτουργίας που προσομοιώνονται. Η λογική αυτή εφαρμόζεται στις προσομοιώσεις μίας στατικής μνήμης τυχαίας προσπέλασης 32 κελιών. Με βάση μία πραγματική αρχιτεκτονική μνήμης, τα δεδομένα που εισέρχονται και εξέρχονται από τη μνήμη, προέκυψαν από μία πραγματική εφαρμογή. Μετρικές απόδοσης του κυκλώματος λαμβάνονται κατά τη διάρκεια της προσομοίωσης. Κάθε διαφορετικό σύνολο εισόδων συνοψίζεται σε μία μετρική καθυστέρησης και μία μετρική ενέργειας διαρροής. Η λογική αυτή, δίνει τη δυνατότητα ομαδοποίησης στιγμιοτύπων λειτουργίας σε ευρύτερα σενάρια λειτουργίας του κυκλώματος.
Το δεύτερο μέρος της εργασίας ασχολείται με την αριθμητική επιτάχυνση του ανοικτού λογισμικού HotSpot-5.0. Εκτεταμένη μελέτη της εκτέλεσης του αρχικού κώδικα απεκάλυψε μία επαναληπτική αριθμητική μέθοδο για την εξαγωγή λύσης στο πεδίο του χρόνου, η οποία καταναλώνει αρκετούς πόρους. Η μέθοδος αντικαταστήθηκε από ένα απλουστευμένο ισοδύναμο. Κατά συνέπεια, παρατηρείται επιτάχυνση του εργαλείου χωρίς να επιβαρύνεται η ακρίβεια αυτού. Η βελτιστοποιημένη έκδοση συμπεριλήφθηκε σε ένα ευρύτερο εργαλείο που επιτρέπει την ιεραρχική προσομοίωση θερμοκρασίας πολυπύρηνων αρχιτεκτονικών. Κατά αυτό τον τρόπο, το εύρος εφαρμογής τέτοιων εργαλείων ανάλυσης θερμότητας διευρύνεται σημαντικά.Δημήτριος Α. Ροδόπουλο
Failure probability of a FinFET-based SRAM cell utilizing the most probable failure point
© 2018 Elsevier B.V. Application requirements along with the unceasing demand for ever-higher scale of device integration, has driven technology towards an aggressive downscaling of transistor dimensions. This development is confronted with variability challenges, mainly the growing susceptibility to time-zero and time-dependent variations. To model such threats and estimate their impact on a system's operation, the reliability community has focused largely on Monte Carlo-based simulations and methodologies. When assessing yield and failure probability metrics, an essential part of the process is to accurately capture the lower tail of a distribution. Nevertheless, the incapability of widely-used Monte Carlo techniques to achieve such a task has been identified and recently, state-of-the-art methodologies focusing on a Most Probable Failure Point (MPFP) approach have been presented. However, to strictly prove the correctness of such approaches and utilize them on large scale, an examination of the concavity of the space under study is essential. To this end, we develop an MPFP methodology to estimate the failure probability of a FinFET-based SRAM cell, studying the concavity of the Static Noise Margin (SNM) while comparing the results against a Monte Carlo methodology.status: publishe
Runtime interval optimization and dependable performance for application-level checkpointing
As aggressive integration paves the way for performance enhancement of many-core chips and technology nodes go below deca-nanometer dimensions, system-wide failure rates are becoming noticeable. Inevitably, system designers need to properly account for such failures. Checkpoint/Restart (C/R) can be deployed to prolong dependable operation of such systems. However, it introduces additional overheads that lead to performance variability. We present a versatile dependability manager (DepMan) that orchestrates a many-core application-level C/R scheme, while being able to follow time-varying error rates. DepMan also contains a dedicated module that ensures on-the-fly performance dependability for the executing application. We evaluate the performance of our scheme using an error injection module both on the experimental Intel Single-Chip Cloud Computer (SCC) and on a commercial Intel i7 general purpose computer. Runtime checkpoint interval optimization adapts to a variety of failure rates without extra performance or energy costs. The inevitable timing overhead of C/R is reclaimed systematically with Dynamic Voltage and Frequency Scaling (DVFS), so that dependable application performance is ensured