Reliability of an electronic device, concerning if it can function reliably over its designated lifetime in the field (such as 10 or 15 years), has become more and more important in today's safety-critical applications such as automotive electronics. Traditionally, the ageing has been performed in an offline setting where stress test has been applied to accelerate the ageing process and then a model is established to make the futuristic prediction. This kind of offline method has a drawback of not being able to take into account the factor of the unique operating condition and environment that a device could have experienced in the field. In this work, we present the first cloud-based ageing monitoring system to the best of our knowledge, for the Internet-of-Things (IoT) devices. It has many advantages. First of all, one can know of the ageing status of an IoT device remotely and continuously. Secondly, through data analysis in a cloud server, more accurate prediction can be achieved. Thirdly, an ageing hazard can be alarmed in advance before it actually strikes, and thereby pre-caution actions (such as online repair, or even call-for-maintenance request) can be taken in advance to avoid unnecessary system fatal failure. A prototype system using test chips with builtin design-for-ageing-monitoring circuitry will be demonstrated with measurement data collected through a cloud server.
I. INTRODUCTION
The aggressive scaling of integrated circuit technology brings numerous advantages, such as the reduction of the form factor, higher speed, and lower power. However, the concern of reliability over a long lifetime as required by safety critical applications (e.g., automobile electronics, biomedical electronics, and various IoT devices) is getting more and more challenging [1] , [2] . For example, an automotive IC is often required to operate in the field for more than 10 to 15 years under hostile conditions (e.g., −40 • C to 150 • C).
It is well known that the reliability (in terms of the failure rate) is a function of time, following a bathtub curve, divided into three stages -including the infant mortality stage, normal lifetime stage, and final ageing stage. Ideally, the high infant mortality stage can be skipped by applying stress tests (with higher temperature and/or higher VDD) to eliminate the weak devices with potential latent defects. It is hopeful that, most shipped devices can operate in a system during its normal lifetime stage with reasonably low failure rate for certain The associate editor coordinating the review of this manuscript and approving it for publication was Giovanni Merlino. amount of time. But after that, the ageing effect may start to take its toll and deteriorate a device's performance or even disrupt its functionality.
There are several different types of ageing mechanisms [3] - [6] , e.g., Bias Temperature Instability (BTI), Electro-Migration (EM), Hot Carrier Injection (HCI), and Thin-Oxide Breakdown, etc. When the ageing effect turns serious, sudden functional failure could occur. As a result, a safety-critical electronic device should have a set of antiageing solutions during the design stage, manufacturing test stage, and in-the-field stage, so as to meet the target reliability requirement.
The ageing prediction can be done statically by offline ageing analysis, as depicted in Fig. 1 . At the cell level, the ageing phenomenon of a transistor or an interconnect under a specific process technology is first characterized by embedded ageing sensors (e.g., ringoscillator) to measure the Vth and/or performance degradation under certain temperature or VDD stress conditions, (e.g., 195 • C for 6 months) [7] - [10] . This offline characterization process produces various fundamental cell-level ageing models. With the cell-level ageing models in place, a designer can then perform the ageing analysis [10] - [14] , taking into account the Circuit Operation Statistics (COS), such as signal probability, current information at each node, and temperature profile, etc. In some sense, COS information reflects how rigorously each transistor and interconnect has been exercised or stressed. Then, a component-level ageing model can be derived, describing the performance degradation of each circuit block or component over the time. In some previous works, COS might be referred to as workload. Throughout this paper, we will use these two terms without distinction. Next, at the chip level, the ageing model for each component can be combined to predict the overall lifetime of a chip, and hopefully revealing the ageing-vulnerable spots at the same time.
The above ageing analysis is becoming more and more inevitable during the design process for a reliability conscious IC product. Even though it may not be very accurate, it does provide first-degree estimation and early feedback to the design process for lifetime enhancement.
The inaccuracy of static ageing analysis mainly arises from the hypotheses used in mapping the cell-level ageing models to the component-level models. These hypotheses may not be very consistent with the reality. For example, it is an still issue how a cell-level model derived based on measurement data over a timing window of just 6 months can be extrapolated to a extended lifetime such as 10 years. Also, the assumed COS information is often too simplified and does not fully reflect the true situations in the field. In reality, each chip is likely to have an unique ageing process based on its own Circuit Operation Statistics (COS). To take this practicality into account, a more integrated method is needed.
To address this challenge, we propose a highly integrated cloud-based online ageing monitoring and prediction methodology in this work, with the following key features:
(1) Ageing monitors are inserted in the chip to collect online ageing information, as some previous works [9] , [10] . But how these data are utilized afterwards is different.
(2) An offline ageing characterization process (with some assumed stress test procedure) is performed on a number of FIGURE 2. Ring-Oscillator based monitor [19] . fabricated chips to derive the so-called collective ''accelerated ageing models''.
(3) For the online operation, each individual chip sends its ageing data periodically to a designated cloud server as ''ageing history''. Then, over the time, more accurate IC-specific ageing and lifetime prediction is performed in the cloud and warning message is sent to each specific chips with the threat of ageing. During this process, the collective ''accelerated ageing models'' is converted into a IC-specific ageing model, which takes into account its unique ''ageing history''.
The proposed cloud-based online ageing monitoring and prediction is particularly effective as compared to the previous works in two aspects. First, the genuine COS information through an IC's life cycle is faithfully reflected in the ageing history, to render accurate lifetime prediction. Second, the ageing threat can be detected quickly and therefore preventive measures can be taken in a more timely manner to avert the ageing-induced catastrophe. The preventive measures could take multiple forms, such as reconfiguration of the functional units in a chip, immediate replacement by a refreshing unit, or a warning signal calling for maintenance.
We believe that our work is the first to use the ''ageing history'' in conjunction with the ''accelerated ageing model'' to produce IC-specific ageing model, and thereby making accurate lifetime estimation possible for each individual IC.
The rest of this paper is organized as follows. In Section II, we first provide preliminaries. In Section III, we present the proposed online ageing monitoring and prediction methodology -including the ageing monitor, the overall system architecture, and the model translation scheme and system integration. In Section IV, we report the silicon results based on fabricated test chips, and in Section V we conclude.
II. PRELIMINARIES
It is quite common that Ring-Oscillators (RO) are used as monitors for the measuring the effects of Process, VDD voltage, Temperature, and Ageing (or jointly called PVTA effects) [15] - [19] . Fig. 2 shows an example. The ring oscillator produces a clock signal, with the clock period, called ROCP, bearing the information of combined PVTA effects. After a time-to-digital converter, the time-varying clock period is converted to some digital codes for further analysis, some capturing the average ROCP, and some capturing the worst-case ROCP. It is possible to further decipher from the resulting digital ROCP codes the individual effect of each contributing factor, leading to useful information about the process status, the average and worst-case VDD voltages, and the temperature, respectively [15] , [19] .
In this work, we use ROs for the purpose of ageing monitoring. In general, it is similar to the monitoring of PVT effects, but different in the following two aspects:
(1) Ageing effect occurs much more slowly than the effects of the VDD voltage (changing in nano-seconds), and the temperature (changing in mini-seconds). Usually, it takes weeks or even months for ageing effect to be noticeable under a typical workload. Therefore, one only needs to take the ageing samples once in a while (e.g., one sample per day).
(2) For monitoring dynamic VDD drop, one need to capture and analyze not only the average, but also the worstcase Ring-Oscillator Clock Period (ROCP) over a monitoring interval. For monitoring ageing effects, we only need to capture the average ROCP values at some specific sampling times. However, these average ROCP values need to be measured at a ''proper condition'' so the average performance of an RO monitor is only affected by the ageing effect alone, not by any other wanted PVT effects. In our system, we use the following condition:
Whenever we attempt to sample an average ROCP value bearing only the ageing effect, we set the chip to operate in the idle mode for a while (e.g., a few seconds) and so it will cool down to the ambient temperature with only little leakage current. Also, the current drained from the power/ground pads are stable and little, and thus the VDD voltage driving our RO monitor is also stable to remove the unwanted effect of dynamic VDD variation. The resulting average ROCP value is further compared to its time zero reference (recorded when the chip was just installed on the system board), and their difference reflects only the accumulated ageing effect from time zero up to the current sampling time.
Certainly, one may need many RO monitors inserted in the chip, one for each selected site of interest. These RO monitors can be arranged into a special architecture, as shown in Fig. 3 , so they can share the same clock period measurement circuit. At any given time, only one RO monitor is active and transfers its output clock signal, RO_CLK, to the Clock Period Measurement (CPM) circuit.
III. PROPOSED CLOUD-BASED AGEING MONITORING
In this section, we introduce the proposed cloud-based online ageing monitoring methodology, including its benefits, architecture, operation flow, and test chip design.
A. OVERVIEW
For an IC used in an Internet of Things (IoT) system, the data produced during the monitoring process can be sent to the cloud via existing wireless internet connection [19] . Such a cloud-based monitoring methodology can have several benefits.
(1) The ageing history of an IC (consisting of the average ROCP samples recorded over its lifetime) can be checked at anytime and from anywhere in the world. (2) In the cloud, the ageing histories of all ICs of the same type can be gathered and compared, and therefore, an IC with abnormal ageing conditions from the population can be easily identified as an outlier in terms of some features. This kind of peer-based over-ageing detection cannot be easily performed when each IC is only monitored individually inside its own system.
(3) As discussed earlier, a monitoring system is composed of not only hardware, but also software. In a cloud-based monitoring system, the software responsible for the ''ageing data analysis'' can be run in the cloud, rather than on the edge device containing the IC. Such an arrangement is more modular and flexible. Whenever there is a new version of ''ageing analysis algorithm'', we only need to install it in the cloud. There is no need to update the software at the numerous edge devices, which can save a lot of efforts and costs.
In terms of the architecture, a cloud-based monitoring system can be divided into three different domains as shown in Fig. 4 , namely, (1) User domain, (2) Cloud domain, and (3) IoT Edge domain. This framework is not only scalable, but also easy-to-integrate in a way that the IoT edge devices are only responsible for producing the raw data, while the complicated back-end data analysis (which maps the raw data into meaningful ageing information) is performed at the cloud server. By such a hardware/software collaboration, the system becomes more flexible in its ability to distribute the functions among the edge devices and the server in a cost-effective manner.
In more detail, the User part comprises a ''control console'' that regulates the overall monitoring flow of all IoT devices. The Cloud part is located in a designated server (e.g., one in our lab), which stores the gathered raw data (mainly the average ROCP values) and also supports the subsequent data analysis to predict ageing and lifetime. The IoT Edge part is associated with each device under monitoring, responsible for producing the ROCP raw data and transmitting the raw data through the internet to the cloud part.
B. AGEING MONITORING FLOW
In this subsection, we explain our monitoring flow for ''ageing and remaining lifetime prediction'', as outlined in Fig. 5 . It is divided into four phases. Phase 1 and Phase 2 are onetime effort made on a number of selected ICs after they are fabricated. On the other hand, Phase 3 and Phase 4 are performed in the field at pre-defined monitoring intervals for each edge device.
Phase 1 (Perform Accelerated Ageing Process):
It is known that stress test can be applied to accelerate the ageing process. This phase is to observe how a handful of ICs selected for characterization during the offline testing will age under stressed conditions. For example, we apply a ''boosted supply voltage'' 2 times the rated VDD level in our test case, and then, their aged behaviors (reflected by the average ROCP values) are recorded over time. The ageing acceleration process is continued until the IC under stress test has aged beyond a preset threshold (say 10%). For example, in our test case, we have three ICs under stress, and we recorded the average ROCP values of all 16 RO monitors inserted in the IC once per day until all three ICs have aged by a threshold value of 10% (as compared to their time zero references).
Phase 2 (Derive ''Accelerated Ageing Model''):
Having the results from the above ageing acceleration process, an ''accelerated ageing model'' is built by fitting the accelerated ageing data. We use a polynomial of degree 5 as the fitting function as shown in Fig. 6 . The resulting ''accelerated ageing model'' is a function of time, denoted as A(t). Just like the timing model for a standard cell, our ''accelerated ageing model'' are derive in three cases: (1) worstcase, (2) typical-case, and (3) best-case, where the worst-case ''accelerated ageing model'' is acquired by taking the top envelop of those ''accelerated ageing data'' at each recording time obtained during the accelerated ageing process. Similarly, the best-case is acquired by taking the bottom envelop, while the typical-case is acquired by taking the average ''accelerated ageing data''. As shown in Fig. 6 , our accelerated ageing process reaches a 10% ageing on these test chips in 14 days in the worse case, 24 days in the typical case, and 32 days in the best case. We use the following terms to assist our subsequent discussions.
• T worst : The time needed to reach an ageing threshold (e.g., 10% ageing) using the worst-case accelerated ageing model.
• T typical : The time needed to reach an ageing threshold using the typical-case accelerated ageing model.
• T best : The time needed to reach an ageing threshold (e.g., 10% ageing) using the best-case accelerated ageing model. Then, in our test case, T worst = 14 days, T typical = 24 days, T best = 32 days. Taking T typical as a reference, we further calculated two ratios to be used later when we conduct the remaining lifetime prediction, namely, the Ratio (worst−to−typical) = 14/24 = 0.58 (or -42%), and the Ratio (best−to−typical) = 32/24 = 1.33 (or +33%). In some sense, the variation of accelerated ageing is in a range of [−42%, 33%].
Phase 3 (Derive ''Predicted'' Ageing Model):
While the accelerated ageing model dictates how a device will age under a particular stress condition, the ''predicted ageing model'' in our system predicts the actual ageing process under normal workload in the field. This model is progressive and specific for each IC, in a sense that it takes into account the real ageing information of each IC when it is operated in the field, and we dynamically update this ''predicted ageing model'' over the entire lifetime of an edge device based on the ageing information seen so far for each specific IC, and store them in the database in the cloud.
In general, the predicted ageing model is a timestretched version of the accelerated ageing model in our methodology and derived by a successive approximation procedure in two stages -coarse-stretching and fine-stretching as described in the following.
(Step 3.1) Derive the coarse ageing model, based on the following operation.
1) COARSE-STRETCHING OPERATION
Stretch the time-axis of the typical-case accelerated ageing model called A(t), shown as the BLUE curve in Fig. 7 , with a known stretching time unit: T typical , into a function denoted as A(t/n), shown as the GREEN curve in Fig. 7 . That is,
where n is an integer, or called coarse-stretching parameter, which is incrementally determined by our successive approximation procedure, which tries to minimize the ageing prediction error. Definition 1(Ageing Prediction Error): At any given time, the actual ageing data accumulated so far for a device under monitoring forms an ageing history. The mean square error between an ''ageing model'' and the actual ''ageing history'' can be used for the calculation of the ageing prediction error.
Based on the above definition, the determination of the coarse-stretching parameter will become easier -it finds a positive integer n by an iterative process such that the ageing prediction error is minimized. Table 1 lists the ageing prediction error for n = 1, 2, . . . , 43. It can be seen that the error monotonically drops from 3032% to 2.781% (when n increases from 1 to 42) and then starts to increase slightly to 2.789% when n is further incremented to 43. As a result, this coarse-stretching procedure will select n = 42 as the final coarse-stretching parameter for this test case.
( Step 3.2) Derive the fine ageing model, by another successive approximation procedure.
2) FINE-STRETCHING OPERATION
Slightly stretch the time-axis of the typical-case accelerated ageing model called A(t) into a fine ageing model, by the following formula:
Note that n is the previously determined coarse-stretching parameter, while k is a fine-stretching parameter to be further decided in this stage.
In our test case, T typical = 24 days. Again, we increase the value of k from 1 gradually, to seek a proper integer value such that the overall ageing prediction error (against the actual ageing history) is minimized.
The ageing prediction error (%) versus the fine-stretching parameter, k. (Note that n has been fixed at 42 in the previous coarse-stretching process.) Table 2 lists the ageing prediction error for k = 1, 2, . . . , 14. It can be seen that the error monotonically drops from 2.7809% to 2.7726% (when k increases from 1 to 13) and then starts to increase slightly to 2.7742% when n is 14. As a result, this fine-stretching procedure selects k = 13 as the final coarse-stretching parameter for this test case. As illustrated in Fig. 8 , the final ageing model for this device will be determined as (T typical, n, k) = (24 days, 42, 13):
Phase 4 (Determine the Range of Remaining Lifetime):
Once we have derived the final ageing model in the typical case, for a specific IoT edge device, we can find out A benchmark circuit ''B17'' embedded with 16 monitors, a logic Built-In Self-Test circuit, and a CPM circuit. Note that, in addition to supporting ageing monitoring, this test chip also supports PVT monitoring proposed in [19] .
its ''retire time in the typical case''. Then, the retire time in the typical case is multiplied by the two ratios we have derived previously in the ''accelerated ageing models'' -i.e., Ratio (worst−to−typical) and Ratio ((best−to−typical) -to derive the retire times in the worst case, and the retire time in the best case. Once we know the retire times, we can calculate the remaining lifetimes as follows:
(Remaining Lifetime) = (Retire Time) -(Current Time)
IV. MEASUREMENT RESULTS

A. TEST CHIP DESIGN
A benchmark circuit (B17) adopting the above ageing monitoring methodology has been designed and fabricated as shown in Fig. 9 . This is also the chip we have used to conduct the experiments for PVT (Process, Voltage, and Temperature) monitoring in [19] . However, the methodology used in this paper for ageing prediction is very different from those used in [19] , which is more concerned about capturing the worst-case dynamic VDD drops and the temperature variation. This test chip design is embedded with 16 monitors and a logic BIST (Built-In Self-Test) circuit. The CPM circuit contains a Time-to-Digital Converter [17] as well as a cell-based Phase-Locked Loop [20] produced by in-house compilers. Based on our test flow, the 16 monitors inserted would take turn to be observed, each of which produces an RO_CLK signal reflecting the ageing condition at the site it monitors. 
B. HARDWARE COMPONENTS
Our test chips are fabricated in a 90nm CMOS process. A chip under monitoring is put into the socket on the device board, which is further connected to the FPGA interface board, sensor hub, ambient temperature sensor, and two battery modules as shown in Fig. 10 , in our prototype system. We have built three of such prototype systems.
C. SOFTWARE COMPONENTS
In addition to hardware, there are software components in our prototype system as listed in Table. 3. These software components are implemented in six different kinds of programming languages, including HTML, Javascript, PHP+SQL, Perl, Python, and C.
(1) In the User part, there are about 500 lines of code using HTML language to build the overall structure of control console interface, with additional 400 lines of Javascript code to communicate with the Cloud.
(2) In the Cloud part, there are four software components: including User communication agent, Database manager, PVT and ageing prediction, and Edge communication agent. The two communication agent provides bridge between User part and an Edge part for transmitting commands, ROCP raw data, and PVT and ageing information, with about 850 lines of PHP code. The database manager has 142 lines of PHP+SQL code to support {insert, delete, and search} database functions. The ROCP modeling and PVT and ageing prediction software component is the major software component of this system, realized by 1360 lines of code in both Perl and Python language. It is responsible for process calibration, temperature-aware VDD-drop prediction, and remaining lifetime estimation.
(3) In the Edge part, there are 336 lines of C code, used to perform wireless network connection (using API functions provided with LinkIt-ONE) and edge device regulation.
In the following, we present measurement results of our cloud-based monitoring system for ageing prediction.
D. AGEING AND REMAINING LIFETIME PREDICTION
We have applied the proposed ''accelerated ageing process'' on three test chips to derive the worst-case, typical-case, and worst-case ''accelerated ageing models''. Then, we use the three models to predict the ''final ageing model'' for one chip by the coarse-stretching and fine-stretching operations. The retire times of this chip are indicated in Fig. 11 , as {592, 1020, 1357} days, in the worst case, the typical case, and the best case, respectively. With the information of these retire times, the remaining lifetime can be calculated thereby. For example, if the retire time is predicted when the chip has been operated for 80 days, then its remaining lifetime would be{592-80, 1020-80, 1357-80} = {512, 940, 1277} days = {1.4, 2.6, 3.5} years in the worst case, the typical case, and the best case, respectively. Note that these lifetimes may seem relatively short. It is partly because we operate this test chip in a non-stop manner when we collected its ''normal ageing data''. If a chip is operated in a more relaxed manner (e.g., spending most of its time in the OFF-state), then it could age more slowly.
E. APPLICATION NOTES
In reality, an IC could have many functional blocks. Each functional block could experience its own unique workload and thus age at a different pace. Even though the proposed methodology is only demonstrated on the ring oscillators used as the ageing sensors, it can be integrated with other run-time ageing sensors as well [21] - [25] . Each ageing sensor is used to collect the online ageing history of a specific functional block (such as cache memory, SRAM memory, logic blocks, etc.), and then based on the prebuilt accelerated ageing model of that specific functional block and the workload-aware online ageing history, one can thereby accurately predict the aging status and the remaining lifetime of each function block. At an ''ageing monitoring center'', a scoreboard can be maintained to keep track of the ageing condition of each individual functional block.
V. CONCLUSION
For safety-critical applications, a convincing methodology that can demonstrate how long an IC can operate reliably under the influence of ageing is desperately needed. Traditional ways of using purely software-based prediction may not be adequate, as the results could be rough. In light of this, we have proposed a more accurate method in this paper. Our contributions can be summarized in two aspects. First, the ageing and lifetime prediction can become more credible since it is derived by not only the offline ''accelerated ageing model'', but also online ''ageing history'' under normal workload. Also, one can produce a unique ''lifetime'' prediction for each individual IC. Second, an ageing monitoring system contains both hardware and software, and involves the data analysis for a larger number of ICs spreading around the globe. By adopting a cloud-based method, the system integration and maintenance become easier as well. Measurement data of test chips have been used to demonstrate the effectiveness of the proposed methodology.
