ABSTRACT Fault-tolerant systems are indispensable for complex electronic systems, but there are some drawbacks to the traditional fault-tolerant schemes. The evolution of hardware for fault-tolerant systems can achieve good results. However, because of the highly time-consuming evolution process, the evolution approach may not meet the real-time constraint for fault recovery. In this paper, we propose an improved genetic algorithm for real-time circuit fault recovery. We establish a real-time fault-tolerant system on a ZYNQ chip. In this system, a fault analysis tree hierarchically monitors a circuit fault. When the fault occurs, the normal operation of the faulty circuit is temporarily maintained by a fault compensation mechanism. At the same time, we use an evolutionary mechanism combined with a fault recovery library and improved genetic algorithm to accelerate the evolution of the repair circuit and obtain a repaired circuit while the fault compensation mechanism is running. Ultimately, the improved algorithm significantly improves the fault-tolerant recovery rate and ensures the operations in real time. Thus, the improved genetic algorithm for real-time circuit fault recovery can meet the system's real-time constraints and improve the system's stability.
I. INTRODUCTION
In a complex and changeable environment, an electronic system may be influenced by many factors and various inevitable faults may appear in the system. If the faults are not recovered in a timely manner, the electronic system could facea collapse. For devices such as deep-sea detectors, underground metal detectors, space satellites and other electronic devices, if faults occur and the devices themselves do not have fault tolerance, the entire electronic system will face a serious situation. And if system cannot complete the recovery during the failure time; it may fail to work resulting in unpredictable loss. Therefore, it is important to design an electronic system that has fault tolerance.
Fault tolerance means that when the system fails, system can adjust its own structure to maintain its normal functions [1] . Fault tolerance is mainly divided into active fault tolerance control and passive fault tolerance control [2] , [3] . Passive fault tolerance mainly uses redundant
The associate editor coordinating the review of this manuscript and approving it for publication was Christian Esposito. hardware resources to maintain system functions, such as redundant technologies [4] , [5] . Active fault tolerance maintains system stability through online tuning or reconfiguring systems, such as reconfigurable technologies [6] , [7] , Evolution Hardware (EHW) technologies [8] . Redundancy and self-reconfiguration belong to traditional fault-tolerant technologies. When the faults occur, the redundant technology will use a large amount of hardware resources to recover faults, which leads to a more complicated system layout and routing, and the self-reconfiguration technology will reconstruct the whole system, which consumes a lot of time. For Evolution Hardware technology, its own adaptive, self-repair characteristics are matched to the fault-tolerant systems. In terms of the resource usage and the time consumption, the EHW technology is widely welcomed as better than the traditional fault-tolerant technologies. However, the fault-tolerant system not only requires fault tolerance but also requires real-time operations. If too much time is consumed in the evolution process, the circuit cannot be recovered within the constraint time. Finally, the system is forced to stop running, and the reliability of the system will be greatly affected.
Although some studies in the literature discuss methods that could satisfy the real-time condition, because various electronic systems have various structures, it is impossible to obtain a general way to satisfy the constraint time of the various systems. Designers can only use as short a time as possible to complete the fault repair. Thus, it is uncertain whether the constraint time will be satisfied. Thus, it cannot be guaranteed whether the repair process does not affect the system operations, which means that there are some flaws. To ensure the repair performance of the fault-tolerant system and to guarantee that the fault recovery is completed without affecting normal functions of the system, this paper presents an improved genetic algorithm for real-time circuit fault recovery. The scheme is divided into two stages. In the design phase, we stratify the circuit system from the top to the bottom and use the modular design to monitor the input and output of each module. Using the detection principle of Fault Tree Analysis, it is easy to determine the fault location. Then, we preset Fault Repair Library for the expected faults of the circuit. This approach can be used for the initial part of the evolution algorithm. In the operation phase, when a fault occurs, Fault Compensation Mechanism is used to isolate the fault circuit from the system and maintain the normal functioning of the system. At the same time, it queries the Fault Repair Library. If the fault is one of the expected faults, then it directly reconstructs the repaired circuit; otherwise, it uses the Improved Genetic Algorithm to generate the repaired circuit. In this paper, we design and implement the faulttolerant system on a ZYNQ chip. The experiment shows that our algorithm can meet the real-time requirements of the system and ensure the recovery capability. This approach can improve the system's stability without affecting the normal operations of the system. The contributions of this paper are summarized as follows:
(1) To accurately locate the fault's location and reduce the time required for fault detection, in the system design stage, the system is divided into modules. It monitors the inputs and outputs of each module, and when a failure occurs, it uses the Fault Analysis Tree to quickly determine the fault circuit. (2) To protect the online repaired performance of the system, the normal operations of the system cannot be affected while it is being repaired, we propose a Fault Compensation Mechanism. During a certain period, the output of the Fault Compensation Mechanism can be considered to be correct, and the mechanism obtains the time for the fault repair and ensures the real-time operations in the end. (3) To get a recovered circuit in the shortest amount of time, we propose an Improved Genetic Algorithm, which can improve the success rate and convergence rate of the repair by using the crossover and mutation process of the genetic algorithm. The repair meets the real-time constraint. The remainder of this article is organized as follows. Section II addresses the relevant work on the fault tolerance research. In Section III, a real-time fault repair scheme is proposed. Then, Section IV describes the experimental designs and the results. Section V summarizes this paper and gives a vision for future work.
II. RELATED WORK
With the continuous development of information technology and its application fields, the dominant and integrated functions of complex electronic systems in aviation and communications have been significantly enhanced. Therefore, the fault tolerance of electronic systems is extremely important. The most widely used fault-tolerant technology is redundant technology. Redundant technology uses redundant hardware resources instead of the original circuits to ensure the normal operations of the electronic systems. When a fault occurs, the redundant technology can quickly repair the system with its redundant hardware resources. The most common redundancy technology is three-mode redundant technology (TMR). However, for complex electronic systems, if we only use redundant technology, a large amount of resources will be wasted. Additionally, when a common mode fault occurs, multiple faults could occur in the TMR, and then, the TMR would be unavailable [4] . Yang et al. [9] proposed a three-mode redundancy architecture based on evolution mechanism, which could effectively solve the common-mode fault problem by using an interactive two-stage evolution strategy to evolve the system circuits into a redundant module with different structures, but more hardware resources would be spent. Another traditional fault-tolerant technology is reconstruction technology, which uses preset redundant system configuration information to repair the faults. It also takes up additional hardware resources and is very timeconsuming in terms of system re-engineering [6] , [7] . Evolution Hardware has the characteristics of self-organization, self-adaption and self-repair, and thus, it is a good match to fault-tolerant systems [10] , [11] . NASA / JPL first started research on the fault tolerance of programmable devices. Liu et al. [12] mainly studied the evolution self-repairing technology of electromagnetic damage to the FPGA and digital circuits. Gavie and Thompson [13] and Gavie [14] proposed an evolutionary repair architecture that could repair the FPGAs without interrupting the system functions, but it could only repair the transient and permanent faults, and it was not very effective for the time-lag faults. Zhang et al. [15] , [16] proposed a new self-repair technique based on Evolution Hardware and compensation balance technology, which could achieve the self-repair of various circuits and devices through dynamic configurations. However, only by compensating the fault output repair, if the same problem occurred in different periods, then the evolution repair would be repeated, and time would be consumed. Wang et al. [17] , [18] and Lanchares et al. [19] and Mukherjee and Dhar [20] both proposed real-time fault-tolerant strategies based on the Evolution Hardware technology. However, the problem with this approach is that the system cannot accurately give a repair time, and thus, the real-time aspect is relative. To meet the real-time constraints, Ran et al. [21] proposed a population hybridization monkey-king genetic algorithm, and they improved the convergence rate and success rate by means of subpopulation hybridization. However, in the late evolution stage, when the chromosome fitness is similar, the algorithm approaches the traditional genetic algorithm and could still fall into local optimization.
The above schemes have been shown to have good fault tolerance and a high fault tolerance rate, but they have given less consideration to the real-time constraint. Additionally, the fault repair must be completed within the allowed repair period to improve the system's reliability, and both the real-time and fault-tolerance aspects must be considered to meet the actual needs of the fault-tolerant design.
III. REAL-TIME FAULT REPAIR SCHEME
Real-time fault repair does not mean to repair the fault in a short time. Real-time is a relative concept for an electronic system; once a fault occurs, before we complete the fault repair, as long as the fault does not affect the normal operations of the system within the period, it can be considered to be real time. The structure of the electronic system is complex and changeable. After a fault occurs, it is often difficult to determine the repair period of the fault circuit. Thus, it is difficult to determine whether the repair meets the real-time constraint. However, as long as the normal operations of the circuits can be ensured during the repair period, it can be regarded as real-time. We design and implement a fault-tolerant system on a ZYNQ chip. As shown in Figure 1 , the system is divided into two parts: the evolution algorithm in the PS part and the hardware structure in the PL part. The system uses the fault detection mechanism to determine the faults and uses the fault compensation mechanism to isolate the fault circuit and maintain the normal system output. At the same time, the system uses the fault repair mechanism to query the fault repair library. If the fault is an expected fault, then it directly recovers from it; otherwise, it uses the evolution algorithm to generate the recovery circuit and then downloads the reconfigurable circuit. Finally, it uses the corrector to correct the output to obtain the correct results.
A. FAULT DETECTION MECHANISM
The premise of using an EHW fault-tolerant system to complete the system self-repair is that it can quickly and accurately locate the fault circuit and then use the corresponding mechanism for fault tolerance [22] - [25] . Thus, it is necessary to use the fault diagnosis method to obtain the fault location. Fault diagnosis methods can be divided into two categories: fault diagnosis based on mathematical models and fault diagnosis based on artificial intelligence. In this paper, the fault diagnosis method is the artificial intelligence diagnosis method which is based on a fault tree (Fault Tree Analysis, FTA). The diagnosis process begins with the failure of the system, ''Why This Phenomenon Appears,'' and forms a ladder fault tree step by step along the fault tree. Through a heuristic search of the fault tree, finally the cause of the fault can be determined. Figure 2 shows the fault analysis tree.
In a large-scale circuit system design, it is necessary for all levels of the circuit to be divided into modules and for the corresponding circuit truth table information to be stored. When the fault tree is used for analysis, only by comparing the actual output of the circuit and the output value in the corresponding truth table can it detect whether the fault occurs. If the system fails, then it analyzes the actual outputs and the expected outputs from top to bottom, in such a way that the corresponding fault part of the circuit can be identified within a short time. 
B. FAULT RECOVERY LIBRARY
There could be too many types of faults in the circuit, and thus, the re-occurrence possibility of a fault is uncertain. When a fault occurs, if we use the Evolution Hardware VOLUME 7, 2019 to repair the fault, then the time spent for the evolution is large. For some faults, the evolution itself will be more time-consuming, and therefore, we use the preset fault repair library to achieve the repair acceleration.
In the design of the circuit system, we analyze the possible circuit fault, and we apply the fault analysis tree to locate the fault position and then use EHW technology to generate the corresponding repair circuit configuration information. When a fault occurs, we search the configuration library, and if the repair configuration information is determined, then we directly download the configuration information to repair the fault circuit and complete the system's fault tolerance. Otherwise, if there is no configuration information in the configuration library, the chromosome string in the configuration library can be directly evolved as the initialization group of the evolution algorithm. Thus, the initialization problem of the Evolution Hardware can be resolved, also.
C. FAULT COMPENSATION MECHANISM
Since the fault circuit is isolated during the repair process, we cannot estimate the specific allowable recovery period for the entire system to recover the corresponding circuit. Additionally, the circuit cannot maintain a normal function during the repair period, which could cause the system to stagnate. If the circuit repair time is too long or the repair fails, then during this time, the system cannot normally receive the signal of the fault circuit, which leads to the entire system entering a paralyzed state. Additionally, in a more serious situation, the system will collapse, which results in immeasurable losses. Although researchers have proposed various methods for fault tolerance, they neglected whether the system can still operate normally when the faulty circuit is isolated during the repair. Based on the above reasons, we propose a fault compensation mechanism to maintain the normal function of the faulty circuit during the repair process. The compensation mechanism uses the corresponding truth table of the detection system to obtain the correct output signals through the multichannel analog switch MUX. As shown in Figure 3 , when a fault occurs, the detection system sends a signal (C1) to the MUX in such a way that the output of the fault (A0) cannot be used by the MUX. Additionally, it obtains the truth table value (A1) of the corresponding circuit from the detection system, and then, it transfers the value to the next level circuit through the MUX to maintain the normal circuit function. During normal operations of the system, the MUX normally outputs the output (A1) of the system under test. When the system is normal, it is not necessary for the MUX to read the corresponding output values from the truth table to reduce the possibility of truth table errors.
Since we use the truth table to detect circuit faults, the values in the truth table can be considered correct. However, while using the truth table to maintain the normal operation of the circuit, the link transmission process may fail. The longer we use it, the higher the failure possibility will become. Therefore, we need a highly efficient evolution recovery algorithm to speed up circuit repair and avoid errors in the compensation mechanism.
D. IMPROVED GENETIC ALGORITHM
For the whole fault-tolerant system, whether the evolution algorithm can complete the fault repair circuit in the shortest time determines the stability and reliability of the faulttolerant system, and thus, a good evolution algorithm for the EHW fault-tolerant system design is important.
In this paper, we propose an improved genetic algorithm (IGA), which improves the selection, crossover and mutation of the standard genetic algorithm. It is found that the problem of the slow convergence rate and local optimization in the later stage of evolution can be offset well, and the real-time performance of the fault-tolerant system can be better satisfied.
1) SELECTION
In this paper, the selection operation uses a combination of the roulette selection method and the elite retention strategy to obtain the parents of the chromosomes. The roulette selection method is the most widely used selection method currently. Assuming that the total number of chromosomes in the population is N , and the fitness value of chromosome i is f i , then the selected probability P i of chromosome i is
It is clear that the greater the fitness value of the chromosome, the higher the probability that the chromosome is selected at the time of selection. The basic steps of the roulette are as follows:
(1) From the beginning of the 0 chromosome, each chromosome fitness is added to find the total chromosome fitness Sum, as shown in equation (3.2) . 
PartS i
A random integer Rand is generated in the (0, Sum) interval, starting from chromosome 0, and if the range of Rand is within the range of (PartS i−1 , PartS i ), chromosomei is selected to enter the Cross operation. The elite reserve strategy is much simpler to operate than the roulette strategy, and the idea is to make the chromosomes with the optimal fitness in the population not cross and mutate, but be replicated directly as the next generation of chromosomes. In this paper, we use the combination of the roulette selection method and the elite retention strategy. At the beginning of the selection, the optimal chromosome in the population directly becomes a parent. Then, from chromosome 0, we choose the remainder of the parent chromosome. At the same time, in the crossover and mutation operation, for the optimal chromosome, if the chromosome fitness of the offspring is less than the optimal chromosome, then the optimal chromosome structure remains unchanged, directly as the next generation of chromosomes in the population. This approach enables the optimal chromosomes to enter the next generation of the population, to speed up the convergence of the genetic operations.
2) FITNESS TRANSFORMATION
Using the above selection method, the selected probabilities of the chromosomes with the large fitness values in the early stages of evolution are larger, and the selection pressures of these chromosomes are larger, which leads to a rapid decrease in the population diversity. At a later stage of the evolution, the group has maintained a relatively stable diversity; however, the majority of these chromosomes in the population have a high fitness value. The difference between the mean fitness value and the maximum fitness value of the population becomes relatively small. The probabilities of the chromosomes being selected are almost the same, and the selection gradually becomes a random process. There is no significant improvement for the group in a very long algebra. Therefore, in this paper, we make the fitness function value a linear change. Assuming that the original fitness function is f and the function after the change is F, the linear transformation can be expressed as follows.
In the above formula, the following conditions must be satisfied:
(1) The mean value of the original fitness should be equal to the mean value of the fitness after the calibration to ensure that the expected copy number of the chromosome that has the average fitness is 1 in the next generation.
The maximum fitness after transformation should be equal to the specified multiple of the original mean fitness to control the copy number of the chromosome with the greatest fitness in the next generation. Experiments show that the specified multiple c can be in the range of 1.0-2.0. In other words, according to the above conditions, the coefficient of the linear proportion can be determined. Using the linear transformation, the fitness gaps between the chromosomes are changed, the diversity in the population is maintained, and the convergence speed of the evolution algorithm is also greatly improved.
3) CROSS OPERATION
In the crossover operation, the crossover operator is used to uniformly intersect the chromosomes. Based on the idea of the adaptive genetic algorithm, the crossover operator is adjusted according to the change in the chromosome fitness in the population. In the early stages of the evolution, increasing the crossover operator appropriately can accelerate the convergence rates of the chromosomes, then the population can achieve higher adaptability at a faster rate and can also improve the global search ability of the algorithm. As the evolution generation increases, the average fitness of the population of chromosomes is close to the optimal fitness. At this time, if the crossover operator is too large, then the fine pattern in the population will be destroyed and the evolution time will be increased. Since the chromosome structures are already close to the optimal chromosome at this time, the chromosomes can be fine-tuned in the local space through the mutation operation, and the optimal chromosome can be achieved in a shorter time. The calculation formula of the crossover operator is shown in Equation (3.10). (3.10) where k 1 and k 2 are constants, f is the current fitness value of the chromosome, f max is the current maximum fitness value of the population, and f avg is the average fitness value of the population. When the fitness of the chromosome is less than the average fitness, we use the large crossover operator to accelerate the chromosome, and the crossover operator is k 1 . When the fitness of the chromosome is greater than the average fitness, the crossover operator is reduced to avoid damage to the excellent chromosome fragments.
4) MUTATION OPERATION
The crossover operator is used to speed up the population convergence rate, but at the later stage of evolution, the average fitness of the population is close to the optimal chromosomal fitness [26] , [27] . At this time, the chromosomes of the population must be fine-tuned by the mutation operation to solve the local optimization problem, to enable the population to produce the optimal chromosome. By comparing the optimal chromosome of the target circuit with the fitness value of the current chromosome, the variability of the current chromosome is calculated to determine the variation number of the chromosome, and then, we can obtain the best chromosome by a local search. The mutation rate is expressed by the following formula: L is the chromosome length, Fit max represents the maximum fitness of the algorithm specified by the user, MutPer max represents the maximum mutation rate given by the user, MutRate max represents the largest mutation number, MutRate indicates the mutation rate, and POPSIZErepresents the chromosome number. The mutation process is described as follows. In this way, the mutation number of each chromosome in the population is determined, and then, the number iof the [0, L] interval is generated randomly, and the irandom bits are mutated. Next, the fitness is evaluated again after the mutation. So, the optimal result is found in the global space, and each time the mutation retains only the high quality chromosome, which enables the algorithm to break the limit of local optimization.
Algorithm 1 Mutation
For the Evolution Hardware fault-tolerant system, the proposed improved genetic algorithm is as follows:
(1) In the system design stage, the FAT technology is used to predict the possible faults in the system, and the corresponding fault configuration information is generated by the evolution algorithm, and it is added into the fault repair library. (8); if the evolution reaches the limit generation, then go to (7) , and the repair is off, otherwise continue to the (4) operation. (7) If evolution fails, turn to (9) . (8) According to the output of the VRC module and the actual output of the system under test, obtain the final output. (9) The repair process is terminated.
IV. IMPLEMENTATION
The fault tolerance and real-time performance of the system are guaranteed by the fast convergence and the high convergence rate of the improved genetic algorithm. When the system is running, we can usually quickly detect the occurrence of a circuit fault through the fault detection system, but it usually takes too much time to repair the fault. Therefore, we use 8-bit parity and a 2-bit multiplier as our experimental subjects After a fault is detected, four different algorithms are used to repair the circuit and to verify the feasibility of the scheme and the efficiency of the recovery algorithm, such as the improved genetic algorithm (IGA), the standard genetic algorithm (SGA), the particle swarm optimization algorithm (PSO) and the simulated annealing algorithm (SA). Then, the fault repair scheme is qualitatively compared with several types of evolution hardware fault-tolerant schemes proposed in recent years, to verify the optimization of the proposed scheme and the system's reliability.
It is necessary to propose that this experiment is conducted after the fault injection; although the circuit failure occurred, the normal output of the circuit is still maintained through the fault compensation mechanism. In other words, the real-time aspect of the fault recovery is ensured. When the fault occurs, the truth table is used to maintain the normal functioning of the system, to isolate the fault circuit and recover it. For a short time, the error probability of truth table is very low. As long as the fault can be repaired in the shortest time, the system will be able to operate normally.
A. IMPLEMENTATION OF THE PROGRAM AND PARAMETERS
In this experiment, the fitness evaluation of the EHW is calculated using the internal evolution; in other words, each group of chromosomes is downloaded to the FPGA to assess the fitness, to accelerate the fitness calculation speed and reduce the time due to the time consumption of the fitness calculation. Considering the time cost of reconfiguration and the computational cost of the fitness calculation, we can realize a more efficient chromosome coding operation to simplify the complexity of the evolution circuit and reduce the computational cost. In this experiment, we use virtual reconfigurable circuit technology (Virtual Reconfigurable Circuits, VRC) [28] - [30] . Figure 4 shows the virtual reconfigurable circuit diagram of an 8-bit parity checker. The virtual reconfigurable circuit of the 8-bit parity checker uses a matrix scale of 8 * 4, with a total of 32 CFB modules. Each module contains two inputs and one output. The input of column 0 is derived from the data input of the entire circuit, and the output is used as the second column of the data input. In the circuit information configuration, we use the parallel mode: configure the same column of the eight CFB modules to save time. After the modules in the third column complete the corresponding function, the calculation result of the final 8-bit parity can be selected according to the output of the last column. The improved genetic algorithm uses the Cartesian genetic coding scheme. Cartesian Genetic Programming (CGP) is often used in evolution hardware and evolution circuit design, and it is suitable for combinatorial logic circuit evolution design [31] - [33] . In CGP, the circuit is shown in the form of an acyclic graph, and the acyclic graph represents the configuration that can be used to evolve the circuit. The configuration information of the chromosome structure defines the logical functions of the specific nodes, the internal connections between the nodes, the connection between the input and the nodes, and the connection between the output and the nodes. Each node unit in CGP can be represented as a gate or other data circuit element. The input data is operated by each node of the CGP, similar to obtaining the output of the circuit through the respective gate circuits. Through the continuous evolution of chromosomes, the dynamic structure of the circuit is adjusted to fully exploit the potential coding ability of the CGP.
B. EFFECTIVE FAULT INJECTION
As we know, when a microprocessor fails in a complex environment, we must use fault-tolerant technology to ensure the reliability of the circuits. In order to improve the practicality of fault-tolerant technology, we must be able to simulate actual and effective circuit failures in experiments. So we designed a flexible, easy-to-operate fault injection tool. It can scan the verilog HDL code and then accurately locate the variables in the code through syntax and semantic analysis, so users can select variables according to their own needs for fault injection. The fault injection tool is mainly composed of functional modules such as a graphical interface, a syntax semantic analyzer and a fault injection manager. As shown in Figure 5 , through the graphical interface, user can import the entire Verilog project into the tool, and after the syntax semantic analyzer processes, user is presented with variables that can be injected into the fault. Grammatical semantic analysis is the core part of the tool. We use regular expressions and Nondeterministic Finite Automata (NFA) to set syntax and semantic rules to achieve code scanning and variable VOLUME 7, 2019 attribute recognition. Through grammatical semantic analysis, the tool can determine the hierarchical relationship between modules, generate a parsing list of models, establish a syntax tree for each element, and finally get all possible fault injection points. The main function of the fault injection manager is to obtain fault injection parameters and pass them to the underlying functions, and then use the functions to achieve fault injection. As shown in Figure 6 , the parameters that can be set by the user are fault bits, fault models, and injection cycles. We have verified the practicability and reliability of the fault injection tool through many experiments. It can realize the accurate fault injection of the signal and can be effectively used for the reliability test of the fault tolerance mechanism.
C. ANALYSIS OF THE EXPERIMENTAL RESULTS
Making full use of the hardware and software cooperation characteristics of the ZYNQ chip, in the PS part, we operate various types of algorithms to generate the corresponding fault repair information, and we transfer the chromosome configuration information through the AXI bus to the PL part to complete the chromosome fitness evaluation. In the PL part, we establish the fault detection mechanism, the fault repair library and the fault compensation mechanism. We construct the fault-tolerant system platform on ZYNQ, and we use the various algorithms to complete the evolution fault-tolerant experiments. First, we predict the fault problems that could occur in the circuit, and then use the evolution algorithm in advance to generate the corresponding fault repair configuration information. Next, we deposit them into the repair library, and then use a fixed fault injection method to achieve the effect of a circuit failure, in which the inputs of the VRC functional matrix are constants (0 or 1) to make the circuit failure. We prepared multiple experiments to replace the fixed bit for testing, to make the experimental process as similar to the faults caused by the environmental problems to improve the credibility of the experiment. The important parameters of the algorithm and circuit in the experiment are shown in Table 1 .
The experiment is realized on the ZYNQ, and the results of 8-bit parity and a 2-bit multiplier are shown in Figures 7 and 8 . For the 8-bit parity checker with a relatively simple structure, the four algorithms are very different in terms of the success rate of the fault-tolerant systems. The best of these is the improved genetic algorithm, which can complete all of the fault repair operations within 4000 generations, and the success rate of a repair within 1000 generations is 75%, while the success rate of a repair is 95% within 2000. The success rate and time consumption of the evolution are the least among the four algorithms. Although the other three algorithms can achieve a higher convergence rate in approximately 4000 generations, the convergence rate is much worse than that of the improved algorithm.
For the 2-bit multiplier with a more complex structure, it can be found that within 1000 generations, the improved genetic algorithm has reached a higher repair rate, while it is poor for the other algorithms. At less than 2000 generations, the repair rate is 80%, within 6,000 generations, it is 95%, and in 10000 generations, the fault repair is basically completed.
From Fig. 7 and Fig. 8 , it can be found that the efficiency of different algorithms varies greatly. PSO optimizes the population chromosomes by sharing the information between particles. The convergence efficiency is high, but the early convergence rate is slow. In the later stage, the population diversity is poor and global optimization ability is weakened. So PSO has some defects. With the generations increasing, improved genetic algorithm can change the crossover operator and mutation operator by chromosome fitness. In the early period of evolution, the average fitness of the population quickly reaches a higher level. Since the chromosome fitness is close to the maximum fitness in the later period, the mutation numbers also decrease correspondingly. Through local search, we finally find the best chromosome. Figure 9 shows the repair circuit and chromosome coding structure of 8-bit parity checker. Table 2 shows the comprehensive comparison table of the two experiments, which can be used to analyze the applicability of the four algorithms in the ZYNQ platform for the fault-tolerant system. The improved genetic algorithm has the advantages of the repair rate and the repair average generations. The improved genetic algorithm can speed up the success rate of the evolution repair circuit and reduce the evolution time, to speed up the fault repair, improve the success rate and, finally, satisfy the requirements of a real-time problem. Although the evolution takes some time and the required repair period cannot be accurately predicted, in the short term, the truth table output can maintain a normal output mode, which can guarantee the normal output of the system during the repair period and provide the repair time for the evolution process, thus strengthening the system's reliability and stability.
Through the above experimental analysis, we can determine that the real-time fault repair scheme using an improved genetic algorithm can achieve the highest fault repair rate in the shortest amount of time while the system is not affected by the fault. Compared with Liu and Zhang's fault repair scheme, both of them emphasize the online repair and real-time, but because of the different complex circuit systems, the repair period at all levels cannot be accurately obtained. Although the repair can be achieved in a short time, they both cannot ensure the normal operations of the system in the repair time. In this paper, we propose a fault compensation mechanism that can satisfy the normal function of the system during the fault recovery. We must reinforce the truth table only to ensure that the system output cannot easily to be faulty while using the truth table.
The fault compensation mechanism in this paper can provide sufficient time for the repair, so that in the shortest time to complete the fault repair, the real-time and fault-tolerance aspects can be guaranteed, to improve the reliability of the system.
V. CONCLUSIONS
The existing fault repair usually mentions online repair, that is, while ensuring circuit repair, the normal operation of the system will not be affected. However, for different system structures, the repair time required by the fault circuit is uncertain, and the system will isolate the fault circuit after the system fault occurs. If the repair cannot be completed within the repair period, the performance of the system will be affected a lot. The existing repair schemes rarely involve this problem. This paper mainly studies the real-time performance of system repair. Although this article cannot get the time limit for fault repair, we establish a fault compensation VOLUME 7, 2019 mechanism. After the circuit is isolated, the system can still operate normally. Even if the repair circuit cannot be repaired within the repair period, the performance of the system will not be affected. Therefore, we focus on improving the repair efficiency of evolution algorithms, and quickly obtain repair circuits through efficient evolution algorithms to improve fault tolerance and real-time performance.
Using the fault analysis tree to quickly detect the circuit failures, we can divide the circuit into modules by a stratification in the system design, and we can record the truth tables of the modules. The fault repair library is preset for the quick fixes of potentially frequent failures, and it can be used as the initialized group for evolution algorithms. After the fault occurs, the fault analysis tree is used to locate the fault part quickly. Then, we use the fault compensation mechanism to isolate the fault circuit and use the truth table to maintain the normal output of the circuit. Then, we query the fault repair library. If there is repair information, we can directly repair the failure; otherwise, we use the improved genetic algorithm for fast iterations to obtain the repair circuit, and finally, we download it to complete the repair work. The real-time fault repair scheme used in this paper provides optimization and improvement on three aspects compared with other researchers.
(1) In the system design, the circuits at all levels are divided into modules. We record the truth table to make the fault analysis tree quickly analyze the failure. (2) We use the fault compensation mechanism to ensure that the system is not affected by the fault circuit to achieve the online repair and ensure the real-time repair. (3) We quickly obtain the repair circuit through an improved genetic algorithm to improve the repair rate and the repair speed. The traditional fault-tolerant technology mainly uses redundant resources to repair faults, but it can only perform fault tolerance and does not have the capability of fault detection. It is difficult to achieve redundant resources for the entire electronic system. Due to the randomness of the evolution repair technique, the optimal chromosome may not be obtained in the end, but the fitness of the local optimal chromosome is almost the same as that of the optimal chromosome, which can satisfy most of the circuit functions. Therefore, we can still use the local optimal chromosome to configure the repair circuit. If the fault is detected again, the evolution repair is performed again. Different scale circuit structures have different requirements for fault tolerant solutions. There are still some mechanisms that need to be improved in the fault-tolerant scheme proposed in this paper. Based on the existing problems, we will focus on the following aspects.
(1) The repair scheme proposed in this paper can effectively repair a simple integrated circuit, but for large scale complex circuits, the key to fault tolerance is whether the system can quickly detect the fault source after the fault occurs, and accurate to the specific module. Therefore, we must establish a layered, fine-grained fault detection mechanism. For large scale circuits, dynamic fault trees can effectively describe the causal relationship between various faults. In the later research, we will focus on how to establish a high-efficiency global fault detection mechanism through the dynamic fault tree to ensure that the fault source circuit can be accurately located after the fault occurs. It can reduce the size of the repair circuit and improve the repair efficiency. (2) It can be seen that, for a complex system, we require more complex fault detection mechanisms and analog switch circuits to maintain the truth tables. Therefore, based on the fine-grained fault detection mechanism, we also need to establish a corresponding fault compensation mechanism. When the detection mechanism locates the fault source circuit, it can accurately obtain the compensation output of the fault circuit, thereby maintaining the normal operation of the circuit. (3) Due to the increasing complexity of the system, large range of system parameters, and multiple sources of interference, it is difficult to accurately describe the system model, and the degree of uncertainty of the system model will also affect the performance of the control system. Therefore, the impact of system uncertainty on performance and stability needs to be considered. Sliding mode technology has strong robustness and anti-interference to the system uncertainties. In the future, we will also conduct research on adaptive sliding mode fault tolerance technology. We will build a fault model based on the operating state of the system and design a controller based on adaptive sliding mode technology. After the fault occurs, the fault influence on the system is compensated by adjusting the controller parameters or changing the controller structure, so that to improve the fault tolerance of the system. 
