Abstract-Nano-crossbar arrays are area and power efficient structures, generally realized with self-assembly based bottom-up fabrication methods as opposed to relatively costly traditional top-down lithography techniques. This advantage comes with a price: very high process variations. In this work, we focus on the worst-case delay optimization problem in the presence of high process variations. As a variation tolerant logic mapping scheme, a fast hill climbing algorithm is proposed; it offers similar or better delay improvements with much smaller runtimes compared to the methods in the literature. Our algorithm first performs a reducing operation for the crossbar motivated by the fact that the whole crossbar is not necessarily needed for the problem. This significantly decreases the computational load up to 72 percent for benchmark functions. Next, initial column mapping is applied. After the first two steps that can be considered as preparatory, the algorithm proceeds to the last step of hill climbing row search with column reordering where optimization for variation tolerance is performed. As an extension to this work, we directly apply our hill climbing algorithm on defective arrays to perform both defect and variation tolerance. Again, simulation results approve the speed of our algorithm, up to 600 times higher compared to the related algorithms in the literature without sacrificing defect and variation tolerance performance.
Ç

INTRODUCTION
N ANO-CROSSBAR arrays emerged as a new computing scheme with an aim of solving the longstanding miniaturization problems of CMOS circuits [1] , [2] , and [3] . Each crosspoint of an array is used as a switching element showing field-effect transistor (FET) like behaviour with programmability features [4] , [5] , and [6] . Therefore, nanocrossbar arrays operate similarly to conventional programmable logic arrays (PLA's) [7] , [8] , [9] , [10] , [11] , and [12] . Structures of a conventional PLA and a nano-crossbar array are given in Fig. 1 . Any logic function f can be implemented with properly placed devices on AND/OR planes along with the corresponding input literals, for both conventional PLA's and nano-crossbars. For a given nano-crossbar array structure in Fig. 1 , each crosspoint is either a transistor working as a switch between the power supply and the ground or a short circuit. If there is a transistor on a crosspoint, the corresponding literal line controls this transistor from its gate to switch between ON and OFF states. In a case where any of the vertical lines has a connection to ground level, the output function f becomes logic 0.
Nano-crossbar arrays offer unique features such as small area, low power consumption, and easy manufacturability. Two fully operational crossbar implementations as a nanoprocessor and a finite-state machine are shown to be feasible in [13] and [14] . However, high process variations and defects are big headache that significantly effect circuit performances, especially for delays. Consider a function f ¼ x 1 x 2 þ x 2 x 3 þ x 1 x 3 to be implemented on a 3 Â 3 nanocrossbar array having 3 output lines for 3 products and 3 input lines for 3 literals. Suppose that delay of each crosspoint varies between d and 10d where d is a minimum delay time. Here, 6 of 9 crosspoints should be configured as FET's, 2 on each product line, and the rest of them are configured as disconnected lines. Selection of these 6 crosspoints plays an important role for the worst-case delay. There are total of 3! Â 3! ¼ 36 options to select as the number of orderings of input and output lines, and each selection gives a different delay value between 2d or 20d, since there are 2 crosspoints on each product line. If we have a chance to measure delay contribution of each crosspoint, then by trying these 36 options we can find the best solution. However, with an increase in the number of input lines (n) and output lines (m), the number of options n! Â m! quickly grows beyond practical limits. After the size of only 8 Â 8, it is not feasible to use an exhaustive search. Indeed this problem, commonly known as variation tolerant logic mapping (VTLM) problem, is an NP-Complete problem [15] .
In this work, we focus on the worst-case delay optimization problem in the presence of high process variations. We propose a fast hill climbing algorithm for the VTLM problem that improves the worst-case delay up to 25 percent with an average of 20 percent for our benchmark set and random generated matrices. We comment that the whole crossbar is not necessarily needed to be used for the worst-case delay optimization problem, so our algorithm first performs a reducing operation for the crossbar. This significantly decreases the computational load of the algorithm, up to 72 percent for our standard benchmark set. Next, initial column mapping is applied. After the first two steps that can be considered as preparatory, the algorithm proceeds to the last step of hill climbing row search with column reordering where optimization for variation tolerance is performed.
Since our algorithm primarily eliminates crosspoints with highest delay values, it can be also used for defect tolerance by assigning relatively high delay values on defective crosspoints. Thus, both defect and variation tolerance could be achieved. However, if defects are considered, the proposed matrix reducing approach, as the first step of the algorithm, can not be used since all defects spreading to the whole crossbar should be tolerated. By running our algorithm for defect and variation tolerance, we see that it gives up to 600 times lower runtimes compared to the related algorithms in the literature without sacrificing defect and variation tolerance performance. This certainly approves the efficiency of the proposed algorithm.
Previous Works
Although defect/fault tolerant logic mapping for nanocrossbars has been long studied [16] , research on variation tolerance is relatively new. First, Gojman and Dehon consider variations on crosspoint transistor parameters to accurately determine the placement of defects as opposed to using randomly assigned defect maps [17] . They propose a post fabrication mapping algorithm (VMATCH) to tolerate defects caused by variations on threshold voltage values of crosspoint FET's. By using independent Gaussian distributions, they determine defects such that when an ON resistance of a crosspoint FET is larger than the OFF resistance, the corresponding crosspoint is defective. As a result, this work can be considered as a transition between defect and variation tolerance methods. However, it does not directly focus on variation tolerant performance optimization.
As a complete variation tolerance methodology, Tunc and Tahoori propose a logic mapping algorithm based on simulated annealing [18] . Additionally, they offer a delay testing technique on nanocrossbar arrays to obtain delay contributions of crosspoints that is needed for constructing a variation matrix. Since the algorithm uses randomly selected iterations without progress monitoring, its efficiency is questionable; sufficient results can not be achieved unless relatively high number of trials are reached. On the other hand, our algorithm is designed to make a continuous progress; that is why we call it a hill climbing algorithm. Another approach based on linear integer programming is proposed by Zamani et al. [15] . Although satisfactory delay results can be achieved by this systematic method, runtime values are even worse than those of the simulated annealing algorithm.
Yang et al. propose a different approach for the VTLM problem using a non-dominated sorting genetic algorithm [19] . While finding near Pareto optimal solutions, time overhead of this algorithm is disadvantageous. In average, runtimes are generally much higher (Â30 À 40) than ours. In order to improve runtimes, Zhong et al. use a greedy re-assignment technique [20] , originally proposed in [21] . We also used this method as an alternative to initial column mapping method and compared optimization results.
Another evolutionary algorithm is proposed by Zhong et al. [22] ; it is a bi-level multi-objective optimization algorithm that uses different approaches on row and column mappings defined as lower and upper level problems. Every individual of an upper level problem is required to be first solved as a lower level problem that puts too much burden on the lower level (row order) algorithm. Therefore, the overall algorithm performance mostly depends on the performance of the lower level algorithm, defined as a minmax-weight and min-weight-gap bipartite matching problem being solved by a heuristic variant of the Hungarian method. A follow-up work, fundamentally based on the same approach, is also proposed [23] . It achieves both defect and variation tolerance using a memetic algorithm. For both of these algorithms, much better delay values are obtained compared to the algorithm in [19] . However, runtime is still an issue. Our algorithm gives delay values in the same range while having considerably lower runtimes.
There are also studies focusing on adding extra rows or columns for better variation tolerance at the cost of area yield. In their work, Zamani and Tahoori use row redundancies with duplicated input lines [24] . Their approach successfully reduces the critical path delay with an average of 50 percent. Along with the area overhead problem, these studies have a logical flaw: before fabrication, the amount of redundancies should be known, but it can be determined after fabrication (after post fabrication delay test).
Another factor of evaluation is the algorithms' capability to tolerate both defects and variances. Among the mentioned studies, the ones [18] , [19] , [22] , and [23] can be either directly or with slight modifications applied for defect and variation tolerance. We consider all of these algorithms in the experimental results part. There are other algorithms targeting both defects and variances [25] and [26] , but they are outperformed by the considered algorithms. Note that our algorithm is also directly applicable for defect and variation tolerance. 
Organization
Organization of this work can be summarized as follows.
We introduce preliminaries for the VTLM problem and define our performance objectives in Section 2. We propose our VTLM algorithm in Section 3 with its steps given in the sections. In these steps, we use a function matrix reducing, an initial column mapping, and a hill climbing row search with column reordering methods. In this section, we also explain how to use the proposed algorithm for both defect and variation tolerance. Experimental results are given Section 4, and Section 5 concludes this study with insights and future directions.
PRELIMINARIES
In variation tolerant logic mapping scheme, a target logic function and a nano-crossbar are generally represented by matrices, called a function matrix FM and a variation matrix VM, respectively. The goal of any mapping algorithm is to achieve a desired performance by determining the proper row and column orders for FM to be mapped on VM.
A binary matrix FM indicates a logic function to be mapped onto a nanoarray. As a general design topology, matrix rows and columns represent function literals and products, respectively. If a literal is included in a function product, intersection of the corresponding row and column is tagged as '1' (the crosspoint behaves as a switch); otherwise '0' (the crosspoint is an open circuit; crossed lines are disconnected). An example of FM and mapping scheme on a nano-crossbar array is given in Fig. 2 . As a representation of a crossbar, VM has switching delay values of all crosspoints. These values are determined by delay testing methods [18] , explained as follows.
Delay Testing: all crosspoints act as FET type switches and stay constant in non-controlling values (logic 1 for NAND and logic 0 for NOR type). In this state, for each input, tester applies falling and rising control signals while the other inputs stay at the same non-controlling value. Here, while switching an input i, transition delay value at the output j is observed as the delay value of the crosspoint located at ði; jÞ. To generate VM for the VTLM process, average rising and falling transition times are used [18] . Note that if any crosspoint delay value is relatively high (Â10) than other crosspoint delays or it does not switch at all, it is considered as a defective crosspoint, so defect tolerance methods applied.
For simulations, it is widely assumed that the measured delay values show a Gaussian (Normal) distribution with a mean m, a standard deviation s, and a coefficient of variation or relative standard deviation COV ¼ s=m [18] , [22] , and [23] . An example of VM and a random delay value generation scheme are given in Fig. 3 .
VMði; jÞ ¼ crosspoint delay ðj; iÞ with Gaussianðm;
VM Parameters: number of horizontal lines/wires m, number of vertical lines n. Note that the sizes of FM and VM are same; no extra crossbar redundancy is used that is a common practice in defect tolerance. Performance Matrix (PM) is a Hadamart product of FM and VM matrices. (1)
Since the highest column delay, sum of all delay values in a column, represent the worst-case delay for a FET type nanoarray, we define a row matrix FPM as a FET performance matrix.
Objectives
In the literature, three main objectives are considered while developing variation tolerant delay optimization algorithms. The first one is minimizing the worst-case delay, so minimizing the highest valued column in FPM. The second one is maximizing the best-case delay, so maximizing the lowest valued column in FPM. Finally, the third one is minimizing the difference between the worst-case and best-case delays. Definitions of these objectives are given in Equations (3), (4), and (5). In our work, we focus on Objective 1 since the worst-case delay optimization is by far the mostly desired and used one in the literature. On the other hand, Objective 2 is generally used for defective crossbars to especially take into account stuck-at zero defects. Additionally, Objective 3 is valid for very specific applications requiring an almost constant delay values.
Our general logic mapping scheme for worst-case delay optimization is given in Fig. 4 . Here, we use interchangeability of rows and columns of the function matrix and find a better mapping to achieve Objective 1.
PROPOSED ALGORITHM
In this section we introduce all stages of our hill climbing algorithm including function matrix reducing, initial column mapping, and hill climbing row search with column reordering. As a pre-processing method, we first use our FM reducing method that finds unnecessary columns to be excluded.
Next we start the mapping process. Here, the only tool that we have is interchanging rows and columns of FM. Since total delay values in a column determines the worst-case delay, changing rows or columns have different effects on Objective 1. Therefore, we should determine whether dealing with rows or columns first. It is obvious that row ordering followed by column ordering is constructive; performance improvement is guaranteed. On the other hand, performing column ordering after row ordering is destructive; ordering columns kills our initial effort of ordering rows. As a result, we first perform column mapping and then precise tuning by row search. We repeat this process using column reorderings until a maximum number of reorderings is reached.
Work flow of the proposed algorithm scheme is given in Fig. 5 . The steps of the algorithm are given in the following three sections. Fourth section analyses probabilistic context of our hill climbing algorithm. The fifth section explains how to use the proposed algorithm for both defect and variation tolerance.
Function Matrix Reducing
Since we are only interested in the highest valued column in FPM, relatively low valued columns having low transistor counts which do not determine the worst-case delay performance, are not needed. An example is given in Fig. 6 . Here, we use a threshold as a percentage of the maximum number of 1's in a column of FM. We remove columns having values under this threshold from FM. Then we perform our algorithm with using the reduced form of FM to achieve Objective 1, and compare the results with those obtained with a standard FM, not reduced. The figure tells us that until the threshold of nearly 65 percent, there is no difference between the worst-case delay values. On the other hand, algorithm runtime for the reduced algorithm decreases as the threshold increases since matrix is getting smaller. As a result, by matrix reducing, we can achieve 35 percent runtime improvement without any degradation for the worstcase delay for this specific example. Motivated by this, we propose a method that effectively finds the threshold.
First, for each column of FM, we determine lower and upper bounds on delay values as well as a specific limit value. Suppose that the ith column C i of FM has a transistor count of T i , representing the number of 1's. For each column of VM, minimum and maximum sum of T i elements are determined. Considering that VM has n columns, we have n minimum sum and n maximum sum values for each FM column. The lower bound lb i for the ith column C i is the lowest value of the n minimum sum values of C i . The upper bound ub i is the highest value of the n maximum sum values of C i . Additionally, we define another bound lim i as the highest value of the n minimum sum values.
In the next step, we find maximum valued lb i and lim i where 1 i n, and call them lb max and lim max , respectively. Then we check if ub i is smaller than lim max for every i where 1 i n. If it does so, then we remove the corresponding column from FM. The same procedure could be applied using lb max . Using lb max would allow us to remove columns without any increase on delay values. However, since using lim max offers considerably higher number of columns to be removed with very small delay increases, we prefer to use it.
A pseudo code of our reducing algorithm is given in Algorithm 1. To elucidate the algorithm, an example is given in Fig. 7 . Here, 6 Â 6 sized FM and VM are used. Considering C 1 , it has four 1's. In order to find n ¼ 6 maximum sum values, sum of the 4 highest values in each VM column is calculated. The highest value of these 6 sums is ub 1 , which is found as 286 for this example. Similarly, while finding n minimum sum values, sum of the 4 lowest values in each VM column is calculated. Lowest value of these 6 minimum sums is lb 1 and the highest value is lim 1 , which are respectively found as 111 and 210 for this example. After finding all ub i , lb i and lim i values, lb max and lim max values are determined as 188 and 270. Then, ub i of each C i is compared with lb max or lim max to determine the columns to be removed. Note that using lim max as bound results in more reduced columns (C 2 , C 4 , C 6 ) compared to using lb max with two reduced columns (C 4 , C 6 ). 
Initial Column Mapping
We treat each of the columns of FM one by one starting from the column having the highest number of 1's to the lowest one. For each column of FM, first lb i is determined in a similar way that we did in the matrix reducing step. The only difference is that, here we determine lb i among unmapped columns of VM as opposed to considering all of the columns. Then, the VM column corresponding to lb i is mapped to the column of FM. Algorithm 2 gives a pseudo code for the proposed column mapping.
Algorithm 2. Proposed Initial
Column Mapping 1: Input: FM mÂn , VM mÂn , column count n 2: Output: FM initial 3: FM sorted high to low sort of FM columns for '1' sums 4: for each FM sorted column i do 5: T i transistor count of C i 6: for each VM column j do 7: All lb i;j sum of the lowest T i values on VMðjÞ 8: end for 9: Pos position of minimum All lb i for C i 10: FM initial ðPosÞ C i 11: remove VMðPosÞ column 12: end for
Hill Climbing Row Search with Column Reordering
After determining an initial column map, our algorithm starts its search by finding a PM column having the worstcase delay value, corresponding to the highest value in FPM. Then, in each try, two FM rows are interchanged such that the delay value of this worst-case column is reduced with a maximum amount, followed by a repeat of finding the worst-case PM column. The maximum amount is satisfied by finding the highest and lowest values in the VM column corresponding to 1 and 0 values in the FM column, respectively; the corresponding FM rows are interchanged. This whole procedure is constantly repeated until the new worst-case value is not smaller than the previous one anymore. If the new worst-case value is only getting larger, then the row search is stopped and the column reordering starts with considering the previous case having the most reducible worst-case delay.
After reaching a non-optimizable point, the algorithm checks which column has the worst-case delay value for most of the times on the previous row search step. This column is switched with a column having the lowest transistor count. After each column reordering, the algorithm goes back to the row search. The number of column reorderings is upper limited by the number of columns. At the end, among all non-optimizable points, the best one is selected as the output.
A pseudo code of the row search and column reordering steps is given in Algorithm 3. To elucidate the algorithm, an example is given in Fig. 8 for a 12 Â 12 sized array. Here the point 'A' represent an initial column map. Each downward slope shows that the hill climbing row search is in duty. Column reorderings are shown with 'B' points. Finally, the point 'C' having the smallest delay among all nonoptimizable points, is selected as the output. 
Probabilistic Analysis of the Proposed Algorithm
Our proposed hill climbing algorithm uses binary switching between 0's and 1's of the current worst-case delay column.
If there is no better worst-case delay on the whole array after all binary moves, algorithm considers current state as a limit and conducts a column reordering to start search with a new column layout. Here, we can generate a fundamental probabilistic model to inspect algorithm characteristic. There are two cases causing a stuck on a search. First, binary changes on the current worst-case column do not find a better worst-case on that particular column. Second, there is no better worst-case delay on the whole array after checking all binary changes. Inspecting the first case, it is apparent that this can happen only if all the crosspoint delays on 0 locations are higher than those on 1 locations. Given that the number of 1's in a column is P 1 , we can find the number of 0's as r À P 1 where r is the row count. To generalize for random arrays, we can estimate P 1 as r Â CR where CR is the crosspoint ratio. Therefore, probability of this specific failing ordering happening P Stuck can be calculated as Equation (6) . Note that CR ¼ 0:5 gives the lowest value for this equation for any row count.
For the second case, since we do have any information about other columns rather than being lower than the inspected worst-case column, we can expect random outcomes for these columns on each binary row switch. Here, the successful case means finding a lower worst-case delay than that corresponding to the current worst-case column. We represent the current worst-case delay value as wc as a condition value to each column's probabilistic outcome. To reach a successful binary row search, all of the columns have to be lower than wc. This probability calculation is given in Equation (7).
In conclusion, we can say that our algorithm fails either with a probability of P Stuck or with a probability of ð1 À P Stuck Þð1 À P Continue Þ as given in Equation (8) . Note that finding better wc values means lower P Continue which result as higher P Fail after each search step.
From Equation (8), we can say that for a constant row count, increasing column count always results in a higher P Fail since there would be more P i products that decrease P Continue value while having constant P stuck . This means faster reach to the the lowest wc value to finish row search and conduct a column reorder, which result as lower runtimes but higher wc values. On the other hand, for a constant column count, increasing row count means lower P stuck value, higher wc value, and lower coefficient of variation value on each column. For considerably high row counts, we expect column distributions to separate from each other, therefore resulting as higher P continue values. These inferences are justified by simulations in Section 4. P ones '1' positions on P old column, high to low 10:
P zeros '0' positions on P old column, low to high 11:
for each index of P ones i do 12:
for each index of P zeros j do 13:
IMV 
Proposed Algorithm on Defect and Variation Tolerance
As an extension for our hill climbing algorithm, we consider defect and variation tolerant logic mapping (DVTLM) problem. For this purpose we only update VM as DVM by assigning relatively high delay values, at least 100 times larger than the delay value on 3s, to defective crosspoints. An example of DVM is given in Fig. 9 with representing defects as infinite delays. Another note is that for the DVTLM problem, the proposed matrix reducing approach, as the first step of the proposed algorithm, can not be used since all defects spreading to the whole crossbar should be tolerated.
EXPERIMENTAL RESULTS
We present simulation results of our hill climbing algorithm for both VTLM and DVTLM problems. To generate FM's we use standard benchmarks from [27] as well as randomly generated benchmarks. For randomly generated benchmarks we use CR ¼ 40%, regarding that the average CR value for standard benchmarks is around 40 percent.
To generate VM's, we assume that each crosspoint, and correspondingly each delay value in the matrix, shows an independent Gaussian (Normal) distribution with a mean m, a standard deviation s, and a coefficient of variation COV ¼ s=m. For further evaluations, we also consider different distributions including Weibull, exponential, and Beta distributions. We mainly use COV values of 0.2 that is a common practice in the literature [23] , [17] , and [15] , also supported by the reports of "International Technology Roadmap for Semiconductors (ITRS)" [28] and [29] . We also try different COV values between 0 and 0.3. For the DVTLM problem, we use defect rates between 5 and 40 percent independently for each crosspoint. We define a performance parameter "Delay Optimization Rate" as an improvement percentage from the initial worst-case delay value which occurs at random mapping.
We select a sample size of 2,000 around which average runtime and delay values become steady. All simulations are conducted in MATLAB which runs on a 3.50 GHz Intel Core i5 CPU (only single core used) with 16 GB memory.
Simulations for VTLM
We consider three state-of-the-art algorithms for comparison. The first one is the memetic algorithm given in [23] . The second one is the simulated annealing algorithm given in [18] . As the third, we developed a basic genetic algorithm which is inspired from [19] and [22] . The source code of these algorithms as well as our proposed algorithm with supporting materials are available at http://www.ecc.itu. edu.tr/images/a/a0/VTLM.zip Before presenting time and delay values of our algorithm, we evaluate the first two steps of the algorithm called matrix reducing and initial column mapping. Note that both of these steps are algorithm independent, so they can be applied to any VTLM algorithm. We apply our matrix reducing method to three different algorithms; results are given in Table 1 . Since the memetic algorithm in [23] is specifically designed for the DVTLM problem, it is not suitable for matrix reducing. We see that using reduced matrices provides us an important time saving, up to 72 percent at the cost of slight increase in the worst-case delay up to 3 percent. Note that these simulations are done using an upper limit of lim max , previously defined in Section 3.1. If we used lb max instead of lim max , then delay increase would not happen at all. However, time decrease rate would be lower.
We evaluate the proposed initial column mapping technique by comparing it with random mapping and greedy mapping method proposed in [20] ; all of these methods are used with our hill climbing algorithm. Randomly generated benchmarks with different sizes are used for a fair comparison. Results as delay optimization rates are given in Table 2 . We see that the proposed technique always gives the best result, which is 3-5 percent better than greedy from the literature [20] and 10-15 percent than random mapping. The reason is that the greedy mapping proposed in [20] , and also used in [22] , determines lb i 's by using sums of all delay values in a VM column. However, we select a minimum sum of T i delay Fig. 9 . DVM generation from VM by representing random defects as infinite delays. values where T i is the number of 1's in the FM column to be mapped.
In Fig. 10 we compare the runtimes of the initial column mapping step and the matrix reducing step with the hill climbing row search plus column reordering steps of the algorithm. We see that, the runtime overhead of the initial column mapping and matrix reducing steps is quite low, always smaller than 5 percent and gets dramatically smaller as the matrix size increases.
The performance of our hill climbing algorithm is summarized in Fig. 11 . Recall that the overwhelming proportion of the algorithm's computational load corresponds to the step of row search and column reordering. Since these steps get into each other, a similar runtime behaviour is expected for the changes in the row and column counts. This is shown in Fig. 11a . However, for the delay values, row and column counts have different effects as shown in Fig. 11b . Since delays are calculated column-wise and correspondingly our algorithm aims to minimize column delays, change in the number of columns affects delay optimization rates more than the row change does. Also, high row count converges to 20 percent delay optimization rate because of the decreasing COV values of the columns sums, since mean 'm' values multiplies with the '1' count while standart deviation 's' values multiplies with the root of the '1' count. This property of Gaussian sums results as more separated column delay distributions as the row count increases. As a result, there are relatively less columns to be optimized, which results as effective and similar results.
We also make detailed comparisons of our algorithm with three different algorithms in the literature in Table 3 . In terms of the worst-case delay, represented by Delay Opt. Rate", our algorithm gives the best results in almost all cases. Here, our algorithm has similar optimization rate results, around 20 percent, for all random array sizes. This proves our algorithms' scalability for large logic functions since other algorithm results get worse as the matrix size increases. Also, in terms the runtime, our algorithm gives by far the best results in all cases. Note that we run the memetic algorithm for at most 20 seconds. The reason is that if the algorithm does not give any solution for this time span, then it is unlikely to get results in practical time limits. Note that we apply initial column mapping on all of these algorithms for a fair comparison.
In the following two sections, we further evaluate our algorithm for different variance values, and for different distributions.
Evaluation for Different Variance Values
Depending on the manufacturing technology, COV values might change. Future technologies might have lower or higher variations with better manufacturing costs. Therefore, we inspect performance of our hill climbing algorithm along with the other algorithms for different COV values ranging from 0 to 0.3. Results are given in Fig. 12 . We see that our proposed algorithm scales much better than the other search algorithms as COV value increases.
Evaluation for Different Distribution Types
To inspect different possible manufacturing properties we inspect different distributions. Along with a symmetrical Gaussian distribution, we use Weibull and Beta distributions with slightly skewed curves of density functions. Also we use an exponential distribution having extremely skewed density function curve. All distributions have COV ¼ 0:2 to preserve a fair comparison. Results are given in Table 4 .
Here, compared to Gaussian distribution, slight differences on optimization rates occur on Weibull and Beta distributions. However, exponential distribution has much higher rates for each algorithm due to having extreme worst-case delays far from mean values. Examining the results we see that our algorithm's superiority for Gaussian distribution is more valid on the other distributions.
Simulations for DVTLM
We define "Success Rate" as the ratio of cases or samples for which all defects are tolerated to the total number of cases.
Figs. 13 and 14 shows success rate or our algorithm for different defect rates and for different benchmarks. Fig. 13 tells us that increasing defect rate beyond 10 percent dramatically worsens the success rate, but depending on the matrix structure 20 percent defect rate might be tolerated with a high success rate. On the Fig. 14 we campared two cases; constant column count with the increasing row count in Fig. 14a and and vice-versa in Fig. 14b . Here, increasing row count doesn't change defect tolerance effectiveness, since our algorithm excels on row search. On the other hand, increasing column count decrease defect tolerance effectiveness since we have to tolerate more column formations with limited row switching action. Tables 5 and 6 give detailed comparisons of our algorithm with three different algorithms for 5 and 10 percent defect rates, respectively. As expected, increase in defect rate not only decrease success rate, but also worsens variation tolerance performance. Examining the numbers, we see that for overwhelming majority of cases, our algorithm gives the best result. Also, results approves the superiority of our algorithm's speed.
CONCLUSION
In this work, we propose a variation tolerant logic mapping algorithm to optimize the worst-case delay of nanocrossbars. We show that our algorithm can be successfully used for defect tolerance, so defect and variation tolerance can be achieved at the same time. Simulations show that our algorithms runs considerably faster than the previously proposed algorithms with offering similar or better delay improvements. Difference between delay optimization rates of our algorithm and reference algorithms increases as the matrix gets larger, which proves that our algorithm has better scalability for real-world applications. The proposed algorithm is technology independent that can be used for any technology using PLA like computing as well as for conventional CMOS PLA circuits.
As a future work, we consider modifying and improving this algorithm for cascaded arrays to adapt real-life applications. Also, we plan to extend this work to be applicable for transient variations due mainly to degradations occurred in crossbars. Relatively, area yield optimizations can be performed. 
