My research areas include VLSI CAD, FPGA testing and trust design, fault-tolerant computing and parallel processing. We have also made a recent foray into optimization. Our work has received one best paper and one most-influential paper awards, one featured speaker recognition, and one best paper nomination, all at premier conferences. My research has been or is funded by NSF, DARPA and AFOSR, and companies like Intel and Xilinx. Highlights of my research contributions in the aforementioned areas are given below.
VLSI CAD
In VLSI-CAD I have worked in partitioning, placement, routing, (including incremental algorithms for the latter two), and logic and physical synthesis. I, along with my students, have made the following contributions in these areas.
Partitioning
1. In swap-based partitioning I developed an algorithm QuickCut that improves the complexity of the wellknown Kernighan-Lin algorithm from Θ(n 2 log n) to Θ(e log n), and empirically shows runtime factor improvements of 5 to 50 [73] . Reducing the complexity of the Kernighan-Lin partitioning algorithm was a two-decade old open problem before the work in [73] . This work found mention in the textbook by Sabih H. Gerez, Algorithms for VLSI Design Automation, Wiley, 1998.
2.
For move-based partitioning we have developed many novel and effective algorithms ranging for probability-based methods [13, 14, 65] , to cluster-aware methods [10, 60, 64] , to non-local information methods [14, 56] , to methods for tackling constraints by intermediate relaxations [59] , and to timingdriven partitioning [58] . All these techniques have been successful in transforming the local search nature of the basic iterative-improvement process in move-based partitioners to have more non-local optimality properties. These algorithms to date have among the best performance among flat partitioning methods. One of these works [65] earned a best paper award in 1996 at the prestigious Design Automation Conference.
Placement
1. A placement method SPADE for standard-cell VLSI circuits was created for wirelength optimization using the partition-driven paradigm and a number of novel concepts, chief of them being simultaneous-level partitioning and a logarithmically-graded balance-criterion as the partitioning proceeds hierarchically [53] .
2. Novel techniques using analytical programming approaches and network-flow were developed for a timing-driven (TD) incremental placement method FlowPlace that can significantly improve critical path delays of wirelength-optimized placements (by up to 34%) and of timing-optimized placements (by up to 10%) with about a 9% deterioration in wirelength (WL) [38] ; its runtime is about 12-18% of that for obtaining the original placements. Further, empirical evidence shows that FlowPlace's runtime grows only linearly with circuit size, making our techniques very scalable. This paper was accorded a featured speaker recognition in the premier International Conference on CAD (ICCAD), 2006. In [29] , we extended FlowPlace by including WL cost (along with timing cost), based on a probabilistic HPBB metric, in network flow based detailed placement; this reduces WL deterioration to about 6% with only a 1.7% reduction in performance. We also prove in [29] that our white-space satisfaction technique (embedded in the network flow based detailed placer) can successfully yield valid placements with very high probability.
3. Effective and theoretically-robust algorithms have also been developed for TD incremental placement under power constraints [37] as well as power-driven incremental placement under timing constraints [32] . Results show that for power optimization, we can achieve average improvements of 12.1%, 10.8% and 9.1% with no delay constraint, 3% delay constraint and -3% delay constraint, respectively-a negative (positive) constraint signifies a metric (delay, in this case) improvement (deterioration) lower bound (upper bound). For delay optimization, we achieve average improvements of 16.8%, 11.6% and 9.1% under no constraints, 3% power constraint and -3% power constraint, respectively. I believe that our algorithms are significant advances in the state-of-the-art in placement algorithms for tackling both optimization and constraint metrics.
Routing
1. Two of the best academic FPGA detailed routers ROAD and ROAD-HOP were developed in [46, 48] . These techniques outperformed the previous state-of-the-art in important metrics. For example, ROAD is 13 times faster than VPR (the best flat router) and has the same quality of results (number of tracks used). ROAD is an optimal detailed router and incorporates optimality-preserving speedup methods that result in its efficacy and time efficiency. These works introduced concepts of learning-based search space pruning that can be applied to the solution of other combinatorial optimization problems using a depth-first search mechanism. A prime example is graph coloring that itself has many applications in computer engineering and science.
2. In incremental routing, we have introduced the concept of bump-and-refit. Incremental routing is used for engineering-change-order (ECO) applications and fault reconfiguration. Use of this novel concept has resulted in significantly better results in terms of routing completion rates, wire-lengths and viausage than previous ripup-and-reroute approaches for both FPGA and ASICs [9, 44, 54] . Further novel concepts including that of Steiner-node slack tolerances were introduced in [41] to yield a near-wirelength optimal and guaranteed slack-satisfying timing-driven incremental routing method TIDE for ASICs that also obtains significant improvements over ripup-and-reroute approaches in the timing-driven context (e.g., 4-6 times fewer slack violations) while being about three times faster.
Logic and Physical Synthesis
1. In [36] , we developed a network flow based timing-driven discrete cell-sizing algorithm that can incorporate total cell size constraints. We tested our algorithm on the ISCAS85 benchmark, and compared our results to an optimal solution produced by a dynamic programming method. The results for a 10% cell area increase constraint show that the improvement obtained by our method is only 1% worse (11.9% v.s. 12.9%) than the optimal solution, while being 60 times faster than it. A significant extension of our method uses network flow iteratively on primal-dual formulations (the dual formulation optimizes cell area of non-critical paths of the circuit under delay constraints and allocates the saved area to the primal problem of minimizing delay in critical paths under area constraints) [33] . We compared our technique to the timing-optimization variation of the state-of-the-art method of [Hu, et. al., DAC'07] and obtained 9% better timing results.
2. In [35] , we proposed a post-placement physical synthesis algorithm, based on network flow, that can apply multiple circuit synthesis and placement transforms on a placed circuit to improve the critical path delay under area constraints by simultaneously considering the benefits and costs of all transforms (as opposed to considering them sequentially after applying each transform, as is done in most state-of-theart methodologies). The circuit transforms we employed include (but are not limited to only these in our general technique), incremental placement, two types of buffer insertion, cell resizing and cell replication. We also tie the "transform selection network graph" to a "detailed placement network graph" with TD arc costs for cell movements. This enables our algorithms to perform both physical synthesis and detailed placement together, and thereby to incorporate the detailed placement cost for each synthesis transform along with the basic cost of applying the transform in the circuit. Results on three sets of benchmarks under 3-10% area increase constraints, show up to 48% and an average of 27.8% timing improvement. Our average improvement is relatively 40% better than applying the same set of transforms in a good sequential order that is used in many current techniques.
FPGA Testing and Trust Design
My students and I have developed several innovative and effective test and trust-design and verification techniques for FPGAs.
FPGA Trust Design and Verification
1. A novel trust design method for FPGA circuits that uses error-correcting code (ECC) structures for detecting design tampers-changes, deletion of existing logic, and addition of extra-design logic like Trojanswas proposed in [5] . We use two levels of randomization to thwart attempts by an adversary to discover the parity groups and inject tampers that mask each other and/or tamper with the testing circuit so that design tampers remain undetected: (a) randomization of the mapping of the ECC parity groups to the CLB (configuration logic block, i.e., logic cell) array; (b) randomization within each parity group of odd and even parities for different input combinations (classically, all ECC parity groups have even parities across all input combinations). These randomizations along with the error-detecting property of the underlying ECC lead to design tampers being uncovered with very high probabilities, as we show both analytically and empirically. Using the 2-D code as our underlying ECC and its 2-level randomization, our experiments with inserting 1-10 circuit CLB tampers and 1-5 extraneous logic CLBs in two medium-size circuits and a large RISC circuit implemented on a Xilinx Spartan-3 FPGA show very promising results of 100% tamper detection and 0% false alarms, obtained at a hardware overhead of only 7-10%.
FPGA Testing
1. We developed 1-and 2-diagnosable built-in-self-testers (BISTers) that achieve very high diagnostic coverages for high fault densities (≈ 10%) that are expected to characterize permanent fault occurrences in future nano-scale CMOS and nanotechnology circuits [6, 45] . The 2-diagnosable BISTer design was the first time a diagnosability greater than one was achieved. The paper [45] was nominated for a best paper award in 2004 at the prestigious Design Automation Conference.
2. We proposed probabilistic BIST techniques using the novel concept of iterative bootstrapping that achieve far greater diagnostic coverage at not only high fault densities, but also for clustered faults, a pattern that occurs frequently for fabrication defects [43] .
3. Interconnect BIST techniques were developed that can provably detect any number of interconnect faults as long as not all interconnects are faulty (a first), and that also have high diagnostic coverage [42] .
4. We designed a methodology based on a formal analysis of iterative bootstrapping that addresses for the first time the problem of detecting and diagnosing both interconnect and PLB (i.e., logic) faults in FPGAs without making any assumptions of any component (interconnects, PLBs) being fault-free. Significantly improved diagnostic coverages and reduced false positives were achieved with this methodology compared to state-of-the-art BIST methods that erroneously make such fault-free assumptions [40] .
Optimization
1. In [30] we proposed a new pivoting rule for the min-cost max-flow network Simplex method to determine the order of arc pivoting. In order to reduce the number of degenerate pivots (those that do not reduce the cost of the current solution), when choosing the pivoted-in arc, besides the standard reduced cost we also consider the probability that the resulting cycle is non-degenerate. A probability based reduced cost is devised to give priority to pivots that are likely to produce non-degenerate cycles. This technique can reduce the number of degenerate pivots by about 30%, and the total run time by 18% on average. However, this technique also causes an increase in the number of non-degenerate pivots, since some degenerate pivots are necessary steps for reaching non-degenerate cycles/pivots with large cost improvements. To address this issue, we developed the concept of necessary degenerate pivots and consider them for pivoting along with known and probabilistic non-degenerate pivots. This reduces the number of non-degenerate pivots (compared to not considering necessary degenerate pivots), helps in reaching negative cycles with large cost improvement, and ultimately reduces run time by an average of 29%.
Fault-Tolerant Computing
In fault-tolerant computing I, along with either my Ph.D. advisor or my students, have made the following contributions.
1. I have developed a range of novel and efficient methodologies for designing fault-tolerant multiprocessors that include use of covering graphs and graph automorphisms, and a structural application of error correcting codes (ECCs) to yield multiprocessors with very high average fault tolerance [11, 17, 23, 24, 25, 27, 69, 71, 76, 77, 78, 79] . In 1995, one of these papers [79] (published in 1988) was awarded the recognition of a most influential paper published in the first 25 years, 1971-1995 2. Novel mantissa based techniques were designed for significantly alleviating the well-known problem of round-off errors in algorithm-based fault tolerance techniques [2, 3, 20, 75] .
3. The REMOD method for concurrent testing and fault tolerance in arithmetic circuits was developed that can accommodate any degree of fault tolerance desired, and has some of the lowest latency and hardware overheads [2, 19] .
4. Very effective hardware and software techniques were designed for fault tolerance in FPGAs [2, 15, 54, 55, 61, 68] .
5. Probably the first method for off-chip control-flow-checking of processors with on-chip caches [39] .
Parallel Processing
Our (my students' and my) accomplishments in this area are:
1. The first load-balancing method for irregular parallel computations, QE, that has analytically-proved performance [22, 74] . QE empirically yields performance efficiency of 80-90% (speedup factor using P processors is 0.8P to 0.9P ; P is the ideal speedup) on large application problems like the Traveling Salesman Problem and Mixed Integer Programming on various large multicomputers like the nCUBE2 with 1024 processors. This is among the highest consistent speedup yielded by any general load-balancing method.
2. A low-overhead informed randomized load-balancing algorithm called Random Seeking that was shown theoretically and empirically to be more efficient than previous randomized load-balancing algorithms [12, 67] 3. The first duplicate pruning strategies for parallel best-first search that have provable scalability [18, 72] .
4. An adaptive load balancing method QE* that adapts to node granularity and density of the application, and the communication latency of the multicomputer. This was the first adaptive load balancer of its kind (multiple dimensions of adaptivity) and it achieved near-ideal speedup on the IBM SP-2 multicomputer for Mixed Integer Programming problems [8, 57] 5. A very efficient termination detection algorithm for general parallel computations that achieves the best performance in several important metrics including the all-important one of detection latency for which it is optimal [7] . 7. An useful analysis of k-ary n-cube multicomputer interconnection architectures on a wide class of real parallel algorithms (divide-and-conquer) [66] , and the first of its kind. Previous analysis, while very useful and comprehensive, were for raw numerical message traffic and hypothetical message patterns.
8. NP-completeness proof for the subcube allocation problem and an effective algorithm for an approximate solution to this problem [26, 80] .
