Abstract-Split manufacturing (SM) and layout camouflaging (LC) are two promising techniques to obscure integrated circuits (ICs) from malicious entities during and after manufacturing. While both techniques enable protecting the intellectual property (IP) of ICs, SM can further mitigate the insertion of hardware Trojans (HTs). In this paper, we strive for the "best of both worlds," that is we seek to combine the individual strengths of SM and LC. By jointly extending SM and LC techniques toward 3D integration, an up-and-coming paradigm based on stacking and interconnecting of multiple chips, we establish a modern approach to hardware security. Toward that end, we develop a security-driven CAD and manufacturing flow for 3D ICs in two variations, one for IP protection and one for HT prevention. Essential concepts of that flow are (i) "3D splitting" of the netlist to protect, (ii) obfuscation of the vertical interconnects (i.e., the wiring between stacked chips), and (iii) for HT prevention, a security-driven synthesis stage. We conduct comprehensive experiments on DRC-clean layouts of multi-million-gate DARPA and OpenCores designs (and others). Strengthened by extensive security analysis for both IP protection and HT prevention, we argue that entering the third dimension is eminent for effective and efficient hardware security.
INTRODUCTION
O N the one hand, design practices by the industry attach importance to optimize for power, performance, and area (PPA) at the level of physical design or design architecture (e.g., cache hierarchies, speculative execution). On the other hand, there are powerful attacks such as Meltdown [2] , which skillfully exploit these very practices to extract sensitive data at runtime. Besides, malicious foundries may implement so-called hardware Trojans (HTs) which can help an adversary to extract sensitive data purposefully [3] . Apart from such concerns regarding the security and trustworthiness of hardware at runtime, protecting the hardware itself from threats such as piracy of design intellectual property (IP) or illegal overproduction is another challenge. That is because to avoid the burgeoning cost associated with ever-shrinking technology nodes, most chip companies outsource the fabrication of their ICs nowadays to third-party foundries which are potentially untrustworthy. Moreover, the tools and know-how required for reverse engineering (RE) of even high-end ICs are becoming more accessible and less costly [4] ; therefore, a malicious end-user obtaining the IP after production is another significant threat. Various schemes for IP protection have been put forth over the last decade, and most of them can be classified into logic locking (LL), layout camouflaging (LC), or split manufacturing (SM). These three classes consider different threats: SM seeks to protect against untrusted foundries, LC against untrusted end-users, and LL against both. Accordingly, there are different assumptions on the attackers' capabilities, different limitations, and different concepts for realization. We provide more details in Sec. 2, and the interested reader may also see [5] .
Independent of hardware security, 3D integration has made significant strides over recent years. The concept of 3D integration is to stack and interconnect multiple chips/dies/tiers/layers, thereby promising "More Moore," i.e., to overcome the scalability bottleneck which is exacerbated by ever-increasing challenges for pitch scaling, routing congestion, process variations, et cetera [6] . Recent studies and prototypes have shown that 3D integration can indeed offer significant benefits over conventional 2D chips [7] , [8] , which can ultimately also help to thoroughly utilize the existing technology nodes. Besides, 3D integration advances manufacturing capabilities by various means such as parallel handling of wafers, higher yields for the smaller outlines of individual chips, and heterogeneous integration ("More than Moore").
In this paper, we propose a modern approach to hardware security. We show that 3D integration is an excellent candidate to combine the strengths of LC and SM in one scheme (Fig. 1) . The key idea is to "3D split" the design into multiple tiers and to obfuscate (i.e., randomize and camouflage) the vertical interconnects between those tiers. Our approach is a significant advancement over prior art in IP protection-while LC may thwart end-user adversaries and SM may safeguard against fab-based adversaries, only our work can readily protect against both threats. We note that LL targets for the same, but the viability of LL schemes depends on tamper-proof memories, which is an area of Our scheme is based on 3D integration, particularly on face-toface (F2F) 3D ICs. Through-silicon vias (TSVs) are used for external connections, and redistribution layers (RDLs) for vertical interconnects. Our security-centric splitting of the design across two tiers, along with obfuscation of the RDLs, allows for IP protection and prevention of HT insertion. That is because the two tiers, even when considered together by colluding foundries, reveal neither the entire design nor individual components prone to HT insertion.
ongoing research [9] . Concerning HT prevention, the second major part of our work besides IP protection, we show that 3D integration advances the state-of-the-art in two ways. For one, components considered prone to HT insertion can be delegated to trusted facilities for fabrication of separate chips, hindering HT insertion to begin with. For another, when the decision on which components are prone to HT insertion is difficult, our notion of obfuscating vertical interconnects is essential to implement other, foundationally secure schemes with superior cost and scalability.
The contributions of this work are as follows:
1) We put forward a practical threat model which is in line with the present-day business practices of design houses (Sec. 3). This model necessitates the application of both LC and SM in conjunction. 2) We leverage 3D integration for a modern approach to IP protection and HT prevention (Sec. 4). We combine the strengths of LC, SM, and prior art on HT prevention, all within one concept. The key idea is to "3D split" the design into two tiers (or more, in principle) and to obfuscate the vertical interconnects between those tiers. We explore two scenarios for commissioning different trusted and untrusted foundries, offering different or same technology node(s). In that section and throughout the paper, we use DARPA and OpenCores multi-milliongate designs (besides the well-known ITC-99 and ISCAS-85 benchmarks), to provide meaningful experimental studies. 3) We develop a security-driven CAD and manufacturing flow for face-to-face (F2F) 3D ICs, initially tailored for IP protection (Sec. 5). In addition to various steps required for an end-to-end 3D IC CAD flow, key concepts for the flow are security-driven partitioning techniques as well as obfuscation of the vertical interconnects in the F2F 3D IC. We implement our flow using Cadence Innovus and demonstrate its applicability on a broad set of benchmarks. 4) We conduct a thorough analysis of DRC-clean layouts, and we contrast with the prior art of LC or SM wherever applicable (Sec. 6). We further present an extensive security analysis, underpinned by analytical and empirical data as well as by a novel proximity-centric attack on 3D ICs (Sec. 6). Fig. 2 . The four main flavors of 3D integration. Metal layers are colored in green, active layers in brown (along with modules in blue), and bonding underfills in yellow. F2F 3D ICs allow for direct metal-to-metal bonding.
5)
We extend our flow toward HT prevention (Sec. 7), specifically for a strong threat model where the fabbased attacker already holds the full netlist to begin with [10] . Initially we identify the limitations of prior art [10] , [11] : the attainable level of security, scalability, and layout cost incurred to guarantee this level of protection. To tackle these limitations, we propose and implement a security-driven synthesis strategy along with our established 3D IC CAD flow. Here we also enable, for the first time, the concerned designer to purposefully protect structures of choice at design time. 6) We comprehensively study various scenarios to demonstrate the resilience and efficacy of our approach to HT prevention (Sec. 8). For example, for the ITC-99 benchmark b19, the attacker has only a chance of 0.25% for successful targeted HT insertion.
BACKGROUND

3D Integration and CAD Flows
3D integration has experienced significant traction over the recent years, for both improving scalability as well manufacturing and integration capabilities [6] , [12] , [13] . 3D integration can be broadly classified into four flavors ( Fig. 2 ): (1) through-silicon via (TSV)-based 3D ICs, where chips are fabricated separately and then stacked, with the vertical inter-chip connections being realized by relatively large metal TSVs running through the entire silicon chips; (2) F2F stacking, where two chips or tiers are fabricated separately and then bonded together directly at their metal faces (along with TSVs only required for external connections); (3) monolithic 3D ICs, where multiple tiers are manufactured sequentially, with the vertical interconnects based on regular metal vias; (4) 2.5D integration, where chips are fabricated separately and then bonded to a systemlevel interconnect carrier, the interposer. Each flavor has its scope, benefits and drawbacks, and requirements for CAD and manufacturing processes [6] . F2F stacking has arguably emerged as the most promising (along with monolithic 3D ICs); various studies are actively streamlining efforts for commercial adoption, e.g., [7] , [12] , [14] . Note that prior art is inherently oblivious to hardware security. These studies carefully trade off intratier wiring with vertical interconnects across tiers. While vertical interconnects are the key feature of 3D integration, an overly large number of crossings/cuts can have a significant, counter-productive impact on PPA [12] . As we explain in Sec. 6.4, however, a large number of cuts is mandatory for a strong resilience against IP piracy. Hence, our flow comprises techniques to explore this security-cost trade-off.
Split Manufacturing
SM offers an interesting solution to safeguard the design IP, but only during manufacturing time. That is, SM cannot protect against malicious end-users. Traditionally, SM means that the device layer and few lower metal layers (front-end-of-line, FEOL) are fabricated using a high-end, potentially untrusted foundry, whereas the remaining interconnects (back-end-of-line, BEOL) are grown on top of the FEOL wafer by a trusted facility. Considering the different pitches of the FEOL/BEOL metal layers, SM supports a cost-aware supply chain. That is, as long as the FEOL and BEOL are separated at some higher metal layer, any low-end facility may be commissioned for the BEOL with relatively little commercial cost. 1 Now, the security promise of SM lies in the fact that an untrusted foundry only holds a part of the overall design, making it difficult to infer the complete design functionality, and thereby hindering an adversary from IP piracy or targeted insertion of HTs.
Existing CAD tools, however, due to their focus on design closure (and their so-far agnostic view on security), tend to leave hints for an FEOL-based adversary. For example, to honor PPA, any to-be-connected cells are typically placed close to each other. Based on this insight, Rajendran et al. [15] proposed a so-called proximity attack which models this principle to infer the missing BEOL connections.
Various placement-centric [16] , [17] and/or routingcentric [17] , [18] schemes have been proposed recently, which all aim to counter the efforts of proximity attacks [15] , [16] . Among those defense schemes, lifting of wires above the split layer remains an intuitive way to obfuscate the IP. That is, the critical wires (as selected by the designer) are lifted, e.g., with the help of constraining the router via routing pins in higher layers or inserting artificial routing blockages. We conducted exploratory experiments on the randomized lifting of nets (Fig. 3) ; here we observe steady and significant increases in PPA cost. More comparative results, also on 2.5D/3D solutions, are given in Sec. 6 .
We acknowledge that the basic idea for 3D SM was already envisioned in 2008 by Tezzaron [19] . Also, various studies are hinting at 3D integration for SM, but most have limitations or cover different scenarios. For example, Dofe et al. [20] remain on the conceptional level, or Xie et al. [21] and Imeson et al. [10] consider 2.5D integration where only wires are hidden from the untrusted foundry. We summarize the prior art on 3D SM in Table 1 , along with that for 3D LC.
Layout Camouflaging
LC foils an adversary's efforts for RE of a chip. LC is accomplished during manufacturing by (i) dissolving optically 1 . This consideration is also essential for adopting SM as a security scheme in practice. It seems more realistic to find some well-established and trustworthy facility which can, however, only offer large-pitch processing capabilities than to find some high-end facility which can also be trusted, among the few to begin with. See also Sec. 4.1. Fig. 3 . PPA cost (%) for look-alike LC [24] (left) and explicit lifting of randomly selected wires to M8 (right) in SM. Results are averaged across ITC-99 benchmarks. For the LC scheme (left), the impact on power and area is substantial, given that the NAND-NOR-XOR structure in [24] incurs 4× and 5.5× more area and power compared to a regular 2-input NAND gate. For the SM scheme (right), the cost for area is severe. That is because routing resources are relatively scarce for M8 (pitch = 0.84µm) and lifting of wires occupies further resources, which can only be obtained by enlarging the die outlines.
distinguishable traits of standard cells, e.g., using look-alike gates [24] or secretly configured MUXes [25] , (ii) selective doping implantation for threshold-voltage-based obfuscation [26] , [27] , or (iii) rendering the BEOL wires and/or vias resilient against RE [28] . It is important to note that most schemes require alterations to the FEOL process, which can be complex and costly. In any case, since physical obfuscation constitutes the secret for IP protection by LC, the involved manufacturing facilities have to be trusted-LC cannot protect against malicious fabs.
Powerful Boolean satisfiability (SAT)-based attacks [29] have questioned the efficacy of various LC schemes, leading to a "cat-and-mouse game" between adversaries and defenders. The base for these attacks are analytical models for all possible assignments of the camouflaged design parts and efficient pruning of the search space of assignments. Recent security schemes thus attempt to impose excessively complex problem instances for SAT solvers by (i) inserting dedicated challenging structures like camouflaged AND trees [30] , (ii) minimally modifying critical parts of the design functionality [31] , or (iii) full-chip camouflaging [28] .
Existing schemes tend to incur significant PPA overheads once LC is applied for large parts of the design. For example in [24] , camouflaging 50% of the design results in ≈150% overheads for power and area, respectively (Fig. 3) . Emerging schemes such as threshold-voltage-dependent LC still suffer from PPA overheads; see Sec. 6 for more comparative results. As for 3D integration, Yan et al. [23] proposed LC for monolithic 3D ICs, and Gu et al. [22] apply LC for 3D ICs, albeit using regular 2D LC schemes. Hence, while promising, both works still require trusted FEOL facilities.
Hardware Trojans
Apart from IP protection, the possibility for HT insertion by an untrustworthy foundry also raises concerns, especially for military applications. ICs that are "bugged" with HTs may (i) deviate from their specified functionality, (ii) leak sensitive information, and/or (iii) become unreliable or fail at particular points in time [3] . Trojans can be broadly classified into digital Trojans and physical Trojans, depending on their payload and trigger mechanisms. Digital HTs are activated by either a specific, rare input pattern or via "time bombs" on certain operations (or input patterns) being executed for a particular number of cycles. Physical Trojans are activated either by (i) aging effects such as electromigration, or (ii) internal or external side-channel triggers. In this paper, we limit the scope to digital Trojans.
In general, there are two classes of HT countermeasures: reactive, i.e., monitoring for HTs at runtime, and pro-active, i.e., seeking to prevent HT insertion during the design and/or manufacturing time. For reactive countermeasures, there are various approaches, e.g., based on current monitoring [32] or security wrappers [33] . For pro-active countermeasures, prior art covers, e.g., built-in self-authentication modules [34] and SM-centric obfuscation schemes [10] , [11] among others. Note that the fab-based adversary needs to comprehend the layout under attack for targeted HT insertion; that is why schemes like LL and SM can prevent HT insertion to a certain extent.
In this work, we focus on pro-active prevention of HTs. In particular, we consider a strong threat model where the attacker already holds the complete gate-level netlist [10] . We review the prior art [10] , [11] in more detail in Sec. 7.1.
A PRACTICAL THREAT MODEL
Here we put forward a novel, practical threat model which is in line with the business practices of present-day electronics companies. Consider the following scenario. Lacking its own fabrication facilities, a company commissions a potentially malicious foundry to manufacture their newest version of some chip. This new version is typically extended from some previous version (Fig. 4(a) )-the reuse of IP modules and the re-purposing of proven architectures are well-known practices. For example, think of the flagship iPhone R by Apple R . The iPhone 7, based on the A10 chip, was launched in September 2016, and the iPhone X, based on the successor chip A11, was launched in September 2017. In such a scenario, it is intuitive that pirating the new IP can become significantly less challenging for fab-based adversaries. In case the same fab was already commissioned for the previous chip version, they readily hold the layout of that earlier version; otherwise, the adversaries can apply RE on chips of the previous version bought in the market. In any case, the adversaries can compare that new layout with the prior layout, to locate and focus on those parts which are different and unique. Recall that understanding the layout and its functionality in full is necessary for targeted HT insertion; this becomes notably less challenging as well. Now, the conclusion for this threat model is that both LC and SM are required for manufacturing of all different chip versions. LC is required to prevent RE of the current layout by any other fab commissioned for later chip versions, whereas SM is necessary to prevent the fab which is manufacturing the current version (and which is also tasked to implement LC) from readily inferring the complete layout of the current version. Prior art can only account for this scenario by applying SM on top of LC, which can exacerbate the individual overheads and shortcomings, as discussed in Sec. 2 and 6. Next, we outline our scheme to combine SM and LC naturally while leveraging 3D integration.
3D INTEGRATION AS MODERN APPROACH TO IP PROTECTION AND TROJAN PREVENTION
The primary advancement we advocate for SM is to "3D split" the design into multiple tiers. That is, unlike regular SM in 2D where the layout is split into FEOL and BEOL, we split the design itself into two parts (or more, in principle). These parts are manufactured as separate chips, and then stacked and vertically interconnected following the F2F integration process (the latter without loss of generality). We suggest that 3D SM can be achieved either by commissioning different foundries or one foundry (Fig. 4(b) ): 1) Different trusted and untrusted foundries (Sec. 4.1):
Consider one trusted and one untrusted foundry, both with full FEOL and BEOL capabilities, but for different technology nodes. It is intuitive to delegate the sensitive parts to the trusted fab exclusively. While this approach is straightforward and inherently secure against fab-based adversaries, its practicality is limited, as we discuss below. 2) Untrusted foundries/foundry (Sec. 4.2): Consider one or more high-end but untrusted fab(s). This way, we can benefit from the latest technology but, naturally, have to obfuscate the design in such a way that the fab(s) cannot readily infer the whole layout, even when they are colluding. Once such strong protection is in place, it is economically more reasonable to commission only one fab.
It is important to note the following. First, we elaborate on both scenarios in this section, but we focus on the more relevant and practical Scenario 2) in the remainder of this work. Second, to further achieve security against (a) fabbased adversaries and (b) malicious end-users, we later on (a) randomize the vertical interconnects and (b) obfuscate those interconnects. Therefore, we then require a trusted BEOL facility, but no trusted FEOL facility. Assuming only an untrusted FEOL facility, the design has to be split in such a manner that an FEOL-based attacker cannot readily identify the vulnerable parts. This also requires randomization of the vertical interconnects using a trusted BEOL facility.
Different Trusted and Untrusted Foundries
Commissioning several foundries providing different trust levels and supporting different technology nodes holds two key implications as follows. First, it is intuitive to assign sensitive design parts to the chip manufactured by the trusted foundry, for example (i) some new IP to protect (Fig. 4(b) ), or (ii) parts considered vulnerable to targeted HT insertion ( Fig. 5(a) )-such an approach is secure by construction against fab-based adversaries for the following reasons. For IP piracy, there is no generic attack model in the literature yet which can infer missing connections and gates when given only a part of the overall design. We believe that such "black-box attacks" would be very challenging, if possible at all. For HT insertion, the adversary cannot perform targeted insertion once the vulnerable parts are delegated to the trusted chip exclusively.
Second, in case a trusted fab and another untrusted fab are commissioned in parallel, it is implied that these two fabs would support different nodes, with the trusted fab typically offering only access to an older technology. In fact, if the trusted foundry would be able to offer the same highend node, one could simply commission the trusted foundry for manufacturing of the whole design. Due to the different pitches for different technology nodes, however, only a fraction of the design can be delegated to the trusted low-end fabrication. Also, power and performance will be dominated by the low-end chip, where factors such as parasitics, level shifting, and clock synchronization may further exacerbate the overheads [14] , [35] .
Case Study on DARPA Common Evaluation Platform
Next, we study the scope for such heterogeneous 3D SM. 2 That is, we consider a case study where the sensitive logic is moved to a trusted fab, offering the older 90nm technology, whereas the remaining logic is delegated to an untrusted 45nm fab. The case study is based on multi-million-gate System-on-Chip (SoC), provided by DARPA as a Common Evaluation Platform (CEP) [36] . An overview of the SoC architecture is given in Fig. 6(a) ; it is a one-master tenslaves system. The master is a re-engineered version of the OpenRISC processor, called OR1200. This OR1200 master executes code from a 128KB static RAM (SRAM), which we had to omit for our layout-level study, due to unavailability 2. Another independent study is provided in Sec. S.I. of memory macros in the considered 45nm library. 3 The slaves comprise cryptographic (crypto) modules, digital signal processing (DSP) modules, and a global positioning system (GPS) processing module. The Wishbone/DBUS connect the processor master with all other blocks, while UART provides a serial interface for off-chip communication.
Given that all considered crypto modules are public knowledge, the designer (attacker) would not be interested in protecting (retrieving) the related IP. However, the DSP modules and other parts may contain customized logic and sensitive IP worth to protect. As discussed before, the choice of which modules/logic to protect lies solely with the design house, as they can best judge which components require IP protection. For our case study, we assume the GPS processing module to be that sensitive asset.
Experimental Setup: We use the NanGate 45nm library [37] and Synopsys 90nm library [38] . Innovus 17.1 is used for layout generation and PPA evaluation; the setup details are further elaborated in Sec. 6.2.3. First, we synthesize the 2D baseline designs, considering the slow corners for the 45nm and 90nm nodes, respectively; the results are given in Table 2 . Second, we partition the 45nm baseline design such that all logic of the GPS module can be delegated to a trusted 90nm chip. Here we follow the design flow shown in Fig. 7 in general, but there are some differences as follows: (a) we have to re-synthesize the GPS module for 90nm, with the same timing constraint as for the remaining modules, which is also to simplify the clock-tree synthesis (CTS) for the individual tiers [35] ; (b) for layout evaluation, we revise the LIB and LEF files for the GPS tier; (c) we do not undertake any additional security-centric steps (highlighted in bold in Fig. 7 ) as the GPS IP is fully secured against adversaries residing in the 45nm fab. Finally, note that we assume supply voltages of 0.95V for 45nm and 0.9V for 90nm when generating the power numbers. Results: The PPA results for this different-foundries 3D implementation are given in Table 3 . We observe overheads of 63.09%, 30.45%, and 15.03% for area, power, and delay, respectively, when compared to the 2D 45nm baseline. However, when comparing the 3D implementation with the trusted 90nm baseline, our approach offers savings of 75.82%, 62.31%, and 13.1% for area, power, and delay, respectively. Also note that the 90nm GPS tier comprises only 15.63% of the total number of instances, although it incurs almost the same area footprint. In short, heterogeneous 3D integration may indeed provide benefits over a purely trusted fabrication (90nm), but it naturally cannot compete with advanced fabrication (45nm).
Summary
This approach of utilizing a trusted and another untrusted foundry for 3D integration may be inherently secure against fab-based adversaries, and may also offer some PPA benefits over a 2D implementation using only the trusted and old node, but it also limited in practice. That is because one can delegate only a small fraction of the overall design to the trusted foundry (i.e., at least without incurring area cost), which limits the scale for IP protection, and the performance and power of this 3D approach cannot compete with the advanced but untrusted 2D node.
Untrusted Foundries
Engaging with (i) several untrusted foundries offering the same technology node or (ii) one untrusted foundry also holds some key implications as follows. First, power and performance of such "conventional" 3D integration can be expected to excel those of the heterogeneous scenario above. We note that splitting of 2D IP modules within 3D ICs has been successfully demonstrated, e.g., in [39] , albeit without hardware security in mind. Hence, savings from the folding of IP modules may provide some margin for security schemes. However, we also show in detail in this work that this margin naturally depends on the design and the measures applied for the scheme.
Second, although IP modules can be split across tiers, which may mislead an RE attacker (a malicious end-user), both tiers are still manufactured by some untrusted fab(s). This fact implies that LC schemes targeting on the device level cannot help to protect from adversaries in those foundries. Interestingly, there is another LC flavor emerging, that is the obfuscation of interconnects [28] , [40] . We argue that obfuscation of interconnects is a natural match for F2F 3D integration-in between the two tiers, redistribution layers (RDLs) can be purposefully manufactured for obfuscation of the vertical interconnects (Fig. 1) . Doing so only requires a trustworthy BEOL facility, which is a practical assumption given that BEOL fabrication is much less demanding than FEOL fabrication (owing to larger pitches and less complex processing steps). That is especially true for higher metal layers; note that RDLs reside between the F2F bonds which themselves are at higher layers. Chen et al. [40] consider real and dummy vias using magnesium (Mg) and magnesium-oxide (MgO), respectively, for obfuscation of interconnects. They demonstrate that real Mg vias oxidize quickly into MgO and, hence, can become indistinguishable from the other MgO dummy vias during RE. Without loss of generality, we assume our LC scheme to be based on the use of Mg/MgO vias for obfuscating the vertical interconnects of the 3D IC. Emerging interconnects such as those based on carbon nanotubes [41] may become relevant in the future as well.
METHODOLOGY FOR IP PROTECTION
Here we elaborate on our CAD and manufacturing flow for F2F 3D integration. The CAD flow is in parts inspired by Chang et al. [13] , but note that we devise and implement our customized flow, with a particular focus on IP protection (Fig. 7) . Our flow allows a concerned designer to explore the trade-offs between PPA and cuts, i.e., the number of F2F vertical inter-tier connections. Cuts are a crucial metric for the security analysis, as discussed in more detail in Sec 6.4.
It is also important to note that we follow the call for layout anonymization [10]-we purposefully do not engage cross-tier optimization steps, to mitigate layout-level hints on the obfuscated BEOL/RDLs.
As for the F2F process, we propose the following security-centric modification. The wafers for the two tiers are fabricated by one (two) untrusted foundry (foundries) and then shipped to a trusted BEOL and stacking facility.
This trusted facility grows the obfuscated RDLs on top of one wafer, and continues with the regular F2F flow (i.e., flipping and bonding the second wafer on top).
Design Partitioning
After obtaining the post-routed 2D design, we partition the netlist into top and bottom groups, representing the tiers of the F2F IC. I/O ports are created for all vertical interconnects between the two groups, representing the F2F vias. Besides these F2F ports, we place primary I/Os at the chip boundary, as in conventional 2D designs. This is also practical for F2F integration where TSVs are to be manufactured at the chip boundary for primary I/Os and the P/G grid.
Random partitioning: A naive way for security-driven partitioning is to assign gates to the top/bottom groups randomly. While doing so, the number of cuts will be dictated by the number, type, and local inter-connectivity of gates being assigned to one group. Since random partitioning lacks any heuristic, it may either result in savings or overheads for power and/or performance, depending on the design, number of vertical interconnects induced, and randomness itself.
Maximizing the cut-size: As already indicated (and further explored in Sec 6.4), the larger the cut size, the more difficult becomes IP piracy. Hence, here we seek to increase the cut size as much as reasonably possible. First, timing reports for the 2D baseline are obtained following which gates are randomly alternated along their timing paths toward the top/bottom groups. In the security-wise best case-which is also the worst case regarding power and performance-every other gate is assigned to the top and bottom group, respectively. There, for a path with n gates, 2n cuts are arising. In short, the trade-off is as follows: the larger the cut size, the more resilient the design, but the higher the layout cost. We study the impact of maximizing the cut-size in Section 6.2
Timing-aware partitioning: Based on the insights regarding the cost-security trade-off for random partitioning and maximizing cuts, here we seek to reduce layout cost while maintaining strong protection. First, the available timing slack is determined for each gate. Then, based on a user-defined threshold, the critical gates remain in the bottom tier, whereas all other gates are moved to the top tier. This procedure is repeated with revised timing thresholds until an even utilization for both tiers is achieved. Note that it is difficult for an attacker to understand whether a path in the bottom/top group is critical or not (or even complete, for that matter). In other words, the attacker has to tackle both groups at once and, more importantly, resolve the randomized F2F vias and the obfuscated interconnects.
We advocate this partitioning strategy, especially for any flat design. In the remainder of this work, timing-aware partitioning is our default strategy unless otherwise noted.
Hierarchical partitioning: This strategy is applied for designs with hierarchies in the top-level module. Inspired by [13] , we separate modules with a large degree of connectivity across tiers, resulting in large numbers of cuts. Other modules are partitioned/placed to balance the utilization of both tiers. In short, this strategy serves to protect the IP as well as to limit layout cost for hierarchical designs. Normalized distances between to-be-connected F2F vias after randomization, for benchmark b17 1.
Bottom Tier
RDL Top Tier
Obfuscated switchbox in RDL Fig. 9 . Obfuscated switchbox, embedded in two RDL layers, exemplarily for bottom-to-top drivers. Each driver pin (downwards triangle) can connect to any sink pin (upwards triangle). All F2F ports are aligned with the pins of the switchbox here, for simplicity, whereas the top-tier ports are randomized in reality.
Planning of F2F Interconnects
After placing the bottom tier, the initial locations for F2F ports are determined in the vicinity of the drivers/sinks. Then, a security-driven, randomized placement of F2F ports is conducted, along with customized on-track legalization. Next, obfuscated switchboxes are placed, and the F2F ports are mapped to the top tier. Randomization: It is easy to see that regular planning of F2F interconnects cannot be secure, as this aligns the ports for the bottom and top tier directly. That is, the untrusted foundry has direct access to both tiers and could simply stack them up to recover the complete design. Hence, we randomize the arrangement of F2F ports as follows. (Fig. 8) . We place additional F2F ports randomly, yet with help of on-track legalization (see below), in the top RDL. These randomized ports are then routed through the RDLs toward the original F2F ports connecting with the bottom tier. In short, randomization of F2F vias is required to protect the design against fab-based adversaries during manufacturing.
Obfuscated switchboxes: To further protect against RE attacks from malicious end-users, we obfuscate the connectivity in the RDLs, using a customized switchbox (Fig. 9) . This switchbox allows stealthy one-to-one mapping of four drivers to four sinks. The essence of the switchbox is the use of Mg/MgO vias (recall Sec. 4.2), to cloak which driver connects to which sink. The pins of the switchbox represent the F2F ports. The pins are aligned with the routing tracks to enable proper utilization of routing resources. For randomization, the additional ports connecting with the top tier are used for rerouting during design closure.
On-track legalization: Each F2F port is moved inside the core boundary, toward the center point defined by all instances connected with this port. Next, we obtain the closest and still-unoccupied on-track location for actual placement. If need be, we step-wise increase the search radius considering a user-defined threshold.
Design Closure
After the F2F via planning stage, both tiers are placed and routed separately, independent of each other. For sequential designs, we conduct CTS on both tiers independently, as suggested in [35] .
Recall that we do not engage in any cross-tier optimization, on purpose, to anonymize the individual tiers from each other. However, we apply intra-tier optimization. While routing the bottom tier, we also route the randomized and obfuscated RDL with their switchboxes. Next, we encapsulate the top and bottom partitions in a wrapper netlist, and we assemble and implement the design followed by generating a Standard Parasitic Exchange Format (SPEF) file that captures the RC parasitics of the F2F vias (modeled as regular vias, see below). Finally, we perform DRC checks, evaluate the PPA, and stream out separate DEF files for the top/bottom tiers and the RDL.
RESULTS FOR IP PROTECTION
Experimental Setup
Implementation and layout evaluation: Since there are no commercial tools available yet for (F2F) 3D ICs, we implement our CAD flow within Cadence Innovus 17.1, using custom TCL and Python scripts. Our implementation imposes negligible design runtime overheads. We use the NanGate 45nm library [37] for our experiments, with six metal layers for the baseline 2D setup and six layers each for the top and bottom tier in the F2F setup. The RDL comprises four duplicated layers of M8, from which two are used for embedding the obfuscated switchboxes, and two are used for randomizing the routing. F2F vias are modeled as M6 vias; while this is an optimistic assumption, for now, F2F technology scaling can be expected to reach such dimensions. The PPA analysis is conducted for the slow process corner, using CCS libraries at 0.95V. For power analysis, we assume a switching activity of 0.2 for all primary inputs. We ensure that the layouts are free of any congestion, by choosing appropriate utilization rates for the 2D baselines. This is essential to prevent any possible congestion to be carried forward in our 3D flow. All experiments are carried out on an Intel Xeon E5-4660 @ 2.2 GHz with CentOS 6.9. For Cadence Innovus, up to 16 cores are allocated.
Setup for security evaluation: Since we promote 3D SM, regular proximity attacks such as [15] , [16] cannot be applied. Thus, we propose and publicly release a novel attack against 3D SM [42], also accounting for the RDL obfuscation underlying in our scheme; see also Sec. 6.4. Attacks on our protected layouts are evaluated by commonly used metrics, i.e., the correct connection rate (CCR), percentage of netlist recovery (PNR) [18] , and Hamming distance (HD). HD is calculated using Synopsys VCS with 1,000,000 test patterns. As for SAT-based RE attacks, we leverage the tool provided by [29] , with the related time-out set to 72 hours.
Designs: The commonly considered benchmarks from the ISCAS-85 and ITC-99 suites are used for layout and security analysis. In addition, we also use two SoC benchmarks: the DARPA CEP [36] and the JPEG OpenCores design [43] . 
Security-Driven Layout Evaluation
Our flow allows to trade-off PPA and cuts; the latter dictates the resilience against IP piracy both during and after manufacturing. Figure 10 showcases the layout images for benchmark b22.
Random Partitioning and Maximizing the Cut-Size
Initially we study random partitioning of gates, by moving them randomly from the bottom to the top group in steps of 10%, up to 50%. As the strategy is randomized, we perform ten runs for each benchmark for any given percentage of gates to move. The resulting power and performance distributions are illustrated in Fig. 11 .
Interestingly, even for the security-wise best case of randomly moving 50% of the gates, some runs still provide better power and/or performance than the 2D baseline. The savings in performance can be attributed to the fact that, when splitting the design across the vertical dimension, we can obtain a reduction in wirelength, which helps to improve timing. We note that these improvements, on average, come at some expense of power, with related overheads in the range of 0-7% for lifting/moving 50% of gates.
While this demonstrates the potential for naive random partitioning, it is important to note that this finding only holds true as long as we refrain from randomizing the F2F ports and from using the obfuscated switchboxes for these experiments. In fact, once we seek to maximize the cuts, along with randomization of F2F ports and use of switchboxes, larger ITC-99 benchmarks such as b18 1 incur considerable overheads of up to 60% (Fig. 12) . Here we also observe that large cut-sizes lead to an increase in routing congestion and total wirelength, thereby further increasing the total capacitance of the design. This offsets the performance benefits which regular, security-oblivious 3D integration can be expected to achieve [12] .
In short, although these strategies offer strong resilience, a more aggressive PPA-security trade-off may be desired.
Timing-Aware Partitioning
This setup tackles that need for achieving security while maintaining reasonable PPA cost. We observe that even for larger ITC-99 benchmarks such as b18 1 and b19 (Fig. 13) , there are some benefits when comparing the secure 3D designs to their 2D baseline. As explained in Sec. 5.1, since the most timing-critical gates are constrained to one tier, we induce significantly less cuts along the timing paths for the 
Pe rform a nc e Ove rhe a ds (%)
Pe rform a nc e Ove rhe a ds (%) Powe r Ove rhe a ds (%)
Moving 10% of Gates Moving 30% of Gates Moving 50% of Gates Fig. 11 . Impact of randomly assigning gates on performance (top) and power (bottom). Each boxplot represents ten runs. Note that the same benchmarks are applied for the top and bottom plots; benchmark labels are accordingly placed between those plots. 3D design. For example, we observe a reduction of about 60% in timing-path cuts for ITC-99 benchmarks b18 1 and b19 when compared to random partitioning. To demonstrate the security implication of this setup, we plot the normalized distances between to-be-connected F2F vias in Fig. 8 . This figure shows a wide variation across the inter-tier nets, whereas for regular, unprotected F2F stacking the distances would be all zero. Overall, the choice of partitioning lies with the designer, which she/he can trade-off considering security and PPA cost, but timing-aware partitioning should be considered first, i.e., at least for non-hierarchical designs.
Case Study on CEP and JPEG
Besides the well-known benchmarks considered above, we also conduct case studies on two "real-world netlists," the DARPA CEP [36] and the JPEG OpenCores design [43] .
Thus, we also demonstrate our secure end-to-end CAD flow for practical 3D ICs.
Setup:
The utilization is set to 70% and 60% for CEP and JPEG, respectively, which ensures that the 2D baseline designs are devoid of any congestion. We use the NanGate 45nm library [37] . F2F vias are modeled as M10 vias. All 2D and 3D designs operate at iso-performance, with a timing constraint of 5ns (i.e., at 200 MHz). Further details are the same as in Sec. 6.1.
Results:
The results for the 2D baseline and secure 3D designs are provided in Table 4 . Regarding the footprint area/die outlines, both the secure 3D designs provide savings over their 2D baselines, namely 49.3% for CEP and 42% for JPEG. Regarding instance counts, we observe some overheads for both the secure 3D designs; as we do not apply any cross-tier and/or post-partitioning optimization, there is less leverage to reduce instance counts for the tools. Again, for regular, security-oblivious 3D F2F integration, one would expect savings/reductions in both wirelength and instance count, which ultimately also enables power savings [12] .
For our security-driven 3D flow, it depends on various aspects whether there are power/performance savings or overheads. First, recall that we randomize F2F vias and leverage obfuscated switchboxes (to deter fab-based adversaries and malicious end-users, respectively), which tends to increases wirelength, and thereby the driver strengths and/or buffer counts. Second, the designs have some impact by themselves. For example, for JPEG we note 8.72% higher power consumption, whereas for CEP we note a 6.49% power reduction. Third, partitioning plays an important role as well, as already discussed. Since CEP and JPEG are both hierarchical designs, we apply hierarchical partitioning here, which fully protects the system-level IP orchestration and any glue logic. To further protect individual modules, we can split them up across the two tiers; which modules to select and how to split is the designer's decision, also depending on the nature of the modules and the overall design [39] . Toward this end, we also performed an experiment on CEP where individual modules were partitioned, resulting in 20,863 F2F vias (3.25× than those reported in Table 4 ). We maintain that both tiers are DRC-clean and free of congestion. We observe power and timing overheads of 9.59% and 13.64%, respectively, which implies that this 3D design can operate at around 176 MHz. Running the overall chip at this frequency also ensures that there is no loss in system functionality. Finally, note that the number of cuts we obtain here indicates strong resilience of the 3D designs; we discuss this further in Sec. 6.4.
Comparison with Prior Art
LC schemes: Among others, threshold-voltage-dependent LC is gaining traction. Although promising concerning resilience, the PPA cost are considerable. For example, Akkaya et al. [27] report overheads of 9.2×, 6.6×, and 3.3× for PPA, respectively, when compared to conventional 2-input NAND gate. Nirmala et al. [26] report 11.2× and 10.5× cost for power and area, respectively. Besides, for interconnects camouflaging, Patnaik et al. [28] report PPA overheads of 4.9%, 31.2%, and 25% for ITC-99 benchmark b17 at 60% LC. When compared to these schemes, we can provide significantly better PPA (except for [28] concerning power). Regarding prior art on 3D LC [22] , [23] , recall that they require a trusted FEOL facility; hence, their schemes are not directly comparable to ours. Also, at the time of writing, their libraries and protected designs were not available to us for a detailed study. Moreover, for [22] , the authors leverage regular 2D LC schemes while using different technology nodes. Depending on the particular node and LC scheme, this may induce large PPA overheads, and technologyheterogeneous 3D integration may hold further complications [14] , [35] . Recall that these concerns were our main motivation to advocate the use of uniform/same technologies and camouflaging of vertical interconnects.
SM schemes: In Table 5 , we compare with studies on 2D SM. Overall, the placement-centric techniques by Wang et al. [16] are competitive concerning power and performance. However, as always the case for regular SM, they can only avert fab-based adversaries, but not malicious end-users.
In Table 6 , we compare with the security-driven 2.5D integration scheme by Xie et al. [21] . Their work is relevant as they propose a similar notion of security based on cut sizes. For the benchmarks the authors considered, we obtain on average 53% more cuts in our scheme. (For our cut sizes on larger benchmarks, refer to Table 7 ). Regarding PPA, we observe significantly lower costs than [21] . 5 Besides, as with regular SM, their 2.5D scheme is not inherently resilient against malicious end-users, but our 3D scheme is.
Security Analysis and Attacks
Our Proximity Attack for 3D SM
To the best of our knowledge, there is no attack yet in the literature which can account for 3D SM in the context of IP piracy. Hence, we propose and implement such an attack, with a focus on one untrusted foundry (or two colluding foundries) and our RDL obfuscation. We provide this attack as a public release in [42] .
We assume that the attacker holds the layout files for the top and bottom tier, but, residing in the untrusted fab, she/he has no access to the trusted RDL. 6 Although she/he understands how many drivers are connecting from the bottom to the top tier and vice versa, she/he does not know which driver connects to which sink, given the randomization of F2F vias. Recall that, we do not engage in cross-tier optimization, to mitigate any layout-level hints. Let us assume there are d bot drivers in the bottom and, independently, d top drivers in the top tier. Since we do not allow for fan-outs within the RDL (this would occupy more F2F vias than necessary), there are only one-to-one mappings-this results in d bot ! × d top ! possible netlists. Once switchboxes are used, however, the attacker can tackle groups of four drivers/sinks at once. Still, she/he has to resolve (a) which four top-tier drivers are connected to which four bottom-tier sinks and vice versa, and (b) the connectivity within the obfuscated switchboxes. For those cases, there are 4!×((1/4 × d bot )! × (1/4 × d top )!) possible netlists remaining. Next, we outline the corresponding heuristics at the heart of our attack.
1) Unique mappings:
Any driver in the bottom/top tier will feed only one sink in the top/bottom tier. Hence, an attacker will reconnect drivers and sinks individually. Moreover, she/he can identify all primary I/Os as they are implemented using wirebonds or TSVs, not randomized F2F vias. 2) Layout hints: Although the F2F vias are randomized, the attacker may try to correlate the proximity and orientation of F2F vias with their corresponding RDL connectivity. Toward this end, she/he can also investigate the routing toward the switchbox ports. Moreover, recalling the practical threat model, the attacker may be able to identify some known IP and confine the related sets of candidate F2F interconnects accordingly. Our attack is generic and can account for those scenarios, by keeping track of the candidate F2F pairings considered by the attacker. 3) Combinatorial loops: Both tiers and thus all active components are available to the attacker, hence 5 . Concerning area, note that we report on die outlines, which is standard practice for 3D studies. Accordingly, for our result of -50%, the 3D IC and the 2D baseline require the same total silicon area. In other words, we incur 0% additional area cost. While Xie et al. report on additional area cost, they omit that their scheme requires an interposer which-being at least as large as the chips stacked onto it-incurs ≥100% cost. Still, when only comprising metal layers, we acknowledge that an interposer is less expensive than regular chips.
6. We discuss the implications for malicious end-users being able to access the obfuscated RDL further below. The results in Table 7 indicate the efficiency of our proposed proximity attack (especially over the SAT-based attack [29] , see below for that scenario). Here we assume that the attacker is able to infer all the driver-sink pairings for the switchboxes correctly; only the obfuscation within switchboxes remain to be attacked. This is a strong assumption and, hence, rendering our evaluation conservative. In fact, this scenario can be considered as an optimal proximity attack, as for all F2F connections the correct one is always among the considered candidates. With regards to CCR, PNR, and HD for the recovered netlists, our protection scheme can be considered as reasonably secure (Fig. 14) . Although PNR, which represents the degree of similarity between the original and the recovered netlist [18] , is around 30% or more for most benchmarks, HD approaches the ideal value of 50% for most benchmarks. In other words, although our attack can correctly recover some parts of the design, the overall functionality still remains obscured.
SAT-Based Attacks
After manufacturing, the attacker can readily understand which four drivers/sinks are connected through the switchboxes, but she/he still has to resolve the obfuscation within the switchboxes themselves. The attacker may now leverage a working copy as an oracle and launch a SAT attack. Toward that end, we employ the attack proposed in [29] , and we model the problem using multiplexers. Empirical results are given in Table 7 . As expected, the SAT-based attack succeeds for smaller designs but runs into time-out for larger designs. This finding is also consistent with those reported by Xie et al. [21] for their security-driven 2.5D scheme, which has a security notion similar to our work.
METHODOLOGY FOR TROJAN PREVENTION
So far we have leveraged the potential of 3D integration for IP protection. As elucidated earlier, insertion of hardware Trojans (HTs) by an untrustworthy foundry is another concerning threat. Security schemes like SM can hinder those adversaries from obtaining and fully understanding the netlist; hence, the adversaries may fail to insert HTs at particular targeted locations. However, there are many other parties in an IC supply chain which may leak the netlist to those fab-based adversaries. Therefore, following a strong threat model [10] , [11] (reviewed next), we leverage 3D integration to hinder such an advanced HT threat scenario.
Strong Threat Model
The security guarantee of k-security, as proposed by Imeson et al. [10] is as follows. Given a k-secure FEOL layout and the complete, final gate-level netlist (just before splitting into FEOL/BEOL), an attacker has only a chance of 1/k for successful HT insertion into a particular location (or an up to k times higher risk for having ≤ k HTs detected by subsequent inspection). To achieve this, the idea is to induce k isomorphic structures in the FEOL by carefully lifting wires to the BEOL. As a result, an attacker cannot uniquely map these k structures to the target in the netlist, but can only randomly guess with a probability of 1/k (Fig. 15) . Imeson et al. [10] developed a greedy heuristic to select wires to be lifted to the BEOL, and also apply SAT to compute the security level, i.e., the minimal degree of isomorphism for any cell type found in the whole FEOL layout. [18] , Hamming distance (HD), and correct connection rates (CCR), when the benchmarks are subjected to our 3D SM proximity attack [42] . Each box represents ten runs.
? ?
Netlist with Targets X X X Fig. 15 . k-security is to apply split manufacturing and purposefully lift wires to the BEOL (indicated by dashed wires) such that FEOL-based attackers cannot uniquely identify some or any parts of the netlist available to them. Hence, targeted Trojan insertion becomes difficult. Here, with security level two, an attacker has a 50% chance each for correctly inserting Trojans into the targeted OR and NOR-NAND structures.
Discussion of Prior Art
Although it provides a formal foundation against HT insertion, k-security [10] has practical shortcomings as follows. "-" The security level hinges on the underlying composition of gates which, in turn, is dictated by logical synthesis. The fewer the instances of some gate type in the FEOL, the lower the overall security level. That is because k-security is defined as the minimum security level across all gate types. The technology library plays a key role here as wellthe "richer" the library in terms of gate types, the lower the average counts for some individual types, and the lower the overall security level. We report on this correlation in Table 8 . In short, being agnostic to synthesis, [10] is at the mercy of the design tools. "-" The reported layout cost is considerable. For the ISCAS-85 benchmark c432, for security level 8, the overheads are already 61%, 82%, and 207% for PPA, respectively. Hence, applying k-security on large designs can become prohibitive. "-" Computing the security level is NP-complete and related to the problem of subgraph isomorphism.
Imeson et al. [10] leverage SAT solvers and circuit partitioning to limit the computational cost. Still, we note that the authors did not investigate largescale benchmarks with hundreds of thousands of gates. Also, while circuit partitioning is a common practice, here it hinders to explore the security for the whole netlist holistically, as sub-circuits can only be secured individually after partitioning. "-" Imeson et al. [10] delegate only wires but no gates to the trusted 2.5D facility. This approach is the same as with classical SM in 2D ICs, hence the complete potential of 3D integration is not utilized.
Li et al. [11] recently proposed an advancement for ksecurity, with the following contributions (labeled as "+").
Despite the advances proposed in [11] , there are also some limitations (labeled as "-").
"+" Additional dummy gates (and dummy wires) help to raise the security level as required for layouts initially holding only a few instances of some gate type. To model the insertion of dummy gates and simultaneous lifting of wires, Li et al. [11] consider the concept of spanning subgraph isomorphism. "+" To limit computational cost, they propose a mixedinteger linear program (MILP)-based framework. "+" To limit layout cost, they protect only selected vulnerable gates in the layout, not the whole design. They further impose a close but uniform arrangement of isomorphic instances, to limit placement disturbances without leaking layout-level hints. "-" As with Imeson et al. [10] , Li et al. [11] apply partitioning on larger netlists, effectively hindering to secure industrial designs in their entirety. Also, Li et al. [11] do not investigate any large-scale designs. "-" While Li et al. [11] may raise the security level using dummy gates (and dummy wires), these gates can impose significant area overheads (and timing overheads, due to routing congestion when having to lift all related wires). We report on the PPA impact for adding dummy gates/wires in Table 9 . For the ITC-99 benchmark b17 1, for security level 20, there are up to 1.2× more gates and 5.2× more wires when compared to the unprotected designs. In short, Imeson et al. [10] and Li et al. [11] provide a solid formal foundation to protect against targeted HT insertion, but there are practical limitations to both schemes.
Our Contributions
Here we tackle all the outlined shortcomings of [10] , [11] . There are two key pillars for our work: (i) a security-driven synthesis strategy and (ii) an end-to-end CAD flow for preventing HTs in 3D ICs. The motivation for a securitydriven synthesis strategy is that [10] , [11] implement their protection on top of a given netlist, solely as an afterthought. In contrast, by delegating the construction of isomorphic structures to the synthesis stage, we effectively render the protection against HTs a design-time priority. Besides, our 7 . In [10] it is only mandated that each vulnerable gate shall have k−1 isomorphic instances, whereas Li et al. [11] further require that none of these instances are vulnerable themselves which, arguably, provides a more stringent security notion. Dummy gates/wires are necessary for this purpose, to compensate whenever some of the isomorphic instances are vulnerable themselves. In this work, we follow the notion of [10] while our methods can be easily tailored toward [11] as well. [10] . All gates have driving strength X1. The maximal security levels equal the minimum count across the libraries' gate types (marked in boldface); the actual level may be lower. [11] , netlists provided as courtesy by the authors of [11] . Setup lib-3 is as in Table 8 , and lib-8 contains NAND2, NOR2, AND2, OR2, XOR2, XNOR2, INV, and BUF, all in X1 strength. Left: number of gates and wires for original, unprotected layouts. Right: number of additional dummy gates (D. Gates) and lifted wires (L. Wires) for security levels 10 (S10) and 20 (S20) when 10% of original gates are considered vulnerable. end-to-end CAD flow for 3D ICs, extended from the earlier part of this work, effectively raises the notion of k-security toward practical application for preventing HT insertion in large-scale designs.
Security-Driven Synthesis Stage
The essence of our security-driven synthesis strategy is as follows. For any netlist, we state that the designer can identify the structures vulnerable to HT insertion (e.g., by vulnerability analysis [11] , [44] ) and wants to protect them accordingly. In agreement with k-security, the designer then intends to induce many isomorphic instances of those structures in the FEOL. To take control of the layout cost, but also to advance scalability and the attainable level of security, we delegate this step of inducing isomorphic instances to the synthesis stage. This way, our approach can be considered as "secure by construction." 8 We provide more details for the synthesis stage in Fig. 16 and below.
Based on some vulnerability analysis of choice [11] , [44] , the designer first identifies the vulnerable gates/structures. In this study, we leverage [11] to identify various structures which are covering the vulnerable gates as well as some surrounding gates (Fig. 17) . It is understood that the designer can investigate as many structures as desired regarding (i) layout cost, (ii) the potential for inducing isomorphic instances, and (iii) the coverage of vulnerable gates. In fact, we explored in total 18 structures; the ones illustrated in Fig. 17 are the most promising ones for our empirical study. 9 Next, these structures are prepared for synthesis, i.e., they are defined as custom cells. Typically, one has to conduct library characterization when creating custom cells, but here we can refrain from this effort. That is because we decompose the structures again later on, i.e., we transform them back into their corresponding arrangement of simple two-input 8 . Although we do modify the netlist, we still assume-in agreement with [10] , [11] -that the attacker holds the final gate-level netlist, truthfully representing our security-driven synthesis stage. As a result, we do not imply security through obscurity.
9. See Sec. 8.1 for the layout cost of all the 18 structures investigated. cells. It is essential, however, to track and preserve all gates related to the decomposed structures, by setting them as "don't touch," such that the design tools cannot interfere with these gates. We also keep track of the input/output wires of the structures, which have to be lifted later on to the BEOL to achieve k-security. Now, instead of library characterization, we leverage the characteristics of simple cells available in the library (e.g., NAND with the same number of inputs as the structure), but we adapt the Boolean functionality as needed. We note that utilizing the characteristics Fig. 17 . The seven structures considered in this work without loss of generality. They are constructed based on the vulnerability analysis of [11] .
of simple cells saves not only effort but also "tricks" the synthesis tool, at least to some degree, into using more instances for those structures (see also Sec. 8.1, Table 10 ). That is presumably because the structures provide some complex Boolean functionality with little, indeed over-optimistic layout cost. With our iterative synthesis approach, we can thus impose more instances of the various structures as needed, while also correctly gauging the anticipated layout cost (by decomposing the structures).
Once the synthesis iterations are completed, which is upon the designer to decide and should be based on the security level and/or synthesis-level PPA cost, the vulnerability of the final netlist is to be re-evaluated. The final security level hinges on how many vulnerable gates are covered by the isomorphic instances of all structures.
End-to-End CAD Flow
As indicated, our 3D IC CAD flow for preventing targeted HT insertion is extended from the earlier part of this work. However, there are some differences as follows.
1) Since the threat of HT insertion applies exclu-
sively to manufacturing time, we do not require LC/obfuscation for the RDL. However, we require randomization of the routing paths within the RDL. 2) k-security is applied initially, to the whole netlist at once, using our security-driven synthesis strategy. 3) Our flow comprises techniques similar to those required for regular SM. More specifically, we leverage our customized lifting cells [18] to lift wires to the RDL as dictated by k-security.
Our flow enables the concerned designer to tackle both layout cost and resilience against HT insertion. To do so, we integrate our security-driven synthesis strategy as a key stage for the overall flow (Fig. 16 ). Further steps in our flow are partitioning, planning of F2F vias, and placement and SM-aware routing. We provide some details below.
Design partitioning: After conducting the securitydriven synthesis strategy, we place and route the 2D design. The resulting netlist is partitioned into two groups, representing the top and bottom tiers of the F2F 3D IC. We define I/O ports for wires crossing the two tiers after partitioning; these ports become F2F vias later on.
As for the impact of 3D partitioning on k-security, it is important to note the following. First, we ensure that all the gates of any decomposed isomorphic instance stay together in one tier. Second, since we apply the synthesis strategy on the complete netlist, i.e., before partitioning, there is no inherent limitation for isomorphic instances. Third, partitioning itself may, however, still impact the final security level. That is because once an attacker can retake the same partitioning steps as the designer, the attacker can also infer which subset of isomorphic instances goes into which tier. It is easy to see that fully random partitioning would render this attacker's benefit void, but we found that this can impose considerable layout cost. Based on Sec. 5.1, we therefore propose a customized timing-driven and securityaware partitioning technique as follows.
First, we obtain the timing reports for the 2D baseline layout. Then, each critical timing path without any isomorphic instances is kept within one tier. Other paths, i.e., paths with some isomorphic instances or non-critical paths, are randomly partitioned across the two tiers. With this partitioning technique, the attacker cannot understand which isomorphic instances in the bottom/top tier relate to which in the netlist-the definition of the security level as in [10] is maintained.
Planning of F2F interconnects: These steps are primarily the same as in Sec. 5.2, in particular, the randomization of F2F ports and the custom on-track legalization, but here we can refrain from implementing obfuscated switchboxes.
Wire lifting and design closure: Next, both tiers are placed and routed separately. In the absence of the obfuscated switchboxes leveraged for routing the RDL in Sec. 5.3, here we re-tailor our custom lifting cells [18] to enable wirelifting as required for k-security. That is, while routing the top/bottom tier, we route the regular metal layers, lift wires to the RDL with the help of the lifting cells, and route the RDL. For design closure, the top and bottom tiers are wrapped into one netlist. Again, we purposefully do not engage in any optimization across tiers, to maintain anonymized layouts. As in Sec 5.3, we derive the SPEF from the wrapper netlist to capture the RC parasitics of F2F vias, and we evaluate the final layout cost.
RESULTS FOR TROJAN PREVENTION
The setup is the same as in Sec. 6, except that two M8 layers are used for the RDL. That is also because without the need for obfuscated switchboxes, the RDL is less complex.
Analysis of the Security-Driven Synthesis Stage
During the iterative, security-driven synthesis stage, note that we fix/preserve all gates of any isomorphic structure. We observe that doing so helps to guide the logical synthesis toward the remaining parts of the netlist not yet covered by some structures; we can increase the instance counts within a reasonable runtime. Since the synthesis iterations require only a few minutes for all the commonly considered benchmarks (e.g., 25-45 minutes even for ITC-99), we additionally explore the large-scale IBM superblue benchmarks for scalability of our synthesis strategy (Fig. 18) . Here we observe that our strategy still impose only little runtime cost, about 6.3% on average for the first iteration, and runtimes for successive iterations are further reducing.
In Table 10 , we report on isomorphic instances and their coverage for various benchmarks, based on 10% of all gates Fig. 17 , netlist coverage, and final security level k for the iterative re-synthesis. being identified as vulnerable ones [11] . We find that large parts of the netlists can be covered by isomorphic instances already after a few synthesis iterations, namely 36.5%, on average, after five iterations. This coverage provides strong protection beyond the 10% of gates targeted. For the largescale IBM superblue benchmarks (not illustrated in Table 10 ) the coverage is even higher, at 59%, on average. Although we seek to cover all the gates which have been initially identified as vulnerable ones, some of those gates may not be covered in the final netlist, due to the iterative re-synthesis runs. We found that this effect is acceptable; of the 10% identified gates, there are 9% covered on average. Besides, we note that few of the critical paths contain isomorphic instances to begin with. That is because critical paths rarely contain vulnerable gates [11] , [44] ; related Trojans would be easy to detect by delay testing.
As for security levels following the definition in [10] , we can achieve significant levels already after the fifth iteration, ranging from 30 for ITC-99 benchmarks b14 up to 1,221 for b19. Based on the counts of individual instances, one may also revisit the synthesis after ruling out some structures which tend to limit the security levels, e.g., structure (c) for Table 10 . Hence, our strategy provides margin to the designer for both cases when lower levels are sufficient or higher levels are desired.
In Table 11 , we report on the impact of some synthesis iterations and the final 2D layout cost. For the latter, wirelifting toward the BEOL as required by k-security is already accounted for. Overall, layout costs are acceptable; that is especially true when qualitatively comparing to [10] , where already notably smaller benchmarks induced significant PPA cost. That is also because we can safely apply buffer insertion to tackle timing degradation, as can be seen below.
As foreclosed, we also provide the distribution of layout cost for all 18 different structures we explored in Fig. 19 .
There we contrast the re-synthesized netlists without any optimization to those after buffer insertion. Since all isomorphic instances are preserved, buffer insertion cannot interfere with those structures and security is not undermined. This simple technique helps to avoid large cost while "tricking" synthesis into using our custom cells; it forms the baseline for the remaining, 3D-centric experiments in Sec. 8.2 Finally, since ours is the first to consider large-scale benchmarks for k-security, a direct comparison with prior works [10] , [11] is impractical. For a qualitative comparison, our work allows for superior security levels, induces little layout cost, and is scalable, all by means of synthesis.
3D IC Layout Cost and Security Analysis
In Table 12 , we report on the final layout cost for the F2F 3D ICs. Overall, costs are better than for the 2D setup (Table 11) , especially for larger benchmarks. This key finding attests our objective to advance k-security for large-scale designs. In Fig. 20 , we showcase the layouts for ITC-99 benchmark b18.
We also report on security levels in Table 12 . The levels in 3D are the same as in 2D-as explained above, our 3D partitioning does not undermine security. In general, few if any proper attacks on k-security are available yet, and this is because the underlying notion of k-isomorphism is formally secure [45] . This also implies that otherwise effective attacks will not be applicable. For example, although we can tackle the missing RDL connections using a SAT attack as [29] the netlist available to the adversary can serve as "virtual oracle" here-doing so is not practical. First, as we show in Table 12 , SAT attacks become computationally expensive for large designs. Second, while a SAT attack may eventually provide a functionally equivalent assignment for the RDL connectivity, they cannot provide the structurally equivalent assignment required for attacking k-security any better than random guessing would.
CONCLUSION AND OUTLOOK
In this work, we demonstrate in detail how 3D integration is a naturally strong match to combine split manufacturing and camouflaging. In particular, we promote "3D splitting" a design across multiple tiers along with randomization and camouflaging of the vertical interconnects between the tiers. By doing so, we propose a modern approach to (i) IP protection and (ii) prevention of targeted Trojan insertion.
Using industrial tools and know-how, we develop security-driven CAD flows for face-to-face (F2F) 3D ICs, allowing us to tackle these two essential hardware security challenges. Among other steps, we propose several securitydriven partitioning techniques, randomized planning of F2F ports, customized cells for obfuscation of vertical interconnects using Mg/MgO vias, and a security-driven synthesis strategy. The latter allows us to apply k-security (a formally secure scheme concerning Trojan insertion) for the first time on large-scale designs and also to protect sensitive design structures of choice readily.
We conduct comprehensive experiments on DRC-clean layouts, using commonly considered benchmarks as well as large-scale, "real-life" designs. Strengthened by extensive security analysis, we argue that leveraging 3D integration is highly promising for hardware security. Finally, we also put forward a practical threat model which accounts for the business practices of present-day design houses.
For future work, we plan to explore how 3D integration can also provide resilience against physical attacks such as invasive probing or exploitation of side-channel leakage. 
S.I STUDY ON DIFFERENT TECHNOLOGY NODES
Using the ITC-99 benchmarks, here we gauge the capabilities for heterogeneous 3D SM, assuming a trusted 180nm foundry and an untrusted 45nm foundry. More specifically, we leverage the OSU libraries [1] . Their libraries hold the same number, type, and strengths of cells; this guarantees a fair comparison since CAD tools cannot leverage different versions of cells. Synopsys DC was used for synthesis and place and route was performed using Cadence Innovus 17.1; see also Sec. 6.1 for details on the F2F 3D setup.
PPA results for an aggressive timing closure of the 2D baseline setup are given in Table S1 . For the heterogeneous F2F 3D setup, we observe performance degradations as we lift more and more gates to the trusted low-end tier (Fig. S1) . Also, note from Table S1 that area (and power) cost is ≈12X (and 9X) when contrasting 180nm to 45nm. To maintain a balanced utilization for both tiers, these findings imply that one should not lift more than ≈8% of the gates to the lowend tier. While such small-scale lifting provides a reasonable performance gain, especially from the perspective of commissioning only the 180nm foundry, it may not be enough to cover all the sensitive design parts.
These findings are in general agreement with those provided in Sec. 4.1. Table S1 and its footnote. The 45nm node is four generations away from 180nm, and delays improve by ≈30% per generation [2] ; surprisingly, delay degradations for OSU 180nm are considerably off from this expectation. We believe that this is due to the academic nature of the library.
