The high volumes of data stored in the cloud, coupled with growing concerns about security and privacy, have motivated research on homomorphic encryption (HE), i.e., a technique that enables computation directly on encrypted data, obviating the need for prior decryption. Recent algorithmic advances have enabled a diverse set of homomorphic operations (e.g., addition, multiplication, and division). Looking at the applications, recent work also suggests extensibility secure, homomorphically encrypted content-addressable memories [or secure content-addressable memories (SCAMs)]. Still, the large datawords that result from homomorphic data encodings (i.e., that must be stored/transferred for computation), compounded with the implicit computational complexity of HE, still impede the deployment of homomorphic computer hardware. As an alternative, computing-in-memory (CiM) architectures could significantly reduce the volume of data transfers for SCAM (and other) applications, leading to considerable energy savings and latency reduction. In this regard, we propose a CiM-compatible engine for SCAM (CiM-SCAM) and analyze the pros and cons of three different memory cells: a 6T CMOS SRAM and two memory cells based on ferroelectric field-effect transistors (FeFETs) (specifically 2T + 1FeFET and 1-FeFET designs). CiM-SCAM leverages in-place copy buffers (IPCBs), along with customized sense amplifiers that include two types of in-memory adders. Our results suggest that energy (and search time) improvements of up to 16× (3.2×) for 1-FeFET memory cells are possible, compared with an application-specific integrated circuit (ASIC) approach. Similar improvements are also possible with SRAM and 2T + 1-FeFET memory cells. For the latter, we achieve up to 13× (3.1×) of energy (speedup). INDEX TERMS Computing-in-memory (CiM), emerging technologies, ferroelectric field-effect transistors (FeFETs), homomorphic encryption (HE), secure content-addressable memory (CAM), SRAMs. VOLUME 5, NO. 2, DECEMBER 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
I. INTRODUCTION
In the era of big data and the Internet of Things, there are growing concerns about the security and privacy of high volumes of data stored in the cloud. While highly secure, traditional encryption schemes, such as the Rivest-Shamir-Adleman (RSA) and the advanced encryption standard (AES), cannot fulfill data privacy requirements, as they require ciphertexts to be fully decrypted before any computation can be performed on the data. Alternatively, homomorphic encryption (HE) [1] supports user privacy by allowing a wide range of operations to be performed on encrypted data without loss of integrity. This may further rationalize the use of cloud-based compute resources, as a user's data would never need to be decrypted. Per [2] , HE schemes have been proven to be secure against a wide range of attacks.
While numerous application spaces could benefit from the computation on encrypted data, in this article, we consider compute kernels for data search engines, which encompass applications, such as online recommender systems [3] , facial recognition [4] , and secure biometric authentication [5] .
Toward this end, ternary content-addressable memories (TCAMs) are often used in search applications due to their inherently high parallelism and energy efficiency [6] . TCAM operations consist of performing bitwise XNOR operations between query bits and database content in parallel, followed by a word-wise AND that produces a match signal for each memory address. However, a traditional TCAM is not well suited for performing queries/searches with homomorphically encrypted data. That is because homomorphic operations, such as XOR (HomXOR) and homomorphic AND (HomAND), used for search need to obey a specific protocol of operations established by the encryption scheme in order to guarantee that the integrity of the data after operations is realized. For instance, in the case of an encryption scheme named FHEW [7] , multiplications are necessary in order to achieve HomAND functionality. Other operations, such as HomXOR, HomMULT, and also arithmetic as HomADD or HomSUB, need to be realized through rounds of multiplications, divisions, additions, subtractions, and modulo reductions [1] , [7] , [8] .
In addition to requiring costly operations, such as multiplications and divisions, HE schemes require extremely large datawords due to the sizes of ciphertexts (e.g., in [9] , the encryption of a 32-bit word in plain text results in 1 415 232 bits of ciphertext). In this context, dense memories may be useful toward scaling HE applications. For example, a near-memory processing solution based on the hybrid memory cube (HMC) was proposed to enable a more energyand area-efficient implementation for searching on homomorphically encrypted data [10] . Still, while density is an essential characteristic of memories used in HE applications, near-memory processing solutions, such as in [10] , cannot reduce the number of data transfers between the memory and a processing unit.
Alternatively, computing-in-memory (CiM) has emerged as a possible solution to the energy and latency overheads associated with transferring high volumes of data between memory and processor. As such, CiM could be helpful in data-intensive applications, such as HE. Furthermore, the unique characteristics of emerging logic and memory technologies enable the design of high-density and energy-efficient CiM architectures. For instance, researchers have proposed CiM designs that employ non-volatile resistive random-access memory (RRAM), spin-transfer torque magnetic random-access memory (STT-MRAM), phase-change memory (PCM), or ferroelectric field-effect transistor-based random-access memory (FeFET-RAM) [11] - [14] . Importantly, for the applications considered here, designs in [12] and [14] leverage modifications in the memory sense amplifier, to enable the addition of two memory words in a single step.
In this article, we propose the use of a novel in-memory (carry select) adder and accumulators that enable energy-efficient searches on homomorphically encrypted data. We leverage the fully additive algorithm proposed in [9] to implement word matching on a secure CAM (SCAM). Our CiM-SCAM engine consists of memory arrays, write buffers, as well as customized sense amplifiers that can perform Boolean operations and additions in memory. Three options for memory cells are considered, i.e., 6T-SRAM [15] , a 2T + 1FeFET [14] , and a 1-FeFET [16] memory cells. We present the layout and area evaluation for the customized sense amplifiers, along with their respective in-memory adders. For the application considered here, the use of FeFET-based CiM rather than other emerging technology can be attractive for two reasons. First, FeFETs have a voltage-based write that enables faster and more energy-efficient writes compared to STT-RAM or RRAM. Second, the separate paths for read/write in FeFET memory cells allow for two simultaneous memory read and write accesses for different cells that are located in the same memory array. The latter characteristic of the FeFET-based memory cells is useful for building in-memory accumulators.
We compare the performance of CiM-SCAM to an application-specific integrated circuit (ASIC) proposed in [9] . The designs considered (SRAM, 1-FeFET, 2T + 1 FeFET, and the ASIC) are implemented with a CMOS 65nm technology node. Furthermore, we carried additional evaluation for 2T + 1FeFET memory with 22-nm PTM [17] as an underlying model for the single-domain LK-based FeFETs. The purpose is to give us some intuition as to how the scaling of the device may impact on figures of merit, such as delay and energy. Our results show that CiM-SCAM based on 2T + 1-FeFET memory cells enable energy savings and speedups of up to 13× and 3.1× compared to the aforementioned ASIC approach. Benefits can be even higher when 2T + 1-FeFET memory cells based on long-term FeFETs are employed. Furthermore, implementations with 6T-SRAM can provide up to 9.3× energy savings compared to the ASIC approach, with search times that are similar to 2T+1-FeFETbased CiM-SCAM. Finally, 1-FeFET-based implementations represent a higher density option for CiM-SCAMs and enable energy savings of up to 16× compared to the ASIC approach.
II. BACKGROUND AND RELATED WORK
In this section, we discuss FeFET device structures, operating modes, and device models. Moreover, we review related work on CiM and HE.
A. FeFET DEVICE
The structure of a FeFET resembles a MOSFET, except a layer of FE oxide is deposited in the gate stack. The equivalent circuit for a FeFET is depicted in Fig. 1(a) , where we represent the FE and CMOS capacitances as C FE and C CMOS , respectively. The coupling between such capacitances can lead to a hysteretic effect, hence non-volatility. Fig. 1 (b) depicts the simulated I -V characteristics illustrating how hysteresis could span over positive and negative gate-source voltages (V gs ). The information stored in a FeFET corresponds to one of two possible states-logic ''0'' (high V th ) and logic ''1'' (low V th ).
B. FeFET MODELS
We now review two FeFET device models. First, we present a single-domain model based on the time-dependent Landau-Khalatnikov (LK) equations [18] . Second, we introduce a multi-domain model for FeFETs based on the Preisach theory [19] .
1) LK-BASED FeFET MODEL
The LK equation (1) has been used to describe the switching behavior of FeFETs in previous work [14] , [20] - [22] . In (1), α, β, and γ are the static coefficients and ρ is the kinetic coefficient associated with the FE material hafnium zirconium oxide (HZO). E and P refers to the electric field and polarization, respectively. For HZO, α = −7 × 10 9 m/F, β = 3.3 × 10 10 m 5 /F/coul 2 , and γ = −2 × 10 9 m 2 /F/coul 4 , and the FE thickness is 5.4 nm
The LK equation assumes a single-domain ferroelectric (FE) material with a single coercive field for the whole FE thin film. Although this assumption of the LK-based model could be reasonable for a highly scaled FeFET device, the FeFET devices fabricated to date have multi-domain FE layers in which polarization switching does not occur abruptly. Phenomena such as non-saturated hysteresis loops and history effects cannot be captured by the LK-based model [18] .
2) PREISACH-BASED FeFET MODEL
The above-mentioned issues are addressed by a FeFET compact model based on the Preisach theory [19] . The Preisach-based multi-domain model is derived from experimental data. Unlike the single-domain LK-based model, the multi-domain Preisach model can reproduce the characteristics of FE devices that have been fabricated to date. The model assumes an FE film with multiple independent single-crystal domains with a distribution of coercive fields. Hence, the model can capture the behavior of both saturated and non-saturated hysteresis loops. Moreover, the multidomain Preisach-based model tracks the FE history by employing an efficient turning-point tracking algorithm [23] . The model successfully reproduces the FE behavior by combining the aforementioned characteristics with a delay unit that can model polarization switching dynamics. Fig. 2 depicts the I D versus V gs characteristic of a FeFET device measured through experiments and simulations with the multi-domain Preisach model. The device is programmed/erased by applying −4-V/+4-V pulses to the gate. Measurement pulses, i.e., sweeps between −1 V and +1 V, are applied to read out the logic state of the FeFET, as illustrated in Fig. 2(a) . A sufficiently wide memory window (MW) of ∼0.96 V, as well as I ON /I OFF ratios on the order of 10 4 lead to good sensing margins, as depicted in Fig. 2(b) . The threshold voltage (V th ) could be shifted through body biasing or gate metal work function engineering to meet the requirements of particular designs, as illustrated in Fig. 2 (c). In this article, we assume that V th has been shifted to enable a read voltage in the range of 0.5-1.0 V. 
C. RELATED WORK: CiM AND HE
CiM architectures aim to reduce the amount of data transfers between memory and processor by bringing computation inside the memory units. CiM designs can be either general purpose or application-specific. Examples of the former include designs that perform operations, such as the Boolean (N)OR, (N)AND, and X(N)OR [11] - [15] , [24] , arithmetic ADD [11] , [12] , [14] , [24] , as well as in-place data copy and carryless multiplication [15] . Studies suggest that CiM architectures could deliver energy savings and speedups for applications from a wide range of domains. Examples of application-specific CiM designs include the in-memory computation of dot products [25] and non-volatile TCAMs [26] that are suitable for pattern recognition and search applications.
Applications in the security domain can also significantly benefit from CiM. Two examples of such applications are the in-memory implementation of encryption engines and the word matching on an encrypted database. For the former, there are several works that propose the implementation of encryption schemes, such as AES in memory [27] , [28] . The results in [28] suggest that in-memory AES achieves energy savings due to reduced data transfers while enabling speedups of 80× compared to a standard CPU implementation. Regarding the latter, [29] proposes the use of modified hash functions that increase the level of security of the data stored in a binary CAM. However, the encryption scheme used does not allow for computing on ciphertexts as with HE.
Reference [9] proposes a modification to the homomorphic XOR and AND operations, as defined in the FHEW scheme [7] , such that searches on a secure CAM do not require expensive operations, such as multiplications, in order to guarantee the integrity of the data at decryption, that is, word matching can be entirely performed through additions, resulting in faster word matching compared to a conventional implementation [30] . Specifically, the word matching function of a CAM employs a XOR-OR variant instead of the original XNOR-AND form, which can be costly to implement in the hardware. In (2), x and y refer to words to be encrypted. While x i ∈ {0, 1} represents a database entry, y i ∈ {0, 1} refers to a query. The index i represents the individual bits of each word, whose lengths are specified by w. The ciphertext for 1-bit plaintext is denoted as c x and c y and can be as long as (n + 1) log q bits (with n = 1052 and log q = 42, for a medium security standard that uses a 80-bit key, as defined in [9] ). (Therefore, a 32-bit data word may translate to a homomorphically encrypted data vector that is >50 000 bits!)
By leveraging the operations in FHEW, a SCAM matching function can be computed by using only additions and accumulations. Note that the query is negated before encryption (−c y i ) and thus eliminates the need for subtractions in (2) . However, adding encrypted bits [two (n + 1) log q size numbers] is still a critical part of SCAM search for two reasons. First, large ciphertexts demand that high volumes of data be fetched from memory, which can be energetically demanding and time consuming. Second, as hardware resources are limited, one cannot build an adder tree of sufficiently large size to process (n + 1) log q inputs. Thus, as many as n clock cycles may be necessary to complete a single addition in the ASIC implementation of SCAM [9] . Recently, [10] proposed the use of dense 3-D memories along with near-memory processing for a secure CAM search. Although the proposed framework enables high density, the energy and latency overheads of data transfers could be further reduced with CiM.
III. CiM-COMPATIBLE SECURE CAM SEARCH
In this section, we propose a hardware implementation that enables searching for content in a homomorphically encrypted database. Rather than relying on CPU/GPU alternatives or ASICs, our framework is comprised of a CiM engine that implements the fully additive secure CAM (SCAM) search algorithm described in [9] .
A. IN-MEMORY FULLY ADDITIVE SEARCH
Computing on large ciphertexts without having to bring data to a processing unit could enable energy savings and speedups for the SCAM search application due to inherently high internal bandwidth of the memory. We propose a CiM-compatible hardware implementation of SCAM search (CiM-SCAM). A high-level view of CiM-SCAM is depicted in Fig. 3 . Here, a basic CiM module contains M × N memory cells plus necessary peripherals for the addition operations. CiM-SCAM lines are labeled from 1 to K and depicted in the dashed box. They comprise sets of CiM modules, in which encrypted words 1 to K are stored. In each CiM-SCAM line, a w-bit plaintext word maps to [w × (n + 1) log q] bits of ciphertext (∼173 KB). In addition, the encrypted query of the same size is also written/stored into the memory. Storing both query and memory entry into a CiM-SCAM line requires ∼346 KB of memory space in each CiM-SCAM line. For a search operation, the encrypted query data are sent to all the encrypted memory contents (e.g., K entries), and the comparison is done in parallel for all K CiM-SCAM lines. 
B. SRAM AND FeFET-BASED MEMORY CELLS
Our proposed CiM-SCAM engine is studied assuming three underlying memory arrays based on different memory cells. Furthermore, CiM-SCAM leverages standard and customized peripheral circuits for each design option. Below, we first describe the memory cell designs suitable for CiM-SCAM implementation and then describe the customized peripheral circuits in Section III-C.
The first option for a memory cell for CiM-SCAM is a standard 6T-SRAM design from [15] . The second option is a 1-FeFET [16] cell simulated with a multi-domain Preisach model [19] , while the third option is a 2T + 1FeFET design [14] based on the single-domain LK model [31] . Fig. 4 depicts the memory cells employed, along with their respective read and write mechanisms.
Note that the two options for FeFET memories employed in CiM-SCAM also assume two different underlying FeFET models and, hence, device behavior. Notably, FeFET devices that have been fabricated to date exhibit behavior that is more in line with Preisach theory. In addition, the reading of 1-FeFET memory cells (studied via the Preisach model) demands a suitable read voltage (typically on the order of 0.5-1.0 V) to be applied to the gate of the device. Write voltages in the order of ±4 V with 10-ns write time are used for programming and erasing [19] .
Alternatively, other design efforts use LK-based models to capture FeFET device behavior, e.g., [20] , [22] , [31] . The 2T + 1FeFET memory cell considers FeFETs that exhibit ideal polarization switching, where the entire FE layer is assumed to be single domain. Read and programming voltages are in much smaller ranges, i.e., −0.5-1.0 V [14] , [20] , [22] . Although this behavior does not correspond to the FeFETs fabricated to date, the paradigm described by the LK-based FeFET models might be suitable to a scaled FeFET device.
C. iM-CSLA FOR FAST ADDITION
A number of CiM designs proposed, to date, use customized sense amplifiers and support in-memory ripple carry adder (RCA) designs [12] , [14] , [24] . While an iM-RCA [depicted in Fig. 5(a) ] is suitable for the addition of standard-length memory words, i.e., 32 or 64 bits, the long carry chain in the addition of (n + 1) log q-bit numbers could impose an extremely high latency, for example, in the case of CiM-SCAM. To address this issue, we propose a CiM architecture that leverages a novel, in-memory carry select adder (iM-CSLA) that accelerates the in-memory addition of (n + 1) log q-bit numbers by ∼9× compared to iM-RCA.
The iM-CSLA works for all three CiM-SCAM design options (i.e., based on 6T-SRAM, 2T + 1FeFET, or 1-FeFET memory cells) and is depicted in Fig. 5(b) . As each CiM-SCAM word is extremely long [w × (n + 1) log q bits], we divide each word up into different segments of N bits, so each segment would fit into one array. w encrypted bits of each entry and query are represented by 2w/M rows of memory blocks. Each N-bit adder constitutes a block B of iM-CSLA, and a total of (n + 1) log q)/N blocks (and arrays) are needed to build one CiM-SCAM line.
Internally, the structure of each iM-CSLA block is still based on RCAs. Assuming an M × N memory array, the iM-CSLA circuit employs redundant carry and sum units used to build an N-bit in-memory adder. The redundant units compute two sums/carries in parallel. The iM-CSLA adder can considerably speed up addition by multiplexing the result of additions. Its mechanism works as follows. Two opposite carry-in inputs (C in = 0 and C in = 1) are given as inputs to the redundant carry and sum circuits, for the least significant bit of every block B. This allows two different additions (with C in = 0 and C in = 1) to be computed simultaneously. Afterward, each block B takes a global carry input (C block B ) from its predecessor block (C block (B−1) ). The global carry selects the correct addition result in each block B through a layer of multiplexers. Area overheads due to the introduction of a layer of multiplexers, along with redundant adders, constitute a disadvantage of iM-CSLA and need to be carefully analyzed. Layout options for CiM-SCAM, along with a study of the impact of iM-CSLA with respect to the chip area, will be presented in Section IV.
D. IN-PLACE COPY BUFFERS
The summation corresponds to the OR operation described in (2), i.e., the additions between the encrypted bits of a query and a database entry (y i and x i ). In the context of CiM-SCAM, the in-memory addition between encrypted bits [(n + 1) log q bit long] needs to be accumulated for w bits of ciphertexts, as shown in (2) in order to produce an encrypted match result for each SCAM line. The encrypted match result is sent to the client, so he/she can decrypt it with the appropriate key. The decrypted result indicates whether there are matches for any of the CiM-SCAM addresses.
We introduce in-place copy buffers (IPCBs) [depicted in Fig. 5(c) ] in the CiM-SCAM design to implement accumulation more efficiently. When activated, an IPCB feeds the sum results to the write bitlines. SRAM-based CiM-SCAM can use an IPCB similar to the one proposed in [15] . As SRAMs have a common write and read paths with a shared bitline (BL), one cannot read from and write into SRAM memory cells in the same clock cycle. Hence, CiM-SCAM based on SRAM needs an extra write cycle after each [(n + 1) log q]-bit addition to perform accumulation, i.e., to save the result into a memory destination through the IPCB.
On the other hand, FeFET memory cells have decoupled read and write paths that enable simultaneous reading and writing of memory cells located in the same memory array. Our IPCB design depicted in Fig. 5(c) can be employed in arrays based on 1-FeFET or 2T + 1FeFET memory cells. Per [14] , 2T + 1FeFET memory cells do not require negative voltages for writing and have write voltages in the range of 0.5-1.0 V. The aforementioned characteristics of 2T + 1FeFET enable the result of a [(n + 1) log q]-bit addition to be fed directly into the IPCB circuits that are activated in the compute cycle. However, 1-FeFET memories have write voltages in the range of ±4 V, which represents an issue when copying an addition result directly from the iM adder to a destination in the memory array. One possible solution to this problem would be to use a level shifter after each IPCB in order to generate appropriate write voltages in the ±4-V range (an option to be explored in our future work). Alternatively, our evaluation (to be presented in Section IV) accounts for a ±4-V/10-ns write for every addition of two [(n + 1) log q]-bit ciphertexts. To generate voltages in the appropriate ±4-V range, the write circuitry employs an operational amplifier (op-amp) voltage comparator [depicted in Fig. 5(d) ]. For practical purposes, we expect a similar overhead from our write circuitry compared to an actual level shifter design implementation.
E. CiM-SCAM LAYOUTS AND DATA PLACEMENT SCHEMES
For area evaluation (to be presented in Section IV-B), we have performed layouts for one column of SCAM-CiM peripherals using the OpenRAM library [32] in the Cadence Virtuoso environment. The customized sense amplifiers and in-memory adders in Fig. 6(a) and (b) are pitch-matched to SRAM and 2T + 1FeFET memory cells.
For the 1-FeFET design, we maximize density gains by introducing a column multiplexer that enables the sharing of peripherals for two columns of memory cells at distinct times. We enable the use of in-memory adders in this context by employing an interleaved data placement scheme. For instance, consider a memory array with N columns, in which the index j represents each individual column, i.e., j = 0 to N − 1. When storing an N -bit data block in the memory, we divide its content into two halves. The lower part of a data block contains bits from j = 0 to (N − 1)/2, while the upper part contains bits from j = N /2 to N − 1. Rather than storing all the bits in an adjacent fashion, i.e., the jth bit followed by j + 1th bit, from j = 0 to N − 1, we interleave the bits from the lower part with bits from the upper part of the data block. Such data placement enables the correct routing of the carry, through the iM-adder circuitry. Fig. 7 depicts an example of interleaved data placement for a block size of 8 bits. Note that the compute cycle is divided into two equal (shorter) cycles to enable the complete carry propagation through all the bits in the adder. 
IV. EVALUATION
In this section, we present a simulation-based case study of CiM-SCAM, comparing our results to the ASIC implementation in [9] . As noted in Section III, the use of customized sense amplifiers and in-memory adders introduce area overheads in the memory peripherals. Here, we evaluate the area overheads of CiM-SCAMs based on SRAM, 2T + 1FeFET, and 1-FeFET memory cells. Furthermore, we compare the energy and search time of CiM-SCAM to the ASIC-based SCAM.
A. SIMULATION SETUP
In order to evaluate the performance and energy efficiency of CiM-SCAM, we perform SPICE simulations, where an interleaved data placement scheme (introduced in Section III-E) is assumed for 1-FeFET memories, and a regular (adjacent) data placement is employed for SRAM and 2T + 1FeFET memories. We use a CiM array size of 64 × 64. The customized sense amplifiers in [14] and [15] are used for the Boolean operations. In-memory adders (RCA and CSLA) are implemented, as described in Section III. Furthermore, CMOS PTM models from [17] and [33] are employed for simulations. We use a 65-nm technology node to simulate SRAM, 1-FeFET, and 2T+1-FeFET memories. Furthermore, we also evaluate 2T + 1FeFET-based array with 22-nm node to enable us to project figures of merit for a scaled device whose behavior might match well with the single-domain FeFET models.
B. RESULTS AND DISCUSSION 1) AREA
The area results for a 64 × 64 array are summarized in Table 1 . We note that for all three implementations, i.e., SRAM, 2T+1-FeFET, and 1-FeFET, the area of memory cells dominates the total CiM-SCAM area regardless of the iM adder used, despite the overhead introduced by the use of redundant adders and a multiplexer layer in iM-CSLA. The CiM peripherals account for less than 50% of total array area in all cases, as indicated in Table 1 . When using an iM-CSLA in lieu of an iM-RCA, we have an average area overhead of ∼12.7%. However, as the graph in Fig. 8 suggests, iM-CSLA offers speedups in search time, which are are not possible to achieve with iM-RCA due to iM-RCA's long carry chain when adding large size ciphertexts.
2) ENERGY AND PERFORMANCE
We compare the energy and performance of CiM-SCAM with an ASIC approach presented in [9] , which demonstrates superiority in terms of search time and energy compared to CPU implementations of the same SCAM algorithm, as described in the literature [9] , [10] . When comparing to the ASIC [9] implementation of SCAM, we account for data transfers and for the energy and time spent on computations. In our CiM-SCAM, each SCAM line comprises 692 arrays of 64 × 64 size, which is sufficiently large to contain encrypted bits of a word. As per the results reported in [9] , additions and accumulations in one SCAM line consume 11.41 nJ and take 9.47 µs to complete for the ASIC approach. Data transfers are not included in this result. In this article, communication overhead is considered for both CiM and ASIC approaches. For CiM, data transfers are reduced as the addition operands do not need to be fetched from memory to a processing unit. Hence, the communication overhead of the CiM solution consists mostly of local copy operations that are necessary to store intermediate results for accumulations carried out by the search algorithm. For ASIC, we consider that each SCAM line has a dedicated SRAM memory of 512-KB size, in which ciphertexts are stored. Using NVSIM [34] , we estimate the access time (and read/write energies) of the SCAM memory to be 5.16 ns (102/99.5 pJ).
Our results (depicted in Fig. 8 ) suggest energy and search time improvements of up to 16× and 3.2×, respectively, for 1-FeFET memory cells. Similar improvements are also possible with SRAM and 2T + 1-FeFET memory cells. For the latter, CiM-CSLA achieves 13× and 3.1× of energy and performance improvement over the ASIC approach. Furthermore, our results for a 2T + 1FeFET memory array implemented with the scaled FeFET device and simulated with the LK-based model (dark gray bar in Fig. 8 ) suggest a potential improvement compared to memory cell built with the bigger device.
Although many challenges regarding the fabrication of the scaled FeFETs still remain, a more compact device would allow for faster, reliable, and low-power reads and writes, which positively impact on several figures of merit. High density is another advantage of FeFET-based memories. In this regard, the use of denser memories, such as 1-FeFET, is preferable when performing a search on larger quantities of ciphertext or for a higher level of security (longer key).
Performance and energy improvements are expected to be maintained at the same level, as ciphertext size increases. The benefits of CiM-SCAM are strongly dependent on the type of adder used (e.g., CSLA versus RCA). Note that RCA and CSLA adders are implemented in this work as a part of the CiM solution, as they compute on data that are stored in memory without the need for reads. Nevertheless, using an iM-RCA for adding large numbers in memory may be inefficient, as the carry needs to propagate through all bits for the operation to be completed. Consequently, we do not get much improvement from an iM-RCA compared to the ASIC solution (latency is in fact worse). However, in the case of the iM-CSLA, a significant improvement is observed. The benefit comes from not only the CSLA structure but also CiM, as the latter helps to significantly reduce the memory transfer overhead.
V. CONCLUSION
We have proposed a CiM-compatible engine for SCAM (CiM-SCAM), implemented with three options of memory cells (6T-SRAM, 2T + 1FeFET, and 1-FeFET). CiM-SCAM leverages customized peripherals to perform in-memory addition and accumulation. Our results show that CiM-SCAM based on 2T + 1-FeFET memory cells enable energy savings and speedups of up to 13× and 3.1× compared to an ASIC approach. Furthermore, implementations with 6T-SRAM can provide up to 9.3× energy savings compared to an ASIC, with search time similar to 2T + 1-FeFET- 
