This paper proposes an ultra-low power crypto-engine achieving sub-pJ/bit energy and sub-1Kµm 2 in 40nm CMOS, based on the Simon cryptographic algorithm. Energy and area efficiency are pursued via microarchitectural exploration, ultra-low voltage operation with high resiliency via latch-based pipelines, and power reduction techniques via multi-bit sequential elements. Overall, the comparison with the state of the art shows best-in-class energy efficiency and area. This makes it well suited for ubiquitous security in tightlyconstrained platforms, e.g. RFIDs, low-end sensor nodes. Introduction Energy and area efficiency are essential requirements to enable truly ubiquitous security via data encryption along the entire chain of trust, from IoT edge devices to the cloud. Existing cryptographic standards such as Advanced encryption standard (AES) [1]-[3] are currently the preferred choice for 128-bit data and key size, or higher. However, area and energy required by implementations with such data/key size are unaffordable in low-end edge devices. Also, 128-bit security is typically beyond the actual requirements of low-end devices, especially when they provide sparse and small amounts of data (e.g., real-time environmental sensors). The energy penalty due to the usage of a data wordlength beyond necessary is aggravated by the additional cost of the external FIFO buffers (~2.5Kgates and 25-30% of area-efficient AES designs [2]) used for data word aggregation (Fig. 1) .
microarchitecture tends to be dominated by flip-flops, in terms of both area (50%) and energy (65%). The clocking energy contribution was reduced through the adoption of the pulsed latch as sequencing element, as shown in Fig. 3(a) . Indeed, a pulsed latch occupies 25% less area, 40% lower clock pin energy, and 20% lower energy per cycle than a flip-flop. In view of the dominance of sequencing elements in the Simon engine, the lower area of pulsed latches permits to shrink the overall area and hence reduce the switched capacitance in both the datapath and the clock tree. The adoption of pulsed latches reduces energy and also enables time borrowing providing resiliency against process/voltage/temperature variations [6] .
Additional energy and area savings were achieved by introducing multi-bit pulsed latches, i.e. sharing the same clock drivers across multiple pulsed latches as shown in Fig. 3(b) . The internal clock buffer was sized (2X) to balance the internal clock slope (determining hold time and energy), and the cell area. This assures reduction in clocking energy by 40% as compared to 1-b latch-based design (see Fig. 4 ). The resulting 8-bit pulsed latch design was created to enrich a commercial standard cell library, and to be integrated with an existing digital design flow.
Post Layout Simulation Results
The post-layout simulation results in Fig. 4 show the area and energy benefits (12-14%) with adoption of multi-bit pulsed latches as compared to conventional flip-flops. The proposed bit-parallel microarchitecture occupies 690µm 2 (Fig. 6 ). The maximum throughput at 0.9V, 25°C and typical corner is 443Mbps without time borrowing (which offers an extra 10% cycle reduction). The power consumption is 434µW at 443MHz, leading to an energy efficiency of 1.02Tbps/W. No timing violations occur under a duty cycle (externally generated) of 10-24%. The minimum energy point lies at 225mV, leading to an energy per bit of 0.104pJ/b.
The baseline bit-serial microarchitecture occupies 861µm 
