Received 30 July 2023; revised 20 September 2023 and 29 September 2023; accepted 2 October 2023. Date of publication 5 October 2023; date of current version 5 December 2023.

Digital Object Identifier 10.1109/JXCDC.2023.3322292

# Energy Efficient Logic and Memory Design With Beyond-CMOS Magnetoelectric Spin–Orbit (MESO) Technology Toward Ultralow Supply Voltage

# ROHIT ROTHE<sup>®1</sup> (Student Member, IEEE), HAI LI<sup>®2</sup> (Member, IEEE), DMITRI E. NIKONOV<sup>®2</sup> (Senior Member, IEEE), IAN A. YOUNG<sup>®2</sup> (Life Fellow, IEEE), KYOJIN CHOO<sup>®3</sup> (Member, IEEE), and DAVID BLAAUW<sup>®1</sup> (Fellow, IEEE)

<sup>1</sup>Department of Electrical and Computer Engineering, University of Michigan, Ann Arbor, MI 48109 USA <sup>2</sup>Components Research Group, Intel Corporation, Hillsboro, OR 97124 USA <sup>3</sup>Swiss Federal Institute of Technology (EPFL), 1015 Lausanne, Switzerland CORRESPONDING AUTHOR: R. ROTHE (rohitrr@umich.edu)

This work was supported in part by the Intel Semiconductor Corporation and in part by the Intel Science and Technology Center (ISTC)—Valleytronics.

This article has supplementary downloadable material available at https://doi.org/10.1109/JXCDC.2023.3322292, provided by the authors.

**ABSTRACT** Devices based on the spin as the fundamental computing unit provide a promising beyondcomplementary metal–oxide–semiconductor (CMOS) device option, thanks to their energy efficiency and compatibility with CMOS. One such option is a magnetoelectric spin–orbit (MESO) device, an attojoule-class emerging technology promising to extend Moore's law. This article presents circuit design and optimization techniques, such as device stacking and a canary circuit-based asynchronous clock pulse generation scheme for MESO device technology. With these targeted circuit techniques, the MESO energy efficiency can be improved by  $\sim 1.5 \times$ . Novel architectures for arithmetic logic and effective realization of in-memory computing are also proposed that utilize the unique properties of this promising new technology.

**INDEX TERMS** Beyond-complementary metal–oxide–semiconductor (CMOS) logic, canary circuits, in-memory computation, magnetoelectric (ME), SPICE, spin–orbit (SO).

#### I. INTRODUCTION

The very large-scale integration (VLSI) industry has always strived for improvements in performance, power, size, and cost with each technology generation. However, the returns from complementary metal-oxide-semiconductor (CMOS) scaling have started to diminish with recent technology nodes. The CMOS operating voltage has not reduced at the same rate as density gains due to the marginal reduction of the threshold voltage. The supply voltage scaling has become increasingly challenging, and the OFF-transistor current leakage has limited the system's energy efficiency. This has hampered strategies for overcoming the CMOS power dissipation concern. An important avenue in the search for lower power and better performance is exploring beyond CMOS approaches. Many alternatives have been proposed to complement CMOS and sustain the trajectory of Moore's law [1]. One of the leading candidates is the magnetoelectric spin-orbit (MESO) device [2], [3], [4], which promises to be in the attojoule energy efficient class with supply voltage in

the range of <100 mV. MESO compares very well with other beyond-CMOS technologies as well as advanced CMOS processes [5], [6]. MESO technology exhibits an excellent throughput at very low power density and delay [6].

An MESO device consists of two primary blocks: 1) an input voltage-driven magnetoelectric (ME) capacitor that switches a ferromagnet (FM) and 2) a spin-orbit (SO) output module, in which the spin current from the FM layer creates a positive or negative output charge current, depending on the magnetization in the FM. This spin current flows into the inverse SO coupling (ISOC) conversion stack beneath the FM, which performs spin-to-charge conversion based on the inverse spin-Hall effect (ISHE) and the inverse Rashba–Edelstein effect (IREE) [7], [8], [9]. Depending on the spin current polarity, either positive or negative charge current flows into the metallic interconnect that drives the next logic gate.

To regulate the amount of charge current flowing through the ISOC stack, some combination of header or footer



# FIGURE 1. Modeling MESO device using a hybrid Verilog-A & SPICE model [8].

| Gates | MESO | Area              | CMOS | Area              |
|-------|------|-------------------|------|-------------------|
| NOT   | 2    | $4F^2 + 4F^2$     | 2    | $4F^2$            |
| NAND  | 4    | $8F^2 + 8F^2$     | 4    | $8F^2$            |
| NOR   | 4    | $8F^{2} + 8F^{2}$ | 4    | $8F^2$            |
| 3-Maj | 4    | $8F^{2} + 8F^{2}$ | 14   | $28F^2$           |
| 5-Maj | 6    | $12F^2 + 12F^2$   | 62   | $124F^2$          |
| 7-Maj | 8    | $16F^2 + 16F^2$   | 282  | 564F <sup>2</sup> |

TABLE 1. Comparison of number of devices required to implement logic gates in MESO versus CMOS transistors.

power-gating CMOS transistors are used. These transistors are clocked using multiphase clock pulses. This ensures that the device consumes DC current only when required for logic gate evaluation.

Fig. 1 shows the mapping of the MESO device to the simulation model. The nodes  $n_1$  and  $n_2$  are the ME capacitor nodes. Whereas nodes  $c_1-c_4$  represent the charge-based terminals of the SO module [10]. A SPICE and Verilog-A hybrid model, as shown in Fig. 1, based on the physics of the MESO device, was described in [10]. Multiphysics coupling is computationally intensive for circuit design and simulation. This model comprehends all the primary physics behavior using a circuit approach. Using this model, this article demonstrates an implementation of arithmetic operations, such as addition, multiplication, and in-memory computing. An asynchronous clock generation scheme is proposed to address the power gated through the current consumption of the MESO logic family. Furthermore, a memory architecture demonstrating how the MESO device is ideally positioned to perform efficient in-memory computing is also proposed.

#### **II. DEVICE STACKING**

The fundamental logic unit using the MESO device is a majority gate. A three-input majority gate would only require four MESO devices, whereas implementation in CMOS would require at least 14 transistors. Moreover, as the complexity of logic increases this difference escalates (as

| Stacked Inverter       |             |                | Non-stacked Inverter   |             |                |  |
|------------------------|-------------|----------------|------------------------|-------------|----------------|--|
| Time<br>Period<br>(ns) | VDD<br>(mV) | Energy<br>(fJ) | Time<br>Period<br>(ns) | VDD<br>(mV) | Energy<br>(fJ) |  |
| 1                      | 450         | 0.889          | 1                      | 300         | 1.526          |  |

2

4

8

150

100

75

0.768

0.681

0.768



0.547

0.529

0.537

2

4

8

250

175

125



FIGURE 2. Stacking MESO devices to efficiently reuse the DC through current.

shown in Table 1), for example, a five-input majority gate requires six MESO devices as compared with 62 CMOS transistors. A single-sided MESO (area:  $2F \times 1F = 2F^2$ ) can overlay over a CMOS transistor [5] (F = smallest feature size). The area of a CMOS transistor is  $2F \times 2F = 4F^2$ . The differential MESO device area can be extrapolated to  $2F \times 2F = 4F^2$  (the same as a CMOS transistor). An example layout of a differential MESO device along with its header transistor is shown in Fig. 1. Table 1 also accounts for the area overhead of header transistors required for MESO implementation (area shown as addition of MESO area + CMOS header area).

Header transistors control the flow of the power supply current into the MESO device so that alternating clock phases allow devices to cascade in logic gate stages. The stacking of MESO logic devices is proposed to reuse this current and allow multiple devices in a column of logic to function in parallel. Compared with the standard implementation [2], this enables  $\sim 1.5 \times$  energy saving as verified in simulations and is shown in Table 2. In the current implementation, up to three devices can be stacked. The marginal energy saving and the requirement of a much higher supply voltage is a deterrent against stacking more than three MESO devices.

Fig. 2 shows the stacking of two inverters. Devices (INP1 and INP2) and (INV1 and INV2) will share the DC through current controlled by the shared nMOS header. Another advantage of stacking devices is that it reduces the overhead of nMOS devices used for MESO clock pulse gating.

Comparing the stacked and nonstacked versions of MESO inverters versus CMOS technology yields the results shown in Fig. 3. Supply voltage as low as 100 mV can be used for



FIGURE 3. Energy comparison for MESO versus CMOS (12-nm FinFET).

the stacked inverter. The load capacitance does not severely impact the energy for the MESO versions, as shown in Fig. 3. This is because the maximum contribution of energy in an MESO device comes from the DC current that flows from  $c_4$ to  $c_3$ . The generated charge current associated with the load capacitance compared with the DC current is extremely small in comparison. The canary circuit enables the minimum time of the DC current to flow. With more efficient SO modules in the future, the percentage of DC power to SO-generated power will reduce enabling lower overall total power. As expected, the energy consumed by the CMOS inverter goes up almost linearly as the load capacitance goes up.

#### **III. ASYNCHRONOUS CLOCK GENERATION**

Using multiphase clocks for timing the stages of logic has two main issues: 1) the clock pulses must be generated externally and distributed globally, incurring energy overhead and 2) the clock phases are not adjusted to the complexity of the logic gates it is clocking. The width of the logic pulse required for the magnetization to flip for a three-input majority gate is shorter than that required for a five- or a seven-input majority gate.

A lot of complex logic operations in CMOS can be expressed simply using majority gates. Full adders' Carry-out and Sum-out can be generated by using three and five-input majority gates, respectively. Priority encoders are essential digital components for many modern architectures. They require high-fan-in logic gates [11], such as four-AND and eight-AND to support higher bit widths. These can be easily implemented in majority gate logic

AND
$$4 = 7 - Majority(A, B, C, D, "0", "0", "0")$$
.

Table 3 shows that the time required for magnetization to flip varies a lot across the varying complexity of the MESO majority gates (time difference of  $6.93 \times$  from one-majority (buffer/inverter) to seven-input-majority). If no canary circuit was used, a pessimistic clock pulse of five-majority gate (2.32 ns, assuming five-majority is the highest complexity) would have to be used for every logic stage. In that case, the one-majority is unnecessarily ON for an extra (2.32–0.62) = 1.7 ns. Without a canary circuit, it is necessary to allocate some margin for the worst-case timing. One



FIGURE 4. Example of logic data-path along with the corresponding canary circuit for clock generation.

option is to consider a seven-majority gate as the worst-case pulsewidth.

Designing a data path using MESO devices involves various combinations of majority gates, all of them requiring different lengths of time. Given the magnitude of the current when the headers are ON, it is paramount that a gate is ON for the minimum required amount of time.

Canary-based clocking, as shown in Fig. 4 (an example of data-path design with a canary circuit), is proposed to address this. The clock for each logic stage is generated locally depending on the type of majority gate(s) used in that logic stage, reducing overall circuit delay.

The worst-case gate delay from each column/data-path logic stage is replicated and used as a representative in the canary circuit. This ensures that the generated clock pulse can satisfy the time requirement for each gate in that data-path bit. For the worst-case evaluation, the devices in the canary circuit are initialized such that each stage flips the magnetization of the next one. An example of such a canary circuit for a full adder. The three-input majority gate generates the carryout. The generated carry out along with the inputs and carry-in facilitate the sum bit generation using the five-input majority.

The principle is to propagate the signal in the canary chain from the MESO devices U1–U6 and flip the magnetization of every gate. A signal can be considered to have propagated through an MESO gate (e.g., U4), and its headers can be disabled when the magnetization of the next gate (U5) flips completely. However, magnetization cannot be sensed electronically. Hence to ensure that this device (U5) has completely flipped, the differential output voltage of the following device (U6) is monitored. Once the threshold ( $\gamma$ ) is crossed, the U2 device can be safely switched OFF. The state machine for CLK(5) generation is shown in Fig. 5(b).

This sequence is implemented using an active high-gated D Latch, and comparators to compare the differential output voltages, as shown in Fig. 6. The simulation results for the canary circuit and the clock signals generated are shown in



FIGURE 5. (a) Example of a canary circuit used for asynchronous clock generation (full adder implementation for carry and sum generation). (b) Flowchart of triggers and states involved in clock generation for CLK (5).



FIGURE 6. Circuit implementation for clock generation at each stage.

Fig. 7. The extra hardware required for the clock generation can be amortized over several logic stages running in parallel with the same clock and the same logic complexity (multiple bits in a word). The canary circuit tunes the clock pulsewidth based on each majority gate's logic size.

With the canary circuit, the computation of Fig. 5(a) would finish in 6.68 ns. Whereas in its absence it would take 26.77 ns (seven-majority time  $\times$  six stages = 4.45  $\times$  6 = 26.77 ns). The seven-majority gate time is assumed to be the worst case to accommodate the margin for variations. The total energy for the circuit of Fig. 5(a) (including the digital auxiliary circuits and excluding the comparators) is 220.4 fJ. Without the canary circuit, if each clock pulse is assumed to be of a seven-majority gate, the energy would be 341.5 fJ. It should be noted that without the canary circuit, a clock pulse gener-



FIGURE 7. Simulation results show the generated clock phases. Arrows show the triggers for CLK (5) generation. Here,  $v_c$  is the differential voltage at each output.

ator or clock tree would be required which usually consumes power in the milliwatt range and involves significantly more design complexity to manage global timing skew.

Accounting for comparator energy would add to the canary circuit energy. A state-of-the-art comparator in 65 nm [12] consumes 30-fJ per conversion at a supply voltage of 1.2 V. An improvement of  $5 \times$  can be assumed by scaling to 12-nm CMOS FinFETs. Moreover, since MESO devices support ultralow voltages, reducing the supply voltage to 0.6 V would be highly beneficial. Assuming a 6-fJ energy overhead per comparator would result in a total energy of 220.4 fJ (canary) +  $6 \times 2 \times 6$  fJ (comparators) = 292.4 fJ for the entire canary circuit. Low voltage operation also enables inverter-based comparators [13] which have the potential to be lower in energy.

Table 4 lists the comparison of the latency and energy of a canary-based versus noncanary-based implementation. The energy is amortized for four circuit blocks each. For example, the canary for a 4-bit ripple carry adder conTABLE 3. Comparison of time required for magnetization to flip for different number of input complexity of the majority gates (no canary circuit involved).

| Gates      | Time Required |
|------------|---------------|
| 1-Majority | 0.62 ns       |
| 3-Majority | 1.41 ns       |
| 5-Majority | 2.32 ns       |
| 7-Majority | 4.45 ns       |

TABLE 4. Energy (amortized for four blocks of each) and latency comparison for without and with canary circuit.

|                             | Without<br>Cir | t Canary<br>cuit | With Canary<br>Circuit |                 |  |
|-----------------------------|----------------|------------------|------------------------|-----------------|--|
| Components                  | Energy<br>(pJ) | Latency<br>(ns)  | Energy<br>(pJ)         | Latency<br>(ns) |  |
| Full Adder                  | 1.02           | 13.42            | 0.90                   | 6.400           |  |
| 4-bit Ripple<br>Carry Adder | 4.32           | 22.29            | 3.72                   | 14.13           |  |
| 4-bit Tree<br>Multiplier    | 14.2           | 40.07            | 10.7                   | 24.89           |  |

sumes 0.68 pJ (accounting for the energy of comparators). Each 4-bit ripple carry adder would consume 0.76 pJ using the clock pulses generated using the canary circuit. Therefore, the total energy for the canary-based approach is 0.68 pJ + 4 × 0.76 pJ = 3.72 pJ. Whereas, without the canary circuit, the 4-bit ripple carry adder would consume  $4 \times 1.08$  pJ = 4.32 pJ.

## **IV. ARITHMETIC LOGIC DESIGN**

### A. 4-BIT RIPPLE CARRY ADDER

Having a majority gate as the fundamental logic unit allows the generation of complex logic with a much lower number of devices.

Typically, it is observed that the carry bit is the bottleneck in any architecture of an adder. However, in the MESO technology, carry is the output of a three-input majority gate as used in [14]

$$C_{\text{out}} = 3 - \text{Majority} (A, B, C_{\text{in}})$$
  

$$S_{\text{out}} = 5 - \text{Majority} (A, B, C_{\text{in}}, \sim C_{\text{out}}, \sim C_{\text{out}})$$

where A and B are the two inputs along with the input carry,  $C_{in}$ . Using the generated output carry bit and the three inputs, sum out ( $S_{out}$ ) can be obtained by using a five-input majority gate. The advantage of this design is that the carry bit can ripple through quickly and the corresponding sum bits can be generated later.

Implementation of a 4-bit ripple carry adder is shown in Fig. 8. Carry generation for a full adder is just a three-input majority gate. Since the carry generation is only a one-gate operation, the carry can ripple very quickly. The sum is generated by using the generated carry output and the inputs.

The circuit block diagram of the 4-bit ripple carry adder consists of four cascaded full adders. Each full adder includes



FIGURE 8. Block diagram for the 4-bit ripple carry adder along with its canary circuit for clock generation.

a carry generation block (three-input majority gate) and a sum generation block (five-input majority gate). The carryout generation for the next bit and the sum generation of the current bit happen in parallel.

The operation is as follows.

- 1) CLK\_0  $\rightarrow$  Generate  $C_{\text{OUT}}\langle 0 \rangle$  using  $A\langle 0 \rangle$ ,  $B\langle 0 \rangle$ , and  $C_{\text{IN}}$ .
- 2) CLK\_1  $\rightarrow$  Use  $C_{\text{OUT}}\langle 0 \rangle$ ,  $A\langle 1 \rangle$ , and  $B\langle 1 \rangle$  to generate  $C_{\text{OUT}}\langle 1 \rangle$ . Parallelly using  $A\langle 0 \rangle$ ,  $B\langle 0 \rangle$ ,  $C_{\text{IN}}$ ,  $\sim C_{\text{OUT}}\langle 0 \rangle$ , and  $\sim C_{\text{OUT}}\langle 0 \rangle$  to generate SUM $\langle 0 \rangle$ .

Step 2 is repeated for the subsequent full adders enabling the rippling of carry.

The optimal design of a CMOS full adder requires 28 transistors. Comparing this to an MESO full adder which requires four devices for carry generation and six devices for sum generation. Accounting for the overhead of the header transistors, the MESO full adder requires ten MESO devices and ten transistors.

The 4-bit addition takes five clock phases to finish all the sum bits and final carry out generation. As shown in Fig. 8, each clock phase except CLK\_0 has a sum generation block. The sum generation block is more complex than the carry generation block (five-input majority versus threeinput majority). Hence, the canary circuit for this adder will consist of five-input majority gates for CLK\_1–CLK\_4 and three-input majority gate for CLK\_0.

Fig. 9 shows the MESO circuit simulation result of the addition of 1111(15) and 1110(14). The output matches



FIGURE 9. Simulation result of the 4-bit ripple carry adder.

TABLE 5. Wallace tree compression for a 4-bit multiplier.

| Weight | 128 | 64 | 32 | 16 | 8 | 4 | 2 | 1 |
|--------|-----|----|----|----|---|---|---|---|
| Wires  | 0   | 1  | 2  | 3  | 4 | 3 | 2 | 1 |
| Pass   |     | 1  | 2  | 1  | 1 |   |   | 1 |
| HA     |     |    |    | 1  |   |   | 1 |   |
| FA     |     |    |    |    | 1 | 1 |   |   |
| Wires  | 0   | 1  | 3  | 3  | 3 | 2 | 1 | 1 |
| Pass   |     | 1  |    |    |   |   | 1 | 1 |
| HA     |     |    |    |    |   | 1 |   |   |
| FA     |     |    | 1  | 1  | 1 |   |   |   |
| Wires  | 0   | 2  | 2  | 2  | 2 | 1 | 1 | 1 |

the expected output of 15 + 14 = 29 ( $C_{OUT} = 1$  and SUM = 1101).

#### **B. 4-BIT TREE MULTIPLIER**

The compression tree for a 4-bit tree multiplier is shown in Table 5. The tree is compressed using the Wallace multiplier reduction scheme [15]. As shown in Table 4, the partial products must go through two stages of reduction before a 4-bit ripple carry adder in the final stage.

The circuit block diagram for the multiplier is shown in Fig. 10. The first stage is the generation of partial products in  $CLK_0$ 

$$PP = 5 - Majority (A, B, "0").$$

The partial product is the AND operation on the two input bits. This is implemented in the MESO technology using a three-input majority with one input set as "0."

The two 4-bit inputs would result in 16 partial products being generated. Once the partial products are ready, and CLK\_1-CLK\_4 are used for the two stages of reduction shown in the Wallace tree. The least significant partial product is passed on to the output directly. The next 10 bits of partial products are used in the first stage of the Wallace tree. The outputs from this stage and the next 4 bits of partial products are used in the second stage of the Wallace tree. Furthermore, once the ripple carry adder inputs are generated, pulses — CLK\_5-CLK\_9 are used to perform the final ripple



FIGURE 10. Block diagram of the 4-bit multiplication. Ripple carry adder for final addition after Wallace tree compression.



FIGURE 11. Simulation result for the multiplication operation.

addition. At the end of CLK\_9, all the multiplication outputs are available.

The canary circuit is used to generate the clock phases for the multiplication operation. Fig. 11 shows the MESO circuit simulation result for a multiplication of A = 1101(13) and B = 1011 (11). The output matches the expected result = 10001111 (143).

#### V. MEMORY DESIGN AND IN-MEMORY COMPUTATION

The magnetization in the MESO device FM is retentive [3], [4] and acts as the state output that can be monitored. Accordingly, a memory can be organized as shown in Fig. 12(a).

The structure makes use of differential word bit lines (WBL), write word lines (WWL), and read word lines (RWL). Each memory element is preceded by a write access device. A read-access MESO device is connected to a column of memory elements as can be seen in an example of in-memory addition in Fig. 13. Compared with an SRAM memory bank, the MESO technology does not require any sense amplifier. Fig. 12(b) shows the timing information of signals WBL, WWL, and RWL required for writing and reading from memory.

# IEEE Journal on Exploratory Solid-State Computational Devices and Circuits



FIGURE 12. Memory design (read and write modes). (a) Unit memory design using a write access device, memory element and a read access device. (b) Timing diagram for the WBL, WWL, and RWL for the write and read operation.

*WRITE:* WBL enables the input data to be written to write access MESO  $\rightarrow$  WWL = ON.

*READ:* RWL =  $ON \rightarrow Read$  Access MESO's magnetization is programed with the data.

#### A. IN-MEMORY ADDITION

This memory organization enables in-memory computation. Multiple memory elements can be enabled simultaneously to obtain a majority gate operation at the bitline output of the read access MESO.

The in-memory compute operation for 2-bit addition is shown in Fig. 13. The steps for the in-memory addition are as follows.

- 1) Write inputs  $A\langle 1:0 \rangle$  and  $B\langle 1:0 \rangle$  in the memory WWL $\langle 0:3 \rangle$  (Clock count = 4).
- 2) A(0) and B(0) are used to generate carry in the WWL(0) and WWL(1) devices. (Clock count = 5).
- Write C<sub>OUT</sub>⟨0⟩ in write access devices to prepare for SUM⟨0⟩ generation (Clock count = 6).
- Write C<sub>OUT</sub>⟨0⟩ in the memory for SUM⟨0⟩ generation (Clock count = 7).
- 5) Generate SUM $\langle 0 \rangle$  (Clock count = 8).

The inputs are written in the memory in step 1 (takes four clock cycles). Steps 2–5 explain the generation of SUM(0). Steps 2–5 are repeated to generate SUM(1). An extra transition clock cycle is required at the end to prepare for the next computation. Hence, it takes a total of 13 clock phases to perform the in-memory addition. Fig. 14 shows the simulation result for the in-memory addition. The inputs used are A = 11 (3) and B = 10 (2). The output matches the expected output of 5 with SUM = 01 and Carry Out = 1. This series of steps can be performed in parallel in multiple memory columns leading to a high throughput.



FIGURE 13. Memory structure enabling in-memory addition operation.

#### **B. IN-MEMORY MULTIPLICATION**

For the in-memory multiplication, the read access MESO devices are also used hierarchically to provide flexibility and parallelism.

The read devices are further used in a hierarchical fashion to enable complex operations. This is another advantage of the MESO memory fabric that allows for hierarchical construction which allows growth in a multiplicative fashion. For the circuit example of Fig. 15, the READ<sub>1</sub> device is capable of reading from 15 memory elements directly (via READ<sub>01-04</sub>) or after a majority gate operation on this data. The circuit block diagram is shown in Fig. 15, for 2-bit



FIGURE 14. In-memory addition simulation result. Here, A = 11 (3) and B = 10 (2) resulting in output sum = 5.



FIGURE 15. Memory structure enabling in-memory multiplication operation by using multilevel hierarchy.

multiplication. The in-memory multiplication is achieved by obtaining the product in a bit-serial fashion [16].

The step-by-step operation for the in-memory multiplication is as follows.

- 1)  $P_0 = \text{AND}(A_0, B_0)$
- 2)  $P_1 = \text{SUM}_\text{OUT}(A_0B_1, A_1B_0)$ . The carry out (Carry<sub>1</sub>) is passed to the next step.



FIGURE 16. In-memory multiplication simulation result. Here, A = 11 (3) and B = 10 (2) resulting in output product = 6.

3)  $P_2 = \text{SUM}_\text{OUT} (A_1 B_1, \text{Carry}_1).$ 

4)  $P_3 = Carry_2$ .

The in-memory multiplication can be easily extended to arbitrary bit width in a bit-serial fashion. Furthermore, as described in [16], an in-memory CMOS multiplication needs 148 transistors per multiplication operation. However, the proposed architecture using MESO devices reduces the number of devices to 36 transistors and 36 MESO devices.

Fig. 16 shows the simulation result for the in-memory multiplication. The inputs used are A = 11 (3) and B = 10 (2). The output matches the expected output of 6 with P = 0110. These steps can be performed in parallel in multiple memory columns leading to high throughput.

#### **VI. CONCLUSION**

MESO technology presents unique circuit opportunities to implement logic functions operating at ultralow supply voltage that enables significant energy efficiency improvements. Using a circuit simulation device model that has been verified with physics simulation [2], [10], this article presented the use of device stacking to reuse power supply current, canary clocking to minimize current, and in-memory computing exploiting the state retention and functional operation of read out innate to the MESO structure and capitalizing on its capabilities. This will further enable the ever-growing need of energy efficient computation required for general purpose computing and artificial intelligence.

#### REFERENCES

- D. E. Nikonov and I. A. Young, "Overview of beyond-CMOS devices and a uniform methodology for their benchmarking," *Proc. IEEE*, vol. 101, no. 12, pp. 2498–2533, Dec. 2013, doi: 10.1109/JPROC.2013. 2252317.
- [2] H. Li, C.-C. Lin, D. E. Nikonov, and I. A. Young, "Differential electrically insulated magnetoelectric spin-orbit logic circuits," *IEEE J. Explor. Solid-State Comput. Devices Circuits*, vol. 7, no. 1, pp. 18–25, Jun. 2021, doi: 10.1109/JXCDC.2021.3105524.
- [3] D. C. Vaz et al., "Functional demonstration of a fully integrated magnetoelectric spin-orbit device," in *IEDM Tech. Dig.*, Dec. 2021, p. 32, doi: 10.1109/IEDM19574.2021.9720677.
- [4] P. Debashis et al., "Low-voltage and high-speed switching of a magnetoelectric element for energy efficient compute," in *IEDM Tech. Dig.*, Dec. 2022, p. 36, doi: 10.1109/IEDM45625.2022.10019505.
- [5] S. Manipatruni et al., "Scalable energy-efficient magnetoelectric spin-orbit logic," *Nature*, vol. 565, no. 7737, pp. 35–42, Dec. 2018.

- [6] D. E. Nikonov and I. A. Young, "Benchmarking of beyond-CMOS exploratory devices for logic integrated circuits," *IEEE J. Explor. Solid-State Comput. Devices Circuits*, vol. 1, pp. 3–11, 2015, doi: 10.1109/JXCDC.2015.2418033.
- [7] E. Saitoh, M. Ueda, H. Miyajima, and G. Tatara, "Conversion of spin current into charge current at room temperature: Inverse spin-Hall effect," *Appl. Phys. Lett.*, vol. 88, no. 18, May 2006, Art. no. 182509.
- [8] J. C. R. Sánchez et al., "Spin-to-charge conversion using Rashba coupling at the interface between non-magnetic materials," *Nature Commun.*, vol. 4, no. 1, p. 2944, Dec. 2013.
- [9] Y. Shiomi et al., "Spin-electricity conversion induced by spin injection into topological insulators," *Phys. Rev. Lett.*, vol. 113, no. 19, Nov. 2014, Art. no. 196601.
- [10] H. Li et al., "Physics-based models for magneto-electric spin-orbit logic circuits," *IEEE J. Explor. Solid-State Comput. Devices Circuits*, vol. 8, no. 1, pp. 10–18, Jun. 2022, doi: 10.1109/JXCDC.2022. 3143130.
- [11] S. K. Maurya and L. T. Clark, "Fast and scalable priority encoding using static CMOS," in *Proc. IEEE Int. Symp. Circuits Syst.*, Paris, France, Jun. 2010, pp. 433–436, doi: 10.1109/ISCAS.2010.5537688.
- [12] H. S. Bindra, C. E. Lokin, A.-J. Annema, and B. Nauta, "A 30fJ/comparison dynamic bias comparator," in *Proc. 43rd IEEE Eur. Solid State Circuits Conf. (ESSCIRC)*, Leuven, Belgium, 2017, pp. 71–74, doi: 10.1109/ESSCIRC.2017.8094528.
- [13] M.-T. Tan, J. S. Chang, and Y.-C. Tong, "A process-independent threshold voltage inverter-comparator for pulse width modulation applications," in *Proc. IEEE Int. Conf. Electron., Circuits Syst. (ICECS)*, vol. 3, Sep. 1999, pp. 1201–1204, doi: 10.1109/ICECS.1999.814385.
- [14] A. Jaiswal, A. Agrawal, and K. Roy, "Robust and cascadable nonvolatile magnetoelectric majority logic," *IEEE Trans. Electron Devices*, vol. 64, no. 12, pp. 5209–5216, Dec. 2017, doi: 10.1109/TED.2017.2766570.
- [15] C. S. Wallace, "A suggestion for a fast multiplier," *IEEE Trans. Electron. Comput.*, vol. EC-13, no. 1, pp. 14–17, Feb. 1964, doi: 10.1109/PGEC.1964.263830.
- [16] J. Wang et al., "A 28-nm compute SRAM with bit-serial logic/arithmetic operations for programmable in-memory vector computing," *IEEE J. Solid-State Circuits*, vol. 55, no. 1, pp. 76–86, Jan. 2020, doi: 10.1109/JSSC.2019.2939682.



**ROHIT ROTHE** (Student Member, IEEE) received the B.Tech. and M.Tech. degrees in electrical engineering from IIT Bombay, Mumbai, India, in 2018. He is currently pursuing the Ph.D. degree in electrical and computer engineering with the University of Michigan, Ann Arbor, MI, USA.

His current research interest includes ultra-low power analog very large-scale integration (VLSI) design, beyond CMOS circuit exploration, low power DC-DC converters, and the Internet of

Things (IoT) sensor systems.



**HAI LI** (Member, IEEE) received the B.S. degree in applied physics from the Huazhong University of Science and Technology, Wuhan, China, in 2011, and the M.S. and Ph.D. degrees in electrical and computer engineering from Carnegie Mellon University, Pittsburgh, PA, USA, in 2015 and 2016, respectively.

He developed foundational theorem of next generation storage system at Carnegie Mellon University. After internships at Western Digital and

Apple, he joined Intel Corporation in 2016, starting with specialized technologies program. He is currently a Senior Research Scientist in Exploratory Integrated Circuits group, Components Research, Hillsboro, OR, USA, with a focus on multiphysics simulation, he is working on emerging technologies in the joint area of novel device, circuit, and computing paradigm, while managing university collaboration programs on nanotechnology and exploratory topics.

Dr. Li serves as a Co-Chair for Intel Emerging Technology Strategic Research Sector (SRS).



**DMITRI E. NIKONOV** (Senior Member, IEEE) received the M.S. degree in aeromechanical engineering from the Moscow Institute of Physics and Technology, Dolgoprudny, Russia, in 1992, and the Ph.D. degree in physics from Texas A&M University, College Station, TX, USA, in 1996.

He participated in the demonstration of the world's first laser without population inversion at Texas A&M University. He joined Intel Corporation in 1998 and he is presently a Principal

Engineer in the Components Research group, Hillsboro, OR, USA. He is a responsible for simulation, benchmarking, and design of beyond-CMOS logic devices and neural network circuits, and for managing joint research programs with universities on nanotechnology and exploratory devices. He has 145 publications in refereed journals in quantum optics, lasers, nanoelectronics, spintronics and neural networks, and 121 issued patents in integrated optic, electronic, and spintronic devices.



**IAN A. YOUNG** (Life Fellow, IEEE) was born in Melbourne, Australia. He received the B.S. and M.S. degrees in electrical engineering from the University of Melbourne, Melbourne, and the Ph.D. degree in electrical engineering from the University of California Berkeley, Berkeley, CA, USA, in 1978.

He is an Intel Senior Fellow and the Director of the Exploratory Integrated Circuits group in Technology Development at Intel Corporation. He

is responsible for defining future circuit directions with emerging novel devices and identifying leading options to manufacture energy efficient integrated circuits for computing in the beyond-CMOS era. He joined at the Intel Corporation in 1983, starting with the development of circuits for the world's first 1-Mb CMOS DRAM, and first 1-µm 64-Kb SRAM. He then led the design of three generations of SRAM products and manufacturing test vehicles and developed the first integration of an analog phase-locked loop (PLL)-based clocking generator circuit on a microprocessor while working on the 50-MHz Intel 486 processor design. He subsequently developed the core PLL clocking circuit building blocks used in each generation of Intel microprocessors through the 0.13-µm 3.2-GHz Pentium 4. From 2001 to 2010, he led a circuit design team doing research and development of integrated analog/digital mixed-signal high-speed serial I/O circuits for microprocessors, wireless RF CMOS synthesizers and transceiver circuits, in conjunction with the research and development of Intel's 90nm, 65nm, 32nm and 22nm process technologies. Prior to Intel, he worked on analog/digital mixed signal integrated circuits for telecommunications products at Mostek Corporation. He has authored or co-authored more than 300 technical papers and holds over 250 US patents. the University of California Berkeley. He did the pioneering research on MOSFET switched-capacitor active filter. He also led exploratory research to enable optical I/O for microprocessors.

Dr. Young received the three Intel Achievement Awards. He served as the Technical Program Committee Chairman of the 2005 International Solid-State Circuits Conference (ISSCC) and the Chairman of the 1997 and 1998 Symposium on VLSI Circuits. He is a three-time Guest Editor for special issues of the IEEE JOURNAL OF SOLID-STATE CIRCUITS and was the Founder and inaugural editor-in-chief of the IEEE JOURNAL ON EXPLORATORY SOLID-STATE COMPUTATIONAL DEVICES AND CIRCUITS. He received the 2009 International Solid-State Circuits Conference's Jack Raper Award for outstanding technology directions paper, and the 2018 IEEE Frederik Philips Award "for leadership in research and development on circuits and processes for the evolution of microprocessors".



**KYOJIN CHOO** (Member, IEEE) received the B.S. and M.S. degrees in electrical engineering from Seoul National University, Seoul, Republic of Korea, in 2007 and 2009, respectively, and the Ph.D. degree from the University of Michigan, Ann Arbor, MI, USA, in 2018.

He was an IC Design Engineer with the Image Sensor Development Team of Samsung Electronics, Yong-In, Republic of Korea, from 2009 to 2013, where he developed the signal readout

chains for high-end mobile/DSLR image sensors, including the sensor for the Samsung NX1 camera. In 2015 to 2018, during his Ph.D., he interned with Apple Inc., Cupertino, CA, USA, and was a design consultant with Sony, San Jose, CA, USA. From 2018 to 2021, he was a Post- Doctoral Fellow with the University of Michigan, where he worked on mm-scale ultra-low-power imaging systems. Since 2021, he has been a Tenure-Track Assistant Professor with the Swiss Federal Institute of Technology, Lausanne (EPFL), Switzerland, leading the Mixed-Signal Integrated Circuits Laboratory (MSIC-Lab). He has authored/co-authored 25 IEEE articles, and he holds 17 USA patents. His research interests are charge-domain dynamic A/MS circuits, which apply to sensor interfaces, data converters, energy converters, high-speed links/timing generators, and millimeter-scale integrated systems to improve resource efficiency, and add unique features.



**DAVID BLAAUW** (Fellow, IEEE) received the B.S. degree in physics and computer science from Duke University, Durham, NC, USA, in 1986, and the Ph.D. degree in computer science from the University of Illinois Urbana-Champaign, Champaign, IL, USA, in 1991.

Until August 2001, he worked for Motorola, Inc., Austin, TX, USA, where he was the manager of the High Performance Design Technology group and won the Motorola Innovation award.

Since August 2001, he has been on the faculty of the University of Michigan, where he is the Kensall D. Wise Collegiate Professor of EECS. He has published over 600 papers, has received numerous best paper awards and holds 65 patents. Most recently, he has pursued research in cognitive computing using analog, in-memory neural-networks for edge-devices and genomics for precision health. He has researched ultra-low-power wireless sensors using subthreshold operation and low-power analog circuit techniques for millimeter systems. His research was awarded the MIT Technology Review's "one of the year's most significant innovations." His research group introduced so-called near-threshold computing, which has become a common concept in semiconductor design.

Dr. Blaauw was a General Chair of the IEEE International Symposium on Low Power and a member of the IEEE International Solid-State Circuits Conference (ISSCC) analog program subcommittee. He received the 2016 SIA-SRC faculty award for lifetime research contributions to the U.S. semiconductor industry.