

Received 27 November 2023, accepted 13 December 2023, date of publication 18 December 2023, date of current version 26 December 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3344197



# SIMPLY+: A Reliable STT-MRAM-Based Smart **Material Implication Architecture for In-Memory Computing**

TATIANA MOPOSITA<sup>101,2,3</sup>, (Member, IEEE), ESTEBAN GARZÓN<sup>101</sup>, (Member, IEEE), RAFFAELE DE ROSE<sup>®1</sup>, (Senior Member, IEEE), FELICE CRUPI<sup>®1</sup>, (Senior Member, IEEE), ANDREI VLADIMIRESCU<sup>[04,5]</sup>, (Life Fellow, IEEE), LIONEL TROJMAN<sup>©</sup><sup>2</sup>, (Senior Member, IEEE), AND MARCO LANUZZA [6], (Senior Member, IEEE)

Department of Computer Engineering, Modeling, Electronics and Systems, University of Calabria (UNICAL), 87036 Rende, Italy

Corresponding author: Tatiana Moposita (tatiana.moposita@dimes.unical.it)

The work of Esteban Garzón was supported in part by the Italian Ministry of University and Research (MUR) under Project PRIN 2020LWPKH7, and in part by the Italian MUR through the call "Horizon Europe 2021-2027 Programme" under Grant H25F21001420001. The work of Raffaele De Rose was supported by the Italian MUR under Project PRIN 2022ZNAMMH. The work of Marco Lanuzza was supported by the Italian MUR under Project PRIN 2020LWPKH7.

**ABSTRACT** This paper introduces SIMPLY+, an advanced Spin-Transfer Torque Magnetic Random-Access Memory (STT-MRAM)-based Logic-in-Memory (LIM) architecture that evolves from the previously proposed smart material implication (SIMPLY) logic scheme. More specifically, the latter is enhanced by incorporating additional circuitry to enhance the reliability of preliminary read operations. In this study, the proposed architecture is benchmarked against its conventional counterpart. Obtained results show a significant improvement in terms of reliability, i.e., the nominal read margin (RM) by a factor of  $\sim 3.4 \times$  and accordingly the bit error rate (BER) by more than four orders of magnitude. These improvements come at minimal cost in terms of circuit area and complexity compared to the conventional SIMPLY design. Overall, this research establishes SIMPLY+ as a promising solution for the design of reliable and energy-efficient in-memory computing architectures.

INDEX TERMS Material Implication, SIMPLY, MTJ, STT-MRAM, in-memory computing.

### I. INTRODUCTION

Material implication (IMPLY) logic shows great potential as a prospective solution for defining Logic-in-Memory (LIM) architectures designed to execute fast and energy-efficient computations directly within memory units. This approach effectively mitigates the von-Neumann bottleneck of conventional computing platforms, specifically the need to read/write data to/from off-chip memories [1], [2], [3], [4],

The associate editor coordinating the review of this manuscript and approving it for publication was Jagadheswaran Rajendran.

[5], [6], [7], [8], [9]. However, the conventional IMPLY logic scheme faces significant challenges, including the degradation of the logic states and the limited design flexibility associated with operating voltages [10]. To overcome these drawbacks, an alternative smart IMPLY (SIMPLY) LIM scheme was recently proposed [10], [11], [12], [13]. The SIMPLY solution integrates an output comparator into the classical IMPLY scheme. This comparator is exploited to execute a preliminary 2-bit read operation, which is then used to execute the SET operation selectively, based on the specific need (i.e., only when both inputs are at low logic level [10]).

<sup>&</sup>lt;sup>2</sup>Laboratoire d'Informatique Signal Image Telecommunications et Electronique, Institut Supérieur d'Électronique de Paris (ISEP), 92130 Paris, France

<sup>&</sup>lt;sup>3</sup>Faculté des Sciences et Ingénierie, Sorbonne Université, 75006 Paris, France

<sup>&</sup>lt;sup>4</sup>Department of Electrical Engineering and Computer Sciences (EECS), University of California at Berkeley, Berkeley, CA 94720, USA

<sup>&</sup>lt;sup>5</sup>Department of Electrical Engineering and Computer Sciences (EECS), Delft University of Technology (TU Delft), 2628 CD Delft, The Netherlands



Such an approach effectively alleviates the issue of logic state degradation, while also reducing energy consumption with minimal impact on circuit area and complexity when compared to the conventional IMPLY scheme [10].

Although the SIMPLY logic was originally proposed and validated for Resistive Random Access Memory (RRAM) devices [10], [11], [12], [13], recent investigations have extended its applicability to Spin-transfer Torque Magnetic Random-Access Memory (STT-MRAM) devices [14], [15]. The latter represents an appealing option for LIM applications owing to faster read/write operations, low standby power consumption, and high endurance [16], [17], [18], [19], [20], [21], [22], [23]. Spin-orbit torque (SOT) MRAM technology has also been considered as a potential alternative for the development of Compute-in-Memory (CIM) architectures [24], [25], [26]. However, despite their potential for offering higher write speed, the switching control of SOT devices can lead to increased power and adds circuit complexity [27].

According to findings in [14], the STT-MRAM-based SIMPLY scheme exhibits the advantages of improved energy-efficiency and reliability as compared to its IMPLY counterpart. However, as highlighted in [15], the reliability of STT-MRAM-based SIMPLY logic is significantly impacted by the preliminary read operation. This influence is primarily due to the relatively narrow read window inherent in STT-MRAM devices, constrained by their tunnel magnetoresistance (TMR) ratio [28], [29], [30]. Consequently, this limitation results in suboptimal read margins, which lead to a higher level of design complexity in the sensing circuitry [15].

In response to the above issue, this work introduces SIMPLY+, i.e., an advanced STT-MRAM-based SIMPLY logic scheme that allows enhanced read operation reliability compared to its conventional counterpart. This is achieved by using a common source (CS) stage with a diode-connected active load, with the aim of enlarging the read margins and thereby ensuring lower error rates. The SIMPLY+ scheme was exhaustively evaluated by means of extensive Monte Carlo (MC) simulations and benchmarked against the conventional SIMPLY logic. In addition, we propose a further improved design of the SIMPLY+ cell, where the sense amplifier-based output comparator is replaced by a simpler two-stage inverter. The latter is properly designed to handle both 1-bit and 2-bit read operations within the same scheme and hence executing the sFALSE operation, which forms along with the SIMPLY operation a computationally complete logic basis [11].

The main contributions of this work can be summarized as follows:

- We propose SIMPLY+, a novel STT-MRAM-based LIM architecture that evolves from the previously proposed SIMPLY logic scheme.
- SIMPLY+ features a dedicated low-complexity circuitry designed to improve the reliability of read operations in terms of read margin, bit error rate,

TABLE 1. STT-MTJ parameters (300 K).

| Parameter | Description                                   | Value     | Units                  |
|-----------|-----------------------------------------------|-----------|------------------------|
| d         | Diameter                                      | 30        | nm                     |
| $t_{FL}$  | FL thickness (variability)                    | 1.15      | nm                     |
| $t_{OX}$  | Oxide thickness (variability)                 | 0.85      | nm                     |
| RA        | Resistance-area product                       | 10        | $\Omega \cdot \mu m^2$ |
| $\eta$    | Spin-polarization factor                      | 0.66      | _                      |
| $V_H$     | Bias voltage for $TMR = 0.5 \cdot TMR(0)$     | 0.5       | V                      |
| $M_S$     | Saturation magnetization                      | 1.58      | T                      |
| $\alpha$  | Gilbert damping factor                        | 0.03      | _                      |
| Ki        | Interfacial perpendicular anisotropy constant | 1.3       | $\mathrm{mJ/m^2}$      |
| $\Delta$  | Thermal stability factor                      | $\sim$ 44 | -                      |

 We suggest an improved SIMPLY+ logic scheme by replacing the standard sense amplifier-based output comparator with a lower complexity two-stage buffer to considerably reduce the read energy penalty.

The rest of the paper is organized as follows. Section II details the scheme and operation of the SIMPLY+ logic, while also reporting and discussing simulation results as compared to its conventional counterpart. Section III describes the improved design of the SIMPLY+ cell along with the corresponding simulation results. Section IV compares the different SIMPLY-based solutions in terms of the main figures-of-merit. Finally, the main conclusions of the work are drawn in Section V.

## II. LOGIC-IN-MEMORY ARCHITECTURES: SIMPLY VS SIMPLY+

Fig. 1 shows the simplified top-level diagram of the referenced STT-MRAM SIMPLY-based architecture [12]. A control logic, equipped with analog tri-state buffers (TSBs) delivers the appropriate voltages to the STT-MRAM devices while the sensing circuit involves the topology needed to perform the SIMPLY operation. The architecture also employs transistors to enable specific array columns. In the following, we focus on the design and operation considerations of the STT-MRAM-based SIMPLY and SIMPLY+ schemes which essentially differ in the sensing scheme implementation, while also reporting and discussing comparative simulation results.

In our work, the electrical behavior of the STT-MTJ devices is taken into account by an analytical macrospin-based Verilog-A compact model [31], whose main physical parameters for the considered 30 nm STT-MTJ device are summarized in Table 1 referring to room temperature (300 K) [32], [33], [34]. Note that, the physical and



FIGURE 1. Top-level STT-MRAM SIMPLY architecture.

electrical characteristics reported in Table 1 are consistent with the state-of-the-art parameters for nano-scaled MTJs [35], [36].

#### A. SIMPLY AND SIMPLY+ LOGIC SCHEMES

Fig. 2(a) sketches the conventional SIMPLY scheme [12], which was introduced to overcome the shortcoming of the traditional IMPLY design [10], [11], [12], [13]. Indeed, both IMPLY and FALSE operations are performed more efficiently within this architecture as SIMPLY and sFALSE operations, respectively [11]. The IMPLY or SIMPLY operation involves two inputs and one output, according to the truth table reported in Fig. 2(b). The inputs are represented by the initially stored states of the two MTJs P and Q in Fig. 2(a), while the output is given by the final state of the MTJ Q after the execution. On the other hand, the FALSE or sFALSE operation is a one-input one-output operation that always results in a logic '0' stored in the considered MTJ. Accordingly, as shown in Fig. 2(a), the core of the conventional SIMPLY logic scheme consists of two MTJs (P and Q) and an NMOS tail transistor (Mn) driven by the Vbias voltage, here used to implement the load resistor R<sub>G</sub> [15]. Moreover, the SIMPLE scheme presents an output comparator and a control logic block including analog TSBs, shown at the top of Fig. 2(a), to apply the appropriate voltages to the MTJs and to manage the execution of the operations.

Within this scheme, ensuring the proper execution of the P IMPLY Q operation involves maintaining the state of MTJ P, regardless of the input combination. At the same time, the MTJ Q should only switch from '0' to '1' when P = Q = '0'. A preliminary read operation is performed with

the aim of distinguishing the input combination P = O ='0' from all other possibilities. To accomplish this, a proper voltage pulse with an amplitude V<sub>READ</sub> and width t<sub>READ</sub> is applied to the top electrode of both MTJ devices through the control logic block. Then the voltage V<sub>G</sub> across the transistor Mn is compared to an appropriate reference voltage  $V_{REF}$ using the output comparator. In this way, the P = Q = 0input combination is effectively detected, thus allowing the subsequent SET operation on MTJ Q to take place only in this specific case. This is achieved by applying an appropriate voltage pulse with amplitude V<sub>SET</sub> and width t<sub>SET</sub> on Q, while keeping the P driver in a high impedance (HI-Z) state, as shown in Fig. 2(b). On the contrary, for the other input combinations the control logic forces both MTJ drivers into a HI-Z state, thus enabling significant energy savings [10], [12], [15].

Similarly to the SIMPLY operation, the sFALSE operation also requires a preliminary read operation, but on a single device. Thereby, the subsequent RESET operation requiring a negative voltage pulse is performed only when the detected MTJ state is '1' [11].

Fig. 2(c) shows the SIMPLY+ scheme proposed in this work and purposely designed to enhance the reliability of the preliminary read operation. This latter is the most critical operation for the STT-MRAM-based SIMPLY framework, owing to the relatively narrow read memory window offered by MTJ devices [15]. Our approach involves the use of a CS amplifier stage with a diode-connected load (M1-M2). The CS input is represented by the voltage  $V_G$  at the drain terminal of the transistor Mn, while the output drives the gate terminal of the same transistor. This allows for significantly enlarging the read margins in terms of  $V_G$  voltages developed for the P = Q = 0 input combination and the others, thus enabling





FIGURE 2. (a) Conventional STT-MRAM-based SIMPLY basic cell with a tail transistor (instead of a resistor) and the output comparator. Detail of the tri-state buffer (TSB) topology at the top. (b) IMPLY truth table and time diagram of applied voltage pulses within the SIMPLY scheme when the sensing circuitry detects the input condition P=Q='0' and in all other cases [15]. (c) STT-MRAM-based SIMPLY+ scheme, including the tail transistor, a common source stage with diode-connected load, and the output comparator.

more reliable operation, as demonstrated in the subsequent subsection.

#### **B. SIMULATION RESULTS AND DISCUSSION**

In the following, we will present and discuss the simulation results referred to the preliminary two-device read operation in the SIMPLY+ scheme, while also benchmarking it against the conventional SIMPLY counterpart. All the reported data is based on electrical simulations performed at room temperature (300 K) by using the Cadence Virtuoso environment. Transistors' modeling refers to a commercial 65 nm, 1.2 V CMOS process, while our Verilog-A compact model [31] is used for the 30 nm STT-MTJ devices.

Fig. 3 shows the timing diagram for key signals involved in the preliminary read operation when using both the conventional SIMPLY and the enhanced SIMPLY+ approaches. The data refers to a nominal simulation with the transistor Mn size of Wn/Ln = 1 um/3 um and  $V_{READ} = 0.5 \text{ V}$  in both schemes. In the SIMPLY scheme, Vbias is set to 1 V. From Fig. 3, when the EN and SN signals are switched ON in the TSBs,

the  $V_{READ}$  is set to 0.5 V. This results in a corresponding  $V_G$  voltage across transistor Mn, dependent on the input combinations, specifically the states of the MTJs P and Q. For the conventional SIMPLY architecture, the  $V_G$  voltages obtained for the scenarios where P=Q= '0' and  $P\neq Q$  yield a nominal RM of just 55 mV. On the other hand, due to the additional CS stage, the SIMPLY+ scheme exhibits a nominal RM of 209 mV at 10 ns and 183 mV at 20 ns. This leads to an improvement of  $3.8\times$  and  $3.3\times$ , respectively, when compared to its conventional counterpart.

The reliability of the preliminary read operation in both schemes was also investigated while considering the effect of process variations on transistor and MTJ devices. This analysis was performed by means of extensive MC simulations. Transistor process variations were included using statistical models provided by the commercial PDK. For the MTJs, we assumed Gaussian distributed variations in the adopted Verilog-A compact model by setting the variability (defined as the ratio of the standard variation  $(\sigma)$  to the mean value  $(\mu)$ ) of 1% and 5% for the oxide



FIGURE 3. Timing diagram of the signals involved in the SIMPLY and SIMPLY+ schemes during the preliminary read operation as obtained from nominal simulations at 300 K considering  $V_{READ}$  Wn/Ln=1 $\mu$ m/3 $\mu$ m size for the transistor Mn,  $V_{READ}=0.5$  V and Vbias = 1 V.

thickness  $(t_{OX})$  and the cross-section area, respectively [14], [15], [31].

Fig. 4(a)-(d) and Fig. 5(a)-(b) show the MC simulation results obtained for the SIMPLY and SIMPLY+ architectures. More specifically, these figures report the statistical distributions of the voltage  $V_G$  for the different input combinations, while highlighting the corresponding estimated values for the RM, bit error rate (BER), and  $V_{REF}$ . The RM is evaluated both at the nominal corner (as given by the difference between the mean  $V_G$  values associated with the P=Q= '0' and  $P\neq Q$  cases) and at the  $3\sigma$  corner (RM $_{3\sigma}$ ), with  $\sigma$  being the standard deviation of  $V_G$  distributions. The BER refers to the failure probability in distinguishing the input combination P=Q= '0' from the other combinations during the preliminary read operation [15] and it is estimated by properly setting the  $V_{REF}$  used as input of the output comparator.

In particular, the appropriate  $V_{REF}$  is determined by the voltage value that results in the same BER for the cases P=Q='0' and  $P\neq Q$  [37]. It is worth pointing out that in our analysis, the BER was evaluated by assuming an ideally stable  $V_{REF}$  and an ideal comparator with zero offset.

Fig. 4(a)-(c) show the MC simulation results achieved within the conventional SIMPLY scheme for  $t_{READ}=10\,\mathrm{ns}$ . From Fig. 4(a), the obtained nominal RM is equal to 55 mV, whereas the corresponding  $RM_{3\sigma}$  is about 15 mV. From Fig. 4(c), the  $V_{REF}$  to be used in the SIMPLY scheme is about

**TABLE 2.** Comparative results SIMPLY vs SIMPLY+ for the preliminary read operation under process variations.

| -                                           | SIMPLY         |          | SIMPLY+        |           | SIMPLY+        |           |
|---------------------------------------------|----------------|----------|----------------|-----------|----------------|-----------|
|                                             |                |          | $(t_{READ} =$  |           |                |           |
|                                             | P=Q='0'        | P≠Q      | P=Q='0'        | P≠Q       | P=Q='0'        | P≠Q       |
| $oldsymbol{\mu}$ of $ m V_G$ ( $ m mV$ )    | 224.0          | 279.0    | 259.7          | 462.2     | 289.5          | 470.6     |
| $oldsymbol{\sigma}$ of $ m V_G$ ( $ m mV$ ) | 6.24           | 7.02     | 16.83          | 16.20     | 14.69          | 5.14      |
| $RM_{nom}$ ( $mV$ )                         | 55.0           | )        | 202            | .5        | 181            | .1        |
| RM <sub>3</sub> $\sigma$ (mV)               | 15.2           | 2        | 103            | .4        | 121            | .6        |
| Worst-case<br>BER                           | $1.7 \times 1$ | $0^{-5}$ | $4.4 \times 1$ | $0^{-10}$ | $3.3 \times 1$ | $0^{-20}$ |
| Energy<br>(pJ)                              | 1.73           | 3        | 2.6            | 0         | 5.1            | 9         |

250 mV, which leads to a BER of  $1.68 \times 10^{-5}$  for the P = Q = '0' and  $P \neq Q$  input combinations, i.e., the worst-case BER.

Fig. 4(b)-(d) report the statistical results of the SIMPLY+ scheme for  $t_{READ}=10\,\text{ns}$ . The nominal RM and  $RM_{3\sigma}$  values are respectively 202.5 mV and 103.4 mV, i.e.,  $3.7\times$  and  $6.8\times$  larger than the conventional SIMPLY scheme. The appropriate  $V_{REF}$  is 362.9 mV, which corresponds to a worst-case BER of  $4.37\times10^{-10}$ , i.e., more than four orders of magnitude better as compared to its conventional counterpart.

The reliability of the preliminary read operation in the SIMPLY+ scheme can be further improved by enlarging the read voltage pulse duration. This can be observed in Fig. 5(a)-(b), which shows the statistical results of the SIMPLY+ scheme for  $t_{READ} = 20 \text{ ns}$ . Indeed, despite a reduction of the nominal RM down to 181.1 mV compared to the 202.5 mV obtained at  $t_{READ} = 10 \, ns$ , increasing the  $t_{READ}$  up to  $t_{READ}$  =20 ns leads to a  $RM_{3\sigma}$  of about 121.6 mV, i.e.,  $8 \times$  and  $1.2 \times$  larger than the conventional SIMPLY scheme and the SIMPLY+ scheme at  $t_{READ}$  = 10 ns, respectively, owing to the reduced standard deviation values of V<sub>G</sub> distributions. This results into a worst-case BER of  $3.32 \times 10^{-20}$  at  $V_{REF} = 423.7 \,\text{mV}$ , which corresponds to an improvement by more than fourteen and ten orders of magnitude as compared to the conventional SIMPLY scheme and the SIMPLY+ scheme for  $t_{READ} = 10 \text{ ns}$ , respectively. Obviously, such an improvement comes at the cost of higher energy consumption. This can be observed in Table 2, which summarizes the comparative results obtained within the SIMPLY and SIMPLY+ schemes for the preliminary read operation under process variations.

Our analysis was also extended with the aim of evaluating the effect of the tail transistor (Mn) sizing (each 1 µm) on SIMPLY+ performance during the preliminary read operation. In this regard, Fig. 6(a)-(f) show the color maps of the nominal RM, the RM at the  $3\sigma$  corner, the worst-case read disturbance rate (RDR), i.e., referred to the case P = Q = '0' [15], the  $V_{REF}$ , the worst-case BER, and the worst-case overall read error rate (RER), both referred again to the case P = Q = '0' [15]. All this data was obtained from statistical





FIGURE 4. Simulation results of the conventional SIMPLY and SIMPLY+ scheme for the preliminary read operation under process variations at 300 K considering Wn/Ln=1 $\mu$ m/3 $\mu$ m size for the tail transistor Mn, V<sub>READ</sub> = 0.5  $\vee$ , Vbias = 1  $\vee$  and t<sub>READ</sub> = 10ns. (a-b) V<sub>G</sub> statistical distributions for the different input combinations and (c-d) estimation of the bit error rate (BER) and reference voltage (V<sub>REF</sub>).



**FIGURE 5.** Simulation results of the SIMPLY+ scheme for the preliminary read operation under process variations at 300 K considering Wn/Ln=1 $\mu$ m/3 $\mu$ m size for the tail transistor Mn, V<sub>READ</sub> = 0.5  $\vee$  and t<sub>READ</sub> = 20ns. (a) V<sub>G</sub> statistical distributions for the different input combinations and (b) estimation of the bit error rate (BER) and reference voltage (V<sub>REF</sub>).

simulations at 300 K,  $V_{READ} = 0.5 V$  and  $t_{READ} = 10 \text{ ns}$  while varying the size (Ln and Wn) of Mn, i.e., its strength.

In particular, the RDR is an important metric to assess the reliability of the read operation performed within the SIMPLY/SIMPLY+ framework, as it refers to the probability of unintentionally switching the stored data during this operation [15], [17]. Accordingly, for a given Mn size, the overall RER is given by the combination of the RDR and BER. From Fig. 6(a)-(b), we can observe that the Mn sizing strongly affects the RM. More specifically, we can identify a relatively small design space for size around Wn/Ln = 1 μm/3 μm leading to nominal RM and RM<sub>3 $\sigma$ </sub> values in the neighborhood of 200 mV and 100 mV, respectively. More precisely, we obtain a nominal RM of 202.5 mV and a RM<sub>3\sigma</sub> of 103.4 mV according to Fig. 4(a). From Fig. 6(c), the worst-case RDR increases when increasing the strength of the transistor Mn, i.e., for larger (smaller) Wn (Ln), as given by the corresponding increase in the current flowing through the MTJ devices. In particular, for Wn/Ln = 1  $\mu$ m/3  $\mu$ m we obtain a worst-case RDR equal to  $1.05 \times 10^{-9}$ . An opposite trend as compared to that of the RDR can be seen for the  $V_{\mbox{\scriptsize REF}}$  in Fig. 6(d), where its value tends to increase when decreasing the transistor strength. From Fig. 6(e), the worst-case BER expectedly shows a similar trend to the RM, hence with an optimal design space for Mn size around Wn/Ln = 1  $\mu$ m/3  $\mu$ m leading to BER values in the order of  $10^{-10}$  as in Fig. 4(d). The discussed trends of the RDR and BER thus result in the map of the RER shown in Fig. 6(f), where its optimal value of  $1.05 \times 10^{-9}$  is achieved at Wn/Ln = 1 µm/3 µm size, i.e., that used in the above analysis.





FIGURE 6. Simulation results of the SIMPLY+ scheme for the preliminary read operation under process variations at 300 K,  $V_{READ} = 0.5 \text{ V}$  and  $t_{READ} = 10 \text{ ns}$  while varying the size (Ln and Wn) of the tail transistor Mn: (a) nominal read margin (RM), (b) RM at the  $3\sigma$  corner, (c) worst-case read disturbance rate (RDR) referred to the case P = Q = '0', (d) reference voltage ( $V_{REF}$ ), (e) worst-case bit error rate (BER) referred to the case P = Q = '0', and (f) worst-case overall read error rate (RER) again referred to the case P = Q = '0'.

#### III. IMPROVED SIMPLY+ DESIGN

In addition to the introduction of the CS stage as discussed in the previous section, we propose a further modification of the conventional SIMPLY scheme, which consists of replacing the conventional sense amplifier-based output comparator [11], [14], [15] with a lower complexity two-stage buffer, thus resulting in an enhanced SIMPLY+ scheme, as depicted in Fig. 7(a). The two-stage inverter-based output block acts as a sense amplifier to discriminate the  $V_G$  values associated with the different input combinations, using the logic threshold as the reference voltage. To this aim, skewed inverters are employed to strengthen the output data  $(V_{OUT})$  and to improve the capability to discriminate the input combinations.

The transistors highlighted in red in Fig. 7 are used to enable both 1-bit and 2-bit read operations within the same circuit block, needed to execute sFALSE and SIM-PLY operations, respectively. This is achieved by properly setting the  $S_{READ}$  signal, which allows adjusting the logic threshold of the first inverter in the output block on the basis of the operation to be performed. More specifically, as reported in Fig. 7,  $S_{READ} = \text{`0'}$  (`1') enables the 1-bit (2-bit) read operation for the sFALSE (SIMPLY) execution. Fig. 8(a) shows the SIMPLY+ scheme during the sFALSE operation, i.e., when  $S_{READ} = \text{`0'}$ , which involves only one MTJ. Fig. 8(b) shows the truth table and time diagram of the voltage pulse applied to Q ( $V_Q$ ) to execute the "sFALSE O".





FIGURE 7. Improved SIMPLY+ scheme, including a common source (CS) stage with diode-connected load and a two-stage inverter as output black





**FIGURE 8.** (a) SIMPLY+ scheme during sFALSE operation. (b) sFALSE truth table and time diagram of the voltage pulse applied to Q ( $V_Q$ ) to execute the "sFALSE Q" operation in SIMPLY architecture when the comparator detects the Q = 1 case (red dash-line) and Q = 0 case (blue).

The proposed approach is demonstrated in Fig. 9, which shows the timing diagram of the  $V_{READ}$ ,  $V_G$  and  $V_{OUT}$ 



FIGURE 9. Timing diagram of  $V_{READ}$ ,  $V_{G}$  and  $V_{OUT}$  signals involved in the preliminary 1-bit and 2-bit read operations to be performed respectively for sFALSE and SIMPLY operations within the improved SIMPLY+ scheme of Fig. 7, as obtained from nominal simulations at 300 K and  $V_{READ} = 0.5 \, \text{V}$ .

TABLE 3. Comparison of STT-MRAM-based SIMPLY logic schemes.

| Type               | Conventional         | SIMPLY+               | Improved              |
|--------------------|----------------------|-----------------------|-----------------------|
| Турс               | SIMPLY               | SIMITELT              | SIMPLY+               |
| Memory Type        | MRAM                 | MRAM                  | MRAM                  |
| Load Type          | Transistor           | Transistor            | Transistor            |
| CMOS Tech          | $65\mathrm{nm}$      | $65\mathrm{nm}$       | $65\mathrm{nm}$       |
| VDD                | $1.2\mathrm{V}$      | 1.2 V                 | 1.2 V                 |
| $V_{READ}$         | 0.5 V                | 0.5 V                 | 0.5 V                 |
| $t_{READ}$         | $10\mathrm{ns}$      | $10\mathrm{ns}$       | $10\mathrm{ns}$       |
| Reference Voltages | Yes                  | Yes                   | No                    |
| RM                 | $55.0\mathrm{mV}$    | $202.5\mathrm{mV}$    | $202.5\mathrm{mV}$    |
| BER                | $1.7 \times 10^{-5}$ | $4.4 \times 10^{-10}$ | $4.4 \times 10^{-10}$ |
| Read Energy (avg)  | 1.73 pJ              | $2.60\mathrm{pJ}$     | $2.07\mathrm{pJ}$     |

signals when performing the 2-bit and 1-bit read operations within the scheme of Fig. 7. Here, data refers to nominal simulations at 300 K and  $V_{READ}=0.5\,V$ . From this figure, we can observe that the proper tuning of the logic threshold through the  $S_{READ}$  signal allows distinguishing the input combination P=Q='0' (i.e.,  $V_{OUT}='0'$ ) from the others (i.e.,  $V_{OUT}='1'$ ) in the 2-bit read operation, as well as the case Q='0' (i.e.,  $V_{OUT}='0'$ ) from the case Q='1' (i.e.,  $V_{OUT}='1'$ ) in the 1-bit read operation. The improved SIMPLY+ presents an average energy consumption of about 2.07 pJ at  $t_{READ}=10\,\text{ns}$ . This is  $\sim 20\%$  less as compared to the SIMPLY+ scheme of Fig. 2(c), thus demonstrating the effectiveness of replacing the conventional sense-amplifier-based comparator with the two-stage buffer.



#### **IV. COMPARISON RESULTS**

Table 3 compares the different STT-MRAM SIMPLY-based architectures analyzed in this work. All SIMPLY schemes were evaluated under the same simulation conditions, i.e., VDD of 1.2 V, V<sub>READ</sub> of 0.5 V, t<sub>READ</sub> of 10 ns, 65 nm process, 1000 Monte Carlo samples. The designs present two types of sensing circuitry. The conventional SIMPLY and SIMPLY+ use the 10-transistor sense amplifier-based output comparator reported in [11], and need reference voltages (whose energy overhead is not considered here), which lead to an increase in terms of overall area and design complexity. For the improved SIMPLY+ alternative, the self-referenced CS stage allows the use of a lowercomplexity 7-transistors two-stage output buffer. This saves precious read energy and avoids the need for additional biasing circuits which potentially increase energy and may represent a source of variability for the overall system [37].

In terms of RM and BER, the proposed SIMPLY+ solutions are by far the best choice for a reliable design (refer to Section III). Such benefit is obtained at the expense of about 20% (improved SIMPLY+ vs SIMPLY) increase in terms of read energy consumption.

#### V. CONCLUSION

In this work, we proposed SIMPLY+, a reliability enhanced STT-MTJ-based LIM logic scheme. Our design exploits a commercial 65 nm CMOS PDK, as well as a macrospin-based Verilog-A compact model to describe the behavior of the adopted 30 nm diameter perpendicular STT-MTJ device. We evaluated SIMPLY+ circuit performance by means of exhaustive Monte Carlo simulations. As a main result of our study, the SIMPLY+ nominal read margin is about  $3-4\times$  compared to its conventional SIMPLY counterpart. In addition, the improved SIMPLY+ also exhibits better performance in terms of BER with an improvement by more than four orders of magnitude. These reliability advantages are obtained at the expense of a penalty of about 20% in terms of read energy.

Our results prove that the SIMPLY+ scheme is a very promising solution for designing reliable in-memory computing architectures.

#### **REFERENCES**

- J. Borghetti, G. S. Snider, P. J. Kuekes, J. J. Yang, D. R. Stewart, and R. S. Williams, "Memristive' switches enable 'stateful' logic operations via material implication," *Nature*, vol. 464, no. 7290, pp. 873–876, Apr. 2010.
- [2] A. Musello, E. Garzón, M. Lanuzza, L. M. Prócel, and R. Taco, "XNOR-bitcount operation exploiting computing-in-memory with STT-MRAMs," *IEEE Trans. Circuits Syst. II, Exp. Briefs*, vol. 70, no. 3, pp. 1259–1263, Mar. 2023.
- [3] S. Kvatinsky, G. Satat, N. Wald, E. G. Friedman, A. Kolodny, and U. C. Weiser, "Memristor-based material implication (IMPLY) logic: Design principles and methodologies," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 22, no. 10, pp. 2054–2066, Oct. 2014.

- [4] E. Garzón, A. Teman, M. Lanuzza, and L. Yavits, "AIDA: Associative inmemory deep learning accelerator," *IEEE Micro*, vol. 42, no. 6, pp. 67–75, Nov. 2022.
- [5] Q. Chen, X. Wang, H. Wan, and R. Yang, "A logic circuit design for perfecting memristor-based material implication," *IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.*, vol. 36, no. 2, pp. 279–284, Feb. 2017.
- [6] H. Mahmoudi, T. Windbacher, V. Sverdlov, and S. Selberherr, "Implication logic gates using spin-transfer-torque-operated magnetic tunnel junctions for intrinsic logic-in-memory," *Solid-State Electron.*, vol. 84, pp. 191–197, Jun. 2013
- [7] M. Lanuzza, M. Margala, and P. Corsonello, "Cost-effective low-power processor-in-memory-based reconfigurable datapath for multimedia applications," in *Proc. Int. Symp. Low power Electron. Design*, 2005, pp. 161–166.
- [8] H. Mahmoudi, T. Windbacher, V. Sverdlov, and S. Selberherr, "Reliability analysis and comparison of implication and reprogrammable logic gates in magnetic tunnel junction logic circuits," *IEEE Trans. Magn.*, vol. 49, no. 12, pp. 5620–5628, Dec. 2013.
- [9] F. M. Puglisi, L. Pacchioni, N. Zagni, and P. Pavan, "Energy-efficient logic-in-memory I-bit full adder enabled by a physics-based RRAM compact model," in *Proc. 48th Eur. Solid-State Device Res. Conf.* (ESSDERC), Sep. 2018, pp. 50–53.
- [10] F. M. Puglisi, T. Zanotti, and P. Pavan, "SIMPLY: Design of a RRAM-based smart logic-in-memory architecture using RRAM compact model," in *Proc. 49th Eur. Solid-State Device Res. Conf. (ESSDERC)*, Sep. 2019, pp. 130–133.
- [11] T. Zanotti, F. M. Puglisi, and P. Pavan, "Reconfigurable smart in-memory computing platform supporting logic and binarized neural networks for low-power edge devices," *IEEE J. Emerg. Sel. Topics Circuits Syst.*, vol. 10, no. 4, pp. 478–487, Dec. 2020.
- [12] T. Zanotti, F. M. Puglisi, and P. Pavan, "Smart logic-in-memory architecture for low-power non-von Neumann computing," *IEEE J. Electron Devices Soc.*, vol. 8, pp. 757–764, 2020.
- [13] T. Zanotti, F. M. Puglisi, and P. Pavan, "Circuit reliability analysis of inmemory inference in binarized neural networks," in *Proc. IEEE Int. Integr. Rel. Workshop (IIRW)*, Oct. 2020, pp. 1–5.
- [14] R. De Rose, T. Zanotti, F. M. Puglisi, F. Crupi, P. Pavan, and M. Lanuzza, "STT-MTJ based smart implication for energy-efficient logic-in-memory computing," *Solid-State Electron.*, vol. 184, Oct. 2021, Art. no. 108065.
- [15] R. De Rose, T. Zanotti, F. M. Puglisi, F. Crupi, P. Pavan, and M. Lanuzza, "Smart material implication using spin-transfer torque magnetic tunnel junctions for logic-in-memory computing," *Solid-State Electron.*, vol. 194, Aug. 2022, Art. no. 108390.
- [16] H. Cai, Y. Guo, B. Liu, M. Zhou, J. Chen, X. Liu, and J. Yang, "Proposal of analog in-memory computing with magnified tunnel magnetoresistance ratio and universal STT-MRAM cell," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 69, no. 4, pp. 1519–1531, Apr. 2022.
- [17] E. Garzón, R. De Rose, F. Crupi, L. Trojman, G. Finocchio, M. Carpentieri, and M. Lanuzza, "Assessment of STT-MRAMs based on double-barrier MTJs for cache applications by means of a deviceto-system level simulation framework," *Integration*, vol. 71, pp. 56–69, Mar. 2020.
- [18] T.-N. Pham, Q.-K. Trinh, I.-J. Chang, and M. Alioto, "STT-BNN: A novel STT-MRAM in-memory computing macro for binary neural networks," *IEEE J. Emerg. Sel. Topics Circuits Syst.*, vol. 12, no. 2, pp. 569–579, Jun. 2022.
- [19] E. Garzón, B. Zambrano, T. Moposita, R. Taco, L.-M. Prócel, and L. Trojman, "Reconfigurable CMOS/STT-MTJ non-volatile circuit for logic-in-memory applications," in *Proc. IEEE 11th Latin Amer. Symp. Circuits Syst. (LASCAS)*, Feb. 2020, pp. 1–4.
- [20] T. Moposita, E. Garzón, F. Crupi, L. Trojman, A. Vladimirescu, and M. Lanuzza, "Efficiency of double-barrier magnetic tunnel junctionbased digital eNVM array for neuro-inspired computing," *IEEE Trans. Circuits Syst. II, Exp. Briefs*, vol. 70, no. 3, pp. 1254–1258, Mar. 2023.
- [21] E. Garzón, M. Lanuzza, A. Teman, and L. Yavits, "AM4: MRAM cross-bar based CAM/TCAM/ACAM/AP for in-memory computing," *IEEE J. Emerg. Sel. Topics Circuits Syst.*, vol. 13, no. 1, pp. 408–421, Mar. 2023.



- [22] H. Cai, Z. Bian, Y. Hou, Y. Zhou, J.-L. Cui, Y. Guo, X. Tian, B. Liu, X. Si, Z. Wang, J. Yang, and W. Shan, "A 28 nm 2Mb STT-MRAM computing-in-memory macro with a refined bit-cell and 22.4–41.5TOPS/W for AI inference," in *Proc. IEEE Int. Solid- State Circuits Conf. (ISSCC)*, Feb. 2023, pp. 500–502.
- [23] Y.-C. Chiu, W.-S. Khwa, C.-Y. Li, F.-L. Hsieh, Y.-A. Chien, G.-Y. Lin, P.-J. Chen, T.-H. Pan, D.-Q. You, F.-Y. Chen, A. Lee, C.-C. Lo, R.-S. Liu, C.-C. Hsieh, K.-T. Tang, Y.-D. Chih, T.-Y. Chang, and M.-F. Chang, "A 22 nm 8 Mb STT-MRAM near-memory-computing macro with 8b-precision and 46.4–160.1 TOPS/W for edge-AI devices," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2023, pp. 496–498.
- [24] Y. Zhang, J. Wang, C. Lian, Y. Bai, G. Wang, Z. Zhang, Z. Zheng, L. Chen, K. Zhang, G. Sirakoulis, and Y. Zhang, "Time-domain computing in memory using spintronics for energy-efficient convolutional neural network," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 68, no. 3, pp. 1193–1205, Mar. 2021.
- [25] A. Nisar, S. Dhull, S. Shreya, and B. K. Kaushik, "Energy-efficient advanced data encryption system using spin-based computing-in-memory architecture," *IEEE Trans. Electron Devices*, vol. 69, no. 4, pp. 1736–1742, Apr. 2022.
- [26] B. Wu, H. Zhu, D. Reis, Z. Wang, Y. Wang, K. Chen, W. Liu, F. Lombardi, and X. S. Hu, "An energy-efficient computing-in-memory (CiM) scheme using field-free spin-orbit torque (SOT) magnetic RAMs," *IEEE Trans. Emerg. Topics Comput.*, pp. 1–12, 2023.
- [27] Z. Wang, Z. Li, Y. Liu, S. Li, L. Chang, W. Kang, Y. Zhang, and W. Zhao, "Progresses and challenges of spin orbit torque driven magnetization switching and application (Invited)," in *Proc. IEEE Int. Symp. Circuits Syst. (ISCAS)*, May 2018, pp. 1–5.
- [28] S. Wang and H. Cai, "Computing-in-Memory with enhanced STT-MRAM readout margin," *IEEE Trans. Magn.*, vol. 59, no. 11, pp. 1–5, Nov. 2023.
- [29] J. Ryu and K. Kwon, "Self-adjusting sensing circuit without speed penalty for reliable STT-MRAM," *Electron. Lett.*, vol. 53, no. 4, pp. 224–226, Feb. 2017.
- [30] T. Na, S. H. Kang, and S.-O. Jung, "STT-MRAM sensing: A review," IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 68, no. 1, pp. 12–18, Inp. 2021
- [31] R. De Rose, M. Lanuzza, M. D'Aquino, G. Carangelo, G. Finocchio, F. Crupi, and M. Carpentieri, "A compact model with spinpolarization asymmetry for nanoscaled perpendicular MTJs," *IEEE Trans. Electron Devices*, vol. 64, no. 10, pp. 4346–4353, Oct. 2017.
- [32] J. J. Nowak, R. P. Robertazzi, J. Z. Sun, G. Hu, J.-H. Park, J. Lee, A. J. Annunziata, G. P. Lauer, R. Kothandaraman, E. J. O'Sullivan, P. L. Trouilloud, Y. Kim, and D. C. Worledge, "Dependence of voltage and size on write error rates in spin-transfer torque magnetic random-access memory," *IEEE Magn. Lett.*, vol. 7, pp. 1–4, 2016.
- [33] J. C. Sankey, Y.-T. Cui, J. Z. Sun, J. C. Slonczewski, R. A. Buhrman, and D. C. Ralph, "Measurement of the spin-transfer-torque vector in magnetic tunnel junctions," *Nature Phys.*, vol. 4, no. 1, pp. 67–71, Jan. 2008.
- [34] S. Ikeda, K. Miura, H. Yamamoto, K. Mizunuma, H. D. Gan, M. Endo, S. Kanai, J. Hayakawa, F. Matsukura, and H. Ohno, "A perpendicularanisotropy CoFeB–MgO magnetic tunnel junction," *Nature Mater.*, vol. 9, no. 9, pp. 721–724, Jun. 2010.
- [35] G. Siracusano, R. Tomasello, M. D'Aquino, V. Puliafito, A. Giordano, B. Azzerboni, P. Braganca, G. Finocchio, and M. Carpentieri, "Description of statistical switching in perpendicular STT-MRAM within an analytical and numerical micromagnetic framework," *IEEE Trans. Magn.*, vol. 54, no. 5, pp. 1–10, May 2018.
- [36] H. Sato, E. C. I. Enobio, M. Yamanouchi, S. Ikeda, S. Fukami, S. Kanai, F. Matsukura, and H. Ohno, "Properties of magnetic tunnel junctions with a MgO/CoFeB/Ta/CoFeB/MgO recording structure down to junction diameter of 11 nm," *Appl. Phys. Lett.*, vol. 105, no. 6, Aug. 2014, Art. no. 062403.
- [37] Q. K. Trinh, S. Ruocco, and M. Alioto, "Novel boosted-voltage sensing scheme for variation-resilient STT-MRAM read," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 63, no. 10, pp. 1652–1660, Oct. 2016.



**TATIANA MOPOSITA** (Member, IEEE) received the B.E. degree in mechatronic engineering from Escuela Politécnica del Ejército (ESPE), Quito, Ecuador, in 2017, and the dual M.S. degree in nanoelectronics and electronics from University San Francisco de Quito, Ecuador, and the University of Calabria (UNICAL), Italy, in 2020. She is currently pursuing the Ph.D. degree with the Institut Supérieur d'Électronique de Paris (ISEP), Sorbonne Université, and UNICAL. Her research

interest includes the integration of spintronic devices with CMOS technology for in-memory computing.



**ESTEBAN GARZÓN** (Member, IEEE) received the Ph.D. degree in electronics engineering from the University of Calabria (UNICAL), Italy, in 2022. He is currently a Postdoctoral Researcher with the Department of Computer Engineering, Modeling, Electronics and Systems Engineering, UNICAL. He has coauthored more than 40 scientific papers in international journals and conferences and has participated in several IC tape-outs. His research interests include domain-specific

hardware accelerators and electronics/spintronics, cryogenic memories, and standard and emerging technologies for logic, memory, and low-power applications.



**RAFFAELE DE ROSE** (Senior Member, IEEE) received the Ph.D. degree from the University of Calabria, Rende, Italy, in 2012. He is currently an Assistant Professor with the Department of Computer Engineering, Modeling, Electronics and Systems Engineering (DIMES), University of Calabria. He has authored more than 60 papers in international journals and conferences with peer-reviewed. His recent research interests include device/circuit modeling and co-design in emerging

technologies, ultralow-voltage and low-power CMOS circuit design, and hardware-level security. He is an Associate Editor of the IEEE Transactions on Very Large Scale Integration (VLSI) Systems.



**FELICE CRUPI** (Senior Member, IEEE) is currently a Full Professor of electronics with the Department of Computer Engineering, Modeling, Electronics and Systems Engineering (DIMES), University of Calabria, Rende, Italy. He is the author of more than 200 publications in international journals and conference proceedings. His primary research interests include electronic device reliability, the design of ultra-low power analog circuits, and early assessment of emerging

technologies for logic and memory applications. He has served as a Technical Program Committee Member of the International Electron Devices Meeting and the International Reliability Physics Symposium.





ANDREI VLADIMIRESCU (Life Fellow, IEEE) received the M.S. and Ph.D. degrees in EECS from the University of California at Berkeley. He was a Key Contributor with the University of California at Berkeley, to the SPICE simulator, releasing the SPICE2G6 production-level SW, in 1981. He pioneered electrical simulation on parallel computers with the CLASSIE simulator as a part of the Ph.D. degree. For many years, he was the Research and Development Director

leading the design and implementation of innovative software and hardware electronic design automation products with Analog Devices, Inc., Daisy Systems, Analog Design Tools, Valid Logic, and Cadence. He is currently a Professor involved in research projects with the University of California at Berkeley, the Delft University of Technology (TU Delft), and the Institut Supérieur d'Électronique de Paris (ISEP), and an Industry Consultant. He has authored *The SPICE Book* (Wiley). His research interests include ultralow-voltage CMOS, design, simulation, and modeling of circuits with new devices and circuits for quantum computing.



LIONEL TROJMAN (Senior Member, IEEE) received the B.S. degree in physics from Aix-Marseille University, France, in 2002, the M.S. degree in physics applied to micro- and nano-electronics and in electrical engineering in microelectronic and telecommunication from École Polytechnique Universitaire de Marseille, Aix-Marseille University, in 2004, and the Ph.D. degree in electrical engineering from KU Leuven in partnership with imec, Belgium, in 2009. Since

2009, he has been a full-time Professor with the Department of Electrical and Electronics Engineering, USFQ, Ecuador. Since 2021, he has also been the Director of Research with the Institut Supérieur d'Électronique de Paris (ISEP) and the Director of the Laboratory of Informatic Signal Image Telecom and Electronics (LISITE), France. His current research interests include transport for ultra-scaled MOSFET (sub-100-nm) with UTEOT high-k dielectrics with conventional and new architectures (FinFET, FDSOI or TFET) for CMOS technologies but also for memory (ReRAM and MTJ) and GaN power electronic devices.



MARCO LANUZZA (Senior Member, IEEE) received the Ph.D. degree in electronic engineering from the Mediterranea University of Reggio Calabria, Reggio Calabria, Italy, in 2005. Since 2006, he has been with the University of Calabria (UNICAL), Rende, Italy, where he is currently an Associate Professor. He has authored more than 140 publications in international journals and conference proceedings. His research interests include the design of ultralow voltage circuits and

systems, the development of efficient models and methodologies for leakageand variability-aware designs, and the design of digital and analog circuits in emerging technologies. He is an Associate Editor of *Integration, the VLSI Journal*, and *International Journal of Circuit Theory and Applications*.

0.0