

# Design and Evaluation of a 28-nm FD-SOI STT-MRAM for Ultra-Low Power Microcontrollers

# GUILLAUME PATRIGEON<sup>®</sup><sup>1</sup>, PASCAL BENOIT<sup>®</sup><sup>1</sup>, LIONEL TORRES<sup>1</sup>, SOPHIANE SENNI<sup>1</sup>, GUILLAUME PRENAT<sup>2</sup>, AND GREGORY DI PENDINA<sup>®</sup><sup>2</sup>

<sup>1</sup>Montpellier Laboratory of Informatics, Robotics and Microelectronics (LIRMM), CNRS, Université de Montpellier, 34090 Montpellier, France <sup>2</sup>SPINTEC, CNRS, CEA, Université Grenoble-Alpes, 38054 Grenoble, France

Corresponding author: Guillaume Patrigeon (guillaume.patrigeon@lirmm.fr)

This work was supported in part by the European Union's Horizon 2020 Research and Innovation Program (GREAT Project) under Grant 687973, and in part by the French National Research Agency (MASTA Project) under Grant ANR-15-CE24-0033-01.

**ABSTRACT** The complexity of embedded devices increases as today's applications request always more services. However, the power consumption of systems-on-chip has significantly increased due to the high-density integration and the high leakage power of current CMOS transistors. To address these issues, emerging technologies are considered. Spin-transfer torque magnetic random access memory (STT-MRAM) is seen as a promising alternative solution to traditional memories, thanks to its negligible leakage current, high density, and non-volatility. In this paper, we present the design and evaluation of a 128-kB STT-RAM in a 28-nm FD-SOI technology with SRAM-like interface for ultra-low power microcontrollers. With 0.9-pJ/bit read in 5 ns and 3-pJ/bit write in 10 ns, this embedded non-volatile memory is suitable for the devices that run at frequencies under 100 MHz. Considering a low-power application with duty-cycled behavior, we evaluate the STT-MRAM as a replacement of embedded Flash and SRAM by comparing single- and multi-memory architecture scenarios.

**INDEX TERMS** 28-nm FD-SOI, microcontroller, STT-MRAM, ultra-low-power.

# I. INTRODUCTION

Nowadays, embedded systems are widely used in various domains, and many applications set high constraints for designers and developers in terms of performance and energy consumption. However, while the complexity of embedded devices is continuously increasing, the power consumption of systems-on-chip is a challenge for battery-powered applications. Having long autonomy for such devices becomes a real need. The energy consumption of an Ultra-Low Power (ULP) microcontroller can be optimized at multiple levels. A lot of ULP applications have a periodic behavior, alternating between run and sleep phases ("duty-cycle" operation mode, Figure 1 (a)). The time spent in each phase depends on the application specifications and selected solutions. Even though sleep modes help to reduce the power consumption of a microcontroller, some energy is still lost during sleep phases. As a workaround, it is possible to power down a microcontroller during sleep phases Figure 1 (b)), but for traditional architectures the system state is then lost, forcing a system reboot. Another solution is to insert Non-Volatile Logic (NVL) inside the architecture to make it able to store its state before a shut down, and restore it after wake-up (normally-off computing, Figure 1 (c)). State recovery by using NVL is faster and more energy efficient than a full restart [1], but this solution adds an overhead in terms of area and makes the go-to-sleep phase time longer. This method also requires some energy to store the system state and restore it at wake-up. Comparing the traditional microcontroller with the normally-off solution, there is a trade-off between the energy lost in sleep phase Figure 1 (a)) and the backup energy overhead Figure 1 (c)).

Different approaches are studied to achieve the lowest power consumption for ULP applications. NVL is used to powering down parts of a system logic by saving its state into non-volatile registers, and then avoiding the leakage current during sleep phases Figure 1 (c). Reference [2] integrates non-volatile cells directly into flip-flops, managed by

The associate editor coordinating the review of this manuscript and approving it for publication was Bora Onat.



FIGURE 1. Duty-cycle mode operation. (a) Sleep mode. (b) Power down mode. (c) Normally-off computing.

| TABLE 1. | Ultra-low- | power mic | rocontrollers. |
|----------|------------|-----------|----------------|
|          |            |           |                |

|                        | Lallement 2018 [4]                   | Lin 2018 [6]                                               | Zwerg 2017 [1]                                                | Izumi 2015 [2]                                                                    | Singhal 2015 [5]                      | Khanna 2014 [3]                                                  |
|------------------------|--------------------------------------|------------------------------------------------------------|---------------------------------------------------------------|-----------------------------------------------------------------------------------|---------------------------------------|------------------------------------------------------------------|
| Technology             | 28 nm FD-SOI                         | 180 nm                                                     | 130 nm                                                        | 130 nm                                                                            | 90 nm                                 | 130 nm HVT                                                       |
| Power supply           | -                                    | 0.2V-1.1V                                                  | 1.2V/1.5V                                                     | 1.2V/3.0V                                                                         | 3V                                    | 1.5V                                                             |
| Frequency              | 16 MHz                               | -                                                          | 8 MHz                                                         | 24 MHz                                                                            | 16 MHz                                | 8 MHz                                                            |
| CPU                    | Cortex-M0+                           | MSP430                                                     | Cortex-M0+                                                    | Cortex-M0                                                                         | MSP430                                | 32-bit                                                           |
| Memory<br>architecture | 4 kB SRAM (code)<br>4 kB SRAM (data) | 128 B ROM (boot)<br>1 kB latch (code)<br>1 kB latch (data) | 96 kB ROM<br>2x 32 kB FRAM<br>2 kB SRAM<br>Non-Volatile Array | 16 kB 6T4C NVRAM<br>(code and data)<br>Non-Volatile Logic<br>(Ferroelectric NVFF) | 64 kB NVRAM<br>8 kB SRAM<br>MTCMOS FF | 10 kB ROM<br>64 kB FRAM<br>8 kB SRAM<br>Non-Volatile FeCap Array |
| Active power           | 2.7 pJ/cycle                         | 33 pJ/cycle @ 0.45V                                        | 150 µA/MHz @ 1.2V                                             | 6.14 μΑ                                                                           | $28.3\;\mu W/MHz$                     | $75 \ \mu W/MHz$                                                 |
| Sleep power            | $0.7 \ \mu W @ 0.5V$                 | -                                                          | NVL sleep: 0 W                                                | -                                                                                 | 0.32 µW @ 3V                          | Retention: 0.28 µW<br>Sleep: 0W                                  |
| Wake-up                | -                                    | -                                                          | 380 nC, 438 µs                                                | -                                                                                 | -                                     | 2.4 nJ, 384 ns (125 MHz)                                         |
| Backup                 | -                                    | -                                                          | -                                                             | -                                                                                 | -                                     | 7.2 nJ, 320 ns (125 MHz)                                         |

TABLE 2. Commercial microcontrollers for low power applications.

|                        | STM32L0 series [7]<br>STMicroelectronics                     | STM32F0 series [8]<br>STMicroelectronics | LPC1100 series [9]<br>NXP Semiconductors                     | nRF51 series [10]<br>Nordic Semiconductors | PSoC 4 family [11]<br>Cypress Semiconductors | FM0+ family [12]<br>Cypress Semiconductors |
|------------------------|--------------------------------------------------------------|------------------------------------------|--------------------------------------------------------------|--------------------------------------------|----------------------------------------------|--------------------------------------------|
| Frequency              | 32 MHz                                                       | 48 MHz                                   | 50 MHz                                                       | 32 MHz                                     | 48 MHz                                       | 40 MHz                                     |
| CPU                    | ARM Cortex-M0+                                               | ARM Cortex-M0                            | ARM Cortex-M0<br>ARM Cortex-M0+                              | ARM Cortex-M0                              | ARM Cortex-M0<br>ARM Cortex-M0+              | ARM Cortex-M0+                             |
| Memory<br>architecture | Flash 8 to 192 kB<br>SRAM 2 to 20 kB<br>EEPROM 512 to 6144 B | Flash 16 to 256 kB<br>SRAM 4 to 32 kB    | Flash 4 to 256 kB<br>SRAM 2 to 36 kB<br>EEPROM 512 to 4096 B | Flash 128 to 256 kB<br>SRAM 16 to 32 kB    | Flash 8 to 256 kB<br>SRAM 2 to 32 kB         | Flash 56 to 88 kB<br>SRAM 6 kB             |

a dedicated controller, whereas [1] and [3] use non-volatile arrays outside logic to save the system state. Depending on the applications, some state of the art solutions focus more on reducing active energy consumption than on removing sleep leakage power [4] and [5] use body-biasing methods to achieve lower active energy and [6] adapts their memory architecture. However, because the size of memories used by [6] is very small (2x 1 kB), the solution is dedicated to a limited number of applications. Moreover the microcontrollers from [6] and [4] require to be initialized from an external device to obtain their program, which adds constraints to their integration in embedded devices [1], [5] and [3] use an architecture and memory sizes similar to commercial products Table 2, with non-volatile memory and a SRAM. Only [2] is based on a single non-volatile memory architecture, the other works presented in Table 1 separate application's program and data in different memories. Moreover, [2] uses only nonvolatile memories. That solution helps to reach the lowest energy consumption in sleep mode without performing data transfers to save the content of volatile parts.

References [1], [2] and [3] use Ferroelectric memories when non-volatility is needed. Here we use another kind of NVM technology: STT-MRAM. Spin-Transfer Torque Magnetic Random Access Memory (STT-MRAM) is seen as a promising alternative solution to traditional memories thanks to its negligible leakage current, high density, and non-volatility. By combining the 28-nm FD-SOI technology for CMOS and STT-MRAM solution for the memory system, we investigate the different architectural solutions to improve the energy efficiency, reliability, and performances of systems-on-chip for ULP applications. In comparison to FeRAM, STT-MRAM offers lower access latencies, higher retention and density ([13], [14]). In this work, we focus on the memory architecture and memory technologies for ULP devices. An overview of memory architecture solutions is introduced in Section II. Section III presents the design of an embedded 128-kB perpendicular peSTT-MRAM with 28-nm FD-SOI CMOS, whose evaluation is detailed in Section IV. Finally, Section V concludes this paper and provides future directions for research works.

# **II. MEMORY ARCHITECTURE OVERVIEW**

## A. MICROCONTROLLER ARCHITECTURE

The specifications of an application determine which microcontroller to use. Manufacturers generally offer a large variety of microcontrollers to answer to the large number of actual embedded applications and their specific constraints, as it is not possible to create one microcontroller architecture that will fit all applications. There are microcontrollers with different packages, number of input/output pins, different processors, operating frequencies, peripherals, communication interfaces, analogic modules, low-power modes, memory technologies, memory capacities, and dedicated to different kinds of applications (automotive for example). However, there are some similarities between all these different devices. Typical microcontrollers include at least one processor, a non-volatile memory (usually Flash for code instructions and read-only data), a volatile memory (usually SRAM for application data), a power management unit, a clock management unit, input/output peripherals, communication modules (UART, SPI, I<sup>2</sup>C, USB, CAN...) and timers. This typical architecture is depicted in Figure 2. Some microcontrollers also include different types of non-volatile memories (ROM, EEPROM...) or have a multi-master system (multi-processor, Direct Memory Access (DMA)...).

## **B. MEMORY PARTINIONING**

The memory architecture of microcontrollers is constrained by the architecture of the processor. With processors having a single bus interface (like the ARM Cortex-M0), it is possible to store both code and application data in the same memory without affecting the performances of the system. Some



FIGURE 2. Typical microcontroller architecture.

other processors (like the ARM Cortex-M3) have multiple bus interfaces (for instruction, data, system...) and require dedicated memory architectures: as they are able to handle parallel transactions, having a single memory architecture with a single access bus decreases the overall system performances. Multi-master systems (when using specific modules like DMA) are also affected by the number of interfaces of the main memories.

ARM Cortex-M architecture is widely used in commercial low-power microcontrollers and we use one of its implementation that is available for academic projects in our evaluation: the ARM Cortex-M0 r1p0. This is a 3-stage 32-bit RISC processor that implements the ARMv6-M ISA, with a maximum frequency of 50 MHz. It includes a single AHB-Lite interface, 32 interrupt lines, 1 Non-Maskable Interrupt and a singlecycle multiplier.

In this study, we focus on memory architectures for a single master system (which is the Cortex-M0, with a single bus interface). As STT-MRAM is a non-volatile random access memory, it could be used to replace both Flash for code memory and SRAM for dynamic data memory. To evaluate the possible gains by using STT-MRAM, we compare different memory architectures, as illustrated in Figure 3: ① program code in Flash and dynamic data in SRAM (this is actually the



FIGURE 3. Memory architecture scenarios.

### TABLE 3. STT-MRAM demonstrators for high performance (hp) and high density (hd).

|                                             | Toshiba [15]<br>TED 2017 (HP)                                            | Toshiba [16]<br>ISSCC 2016 (HP)                                                       | Toshiba [17]<br>ISSCC 2015 (HP)                   | Samsung [18]<br>IEDM 2016 (HD)                                                           | Qualcomm/TDK [19]<br>IEDM 2015 (HD)                          |
|---------------------------------------------|--------------------------------------------------------------------------|---------------------------------------------------------------------------------------|---------------------------------------------------|------------------------------------------------------------------------------------------|--------------------------------------------------------------|
| CMOS                                        | 20 nm                                                                    | 65 nm                                                                                 | 65 nm                                             | 28 nm LPP                                                                                | 40 LP, 6 metal levels                                        |
| Density                                     | -                                                                        | 4x 1 Mb                                                                               | 1 Mb                                              | 8 Mb                                                                                     | 1 Mb                                                         |
| Cell<br>architecture                        | 2T2MTJ (L2),<br>1T1MTJ (L3)                                              | 2T2MTJ                                                                                | 2T2MTJ                                            | 1T-1MTJ<br>(Cu backend)                                                                  | 1T-1MTJ<br>(MTJ between M4 M5)                               |
| Unit cell size                              | 3x Minsize                                                               | -                                                                                     | -                                                 | 0.0364 µm <sup>2</sup>                                                                   | 40 F <sup>2</sup> , 0.065 µm <sup>2</sup>                    |
| MTJ                                         | 16-43 nm                                                                 | -                                                                                     | 47 nm                                             | 38-45 nm                                                                                 | -                                                            |
| TMR, Rp, σ                                  | > 150%                                                                   | -                                                                                     | -                                                 | Rp ~1 kΩ,<br>TMR 180%<br>for high yield @85°C<br>(Sensing marge 25 σ<br>Rp variation 7%) | TMR 110%, Rp ~2 kΩ<br>(18 σ read windows)                    |
| Retention<br>Endurance                      | 10 <sup>12</sup>                                                         | -                                                                                     | -                                                 | 10 years @ 85°C<br>Up to 10 <sup>8</sup> cycles                                          | $> 10^{13}$ (10 ns Wpulse)                                   |
| Timing                                      | Wpulse 1-3 ns<br>@ Ic =40-100μA                                          | Read 3.3 ns @ 1.25V<br>(F = 300 MHz)                                                  |                                                   | 40 MHz<br>Wpulse 25 ns @ 1.2V<br>Wpulse 15 ns @ 1.5V                                     | 50 MHz<br>ReadAccess 16 ns (0-70°)<br>WriteAccess 20-100 ns, |
| IO width                                    | -                                                                        | -                                                                                     | -                                                 | x32/x64                                                                                  | x32/x64                                                      |
| Optimization,<br>Redundancy,<br>Repair, ECC | -                                                                        | Physically eliminated<br>read disturbance<br>write-verify-write,<br>read-modify-write | Hierarchical bit line for eliminating disturbance | Rows and columns<br>(activated)                                                          | Rows and columns<br>(not activated)                          |
| Power supply<br>(core/IO)                   | -                                                                        | 1.2V, local and global power gating                                                   | 1.2V, local and global power gating               | 1.0V / 1.8V                                                                              | 1.2V / 1.8V                                                  |
| Error bit count                             | WER < 6 (4 ns, 62 μA)<br>RER: 0 over 10 <sup>6</sup> reads (1 ns, 10 μA) | -                                                                                     | -                                                 | -                                                                                        | Zero (100% yield) No pair                                    |

architecture used in most of commercial microcontrollers), ② Flash is replaced by STT-MRAM, ③ both program code and data are located in a single STT-MRAM, and ④ both program and data are located in a single SRAM.

The memory architecture of commercial low power microcontrollers using an ARM Cortex-M0 or an ARM Cortex-M0+ is in many cases composed of at least a Flash memory and a SRAM. The microcontroller families presented in Table 2 have 4 kB to 256 kB of Flash and 2 kB to 36 kB of SRAM. Regarding these capacities, we chose a main memory of 128 kB and an optional second memory of 16 kB for our evaluation.

## **III. MEMORY DESIGN**

This section presents the design of the 128 kB (1 Mb) peSTT-MRAM with 28-nm FD-SOI CMOS.

## A. SPECIFICATIONS

This memory is used as main memory for different architecture scenarios presented in Section II. To be compliant with these architectures, the memory has a single port synchronous 32-bit wide SRAM-compatible interface. The chosen CMOS technology used is the 28-nm FD-SOI from STMicroelectronics. To help identifying the memory specifications with a 28-nm technology node, two versions of memory array implementation have been considered, for High-Performance (HP) or High-Density (HD) applications, and the performance extrapolated for different sizes and options of the memory in a compiler-oriented approach Table 3. As our system is limited to 50 MHz by the processor, a HD architecture is preferred. For this work we have focused

#### TABLE 4. Specifications of the MRAM.

| МТЈ                     | 40 nm                                                               |  |
|-------------------------|---------------------------------------------------------------------|--|
| TMR, Rp, δ              | $\geq$ 150 %, 1 k $\Omega$ , 5 %                                    |  |
| Retention               | 10 years                                                            |  |
| Endurance               | $\geq 10^{12}$                                                      |  |
| Critical current (Ic)   | $40 \sim 100 \ \mu A$                                               |  |
| Density                 | 128 kB (1 Mb)                                                       |  |
| Timing                  | 20 ~ 100 ns (10 ~ 50 MHz)                                           |  |
| IO width                | 32                                                                  |  |
| Bit cell size           | Min MOS size: 0.0364 $\mu m^2$                                      |  |
| Bit cell architecture   | 1T-1MTJ                                                             |  |
| Optimization techniques | Body bias<br>Quasi differential sensing<br>Source Line (SL) sharing |  |
| CMOS                    | 28-nm FDSOI                                                         |  |
| Power supply            | 1.0 VDC                                                             |  |

on the parameters of [18] and used them to define the following specifications of the bit cell: 1 transistor 1 junction architecture (1T-1MTJ), 40 nm diameter MTJ with a parallel resistance (Rp) of 1 k $\Omega$  and a minimum TMR of 150%, 10 years retention with 10<sup>12</sup> write endurance. All the specifications of the bit cell and the memory are summarized in Table 4.

## **B. MEMORY ARCHITECTURE**

The memory is made of a single bank. The data are 32-bit wide, so the memory architecture is made of 32 IO blocks (Figure 4). Each IO contains 32 columns, 1 reference column,



FIGURE 4. Memory architecture, made of a single bank.



FIGURE 5. Architecture of one IO block, containing 32 columns (grey), read (blue), and write (green) circuitry and a 2-level multiplexer (yellow). The column decoder (purple) is shared between IOs.

the reading/writing circuits and a 32 to 1 multiplexing stage is used to select the addressed column as depicted in Figure 5 and Figure 6.

# C. DESIGN AND CHARACTERISATION

Based on these specifications, the memory was fully designed in a memory compiler approach, allowing to easily providing memories of different sizes according to the requirements of the application (up to 128 kB). The memory has been fully characterized at circuit level, using the electrical simulators Spectre and Ultrasim. Summarized in Table 5, one IO block requires 0.9 pJ to read the farthest bit in 5 ns (worst path case) and 3.0 pJ to write it in 10 ns, which sets the maximum operating frequency without wait states to 100 MHz.



FIGURE 6. Columns and bit cells organization in the memory.

#### TABLE 5. Memory performances.

| Leakage            | 352 μA (352 μW)     |
|--------------------|---------------------|
| 1 bit read energy  | 0.9 pJ              |
| 1 bit write energy | 3.0 pJ              |
| Read latency       | 5 ns (200 MHz max)  |
| Write latency      | 10 ns (100 MHz max) |

Since the simulation of a memory of such a size at circuit level is almost impossible due to huge simulation times and amount of data, the global operation of the memory was performed on very short durations and with a degraded level of accuracy for Ultrasim. The full characterization was made using the critical path of the memory, with parasitic capacitances and resistances of the access lines calculated using the Design Rules Manager (DRM) of the technology. Following a standard compiler approach, the simulations were performed for different sizes of the memory, so that the performance for a given size can be extrapolated.

Looking at area, the 128 kB peSTT-MRAM is about 58 000  $\mu$ m<sup>2</sup>. This is around 3.5 times smaller than a SRAM of the same capacity (around 204 000  $\mu$ m<sup>2</sup>).

## **IV. EVALUATION**

## A. SYSTEM DETAILS

The system is kept as simple as possible to focus the evaluation on the memory architecture. The Cortex-M0 and the memories are organized around a single master AMBA3 AHB-Lite bus architecture. Here are the assumpt for this evaluation:

- The maximum operating frequency is chosen to perform each memory operation in one cycle.
- Both SRAM and STT-MRAM support 8-bit, 16-bit and 32-bit write operations.
- Each read operation is 32-bit wide.
- There is no interrupt, exception nor event.
- There is no shadow memory operation (single master).

The energy of the matrix bus and memory controllers are not included in this work. The software benchmark used is the ULPMark from EEMBC [20]. CoreProfile (ULPMark-CP) is an application designed to reproduce a periodic behavior with active and sleep phases (see figure 1 (a)). The active phase of the CoreProfile is composed of math functions (linear approximation, filtering), conversion tables, string search, table copy, sorting, data permutations and output toggling. This application code is mostly used to evaluate and compare the energy efficiency of Ultra-Low-Power microcontrollers for Internet of Things applications.

A comparison of the memories in terms of power consumption and energy per operation is summarized in Table 6. The energy consumption of the SRAM comes from a 128 kB memory implementation in 28-nm FD-SOI from STMicroelectronics, the read energy of the 28-nm Flash memory is extrapolated from the data of [21] and the data of the STT-MRAM are based on the results presented in Section III. If the STT-MRAM has a bigger writing energy cost (3.0 pJ/bit) than SRAM (0.73 pJ/bit) for the same capacity (128 kB), it has the lowest energy cost for read operations. As we lack information about leakage, the evaluation of the active phase only takes into account the dynamic energy.

|              | Flash<br>128 kB | STT-MRAM<br>128 kB | SRAM<br>128 kB |
|--------------|-----------------|--------------------|----------------|
| Operation    |                 |                    |                |
| 32-bit read  | 39.4 pJ         | 29 pJ              | 30.6 pJ        |
| 8-bit write  | -               | 24 pJ              | 5.85 pJ        |
| 16-bit write | -               | 48 pJ              | 11.7 pJ        |
| 32-bit write | -               | 96 pJ              | 23.4 pJ        |

 TABLE 6. Energy cost per operation for each memory.

 TABLE 7. Memory operations for one coreprofile active phase (49767 cycles, 32773 executed instructions).

|                     | Code memory | Data memory |
|---------------------|-------------|-------------|
| Idle cycles         | 29532       | 38280       |
| Instruction fetches | 18693       | 0           |
| Total reads         | 1542        | 8151        |
| Total writes        | 0           | 3336        |
| 8-bit writes        | 0           | 1000        |
| 16-bit writes       | 0           | 516         |
| 32-bit writes       | 0           | 1820        |



FIGURE 7. Dynamic energy consumption of the memories.

## **B.** RESULTS

For each architecture scenario, the active phase of the application executes in 49767 cycles (32773 instructions executed). Table 7 shows the total count of the different memory operations in the code memory (that is, the location where readonly data and program code are stored) and the data memory (volatile data). Because the program code is located in the code memory, no instruction fetch occurs in the data memory, but writes operations only occur in the data memory. One instruction fetch corresponds to a 32-bit read.

Figure 7 shows the estimated dynamic energy consumption of the memories, for the different architectures previously described in Figure 3. Each memory is represented by a different color to distinguish its contribution.

For scenarios ① and ②, where code and data are separated in two different memories, and where only read operations are performed into non-volatile memories, we observe that the dynamic energy consumption of the STT-MRAM is lower than the Flash's one (around 26%). As the data memories are the same for both scenarios (16 kB SRAM), their contribution is equal. In total, the memories' dynamic energy consumption for the second scenario is lower than for the first one (around 23%). For scenarios 3 and 4, where a single memory is used for both code and data, we observe that the single SRAM based architecture consumes a little bit less energy than the single STT-MRAM based architecture (around 12%), because of the higher write energy needs of the STT-MRAM. Memory operations in the 16 kB SRAM (which is used for application data only) requires less energy than in the larger memories (128 kB SRAM and STT-MRAM), that is why scenarios 3 and 4 (with a single memory architecture) have higher dynamic energy consumption than scenario 2 (with separated memory architecture).

Looking at power consumption in sleep mode, scenario (3) is the only one where the memory can be powered-down without losing data. Each other scenario requires keeping the volatile memory into retention state: 879 nW for scenarios (1) and (2), 13.3  $\mu$ W for scenario (3). Moreover, a SRAM's retention state requires keeping a dedicated voltage regulator enabled, which adds another energy cost. For scenarios (1) and (2), it is possible to save its content into the non-volatile part and restore it at wake-up, but this operation takes time and energy, especially for the Flash memory in scenario (1). Finally, for scenario (3), the application code has to be loaded into SRAM from a non-volatile memory or an external source after each power up.

Now considering duty cycled behavior (Figure 1), we can evaluate the minimum sleep period required in scenario ③, which as the lowest leakage power, to compensate the overhead energy consumption compared to scenarios ① and ② in active phase. After 148 ms, the energy saved during sleep phase in scenario ③ compensate its 130  $\mu$ J overhead compared to scenario ① in active mode. After 386 ms, the single STT-MRAM memory architecture of scenario ③ is more energy efficient than the STT-MRAM + SRAM architecture of scenario ②.

In terms of flexibility, scenario ③ and ④ are the most interesting because these solutions offer the possibility to adjust the allocated memory size between application code and data. Moreover, these architectures are simpler than solutions ① and ②.

## **V. CONCLUSION**

We have designed a 128 kB (1 Mb) peSTT-MRAM, 28-nm FD-SOI CMOS with single 32-bit port SRAM-like interface for low-power embedded application. With 0.9 pJ/bit read in 5 ns and 3 pJ/bit write in 10 ns, this embedded non-volatile memory is suitable for low-power devices that run at frequencies under 100 MHz. We presented the evaluation of STT-MRAM, SRAM and Flash solutions for different

memory architectures (single and multiple memories architectures). STT-MRAM is a more interesting solution than Flash, thanks to a lower read energy (26% gain). Moreover, it has a faster, more flexible and more energy efficient write capability than traditional embedded Flash. When used as the sole memory of a system, the non-volatility of MRAM helps to reach the lowest power consumption in sleep mode, although this solution (one STT-MRAM for both code and data) is not the best for active mode. There is a trade-off between low power consumption in sleep mode and in active mode. In order to improve this study including the other parts of the system, we plan future works about the integration and evaluation of MRAM at various level, for duty-cycled ULP application.

## REFERENCES

- M. Zwerg, S. Khanna, and S. Bartling, "Non-volatile logic SoC with software-hardware co-design and integrated supply supervisor for energy harvesting applications," in *Proc. IEEE 60th Int. Midwest Symp. Circuits Syst. (MWSCAS)*, Aug. 2017, pp. 887–889.
- [2] S. Izumi et al., "Normally off ECG SoC with non-volatile MCU and noise tolerant heartbeat detector," *IEEE Trans. Biomed. Circuits Syst.*, vol. 9, no. 5, pp. 641–651, Oct. 2015.
- [3] S. Khanna, S. C. Bartling, M. Clinton, S. Summerfelt, J. A. Rodriguez, and H. P. McAdams, "An FRAM-based nonvolatile logic MCU SoC exhibiting 100% digital state retention at VDD = 0 V achieving zero leakage with <400-ns wakeup time for ULP applications," *IEEE J. Solid-State Circuits*, vol. 49, no. 1, pp. 95–106, Jan. 2014.
- [4] G. Lallement, F. Abouzeid, M. Cochet, J.-M. Daveau, P. Roche, and J.-L. Autran, "A 2.7 pJ/cycle 16 MHz, 0.7 μW deep sleep power ARM cortex-M0+ core SoC in 28 nm FD-SOI," *IEEE J. Solid-State Circuits*, vol. 53, no. 7, pp. 2088–2100, Jul. 2018.
- [5] V. K. Singhal, V. Menezes, S. Chakravarthy, and M. Mehendale, "A 10.5 μA/MHz at 16 MHz single-cycle non-volatile memory access microcontroller with full state retention at 108 nA in a 90 nm process," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2015, pp. 1–3.
- [6] L. Lin, S. Jain, and M. Alioto, "A 595 pW 14 pJ/cycle microcontroller with dual-mode standard cells and self-startup for battery-indifferent distributed sensing," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2018, pp. 44–46.
- STMicroelectronics. STM32L0—ARM Cortex-M0+ Ultra-Low-Power MCUs. Accessed: Oct. 22, 2018. [Online]. Available: https://www.st. com/en/microcontrollers/stm32l0-series.html?querycriteria= productId=SS1817
- [8] STMicroelectronics. STM32F0—ARM Cortex-M0 Microcontrollers. Accessed: Oct. 22, 2018. [Online]. Available: https://www. st.com/en/microcontrollers/stm32f0-series.html?querycriteria= productId=SS1574
- [9] LPC1100 Series: Scalable Entry-Level Microcontrollers (MCUs) Based on Arm Cortex-M0+/M0 Cores/NXP. Accessed: Oct. 22, 2018.
   [Online]. Available: https://www.nxp.com/products/processorsand-microcontrollers/arm-based-processors-and-mcus/lpc-tortex-mmcus/lpc1100-cortex-m0-plus-m0:MC\_1392389687150
- [10] Low Power Short-Range Wireless Products—Nordic Semiconductor. Accessed: Apr. 15, 2019. [Online]. Available: https://www.nordicsemi.com/en/Products/Low%20power%20shortrange%20wireless
- [11] 32-Bit Arm Cortex-M0 PSoC 4. Accessed: Oct. 22, 2018. [Online]. Available: http://www.cypress.com/products/32-bit-arm-cortex-m0-psoc-4
- [12] ARM Cortex M0+ FM0+. Accessed: Oct. 22, 2018. [Online]. Available: http://www.cypress.com/products/fm0-32-bit-arm-cortexm0-microcontroller-mcu-families
- [13] J. S. Meena, S. M. Sze, U. Chand, and T.-Y. Tseng, "Overview of emerging nonvolatile memory technologies," *Nanosc. Res. Lett.*, vol. 9, no. 1, p. 526, 2014.
- [14] L. Cargnini, L. Torres, R. Brum, S. Senni, and G. Sassatelli, "Embedded memory hierarchy exploration based on magnetic random access memory," J. Low Power Electron. Appl., vol. 4, no. 3, pp. 214–230, Aug. 2014.

- [15] D. Saida *et al.*, "1×-to 2×-nm perpendicular MTJ switching at sub-3-ns pulses below 100 μA for high-performance embedded STT-MRAM for sub-20-nm CMOS," *IEEE Trans. Electron Devices*, vol. 64, no. 2, pp. 427–431, Feb. 2017.
- [16] H. Noguchi et al., "4 Mb STT-MRAM-based cache with memoryaccess-aware power optimization and write-verify-write / read-modifywrite scheme," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Jan. 2016, pp. 132–133.
- [17] H. Noguchi *et al.*, "A 3.3 ns-access-time 71.2 μW/MHz 1 Mb embedded STT-MRAM using physically eliminated read-disturb scheme and normally-off memory architecture," in *IEEE Int. Solid-State Circuits Conf.* (*ISSCC*) Dig. Tech. Papers, Feb. 2015, pp. 1–3.
- [18] Y. J. Song et al., "Highly functional and reliable 8 Mb STT-MRAM embedded in 28 nm logic," in *IEDM Tech. Dig.*, Dec. 2016, pp. 27.2.1–27.2.4.
- [19] Y. Lu *et al.*, "Fully functional perpendicular STT-MRAM macro embedded in 40 nm logic for energy-efficient IOT applications," in *IEDM Tech. Dig.*, Dec. 2015, pp. 26.1.1–26.1.4.
- [20] CPU Energy Benchmark—MCU Energy Benchmark—ULPMark— EEMBC Embedded Microprocessor Benchmark Consortium. [Online]. Available: https://www.eembc.org/ulpmark/
- [21] I. Kouznetsov, "Embedded 28-nm charge-trap NVM technology," presented at the Flash Memory Summit, Santa Clara, CA, USA, 2017.



**GUILLAUME PATRIGEON** received the Engineering degree in microelectronics and automatics from Polytech Montpellier, France, in 2015. He is currently pursuing the Ph.D. degree with the Montpellier Laboratory of Informatics, Robotics and Microelectronics (LIRMM), France, with a focus on ultra-low-power embedded system-onchips. From 2015 to 2016, he was as an Electronics Engineer with the Research and Development Team, ELA Innovation Company, France.

**PASCAL BENOIT** received the Ph.D. degree in microelectronics from the University of Montpellier, France, in 2004, and the Habilitation degree from the Montpellier Laboratory of Informatics, Robotics and Microelectronics, University of Montpellier, in 2015, where he has been a Permanent Associate Professor, since 2005. He joined the Karlsruhe Institute of Technology, University of Karlsruhe, Germany, where he was a Scientific Assistant. He has coauthored more than

130 publications in books, journals, and conference proceedings, and holds five patents. His research is dedicated to the Internet of Things, from smart sensors to gateways, and addresses more specifically energy efficiency and security issues.



**LIONEL TORRES** received the master's and Ph.D. degrees from the University of Montpellier, in 1993 and 1996, respectively. From 1996 to 1997, he was an IP Core Methodology Research and Development Engineer with ATMEL Company. From 1997 to 2004, he was an Assistant Professor with Polytech Montpellier and also with the Montpellier Laboratory of Informatics, Robotics and Microelectronics (LIRMM), University of Montpellier. He was the Head of the Micro-

electronic Department, LIRMM, from 2007 to 2010, where he has been a Full Professor, since 2004. He is currently the Deputy Head of Polytech Montpellier, where he is in charge of research, industrial, and international relationship. Since 2015, he has been the Head of the cluster of excellence NUMEV (Digital and Hardware Solutions and Modelling for the Environment and Life Sciences). He leads several European, national, and industrial projects in his research field. He has coauthored more than 50 journal papers and 150 conference publications, and holds ten patents. His research interests and skills concern system-level architecture, with a specific focus on the security and cryptographic applications and non-volatile computing based on emerging technologies.



**SOPHIANE SENNI** received the M.S. and Ph.D. degrees from the University of Montpellier, Montpellier, France, in 2012 and 2015, respectively, where he is currently a Postdoctoral Researcher with the Montpellier Laboratory of Informatics, Robotics and Microelectronics (LIRMM). His research interest includes new architectures for non-volatile computing based on magnetic RAM technology.



**GUILLAUME PRENAT** received the degree in engineering from the Grenoble Institute of Technology, France, in 2002, and the Ph.D. degree in analog and mixed-signal testing from the TIMA laboratory, in 2005. In 2006, he joined the spintronics laboratory SPINTEC to take in charge of the design activity. He is currently a Researcher with the CEA-SPINTEC Laboratory. He is in charge of the Spintronics IC Design Staff of the laboratory. He has authored or coauthored 59 pub-

lications in these areas. He holds eight patents in these areas. He has co-edited a book *Spintronics-Based Computing* along with W. Zhao and has written chapters in several books dealing with spintronics. His research interests include the development of design tools for the hybrid CMOS/magnetic technology and the evaluation of hybrid non-volatile circuits (FPGA, processors, and so on) contributing to circumvent the limits of microelectronics.



**GREGORY DI PENDINA** received the master's degree from the University Joseph Fourier of Grenoble, France, in 2005, and the Ph.D. degree from the University of Grenoble, France.

He spent first 13 years as an Engineer in design verification at the CNRS Unit offering MPW fabrication service. He has been leading research activities, from 2008 to 2012, such as hybrid MEMS/CMOS fabrication and hybrid photovoltaic (PV) or organic photovoltaic (OPV)

CMOS systems, at the IC and device level. In 2012, he joined Spintec Laboratory as an Engineer and a Researcher, where he is involved in several national and European projects as a Work Package Leader and Scientific Responsible. He is currently an Application Specific Integrated Circuit (ASIC) Designer, specialized in non-volatile applications, based mostly on MRAM technologies using all generations of magnetic tunnel junctions (MTJ). His background comprises full custom and digital IC design, hybrid process design kit (H-PDK) development, design for testing (DFT), and ASIC test and characterization. His research interests include ultra-low power design and radiation hardening using hybrid CMOS/MRAM technologies.

...