Received 5 April 2021; revised 28 May 2021; accepted 22 June 2021. Date of publication 28 June 2021; date of current version 28 September 2021.

*Digital Object Identifier 10.1109/JXCDC.2021.3092769*

# **Multirow Complementary-FET (CFET) Standard Cell Synthesis Framework Using Satisfiability Modulo Theories (SMTs)**

# **CHUNG-K[UAN](https://orcid.org/0000-0002-6479-7552) CHENG [1](https://orcid.org/0000-0002-9865-8390) (Life Fellow, IEEE), CHIA-TUNG HO <sup>2</sup> (Graduate Student Member, IEEE), DAEYEAL LEE<sup>2</sup> ([Stu](https://orcid.org/0000-0003-0965-7247)dent Member, IEEE), and BILL LIN <sup>2</sup> (Member, IEEE)**

<sup>1</sup>Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA 92037 USA <sup>2</sup>Department of Electrical and Computer Engineering, University of California San Diego, La Jolla, CA 92037 USA CORRESPONDING AUTHOR: C.-T. HO (c2ho@ucsd.edu)

The work of Chung-Kuan Cheng and Bill Lin was supported by NSF under Grant CCF-1564302 and Grant IIS-1956339. This article has supplementary downloadable material available at https://doi.org/10.1109/JXCDC.2021.3092769, provided by the authors.

**ABSTRACT** With the relentless scaling of technology nodes, the track number reduction of conventional (Conv.) cell is starting to reach its limitations due to limited routing resources, lateral p-n separations, and performance requirements. As a result, to exploit the benefits of 3-D architectures, complementary-FET (CFET) technology, which stacks P-FET on N-FET or vice versa, is proposed to release the restriction of p-n separation and reduce in-cell routing congestion by enabling p-n direct connections. However, CFET standard cell (SDC) synthesis demands a holistic reconsideration of multirow (MR) structure to maximize the cell and block-level area benefits due to limited in-cell routing tracks and routability that comes from the stacked structure and reduced cell height. In this article, we propose a satisfiability modulo theory (SMT) based MR CFET SDC synthesis framework that simultaneously solves place-and-route to minimize the cell area by considering single-row and MR placement together. We enable explorations on upper/lower M0A/PC routing to leverage the shared-and-split structure across cell rows with the proposed MR dynamic complementary pin allocation scheme. We demonstrate that MR 2.5T CFET without and with upper/lower M0A/PC routing achieves 16.44% and 20.61% on the average reduced cell areas, respectively, compared to 3.5T CFET. Moreover, MR 2.5T CFET SDCs achieve 13.43% and 14.40% less block-level area and total wirelength on average compared to 3.5T CFET SDCs.

**INDEX TERMS** Automated cell generation, cell synthesis, complementary-FET (CFET), multirow standard cell, placement, routing, satisfiability modulo theory (SMT), standard cell.

# **I. INTRODUCTION**

**TITH** the relentless scaling of VLSI technology beyond 5 nm, CFET standard cell (SDC) layout scaling of conventional (Conv.) FET structure is limited due to routing congestions, lateral p-n separations, and performance requirements. We are forced to look at 3-D device architecture with disruptive and innovative interconnect to enable area efficiency. Complementary-FET (CFET) technology [1]–[3], which stacks the P-FET on N-FET or vice versa, can relieve in-cell routing congestion of p-n connection such that SDC designers can continue cell layout reduction in sub-7 nm. Fig. [1](#page-1-0) shows an illustration of a Conv. cell structure as well as a CFET cell structure that stacks the P-FET on N-FET. Compared to the Conv. cell architecture, the shared or split gate and source/drain (G/S/D) structure provides flexible local interconnect connections.<sup>[1](#page-0-0)</sup>

Recently, feasible CFET-based SDC layouts have been successfully proposed [2], [4], [5]; therefore, CFET has been one of the promising cell structures in sub-7 nm and beyond. However, the severe in-cell routing congestion and limited routability at 2.5T demands multirow (MR) CFET SDC architecture to maximize the cell and block-level area benefits [6]. MR CFET SDC design demands holistic considerations of cell area, FET stacking, pin accessibility, routing

<span id="page-0-0"></span><sup>1</sup>If the G/S/D of P-FET and N-FET shares the same net connection, the G/S/D can be merged and connected to M0. On the contrary, the G/S/D are split and M0 drops tall and short vias to connect P-FET and N-FET, respectively.



**FIGURE 1. Illustration of conventional and CFET structure (top row). CFET shared and split gate, source, and drain structure (bottom row) [2], [4], [5].**

congestion, and block-level area due to the limited routing resources and the exploding conditional design rules of later physical design procedures. These considerations for SDC design rely on an automatic MR SDC layout synthesis framework, which supports MR structure, track number reduction, FET stacking, design rule changes, and so on.

# **A. Conv. SDC SYNTHESIS AUTOMATION**

For single-row (SR) SDC synthesis, several works have reported full automation of cell layout covering transistor-level placement and in-cell routing together [7], [8], but these approaches are not applicable in the multipatterning technologies in sub-5 nm. Also, several SDC synthesis automation works have been proposed for multipatterning technology [9]–[11], but the placement and routing are performed in separate operations. Recently, in [12], they proposed an approach that integrates the placement and routing with dynamic pin allocation interface using satisfiability modulo theories (SMTs) [13]. For MR SDC synthesis, a minimum width transistor placement method for MR structure using SAT has been proposed in [14], but this approach does not guarantee the optimal solution after routing due to the lack of considerations of multipatterning and design rules. Recently, Li *et al.* [15] developed an entire placement and routing flow for synthesizing MR SDCs, but the placement and routing are performed sequentially and the number of cell rows is not optimized in terms of cell area. These works focus on the Conv. cell structure optimizations, and thus, they are not available for CFET cell structure that has stackable P-/N-FET.

# **B. CFET SDC SYNTHESIS AUTOMATION**

CFET SDC synthesis framework that performs FET placeand-route concurrently with a novel dynamic complementary

pin allocation (DCPA) approach has been proposed in [16], [17]. However, these works focus on SR CFET SDC synthesis, and thus, they are not available for MR cell area optimization, which considers SR and MR structure together and various inter-row routing options (i.e., M0A/PC layers) in MR CFET SDC structure.

In this article, we develop a MR CFET SDC synthesis automation framework that supports track number reduction, design rule changes, FET stacking alternatives, and M0A/PC for inter-row routing option using concurrent FET placement and route through a MR dynamic pin shape/allocation scheme, resulting in optimized cell layout with an optimum number of cell rows, various CFET SDC architectures, and design rule selections. Our optimized SDC layout has maximized pin accessibility and routability through the proposed routability-driven objectives and constraints. Our main contributions are as follows.

- <span id="page-1-0"></span>1) We develop the MR CFET SDC synthesis framework, including concurrent transistor placement and in-cell routing through a novel MR DCPA (MR-DCPA) scheme to generate optimum CFET SDC layouts across SR, MR, and various track number cell architectures.
- 2) We develop an MR-DCPA scheme to enable explorations of upper/lower M0A/PC for inter-row routing.
- 3) We formulate an integrated constraint satisfaction problem (CSP) for SMT solving, including not only place-and-route but also pin accessibility and design rule-related constraints, resulting in the optimized cell layout across SR, MR, and various track number cell architectures.
- 4) We propose a novel MR cell area objective to minimize the cell area considering SR and MR structures together.
- 5) For CFET cell area scaling, we explore the cell-level metrics as reducing three routing tracks (RTs) to two RTs with/without upper/lower M0A/PC for inter-row routing.
- 6) For block-level area scaling, we perform block-level analysis to explore the block-level area benefits by scaling three RTs to two RTs CFET cell structure with/without upper/lower M0A/PC for inter-row routing.

The remaining sections are organized as follows. Section [II](#page-1-1) describes our CFET SDC synthesis framework. Section [III](#page-5-0) presents our main experiments as scaling to the extreme two RTs CFET architecture. Section [IV](#page-7-0) concludes this article.

# <span id="page-1-1"></span>**II. SIMULTANEOUSLY PLACE-AND-ROUTE FOR CFET SDC SYNTHESIS FRAMEWORK**

We utilize an SMT-based constraints solving methodology for simultaneous place and route of CFET SDCs. In this section, we describe the detailed features of our framework: 1) overview of CFET SDC synthesis framework; 2) CFET cell architecture; 3) MR-DCPA; 4) MR cell area minimization; and 5) multiobjective optimization.



**FIGURE 2. Framework overview.**

#### **A. CFET SDC SYNTHESIS FRAMEWORK OVERVIEW**

Fig. [2](#page-2-0) shows the overview of our framework. Given cell netlist and layout specification, our framework generates an integrated CSP for automating CFET SDC layout, which strictly satisfies transistor placement, in-cell routing, conditional design rules, and pin-accessibility-driven constraints. Inspired by [12] and [16], individual constraints are combined by our novel MR-DCPA constraint. Our framework performs routability-driven lexicographic multiple-objective optimization by implementing: 1) MR cell area minimization; 2) edge-based pin separation (EB-PS); 3) M2 track use objectives; and 4) metal length (ML). We utilize five representative conditional design rules mentioned in [18] and [19], which are minimum area rule (MAR), end-of-line (EOL), via rule (VR), and multipattern-aware design rules (i.e., parallel run length (PRL)/step height rule (SHR)). The notations are shown in Table [1.](#page-2-1)

#### **B. CFET CELL ARCHITECTURE**

Our framework employs a CFET cell architecture and netlist information mentioned in [4], [5], and [20]. Fig. [3](#page-2-2) shows the grid-based placement and routing graph using four RTs P-on-N CFET example. The routing grid consists of four RTs with buried power rails and each layer is defined as unidirectional edges. We adopt supernodes [21] for the pin of FET (i.e., internal pin,  $P_{IN}$ ) or the I/O pin of a standard cell (i.e., external pin,  $P_{EX}$ ). The P-FET and N-FET regions are stacked up on the upper and lower M0A/PC layers, respectively. The M0 layer accesses the upper/lower FET pins through shared-and-split pin shapes, as shown in Fig. [1.](#page-1-0) Our framework supports stacking N-FET on P-FET by swapping the FET-related variables, a different number of RTs by adjusting *h* variable, and MR CFET SDC structures with  $R$  variable as described in  $(1)$ – $(4)$ . For inter-row routing, our framework also supports upper/lower M0A/PC routing, which is introduced in Section [II-C.2.](#page-4-0)

## **C. MULTIROW DYNAMIC COMPLEMENTARY PIN ALLOCATION**

MR-DCPA dynamically constructs the shared and split pin shapes of FETs for optimal in-cell and inter-row routing

#### <span id="page-2-1"></span>**TABLE 1. Notations for CFET cell synthesis framework.**

<span id="page-2-0"></span>



<span id="page-2-2"></span>**FIGURE 3. Grid-based placement, routing graph, and pin shape of P-FET/N-FET using four RTs P-on-N CFET example. The placement and routing grids are extended to MR structure accordingly.**

exploration of MR CFET structure. The MR-DCPA scheme for simultaneous place-and-route follows the same principle as in [12] for interconnecting placement and routing formulas using flow capacity variables [i.e.,  $C_m^n(v, u)$ ]. Here, we introduce the constraints for shared and split pin shapes of FETs and constraints for upper/lower M0A/PC inter-row routing.

<sup>2</sup>The symbol *d* is *L* (left), *R* (right), *F* (front), *B* (back), *U* (up), *D* (down), or a combination of these directions, e.g., *FL* means FrontLeft.



**FIGURE 4. Concept of MR-DCPA for four RTs P-on-N CFET cell structure.** p P 1 **: P-FET gate pin.** p N 1 **: N-FET gate pin. (a) Shared gate/source/drain pin shapes. (b) Split gate/source/drain pin shapes.**

#### 1) SHARED AND SPLIT PIN SHAPES OF FETs

Fig. [4](#page-3-0) shows the concept of MR-DCPA using four RTs P-on-N CFET as an example. When the pins of P-FET and N-FET are located at the same *x*-coordinate [i.e.,  $x(p_i^P) =$  $x(p_j^N)$ ], the pin shapes (i.e., shared or split) at the corresponding column in the upper/lower M0A/PC layers are determined by the net information. For example, in Fig. [4\(](#page-3-0)a), if both of the gate pins  $p_1^N$  and  $p_1^P$  belong to the same net [i.e.,  $n(p_1^N) = n(p_1^P)$ ], a shared pin shape on upper/lower PC layers is selected and one of the corresponding flow variables (i.e.,  $f_m^n$ ) among four possible M0 access points (i.e., blue squares) is determined by the flow formulation. On the other hand, if each gate pin belongs a different net [i.e.,  $n(p_1^N) \neq$  $n(p_1^P)$ ], MR-DCPA selects one of two possible split pin shapes (i.e., top or bottom M0 access point for N-FET), as shown in Fig. [4\(](#page-3-0)b). Meanwhile, when the upper FET pin has a connection to the power rail (i.e., VDD or VSS), MR-DCPA selects the split pin shape without blocking the power rail connection of upper FET pin. The expressions of shared and split pin shapes are shown as follows.

*Shared Pin-Shape Expressions:*

<span id="page-3-1"></span>
$$
\bigwedge_{y=y'_i,...,y'_i} (f_m^n(v_{x,y,l}, v_{x,y+1,l}) = 1)
$$
  
\n
$$
l = \{PC^U/M0A^U, PC^L/M0A^L\}
$$
  
\n
$$
n = n(p_{i,t}^P) = n(p_{j,s}^N), \quad x = x_t^P + i
$$
 (1)

*Split Pin-Shape Expressions: Top Access for Lower FET (Type 1):*

<span id="page-3-3"></span>
$$
f_m^{n_1}(v_{x,y_i^j-1,l_1}, v_{x,y_i^j,l_1})
$$
  
=  $0 \wedge \left( \bigwedge_{y=y_i^f,\dots,y_i^j-2} (f_m^{n_1}(v_{x,y,l_1}, v_{x,y+1,l_1}) = 1) \right)$   

$$
\wedge \left( \bigwedge_{y=y_i^f,\dots,y_i^j-1} (f_m^{n_2}(v_{x,y,l_1}, p_{j,s}^L) = 0) \right)
$$
  

$$
\wedge (f_m^{n_1}(v_{x,y_i^j,l_0}, p_{i,t}^U) = 0)
$$
 (2)

<span id="page-3-0"></span>*Bottom Access for Lower FET (Type 2):*

<span id="page-3-2"></span>
$$
f_m^{n_1}(v_{x,y_i',l_1}, v_{x,y_i'+1,l_1})
$$
  
=  $0 \wedge \left(\bigwedge_{y=y_i'+1,\dots,y_i'-1} (f_m^{n_1}(v_{x,y,l_1}, v_{x,y+1,l_1}) = 1)\right)$   

$$
\wedge \left(\bigwedge_{y=y_i'+1,\dots,y_i'} (f_m^{n_2}(v_{x,y,l_0}, p_{j,s}^L) = 0)\right)
$$
  

$$
\wedge (f_m^{n_1}(v_{x,1,y_{i_1}'} , p_{i,t}^U) = 0))
$$
 (3)

*No Access for Lower FET (Type 3):*

<span id="page-3-4"></span>
$$
\bigwedge_{y=y_1^f, \dots, y_i^l-1} (f_m^{n_1}(v_{x,y,l_1}, v_{x,y+1,l_1}) = 1)
$$
\n
$$
\bigwedge \left( \bigwedge_{y=y_1^f, \dots, y_i^l} (f_m^{n_2}(v_{x,y,l_0}, p_{j,s}^L) = 0) \right)
$$
\n
$$
l_0 = PC^L/M0A^L, \quad l_1 = PC^U/M0A^U
$$
\n
$$
n_1 = n(p_{i,t}^U), \quad n_2 = n(p_{j,s}^L),
$$
\n
$$
x = x_t^U + i, \quad \begin{cases} U = P, L = N, & \text{if } P\text{-on-N} \\ U = N, L = P, & \text{if } N\text{-on-P.} \end{cases} (4)
$$

Algorithm [1](#page-4-1) utilizes the SMT's if-then-else structure to describe a generation procedure of the constraint for shared and split pin shapes of FETs selection scheme for MR structures. For each cell row,  $y_i^f$  $y_i^f$  and  $y_i^l$  are set for the corresponding shared and split pin-shapes selection (Lines 1 and 2). If N-FET and P-FET pins have the same net information, the shared pin shape is selected (Lines 3 and 4). Otherwise, the split pin shape is selected (Lines 6–31). The split pin shape consists of three types on upper/lower M0A/PC layers. Type1 and Type2 represent top  $(y = h)$  and bottom  $(y = 1)$  accesses for lower FET, respectively. If the net of lower FET pin is VSS or VDD, Type3 is used since there is no connection from M0 to lower FET pin (Lines 14 and 20). When the net of upper FET pin is VDD or VSS, Type2 is always selected in the odd

# **Algorithm 1** Shared and Split Pin-Shapes Selection

<span id="page-4-1"></span>*/\*Input: Given G(V,E); Output: MR-DCPA constraints; StackFlag*=*P-on-N/N-on-P.\*/* 1: **for**  $r = 1, 2, ..., R$  **do** 2: Set  $y_i^f = (r-1)h + 1$ ,  $y_i^l = rh$ ; 3: **if**  $(n(p_{i,t}^P) = n(p_{j,s}^N)) \wedge (x(p_{i,t}^P) = x(p_{j,s}^N))$  then  $\Rightarrow$ Shared Pin-Shape 4: *Exp*. [\(1\)](#page-3-1) for P-FET and N-FET access. 5: **else if**  $(n(p_{i,t}^P) \neq n(p_{j,s}^N)) \wedge (x(p_{i,t}^P) = x(p_{j,s}^N))$  then  $\triangleright$ Split Pin-Shape 6: **if**  $(StackFlag = P-on-N)$  **then**  $\triangleright$  P-on-N CFET 7: **if**  $(n(p_{i,t}^P) = VDD)$  then  $\triangleright$  VDD net at Upper FET pin 8: **if** *r*%2=1 **then** 9: *Exp*. [\(3\)](#page-3-2) for access Lower N-FET. 10: **else if** *r*%2=0 **then** 11: *Exp*. [\(2\)](#page-3-3) for access Lower N-FET. 12: **end if** 13: **else if**  $(n(p_{j,s}^N) = \text{VSS})$  **then**  $\triangleright \text{VSS net at}$ Lower FET pin 14: *Exp*. [\(4\)](#page-3-4) for access Upper P-FET. 15: **else** 16: *Exp*. [\(2\)](#page-3-3) ∨ *Exp*. [\(3\)](#page-3-2) for access P-FET and N-FET. 17: **end if** 18: **else if**  $(StackFlag = N-on-P)$  **then**  $\triangleright$  N-on-P CFET 19: **if**  $(n(p_{i,t}^P) = \text{VDD})$  **then**  $\triangleright$  VDD net at Lower FET pin 20: *Exp*. [\(4\)](#page-3-4) for access Upper N-FET. 21: **else if**  $(n(p_{j,s}^N) = \text{VSS})$  **then**  $\triangleright \text{VSS}$  net at Upper FET pin 22: **if**  $r\%2 = 1$  **then** 23: *Exp.* [\(2\)](#page-3-3) for access Lower P-FET. 24: **else if**  $r\%2 = 0$  **then** 25: *Exp*. [\(3\)](#page-3-2) for access Lower P-FET. 26: **end if** 27: **else** 28: *Exp.* [\(2\)](#page-3-3)  $\vee$  *Exp.* [\(3\)](#page-3-2) for access P-FET and N-FET. 29: **end if** 30: **end if** 31: **end if** 32: **end for**

cell row for P-on-N stacking and even cell row for N-on-P stacking (Lines 9 and 25). Type1 is always selected in the even cell row for P-on-N stacking and odd cell row for Non-P stacking (Lines 11 and 23). Otherwise, Type1 or Type2, which satisfies all the constraints and produces the optimal solution, is selected (Lines 16 and 28).

#### <span id="page-4-0"></span>2) M0A/PC ROUTING CONSTRAINTS

The routing grid is extended to upper/lower M0A/PC layers for simultaneous place-and-route using flow capacity

variables [i.e.,  $C_m^n(v, u)$ ]. We consider the interaction between FET pin connection and FET stacking when using upper/lower M0A/PC for routing and formulate the following constraints.

#### a: ROUTING CONSTRAINT I

The upper/lower M0A/PC layers at the column in active FET can only be used for routing by the same net of the corresponding FET pin as described in the following equation:

$$
\bigwedge_{n \neq n(p^F)} \left( \bigwedge_{y=y_1^f, \dots, y_l^f-1} (f_m^n(v_{x,y,l}, v_{x,y+1,l}) = 0) \right)
$$
\n
$$
x = x(p^F), \qquad \begin{cases} l = PC^U/M0A^U, & \text{if } ((F = P \land P \text{-}on-N) \\ l = PC^L/M0A^L, & \text{if } ((F = N \land P \text{-}on-N) \\ l = PC^L/M0A^L, & \text{if } ((F = N \land P \text{-}on-N) \\ \lor (F = P \land N \text{-}on-P)). \end{cases}
$$
\n
$$
(5)
$$

#### b: ROUTING CONSTRAINT II

If the upper FET pin connects to power rail (i.e., VDD or VSS), the lower layers (i.e., M0A/PC) at the same column cannot be used for inter-row routing as described in [\(6\)](#page-4-2)

<span id="page-4-2"></span>
$$
\bigwedge_{\forall n \in N, n \neq n(p^U)} f_m^n(v_{x, y_i^l, l}, v_{x, y_{i+1}^l, l}) = 0, \quad l = \text{PC}^L/\text{MOA}^L
$$
  
if  $(n(p^U) = n(\text{PR}_i)) \wedge (x = x(p^U)) \wedge (y_i^f \leq y(p^U) \leq y_i^l).$   
(6)

#### **D. MULTIROW CELL AREA MINIMIZATION**

We introduce the novel MR cell area minimization objective, which considers the solutions of SR and MR structures simultaneously and generates the minimum cell area layouts with optimum cell row (Opt. CR). The maximum cell width is defined as the right-most vertical track occupied by the FET among all cell rows as shown in [\(7\)](#page-4-3). Then, if there is any FET be placed in the *i*th cell row or the cell row larger than *i*, *W<sup>i</sup>* is set to  $W_{\text{max}}$ . Otherwise,  $W_i$  is 0 as described in [\(8\)](#page-4-3). With (8), we can minimize the cell area with the considerations of SR and MR structures simultaneously

<span id="page-4-3"></span>
$$
W_{\text{max}} = \max\{x_t + w_t \mid t \in T\} \tag{7}
$$

$$
W_i = \begin{cases} W_{\text{max}}, & \text{if } i = 1 \\ W_{\text{max}}, & \text{if } \left( y_i^f \le y_t \le y_i^l \right) \forall t \in T \\ W_{\text{max}}, & \text{if } W_j = W_{\text{max}} \forall j > i \\ 0, & \text{otherwise.} \end{cases} \tag{8}
$$

## **E. MULTIOBJECTIVE OPTIMIZATION (OPTIMAL PRIORITY)**

 $\epsilon$  vy

Our framework has multiple objectives associated with placement and routing problems for standard cell layout design. The first objective is cell area that is defined as the sum of  $W_i$  of each cell row as shown in  $(9)$ . The second objective is EB-PS [17] and it minimizes the summation of column- and

edge-based pin costs [i.e.,  $SC(p)$  and  $EC(p)$ ] of each SDC I/O pin in [\(10\)](#page-5-1). The third objective is the number of *M*2 tracks used for in-cell routing in [\(11\)](#page-5-1) [16]. The last objective is the weighted sum of routed metal segments and vias (i.e., total ML) as shown in [\(12\)](#page-5-1). In practice, the cell size has the highest priority because it has a direct impact on the area of a whole chip. The EB-PS should be considered as the second objective because the in-accessible pins cannot be routed regardless of the routing resources [16]. Then, the number of *M*2 tracks has been used as a more important metric than total ML to maximize the routability by reserving upper routing resources. Therefore, our framework simultaneously optimizes these multiple objectives based on addressed ''lexicographic'' order in [\(13\)](#page-5-1) through an optimization feature of Optimization Modulo Theories (OMT) [13]

<span id="page-5-1"></span>**Min**: **Multi**-**Row Placement** (**Cell Area**)

$$
=\sum_{i=1,\ldots,R}W_i\tag{9}
$$

**Min**: **Pin**-**accessibility** (**EB**-**PS**)

$$
= \sum_{p \in P_{\text{EX}}} \text{SC}(p) + \text{EC}(p)
$$
  
\n
$$
\text{SC}(p) = \bigvee_{e_{v,q} \in E_k^M, k \in d_{\text{int}}(x(p)), q \in P_{\text{EX}}, q \neq p} e_{v,q}^{n(q)}
$$
  
\n
$$
\text{EC}(p) = \sum_{\substack{e_{v,u}^{n(p)} e_{v,u}^{n} \in E_k^M, k \in d_{\text{int}}(x(p)), n \in N_{\text{EX}}, n \neq n(P)}} \int_{N_{\text{EX}}} e_{v,u}^n
$$

$$
N_{\text{EX}} = \{n(p)|p \in P_{\text{EX}}\}\tag{10}
$$

Min : **Routability** (#M2 Track) = 
$$
\sum_{k=1}^{N} \bigvee_{e_{v,u} \in E_{k}^{M2}} m_{v,u}
$$
 (11)

$$
\text{Min}: \text{Total Metal Length} = \sum_{e_{v,u} \in E} (w_{v,u} \times m_{v,u}) \tag{12}
$$

#### **Lexicographic Optimization** :

(a) Cell Size, (b) EB-PS, (c) #M2 Track, (d) Total ML.

(13)

#### <span id="page-5-0"></span>**III. EXPERIMENTAL RESULTS**

Our framework is implemented in Perl/SMT-LIB 2.0 standard-based formula and executed on a workstation with 2.4-GHz Intel Xeon E5-2620 CPU and 256-GB memory. The single-threaded SMT solver Z3 [13] (version 4.8.5) is used to produce the optimized solution in the proposed framework.

#### <span id="page-5-6"></span>**A. EXPERIMENTAL SETUP**

#### 1) SDC GENERATION

We use ASAP7 [20] SDC SPICE netlists as inputs of CFET SDCs. We adopt the same number of fingers from [20] for SDC layout generation in the following experiments. To evaluate the block-level power–performance–area (PPA) in early DTCO exploration, we select 30 representative SDCs [2], which are specified in Table [2.](#page-6-0) The number of FETs in each cell varies from 2 to 24. For SDC architecture, we generate 3.5T and 2.5T CFET SDCs with three and two

RTs considering solution space with three cell rows structure through our framework, respectively. We use P-on-N stacking for all experiments since the cell metrics are not sensitive to the conditional design rules settings at 2.5T structure<sup>[3](#page-5-2)</sup> [17]. Here, we report the ML and number of vias separately since the parasitic resistances of metal and via are different  $[22]$  in Table [2.](#page-6-0)<sup>[4](#page-5-3)</sup> The 2.5T is the limit for CFET SDC structure since the split structure needs at least two access points from M0, as shown in Fig. [1.](#page-1-0) In addition, VR relaxation is required for split structure in two RTs structure. The conditional design rules [18], [19] are as follows: MAR/EOL/VR/PRL/SHR  $= 1/1/0/1/2$ . The minimum I/O pin opening constraint [16] is set to 3 for pin accessibility.

#### 2) BLOCK-LEVEL P&R

Three open-source RTL designs [23], M0 Core, M1 Core, and AES that, respectively, have 17k, 20k, and 14k instances are adopted. We perform the block-level analysis through a placeand-route suite [24].

For BEOL, we set the contacted poly pitch (CPP), M0/M2 pitch,<sup>[5](#page-5-4)</sup> and the number of masks for each BEOL layer according to [25]. For M1, VIA12, and M2 layers, the grid-based conditional design rules' parameters are applied at block level using the same approach in [17]. The metals' pitch and width of layers above M2 are set based on [26]. Here, we adjust the offset of M4 to provide two horizontal RTs in each cell row to alleviate the limited routing resource in the extreme two RTs cell architecture.

The power delivery network consists of top power meshes (M8 and M9), intermediate power stripes (M3), and standard cell rails. The top power mesh is designed as spaces are allowed. Then, the power is delivered from M3, which is  $4\times$  wider than signal wires, to M1 and M1 to Buried Power Rail (BPR) using stacked vias and SuperVia models [27], respectively. The M3 power stripes for the BPR standard cell rail are placed per every 64 CPPs [28]. We use 300 #Design Rule Violations (#DRVs) threshold,<sup>[6](#page-5-5)</sup> which is depicted in a red horizontal line in the figures representing the block-level P&R results, to measure the valid block-level area.

The experiments are organized as follows.

1) *Exp. III-B (Scaling to Extreme two RTs With Inter-Row Routing Options):* We compare the cell area, ML, #Vias, and #M2 Track with/without upper/lower M0A/PC for inter-row routing as scaling 3.5T-to-2.5T CFET structure using adaptive cell row number for cell area minimization.

<span id="page-5-2"></span><sup>3</sup>The difference of average cell area, #M2 Track, ML, and #Vias of P-on-N and N-on-P 2.5T CFET structures are 0%, 0%, 3.53%, and 2.22%, respectively.

<span id="page-5-3"></span><sup>4</sup>In the experiments, the weightings of via are  $4\times$  metal grid considering the parasitic resistance [22] and the weightings of M2 cost  $4\times$  metal grids for routability in objective [\(12\)](#page-5-1).

<span id="page-5-4"></span><sup>5</sup>The M0/M2 pitches and widths are 24 and 12 nm with two masks. The CPP and M1 pitch are 42 nm. The directions of metal layers are all unidirectional.

<span id="page-5-5"></span> $6$ As a common industrial practice, once the number of DRVs increases beyond 300, the block layout is deemed too troublesome to fix with laborious engineering change orders (ECOs).

<span id="page-6-0"></span>**TABLE 2. Experimental statistics of 3.5T CFET, 2.5T CFET, and 2.5T CFET with upper/lower M0A/PC routing (2.5T M0A/PC-R). CW: cell width (CPP). Opt. CR: optimum cell row. ML: metal length (not including vias). #Vias: number of vias. #M2 Track: number of used M2 tracks. CPP: contact poly pitch. Cell Area Impr.** = **[(3.5T CW** × **3.5T Opt. CR - 2.5T/(2.5T M0A/PC-R) CW** × **2.5T/(2.5T M0A/PC-R) Opt.CR)/(3.5T CW** × **3.5T Opt. CR)].**



2) *Exp. III-C (Block-Level Area Scaling):* We explore the minimum valid block-level areas with 300 #DRVs threshold for 3.5T CFET and 2.5T CFET with/without upper/lower M0A/PC routing.

# <span id="page-6-1"></span>**B. SCALING TO EXTREME TWO RTs WITH INTER-ROW ROUTING OPTIONS**

We explore the CFET SDC cell area benefits as reducing the number of tracks using the proposed MR CFET SDC synthesis framework with/without upper/lower M0A/PC for inter-row routing options.

#### 1) INTER-ROW ROUTING WITH METAL LAYERS ONLY

We compare the cell area, #M2 Tracks, ML, and #Vias of 3.5T CFET and 2.5T CFET with metal layers (i.e., M1) for inter-row routing in Table [2.](#page-6-0) The average runtime per cell is around 45 min for 2.5T CFET and 24 min for 3.5T CFET. As scaling from 3.5T to 2.5T CFET cell architecture, the average cell area is reduced by 16.44% with 8.40%, 47.57%, and 1.23 increment on average ML, #Vias, and #M2 Track, respectively. The increase of ML, #Vias, and #M2 Track is caused by less in-cell routing resources and the constraints of design rules and pin accessibility in 2.5T CFET cell structure.

#### 2) ENABLE UPPER/LOWER M0A/PC FOR INTER-ROW ROUTING

We enable the inter-row routing with upper/lower M0A/PC in 2.5T CFET structure (2.5T M0A/PC-R CFET) and compare the cell area, #M2 Tracks, ML, and #Via of 2.5T M0A/PC-R

CFET with 3.5T CFET in Table [2.](#page-6-0) The average runtime per cell is around 43 min for 2.5T M0A/PC-R CFET. Compared to 3.5T CFET, 2.5T M0A/PC-R CFET achieves 20.61% and 1.33% smaller cell area and ML on average with 23.85% and 0.80 increment on average #Vias and #M2 Track, respectively. Compared to 2.5T CFET, 2.5T M0A/PC-R CFET provides 4.03%, 8.98%, 16.08%, and 20.48% smaller cell area, ML, #Vias, and #M2 Track on average, respectively. This shows that enabling M0A/PC for routing can reduce not only cell size but also parasitic resistance in SDC.

Finally, Fig. [5\(](#page-7-1)a) summarizes the average cell area benefit of the representative 30 SDCs by track number reduction (i.e., 3.5T–2.5T) and M0A/PC routing option. Note that the cell areas of  $AOI21 \times 1$ ,  $AOI22 \times 1$ ,  $OAI21 \times 1$ , and  $OAI22 \times 1$ with 2.5T CFET are still larger than 3.5T CFET due to the severe in-cell routing congestion. With enabling M0A/PC layers routing (i.e., 2.5T M0A/PC-R CFET) for maximizing the area benefit of track number reduction, all SDC areas are smaller than 3.5T CFET.

# **C. BLOCK-LEVEL AREA SCALING**

We compare the block-level areas of 3.5T CFET SDCs, 2.5T CFET SDCs, and 2.5T M0A/PC-R CFET SDCs from Exp. [III-B](#page-6-1) using three open-source RTL designs [23]: M0 Core, M1 Core, and AES.<sup>[7](#page-6-2)</sup> M2–M7 are used for

<span id="page-6-2"></span> $7$ The worst negative slacks of M0 Core, M1 Core, and AES are carefully adjusted between 50 and −50 ps for a fair comparison in the block-level analysis.



**FIGURE 5. Cell and block-level area benefits by track reduction and M0A/PC routing: (I) 3.5T CFET (black), (II) 2.5T CFET (orange), and (III) 2.5T M0A/PC-R CFET (blue). (a) Cell area of representative 30 SDCs. (b) Block-level P&R results of M0 Core. The core area is improved by 13.20% by track number reduction and using M0A/PC for routing. The red arrow shows the 64 CPPs M3 power stripe grid. CellArea = CW**  $\times$  CPP  $\times$  CH  $\times$  M2Pitch, CPP = 42 nm, and M2Pitch = 24 nm.

<span id="page-7-2"></span>**TABLE 3. Block-level placement and route results of 3.5T CFET, 2.5T CFET, and 2.5T CFET M0A/PC-R. #Inst: number of instances. SDC area: standard cell area, Total WL: total wirelength, Min. Area: minimum valid block-level area. Area Impr.** = **(Min. Area of 3.5T CFET - Min. Area of 2.5T CFET/(2.5T M0A/PC-R CFET))/(Min. Area of 3.5T CFET).**



block-level routing. The design rules of BEOLs and power delivery network are described in Section [III-A.](#page-5-6) In addition, to avoid dropping the SuperVia [27] on the upper/lower M0A/PC layers, which are used by inter-row routing, for connecting the BPR in the block level, we extract upper/lower M0A/PC layers as blockages.

The block-level P&R results of 3.5T CFET, 2.5T CFET, and 2.5T M0A/PC-R CFET are shown in Table [3.](#page-7-2) The valid minimum block-level area is obtained using 300 #DRVs threshold [17]. Compared to 3.5T CFET, the average minimum block-level area of M0 Core, M1 Core, and AES is reduced by 6.29% for 2.5T CFET and 13.43% for 2.5T M0A/PC-R CFET; the average total wirelength is also reduced by 7.65% for 2.5T CFET and 14.40% for 2.5T M0A/PC-R CFET. Fig. [5\(](#page-7-1)b) shows that 2.5T M0A/PC-R CFET achieves a 13.20% smaller core area than 3.5T CFET for M0 Core design. This area benefit comes from further cell area reduction by connecting shared-and-split structures across cell rows through M0A/PC layers.

<span id="page-7-0"></span>In summary, we show that 2.5T M0A/PC-R CFET can not only achieve 20.61% smaller cell area on average but also provide 13.43% and 14.40% less block-level area and total wirelength on average, respectively, compared to 3.5T CFET SDCs. Leveraging the direct connection of shared-and-split structures between cell rows with M0A/PC layers can maximize the cell and block-level area benefits of reducing cell height to 2.5T.

#### <span id="page-7-1"></span>**IV. CONCLUSION AND FUTURE WORKS**

We propose an SMT-based MR CFET SDC synthesis framework, which supports track number reduction, design rule selections, MR architectures, and different stacking options, for cell and block-level areas explorations. The novel MR-DCPA scheme enables the exploration of using upper/lower M0A/PC for inter-row routing to maximize the advantage of CFET shared and split structure across cell rows. In addition, the novel MR cell area objective explores SR and MR structures together and generates the minimum cell area. We demonstrate that the proposed novel MR cell area objective achieves 20.69%, 8.37%, and 3.33% smaller SDC cell areas on average compared to TR, DR, and SR [17] structures, respectively, in Supplementary Material. Then, we demonstrate that enabling upper/lower M0A/PC for inter-row routing achieves 20.61% smaller cell area on average when scaling 3.5T-to-2.5T cell structure. Moreover, we show that the 2.5T CFET with M0A/PC layers for inter-row routing achieves 13.43% and 14.40% less block-level area and total wirelength on average compared to 3.5T CFET, respectively.

The important directions for future researches include incorporating timing and power information of CFET for further PPA explorations in both cell level and block level, developing CFET SDC synthesis framework considering emerging monolithic 3-D integration [29], [30], and developing process variation-aware CFET SDC synthesis framework for both FET and interconnect level such as adding the objectives/constraints related to reliability (i.e., layout-dependent aging effect [33] and double via for EM).

#### **REFERENCES**

- [1] L. Liebmann, ''Design technology co-optimization for 3nm and beyond,'' in *IEDM Tech. Dig.* Short Course, Technol. Scaling EUV Era Beyond, 2019.
- [2] L. Liebmann *et al.*, ''DTCO acceleration to fight scaling stagnation,'' *Proc. SPIE*, vol. 11328, Mar. 2020, Art. no. 113280C.
- [3] R.-H. Kim *et al.*, "IMEC N7, N5 and beyond: DTCO, STCO and EUV insertion strategy to maintain affordable scaling trend,'' *Proc. SPIE*, vol. 10588, Mar. 2018, Art. no. 105880N.
- [4] J. Smith, ''Design technology co-optimization approaches for integration and migration to CFET and 3D logic,'' in *Proc. Surf. Preparation Cleaning Conf.*, 2019. [Online]. Available: https://www.linx-consulting.com/wpcontent/uploads/2019/04/02-01-J\_Smith-TEL-DTCO.pdf
- [5] S. Sherazi et al., "CFET standard-cell design down to 3track height for node 3 nm and below,'' *Proc. SPIE*, vol. 10962, Mar. 2019, Art. no. 1096206.
- [6] Y.-X. Chiang *et al.*, "Designing and benchmarking of double-row height standard cells,'' in *Proc. ISVLSI*, Jul. 2018, pp. 64–69.
- [7] M. Guruswamy et al., "Cellerity: A fully automatic layout synthesis system for standard cell libraries,'' in *Proc. DAC*, 1997, pp. 327–332.
- [8] A. M. Ziesemer and R. A. D. L. Reis, ''Simultaneous two-dimensional cell layout compaction using MILP with ASTRAN,'' in *Proc. IEEE Comput. Soc. Annu. Symp. VLSI*, Jul. 2014, pp. 350–355.
- [9] P. Cremer *et al.*, ''Automatic cell layout in the 7 nm era,'' in *Proc. ISPD*, 2017, pp. 99–106.
- [10] Y.-L. Li et al., "NCTUcell: A DDA-aware cell library generator for Fin-FET structure with implicitly adjustable grid map,'' in *Proc. DAC*, 2019, pp. 1–6.
- [11] K. Jo, S. Ahn, J. Do, T. Song, T. Kim, and K. Choi, ''Design rule evaluation framework using automatic cell layout generator for design technology cooptimization,'' *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 27, no. 8, pp. 1933–1946, Aug. 2019.
- [12] D. Park *et al.*, "SP&R: Simultaneous placement and routing framework for standard cell synthesis in sub-7 nm,'' in *Proc. ASP-DAC*, Jan. 2020, pp. 345–350.
- [13] N. Bjørner et al., "vZ-an optimizing smt solver," in *Proc. Int. Conf. Tools Algorithms Construct. Anal. Syst.* Berlin, Germany: Springer, 2015, pp. 194–199. [Online]. Available: https://link.springer.com/chapter/ 10.1007/978-3-662-46681-0\_14#citeas
- [14] T. Iizuka, M. Ikeda, and K. Asada, ''Exact minimum-width multi-row transistor placement for dual and non-dual CMOS cells,'' in *Proc. IEEE Int. Symp. Circuits Syst.*, May 2006, p. 4.
- [15] Y.-L. Li et al., "MCell: Multi-row cell layout synthesis with resource constrained MAX-SAT based detailed routing,'' in *Proc. ICCAD*, 2020, pp. 1–8.
- [16] C.-K. Cheng *et al.*, "A routability-driven complimentary-fet (CFET) standard cell synthesis framework using smt,'' in *Proc. ICCAD*, 2020, pp. 1–8.
- [17] C.-K. Cheng, C.-T. Ho, D. Lee, B. Lin, and D. Park, ''Complementary-FET (CFET) standard cell synthesis framework for design and system technology co-optimization using SMT,'' *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 29, no. 6, pp. 1178–1191, Jun. 2021.
- [18] D. Park et al., "Grid-based framework for routability analysis and diagnosis with conditional design rules,'' *IEEE Trans. Comput.- Aided Design Integr. Circuits Syst.*, vol. 39, no. 12, pp. 5097–5110, Dec. 2020.
- [19] D. Park et al., "ROAD: Routability analysis and diagnosis framework based on SAT techniques,'' in *Proc. ISPD*, 2019, pp. 65–72.
- [20] L. T. Clark *et al.*, ''ASAP7: A 7-nm finFET predictive process design kit,'' *Microelectron. J.*, vol. 53, pp. 105–115, Jul. 2016.
- [21] I. Kang et al., "Fast and precise routability analysis with conditional design rules,'' in *Proc. SLIP*, 2018, p. 4.
- [22] L. T. Clark, V. Vashishtha, D. M. Harris, S. Dietrich, and Z. Wang, ''Design flows and collateral for the ASAP7 7nm FinFET predictive process design kit,'' in *Proc. IEEE Int. Conf. Microelectronic Syst. Educ. (MSE)*, May 2017, pp. 1–4.
- [23] (2020). *OpenCores: Open-Source IP Cores*. [Online]. Available: https:// opencores.org/
- [24] (2020). *Cadence Innovus User Guide*. [Online]. Available: http:/www. cadence.com
- [25] S. Y. Sherazi et al., "Standard-cell design architecture options below 5 nm node: The ultimate scaling of FinFET and Nanosheet,'' *Proc. SPIE*, vol. 10962, May 2019, Art. no. 1096202.
- [26] (2020). *LEF/DEF Language Reference*. [Online]. Available: http://www. ispd.cc/contests/18/lefdefref.pdf
- [27] A. Gupta *et al.*, "High-aspect-ratio ruthenium lines for buried power rail," in *Proc. IEEE Int. Interconnect Technol. Conf. (IITC)*, Jun. 2018, pp. 4–6.
- [28] B. Chava et al., "DTCO exploration for efficient standard cell power rails," *SPIE*, vol. 10588, Mar. 2018, Art. no. 105880B.
- [29] K. Chang *et al.*, "Design automation and testing of monolithic 3D ICs: Opportunities, challenges, and solutions,'' in *Proc. ICCAD*, Nov. 2017, pp. 805–810.
- [30] B. Chehab *et al.*, "Design-technology co-optimization of sequential and monolithic CFET as enabler of technology node beyond 2 nm,'' *Proc. SPIE*, vol. 11614, Feb. 2021, Art. no. 116140D.
- [31] B. Yu, X. Xu, S. Roy, Y. Lin, J. Ou, and D. Z. Pan, "Design for manufacturability and reliability in extreme-scaling VLSI,'' *Sci. China Inf. Sci.*, vol. 59, no. 6, pp. 1–23, Jun. 2016.
- [32] Y. Ma et al., "Self-aligned double patterning (SADP) compliant design flow,'' *Proc. SPIE*, vol. 8327, Mar. 2012, Art. no. 832706.
- [33] P. Ren et al., "Adding the missing time-dependent layout dependency into device-circuit-layout co-optimization-new findings on the layout dependent aging effects,'' in *IEDM Tech. Dig.*, Mar. 2015, pp. 7–11.