Loading web-font TeX/Main/Regular
Hybrid Gate-Level Leakage Model for Monte Carlo Analysis on Multiple GPUs | IEEE Journals & Magazine | IEEE Xplore

Hybrid Gate-Level Leakage Model for Monte Carlo Analysis on Multiple GPUs


Example of the proposed model and accuracy comparison result.

Abstract:

This paper proposes a hybrid gate-level leakage model for the use with the Monte Carlo (MC) analysis approach, which combines a lookup table (LUT) model with a first-orde...Show More

Abstract:

This paper proposes a hybrid gate-level leakage model for the use with the Monte Carlo (MC) analysis approach, which combines a lookup table (LUT) model with a first-order exponential-polynomial model (first-order model, herein). For the process parameters having strong nonlinear relationships with the logarithm of leakage current, the proposed model uses the LUT approach for the sake of modeling accuracy. For the other process parameters, it uses the first-order model for increased efficiency. During the library characterization for each type of logic gates, the proposed approach determines the process parameters for which it will use the LUT model. And, it determines the number of LUT data points, which can maximize analysis efficiency with acceptable accuracy, based on the user-defined threshold. The proposed model was implemented for gate-level MC leakage analysis using three graphic processing units. In experiments, the proposed approach exhibited the average errors of ¡5% in both mean and standard deviation with reference to SPICE-level MC leakage analysis. In comparison, MC analysis with the first-order model exhibited more than 90% errors. In CPU times, the proposed hybrid approach took only two to five times longer runtimes. In comparison with the full LUT model, the proposed hybrid model was up to one hundred times faster while increasing the average errors by only 3%. Finally, the proposed approach completed a leakage analysis of an OpenSparc T2 core of 4.5 million gates with a runtime of .
Example of the proposed model and accuracy comparison result.
Published in: IEEE Access ( Volume: 2)
Page(s): 183 - 194
Date of Publication: 20 May 2017
Electronic ISSN: 2169-3536
Citations are not available for this document.

SECTION I.

Introduction

Owing to the rapid growth of mobile application markets, power consumption has become one of the cardinal factors for VLSI chip designs. In particular, leakage power consumption, or static power consumption, accounts for almost one-half of the overall system power consumption. As the leakage power is consumed even while a chip is in an idle state, it directly affects the battery life of mobile devices. Thus, it is very important to reduce the leakage power of mobile VLSI chips, and, therefore, accurate leakage estimation has become one of the most important steps of mobile chip designs.

As the semiconductor process technology scales down, the process uncertainty increases and the variation in process parameters increases. To deal with the variation of process parameters, a worst-case corner analysis [1] has been widely used. This approach can be used successfully when the process variation is not too serious and the die-to-die (D2D) variation is the main contributor [2]. However, the continuous scaling down of semiconductor process technology has worsened the process parameter variation to a serious level, and leakage variation also has become considerably severe in the state-of-the-art process technology. In addition, the within-die (WID) variation has continuously increased and became comparable with the D2D variation in size [2], [3]. Therefore, a worst-case corner analysis of leakage has become overly pessimistic in many designs, resulting in an invalidity of the analysis results [2], [4]–​[6], particularly during the gate-level design.

To address the problem of overly pessimistic leakage analysis, a first-order exponential-polynomial leakage current model (the first-order model) [7], which represents the leakage power under process variations, has been proposed for use with gate-level designs. The first-order model represents the logarithmic value of the leakage current as a linear function of the process parameters, such as channel length, oxide thickness, and threshold voltage. In addition, many approaches to a leakage analysis of gate-level VLSI designs, which use the first-order model, have been proposed. These include analytic gate-level statistical leakage estimation (SLE) methods [4], [5], [7]–​[16] and Monte-Carlo (MC) simulation-based gate-level leakage estimation methods [17], [18].

The first-order model is very efficient, and has been successfully used to analyze the leakage current of a VLSI design at the gate-level under process variations. However, as the process technology scaled down to below 65 nm [19], the nonlinear relationship between certain process parameters and the logarithmic value of the leakage current became much stronger. Consequently, the accuracy of the first-order model was strongly questioned for use with state-of-the-art technology [5], [6], [19], [20], and it exhibited significant errors compared to the results of circuit level analyses, obtained using the BSIM4 transistor model [19], [21].

To overcome the weak point of the first-order leakage model, which became clear with the deep submicron process technology, a higher-order exponential-polynomial model [6] and a look-up table (LUT)-based gate leakage model [20] have been proposed. These models are intended for MC-based gate-level leakage analysis. The approach in [6] models the logarithm value of the leakage current as a polynomial of gate channel length and threshold voltage; it uses a first-order expression for threshold voltage and a high-order expression for gate channel length. However, unfortunately, [6] did not deal with other important process variations such as oxide thickness [4], [5]. Thus, this model is more accurate than the first-order model when there are only two variation sources, gate channel length and threshold voltage variations. Compared to the higher-order exponential-polynomial model, the LUT-based model [20] represents the nonlinearity more accurately by representing the logarithmic value of the leakage current using a table. However, in this model, the leakage current needs to be estimated by interpolating the values in the LUT during analysis, and the number of computations for the interpolation increases at the rate of 2^{n} for piecewise linear interpolation, where n is the number of process parameters. This model is therefore computationally too complex to be applied to a large number of process parameters. Because the leakage current model is repeatedly evaluated for each logic gate during leakage current analysis, high computational complexity of this model can be a big burden for today's large VLSI designs, which have hundreds of millions of logic gates. In addition, it is necessary to divide the ranges of the process parameters into many regions to maintain accuracy, which may increase the LUT to a burdensome size.

This paper proposes a hybrid leakage current model, developed for the gate-level MC analysis of VLSI designs. The proposed model combines the LUT-based model [20] with the first-order model, and takes the advantages of both models. For accuracy, the proposed approach treats the process parameters of strong nonlinear relationships with the logarithm of the leakage current as nonlinear process parameters. For these process parameters, it uses an LUT approach. Other process parameters are regarded as linear process parameters, and the proposed approach uses a first-order model for efficiency for them.

The accuracy and the efficiency of the proposed model depend on the number of parameters which are considered nonlinear. In addition, the number of LUT data points used for the nonlinear parameters affects the accuracy and the efficiency of the proposed model. Thus, it is very important to determine the nonlinear parameters appropriately and the number of data points for the LUT of those parameters. When characterizing the leakage of each logic gate type, the proposed approach adaptively selects the process parameters which should be handled as nonlinear and it determines the number of their LUT data points for each input condition, based on the user-defined error threshold, so that it can obtain the maximum efficiency while maintaining acceptable accuracy.

The remainder of this paper is organized as follows. Section II briefly describes the existing leakage current models. Section III describes the proposed leakage current model and its characterization method in detail. Section IV presents the implementation of the MC simulation with the proposed model on a multiple NVIDIA GPU environment using a CUDA programming environment [22]. Section V presents the performance evaluation of the proposed approach in terms of accuracy and efficiency, and a comparison with existing well-known approaches. Finally, Section VI summarizes the paper and provides some concluding remarks.

SECTION II.

Background

A. First-Order Exponential-Polynomial Leakage Current Model

State-of-the-art SLE methods use the first-order exponential-polynomial model (the first-order model) [7], which is based on the assumption that major leakage mechanisms such as sub-threshold and gate tunneling leakages are exponentially affected by the process parameter variations. The first-order model of the leakage current approximates the polynomial exponents as a first-order linear model, as shown in (1).I_{leak}=\exp\left({a_{0}+\sum_{i=1}^{n}{a_{i}X_{i}+a_{n+1}R_{n+1}}}\right),\eqno{\hbox{(1)}}

View SourceRight-click on figure for MathML and additional features.where a_{0} through a_{n+1} denote the fitting coefficients determined through characterization, X_{i} indicates the global source of variation that have correlations with other parameters in other leakage terms [5]. X_{i} includes the D2D variation parameters and WID variation parameters having correlation. R_{n+1} is the sum of the varying process parameters not correlated with the other leakage parameters. The state-of-the-art SLE methods assume that all X_{i} 's and R_{n+1} are all standard normal random variables. Because the exponent in (1) depends on a normal random variable, the existing SLE methods model the variation in leakage current as a lognormal random variable.

B. LUT-Based Exponent Model

The authors of [20] proposed an LUT-based model for the exponent in (1) and demonstrated the accuracy of their model using two interpolation methods: piecewise linear (PWL) and cubic spline interpolations. The LUT stores the natural-logarithmic values of the leakage currents for the given sampling points and the values in the LUT are pre-calculated using a characterization process. Intermediate points between the sampling points can be calculated from the LUT using interpolation.

The PWL interpolation calculates the exponent value of an intermediate point on a straight line between both adjacent sampling points of the intermediate point. Cubic spline interpolation uses a special type of third-order piecewise polynomial interpolant called a spline, and it matches the first and second derivatives of two adjacent splines on the intersection point. Fig. 1 shows the first-order model and the PWL.

Fig. 1. - Examples of first-order and PWL interpolations.
Fig. 1. Examples of first-order and PWL interpolations.

Multi-dimensional interpolation can be performed through recursive iteration. Fig. 2 shows an example of a two-dimensional interpolation. We assume that we have two process parameters, X and Y, and an LUT with 3 × 3 sample points \left({x_{i},y_{j}}\right) for i, j= 0, 1, and 2; we want to calculate a value for point \left({x,y}\right). First, interpolation with respect to X is performed, and we obtain three interpolated points \left({x,y_{i}}\right) for i=0, 1, and 2. Subsequently, interpolation over parameter Y is performed, and we obtain an interpolated value for \left({x,y}\right).

Fig. 2. - Multi-dimensional interpolation.
Fig. 2. Multi-dimensional interpolation.

Clearly, the LUT-based leakage model is much more accurate than the first-order model even when PWL interpolation is applied. However, the LUT model is incompatible with conventional SLE methods, and only the MC-based leakage analysis can currently handle the LUT model.

SECTION III.

Proposed Leakage Model and its Characterization

Although the LUT-based leakage current model [20] is reasonably accurate, its computational complexity can be a significant burden for the MC simulation when a number of varying process parameters are present. In practice, not all process parameters have a nonlinear relationship with the logarithmic value of the leakage current.

In this section, we present a novel leakage current model that combines the LUT-based leakage and first-order models.

A. Hybrid Leakage Current Model

The amount of leakage current in a cell is a function of the input state of the cell [23]. The proposed model expresses the leakage current of cell l as (2) when n+m process parameters are considered. I_{l}=\sum_{\forall state\left(i\right)}{P_{i}I_{l}^{i}}=\sum_{\forall state\left(i\right)}{P_{i}\exp\left\{{f_{l}^{i}\left({X_{1},\ldots,X_{n+m}}\right)}\right\}},\eqno{\hbox{(2)}}

View SourceRight-click on figure for MathML and additional features. where P_{i} is the probability of input state i of cell l, and I_{l}^{i} is the leakage current value of cell l for input state i. Similar to the first-order model, the proposed model is based on the assumption that the major leakage mechanisms are exponentially affected by the varying process parameters [16]. In addition, f_{l}^{i}\left({X_{1},\ldots,X_{n+m}}\right) is modeled instead of I_{l}^{i}, which is the logarithmic value of I_{l}^{i}.

Let X_{k} indicate the varying process parameters, which are normalized to a zero mean for k=1,\ldots,n+m. Among them, let X_{1},\ldots,X_{n} be n varying process parameters that have a strong nonlinear relationship with the logarithmic value of the leakage current (nonlinear process parameters), and let X_{n+1},\ldots,X_{n+m} be m varying process parameters with high linearity (linear process parameters).

The proposed model approximates f_{l}^{i}\left({X_{1},\ldots,X_{n+m}}\right) with an LUT of the first-order models. It combines an n -dimensional LUT for nonlinear process parameters X_{1},\ldots,X_{n}, and the first-order model for linear process parameters X_{n+1},\ldots,X_{n+m} to approximate f_{l}^{i}\left({X_{1},\ldots,X_{n+m}}\right) for an i-th input state of cell l in (2), and uses the PWL interpolation (extrapolation) to obtain the value of the point of interest.

Let \big\{{y_{i}^{j}}\big\} for j=1,\ldots,d_{i} be the set of LUT index values for nonlinear process parameter X_{i} for i=1,\ldots,n, where d_{i} is the number of data points in an LUT for X_{i}. For each LUT index \big ({y_{1}^{j_{i}},\ldots,y_{n}^{j_{n}}}\big), the proposed model approximates f_{l}^{i}({X_{1},\ldots,X_{n+m}}) using the first-order model of linear process parameters, which is pre-characterized while nonlinear process parameters X_{i} are set to their index values y_{i}^{j_{i}}. Therefore, the proposed model contains \prod\limits_{i=1}^{n}{d_{i}} first-order models of linear process parameters. For each required data point for interpolation, we can obtain the desired logarithm value for \big ({y_{1}^{j_{1}},\ldots,y_{n}^{j_{n}},x_{n+1},\ldots,x_{n+m}}\big) by substituting linear process parameters X_{i} for i=n+1,\ldots,n+m with linear process parameter values x_{i}.

Table I shows an example LUT of the proposed model when n is given as 2, and a 3×3 table is used for X_{1} and X_{2} As shown in the table, the LUT index values for X_{1} are {-}{0.11}, 0.02, and 0.07, and the index values for X_{2} are {-}{0.13}, {-}{0.07}, and 0.13. The proposed model has nine first-order polynomial equations for X_{3},\ldots,X_{2+m}.

TABLE I Example Look-Up the Proposed Model.
Table I- Example Look-Up the Proposed Model.

For example, the (1,1) value, corresponding to the index value \left({y_{1}^{1}=-0.11,y_{2}^{1}=-0.13}\right), of the LUT in Table I is obtained by computing the first-order model equation for varying process parameters X_{3},\ldots,X_{2+m}, which is pre-characterized with X_{1}=-0.11 and X_{2}=-0.13. The corresponding first-order model equation is expressed as an equation of the \left({1,1}\right) element in Table I. Here, a_{0}^{-0.11,-0.13} represents the mean value of \left.{\log\left({I_{l}^{i}}\right)}\right\vert_{X_{1}=-0.11,X_{2}=-0.13} for X_{3},\ldots,X_{2+m} when X_{1} and X_{2} are fixed at {-}{0.11} and {-}0.13, respectively, and a_{k}^{-0.11,-0.13} indicates the fitting coefficients for X_{k} for k=3,\ldots,2+m when X_{1}=-0.11 and X_{2}=-0.13.

If process parameter values for X_{1} and X_{2} are given as {-}{0.05} and 0.1, respectively, four LUT values located in \left({1,2}\right),~\left({1,3}\right),\left({2,2}\right), and \left({2,3}\right) are needed for interpolation and these values can be obtained by evaluating the first-order models in those cells. The desired \log\left({I_{l}^{i}}\right) value can be obtained by performing two-dimensional PWL interpolation using these four values, as described in Section II-B, which can be easily implemented using a recursive function call.

Because the first-order model for X_{n+1},\ldots,X_{n+m} (which consists of m+1 terms) is necessary to compute each LUT value, the size of the model is given as (3).size=\left({m+1}\right)\times\prod\limits_{i=1}^{n}{d_{i}}\eqno{\hbox{(3)}}

View SourceRight-click on figure for MathML and additional features.

In general, the cubic spline interpolation discussed in [20] shows more accurate results than the PWL interpolation for the same number of data points. However, if n data points are used in the interpolation, we should compute the derivatives of all n points for cubic spline interpolation. On the other hand, PWL interpolation only requires two data points, which are neighbors of the point of interest. The proposed method uses PWL interpolation to minimize the computational overhead resulting from the interpolation.

A one-dimensional interpolation requires two data points, and therefore, an n-dimensional interpolation requires 2^{n} data points. Each required data point is obtained by computing the first-order model for the linear parameters. This is clearly a significant burden for an MC simulation of a large n value. In a practical case, however, the number of nonlinear varying process parameters is not sufficiently large to make the MC simulation impractical. Moreover, even though only one or two process parameters, which have high nonlinear relationships with f_{l}^{i}\left({X_{1},\ldots,X_{n+m}}\right) in (2), are considered as interpolating parameters, the proposed model improves the accuracy without a large burden in terms of the computation time.

B. Gate Library Characterization

Unfortunately, similar to the conventional LUT-based models, it is difficult to estimate the maximum error bound and determine the appropriate number of data points (LUT index values) for each LUT parameter. Moreover, it is difficult to choose the LUT index values that minimize the error and determine which process parameters should be considered for the LUT. In particular, it is nearly impossible to find a global optimum solution for the overall parameter space.

Instead of finding a global optimum solution, the proposed characterization method inevitably finds a local optimum solution using a given error threshold on the sample space. The proposed method estimates the leakage current values while one process parameter varies in the sampling range and the other parameters are fixed at their nominal values (without a process variation for the other process parameters). The error of the first-order model is then estimated, and the LUT index values are determined through optimization. This characterization is performed for each input state of a gate. The detailed procedure is described as follows:

For each process parameter X_{i}, 1) let S be a set of sampling points s_{k} for k=1,\ldots,l, where l is the number of sampling points, 2) estimate the reference leakage current values I\left({s_{k}}\right) for each sampling point s_{k}, 3) find a_{0} and a_{1} for the first-order model a_{0}+a_{1}X_{i} using the sampling data I\left({s_{k}}\right), 4) estimate the maximum absolute relative error as (4), i.e., {\rm Error}={\max\limits_{s_{k}\in S}}\left({abs\left({{I\left({s_{k}}\right)-\exp\left({a_{0}+a_{1}s_{k}}\right)}\over{I\left({s_{k}}\right)}}\right)}\right),\eqno{\hbox{(4)}}

View SourceRight-click on figure for MathML and additional features. and 5) if the error in (4) is greater than the user-defined error threshold, mark X_{i} as an interpolating process parameter; otherwise, X_{i} is marked as a linear parameter.

For each interpolating process parameter X_{i}, the proposed method finds the minimum subset P of S as a set of LUT index values under the constraint that the error is less than the given error threshold using a conventional optimization algorithm such as a genetic optimization and full search algorithm. The optimization problem can be formulated as (5), \eqalignno{&{\rm minimize}d_{i}=\sum_{k=1}^{l}{x_{k}}\cr&{s.t.}\quad\qquad\qquad{x_{k}=0\,{\rm or\,1\,for\,}k=1,\ldots,l}\hfill\cr&\qquad\qquad\qquad~{P=\left\{{s_{k}{\rm such\,that\,}x_{k}=1}\right\}}\cr&\qquad\qquad\qquad{V=\left\{{\log I\left({s_{k}}\right)\,{\rm for\,}s_{k}\in P}\right\}}\cr&\qquad{{\max\limits_{s_{k}\in S}}\left({abs\left({{I\left({s_{k}}\right)-\exp\left({F\left({P,V,s_{k}}\right)}\right)}\over{I\left({s_{k}}\right)}}\right)}\right)\leq{\rm Error}_{threshold}}&{\hbox{(5)}}}

View SourceRight-click on figure for MathML and additional features. where F\left({P,V,s_{k}}\right) is a function that performing interpolation to find the value of s_{k}, which is the value of the underlying function V=f\left(P\right) at the query point s_{k} In addition, x_{i} is a decision variable in which a sampling point s_{i} is chosen as the LUT index value when x_{i}=1. Set P is a set of LUT index values, and set V is a set of the corresponding logarithmic values of the reference values. In the above procedure, we simply use the maximum absolute relative error for the error metric. Other error metrics such as the squared sum of errors (SSE) can also be used.

After the optimization, n interpolating parameters, i.e., X_{i} for i=1,\ldots,n, and the LUT index values for each interpolating parameter X_{i}, are determined. The corresponding LUT is n-dimensional, and the number of elements in the LUT is given by \prod\limits_{i=1}^{n}{d_{i}}. For each LUT element, the first-order model is characterized for X_{n+1},\ldots,X_{n+m}.

SECTION IV.

Implementation of the Proposed Model for MC Analysis on a Multiple GPU Environment Using CUDA

We implemented the proposed hybrid gate leakage model for the MC analysis under a multiple GPU environment using the NVIDIA CUDA platform [22]. The overall flow of the implemented MC analysis (Fig. 3) is similar to that of [18], but we modified the flow of [18] to consider the spatial correlation and utilize multiple GPUs. In Fig. 3, the white boxes represent the host-side processes (processes on a CPU), and the gray boxes represent the GPU-side processes. 1) A grid number is assigned for each cell in the netlist. 2) A principle component analysis (PCA) is performed on the CPU to determine a transformation matrix for transforming independent random values into correlated random values; eigenvalue decomposition is performed on each covariance matrix of the spatially correlated process parameter. 3) D2D random values and 4) spatial correlated random values are generated on a single GPU. 5) The generated D2D and correlated random values are copied to all GPUs. 5) MC simulation is performed on the GPUs. 6) One GPU accumulates the leakage current values of all partitions for each MC sample.

Fig. 3. - Overall flow of the proposed method (blank box: process on a CPU, grayed box: process on a GPU).
Fig. 3. Overall flow of the proposed method (blank box: process on a CPU, grayed box: process on a GPU).

In this procedure, to utilize multiple GPUs effectively, the netlist is partitioned with grid units. Fig. 4 shows an example of partitioning and multi-GPU assignment. The grids are equally divided with respect to the number of available GPUs. Fig. 4 shows that six grids exist and three GPUs are available. The six grids are divided uniformly into three partitions for the three GPUs. The leakage current in each partition is computed by the thread on each GPU, and therefore, three corresponding threads are executed on the three GPUs to compute the leakage current values of three partitions. Then, one GPU accumulates the leakage current values of three partitions to obtain the leakage current of a circuit.

Fig. 4. - Partitioning and multi-GPUs.
Fig. 4. Partitioning and multi-GPUs.

Fig. 5 shows the program flow of each thread on a GPU for the implemented MC analysis. As mentioned previously, the netlist is divided into L partitions for the number of available GPUs. The MC simulation for the ith MC sample is conducted by L threads on L GPUs whose thread indexes are i. Each thread on a GPU is executed for an assigned partition of the netlist, as shown in Figs. 4 and 5.

Fig. 5. - MC simulation flow on a GPU.
Fig. 5. MC simulation flow on a GPU.

A. LUT Data Structure

The LUT index values are stored in a variable XGPU in constant memory to reduce the access latency of the LUT index values. Because only a one-dimensional array is allowed in the constant memory of the CUDA, XGPU is declared as a one-dimensional array. Let c be the number of characterized cells and v_{i} be the number of input states of cell i. Then, the length of the XGPU is given as M\times\sum\limits_{i=1}^{c}{v_{i}}, where M is the maximum value of \sum\limits_{i=1}^{n}{d_{i}} for all input states and characterized cells. Therefore, the LUT index values of process parameter X_{i} for the k-th input state of cell j are located in {{\bf X}}_{i-1}\left[{d_{i-1}}\right] in a zero-based array ordering, where {{\bf X}}_{i} is a pointer for the start address of the LUT index values of process parameter X_{i} for the k-th input state of cell j, which is given as {\bf X}_{i}=\&{{\bf XGPU}}\left[{M\sum_{l=1}^{j-1}{v_{l}}+M\left({k-1}\right)+\sum_{l=1}^{i-1}{d_{l}}}\right],\eqno{\hbox{(6)}}

View SourceRight-click on figure for MathML and additional features. where the “&” symbol represents the address of the operand in C language. Here, M\sum_{i=1}^{j-1}{v_{i}} for each cell is pre-calculated and stored in another array to easily point the proper LUT index values.

The LUT values, which are first-order polynomials for X_{n+1},\ldots,X_{n+m}, are stored in a three-dimensional array. The first index of the LUT signifies a cell, and the second index signifies the input state of a cell. Thus, {\rm LUT}[i][j] is a one-dimensional array that stores n-dimensional LUT elements for the input state j of cell I, and each element consists of (m+1) floating point values for the corresponding first-order model equation. The LUT value of LUT index \left({{\mathtilde{x}}_{1},\ldots,{\mathtilde{x}}_{n}}\right) for input state j of cell i is stored in {\rm LUT}[i][j][x_{1}\times s_{1}+\cdots+x_{n}\times s_{n}], where s_{i}=s_{i+1}\times d_{i+1} and s_{n}=m+1.

B. Cell Leakage Calculation

Fig. 6 shows the CUDA implementation of a cell leakage calculation. For process parameter X_{i}, where i\leq n, the proposed method finds lo and hi such that {\rm X}[lo] and {\rm X}[hi] are the nearest adjacent points to the corresponding random value {\rm x}[0]. By utilizing a binary search procedure, \log_{2}d_{i} operations are required to find the two adjacent points. The proposed method then computes the exponent values corresponding to {\rm X}[lo] and {\rm X}[hi] by calling itself recursively, and returns the interpolated value. If i>n, the proposed method computes the first-order model and returns it.

Fig. 6. - CUDA implementation of leakage calculation on LUT.
Fig. 6. CUDA implementation of leakage calculation on LUT.

SECTION V.

Experimental Results

A. Experimental Environment

In the experiments, we evaluated the performance of the proposed approach in terms of its efficiency and accuracy. We used ISCAS-85 circuits and an OpenSparc T2 processor [24] as the benchmark circuits. A Synopsys Design Compiler and Astro were used for synthesis and placement, respectively. We used two transistor models: an industrial 32-nm transistor model and the Predictive Technology Model (PTM) [25] with a high-performance 22-nm process based on the BSIM4 model.

The process parameter variations considered for the PTM were the gate channel length, gate oxide thickness, and both nMOS/pMOS doping concentration-dependent threshold voltage variations. Seven process parameter variations, which include the gate channel length and the gate oxide thickness variations, were considered for the 32-nm industrial transistor model. We assumed that these process variations were normally distributed, and that the D2D and WID variations in these process variations had the same parts. In addition, the three-sigma (3\sigma) values of the D2D and WID variations for each process parameter were determined to be 10% of their mean value (i.e., 3\sigma/\mu=0.1). Because the D2D and WID variations are independent, the 3\sigma values of the total process variations were approximately 14.14% of their mean values.

We employed an analytic SLE and conventional MC-based leakage estimation methods as the benchmark methods. More specifically, we used a Wilkinson's method (WM)-based estimation (which uses Wilkinson's method to sum all of the leakage components modeled using the first-order model) [7], Chang's hybrid method (HC) [4], and VCA [5] as benchmark analytic SLE methods. The GPU-based MC method using the first-order model (F-MC) [18] was used as the benchmark MC-based leakage estimation method. Although the F-MC has been proposed without considering the spatial correlation and multiple GPUs, we applied the same implementation scheme of the spatial correlation and multiple GPUs, described in Section IV, to the F-MC for experimental purposes. In addition, although the authors of [20] did not present a GPU-based MC method using their model, we implemented one in the same MC implementation flow for the proposed model, and used it as a benchmark method (L-MC). Higher-order exponential-polynomial model [6] was not employed because it cannot handle process parameters considered in experiments except for both gate channel length and threshold voltage variations.

The gate channel length was considered to be a spatially correlated parameter, and to consider the spatial correlation, we used the grid model in which the area of one region was 10 \mu{\rm m}\times\,10 \mu{\rm m}, which was adjusted from 40 \mu{\rm m}\times\,40 \mu{\rm m} in the 90-nm process [5] in consideration of the shrinkage achieved through advanced process technology. The amount of spatial correlation between regions was assumed to linearly decrease as the distance between the regions increased in generating the spatial-correlation matrix [26]. The spatial-correlation matrix was adjusted to be a positive semi-definite matrix using the method described in [27].

The proposed method (MC analysis using the proposed hybrid gate leakage model, H-MC), L-MC, and F-MC were implemented using an Nvidia CUDA programming environment with CUBLAS [28], which is a CUDA version of BLAS [29]. HC was implemented on Mathworks MATLAB [30], and both WM and VCA were implemented using the C programming language. In the experiments, we used a Linux machine consisting of an Intel Xeon E5–2690 octa-core CPU with a 2.9-GHz clock frequency, 64 GB of memory, and three NVIDIA GeForce GTX 680 graphics cards [31].

In the characterization, the leakage current values were sampled for each process parameter in the range of {-}{19\%} to 19% of their mean value in 1% increments using Synopsys HSPICE [32]. As mentioned before, the 3\sigma values of the total amount of process variations were set at 14.14%. We estimated the error to be in the plus-minus four-sigma (\pm 4\sigma) range. As a result, we obtained 39 sample points for each process parameter. Using these 39 samples, the interpolating parameters and the LUT index values were determined. A Mathworks MATLAB [30] optimization toolbox was used to determine the interpolating parameters, the LUT index values, and the fitting coefficients of the first-order model. Five sampling points, namely, {-}4\sigma, {-}2\sigma, 0, 2\sigma, and 4\sigma were used to compute the first-order model of the linear process parameters of each LUT element.

For the LUT model [20], the number of LUT index values of each process parameter was set to the maximum number of LUT index values of the characterized cell leakage library for the proposed hybrid gate leakage model (thirteen points for PTM, and eight points for the industrial TR model), and the LUT index values were determined to minimize the errors (4).

B. Accuracy Evaluation for the Leakage Model

In this experiment, we used ten ISCAS-85 benchmark circuits. For each circuit, we generated 200 random MC samples. The random values for the D2D variations in the process parameters in each sample were randomly generated as normal distributions with 10% of the 3\sigma/\mu variation, as mentioned earlier. The random values of the WID variations in the process parameters of each cell in each MC sample were also randomly generated with the same amount of variations as the D2D variations.

The leakage current of each MC sample was evaluated using the proposed hybrid gate leakage current model, the first-order model, and a LUT-based model [20]. We used the leakage current value of each MC sample obtained using Synopsys HSPICE as the reference value and estimated the absolute relative errors. In the “Leakage calculation” step shown in Fig. 5, Synopsys HSPICE was used to compute the leakage current value of the current input state of the current gate for the given process parameter values.

Table II shows the results of this experiment. The maximum relative error represents the maximum value among the absolute values of relative errors of all MC samples for all ISCAS-85 benchmark circuits. The average relative error represents the average value of the absolute values of relative errors of all MC samples for all ISCAS-85 benchmark circuits. The coefficient of determination R^{2} is the minimum value of R^{2} for all ISCAS-85 benchmark circuits. The first-order model showed very inaccurate results, which were 98.5% and 75.6% of the maximum relative errors for the 22-nm PTM and the industrial 32-nm transistor model, respectively. In particular, in terms of R^{2}, which provides a measure of how well observed the outcomes were replicated by the model, the first-order model scored {-}{0.277} and 0.477 for the PTM and industrial transistor model, respectively. On the other hand, the proposed model showed high R^{2} scores of greater than 0.97 for the PTM, and 0.84 for the industrial model. Although the maximum error of the proposed model for the 20% error threshold was comparable with that of the first-order model, the proposed model showed a much lower average error and a much higher R^{2} score for the PTM, and a comparable average error and much higher R^{2} score for the industrial model. Although the LUT model showed more accurate results than the proposed model, the proposed model also showed quite accurate results with average errors of less than 4% when the error threshold was set to 3%, which are still within an acceptable range. Although the error threshold was unable to limit the errors, the average errors were not far from the error threshold, and the accuracy of the proposed model increased as the error threshold decreased. The errors were somewhat controllable even not precise.

TABLE II Accuracy Evaluation Results of the Proposed Leakage Current Model for 200 Random Samples for Each ISCAS-85 Benchmark Circuit. (R^{2}: the Minimum Value of the Coefficient of Determination Among 10 Circuits.
Table II- Accuracy Evaluation Results of the Proposed Leakage Current Model for 200 Random Samples for Each ISCAS-85 Benchmark Circuit. ($R^{2}$: the Minimum Value of the Coefficient of Determination Among 10 Circuits.

Fig. 7 shows the leakage current estimation results for 200 random samples of the c6288 ISCAS 85 benchmark circuit when a 22-nm PTM transistor model was used. The x-axis represents the reference value, and the y-axis represents the corresponding estimated value. The black solid line represents the reference values obtained using Synopsys HSPICE. As this figure shows, the results of the first-order model remained very far from the reference values. On the other hand, the results of the proposed and LUT models were relatively close to the reference values. In addition, the first-order model showed a tendency toward underestimation. In the characterization, the relative error was the target to be optimally minimized. Because an underestimation always yields a maximum error of less than 100%, we obtained the fitting coefficients of the first-order model to underestimate the values through optimization.

Fig. 7. - Leakage current estimation results for 200 random samples of c6288 and PTM 22nm TR model. (a) First-order model. (b) LUT model. (c) Hybrid model: 3%. (d) Hybrid model: 5%. (e) Hybrid model: 10%. (f) Hybrid model: 20%.
Fig. 7. Leakage current estimation results for 200 random samples of c6288 and PTM 22nm TR model. (a) First-order model. (b) LUT model. (c) Hybrid model: 3%. (d) Hybrid model: 5%. (e) Hybrid model: 10%. (f) Hybrid model: 20%.

C. Performance Comparison of the Leakage Estimation Methods

In this experiment, we evaluated the leakage current distributions of the benchmark circuits using the proposed and benchmark methods. To obtain the reference leakage current distribution, we used Synopsys HSPICE. Similar to the experiment presented in Section V-B, Synopsys HSPICE was called in the leakage calculation step shown in Fig. 5, and was used to compute the leakage current value of a cell (SPICE-MC).

As mentioned previously, the 3\sigma values of the D2D and WID process parameters were set to 10% of their mean values (i.e., 3\sigma/\mu=10\%), and therefore, the 3\sigma values of the total process variations were approximately 14.14% of their mean values. All MC-based methods, including the MC simulation using Synopsys HSPICE, were performed on 16,384 samples for each ISCAS-85 benchmark circuit. This sample number corresponds to a 95% confidence level with {\pm}{3\%} and {\pm}{2.5\%} sampling errors for the mean values of the PTM and industrial transistor model, respectively, and {\pm}{2.5\%} sampling errors for the standard deviation values of both the PTM and industrial transistor model [33].

Table III presents the average and maximum values of the absolute relative errors of the benchmark circuits. The percentile points of the analytic SLE methods were calculated from the inverse lognormal cumulative distribution function (CDF) using the estimated mean and standard deviation values. These percentile point errors were presented to evaluate the accuracy of the benchmark methods at the tail of the distribution.

TABLE III Estimation Results for Leakage Current Distribution for 10 ISCAS-85 Benchmark Circuits. The Average Values and the Maximum Values of Absolute Relative Errors for ISCAS-85 Benchmark Circuits are Represented in%. (HC: H. Chang's Method [4], WM: Wilkinson's Method [7], VCA: Virtual Cell Approximation Method [5], F-MC: Monte-carlo Simulation Using the First-Order Model [18], L-MC: Monte-carlo Simulation Using LUT-based Model [20], and H-MC: Monte-carlo Simulation Using the Proposed Hybrid Gate Leakage Model.
Table III- Estimation Results for Leakage Current Distribution for 10 ISCAS-85 Benchmark Circuits. The Average Values and the Maximum Values of Absolute Relative Errors for ISCAS-85 Benchmark Circuits are Represented in%. (HC: H. Chang's Method [4], WM: Wilkinson's Method [7], VCA: Virtual Cell Approximation Method [5], F-MC: Monte-carlo Simulation Using the First-Order Model [18], L-MC: Monte-carlo Simulation Using LUT-based Model [20], and H-MC: Monte-carlo Simulation Using the Proposed Hybrid Gate Leakage Model.

The proposed method (MC analysis with the proposed hybrid gate leakage model, H-MC) clearly showed improved accuracy for all statistics (mean, standard deviation, and percentile points) of both the PTM and industrial transistor model as compared with all analytic SLE methods and F-MC. For the PTM, the benchmark methods based on the first-order model showed inaccurate results of approximately 90% on average and maximum errors for both the mean and standard deviation values. On the other hand, H-MC showed relatively accurate results with approximately 7% maximum errors for both the mean and standard deviation values when the error threshold was set to 5%. Although the L-MC showed the best accuracy among all methods, the differences between L-MC and the proposed H-MC for the 5% error threshold were only less than 3% in both mean and standard deviation values on average.

Similar results were obtained for the industrial 32-nm transistor model. The first-order model based methods (analytic SLEs and F-MC) exhibited very inaccurate results with average errors of greater than 23% for the mean value and 50% for the standard deviation value. Among all methods considered, L-MC showed the best accuracy, as it did for the PTM, but there was only a small difference of less than 4% on average between the results of L-MC and H-MC for an error threshold of 5% in terms of both the mean and standard deviation values.

Figs. 8 and 9 show the CDFs for the c6288 ISCAS 85 benchmark circuit of the PTM and industrial transistor model, respectively. Similar to the results presented in Table III, we can easily ascertain that H-MC is more accurate than the first-order model based benchmark methods.

Fig. 8. - CDF comparison between the results of the proposed method, the first-order MC, MC using LUT model and SPICE MC simulation for c6288 benchmark circuit. (PTM 22nm TR model and 5% error threshold for the proposed method.).
Fig. 8. CDF comparison between the results of the proposed method, the first-order MC, MC using LUT model and SPICE MC simulation for c6288 benchmark circuit. (PTM 22nm TR model and 5% error threshold for the proposed method.).
Fig. 9. - CDF comparison between the results of the proposed method, the first-order MC, MC using LUT model and SPICE MC simulation for c6288 benchmark circuit. (Industrial 32nm TR model and 5% error threshold for the proposed method.).
Fig. 9. CDF comparison between the results of the proposed method, the first-order MC, MC using LUT model and SPICE MC simulation for c6288 benchmark circuit. (Industrial 32nm TR model and 5% error threshold for the proposed method.).

Table IV and Figs. 10 and 11 compare the runtimes of the proposed and benchmark methods when one GPU was used to perform the MC methods. Figs. 10 and 11 show the overall runtime used to analyze ten ISCAS-85 benchmark circuits for the PTM and industrial TR models, respectively, and Table IV details the results of the experiment. Note that the HC method was implemented using Mathworks MATLAB [30], and thus a direct comparison between HC and the other methods is unfair; we therefore omitted the HC runtime results. In [5], the runtimes of HC, WM, and VCA were compared, and the runtime of HC was shown to be comparable with that of VCA.

Fig. 10. - Overall runtimes to analyze ten ISCAS-85 benchmark circuits for PTM 22nm TR model.
Fig. 10. Overall runtimes to analyze ten ISCAS-85 benchmark circuits for PTM 22nm TR model.
Fig. 11. - Overall runtimes to analyze ten ISCAS-85 benchmark circuits for industrial 32nm TR model.
Fig. 11. Overall runtimes to analyze ten ISCAS-85 benchmark circuits for industrial 32nm TR model.
TABLE IV Runtime Comparison of the MC With the Proposed Model and Benchmark Methods. (WM: Wilkinson's Method [7], VCA: Virtual Cell Approximation Method [5], F-MC: the First-order Exponential-polynomial Based MC Simulation [18], L-MC: MC Simulation Using LUT-based Model [20], and H-MC: MC Analysis With the Proposed Model.
Table IV- Runtime Comparison of the MC With the Proposed Model and Benchmark Methods. (WM: Wilkinson's Method [7], VCA: Virtual Cell Approximation Method [5], F-MC: the First-order Exponential-polynomial Based MC Simulation [18], L-MC: MC Simulation Using LUT-based Model [20], and H-MC: MC Analysis With the Proposed Model.

As shown in the table, VCA showed the fastest results. VCA was at least 2.5 times faster than the proposed H-MC. F-MC ranked second. When compared with F-MC, the proposed method (H-MC) had at least a 50% longer runtime. Although H-MC was much slower than both VCA and F-MC, it showed much more accurate results. In addition, by applying multi-processing for multiple GPUs, the proposed method can improve the computation speed. When compared with the L-MC, the proposed H-MC reduced the runtimes by 40–70% for the PTM, and over 90% for the industrial model. The computational complexity of L-MC increases at the rate of 2^{n}, where n is the number of process parameters. Therefore, the efficiency improvement of H-MC over L-MC highly depends on the number of considered process parameters.

Table V shows the comparison results of the runtimes of the proposed method and the benchmark methods for a combinational logic of the OpenSparc T2 core (4.5 million gates). As shown in Table V, VCA was still the fastest among all methods tested. However, the runtime of the proposed H-MC was significantly reduced as the number of GPUs increased. When three GPUs were used for the computation, the runtime was reduced to one-third of the runtime of a single GPU.

Fig. 12 shows the runtimes with respect to the number of GPUs used. In this experiment, we estimated the runtimes for the parallelizing overhead and the MC process for four to eight GPUs through a simulation. The overhead includes the data loading and a summation of the results from all GPUs. Fig. 12 shows that the parallelized overhead increases as the number of GPUs used increases. However, the runtimes of the overhead were small enough to be neglected, as compared with those of the MC simulation, because the data size to be transferred to each GPU, given as LM\times (n+1)\,\times\, the unit size of the floating point variable for L GPUs, M threads, and n process parameters, is negligible compared with the 4 GB/sec bandwidth (8-lane PCI-express 3.0 bus) between the DRAM for a GPU and DRAM for a CPU. In addition, the amount of computations for accumulation of MC results for all partitions is also negligible compared with the amount of computations for MC simulation. Thus, the runtime of the H-MC decreased in accordance with an increase in the number of GPUs used.

Fig. 12. - Runtime for OpenSparc T2 core of the proposed method for 5% error threshold for PTM 22 nm TR model. (Results for 4 8 GPUs are estimated values).
Fig. 12. Runtime for OpenSparc T2 core of the proposed method for 5% error threshold for PTM 22 nm TR model. (Results for 4 8 GPUs are estimated values).
TABLE V Runtime Comparison of the Proposed Method and Benchmark Methods for Combinational Logic Parts of OpenSparcT2 Core (4.5 Million Gates and Runtime for VCA: 40s for PTM and 91s for Industrial Model.
Table V- Runtime Comparison of the Proposed Method and Benchmark Methods for Combinational Logic Parts of OpenSparcT2 Core (4.5 Million Gates and Runtime for VCA: 40s for PTM and 91s for Industrial Model.

When a single GPU was used, the runtime of H-MC for a 5% error threshold was only approximately 10 min for 4.5 million gates; if more GPUs are available, the runtime of H-MC will be further decreased. We therefore claim that the computational overhead of MC using the proposed hybrid gate leakage current model is not a major problem in practical applications.

SECTION VI.

Conclusion

This paper proposed an accurate gate leakage current model that combines an LUT and the first-order exponential-polynomial model, along with its characterization method. The proposed hybrid gate leakage model uses an LUT for varying process parameters having strong nonlinear relationship with the logarithm of leakage current and uses the first-order model for other varying process parameters as each table element in LUT. By combining an LUT and the first-order exponential-polynomial model, the proposed model is more accurate than the first-order model and more efficient than the LUT-based model.

In the accuracy evaluation of the proposed hybrid gate leakage current model, the proposed model obtained high R^{2} scores of close to 1. Although the error threshold could not bound the maximum and average relative errors, the average errors were close to the error threshold, and the proposed model showed comparable accuracy with that of the LUT model.

The proposed hybrid gate leakage model was implemented for an MC analysis under a multiple GPU environment. In the MC simulation, the proposed method (MC with the proposed model) was slower than VCA, which is the fastest analytic SLE method among the benchmark methods used. However, the first-order model-based benchmark methods showed large relative errors of approximately 90% for the 22-nm transistor PTM in all statistics, and 50% for the industrial 32-nm transistor model in standard deviation value and at the tail of the distribution. On the other hand, the proposed method showed results that are more efficient with comparable accuracy to that of the LUT-based MC method.

Although the proposed method was slower than VCA, the runtime of the proposed method can be reduced by utilizing multiple GPUs. When only one GPU is used, the runtime is less than 10 min for a commercial design with 4.5 million gates, and if more GPUs are available, this runtime can be reduced further. The computational complexity of the proposed method therefore poses no problem.

In conclusion, we expect that an MC analysis using the proposed hybrid gate leakage model can provide accurate leakage analysis results within a practical runtime.

Cites in Papers - |

Cites in Papers - IEEE (1)

Select All
1.
Hyunjeong Kwon, Mingyu Woo, Young Hwan Kim, Seokhyeong Kang, "Statistical Leakage Analysis Using Gaussian Mixture Model", IEEE Access, vol.6, pp.51939-51950, 2018.

References

References is not available for this document.