Introduction
Owing to the rapid growth of mobile application markets, power consumption has become one of the cardinal factors for VLSI chip designs. In particular, leakage power consumption, or static power consumption, accounts for almost one-half of the overall system power consumption. As the leakage power is consumed even while a chip is in an idle state, it directly affects the battery life of mobile devices. Thus, it is very important to reduce the leakage power of mobile VLSI chips, and, therefore, accurate leakage estimation has become one of the most important steps of mobile chip designs.
As the semiconductor process technology scales down, the process uncertainty increases and the variation in process parameters increases. To deal with the variation of process parameters, a worst-case corner analysis [1] has been widely used. This approach can be used successfully when the process variation is not too serious and the die-to-die (D2D) variation is the main contributor [2]. However, the continuous scaling down of semiconductor process technology has worsened the process parameter variation to a serious level, and leakage variation also has become considerably severe in the state-of-the-art process technology. In addition, the within-die (WID) variation has continuously increased and became comparable with the D2D variation in size [2], [3]. Therefore, a worst-case corner analysis of leakage has become overly pessimistic in many designs, resulting in an invalidity of the analysis results [2], [4]–[6], particularly during the gate-level design.
To address the problem of overly pessimistic leakage analysis, a first-order exponential-polynomial leakage current model (the first-order model) [7], which represents the leakage power under process variations, has been proposed for use with gate-level designs. The first-order model represents the logarithmic value of the leakage current as a linear function of the process parameters, such as channel length, oxide thickness, and threshold voltage. In addition, many approaches to a leakage analysis of gate-level VLSI designs, which use the first-order model, have been proposed. These include analytic gate-level statistical leakage estimation (SLE) methods [4], [5], [7]–[16] and Monte-Carlo (MC) simulation-based gate-level leakage estimation methods [17], [18].
The first-order model is very efficient, and has been successfully used to analyze the leakage current of a VLSI design at the gate-level under process variations. However, as the process technology scaled down to below 65 nm [19], the nonlinear relationship between certain process parameters and the logarithmic value of the leakage current became much stronger. Consequently, the accuracy of the first-order model was strongly questioned for use with state-of-the-art technology [5], [6], [19], [20], and it exhibited significant errors compared to the results of circuit level analyses, obtained using the BSIM4 transistor model [19], [21].
To overcome the weak point of the first-order leakage model, which became clear with the deep submicron process technology, a higher-order exponential-polynomial model [6] and a look-up table (LUT)-based gate leakage model [20] have been proposed. These models are intended for MC-based gate-level leakage analysis. The approach in [6] models the logarithm value of the leakage current as a polynomial of gate channel length and threshold voltage; it uses a first-order expression for threshold voltage and a high-order expression for gate channel length. However, unfortunately, [6] did not deal with other important process variations such as oxide thickness [4], [5]. Thus, this model is more accurate than the first-order model when there are only two variation sources, gate channel length and threshold voltage variations. Compared to the higher-order exponential-polynomial model, the LUT-based model [20] represents the nonlinearity more accurately by representing the logarithmic value of the leakage current using a table. However, in this model, the leakage current needs to be estimated by interpolating the values in the LUT during analysis, and the number of computations for the interpolation increases at the rate of
This paper proposes a hybrid leakage current model, developed for the gate-level MC analysis of VLSI designs. The proposed model combines the LUT-based model [20] with the first-order model, and takes the advantages of both models. For accuracy, the proposed approach treats the process parameters of strong nonlinear relationships with the logarithm of the leakage current as nonlinear process parameters. For these process parameters, it uses an LUT approach. Other process parameters are regarded as linear process parameters, and the proposed approach uses a first-order model for efficiency for them.
The accuracy and the efficiency of the proposed model depend on the number of parameters which are considered nonlinear. In addition, the number of LUT data points used for the nonlinear parameters affects the accuracy and the efficiency of the proposed model. Thus, it is very important to determine the nonlinear parameters appropriately and the number of data points for the LUT of those parameters. When characterizing the leakage of each logic gate type, the proposed approach adaptively selects the process parameters which should be handled as nonlinear and it determines the number of their LUT data points for each input condition, based on the user-defined error threshold, so that it can obtain the maximum efficiency while maintaining acceptable accuracy.
The remainder of this paper is organized as follows. Section II briefly describes the existing leakage current models. Section III describes the proposed leakage current model and its characterization method in detail. Section IV presents the implementation of the MC simulation with the proposed model on a multiple NVIDIA GPU environment using a CUDA programming environment [22]. Section V presents the performance evaluation of the proposed approach in terms of accuracy and efficiency, and a comparison with existing well-known approaches. Finally, Section VI summarizes the paper and provides some concluding remarks.
Background
A. First-Order Exponential-Polynomial Leakage Current Model
State-of-the-art SLE methods use the first-order exponential-polynomial model (the first-order model) [7], which is based on the assumption that major leakage mechanisms such as sub-threshold and gate tunneling leakages are exponentially affected by the process parameter variations. The first-order model of the leakage current approximates the polynomial exponents as a first-order linear model, as shown in (1).I_{leak}=\exp\left({a_{0}+\sum_{i=1}^{n}{a_{i}X_{i}+a_{n+1}R_{n+1}}}\right),\eqno{\hbox{(1)}}
B. LUT-Based Exponent Model
The authors of [20] proposed an LUT-based model for the exponent in (1) and demonstrated the accuracy of their model using two interpolation methods: piecewise linear (PWL) and cubic spline interpolations. The LUT stores the natural-logarithmic values of the leakage currents for the given sampling points and the values in the LUT are pre-calculated using a characterization process. Intermediate points between the sampling points can be calculated from the LUT using interpolation.
The PWL interpolation calculates the exponent value of an intermediate point on a straight line between both adjacent sampling points of the intermediate point. Cubic spline interpolation uses a special type of third-order piecewise polynomial interpolant called a spline, and it matches the first and second derivatives of two adjacent splines on the intersection point. Fig. 1 shows the first-order model and the PWL.
Multi-dimensional interpolation can be performed through recursive iteration. Fig. 2 shows an example of a two-dimensional interpolation. We assume that we have two process parameters,
Clearly, the LUT-based leakage model is much more accurate than the first-order model even when PWL interpolation is applied. However, the LUT model is incompatible with conventional SLE methods, and only the MC-based leakage analysis can currently handle the LUT model.
Proposed Leakage Model and its Characterization
Although the LUT-based leakage current model [20] is reasonably accurate, its computational complexity can be a significant burden for the MC simulation when a number of varying process parameters are present. In practice, not all process parameters have a nonlinear relationship with the logarithmic value of the leakage current.
In this section, we present a novel leakage current model that combines the LUT-based leakage and first-order models.
A. Hybrid Leakage Current Model
The amount of leakage current in a cell is a function of the input state of the cell [23]. The proposed model expresses the leakage current of cell I_{l}=\sum_{\forall state\left(i\right)}{P_{i}I_{l}^{i}}=\sum_{\forall state\left(i\right)}{P_{i}\exp\left\{{f_{l}^{i}\left({X_{1},\ldots,X_{n+m}}\right)}\right\}},\eqno{\hbox{(2)}}
Let
The proposed model approximates
Let
Table I shows an example LUT of the proposed model when
For example, the (1,1) value, corresponding to the index value
If process parameter values for
Because the first-order model for size=\left({m+1}\right)\times\prod\limits_{i=1}^{n}{d_{i}}\eqno{\hbox{(3)}}
In general, the cubic spline interpolation discussed in [20] shows more accurate results than the PWL interpolation for the same number of data points. However, if
A one-dimensional interpolation requires two data points, and therefore, an
B. Gate Library Characterization
Unfortunately, similar to the conventional LUT-based models, it is difficult to estimate the maximum error bound and determine the appropriate number of data points (LUT index values) for each LUT parameter. Moreover, it is difficult to choose the LUT index values that minimize the error and determine which process parameters should be considered for the LUT. In particular, it is nearly impossible to find a global optimum solution for the overall parameter space.
Instead of finding a global optimum solution, the proposed characterization method inevitably finds a local optimum solution using a given error threshold on the sample space. The proposed method estimates the leakage current values while one process parameter varies in the sampling range and the other parameters are fixed at their nominal values (without a process variation for the other process parameters). The error of the first-order model is then estimated, and the LUT index values are determined through optimization. This characterization is performed for each input state of a gate. The detailed procedure is described as follows:
For each process parameter {\rm Error}={\max\limits_{s_{k}\in S}}\left({abs\left({{I\left({s_{k}}\right)-\exp\left({a_{0}+a_{1}s_{k}}\right)}\over{I\left({s_{k}}\right)}}\right)}\right),\eqno{\hbox{(4)}}
For each interpolating process parameter \eqalignno{&{\rm minimize}d_{i}=\sum_{k=1}^{l}{x_{k}}\cr&{s.t.}\quad\qquad\qquad{x_{k}=0\,{\rm or\,1\,for\,}k=1,\ldots,l}\hfill\cr&\qquad\qquad\qquad~{P=\left\{{s_{k}{\rm such\,that\,}x_{k}=1}\right\}}\cr&\qquad\qquad\qquad{V=\left\{{\log I\left({s_{k}}\right)\,{\rm for\,}s_{k}\in P}\right\}}\cr&\qquad{{\max\limits_{s_{k}\in S}}\left({abs\left({{I\left({s_{k}}\right)-\exp\left({F\left({P,V,s_{k}}\right)}\right)}\over{I\left({s_{k}}\right)}}\right)}\right)\leq{\rm Error}_{threshold}}&{\hbox{(5)}}}
After the optimization,
Implementation of the Proposed Model for MC Analysis on a Multiple GPU Environment Using CUDA
We implemented the proposed hybrid gate leakage model for the MC analysis under a multiple GPU environment using the NVIDIA CUDA platform [22]. The overall flow of the implemented MC analysis (Fig. 3) is similar to that of [18], but we modified the flow of [18] to consider the spatial correlation and utilize multiple GPUs. In Fig. 3, the white boxes represent the host-side processes (processes on a CPU), and the gray boxes represent the GPU-side processes. 1) A grid number is assigned for each cell in the netlist. 2) A principle component analysis (PCA) is performed on the CPU to determine a transformation matrix for transforming independent random values into correlated random values; eigenvalue decomposition is performed on each covariance matrix of the spatially correlated process parameter. 3) D2D random values and 4) spatial correlated random values are generated on a single GPU. 5) The generated D2D and correlated random values are copied to all GPUs. 5) MC simulation is performed on the GPUs. 6) One GPU accumulates the leakage current values of all partitions for each MC sample.
In this procedure, to utilize multiple GPUs effectively, the netlist is partitioned with grid units. Fig. 4 shows an example of partitioning and multi-GPU assignment. The grids are equally divided with respect to the number of available GPUs. Fig. 4 shows that six grids exist and three GPUs are available. The six grids are divided uniformly into three partitions for the three GPUs. The leakage current in each partition is computed by the thread on each GPU, and therefore, three corresponding threads are executed on the three GPUs to compute the leakage current values of three partitions. Then, one GPU accumulates the leakage current values of three partitions to obtain the leakage current of a circuit.
Fig. 5 shows the program flow of each thread on a GPU for the implemented MC analysis. As mentioned previously, the netlist is divided into
A. LUT Data Structure
The LUT index values are stored in a variable XGPU in constant memory to reduce the access latency of the LUT index values. Because only a one-dimensional array is allowed in the constant memory of the CUDA, XGPU is declared as a one-dimensional array. Let {\bf X}_{i}=\&{{\bf XGPU}}\left[{M\sum_{l=1}^{j-1}{v_{l}}+M\left({k-1}\right)+\sum_{l=1}^{i-1}{d_{l}}}\right],\eqno{\hbox{(6)}}
The LUT values, which are first-order polynomials for
B. Cell Leakage Calculation
Fig. 6 shows the CUDA implementation of a cell leakage calculation. For process parameter
Experimental Results
A. Experimental Environment
In the experiments, we evaluated the performance of the proposed approach in terms of its efficiency and accuracy. We used ISCAS-85 circuits and an OpenSparc T2 processor [24] as the benchmark circuits. A Synopsys Design Compiler and Astro were used for synthesis and placement, respectively. We used two transistor models: an industrial 32-nm transistor model and the Predictive Technology Model (PTM) [25] with a high-performance 22-nm process based on the BSIM4 model.
The process parameter variations considered for the PTM were the gate channel length, gate oxide thickness, and both nMOS/pMOS doping concentration-dependent threshold voltage variations. Seven process parameter variations, which include the gate channel length and the gate oxide thickness variations, were considered for the 32-nm industrial transistor model. We assumed that these process variations were normally distributed, and that the D2D and WID variations in these process variations had the same parts. In addition, the three-sigma
We employed an analytic SLE and conventional MC-based leakage estimation methods as the benchmark methods. More specifically, we used a Wilkinson's method (WM)-based estimation (which uses Wilkinson's method to sum all of the leakage components modeled using the first-order model) [7], Chang's hybrid method (HC) [4], and VCA [5] as benchmark analytic SLE methods. The GPU-based MC method using the first-order model (F-MC) [18] was used as the benchmark MC-based leakage estimation method. Although the F-MC has been proposed without considering the spatial correlation and multiple GPUs, we applied the same implementation scheme of the spatial correlation and multiple GPUs, described in Section IV, to the F-MC for experimental purposes. In addition, although the authors of [20] did not present a GPU-based MC method using their model, we implemented one in the same MC implementation flow for the proposed model, and used it as a benchmark method (L-MC). Higher-order exponential-polynomial model [6] was not employed because it cannot handle process parameters considered in experiments except for both gate channel length and threshold voltage variations.
The gate channel length was considered to be a spatially correlated parameter, and to consider the spatial correlation, we used the grid model in which the area of one region was 10
The proposed method (MC analysis using the proposed hybrid gate leakage model, H-MC), L-MC, and F-MC were implemented using an Nvidia CUDA programming environment with CUBLAS [28], which is a CUDA version of BLAS [29]. HC was implemented on Mathworks MATLAB [30], and both WM and VCA were implemented using the C programming language. In the experiments, we used a Linux machine consisting of an Intel Xeon E5–2690 octa-core CPU with a 2.9-GHz clock frequency, 64 GB of memory, and three NVIDIA GeForce GTX 680 graphics cards [31].
In the characterization, the leakage current values were sampled for each process parameter in the range of
For the LUT model [20], the number of LUT index values of each process parameter was set to the maximum number of LUT index values of the characterized cell leakage library for the proposed hybrid gate leakage model (thirteen points for PTM, and eight points for the industrial TR model), and the LUT index values were determined to minimize the errors (4).
B. Accuracy Evaluation for the Leakage Model
In this experiment, we used ten ISCAS-85 benchmark circuits. For each circuit, we generated 200 random MC samples. The random values for the D2D variations in the process parameters in each sample were randomly generated as normal distributions with 10% of the
The leakage current of each MC sample was evaluated using the proposed hybrid gate leakage current model, the first-order model, and a LUT-based model [20]. We used the leakage current value of each MC sample obtained using Synopsys HSPICE as the reference value and estimated the absolute relative errors. In the “Leakage calculation” step shown in Fig. 5, Synopsys HSPICE was used to compute the leakage current value of the current input state of the current gate for the given process parameter values.
Table II shows the results of this experiment. The maximum relative error represents the maximum value among the absolute values of relative errors of all MC samples for all ISCAS-85 benchmark circuits. The average relative error represents the average value of the absolute values of relative errors of all MC samples for all ISCAS-85 benchmark circuits. The coefficient of determination
Fig. 7 shows the leakage current estimation results for 200 random samples of the c6288 ISCAS 85 benchmark circuit when a 22-nm PTM transistor model was used. The
C. Performance Comparison of the Leakage Estimation Methods
In this experiment, we evaluated the leakage current distributions of the benchmark circuits using the proposed and benchmark methods. To obtain the reference leakage current distribution, we used Synopsys HSPICE. Similar to the experiment presented in Section V-B, Synopsys HSPICE was called in the leakage calculation step shown in Fig. 5, and was used to compute the leakage current value of a cell (SPICE-MC).
As mentioned previously, the
Table III presents the average and maximum values of the absolute relative errors of the benchmark circuits. The percentile points of the analytic SLE methods were calculated from the inverse lognormal cumulative distribution function (CDF) using the estimated mean and standard deviation values. These percentile point errors were presented to evaluate the accuracy of the benchmark methods at the tail of the distribution.
The proposed method (MC analysis with the proposed hybrid gate leakage model, H-MC) clearly showed improved accuracy for all statistics (mean, standard deviation, and percentile points) of both the PTM and industrial transistor model as compared with all analytic SLE methods and F-MC. For the PTM, the benchmark methods based on the first-order model showed inaccurate results of approximately 90% on average and maximum errors for both the mean and standard deviation values. On the other hand, H-MC showed relatively accurate results with approximately 7% maximum errors for both the mean and standard deviation values when the error threshold was set to 5%. Although the L-MC showed the best accuracy among all methods, the differences between L-MC and the proposed H-MC for the 5% error threshold were only less than 3% in both mean and standard deviation values on average.
Similar results were obtained for the industrial 32-nm transistor model. The first-order model based methods (analytic SLEs and F-MC) exhibited very inaccurate results with average errors of greater than 23% for the mean value and 50% for the standard deviation value. Among all methods considered, L-MC showed the best accuracy, as it did for the PTM, but there was only a small difference of less than 4% on average between the results of L-MC and H-MC for an error threshold of 5% in terms of both the mean and standard deviation values.
Figs. 8 and 9 show the CDFs for the c6288 ISCAS 85 benchmark circuit of the PTM and industrial transistor model, respectively. Similar to the results presented in Table III, we can easily ascertain that H-MC is more accurate than the first-order model based benchmark methods.
Table IV and Figs. 10 and 11 compare the runtimes of the proposed and benchmark methods when one GPU was used to perform the MC methods. Figs. 10 and 11 show the overall runtime used to analyze ten ISCAS-85 benchmark circuits for the PTM and industrial TR models, respectively, and Table IV details the results of the experiment. Note that the HC method was implemented using Mathworks MATLAB [30], and thus a direct comparison between HC and the other methods is unfair; we therefore omitted the HC runtime results. In [5], the runtimes of HC, WM, and VCA were compared, and the runtime of HC was shown to be comparable with that of VCA.
As shown in the table, VCA showed the fastest results. VCA was at least 2.5 times faster than the proposed H-MC. F-MC ranked second. When compared with F-MC, the proposed method (H-MC) had at least a 50% longer runtime. Although H-MC was much slower than both VCA and F-MC, it showed much more accurate results. In addition, by applying multi-processing for multiple GPUs, the proposed method can improve the computation speed. When compared with the L-MC, the proposed H-MC reduced the runtimes by 40–70% for the PTM, and over 90% for the industrial model. The computational complexity of L-MC increases at the rate of
Table V shows the comparison results of the runtimes of the proposed method and the benchmark methods for a combinational logic of the OpenSparc T2 core (4.5 million gates). As shown in Table V, VCA was still the fastest among all methods tested. However, the runtime of the proposed H-MC was significantly reduced as the number of GPUs increased. When three GPUs were used for the computation, the runtime was reduced to one-third of the runtime of a single GPU.
Fig. 12 shows the runtimes with respect to the number of GPUs used. In this experiment, we estimated the runtimes for the parallelizing overhead and the MC process for four to eight GPUs through a simulation. The overhead includes the data loading and a summation of the results from all GPUs. Fig. 12 shows that the parallelized overhead increases as the number of GPUs used increases. However, the runtimes of the overhead were small enough to be neglected, as compared with those of the MC simulation, because the data size to be transferred to each GPU, given as
When a single GPU was used, the runtime of H-MC for a 5% error threshold was only approximately 10 min for 4.5 million gates; if more GPUs are available, the runtime of H-MC will be further decreased. We therefore claim that the computational overhead of MC using the proposed hybrid gate leakage current model is not a major problem in practical applications.
Conclusion
This paper proposed an accurate gate leakage current model that combines an LUT and the first-order exponential-polynomial model, along with its characterization method. The proposed hybrid gate leakage model uses an LUT for varying process parameters having strong nonlinear relationship with the logarithm of leakage current and uses the first-order model for other varying process parameters as each table element in LUT. By combining an LUT and the first-order exponential-polynomial model, the proposed model is more accurate than the first-order model and more efficient than the LUT-based model.
In the accuracy evaluation of the proposed hybrid gate leakage current model, the proposed model obtained high
The proposed hybrid gate leakage model was implemented for an MC analysis under a multiple GPU environment. In the MC simulation, the proposed method (MC with the proposed model) was slower than VCA, which is the fastest analytic SLE method among the benchmark methods used. However, the first-order model-based benchmark methods showed large relative errors of approximately 90% for the 22-nm transistor PTM in all statistics, and 50% for the industrial 32-nm transistor model in standard deviation value and at the tail of the distribution. On the other hand, the proposed method showed results that are more efficient with comparable accuracy to that of the LUT-based MC method.
Although the proposed method was slower than VCA, the runtime of the proposed method can be reduced by utilizing multiple GPUs. When only one GPU is used, the runtime is less than 10 min for a commercial design with 4.5 million gates, and if more GPUs are available, this runtime can be reduced further. The computational complexity of the proposed method therefore poses no problem.
In conclusion, we expect that an MC analysis using the proposed hybrid gate leakage model can provide accurate leakage analysis results within a practical runtime.