

Received 17 May 2023, accepted 12 June 2023, date of publication 15 June 2023, date of current version 21 June 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3286544

## RESEARCH ARTICLE

# A 0.57 mW@1 FPS In-Column Analog CNN Processor Integrated Into CMOS Image Sensor

BOHYEOK JEONG<sup>1</sup>, (Student Member, IEEE), JAEHWAN LEE<sup>©2,3</sup>, JAIHYUK CHOI<sup>1</sup>, (Student Member, IEEE), MINKYU SONG<sup>1</sup>, (Member, IEEE), YOUNGDOO SON<sup>©2,3</sup>, (Member, IEEE), AND SOO YOUN KIM<sup>©1</sup>, (Member, IEEE)

<sup>1</sup>Department of Semiconductor Science, Dongguk University, Seoul 04620, South Korea
<sup>2</sup>Department of Industrial and Systems Engineering, Dongguk University, Seoul 04620, South Korea
<sup>3</sup>Data Science Laboratory (DSLAB), Dongguk University, Seoul 04620, South Korea

Corresponding authors: Soo Youn Kim (sooyoun@dgu.ac.kr) and Youngdoo Son (youngdoo@dongguk.edu)

This work was supported in part by National R&D Program through the National Research Foundation of Korea (NRF) funded by Ministry of Science and ICT under Grant 2023M3F3A2A01037928 and Grant RS-2023-00208412; in part by the Ministry of Trade, Industry & Energy (MOTIE) and the Korea Semiconductor Research Consortium (KSRC) Support Program for the Development of the Future Semiconductor Device under Grant 2001930; and in part by the IC Design Education Center (IDEC), South Korea, for Electronic Design Automation (EDA) tool.

**ABSTRACT** This article presents a high-performance, low-power analog convolutional neural network (CNN) circuit integrated into a CMOS image sensor (CIS) for face detection applications. The main block of the proposed in-column analog CNN circuits is an analog multiplication-and-accumulation (MAC) circuit consisting of an operational transconductance amplifier-based switched capacitor circuit enabling the programmable weight function. With the proposed MAC, a 3-layer analog CNN processor is implemented into the column-parallel readout circuit in conventional CIS. Furthermore, for low-power CNN operations, we use a low-resolution analog-to-digital converter with the proposed nonlinear quantization method resulting in an increase in the accuracy of face detection from 92.8% to 98.75% at 120 frame rates with 2.8 V/1.5 V supply voltage. A prototype sensor with  $160 \times 120$  effective image resolution was fabricated using a 110 nm CMOS image sensor process. The measurement results showed that the maximum power consumption was 0.57 mW and 4.02 mW at 1 and 120 frame rates, respectively.

**INDEX TERMS** CMOS image sensor, convolutional neural networks, face detection, multiplication-and-accumulation, nonlinear quantization.

#### I. INTRODUCTION

Recently, with the advancement of machine learning, various deep neural network (DNN)-based applications have been widely used in the Internet of Things (IoT) applications [1]. The system architecture of conventional DNN converts image information from external high-performance CMOS image sensors (CIS) into digital output to perform DNN tasks on neural processing units (NPUs) [2]. Operating two chips requires a large memory area to transmit extensive image data to the external NPU chip, resulting in decreased operation speed and increased power consumption. To overcome such issues from two-chip solutions, simple image processors adjacent to image sensors with switched-current-based

The associate editor coordinating the review of this manuscript and approving it for publication was Michele Nappi<sup>(D)</sup>.

multiplication-and-accumulation (MAC) and in-column convolutional neural network (CNN) processors in CIS have been proposed [3], [4], [5]. The switched-current-based MAC converts the voltage scale of the pixel into a current and performs a convolution operation. However, since the error in the current mirror circuit can lead to variations in the weight values, a high-performance current mirror is required. In addition, frame per second (FPS) operation is low (=1 FPS in [3]). The in-column CNN processor in CIS showed a fast data-processing time (120 FPS in [4]). However, a the switched-capacitor-based AMAC consisting of passive devices without an operational transconductance amplifier (OTA) was used, the weights could not be changed, resulting in low face detection accuracy due to a fixed weight. Furthermore, the functions of the algorithms are limited [4], [5].



<Conventional CIS with an external CNN processor>

FIGURE 1. Conventional CIS with CNN system and the proposed CIS system.



FIGURE 2. Accuracy of the proposed CIS by ADC resolution.

Therefore, we present an in-column CNN processor in a CIS capable of weight update and high-speed operation (=120 FPS) with the proposed switched-capacitor-based analog MAC with an OTA. In addition, for low-power operation within a limited area, a nonlinear quantization technique was proposed for a low-resolution analog-to-digital converter (ADC) to improve the accuracy of face detection. Fig. 1 shows a conventional CIS with a CNN system and the proposed CIS system. The conventional system performs CNN algorithms in the digital domain after converting analog voltages representing extensive image data from the pixel array to digital code with a high-performance ADC. On the other hand, the proposed CIS has an in-column CNN processor structure with two integrated convolution layers for low-power operation and high face detection accuracy. The proposed CIS consists of an analog MAC circuit and memory block to perform a convolution operation. The algorithm of the proposed analog CNN processor consists of a 2×2 convolution layer (stride = 1) and a  $2 \times 2$  pooling layer (stride = 2). Fig. 2 shows the trend of face-detection accuracy with different ADC resolutions of the proposed CIS. The feature maps were quantized with different ADC resolutions ranging from one to eight bits. After performing classification, the accuracy tendency according to the different quantized ADC bits can be calculated. From the results, we found that ADCs with four or more bits had similar image classification accuracies of approximately 96%. We think the reason why the accuracy is saturated at ADC resolution above a certain level is that in the case of face detection, classification is performed by learning key elements such as eyes and nose from a human face. Therefore, unlike CNN systems using conventional CIS that require a high-resolution ADC ( $\geq$ 12 bits), the ADC resolution for quantizing feature map images in the proposed in-column CNN can be lowered (ADC resolution: 5 bits in this paper). Furthermore, with the proposed analog MAC with an

OTA-based switched capacitor circuit, a weight update was possible, resulting in an approximately 2.62% improvement in the face detection accuracy from 92.8% to 95.42%. In addition, the proposed nonlinear quantization technique has a face-detection accuracy of 97.50% and power consumption of 4.02 mW at 120 FPS, 5b-resolution ADC.

This article is organized as follows. Section II describes the operation of the proposed analog MAC and nonlinear quantization. Section III presents the measurement results with a prototype sensor. Finally, Section IV presents the conclusions of this paper.

#### II. THE PROPOSED ANALOG CNN PROCESSOR IN CIS SYSTEM

#### A. OVERALL SYSTEM ARCHITECTURE

Fig. 3 shows the proposed analog CNN processor architecture; it receives  $160 \times 120$  image data as input and consists of  $2 \times 2$  convolutional layers (numerical = 1) and  $2 \times 2$  pooling layers (numerical = 2). The outputted 5-bit  $40 \times 30$  feature map data is a fully connected layer processed by the software. Fig. 4 shows the proposed CIS architecture integrated with an in-column analog CNN processor. The CIS consists of a  $160 \times 120$  pixel array and a source follower and uses a rolling-shutter reading method. For analog CNN processing, there were three layers using the proposed analog MAC circuit and a nonlinear single-slope analog-to-digital (SSADC) in each column. Fig. 5 shows the two operation modes of the proposed CIS: CIS and CNN modes. For example, the proposed CIS may operate in either CIS mode or CNN mode, depending on the operation of an analog convolution



FIGURE 3. The proposed analog convolutional neural network algorithm.



FIGURE 4. The proposed CIS system architecture.



FIGURE 5. Operation mode of the proposed CIS.

processor. In this model, two convolutional layers were implemented with three analog layers. In CIS mode, a pixel voltage of  $160 \times 120$  is input to the analog convolution processor, and the first layer (Layer-1) performs and stores a pixel's correlated double sampling (CDS). After that, Layer-1 and the final layer (Layer-F) convert the pixel voltage data of the

final 160×120 into a 5-bit CIS image through SSADC using the read and storage process. In CNN mode, a pixel voltage of  $160 \times 120$  is input into the analog convolution processor as in CIS mode. Layer-1 performs CDS and first convolutional weight (with  $2 \times 2$  convolutional mask) operations. Then, the second layer (Layer-2) accumulates the output of Layer-1 in analog memories and applies average-pooling and the second convolutional weight (with  $2 \times 2$  convolutional mask) operations. After pooling, 80×60 feature map data were stored in Layer-2, and the  $2 \times 2$  convolution 2-layer weight was calculated and stored. Polarity and pooling are performed in the same process as in Layer-2, and the final input data of 160×120 pixels are compressed and output as  $40 \times 30$  feature map data. Finally, in Layer-F, Layer-2 values are accumulated in analog memories, and the second average pooling is performed. The  $40 \times 30$  feature map converted a 5-bit CNN image into a nonlinear lamp signal at the SSADC. The proposed CIS can retrain the algorithm models by obtaining noisy images from the CIS mode output data. Based on the retraining results, weight updates were performed to improve the model accuracy by 2.62%, from 92.8% to 95.42%.

#### B. THE PROPOSED ANALOG MAC CIRCUIT

Fig. 6 shows the proposed analog MAC circuit and its timing diagram for the CDS and convolution weight operations. The proposed analog MAC circuit with an auto-zeroing operation was implemented in a column pitch (12.8  $\mu$ m in this paper) with an amplitude of approximately 50 dB. OTA operates the Reset and Redistribution phases using auto-zeroing (AZ) signals. In the reset phase, AZ is logic 'Low', and the input transistors M1 and M6 are connected to the drain nodes of the current source (M2 and M5) to sample the OTA logical current bias (auto-zeroing voltage) to the CAZ at negative feedback. In the redistribution phase, the OTA operates with an AZ high signal. M1 and M6 use the bias sampled from the CAZ, and the AZb switch is connected to become the OTA of the cascode inverter structure. M3 and M4 are used to set the bias voltages  $V_{G1}$  and  $V_{G6}$  when a static current with  $V_{SG3}$  and  $V_{GS4}$  flows during the reset phase (called the floating current source [6]). After resetting, M3 and M4 are bypassed. OTA uses a negative feedback structure to allow programmable weight values to be used during analog MAC

### IEEE Access



FIGURE 6. The proposed analog MAC circuit & timing diagram.



**FIGURE 7.** Convolution MASK of the proposed CIS & Convolution operation.

operations. The analog MAC operations are discussed in more detail in the next section.

#### C. ANALOG CONVOLUTION LAYER

Fig. 7 shows the operation process of the convolution mask and the convolution layer of the proposed CIS. The proposed CIS uses a 2×2 convolution mask (stride = 1) structure to improve the image classification accuracy and reduce data loss. Fig. 8 shows the operation process and convolution weight of the proposed conv1 layer. Equation (1) shows the process of performing pixel voltage ( $\Delta V_{PIX} = V_{rst} - V_{sig}$ ) CDS for the proposed analog MAC. The capacitances of the analog MAC are  $C_S = C_H = 400$  fF and  $C_D = 200$  fF. The initial charge follows  $Q_{R1}$  in (1). When the signal voltage  $V_{sig}$ of the pixel is input, the analog MAC operates in the signal phase. The changed charge follows the  $Q_{W1}$  in (1). The CDS voltage follows  $V_{out}$  of (1) and is input to the conv2 layer as the read phase.

$$Q_{R1} = C_S (V_{rst} - V_{ref}) + 2C_H (V_{ref} - V_{ref})$$
  

$$Q_{W1} = C_S (V_{sig} - V_{ref}) + 2C_H (V_{out} - V_{ref})$$
  

$$V_{out} = V_{H1,2} = \frac{C_s}{2C_H} (V_{rst} - V_{sig}) + V_{ref}.$$
 (1)

VOLUME 11, 2023

After the CDS operation, the voltage is stored in  $C_{H1}$  and  $C_{H2}$  as the virtual ground  $(V_{AZ})$  of the analog MAC. The convolution weight is implemented with the stored voltage and divide phase, as shown in Fig. 8. The process of implementing the convolution weight by the charge distribution follows (2).

$$V_{out} = V_{H2} = \frac{C_H}{C_H + C_D} \times (\frac{1}{2}\Delta V_{PIX})$$
(2)

The convolution weight was implemented in the divide phase of the analog MAC. The  $C_D$  1/3 voltage was stored by the charge distribution. The Rst of the analog MAC is a reset switch that connects the node of the  $C_D$  to  $V_{ref}$ . The voltage was calculated by repeating the weight implementation process, as given in (3).

$$V_{out} = V_{H2} = \left(\frac{C_H}{C_H + C_D}\right)^n \times \left(\frac{1}{2}\Delta V_{PIX}\right)$$
$$= \alpha^n \times \left(\frac{1}{2}\Delta V_{PIX}\right)$$
(3)

The convolution weight is programmable and can be defined as  $\alpha^n$  depending on the number of iterations, n in the divide and rst phases (n = 0, 1, 2, 3 and the total number of weights is nine). The proposed analog MAC circuit uses the charge redistribution characteristics of the capacitors for programmable weights. Therefore, since the number of capacitors that can be integrated into one column pitch is limited, the accuracy is also more limited than when arbitrary weights are used. Therefore, in this paper, we proposed a method to improve accuracy through a nonlinear ADC while using a 3-bit weight. Nonlinear ADC operations are discussed in more detail in Section II-D.

The conv1 layer only implements the convolution weight operation size. Polarity operations and data accumulation were performed in the conv2 layer. Fig. 9 shows the polar operating process with the three phases of the conv2 layer. The polar operating output voltage of the conv2 layer follows (4). First, the reset phase of the conv2 layer is reset to the C<sub>S</sub> and C<sub>H</sub>, and the charge follows Q<sub>R2</sub> in (4). Second, in the write phase, the voltage (V<sub>FM</sub>) is input into the conv1 layer, and the charge is changed to Q<sub>W2</sub>. Voltage (V<sub>H1,2</sub>) is stored as the inversion polarity of the V<sub>FM</sub>.

$$Q_{R2} = C_S \left( V_{ref} - V_{ref} \right) + 2C_H \left( V_{ref} - V_{ref} \right)$$



FIGURE 8. The proposed analog MAC convolution weight operations.

$$Q_{W2} = C_S \left( V_{FM} + V_{ref} - V_{ref} \right) + 2C_H \left( V_{out} - V_{ref} \right)$$
  
$$V_{out} = V_{H1,2} = -\frac{C_s}{2C_H} \left( V_{FM} \right) + V_{ref}$$
(4)

The non-inverted sampling of the proposed analog MAC follows (5). First, when the  $V_{FM}$  is input, the initial charging in the write phase of the analog MAC and reset phase of the OTA follow  $Q_{WR2}$ . Second, the polarity phase of the analog MAC reset the  $C_S$  and changes the charge to  $Q_{RL2}$ . Finally, the output voltage,  $V_{out}$  was sampled using the non-inverted  $V_{FM}$ .

$$Q_{WR2} = C_S (V_{FM} + V_{ref} - V_{ref})$$
  

$$Q_{RL2} = C_S (V_{ref} - V_{ref}) + 2C_H (V_{out} - V_{ref})$$
  

$$V_{out} = V_{H1,2} = +\frac{C_s}{2C_H} (V_{FM}) + V_{ref}$$
(5)

Fig. 10 shows the proposed analog CNN architecture. The processes shown in Figs. 8 and 9 can implement the size and polarity of the convolution mask. The masks for M and M+1, as shown in Fig. 7, were implemented. The CNN of the proposed CIS has a  $2 \times 2$  mask (stride = 1) structure, and receives data from one row at a time. Before reading the Pixel N row data, the weight of mask N-1 row is implemented with N-1 row data stored in the conv1 layer. The data of the M and M+1 columns are received and accumulated by the MUX. After performing the N-1 row of a mask, the conv1 layer performs the N row of the pixel with CDS and weight. The calculated 2-row data (one row of the feature map) are accumulated in the conv2 layer for the pooling operation. If two rows of feature maps accumulate in the conv2 layer by repeating the previous process,  $2 \times 2$  pooling (stride = 2) is operated by connecting the M+1 column with switches, S<sub>ML</sub> and S<sub>MR</sub>. The convolution and pooling processes of the conv2 and convF layers are identical. Finally, the 4-row pixel

61086

data were compressed into 1-row feature map data using the analog CNN.

Furthermore, the proposed analog CNN operates with a  $2 \times 2$  average pooling (stride = 2). First, the 2-row feature map data operating in the conv1 and conv2 layers were accumulated in the conv2 and convF layers. Second, average pooling is performed by connecting the M and M+1 columns using a binning switch (S<sub>Bin</sub>). The V<sub>FM</sub> stored in C<sub>H1,2</sub> is distributed as an average voltage. The average pooling operation process of the proposed analog MAC follows (6).

$$Q_{P,M} = 2C_H (V_{FM,M}) + 2C_H (V_{FM,M+1})$$
  

$$Q_{P,M+1} = 4C_H (V_{out} - V_{ref})$$
  

$$V_{out} = V_{H1,2} = \frac{1}{2} (V_{FM,M} + V_{FM,M+1}) + V_{ref}$$
(6)

The  $160 \times 120$  input data were compressed and output as  $40 \times 30$  feature map data by operating with an average pooling layer. One hundred and sixty columns of ADC are required to perform CIS, and only 40 columns of ADC are required to perform CNN. As a result, the power consumption of CIS mode is 4.54 mW, while that of CNN mode is 4.02 mW, 11% less power consumption at 120 FPS.

#### D. NONLINEAR QUANTIZATION TO IMPROVE ACCURACY

Fig. 11 shows the process and image of the SSADC depending on the mode of the proposed CIS. In the CIS mode, the pixel voltage after performing CDS is input to the SSADC comparator input node  $V_{LF}$  and the ramp signal is input to  $V_{ramp}$  to output a 5b CIS image. The feature map (FM) calculated by the analog CNN was input to the  $V_{LF}$  in the CNN mode. A 5-bit CNN image is output by applying the CNN offset signal shown in Fig. 12, which is capable of negative calculation. Based on the output 5-bit feature map data, face detection achieved an accuracy of 92.8%. The random noise generated when implemented in hardware was reduced by







FIGURE 10. The proposed analog CNN architecture.



FIGURE 11. Data conversion operations of the proposed CIS.



FIGURE 12. Image classification accuracy and weight changes according to regularization lambda.

3.2% compared to the results of existing software models. To compensate for the reduced accuracy, software retraining was conducted with noise-containing data, and because of the weight update and regularization of the fully connected layers, the accuracy was improved by 95.42%.

Fig. 12 shows the weight and image classification accuracy change depending on the regularization lambda of the fully connected layer. Normalization makes the low-critical

weight of the fully connected layer zero and allows only the high-critical weight to perform the classification. As lambda increased, the weight and accuracy decreased, but a critical feature map could be identified. As a result, classification accuracy improves for features such as eyes, nose, cheekbones, and hair. The feature map data to be calculated by weight were 0–3 and 20–31 5-bit codes, which were the minimum and maximum data conversion values, respectively. These are close to the minimum and maximum codes of the



FIGURE 13. Comparison to the accuracy of face detection with linear and nonlinear signals.

feature map, and are the start and end of the CNN ramp signal, respectively. Therefore, the importance of performing a CNN is low because the other codes are weighted to zero.

Fig. 13 shows a comparison of the accuracies of the linear and nonlinear signals. The proposed CIS can apply a double gain slope for highly critical values (= codes 0-1, 26–31 based on a normal 5-bit ramp), resulting in the same effect as using a high-resolution ADC. In the ADC of the proposed CIS, noise can be reduced compared with the conventional SSADC because the ADC quantization noise is smaller than the thermal noise [7], [8]. The low-critical value is a 5-bit ADC obtained by applying a gan slope of 1 and 0.5, which improves the face-detection accuracy by 2.08%, from 95.42% to 97.5%. The proposed enhancement technique demonstrates improved performance because the higher the bit resolution, the higher is the resolution usage effect. In the 8-bit resolution SSADC, nonlinear quantization techniques were applied for a 3.18% improvement from 95.59% to 98.75%. Using the proposed technique, the proposed CIS achieves high face-detection accuracy even in low-resolution ADC and reduces the ADC area and power.

#### **III. IMPLEMENTATION RESULTS**

Fig. 14 shows the layout and summary of the CIS with the proposed integrated analog CNN processor. The proposed CIS was designed with a 110 nm 1P4M CIS process at  $3.3 \times 3.6 \text{ mm}^2$ . A pixel is an active pixel sensor with a 4-tr structure that uses a rolling shutter readout method with an image resolution of  $160 \times 120$ . The total power consumption is 4.02 mW at 120 frames/s and a global clock of 20 MHz.

Fig. 15 shows the power consumption depending on the mode of the proposed CIS. As shown in Fig. 5, CIS and CNN have different readout operations. The proposed CIS uses an analog CNN processor as the CDS and analog memory to output 160  $\times$  1,205 bits of output data and 4.54 mW at 120 FPS. Meanwhile, the CNN mode of the proposed CIS outputs 160 $\times$ 120 input data as 40 $\times$ 30 5-bit output data, as it



FIGURE 14. Chip photograph.

| <u>_ i                              </u> |                   |           |  |  |  |  |  |
|------------------------------------------|-------------------|-----------|--|--|--|--|--|
| 4.5mW                                    | 4.54mW            | 11% Power |  |  |  |  |  |
|                                          | Analog            | Reduction |  |  |  |  |  |
| 4mW                                      |                   | 4.02mW    |  |  |  |  |  |
|                                          | 3.15mW            |           |  |  |  |  |  |
|                                          |                   | Analog    |  |  |  |  |  |
| 3.5mW                                    | Dis Well          | 2.7mW     |  |  |  |  |  |
| 3mW                                      | Digital<br>0.31mW | Digital   |  |  |  |  |  |
| 311144                                   | D' 1              | 0.26mW    |  |  |  |  |  |
|                                          | Pixel             | Pixel     |  |  |  |  |  |
|                                          | 1.08mW            | 1.08mW    |  |  |  |  |  |
| CIS Mode CNN Mode                        |                   |           |  |  |  |  |  |

Power Consumption @ 120 fps

FIGURE 15. Measured power consumption reduction.

can compress pixel 4-row data into 1-row data with an analog CNN operation. Since only a 40-column ADC was used in the 160-column ADC, the power consumption decreased by 11% to 4.02 mW.

Fig. 16 shows the measurement environment and the system block of the proposed CIS. The proposed CIS was measured using field-programmable gate array (FPGA) boards, LED displays projecting test images, a host PC, and PC software. The TEST image was input from the laptop LED display, and the proposed CIS could output different images in the CIS and CNN modes. Either the CIS or CNN modes operate with only one ramp generator, similar to a normal CIS. In this paper, the ramp signal was generated using an external digital-to-analog converter (DAC) on an FPGA board. However, a bi-directional gamma curve for nonlinear characteristics can be implemented by modifying the counter circuit to almost the same area by changing the counter clock frequency [8]. The output images are transmitted in the order of FPGA, host PC, and software PC, and image classification and retraining are performed. In addition, the weight and nonlinear ramp signals trained through a Host PC and FPGA can be adjusted externally. Images (input dataset) with 160  $\times$ 120 resolution consisting of 600 human faces and 600 nonface objects were used in the experiment. Furthermore, the entire data set was divided into a 4:1 ratio to form a training set and a test set. Using the proposed analog CNN circuits and nonlinear ADC, the feature map images were quantized. The final fully connected layer classifies and estimates the



FIGURE 16. Measurement environment & system.



FIGURE 17. Measured images of (a) CIS mode and (b) CNN mode of the proposed CIS.

accuracy. The proposed CIS achieved 97.50% face-detection accuracy at a 5-bit resolution and 98.75% at an 8-bit resolution. Fig. 17 shows the measured images with the CIS and CNN modes, respectively. It should be noted that the CIS mode is used to control the focus before capturing images and collecting data for retraining.

#### TABLE 1. Performance comparison table for CIS-integrated FD.

|                              | ISCAS'<br>19 [3]                                | Sensors'<br>20 [4]     | ISSCC'<br>22[5]                  | This<br>Work                                     |
|------------------------------|-------------------------------------------------|------------------------|----------------------------------|--------------------------------------------------|
| Technology                   | 65nm                                            | 110nm                  | 180nm                            | 110nm                                            |
| Supply Analog                | 2.5, 1.2                                        | 3.3                    | 0.8                              | 2.8                                              |
| [V] Digital                  | 0.77~1.1                                        | 1.5                    | 0.8                              | 1.5                                              |
| Pixel readout                | 4T-APS                                          | 4T-APS                 | 4T-PWM                           | 4T-APS                                           |
| Pixel Size[µm <sup>2</sup> ] | 7×7                                             | 3.2×3.2                | 7.6×7.6                          | 3.2×3.2                                          |
| <b>Pixel resolution</b>      | 320×240                                         | 160×120                | 126×126                          | 160×120                                          |
| Maximum frame<br>rate [FPS]  | 1                                               | 120                    | 250                              | 120                                              |
| Chip Area [mm <sup>2</sup> ] | 3.6×4.4                                         | 2.93×2.61              | 2.18×2.46                        | 3.3×2.6                                          |
| Algorithm                    | FD & FR:<br>Analog-<br>digital<br>hybrid<br>CNN | FD:<br>Analog-<br>CNN  | FD:<br>Analog-<br>digital<br>CNN | FD:<br>Analog-<br>CNN                            |
| Accuracy [%]                 | 1.5b <sup>a</sup> :96.18                        | 8b <sup>a</sup> :89.33 | 8b <sup>a</sup> :93              | 5b <sup>a</sup> :97.50<br>8b <sup>a</sup> :98.75 |
| Total @1FPS                  | 0.62                                            | 0.16 <sup>b</sup>      | -                                | 0.57 <sup>b</sup>                                |
| Power @120FPS                | -                                               | 1.12                   | -                                | 4.02                                             |
| [mW] @250FPS                 |                                                 | -                      | 0.135                            | -                                                |
|                              |                                                 |                        |                                  |                                                  |

<sup>a</sup> ADC resolution.

<sup>b</sup> Estimated power consumption for FD; power consumption is reduced by about 20% with the reduction of every 10 FPS in [9]

Table 1 shows a performance comparison table of CIS integrated with face-detection algorithms. A conventional CIS [4] integrated with face detection is designed in a column for integration and fast data processing, and its constraints achieve a low face-detection accuracy of  $2 \times 2$ convolution layer (stride = 2) and fixed-weight 8-bit resolution ADC. The additional out-of-array processor design [3] achieves 96.18% accuracy at 0.62 mW through current convolution and digital operations but has a one FPS slower data processing performance. In the case of [5] using 4Tpulse width modulation (PWM) for fast data-processing speed and low power consumption, it has 0.135 mW power consumption at 250 FPS and achieves 93% face-detection accuracy. However, it is difficult to use the 4T-PWM reading method in typical mobile applications. The proposed CIS had a 4.02 mW power consumption at 120 FPS through an in-column design using a general 4T-APS. Analog CNN processors with weight-updateable and nonlinear quantization achieved a face-detection accuracy of 98.75% at an 8-bit resolution SSADC. The proposed CIS has the lowest power consumption of 0.57 mW among the methods that achieved ≥95% face-detection accuracy and a fast data-processing speed of 120 FPS.

#### **IV. CONCLUSION**

This article proposes a CIS integrated with an in-column analog CNN processor with a power consumption of 4.02 mW and 98.75% face-detection accuracy at 120 FPS operation. The proposed analog MAC circuit with integrated analog memory, convolution and pooling operations, low-resolution ADC usage, and weight update is possible. The proposed CIS can achieve high face-detection accuracy with weight updates and the proposed nonlinear quantization techniques, even if the face-detection accuracy is reduced by the random noise generated in the hardware implementation. The proposed CNN architecture can be implemented as an in-column CNN processor with a  $160 \times 120$  image resolution, which can be effective for a variety of face-detection mobile applications.

#### REFERENCES

- [1] K. Bong, S. Choi, C. Kim, S. Kang, Y. Kim, and H. Yoo, "14.6 a 0.62 mW ultra-low-power convolutional-neural-network face-recognition processor and a CIS integrated with always-on Haar-like face detector," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2017, pp. 248–249.
- [2] R. Eki, S. Yamada, H. Ozawa, H. Kai, K. Okuike, H. Gowtham, H. Nakanishi, E. Almog, Y. Livne, G. Yuval, E. Zyss, and T. Izawa, "9.6 A 1/2.3inch 12.3Mpixel with on-chip 4.97TOPS/W CNN processor back-illuminated stacked CMOS image sensor," in *IEEE Int. Solid-State Circuits Conf.* (*ISSCC*) Dig. Tech. Papers, vol. 64, Feb. 2021, pp. 154–156.
- [3] J. Kim, C. Kim, K. Kim, and H. Yoo, "An ultra-low-power analog-digital hybrid CNN face recognition processor integrated with a CIS for alwayson mobile devices," in *Proc. IEEE Int. Symp. Circuits Syst. (ISCAS)*, May 2019, pp. 1–5.
- [4] J. Choi, S. Lee, Y. Son, and S. Y. Kim, "Design of an always-on image sensor using an analog lightweight convolutional neural network," *Sensors*, vol. 20, no. 11, p. 3101, May 2020.
- [5] T. Hsu, G. Chen, Y. Chen, C. Lo, R. Liu, M. Chang, K. Tang, and C. Hsieh, "A 0.8 V intelligent vision sensor with tiny convolutional neural network and programmable weights using mixed-mode processing-in-sensor technique for image classification," in *IEEE Int. Solid-State Circuits Conf.* (*ISSCC*) Dig. Tech. Papers, vol. 65, Feb. 2022, pp. 1–3.
- [6] C. Young, A. Omid-Zohoor, P. Lajevardi, and B. Murmann, "5.3 A datacompressive 1.5b/2.75b log-gradient QVGA image sensor with multi-scale readout for always-on object detection," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2019, pp. 98–100.
- [7] G. K. De Teyou, H. Petit, P. Loumeau, H. Fakhoury, Y. Le Guillou, and S. Paquelet, "Statistical analysis of noise in broadband and high resolution ADCs," in *Proc. 21st IEEE Int. Conf. Electron., Circuits Syst. (ICECS)*, Dec. 2014, pp. 490–493.
- [8] H. Im, K. Park, J. H. Cho, H. S. Choo, and S. Y. Kim, "Design of a pseudowide dynamic range CMOS image sensor by using the bidirectional gamma curvature technique," *IEEE Trans. Circuits Syst. II, Exp. Briefs*, vol. 68, no. 5, pp. 1596–1599, May 2021.
- [9] I. Cevik, X. Huang, H. Yu, M. Yan, and S. Ay, "An ultra-low power CMOS image sensor with on-chip energy harvesting and power management capability," *Sensors*, vol. 15, no. 3, pp. 5531–5554, Mar. 2015.



**BOHYEOK JEONG** (Student Member, IEEE) received the B.S. degree in semiconductor science from Dongguk University, Seoul, South Korea, in 2021, where he is currently pursuing the M.S. degree in semiconductor science.

His current research interests include algorithms, architectures, and circuits for low-power CMOS image sensors in mobile devices.



**JAEHWAN LEE** received the B.S. degree in industrial and system engineering from Dongguk University, Seoul, South Korea, in 2021, where he is currently pursuing the M.S. degree in industrial and system engineering.



**JAIHYUK CHOI** (Student Member, IEEE) received the B.S. degree in semiconductor science from Dongguk University, Seoul, South Korea, in 2019, where he is currently pursuing the M.S. degree in semiconductor science.



**MINKYU SONG** (Member, IEEE) received the B.S., M.S., and Ph.D. degrees in electronics engineering from Seoul National University, South Korea, in 1986, 1988, and 1993, respectively. From 1993 to 1994, he was a Researcher with the Asada Laboratory, VDEC, The University of Tokyo, Japan, where he worked in the area of low-power VLSI design. From 1995 to 1996, he was a Researcher with the CMOS Analog Circuit Design Team, Samsung Electronics,

South Korea. Since 1997, he has been a Professor with Dongguk University, South Korea. His major research interests include the design of CMOS analog circuits, mixed-mode circuits, and low-power digital circuits. He is a member of IEEK.



**YOUNGDOO SON** (Member, IEEE) received the B.S. degree in physics and the M.S. degree in industrial and management engineering from the Pohang University of Science and Technology, Pohang, South Korea, in 2010 and 2012, respectively, and the Ph.D. degree in industrial engineering from Seoul National University, Seoul, South Korea, in 2015.

He is currently an Associate Professor with the Department of Industrial and Systems Engineering

and the Director of the Data Science Laboratory (DSLAB), Dongguk University, Seoul. His research interests include machine learning, neural networks, Bayesian methods, and their industrial and business applications.



**SOO YOUN KIM** (Member, IEEE) received the B.S. and M.S. degrees in semiconductor science from Dongguk University, Seoul, South Korea, in 2001 and 2003, respectively, and the Ph.D. degree in electrical and computer engineering from the School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, USA, in 2013.

From 2003 to 2008, she was an Engineer with the Image Development Team, System LSI Divi-

sion, Samsung Electronics, Yongin, South Korea. From 2013 to 2017, she was a Staff Engineer with Qualcomm Corporate Research and Development, San Diego, CA, USA. She is currently an Associate Professor with the Department of Semiconductor Science, Dongguk University. Her current research interests include low-power CMOS image sensors, computer vision sensors, and thermal-aware FinFET circuit design.

Dr. Kim received several awards and honors, including the IEEE ISCAS Best Paper Award (Runner-Up).

...