# An Ultrasound Imaging System With On-Chip Per-Voxel RX Beamfocusing for Real-Time Drone Applications

Liuhao Wu, Student Member, IEEE, Jiaqi Guo, Graduate Student Member, IEEE, Rucheng Jiang, Graduate Student Member, IEEE, Yande Peng<sup>®</sup>, Han Wu, Student Member, IEEE, Jiamin Li, Member, IEEE, Yang Luo, Student Member, IEEE, Liwei Lin<sup>®</sup>, Member, IEEE,

and Jerald Yoo<sup>(D)</sup>, Senior Member, IEEE

Abstract-For drone vision and navigation, low-power 3-D depth sensing with robust operations against strong/weak light and various weather conditions is crucial. CMOS image sensor (CIS) and light detection and ranging (LiDAR) can provide high-fidelity imaging. However, CIS lacks depth sensing and has difficulty in low light conditions. LiDAR is expensive with issues of dealing with strong direct interference sources. Ultrasound imaging system (UIS), on the other hand, is robust in various weather and light conditions and is cost-effective. However, in air channel, it often suffers from long image reconstruction latency and low framerate. To address these issues, we present a UIS application-specific integrated circuit (ASIC) that adopts the one-shot transmitter (TX) and on-chip per-voxel receiver (RX) beamfocusing (PV-RXBF) image reconstruction scheme. The ASIC adopts the designs of fully differential charge-reuse high-voltage TX (FDCR-HVTX), digital back-end (DBE), and an on-chip power management unit (PMU). FDCR-HVTX generates 28  $V_{pp}^{-1}$  pulses and reduces the average power consumption by 25% by charge reuse (CR). The DBE achieves 7.76-µs processing latency and 9.83M-FocalPoint/s throughput to effectively translate real-time 3-D image streaming at 24 frames/s. A prototype UIS, with an 8 x 8 bulk piezo transducer array, is assembled with the proposed ASIC and a wireless data transmission module [field-programmable gate array (FPGA) + ESP32] on an entry-level consumer drone, and the real-time wireless 3-D image streaming at 24 frames/s

Manuscript received 28 April 2022; revised 23 July 2022 and 17 August 2022; accepted 17 August 2022. Date of publication 27 September 2022; date of current version 24 October 2022. This article was approved by Associate Editor Vyshnavi Suntharalingam. This work was supported in part by the Agency for Science, Technology and Research (A\*STAR) Advance Manufacturing and Engineering (AME) Wavefront Computing Program under Grant A18A4b0055 and in part by the Samsung Electronics under Grant A-0005134-03-00. (*Liuhao Wu and Jiaqi Guo contributed equally to this work.*) (*Corresponding author: Jerald Yoo.*)

Liuhao Wu, Jiaqi Guo, Rucheng Jiang, Han Wu, and Yang Luo are with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore 117583 (e-mail: l.wu@u.nus.edu; jiaqi.g@u.nus.edu; rucheng.jiang@u.nus.edu; wuhan@u.nus.edu; yangluo@u.nus.edu).

Yande Peng and Liwei Lin are with the Department of Mechanical Engineering, University of California at Berkeley, Berkeley, CA 94720 USA (e-mail: yande\_p@berkeley.edu; lwlin@berkeley.edu).

Jiamin Li is with the School of Microelectronics, Southern University of Science and Technology (SUSTech), Shenzhen, Guangdong 518000, China (e-mail: lijm3@sustech.edu.cn).

Jerald Yoo is with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore 117583, and also with the N.1 Institute for Health, Singapore 117456 (e-mail: jyoo@nus.edu.sg).

Color versions of one or more figures in this article are available at https://doi.org/10.1109/JSSC.2022.3202502.

Digital Object Identifier 10.1109/JSSC.2022.3202502

with a range of 7 m is verified while the drone is flying. The ASIC implemented in 180-nm 1P6M Standard CMOS occupies 32.5 mm<sup>2</sup> and consumes 142.3 mW.

*Index Terms*—3-D imaging, all light condition, charge reuse (CR), depth sensing, drone, high-voltage transmitter (TX), low power, per-voxel RX beamfocusing (PV-RXBF), real time, standard CMOS, ultrasound.

### I. INTRODUCTION

TNMANNED aerial vehicles (UAVs), or drones, are gaining popularity in recent years. Some consumer-grade drones can fly up to 20 m/s, and 3-D depth-sensing systems for such UAVs should have low latency, sufficient framerate, and low power consumption. As a safety measure, the low latency and high framerate are essential, especially for consumer models, as the operators are often less professionally trained. Unfortunately, due to the power, size, and cost considerations, many existing drones do not have a 3-D depth-sensing capability and/or are limited to only distance detection at the front side [1]. To address these issues, CMOS image sensors (CISs), light detection and ranging (LiDAR), and ultrasound image systems (UISs) could be utilized. Fig. 1 shows the comparison of these technologies. CISs [2], [3], [4], [5] offer the highest framerate with good image quality, but they are vulnerable to bright exposure and low light conditions. More importantly, it lacks direct depth information, making it suboptimal as the onboard safety sensor for drones. LiDAR [6], [7], [8], [9] provides the best depth information, but it also suffers from lighting conditions such as direct sunlight. Moreover, a LiDAR system may require a power-hungry transmitter (TX), such as a >40-W pulsed laser diode [6], [7]; and it is also an expensive solution, making it less attractive for consumer-grade drone applications. Ultrasound imaging [10], [11], [12], [13], [14], [15], [16], [17], [18], on the other hand, can provide low-power 3-D-depth information and works well even in complex lighting conditions (bright/dark) as well as in rainy/foggy weather. Of course, this comes at the cost of lower spatial resolution when compared to CIS or LiDAR, but this is not a major issue in drone navigation. Previously, UISs [10], [11], [12], [13], [14] shown successful results for biomedical applications. Due to the slow sound speed (343 m/s in air compared to 1540 m/s in tissue) and long detection range compared to in vivo, air-channel ultrasound

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/



† Power including AFE and On-Chip Image Reconstruction

Fig. 1. (a) Framerate versus power comparison of 3-D imaging technologies and (b) comparison of current 3-D imaging technologies.

imaging [15], [16], [17], [18] possesses a different challenge in how to achieve adequate framerate with a limited number of transmissions. Prior study [19] discussed techniques that could increase framerate by trading-off in spatial resolutions, though the study is about cardiac ultrasound imaging, many concepts could be applied in the air channel as well. Multiline acquisition (MLA) is a framerate-improving technique that multiple lines are rendered with one transmission. This technology could reduce to number of transmissions while preserving the number of lines rendered, although it disperses the transmitted sound pressure causing weaker echo received. When pairing one-transmission-per-volume (one-shot TX), max MLA, and parallel post-beamforming image reconstruction fast enough that its processing latency could be overlaid under one pulseto-echo (P2E) time through pipelining, the highest number of framerates could be achieved. Przybyla et al. [17] presented an air-channel rangefinder that applies one-shot TX and max MLA to reach 30 frames/s; however, its range was limited to 1 m and it does not integrate on-chip image reconstruction DBE; another work [18] demonstrated real-time ultrasound imaging in air channel up to 3 m range at 29 frames/s, but the system is made up of discrete components (no chip implementation), making its bulky form factor and high power consumption ( $\gg$ 10 W) beyond what a common UAV can handle. For drone applications, high framerate and low latency are crucial; for example, Wu et al. [16] reported 4 frames/s, and a 20-m/s drone will fly 5 m between frames, which significantly increases the probability of the drone colliding with an object.

To be suitable for commercial UAVs, a 3-D-imaging system should be both capable and not burdensome, which brings specific requirements in framerate, latency, range, and form

factor. For UAVs flying up to 20 m/s, to avoid collision with a near obstacle, e.g., 1 m, a latency comprising acoustic and processing latency of well below 50 ms is desired, considering additional latency from the control system and maneuver. Similarly, a motion-like 24 frames/s is the target. As sound travels at 343 m/s in air, the 24 frames/s brings a maximum 7 m range target to the design. Form factorwise, the system needs to be light and low power energy to be integrated into a small UAV system. The framerate and latency requirements motivate the need for an on-chip image reconstruction processor. For better image quality, more transducer channels are desired; however, due to the constraint in system overall size and chip area, an 8  $\times$  8 array is a sweet spot. High-frequency US transducers are also preferred for image quality; however, sound wave attenuates more with higher frequency; therefore, 40 kHz is chosen as a tradeoff between range, image quality, and costeffectiveness. The defined render volume should contain as many voxels as possible to preserve details, but the actual implementation needs to compromise chip area and power. Nonetheless, a  $\pm 30^{\circ}$  field-of-view (FOV) is desired to mimic human central vision.

This article presents a new ultrafast real-time air-channel UIS application-specific integrated circuit (ASIC) [20] that integrates on-chip image reconstruction digital back-end (DBE), coupled with a one-shot TX and receiver (RX) beamfocusing scheme to achieve lower power consumption (142.3 mW) than those of CIS and LiDAR systems and higher framerate (24 frames/s) than those of previous UIS, suitable for the UAV applications. Integrating a 64-channel 2-D phased transducer array, the ASIC processes 24 frames/s 9.83M-FocalPoints/s 3-D volumetric images at 7.76- $\mu$ s processing latency. The functionality and real-time operation of this UIS have been successfully verified with a drone-mounted prototype.

The rest of this article is organized as follows. Section II elaborates on the proposed UIS system architecture and the design considerations. Sections III–V present the design details of the key building blocks, including the per-voxel beam focusing DBE, the one-shot TX, and the power management unit (PMU). Section VI shows the implementation and measurement results, and finally, Section VII concludes this article.

#### II. ULTRAFAST UIS ASIC OVERVIEW

## A. System Architecture

Ultrasound images are traditionally rendered by beamforming the transmitting or the receiving acoustic waves, transforming the echo strength envelope, and scan-converting the signal to images. In the previous works [15], [16], [17], [18], only prebeamforming steps are integrated into the analog domain, and the remaining image reconstruction steps, parallel postbeamforming till scan-conversion, are processed on off-chip components. The off-chip components constrain the overall system power and form factor.

The proposed work solves the problem by replacing the imaging pipeline with a novel on-chip architecture, which adapts a one-shot TX and per-voxel RX beamfocusing



Fig. 2. Proposed UIS system architecture.

(PV-RXBF) scheme that is capable of rendering 4096 scanlines of a volumetric image at one P2E. This architecture overcomes the power-size limitation by maximally integrating systems to a single chip, and it reaches the highest possible framerate permitted by the physical laws by rendering the entire volumetric image within one P2E.

Fig. 2 shows the architecture of the proposed UIS. The RX chains consist of a low-noise amplifier (LNA), time-gain compensation (TGC), and a 10-bit SAR analog-to-digital converter (ADC), which uses the same IPs from our previous works [10], [15]. The ASIC integrates on-chip PV-RXBF DBE, fully differential charge-reuse high-voltage TX (FDCR-HVTX), high-voltage PMU, and 64 Ch. analog front-end (AFE), implemented in 180-nm 1P6M standard CMOS. The FDCR-HVTX drivers can be configured to drive any of the 64 TXes; however, for the presented demonstration system, only  $2 \times 2$  TX channels are enabled to minimize the grating lobe in the transmit sound pressure. The system also includes TRX, a wireless interface consisting of a field-programmable gate array (FPGA), ESP32 WiFi module, and a display user interface (UI) on a laptop.

#### B. Per-Voxel RX Beamfocusing

In the proposed work, the system renders 3-D volumetric images at a fluid 24 frames/s to the human eye with imperceptible 7.76- $\mu$ s processing latency. The 3-D image is represented in a defined volume-of-interest (VOI) that has 64 steps in the azimuth and elevation directions and it has 100 steps in the distance direction.

Every voxel in the defined VOI is a focal point to the RX beamformer; therefore, the scheme is called PV-RXBF. Unlike the scanline-based beamforming, where each scanline corresponds to a set of delays and the voxels along that scanline are visualized from one beamforming, PV-RXBF does beam focusing, a type of beamforming, for every voxel. At first glance, it might seem that the PV-RXBF conducts unnecessarily redundant times of beam focusing; however, PV-RXBF is designed to be a fully digital real-time implementation, where the delay operation is merely a memory read access, which is cheap and negligible compared to reconfiguring delay cells in analog implementations. Additionally, beamfocusing to every



Fig. 3. DBE block diagram.

focal point improves the image fidelity than merely steering the wave resulting in unfocused voxels.

#### C. Digital Hardware Mapping

Delay and sum (DAS) is the algorithm used in the DBE for beamfocusing. The basic principle of DAS is to apply delays to each transducer channel and then accumulate the data to produce highly correlated echo amplitude data of the focal point, as shown in the following equation:

$$O(v) = \sum_{n=1}^{M} I_n(t_{n,v}), \quad v := [\theta, \varphi, r]$$
(1)

where O, the voxel intensity value, is the output. It is the summation of M-channel inputs  $I_n(t)$  and the echo intensity of channel n received at time t, where t differs from each channel and focal point v that DBE beamfocus on. The voxel v has three components, the lateral angle  $\theta$ , the elevational angle  $\varphi$ , and the axial distance r.

Fig. 3 depicts the DBE in the proposed ASIC. The DBE maps the beamfocusing operation to a cached digital architecture, and it consists of four major blocks, the beamfocus sequence generation (BSG), quadrant-symmetrical reused delay generation (QSR-DG), two-level port-per-channel memory (2L-PCM) (as the buffer), and the 64-to-1 vector adder. The first two modules generate the 64 delays for 64 transducer channels, and the buffer module applies the delays to the data

of each channel. With a vector adder at the end of the pipeline, this DBE accomplishes the DAS operation.

In the analog fashion, DAS is done by placing delay cells at each channel's analog pipeline and accumulating the charge before ADC converts the signal to digital. It is difficult to store and retrieve a large number of analog values (charges or voltages); therefore, to get the echo strength of another focal point, it is usually required to send another pulse and reconfigure the delay cells. This process would consume another P2E time. Depending on the medium and detection range, as in air-channel and 7-m range, it would take an additional 40 ms for each DAS operation. In contrast, digital signals have the advantage that they can be easily stored in RAM cells. In this work, the input signals are first continuously buffered into a 2L cached memory without delays applied. Meantime, the set of 64 delays per channel is generated on-chip. These delays are then used as addresses to read from the 2L-PCM. The delay aspect of DAS is embedded in the memory read access, because delays, t from (1), could be expressed as an address that points to later stored data.

The advantage of this digital implementation is the minimum latency. Latency includes the acoustic delay and the processing delay; the acoustic delay depends on how many P2E cycles are consumed for each frame of 3-D images rendered. The analog approaches rely on TX beamforming that requires multiple P2E cycles for each frame, which results in a large acoustic delay. As an example, in our defined VOI, we have 4096 scanlines. Even if  $4 \times$  MLA is applied, it will still take 1024 P2E cycles to complete one volume scan. This is especially severe in our use case, as it is in the air channel and goes to 7 m. Each one P2E would take 40 ms; thus, the analog approach takes 40.96 s for a whole volume scan, which is unacceptable. In contrast, the proposed UIS's PV-RXBF adopts one-shot TX; therefore, only 1 P2E is needed. All voxels will be beamfocused in the digital domain and outputted after a minimum processing latency. The downside of the proposed approach, of course, is the unfocused pulse sent and weak echo received. This is acceptable as long as the received echo from the targeted 7-m range is well above the noise floor; to ensure enough echo level (even with unfocused pulses), we developed a TX driving at 28 V<sub>pp</sub>, presented in Section IV.

Due to the maximal RX scheme, the overall acoustic delay is reduced to the minimal one P2E, which is 40 ms. Upon the arrival of the echo, the processing latency is at an imperceptible 7.76  $\mu$ s. Overall, the DBS achieves true real-time 3-D imaging of ultrasound in the air channel.

## D. Beamfocus Sequence

BSG is the first step of the overall pipeline, and it is critical to achieve real-time operation with a sufficient small memory footprint for on-chip implementation. The idea is to beamfocus on the voxel whose data come in first, so that the system does spend too much time waiting. With new data coming in and old data discarded like a first-in-first-out (FIFO), the current beamfocusing voxel should not request data that are too old and has been dropped from the buffer. When not-in-memory



Fig. 4. Illustration of VOI;period Beamfocus sequence in quadrant Q1, read address for voxels in other quadrants is generated by exploiting quadrant symmetry.

errors occur, the system would halt, which signifies that the current settings, FOV, and axial resolution are not achievable. Therefore, the beamfocus sequence must be tested in advance in simulation to ensure continuous operations.

Inside the DBE, the BSG module is integrated with the proposed beamfocus sequence. In general, the sequence starts from the nearer plane and moves on to further planes until the frame refreshes. Within each plane, as shown in Fig. 4, the sequence starts from the central voxels and moves outward following a zig-zag pattern, which is friendly to hardware implementation. We could estimate the minimum cache size by using software behavioral models to record the largest difference between the current and oldest data accessed. In the simulation, using the proposed beamfocus sequence at 130 kHz sampling frequency, which we used to sample 40-kHz ultrasound wave for the  $\pm 30^{\circ}$  FOV, the minimal theoretical cache size for 64 channels is 2560 entries. In practice, the cache is designed to have 16 384 entries, so that it could support up to an 800-kHz sampling rate for the  $\pm 30^{\circ}$  FOV. Compared to a sequential logic that starts from the bottom-left corner pixel to the top-right pixel of each plane and from near planes to far planes, this logic saves on average 20% of the required buffer size at 80-kHz-5-MHz sampling frequency.

#### III. PER-VOXEL RX BEAMFOCUSING DBE

The DBE follows a logic shown in Fig. 5. The four-task step is done by BSG, QSR-DG, 2L-PCM, and the vector adder, respectively, as shown in Fig. 3. Data that come from AFE are written to 2L-PCM with a write-through scheme, and the write access always suppresses the read access. As writes only happen when new data are sampled by the ADCs, its duty cycle is relatively small compared to reads. The read accesses symbolize the conventional DAS algorithm. BSG module directs subsequent pipeline which voxel, or focal point, to beamfocus on next. QSR-DG module generates 64 delay values as the read addresses used to access the 2L-PCM. From here, only when all 64 values appear in the memory by write actions, the pipeline proceeds by summing up the 64 values that could be viewed as the delayed values, analogous to data



Fig. 5. DBE write logic path and DAS (read) logic path.

passed delay cells in an analog beamformer. The result is the echo intensity value of the voxel that is being beamfocused on. These values are then drawn on display as either gray-scaled or color-mapped voxels.

Fig. 6 shows the detail operations of the DBE architecture, and the blue and red curly arrows show the progression of the DBE pipeline operations. Before ADCs finish converting new echo data at cycle 192, the whole pipeline is halted, until 2L-PCM sends out signals both upstream (red arrows) and downstream (blue arrows) of the pipeline. Fig. 6 also conveys three important features of the DBE: 1) the 7.76- $\mu$ s processing latency; 2) the 4× computation reduction from QSR; and 3) the throughput improvement from the cache memory architecture. Explanations are as follows.

- 1) Cycle 1 flags the echo arrival. Before cycle 192, the DBE would already be waiting for the new echo data to finish the conversion from ADC. Once the conversion is done, the DBE starts to progress and output data after another two cycles for the delay application and sum actions. In total, there is 194-cycle latency or 7.76  $\mu$ s.
- 2) With one voxel coordinates input, the front-end of QSR-DG (before QSR) computes one set of 64 delays, and this set of delays is reused 4 times by the QSR submodule, resulting in four voxel outputs circled in green.
- 3) Looking at the small blue arrows at the bottom, they each are a voxel output, and the intervals between would get smaller as the L1 cache withholds more recently used data and its hit rate increases.

## A. BSG Module

BSG is the first step in PV-RXBF. The hardware module outputs the coordinates of the voxel in the form of  $[\theta, \varphi, r]$ representing the lateral and elevation angle index and distance steps at the forward-facing direction of the VOI. The system is designed to support VOI up to the size of 64 steps in azimuth and elevation directions and 1024 steps in the forward-facing direction, although in measurement, we only utilize 100 steps in the forward-facing direction. The BSG module actively listens to the downstream QSR-DG module for a ready signal. Once the ready signal turns high, in the next cycle, the BSG module will output the next voxel coordinates in the embedded beamfocus sequence.

#### B. QSR-DG Module

The QSR-DG is the second step. It receives directives from the BSG and generates 64 memory addresses corresponding to the 64 RX transducer channels. Specifically, it solves the time  $t_{n,v}$  from (1), or it could be written as  $t_{n1,...,64}(v)$ , as vis the actual input, and n denotes the 64-channel values. The addresses are in the unit of ADC sampling cycles. A counter is used to track the cycles. QSR-DG takes three clock cycles to generate one set of delays, at first glance, which may seem to create a big bubble in the pipeline, but as will be mentioned later, the QSR module generates four sets of delays from one set input and keeps the pipeline fed for at least four cycles, so the three clock cycles delay is not a concern

$$t_n([\theta, \varphi, r]) = r + \sqrt{ \frac{(r\sin\theta\cos\varphi - C_{n,x})^2}{+ (r\sin\varphi - C_{n,y})^2}} + \frac{(r\sin\theta\cos\varphi - C_{n,y})^2}{(2)}$$

The logic and implementation of QSR-DG are shown in (2) and Fig. 7. At the left side of (2), time t is technically the delay since the time origin that the DBE needs to apply to channel n. Voxel, or the focal point v, is presented by three indexes, the lateral, the elevation, and the axial. The first two are 5 bits each and the last is 7 bits, so the total voxel indexes are 17-bit wide. The reason that 5-bit values could be used to represent 64 steps in lateral and elevation angles is the exploitation of quadrant symmetry and will be explained later. The left side of the equation is the logic that the module implements. This module stores parameters to calculate axial distance, or radius r, channel coordinates, and trigonometrical parameters,  $\sin\theta\cos\phi$ ,  $\sin\phi$ , and  $\cos\theta\cos\phi$ , for scanline calculation. Parameters can be stored and derived directly in the unit of the ADC sampling cycle because of the constant speed of sound and sampling rate. This effectively cuts away the need for conversion from distance to time to memory addresses, saving area and power by additional hardware. Radius r is derived by an incrementor as the axial distance always increases from near to far as a constant interval, and this saves area and power compared to a naïve implementation of a multiplier and an adder. The derived r is then used twice during the calculation, first as the multiplier to the trigonometrical terms and second as the transmit time (in the ADC sampling cycle). The first bundle operations of a multiply, a minus, and a square are instantiated 17 times to cover all the 17 coordinate terms, 8 in x and y and 1 in z axes. The 17 unique terms are reused across 64 channels and used as inputs for the second bundled operations, a sum, a square root, and an add that also uses the radius r as mentioned earlier. The second bundle arithmetic logic unit (ALU) is instantaneities 64 times as per the number of channels. The result consisting of 64 memory addresses will be passed to the QSR module for reuse later.



Fig. 6. DBE datapath and essential control signals, three features: 1) 7.76- $\mu$ s latency, 2) 4× computation reduction (from QSR), and 3) increased throughput (hit rate).



Fig. 7. QSR-DG block diagram.

Except for  $\sin\theta \cos\phi$  which is stored in a 32-bit fixed-point format, the other two terms are stored using 16-bit fixed-point formats, as these are sufficient for precision. Storing parameters and computing the time delays on-chip dramatically save on-chip memory space compared to brutely storing all focal point delays. For a setup of 64 transducers and a VOI of 409 600 voxels, if each delay values occupy 2 bytes, it would require 52-MB on-chip storage, which is clearly infeasible for a 32.5-mm<sup>2</sup> 180-nm chip. Although trading off with power for computations, this implementation only requires 6.2-KB on-chip storage, possible to fit into an 8-KB 0.46-mm<sup>2</sup> on-chip static random access memory (SRAM).

## C. Exploiting Quadrant Symmetry

The QSR module is implemented as a submodule inside the QSR-DG. It is the last stage in Fig. 7. It exploits the symmetry properties of trigonometry, as shown in Fig. 4. As the output of the sine function changes its sign when its input changes sign, looking at (equation), if we also change the sign of  $C_y$  and  $C_z$ , time-of-flight (ToF) will remain the same. This implies that, for each voxel, its x, y, and diagonal mirrored voxels all shared the same set of 64 delays by values, and the channel that the delay applies changes. Instead of computing a new set of 64 delays, the QSR module rearranges the order of the

delay. This saves both computations and memory space needed for storing trigonometrical parameters since only the delays of one-fourth of the voxels are calculated.

## D. Two-Level Port-Per-Channel Memory

With the 64 delays generated, the 2L-PCM module uses them as the read addresses. 2L-PCM has two levels of the memory hierarchy, where the first level is made from register files and the second level is made from SRAMs, as shown in Fig. 8. There are 64 register files, each containing 1610-bit entries, corresponding to 64 transducer channels. The four SRAMs each have 4096 entries or effectively 256 entries per channel. The benefit of SRAMs is their high memory density to save the chip area. However, this SRAM IP only has one synchronous read-write port. This puts limitations on the overall bandwidth. With 409 600 voxels per frame and 24 targeted fps, the design needs to render 9.8M voxels/s. Each voxel needs 64 read accesses. Not including the write accesses, at 25-MHz clock, the four SRAMs provide at best 100M accesses/s, almost 6.3 times less than the required 629.1M read accesses/s. The L1 register files have, in total, 64 read-write ports. With a hit rate of 98.2% during operation for targeted VOI, it could achieve the desired bandwidth, at an acceptable 70% increase in area.



Fig. 9. Echo amplitude versus object distance for different driving voltages.

# IV. FULLY DIFFERENTIAL CHARGE-REUSE HIGH-VOLTAGE TX

micromachined ultrasonic Piezoelectric transducers (pMUTs) [21], [22], [23] and bulk piezoelectric transducers can be used for low-power air-channel ultrasound detection. Fig. 9 shows the echo amplitudes received by a 40-kHz bulk piezo transducer using P2E measurement with a reflector at different distances. The echo signal strength attenuates exponentially with the distance of the object but is proportional to the driving voltage. To detect objects 7 m away and have a sufficient signal-to-noise ratio (SNR), a higher transducer driving voltage is needed; 20 V<sub>pp</sub> driving voltage can provide 7-dB SNR at 7 m, and 28 Vpp offers up to 13-dB SNR at 7 m. The charge redistribution TX proposed in [24] operates with only 6-V<sub>pp</sub> driving voltage. Although the charge-recycling high-voltage TX (CRHV-TX) designs proposed in [10] and universal energy recycling TX [16] improved the output swing to 13.2 and 14  $V_{pp}$ , respectively, they are still insufficient for the 7-m-range object detection. As shown in Fig. 10(a), the CRHV-TX [10] driving a biomorph pMUT [21] uses 4-2 V<sub>DDH</sub> to drive  $V_p$  and 0–2  $V_{DDH}$  to drive  $V_n$ , resulting in a 4- $V_{DDH}$ differential driving signal. To increase the driving voltage even further, we propose a 28  $V_{PP}$  FDCR-HVTX driver in the 180-nm standard CMOS. Using standard CMOS makes ASIC possible to use a variety of IPs in both analog and digital blocks. It drives  $V_p$  and  $V_n$  from 4 V<sub>DDH</sub> to 0 and 0 to 4  $V_{DDH}$ , respectively, so that it can provide 8- $V_{DDH}$  $(28 V_{pp})$  fully differential drive signal. Fig. 10(b) shows the FDCR-HVTX driving a bimorph pMUT; the proposed FDCR-HVTX achieves 2.12× the driving voltage  $(V_{pp})$  than [10] using the same supply, which enables the full utilization of the transducer and results in higher sound pressure for the 7-m detection range target. It should be noted that the FDCR-HVTX can drive bulk piezo transducers as well.



Fig. 10. Cross section and drive voltage of bi-morph pinned dual-electrode pMUT driven by (a) CRHV-TX [10] and (b) proposed FDCR-HVTX.



Fig. 11. Proposed FDCR-HVTX circuit.

The proposed FDCR-HVTX circuit is shown in Fig. 11. It employs two high-voltage drivers (HVDs) to differentially drive the two electrodes of the transducer  $(V_p/V_n \text{ in a biomorph})$ pMUT or signal/ground in a bulk piezo transducer) in an alternating manner. The high-voltage switch (HVSW) is turned on only in the charge reuse (CR) phase. During the CR phase, unlike [25], the HVD exhibits a high-Z state controlled by the clock signal to prevent the charge backflow from the electrode to the HVD and enable CR. In addition, the presence of the high-Z mode reduces the crowbar current. The HVSW consists of four stacked NMOS and body bias adaptation circuits. When FDCR-HVTX operates,  $V_p$  and  $V_n$  alternately reach the high potential of 4 V<sub>DDH</sub>. The body bias circuit [26] allows the body potential of each NMOS to be the lowest between its source and drain. It ensures that the diode between the body and source (or drain) will not be turned on. In the HVSW,  $|V_{GS}|$  and  $|V_{DS}|$  of each transistor do not exceed 1 V<sub>DDH</sub>, thus being compatible with the standard CMOS process.

Considering the on-chip PMU delivering the power to the entire ASIC and with stringent energy availability for UAV applications, an energy-efficient TX is critical. The control signals of the proposed FDCR-HVTX are shown in Fig. 12. The HVD and /HVD work interleaved. The FDCR-HVTX operates in four phases (see Fig. 13). Phases 1 and 3 are the



Fig. 12. Control signals of FDCR-HVTX.



Fig. 13. Working principle of FDCR-HVTX.

charging and discharging phases, while phases 2 and 4 are the CR phase. In phase 1, HVSW is turned off. When entering phase 1,  $V_p$  is charged to 4 V<sub>DDH</sub>, and  $V_n$  is discharged to GND. In phase 2, HVD and /HVD present the high-Z mode. While the HVSW is turned on, the CR takes place where the parasitic capacitor of  $V_p$  (between the  $V_p$  and GND) discharges and "aids" by pulling up the parasitic capacitor of  $V_n$  from GND to (ideally) 2 V<sub>DDH</sub>. Meanwhile,  $V_p$  and  $V_n$  are shorted to help the transducer to flip the voltage from to  $-4 V_{DDH}$ . The ON-resistance of HVSW is 125  $\Omega$ . During this phase, no current is drawn from the supply. In phase 3, the HVSW is OFF, and  $V_n$  is now topped up (from the supply) to 4 V<sub>DDH</sub>, so that less power is consumed. The OFF-resistance of HVSW is 43.8 G $\Omega$ , so that it will not affect the driver efficiency. Similarly, in phase 4, HVD and /HVD are high-

pulled up to 4  $V_{DDH}$  later. The CR also happens from the parasitic capacitor of  $V_n$  to the parasitic capacitor of  $V_p$ . Differing from [10], the CR phase is based on the differential signal regardless of the current flow direction (from  $V_p$  to  $V_n$  or from  $V_n$  to  $V_p$ ). In addition, the FDCR-HVTX can create the midlevel voltage by short the two electrodes instead of using an additional midlevel supply [27].

In conventional HVTX, each electrode experiences charging from 4  $V_{DDH}$  and discharging to GND. Theoretically, the total power consumption is shown in the following equation:

$$P_{\text{total,conventional}} = CV^2 f \tag{3}$$

where *C* is the load capacitance, *V* is the driving voltage, and *f* is the pulse frequency. In FDCR-HVTX, on the other hand, the discharging electrode "aids" the charge to the charging electrode (phase 2 and phase 4 of Fig. 13), so the charge is replenished. Then, the supply will "top-up" the remainder, as shown in phase 3. Theoretically, the CR can make both two electrodes equal to 2  $V_{DDH}$ . Hence, in the next charge mode, each electrode charges up from 2  $V_{DDH}$  instead of GND; therefore, half of the initial charge from one electrode is reused for another electrode. The total power consumption of the FDCR-HVTX is shown in the following equation:

$$P_{\text{total},\text{FDCR-HVTX}} = \frac{1}{2}CV^2f.$$
(4)

In theory, this technique saves 50% power. However, due to the parasitic capacitance and switch series resistance, there will be power losses, so the actual power reduction will be less than 50%.

#### V. ON-CHIP POWER MANAGEMENT UNIT

An on-chip PMU reduces the system complexity, especially the test board size by omitting the external voltage regulators for each power domain. In order to supply the FDCR-HVTX with the 1/2/3/4 V<sub>DDH</sub>, a standard CMOS-compatible 14 V multilevel output dc–dc converter is proposed [see Fig. 14(a)]. Just as in the case with FDCR-HVCX, using the standard CMOS enables the ASIC to use various IPs in both analog and digital blocks.

Charge pump-based boost conversion [16] (with open-loop control) requires a fixed switching frequency for the peak power drain and increasing hard-charging loss and switching overhead. A flying capacitor cross-connected dc–dc boost converter proposed in [28] can reduce the  $V_{\rm DS}$  stress of some transistors with a high conversion ratio and single high-voltage output. Nonetheless, special high-voltage NMOS devices are still required because it does not reduce the  $V_{\rm DS}$  and  $V_{\rm GS}$  stress of all transistors.

In this work, we adopt a Dickson charge pump [29]-based dc–dc converter [see Fig. 14(a)], which achieves a high conversion ratio and high-voltage multilevel outputs. The power stage of the proposed converter consists of two NMOSs ( $M_{1-2}$ ), seven PMOSs ( $M_{3-9}$ ), three flying capacitors ( $C_{F1-F3}$ ), and four storage capacitors ( $C_{1-3}$  and  $C_{OUT}$ ). At the control stage,



Fig. 14. (a) Proposed 14 V multilevel dc-dc converter and (b) control signal waveforms of power-stage transistors.

pulsewidth modulation (PWM) [28] is employed with output feedback for the switch control. A single control loop is used to generate two PWM control signals with a 180° phase shift. In addition,  $M_1$  and  $M_2$  are controlled in a nonoverlapping manner with  $M_{3-4,7-8}$ , and  $M_{5-6,9}$ , respectively, shown in Fig. 14(b). To generate the control signal that swings between both 1 and 2 V<sub>DDH</sub>, stress relaxed multiple output high-voltage level shifter proposed in [30] is used. In the power stage, transistors  $M_{3-4}$ ,  $M_{5-6}$ , and  $M_{7-8}$  are stacked in each stage to reduce the stress of each thick-oxide transistor to not exceed 1 V<sub>DDH</sub> (unlike [31], which requires stress-tolerant diodes). The storage capacitors (i.e.,  $C_{1-3}$ ) were introduced to stabilize the four voltage outputs (i.e.,  $1-4 V_{DDH}$ ) and further ensure stress-sharing. The four voltage outputs are not only used to power and drive the FDCR-HVTX, but as the supply voltage to the level shifter required to control the transistors.

The boost converter is implemented with two inductors operating in three states in an interleaved manner (see Fig. 15). The red and blue paths represent the charging and discharging of  $L_1$  and  $L_2$ , respectively. In state-1,  $M_1$ ,  $M_{5-6}$ , and  $M_9$  are OFF and  $M_2$ ,  $M_{3-4}$ , and  $M_{7-8}$  are ON. While  $V_{in}$  is charging the inductor  $L_2$ , the inductor  $L_1$  can charge capacitors  $C_1$  and  $C_{F1}$ , while the flying capacitor  $C_{F2}$  (having 2 V<sub>DDH</sub> accumulated) charges capacitors  $C_3$  and  $C_{F3}$ . Therefore, from Fig. 15(a) the flying capacitor voltage can be written as shown in the following equation:

$$V_{C_{F1}} = V_{C_{F3}} - V_{C_{F2}} = \frac{V_{\text{in}}}{(1-D)}$$
(5)

where  $V_{C_{F1-3}}$  is the voltage of  $C_{F1-3}$ ,  $V_{in}$  is the input voltage, and D (D > 0.5) is the switching duty cycle of  $M_{1-2}$ .

In state-2, only  $M_1$  and  $M_2$  are ON.  $V_{in}$  charges both  $L_1$  and  $L_2$ , as shown in Fig. 15(b). In state-3,  $M_1$ ,  $M_{5-6}$ , and  $M_9$  are on and  $M_2$ ,  $M_{3-4}$ , and  $M_{7-8}$  are OFF. When  $V_{in}$  is charging, the inductors  $L_1L_2$  charge capacitors  $C_2$  and  $C_{F2}$  together with  $C_{F1}$  (having 1 V<sub>DDH</sub> accumulated) and simultaneously charge  $C_{OUT}$  with  $C_{F3}$  (having 3 V<sub>DDH</sub> accumulated). Therefore, from



Fig. 15. Working principle of the proposed converter. (a) State-1, (b) state-2, and (c) state-3.

Fig. 15(c) and the  $L_2$  discharging process, the flying capacitor voltage can be written as shown in the following equation:

$$V_{C_{F2}} - V_{C_{F1}} = V_{OUT} - V_{C_{F3}} = \frac{V_{in}}{(1-D)}$$
 (6)

TABLE I  $|V_{\rm DS}|$  of  $M_1 - M_9$  in Different States

|         | M₁                        | M <sub>2</sub>     | M <sub>3</sub>   | M <sub>4</sub>      | M <sub>5</sub>      | M <sub>6</sub>            | M7                  | M <sub>8</sub>            | M9                  |
|---------|---------------------------|--------------------|------------------|---------------------|---------------------|---------------------------|---------------------|---------------------------|---------------------|
| State-1 | $\mathbf{V}_{\text{DDH}}$ | 0                  | 0                | 0                   | $\mathbf{V}_{DDH}$  | $\mathbf{V}_{\text{DDH}}$ | 0                   | 0                         | $V_{\text{DDH}}$    |
| State-2 | 0                         | 0                  | $V_{\text{DDH}}$ | V <sub>DDH</sub> /2 | V <sub>DDH</sub> /2 | V <sub>DDH</sub> /2       | V <sub>DDH</sub> /2 | V <sub>DDH</sub> /2       | V <sub>DDH</sub> /2 |
| State-3 | 0                         | $\mathbf{V}_{DDH}$ | V <sub>DDH</sub> | $V_{\text{DDH}}$    | 0                   | 0                         | $V_{\text{DDH}}$    | $\mathbf{V}_{\text{DDH}}$ | 0                   |

TABLE II  $V_{\rm GS}$  of  $M_1 - M_9$  in Different States

|         | M <sub>1</sub>            | M <sub>2</sub>   | M <sub>3</sub>    | M4                  | M₅                | M <sub>6</sub>      | M7                | M <sub>8</sub>      | М9                |
|---------|---------------------------|------------------|-------------------|---------------------|-------------------|---------------------|-------------------|---------------------|-------------------|
| State-1 | 0                         | $V_{\text{DDH}}$ | -V <sub>DDH</sub> | -V <sub>DDH</sub>   | 0                 | 0                   | -V <sub>DDH</sub> | -V <sub>DDH</sub>   | 0                 |
| State-2 | $V_{\text{DDH}}$          | $V_{\text{DDH}}$ | 0                 | V <sub>DDH</sub> /2 | 0                 | V <sub>DDH</sub> /2 | 0                 | V <sub>DDH</sub> /2 | 0                 |
| State-3 | $\mathbf{V}_{\text{ddh}}$ | 0                | 0                 | 0                   | -V <sub>DDH</sub> | -V <sub>DDH</sub>   | 0                 | 0                   | -V <sub>DDH</sub> |

where  $V_{OUT}$  is the output voltage of the last stage. The converter works with State-1 $\rightarrow$ State-2 $\rightarrow$ State-3 $\rightarrow$ State-2 periodically.

From (5) and (6), the capacitor voltages for the proposed converter can be derived as shown in the following equation:

$$V_{C_{F3}} = \frac{3}{2} V_{C_{F2}} = 3 V_{C_{F1}} = \frac{3V_{\text{in}}}{(1-D)}.$$
 (7)

The output voltage is derived from (6), which is given by

$$V_{\rm OUT} = \frac{V_{\rm in}}{(1-D)} + V_{C_{F3}} = \frac{4V_{\rm in}}{(1-D)}.$$
 (8)

As shown in Fig. 15, all power switches in each state only have less than 1 V<sub>DDH</sub> in voltage stress, making the design compatible with the 180-nm standard CMOS process. Tables I and II summarize the voltage stresses in terms of  $V_{DS}$ and  $V_{GS}$  of the power switches ( $M_{1-2}$  are NMOSs and  $M_{3-9}$ are PMOSs). The regulated output voltage of 14 V is measured at the converter output, supporting up to 200-mW output power with the 180-nm standard CMOS process.

#### VI. MEASUREMENT AND IMPLEMENTATION RESULTS

#### A. Processing Latency and Throughput of DBE

The main outputs of this UIS system are buddled in a 17-bit bus from the DBE. It consists of 16-bit data for the voxel intensity and 1-bit for the VOXEL\_VALID signal. At the rising edge of the core clock, the corresponding data are valid for sampling if the valid signal is high. The coordinates of the voxel in the defined VOI are implicitly defined by the beamfocus sequence, as the architecture does not skip voxels. In total, the processing latency after echo arrival is 12 ADC clock cycles plus 2 core clock cycles, which is 7.76  $\mu$ s in total.

Fig. 16 shows the measurement results. ADC sampling is triggered at the frequency of 130.2 kHz. Each set of new ADC data makes a bunch of voxels ready to be beamfocused, shown as a train of VOXEL\_VALID signals. Due to pipelined



Fig. 16. DBE measurement of processing latency and throughput.



Fig. 17. (a) Waveform of FDCR-HVTX driving bi-morph pinned dualelectrode pMUT and bulk piezo at 28  $V_{PP}$ . (b) Comparison of power versus  $V_{pp}$  between FDCR-HVTX and HVTX.

operations that while DBE is generating new voxels the ADCs are also sampling the next set of data, once an echo arrives (new ADC data), the voxels that become ready are the next train of VOXEL\_VALID signals. Looking at the top part of Fig. 16, the trains of VOXEL\_VALID shown right after the echo arrival are fast due to the previous echo arrival.

In the prototype, the train of VOXEL\_VALID is measured roughly 7.76  $\mu$ s later after ADCs start to sample. TX\_EN is a test signal that shows when the TX transducers are controlled to excite to transmit ultrasound pulses. It also happens to show the start of a frame. In the measurement, the intervals between 2 TX\_EN signals are about 40 ms, which translates to 24 frames/s real-time framerate. With each frame containing 100 planes, there are in total 409 600 VOXEL\_VALID high signals, which implies that the achieved throughput is 9.83M FocalPoints/s.

## B. FDCR-HVTX Measurement

To verify the proposed FDCR-HVTX for different types of ultrasonic sensors, the measurements are done by driving both

| TABLE III                                             |
|-------------------------------------------------------|
| Comparison of the State-of-the-Art 3-D Imaging System |

|                              | This work          | JSSC'22 [16]          | JSSC'15 [17]         | JSSC'21 [10]          | JSSC'21 [12]          | JSSC'16 [14]           | JSSC'21 [3]         | ISSCC'21 [7]           | TUFFC'21 [18]        |
|------------------------------|--------------------|-----------------------|----------------------|-----------------------|-----------------------|------------------------|---------------------|------------------------|----------------------|
| Process                      | 180 nm             | 180 nm                | 180 nm               | 180 nm                | 180 nm                | 180 nm                 | 65 nm               | 40 nm                  | No Chip              |
| Process                      | Standard CMOS      | Standard CMOS         | CMOS                 | Standard CMOS         | BCD                   | BCD                    | BSI CMOS            | Standard CMOS          | Implementation       |
| Sensor Type                  | UIS                | UIS                   | UIS                  | UIS                   | UIS                   | UIS                    | CIS                 | LiDAR                  | UIS                  |
| Transducer Type              | pMUT/PZT           | pMUT/PZT              | pMUT                 | pMUT                  | PZT                   | pMUT                   | -                   | -                      | PZT                  |
| Image Reconstruction         | on-chip            | off-chip              | off-chip             | off-chip              | off-chip              | off-chip               | on-chip             | on-chip                | off-chip             |
| Focal Points/s               | 9.8M               | 1.63M                 | -                    | 155.5k                | -                     | -                      | -                   | 211.7k*                | 980.6k*              |
| Processing Latency           | 7.76 μs            | 250 ms (FPGA)         | -                    | - (software)          | -                     | - (FPGA)               | -                   | -                      | - (FPGA)             |
| Medium                       | Air                | Air                   | Air                  | Biomedical            | Biomedical            | Fingerprint            | Air                 | Air                    | Air                  |
| Framerate (fps)              | 24                 | 4                     | 30                   |                       | 5                     | 380                    | 60                  | 20                     | 29                   |
| Range (m)                    | 7                  | 7                     | 1                    | -                     | 0.012                 | << 0.01                | 4                   | 150-200                | 3                    |
| Overall Power                | 142.3 mW           | 68.8 mW <sup>§*</sup> | 0.4 mW <sup>§*</sup> | 38.3 mW <sup>§*</sup> | RX: 6 mW <sup>§</sup> | 106.4 mW <sup>§*</sup> | 290 mW <sup>§</sup> | TX: 45 W <sup>II</sup> | >> 10 W <sup>†</sup> |
| Overall Power                |                    |                       |                      |                       |                       |                        |                     | RX: 1.2 W <sup>§</sup> |                      |
| #Channel                     | 64                 | 16 (one side)         | 7                    | 36                    | 64                    | 56                     | -                   | -                      | 64                   |
| Field of view                | ±30°               | ±30°                  | ±45°                 | ±30°                  | -                     | -                      | ±39°                | H25.2° V9.45°          | ±50°                 |
| Range Error                  | 7.3mm @ 0.4m       | -                     | 0.41 mm<br>@ 0.5m    | 0.83 mm               | 560 um                | -                      | 3.6 mm              | 15-30 cm               | -                    |
| Max. Effective TX Voltage    | 28 V <sub>pp</sub> | 14 V <sub>pp</sub>    | 5 V <sub>pp</sub>    | 13.2 V <sub>pp</sub>  | 30 V <sub>pp</sub>    | -                      | -                   | -                      | 20 V <sub>pp</sub>   |
| On-Chip DC-DC @Standard CMOS | Yes                | No                    | No                   | No                    | No                    | No                     | No                  | No                     | No                   |
| Area (mm <sup>2</sup> )      | 32.5               | 25                    | 1.7                  | 11.8                  | 1.77                  | 23.7                   | 32.8                | 53.8                   | -                    |

\* Estimated based on information provided in publications † Discrete components (IC+FPGA+GPU) § Without on-chip post-beamforming ¶ Pulsed laser diode TX peak power



Fig. 18. (a) Proposed UIS mounted on a DJI Mavic Air 2 drone while it is flying and (b) chip micrograph of the proposed UIS ASIC.

the bi-morph pinned dual-electrode pMUT and the bulk piezo transducers. Fig. 17(a) shows the waveform of FDCR-HVTX at 28 V<sub>pp</sub> driving a 192-kHz pMUT with 0.4-nF  $C_{FT}$  and a 40-kHz bulk piezo with 1.9-nF  $C_{FT}$ . A fully differential HVTX without CR is also implemented as a comparison. Fig. 17(b) shows the power consumption of FDCR-HVTX and HVTX driving 1.9-nF PZT by using 40-kHz continuous pulses at different driving voltages ( $V_{pp}$ ). Due to the nonideal CR control, which is induced due to the partial overlapping of the control signal, the proposed FDCR-HVTX saves 25% of power consumption under 28 V<sub>pp</sub>.

#### C. Drone-Mounting UIS Prototype

A functioning prototype mounted on a drone is built to verify the proposed UIS architecture, as shown in Fig. 18(a). The  $8 \times 8$  1-cm-pitch size 40-kHz bulk piezo array is used for measurement; 1-cm pitch is greater than  $1 - \lambda$  that will inevitably lead to the grating lobe issues; however, for the overall cost-effectiveness of the system, the commercially available 40-kHz 400SR100 bulk piezo transducers (outer diameter = 9.7 mm) are chosen to build the prototype. The spatial angle resolution is  $2.7^{\circ}$ .

The FDCR-HVTX drives  $2 \times 2$  TX channels at 40 kHz, as transmit grating-lobe issue is more severe when more channels are used for TX. We tested and compared 8, 16, and 32 TX pulses, where 16 pulses balance the maximum sound pressure generated and the power consumed by the TX; this results in 0.96% of the overall operation time. The 64 RX channels convert the received echoes into electrical signals and digitize them to the DBE, where the voxels are calculated. To save the wireless transmission bandwidth, only voxels whose intensity value passes the threshold are transmitted. The passed intensity value is packed together with its VOI coordinates. The data packet contains 32 b, in which  $\theta$ ,  $\phi$ , R, and intensity are 6, 6, 7, and 13 b, respectively. Once the data collection and processing of that frame are completed, these packets are transmitted to ESP32 using universal asynchronous receiver/transmitter (UART). Meanwhile, the ESP32 module starts to send the packed data in bytes to the remote UI display through Wi-Fi.

To remotely monitor the processed voxel information from drones, a high frame rate (>24 frames/s), multi-image, and a real-time display UI are developed. Fig. 19 is the UI display interface that has a top view, a depth view, and a drone camera view. The numbers of pixels in the top view and the depth view represent the VOI where detected objects reside. The distance is indicated by the color bar, and the transparency of color means the voxel intensity, which shows the strength of the echo. On the displaying device, upon detecting the UDP sockets from ESP32, the application begins to cache the received data packages into memory and decodes them into coordinates and intensity information. Data transmission of each frame ends with a flag of 4 bytes. Then, the UI refreshes with processed data, and it can achieve a dynamic display at a rate of more than 24 frames/s.

Table III presents the comparison between this work and previous works. Fig. 18(b) is the ASIC die photogrpah, and Fig. 20(a) shows the power breakdown of the system. The measured SNR of the RX chain and range error from 0.1 to 7 m are shown in Fig. 20(b).



Fig. 19. UI image is taken from a live stream with the proposed UIS ASIC mounted on the drone using an 8  $\times$  8 bulk piezo array.



Fig. 20. (a) Power breakdown. (b) SNR and range error measurements of the UIS ASIC.

# VII. CONCLUSION

The presented UIS ASIC integrates PV-RXBF DBE to achieve 7.76- $\mu$ s processing latency and 24 frames/s real-time 3-D depth image reconstruction, and the FDCR-HVTX drives the transducers with 28 V<sub>pp</sub> pulses while saving the average TX power consumption by 25%. The ASIC fabricated in 180-nm standard CMOS occupies 32.5 mm<sup>2</sup> and consumes 142.3 mW during operation. The working prototype with a consumer-grade entry-level drone is built, which contains the ASIC, an 8 × 8 bulk piezo transducer array, and a wireless data transmission module (FPGA + ESP32), and it successfully reconstructs the image at 24 frames/s (with 100 planes per frame) at 7 m range while the drone is flying.

# REFERENCES

- (2020). Mavic Air 2 Specs. [Online]. Available: https://www.dji.com/ sg/mavic-air-2/specs
- [2] D. Kim *et al.*, "Indirect time-of-flight CMOS image sensor with on-chip background light cancelling and pseudo-four-tap/two-tap hybrid imaging for motion artifact suppression," *IEEE J. Solid-State Circuits*, vol. 55, no. 11, pp. 2849–2865, Nov. 2020.
- [3] M.-S. Keel *et al.*, "A 1.2-mpixel indirect time-of-flight image sensor with 4-tap 3.5-μm pixels for peak current mitigation and multi-user interference cancellation," *IEEE J. Solid-State Circuits*, vol. 56, no. 11, pp. 3209–3219, Nov. 2021.

- [4] C. S. Bamji *et al.*, "IMpixel 65nm BSI 320MHz demodulated TOF image sensor with 3μm global shutter pixels and analog binning," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2018, pp. 94–96.
- [5] C. S. Bamji *et al.*, "A 0.13 μm CMOS system-on-chip for a 512 × 424 time-of-flight image sensor with multi-frequency photo-demodulation up to 130 MHz and 2 GS/s ADC," *IEEE J. Solid-State Circuits*, vol. 50, no. 1, pp. 303–319, Jan. 2015.
- [6] H. Seo et al., "Direct TOF scanning LiDAR sensor with two-step multievent histogramming TDC and embedded interference filter," *IEEE J. Solid-State Circuits*, vol. 56, no. 4, pp. 1022–1035, Apr. 2021.
- [7] O. Kumagai et al., "7.3 A 189×600 back-illuminated stacked SPAD direct time-of-flight depth sensor for automotive LiDAR systems," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2021, pp. 110–112.
- [8] C. Zhang, S. Lindner, I. M. Antolovic, J. M. Pavia, M. Wolf, and E. Charbon, "A 30-frames/s, 252 × 144 SPAD flash LiDAR with 1728 dual-clock 48.8-ps TDCs, and pixel-wise integrated histogramming," *IEEE J. Solid-State Circuits*, vol. 54, no. 4, pp. 1137–1151, Apr. 2019.
- [9] K. Yoshioka *et al.*, "A 20-ch TDC/ADC hybrid architecture LiDAR SoC for 240×96 pixel 200-m range imaging with smart accumulation technique and residue quantizing SAR ADC," *IEEE J. Solid-State Circuits*, vol. 53, no. 11, pp. 3026–3038, Nov. 2018.
- [10] J. Lee et al., "A 36-channel auto-calibrated front-end ASIC for a pMUTbased miniaturized 3-D ultrasound system," *IEEE J. Solid-State Circuits*, vol. 56, no. 6, pp. 1910–1923, Jun. 2021.
- [11] J. Lim, C. Tekes, E. F. Arkan, A. Rezvanitabar, F. L. Degertekin, and M. Ghovanloo, "Highly integrated guidewire ultrasound imaging systemon-a-chip," *IEEE J. Solid-State Circuits*, vol. 55, no. 5, pp. 1310–1323, May 2020.
- [12] D. M. van Willigen *et al.*, "A transceiver ASIC for a single-cable 64element intra-vascular ultrasound probe," *IEEE J. Solid-State Circuits*, vol. 56, no. 10, pp. 3157–3166, Oct. 2021.
- [13] C. Sutardja, A. Singhvi, A. Fitzpatrick, A. Cathelin, and A. Arbabian, "Multi-watt-level 4.9-GHz silicon power amplifier for portable thermoacoustic imaging," *IEEE J. Solid-State Circuits*, vol. 57, no. 5, pp. 1421–1431, May 2022.
- [14] H.-Y. Tang et al., "3-D ultrasonic fingerprint sensor-on-a-chip," IEEE J. Solid-State Circuits, vol. 51, no. 11, pp. 2522–2533, Nov. 2016.
- [15] H. Wu et al., "A 7m-range, 4.3 mW/Ch. ultrasound ASIC with universal energy recycling TX for all-weather metamorphic robotic 3D vision system," in *Proc. IEEE Asian Solid-State Circuits Conf. (A-SSCC)*, Nov. 2021, pp. 1–3.
- [16] H. Wu *et al.*, "An ultrasound ASIC with universal energy recycling for > 7-m all-weather metamorphic robotic vision," *IEEE J. Solid-State Circuits*, early access, Jul. 4, 2022, doi: 10.1109/JSSC.2022.3182102.
- [17] R. J. Przybyla, H. Y. Tang, A. Guedes, S. E. Shelton, D. A. Horsley, and B. E. Boser, "3D ultrasonic rangefinder on a chip," *IEEE J. Solid-State Circuits*, vol. 50, no. 1, pp. 320–334, Jan. 2015.
- [18] G. Allevato *et al.*, "Real-time 3-D imaging using an air-coupled ultrasonic phased-array," *IEEE Trans. Ultrason., Ferroelectr., Freq. Control*, vol. 68, no. 3, pp. 796–806, Mar. 2021.
- [19] M. Cikes, L. Tong, G. R. Sutherland, and J. D'Hooge, "Ultrafast cardiac ultrasound imaging: Technical principles, applications, and clinical benefits," *JACC Cardiovascular Imag.*, vol. 7, no. 8, pp. 812–823, Aug. 2014.
- [20] L. Wu et al., "BatDrone: A 9.83M-focal-points/s 7.76µs-latency ultrasound imaging system with on-chip per-voxel RX beamfocusing for 7m-range drone applications," in *IEEE Int. Solid-State Circuits Conf.* (ISSCC) Dig. Tech. Papers, Feb. 2022, pp. 492–494.
- [21] Z. Shao, S. Pala, Y. Peng, and L. Lin, "Bimorph pinned piezoelectric micromachined ultrasonic transducers for space imaging applications," *J. Microelectromech. Syst.*, vol. 30, no. 4, pp. 650–658, Aug. 2021.
- [22] G.-L. Luo, Y. Kusano, and D. A. Horsley, "Airborne piezoelectric micromachined ultrasonic transducers for long-range detection," *J. Microelectromech. Syst.*, vol. 30, no. 1, pp. 81–89, Feb. 2021.
- [23] J. Tillak and J. Yoo, "A 23μW digitally controlled pMUT interface circuit for Doppler ultrasound imaging," in *Proc. IEEE Int. Symp. Circuits Syst. (ISCAS)*, May 2015, pp. 1618–1621.
- [24] J. Tillak, S. Akhbari, N. Shah, L. Radakovic, L. Lin, and J. Yoo, "A 2.34μJ/scan acoustic power scalable charge-redistribution pMUT interface system with on-chip aberration compensation for portable ultrasonic applications," in *Proc. IEEE Asian Solid-State Circuits Conf.* (A-SSCC), Nov. 2016, pp. 189–192.
- [25] Z. Luo and M.-D. Ker, "A high-voltage-tolerant and precise chargebalanced neuro-stimulator in low voltage CMOS process," *IEEE Trans. Biomed. Circuits Syst.*, vol. 10, no. 6, pp. 1087–1099, Dec. 2016.

- [26] H.-M. Lee and M. Ghovanloo, "An adaptive reconfigurable active voltage doubler/rectifier for extended-range inductive power transmission," *IEEE Trans. Circuits Syst. II, Exp. Briefs*, vol. 59, no. 8, pp. 481–485, Aug. 2012.
- [27] K. Chen, H.-S. Lee, A. P. Chandrakasan, and C. G. Sodini, "Ultrasonic imaging transceiver design for CMUT: A three-level 30-Vpp pulseshaping pulser with improved efficiency and a noise-optimized receiver," *IEEE J. Solid-State Circuits*, vol. 48, no. 11, pp. 2734–2745, Nov. 2013.
- [28] M. Huang, Y. Lu, T. Hu, and R. P. Martins, "A hybrid boost converter with cross-connected flying capacitors," *IEEE J. Solid-State Circuits*, vol. 56, no. 7, pp. 2102–2112, Jul. 2021.
- [29] J. F. Dickson, "On-chip high-voltage generation in MNOS integrated circuits using an improved voltage multiplier technique," *IEEE J. Solid-State Circuits*, vol. SSC-11, no. 3, pp. 374–378, Jun. 1976.
- [30] V. Rana and R. Sinha, "Stress relaxed multiple output high-voltage level shifter," *IEEE Trans. Circuits Syst. II, Exp. Briefs*, vol. 65, no. 2, pp. 176–180, Feb. 2018.
- [31] V. A. K. Prabhala, P. Fajri, V. S. P. Gouribhatla, B. P. Baddipadiga, and M. Ferdowsi, "A DC–DC converter with high voltage gain and two input boost stages," *IEEE Trans. Power Electron.*, vol. 31, no. 6, pp. 4206–4215, Jun. 2016.



Yande Peng received the B.S. degree in materials physics from the University of Science and Technology of China, Hefei, China, in 2019. He is currently pursuing the Ph.D. degree with the Department of Mechanical Engineering, University of California at Berkeley, Berkeley, CA, USA.

His research interests focus on the development of piezoelectric micromachined ultrasonic transducers (PMUTs), including their designs, fabrication, and applications.



Han Wu (Student Member, IEEE) received the B.Eng. degree in electronic science and technology and the M.E. degree in microelectronics and solid-state electronics from the College of Optoelectronic Engineering, Chongqing University, Chongqing, China, in 2013 and 2016, respectively, and the Ph.D. degree from the Electrical and Computer Engineering Department, National University of Singapore, Singapore, in 2021.

He is currently a Research Fellow with the Department of Electrical and Computer Engineering,

National University of Singapore, Singapore. His research focuses on energyefficient high-speed links, ultralow-power system-on-chip design, MEMS sensor interface circuit design, and so on.



Liuhao Wu (Student Member, IEEE) received the B.E. degree from the School of Electrical and Electronic Engineering, NTU, Singapore, in 2019. He is currently pursuing the Ph.D. degree with the Graduate School, Integrative Sciences and Engineering Programme (ISEP), National University of Singapore (NUS), Singapore, attached to the Department of Electrical and Computer Engineering, NUS. His research focuses on fast on-chip image-

reconstruction digital processers for low-power ultrasound imaging system (UIS), power-efficient digital application-specific integrated circuit (ASIC) design, hardware software

co-design, and so on.



**Jiaqi Guo** (Graduate Student Member, IEEE) received the B.S. degree from Xidian University, Xi'an, China, in 2019, and the M.S. degree from the National University of Singapore (NUS), Singapore, in 2020, where he is currently pursuing the Ph.D. degree with the Department of Electrical and Computer Engineering.

His current research interests include power management units (PMUs) and transmitter (TX) circuits and systems for an ultrasonic transmission imaging application.



**Jiamin Li** (Member, IEEE) received the B.Eng. and Ph.D. degrees in electrical engineering from the National University of Singapore (NUS), Singapore, in 2017 and 2021, respectively.

She is currently an Assistant Professor with the School of Microelectronics, Southern University of Science and Technology (SUSTech), Shenzhen, China. Her research interests include system and circuit design for the body area network, power management and mixed-signal integrated circuits, and systems for wearable applications.

Dr. Li was a recipient/co-recipient of the 2020–2021 IEEE SSCS Predoctoral Achievement Award, ISSCC 2020 Demonstration Session Certificate of Recognition, ISSCC 2020 Student Travel Grant Award, and ASSCC 2021 Student Design Contest (SDC) Best Design Award.



**Rucheng Jiang** (Graduate Student Member, IEEE) received the M.S. degree in electronic science and technology from Zhejiang University, Hangzhou, China, in 2017. He is currently pursuing the Ph.D. degree with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore.

His research interests include energy-efficient analog-to-digital converter (ADC) and amplifiers.



Yang Luo (Student Member, IEEE) received the B.Eng. degree in communication engineering from Soochow University, Suzhou, China, in 2020, and the M.Sc. degree from the National University of Singapore, Singapore, in 2022.

He is currently with the NUS Signal processing and VLSI Laboratory, Prof. Jerald Yoo's Group. His research is mainly about IC designed for ultrasound imaging systems (UISs) with piezoelectric micromachined ultrasonic transducers (pMUTs) arrays, including algorithm developing and energy efficient system-on-chip (SoC) design.



Liwei Lin (Member, IEEE) received the B.S. degree in mechanical engineering from the Power Mechanical Engineering Department, National Tsing Hua University, Hsinchu, Taiwan, in 1986, and the M.S. and Ph.D. degrees in mechanical engineering from the Mechanical Engineering Department, University of California-Berkeley (UC Berkeley), Berkeley, CA, USA, in 1991 and 1993, respectively.

He is currently the James Marshall Wells Professor with the Department of Mechanical Engineering and the Co-Director of the Berkeley Sensor and

Actuator Center, UC Berkeley. He has authored or coauthored more than 300 journal papers. His research interests include microelectromechanical systems (MEMS), nanoelectromechanical systems, nanotechnology, design and manufacturing of microsensors and microactuators, development of micro-machining processes by silicon surface/bulk micromachining, micromolding process, and mechanical issues in MEMS, such as heat transfer, solid/fluid mechanics, and dynamics.

Dr. Lin is a fellow of ASME and a Subject Editor of the IEEE JOURNAL OF MICROELECTROMECHANICAL SYSTEMS.



Jerald Yoo (Senior Member, IEEE) received the B.S., M.S., and Ph.D. degrees from the Department of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea, in 2002, 2007, and 2010, respectively.

From 2010 to 2016, he was with the Department of Electrical Engineering and Computer Science, Masdar Institute, Abu Dhabi, United Arab Emirates, where he was an Associate Professor. From 2010 to 2011, he was also with the Microsys-

tems Technology Laboratories (MTL), Massachusetts Institute of Technology (MIT), Cambridge, MA, USA, as a Visiting Scholar. Since 2017, he has been with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore, where he is currently an Associate Professor. He has pioneered research on low energy body-area-network (BAN) for communication/powering and wearable body sensor networks using the planar-fashionable circuit board for a continuous health monitoring system. He has authored book chapters in Biomedical CMOS ICs (Springer, 2010), Enabling the Internet of Things-From Circuits to Networks (Springer, 2017), The IoT Physical Layer (Springer, 2019), and Handbook of Biochips (Biphasic Current Stimulator for Retinal Prosthesis) (Springer, 2021). His current research interests include low-energy circuit technology for wearable biosignal sensors, flexible circuit board platform, BAN communication and powering, application-specific integrated circuit (ASIC) for piezoelectric micromachined ultrasonic transducers (PMUTs), and system-on-chip (SoC) design to system realization for wearable healthcare applications.

Dr. Yoo served as an IEEE Circuits and Systems Society (CASS) Distinguished Lecturer from 2019 to 2021 as well as an IEEE Solid-State Circuits Society (SSCS) Distinguished Lecturer from 2017 to 2018. He was a recipient or a co-recipient of several awards: IEEE International Solid-State Circuits Conference (ISSCC) 2020 Demonstration Session Award (Certificate of Recognition), IEEE International Symposium on Circuits and Systems (ISCAS) 2015 Best Paper Award (BioCAS Track), ISCAS 2015 Runner-Up Best Student Paper Award, the Masdar Institute Best Research Award in 2015, and the IEEE Asian Solid-State Circuits Conference (A-SSCC) Outstanding Design Award (2005). He was the Founding Vice-Chair of IEEE SSCS United Arab Emirates (UAE) Chapter and is currently the Chair of the IEEE SSCS Singapore Chapter. He is serving/served as a Technical Program Committee Member for the IEEE International Solid-State Circuits Conference (ISSCC), ISSCC Student Research Preview (Co-Chair), IEEE Asian Solid-State Circuits Conference (A-SSCC, Emerging Technologies and Applications Subcommittee Chair), and IEEE Custom Integrated Circuits Conference (CICC). He is also an Analog Signal Processing Technical Committee Member of IEEE Circuits and Systems Society.