Loading [MathJax]/extensions/MathMenu.js
Zhenman Fang - IEEE Xplore Author Profile

Showing 1-25 of 46 results

Results

This paper introduces SDA, the first effort to adapt the expensive stable diffusion (SD) model for edge FPGA deployment. First, we apply quantization-aware training to quantize its weights to 4 -bit and activations to 8 -bit ($W 4 A 8$) with a negligible accuracy loss. Based on that, we propose a high-performance hybrid systolic array (hybridSA) architecture that natively executes convolution and ...Show More
The computation of electron repulsion integrals (ERIs) is a key component for quantum chemical methods. The intensive computation and bandwidth demand for ERI evaluation presents a significant challenge for quantum-mechanics-based atomistic simulations with hybrid density functional theory: due to the tens of trillions of ERI computations in each time step, practical applications are usually limit...Show More
To improve the file storage efficiency of large datasets, big data analytics usually use some common file formats, such as Apache ORC (optimized row columnar) format, to encode and compress the data. However, this shifts the IO bottleneck (especially with high-bandwidth SSDs) to the computation bottleneck on CPUs to decompress and decode the data. This paper presents FORC, a high-throughput stream...Show More
The Bloom filter is one of the most widely used data structures in big data analytics to efficiently filter out vast amounts of noisy data. Unfortunately, prior Bloom filter designs only focus on single-input-stream acceleration, and can no longer match the increasing data rates offered by modern networks.To support large Bloom filters with low false-positive rate and high throughput, we present B...Show More
Many studies have demonstrated that 4-bit precision quantization can maintain accuracy levels comparable to those of floating-point deep neural networks (DNNs). Thus, it has sparked a keen interest in the efficient acceleration of such compressed DNNs, especially 4-bit convolutions, on edge devices. However, we observe that conventional systolic array (SA) architectures, widely adopted for DNN acc...Show More
Triangle counting (TC) is one of the fundamental computing patterns in graph computing and social networks. Due to its high memory-to-computation ratio and random memory access patterns, it is nontrivial to accelerate TC's performance. In this work, we propose a high-performance TC (HiTC) accelerator to speed up triangle counting on high-bandwidth memory (HBM)-equipped FPGAs via software/hardware ...Show More
Deep learning-based image compression has made great progresses recently. However, some leading schemes use serial context-adaptive entropy model to improve the rate-distortion (R-D) performance, which is very slow. In addition, the complexities of the encoding and decoding networks are quite high and not suitable for many practical applications. In this paper, we propose four techniques to balanc...Show More
Recent advancements in deep learning-based image compression are notable. However, prevalent schemes that employ a serial context-adaptive entropy model to enhance rate-distortion (R-D) performance are markedly slow. Furthermore, the complexities of the encoding and decoding networks are substantially high, rendering them unsuitable for some practical applications. In this paper, we propose two te...Show More
Recently, learning-based image compression approaches have achieved superior performance over classical image compression methods. However, their complexities remain quite high. In this paper, we propose two efficient modules to reduce the complexity. First, we introduce a selective kernel residual module into the core network, which effectively expands the receptive field and captures global info...Show More
Binary neural network (BNN), where both the weight and the activation values are represented with one bit, provides an attractive alternative to deploy highly efficient deep learning inference on resource-constrained edge devices. However, our investigation reveals that, to achieve satisfactory accuracy gains, state-of-the-art (SOTA) BNNs, such as FracBNN and ReActNet, usually have to incorporate ...Show More
Today's big data query engines are constantly under pressure to keep up with the rapidly increasing demand for faster processing of more complex workloads. In the past few years, FPGA-based database acceleration efforts have demon-strated promising performance improvement with good energy efficiency. However, few studies target the programming and design automation support to leverage the FPGA acc...Show More
In recent years, there has been increasing adoption of FPGAs in datacenters as hardware accelerators, where a large population of end users are software developers. While high-level synthesis (HLS) facilitates software programming, it is still challenging to scale large accelerator designs on modern datacenter FPGAs that often consist of multiple dies and memory banks. More specifically, routing c...Show More
Binary neural network (BNN) has recently presented a promising opportunity for deep learning inferences on resource-constrained edge devices. Using extreme data precision, i.e., 1-bit weight and 1-bit activation, BNN not only significantly reduces the network memory footprint, but also trades massive multiply-accumulate operations for much cheaper logical XNOR and population count operations. Howe...Show More
Stochastic rounding is crucial in the low-bit (e.g., 8-bit) training of deep neural networks (DNNs) to achieve high accuracy. One of the drawbacks of prior studies is that they require a large number of high-precision stochastic rounding units (SRUs) to guarantee low-bit DNN accuracy, which involves considerable hardware overhead. In this paper, we use extremely low-bit SRUs (ESRUs) to save a larg...Show More
While vision transformers (ViTs) have continuously achieved new milestones in the field of computer vision, their sophisticated network architectures with high computation and memory costs have impeded their deployment on resource-limited edge devices. In this paper, we propose a hardware-efficient image-adaptive token pruning framework called HeatViT for efficient yet accurate ViT acceleration on...Show More
Accurately and timely detecting multiscale small objects that contain tens of pixels from remote sensing images (RSI) remains challenging. Most of the existing solutions primarily design complex deep neural networks to learn strong feature representations for objects separated from the background, which often results in a heavy computation burden. In this article, we propose an accurate yet fast o...Show More
The emergence of high-bandwidth memory (HBM) brings new opportunities to boost the performance of sorting acceleration on FPGAs, which was conventionally bounded by the available off-chip memory bandwidth. However, it is nontrivial for designers to fully utilize this immense bandwidth. First, the existing sorter designs cannot be directly scaled at the increasing rate of available off-chip bandwid...Show More
Vision transformers (ViTs) are emerging with significantly improved accuracy in computer vision tasks. However, their complex architecture and enormous computation/storage demand impose urgent needs for new hardware accelerator design methodology. This work proposes an FPGA-aware automatic ViT acceleration framework based on the proposed mixed-scheme quantization. To the best of our knowledge, thi...Show More
The most advanced ASIC-based approximate adders are focused on gate or transistor level approximating structures. However, due to architectural differences between ASIC and FPGA, comparable performance gains for FPGA-based approximate adders cannot be obtained using ASIC-based approximation ones. In this paper, we propose a method for designing a low-error approximate adder that effectively deploy...Show More
Because of their high accuracy, deep neural net-works (DNNs) have achieved amazing success in security-critical systems such as medical devices. It has recently been demon-strated that Adversarial Bit Flip Attacks (BFAs) against DNN hardware by flipping a very small number of bits can result in catastrophic accuracy loss. The reliance on test data, however, is a significant drawback of previous st...Show More
Recently, deep neural networks (DNNs) have been deployed in safety-critical systems such as autonomous vehicles and medical devices. Shortly after that, the vulnerability of DNNs were revealed by stealthy adversarial examples where crafted inputs—by adding tiny perturbations to original inputs—can lead a DNN to generate misclassification outputs. To improve the robustness of DNNs, some algorithmic...Show More
The emergence of high-bandwidth memory (HBM) brings new opportunities to boost the performance of sorting acceleration on FPGAs, which was conventionally bounded by the available off-chip memory bandwidth. However, it is nontrivial for designers to fully utilize this immense bandwidth. First, the existing sorter designs cannot be directly scaled at the increasing rate of available off-chip bandwid...Show More
Deep neural networks (DNNs) are increasingly being deployed in safety-critical systems such as personal healthcare devices and self-driving cars. In such DNN-based systems, error resilience is a top priority since faults in DNN inference could lead to mispredictions and safety hazards. For latency-critical DNN inference on resource-constrained edge devices, it is nontrivial to apply conventional r...Show More
Recently, the convolutional neural network (CNN)-based approach for on-satellite ship detection in synthetic aperture radar (SAR) images has received increasing attention since it does not rely on predefined imagery features and distributions that are required in conventional detection methods. To achieve high detection accuracy, most of the existing CNN-based methods leverage complex off-the-shel...Show More
In this paper, we develop a framework called MAPLE to enable the aging-aware FPGA architecture exploration. The core idea is to efficiently model the aging-induced delay degradation at the coarse-grained FPGA basic block level using deep neural networks (DNNs). For each type of the FPGA basic block such as LUT and DSP, we first characterize its accurate delay degradation via transistor-level SPICE...Show More