Loading [MathJax]/extensions/MathZoom.js
Wenbo Yin - IEEE Xplore Author Profile

Showing 1-25 of 37 results

Filter Results

Show

Results

Efficient memory allocation is essential for highperformance computing and neural networks implemented on FPGA, particularly in scenarios demanding rapid data processing and substantial bandwidth. Traditional DDR memory often fails to meet these requirements. To address this, a dynamic memory allocator on FPGA utilizing High Bandwidth Memory (HBM) is developed. Our design includes Data Dispatch, L...Show More
Conventional design space exploration (DSE) flows of Coarse-grained reconfigurable arrays (CGRAs) are based on black-box optimization methods, which are slow and ineffective. This paper proposes FCE, a fast CGRA DSE framework that presents the CGRA design space to the application data flow graph (DFG) mapper and directly employs it to explore the architecture in the mapping process. FCE achieves o...Show More
Convolutional neural networks (CNNs) and transformer neural networks have been adopted in a wide range of applications such as natural language processing and computer vision. Coarse-grained reconfigurable architectures (CGRAs) are highly suitable for CNN and transformer applications due to their high flexibility and energy efficiency. However, current implementations of CGRA for CNNs and transfor...Show More
Coarse-grained reconfigurable architecture (CGRA) is a type of reconfigurable computing architecture suitable for emerging applications that require dynamic compilation hardware. However, the resource utilization of existing CGRA is low due to the lack of flexibility across varied application granularity. In this paper, we propose a CGRA framework for multiple dataflow lanes (MDCRA). It supports p...Show More
In this paper, we propose a domain-specific frame-work that integrates Chisel-based Coarse-grained reconfigurable architecture (CGRA) Modeling, RTL generation, Architecture Graph Intermediate Representation (IR), dataflow graph (DFG) Mapping, interconnect exploration, and physical implementation. Within this framework, we propose an interconnect exploration flow based on a novel interconnect archi...Show More
The Coarse-grained reconfigurable array (CGRA) hardware design optimization is hampered by time-consuming design exploration and evaluation methods. This paper proposes CDE, a novel CGRA Development Environment with a graph-analysis-based fast design space exploration (DSE) framework and an accurate hardware evaluation tool. CDE significantly improves the efficiency of CGRA architecture developmen...Show More
Coarse-grained reconfigurable architectures (CGRAs) are garnering increasing attention as domain-specific accelerators owing to their high flexibility and energy efficiency. Mapping encompasses both placement and routing, constituting a crucial part of the CGRA toolchain. Achieving high mapping quality while minimizing mapping time has been a key objective. In this paper, we propose TransMap, an e...Show More
Coarse-Grained Reconfigurable Arrays (CGRAs) are promising accelerators in the rapidly evolving field of high-performance computing (HPC). However, their potential is limited by the inability of compilers to efficiently map complex application kernels to architectures. In this paper, we propose an architecture-agnostic mapping framework called AGILE, which has a loosely coupled flow that contains ...Show More
Adopting specialized accelerators such as Coarse-Grained Reconfigurable Architectures (CGRAs) alongside CPUs to enhance performance within specific domains is an astute choice. However, the integration of heterogeneous architectures introduces complex challenges for compiler design. Simultaneously, the ever-expanding scale of workloads imposes substantial burdens on deployment. To address above ch...Show More
Coarse-Grained Reconfigurable Arrays (CGRAs) are attracting more and more attention for their high flexibility and energy efficiency. Due to the limited resources, mapping large data flow graphs (DFGs) that represent application kernels onto a CGRA is difficult, for which partitioning is employed. However, existing partitioning methods in the CGRA domain are unable to solve large kernels. In this ...Show More
Coarse-grained reconfigurable architecture (CGRA) gradually becomes an extraordinarily promising accelerator due to its flexibility and power efficiency. However, most CGRA front-end compilers focus on the innermost body of regular loops with a pure data flow. Therefore, we propose CO-Compiler, an LLVM-based CGRA front-end compiler to generate an optimized control-data flow graph (CDFG), which can...Show More
Due to its high energy efficiency and flexibility, coarse-grained reconfigurable architecture (CGRA) has gained increasing attention. Temporal CGRA is a typical category of CGRA that supports single-cycle context switching and time-multiplexing hardware resources to perform spatial and temporal computations. Although multiple temporal CGRAs have been proposed, an architecture with rich design para...Show More
In today’s tech-driven society, the emphasis on data privacy and security has skyrocketed. With technological progress, the emergence of new encryption algorithms and advanced attack technologies compel the need for algorithm upgrades. With rising hardware costs and demand for flexible cryptographic platforms, single-algorithm accelerators are insufficient, making versatile accelerators supporting...Show More
When an application is accelerated with Coarse-Grained Reconfigurable Architecture (CGRA), it is compiled into Data Flow Graph (DFG). In conventional CGRA frameworks, only one DFG is accelerated in each epoch. Consequently, single-context CGRAs can’t fully utilize hardware resources when executing multi-kernel applications. In this paper, we propose a dynamic partial reconfigurable CGRA framework ...Show More
Coarse-grained reconfigurable architecture (CGRA) is an emerging computing architecture that provides a trade-off between energy efficiency and flexibility. Although the CGRA mapping problem has been explored for many years, efficiently mapping complex loop kernels to large-scale CGRAs is still challenging. In this paper, we present GRAFT, GRaph neural network (GNN) Adaptive Framework for efficien...Show More
Coarse-grained reconfigurable architecture (CGRA), composed of word-level processing elements (PEs) and interconnects, has emerged as a promising architecture due to its high performance, energy efficiency, and flexibility. Although multiple CGRA frameworks have been proposed, a complete heterogeneous CGRA exploration framework with tunable interconnect flexibility and fast design space exploratio...Show More
Temporal Coarse-Grained Reconfigurable Architecture (CGRA) is a typical category of CGRA that supports single-cycle context switching and time-multiplexing hardware resources to perform both spatial and temporal computations. Compared with the spatial CGRA, it can be used in area and power budget-constrained scenarios, with the sacrifice of the throughput. Therefore, achieving minimum Initializati...Show More
Coarse-Grained Reconfigurable Architecture (CGRA) is a domain-specific reconfigurable architecture. Generally, the CGRA architecture consists of IO, memory, coarse-grained processing element (PE), and interconnect. Usually, ALU in PE contains a relatively complete set of operations and most of the interconnects adopt neighbor-to-neighbor (N2N) [1], switch-based [2], and combination of the connecti...Show More
Coarse-grained reconfigurable architecture (CGRA) is a promising accelerator design choice due to its high performance and power efficiency in the computation or data-intensive application domains, such as security, multimedia, digital signal processing, machine learning, and high-performance computing. CGRA consists of coarse-grained processing elements (PEs) and interconnects that determine the ...Show More
Coarse-grained reconfigurable architecture (CGRA) accelerator is a promising solution in fields such as deep learning and edge computing. One of the difficult problems to design the accelerator is to map the data flow graph (DFG) to the architecture efficiently. The previous mapping algorithms based on graph structure mainly focus on optimization during the placement and routing stage, which may f...Show More
Nonvolatile memory express (NVMe) is a high-performance and scalable PCI express (PCIe)-based interface for the host software communicating with NVMs, including NAND Flash and the storage class memories (SCMs). NVMe solid-state drives (SSDs) have been deployed in cloud platforms and data-centers for a variety of I/O intensive applications due to their performance benefits compared to SATA/SAS SSDs...Show More
Coarse-Grained Reconfigurable Arrays (CGRAs) provide sufficient flexibility in domain-specific applications with high hardware efficiency, which make CGRAs suitable for fast-evolving fields such as neural network acceleration and edge computing. To meet the requirement of the fast evolution, we propose FastCGRA, the modeling, mapping, and exploration platform for large-scale CGRAs. FastCGRA suppor...Show More
NAND-Flash-based SSDs have been widely employed in diverse computing domains and storage systems due to their higher performance and lower power consumption than HDDs. There have been various studies to explore the internal parallelism inside SSDs, including the channel-way-plane levels of interleaving and the cache mode pipelining. However, most current studies are based on simulators or focus on...Show More
Load balancing is one of the most important network services in cloud data centers. However, traditional load balancers are gradually overstretched with the explosive growth of big data, whose latency and throughput are far from satisfying the performance requirements. Based on the high parallelism and flexibility of Field Programmable Gate Array (FPGA), this paper presents a load balancing scheme...Show More
This paper presents a high throughput and low latency multi-version management KVS accelerator, implemented on the FPGA-CPU heterogeneous architecture, supporting PUT, GET, DELETE and GET_RANGE operations. A pipelined Hash Engine architecture is proposed to improve the accelerator throughput and avoid data consistency issue. B-Tree processing engine is set in parallel and is designed in pipeline t...Show More