I. Introduction
In recent years, academia and industry have proposed a large number of customized spatial accelerators to effectively handle various tensor computations [1][2][3][4]. Although spatial accelerators share similarities in their overall functionality, the architecture details, performance, and power consumption can vary significantly due to differences in task allocation and execution order. The task allocation and execution order are defined as dataflow. Finding a suitable dataflow is a key issue in the design and deployment of spatial accelerators. TPU [2] uses a weight stationary dataflow on a systolic array, which offers high versatility to support multiple applications. Cambricon [3] employs an H-tree broadcast dataflow to reduce congestion and power consumption in long-distance data transfers. Eyeriss [1] introduces a row stationary dataflow specifically designed for convolution operations to maximize weight reuse and minimize power consumption. Magnet [4] proposes a local output stationary dataflow and local weight stationary dataflow, which leverage data reuse across multiple levels of memory hierarchy by keeping data resident in local caches.