1. Introduction
Convolutional neural networks (CNNs) [19], [25], [67], [49], [40], [34] and vision Transformers [11], [54], [72], [39], [59], [41] have precipitated substantial advancements in visual recognition. Despite concerted efforts towards scaling up vision models for superior accuracy [75], [38], [51], the high computational demands have acted as a deterrent to their deployment in resource-constrained scenarios. Research endeavours towards improving the inference efficiency of deep networks span a multitude of directions, including lightweight architecture design [23], [77], [22], pruning [15], [20], [70], quantization [30], [73], etc. In contrast to traditional models, which adhere to a static computational graph during testing, dynamic networks [16], [3], [35], [61], [18], [64], [65], [79], [78] can adapt their computation with varying input complexities, leading to promising results in efficient visual recognition.