Loading web-font TeX/Math/Italic
Exploiting Attention-to-Motion via Transformer for Versatile Video Frame Interpolation | IEEE Conference Publication | IEEE Xplore

Exploiting Attention-to-Motion via Transformer for Versatile Video Frame Interpolation


Abstract:

Video Frame Interpolation (VFI) aims to synthesize realistic intermediate frames from preceding and following video frames. Although many VFI methods perform well on spec...Show More

Abstract:

Video Frame Interpolation (VFI) aims to synthesize realistic intermediate frames from preceding and following video frames. Although many VFI methods perform well on specific motion types, their versatility in handling both large and small motions remain limited. In this work, we propose ATM-VFI, which is a novel hybrid CNN-Transformer architecture that effectively combines the strengths of the CNN (efficiency and considering the detail information) and the transformer (well adopting the global information). It utilizes an Attention-to-Motion (ATM) module and adopts a dual-branch (local and global branches) mechanism to intuitively formulate motion estimation and estimate global and local motion adaptively. Furthermore, we introduce a four-phase training procedure leveraging small-to-medium and large motion datasets to enhance versatility and training stability. Extensive experiments demonstrate that the proposed ATM-VFI algorithm outperforms state-of-the-art methods. It can well interpolate the video frames with a variety of motion types while maintaining high efficiency.
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information:

ISSN Information:

Conference Location: Hyderabad, India

SECTION I.

Introduction

Video Frame Interpolation (VFI) is to synthesize intermediate frames from preceding and following frames. It is critical in frame rate up-conversion, slow-motion generation [8], view synthesis [4], [38], and video compression [34].

Existing VFI methods can be categorized by both architecture (CNN, transformer, or hybrid) and methodology: (1) Kernel-based methods [1], [2], [11], [13], [23]-​[25], [31] that leverage local convolutions or attention mechanisms over the patches and perform VFI in a single stage procedure; (2) Flow-based methods [14], [16], [21], [22], [26], [27], [30], [36] that typically follow a two-stage pipeline: motion estimation and synthesis refinement, where the intermediate frame is synthesized by forward warping [6], [22] or backward warping [35].

The demand for versatile VFI has grown significantly, particularly in applications like sports broadcasting and action videos where object motion varies dramatically. Recent works [3], [19] have highlighted the challenges when objects move at very high speeds, leading to a degradation of interpolation quality. This creates a critical need for VFI methods that can handle both subtle movements and extreme motion cases. Although many VFI methods achieve impressive results, they often excel in specific scenarios but struggle in others. Methods optimized for small-to-medium motions typically fail with large displacements, while those designed for large motions tend to compromise fine detail accuracy.

Fig. 1: - 
Rank of the performance of the proposed ATM-VFI method compared to seven state-of-the-art methods [6], [7], [12], [14], [18], [30], [36] in terms of (a) the PSNR and (b) the SSIM values on eight benchmarks.
Fig. 1:

Rank of the performance of the proposed ATM-VFI method compared to seven state-of-the-art methods [6], [7], [12], [14], [18], [30], [36] in terms of (a) the PSNR and (b) the SSIM values on eight benchmarks.

To address these limitations, we propose a versatile framework that effectively handles both large and small-to-medium motions. As shown in Figure 1, our method achieves excellent performance across eight widely used benchmarks, demonstrating unprecedented versatility.

Although CNN can estimate large motions through hierarchical structure that increases the receptive field, this approach risks error accumulation. Transformers excel at capturing long-range dependencies but requires a high computational cost.

In this paper, we propose ATM-VFI, a hybrid CNN-transformer architecture. To tackle the challenging motion estimation problem, the proposed ATMFormer is applied to improve the motion estimation result. In addition, we employ cross-scale feature fusion to avoid the loss of the small object information. Moreover, we propose to separate the calculation of large and small motions into two distinct branches and make each branch focus solely on one of the motion types independently, allowing adaptive activation of global motion estimation based on scene requirements.

Our contributions are summarized as follows:

  • We design a hybrid CNN-Transformer architecture. It integrates the advantages of the CNN (efficiency and well considering the local information) and the transformer (well considering the global information).

  • We propose a dual-branch motion estimation approach. The local branch focuses on small-medium motions, while the global branch focuses on large motions. It can well handle different types of motion and optimize each branch to its corresponding motion type.

  • We introduce a four-phase training procedure. We adopt two training sets. First, local motion pretraining is conducted. Then, global motion pretraining is performed. Then, the two datasets are trained jointly to fine-tune the result. It can avoid overfitting and assure stability.

  • Our method achieves the best versatility when evaluating on the benchmarks with different motion types.

SECTION II.

Method

The overall pipeline of our method is in figure 2. Given two frames I0 and I1, our goal is to generate an intermediate frame It where t is within (0, 1). We first extract and fuse multi-scale features from inputs using a shared pyramid CNN encoder (section II-A). Then, the proposed ATMFormer (section II-B) is applied to adaptively estimate global and local motion. The extracted features and the estimated motion are then jointly processed for enhancement (section II-C). The estimated intermediate frame {\tilde I_t} is then obtained by: \begin{gather*} {\tilde I_t} = {\mathbf{M}} \odot \overleftarrow{w} \left( {{I_0},{{\mathbf{F}}_{t \to 0}}} \right) + ({\mathbf{1}} - {\mathbf{M}}) \odot \overleftarrow{w} \left( {{I_1},{{\mathbf{F}}_{t \to 1}}} \right),\tag{1} \\ {{\mathbf{F}}_{t \to i}} = \uparrow \left( {{\mathbf{F}}_{t \to i}^{global{\text{ }}}} \right) + {\mathbf{F}}_{t \to i}^{local{\text{ }}},\tag{2}\end{gather*}

View SourceRight-click on figure for MathML and additional features. where \overleftarrow w denotes backward warping, ⊙ is the element-wise product, ↑ is bilinear up-sampling, Ft→i means the optical flows for the ith video frame, and M ∈ [0,1] is the fusion mask that blends the warped frames based on occlusion and motion boundaries.

Finally, a residual ΔIt determined from the RefineNet is added to {\tilde I_t} to address occlusion and enhance the sharpness: \begin{equation*}{I_t} = {\tilde I_t} + \Delta {I_t}.\tag{3}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

A. Multi-Scale Feature Extraction and Fusion

To well extract the multi-scale features, we employ a pyramid CNN encoder built upon blocks of two 3 × 3 convolution layers with strides 2 and 1, respectively. The pyramid structure extracts five levels of features {\mathbf{X}}_i^k for each input frame Ii, where k ∈ [0,4] represents the pyramid level.

After feature extraction, feature fusion is performed to capture cross-scale information. We propose a dual-branch cross-scale fusion strategy as follows:

  • Local branch fuses features {\mathbf{X}}_i^{\{ 1,2,3\} } into {\mathbf{X}}_i^{{\text{local}}}. It focuses on capturing detailed local information. It is particularly beneficial for refining motion estimation.

  • Global branch fuses features {\mathbf{X}}_i^{\{ 2,3,4\} } into {\mathbf{X}}_i^{{\text{global}}}. It incorporates deeper features with a larger receptive field. This design is advantageous for capturing the motion of large or fast-moving objects.

Algorithm 1 Attention-to-Motion (ATM)
Table 1- 
Attention-to-Motion (ATM)

B. ATMFormer

The core of ATMFormer is our proposed Attention-to-Motion (ATM) module (summarized in algorithm 1). It processes multi-scale features X0 and X1 to output the estimated optical flow F and enhanced features {\mathbf{X}}_0^\prime ,{\mathbf{X}}_1^\prime .

The ATM module first concatenates X0 and X1 into Y0, with Y1 obtained by reverse concatenation. Then, the query {\mathbf{Q}} \in {\mathbb{R}^{h \times (M \times M) \times d}} is calculated through a linear projection of Y0, where M, h, and d denotes the window size, the number of heads, and the vector dimension, respectively. K and {\mathbf{V}} \in {\mathbb{R}^{h \times (M \times M) \times d}} are calculated through two independent projections of Y1. The multi-head attention matrix {\mathbf{A}} \in {\mathbb{R}^{h \times (M \times M) \times (M \times M)}} is calculated as \begin{equation*}{\mathbf{A}} = Softmax\left( {{\mathbf{Q}}{{\mathbf{K}}^T}/\sqrt d } \right).\tag{4}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

Then, the vanilla cross-attention output is AV.

The proposed ATMFormer differs fundamentally from previous transformer-based approaches in its motion estimation mechanism. While previous methods used transformers to model long-range dependencies and output feature maps or embeddings that require additional processing modules for motion estimation, our method explicitly converts the attention matrix A into inter-frame motion vectors through a direct dot product with relative coordinates R. As in figure 3b, for each entry of X0,ij in the local window, a relative coordinate {{\mathbf{R}}_{ij}} \in {\mathbb{R}^{(M \times M) \times 2}} is precomputed, where each position indicates the relative offsets from (i, j). The ATMFormer initially predicts two optical flows, Fshift=0 and Fshift=M/2, which requires no extra learnable parameters, making it more computationally efficient.

We leverage the ATMFormer to estimate the global interframe motion Fglobal using the global branch feature {\mathbf{X}}_i^{{\text{global}}}. Then, we apply a 2× bilinear up-sampling to Fglobal and use it to warp {\mathbf{X}}_i^{{\text{local}}}. Subsequently, {\mathbf{X}}_i^{{\text{local}}} is used by another ATMFormer to estimate the local inter-frame motion Flocal. Separating motion estimation into local and global branches has several benefits in our method:

  1. Each branch acts as an independent expert, focusing solely on a particular motion it is tasked with, thereby producing optimal results.

  2. If the inter-frame motion is known to be small beforehand, global motion estimation can be adaptively disabled to reduce computational cost and also avoids potential error propagation.

Fig. 2: - 
Overview of the proposed ATM-VFI architecture. (Input examples were from SNU-FILM [2])
Fig. 2:

Overview of the proposed ATM-VFI architecture. (Input examples were from SNU-FILM [2])

Fig. 3: - 
Illustration of the ATMFormer architecture.
Fig. 3:

Illustration of the ATMFormer architecture.

C. Joint Enhancement for Feature and Motion

We concatenate the outputs of the ATMFormer, {\mathbf{X}}_0^\prime and {\mathbf{X}}_1^\prime , and perform feature enhancement using the self-attention Swin transformer [15]. We warp the enhanced features based on the estimated bidirectional flow {\mathbf{F}}_{t \to 0}^{local} and {\mathbf{F}}_{t \to 1}^{local}. We then concatenate \left[ {{\mathbf{X}}_0^\prime ,{\mathbf{X}}_1^\prime ,{\mathbf{F}}_{t \to 1}^{local},{\mathbf{F}}_{t \to 0}^{local},{\mathbf{M}}} \right] and up-sample the concatenated tensor using a pyramid CNN decoder.

In contrast to most coarse-to-fine methods ( [7], [14], [26], [30], [32]) that predict residuals to refine the flow from the coarser level, our method jointly performs up-sampling and refinement. The optical flow Fl and the fusion mask Ml at level l (a larger l indicates a coarser level) are supervised using the warping loss (section II-D) to mitigate error propagation for the final output flow F.

D. Loss Functions

To supervise the final synthesized frame, we adopt multiscale Laplacian functions to define the reconstruction loss: \begin{equation*}{\mathcal{L}_{{\text{rec}}}} = {\mathcal{L}_{{\text{lap}}}}\left( {{{\hat I}_t},{I_t}} \right) = \sum\limits_k {{{\left\| {{L^k}\left( {{{\hat I}_t}} \right) - {L^k}\left( {{I_t}} \right)} \right\|}_1}} ,\tag{5}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where Lk means the kth layer of the Laplacian pyramid and {\hat I_t} and It are the predicted and ground-truth intermediate frames, respectively. To supervise the quality of the estimated flows and the fusion mask, we also employ the warping loss: \begin{equation*}{\mathcal{L}_{{\text{warp }}}} = \sum\limits_l {{\mathcal{L}_{{\text{lap }}}}} \left( {\hat I_t^l,I_t^l} \right),\tag{6}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where \hat I_t^l is the intermediate frame obtained by the lth level of the pyramid CNN. We also adopt the perception loss for ATM-VFI-pct, including VGG [10] and the style losses [5] \begin{align*} & {\mathcal{L}_{{\text{VGG}}}} = \sum\limits_l^L {{{\left\| {{\phi _l}\left( {{{\hat I}_t}} \right) - {\phi _l}\left( {I_t^{GT}} \right)} \right\|}_1}} /L,\tag{7} \\ & {\mathcal{L}_{{\text{style }}}} = \sum\limits_l^L {{{\left\| {M\left( {{{\hat I}_t}} \right) - M\left( {I_t^{GT}} \right)} \right\|}_2}} /L,\tag{8}\end{align*}
View SourceRight-click on figure for MathML and additional features.
where ϕl is the feature from the lth layer and M(•) is the auto-correlation function of ϕl. The full objective function is \begin{equation*}\mathcal{L} = {\mathcal{L}_{{\text{rec }}}} + {\lambda _{{\text{warp }}}}{\mathcal{L}_{{\text{warp }}}} + {\lambda _{{\text{VGG }}}}{\mathcal{L}_{{\text{VGG }}}} + {\lambda _{{\text{style }}}}{\mathcal{L}_{{\text{style}}}}.\tag{9}\end{equation*}
View SourceRight-click on figure for MathML and additional features.

We set λwarp to 0.25 and set λVGG and λstyle to 0.05 and 5e-9 for fine-tuning ATM-VFI-pct and balance each loss term.

SECTION III.

Experiments

Our models are trained on the Vimeo90K [35] training set and the X-TRAIN [32] dataset. We evaluate our method on various benchmarks with different resolutions and motion styles, including (1) Vimeo90K [35], resolution: 448×256. (2) X-TRAIN [32], resolution: 768×768. (3) UCF101 [33], resolution: 256×256. (4) SNU-FILM [2], resolution: 1280×720. (5) Xiph [20]: a 4K resolution dataset with 8 clips.

About the training detail, We first warm-up our model for 2k steps and train it with a AdamW [17] optimizer. All training data are augmented with random flipping, rotation, and time reversal. The training process has four phases:

  1. Local Motion Pretraining. We first inactivate global motion estimation and train our model on Vimeo90K (randomly cropped to 256×256 patches).

  2. Global Motion Pretraining. We then freeze the network parameters except for the components related to global motion estimation (the {\color{yellow}{\text{yellow}}} part of figure 2). The rest of the network is trained on X-TRAIN (randomly crop to 448×448 patches).

  3. Joint Motion Training. Then, we train the entire network on Vimeo90K during odd-numbered epochs and on X-TRAIN during even-numbered epochs.

  4. Fine-tuning. In this phase, we apply the residual refinement module to fine-tune the results.

TABLE I: Quantitative comparison (PSNR/SSIM) of the VFI result on different benchmarks. The best and the second best results are colored in {\color{red}{\mathbf{red}}} and \underline {{\color{blue}{\text{blue}}}} . OOM denotes the out-of-memory issue when evaluating on an NVIDIA RTX3090 GPU.
Table I:- 
Quantitative comparison (PSNR/SSIM) of the VFI result on different benchmarks. The best and the second best results are colored in ${\color{red}{\mathbf{red}}}$ and $\underline {{\color{blue}{\text{blue}}}} $. OOM denotes the out-of-memory issue when evaluating on an NVIDIA RTX3090 GPU.

We evaluate the proposed method against state-of-the-art methods in table I. For fairness, we inactivate the test-time augmentation when testing all the methods using their public source codes. Furthermore, we group these methods into two categories according to their computation complexity.

To evaluate the performance under varying computational budgets, we apply three configurations: (i) ATM-VFI: the base version network, (ii) ATM-VFI-lite: the lightweight version network with 75% fewer parameters, and (iii) ATM-VFI-pct: base version network fine-tuned by the perception loss.

Quantitative Comparison: As reported in table I and figure 1, ATM-VFI achieves the best performance on most of the evaluated benchmarks. This performance is consistent across various motion types within the interpolated frames, demonstrating the versatility of the proposed method.

The effectiveness of our method under constrained computation resource is also assessed. As shown in table I, ATM-VFI-lite maintains high performance against other lightweight methods and nearly achieves the best scores across all benchmarks. It verifies the robustness of the proposed architecture to computation resource.

Qualitative Comparison: We present a visual comparison between ATM-VFI-pct and state-of-the-art methods in figure 4. By observing the cases of large and complex motion, our approach can accurately estimate the intermediate frames, whereas other methods may produce blurred results and artifacts. The proposed method achieves superior performance for small-to-medium motion scenarios in the Vimeo90K [35] dataset or for fast-moving object scenarios on the DAVIS-480p [29] dataset.

SECTION IV.

Conclusion

In this paper, we proposed ATM-VFI, a novel hybrid CNN-transformer architecture for video frame interpolation. Our key contributions included proposing a novel ATM module for robust motion estimation, a dual-branch design specialized in global and local motion estimation, and a four-phase training procedure leveraging both small-to-medium and large motion datasets to improve versatility. Experiments demonstrated that ATM-VFI has the performance superior to that of state-of-the-art methods on eight benchmarks and is robust to diverse resolutions and motion types.

Fig. 4: - 
Qualitative comparison of the ground truth (GT) and the results of RIFE [7], IFRNet [12], VFIformer [18], EMA-VFI [36], SGM-VFI [14], and the proposed ATM-VFI module on (a) Vimeo90K [35] and (b) DAVIS-480p [29].
Fig. 4:

Qualitative comparison of the ground truth (GT) and the results of RIFE [7], IFRNet [12], VFIformer [18], EMA-VFI [36], SGM-VFI [14], and the proposed ATM-VFI module on (a) Vimeo90K [35] and (b) DAVIS-480p [29].

References

References is not available for this document.