Introduction
Video Frame Interpolation (VFI) is to synthesize intermediate frames from preceding and following frames. It is critical in frame rate up-conversion, slow-motion generation [8], view synthesis [4], [38], and video compression [34].
Existing VFI methods can be categorized by both architecture (CNN, transformer, or hybrid) and methodology: (1) Kernel-based methods [1], [2], [11], [13], [23]-[25], [31] that leverage local convolutions or attention mechanisms over the patches and perform VFI in a single stage procedure; (2) Flow-based methods [14], [16], [21], [22], [26], [27], [30], [36] that typically follow a two-stage pipeline: motion estimation and synthesis refinement, where the intermediate frame is synthesized by forward warping [6], [22] or backward warping [35].
The demand for versatile VFI has grown significantly, particularly in applications like sports broadcasting and action videos where object motion varies dramatically. Recent works [3], [19] have highlighted the challenges when objects move at very high speeds, leading to a degradation of interpolation quality. This creates a critical need for VFI methods that can handle both subtle movements and extreme motion cases. Although many VFI methods achieve impressive results, they often excel in specific scenarios but struggle in others. Methods optimized for small-to-medium motions typically fail with large displacements, while those designed for large motions tend to compromise fine detail accuracy.
To address these limitations, we propose a versatile framework that effectively handles both large and small-to-medium motions. As shown in Figure 1, our method achieves excellent performance across eight widely used benchmarks, demonstrating unprecedented versatility.
Although CNN can estimate large motions through hierarchical structure that increases the receptive field, this approach risks error accumulation. Transformers excel at capturing long-range dependencies but requires a high computational cost.
In this paper, we propose ATM-VFI, a hybrid CNN-transformer architecture. To tackle the challenging motion estimation problem, the proposed ATMFormer is applied to improve the motion estimation result. In addition, we employ cross-scale feature fusion to avoid the loss of the small object information. Moreover, we propose to separate the calculation of large and small motions into two distinct branches and make each branch focus solely on one of the motion types independently, allowing adaptive activation of global motion estimation based on scene requirements.
Our contributions are summarized as follows:
We design a hybrid CNN-Transformer architecture. It integrates the advantages of the CNN (efficiency and well considering the local information) and the transformer (well considering the global information).
We propose a dual-branch motion estimation approach. The local branch focuses on small-medium motions, while the global branch focuses on large motions. It can well handle different types of motion and optimize each branch to its corresponding motion type.
We introduce a four-phase training procedure. We adopt two training sets. First, local motion pretraining is conducted. Then, global motion pretraining is performed. Then, the two datasets are trained jointly to fine-tune the result. It can avoid overfitting and assure stability.
Our method achieves the best versatility when evaluating on the benchmarks with different motion types.
Method
The overall pipeline of our method is in figure 2. Given two frames I0 and I1, our goal is to generate an intermediate frame It where t is within (0, 1). We first extract and fuse multi-scale features from inputs using a shared pyramid CNN encoder (section II-A). Then, the proposed ATMFormer (section II-B) is applied to adaptively estimate global and local motion. The extracted features and the estimated motion are then jointly processed for enhancement (section II-C). The estimated intermediate frame \begin{gather*} {\tilde I_t} = {\mathbf{M}} \odot \overleftarrow{w} \left( {{I_0},{{\mathbf{F}}_{t \to 0}}} \right) + ({\mathbf{1}} - {\mathbf{M}}) \odot \overleftarrow{w} \left( {{I_1},{{\mathbf{F}}_{t \to 1}}} \right),\tag{1} \\ {{\mathbf{F}}_{t \to i}} = \uparrow \left( {{\mathbf{F}}_{t \to i}^{global{\text{ }}}} \right) + {\mathbf{F}}_{t \to i}^{local{\text{ }}},\tag{2}\end{gather*}
Finally, a residual ΔIt determined from the RefineNet is added to \begin{equation*}{I_t} = {\tilde I_t} + \Delta {I_t}.\tag{3}\end{equation*}
A. Multi-Scale Feature Extraction and Fusion
To well extract the multi-scale features, we employ a pyramid CNN encoder built upon blocks of two 3 × 3 convolution layers with strides 2 and 1, respectively. The pyramid structure extracts five levels of features
After feature extraction, feature fusion is performed to capture cross-scale information. We propose a dual-branch cross-scale fusion strategy as follows:
Local branch fuses features
into{\mathbf{X}}_i^{\{ 1,2,3\} } . It focuses on capturing detailed local information. It is particularly beneficial for refining motion estimation.{\mathbf{X}}_i^{{\text{local}}} Global branch fuses features
into{\mathbf{X}}_i^{\{ 2,3,4\} } . It incorporates deeper features with a larger receptive field. This design is advantageous for capturing the motion of large or fast-moving objects.{\mathbf{X}}_i^{{\text{global}}}
B. ATMFormer
The core of ATMFormer is our proposed Attention-to-Motion (ATM) module (summarized in algorithm 1). It processes multi-scale features X0 and X1 to output the estimated optical flow F and enhanced features
The ATM module first concatenates X0 and X1 into Y0, with Y1 obtained by reverse concatenation. Then, the query \begin{equation*}{\mathbf{A}} = Softmax\left( {{\mathbf{Q}}{{\mathbf{K}}^T}/\sqrt d } \right).\tag{4}\end{equation*}
Then, the vanilla cross-attention output is A • V.
The proposed ATMFormer differs fundamentally from previous transformer-based approaches in its motion estimation mechanism. While previous methods used transformers to model long-range dependencies and output feature maps or embeddings that require additional processing modules for motion estimation, our method explicitly converts the attention matrix A into inter-frame motion vectors through a direct dot product with relative coordinates R. As in figure 3b, for each entry of X0,ij in the local window, a relative coordinate
We leverage the ATMFormer to estimate the global interframe motion Fglobal using the global branch feature
Each branch acts as an independent expert, focusing solely on a particular motion it is tasked with, thereby producing optimal results.
If the inter-frame motion is known to be small beforehand, global motion estimation can be adaptively disabled to reduce computational cost and also avoids potential error propagation.
Overview of the proposed ATM-VFI architecture. (Input examples were from SNU-FILM [2])
C. Joint Enhancement for Feature and Motion
We concatenate the outputs of the ATMFormer,
In contrast to most coarse-to-fine methods ( [7], [14], [26], [30], [32]) that predict residuals to refine the flow from the coarser level, our method jointly performs up-sampling and refinement. The optical flow Fl and the fusion mask Ml at level l (a larger l indicates a coarser level) are supervised using the warping loss (section II-D) to mitigate error propagation for the final output flow F.
D. Loss Functions
To supervise the final synthesized frame, we adopt multiscale Laplacian functions to define the reconstruction loss:
\begin{equation*}{\mathcal{L}_{{\text{rec}}}} = {\mathcal{L}_{{\text{lap}}}}\left( {{{\hat I}_t},{I_t}} \right) = \sum\limits_k {{{\left\| {{L^k}\left( {{{\hat I}_t}} \right) - {L^k}\left( {{I_t}} \right)} \right\|}_1}} ,\tag{5}\end{equation*}
\begin{equation*}{\mathcal{L}_{{\text{warp }}}} = \sum\limits_l {{\mathcal{L}_{{\text{lap }}}}} \left( {\hat I_t^l,I_t^l} \right),\tag{6}\end{equation*}
\begin{align*} & {\mathcal{L}_{{\text{VGG}}}} = \sum\limits_l^L {{{\left\| {{\phi _l}\left( {{{\hat I}_t}} \right) - {\phi _l}\left( {I_t^{GT}} \right)} \right\|}_1}} /L,\tag{7} \\ & {\mathcal{L}_{{\text{style }}}} = \sum\limits_l^L {{{\left\| {M\left( {{{\hat I}_t}} \right) - M\left( {I_t^{GT}} \right)} \right\|}_2}} /L,\tag{8}\end{align*}
\begin{equation*}\mathcal{L} = {\mathcal{L}_{{\text{rec }}}} + {\lambda _{{\text{warp }}}}{\mathcal{L}_{{\text{warp }}}} + {\lambda _{{\text{VGG }}}}{\mathcal{L}_{{\text{VGG }}}} + {\lambda _{{\text{style }}}}{\mathcal{L}_{{\text{style}}}}.\tag{9}\end{equation*}
We set λwarp to 0.25 and set λVGG and λstyle to 0.05 and 5e-9 for fine-tuning ATM-VFI-pct and balance each loss term.
Experiments
Our models are trained on the Vimeo90K [35] training set and the X-TRAIN [32] dataset. We evaluate our method on various benchmarks with different resolutions and motion styles, including (1) Vimeo90K [35], resolution: 448×256. (2) X-TRAIN [32], resolution: 768×768. (3) UCF101 [33], resolution: 256×256. (4) SNU-FILM [2], resolution: 1280×720. (5) Xiph [20]: a 4K resolution dataset with 8 clips.
About the training detail, We first warm-up our model for 2k steps and train it with a AdamW [17] optimizer. All training data are augmented with random flipping, rotation, and time reversal. The training process has four phases:
Local Motion Pretraining. We first inactivate global motion estimation and train our model on Vimeo90K (randomly cropped to 256×256 patches).
Global Motion Pretraining. We then freeze the network parameters except for the components related to global motion estimation (the
part of figure 2). The rest of the network is trained on X-TRAIN (randomly crop to 448×448 patches).{\color{yellow}{\text{yellow}}} Joint Motion Training. Then, we train the entire network on Vimeo90K during odd-numbered epochs and on X-TRAIN during even-numbered epochs.
Fine-tuning. In this phase, we apply the residual refinement module to fine-tune the results.
We evaluate the proposed method against state-of-the-art methods in table I. For fairness, we inactivate the test-time augmentation when testing all the methods using their public source codes. Furthermore, we group these methods into two categories according to their computation complexity.
To evaluate the performance under varying computational budgets, we apply three configurations: (i) ATM-VFI: the base version network, (ii) ATM-VFI-lite: the lightweight version network with 75% fewer parameters, and (iii) ATM-VFI-pct: base version network fine-tuned by the perception loss.
Quantitative Comparison: As reported in table I and figure 1, ATM-VFI achieves the best performance on most of the evaluated benchmarks. This performance is consistent across various motion types within the interpolated frames, demonstrating the versatility of the proposed method.
The effectiveness of our method under constrained computation resource is also assessed. As shown in table I, ATM-VFI-lite maintains high performance against other lightweight methods and nearly achieves the best scores across all benchmarks. It verifies the robustness of the proposed architecture to computation resource.
Qualitative Comparison: We present a visual comparison between ATM-VFI-pct and state-of-the-art methods in figure 4. By observing the cases of large and complex motion, our approach can accurately estimate the intermediate frames, whereas other methods may produce blurred results and artifacts. The proposed method achieves superior performance for small-to-medium motion scenarios in the Vimeo90K [35] dataset or for fast-moving object scenarios on the DAVIS-480p [29] dataset.
Conclusion
In this paper, we proposed ATM-VFI, a novel hybrid CNN-transformer architecture for video frame interpolation. Our key contributions included proposing a novel ATM module for robust motion estimation, a dual-branch design specialized in global and local motion estimation, and a four-phase training procedure leveraging both small-to-medium and large motion datasets to improve versatility. Experiments demonstrated that ATM-VFI has the performance superior to that of state-of-the-art methods on eight benchmarks and is robust to diverse resolutions and motion types.