Introduction
Video frame interpolation is a fundamental issue in computer vision and aims to synthesize plausible intermediate frames between any two adjacent frames. This is a challenging problem requiring a generative framework that can model motion information, and generate both spatially and temporally consistent frames. Video frame interpolation can be used in numerous applications, such as video frame rate conversion [1], slow-motion video generation [2] and video compression [3]. It helps surmount the temporal constraint of camera sensors and has drawn increasing attention in image and video processing community.
Video frame interpolation is a complicated problem due to three reasons: (1) The variability of real world scene makes the video sequences modeling considerably difficult. (2) Large motion generally appears in videos. It is more troublesome to capture temporal and spatial consistency in that case. (3) The video interpolation framework could not generate realistic frames with accurate content when local occlusion occurs in the video.
Most existing video interpolation methods usually use convolutional neural networks (CNN). These methods can be divided into two categories: based on interpolation convolution kernels estimation [4], [5] and based on optical flow estimation [2], [6], [7]. The first technique combines motion estimation and pixel synthesis into a single process, which produces a convolution kernel for each pixel. It can capture both the motion information between the original consecutive frames and the coefficients for pixel synthesis via a convolutional neural network. The other technique estimates optical flow between input frames firstly, and then interpolate intermediate frames depend on the estimated dense correspondences The accuracy of optical flow estimation directly affects the quality of the generated frame. However, these two kinds of CNN-based techniques are essentially indirect methods to synthesize intermediate frames. On the one hand, they will produce unrealistic and blurry results if the intermediate process fail. The estimated optical flow or kernel is commonly not correct when the motion regions under the case of occlusion, blur and sudden illumination change. And on the other hand, the computation of optical flow or kernel for each pixel requires considerable time consumption and memory overhead, which significantly escalates the difficulty of video frame interpolation. Recently, generative adversarial networks (GANs) have achieved great success in the field of image and video generation [8], [12], [48]. GANs contain a generator network and a discriminator network, which are trained against each other. Through adversarial learning, generator can achieve satisfactory results.
Building an effective frame interpolation model of natural images is hard due to video sequence are complex and high dimensional. It is easy to get blurred results and artifacts when processing high-resolution image, because there are different sizes of objects in the image. Although we can contact multiple convolution operations to expand the receptive field, the problem of long-distance dependence will become more prominent [15], [49]. The proposed method utilizes multi-scale structure that can more accurately model global information and local features with different size level images as input. The effective image patch size become smaller as going up the multi-scale pyramid, and the network can process more elaborate image local feature.
In this paper, a novel and explicit video frame interpolation method is proposed which can relieve the calamity caused by mistaken intermediate process. Inspired by recent advances in generative adversarial networks [8]–[11] for image synthesis [12]–[14] and video prediction [15]–[17], we introduce a frame interpolation framework using multi-scale dense attention generative adversarial networks, i.e., FI-MSAGAN. FI-MSAGAN is an effective end-to-end trainable fully convolutional neural network, manipulating two consecutive frames at arbitrary resolution and directly producing the high-quality intermediate frame. The proposed FI-MSAGAN progressively reconstruct intermediate frames in a coarse-to-fine manner. At each scale, the frame generator uses residual blocks [18], [19] and skip connection to form a dense network structure. And frame discriminators judges whether the input frame patches are fake or true. It helps the model to capture fine details and texture structure. In order to make the network focus on large moving objects and handle with complex motion, we introduce an attention module for generator. In addition, a sequence discriminator is designed to provide feedback signal for generator to capture the temporal consistency between video frames. The training process of FI-MSAGAN is end-to-end and guided by a comprehensive loss function that contains a unique multi-scale frame adversarial loss, a sequence adversarial loss, a perceptual loss and a reconstruct loss. This paper mainly consists of three contributions condensed as follows:
We propose an efficient multi-scale dense attention generative adversarial networks (FI-MSAGAN) for video frame interpolation. Our method can utilize global and local information to produce more realistic intermediate frame directly without optical flow estimation.
We design a dense attention generator for better reconstruct temporal and spatial consistency of video sequences. The generator consists of synthesis module and attention module. Attention map obtained from attention module makes generator adaptively focus on the dynamic areas.
We construct a sequence adversarial loss through a sequence discriminator. And frame discriminators recover high-frequency structure. Ultimately, the total loss of the network includes four terms: multi-scale frame adversarial loss, sequence adversarial loss, feature perception loss, and reconstruction loss. It is adopted to further improve the quality of the interpolated frame.
Related Work
Many advanced approaches for video frames interpolation explicitly or implicitly presume consistent motion between video sequences. Conventional video frame interpolation approaches usually consist of two stages: motions estimation and pixel synthesis. Phase-based method, optical flow and local convolution kernel can be used to capture motion consistency. Mahajan et al. [21] holds that a given pixel in the interpolated frames traces out a path in the original video sequences. Based on the intuitive idea, they copy and move pixel gradients from the inputs to the interpolated frames along this path. Meyer et al. [1] presents an efficient method to computing in-between frames by simple per-pixel phase information propagation across the levels of multi-scale pyramid, but it fails gracefully in case of large appearance changes.
Optical flow prediction is the key technology of motions estimation and determines the accuracy of the interpolation results. However, they are often challenged by large motions and brightness change [22]–[24]. Deep learning has make enormous success and dramatic developments in image and video recognition. Most state-of-the-art optical flow models use deep learning [25], which suggest that CNNs can understand motion information between frames. In order to get better results, many researchers merge optical flow estimation and video interpolation frame in a single model [2], [6], [7], [26]–[28]. Liu et al. [6] designs a deep network with a voxel flow layer to synthesize video frames by flowing pixel values from input video volume. The DVF is trained in an unsupervised manner and can be easily extended to extrapolating video. Jiang et al. [2] first uses a flow computation U-Net to estimate the bi-directional flow, which is then linearly fused to get rough intermediate flow. And then, they use another flow interpolation U-Net to refine the approximate flow and visibility maps. Finally, synthesizing intermediate frames by applying the visibility maps to the warped images. Niklaus and Liu [27] apply pixel-wise contextual information extracted by a pre-trained network to estimated bidirectional flow, and uses a frame synthesis network to produce the interpolated frame in a context-aware fashion. Zhang et al. [7] uses a 3D U-Net feature extractor to excavate spatio-temporal context and rebuild texture, and a coarse-to-fine architecture to improve optical flows estimation. Li et al. [28] proposes a lightweight network to estimate optical flow at feature level and introduce a new sobolev loss achieve better results.
Different from optical flow based models, Niklaus et al. [4] develops a network (AdaConv) to estimate spatially-adaptive convolution kernel for each pixel. However, the memory demand increases quickly while the model needs large kernel to deal with large motion. Niklaus et al. [5] improves AdaConv by using a pairs of 1D kernels to approximate regular 2D ones. The use of local separable convolution kernels significantly reduces the parameters of the model. There are several papers [29], [30] that use CNNs to directly produce intermediate frames, yet the models are simple and hard to generate reasonable results.
After the invention of generative adversarial networks [8], many researches applied this framework to generate images in the context of image-to-image translation [12], [31], super resolution [32], and video prediction [15]–[17], [48]. Mathieu et al. [15] first applies adversarial learning to video prediction, which employ an image gradient loss in a multi-scale architecture and obviously reduce blurring artifacts. On the other hand, many works have improved performance of GANs [13], [33], [34], [36]. WGAN [33] and LSGAN [36] introduce a new discriminator loss function to relieve training instability. BigGAN [13] and ProGAN [34] allow generator to map noise to high-resolution and realistic images. Inspired by this, our network uses multi-scale generative adversarial strategy to produce middle frame in an explicit manner.
Proposed Approach
In this section, we expatiate our method of video frame interpolation. First, we define the problem and give some notations. Then, we introduce our proposed multi-scale frame interpolation networks in details. At last, we show several types of loss functions used in the networks.
A. Problem Description
Our goal is to synthesize the intermediate frame between two successive frames
In this work, we take advantage of generative adversarial networks to achieve more effective video frame interpolation task. The generator receives consecutive frames \begin{equation*} I_{s} =G\left ({{I_{1},I_{2}} }\right)\tag{1}\end{equation*}
B. Fi-MSAGAN Structure for Video Frame Interpolation
Our proposed method adopts a multi-scale structure, which consists of several generators and discriminators with different size of images as inputs. The multi-scale structure can better combine global information and local details for photo-realistic frame generation. In order to preserve the spatial and temporal consistency of the generated video, we introduce a dense attention generator. Moreover, we use both a frame discriminator and a sequence discriminator to distinguish fake data from real. The generators and discriminators as follows.
1) Multi-Scale Dense Attention Generator
We construct the proposed video frame interpolation method using a multi-scale pyramidal structure as shown in Figure 1. There are four scale levels in the network and each level of generator is a subnetwork with several residual blocks, denote as \begin{equation*} I_{s}^{i} =\begin{cases} G_{i} \left ({{I_{1}^{i},I_{2}^{i}} }\right), & i=1 \\ G_{i} \left ({{I_{1}^{i},I_{2}^{i},\mathrm {U}I_{s}^{i-1}} }\right), & i=2,3,4 \\ \end{cases}\tag{2}\end{equation*}
Overview of our multi-scale frame interpolation network. The left side represents the multi-scale generator in yellow block and The right side represents the multi-scale frame discriminator.
The entire generator network is a cascade of sub-networks with common structure at each scale. Generator
The structure of
Recent works have achieved great success in image-to-image translation [12], [31], [37], [38]. These works adopt similar architecture as [39] for generator, which contain stride-2 convolutions for down-sample, several residual blocks for domain feature space translation and fractionally-strided convolutions with stride 1/2 for up-sample. However, inspired by [15], we remove down-sample and up-sample convolution layers and only use a few residual blocks in our multi-scale generator architecture, both synthesis module
Video frame synthesis is a quite complicated and challenging task due to large and complex motion or occlusions. Effective video interpolation algorithm should be able to accurately focus on moving objects. However, a network with a series of convolution operations could not achieve the above property. Because traditional convolution structure only account for short-range dependencies, limited by the kernel size. Recent researches show that, motivated by human perception procedure, attention mechanism has an advantage in computer vision, e.g. image classification [40], image-to-image translation [41], [42], video classification [20]. Rather than processing a single image or a sequence using local information, attention allows the network to focus on the most relevant part of features as needed.
We produce an attention map by inserting an attention module to solve the above issues. As shown in the purple rectangle in Figure 2, Attention module
For example at scale \begin{equation*} I_{s}^{i} =am_{i} \otimes I_{s}^{i^{\prime }}+\left ({{1-am_{i}} }\right)\otimes I_{1}^{i}\tag{3}\end{equation*}
2) Multi-Scale Frame Discriminator and Sequence Discriminator
Our network is a multi-scale structure, so we provide each generator with a corresponding frame discriminator. In other words, the whole network incorporates many pairs of generators and discriminators that can process different size input images. These GANs models generate the final intermediate frame from coarse to fine. Frame discriminators with different scales have different receptive fields compared with original size images. The coarsest scale frame discriminator has the largest receptive field and better global view, which can guide the generator to generate globally consistent frames. While the frame discriminator at the top level is adept in guiding the generator to produce more exquisite detail information. At the same time, it makes it easier to training the generator easier and more stability.
The architecture of frame discriminator
The structure of frame discriminator
Temporal consistency of generated intermediate frames is crucial for video frame interpolation. Therefore, we design a sequence discriminator
The structure of sequence discriminator
C. Loss Functions
Our multi-scale video frame interpolation approach is based on GANs [8] which optimizes the following objective functions: \begin{align*}&\hspace {-2pc}\min \limits _{G} \max \limits _{D} \mathbb {E}_{x\sim p_{data} \left ({x }\right)} \left [{ {\log D\left ({x }\right)} }\right] \\&+\,\mathbb {E}_{x\sim p_{model} \left ({x }\right)} \left [{ {\log \left ({{1-D\left ({x }\right)} }\right)} }\right]\tag{4}\end{align*}
We use multi-scale frame discriminators and a sequence discriminator, so there are two kinds of adversarial loss. The adversarial loss for frame discriminator at level \begin{align*} {\mathcal{ L}}_{GAN}^{i}=&\mathbb {E}_{I_{r}^{i} \sim p_{data}} \left [{ {\log D_{i} \left ({{I_{r}^{i}} }\right)} }\right] \\&+\,\mathbb {E}_{I_{s}^{i} \sim p_{model}} \left [{ {\log \left ({{1-D\left ({{I_{s}^{i}} }\right)} }\right)} }\right]\tag{5}\end{align*}
\begin{equation*} {\mathcal{ L}}_{F} =\sum \limits _{i=1}^{4} {\lambda _{i} {\mathcal{ L}}_{GAN}^{i}}\tag{6}\end{equation*}
Sequence discriminator \begin{align*} {\mathcal{ L}}_{S} \!=\!\mathbb {E}_{p_{data}} \left [{ {\log D_{S} \left ({{Rseq} }\right)} }\right]\!+\!\mathbb {E}_{p_{model}} \left [{ {\log \left ({{1-D_{S} \left ({{Fseq} }\right)} }\right)} }\right]\!\!\!\! \\\tag{7}\end{align*}
We use a pixel-wise reconstruction loss in L1-norm, which can produce sharper results than MSE, the expression as follow: \begin{equation*} {\mathcal{ L}}_{f} =\mathbb {E}_{p_{data}} \left [{ {\left \|{ {I_{s} -I_{r}} }\right \|_{1}} }\right]\tag{8}\end{equation*}
Inspired by perceptual loss used in single image super-resolution style [39], we employ features 5_4 from the VGG network [44] as a feature perceptual loss: \begin{equation*} {\mathcal{ L}}_{vgg} =\mathbb {E}_{p_{data}} \left [{ {\left \|{ {\phi \left ({{I_{s}} }\right)-\phi \left ({{I_{r}} }\right)} }\right \|_{2}} }\right]\tag{9}\end{equation*}
Ultimately, the total loss functions to train our multi-scale dense attention frame interpolation as following. In practice, we set \begin{equation*} {\mathcal{ L}}=\lambda _{GAN} \left ({{\cal L_{F} +{\mathcal{ L}}_{S} } }\right)+{\mathcal{ L}}_{f} +\lambda _{vgg} {\mathcal{ L}}_{vgg}\tag{10}\end{equation*}
Experiments
In this section, we perform extensive experiments to demonstrate the effectiveness of our approach. First, we explain the experiment details. Then, we provide ablation study to evidence the contribution of the proposed method, i.e. 1) multi-scale generator; 2) attention module and 3) sequence adversarial loss. At last, we compare our method with state-of-the-art video frame interpolation methods and evaluate them both qualitatively and quantitatively.
A. Experiment Details
We can use any available video to train our network, and we do not need any labels about the video. So we use the adobe240-fps video collected by [45] due to real and diverse scenes. We extract three consecutive frames from these videos to form frame triplets. In order to train the model effectively, we discard triplets for which first and third frames is almost identical. Finally, 50k triplets are selected. Among them, randomly selected 783 triplets are used for test and the rest are used for train. When compared with other methods, we also use UCF101 [46] to verify network performance. The triplets are cropped to patches with size of
To train our model, we use Adam optimization [47] with
The batch size is set to 8. The number of residual blocks in generators are 4, 6, 6, 8 from coarse to fine. We implement our network by TensorFlow on NVIDIA GeFore GTX 1080 Ti.
B. Ablation Study
We perform ablation study to prove the effectiveness of our contributions: multi-scale generator structure, attention module and sequence adversarial loss. A baseline model is simply based on GANs framework without multi-scale set and attention module in generator. And loss function to train baseline model includes original scale frame adversarial loss, frame reconstruction loss and feature perceptual loss. The adobe240-fps video test dataset are used in these experiments. The results of ablation study are shown in TABLE 1. Our full model denotes that the model use all of proposed three contributions.
As reported in TABLE 1, the baseline model acquires acceptable PSNR and SSIM values. It shows that the GANs framework can use to video frame interpolation task. But the predicted intermediate frame is not good visually, as shown in Figure 5(b). The word in the moving car is blurry. The PSNR and SSIM can be significantly improved by inserting our improved strategy into the baseline mode. According to two quantitative evaluation metrics, our full model has the best performance than other ablation models. Comparing rows 1 and 2 of TABLE 1, the multi-scale structure is effective because the model can integrate global and local information. Generator only with attention module is a little worse than only with multi-scale structure, this may due to multi-scale loss can be constructed by using multi-scale structure. In addition, multi generator and discriminator can improve training stability of GAN. Using multi-scale structure and attention module simultaneously can get better performance. Comparing the last two rows in TABLE 1, our full model can capture spatial and temporal consistency owe to a sequence discriminator.
Video frame interpolation results of ablation study as well as the ground truth frame.
The results of models with different combination are shown in Figure 5. Although the baseline model is feasible, it still produces blurry results. Baseline model with multi-scale structure can generate more realistic frame than model with attention. The produced middle frame of our full model is sharper. The latter two models are very similar in visual perception, but the full model has better quantitative perfor-mance.
C. Comparison With State-of-the-Art Methods
In order to verify the progressiveness of the proposed method, we compare our network with several state-of-the-art video frame interpolation methods and evaluate them both qualitatively and quantitatively. Comparison methods includes DVF [6], SepConv [5] and Super SloMo [2]. These three methods are representative of various video frame interpolation techniques. DVF and SepConv are directly rencet CNN-based methods. And Super SloMo use optical flow estimated by CNN to assist in the generation of intermediate frame. Due to lack of pretrianed model, we train our data with official implementation. We make a quantitative and qualitative comparison on UCF101 and adobe240-fps. Hence, the comparison results are persuasive. The PSNR and SSIM of different methods are shown in TABLE 2 and TABLE 3.
As reported in TABLE 2 and TABLE 3, our method is better than other methods. The SSIM value of our model is a little lower than SepConv and Super SloMo on UCF101 test datasets. The PSNR score of our model is best among these approaches. The quantitative comparison proves that our method is effective. Although the SSIM value of our method is not the highest in UCF101 test data, the intermediate frame generated by our model has better visual reality, as shown in Figure 7. Maybe it’s because numerical indicators are not always consistent with human perception. We list the runtime of different methods in TABLE 3. When testing at
Some results of different methods and our models on adobe240-fps dataset. The bottom right shows the details in red rectangular area of the frame. Our model can obtain better detail information.
Some results of different methods and our models on UCF101 dataset. Column (a) is real frames. Column (b) to (e) are the results of DVF, SepConv, Super SloMo and Our model. The bottom right shows the details in red rectangular area of the frame. Our model can obtain better detail information.
Some frame interpolation results of other methods and our model are shown in Figure 6 and Figure 7. On adobe 240-fps test dataset, our proposed model can produce more realistic intermediate frame, as shown in Figure 6 (e) and (j). In Figure 6 (e), the front tyre of red cars have less blurring. In Figure 6 (j), the foots of the person riding on a bicycle are sharper. The chain and transmission looks more realistic. Figure 7 shows some results of these methods on UCF101 dataset. Column (e) is the result of our method. The edge of hula hoops and paddle are better reconstructed and the texture of the sole is more vivid. The generated intermediate frames of our model have better visual perception and accurate spatial and temporal consistency. Due to the use of multi-scale structure and attention mechanism, our model can focus on moving objects, and combine the global and local features to synthesize the intermediate frame preferably.
Conclusion
In this paper, we propose a novel multi-scale dense attention generative adversarial network (FI-MSAGAN) for video frame interpolation. First, we establish a generative adversarial framework with a multi-scale loss to produce middle frame in a form of coarse-to-fine. Generators can better combine global and local information by receiving rough results from lower levels. Second, an attention module is embedded in generator so that the network can accurately focus on moving objects and deal with large motion precisely. Third, we adopt a sequence discriminator to judge whether the sequence is true or false. So generators can capture spatial and temporal consistency in frame sequence via feedback information from the sequence discriminator. The ablation study proves the effectiveness of our contributions. We implement comparative experiments with several state-of-the-art video frame interpolation methods and evaluate them both qualitatively and quantitatively. The results show our method outperforms state-of-the-art methods.
It has been shown in recent works that traditional numerical metrics (PSNR, SSIM) are not always consistent with human perception. It can be helpful to evaluate video frame interpolation models by designing new metrics that are more suitable for human cognition. Furthermore, combining GANs with optical flow estimation is also worthy of future research.