Introduction
Video understanding is a research field that leverages machine learning to analyze video data. This technology has a wide range of applications, including automated driving, surveillance systems, and video generation. A typical task in video understanding is action recognition, which involves classifying human actions from short video clips. In traditional studies, handcrafted descriptors like dense trajectories [1] and improved dense trajectories [2] were developed to classify actions. Recent studies have used deep learning models to classify human actions directly from RGB frame sequences [3], [4], [5]. Nevertheless, accessing RGB frames requires decoding because most videos are compressed for efficient storage. While some studies have even incorporated optical flow as an auxiliary input to boost accuracy [6], [7], [8], [9], [10], [11], it also has the same limitation because it is computed from RGB frame sequences. This decoding limits the deployment of action recognition models on mobile or edge devices.
Videos frequently contain redundant information, such as backgrounds, and video compression reduces such redundancy by converting RGB frames into different features, including I-frames, motion vectors, and residuals. The I-frames are stored as images, whereas the motion vectors and residuals represent only the changes from previous RGB frames. Compressed video action recognition directly classifies actions from compressed video features, as depicted in Fig. 1. We can omit the decoding process and reduce the computational cost by using compressed video features as inputs. Wu et al. [12] showed that this method achieves competitive classification accuracy against conventional RGB-based action recognition but with lower computational complexity defined by floating-point operations (FLOPs). Some studies have leveraged the low computational complexity of compressed action recognition to deploy action recognition models on mobile and edge devices [13], [14].
Illustration of compressed video action recognition and traditional RGB frame-based action recognition. Compressed video action recognition is more efficient than RGB frame-based approach because decoding is not required to obtain I-frames, motion vectors, and residuals from compressed video files.
Most conventional methods for compressed video action recognition use multiple networks to process I-frames, motion vectors, and residuals of compressed video features. Some studies optimized these networks independently and fused their predictions for inference [12], [14], [15], [16], [17]. Others jointly optimized them by weakly connecting their hidden layers for performance improvement [13], [18]. Such models require the computational complexity of multiple networks.
The computational complexity of multiple networks may be unnecessary because most parameters in deep networks are unused and removable after training, as shown in various studies [19], [20], [21]. For the efficient ensemble of image classification, the multi-input multi-output (MIMO) model utilized the unused parameters by creating independent subnetworks within a single network [22]. The subnetworks process multiple images independently with one feedforwarding step of the parent network and make different predictions from the images. As a result, the MIMO model achieved competitive accuracy against multiple networks while reducing the computational complexity.
Inspired by the MIMO model, we have proposed a multi-stream single network (MussNet) for efficient compressed video action recognition in our previous study [23]. This model trains a single network in the MIMO manner and creates three independent subnetworks within a single network. The subnetworks in the MussNet model independently process I-frames, motion vectors, and residuals instead of the multiple networks used in the previous methods. The proposed model achieved competitive accuracy against multiple networks while reducing the overall computational complexity.
The limitation of the current MussNet model is that this model only processes inputs independently and cannot fuse the input features inside the network. Recent studies have shown that intermediate fusion, which fuses hidden vectors of multiple networks at some intermediate points of the feedforwarding step, improves the accuracy of the multi-stream models [5], [10], [11]. However, in the MussNet model, features of I-frames, motion vectors, and residuals are contained in the shared hidden vectors, and how to perform the intermediate fusion using the subnetworks is non-trivial.
To overcome this limitation, we expand our previous study by proposing a novel module named the Extract, Fuse, and Scale (EFS) module. The EFS module learns to disentangle and fuse the features of I-frames, motion vectors, and residuals from the shared hidden vectors, allowing the MussNet model to perform intermediate fusion. In addition, this module is designed to use only a few additional computational complexities for the feedforwarding step. The MussNet model with the EFS module improves accuracy from the original MussNet model while maintaining the efficiency of a single network in terms of computational complexity. The contributions of this study can be summarized as follows:
We propose the EFS module to improve the accuracy of the MussNet model with a few additional computational complexities.
We experimentally show that the MussNet model with the EFS module achieves competitive accuracy against our multiple network-based baselines while reducing the computational complexity.
We analyze the EFS module and clarify that extending the MussNet model into intermediate fusion improves accuracy. We also show that the EFS module succeeds in disentangling and fusing features from the shared hidden vectors with 1% GFLOPs of the original MussNet model.
Related Work
A. Compressed Video Action Recognition
Pioneering studies [24], [25] on compressed video action recognition only used motion vectors as an easy-to-use alternative to optical flow that needs expensive computation to obtain. These studies still used decoded RGB frames and did not use compressed video features other than motion vectors. Wu et al. [12] first proposed the CoViAR method that classifies videos using only compressed video features. They employed three 2DCNNs corresponding to compressed video features and trained them independently. After training, the final prediction was computed by averaging the predictions of the three different networks. Li et al. [26] showed that compressed video action recognition was available under the practical scenario that compressed videos are transmitted from other devices and some packets are dropped. Some studies focused on the efficiency of compressed video action recognition and extended it into different tasks, such as real-time object tracking [27] and facial expression recognition [28].
Subsequent studies have developed more efficient or effective compressed video action recognition methods by replacing backbone networks with different lightweight networks and employing additional components to maintain the accuracy of the CoViAR method. For example, CV-C3D [17] and MFCD-Net [29] used 3DCNNs; Wu et al. [15] and Guo et al. [30] used ResNet18 [31] as their backbone network and trained it using knowledge distillation [32]; TTP [14] combined MobileNetV2 [33] with an efficient yet effective fusion method. Other studies estimated optical flow from motion vectors and residuals and used the estimated optical flow to improve accuracy. The DMC-Net method [16] trained the optical flow estimator in a supervised manner using actual optical flow for training. The subsequent SIFP method [18] developed an unsupervised approach to train the optical flow estimator without actual optical flow. The proposed MussNet provides another approach for efficient compressed video action recognition.
B. Efficient Ensemble
Our method is inspired by the recent ensemble method that can estimate the uncertainty of predictions or improve out-of-distribution robustness by feeding the same inputs (e.g., images) into multiple networks and fusing their predictions. The problem with the ensemble methods is the expensive computation and memory costs for training and testing multiple networks. Various approaches, such as Monte Carlo Dropout [34] and Snapshot [35], were proposed to address this problem. Havasi et al. proposed the MIMO method [22], which uses a single MIMO network instead of multiple single-input single-output networks. Unlike other methods, their method processes multiple inputs with one feed-forwarding of a single network. They showed that independent subnetworks are obtained in a single network through MIMO learning. We extend the MIMO method into compressed video action recognition to obtain a single network that processes compressed video features simultaneously.
Method
A. Background
1) Compressed Video Features
This study used the MPEG-4 Part 2 format [36] to compress videos, following the previous study [12]. This format arranges RGB frames into groups of pictures (GOPs), where each GOP contains a fixed number of frames and starts with an I-frame, followed by several P-frames. The codec stores I-frames as standard RGB images and P-frames as the changes in the RGB values from the previous frames. Specifically, the differences between P-frames and the previous frames are represented by motion vectors and residuals. The motion vectors represent the coarse motion of
2) Naïve Fusion Methods for Compressed Videos
To utilize multiple inputs, including compressed videos, it is crucial to effectively fuse their information to obtain accurate predictions. The design of the fusion method has been extensively studied as it significantly affects classification performance. This study focused on three naïve fusion methods: early, late, and intermediate fusion.
Early fusion, depicted in Fig. 2-(a), is a straightforward method of fusing compressed video features. This method concatenates compressed video features in advance and feeds the concatenated features into a single network. The advantage of the early fusion method is that it only requires one network to process compressed video features, leading to lower computational costs than other fusion methods. However, early fusion often yields poorer classification performance than other methods, which limits its use in action recognition.
Late fusion, depicted in Fig. 2-(b), is another simple fusion method for compressed videos, which has been used in many previous methods [12], [14], [16], [17], [26]. In compressed video action recognition, the late fusion method uses three networks and independently classifies actions from each compressed video feature using these networks. The final prediction is obtained by averaging the three predictions. Late fusion can significantly improve classification performance compared to early fusion. However, late fusion only linearly fuses the predictions from compressed videos and cannot nonlinearly fuse compressed video information, leaving room for further improvement in accuracy.
Intermediate fusion, depicted in Fig. 2-(c), is a more complex fusion method that fuses input features information during hidden layer processes. This method also uses three networks and trains them to classify each compressed video feature, similar to the late fusion method. Furthermore, intermediate fusion aggregates the hidden layer outputs in networks and passes their information to one of the networks, enabling the nonlinear fusion of compressed video features. In conventional methods, SIFP [18], MEACI-Net [37] and He et al’s method [13] employed the intermediate fusion to improve accuracy.
B. Multi-Stream Single Network
We aim to simultaneously process I-frames, motion vectors, and residuals using a single network for efficient compressed video action recognition. However, early fusion, which trains a single network to classify actions from the concatenated I-frames, motion vectors, and residuals, leads to poor accuracy as described in Sec. III-A2.
To overcome this problem, we proposed the MussNet model. This model was inspired by the MIMO model, originally proposed for the efficient ensemble [22]. The original MIMO model performs an ensemble of predictions from the input with the same feature type. Instead, we perform an ensemble of predictions from different features, I-frames, motion vectors, and residuals. As depicted in Fig. 3, our model is a single network with three prediction heads corresponding to I-frames, motion vectors, and residuals, respectively. The prediction heads are trained to classify videos only from corresponding features by simultaneously feeding I-frames, motion vectors, and residuals extracted from different videos into MussNet. This training encourages MussNet to have independent subnetworks that classify videos from one of the compressed video features using the corresponding prediction head. The created subnetworks are available for late or intermediate fusion.
1) Training
Let \begin{equation*} \sum _{(V, l) \in \{(I, i), (M, j), (R, k)\}} -t_{l} \log p_{\theta} (y_{l}^{V}|x_{i}^{I}, x_{j}^{M}, x_{k}^{R}). \tag{1}\end{equation*}
This optimization promotes the independence of model predictions because each compressed video feature does not provide significant information for predicting the labels of the other two features, as the classes of input videos may differ. For example,
Algorithm 1 Train the MussNet Model
Require: Network parameters
Ensure: Updated parameters
while
end while
2) Inference
During the inference phase, the MussNet model makes predictions from the same video, as depicted in Fig. 3-(b). The final prediction \begin{equation*} p_{\theta} (y_{i}|x_{i}^{I}, x_{i}^{M}, x_{i}^{R}) = \frac {1}{3}\sum _{V \in \{I, M, R\}} p_{\theta} (y_{i}^{V}|x_{i}^{I}, x_{i}^{M}, x_{i}^{R}). \tag{2}\end{equation*}
C. Extract, Fuse, and Scale Module
To extend the MussNet model into intermediate fusion, we develop the extract, fuse, and scale (EFS) module depicted in Fig. 4. The EFS module consists of shallow networks and is placed after any hidden layers. For intermediate fusion, the EFS module aggregates features of I-frames, motion vectors, and residuals come from the same videos. However, to construct subnetworks during training, the MussNet with EFS module has to simulateously process compressed video features from different videos in the same way as the case without the EFS module. To satisfy these two demands of intermediate fusion and independent processing, we adopt a strategy to aggregate the intermediate features in the EFS module from multiple feedforwarding steps exclusively containing multiple compressed features of the same videos.
Let
The features of motion vectors \begin{align*} (\hat {h}^{M}_{i}, \hat {h}^{R}_{m}) &= \text {Extract}(h_{lim}), \tag{3}\\ (\hat {h}^{M}_{o}, \hat {h}^{R}_{i}) &= \text {Extract}(h_{noi}), \tag{4}\end{align*}
The generated vectors \begin{equation*} \hat {h}_{i} = \text {Fuse}(\hat {h}^{M}_{i} \oplus \hat {h}^{R}_{i}), \tag{5}\end{equation*}
To incorporate the fused features
To solve this problem, we introduce the scale submodule (Fig. 4-(c)) inspired by squeeze-and-excitation [38]. This submodule generates scaling weights for every channel of
In this study, the scale submodule consists of global average pooling (GAP), followed by multi-layer perceptrons (MLPs). GAP reduces the number of elements of
By using the scale submodule and outputs of the fuse submodule \begin{equation*} h_{ijk} \leftarrow h_{ijk} + \sigma (\text {Scale}(h_{ijk})) \cdot \hat {h}_{i}. \tag{6}\end{equation*}
Algorithm 2 Procedure of the EFS Module for Training
Require: Hidden vectors
Ensure: Updated
Inference: The EFS module does not need multiple input videos for inference, similar to the MussNet model. When \begin{equation*} (\hat {h}_{i}^{M}, \hat {h}_{i}^{R}) = \text {Extract}(h_{iii}), \tag{7}\end{equation*}
Algorithm 3 Procedure of the EFS Module for Inference
Require: Hidden vector
Ensure: Updated
D. Information Routing
To update other hidden vectors in addition to
In this study, we propose the information routing technique to overcome this problem (Fig. 5). This technique is available under mini-batch learning with an acceptable batch size (e.g., 32 and 64). The extra motion vector and residual features to optimize the EFS module are collected from the mini-batches; therefore, we do not need extra videos except for mini-batches to optimize the MussNet model and the EFS module.
Information routing for the efficient optimization of the MussNet model with the EFS modules. The colors of the arrows emphasize which information of compressed video features are focused on. We omit the EFS module procedures for
Let \begin{align*} B^{I} &:= \{x^{I}_{1}, x^{I}_{2}, \ldots, x^{I}_{K}\}, \tag{8}\\ B^{M} &:= \{x^{M}_{1}, x^{M}_{2}, \ldots, x^{M}_{K}\}, \tag{9}\\ B^{R} &:= \{x^{R}_{1}, x^{R}_{2}, \ldots, x^{R}_{K}\}. \tag{10}\end{align*}
In our optimization, we require sets of compressed video features sampled from different videos. To create such sets from \begin{align*} B^{M} &\leftarrow \{x^{M}_{\phi ^{M}_{1}}, x^{M}_{\phi ^{M}_{2}}, \ldots, x^{M}_{\phi ^{M}_{K}}\}, \tag{11}\\ B^{R} &\leftarrow \{x^{R}_{\phi ^{R}_{1}}, x^{R}_{\phi ^{R}_{2}}, \ldots, x^{R}_{\phi ^{R}_{K}}\}. \tag{12}\end{align*}
The MussNet model concatenates \begin{align*} \hat {h}^{M} &\leftarrow \{\hat {h}^{M}_{1}, \hat {h}^{M}_{2}, \ldots, \hat {h}^{M}_{K}\}, \tag{13}\\ \hat {h}^{R} &\leftarrow \{\hat {h}^{R}_{1}, \hat {h}^{R}_{2}, \ldots, \hat {h}^{R}_{K}\}. \tag{14}\end{align*}
Algorithm 4 Mini-Batch Learning of the MussNet Model With the EFS Modules Using the Information Routing
Require: Mini-batch of compressed video features
Ensure: Model output
// Get indices that permutate the shuffled elements to the original order.
for
end for
// Shuffle motion vectors and residuals.
// Start the feedforwarding step.
for
// Permutate to the original order
end for
Experiments
A. Datasets
We evaluated the proposed method on two widely used public datasets for action recognition: UCF-101 [39] and HMDB-51 [40].
The UCF-101 dataset consists of 13,320 video clips across 101 action categories. All clips are collected from YouTube and have a fixed 25 FPS with a resolution of
The HMDB-51 dataset consists of 6,770 video clips across 51 action categories, where each action category has a minimum of 101 clips collected from various sources, including movies and public databases. Although video clips have various FPS and resolutions, we converted them to 25 FPS and a resolution of
B. Implementation Details
We employed ResNet18, ResNet34, and ResNet50 [31] with temporal shift modules (TSM) [8] as backbone networks. These networks consist of 2D convolution layers, and TSMs adapt them for video processing without increasing the number of parameters or computational complexity by shifting parts of hidden vector channels to the future and past frames. In our experiments, TSMs are placed before every ResNet block. The EFS modules consist of
To conduct our experiments, we resized all videos to a resolution of
The backbone networks were pre-trained on the ImageNet dataset [41], using the MIMO method with three prediction heads, thus creating multiple subnetworks in the backbone networks. After pre-training the backbone networks, the MussNet was constructed by adding randomly initialized TSMs and EFS modules. Then, the networks were optimized using stochastic gradient descent with Nesterov’s momentum [42]. The learning rate \begin{align*} \eta = \begin{cases} \displaystyle \frac {t}{10}\eta _{peak} & \text {if } t \leq 10, \\ \displaystyle 0.5 \times \left({1 + \cos \frac {t-10}{T-10}\pi }\right)\eta _{peak}& \text {otherwise}, \end{cases} \tag{15}\end{align*}
C. Baseline Methods
We will show that MussNet can achieve similar accuracy to the late and intermediate fusion methods while keeping the efficiency of a single network. Hence, we selected the naïve early, late, and intermediate fusion methods described in Sec. III-A2 as our baseline methods. For a fair comparison, the same backbone networks of MussNet are used for those baseline methods in our experiments.
The early fusion baseline is equivalent to the MussNet trained with compressed video features extracted from the same videos during training. The late fusion baseline uses three backbone networks to classify videos from each compressed video feature. The intermediate fusion baseline is based on the late fusion baseline but has mono-directional information paths from motion vector and residual networks to the I-frame network. The mono-directional information path concatenates the outputs of the arbitrary layers of the motion vector and residual networks, transforms them by a
We used the standard ImageNet pre-trained parameters provided by the PyTorch library3 [44] to initialize the baseline networks. Then, baseline networks were fine-tuned using the same optimization strategy of Sec. IV-B.
D. Comparison With Baselines
We trained the MussNet with EFS modules and our baselines and presented the training results in Fig. 6. By comparing our baselines, we found that early fusion performed significantly worse than the other baselines in terms of accuracy. Specifically, the early fusion accuracies were approximately 10 points lower for the UCF-101 dataset and 15–20 points lower for the HMDB-51 dataset than those of the other baselines. For more details, the accuracy improvement of late fusion of {ResNet18-TSM, ResNet34-TSM, ResNet50-TSM} against early fusion was {12.2, 11.7, 9.7} points for the UCF-101 dataset and {17.9, 16.4, 15.6} points for the HMDB-51 dataset. These findings demonstrate that early fusion is unsuitable for the effective classification of compressed video features with a single network. In addition, the intermediate fusion achieved better accuracy than early and late fusion methods in most cases; incorporating an intermediate fusion into our model is expected to promote better accuracy.
Comparison between the proposed method and baselines with respect to the accuracy and computational complexity (GFLOPs). We used ResNet18-TSM, ResNet34-TSM, and ResNet50-TSM as backbone networks for the proposed method and baselines (ResNet18-TSM consumes the lowest GFLOPs). We report the average scores of all train-test splits.
Comparing the proposed method to the baselines with the same network architecture, the computational complexity of the MussNet was less than half that of late fusion and intermediate fusion, as the MussNet only utilizes one network for classification. However, despite this, the proposed method achieved comparable accuracy to late and intermediate fusion. These findings indicate that the proposed method is better suited for training a single network for compressed videos than early fusion. Furthermore, employing three ResNet18-TSMs consumes as much computational complexity as a single ResNet50-TSM. Comparing the proposed method of ResNet50-TSM with the intermediate fusion baseline of ResNet18-TSM in Fig. 6, we found that the proposed method resulted in better accuracy. This result illustrates that the MussNet with EFS modules also improves the accuracy of nave late fusion or intermediate fusion models without increasing computational complexity.
E. Analysis of the MussNet Model and the EFS Module
We conducted ablation experiments to evaluate how much our proposed methods contribute to the efficiency and accuracy of the MussNet model. This section presents the validation results based on the ResNet34-TSM model using UCF-101 and HMDB-51 datasets. The reason for the choice of ResNet34-TSM in the ablation study is the good balance of trade-offs between accuracy and computational complexity. All experimental results of the ablation study are summarized in Table 4.
From our experiments, we found the following insights of the MussNet model and the EFS module.
1) Creating Subnetworks in the MussNet Model is the Most Important for Accuracy
The MussNet model is optimized using compressed video features from different videos to create subnetworks in a single network. We first evaluated whether creating subnetworks in the MussNet model contributes to the classification performance or not. For this evaluation, we trained the MussNet model using compressed video features from the same videos.
As shown in w/o MIMO training (Early fusion) of Table 4, when the MussNet model was optimized using compressed video features from the same videos, it only achieved 77.5% accuracy for UCF-101 and 44.0% accuracy for HMDB-51. These results were 10.1 points and 19.0 points worse than the full method. This performance degradation was worst in the experimental results of our ablation study; thus, we concluded that creating subnetworks was the most important for the accuracy of the MussNet model.
2) Extending the MussNet Model into Intermediate Fusion Improves Accuracy
We also evaluated whether extending the MussNet model into intermediate fusion improves accuracy. For this evaluation, we trained the MussNet model without the EFS modules and summarized the experimental results in w/o EFS module of Table 4.
As shown in the results, the MussNet model without EFS modules achieved 87.2% accuracy for UCF-101 and 62.0% accuracy for HMDB-51. Comparing these results with the full method, the EFS modules improved 0.8 points for UCF-101 and 1.0 points for HMDB-51. This experimental result showed that integrating the EFS module into the MussNet model improved the accuracy of the MussNet model.
However, this experiment is not enough to claim that extending the MussNet model into intermediate fusion improves accuracy because there is a possibility that the additional computations of the EFS module improved accuracy. To remove this possibility, we used the inference computation of the EFS modules for training and disabled intermediate fusion while keeping the computation procedures. Specifically, we replaced Eq. 5 with the following equations:\begin{align*} (\hat {h}_{j}^{M}, \hat {h}_{k}^{R}) &= \text {Extract}(h_{ijk}), \tag{16}\\ \hat {h}_{i}&=\text {Fuse}(\hat {h}_{j}^{M}, \hat {h}_{k}^{R}), \tag{17}\end{align*}
As a result, disabling intermediate fusion achieved 81.7% accuracy for UCF-101, and 50.6% accuracy for HMDB-51, respectively. Comparing the experimental results with those of w/o EFS module, we found that just introducing computations while keeping late fusion worsened accuracy rather than improved. It means that, as we claimed, extending the MussNet model into intermediate fusion is helpful in improving accuracy.
3) The EFS Module Only Increases 1% GFLOPs of the MussNet Model
The comparison of the full method and w/o EFS module also showed that the EFS module only increases 1% GFLOPs of the MussNet model. Specifically, the full method used 29.3 GFLOPs, which was only 0.3 GFLOPs more than that of w/o EFS module. This observation showed that the EFS module is available without reducing the efficiency of a single network-based approach.
4) The Scale Submodule is Essential in the EFS Module
We evaluated whether the scale submodule contributed to accuracy. For this evaluation, we removed the scale submodule from the EFS module and only used the fuse submodule output \begin{equation*} h_{ijk} \leftarrow h_{ijk} + \hat {h}_{i}, \tag{18}\end{equation*}
5) The Later EFS Modules Learn Better Disentanglement of Features
We also analyzed whether the outputs of the extract submodules only hold features from either motion vectors or residuals.
For this analysis, we fixed our model’s parameters and trained new classifiers that predict labels corresponding to I-frames, motion vectors, or residuals from the outputs of extract submodules
Accuracy of action recognition from outputs of each extract submodule. The bar colors show types of compressed video features with labels to be estimated by classifiers.
From our experiments, we found that the 2nd and 3rd extract submodules disentangled features of motion vectors and residuals from the hidden vectors. However, the disentanglement of the 2nd submodule was not as clear as that of the 3rd submodule. In addition, the 1st submodule failed to disentangle the features because the accuracy of the motion vector was consistently better than the residuals.
These results indicate that our proposed method learns to extract the corresponding features from hidden vectors for intermediate fusion. The 1st submodule failed to disentangle the features because the transformation of the 1st ResNet block was not sufficient, and the intermediate fusion at the 1st EFS module does not contribute to the classification compared with the 2nd and 3rd submodules. When the extract submodules failed to disentangle features, the EFS modules fuse noisy features into subnetworks and make the MussNet model optimization challenging. Therefore, improving the extract submodules seems to be a promising direction for effective intermediate fusion of a single network-based approach.
F. Comparison With Previous Studies
Finally, we compared our method with conventional compressed video action recognition methods and summarized the results in Table. 5. This result showed that the MussNet model with the EFS modules was one of the most efficient methods in terms of computational complexity, even when we used ResNet50-TSM as our backbone. Only MTFD was more efficient than the MussNet model with ResNet34-TSM and ResNet50-TSM. However, the MussNet model with ResNet18-TSM was more efficient while using the same backbone network of MTFD. It was because MTFD still used three ResNet18 to classify compressed video features, while the MussNet model only used one ResNet18 to classify compressed video features.
The accuracy comparison shows that our model was competitive against most previous methods. This result indicates that a single network can achieve the same level of accuracy as multiple networks. However, recent models such as SIFP, TEMSN, and MTFD were more accurate than the MussNet model while keeping the efficient computational complexity. One reason for the accuracy gap between the MussNet model and these previous methods is that they introduce various techniques, such as knowledge distillation, while the MussNet model is only optimized using the cross-entropy loss.
From this comparison, we consider that a single network-based method is a promising direction for efficient compressed video action recognition. The advantage of the single network-based method against previous methods is the efficiency even when relatively large networks (e.g., ResNet50-TSM) are used as the backbone network. However, the MussNet model does not reach the accuracy of state-of-the-art efficient compressed video action recognition methods.
Discussion
While the MussNet was one of the most efficient compressed video action recognition models in terms of computational complexity, our comparison of the MussNet with previous studies revealed that its accuracy does not match those of state-of-the-art methods even when EFS modules are introduced. We believe that the reason for the inferior performance is that we only used the standard cross-entropy loss for training, whereas state-of-the-art methods employ cross-entropy loss as well as other modules or loss terms to improve their classification performance. For instance, DMC-Net and SIFP employed shallow modules that estimated optical flow from motion vectors and residuals and used the estimated optical flow as an additional input. Wu et al. used knowledge distillation from powerful yet heavy networks to optimize lightweight networks. Given that our model only uses subnetworks instead of multiple networks, introducing most of the modules or loss terms used in previous studies is feasible. However, the MussNet models are required to create subnetworks within the backbone network, and directly introducing these modules and loss terms may make the training unstable. Therefore, introducing the additional modules and loss terms to improve the accuracy of the MussNet model remains in future work toward efficient yet effective compressed video action recognition methods.
The limitation of the proposed method is that it requires the backbone networks to have sufficient capacity to hold multiple subnetworks. While we used the ResNet family, which consists of standard convolution layers, as our backbone network, some modern networks use more sparse layers, such as depthwise separable convolution [33] and MBConv [45], [46] to achieve more accurate yet efficient predictions. These sparse layers have fewer parameters than the standard convolution layers and can also have reduced capacity. Such modern networks can limit the power of the MussNet; hence, it is necessary to improve the MussNet to utilize its limited capacities. The recently proposed techniques to improve MIMO-based ensembles [47], [48], [49] may be helpful in resolving this problem, but it remains a contemplation for future work.
Conclusion
In this study, we introduce the MussNet and EFS modules for efficient compressed video action recognition. The proposed method requires only one network to process compressed video features, reducing overall computational complexity while maintaining the classification accuracy of naïve late and intermediate fusion baselines.
Our experiments demonstrated that the MussNet achieved comparable classification performance to previous compressed video action recognition methods but with significantly lower computational complexity. However, our model was only optimized by cross-entropy loss, while some previous studies used more effective loss terms and modules. Furthermore, the MussNet is one of the MIMO-based methods, and techniques that improve the MIMO method are also helpful for the MussNet. In future work, we will integrate such additional loss terms, modules, and techniques into our method and improve accuracy.
Appendix APractical Implementation of MussNet
Practical Implementation of MussNet
In Listing 1, we show PyTorch-like implementation of the MussNet model and the EFS module.
Appendix BTabular Results of Comparison
Tabular Results of Comparison
We show the tabular results of comparison with baselines in Table 6.