I. Introduction
Considering the natural dynamic property of the human face expression, many works propose to explore spatio-temporal features of facial expression recognition (FER) from the videos. Although the multi-frame sequence can inherit richer information and the temporal-correlation between the consecutive frames can usually be helpful for the facial expression recognition, the video also introduced significantly redundancy. Considering the subtle muscle movement in FER videos, the signal-to-noise ratio (SNR) can be low [1], [2].