1. INTRODUCTION
As portable devices and social media applications become widely accessible, users are getting used to capturing information with short videos, which possess characteristics of multiple modalities, limited-time series, and multi-facet assessments. A short video always consists of multiple modalities including image (cover image), video (video content), and text language (video title, author information, etc.), and be tagged as "clicks", "likes", "shares", "comments", etc. According to statistics, the total number of monthly active users existing in TikTok and Douyin (two popular video-sharing and social media platforms) achieves 1.2 billion [1] and 934 million [2], respectively. With a huge user community, short videos have accounted for a large portion of internet traffic and become a commercially important carrier, thus bringing about contributions to commercial promotions and gains. A practical issue arising from short videos is assessing values that may be potentially produced, i.e., Short Video Quality Assessment (SVQA).