Loading [MathJax]/extensions/MathZoom.js
IEEE Transactions on Circuits and Systems for Video Technology | All Volumes | IEEE Xplore

Issue 3 • March-2023

Loading...

Table of Contents

Publication Year: 2023,Page(s):C1 - C4

Table of Contents

IEEE Transactions on Circuits and Systems for Video Technology publication information

Publication Year: 2023,Page(s):C2 - C2

IEEE Transactions on Circuits and Systems for Video Technology publication information

Visual tempos show the dynamics of action instances, characterizing the diversity of the actions, such as walking slowly and running quickly. To facilitate action recognition, it is essential to capture visual tempos. To this end, previous methods sample raw videos at multiple frame rates or integrate multi-scale temporal features. These methods inevitably introduce two-stream networks or feature-...Show More
This paper presents a perception-aware decomposition and fusion framework for underwater image enhancement (UIE). Specifically, a general structural patch decomposition and fusion (SPDF) approach is introduced. SPDF is built upon the fusion of two complementary pre-processed inputs in a perception-aware and conceptually independent image space. First, a raw underwater image is pre-processed to pro...Show More
Cloud computing offers advantages in handling the exponential growth of images but also entails privacy concerns on outsourced private images. Reversible data hiding (RDH) over encrypted images has emerged as an effective technique for securely storing and managing confidential images in the cloud. Most existing schemes only work on uncompressed images. However, almost all images are transmitted a...Show More
Zero-shot Learning (ZSL) aims to recognize novel classes through seen knowledge. The canonical approach to ZSL leverages a visual-to-semantic embedding to map the global features of an image sample to its semantic representation. These global features usually overlook the fine-grained information which is vital for knowledge transfer between seen and unseen classes, rendering these features sub-op...Show More
Distortions from spatial and temporal domains have been identified as the dominant factors that govern the visual quality. Though both have been studied independently in deep learning-based user-generated content (UGC) video quality assessment (VQA) by frame-wise distortion estimation and temporal quality aggregation, much less work has been dedicated to the integration of them with deep represent...Show More
The prevalence of short-video applications imposes more requirements for video quality assessment (VQA). User-generated content (UGC) videos are captured under an unprofessional environment, thus suffering from various dynamic degradations, such as camera shaking. To cover the dynamic degradations, existing recurrent neural network-based UGC-VQA methods can only provide implicit modeling, which is...Show More
In real-world crowd counting applications, the crowd densities in an image vary greatly. When facing density variation, humans tend to locate and count the targets in low-density regions, and reason the number in high-density regions. We observe that CNN focus on the local information correlation using a fixed-size convolution kernel and the Transformer could effectively extract the semantic crowd...Show More
Deep learning models are found to be vulnerable to adversarial examples, as wrong predictions can be caused by small perturbation in input for deep learning models. Most of the existing works of adversarial image generation try to achieve attacks for most models, while few of them make efforts on guaranteeing the perceptual quality of the adversarial examples. High quality adversarial examples mat...Show More
The self-attention based video inpainting methods have achieved promising progress by establishing long-range correlation over the whole video. However, existing methods generally relied on the global self-attention that directly searches missing contents among all reference frames but lacks accurate matching and effective organization on contents, which often blurs the result owing to the loss of...Show More
In object detection, precise object representation is a key factor to successfully classify and locate objects of an image. Existing methods usually use rectangular anchor boxes or a set of points to represent objects. However, these methods either introduce background noise or miss the continuous appearance information inside the object, and thus cause incorrect detection results. In this paper, ...Show More
Visual place recognition is a challenging problem in robotics and autonomous systems because the scene undergoes appearance and viewpoint changes in a changing world. Existing state-of-the-art methods heavily rely on CNN-based architectures. However, CNN cannot effectively model image spatial structure information due to the inherent locality. To address this issue, this paper proposes a novel Tra...Show More
Underwater enhanced images (UEIs) are affected by not only the color cast and haze effect due to light attenuation and scattering, but also the over-enhancement and texture distortion caused by enhancement algorithms. However, existing underwater image quality assessment (UIQA) methods mainly focus on the inherent distortion caused by underwater optical imaging, and ignore the widespread artificia...Show More
The security protection of Ultra-High-Definition (UHD) video is facing grand challenges due to changeable application scenarios. The video business is highly dependent on the video format structure, which makes the format compliance of the encryption algorithm essential. Existing HEVC Selective Encryption (SE) algorithms are difficult to encrypt with format compliance while independent from the en...Show More
Camouflaged object detection is a challenging task that aims to identify objects having similar texture to the surroundings. This paper presents to amplify the subtle texture difference between camouflaged objects and the background for camouflaged object detection by formulating multiple texture-aware refinement modules to learn the texture-aware features in a deep convolutional neural network. T...Show More
Unsupervised representation learning for videos has recently achieved remarkable performance owing to the effectiveness of contrastive learning. Most works on video contrastive learning (VCL) pull all snippets from the same video into the same category, even if some of them are from different actions, leading to temporal collapse, i.e., the snippet representations of a video are invariable with th...Show More
Zero-shot learning (ZSL), an emerging topic in recent years, targets at distinguishing unseen class images by taking images from seen classes for training the classifier. Existing works often build embeddings between global feature space and attribute space, which, however, neglect the treasure in image parts. Discrimination information is usually contained in the image parts, e.g., black and whit...Show More
Motivated by the intuition that the critical step of localizing a 2D image in the corresponding 3D point cloud is establishing 2D-3D correspondence between them, we propose the first feature-based dense correspondence framework for addressing the challenging problem of 2D image-to-3D point cloud registration, dubbed CorrI2P. CorrI2P is mainly composed of three modules, i.e., feature embedding, sym...Show More
Establishing visual correspondences across semantically similar images is challenging due to intra-class variations, viewpoint changes, repetitive patterns, and background clutter. Recent approaches focus on cost aggregation to achieve promising performance. However, these methods fail to jointly utilize local and global cues to suppress unreliable matches. In this paper, we propose a cost aggrega...Show More
Semantic segmentation is important for scene understanding. To address the scenes of adverse illumination conditions of natural images, thermal infrared (TIR) images are introduced. Most existing RGB-T semantic segmentation methods follow three cross-modal fusion paradigms, i. e., encoder fusion, decoder fusion, and feature fusion. Some methods, unfortunately, ignore the properties of RGB and TIR ...Show More
Zero-shot learning (ZSL) typically suffers from the domain shift issue since the projected feature embedding of unseen samples mismatch with the corresponding class semantic prototypes, making it very challenging to fine-tune an optimal visual-semantic mapping for the unseen domain. Some existing transductive ZSL methods solve this problem by introducing unlabeled samples of the unseen domain, in ...Show More
Existing automated dubbing methods are usually designed for Professionally Generated Content (PGC) production, which requires massive training data and training time to learn a person-specific audio-video mapping. In this paper, we investigate an audio-driven dubbing method that is more feasible for User Generated Content (UGC) production. There are two unique challenges to design a method for UGC...Show More
This paper presents a novel Convolutional Neural Network (CNN) architecture for 2D human pose estimation from RGB images that balances between high 2D human pose/skeleton estimation accuracy and rapid inference. Thus, it is suitable for safety-critical embedded AI scenarios in autonomous systems, where computational resources are typically limited and fast execution is often required, but accuracy...Show More
Point clouds obtained by 3D scanning or reconstruction are usually accompanied by noise. Filtering-based point cloud denoising methods are simple and effective, but they are limited by the manually defined coefficients. Deep learning has shown excellent ability in automatically learning parameters. In this paper, a filtering network named PointFilterNet (PFN for short) is proposed to denoise point...Show More

Contact Information

Editor-in-Chief
Wenwu Zhu
Tsinghua University
Beijing
China
wwzhu@tsinghua.edu.cn