Journals & Magazines >IEEE Transactions on Multimedia >Volume: 26

Controllable Video Generation With Text-Based Instructions

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Most of the existing studies on controllable video generation either transfer disentangled motion to an appearance without detailed control over motion or generate videos...Show More

Metadata

Abstract:

Most of the existing studies on controllable video generation either transfer disentangled motion to an appearance without detailed control over motion or generate videos of simple actions such as the movement of arbitrary objects conditioned on a control signal from users. In this study, we introduce Controllable Video Generation with text-based Instructions (CVGI) framework that allows text-based control over action performed on a video. CVGI generates videos where hands interact with objects to perform the desired action by generating hand motions with detailed control through text-based instruction from users. By incorporating the motion estimation layer, we divide the task into two sub-tasks: (1) control signal estimation and (2) action generation. In control signal estimation, an encoder models actions as a set of simple motions by estimating low-level control signals for text-based instructions with given initial frames. In action generation, generative adversarial networks (GANs) generate realistic hand-based action videos as a combination of hand motions conditioned on the estimated low control level signal. Evaluations on several datasets (EPIC-Kitchens-55, BAIR robot pushing, and Atari Breakout) show the effectiveness of CVGI in generating realistic videos and in the control over actions.

Published in: IEEE Transactions on Multimedia ( Volume: 26)

Page(s): 190 - 201

Date of Publication: 29 March 2023

ISSN Information:

DOI: 10.1109/TMM.2023.3262972

Contents

I. Introduction

Deep architectural models such as convolutional neural networks (CNNs) and generative adversarial networks (GANs) enable the generation of high-dimensional data such as images, [1], [2], [3], [4], [5], [6] and videos [7], [8], [9], [10], [11], [12]. These models can manipulate the given high-dimensional data conditioned on the desired manipulation. For example, image manipulation and editing architectures [13], [14], [15], [16] allow users to transfer the style from another image.

References is not available for this document.

MIT Libraries

MIT Libraries

Controllable Video Generation With Text-Based Instructions

Abstract:

Metadata

Abstract:

ISSN Information:

Description

I. Introduction

Description

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

Controllable Video Generation With Text-Based Instructions

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Description

I. Introduction

Description

References

IEEE Account

Purchase Details

Profile Information

Need Help?