Introduction
Movies, one of the most complex visual art forms, simulate experiences that communicate ideas, stories, perceptions, feelings, beauty, or atmosphere through the use of moving images. In recent years, there has been increasing computing research focusing on various aspects of movies, such as movie shot retrieval [1], action recognition [2], and question answering [3]. Also, many movie domain datasets [2]–[8] have been proposed to facilitate movie understanding. These datasets are getting larger, and often include various annotations including subtitles, plots, descriptions, and so on.
Despite the proliferation of movie-related datasets, far too little attention has been paid to the story structure in movies: the way they express the stories.
Naturally, videos are the carrier of movie stories. Stories are, however, implicitly contained in the video consisting of shots, visual elements, sounds, dialogues, etc. We argue that video clips, as the final format of film making, usually mix multiple storylines, making the story structure complex and unclear. For example, the minimum video clip of many existing movie datasets [3], [7] is at the minute level. Likewise, the corresponding text descriptions are highly generalized (like plots or synopses). Long videos and high-level text make optimization difficult in certain tasks. Some datasets tend to use cartoons to highlight the story structure [9], [10], which favours simple and clear story content, but in this way, they ignore the real-world visual elements. We agree with the benefits of a clear story structure, and in this way, it would be more suitable as a testbed for movie story understanding.
Actually, stories in movies can be explicitly expressed using the storyboard, a graphic layout of sequential illustrations and images to visually tell a story. Storyboards are useful in filming movies to express the key moments and outline the events. The usage of storyboards shows that movie clip videos can be condensed to keyframes. Inspired by this, we aim to construct a storyboard-based dataset to clearly outline story structure to facilitate story understanding. We name the dataset Script-to-Storyboard (Sc2St), along with which we introduce a new processing task in the movie domain: contextual text-based image retrieval. Figure 1 shows a storyboard sample and the retrieval task. A storyboard in our work is composed of a series of still video frames, with drawings/pictures of sequential key events and shots in a film.
In the Sc2St task, we take a multi-sentence paragraph (or script, for short) as the story description. Then, for each movie frame, given its query text, the task asks a model to retrieve the best match from a prepared list of candidates for the current frame. These are carefully collected in-movie and cross-movie frames, serving as a benchmark for evaluation of the retrieval performance. Section 4.1 defines the task in detail, and the evaluation setting.
This task differs from conventional text-based retrieval, such as text-based image and text-based video retrieval. Text-based image retrieval uses images as visual content, but it only considers independent text-based image pairs without considering the additional context. Text-based video retrieval uses video clips as visual content. Although videos contain frames used as the in-clip context, the dense neighbouring frames look about the same, resulting in redundant visual information and unclear contextual structure. Actually, the text-based video retrieval task also ignores across-video context, making it similar to text-based image retrieval except for the usage of clips rather than images. There are deeper challenges in the Sc2St task: in a storyboard, a future keyframe is not only related to its textual caption, but also the script telling the story; it should also be visually coherent with previous keyframes. Existing context-aware image retrieval methods focus on modelling the context only from the text [11], [12], e.g., building the textual context from sentence level to paragraph level. In comparison, the Sc2St task requires a model to capture contextual information from both visual and textual streams.
Contextual retrieval task using the Script-to-Storyboard dataset. A storyboard's future keyframes are retrieved one by one, given a movie summary script comprising multiple sentences that depict the whole story, the keyframe history, and the sentences for future frames. Each round of retrieval selects from a list of frame candidates.
In constructing the Sc2St dataset, we have taken into account that there are already sufficient datasets in the movie domain, such as Refs. [2]–[8]. To avoid repeated annotations, we have selected the LSMDC [6] movie collection as a basis and provide further fine-grained text and keyframe annotations to it. We processed all available 165 movies in three steps. Firstly, we produced fine-grained text by systematically parsing the original video-level text descriptions into sentences. Then, using rule-based methods to recognize “splittable” sentences, we further split those into sub-sentences to generate finer and elementary text instances. Secondly, we identified keyframes by selecting the most representative images from the video clips for each text instance using similarity-based and manual screening, take into account both objective and subjective factors. Thirdly, we formed stories as storyboards containing a sequence of text-keyframe pairs using keyframes temporally close in the original, to ensure semantic storyline coherence. The storyboard length is by default set to ten items in our work but is flexible to length. We finally collated the Sc2St dataset consisting of ~20.4k storyboards with ~61.2k unique keyframes and ~44k unique text instances (sentences or sub-sentences describing the keyframe). Statistical analysis reveals that the newly created Sc2St dataset has several advantages: (i) finer-grained text than existing movie datasets, (ii) highly-diverse textual descriptions, and (iii) better story coherency than in story understanding datasets. To tackle the contextual retrieval task based on the new dataset, we propose a recurrent model framework with three variants targeting dual-way context encoding. We have conducted extensive experimental comparisons to existing text-based image and text-based video retrieval approaches. We further evaluate human performance using an online testing system for comparison. We also provide detailed analyses and discussions to demonstrate the effectiveness of our model in capturing the contextual information in the new task.
Our main contributions are in summary:
a new benchmark dataset Sc2St in the movie domain with novel story structures: it comprises storyboards with coherent story structures, fine-grained texts, and semantic keyframes,
the contextual retrieval task (script-to-storyboard): given a script, it aims to retrieve future keyframes by respecting both the corresponding text and image history coherence,
baseline methods for this task, with extensive experimental comparison to existing methods and human performance, and
discussions of the effectiveness of our methods and of the potential of our dataset for image generation.
Related Work
2.1 Related Datasets
The proposed dataset has two features, where the contents are in the movie domain and the storyboard form relates to story understanding. We thus present the related datasets with analysis from the perspective of movie datasets and datasets involving sequential story-based tasks.
2.1.1 Movie Datasets
MovieQA [3] was collected from 408 movies and targets story and video understanding using question-answering. The data include video clips, plots, subtitles, scripts, etc. However, there are some concerns about using MovieQA for story understanding: video clips are not always provided, the text sources vary largely in detail and are not rich in content, the video clips are minutes long, lacking fine-grained timestamps, etc. Going beyond MovieQA, we provide a keyframe-based story dataset with fine-grained annotation. MovieGraphs [4] provides graph-based annotations of detailed social situations from 51 movies. The graph consists of various nodes capturing the characters' presence, their emotional and physical attributes, their relationships and interactions, etc. MovieGraph focuses mainly on using graphs for movie situation recognition, while our work targets contextual movie story understanding using textual and visual data. AVA [2] is an action recognition dataset sourced from 430 movies with annotations including 80 atomic visual actions in space and time on 15-min video clips. Although the AVA dataset has densely labelled person-centric actions, the clips used are only part of the original movies and the action information is not rich enough for story understanding compared to use of textual descriptions. MSA [5] contains 327 movies for movie story understanding in matching between movie segments and synopsis paragraphs. It gives each movie a synopsis and provides annotated associations between corresponding synopsis paragraphs and movie segments. Although MSA also splits each movie segment into multiple shots (events), unlike our fine-grained text and keyframe pairs, they may match one paragraph of synopsis to many movie shots due to the high-level descriptive nature of the movie synopsis. LSMDC [6] consists of nearly 128k video clips annotated with detailed descriptive sentences from around 200 movies. The textual descriptions are collected from the transcribed audio description (AD), which gives a descriptive narration of important visual elements of movie clips for visually impaired people. The usage of AD ensures that the textual description captures the key story as well as the necessary details, unlike high-level synopsis/plots or redundant subtitles with less visual narrative. Our proposed dataset is based on LSMDC, while the differences are: (i) we use semantic keyframes rather than videos as visual forms, (ii) we further prepare the fine-grained text-keyframe pairs by aligning each sub-sentence with a key image, and (iii) we introduce additional character and meta-annotations to enrich the existing dataset. Condensed Movies [8] targets long-range understanding of the narrative structures of movies. It consists of around 36k movie key scenes with high-level descriptions and character face tracks. The collected movie segments are freely available from YouTube and the number of movies involved is much larger than for other movie datasets. However, as each movie includes roughly ten segments (the key scenes), the clip duration is long while the description is short. Learning the relationship between coarse text and complex video is difficult. The recent MovieNet [7] is a holistic dataset for movie understanding. It was sourced from 1.1k movies, comprising annotations of movie trailers, photos, plot descriptions, character information, scene boundaries, descriptions, etc. Despite the various, large-scale annotations, this dataset shares a problem with Condensed Movies: the aligned movie segments and descriptions are at a high level, i.e., the text source is a synopsis paragraph and the clips are up to a few minutes.
2.1.2 Story Datasets
PororoQA [9] focuses on video story question-answering on cartoon videos, with around 16k scene-dialogue pairs. The dialogues contain fine-grained sentences for scene descriptions, so the Pororo dataset contains not only rich descriptive details but also simple and coherent story structures. Related research also uses Pororo for story-based image generation [13]. However, the cartoon domain restricts the diversity of genres, scenes, and characters compared to the hundreds of movies in our dataset. CoDraw [10] is a collaborative image-drawing game that contains visual and movable clip art objects. The game has two players communicating together to construct a scene: a teller describes an abstract scene while the drawer reconstructs the scene or asks for details. The collected dataset contains ~ 10k dialogues with corresponding scenes in multiple rounds. CoDraw is similar to our dataset since each round's scene image is like a movie keyframe and each dialogue tells a coherent story. However, unlike our movie-based dataset, CoDraw is based on cartoon art where the visual elements are simple and the involved actions are predefined. The Visual Storytelling Dataset [14] was proposed with the aim of generating image sequences from language. This dataset has a similar structure to ours: consecutive images, each provided with a corresponding description. However, the data are collected from the Internet which constrains the forming of image sequences, capable of being turned into a story: for example, some topics like “birthday” or “party” were manually pre-defined. Also, some images are missing due to deletion by posters.
2.2 Related Methods
Visual-language retrieval is the most closely related area, particularly text-based image and text-based video retrieval; we also investigate retrieval tasks involving sequential encoding. Text-to-image generation tasks are also discussed since our Sc2St dataset has the potential for sequential image generation.
2.2.1 Text-Based Image Retrieval
Existing works focus on visual-semantic embedding for learning the similarities between the two modalities. Some methods have been proposed to improve the ranking losses used, such as exploiting hardest negative pairs [15] or instance losses [16] to improve the discriminative representation, and projection classification loss [17] to categorize one modality representation vector to another. Other methods further employ fine-grained image-sentence matching [18], [19], e.g., matching words with image regions. Recently, the pre-training-based transformers [20] have become more popular, achieving significant performance gains in multiple tasks. Representative visual-language transformers include Uniter [21] and Oscar [22], which leverage word-region alignment to learn the image-text representations.
2.2.2 Text-Based Video Retrieval
A widely used approach in video-language retrieval is to learn a joint embedding space from similar texts and videos [23]. Recent state-of-the-art methods [23], [24] follow the mixture-of-experts (MoE) paradigm [25] by combining several different embeddings from pre-trained models for video representation. Videos are composed of images or frames and can be taken as single-frame form using the image features in analysis [26], so the MoE framework can be adopted in our task.
2.2.3 Contextual Retrieval
In the retrieval tasks, contextual retrieval is a type that considers either the context in the structure of the text or the images/videos. The context usually exists in sequential data forms, e.g., the sequence of sentences in the text or the continuous frames in the videos. Ref. [11] proposes a hierarchical (from sentence to story) text encoder to encode each sentence and retrieve the relevant images. Ref. [12] aims to create the storyboard by combining retrieval and style transfer. Its story-to-image retriever also uses a hierarchical text encoding method, e.g., from word to sentence to story. However, these methods capture the context only from the text part. The most similar to ours is the Contextual Mixture of Embedding Experts model (CMoEE) [8] which adds context both from past and future movie clips to learn the text-based video similarity. The used experts not only include the movie scene/objects representations but also the character embeddings. However, there is no textual context modelling or global story description constrained here.
2.2.4 Text to Visual Content Generation
Reed et al. [27] first proposed the text-to-image (T2I) model using conditional generative adversarial networks (GANs). Afterwards, other research made various improvements, e.g., to image quality by using coarse-to-fine structure [28], [29], text-based image consistency based on an attention mechanism [30]–[32], etc. Recently, large-scale T2I models have brought remarkable advances in realistic image synthesis [33]–[35]. Instead of generating a single image, Ref. [13] can synthesize a series of cartoon images using a recurrent-based generative model. Similarly, Ref. [36] iteratively generates images from continual linguistic instructions at multiple steps.
Some applications allow taking intuitive user input, e.g., paragraphs or multiple sentences, for content creation [37], such as text-guided storytelling [14] and video editing [38]. Ref. [14] focuses on sequential image retrieval, while Ref. [38] can create video montages made from retrieved video shots based on user-specified texts. These applications share a similar sequential retrieval form with Sc2St. However, the aims differ. The Sc2St dataset naturally contains movie story context and is built with the objective of contextual retrieval analysis. In contrast, Refs. [14], [38] use general topics (tour, party, or animals) to create the video context with the objective of content creation.
The Sc2St Dataset
3.1 Dataset Construction
The successful usage of storyboards in film making reveals their power to illustrate the condensed story. The sequential structure explicitly delivers a more concise but coherent visual story structure, and we thus aim to use a storyboard as the data form. Similar image-based story datasets include CoDraw [10] and Visual Storytelling [14]. However, Ref. [10] uses synthetic cartoon data that is limited in scene and character variety and lacks the generality of the real world. Ref. [14] collects image-text pairs independently from the Internet and constructs stories manually; as a result, the story cohesion may not be strong and neighbouring image styles are usually different. Thus, we choose the movie domain because movies not only naturally contain underlying storylines, but also comprise fruitful visual content in the real world accompanied by rich textual descriptions, subtitles or plots, etc.
3.1.1 Data Source Selection
To construct the Sc2St dataset, we started by exploiting several existing movie datasets [2]–[8], which cover a wide range of movies with various annotations. Using existing datasets brings two advantages. First, it provides alignment with the existing dataset format. It is convenient to use a familiar dataset for research: for example, MS-COCO [39] has drawn great attention, with following work providing additional annotations [40], [41] based on the original dataset. The second advantage is the saving of significant time and labor. Some annotations, e.g., subtitles and descriptions, do not change, once collected for a movie.
We carefully explored existing movie understanding datasets, such as MovieQA [3], LSMDC [6], MovieNet [7], and Condensed Movies [8]. After investigating their availability, accessibility, and richness, we eventually selected LSMDC as the data source upon which to build our dataset. LSMDC stems from an active movie understanding challenge; it has well-maintained movie videos and textural annotations. The entire LSMDC has 204 movies with various genres including action, science-fiction, family, documentary, etc. In particular, LSMDC provides short clips (of several seconds) with corresponding detailed descriptions. In contrast, other datasets such as Ref. [8] only contain brief descriptions for lengthy movie segments (of several minutes). Table 2 quantitatively compares the average clip duration, showing LSMDC has shorter clips (4.1 s). In addition, the text annotations of LSMDC were collected from two main sources, movie scripts and audio descriptions (AD). The latter are usually used to help people with visual impairments understand movies, so are more accurate and descriptive. They are well (manually) aligned with corresponding movie clips to form the video-text pairs. Unlike other annotation formats (e.g., movie synopsis, subtitles, and plots), the annotated text here can provide a narrative description of the storyline as well as the key visual elements in movie clips. Finally, after excluding the movies without accessible annotations, e.g., blind test sets, we obtained 165 movies in total, each with hundreds of video clips and textual annotations.
Although each movie in LSMDC is split into hundreds of clips of varying duration (typically a few seconds), the video clips cannot be directly used for our task since they contain consecutive frames. Also, the corresponding text description for one clip generally contains multiple sentences and needs further splitting. We then perform several processing steps to build a keyframe-based story dataset with fine-grained textual description: (i) text processing which parses sub-text from the original description, (ii) keyframe selection that extracts the most representative keyframe image representing the given text description, and (iii) story formation to build stories of a specific length for the retrieval task. We next describe these three processing steps in detail.
3.1.2 Text Processing
The descriptions of the original text-based video pairs in LSMDC usually comprise multiple sentences or sub-sentences. Directly using them would require selecting multiple keyframes covering different scenes, while we believe that using elementary sentence and image pairs is likely to be more helpful to finer text-based image understanding. In order to construct fine-grained matching, we designed an automatic two-step text processing procedure. First, we perform language analysis using the spaCy tool to parse and split the text into sentences. Second, we further split the sentences into sub-sentences using a rule-based method. We collect a list of conjunctions (e.g., then), delimiters (e.g., semicolon), and other particular symbols (e.g., consecutive dashes) based on analysis of the text. An example is shown in Fig. 2, where scripts (1) and (2) originate from the same sentence, and the corresponding frames are well-matched to each sub-sentence. It clearly demonstrates that fine-grained extraction of sub-sentences assists selection of more specific keyframes. We next describe how the keyframes are selected and aligned with the given text.
A storyboard example from the Spiderman movie. This example is composed of 10 keyframe-description pairs, but it can be used to construct storyboards of flexible length. The characters' names are shown in uppercase for clarity. Note that each description can be part of its original LSMDC clip-level description, such as descriptions (1) and (2). This enables pairing with fine-grained keyframes in our dataset.
3.1.3 Keyframe Selection and Alignment
Videos are composed of successive frames. We first sample raw frames at a sampling rate of 5 frames per second from the original LSMDC video clips. The sampled frames are usually redundant, and often contain similar or even duplicated visual information. To select keyframes with high expressiveness and well-matched to the text, an intuitive idea is to use rule-based methods, such as selecting a frame with a fixed position (e.g., the middle), or randomly selecting a frame. However, a raw clip in LSMDC may contain unrelated frames at its start or end, due to errors in labeling. For example, Fig. 3 shows that two frames of Dobby are wrongly included in the gateau scene (Harry Potter and the Chamber of Secrets). Therefore, we use a two-step process: semantic alignment based on scoring the frame-text match, and human screening to avoid errors.
We compute the text-image similarity using a universal pre-trained model [42], which has also been used for constructing multi-modal datasets [43], [44] by aligning images and text. Then the frames are sorted based on their cosine similarity to the query text. By default, the top-ranked frame is taken as the keyframe, as it shares the most semantic features with the text instance.
We then perform a manual screening process to check whether the selected keyframes are reasonable from a human point of view. Specifically, we show the selected keyframes based on image-text similarity to three experienced annotators and ask them to check if the keyframes (i) match the text, and (ii) are more representative than other similar frames. If all annotators agree with the initial selection, it is kept. If not, voting is used to determine the final keyframe selection. For even votes, further discussion is conducted, followed by further voting.
3.1.4 Story Formation
After obtaining fine-grained text descriptions and aligned keyframes, the final step is to construct the Sc2St data samples in story form. Specifically, a storyboard is composed of sequences of consecutive keyframes, with each keyframe paired with a sentence or sub-sentence, so that all the sentences form the script that tells the whole story. The clips from which the keyframes are derived should be temporally close (<10 s) to ensure semantic storyline coherence. In our implementation, we set a fixed image sequence length (10 in our experiment) for each storyboard to balance the feasibility and difficulty of the task. In detail, story formation has two main steps, clip grouping and within group story formation.
The video clips in the original LSMDC are cut from movies, and neighbouring clips may have a gap between them. Its length may be found by examining the current clip's starting timecode and the previous clip's ending timecode. A small interval indicates that these neighbouring clips very likely belong to the same scene. After experimentation, we chose 10 s as the threshold for grouping clips.
For each group, starting from the first clip, we add each successive clip's associated keyframes with paired sentences to form a storyboard until its length is at least the required story length. If the storyboard has a greater length, we remove the extra keyframe-text pairs at the end to provide the required storyboard length. After a storyboard is generated, we move to the next clip and restart grouping.
Here, we specify that a story has a fixed length, following Refs. [13], [14] for easier evaluation and benchmarking. Note that we could easily generate storyboards with flexible lengths for different experiments and scenarios using the above methods. Figure 2 presents an example of a storyboard with its script consisting of sentences or sub-sentences. More storyboard samples can be found in the Appendix.
3.2 Dataset Analysis
3.2.1 Overview
The final Sc2St dataset consists of ~20.4k storyboards covering ~61.2k distinct keyframe images, ~204k sentences (~44k distinct sentences), ~21k unique words, and ~2.9k characters. Taking the 10-image storyboard as an example, most scripts in Sc2St dataset contain 80–127 words (at the 20 and 80 percentile, respectively), with the average words around 108. Using parts-of-speech analysis, Fig. 4(a) shows the distribution of unique nouns and verbs for each script and Fig. 5 illustrates the word cloud of the most frequently used verbs, nouns, attributes, and characters. The following analyzes the dataset characteristics from various perspectives.
In the LSMDC dataset, frames at the edge of a shot can be unrelated to the scene described by the text due to manual clip segmentation errors. Red boxes show such frames belonging to the previous shot.
3.2.2 Descriptive and Fine-Grained Text
Compared to existing movie or story understanding datasets, the text annotations in our Sc2St dataset have two differentiating features: they are descriptive and fine-grained. Table 1 shows the text's characteristics. Most existing datasets use high-level (e.g., plot or synopsis) or raw (e.g., subtitles) textual annotations. The Sc2St dataset relies on descriptive sources (audio description and movie scripts) for descriptive and visual-content aware purposes. Although other datasets such as MovieGraphs and Condensed Movies also contain descriptions, the corresponding video lengths are much longer. We categorize the visual units into short clips, long clips, and segments based on increasing video duration; movie clips have detailed text annotation while segments have coarse annotation. Tables 1 and 2 summarize information of visual units. Only LSMDC and PororoQA contain short clips. Our Sc2St dataset has finer text than the original LSMDC, e.g., there are 1.08 sentences per clip (#sents/unit), compared to 1.0 in LSMDC: on average a video clip is described by more (sub-)sentences. The duration per sentence (dur./sent) values also verify this. Considering image-based datasets, we have similar statistics of words and sentences to the Visual Storytelling dataset.
(a) Distribution of unique verbs and nouns within the storyboard scripts in the Sc2St dataset. (b) Percentage coverage of unique words (y-axis) by dataset coverage (x-axis), compared to other image-sequence-based story datasets. VST is the Visual Storytelling [14] dataset. Curves with higher slopes mean more unique words are used up as dataset coverage grow. The Sc2St dataset has a more gentle curve and thus has more diverse words in the text.
Word clouds of the most frequent verbs, nouns, attributes, and characters in our Sc2St dataset.
3.2.3 Text Diversity
In terms of text diversity, we compare our Sc2St dataset to other similar story datasets which contain sequential text-image pairs, including Pororo [9], CoDraw [10], and Visual Storytelling (VST) [14]. Figure 4(b) shows the cumulative coverage of unique words
3.2.4 Story Coherency
In terms of story understanding, datasets in the movie domain unusually use videos as visual content while facing several challenges to reflect a clear story structure. First, minutes-long videos always contain many shots, resulting in complex and distant story structures with various backgrounds and characters [3], [7], [8]. Second, the discrepancy between dense video format and simple textual description leads to poor cross-modality alignment [7], [8]. Third, shorter video clip-text pairs often ignore the larger context [6]. These problems were noted in Ref. [9] and a cartoon-video-based dataset Pororo was proposed to leverage the simple storyline in cartoon arts to maintain story coherency. Instead of using videos, a derived image-version Pororo-SV adopts the images extracted from videos to build the stories for Story Visualization [13]. Similarly, others such as CoDraw and Visual Storytelling use image sequences as a way to reflect the context. Our proposed Sc2St dataset also uses a sequential approach to tell stories, while the visual movie content is richer than in cartoon datasets and more coherent than in an open domain. Statistics are shown in Table 1 (last column).
Benchmark Evaluation
In this section, we first elaborate on the task definition with a specifically designed evaluation protocol, and then evaluate both baselines and the state-of-the-art in text-based image and video retrieval on the proposed Sc2St dataset. We further propose our own approaches with three variants targeting the contextual retrieval task. Finally, we show how we adapt the evaluation protocol to human participants.
4.1 Contextual Retrieval Task
4.1.1 Task Definition
Given a paragraph of movie script with
4.1.2 Dataset Splits
The Sc2St dataset contains 20,413 storyboards, each of which has 10 images and needs 9 rounds of experiments. The dataset is split into 16,330 for training (80%), 1022 for validation (5%), and 3061 for testing (15%). Thus, there are 183,717 training, 9198 validation, and 27,549 testing rounds in total.
4.1.3 Candidate Keyframes
We prepare the candidates as follows. For each ground truth image in a storyboard sample, the candidate set includes 1 correct image and 99 incorrect images of two kinds: Similar images are about ~70% of the total. We first extract the 1024-dimensional features for all keyframe images using the 121-layer DenseNet [45], and then compute the cosine similarity matrix over all keyframes. Then, each keyframe is assigned a set of most similar candidates, which are chosen from the same movie and other movies, with about 30% from the same movie to ensure a certain level of difficulty. Random images make up the remaining ~30%, and are randomly selected from the other keyframes from the current movie (series) and other movies, again in the same ratio.
Note that there are no overlapping candidates across the different data splits to avoid data leakage. The candidates can include those that do not belong to any storyboard, i.e., isolated ones not qualified to form a story.
4.1.4 Evaluation Metrics
Each round of retrieval is prepared with a list of 100 keyframe candidates. The tasks are automatically evaluated using retrieval ranking scores on the candidate lists, which include (i) recall@k, the recall (percentage) of the top
4.2 Baselines
We consider two baselines to evaluate whether methods are better than chance. The prior baseline is given by random results over the candidates without using any inputs, and the similarity baseline comes from results obtained by descending cosine similarity scores between candidate images and the image in the last round.
4.3 State-of-the-Art
No methods directly target our task, and the most closely related research targets text-based image and video retrieval. Considering the types of methods and involvement of context encoding, we classify existing methods into four groups: text-based image retrieval, video retrieval, pre-training-based visual transformers, and contextual retrieval.
4.3.1 Text-Based Image Retrieval (i-)
We compare to two recent text-based image retrieval methods: SCO [46], that learns sentence-image similarity, and CAMP [19], for word and region-level similarity learning. For fairness of comparison, we use Faster R-CNN [47] to extract region-level visual features and [48] for encoding word embeddings.
4.3.2 Text-Based Video Retrieval (V-)
Mixture of embedding experts (MoEE) models are widely used in text-based video retrieval [23], [49]. A standard procedure is to use a weighted combination of multiple expert embeddings for video representations to learn text-video similarity. Although our data modality is images, we treat it as a single-frame video following Ref. [26]. We use the following experts for keyframe representation: scene features using the DenseNet161 model [45] pre-trained on the Places365 dataset, object features using the SENet154 [50] model pre-trained on ImageNet, and a character embedding that encodes the top-100 characters mentioned in the text.
4.3.3 Pre-Training-Based Vit (P-)
Recently, vision-language transformers have been widely used in multi-modal alignment based on pre-training. We choose to compare to the representative UNITER [21] and OSCAR [22] models. Both adopt object tags and regions detected in images to better learn the text-image alignment. We first extract the detected objects with region features from the keyframes using Faster R-CNN [47]. Then, we fine-tune the pre-trained models by feeding them with the text-keyframe pairs and object features. The [CLS] token is used as input for the following retrieval task.
4.3.4 Contextual Retrieval (C-)
We further compare to methods involving contextual retrieval, which exist for text-based image or text-based video retrieval. For the former, we compare to a neural story illustration method [11], StoryShow. It uses a hierarchical GRU network to learn a representation for the input story while keeping coherence between sentences to retrieve a sequence of ordered images, which is similar to the setting for the Sc2St task. For the text-based video domain, we compare to the Contextual MoEE (CMoEE) [8], which learns the similarity score between text and video using weighed expert features from the current and past video clips. We replace the original video clips with keyframes and use the same experts explained in the MoEE model. As our contextual retrieval task involves text and frame context, these two methods have a better fit to our task than other methods.
4.4 Proposed Methods
4.4.1 Approach
The storyboards contain rich visual and textual context, namely the keyframes and text descriptions. To better capture the contextual information from both sequences, we base our model on a recurrent architecture. Figure 6 illustrates the framework. Specifically, at each time step
Proposed recurrent model architecture using three contextual fusion encoders. After encoding the text query and keyframe separately, a context encoder
In the implementation, the image encoder is derived from a pre-trained Inception-v3 [51] model. It serves as a general feature extractor that converts images in the storyboard history
4.4.2 Late Fusion (LF)
In the late fusion encoder, the text and image features are directly concatenated and then processed by a multilayer perceptron (MLP). This simple approach fuses the two modality features into a joint semantic space
4.4.3 Gated Fusion (GF)
The gated fusion encoder has a gating mechanism that controls what information is passed on or forgotten, as in a gated linear layer (GLU) [52]. Here we propose a cross-gating mechanism: using two gated layers for filtering image history information by text information and filtering text information by image history information. The two outputs are then concatenated as the context gist, as formulated by
\begin{equation*}g_{t}=G_{S}(r_{S}^{t})r_{I}^{t-1}+G_{I}(r_{I}^{t-1})r_{S}^{t}\tag{1}\end{equation*}
4.4.4 Self-Attention Fusion (SA)
We leverage an attention mechanism [20] to connect the visual and textual information. Specifically, self-attention (SA) is used to bridge the two recurrent outputs where we adapt the non-local concept from Ref. [53] to implement the SA module for our data inputs. Firstly, image and text contexts are concatenated and fed into an MLP to get the \begin{align*}\boldsymbol{u}_{j} & = \boldsymbol{W}_{v} \sum\limits_{i=1}^{N} w_{j,i} \boldsymbol{u}_{i}\tag{2}\\ w_{j,i} & = \frac{\exp(\alpha_{i,j})}{\sum\limits_{k=1}^{N}\exp(\alpha_{i,k})}\tag{3}\end{align*}
4.5 Human Evaluation
To evaluate human performance on the contextual retrieval task, we built an online user study system (Online Human Evaluation: https://sc2st.com) using the same testing set. As the testing set includes ~3k storyboard samples covering ~27k rounds of experiments, we randomly chose 30 samples from it to reduce the testing size, resulting in a total of 270 rounds of retrieval. To align the human evaluation with the ranking-based evaluation protocol as well as to make it practical to conduct, for each retrieval round, we allow participants to choose at least 1 but up to 10 images instead of exactly 10 images according to their confidence. In this way, it is convenient and efficient when participants are more confident about the already chosen and ranked images, since they do not need to rank more to make up 10 images, and the rest are automatically filled by random frames. Overall, we obtained human evaluation results from 14 participants on 321 data samples.
Results and Discussion
5.1 Quantitative Results
Table 3 summarises quantitative results of evaluating the various methods. Our methods perform favorably against all other existing methods under all metrics. There are subtle performance differences between the three context encoders in terms of R5 and R10, while the GF and SA encoders perform slightly better than the LF encoder for R1. The SA encoder outperforms both the LF and GF encoders under R1. The pre-training-based transformers (UNITER [21] and Oscar [22]) display superior performance to the classic similarity-based text-based visual retrieval methods (SCO [46], CAMP [19], and MoEE [49]). For context-based methods (C-StoryShow [11] and C-CMoEE [8]), the better performance of C-CMoEE indicates the visual context has a greater weight than textual context; we further validate this in the ablation study (in Section 5.3). Human subjects achieve leading results in all metrics. Mean and MRR results are absent for the user study, as these metrics need rankings over all the candidates which is unachievable for the user study.
5.2 Qualitative Results
Figure 7 shows the top-5 selection results over an entire story example by our model using the LF context encoder. Given the initial keyframe with its descriptions, for each round, the model needs to predict the possible keyframe conditioned on the text description and the frame history. It can be seen that from the top selected frames, the visual features share a semantic similarity. For example, the candidates in the third row concern a neon sign while Rank-1 and Rank-4 in the 6th row show a scene of a person using a phone, etc. Only using text-based image similarity is insufficient, as there are similar candidates making the task challenging, while our context-aware method can leverage the visual and textual history context for better retrieval. Later rounds receive more contextual information and the results tend to be better (rounds 4–10). We now quantitatively verify this observation.
Qualitative retrieval results. The top-5 selected keyframes are shown given the script, denoted by green boxes.
5.3 Effectiveness of Context Encoding
The core part of our proposed approaches is the recurrent architecture using context encoders to capture the contextual information. In order to better compare whether the method can effectively use the context, Table 4 presents the results for early (rounds 2–4), middle (rounds 5–7), and late (rounds 7–9) temporal stages in storyboards using our method (LF). In comparison, the MoEE results are also shown for each temporal stage. Results demonstrate that our method performs better in middle or late temporal stages than in earlier stages, meaning that availability of more history information for later rounds is important, and our model can successfully utilize previous context for the retrieval task. However, non-contextual models (like MoEE) do not show this change, and indeed the performance in earlier stages is better than in later stages.
5.4 Effectiveness of Dual Context
To show the contribution of visual and textual context used in our methods separately, we designed experiments that use only the visual context and only the textual context, and compare them to the full model using both contexts. The results are shown in Table 5 using the LF fusion module. They show that the usage of visual context, namely the frame history, can alone perform better than the textual context for all metrics, suggesting that visual elements of the context are more important. The performance is further boosted by adding the textual context, especially in R1, by about 52%.
5.5 Exploring Potential Applications
As the Sc2St dataset has a clear story structure, we hope to further explore its application to other movie-related scenarios besides contextual retrieval. One possible application may be script-guided storyboard generation, unlike the usual text-to-image (T2I) task, where a sequence of images needs to be generated. Currently, T2I is still a challenging task since most research has focused on single object (e.g., birds or flowers) generation from text, and the quality of generated complex scenes is not ideal [32], [54]. This is more challenging for the movie domain which involves various characters, objects, and scenes. Recent advances in T2I are driven by scaling models on large datasets. These models, with billions of parameters, and trained from abundant data using hundreds or thousands of GPUs, show the potential to generate realistic images from text [33]–[35], [55]. We thus examined the generation quality of recently available state-of-the-art T2I models: Dall.E-2 [35] and CogView-2 [34]. Specifically, each keyframe in a storyboard is generated one by one given its paired text. Some results are shown in Fig. 8. Note that the generated images are carefully picked since each sentence can generate multiple images from different random initialization, and we manually screened the most suitable images. Dall.E-2 can generate more realistic images than CogView-2. Simple scenes (e.g., the owl) can be well synthesized by both models while complex scenes are harder for CogView-2. The generated images lack coherence, having inconsistent styles, which is to be expected since no contextual information is used. With T2I benefiting from large-scale modeling, we thus hope our proposed dataset can be applied to fill the gap for more coherent and realistic storyboard generation in future.
5.6 Limitations and Future Work
First, the movies included are limited. Our Sc2St dataset can provide fine-grained storyboards based on 165 movies. Inspired by the recent large-scale movie dataset [7], there is potential to include more movies with our storyboard annotations. Second, the evaluation of contextual generation can be further investigated. We carefully designed the benchmark evaluation for the Sc2St contextual retrieval task. However, automatic perceptual evaluation of the generated results remains challenging, and we leave it for future work. Third, although we discussed potential text-to-image generation and qualitative results using our dataset, other applications such as storytelling, and video creation/editing applications would also benefit from our dataset.
Reviewing the text-to-image generation results on our dataset using Dall.E-2 and CogView2.
Conclusions
In this paper, we proposed a new script-to-storyboard dataset (Sc2St) together with a contextual retrieval task in the movie domain. The new dataset features a new data form called a storyboard, which consists of sequential keyframe images with corresponding textual descriptions. A storyboard has the advantage of an explicit, clear, and coherent story structure over the implicit storyline in movies. Compared to existing movie datasets, the Sc2St dataset contain fine-grained, highly diverse text annotations. The newly annotated keyframes are semantically matched to the text. Using the new dataset, we have benchmarked the contextual retrieval task with an automatic ranking-based evaluation protocol. We have proposed baselines with three variants to accomplish the task and compare them to state-of-the-art methods as well as human performance. Quantitative results demonstrated that our approach performs better by successfully leveraging contextual information from both the text and image history. Finally, we explored the potential of the generation task using our dataset.
Declaration of Competing Interest
The authors have no competing interests to declare that are relevant to the content of this article.
ACKNOWLEDGEMENTS
This research was supported by RCUK grant CAMERA (EP /M023281/1, EP/T022523/1), the Centre for Augmented Reasoning (CAR) at the Australian Institute for Machine Learning, and a gift from Adobe.
Appendix
Appendix
Implementation
Our proposed models were implemented using the deep learning framework PyTorch [56]. Adam [57] was used as optimisation method, with learning rate set to 0.0003 and scheduled with a cosine annealing strategy. For all models, the batch size was set to 128. To select a best model, mean ranking performance evaluated on a hold-out validation set was used.
Additional Storyboard Samples
Figures A1–A3 show more storyboard samples from different movies, demonstrating that the keyframes in the storyboards summarize the condensed visual information reflected in the textual description. The sequences of keyframes provide a coherent story, which one can understand without the original long and redundant video information.
Human Evaluation Interface
The human evaluation interface is shown in Fig. A4. A demonstration is provided at https://sc2st.com.