Loading web-font TeX/Main/Regular
The Video Authentication and Camera Identification Database: A New Database for Video Forensics | IEEE Journals & Magazine | IEEE Xplore

The Video Authentication and Camera Identification Database: A New Database for Video Forensics

DatasetsAvailable

Highlighted video properties associated with each camera model.

Abstract:

Modern technologies have made the capture and sharing of digital video commonplace; the combination of modern smartphones, cloud storage, and social media platforms have ...Show More
Topic: Digital Forensics through Multimedia Source Inference

Abstract:

Modern technologies have made the capture and sharing of digital video commonplace; the combination of modern smartphones, cloud storage, and social media platforms have enabled video to become a primary source of information for many people and institutions. As a result, it is important to be able to verify the authenticity and source of this information, including identifying the source camera model that captured it. While a variety of forensic techniques have been developed for digital images, less research has been conducted toward the forensic analysis of videos. In part, this is due to a lack of standard digital video databases, which are necessary to develop and evaluate state-of-the-art video forensic algorithms. In this paper, to address this need, we present the video authentication and camera identification (video-ACID) database, a large collection of videos specifically collected for the development of camera model identification algorithms. The video-ACID database contains over 12 000 videos from 46 physical devices representing 36 unique camera models. Videos in this database are hand collected in a diversity of real-world scenarios are unedited and have known and trusted provenance. In this paper, we describe the qualities, structure, and collection procedure of video-ACID, which includes clearly marked videos for evaluating camera model identification algorithms. Finally, we provide baseline camera model identification results on these evaluation videos using the state-of-the-art deep-learning techniques. The Video-ACID database is publicly available at misl.ece.drexel.edu/video-acid.
Topic: Digital Forensics through Multimedia Source Inference
Highlighted video properties associated with each camera model.
Published in: IEEE Access ( Volume: 7)
Page(s): 76937 - 76948
Date of Publication: 10 June 2019
Electronic ISSN: 2169-3536

Funding Agency:


SECTION I.

Introduction

The capture and spread of digital multimedia has exploded over the last several decades. Higher quality and wildly available cameras, like those in modern smartphones, as well as internet applications, such as social media and cloud storage, have allowed the average person to easily document anything from a family vacation to an academic lecture. However, in some scenarios, such as news reporting, legal proceedings, and national security operations, it is critical to know the source and integrity of a given image or video.

To address these issues, researchers have developed forensic algorithms that verify the authenticity and source of digital content [1]–​[4]. For example, techniques have been developed to identify the processing history of digital images [5]–​[12], perform image forgery detection [13]–​[24], as well to identify an image’s source device [25]–​[27] and source camera model [28]–​[34]. The development of these algorithms has been significantly aided by the availability of several, high quality forensic databases, such as the Dresden Image Database [35] and the Vision Database [36].

While much of forensics research has focused on images, the increasing importance of video has created a growing need for the development of new video forensic techniques. Currently, researchers have developed forensic algorithms to identify video manipulation and forgery [37], [38], detect frame deletion [39], [40], and identify a video’s source device [41]–​[43] and camera model [44]. While this research provides forensic analysis with important investigative capabilities, the development of video forensic algorithms has proceeded at a much slower pace than image forensic algorithms. One significant reason for this is lack of a widely available database of videos suitable for use in developing and benchmarking forensic algorithms. Though some existing forensic databases contain videos, such as the Vision database, the number of videos in these databases are not sufficiently large to train and evaluate modern data-driven video forensic algorithms. As a result, there is a significant need for a large database of unaltered videos of known provenance that is suitable for use in forensic research.

To fill this gap, we present the Video Authentication and Camera Identification database (Video-ACID). This database is a carefully constructed collection of videos that is purposely made for the development and evaluation of video camera model identification algorithms. While this database is intentionally made with camera model identification techniques in mind, we note that the properties of this database, such as diverse codec parameters, allow this database to be useful for the development and evaluations of many forensic algorithms.

The Video-ACID database contains over 12,000 videos from 46 different devices, totaling 36 represented camera models, and 18 different camera manufacturers. Each device was used to capture on average over 250 different videos. To represent real-world scenarios, these videos were manually captured to depict a diversity of content, lighting conditions, and motion. Videos are unedited, and directly output by the camera for which we have physical access to. Additionally, videos in this dataset have already been used in benchmarking a state-of-the-art video camera model identification [44] system.

The remainder of this paper is structured as follows. In Section II, we discuss the need for a standard database of videos designed for the development of camera model identification techniques, limitations of existing databases, and the desired properties of a video forensics database. In Section III, we detail the creation process, attributes of the videos and camera models, and organization of the Video-ACID database. Finally, in Section IV, evaluate the Video-ACID database by conducting a series of experiments using a state-of-the-art video camera model identification system.

SECTION II.

Motivation

In the past decade, many algorithms have been developed to determine the origin and integrity of digital images. However, less attention has been paid to videos. One significant reason is that there exists no up-to-date databases that are suitable for developing video forensic analysis techniques. Furthermore, capturing a large amount of videos is expensive and time-consuming. To facilitate the research of video forensics, it is critical to provide researchers a standard video database that has the properties suitable for developing new algorithms.

In this section, we discuss existing relevant databases and their limitations for benchmarking video forensic algorithms. Particularly, we will analyse their shortcomings related to video camera model identification algorithms. Additionally, we outline important properties of a good database for training and benchmarking video forensics algorithms. Although we built this database with camera model identification in mind, we note that many of these properties translate well to other video forensic tasks.

A. Existing Datasets

The Dresden Image Database [35] is a commonly used [45]–​[48] forensic dataset. This database contains over 14,000 images from 25 different camera models. While the dataset was originally intended for source device identificaion, it has been used to benchmark numerous forensic algorithms, including image source camera model identification [46], [49], image forgery detection [47], and more [48]. However, recent research has shown that features learned by image camera model identification algorithms do not transfer well to video source identification [44]. Since the Dresden Image Database does not contain any videos, it is not useful for developing algorithms that requires feature extraction directly from video frames. However, the flexibility and ubiquity of the Dresden database have motivated us in the construction of our own dataset to adopt similar qualities, such as a large number of camera models, many videos of natural scenery, and a database structure that is easy to interact with.

The recently published VISION dataset [36] contains approximately 300 images and between 10 and 30 videos from each of 35 distinct devices. This dataset also features images and videos which have been uploaded to and downloaded from social media networks such as Facebook and Whatsapp. This dataset was collected for the purpose of device identification based on sensor noise patterns. While the VISION database has been useful for developing a number of forensic algorithms [43], [50], the database does not contain a sufficient number of unique, unedited videos to train state-of-the-art video camera model identification algorithms.

Concurrent to the development of our proposed database, a collection of datasets has been developed called the Multimedia Forensics Challenge (MFC) evaluation datasets [51]. The MFC datasets were built for a number of purposes, including evaluating image and video device (not camera model) identification, forgery detection, and event verification. While these MFC datasets contain many images and videos, none are specifically built with camera model identification in mind, highlighting the need for such a database.

B. Motivating Qualities

We now discuss the properties that make a video database suitable for the development of state-of-the-art algorithms. We considered both the important properties of a video forensic dataset, and the requirements specific to a video camera model identification dataset.

1) Number of Videos

Since many forensic algorithms are trained by first extracting features from a large amount of data, the database should contain a large amount of data points (i.e video clips). Recent forensics algorithms rely increasingly on data driven techniques, such as convolutional neural networks. These neural networks require a large amount of training data to achieve high accuracies [52]. Therefore, a dataset that is suitable for the development of these data driven techniques must have a large number of individual data points.

2) Number of Camera Models

In a real world scenario, a forensic investigator may be tasked with differentiating between a large number of camera models. It is important that a camera model identification algorithm is able to differentiate between many camera models. As a result, it is necessary that a camera model identification database contains data from many different classes i.e. camera models. Additionally, a large number of camera models can help to increase the data variety when used for other forensics tasks.

3) Diversity of Camera Models

A forensic investigator may encounter a breadth and depth of camera models. That is, they may encounter camera models from different manufacturers as well as different camera types, such as camcorders, DSLRs, or cell phones. Additionally, they may need to differentiate between very similar camera models, such as a Samsung Galaxy S5 and Samsung Galaxy S7, which are likely to contain similar forensic traces. Therefore, the database should include a diverse set of camera models.

4) Known and Trusted Provenance

An accurate correspondence between label and data point is critical for a successful classifier [53], [54]. For this reason, it is important that videos are collected by trusted agents who have full control over the capture devices, as opposed to crowd-sourced through the internet. Additionally, for forensic research, it is important to know the history of digital content, including processing history. In the case of camera model identification, it is desirable that videos be unedited and in the format directly output from the camera.

5) Content Diversity

It is important that a video forensics data set includes videos captured in many different scenes and environments. This is because multimedia forensic algorithms should be robust to variations in depicted content. That is, a video forensic algorithm should not operate differently when presented with different scenery, lighting environments, motion, etc. To ensure this, training must be performed on videos captured in a variety of settings, that are ideally representative of all possible encountered scenarios.

6) Duplicate Devices

It is possible, particularly when using deep learning techniques, that a camera model identification algorithm learn device-specific features, in addition to the desired model-specific features. For this reason, it is important that a video database oriented toward camera model identification provides some method by which researchers and algorithmic developers can study the influence of these device-specific features on their algorithms. One way to do this is to capture videos using multiple devices of the same make and model.

7) Codec Diversity

The codec and codec parameters used to compress a video are an important consideration for video forensics techniques [2], [37], [39], [55]. These encoding parameters can affect both the video’s perceptual quality, as well as the forensic traces left in the video. It is important for a forensic video dataset to include videos that represent the most popular and modern compression techniques. This includes a diversity of encoding parameters.

In light of this, we provide a brief explanation of modern video coding techniques. Modern video compression techniques take advantage of the temporal redundancy of successive frames. Videos encoded using the H.264 codec are compressed in sets known as Groups of Pictures(GOP’s). The first frame of each GOP, known as an I-frame, is coded similar to a JPEG images. The rest of the GOP comprises P and B frames, predicted from temporally local frames. Some cameras further compress frames by sending every other row of the captured video, alternating which rows are sent. This is known as interlaced scanning, as opposed to progressive scanning. There are many other parameters to consider when encoding a video, such as the number of frames per second, and the amount of information that is stored per frame. These parameters are further discussed in Section III.

The above qualities are those we find most important to the usability of a dataset. We considered these when designing our proposed dataset. In the following section, we describe the design of the dataset, and how these qualities were considered and implemented.

SECTION III.

The Video-ACID Dataset

The Video Authentication and Camera Identification Database(Video-ACID) contains 12,173 total videos from 46 devices, comprising 36 unique camera models. These cameras include a variety of different device categories, manufactures, and models. Specifically, our database contains videos from 19 different smartphones or tablets. We also include videos from 10 point-and-shoot digital cameras, 3 digital camcorders, 2 single-lense reflex cameras, and 2 action cameras. Our dataset contains videos from 18 different camera manufacturers, including Apple, Asus, Canon, Fujifilm, GoPro, Google, Huawei, JVC, Kodak, LG, Motorola, Nikon, Nokia, Olympus, Panasonic, Samsung, Sony, and Yi. For nine of the camera models in the Video-ACID database, videos were collected using two or more physical devices of the same make and model. Table 1 shows the make and model of each class, as well as some properties of the videos recorded using those cameras.

TABLE 1 Highlighted Video Properties Associated With Each Camera Model
Table 1- 
Highlighted Video Properties Associated With Each Camera Model

A. Capture Procedure

Videos were captured by hand by a team of researchers who had physical access to each device. Additionally, these videos are unaltered, in the original format directly output by the camera. In order to ensure consistency and avoid biases across the Video-ACID database, the following guidelines were developed and used during data collection.

1) Device Settings

Cameras are often configurable to capture videos with many different parameters including frame size and frame rate. Videos in this dataset are captured by cameras operating at their highest quality setting, usually 1080P at 30 frames per second. Digital zoom can introduce distortions into multimedia content and the associated forensic traces, so during the data collection process, all cameras were left at their default zoom level. Many of these camera models have multiple image sensors on a single device. In these scenarios, we refer to the higher quality rear-facing camera, as opposed to the front facing “selfie” camera.

2) Duration

The videos captured for the Video-ACID dataset are each roughly five seconds or more in duration. This number was chosen in light of many constraints. First, videos must be long enough to exhibit forensically significant behavior, providing a lower bound of several GOP sequences in duration. Second, some forensic algorithms operate on individual frames, as opposed to an entire video. In light of this, we would like to maximize the number of frames available in each video. Third, data collection is an expensive process, and we would like to maximize the number of videos that can be recorded in a given time period. We found that videos of five seconds in duration fit all these constraints.

3) Content

Content diversity is important for many forensic tasks. We collected videos from a variety of different scenes with each camera, including near and far-field focus, indoor and outdoor settings, varied lighting conditions, horizontal and vertical capture, and varied background such as greenery, urban sprawl, snowy landscapes, etc. All videos incorporate some sort of motion or change in scene content, lessening the redundancy of frames within a single video. This motion is typically in the form of panning or rotating the camera, or changing the distance of the camera from the scene. Figure 1 shows still frames of different videos in the Video-ACID database, demonstrating the range and variety of scene content in this database.

FIGURE 1. - Sample frames from captured videos.
FIGURE 1.

Sample frames from captured videos.

B. Camera Capture Properties

Table 1 summarizes the collected videos and the capture parameters of each device. This table contains information about the capture and compression properties of videos recorded by each camera model such as the resolution, codec profile frame rate mode, and GOP structure.

1) Resolution

The resolution of a video corresponds to the width and height of the video in pixels. Additionally, a suffix of ‘I’ or ‘P’ is added to indicate the scan type of the video. A video with an interlaced scan is displayed by updating every other row of a frame to reduce the amount of data that needs to be stored. A progressive video is displayed by updating every row of the display for each frame.

2) Codec Profile

Of the 36 camera models used to collect data, most encode captured videos according to the H.264 video coding standard. Associated with this standard are Profiles and Levels, indicating the complexity and speed required to decode the given video. However, two of these cameras encode video using the MJPEG codec, which does not have a profile or level.

The profile of a video determines the complexity needed to decode that video. while the level indicates the speed required. For example “Baseline” and “Constrained baseline” videos use Context-Adaptive Variable-Length Coding (CAVLC), while “Main” and “High” profile videos use Context-based Adaptive Binary Arethmetic Coding (CABAC). Both encoding schemes are lossless, however CABAC is much more computationally intensive to encode and decode than CAVLC. Notably, B-frames are not available when using the “Baseline” profiles, but are available when encoding “Main” and “High” profile video.

While the Profile indicates a video stream’s complexity in terms of the capability necessary to decode it, a video’s Level indicates bitrate necessary to decode the stream. For example, decoding a 1080p video at 30FPS requires a decoder capable of Level 4 or above. Most modern flagship smartphones use the “High” profile, while older phones, cheaper phones, and point-and-shoot digital camera are more likely to use the “Main” or “Baseline” profiles.

3) Frame Rate

The Video-ACID dataset contains a mix of “Variable” and “Fixed” frame rate video. In fixed frame rate videos, each frame is displayed for the same amount of time as every other. In variable frame rate videos, the timing between frames can change. For example, a camera may detect fast motion in a scene, and increase the frame rate to be able to better capture this motion.

4) GOP Structure

The length and sequence of a Group of Pictures (GOP) is not fixed by the codec. Instead, as long as a GOP starts with an I-frame, each encoder is allowed to determine its own sequence of P and B frames. A video’s GOP sequence is usually parameterized by the length of the GOP – the number of frames between I frames – and the maximum number of B frames allowed between anchor frames. In Table 1, the N value is the number of frames between I frames, and the M value is the maximum number of B-frames between P-frames For MJPEG-encoded videos there are no predicted frames, so the distance between I-Frames is 1.

As seen in Table 1, many cameras will use different M and N parameters. Across similar devices from the same manufacturer however, these parameters are likely to be constant. For example, all the Samsung devices use M=1 and N=30, while Google’s devices use M=1, N=29.

C. Structure

Within Video-ACID, we provide two datasets, a “Full” dataset, and a “Duplicate Devices” dataset. The Full dataset contains all videos from all camera models. The Duplicate Devices dataset contains videos from camera models represented by multiple devices. We organize these videos in the following way:

1) Full Dataset

The Full dataset contains all videos from all devices in the Video-ACID database. We split this dataset into disjoint sets of training and evaluation videos. To do this, we randomly select 25 videos from each camera model to act as the evaluation set. In the case of multiple devices of the same make and model, these 25 videos are randomly split across the devices. The rest of the videos are left for training.

Many existing camera model identification algorithms operate using patches of an image or video. In light of this, we selected an additional 25 videos from the Nikon Coolpix S3700 because the small frame size limits the number of unique non-overlapping patches that can be extracted. Table 2 shows the total number of training and evaluation videos for each camera model.

TABLE 2 Folder Names of Number of Videos in the Full Dataset
Table 2- 
Folder Names of Number of Videos in the Full Dataset

Within the root directory of our “Full” dataset, we have a directory for training data, and another for evaluation data. Within these directories, there is a subdirectory for each camera model. Camera models are identified by both a model number, from 0 to 35, and a name describing the make and model of the camera. Within each of these camera model directories, we separate videos by the device which captured them. For most models, this is just a single subdirectory labeled “DeviceA”. When multiple devices of the same model were used to capture videos, there are multiple subdirectories, “DeviceA”, “DeviceB”, etc.

The videos are named according to the following scheme: MXX_DY_T0000.mp4. MXX is the model number assigned to the camera. DY is the device identifier, e.g. DA for Device A. The prefix of the video number is either ’T’ or ’E’ indicating whether the video belongs to the training set or evaluation set respectively. Finally, a four-digit number is assigned to index videos captured by the same device. For example, M30_DA_E0010.mp4 refers to the 11th evaluation video captured by device A of a Samsung Galaxy S7. The full filepath is then, “eval/M30_Samsung_Galaxy_S7/DeviceA/M30_DA_E0010.mp4”.

2) Duplicate Devices Dataset

The Duplicate Devices dataset contains only those camera models from which videos were captured using multiple devices. This dataset is useful for studying the device dependence of various forensic algorithms. Table 3 lists the nine camera models and the number of training and evaluation videos in this Duplicate Devices dataset.

TABLE 3 Folder Names and Number of Videos in the Duplicate Devices Dataset
Table 3- 
Folder Names and Number of Videos in the Duplicate Devices Dataset

From each of these camera models we select the A device and the B device. Videos from the A devices are divided into Train-A and Eval-A, where the Eval-A directory contains 25 videos from each model, and the Train-A directory contains the rest. The videos from the B devices are divided the same way. One camera model, the Google Pixel 1, has videos from three devices. Device C from the Google Pixel 1 is excluded from the Duplicate Devices dataset.

These directories are structured similarly to the “Full” set, with the different model numbers prefixing the model names. The device subdirectories, because the device is implied, is removed. The videos are directly beneath the camera model directory. For example, train-A/M03_Kodak_Ektra/M03_DA_0010.mp4 is the file path pointing to the 11th training video captured by the A device of the Kodak Ektra.

SECTION IV.

Applications and Baseline Evaluation

In this section, we evaluate the quality of our dataset as a benchmark for forensic algorithms by conducting the following series of experiments. First, we conducted benchmark experiments for camera model identification on our Full dataset. Second, on our Duplicate Dataset, we conducted experiments to investigate device generalization.

A. Camera Model Identification

In our first experiment, we trained and evaluated a state-of-the-art video camera model identification system on our Full dataset. This result establishes a baseline camera model identification accuracy on the Video-ACID database.

To do this, we trained and evaluated a state-of-the-art camera model identification system [44] on the Full dataset. Briefly, this system is a CNN that is trained to output patch-level camera model identification decisions. Furthermore, activations from the CNN are used to fuse neuron activations from multiple patches to render video-level camera model identification decision.

1) Classifier Training

To train the CNN, we used the training videos in our Full dataset to train the classifier according to the procedure in [44]. To extract training patches from the Full dataset, we first start with one camera model and randomly select a training video. We then choose three I-frames from the video also at random. For each of these three frames, we stored every nonoverlapping 256 \times 256 patch, beginning at the top left corner of the frame. This process is repeated until we have extracted 10,000 patches from each class, and then for each camera model for a total of 360,000 patches which are then randomly shuffled.

Our Full dataset primarily contains videos encoded with the H.264 family of codecs (H.264, MPEG-4, etc.). However, the videos in our dataset captured by the Nikon Coolpix S3700 and the Fujifilm Finepix S8600 are encoded using MJPEG. This codec lacks features of the H.264 family of codecs, such as predictive coding and variable block sizes. To investigate the effect of the codec on the forensic traces in a video, we train two versions of the classifier. The first classifier is trained on all 36 camera models in our database. The second is trained using only the camera models which produce H.264-encoded video.

2) Patch Classification

First, we evaluate and compare the patch-level camera model identification accuracy of the trained CNNs. To do this, we created an evaluation set by repeating the training patch extraction procedure, however using videos from the evaluation set instead. For each camera model, we extracted 1,000 total evaluation patches, yielding a total of 36,000 evaluation patches. We then used the trained classifiers to predict the source camera model of the evaluation patches.

The average single patch classification accuracy for the evaluation set is shown in the second column of Table 4. The CNN trained on the H.264 only dataset, correctly identified the source camera model of 79.6% of evaluation patches. For the CNN trained on both H.264 and MJPEG videos, 81.7% accuracy was achieved.

TABLE 4 Single-Patch and Fusion Accuracy of Video Camera Model Identification
Table 4- 
Single-Patch and Fusion Accuracy of Video Camera Model Identification

In Table 5 we show the average per-class, single-patch accuracy of each trained system. For example, The CNN trained only on H.264 video is able to correctly classify 60.8% of patches taken from the Kodak Ektra. However, the CNN trained on both H.264 and MJPEG videos is able to correctly classify 80.0% of these patches. The accuracies of all camera models range from 50% to 99%.

TABLE 5 Single-Patch Per-Class Accuracy of Video Camera Model Identification
Table 5- 
Single-Patch Per-Class Accuracy of Video Camera Model Identification
TABLE 6 Per-Class Fusion Accuracy of Video Camera Model Identification
Table 6- 
Per-Class Fusion Accuracy of Video Camera Model Identification
TABLE 7 Single-Patch Accuracy of Camera Model Identification System With Varying Training and Evaluation Devices
Table 7- 
Single-Patch Accuracy of Camera Model Identification System With Varying Training and Evaluation Devices
TABLE 8 P=F=3 Fusion Accuracy of Camera Model Identification System With Varying Training and Evaluation Devices
Table 8- 

$P=F=3$
 Fusion Accuracy of Camera Model Identification System With Varying Training and Evaluation Devices

Table 9 shows the confusion matrix obtained using the classifier trained with all 36 camera models. Similarly, Table 10 shows the confusion matrix obtained using the classifier trained only on H.264-encoded videos.

TABLE 9 Confusion Matrix of Camera Model Identification System’s Single-Patch Accuracy When Trained on All VideoACID Camera Models
Table 9- 
Confusion Matrix of Camera Model Identification System’s Single-Patch Accuracy When Trained on All VideoACID Camera Models
TABLE 10 Confusion Matrix of Camera Model Identification System’s Single-Patch Accuracy When Trained Only on H.264 Encoded Videos
Table 10- 
Confusion Matrix of Camera Model Identification System’s Single-Patch Accuracy When Trained Only on H.264 Encoded Videos

3) Video-Level Classification

Next, we present camera model identification results on whole videos. To do this, we apply the fusion technique described in [44]. Briefly, the fusion technique fuses neuron activations from P patches from F frames. In this work, we choose P=3 , F=3 for a total of 9 patches to fuse.

To evaluate each trained system’s fusion accuracy, we randomly selected three I-frames from an evaluation video. Each frame was divided into a set of 256 \times 256 non-overlapping patches, and three patches were randomly selected from each. We use the fusion system in [44] to produce a video-level classification decision. This was repeated for every evaluation video.

The average video-level classification accuracy for the Full dataset is shown in the third column of Table 4. For the H.264 only dataset, 96.9% camera model accuracy was achieved. For the evaluation set comprised of both H.264 and MJPEG videos, 96.0% accuracy was achieved. Fusing the multiple patches using either CNN results in a boost in accuracy of over 14% compared to the single-patch classification accuracy. These results are consistent with those reported in [44].

In Table 6 we show the per-class video-level classification accuracy of each trained system. Fusing the activations from the CNN trained only on H.264 video results in the correct classification of 84% of patches from the Kodak Ektra. However, when fusing the activations of the CNN trained on both H.264 and MJPEG, 100% accuracy is achieved. The accuracy is between 64% and 100% for all camera models.

B. Device Generalization

In the second experiment, we evaluate the device dependency effects using the Duplicate Devices dataset.

Using the procedure outlined in IV-A, we extracted 10,000 I-frame patches of size 256 \times 256 from each device in the Train-A set. We did the same for devices in the Train-B set. From each device in each evaluation set, we also extracted 1,000 I-frame patches. This resulted in two training sets, each comprising 90,000 patches, and two evaluation sets, each with 9,000 patches. We trained the camera model identification system once on each training dataset, and evaluated it’s performance using each of the evaluation datasets.

Table 7 shows the average single-patch accuracy of each classifier for each dataset. As shown in Table 7, a CNN trained on only the B devices is able to correctly classify 79.1% of patches from those same devices. That CNN can also correctly classify 74.5% of patches from the A devices. While the CNN trained on only the A devices can correctly classify 82.0% of patches from the A devices, this accuracy falls to 64.7% for patches from the B devices.

To evaluate video-level camera model identification accuracy on this dataset, we employ the fusion system in [44]. For each video in each evaluation set, we randomly selected three I-frames and randomly selected three patches from each of these. We then averaged the accuracy across videos from the A devices, and those from the B devices.

The accuracy of each classifier on each video set is shown in Table 8. Both CNN’s, when evaluating patches from the same devices used during training, achieves close to 95% accuracy. The CNN trained on the B devices correctly classifies 92.0% of patches from the A devices. This is comparable to the CNN’s accuracy when classifying patches from B devices. However, The CNN trained on the A devices correctly classifies only 81.8% of patches from the B devices.

When training on the B devices and evaluating on the A device videos with fusion, the accuracy approaches that of training and evaluating on the same device. Interestingly, the system trained on the A devices, when attempting to classify videos from device B, does not achieve the same accuracy.

These results show that when training on one device and evaluating on another device, classification performance decreases relative to training and evaluating on the same device. This suggests that a single device may not be representative of the entire class of camera models. Overfitting to single-devices can affect state-of-the-art video camera model identification systems. The Video-ACID database provides the means for studying this effect.

SECTION V.

Conclusion

In this paper, we proposed a new standard database, VideoACID, that is designed for the study of multimedia forensics on videos. The VideoACID database contains 12,173 video clips of various scenes that were manually captured using 46 unique devices of 36 camera models. The large amount of video clips ensure the sufficiency and diversity of data for developing and evaluating state-of-art forensic algorithms. Particularly, it satisfies the need for developing source identification algorithms on videos. By conducting a series experiments, we demonstrated the benchmark of the state-of-art forensic video source identification algorithms using the VideoACID database. Moreover, the dataset will grow in both the number of devices and the number of represented camera manufacturers and models, and future work will involve more benchmark evaluations using VideoACID.

ACKNOWLEDGMENTS

The authors would like to thank Hunter Kippen, Keyur Shah, Shengbang Feng, Belhassen Bayar, and Michael Spanier for their assistance in data collection for this project.

ACKNOWLEDGMENT

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office, the Defense Forensics Science Center, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

References

References is not available for this document.