<span id="page-0-0"></span>

Received February 14, 2019, accepted February 28, 2019, date of publication March 11, 2019, date of current version April 18, 2019. *Digital Object Identifier 10.1109/ACCESS.2019.2904196*

# 4K Real Time Software Solution of Scalable HEVC for Broadcast Video Application

# RONAN PAROIS<sup>1</sup>, WASSIM HAM[IDO](https://orcid.org/0000-0002-1953-3574)UCH[E](https://orcid.org/0000-0002-0143-1756)®2, PIERRE-LOUP C[AB](https://orcid.org/0000-0003-0750-0959)ARAT<sup>2</sup>, MICKAEL RAULET<sup>1</sup>, NATY SIDATY<sup>®2</sup>, AND OLIVIER DÉFORGES<sup>®2</sup>

<sup>1</sup>ATEME Rennes, 35700 Rennes, France

<sup>2</sup>VAADER Team, Institute of Electronic and Telecommunication of Rennes (IETR) UMR 6164, INSA Rennes, 35042 Rennes, France Corresponding author: Wassim Hamidouche (wassim.hamidouche@insa-rennes.fr)

**ABSTRACT** Scalable high-efficiency video coding (SHVC) is the scalable extension of the high-efficiency video coding (HEVC) standard. SHVC enables spatial, quality, bit-depth, color gamut, and codec scalability. The architecture of the SHVC encoder is based on multiple instances of the HEVC encoder where each instance encodes one video layer. This architecture offers several advantages of being modular and close to the native HEVC coding block scheme. However, the close-loop SHVC architecture requires the complete decoding of the reference lower layer frames to decode a higher quality layer, which considerably increases the complexity of both encoder and decoder processes. In this paper, we propose an end-to-end 4K real-time SHVC solution, including both software encoder and decoder, for video broadcast applications. The SHVC codec relies on low-level optimizations for specific Intel x86 platform and parallel processing to speed up the encoding and decoding processes. The proposed encoder enables real-time processing of 4Kp30 video in 2× spatial scalabilities on the  $4 \times 10$  cores Intel Xeon processor (E5-4627V3) running at 2.6 GHz. In addition, the SHVC decoder enables to decode, respectively, the lower quality layer in full HD (1920  $\times$ 1080p30) resolution, an advanced RISC machine (ARM) Neon mobile platform, and the enhancement layer in UHD (3840  $\times$  2160p30), on a fitted laptop, with 4 cores Intel i7 processor running at 2.7 GHz. Finally, experimental results have shown that the proposed solution can reach a high rate-distortion performance close to the reference SHVC reference software model (SHM) with a speedup of 37 and 66 in intra and inter coding configurations.

**INDEX TERMS** Scalable video coding, HEVC, SHVC extension, real time video codecs.

# **I. INTRODUCTION**

Nowadays, with the gross consumption of video contents, these latter are stored and delivered in several formats, such as resolution, frame rate, quality, bitdepth and codec in order to cover a wide range of users requirements. These needs consist in the available bandwidth, memory and codec: display, computing and energy capabilities as well as the content quality. However, encoding and delivering the video in all these specifications considerably increases both storage and bandwidth resources. The Scalable High efficiency Video Coding (SHVC) extension [1], [2], has been designed by the Joint Collaborative Team on Video Coding (JCT-VC) as the Annex H of the High Efficiency Video Coding (HEVC) standard [3] to encode the video in several layers (formats). SHVC is based on the HEVC standard and supports spatial, quality, bitdepth, color-gamut and codec scalability. The SHVC

The associate editor coordinating the review of this manuscript and approving it for publication was Giovanni Angiulli.

extension leverage inter-layer predictions to improve the Rate Distortion (RD) performance by up to 30% under the Common Test Conditions (CTC) [4], compared to the simulcast coding configuration, which consists in independent HEVC encodings of a same video in various formats. This gain can be further enhanced with an optimal bitrate allocation strategy between SHVC layers, as proposed in [5], [6]. Compared to the previous Scalable Video Coding (SVC) extension [7], SHVC offers two main advantages. First, the coding architecture of SHVC remains simple based on the core HEVC standard with inter-layer prediction requiring only high level changes. Second, SHVC has been released only one year and half after HEVC. These two advantages hasten time-tomarket after adoption and support its deployment in more video applications, not restricted to video conferencing services as for SVC.

Advanced Television Systems Committee (ATSC) 3.0 has considered several broadcasting scenarios for which SHVC



<span id="page-1-0"></span>**FIGURE 1.** ATSC3.0 broadcasting scenario with SHVC codec.

has been identified as a serious candidate solution for video coding [8]. Fig. [1](#page-1-0) illustrates one [ATSC3](#page-0-0).0 broadcasting scenario where the [SHVC](#page-0-0) encoder encodes the video in two layers, HD to UHD (2x) spatial resolutions, and then are broadcast in [Motion Picture Expert Group 2 -](#page-0-0) [Transport Stream \(MPEG-2 TS\)](#page-0-0) within two [Physical Layer](#page-0-0) [Pipes \(PLP\)](#page-0-0) [9], [10]. The end-user receiving the two layers decodes either the [Base Layer \(BL\)](#page-0-0) for HD quality or both layers for UHD quality, depending on its display, energy and mobility configurations.

The close-loop architecture of the [SHVC](#page-0-0) extension requires the decoding of all reference layer frames to encode/decode a higher quality layer frame. This increases both encoding and decoding complexities compared to a single-layer coding configuration. Moreover, additional processing is introduced by the [SHVC](#page-0-0) extension to rescale the reference frames used by the [Enhancement Layers \(EL\)](#page-0-0) for inter-layer predictions. In this paper, we propose a complete solution for 4K real time [SHVC](#page-0-0) codec including both [SHVC](#page-0-0) encoder and decoder software. The proposed [SHVC](#page-0-0) encoder, called [SHVC-ATEME Encoder \(SHVC-AE\),](#page-0-0) is based on the professional *ATEME* [HEVC](#page-0-0) software encoder [HEVC-](#page-0-0)[ATEME Encoder \(HEVC-AE\)](#page-0-0) [11]. The [SHVC](#page-0-0) decoder is based on the open source real time *OpenHEVC* decoder [12]. The most time consuming coding/decoding operations are optimized with [Single Instruction Multiple Data \(SIMD\)](#page-0-0) methods for x86 platforms.

The down-sampling (encoder side) and up-sampling (both encoder and decoder) operations, required for spatial scalability in [SHVC](#page-0-0) encodings, are also optimized to speedup and minimize the delay introduced by these operations. The [HEVC](#page-0-0) high level parallel processing solutions including tile, slice and wavefront [13] are supported by the [HEVC](#page-0-0) *ATEME* encoder and the *OpenHEVC* decoder. The encoding layers and the down/up-sampling functions are pipelined and processed in parallel to take advantage of multicore platforms and further minimize the end-to-end delay. The encoding solution enables a real time processing of  $3840 \times 2160p 30$  fps video on  $4 \times 10$ -cores Intel Xeon processor (E5-4627V3) running at 2.6 GHz. The decoder enables a real time decoding of the [BL](#page-0-0) in full HD (1920x1080p) resolution on mobile [Advanced RISC Machine \(ARM\)](#page-0-0) platform and the enhancement layer in UHD (3840 $\times$ 2160p) at 30 fps on a laptop fitted with a 4 cores i7 Intel processor running at 2.7 GHz.

The rest of this paper is organized as follows. Section [II](#page-1-1) provides details on the [SHVC](#page-0-0) extension and the existing implementations of the [HEVC](#page-0-0) and its scalable extension. The architecture of the [SHVC](#page-0-0) encoder/decoder are provided in Section [III.](#page-3-0) The performance in terms of coding efficiency and speed of the [SHVC](#page-0-0) codec are provided and discussed in Section [IV.](#page-6-0) Section [V](#page-10-0) depicts the complete end-to-end [SHVC](#page-0-0) demonstration in broadcast environment. Finally, Section [VI](#page-12-0) concludes the paper.

# <span id="page-1-1"></span>**II. RELATED WORKS**

# A. SHVC EXTENSION

The [SHVC](#page-0-0) extension [2] enables several types of scalability not supported by [SVC](#page-0-0) such as color-gamut and bit depth. These two scalability enable to switch from [Standard](#page-0-0) [Dynamic Range \(SDR\)](#page-0-0) to [High Dynamic Range \(HDR\)](#page-0-0) formats within one bitstream [14]. [SHVC](#page-0-0) defines high level syntax elements mostly at the level of [Video Parameter Set](#page-0-0) [\(VPS\)](#page-0-0) header. These syntax elements provide information on the video layers such as the number of layers, and for each layer: resolution, bit depth and the inter-layer dependencies. The [SHVC](#page-0-0) encoder architecture consists of *L* [HEVC](#page-0-0) encoders in a single encoder to encode each layer with *L* the number of layers: one [BL](#page-0-0) and  $L - 1$  [ELs.](#page-0-0) In the case of [SHVC](#page-0-0) spatial scalability, the [BL HEVC](#page-0-0) encoder encodes a down-sampled version of the original video and feeds the first [EL](#page-0-0) encoder with the decoded picture and its [Motion](#page-0-0) [Vectors \(MVs\).](#page-0-0) The [BL](#page-0-0) is the first  $(l = 1)$ ) and encodes the lowest resolution of the video. The [EL](#page-0-0) layer encoder  $l(l =$  $2, \ldots, L$ ) encodes a higher resolution video with using the decoded picture from a lower layer as an additional reference picture (included in the reference picture lists). The interlayer reference picture is up-sampled and its [MVs](#page-0-0) up-scaled to match with the resolution of the layer being encoded. The up-sampling operation is standard operation performed by a 8-tap and 4-tap interpolation filters for luma and chroma samples, respectively. The down-sampling operation carriedout to produce the lower resolution video is not standard and can be considered as pre-processing operation. Fig. [2](#page-2-0) shows a block diagram of the [SHVC](#page-0-0) encoder encoding two layers in spatial scalability configuration. In the case of quality scalability (same resolution), the encoding process remains unchanged except that the picture used for inter-layer prediction is used without being up-sampled and its [MVs](#page-0-0) upscaled. As shown in Fig. [2,](#page-2-0) the outputs from the two encoders are multiplexed to form one bitstream that conforms to [SHVC.](#page-0-0)

The [HEVC](#page-0-0) standard version 2 defines two [SHVC](#page-0-0) profiles: Scalable Main and Scalable Main 10 [2]. The Scalable Main enables a [BL](#page-0-0) that conforms with the Main [HEVC](#page-0-0) profile, while the Scalable Main 10 profile allows a [BL](#page-0-0) that conforms



<span id="page-2-0"></span>**FIGURE 2.** Block diagram of the SHVC encoder by encoding two spatial scalability layers.

with the Main 10 [HEVC](#page-0-0) profile. The 4*th* [HEVC](#page-0-0) version defines four more scalable profiles for [BL](#page-0-0) in monochrome format with 8, 12 and 16 bitdepth (Scalable Monochrome, Scalable Monochrome 12, Scalable Monochrome 16) and one Scalable Main 4:4:4 profile that conforms to the Main 4:4:4 [HEVC](#page-0-0) profile.

# B. REAL TIME VIDEO CODECS

In this section we give a brief description on the existing [SVC,](#page-0-0) [HEVC](#page-0-0) and [SHVC](#page-0-0) encoder and decoder solutions. The software *openSVC* decoder [15] has been developed to offer an open source real time decoder solution of the [SVC](#page-0-0) extension. It was developed in C language and supports the Scalable Baseline profile library offering all tools to deal with spatial, temporal and fidelity scalability. The *openSVC* decoder achieves a speed-up up to 50 times faster than the [SVC](#page-0-0) reference software decoder [Joint Scal](#page-0-0)[able Video Model \(JSVM\)](#page-0-0) [16]. Authors in [17] proposed a [SVC](#page-0-0) video encoder dedicated to HD video conferencing applications. This encoder combines slice-level parallelism for frame encoding with block-level parallelism for the upsampling and interpolation filter processes. The baseline encoder is optimized in [SIMD](#page-0-0) using [Streaming SIMD Exten](#page-0-0)[sions \(SSE\)2](#page-0-0) instructions. The parallel encoder enables, on a 8 cores Intel Xeon E5-2687W processor running at 3.1 GHz, to encode a 720p30 video in real time at different bitrates. The slice partitioning introduces a slight loss in rate-distortion coding efficiency.

Recently, several hardware [18]–[22] and software [12], [23]–[26] [HEVC](#page-0-0) decoders have been developed. The hardware solutions offer a fast [HEVC](#page-0-0) decoder implementation enabling real time decoding of 4Kp60 [19] and even 8Kp60 [21] with a very low energy consumption performance [20]. On the other hand, software [HEVC](#page-0-0) decoder implementations offer flexibility, fast time-to-market and are well suited for quick adaptation to standard evolutions. In addition, software decoder can be easily optimized

for several platforms, not dedicated to video processing, including Intel x86 [12] and [ARM/](#page-0-0)Neon [26] using [SIMD](#page-0-0) instructions.

There are a number of hardware [27] and software [11], [28]–[31] implementations of the [HEVC](#page-0-0) encoder. The two open source software [HEVC](#page-0-0) encoders, *Kvazaar* and *x265* , enable a real time encoding of 4K videos, with using both parallel processing (frame, tile and wavefront) and low level optimizations through [SIMD](#page-0-0) instructions. In addition, these solutions use algorithmic optimizations to avoid the full ratedistortion search, especially at the level of quad-tree partitioning and intra prediction. These algorithmic optimizations enable encoding complexity reduction at the expense of bitrate increase [32], [33].

For [SHVC](#page-0-0) encoder, authors in [34] leverage the existing correlation between layers to select the [Coding Unit](#page-0-0) [\(CU\)](#page-0-0) size at the [EL](#page-0-0) by restricting the [CU](#page-0-0) depth range to reduce the encoding complexity for quality scalability. This method skips some specific depth levels which are rarely used in the previous frame and neighboring [CUs](#page-0-0) to further reduce the full search set and decrease the coding complexity with similar [RD](#page-0-0) performance as the original [SHVC](#page-0-0) encoder. Work in [35] propose a method to predict [CU](#page-0-0) modes based on the co-located [CU](#page-0-0) within the reference quality layer. This solution enables up to 51% complexity reduction while maintaining the overall quality of the original [SHVC](#page-0-0) coding. Finally, authors in [36] developed an efficient [Coding](#page-0-0) [Tree Unit \(CTU\)](#page-0-0) decision method by combining a temporalspatial searching order algorithm at the [BL](#page-0-0) and a fast interlayer searching algorithm at the [EL](#page-0-0) to speed-up the [SHVC](#page-0-0) encoding.

The major drawbacks of the [SHVC](#page-0-0) solutions, mentioned above [34]–[38] is the absence of real time character. In fact, the complexity reduction opportunities offered by these solutions are around 50%, corresponding to a speedup of 2 of the reference [SHVC reference software Model \(SHM\)](#page-0-0) encoder. However, to reach real time encoding of 4K resolution video with the [SHM](#page-0-0) encoder in spatial scalability a speedup of 40 to 80 is required depending on the coding configuration (Intra/Inter). In addition, these solutions, that use coding decisions of the [BL](#page-0-0) encoder at the [EL](#page-0-0) encoder, can not be integrated in the context of professional encoders since as depicted in the [SHVC](#page-0-0) extension only the decoded [BL](#page-0-0) frame and associated [MVs](#page-0-0)shall be available at the [EL](#page-0-0) encoder without the coding decisions. To cope with this inconvenience, we propose an end-to-end solution that takes into account the real-time character, imperative for broadcast application. Hence, in this paper we focus on the software implementation for real time [SHVC](#page-0-0) encoder and decoder on multi-core Intel x86 platform. The [SHVC](#page-0-0) encoder is based on the professional *ATEME* core encoder, which includes SIMD instructions for Intel x86 platform, algorithmic optimizations and parallelism. The real time [SHVC](#page-0-0) decoder is based on the core HEVC decoder, *openHEVC*, which optimizes the most time consuming operations in SIMD for x86 platform and takes advantage of multicore processor to speed-up the decoding



<span id="page-3-1"></span>**FIGURE 3.** SHM encoder processing.

process through tile, wavefront and frame parallelisms. To the best of our knowledge, there is no [SHVC](#page-0-0) codec, except the [SHM](#page-0-0) [39] developed by the [JCT-VC,](#page-0-0) to evaluate the proposed algorithmic contributions. In addition, this latter is not dedicated to real time processing.

The [SHM](#page-0-0) encoder enables a high rate-distortion performance since it relies on the full search rate-distortion optimization at the expense of coding speed. Moreover, [SHM](#page-0-0) does not include low level optimizations neither uses parallel processing. Fig. [3](#page-3-1) illustrates the sequential architecture of the [SHM](#page-0-0) encoder encoding two video layers. First, the input video is pre-processed, which corresponds to down-sampling in the spatial scalability, and then encoded with the [BL](#page-0-0) encoder. The decoded [BL](#page-0-0) frame is then processed by the *data rescaling* block to rescale [BL](#page-0-0) output data. This block performs up-sampling operation of the decoded frame and [MV](#page-0-0) up-scaling in spatial scalability. Finally, the [EL](#page-0-0) encoder block encodes the original video with using the decoded [BL](#page-0-0) input as an additional reference frame. We can notice that the [SHM](#page-0-0) software performs these four main encoding operations in sequential order which would increase both the encoding time and end-to-end latency compared to a fully pipelined architecture on multi-core platform. The proposed real time [SHVC](#page-0-0) codec is compared in this paper to the [SHM](#page-0-0) codec in terms of rate-distortion performance for the encoder, speedup and processed [frames per second \(fps\)](#page-0-0) for both encoder and decoder.

# <span id="page-3-0"></span>**III. PROPOSED REAL TIME [SHVC](#page-0-0) CODEC**

## A. REAL TIME [SHVC](#page-0-0) DECODER

The [SHVC](#page-0-0) decoder consists of multiple instances of the *OpenHEVC* [HEVC](#page-0-0) decoder, where each instance decodes one [SHVC](#page-0-0) layer. In the proposed architecture, the [SHVC](#page-0-0) pixel's up-sampling and [MV](#page-0-0) up-scaling operations are carried-out at the block level by the [EL](#page-0-0) decoder. This architecture enables both fast and low latency decoding since only blocks used as reference are up-sampled in spatial scalability and efficient parallel decoding is performed between layers. The up-sampling operation that consists in 8-tap filter for Luma and 4-tap filter for chroma components are optimized in [SIMD](#page-0-0) instructions for Intel x86 and embedded [ARM](#page-0-0) Neon processors. Moreover, the most complex [HEVC](#page-0-0) decoding operations including [Discrete Cosine Trans](#page-0-0)[form \(DCT\)/Discrete Sine Transform \(DST\)](#page-0-0) transforms and Motion Compensation filters are optimized in the core [HEVC](#page-0-0) decoder (*OpenHEVC* ) for these two platforms. The *Open-HEVC* decoder supports the wavefront, tile and frame-based parallel processing solutions enabling to decode [CTU](#page-0-0) rows,



<span id="page-3-2"></span>**FIGURE 4.** Frame-based parallel decoding in the scalable OpenHEVC decoder.

tiles and frames in parallel, respectively. The wavefront and tile parallel processing in the core *OpenHEVC* decoder can be activated for all [SHVC](#page-0-0) layers when these two tools are enabled by the encoder, respectively. The frame based parallel decoding mechanism in the core *OpenHEVC* decoder has been extended to support parallel decoding of frames from different layers.

Fig. [4](#page-3-2) illustrates the frame based parallel decoding of two [SHVC](#page-0-0) video layers encoded in  $2 \times$  spatial scalability. In total, six frames (three at each layer) are decoded in parallel with inter and inter-layer control mechanisms to ensure that the block used as reference is available (already decoded) to perform inter and inter-layer predictions. In the case where the block used as reference is not available (not yet decoded), the threads of the depending blocks wait until the reference block is decoded. Therefore, once one thread completes the decoding of the block, it wakes up all threads waiting for this block. Moreover, while the [EL](#page-0-0) frame is not fully decoded, the reference [BL](#page-0-0) frame is not released since it can be used as reference by the [EL](#page-0-0) decoder. The proposed decoder supports several scalability including spatial, quality, color gamut, bitdepth and codec with [BL](#page-0-0) coded by the [Advanced Video](#page-0-0) [Coding \(AVC\)](#page-0-0) standard [40].

## B. REAL TIME [HEVC](#page-0-0) ENCODER

The proposed [SHVC](#page-0-0) encoder [\(SHVC-AE\)](#page-0-0) relies on the core [HEVC](#page-0-0) software encoder [\(HEVC-AE\)](#page-0-0) developed by *ATEME* . As for the [SHVC](#page-0-0) decoder, [SHVC-AE](#page-0-0) instantiates multiple instances of the core [HEVC-AE](#page-0-0) to encode the [SHVC](#page-0-0) layers. The software [HEVC-AE](#page-0-0) is also optimized in [SSE2](#page-0-0) instructions to speed-up, on Intel platform, the main [HEVC](#page-0-0) coding operations including Intra prediction, motion compensation filters, [DCT/DST](#page-0-0) transforms and in-loop filters. The encoding steps in the [HEVC-AE](#page-0-0) involving video acquisition, preprocessing, [Group of Pictures \(GOP\)](#page-0-0) construction and coding decision are pipelined as illustrated in Fig. [5.](#page-4-0) The first step manages the video acquisition from a file or from an external



<span id="page-4-0"></span>**FIGURE 5.** Pipeline of the encoding steps in the [HEVC-AE.](#page-0-0)

device (camera, [Serial Digital Interface \(SDI\)](#page-0-0) card). Then, the pre-processing step adapts the input source video format to the encoder input format including color conversion and bitdepth adaptation. The [GOP](#page-0-0) construction module affects to each picture a specific [Picture Order Count \(POC\).](#page-0-0) Then, the bitrate estimation module estimates the bitrate allocated to each picture to follow the target bitrate for the highest video quality. This step may introduce a latency depending on the [GOP](#page-0-0) configuration since all pictures of the [GOP](#page-0-0) are required. Finally, the coding decision step performs the rate-distortion minimisation over a pre-defined set of [HEVC](#page-0-0) coding configurations ending up with the most efficient coding tools within the considered set:

$$
\{C_k^*\}_{k=1}^M = \underset{\{\vec{C}_k\}_{k=1}^M}{\arg \min} \sum_{i=1}^M \left(J_{i|_{\vec{C}_i}}\right) \tag{1}
$$

where *M* is the number of coding parameters,  $\vec{C}_k$  the set of all coding configurations tested for the coding parameter *k* and *J* is the [RD](#page-0-0) cost to minimize computed by Equation [\(2\)](#page-4-1) with λ, *D* and *R* are the Lagrangian parameter, the distortion and the bitrate, respectively.

<span id="page-4-1"></span>
$$
J = D + \lambda \cdot R. \tag{2}
$$

The number of configurations *H* to be tested by the encoder is equal to the  $M - 1$  multiplications between the number of configurations of the *N* parameters expressed as follows:

$$
H = \prod_{i=1}^{M} \dim_{K} (\mathcal{C}_{i}).
$$
 (3)

The coding decision is the most complex step within the [HEVC-AE](#page-0-0) pipeline. We can notice in Fig. [5](#page-4-0) that the coding decision takes more than one real time cycle. To support a real time encoding the duration of this step should be lower than the duration of one frame (real time cycle equal to  $\frac{1}{video\, fps}$  $\frac{1}{video\, fps}$  $\frac{1}{video\, fps}$  in second)

Three different optimizations are carried-out to reduce the coding decision duration to fill within a real time cycle. The first one consists in [SIMD](#page-0-0) optimization of the most complex coding operations including Intra prediction, motion compensation filters, [DCT/DST](#page-0-0) and in-loop filters. The second optimization consists in the definition of restricted sets of coding configurations to be tested by the encoder. This optimization enabled to define three coding setups named *FILE* , *LIVE* HD and *LIVE* UHD. The *FILE* setup considers a large set of coding configurations targeting a high video quality at the expense of coding speed performance, while the *LIVE* HD and *LIVE* UHD setups test reduced coding configurations favoring coding speed to fulfill real time requirements of HD and UHD resolutions, respectively. TABLE [1](#page-5-0) gives the tested coding tools sets for *FILE* , *LIVE* HD and *LIVE* UHD setups. The complexity reduction of [HEVC](#page-0-0) encoders has been widely investigated in the literature [33], the derivation of these setups is not investigated in this paper, which focuses more on the parallel and optimized software implementation of the [SHVC](#page-0-0) codec.

The third optimization considers parallel processing at different levels of the encoder to take advantage of multicore platforms. The core [HEVC-AE](#page-0-0) supports the Tile parallel processing defined in the [HEVC](#page-0-0) standard. [HEVC-](#page-0-0)[AE](#page-0-0) can process in parallel multiple independent rectangular regions (Tiles) of one frame. This will speed-up the coding decision step at the expense of slight coding performance loss caused by Tile partitioning. The Tile parallel processing will be activated only in *LIVE* setups.

The second level of parallelism, called CTU-parallelism, enables to process the [CTU](#page-0-0) rows of the frame in parallel. The CTU-parallelism is different from the wavefront parallelism proposed in [HEVC](#page-0-0) [13] in the way that the entropy engine is not initialized at each [CTU](#page-0-0) row. This improve the coding efficiency of the [Context-Adaptive Binary Arithmetic Coding](#page-0-0) [\(CABAC\)](#page-0-0) engine, since it is not initialized, at the expense of memory increase. In fact, the CTU-parallelism performs all encoding operations of the [CTU](#page-0-0) rows in wavefront except the [CABAC](#page-0-0) which is performed in sequential order once the coding of all [CTU](#page-0-0) rows is completed. This solution increases the memory usage since all coding decisions are stored and then processed by the [CABAC](#page-0-0) engine once the last [CTU](#page-0-0) of the previous row is encoded [\(CABAC](#page-0-0) context is available).

The third level of parallelism is performed between the coding decision steps of different frames. Several frames

setup



<span id="page-5-0"></span>**TABLE 1.** Coding configurations of FILE , LIVE HD and LIVE UHD [HEVC-AE](#page-0-0) setups.

are encoded in parallel where the main process manages the inter-frame dependencies ensuring that the block used as reference is available within the reference frame. The main process (manager) launches the frame encodings in parallel (threads) and manages all communications between concurrent threads. The frame-based parallelism speed-up the encoding process without impacting the coding quality and then can be activated as for the CTU-parallelism parallelism in *FILE* , *LIVE* HD and *LIVE* UHD setups.

Split mode on PU sizes

# <span id="page-5-2"></span>C. REAL TIME [SHVC](#page-0-0) ENCODER

The [SHVC-AE](#page-0-0) creates multiple instances of the core [HEVC-AE](#page-0-0) to encode the [SHVC](#page-0-0) layers. The support of the [SHVC](#page-0-0) standard introduces two new operations to the core [HEVC-AE:](#page-0-0) down-sampling and up-sampling operations. The down-sampling operation enables, in spatial scalability, to build from the source video the frames to be encoded by the [BL](#page-0-0) encoder, while the up-sampling operation creates, from the decoded [BL](#page-0-0) frame, the frame used by the [EL](#page-0-0) encoder as reference for inter-layer predictions. Fig. [6](#page-5-1) shows the pipeline of the coding steps in the [SHVC-AE](#page-0-0) encoding two layers. The down-sampling and up-sampling operations illustrated in red and pink colors are performed by the [BL](#page-0-0) and [EL](#page-0-0) encoders, respectively. To perform inter-layer prediction, the [EL](#page-0-0) requires the coding information from the [BL.](#page-0-0) This means the beginning of the coding decision on the [EL](#page-0-0) needs to be synchronized with the end of the coding decision on the [BL.](#page-0-0) Therefore, a latency of three cycles is introduced corresponding to the down-sampling step, the inter-layer synchronization and the up-sampling step. We can also notice from Fig. [6](#page-5-1) that the durations of both up-sampling and downsampling operations are higher than one real time cycle. Two optimizations are proposed to speed-up these operations including [SIMD](#page-0-0) optimization and parallelism. The upsampling is standard operation and consists in 8 tap and 4 tap filters for luma and chroma components, respectively. The down-sampling is not standard operation and is also carriedout in this paper with 8 tap and 4 tap filters for luma and chroma components. These two operations are performed with a convolution product between the pixels and the filter coefficients:

$$
s_m = \sum_{i=-\lceil W/2 \rceil - 1}^{\lceil W/2 \rceil} c_{i+\lceil W/2 \rceil - 1} \cdot p_{m+i} \tag{4}
$$



<span id="page-5-1"></span>**FIGURE 6.** Pipeline of the encoding steps in the [SHVC-AE.](#page-0-0)

with  $c_i$  is the filter coefficients,  $p_m$  the pixel value at position  $m$ ,  $s_m$  the output of the filter at position  $m$  and  $W$  the size of the filter. In this paper, *W* is equal to 8 and 4 for Luma and Chroma components, respectively.

The 2D convolution product requires 8 multiplications and 7 additions in horizontal and vertical directions. [SSE3](#page-0-0) instructions define several functions to perform arithmetic operations on registers of sizes 64 and 128 bits. The 8 tap filter for 8 luma positions (pixels) can be performed only by 4 multiplications (*\_mm\_maddubs\_epi16*) and three additions (*\_mm\_add\_epi16*) on 64 bits and 128 bits for 8 and 10 bitdepth, respectively. Fig. [7](#page-6-1) illustrates horizontal 8 tap filters performed by [SSE3](#page-0-0) instructions. The down-sampling and up-sampling operations can also be conducted in parallel on multi core processors. To optimize the memory access, we propose to process the three color components in parallel. Moreover, the frame of each component is partitioned in four horizontal regions (two for chroma) of similar height equal

| $ \text{p}[-3] p[-2] p[-1] p[0] p[1] $                                                                                                                                                                                                                                                                              |                                                             |  |      |      | p[2] | p[3] | p 4 |                       |  |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------|--|------|------|------|------|-----|-----------------------|--|
| x (_mm_maddubs_epi16)                                                                                                                                                                                                                                                                                               |                                                             |  |      |      |      |      |     |                       |  |
| $\frac{1}{2}  c_V[0] c_V[1] c_V[2] c_V[3] c_V[4] c_V[5] c_V[6] c_V[7] c_V[7] c_V[8] c_V[9] c_V[9] c_V[9] c_V[9] c_V[1]^2 c_V$            |                                                             |  |      |      |      |      |     |                       |  |
| $+$ (_mm_add_epi16)                                                                                                                                                                                                                                                                                                 |                                                             |  |      |      |      |      |     |                       |  |
| p[1]   p[0]   p[1]                                                                                                                                                                                                                                                                                                  |                                                             |  | p[2] | p[3] | p[4] |      |     |                       |  |
|                                                                                                                                                                                                                                                                                                                     | $x \ (mm \text{maddubs\_epi16})$                            |  |      |      |      |      |     |                       |  |
| $\frac{1}{2}$ |                                                             |  |      |      |      |      |     |                       |  |
|                                                                                                                                                                                                                                                                                                                     | $\pm$<br>$\text{\textcolor{blue}{\text{(mm\_add\_epi16)}}}$ |  |      |      |      |      |     |                       |  |
| $\left\lfloor \frac{1}{2} \right\rfloor$                                                                                                                                                                                                                                                                            | p[2]  p[3]  p[4]                                            |  |      |      |      |      |     |                       |  |
|                                                                                                                                                                                                                                                                                                                     |                                                             |  |      |      |      |      |     | x (_mm_maddubs_epi16) |  |
| $  c_V[4] c_V[5] c_V[6] c_V[7] $                                                                                                                                                                                                                                                                                    |                                                             |  |      |      |      |      |     |                       |  |
|                                                                                                                                                                                                                                                                                                                     | $\text{\textcircled{m}}$ $\text{and}\text{epi16}$<br>$\pm$  |  |      |      |      |      |     |                       |  |
| $\frac{1}{2}$ p[3]                                                                                                                                                                                                                                                                                                  | p[4]                                                        |  |      |      |      |      |     |                       |  |
| X (_mm_maddubs_epi16)                                                                                                                                                                                                                                                                                               |                                                             |  |      |      |      |      |     |                       |  |
| $  c_V[6]  c_V[7]  $                                                                                                                                                                                                                                                                                                |                                                             |  |      |      |      |      |     |                       |  |
|                                                                                                                                                                                                                                                                                                                     |                                                             |  |      |      |      |      |     |                       |  |
| s 0                                                                                                                                                                                                                                                                                                                 |                                                             |  |      |      |      |      |     |                       |  |

<span id="page-6-1"></span>**FIGURE 7.** Convolutional product optimized in SSE instructions with buffer sizes of 64 and 128 bits.

<span id="page-6-2"></span>**TABLE 2.** Coding gains in terms of [Bjøntegaard Delta Bit Rate \(BD-BR\)](#page-0-0) of I, P and B slices with the [SHM](#page-0-0) in [Random Access \(RA\)](#page-0-0) coding configuration.

| Videos                 | Gain with I-P<br>and B slices | Gain with I-P<br>slices | Gain with I<br>slices |
|------------------------|-------------------------------|-------------------------|-----------------------|
| Traffic                | $-26.80\%$                    | $-24.10\%$              | $-18.00\%$            |
| PeopleOnStreet         | $-45.10\%$                    | $-20.60\%$              | $-7.70%$              |
| Kimono1                | $-42.70\%$                    | $-29.34\%$              | $-14.50\%$            |
| ParksScene             | $-27.70%$                     | $-23.0\%$               | $-15.80\%$            |
| Cactus                 | $-28.00\%$                    | $-18.60\%$              | $-6.60\%$             |
| <b>BasketballDrive</b> | $-31.50\%$                    | $-14.60\%$              | $-2.90\%$             |
| <b>BQTerrace</b>       | $-8.20\%$                     | $-7.00\%$               | $-1.90\%$             |
| Average                | $-30.00\%$                    | $-19.61%$               | $-9.63\%$             |
| Gain vs. total gain    | $100\%$                       | $69.38\%$               | $33.03\%$             |

to the frame height / 4 (frame height / 2 for chroma) and the width of the frame. These four regions are also processed in parallel resulting in 8 threads (4 for luma and 4 for chroma) running in parallel for both up-sampling and down-sampling processes. It should be noted that this partitioning does not have an impact on the [RD](#page-0-0) performance since it is only used for parallel processing of the up-sampling process.

TABLE [2](#page-6-2) gives the coding gains in terms of [BD-BR](#page-0-0) metric [41], [42] of the [SHM](#page-0-0) with respect to single-layer coding configuration (ie. [EL](#page-0-0) coded with [SHM](#page-0-0) versus [EL](#page-0-0) coded with [HEVC\)](#page-0-0). It provides the bitrate reduction of the [EL](#page-0-0) when inter-layer prediction is activated on only I slices, I and P slices and I, P and B slices in [RA](#page-0-0) coding configuration illustrated in Fig. [4.](#page-3-2) We can notice from TABLE [2](#page-6-2) that interlayer prediction on I slices brings 33 % of the total [SHVC](#page-0-0) gain whereas I slices represent only 1 % to 4 % of the slices in [RA](#page-0-0) bitstream depending on the video frame rate [\(fps\)](#page-0-0). The P slices representing between 8 % and 11 % in [RA](#page-0-0)

#### <span id="page-6-3"></span>**TABLE 3.** Configurations of the [SHVC-AE.](#page-0-0)



bitstream bring 36% of the total [SHVC](#page-0-0) gain. Finally, the B slices representing around 88 % of the slices in [RA](#page-0-0) bitstream bring on average 30 % of the total [SHVC](#page-0-0) gain. We can use these statistics to reduce the [SHVC-AE](#page-0-0) complexity with a slight impact on the coding gain. We propose in this paper to disable inter-layer prediction in the [SHVC-AE](#page-0-0) on the B slices of the highest temporal layer since these frames are not used as reference in inter prediction and bring the lowest [SHVC](#page-0-0) gain (frames id 1 and 3 in Fig. [4\)](#page-3-2). This optimization concerns only [RA](#page-0-0) coding configuration enabling to speed-up the coding process since up-sampling is not performed on B slices of the highest temporal layer. Moreover, this technique also enables to decrease the decoder complexity since the B slices of the highest temporal layer are not up-sampled. In fact, the proposed [SHVC](#page-0-0) decoder architecture up-samples only blocks used as reference for inter-layer prediction.

The [SHVC-AE](#page-0-0) inherits the parallelism from the core [HEVC-AE.](#page-0-0) The [SHVC-AE](#page-0-0) encoders encodes each layer in parallel and can also use Tile parallelism when running in *LIVE* setups. The CTU-parallelism can also be used on each Tile or on the whole frame to speed-up the video coding processing in both *FILE* and *LIVE* setups since it does not reduce the compression performance. Moreover, the framebased parallelism is also extended to process in parallel the [BL](#page-0-0) and [EL](#page-0-0) of several frames. The main manager can launch the encoding in parallel of several [SHVC](#page-0-0) frames with synchronization between concurrent encodings. It should be noted that the [BL](#page-0-0) and [EL](#page-0-0) of one frame are always processed in sequential order and only other operations in the pipeline are carried-out in parallel between layers of one frame. For the [SHVC-AE,](#page-0-0) we define three setups: *FILE* and *LIVE* that use the *FILE* and *LIVE* UHD single-layer encoder setups at both layers, respectively as well as *LIVE*+ setup that uses *LIVE* HD setup on the [BL](#page-0-0) and *LIVE* UHD one on the [EL.](#page-0-0) TABLE [3](#page-6-3) summarizes the activated coding tools in the three considered setups for the proposed [SHVC-AE.](#page-0-0)

# <span id="page-6-0"></span>**IV. RESULTS AND PERFORMANCE EVALUATION**

## A. EXPERIMENTAL SETUP

The experimental tests for the [SHVC-AE](#page-0-0) have been carried out on a  $4 \times 10$ -cores Intel Xeon processor (E5-4627V3)

<span id="page-7-0"></span>**TABLE 4.** Test video sequences.

| Sequence         | Name                   | Resolution         | Frame rate | number    |
|------------------|------------------------|--------------------|------------|-----------|
|                  |                        |                    | (fps)      | of frames |
| UHD1             | PeopleOnStreet         | $3840 \times 2160$ | 30         | 150       |
| UHD <sub>2</sub> | <b>Brest</b>           | $3840 \times 2160$ | 60         | 600       |
| A1               | Traffic                | $2560 \times 1600$ | 30         | 150       |
| A <sub>2</sub>   | PeopleOnStreet         | $2560 \times 1600$ | 30         | 150       |
| B1               | Kimono l               | 1920×1080          | 24         | 240       |
| B <sub>2</sub>   | ParksScene             | $1920 \times 1080$ | 24         | 240       |
| B3               | Cactus                 | 1920×1080          | 50         | 500       |
| R4               | <b>BasketballDrive</b> | $1920 \times 1080$ | 50         | 500       |
| B5               | <b>BOTerrace</b>       | $1920 \times 1080$ | 60         | 600       |

<span id="page-7-1"></span>**TABLE 5.** [BD-BR](#page-0-0) performance of the [SHVC-AE EL](#page-0-0) in comparison with [HEVC-AE,](#page-0-0) [SHM EL](#page-0-0) with HM and [SHVC-AE EL](#page-0-0) with [SHM EL](#page-0-0) in [AI](#page-0-0) coding configuration.



running at 2.6 GHz. Several test video sequences from the [SHVC CTC](#page-0-0) and *4-EVER* French collaborative project (*Brest*), described in TABLE [4,](#page-7-0) have been considered in this study. These videos are encoded with the [SHVC](#page-0-0) reference software [\(SHM\)](#page-0-0) encoder and the proposed *ATEME* [SHVC](#page-0-0) encoder [\(SHVC-AE\)](#page-0-0) in three setups *FILE* , *LIVE* and  $LIVE+$ . The videos are encoded in  $2 \times$  spatial scalabil-ity, eight [Quantization Parameter \(QP\)s](#page-0-0):  $(QP<sub>BL</sub>, QP<sub>EL</sub>)$  $(QP<sub>BL</sub>, QP<sub>EL</sub>)$  $(QP<sub>BL</sub>, QP<sub>EL</sub>)$  ∈ {(22, 22), (22, 24), (26, 26), (26, 28), (30, 30), (30, 32), (34, 34), and (34, 36)} and three [GOP](#page-0-0) coding configurations: [All Intra \(AI\), Low Delay P \(LD. P\)](#page-0-0) and [RA.](#page-0-0) The performance of the [SHVC-AE](#page-0-0) is assessed in terms of coding speed in [fps,](#page-0-0) speed-up compared to [SHM](#page-0-0) and rate-distortion with respect to [SHM](#page-0-0) and [HEVC-AE](#page-0-0) single-layer using the [BD-BR](#page-0-0) metric [41], [42]. The proposed real time [SHVC](#page-0-0) decoder is assessed in terms of decoding frame rate in [fps](#page-0-0) and speed-up compared to the reference [SHM](#page-0-0) decoder. The performance of the decoder is carried out on two platforms: laptop fitted with 4-core Intel i7-6820HQ CPU for both layers and octacore Exynos 5410 [System on Chip \(SoC\)](#page-0-0) for the [BL](#page-0-0) resolution. This [SoC](#page-0-0) is based on the big.LITTLE configuration including a cluster of 4 ARM Cortex-A15 cores and a cluster of 4 ARM Cortex-A7 cores. The Tile parallelism activates in *LIVE* setups splits the video frame in 4 tiles  $(2\times2)$ of the same size.

# B. THE PROPOSED [SHVC-AE](#page-0-0) FILE

TABLE [5](#page-7-1) gives the performance of the [SHVC-AE](#page-0-0) *FILE* in terms of [BD-BR](#page-0-0) with respect to the reference [SHM](#page-0-0) encoder



<span id="page-7-2"></span>**TABLE 6.** [BD-BR](#page-0-0) performance of the [SHVC-AE EL](#page-0-0) in comparison with [HEVC-AE,](#page-0-0) [SHM EL](#page-0-0) with HM and [SHVC-AE EL](#page-0-0) with [SHM EL](#page-0-0) in [LD. P](#page-0-0) coding

configuration.

in [AI](#page-0-0) coding configuration. The first column shows the bitrate saving of the [SHVC-AE EL](#page-0-0) with respect to [HEVC-AE](#page-0-0) encoding the [EL](#page-0-0) in single-layer configuration. The inter-layer prediction in the proposed [SHVC-AE](#page-0-0) enables on average 36 % bitrate reduction while [SHM](#page-0-0) enables 34.5%. The interlayer prediction is more efficient in [SHVC-AE](#page-0-0) than in the [SHM](#page-0-0) since the two single encoders in [SHM](#page-0-0) are more efficient in terms of compression than the [HEVC-AEs](#page-0-0) encoding the two layers. This lower coding performance is mainly caused by the restrictions of coding tools set in the core [HEVC-AE](#page-0-0) *FILE* . Therefore, [SHVC-AE](#page-0-0) uses more inter-layer prediction compared to the [SHM](#page-0-0) which has more efficient Intra coding tools used to encode the two layers. The last column in TABLE [5](#page-7-1) shows that the reference [SHM](#page-0-0) encoder outperforms the proposed [SHVC-AE](#page-0-0) by 6.2 % on average in terms of [BD-BR,](#page-0-0) which is mainly caused by restrictions in the *FILE* setup.

TABLE [6](#page-7-2) gives the performance of the [SHVC-AE](#page-0-0) *FILE* in terms of [BD-BR](#page-0-0) in comparison with the reference [SHM](#page-0-0) encoder in [LD. P](#page-0-0) coding configuration. The inter-layer prediction enables a bitrate saving of 30.4 % on average while [SHM](#page-0-0) reference encoder reaches 53.6 %. This difference is mainly introduced by the restriction on intra and inter coding tools in the proposed [SHVC-AE.](#page-0-0) Moreover, the restriction on inter coding tools also impacts the inter-layer prediction efficiency since the same tools are used for both inter and inter-layer predictions.

TABLE [7](#page-8-0) gives the performance of the [SHVC-AE](#page-0-0) in terms of [BD-BR](#page-0-0) in comparison with the reference [SHM](#page-0-0) encoder in [RA](#page-0-0) coding configuration. In this coding configuration, the inter-layer prediction enables a bitrate reduction on average of 18.5 % and 24.6 % for [SHVC-AE](#page-0-0) and [SHM](#page-0-0) encoders, respectively. As in [RA](#page-0-0) configuration, the loss in coding efficiency of the [SHVC-AE](#page-0-0) compared to the reference [SHM](#page-0-0) encoder is mainly caused by restricted coding tools in the *FILE* configuration. In addition, disabling the inter-layer prediction for the B slices of the highest temporal layer also decreases the inter-layer gain impacting the global coding performance of the [SHVC-AE.](#page-0-0)

<span id="page-8-0"></span>**TABLE 7.** [BD-BR](#page-0-0) performance of the [SHVC-AE EL](#page-0-0) in comparison with [HEVC-AE,](#page-0-0) [SHM EL](#page-0-0) with HM and [SHVC-AE EL](#page-0-0) with [SHM EL](#page-0-0) in [RA](#page-0-0) coding configuration.

| Sequences      | <b>BD-BR</b>   |             |            |  |  |  |  |
|----------------|----------------|-------------|------------|--|--|--|--|
|                | SHVC-AE vs     | SHM vs      | SHVC-AE vs |  |  |  |  |
|                | <b>HEVC-AE</b> | <b>HM</b>   | <b>SHM</b> |  |  |  |  |
| A <sub>1</sub> | $-19.1\%$      | $-21.7\%$   | $31.0\%$   |  |  |  |  |
| A <sub>2</sub> | $-27.7\%$      | $-36.6\%$   | $39.2\%$   |  |  |  |  |
| B <sub>1</sub> | $-30.2\%$      | $-38.1\%$   | $43.4\%$   |  |  |  |  |
| B <sub>2</sub> | $-17.1\%$      | $-22.5\%$   | 33.1%      |  |  |  |  |
| B <sub>3</sub> | $-15.0\%$      | $-22.3\%$   | 35.9%      |  |  |  |  |
| <b>B4</b>      | $-12.5\%$      | $-24.8\%$   | 45.8%      |  |  |  |  |
| <b>B5</b>      | $-8.2\%$       | $-6.3\%$    | 59.2%      |  |  |  |  |
| Average        | $-18.5\,\%$    | $-24.6\,\%$ | $41.1\%$   |  |  |  |  |

<span id="page-8-2"></span>**TABLE 8.** speed-up (Sp) performance in % of the [SHVC-AE](#page-0-0) compared to both single-layer [SHVC-AE](#page-0-0) and [SHM](#page-0-0) encoder.



We can also notice from theses results that the gain brought by the inter-layer prediction depends on the characteristics of the video sequence including spatial and temporal informations as well as its resolution.

Fig. [8](#page-8-1) shows the weighted PSNR ( $wPSNR = (6 \cdot YPSNR +$ *UPSNR* + *VPSNR*)/8) performance versus the bitrate of the proposed [SHVC-AE](#page-0-0) and [SHM](#page-0-0) encoder in the three coding configurations for *BasketballDrive* and *BQTerrace* video sequences. The difference between the curves of the two encoders remains similar at the four printed bitrates. Moreover, this difference is higher in [LD. P](#page-0-0) configuration at all bitrates which explain the high bitrate loss in [LD. P](#page-0-0) configuration especially for *BQTerrace* video (B5).

TABLE [8](#page-8-2) gives the speed-up performance of the [SHVC-AE](#page-0-0) compared to both single-layer [HEVC-AE](#page-0-0) and the [SHM](#page-0-0) encoder in the three considered [GOP](#page-0-0) configurations. The speed-up of an encoder of encoding time *EC*1 with respect to the reference encoder of encoding time *EC*2 is computed as follows:

$$
Sp = \frac{EC2}{EC1} \cdot 100\% \tag{5}
$$

The speed-up of the [SHVC-AE](#page-0-0) in [AI](#page-0-0) configuration is on average around 44 % compared to the [Single Layer](#page-0-0) [\(SL\)](#page-0-0) encoding. The [SHVC-AE](#page-0-0) is almost two times slower than the [HEVC-AE](#page-0-0) encoding the equivalent [EL.](#page-0-0) The complexity of the SHVC-AE, with respect to the single layer



<span id="page-8-1"></span>**FIGURE 8.** Rate-distortion performance of the [SHM](#page-0-0) and [SHVC-AE](#page-0-0) encoders using three [GOP](#page-0-0) coding configurations, in FILE setup, for BasketballDrive (B4) BQTerrace (B5) videos.

[HEVC-AE,](#page-0-0) is caused by the additional processing introduced by the [SHVC](#page-0-0) extension including the up-sampling and downsampling operations as well as the encoding of the [BL.](#page-0-0) For [LD. P](#page-0-0) and [RA](#page-0-0) coding configurations, the speed-up is on average 90 % and 98 %, respectively. The complexity increase versus single-layer encoder is significantly reduced in these two inter coding configurations since both single-layer and

<span id="page-9-0"></span>**TABLE 9. [BD-BR](#page-0-0) performance (%) of the [SHVC-AE](#page-0-0) LIVE and LIVE+ in** comparison with the single-layer [HEVC-AE](#page-0-0) LIVE.

| Seq.             |           | SHVC-AE LIVE        | <b>VS</b> | SHVC-AE LIVE+<br><b>VS</b> |           |           |  |
|------------------|-----------|---------------------|-----------|----------------------------|-----------|-----------|--|
|                  |           | <b>HEVC-AE LIVE</b> |           | <b>HEVC-AE LIVE</b>        |           |           |  |
|                  | AI        | LD. P               | <b>RA</b> | AI                         | LD. P     | RA        |  |
| UHD <sub>1</sub> | $-53.9\%$ | -43.3%              | 33.5%     | $-55.8\%$                  | $-45.3%$  | $-35.2\%$ |  |
| UHD <sub>2</sub> | $-31.0\%$ | $-21.0\%$           | $-1.4\%$  | $-32.1\%$                  | $-21.7\%$ | $-1.8\%$  |  |
| A <sub>1</sub>   | $-49.8\%$ | $-36.7\%$           | $-18.9\%$ | $-51.4\%$                  | $-37.9\%$ | $-19.8$   |  |
| A <sub>2</sub>   | $-53.9\%$ | $-42.1\%$           | $-32.5\%$ | $-56.4\%$                  | $-44.8\%$ | -35.1     |  |
| B1               | $-51.1\%$ | -35.3%              | $-30.2\%$ | -55.9%                     | -38.3%    | -29.3     |  |
| B <sub>2</sub>   | $-40.9\%$ | $-28.2\%$           | $-13.3\%$ | $-42.4\%$                  | $-28.9\%$ | $-13.8$   |  |
| B <sub>3</sub>   | $-40.7\%$ | -32.7%              | $-17.7\%$ | $-43.1\%$                  | $-34.4\%$ | $-18.3$   |  |
| <b>B4</b>        | $-34.6\%$ | $-24.6\%$           | $-15.5%$  | $-39.5\%$                  | $-29.0\%$ | -17.8     |  |
| <b>B5</b>        | $-29.5%$  | $-20.0\%$           | $-1.8\%$  | $-31.0\%$                  | $-21.2\%$ | $-3.4\%$  |  |
| Av.              | $-42.8\%$ | $-31.5\%$           | $-18.3\%$ | $-45.3\%$                  | $-33.5\%$ | $-19.4\%$ |  |

<span id="page-9-1"></span>**TABLE 10. [BD-BR](#page-0-0) performance of the [SHVC-AE](#page-0-0) LIVE and LIVE+ in** comparison with [SHVC-AE](#page-0-0) FILE.



scalable encoders use inter predictions. The slight complexity increase is related in these configurations to the [BL](#page-0-0) encoding as well as up-sampling and down-sampling operations. On the other hand, the speed-up of the proposed [SHVC](#page-0-0) encoder with respect to [SHM](#page-0-0) is on average equal to 3750, 6690 and 6650 in the three considered configurations. The different optimizations and parallel processing introduced at the level of the core [HEVC-AE](#page-0-0) and its scalable extension in the *FILE* setup enable to speed-up the encoder by 37 times in [AI](#page-0-0) configuration and 66 times in the two inter coding configurations [LD. P](#page-0-0) and [RA.](#page-0-0) Therefore, the *FILE* setup of the proposed [SHVC-AE](#page-0-0) enables a high [RD](#page-0-0) performance with an efficient use of the inter-layer prediction and an interesting speed-up compared to the [SHM](#page-0-0) encoder. The last row of TABLE [8](#page-8-2) gives the average encoding frame rate in [fps](#page-0-0) of the proposed [SHVC-AE](#page-0-0) in *FILE* setup. We can notice that the frame rate is around 1 [fps](#page-0-0) in the three coding configurations which is far from real time performance. Therefore, this setup can be used only for offline encoding on the cloud to reach a high video quality but it does not enables real time encoding of live HD/UHD video broadcasting.

## C. THE PROPOSED [SHVC-AE](#page-0-0) LIVE & LIVE+

To reach a real time performance, we proposed two setups of the [SHVC-AE:](#page-0-0) *LIVE* which uses *LIVE* UHD setup

<span id="page-9-2"></span>



of the single-layer encoder [HEVC-AE](#page-0-0) for both layers, and *LIVE*+ setup which uses *LIVE* HD setup of the singlelayer encoder for the [BL](#page-0-0) and *LIVE* UHD setup for the [EL.](#page-0-0) TABLE [9](#page-9-0) provides the [BD-BR](#page-0-0) performance of the [SHVC-](#page-0-0)[AE](#page-0-0) *LIVE* and *LIVE*+ in comparison with the single-layer [HEVC-AE](#page-0-0) *LIVE* UHD. The average results show that the [SHVC-AE](#page-0-0) benefits well from inter-layer prediction in both *LIVE* and *LIVE*+ setups with a [BD-BR](#page-0-0) savings of 42.8 %, 31.5 % and 18.3 % in the three coding configurations for *LIVE* setup and 45.3 %, 33.5 % and 19.4 % for *LIVE*+ setup in comparison with a single-layer [HEVC-AE](#page-0-0) in *LIVE* UHD setup encoding the [EL.](#page-0-0) Therefore, the *LIVE*+ setup of the [SHVC-AE](#page-0-0) enables to benefit more from the inter-layer prediction compared to the *LIVE* setup since the [BL](#page-0-0) is of higher quality when encoded in *LIVE* HD single-layer encoder.

TABLE [10](#page-9-1) gives the [BD-BR](#page-0-0) performance of the [SHVC-AE](#page-0-0) *LIVE* and *LIVE*+ in comparison with [SHVC-](#page-0-0)[AE](#page-0-0) *FILE* . We can notice that the restrictions in *LIVE* UHD setup at both layers (SHVC *LIVE* setup) and *LIVE* UHD setup at the only [EL](#page-0-0) (SHVC *LIVE*+ setup) significantly reduce the rate-distortion performance by 25.2 %, 39.5 %, 59.8 % and 18.1 %, 30.3 %, 49.2 % respectively in the three coding configurations. The [SHVC-AE](#page-0-0) in *LIVE*+ setup has higher performance than *LIVE* setup enabled by the higher efficiency of the [BL](#page-0-0) encoder in *LIVE* HD setup resulting in more efficient inter-layer predictions.

TABLE [11](#page-9-2) gives the encoding frame rate performance of the [SHVC-AE](#page-0-0) in *LIVE* and *LIVE*+ setups. We can notice that the two *LIVE* and *LIVE*+ setups enable almost the same coding frame rate performance. In fact, the additional complexity of the *LIVE*+ setup is introduced by the *LIVE* HD setup at the [BL](#page-0-0) which represents less than 10% of the whole scalable encoder complexity in [LD. P](#page-0-0) and [RA](#page-0-0) coding configurations. Moreover, we can also notice that both configurations enable real time encoding of all considered video sequences even with a  $3840\times2160p$  30 fps format.



<span id="page-10-1"></span>**TABLE 12.** Decoding frame rate performance of the OpenHEVC decoder on 4 cores i7 laptop decoding [SHVC](#page-0-0) bitstreams encoded by [SHM](#page-0-0) encoder and [SHVC-AE](#page-0-0) in LIVE and LIVE+ setups.

<span id="page-10-2"></span>**TABLE 13.** Decoding frame rate performance of the OpenHEVC decoder decoding the [BL](#page-0-0) encoded by [HEVC-AE](#page-0-0) in FILE and LIVE setups on mobile [ARM](#page-0-0) platform.

| Seq.             | Decoding time in fps. |                        |                 |                        |                 |     |
|------------------|-----------------------|------------------------|-----------------|------------------------|-----------------|-----|
|                  |                       | <b>HEVC-AE BL FILE</b> |                 | <b>HEVC-AE BL LIVE</b> |                 |     |
|                  | AI                    | LD. P                  | RA              | AI                     | LD. P           | RA  |
| UHD <sub>1</sub> | $\overline{20}$       | $\overline{27}$        | $\overline{33}$ | $\overline{22}$        | $\overline{27}$ | 34  |
| UHD <sub>2</sub> | 18                    | 27                     | 40              | 18                     | 29              | 41  |
| A1               | 36                    | 59                     | 78              | 42                     | 54              | 71  |
| A <sub>2</sub>   | 33                    | 48                     | 57              | 38                     | 53              | 61  |
| B <sub>1</sub>   | 62                    | 123                    | 105             | 64                     | 137             | 108 |
| B <sub>2</sub>   | 56                    | 103                    | 88              | 60                     | 106             | 95  |
| B <sub>3</sub>   | 52                    | 100                    | 93              | 59                     | 109             | 96  |
| <b>B4</b>        | 57                    | 101                    | 99              | 63                     | 126             | 96  |
| B <sub>5</sub>   | 50                    | 90                     | 102             | 57                     | 101             | 109 |
| Av. UHD          | 19                    | 27                     | 36              | 20                     | 28              | 37  |
| Av. A            | 34                    | 53                     | 67              | 40                     | 53              | 66  |
| Av. B            | 55                    | 103                    | 98              | 61                     | 116             | 101 |
| Max. UHD         | 26                    | 36                     | 53              | 29                     | 41              | 53  |
| Max. A           | 49                    | 79                     | 101             | 55                     | 79              | 81  |
| Max. B           | 67                    | 167                    | 140             | 74                     | 166             | 146 |
| Min. UHD         | 13                    | 19                     | 24              | 13                     | 19              | 24  |
| Min A            | 26                    | 34                     | 42              | 29                     | 37              | 45  |
| Min. B           | 41                    | 63                     | 74              | 46                     | 68              | 80  |

# D. THE PROPOSED OpenHEVC DECODER

The decoding frame rate (in [fps\)](#page-0-0) performance of the *Open-HEVC* decoder on 4 cores i7 laptop is provided in TABLE [12](#page-10-1) for three encoders including [SHVC-AE](#page-0-0) *FILE* and *LIVE*+ setups and [SHM](#page-0-0) in three coding configurations: [AI,](#page-0-0) [LD. P](#page-0-0) and [RA.](#page-0-0) We can notice that on average the decoder

reaches a real time decoding of  $3840 \times 2160p$  30 fps videos in inter configurations [\(LD. P](#page-0-0) and [RA\)](#page-0-0) and higher than 60 [fps](#page-0-0) and 107 [fps](#page-0-0) for videos of classes A and B [\(LD.](#page-0-0) [P](#page-0-0) and [RA\)](#page-0-0), respectively. The decoding performance is on average slightly higher for [SHM](#page-0-0) bitstream since the reference encoder decreases the bitstream size compared to the proposed *ATEME* encoders leading to lower complexity at the decoder side. For high bitrate configurations (min), the real time decoding is not reached for videos in UHD and 2K (class A) resolutions. The [RA](#page-0-0) coding configuration leads to the fastest decoding performance since this configuration enables the highest [RD](#page-0-0) coding performance and the up-sampling operation is not performed on the highest temporal B slices.

TABLE [13](#page-10-2) provides the decoding frame rate of the proposed *OpenHEVC* decoder decoding the [BL](#page-0-0) on [ARM](#page-0-0) mobile platform for bitstreams encoder with the [SHVC-AE](#page-0-0) *LIVE* and *FILE* in the three coding configurations. We can notice that the decoder enables real time decoding of the [BL](#page-0-0) in full HD resolution (1920 $\times$ 1080p) for [RA](#page-0-0) configuration on embedded [ARM](#page-0-0) platform. Moreover, the [BL](#page-0-0) is decoded in real time for videos of classes A and B in the three coding configurations.

# <span id="page-10-0"></span>**V. REAL TIME UHD HDR VIDEO DEMONSTRATION**

The proposed encoder and decoder enable real time processing of UHDp30 video sequences on the considered platforms for [RA](#page-0-0) coding configuration. This allows the integration of the solution into a broadcast channel context. In our demonstration, as illustrated in Fig. [9,](#page-11-0) we consider a broadcast context composed of a camera or a streamer, [SHVC](#page-0-0) encoder, [SHVC](#page-0-0) decoder and both a UHD [HDR](#page-0-0) compliant TV screen



<span id="page-11-0"></span>**FIGURE 9.** SHVC streaming in real-time context.

and a smartphone with an HD [SDR](#page-0-0) screen. This set-up simulates a stream application with an end-to-end transmission on cable and network.

First, the camera or the streamer sends the captured uncompressed video to the SHVC encoder through an [SDI](#page-0-0) link. [SDI](#page-0-0) is a standard enabling to transfer uncompressed video on cable. The [SDI](#page-0-0) link retained for the experiment is composed of  $4\times3G$  [SDI](#page-0-0) cables allowing a 12 Gbps maximum bit-rate. To send uncompressed UHD contents in 4:2:2 format, a bitrate amount inferior to 10 Gbps is required, as expressed on the following calculations: Bit-rate UHD1  $(3840 \times 2160p \cdot 30 \cdot fps) = 3840 \times 2160 \times 30 \times 2 \times 10 =$ 4.976 Gbps Bit-rate UHD2 (3840 × 2160*p* 60 *fps*) = 3840 ×  $2160 \times 60 \times 2 \times 10 = 9.953$  Gbps The device used to receipt the uncompressed video on the encoder side is the DTA-2174 produced by Dektec [43] which allows a  $4\times3G$ [SDI](#page-0-0) reception. Each uncompressed frame sent on the [SDI](#page-0-0) link is in a V210 format. The V210 format consists of a 4:2:2 representation with 10 bits per pixels and each pixel of Luma (Y) and chroma (U and V) packed in a sequence such as: *U*0, *Y*0, *V*0, *Y*1, *U*2, *Y*2, *V*2, *Y*<sup>3</sup> . . . Once the DTA-2174 receipts a frame, this last one is converted to a 4:2:0 planar representation before encoding. As a result, on the input side of the encoder, the DTA-2174 is added and a conversion from V210 to 4:2:0 planar representation is processed. The DTA-2174 is embedded on the  $4 \times 10$  cores Intel Xeon processor (E5-4627V3) where the [SHVC-AE](#page-0-0) is also integrated. In the case of [HDR](#page-0-0) coding, the uncompressed video is first sent to a preprocessing device before the encoder. This preprocessing device enables to produce the meta-data used for [HDR](#page-0-0) displays. It can be used for different [HDR](#page-0-0) technologies such as:

- [Perceptual Quantizer \(PQ\)](#page-0-0) proposed by the [Society](#page-0-0) [of Motion Picture and Television Engineers \(SMPTE\)](#page-0-0) in specification ST-2084 defining a transfer function enabling [HDR](#page-0-0) displays with 10 bits per pixel and a BT.2020 color-gamut,
- [HDR1](#page-0-0)0 exploiting [PQ](#page-0-0) and specification [SMPTE](#page-0-0) ST-2086 defining information transfer for color calibration in [HDR](#page-0-0) displays with static size of meta-data.
- [HDR1](#page-0-0)0+ proposed by *Samsung* and *Amazon Video* exploiting specification [SMPTE](#page-0-0) ST-2094-40 and enhancing HDR10 with dynamic size of meta-data.
- Dolby Vision proposed by *Dolby* similar to [HDR1](#page-0-0)0+ (using [PQ](#page-0-0) and [SMPTE](#page-0-0) ST-2094-40) but with luminosity adaptation for [HDR](#page-0-0) displays on TV.



<span id="page-11-1"></span>**FIGURE 10.** Schematic of the [MPEG-2 TS](#page-0-0) packets structure.

- [Hybrid Log Gamma \(HLG\)](#page-0-0) proposed by the [British](#page-0-0) [Broadcasting Corporation \(BBC\)](#page-0-0) and [Japan Broadcast](#page-0-0)[ing Corporation \(NHK\)](#page-0-0) in specification ARIB STD-B67 defining another transfer function fo [HDR](#page-0-0) displays with 10 bits per pixel and a BT.2020 color-gamut,
- [SL-HDR1](#page-0-0) proposed by *STMicroelectronics*, *Philips International* and *Technicolor* in specification ETSI TS 103 433 relying on [SMPTE](#page-0-0) ST-2087, ST-2086, ST-2094-20 and ST-2094-30 with dynamic size of metadata sends in a [Supplemental Enhancement Information](#page-0-0) [\(SEI\)](#page-0-0) message.

In the proposed application, we only use the SL-HDR1 for backward compatibility [SDR-HDR](#page-0-0) but other [HDR](#page-0-0) technologies can be employed for all layers. The [SL-HDR1](#page-0-0) metadata is added to the [SDI](#page-0-0) messages in ancillary data packets. Once received by the DTA-2174, the meta-data is put in an [SEI](#page-0-0) message and passes through the encoding process.

Then, the [SHVC-AE,](#page-0-0) embedded on the  $4 \times 10$  cores Intel Xeon processor (E5-4627V3), processes the received frame as explained in Section [III-C.](#page-5-2) Once the encodings are performed, the [SHVC](#page-0-0) bitstream is packed in [MPEG-2 TS](#page-0-0) packets and sent to the decoder through an Internet Protocol (IP) link. As illustrated in Fig. [10,](#page-11-1) the [MPEG-2 TS](#page-0-0) packet is composed of a payload containing the encoded bitstream, also called [Elementary Stream \(ES\),](#page-0-0) and a header containing information on the payload. This information concerns, for instance, the type of transmitted data which can be video but also audio or subtitles... The type of data is identified thanks to the syntax element called [Packet Identifier](#page-0-0) [\(PID\).](#page-0-0) In the broadcast environment, there are two main specifications:

- [Digital Video Broadcasting \(DVB\)](#page-0-0) used in Africa, Europe, Middle East, Oceania and South Asia,
- [ATSC](#page-0-0) used in North America and South Korea.

They rely on standards such as [HEVC](#page-0-0) for video coding or [MPEG-2 TS](#page-0-0) for IP transmission. In the case of [SHVC,](#page-0-0) they define different [PID](#page-0-0) specifications: [DVB](#page-0-0) recommends different [PID](#page-0-0) for each scalable layer while [ATSC](#page-0-0) recommends a single [PID](#page-0-0) for the video [ES.](#page-0-0) Our solution supports both solution and the default configuration uses the [ATSC](#page-0-0) recommendation.

The [MPEG-2 TS](#page-0-0) packets are then transferred to the *Open-HEVC* decoder through cables and to a smartphone through network. For the display on TV, the *OpenHEVC* decoder is integrated to the GPAC player as proposed in [44] to manage the reception of [MPEG-2 TS](#page-0-0) packets. Both [BL](#page-0-0) and [EL](#page-0-0) are decoded to enable UHD display. Once decoded, UHD frames are finally sent to the UHD HDR TV through a [High-](#page-0-0)[Definition Multimedia Interface \(HDMI\)](#page-0-0) link. If present, the [SEI](#page-0-0) message containing information for [HDR](#page-0-0) display passes through the decoder and are employed by the TV. Otherwise, the TV displays the UHD content in [SDR.](#page-0-0) On the other hand, the smartphone receives the [MPEG-2 TS](#page-0-0) packets through network and process only the [BL](#page-0-0) for an HD display. The [SEI](#page-0-0) message containing the information for [HDR](#page-0-0) display are not employed by the smartphone and only [SDR](#page-0-0) can be displayed.

The real time end-to-end video transmission of UHD [SDR](#page-0-0) contents  $(3840\times2160)$  pixels) at 30 fps with 10 bits per pixel was experimented and demonstrated in [45] and [46] for codec scalability. We improve this demonstration by adding [HDR](#page-0-0) support on all layers. In this demonstration, the [SHVC-](#page-0-0)[AE](#page-0-0) only realizes a spatial scalability with 10 bits per pixel and BT.2020 color-gamut on both layers. The backward compatibility between [SDR](#page-0-0) and [HDR](#page-0-0) is enabled by the [SL-HDR1](#page-0-0) technology.

# <span id="page-12-0"></span>**VI. CONCLUSION**

In this paper we have proposed a complete software implementation solution of the scalable extension of the [HEVC](#page-0-0) standard. This solution includes both [SHVC](#page-0-0) encoder and decoder based, respectively, on the core professional [HEVC](#page-0-0) encoder [\(HEVC-AE\)](#page-0-0) and the open source real time [HEVC](#page-0-0) decoder (*OpenHEVC* ). Several optimizations have been integrated into the proposed scalable [HEVC](#page-0-0) encoder [\(SHVC\)](#page-0-0), resulting in three setups of the encoder *FILE* , *LIVE* and *LIVE*+ . The [SHVC-AE](#page-0-0) in *FILE* setup enables to reach a high rate-distortion performance close to the reference [SHM](#page-0-0) with a speed-up of 37 and 66 in Intra and Inter coding configurations. The [SHVC-AE](#page-0-0) in *LIVE* and *LIVE*+ setups enables real time encoding performance of  $3840 \times 2160p$ 30 fps video with an efficient inter-layer prediction. The complete solution, including the [SHVC-AE](#page-0-0) and scalable *OpenHEVC* decoder, enables a real time encoding/decoding of 3840×2160p30 videos on multi-core Intel Xeon platform. Moreover, the scalable *OpenHEVC* decoder enables to decode the [BL](#page-0-0) in HD resolution on [ARM](#page-0-0) mobile platform.

Several improvements on the [SHVC-AE](#page-0-0) can be investigated as future works. First, the proposed encoder can be extended to support the encoding of more than two layers (*N* layers). In addition, it would be interesting to investigate the performance of the encoder with other types of scalability including quality, bit-depth, color and codec. Finally, more algorithmic optimizations can be performed to improve the coding efficiency of the [HEVC-AE](#page-0-0) encoder, especially in Inter coding configuration.

#### **REFERENCES**

- [1] G. J. Sullivan, J. M. Boyce, Y. Chen, J. R. Ohm, C. A. Segall, and A. Vetro, ''Standardized extensions of high efficiency video coding (HEVC),'' *IEEE J. Sel. Topics Signal Process.*, vol. 7, no. 6, pp. 1001–1016, Dec. 2013.
- [2] J. M. Boyce, Y. Ye, J. Chen, and A. K. Ramasubramonian, ''Overview of SHVC: Scalable extensions of the high efficiency video coding standard,'' *IEEE Trans. Circuits Syst. Video Technol.*, vol. 26, no. 1, pp. 20–34, Jan. 2016.
- [3] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, ''Overview of the high efficiency video coding (HEVC) standard,'' *IEEE Trans. Circuits Syst. Video Technol.*, vol. 22, no. 12, pp. 1649–1668, Dec. 2012. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/ wrapper.htm?arnumber=6316136
- [4] V. Seregin and Y. He, *Common SHM Test Conditions and Software Reference Configurations*. document JCTVC-Q1009, Apr. 2014.
- [5] T. Biatek, W. Hamidouche, J.-F. Travers, and O. Deforges, ''Optimal bitrate allocation in the scalable hevc extension for the deployment of UHD services,'' *IEEE Trans. Broadcast.*, vol. 62, no. 4, pp. 826–841, Dec. 2016.
- [6] X. HoangVan, J. Ascenso, and P. Pereira, ''Improving SHVC performance with a joint layer coding mode.'' in *Proc. IEEE Int. conf. Acoust. Speech Signal Process. (ICASSP)*, Shanghai, China, Mar. 2016, pp. 1145–1149.
- [7] H. Schwarz, D. Marpe, and T. Wiegand, ''Overview of the Scalable Video Coding Extension of the H.264/AVC Standard,'' *IEEE Trans. Circuits Syst. Video Technol.*, vol. 17, no. 9, pp. 1103–1120, Sep. 2007. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/ wrapper.htm?arnumber=4317636
- [8] *ATSC Standard: Video HEVC (A/341)*, Standard, Feb. 2019. [Online]. Available: https://www.atsc.org/wp-content/uploads/2017/05/A341-2019- Video-HEVC.pdf
- [9] K.Park Y. Lim and D. Y. Suh, ''Delivery of ATSC 3.0 services with MPEG media transport standard considering redistribution in MPEG-2 TS Format,'' *IEEE Trans. Broadcast.*, vol. 62, no. 1, pp. 338–351, Mar. 2016.
- [10] *Generic Coding of Moving Pictures and Associated Audio Information— Part 1: Systems*, document ISO/IEC 13818-1, 2018.
- [11] R. Parois, W. Hamidouche, J. Vieron, M. Raulet, and O. Deforges, ''Efficient parallel architecture for a real-time UHD scalable HEVC encoder,'' in *Proc. 25th Eur. Signal Process. Conf. (EUSIPCO)*, Aug. 2017. pp. 1465–1469.
- [12] W. Hamidouche, M. Raulet, and O. Déforges, "4K real-time and parallel software video decoder for multilayer hevc extensions,'' *IEEE Trans. Circuits Syst. Video Technol.*, vol. 26, no. 1, pp. 169–180, Jan. 2016. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/ wrapper.htm?arnumber=7273890
- [13] C. C. Chi et al., "Parallel scalability and efficiency of HEVC parallelization approaches,'' *IEEE Trans. Circuits Syst. Video Technol.*, vol. 22, no. 12, pp. 1827–1838, Dec. 2012.
- [14] P. Bordes, P. Andrivon, X. Li, Y. Ye, and Y. He, "Overview of color gamut scalability,'' *IEEE Trans. Circuits Syst. Video Technol.*, vol. 27, no. 7, pp. 1580–1594, Jul. 2017.
- [15] M. Blestel and M. Raulet, "Open SVC decoder: A flexible SVC library," in *Proc. ACM Int. Conf. Multimedia*, Oct. 2010, pp. 1463–1466.
- [16] *Joint Scalable Video Model JSVM*. Accessed: 2010. [Online]. Available: https://www.hhi.fraunhofer.de/en/departments/vca/researchgroups/image-video-coding/research-topics/svc-extension-of-h264avc/ jsvm-reference-software.html
- [17] S. Sanz-Rodríguez, M. Álvarez-Mesa, T. Mayer, and T. Schierl, ''A parallel H.264/SVC encoder for high definition video conferencing,'' *Signal Proc. Image Commun.*, vol. 30, pp. 89–106, Jan. 2015.
- [18] P.-T. Chiang *et al.*, ''A QFHD 30-frames/s HEVC decoder design,'' *IEEE Trans. Circuits Syst. for Video Technol.*, vol. 26, no. 4, pp. 724–735, Apr. 2015.
- [19] M. Abeydeera, M. Karunaratne, G. Karunaratne, K. De Silva, and A. Pasqual, ''4K real-time HEVC decoder on an FPGA,'' *IEEE Trans. Circuits Syst. Video Technol.*, vol. 26, no. 1, pp. 236–249, Jan. 2016.
- [20] C.-C. Ju, "A 0.2 nJ/pixel 4 K 60 fps main-10 HEVC decoder with multi-format capabilities for UHD-TV applications,'' in *Proc. IEEE 40th Eur. Solid State Circuits Conf. (ESSCIRC)*, Rome, Italy, Sep. 2014, pp. 195–198.
- [21] D. Zhou, "An 8K H.265/HEVC video decoder chip with a new system pipeline design,'' *IEEE J. Solid-State Circuits*, vol. 52, no. 1, pp. 113–126, Jan. 2017.
- [22] M. Tikekar, C.-T. Huang, C. Juvekar, V. Sze, and A. P. Chandrakasan, ''A 249-Mpixel/s HEVC video-decoder chip for 4K Ultra-HD applications,'' *IEEE J. Solid-State Circuits*, vol. 49, no. 1, pp. 61–72, Jan. 2014. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper. htm?arnumber=6636099
- [23] F. Bossen, B. Bross, K. Suhring, and D. Flynn, "HEVC complexity and implementation analysis,'' *IEEE Trans. Circuits Syst. Video Technol.*, vol. 22, no. 12, pp. 1685–1696, Dec. 2012.
- [24] C. C. Chi, M. Alvarez-Mesa, J. Lucas, B. Juurlink, and T. Schierl, ''Parallel HEVC decoding on multi- and many-core architectures,'' *J. Signal Process. Syst.*, vol. 71, no. 3, pp. 247–260, Jun. 2013.
- [25] (2017). *Libde265 Decoder*. [Online]. Available: https://github. com/strukturag/libde265
- [26] C. C. Chi, M. Alvarez-Mesa, B. Bross, B. Juurlink, and T. Schierl, "SIMD acceleration for HEVC decoding,'' *IEEE Trans. Circuits Syst. Video Technol.*, vol. 25, no. 5, pp. 841–855, May 2015.
- [27] *Snapdragon 810 Processor Product Brief*, Qualcomm, San Diego, CA, USA, 2014.
- [28] MulticoreWare. *x265 HEVC Encoder/H.265 Video Codec*. [Online]. Available: http://x265.org/ 2017.
- [29] Vantrix. (2017). *F265 Open Source HEVC/H.265 Project*. [Online]. Available: http://vantrix.com/f-265-2/
- [30] UltraVideoGroup. (2017). *Kvazaar HEVC Encoder*. [Online]. Available: http://ultravideo.cs.tut.fi/#encoder
- [31] M. Viitanen, A. Koivula, A. Lemmetti, J. Vanne, and T. D. Hämäläinen, ''Kvazaar HEVC encoder for efficient intra coding,'' in *Proc. Int. Symp. Circuits Syst. (ISCAS)*, Lisbon, Portugal, May 2015, pp. 1662–1665.
- [32] A. Mercat, F. Arrestier, W. Hamidouche, M. Pelcat, and D. Menard, ''Energy reduction opportunities in an HEVC real-time encoder,'' in *Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP)*, Mar. 2017, pp. 1158–1162.
- [33] G. Correa, P. Assuncao, L. Agostini, and L. A. D. S. Cruz, *Encomplexity-Aware High Efficiency Video Coding*. Cham, Switzerland: Springer, 2016.
- [34] X. Li, M. Chen, Z. Qu, J. Xiao, and M. Gabbouj, "An effective CU size decision method for quality scalability in SHVC,'' *Multimedia Tools Appl.*, vol. 76, no. 6, pp. 8011–8030, Mar. 2017.
- [35] H. R. Tohidypour, M. T. Pourazad, and P. Nasiopoulos, "An encoder complexity reduction scheme for quality/fidelity scalable HEVC,'' *IEEE Trans. Broadcast.*, vol. 62, no. 3, pp. 664–674, Sep. 2016.
- [36] C.-C. Wang, Y.-S. Chang, and K.-N. Huang, "Efficient coding tree unit (CTU) decision method for scalable high-efficiency video coding (SHVC) encoder,'' in *Proc. Recent Adv. Image Video Coding*, Nov. 2016.
- [37] W.-J. Chiang, J.-J. Chen, and Y.-H. Tsai, "A fast SHVC coding scheme based on base layer co-located CU and cross-layer PU mode information,'' in *Proc. IEEE Int. Conf. Multimedia Expo Workshops (ICMEW)*, Jul. 2017, pp. 381–386.
- [38] H. R. Tohidypour, H. Bashashati, M. T. Pourazad, and P. Nasiopoulos, ''Online-learning-based mode prediction method for quality scalable extension of the high efficiency video coding (HEVC) Standard,'' *IEEE Trans. Circuits Syst. Video Technol.*, vol. 27, no. 10, pp. 2204–2215, Oct. 2017.
- [39] (2017). *SHVC Reference Software (SHM)*. [Online]. Available: https://hevc.hhi.fraunhofer.de/svn/svn\_SHVCSoftware/
- [40] T. Wiegand, G. Sullivan, G. Bjontegaard, and A. Luthra, ''Overview of the H.264/AVC video coding standard,'' *IEEE Trans. Circuits Syst. Video Technol.*, vol. 13, no. 7, pp. 560–576, Jul. 2003. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1218189
- [41] G. Bjontegaard, "Calculation of average PSNR Differences between RDcurves,'' in *Proc. VCEG*, Austin, TX, USA, Apr. 2001, pp. 2–4.
- [42] G. Bjontegaard, *Improvements of the BD-PSNR Model*, document ITU-T SG16 Q, Jul. 2008.
- [43] *Dektec Digital Video BV*, document DTA-2174, Leaftlet, Jun. 2015.
- [44] P.-L. Cabarat, W. Hamidouche, O. Deforges, M. Raulet, and J. L. Feuvre, ''4K real-time video streaming in hybrid codec scalability SHVC configuration,'' in *Proc. IEEE Conf. Design Architecture Signal Image Process. (DASIP)*, Rennes, France, Oct. 2016, pp. 1–3.
- [45] R. Parois, W. Hamidouche, E. G. Mora, M. Raulet, and O. Deforges, ''Demo: UHD live video streaming with a real-time scalable HEVC encoder,'' in *proc. Conf. Design Architectures Signal Image Process. (DASIP)*, Oct. 2016, pp. 235–236.
- [46] P.-L. Cabarat, W. Hamidouche, and O. Déforges, "Real-time and parallel SHVC hybrid codec AVC to HEVC decoder," in *Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP)*, Mar. 2017, pp. 3046–3050.



RONAN PAROIS received the Ph.D. degree in signal and image processing from INSA Rennes, in 2018. He focuses on the scalable extension of the HEVC standard. He is currently a Research and Development Engineer with ATEME. His research interest includes real-time implementations of software video codecs.



WASSIM HAMIDOUCHE received the Ph.D. degree in signal and image processing from the University of Poitiers, France, in 2010. From 2011 to 2012, he was a Research Engineer with the Canon Research Centre, Rennes, France. Since 2015, he has been an Associate Professor with INSA Rennes. He is currently a member of the Institute of Electronics and Telecommunications of Rennes (IETR), UMR CNRS 6164. His research interests include video coding, efficient real time

and parallel architectures for the new generation video coding standards, multimedia transmission over heterogeneous networks, and multimedia content security.



PIERRE-LOUP CABARAT received the M.S. degree in signal and image processing from the University of Rennes 1, France, in 2014. He has been a Research Engineer with the Institute of Electronics and Telecommunications of Rennes (IETR), UMR CNRS 6164, since 2016. His research interests include video coding, and efficient real-time and parallel architectures for the next generation of video coding standards.



MICKAEL RAULET received the Ph.D. degree in electronic and signal processing from INSA in collaboration with Mitsubishi Electric ITE, Rennes, France, in 2006. Until 2014, he was a Researcher of rapid prototyping of video coding standards with the Research Institute of Electronics and Telecommunications of Rennes (IETR), where he was also a Project Leader of several French and European projects. Until 2014, he was also a member of IRT B-COM, a new research institute.

In 2015, he joined ATEME, Rennes, where he is currently an In Charge of a research team on video compression. Since 2007, he has been involved in the ISO/IEC MPEG standardization activities as a Reconfigurable Video Coding Expert. He has authored 3 book chapters and more than 80 international conferences and journal papers. His particular interests include dataflow programming, signal processing systems, and video coding. He served as a member for the Technical Committee of the Design and Implementation of Signal Processing Systems (DISPS) of the IEEE Signal Processing Society and the Circuits and Systems for Video Technology Editorial Board.



NATY SIDATY received the Engineering degree in telecommunications and electronics from the National Engineering School of Tunis, Tunisia, in 2010, the master's degree in telecommunications and electronics from Limoges University, France, in 2011, and the Ph.D. degree in signal and image processing from the University of Poitiers, in 2015. He is currently a Postdoctoral Researcher with the IETR Lab/INSA of Rennes, France. His research interests include visual atten-

tion modeling, video quality assessment [standard dynamic range (SDR) and high dynamic range (HDR)], video security [high-efficiency video coding (HEVC) perceptual encryption], and new coding tools (HEVC and JEM).



**OLIVIER DÉFORGES** received the Ph.D. degree in image processing, in 1995. In 1996, he joined the Department of Electronic Engineering, National Institute of Applied Sciences of Rennes (INSA), Scientific and Technical University. He is currently a Professor with INSA. He is a member of the Institute of Electronics and Telecommunications of Rennes (IETR), UMR CNRS 6164. He has authored more than 180 technical papers. His principal research interests include image and

video lossy and lossless compression, image understanding, fast prototyping, and parallel architectures.

 $\bullet$   $\bullet$   $\bullet$