Processing math: 100%
Data-Driven Variable Synthetic Aperture Imaging Based on Semantic Feedback | IEEE Journals & Magazine | IEEE Xplore

Data-Driven Variable Synthetic Aperture Imaging Based on Semantic Feedback


This paper proposes a novel data-driven variable synthetic aperture imaging based on semantic feedback for better de-occlusion performance. Starting with semantic level, ...

Abstract:

Synthetic aperture imaging, which has been proved to be an effective approach for occluded object imaging, is one of the challenging problems in the field of computationa...Show More

Abstract:

Synthetic aperture imaging, which has been proved to be an effective approach for occluded object imaging, is one of the challenging problems in the field of computational imaging. Currently most of the related researches focus on fixed synthetic aperture which usually accompanies with mixed observation angle and foreground de-focus blur. But the existence of them is frequently a source of perspective effect decrease and occluded object imaging quality degradation. In order to solve this problem, we propose a novel data-driven variable synthetic aperture imaging based on semantic feedback. The semantic content we concerned for better de-occluded imaging is the foreground occlusions rather than the whole scene. Therefore, unlike other methods worked on pixel-level, we start from semantic layer and present a semantic labeling method based on feedback. Semantic labeling map deeply mines visual data in synthetic image and preserves the semantic information of foreground occluder. On the basis of semantic feedback strategy, semantic labeling map will conversely pass to synthetic imaging process. The proposed data-driven variable synthetic aperture imaging contains two levels: one is adaptive changeable imaging aperture driven by synthetic depth and perspective angle, the other is light ray screening driven by visual information in semantic labeling map. On this basis, the hybrid camera view and superimposition of foreground occlusion can be removed. Evaluations on several complex indoor scenes and real outdoor environments demonstrate the superiority and robustness performance of our proposed approach.
This paper proposes a novel data-driven variable synthetic aperture imaging based on semantic feedback for better de-occlusion performance. Starting with semantic level, ...
Published in: IEEE Access ( Volume: 7)
Page(s): 166021 - 166042
Date of Publication: 14 November 2019
Electronic ISSN: 2169-3536

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

As one of the most active research branches, occlusion handling has attracted increasing attention due to its broad application prospect and important practical value. Typical applications include video surveillance and monitoring [1]–​[3], object recognition and tracking [4]–​[6], automatic visual navigation [7]–​[9], etc. Among this area, occluded object imaging is still a major obstacle for achieving satisfactory system performance. Especially, the imaging quality directly relates to the efficiency and precision of overall system. For example, the lawbreakers often use shelter to escape traditional surveillance which occurs around us every day. At this time, the imaging quality of occluded object becomes important for social security. Recently, along with the advances in computational photography [10], [11], synthetic aperture imaging [12]–​[14] has been proved as an effective method for occlusion removal and object perspective. Meanwhile, more and more camera arrays with different configurations for data acquisition are appeared, such as circular, arc, linear, planar, etc. But the same problems exists in this kind of approach which also suffers from low-resolution reconstructed images when multiple occlusions occur. In this work, on the basis of synthetic aperture imaging, we focus on solving the problem of high performance imaging of occluded object in complex background with multiple shelters. Fig. 1 shows an example of occluded object imaging result. Through serious occlusion from the man in front, occluded target is imaged by the proposed method with clear facial characteristics. The imaging performance is greatly improved compared with traditional synthetic aperture imaging algorithm.

FIGURE 1. - An example of occluded object imaging. The occluded object of interest is marked in red and she is completely blocked by foreground occlusion. Above: the observation scene; Below: enlarged foreground occlusion region, occluded object imaging result with traditional synthetic aperture method and the proposed method.
FIGURE 1.

An example of occluded object imaging. The occluded object of interest is marked in red and she is completely blocked by foreground occlusion. Above: the observation scene; Below: enlarged foreground occlusion region, occluded object imaging result with traditional synthetic aperture method and the proposed method.

Although some research works have been carried out on occluded object imaging [15], [16], there are still some significant technique challenges requiring further study [17]. To be more specific, it mainly covers the following points: (1) For an occluded object, the interference from foreground rays is one of the key factors affecting its imaging quality. Worse still, imaging clarity declines significantly with the increase of occlusion layer; (2) The synthetic aperture also affects the imaging effect. Fixed synthetic aperture has higher imaging efficiency, but lacks flexibility for different occlusion situations. Thus, it usually accompanies with complex light field information from different views and foreground which will lead to a drop of imaging performance; (3) The degrees of mutual occlusion between multi-object and moving occluder is variable. That makes it difficult to enhance the clearness and contrast of hidden object image; (4) Typically the imaging condition is unfavorable, particularly when associated with crowded environment, complex background and plenty interferential factors.

To achieve better synthetic aperture imaging performance, many scholars have been doing interrelated investigations and achieve impressive progress in the last decades [18]–​[21]. One common approach is calculating and synthesizing all camera images directly [22]. They improve the occluded object imaging quality through additional process on source light field data or preliminary synthetic image. For example, Joshi et al. [23] propose a natural sequence matting algorithm after synthesis to obtain better imaging performance. In this work, high-frequency components in reconstructed image are regarded as noise and occlusion when focusing on the occluded target. In 2013, Pei et al. [24] solve this problem by modeling it as an occluder labeling problem. They utilize the energy minimization method to label pixels occupied by occluder in each camera view and average unlabeled pixels to reconstruct hidden object. Xiao et al. [25] employ a global optimization framework and present a novel iterative reconstruction approach. It refines the reconstruction result by applying a coarse-to-fine strategy. Another kind of algorithm is carried on during the procedure of synthetic imaging [26]. The quality of occluded image is reconstructed by filtering light information and eliminating occlusion ray. For instance, Yang et al. [27] select optimal camera view when synthesis based on multiple label energy minimization. This method is robust to complex occluder or severe occlusion. Later, on the basis of the notion of fully visible pixels, they [19] investigate an all-in-focus synthetic aperture image. By using maximum chromatic aberration as the energy minimization term in multi-layer visible propagation method, the scene is divided into different visible layers in their method. In order to remove unwanted reflections and obstructions, Xue et al. [28] separate occlusion according to their different motions in 2015 and reconstruct occluded object. In 2018, Pei et al. [29] propose a novel method to estimate foreground and background by using image matting via energy minimization. Although these approaches can effectively return the occluded object image and demonstrate good result in its own experiment, most of them are based on fixed synthetic aperture which has its own limitation on perspective imaging and they mainly focus on the study of mathematical algorithms which pay little attention to scene semantic information.

In order to solve this problem, we first explore the factors affecting imaging performance from camera array itself. The structure of array especially its baseline is one of the key influencing parameters of final imaging result. Array baseline is associated with the synthetic aperture, even they are equal in most existing systems. Once array baseline is selected, the synthetic aperture is determined too. However, such fixed aperture lacks flexibility to meet different application requirements. Meanwhile, small synthetic aperture collects light ray in limited angle which restricts its perspective ability on imaging, whereas large synthetic aperture has better de-occluded effect which is accompanied by more hybrid view data. According to the discussion above, we employ variable synthetic aperture in this paper. But all light field data are more complex with different observation views. Various light rays of foreground, noise or background are mixed together, which is also the unfavorable critical aspect influencing occluded object imaging quality. Therefore, this paper need to extract the most useful object light ray from all light field data for synthesis. The information extraction contains two parts: (1) Under sufficient perspective imaging angle, we screen effective cameras from array; (2) For foreground occlusion, only the light rays which carry the visual information of occluding target can participate in synthesis and other redundant data need to be eliminated.

Fig. 2 provides the proposed synthetic aperture imaging principle intuitively. We first collect the light field data of observation using camera array. Then, through two levels of visual data screening, only useful light ray of object (marked in green) will be reserved. Specifically, the two-tried visual data screening is driven by the following two data. Firstly, the degree and location of occlusion vary from object to object. Thus, synthetic aperture is different as the change of each object concrete condition. Fig. 2 symbolically depicts the variable synthetic aperture of every pixel based on their perspective angle and object depth. According to this truth, optimal sub-array selection driven by perspective angle and object location is required. Secondly, we make in-depth analysis on source light field data, as Fig. 2 bottom right corner shown. This paper principally explores the semantic information contained in each light ray and labels them in their meaning. Thus, object light ray separation from mixed visual data can be implemented with the help of semantic information. As you can see, driven by semantic data, the useful light ray marked in green is retained and other irrelevant ray is removed. In addition, it is remarkable that the proposed variable synthetic aperture imaging and semantic mining are iterated by turns. Semantic information which is explored from synthetic image will feedback to labeling map for next synthesis. Through the above analysis, compared with general method, occluded object imaging can be implemented with better perspective ability and higher imaging quality.

FIGURE 2. - The research principle of the proposed data-driven synthetic aperture imaging method. It incorporates two meanings: (1) Driven by perspective angle and object depth, sub-array is varied for better de-occluded imaging; (2) To filter foreground occlusion and irrelevant information, light ray screening is presented driven by semantic information. By this way, light marked in green will be retained and our imaging result is greatly improved comparing with traditional approach.
FIGURE 2.

The research principle of the proposed data-driven synthetic aperture imaging method. It incorporates two meanings: (1) Driven by perspective angle and object depth, sub-array is varied for better de-occluded imaging; (2) To filter foreground occlusion and irrelevant information, light ray screening is presented driven by semantic information. By this way, light marked in green will be retained and our imaging result is greatly improved comparing with traditional approach.

This paper systematically investigates the fundamental thought above and proposes a novel data-driven variable synthetic aperture imaging algorithm based on semantic feedback. More specifically, with camera array for light field data acquisition, we scan the whole observation scene depth by depth at the beginning. For each depth’s imaging, the proposed algorithm consists of two components: semantic labeling algorithm based on feedback, and data-driven variable synthetic aperture imaging. The first part is based upon the success of deep object detector to explore scene semantic information. To improve imaging performance, the semantic content we concerned in this paper is mainly the foreground occlusion especially the position relationship between potential occluder and synthetic depth. So if an object as well as a potential occluder for target behind is detected from synthetic image, we recalculate the detection information back to array and label their depth information on the labeling map. In this way, the semantic labeling map can be updated. At the same time, the object information which is mined from synthetic image is feedback to semantic labeling map for next synthesis. In the second part, based on semantic labeling map, we present a data-driven variable synthetic aperture imaging approach for occluded object reconstruction. The “variable aperture” is driven by two aspects of data. The first is synthetic depth and the optimum perspective angle which aims to extract useful synthetic view from mixed array observation angle, whereas the second is visual data in semantic labeling map to filter out foreground occlusion. In this way, only camera in effectively view with light ray of object will participate in imaging, and other irrelevant data will be removed. With increasingly comprehensive information in semantic labeling map, the proposed data-driven variable synthetic aperture algorithm based on semantic feedback implements a de-occlusion imaging system with high performance.

The main contributions of this paper are two folds.

  • We propose a data-driven variable synthetic aperture imaging algorithm to achieve better perspective ability and higher de-occluded imaging performance simultaneously. It removes foreground occlusion with a two-tiered variable aperture: one is driven by object depth and perspective angle to change synthetic aperture for effective sub-array selection; the other is driven by different occlusion conditions to adjust aperture through optimal light ray selection. Thus, the synthetic aperture in this paper is flexible and non-fixed. The proposed method can adaptively fit the specific situations of every occluded object and provide their best synthetic images.

  • We present a novel semantic feedback strategy which serves as the basis for data-driven variable synthetic aperture imaging. Different from other methods, we pay more attention to each light ray’s semantic meaning and it is the core component to drive synthetic aperture change. Specially, with the help of effective object detector, we first explore the foreground semantic content from synthetic image forward and preserve it in semantic labeling map. Meanwhile, we feedback semantic information to visual data mining which determines next synthetic aperture. The proposed method gives full play to the role of scene semantic information and it greatly improves synthetic aperture imaging performance.

On the basis of semantic-feedback strategy and data-driven variable synthetic aperture imaging approach, a de-occlusion system with better perspective ability and higher synthetic imaging quality is implemented. To further evaluate its performance, we apply it in both challenging indoor scenarios and real outdoor environment, including serious occlusion, multi-layer shelters, complex mutual occlusion, etc. Experiments demonstrate that our work maintains satisfactory imaging result and achieves an obvious advantage than other methods.

The remainder of this paper is organized as follows. In Section II, we propose a novel high-quality occluded object imaging algorithm based on semantic-feedback strategy and data-driven variable synthetic aperture imaging. The performance of our proposed method and comparative experimental results with other researches are presented in Section III. Finally, we conclude the paper in Section IV.

SECTION II.

Data-Driven Variable Synthetic Aperture Imaging Based on Semantic Feedback

In this section, we introduce our novel data-driven variable synthetic aperture imaging algorithm based on semantic feedback to achieve better occluded object imaging performance. Supposing that the occluded object is donated as A. According to observation scene and foreground occlusion, the best perspective angle is \alpha . As Fig. 3 shown, if the depth of object location is D, the best aperture width B can be calculated as follows:\begin{equation*} B=2 \times D \times tan\frac {\alpha }{2} \tag{1}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

FIGURE 3. - The calculation model of the proposed variable synthetic aperture imaging algorithm. Left: with the suitable perspective angle(labeled in green), synthetic aperture is different for different object depth; Right: visual data acquisition of each camera can be regarded as a beam of light ray that includes information of foreground, object and background.
FIGURE 3.

The calculation model of the proposed variable synthetic aperture imaging algorithm. Left: with the suitable perspective angle(labeled in green), synthetic aperture is different for different object depth; Right: visual data acquisition of each camera can be regarded as a beam of light ray that includes information of foreground, object and background.

Camera in this scope will participate in synthesis and that is favorable for removing the confounding view of multi-camera. Then, the interesting object is inversely mapped to each camera which can be formulated as a beam of light, as Fig. 3 right shown. Due to the mutual occlusion between targets, the light rays of one camera contain the following three groups:\begin{align*} L_{i}(A)=&\{ \underbrace {L_{i{f}_{1}}, L_{i{f}_{2}},\ldots, L_{i{f}_{m}}}_{foreground}, \\&\underbrace {L_{i{o}_{1}}, L_{i{o}_{2}},\ldots, L_{i{o}_{n}}}_{object}, \\&\underbrace {L_{i{b}_{1}}, L_{i{b}_{2}},\ldots, L_{i{b}_{q}}}_{background}\}\tag{2}\end{align*}

View SourceRight-click on figure for MathML and additional features.

Take i the camera in array for example, light ray which reflects foreground is denoted as L_{if} and they are labeled in red in Fig. 3. The blue rays L_{ib} include the information of background. L_{io} marked in red are the useful object light rays for occluded object imaging. As you can indicate, L_{ib} has little effect on image quality degradation and L_{if} plays the major adverse contributor on de-occlusion imaging.

Therefore, to obtain better imaging quality, our aim is to remove foreground from all visual data obtained by camera array. That means only the light ray of object L_{io} and background L_{ib} can participate in synthesis. However, which light blocked the target as foreground occlusion is unknown. For each pixel P, we need to distinguish each light ray according to its specific content. So how to mine and screen the source light field data is the keystone in this paper. This problem can be summarized into (3). L_{A_{i}} is the corresponding light ray in each valid camera. Through \mathscr {F} , the value of pixel P can be synthesized.\begin{equation*} P=\mathscr {F}(L_{A_{1}}, L_{A_{2}}, L_{A_{i}}, \cdots,L_{A_{N}}) \tag{3}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

An illustration of the proposed algorithm is shown in Fig. 4. Through multi-camera array, we first collect the light field data of the observation scene, as figure left shown. Then, we detail the semantic labeling algorithm based on feedback to lay a solid foundation for follow-up synthesis and this algorithm flow is shown in the figure upper right module. Next, on the basis of semantic labeling map, a two-tiered data-driven variable synthetic aperture imaging algorithm is discussed. The occluded object is imaged with high quality and satisfactory perspective effect.

FIGURE 4. - An illustration of the proposed data-driven variable synthetic aperture imaging method. Our approach contains two main parts: semantic labeling map based on feedback and data-driven variable synthetic aperture imaging. In the first part, using the source light field data collected by camera array, we scan the observation scene depth by depth. The current depth semantic information is mined with deep object detector. Semantic labeling map is updated based on feedback. Then, driven by semantic labeling map and object concrete condition, a variable synthetic aperture imaging method is presented and occluded object is reconstructed with high performance.
FIGURE 4.

An illustration of the proposed data-driven variable synthetic aperture imaging method. Our approach contains two main parts: semantic labeling map based on feedback and data-driven variable synthetic aperture imaging. In the first part, using the source light field data collected by camera array, we scan the observation scene depth by depth. The current depth semantic information is mined with deep object detector. Semantic labeling map is updated based on feedback. Then, driven by semantic labeling map and object concrete condition, a variable synthetic aperture imaging method is presented and occluded object is reconstructed with high performance.

A. Semantic Labeling Map Based on Feedback

Source light field data is the basis of synthetic aperture imaging,so we begin with a brief introduction of our data acquisition device. Camera array is one of the main and most commonly used devices for acquiring light field data and it has a wide range of applications [30]. So we choose camera array as our data collection equipment. In particular, this paper takes linear array as an example to help us introduce the proposed method. The details of the camera array will be given in Section III. Then, on this basis, we initialize the array imaging parameters. This step is employed to calculate the exact relative position between cameras and determine the imaging range. We adopt parallel parallax method which has high accuracy, fast processing speed and robustness for scene variations. After calibration, the camera array can be used for light field data acquisition.

With light field data acquisition results, we firstly introduce the proposed semantic labeling algorithm with feedback which is the key determinant for de-occluded performance. It consists of three modules: semantic labeling map installation, forward scene semantic mining and semantic labeling map update with feedback.

1) Semantic Labeling Map Initialization

Utilizing the multi-camera array mentioned earlier, the light field data of observation scene can be captured preliminarily. We organize these data into several sets of image sequence and the images from the same set are captured by each camera in array at the same time. Thus, an image set is N images from N cameras, respectively. Assume that the image obtained by ith camera is denoted as I_{i} and the size of this image I_{i} is w_{i} \times h_{i} . There are N images for one synthesis. If one pixel can be regarded as a light ray among light field, N images contain N \times w_{i} \times h_{i} light rays. All of them are the source visual data for imaging. They contain information of foreground, background and object. The depth of observation ranges from R_{L} to R_{H} .

In addition, to mine scene semantic information, the proposed method also constructs an empty semantic labeling map D_{i} corresponding to each image in sequence. The size of D_{i} and I_{i} is equal, which is w_{i} \times h_{i} . Depth of each pixel is saved in its associated semantic labeling map.

2) Forward Scene Semantic Mining

After camera array initialization, we start the synthetic aperture imaging system. To achieve better imaging quality, the first step in our work is to mine semantic information hidden in each light ray and save it in semantic labeling map. Concretely, this subsection contains two modules: one is depth-by-depth scanning, the other is forward semantic information mining.

Firstly, the proposed method scans the whole scene depth by depth. Scanning depth is relative to the camera array. Low scanning depth means that it closes to camera array, and vice versa. Observation range [R_{L}, R_{H}] has already calculated in initialization. Scanning step is dependent on user’s own application requirement and processor computing ability. Therefore, to balance the two factors, the choice of scanning step has an appropriate value and cannot be too small or too big. Too big step may miss some important object information, whereas too small step accompanies by heavy computational cost. Within the computational capacity of the server, the smaller scanning step the better. If the suitable step is denoted as s , the total number of scanning depth is M which is the ratio of scanning range and depth step. Take the jth depth as an example, our system currently scans at this depth. The semantic labeling map has already updated after j-1th depth scanning. As Fig. 5 shown, the foreground located in front which may obscure the target is marked in red or yellow in semantic labeling map. Different color means foreground occlusion located in different depth.

FIGURE 5. - Semantic labeling algorithm based on feedback. With source light field image and previous semantic labeling map, we first mine scene semantic information with deep object detector. In this figure, the green rectangle is the object region and it inversely propagates to semantic labeling map for next depth synthesis.
FIGURE 5.

Semantic labeling algorithm based on feedback. With source light field image and previous semantic labeling map, we first mine scene semantic information with deep object detector. In this figure, the green rectangle is the object region and it inversely propagates to semantic labeling map for next depth synthesis.

Based on current semantic labeling map calculated by previous depths’ synthesis, we synthesize the scene at this depth using the proposed data-driven variable synthetic aperture imaging algorithm. The synthetic process will be introduced in detail in the follow-up section. With synthetic image of current depth scanning, this paper then describes the proposed forward semantic mining method.

Concretely, the forward semantic mining algorithm adopts object detector to mine scene semantic information from current depth synthetic image. The type of object detector determines the occlusion type we can remove and different detectors have different prospects for de-occlusion. Both object detector based on deep neural network and traditional manual design feature can be employed to mine scene semantic information. By carefully comparing the two, deep object detector has the following two main advantages: (1) Deep object detector has successfully applied in many fields with higher accuracy and robustness than traditional method. This is beneficial to deeply explore observation scene visual data. (2) The type of deep object detector is many and varied. Different aspects of scene semantic information can be mined well. Meanwhile, the deep object detector is flexible and non-fixed which is up to actual demand. So our approach is conducted on deep object detector. The deep object detector is utilized as an information extraction tool for the proposed method which is not our research focus, so we only provide a description of one detector type in this paper to verify algorithm’s efficiency.

As face is the most commonly interested object in many applications, we use face as the instance in the following sections to explain how the proposed algorithm works. In this paper, the face detector employed to detect potential face from synthetic image is trained by deep network. Concretely, we utilize MobileNet-SSD as our deep learning network. It is a combination of MobileNet [31] and SSD (Single Shot Multibox Detector) [32] which absorbs the advantages of both. This network has not only high detection accuracy and low runtime from SSD, but also low complexity and ease of realization from MobileNet. For network framework, the configuration of MobileNet-SSD from Conv0 to Conv13 is completely consistent with the MobileNet V1 model. According to SSD framework, eight convolution layers are added behind MobileNet’s Conv13, and then we extract features from six feature maps of different scales for object detection. Deep learning network is an important factors affecting target detector performance as well as the training database. In this paper, our detector utilizes WIDERFACE face database [33] as the training dataset. This database contains 32,203 images which are selected from publicly available WIDER dataset. The total number of labeled face is 393,703, including various illumination, different scale, multiple pose, diverse facial expression, etc. All images are divided into 61 classes and 40% of each class is randomly chosen for training.

As object detection result, the position of each face is preserved as \{x_{L}, y_{L}, x_{R}, y_{R}\} , where \{x_{L}, y_{L}\} and \{x_{R}, y_{R}\} are the top left coordinate and bottom right coordinate of face region respectively. As Fig. 5 shown, these object information is the semantic data that our system concerned in current depth. What’s more, the above face detector is just one example of object detector in this paper. The proposed method corresponds to different detectors for different foreground occlusions, and it is universal to different shields.

Through the procedure above, we finish the semantic information mining from current synthetic image. Possible object-occluded information in synthetic image is obtained with its depth and position. In other words, semantic information in synthetic image is propagated forward. Put slightly differently, this paper does not mine all semantic information in synthetic image, but only explores object information as the semantic information. This is due to mutual occlusion between targets which most actual monitoring scenario focus on is influenced by the front object, rather than other static subjects. In addition, the above is the semantic mining from synthetic image in one depth. If all depths of the observation scene are scanned, all semantic information can be explored.

3) Semantic Labeling Map Update Based on Feedback

The object detected from current depth in the last section may be the foreground shield for subsequent depth synthesis. In order to well prepared for the next depth synthesis, the semantic labeling map needs to be updated along with scanning. This processing flow is shown in the right module of Fig. 5 intuitively.

The left module of Fig. 5 is the light field source image and its semantic labeling map. In current depth scanning(depth is labeled in green in the middle part), we obtain the synthetic result under previous semantic labeling map and the proposed data-driven variable synthetic aperture imaging method. The details of our data-driven variable synthetic aperture imaging algorithm will present in Subsection B. Then, with forward scene semantic mining algorithm discussed in the last section, object information is mined from synthetic image. After that, as the right module shown, semantic information is inversely propagated to the next synthesis. That’s what we will introduce in detail in this section. From this figure, you can see that semantic labeling map update in fact is semantic information feedback. Scene semantic information that extracted from synthetic image based on forward scene semantic mining method will transmit back for next depth synthesis.

The specific semantic feedback method is as follows. Firstly, with object position and depth in semantic mining result, the proposed method inversely calculates these object parameters back to array. To be more specific, in Section A 1), the transformation relationship between cameras and observation scene is obtained in priority by camera array calibration. Thus, according to this relation, the object information above can mapped back to corresponding camera image. The object information contains depth, size and position. Suppose that pixel P is a point of object A in synthetic image, it is calculated back to N camera in array and I_{i} is the corresponding image where i =\{1,2,3, {\dots },N\} . The inversely pixel associated with P in each image can be denoted as \{P'_{1},P'_{2},P'_{3}, {\dots },P'_{N}\} . These pixels reflect the object visual data in different observational angles. Using the same method as P , all pixels in object region marked by the last section can be mapped back to each camera.

In the second step, with the object inverse result, we update all semantic labeling maps. Semantic labeling maps shown in Fig. 5 left have one-to-one relationship with camera in array. The size of I_{i} and its semantic labeling map is same. They are aligned in pixel level with each other. Target information contained in every pixel is preserved in their semantic labeling maps at appropriate position. In this figure, the semantic information mined from current depth is marked in green. With their corresponding relations, we update these information to semantic labeling map. For each pixel P , there are two situations whereas updating: (1) If this position where P located is blank as initialization, we fill it as the current depth value. It indicates that this light ray comes from present depth and there exists object of interest; (2) If this position has already been used, it means that there exists foreground lie ahead of current depth. So we will not update its depth value.

The previous semantic labeling map only contains semantic information of three depths which are marked in yellow, red and dark red respectively. With the object information detected in current depth(labeled in green), we feedback it to current semantic labeling map according to the proposed method. The updated result is shown in the right side. From the overall point of view, semantic labeling map adds some green annotated depth data. From the details, as Fig. 6, different colors represent objects located at different depths. As for the position which has not been updated since initialization, we annotate it in dark blue to indicate no depth information. These object information in updated semantic labeling map which may be the foreground occlusion for next synthesis. Therefore, as we can see from the figure, we feedback it for follow-up depth’s better synthesis.

FIGURE 6. - The detailed semantic labeling map. Different colors represent different object depths. Specially, dark blue means no information as initialization. It indicates no foreground occlusion located at the front of this position in current depth.
FIGURE 6.

The detailed semantic labeling map. Different colors represent different object depths. Specially, dark blue means no information as initialization. It indicates no foreground occlusion located at the front of this position in current depth.

To conclude, after you complete the steps above, semantic labeling map after current scanning is generated with the proposed semantic feedback method. This map records the precise depth of each ray by visual data analysis. Under this basis, occluded object imaging can be implemented in the next section with better performance.

B. Data-Driven Variable Synthetic Aperture Imaging

The synthetic aperture imaging we mentioned above will describe specifically in this part. It is the final but most important step for our occluded object imaging system. Different from traditional synthetic aperture imaging method, this paper proposes a novel data-driven variable synthetic aperture imaging algorithm.

In order to obtain better synthetic aperture imaging performance, we first explore the principle of occluded object imaging. As one of the most important branches in the field of computational imaging, synthetic aperture imaging algorithm based on multi-view light field information overcomes the limitation of traditional single-view imaging mechanism. When occlusion occurs, the large virtual synthetic aperture has a shallow depth of focus, which can effectively weaken foreground occlusion and realize “perspective” detection of occluded target imaging. However, synthetic aperture imaging algorithm adjusts the focus depth by mathematical calculation to change observation depth factor in imaging model and it ignores object semantic information. Meanwhile, their synthetic methods which treat all occluded objects equally at the pixel level also cause a drop of imaging quality. Taking the two factors into account, we propose a two-tiered data-driven variable synthetic aperture imaging algorithm.

Variable synthetic aperture is driven by two aspects of data. On the one hand, in order to achieve satisfactory perspective effect, we choose appropriate camera data driven by perspective angle and synthetic depth. The camera with less occlusion in effective imaging range is selected. On the other hand, to improve imaging quality, we screen light ray data driven by semantic labeling map which saves scene semantic information. The light ray that reflects object information and not blocked by foreground will participate synthesis. The detailed flow is shown in Fig. 7. The proposed data-driven variable synthetic aperture imaging includes two parts: optimal sub-array selection driven by synthetic depth and perspective angle, and optimal light ray selection driven by semantic labeling map.

FIGURE 7. - Data-driven variable synthetic aperture imaging algorithm. It is a two-tired variable aperture strategy. First, for better perspective effect, we select the optimal sub-array. Second, to remove foreground occlusion and irrelevant information, we select the optimal light ray based on semantic labeling map.
FIGURE 7.

Data-driven variable synthetic aperture imaging algorithm. It is a two-tired variable aperture strategy. First, for better perspective effect, we select the optimal sub-array. Second, to remove foreground occlusion and irrelevant information, we select the optimal light ray based on semantic labeling map.

1) Optimal Sub-Array Selection Driven by Synthetic Depth and Perspective Angle

As we can see from the upper half of Fig. 7, driven by synthetic depth and perspective angle, the proposed method first adjusts synthetic aperture through selecting the optimal sub-array. Only part of cameras in array is retained for synthesis(labeled in green).

To more directly illustrate our specific sub-array selection principle, we model the computational flow of imaging, as Fig. 8 shown. We take two depths marked in red and green as an example to introduce the proposed method. As mentioned above, perspective angle has an appropriate value for occluded object imaging which is neither too large nor too small. The best perspective angle is determined by actual application requirements, including foreground occlusion size and distribution density, camera spacing in array, special place constraints, and deployment cost. Thus, we need to design the best perspective angle for each system individually and empirical fine-tuning is also needed. But this parameter is fixed for each specific de-occluded system. Under the fixed perspective angle, the synthetic apertures are different for different synthetic depths through mathematical calculation and analysis. The lower half of Fig. 8 intuitively displays this from an overhead view. The green part and red part have the same perspective angle but their aperture width are different due to different synthetic depths. In other words, the synthetic aperture is varied from depth to depth. Assume that the perspective angle is \alpha and synthetic depth is D, its aperture width B can be calculated by (1).

FIGURE 8. - Variable synthetic aperture imaging based on optimal sub-array selection. Red and green mean two object depths with different aperture width. Meanwhile, after synthetic aperture width determines, we choose the sub-array with minimum occlusion as the blue line marked.
FIGURE 8.

Variable synthetic aperture imaging based on optimal sub-array selection. Red and green mean two object depths with different aperture width. Meanwhile, after synthetic aperture width determines, we choose the sub-array with minimum occlusion as the blue line marked.

However, more than one sub-array in the array may meet the perspective angle requirements when the aperture width is small. As shown in the example in Fig. 8, three camera arrays marked in different colors is provided on right side. How do we choose the optimal sub-array? Our ultimate goal is to improve occluded object imaging quality. So the proposed method compares all possible sub-arrays using semantic labeling map and selects the one with minimum foreground occlusion. The judgement is implemented based on semantic labeling map which can obtain from the previous section.

2) Optimal Light Ray Selection Driven by Semantic Labeling Map

After optimal sub-array selection, this paper further screens the light ray driven by semantic labeling map for better imaging quality, as the lower half of Fig. 7 shown. The entire synthesis is made up of every pixel’s synthesis. When all pixels finish synthetic calculation in turn, the occluded object imaging is completed. The synthesis method of each pixel is the same, so we take one pixel P in synthetic image to be processed as an example to introduce the proposed method. First, its position is inversely calculated to n cameras in optimal sub-array above. With the corresponding source image I_{i} , these related light rays can be denoted as a set:\begin{equation*} \{ L_{I_{1}}, L_{I_{2}}, L_{I_{i}}, \ldots,L_{I_{n}}\}\tag{4}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

The most immediate solution is taking their average as the final synthetic value of pixel P . But unfocused light rays of foreground occlusion will also involve calculation, and then influence imaging performance. That is exactly the drawbacks of traditional research route. Different from this, the proposed method deals synthetic problem driven by scene semantic information to remove foreground light rays. The following is the specific process of our proposed method.

First of all, we employ semantic labeling map to help us distinguish occlusion information. Semantic labeling map calculated in the last subsection preserves every light ray’s depth. L_{I_{i}} is a light ray belongs to the light ray set of P . The proposed method compares the value of current scanning depth D_{C} and the labeled depth D_{L_{Ii}} of L_{I_{i}} in semantic labeling map. This processing is formulated as:\begin{equation*} \begin{cases} D_{C} > D_{L_{Ii}}, & L_{I_{i}} ~is ~foreground \\ D_{C} \leq D_{L_{Ii}}, & otherwise \\ \end{cases}\tag{5}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

If D_{C} is greater than D_{L_{Ii}} , that means scene information captured by L_{I_{i}} is located in the front of current depth. Further analysis, it may be foreground occlusion. Conversely, it indicates that there is no target occlusion before current depth. Using the formula above, every light ray in set can be judged with semantic labeling map.

Then, we synthesize every pixel with variable aperture. The basic idea is to change synthetic aperture according to the specific situation of each pixel. Specifically, the light ray whose depth is lower than current depth can be regarded as foreground occlusion and it will not be allowed to involve synthesis, whereas other light rays will synthesize this pixel. The process is repeated until the whole image finish synthesized lastly. With the selection of synthetic data driven by semantic information, mixed view of object and the light ray of foreground occlusion is filter out. This paper achieves occluded object imaging with strong perspective ability and high imaging quality.

Alternating iteration of semantic labeling map based on feedback and data-driven variable synthetic aperture imaging, all occluded object in observation scene can be imaged. Multiple shelters and severely hidden are dealt with semantic mining strategy and data-driven variable synthetic aperture imaging algorithm well.

SECTION III.

Experiment Results and Analysis

A. Experimental Setup

1) Camera Array

In order to verify and evaluate the performance of our proposed imaging algorithm, a multi-camera array is designed and all experiments are conducted on it. For an occluded object imaging system, the first and foremost process is camera array construction. Appropriate imaging equipment not only provides comprehensive and adequate data support for algorithm implementation, but also affects the final imaging performance. Therefore, this paper describes our imaging equipment—linear camera array in detail. Fig. 9 intuitively provides these multi-camera arrays. They are developed by ourselves from camera selection to internal structure design to array arrangement to final camera array construction. Next, we introduce the camera arrays through the following aspects: array arrangement, camera selection, and data transmission.

FIGURE 9. - The linear camera array for data acquisition device. As figure shown, the two arrays are constructed with different camera types. Left: Linear array with modular camera. Right: Linear array with commercial available network camera.
FIGURE 9.

The linear camera array for data acquisition device. As figure shown, the two arrays are constructed with different camera types. Left: Linear array with modular camera. Right: Linear array with commercial available network camera.

Firstly, for array arrangement, the proposed method has no limitation of array configuration because our requirement for camera array is able to capture enough occluded object light ray. It does not matter the array configuration. This paper takes the linear arrangement as an example. All cameras are distributed in a straight line which is approximately linear. The array baseline is influenced by the observation scope and occlusion size. Large observation scope or occlusion size corresponds to wide baseline for comprehensive monitoring. The spacing between two adjacent cameras is the flexibility to adjust user’s imaging requirement and their spacing can be the same or different. There should be more cameras in regions with high target density which means small camera spacing, and vice versa. Since the density distribution of human flow in our observation area is unknown, we set the same camera spacing in this array. Moreover, linear distribution provides great benefits in many ways, including better for perspective imaging, more extensive horizontal coverage, easy to be produced, etc. Concretely, the two camera arrays consist of 8 cameras. The distance between the first camera and the last camera is 200cm, and the camera spacing in this array is 30cm.

Secondly, we discuss the selection of camera in array. It is generally known that camera is the fundamental and important component for data acquisition. Cameras suitable for arrays are mainly divided into two categories: industrial camera and non-industrial camera. Both of them have their respective weakness and strength. Industrial camera has better acquisition quality, but it is heavy with high camera cost. Non-industrial camera is lighter and cheaper than industrial camera with a little drop of imaging performance. Meanwhile, non-industrial has fewer constraints on building environment which is beneficial to the system deployment. After considering the above and the requirement of practical application, this paper chose the self-built array with non-industrial camera is more suitable for current research and future product landing. In addition, we chose two kinds of non-industrial cameras to test our generality for different camera types: modular camera with convenient deployment (as shown in Fig. 9 left) and commercial available network camera with steady performance (as shown in Fig. 9 right). Concretely, the resolution of each camera is 1920\times 1080 and their focus length is 4mm uniformly.

The third part is data transmission method. USB interface and network are two most commonly used approaches. With the development of network, the data transmission based on network is more convenient and supports remote control. Considering actual application needs, the data in this paper are transmitted through network.

2) Database

Database is important for the objectivity of system performance evaluation. Many factors have an impact on final imaging quality, including occlusion degree, object spacing, number of occlusion layer, etc. In this paper, we evaluate the proposed method on public database and self-built database.

This paper first employs UCSD light field database [23] to test de-occlusion imaging performance. It is a public database which is established by Joshi et al. in 2007. The data acquisition equipment is a multi-camera array arranged on 8\times 1 grid. Five people walk around the outdoor scene and mutual occlusion occurs randomly between them. So occlusion degree and object spacing are various. The database contains a total of 276 frames with image resolution of 640\times480 . Such comprehensiveness and diversity visual data are benefit for system performance evaluation.

To better explore the influence of these factors clearly, we also build a new occluded object imaging database to manifest the advantage and generalization of the proposed algorithm. All data in this database are collected by our linear multi-camera array introduced above. This database contains different occlusion situations, such as object under different occlusion degrees which range from 0% to 100%, different object spacings, object with different occlusion layers and object in unknown complex occluded scene. Both indoor and outdoor environment are included in this database.

3) Imaging System Setting

The proposed occluded object imaging method is deployed on a computer equipped with 2.80GHz CPU, 8GB RAM, GTX 1060 GPU. Our array calibration program is implemented in MATLAB and the occluded object imaging is implemented in Microsoft Visual C++ and OpenCV. The processing time consists of two parts: one is multi-camera calibration time which takes 54s, the other is occluded object imaging which needs 8ms for one depth synthesis. In addition, according to the observation target—face, we set the scanning step to 10mm with that image processor. Through mathematical calculation and experience adjustment, the optimum perspective angle is 4° after several trials.

4) Evaluation Metrics

Imaging quality is evaluated by Peak Signal to Noise Ratio(PSNR). It is defined as:\begin{align*} PSNR=&10 log_{10}\left ({\frac {\left ({2^{n}-1 }\right)^{2}}{MSE} }\right) \\ MSE=&\frac {1}{w\times h}\sum _{i=0}^{w-1}\sum _{j=0}^{h-1}\left \|{ I\left ({i,j}\right)-R\left ({i,j}\right) }\right \|^{2}\tag{6}\end{align*}

View SourceRight-click on figure for MathML and additional features.

MSE is the mean square error between current image I and reference image R; h and w is the height and width of I; n is the bits of each pixel which is 8 in this paper. We only compute PSNR of reconstructed face region for a more accurate de-occlusion imaging quality evaluation. Larger value of PSNR indicates higher imaging performance and vice versa.

B. Experimental Results

We evaluate the performance of our proposed method on four occluded object imaging tasks: imaging under different occlusion degree and object spacing, imaging through multi-layer occlusion, imaging in complex indoor scene and real outdoor environment. In order to more clearly demonstrate the imaging result, some key local regions are enlarged and shown in the upper right corner. The following part will carry out a detailed discussion of imaging performance under different occlusion situations respectively.

1) Different Occlusion Degree and Object Spacing

Distribution density of occlusion is one of the important factors affecting imaging effect, especially occlusion degree and object spacing is directly related to the amount of perspective light ray. This paper provides the imaging performance of different occlusion degree (which is 0%, 50% and 100%) and object spacing (which parametric change interval is [40cm, 240cm]). Fig. 10 and 11 show occluded object imaging quality intuitively. Table 1 summaries the imaging result of quantitative analysis.

TABLE 1 Quantitative Evaluation of Different Occlusion Degree and Object Spacing
Table 1- 
Quantitative Evaluation of Different Occlusion Degree and Object Spacing
FIGURE 10. - Some examples of occluded object imaging result with different occlusion degree. All results are conducted under the same object spacing (160cm) and different occlusion degrees. The upper right corner is the magnification result of the key area.
FIGURE 10.

Some examples of occluded object imaging result with different occlusion degree. All results are conducted under the same object spacing (160cm) and different occlusion degrees. The upper right corner is the magnification result of the key area.

FIGURE 11. - Some examples of occluded object imaging result with different object spacing. All results are conducted under complete occlusion and different object intervals. The upper right corner is the magnification result of the key area.
FIGURE 11.

Some examples of occluded object imaging result with different object spacing. All results are conducted under complete occlusion and different object intervals. The upper right corner is the magnification result of the key area.

Under the same object spacing 160cm, as Fig. 10 displays, the people behind is covered by the first person with 0%, 50%, 100% degree separately. Through occlusion, our system can maintain great de-occluded imaging quality when degree increases. It is worth noting that the proposed algorithm provides the occluded object image with high performance even if it is completely obscured by foreground. Meanwhile, to investigate the influence of object spacing, we make object spacing change with the constant degree 100% of occlusion. Through experiments, imaging performance gradually reduces as the increasing of object spacing. That is because large object spacing means far imaging distance. Long-range imaging naturally results in worse imaging quality. In addition, their spacing should not be too close, the occluded target light information captured by camera array decreases accordingly. Insufficient perspective light ray can lead to non-smooth the synthetic edge, as Fig. 11 shown (100% occluded with 40cm spacing). To sum up, when their spacing is approximately 120cm, the imaging performance maintains optimum quality. As for quantitative imaging results shown in Table 1, there are two tendencies in the quantitative imaging results of our proposed method: if we go across a row, what we see is that PSNR is decreasing; By column, PSNR has a slightly reduction with occlusion degree increases. As pervious analysis, this agrees with our qualitative results in figure.

From the above experiments and discussion, the proposed method can return occluded object clear image even in some challengeable scene with relatively large crowd density and serious occlusion of targets, which indicates our effective de-occlusion performance.

2) Imaging Through Multi-Layer Occlusion

It is generally known that the increment of occlusion layers brings more difficult for occluded object imaging. Therefore, we design a set of experiments to test the system ability on multi-layer occlusion. As Fig. 12 shown, five persons (labeled in A, B, C, D, E from front to back) stand in a line toward camera array. From the reference view, only the first man can be seen and other people are seriously blocked by their front man significantly.

FIGURE 12. - Some examples of imaging through multi-layer occlusion. Left: the reference view of observation scene with five people in a line. Person A, B, C, D and E are occluded with different layers. Removing different number of occlusion layer, all occluded objects are reconstructed with high quality.
FIGURE 12.

Some examples of imaging through multi-layer occlusion. Left: the reference view of observation scene with five people in a line. Person A, B, C, D and E are occluded with different layers. Removing different number of occlusion layer, all occluded objects are reconstructed with high quality.

The imaging result of removing foreground occlusion with different layers is shown from top to bottom and left to right. Person B is obscured under one layer occlusion caused by person A. Based on the proposed data-driven variable synthetic aperture imaging approach, person B is de-occlusion imaged well. By increasing the number of foreground layer, person E located at the end is blocked by four occlusion layers from the front four people. We successfully see through these layers and our approach provides object image with clear facial feature, as marked in red box. From these experiments, our system removes layer-by-layer occlusion and returns occluded object image in high resolution. In addition, though the de-occluded imaging quality of synthetic image decreases with the number of layers increases, the imaging result after four occlusion layers has a PSNR more than 25. That is enough to meet the requirement of practical application. The above results prove that our system not only has strong perspective imaging ability after multiple occlusion layer, but also ensures superior imaging quality at the same time.

3) Complex Indoor Scene

In this part, we evaluate the performance of our proposed method in complex indoor scene with unknown object spacing, multiple occlusion layers, various occlusion degree (no-occlusion, partial occlusion and serious occlusion), different imaging distances and complicated mutual occlusion.

As shown in Fig. 13, there are 12 people in the observation scene with high density. The nearest person and the farthest person are no more than 3.5m away. Some targets are occluded seriously and people are almost next to each other. After depth-by-depth synthesis, we perspectively obtain all persons in the scene. The right half part of Fig. 13 is their synthetic imaging results and the interest of region is enlarged with more detailed information at upper right corner. Blue rectangle labels the object which is not occluded in reference view, whereas the occluded object is marked in red. As shown, three serious occluded objects are clearly imaged at its location depth. What’s more, all reconstructed object information is summarized at the bottom left corner. The proposed method can obtain omni-directional object information when monitoring occluded scenarios. Experimental results demonstrate that objects in observation scene with mutual occlusion are imaged with distinguishable facial feature. Especially, occluded objects have the equivalent performance comparing with other objects. Therefore, even though the scene is crowded and the light information of some targets is limited, the proposed approach returns their high performance de-occlusion image which proves its effectiveness of removing foreground occlusion.

FIGURE 13. - Some examples of imaging result based on the proposed method in complex indoor scene. The observation scene contains crowded people with complicated occlusion situation. People under serious occlusion are marked in red and others are marked in blue. We summarize all result on left lower corner.
FIGURE 13.

Some examples of imaging result based on the proposed method in complex indoor scene. The observation scene contains crowded people with complicated occlusion situation. People under serious occlusion are marked in red and others are marked in blue. We summarize all result on left lower corner.

4) Real Outdoor Environment

The previous four sets of experiments are conducted inside indoor environment. This paper also evaluates algorithm performance in real outdoor environment. Compared with indoor environment, outdoor experiment has more interfere factors as changeable illumination and complex background. Many uncertainties increase de-occluded imaging difficulties. This paper selects two typical observation scenes for experiment: building entrance and pedestrian walkway with large people flow. Fig. 14 and Fig. 15 displays some key frames and their imaging results.

FIGURE 14. - Some examples of imaging result in building entrance. As figure shown, the array with 8 cameras is constructed which faces the entrance. Pedestrian density increases from top to bottom. Region of interest is boxed in red rectangle on left. The occluded object is enlarged on the right column.
FIGURE 14.

Some examples of imaging result in building entrance. As figure shown, the array with 8 cameras is constructed which faces the entrance. Pedestrian density increases from top to bottom. Region of interest is boxed in red rectangle on left. The occluded object is enlarged on the right column.

FIGURE 15. - Some examples of imaging result in pedestrian walkway. The observation scene contains no occlusion, serious occlusion, multi-layer occlusion, crisscross occlusion. Object of interest is marked in red. Under different occlusion situations, the occluded object is successfully imaged.
FIGURE 15.

Some examples of imaging result in pedestrian walkway. The observation scene contains no occlusion, serious occlusion, multi-layer occlusion, crisscross occlusion. Object of interest is marked in red. Under different occlusion situations, the occluded object is successfully imaged.

As one of important monitoring area in many practical applications, building entrance has many difficulties for occluded object imaging. Fig. 14 provides some examples and their corresponding imaging results. From top to bottom, pedestrian density increases gradually. In Frame 735, two people are occluded by the front man simultaneously. Our proposed method reconstructs them with clear facial features. Frame 1129 is captured when people are comparatively crowded. The girl blocked by foreground occlusion is well imaged. Therefore, our system maintains acceptable performance in outdoor entrance.

Pedestrian walkway is often crowded and complex, especially mutual shelter between people restricts the development of public security. As Fig. 15 shown, a people stand in the observation scene and others have caused mutual occlusion on her. Different occlusion situations are occurred, including no occlusion (Frame 197), serious occlusion (Frame 278), multi-layer occlusion (Frame 377), crisscross occlusion (Frame 300). However, from the right column of Fig. 15, the object of interest can be successful de-occluded imaged.

Generally, there are still some gaps between indoor scene and real outdoor environment on imaging quality. That is because outdoor environment is more complex and we only solve part of these problems on occluded object imaging. Other will be considered in our future work. Fortunately, whether building entrance or pedestrian walkway, our system can return occluded object image with acceptable performance effectively and efficiently.

C. The Performance Comparison

1) Comparison Method

In order to further explore the imaging performance of our proposed method, we choose the traditional synthetic aperture imaging which directly averages all pixels, and imaging algorithm based on pixel-level mathematical calculation as our contrast approaches.

a: Traditional Synthetic Aperture Imaging [34]

With the improvement of camera array performance and the expansion of its application scope, synthetic aperture imaging has been one of the commonly used method for occluded object imaging. This approach presumes that light rays of foreground occlusion are in the minority among all light field ray. With camera array initialization parameters, they calculate the mean of each light ray as the synthetic pixel value in the corresponding position. After all pixels processed, hidden object is reconstructed. This method does not care about foreground occlusion type and it has a strong versatility.

b: Yang et al. [35]

In order to reconstruct the whole scene through occlusion, they present a synthetic aperture imaging method. This method is proposed base on light field visibility analysis. They first segment the scene into several visibility layers and transfer visibility information between multiple layers by an optimization framework. The visibility and optimal focus depth estimation on every layer is regarded as a multiple label energy minimization problem which integrates the previous layers’ visibility mask, multi-view intensity consistency, and depth smoothness constraint. However, their assumption that fully visible pixels satisfy the unimodal constraint in the focus depth is violated some situations.

c: Pei et al. [29]

This method aims to achieve an all-in-focus “seeing through” image by removing the defocusing blurriness in synthetic aperture imaging. They reformulate the synthetic aperture imaging as an image matting problem and estimate region on different depth based on energy minimization formulation. After refocusing on background and hidden object, a “see through” synthetic aperture image with high performance by composting the foreground and background. But only two focus depths can be synthesized at a time which may lose part of scene information about other targets.

2) Experiment on Self-Built Light Field Database

In this part, we compare the proposed method with traditional synthetic aperture imaging on self-built light field database. As for the specific implementation of these experiments, traditional synthetic aperture imaging was from [34], then we followed the steps to implement the code and complete these tests. Both source light field data and camera array parameters used in these methods are the same. In order to make a more comprehensive experiment, we analyze system performance from both qualitative and quantitative aspects.

a: Qualitative Comparison

We conduct comparative experiments in two scenarios: non-crowded scene with serious occlusion (Fig. 16 and Fig. 18) and crowded scene with complex occlusion (Fig. 17 and Fig. 19). As figure shown, Fig. 16 and Fig. 17 are captured indoor, where as Fig. 18 and Fig. 19 are real outdoor environment. Reference camera view is displayed at left in these figures and imaging results are shown at right. The three columns on the right are foreground occlusion which describe the details of foreground occlusion, traditional synthetic aperture imaging result and the proposed method imaging result.

FIGURE 16. - Qualitative comparison of our approach against traditional synthetic aperture imaging in no-crowded indoor scene with serious occlusion. The front people (marked in blue) completely blocks the second people (marked in red) in reference view.
FIGURE 16.

Qualitative comparison of our approach against traditional synthetic aperture imaging in no-crowded indoor scene with serious occlusion. The front people (marked in blue) completely blocks the second people (marked in red) in reference view.

FIGURE 17. - Qualitative comparison of our approach against traditional synthetic aperture imaging in crowded indoor environment with complex occlusion. The observation scene contains many people with mutual occlusion. We evaluate imaging performance in different occlusion degree. Blue: no-occlusion, Orange: partial occlusion, Red: total occlusion.
FIGURE 17.

Qualitative comparison of our approach against traditional synthetic aperture imaging in crowded indoor environment with complex occlusion. The observation scene contains many people with mutual occlusion. We evaluate imaging performance in different occlusion degree. Blue: no-occlusion, Orange: partial occlusion, Red: total occlusion.

FIGURE 18. - Qualitative comparison of our approach against traditional synthetic aperture imaging in no-crowded outdoor scene with serious occlusion. The front people (marked in blue) completely blocks the second people (marked in red) in reference view.
FIGURE 18.

Qualitative comparison of our approach against traditional synthetic aperture imaging in no-crowded outdoor scene with serious occlusion. The front people (marked in blue) completely blocks the second people (marked in red) in reference view.

FIGURE 19. - Qualitative comparison of our approach against traditional synthetic aperture imaging in crowded outdoor walkway with complex occlusion. The observation scene contains many people with mutual occlusion. We evaluate imaging performance in different occlusion degrees. Blue: no-occlusion, Orange: serious occlusion, Red: total occlusion.
FIGURE 19.

Qualitative comparison of our approach against traditional synthetic aperture imaging in crowded outdoor walkway with complex occlusion. The observation scene contains many people with mutual occlusion. We evaluate imaging performance in different occlusion degrees. Blue: no-occlusion, Orange: serious occlusion, Red: total occlusion.

Next, we discuss and analyze these qualitative results one by one. Firstly, Fig. 16 contains a girl marked in red box and she is 100% occluded by the man in front. Their imaging result is provided separately on the right side. For the front people, he is not blocked no-occlusion and traditional synthetic aperture imaging method achieves comparable imaging quality with ours. But through occlusion, the imaging performance of our method is obviously better when reconstructs the hidden object behind. Similarly, our imaging result obtains more clear facial characteristics in Fig. 18 than the contrast method. Then, we take another look at the comparison results in crowded environments. The imaging results with different occlusion degree (no-occlusion, partial occlusion, total occlusion and serious occlusion) is provided in Fig. 17 and Fig. 19. With the gradual superposition of defocusing foreground light rays, traditional synthetic aperture imaging method suffers from fuzzy imaging performance under different degree occlusion. In contrast, our method can still obtain better imaging performance.

The reason for image blurring in traditional synthetic aperture imaging results is the overlap of multi-layered shadows. After analyzing the principles of synthetic aperture imaging, these shadows come from foreground occlusion. So as foregrounds increase, the shadow becomes more and more. Once the number of light rays reflected foreground occluder is in the majority, the imaging quality of this position is greatly affected. However, the proposed imaging method labels semantic information depth-by-depth with scanning. On this basis, the defocus light ray of foreground occlusion can be removed during synthesis by data-driven variable synthetic aperture and semantic feedback.

b: Quantitative Comparison

For quantitative evaluation standard, we also employ PSNR to measure the imaging performance of each occluded object, as described in Section III A. In addition, to exclude other environmental influence, indoor scene is selected for quantitative comparison. In this experiment, all people are stand in a line. Except for the first person, each of them is completely blocked by their front people. Therefore, the farther away people are, the more sheltered layers they will have. As Fig. 20 shown, statistics of the imaging results under different occlusion layers are made respectively. Fig. 20 shows the specific PSNR value of the two methods under different occlusion layers. No-occlusion has the highest imaging quality and the performance decreases gradually with the layer number increases. Remarkably, the proposed approach achieves the PSNR improvement more than 1.15 on each occluded object imaging. Moreover, with the increase of occlusion layer, our advantage becomes more and more obvious. This is still the result of foreground defocusing images’ superposition. The more foreground occlusion, the greater impact it will have on the occluded object imaging. This point is naturally understandable.

FIGURE 20. - Quantity comparison of our approach against traditional synthetic aperture imaging. Objects under different layers of occlusion are marked in different colors as the upper half shown. We calculate PSNR of the marked region and statistic the result in bar graph. The imaging performance of the proposed method shown in blue rectangle is better than the contrast approach.
FIGURE 20.

Quantity comparison of our approach against traditional synthetic aperture imaging. Objects under different layers of occlusion are marked in different colors as the upper half shown. We calculate PSNR of the marked region and statistic the result in bar graph. The imaging performance of the proposed method shown in blue rectangle is better than the contrast approach.

Both qualitative and quantitative results indicate the superior performance of the proposed method than traditional synthetic aperture imaging algorithm. After analysis, the worse reconstructed quality of the comparative method is due to the superimposition of foreground defocus blur and the information interference from redundant array perspective. The proposed method utilizes semantic information mining to filter out foreground light ray. Meanwhile, with the priority of perspective information, we obtain a more effective sub-array view. Therefore, it is natural to the imaging performance improvement with our data-driven variable synthetic aperture imaging method based on semantic feedback.

3) Experiment on UCSD Light Field Database

In this section, we compare our approach with traditional synthetic aperture imaging [34], Yang et al.’s method [35] and Pei et al.’s method [29] on UCSD light field database. All data are collected outdoors with high crowded degree. Specifically, the observation scene contains five people. One of them walks back and forth in effective observation scope. Four others pass randomly in front of him and they block him. At the same time, there is also mutual occlusion between the four men. Different occlusion degree, various object spacing, unknown synthetic depth are all included in the database. Such database is conducive to comprehensively testing the algorithms’ robustness on de-occlusion.

Fig. 21 shows the comparison results between our method and other three methods. The man in yellow crosses the front of the crowd, and then he makes shadows for the other three people, respectively. Frame 210, 213 and 217 are the specific situation of the three occlusion, as shown in the first column. From left to right are the imaging results of traditional synthetic aperture imaging method [34], Yang et al.’s method [35], Pei et al.’s method [29] and the proposed method. Region of interest with occluded object is amplified in detail at left. The imaging performance of Yang et al.’s method and Pei et al.’s method are the de-occluded imaging results from [35] and [29] respectively. As you can see, each method has a certain perspective effect and can reconstruct the occluded target. Their imaging results vary greatly and each has its own characteristics. Without additional processing on synthesis, traditional approach provides the outline information of occluded target. It is robust to complex foreground occlusion with irregular edges. As for Yang et al.’s method [35], the imaging results of occluded object have been greatly improved. Pei et al. [29] have also succeeded in improving the occluded object image sharpness. At the same time, background and target are both focused at one synthesis. For the proposed method, we employ human body detector to obtain a more objective comparison. Our system successfully removes foreground occlusion caused by the man in yellow and occluded object is imaged with high performance under scene semantic analysis.

FIGURE 21. - Comparison results on UCSD light field database. The original sequences are firstly provided at left and then the occluded object imaging results of traditional synthetic aperture imaging [34], Yang et al. [35], Pei et al. [29]. and the proposed method are shown from left to right.
FIGURE 21.

Comparison results on UCSD light field database. The original sequences are firstly provided at left and then the occluded object imaging results of traditional synthetic aperture imaging [34], Yang et al. [35], Pei et al. [29]. and the proposed method are shown from left to right.

Through the analysis of the ideational functions, contrast results are summarized as follows. Traditional method [34] averages all pixels and it does not care about occlusion’s outline, size, type, etc. So it specializes in complex foreground occlusion at the expense of imaging quality reduction. Yang et al. [35] aim to implement a depth free all-focus imaging system, and the imaging performance is improved by analysis the characteristics of light field focusing. But if a scene point is not visible at least in two cameras, the hole will appear in synthetic image. By estimating the occluded object and the background respectively, Pei et al. [29] return the fully focused synthetic image. Different from them which are based on some focusing features, the proposed method removes foreground occlusion from the semantic point of view. According to the specific situation of each pixel, we appropriately adjust the synthetic aperture driven by visual data. Especially when the focusing feature is not obvious, our method can still work well. With the development of semantic mining methods and the improvement of their performance, the proposed method has solid potentials in further imaging effect enhancement, perspective under multi-layer occlusion, etc.

SECTION IV.

Conclusion

This paper proposes a novel data-driven variable synthetic aperture imaging algorithm based on semantic feedback. Under the light field data collected by multi-camera array, we first mine semantic information in the observation scene by object detector and preserve these visual data in semantic labeling map. Through this semantic labeling map, the location information of each light ray labeled. Then, based on the proposed feedback strategy, the proposed method back propagates these semantic information to synthetic aperture imaging process. Finally, with semantic labeling map, we put forward a data-driven variable synthetic aperture imaging approach. Light ray to synthesise is selected under variable synthetic aperture which is driven by two aspects of data: one is object depth and perspective angle for better de-occluded effect, the other is its semantic meaning for better imaging quality. Different number of cameras in array and different number of light rays are participated in each pixel’s synthesis. Finally, occluded object imaging system is successfully implemented through foreground occlusion. Extensive experimental results under challenging occlusion situations confirm that the proposed method achieves encourage de-occluded imaging result. Moreover, in the contrast experiments with other method on self-built database and public database, our system has superior performance.

Furthermore, the current work can be developed and extended to multiple task with different scene semantic mining algorithms. Meanwhile, object recognition after de-occluded imaging can also be considered as an extension of existing work. In future work, we will take into account these issues which are also worthy problems to be investigated.

ACKNOWLEDGMENT

The authors would like to thank S. Chen and J. Shu from Xidian University for data collection. In addition, our deepest gratitude goes to the anonymous reviewers for their suggestion to help improve this article.

References

References is not available for this document.