Introduction
Keeping track of the migration and population of birds is an essential scientific resource that can be used to monitor the ecosystem and identify potential problems of the climate, which can directly affect the human environment. So far, in the field of counting crowded birds, methods that rely on human observations are used, which require high costs but yield low precision.
Meanwhile, deep learning methods were applied to the detection of birds where they mostly adopt the RCNN [6] to detect birds in various circumstances with a bounding box surrounding each bird individually [2], [8], [20], [23], [24], [28]. However, their works are limited to situations where the birds are very sparse since the accuracy of bounding box detection methods decreases sharply when the objects get crowded.
For crowded object counting, many studies have focused on crowd counting, where they usually incorporate the density map regression scheme [9], [11], [13]–[15], [22]. The density map regression scheme trains the network to regress a manually created target density map by the MSE loss between the target density map and the output density map. For crowd counting, the target density map is constructed by placing a 2D Gaussian kernel centered on each dot location that always points to the head of a person. Since one Gaussian kernel sums up to the value one, the global count is obtained by summing up all the values inside the density map. However, when constructing target density maps for birds, using the circular 2D Gaussian kernel is not appropriate due to the random occlusion of bird parts (Fig. 1). Moreover, the fixed-width of the Gaussian kernel is insufficient to model the various scales of the birds in images.
Sample images from our proposed dataset, CBD-6000. Various scenes of occluded birds with diverse scales are included.
Arteta et al. [1] was the first to apply the density map regression scheme in counting crowded penguins. Their network is denoted as PenNet for comparison in this paper. To solve the problem of using 2D Gaussian kernels for birds, they constructed a uniform target density map by firstly estimating the penguin regions and secondly assigning a uniform value to each pixel inside a connected penguin region where the value is the number of penguins inside the connected region divided by the number of pixels composing the region (Fig. 2-(d)). This way, the ground truth global count is equal to the sum of all the values in the uniform target density map. The PenNet is trained to regress this uniform target density map. However, the dataset proposed in [1] does not provide ground truth segmentation masks. Thus, they initially estimate the size of the penguins by the distribution of the multiple-dot annotations to roughly obtain the penguin regions, leading to inaccurate uniform target density maps. Moreover, the uniform target density map ignores the local density variation which causes confusion to the network since the pixel-wise MSE loss forces the network to regress the diverse local densities of the birds to uniform values (Fig. 2-(e)).
(a): An input bird image. (b): Target density map constructed by placing 2D Gaussian kernels. (c): Output density map of [15], which is trained by the density map regression scheme using (b) as the target. (d): Uniform target density map. (e): Output density map of [1] when trained by the density map regression scheme using (e) as the target. (f): Density activation map generated by the same network as (e) but trained with our DAM counting scheme.
Since the target density maps are not adequate to properly model the density of birds, we propose a new counting scheme that does not require a target density map but outperforms the density map regression scheme. The contributions of our work are summarized as follows:
We propose a new counting scheme, called DAM counting, which generates our-first-proposed density activation map (DAM). DAM is a CNN perspective density map that has high activation values where the network focuses on for precise counting of birds. The network is trained to autonomously learn where and how much the DAM should be activated so that the sum of all values in the DAM estimates the number of birds. DAM counting can effectively substitute the existing density map regression scheme, bringing a remarkable increase in counting accuracy;
Our DAM counting scheme enables the network to count different birds of various scales via its inner two (strong and weak) segmentation regularizers, which effectively incorporates manually-created ground truth segmentation masks. Note that we are the first to apply ground truth segmentation masks of birds while the PenNet [1] predicts a rough segmentation mask by the distribution of the multiple-dot annotations, yielding imprecise counting;
We propose DAM count loss that trains the network to autonomously generate DAMs, in which the sum of all values in the DAM leads to the precise estimation of the global count. Note that the count loss used in the previous models [7], [10], [19], [25], [27] train the network to regress the global counts directly via a fully connected network, which is known to be difficult to train due to the direct mapping from 2-d input images to 1-d global counts;
We are the first to address the crowded bird counting problem with various scales of different bird species by proposing the first crowded bird dataset, CBD-6000, which includes 6,477 bird images for seven different species of birds where the maximum number of birds is 614 (Fig. 1).
Related Work
A. Bird Counting
Currently, in the business of bird counting, traditional bird counting methods are still being used to count crowded birds [5]. Since counting the birds individually in images is a very time-consuming job, they use techniques like grouping and grid counting, which are performed by highly trained humans (birders). Grouping is a technique where a birder counts five or ten at a time. Grid counting is a technique where a birder divides the scene into grids of uniform size, count the number of birds in one grid and multiply the count by the number of grids. However, these methods tend to have low precision due to the rough counting by birders.
Meanwhile, deep learning methods were developed to detect birds automatically. These studies [2], [8], [18], [20], [24] mostly used variations of RCNN [6] to detect birds by surrounding each bird with a bounding box. The number of birds can be obtained by counting the generated bounding boxes. Although bounding box detection-based methods perform well in sparse situations, they are not capable of counting birds in highly crowded and occluded scenes.
B. Crowd Counting
When counting crowded objects, especially crowded people, regression-based methods are preferred since they do not need to count the objects individually. [22] conceptually divided the regression-based counting methods by their different regression targets into the global count regression scheme and the density map regression scheme.
The global count regression scheme [10], [25], [26] gradually encodes the extracted features of the input image into smaller feature sizes and uses fully connected layers at the end to output a single number. This scheme uses a global count loss that regresses the network output to the global count. However, these methods have a limitation that pooling the entire image into a single number ignores the spatial information of the objects.
While the global count regression scheme transfers a 2D input image to a single number, the density map regression scheme [3], [9], [13]–[15], [21], [30] converts a 2D input image to a 2D density map which contains the spatial information of the objects. The global count is obtained by integrating the whole density map. The density map regression scheme requires a manually created target density map for the regression target. When counting crowded people in images, the target density map is created by placing a fixed-width 2D Gaussian kernel centered on each dot location that always points to the center of the head of each person. People with completely occluded heads are always ignored. The created target density map appropriately correlates to the distribution of human heads since they become most visible when seen from a higher altitude. However, the density of birds can not be modeled with the fixed-width 2D Gaussian kernel because of the following reasons: Firstly, the prediction for the bird centers is usually difficult since their body is variously shaped and deformable; Secondly, due to the various motions and occlusions of birds, it is hard to consistently annotate the birds on the same part of the body (See the first image in Fig. 1). Thus, the 2D Gaussian kernels will be placed on various regions of the birds, depending on the exposed parts and poses. This can make the networks confused for learning, as shown in Fig. 2-(c). Finally, the shapes of the birds can extremely vary, especially between when the birds are largely scaled (first image in Fig. 1) and when they are scaled small (third image in Fig. 1).
C. Density Map Regression for Birds
Arteta et al. [1] first applied the density map regression scheme to count crowded penguins. We denote their network as PenNet. As mentioned before, since penguins are not circular-shaped, they proposed a uniform target density map as the new regression target for the density map regression scheme (Fig. 2-(d)). The construction of the uniform target density map requires segmentation masks indicating the penguin regions. For each pixel in a connected area of the segmentation mask, equal value is assigned where the value is the number of penguins inside the connected area divided by the number of pixels composing the area. However, the dataset used to train the PenNet does not provide manually created segmentation masks. So, a rough segmentation mask is estimated according to the distribution of the multiple-dot annotations as an alternative.
The PenNet is designed to predict the segmentation mask, compute the uniform target density map, and perform density map regression via a multi-task learning framework. The first branch outputs a segmentation mask, and the second branch outputs a density map. The segmentation mask of the first branch is regressed by the rough segmentation mask computed from multiple-annotated dots. Note that the output segmentation mask is also very rough. The uniform target density map is constructed by the output segmentation mask and used as the regression target for the second branch, which performs density map regression. Note that the regression target for density map regression changes every time the output segmentation mask changes. However, as shown in Fig. 2-(d), the uniform target density map ignores the local density variations of the birds. This confuses the network, leading to imprecise counting of the birds (Fig. 2-(e)). In contrast, our proposed DAM counting scheme does not require a target density map. Also, it incorporates manually created ground truth segmentation masks for precise information of the scale and shape of the birds.
Method
Zhou et al. [29] proposed a class activation map (CAM) that activates the regions in which the CNN discriminates when classifying images. Inspired by [29], we propose a new counting scheme, called DAM counting, which produces a density activation map (DAM). DAM is a CNN perspective density map that has high activation values where the network focuses on for precise counting of birds. We apply our DAM counting scheme to previous state-of-the-art counting models [1], [12], [15], which were originally trained by the density map regression scheme.
A. Overall Pipeline
Suppose there is a density map regression-based counting network where the network architecture consists of an encoder and decoder. We denote each part of the network as Baseline Encoder and Decoder 3 in Fig. 3. When our DAM counting scheme is applied to this counting network, two more decoder branches (Decoder 1 and Decoder 2) that have the same architecture as Decoder 3 are added. Decoder 1 outputs a segmentation mask while Decoder 2 and Decoder 3 outputs a DAM. Decoder 1 and Decoder 2 involve strong segmentation regularization, which produces a strong DAM. Decoder 3 involves weak segmentation regularization, which produces a weak DAM. Two DAMs are concatenated with the output (low-level features) of Baseline Encoder, which is then fed into the attention module. Inspired by [13], the attention module outputs two probabilistic attention maps, which are used to fuse the two DAMs as a weighted sum, yielding a final DAM. Finally, the global count is computed by summing all the values in the final DAM. Fig. 3 shows the pipeline of our proposed DAM counting scheme.
The pipeline of our proposed DAM counting scheme. Blue arrows indicate loss functions between the output and the regression target.
B. Segmentation Regularization
We propose two (strong and weak) segmentation regularizers, where the strong and weak segmentation regularizer removes the non-bird regions aggressively and moderately, respectively. The objective is to control the values of non-bird regions in the DAM to zero so that the network focuses on bird regions only. Along with our proposed DAM count loss, the network will autonomously learn to activate bird regions so that the sum of all values in the DAM approaches the ground truth global count. Moreover, the segmentation regularizers incorporate the ground truth segmentation masks, which indicate the exact size and shape of the birds. Therefore, the network is able to take into account the various scales and the diverse species of the birds.
1) Strong Segmentation Regularization
As shown in Fig. 3, the strong segmentation regularization consists of Decoder 1 and Decoder 2. Decoder 1 outputs the segmentation mask \begin{equation*} D_{strong} = s \odot D_{int}\tag{1}\end{equation*}
\begin{equation*} \mathcal {L}_{count}(D_{strong},N_{gc}) \!=\! \frac {1}{B} \sum _{b}^{B} |N_{b,gc} \!- \! \sum _{i}^{H}\sum _{j}^{W} D_{b,strong}^{(i,j)} |\tag{2}\end{equation*}
\begin{equation*} \mathcal {L}_{strong} = \mathcal {L}_{count}(D_{strong},N_{gc}) + \omega \mathcal {L}_{ce}(s,S)\tag{3}\end{equation*}
In general, the strong segmentation regularization works effectively for most of the cases. However, there are two limitations. First, it is highly dependent on the predicted segmentation mask
Visual comparison. (a) Input bird image. (b) Density map generated by PenNet+. The small density map in the corner is the uniform target density map that the network should target. (c) Density activation map generated by PenNet++DAM.
Test MAE comparison for every epoch between the network trained by the density map regression scheme (black) and the network trained by our proposed DAM counting scheme (red).
Visual results of ablation study. (a) Input bird image. (b) The output DAM when using only strong segmentation regularizer. (c) The output DAM when using only weak segmentation regularizer. (d) The fused DAM via the attention module.
2) Weak Segmentation Regularization
We propose weak segmentation regularization that can alleviate the limitations of the strong segmentation regularization. Weak segmentation regularization incorporates Decoder 3 and generates a weak DAM, \begin{equation*} \mathcal {L}_{MSE}(D_{weak},S) = \frac {1}{B} \sum _{b}^{B} (1-S_{b})(D_{b,weak} - S_{b})^{2}\tag{4}\end{equation*}
We combine \begin{equation*} \mathcal {L}_{weak} = \mathcal {L}_{count}(D_{weak},N_{gc}) + \omega \mathcal {L}_{MSE}(D_{weak},S)\tag{5}\end{equation*}
This regularization is called weak segmentation regularization because the non-bird regions are not aggressively removed like the strong segmentation regularization. Instead, it moderately suppresses the non-bird regions via the modulated MSE loss, which lowers the possibility of missing bird regions in the DAMs. Thus, due to the fusion of the weak DAM and the strong DAM via the attention module, the weak DAM supplements the missed bird regions in the strong DAM, as shown in the first row of Fig. 6. Also, for very small-scaled birds, the moderate suppression of the weak segmentation regularization allows the network to activate some of the non-bird regions near the birds (Fig. 6-(c)). This feature is fused with the strong DAM and appears as weak activation clouds surrounding the strong activations in the final DAM, as shown in Fig. 6-(d). Thus, the weak DAM alleviates the second limitation of the strong segmentation regularization by allowing the network to generate more spread activations rather than peaky activations.
C. Attention Module
We finally propose an attention module that fuses the weak DAM with the strong DAM to alleviate the limitations of the strong DAM. The two resulting DAMs, \begin{equation*} D_{final} = K_{1} \odot D_{strong} + K_{2} \odot D_{weak}\tag{6}\end{equation*}
\begin{equation*} \mathcal {L}_{total} = \mathcal {L}_{strong} + \mathcal {L}_{weak} + \omega \mathcal {L}_{count}(D_{final},N_{gc})\tag{7}\end{equation*}
Experiments
A. Crowded Bird Dataset (CBD-6000)
We propose the first crowded bird dataset, called CBD-6000, which has 6,477 photos of 7 bird species taken from Google search. It also includes single dot-annotations of the birds and manually created segmentation masks for bird regions in each image. The dot-annotations are placed on the center of the visible part of each bird. Due to random occlusions of birds, the location of the dots varies. The number of birds ranges from 1 to 614 per image. The 6,067 images are used for training, and the remaining 410 images are used for testing. Specific details for CBD-6000 are shown in Table 1. Note that the dot-annotations are not used in our DAM counting scheme but are provided to construct target density maps, which are required for the existing density map regression scheme.
B. Implementation Details
Training images are all resized to
The original PenNet [1] is modified to utilize the ground truth segmentation masks when constructing the uniform target density maps, instead of using rough estimates of them. Moreover, the original PenNet relies on an old segmentation network architecture, FCN8s [16]. For more harsh comparison, we substitute the FCN8s with a more recent SOTA segmentation architecture, DeepLabv3+ [4]. The modified PenNet is denoted as PenNet+.
The crowd counting models, CSRNet [12] and CAN [15], place a fixed-width 2D Gaussian kernel on the locations of the dot-annotations to construct a target density map. However, since the bird shapes are not circular in general, such a fixed-width 2D Gaussian kernel is not appropriate. If the uniform target density map were used as the regression target instead, the models would have achieved improved performances. To harshly compare our DAM counting scheme against the density map regression scheme, we additionally implement the two models, CSRNet [12] and CAN [15], so that they use the uniform target density map as the regression target, and are denoted as CSRNet+ and CAN+, respectively.
In order to access the effectiveness of our proposed DAM counting scheme, it is incorporated (implemented) into the three previous state-of-the-art counting models, PenNet, CSRNet, and CAN, which are denoted as PenNet++DAM, CSRNet+DAM, and CAN+DAM, respectively.
C. Results Comparison
We use the mean absolute error (MAE) as our evaluation metrics which represents the counting accuracy of the model and is given by \begin{equation*} MAE = \frac {1}{M} \sum _{m}^{M} |N_{m,gt} - N_{m,est}|\tag{8}\end{equation*}
We show the experimental results for each range of the number of birds, as depicted in Table 2. As we can see for all three networks (CSRNet, CAN, PenNet), the accuracy increases significantly when our proposed DAM counting scheme is substituted with the density map regression scheme. Notably, we see a substantial increase in the counting accuracy for the range of 100 ~ 614. In this range, the birds are very small-scaled, which may be modeled with a 2D Gaussian kernel. However, the results show that our method outperforms all the other methods in this range as well. Moreover, compared to a pre-trained Yolov3 detector [17] on our crowded bird dataset, our method also shows better performance even when the birds are sparse and large-scaled (1 ~ 10).
Since manual segmentation masks are expensive to obtain, we performed an experiment with a pre-trained segmentation network (deeplabv3 [4]). We used a fixed-size circle of a 5-pixel radius for birds that deeplabv3 could not detect. The resulting accuracies for the segmentation masks are shown in Table 3, and it can be noted that they are comparable with those by the manual segmentation.
Also, we tested our method on the penguin dataset [1], as shown in Table 4. Even though very rough segmentation masks are used by simply drawing circles of 20-pixel radius around the dot annotations, the counting accuracy of our method is higher than that of the PenNet [1], which implies that our method is robust with cheaply obtained segmentation masks.
Fig. 4 compares the generated density maps of the PenNet+ [1] and the generated density activation maps of the PenNet++DAM. Visual results show that the PenNet++DAM autonomously learns where to activate from the perspective of precise bird counting. For large-scaled birds in the second image of Fig. 4-(a), it is interesting to see that the DAM is more highly activated in the heads of the birds than their body. This is similar to the human’s counting perspective, where humans will tend to count the head of the birds for this image. Surprisingly, the network autonomously learns this fact without any additional information related to the head regions. On the other hand, even though the PenNet+ is trained with a uniform target density map (the right corner image of Fig. 4-(b)), the PenNet+ produces a density map with large variations over the entire bird region. This implies that the network is confused by the uniform target density map, which ignores local density variations (Fig. 2-(d)). Similarly, the third row of Fig. 4 shows that our DAM counting scheme can autonomously learn to activate necessary parts of the birds that lead to precise counting for diverse scenes. For small-scaled birds in the sixth and seventh images of Fig. 4-(a), which are too small to discriminate the head and body regions, the PenNet++DAM additionally activates some of the non-bird regions that are surrounding the small-scaled birds like a cloud. This is a feature fused from the weak DAM to supplement the limitations of the strong DAM by the attention module, which will be discussed in the following ablation study. We can see that the autonomous activation of our DAM counting scheme remarkably increases the counting accuracies compared to the density map regression scheme, which directly regresses the uniform target density map.
D. Ablation Study
In this section, we perform ablation studies to show the limitations of each regularizer and the necessity of the attention module. As shown in Table 5, we can see that the segmentation regularization in our DAM counting scheme is essential, and the fusion via the attention module significantly increases the accuracy for all ranges of the number of birds. The strong segmentation regularizer generally performs better than the weak segmentation regularizer. However, the strong segmentation regularizer shows poor accuracy in the 51~100 range. The images in this range often have extreme occlusions and are difficult to detect the bird regions correctly. The strong dependence on the correctness of segmentation masks can lead to undercounting when it misses a bird region. The fusion of the strong DAM and the weak DAM via the attention module supplements the missed regions of the strong DAM, as shown in the first row of Fig. 6 which leads to more precise counting. The 101~614 range also experiences a significant increase in the counting accuracy due to the attention module. As shown in the second row of Fig. 6, the activation in the non-bird regions of the weak DAM is fused with the strong DAM and appears as weak activation clouds surrounding the strong activations in the final DAM. By generating more spread activations around the peaky activations of the strong DAM, the network is able to count the number of small-scaled birds precisely.
Conclusion
In this paper, we propose a new counting scheme, called DAM counting, which generates a density activation map (DAM) to count crowded birds of various scales. Most current counting models are trained via the density map regression scheme; however, the regression target density maps are not appropriate in modeling the density of the birds. Our proposed DAM counting scheme does not require target density maps; instead, it allows the network to learn effective activation regions from a CNN perspective for bird counting. In order to handle crowded birds of various scales, strong and weak segmentation regularizers are proposed in conjunction with an attention module. The experimental results show that our proposed DAM counting scheme significantly boosts the counting performances of the existing SOTA counting models. Our proposed DAM counting scheme is expected to be applicable in counting other animals besides birds. Some challenging problems in birds counting include ambiguity of background and birds, unseen patterns of various bird occlusions, out-focused birds, and partial appearance along image boundaries, etc. Our future work will try to solve such challenging problems and to apply our method for other animals.