Introduction
Adequate crop management is the key to reach high and sustainable food production in a framework characterized by increasing water scarcity and reduced availability of agricultural lands [1]. In this context, hyperspectral (HS) remote sensing widely demonstrated its potential in supporting precision agriculture tasks [2]. However, despite the success cases documented in research environments, their exploitation in the end-user community is often limited to the provision of very basic indicators related to crop health status.
Indeed, HS data are able to provide much more information about crops, including dry mass (DM) and nitrogen (N) content [3]. These parameters are usually retrieved by means of machine learning techniques requiring adequate training samples for model calibration. This operation involves the data collection of the biophysical variable of interest through field sampling and successive laboratory analysis. This task is time-consuming and expensive, sometimes requiring the destruction of plants [3]. Therefore, this task is not viable for many farmers, especially if the number of samples to be collected for model calibration is high, as reported by most of the literature.
As stated by Huuskonen and Oksanen [4], sampling zones should be large enough to make the costs of soil sampling lower with respect to the benefits introduced by the adoption of precision agriculture methodologies. This is possible when crop fields are expected to be homogeneous or when local information, concerning the effects of a particular fertilization treatment [3], is not needed. In this case, satellites offer a reliable data source to infer the crop status at a large scale, provided that the available spectral resolution is suitable for the scope [5]. However, several studies have indicated that a more detailed description of an area variability is beneficial for precision agriculture applications [6]. Therefore, precision agriculture is often implemented by means of cameras mounted on drones due to the high spatial and spectral resolution this sensing configuration can ensure [7], [8], [9]. This means that, in most cases, the local scale is considered attractive by stakeholders and that the figure of one sample per hectare indicated in some literature [10] could be insufficient or insignificant for the problem at hand if randomly selected.
In this context, active learning techniques allow data-driven identification of the most representative samples for machine learning model calibration, exploiting criteria based on data diversity or uncertainty [11]. Most of these techniques require the selection of a number of samples as a processing parameter. Therefore, it is possible to set it as low as possible, consistent with the amount of field work the end-user is willing to spend. In this regard, assuming that the farmer aims to maximize the model accuracy with the lower possible fieldwork, two questions arise from the standpoint of the data analyst: 1) how much calibration data can be reduced without significant degradation of model prediction capability? 2) is it possible to design strategies to mitigate the effects of the reduction of model calibration data?
The objective of this article is to provide an answer to both the questions, which are, in any case, interrelated, because the reduction of field sampling has a foundation only if the performance of the estimate is comparable with those obtained using an extended calibration set or not significantly penalized. Indeed, to the best of our knowledge, the literature underestimated the problem of the calibration data cardinality as most of the reviewed works aimed to achieve the best performance through the development of new techniques or the fine-tuning of existing ones. Therefore, this work aims to demonstrate that the combination of active learning techniques for sample selection and the exploitation of ensemble learning can allow for a drastic reduction in field sampling activities with negligible loss in model prediction performance.
The rest of this article is organized as follows. The overall methodology is introduced in Section II, together with available data. Experimental results are presented and discussed in Sections III and IV, respectively. Conclusions are drawn in Section IV.
Methodology
The block diagram of the proposed methodology is shown in Fig. 1. It represents an incremental monitoring framework starting with the first flight over the study area. These data are used to extract processing variables that, as suggested in [3], [7], and [12], are represented by calibrated reflectance bands and/or a collection of vegetation indices. Indeed, indications provided in the literature about the set of variables to be used in this kind of application are quite variegated, as they may depend on the scene, the type of crop, and the parameters to be estimated. Therefore, we choose to oversize the set of regression variables by calculating more than 100 HS indices to be subsequently screened.
Block diagram of the proposed methodology. Blocks with rounded corners identify processing activities. The adopted acronyms are as follows: AL—active learning, VIP—variable importance in projection, PLSR—partial least square regression, GB—gradient boosting.
According to Fig. 1, the obtained datacube, composed of both reflectance values and spectral indices, is exploited for the active sample selection, with the objective to identify the most informative areas to be investigated for the retrieval of machine learning calibration data. In particular, this step is performed by exploiting an ensemble of active sampling techniques, as detailed in Section II-A.
Following the active learning phase, the reduced datacube is further processed to skim the input variables used to train the regression algorithms for the parameters’ estimation. According to the incremental paradigm, the regression implemented for the ith acquisition is calibrated using also the samples collected to calibrate previous ones. In this regard, as better explained in Section II-B, three prediction configurations are exploited. The first two are obtained from a single predictor; the third configuration is represented by an ensemble obtained from their average.
A. Active Learning
As mentioned above, active learning is implemented to select the most informative samples for machine learning model calibration. The purpose is to extend the forecasting capability of an existing model through the addition of a few samples extracted from newly available data, allowing new predictions with limited field work [11].
As discussed in [11], active learning techniques are based on uncertainty [13] or diversity criteria [14]. In the first case, samples are ranked by means of their uncertainty: the higher the uncertainty, the better their rank. Among these methods, those based on a variance-based pool of regressors (PAL) are probably the most interesting [15]. They start from the generation of n random subsets of the available training set. Each subset is used to train a regressor delivering a prediction for the samples of the test set. This way, the samples in the test set are coupled with n predictions, each one having a certain variance against the original training set. The higher the variance, the higher the uncertainty associated with a specific sample [11]. This methodology is thoroughly discussed in [16] and adopted in [3].
Active learning techniques based on diversity criteria select samples based on the dissimilarities they introduce in the training dataset [11]. Several metrics can be used to assess such dissimilarity, such as the Euclidean distance [17] or the cosine angle distance [18].
In this work, two different active learning techniques have been exploited. Both of them are literature techniques relying on the sample diversity criterion, although they use different metrics, such as the Euclidean distance and Shannon's entropy [19].
The first examined technique is the greedy sampling [20]. It exploits the Euclidean distance as a discriminator. The algorithm assumes that the pool is composed of N samples, initially not labeled. The goal is to select K of them to label. The first one is selected as the one closest to the centroid of the pool. The remaining K-1 samples are selected incrementally as follows.
Without loss of generality, assume that the first k samples have already been chosen. For each remaining N-k unlabeled sample, the algorithm first computes its distance to each of the k labeled ones
\begin{equation*}
d_{nm}^x = \left| {\left| {{{x}_n} - {{x}_m}} \right|} \right|\,\ m\ = \ 1, \ldots,k;n\ = \ k + 1, \ldots,\ N. \tag{1}
\end{equation*}
Then, the shortest distance from
\begin{equation*}
d_n^x = \min d_{nm}^x,\ \ n\ = \ k + 1, \ldots,\ N. \tag{2}
\end{equation*}
Finally, the sample with the maximum
The second active learning technique exploited for sample selection has been introduced in [12]. It exploits Shannon's entropy as a diversity measure [19]
\begin{equation*}
H\ = \mathop \sum \limits_{i = 1}^N - {{P}_n}{{\log }_2}{{P}_n}\ \tag{3}
\end{equation*}
According to Amitrano et al. [12], [21], this principle can be exploited for active sample selection as follows. After proper segmentation of the study area:
consider each image segment (i.e., a plot) and calculate the spatial average for each regression variable. Referring to Fig. 2, the image segment is represented by, as an example, the orange region. The maps of the available regression variables are rendered by the blue rectangles;
calculate the histogram of the regression variables starting from a randomly selected plot. Assuming that it is the orange region in Fig. 2, the histogram is calculated along the third dimension of the datacube formed by the regression variables;
calculate the entropy of the histogram
according to (3);{{H}_1} add another plot to the dataset by appending the (averaged) values of its regressors to the vector considered in step 2);
calculate the entropy
of the new vector (see the right part of Fig. 2). If{{H}_2} , mark the region as informative; otherwise, discard it;{{H}_2} > {{H}_1} continue the processing by adding new regions until all the available plots have been considered. At each iteration, compare the entropy against the last available value.
According to the incremental paradigm, when new acquisitions become available, the processing is repeated starting from the first image of the series. This means that the histograms used for entropy calculations retain all the information already available. The rationale behind the technique is that if the entropy increases, the information content of the histogram increases as well. As a consequence, the plot causing such an increase should be considered for sampling.
B. Feature Selection
As reported in Fig. 1, following the active learning phase, data are screened to remove the least informative variables for the problem at hand.
In this regard, the literature highlighted that the large amount of data usually ingested into regression algorithms can contain irrelevant information that can cause the reduction of the estimation performance [22]. This possibility is tackled calculating the variable importance in projection (VIP) score [23] according to the following procedure.
As first, a tentative regression is calculated using the partial least square regression (PLSR) algorithm [24]. In this phase, the purpose is only the calculation of the VIP score, which is a measure of the importance of the specific variable in the regression. The VIP score is defined, for each variable j, as the sum, over latent variables f, of its PLS-weight value
\begin{equation*}
\text{VI}{{\mathrm{P}}_j} = \sqrt {\frac{{J \cdot \mathop \sum \nolimits_{f = 1}^F w_{jf}^2 \cdot SS{{Y}_f}}}{{F \cdot \mathop \sum \nolimits_{f = 1}^F SS{{Y}_f}}}} \ \tag{4}
\end{equation*}
\begin{equation*}
SS\ {{Y}_f} = {\bm{b}}_f^2\ {\bm{t}}_f^T{{{\bm{t}}}_f} \tag{5}
\end{equation*}
C. Regression
The calibration set retrieved via active learning is used to train two of the most exploited machine learning algorithms for vegetation traits estimation, i.e., a PLSR [24] and a gradient boosting (GB) regression tree [26]. The calibration procedure is quite simple. The areas indicated as the most informative according to active learning criteria are sampled to retrieve ground truth data. Then, the model is fit according to the classic regression schema summarized as follows.
PLSR is a consolidated linear statistical approach suitable for the analysis of multicollinear spectral datasets, thus fully exploiting redundant information [22]. It represents the literature standard for crop biophysical parameters estimation from HS data [3], [7], [12]. In this phase, the optimum number of components is determined through a leave-one-out cross validation [27]. It consists of the building of models leaving out from the calibration set, one by one, of all the samples. In other words, if 30 samples are available, each model is built using 29 samples. The one left out changes at each iteration and is used for model evaluation via root-mean-square error (RMSE) calculation. The process is repeated running regressions with different numbers of components. For all the tries, the cumulative RMSE is calculated. Finally, the optimal number of components is selected as the one corresponding to the lowest cumulative RMSE.
GB, the second regression algorithm used for prediction, is a machine learning technique suitable for regression and classification tasks. GB builds a prediction model from an ensemble of weak classifiers/estimators, typically constituted by decision trees that minimize an arbitrary differentiable loss function [26]. GB has been widely exploited in the literature for vegetation parameters estimation [21], [28]. In this work, a least-squares boosting has been implemented using as input the same variables selected for PLSR through VIP scoring. Given n variables, the maximum number of decision splits is n–1. The number of predictors randomly selected for each split is one third of the number of predictors. The number of learning cycles is set to 100.
The two techniques are also jointly exploited within an ensemble. In this case, the final result is given by the average of the two predictions.
D. Data
Data used for this study have been acquired in two different campaigns. The first dataset was collected by the University of Wageningen using the WageningenUR HS Mapping System, which is able to acquire data in 101 spectral bands in the range [450, 950] nm with 5 nm interval [29]. The test site is a crop field cultivated with ryegrass. It is divided in 60 rectangular plots measuring 1.5 × 8 m. A total of 15 different fertilization treatments have been applied to groups of 4 plots. The dataset is composed of five acquisitions, each one supplied with field data concerning DM and N content. Ground sampling was performed through destructive methods and laboratory analysis on 15-05-2014 (Day 1), 14-10-2014 (Day 2), 09-05-2017 (Day 3), 29-08-2017 (Day 4), and 26-10-2017 (Day 5), concurrently with drone acquisitions [3]. The DM and N content statistics for this dataset are reported in Table I. Their overall averages and standard deviations are 2.00 t/ha and 1.20 t/ha and 43.4 kg/ha and 19.8 kg/ha, respectively.
The second dataset has been acquired by the Italian National Research Council (CNR). Two field campaigns (from 2 to 7 July and from 31 July to 1 August 2018) were implemented to measure dry biomass and leaf nitrogen content. Leaf biochemical variables were measured in the laboratory from samples collected on the last fully developed leaf from three plants. This operation has been repeated for 31 maize plots. These data have been collected concurrently with the HS images acquired by the HyPlant-DUAL instrument on 7 and 30 July. The sensor is able to acquire information in the spectral range [450, 2300] nm for a total of 157 spectral bands with 10 nm interval [30]. The statistics for this dataset are reported in Table II.
Experimental Results
Experiments have been implemented considering an operational scenario. The problem is to estimate the DM and the N content of crops with no prior knowledge of the area. To this end, the workflow to be followed is as follows. At first, the area is surveyed using aerial remote sensing. Then, active sampling is implemented. Finally, the plots indicated by such an algorithm are sampled to retrieve calibration data. Clearly, both active sampling and calibration shall be completed in a short time after the survey in order to avoid data leakage issues.
Predictions of vegetation DM and N content have been implemented using three different sampling configurations exploiting 1) only entropy-based active learning, 2) only greedy sampling, and 3) a combination of the two techniques. For each configuration, two settings for the number of calibration data s to be measured have been tested, i.e., s = 10 and s = 6. The third configuration is represented by the joint exploitation of the two active learning techniques, bringing to the calibration set s/2 samples each. It is worthwhile to remark that these values have been empirically chosen with the purpose to reduce drastically the amount of calibration data compared to the literature in which the processed datasets were first exploited for vegetation parameters extraction. This means that the number of calibration data is not optimized for the specific dataset.
As discussed in [12], the incremental monitoring framework includes in the ith regression the samples collected to implement previous ones. All the experiments have been implemented on a standard machine with 8-cores and 32 GB RAM. Codes have been written in IDL and MATLAB languages for active learning and regression, respectively.
The performance of the different processing configurations, i.e., the agreement between machine learning predictions and laboratory data, has been assessed in terms of the RMSE. For the Wageningen dataset, the results are resumed in Tables III and IV for DM and N content estimation, respectively. As for the CNR dataset, experimental results are reported in Table V for both DM and N content estimation. According to Moran [31], an accuracy in the order of 70%–75% of most crop or soil conditions is considered by stakeholders sufficient for crop management to improve farm profitability.
In Fig. 3, the obtained results are reported in the plane (R2, RMSE). In particular, Fig. 3(a) reports the DM experiment results, while Fig. 3(b) concerns the N ones. In these pictures, each setting is represented by a different marker. Black and green markers refer to experiments with setting s = 10 and s = 6, respectively. These data represent graphically those provided in Tables III and IV. Moreover, data relevant to two more settings, i.e., s = 8 and s = 12, have also been plotted to support the discussion developed in Section III-A. They correspond to magenta and red markers, respectively. Overall averages are depicted by solid dots.
Experimental results represented the plane (R2, RMSE). (a) DM experiment. (b) N experiment. Each setting is represented by a different marker. Red markers: s = 12, black markers: s = 10, magenta markers: s = 8, and green markers: s = 6. Overall averages indicated by solid markers.
A. Wageningen Dataset
Setting s = 10, the lowest average RMSE for the DM experiment has been obtained using ensemble regression calibrated with ensemble active learning. Using s = 6, the best performance has been obtained by PLSR calibrated with greedy sampling and ensemble regression calibrated with entropy-based active learning. In particular, RMSE values for the settings s = 10 and s = 6 are equal to 0.17 t/ha and 0.21 t/ha, corresponding to 8.5% and 10.5% of the total average DM, respectively.
As for N estimation, for both the settings s = 10 and s = 6, the best performance was provided by ensemble regression calibrated with greedy sampling. In particular, the obtained RMSE values were equal to 3.22 kg/ha and 3.93 kg/ha, corresponding to 7.41% and 9.1% of the total average N content.
B. CNR Dataset
DM estimation provided the lowest RMSEs using PLSR regression calibrated by ensemble active learning for both the settings s = 6 and s = 10. Their values, equal to 0.65 t/ha and 0.67 t/ha, correspond to 12.1% and 12.4% of the average DM, respectively.
As for the N estimation, the best result using s = 10 (RMSE = 11.7 kg/ha) is given by PLSR calibrated with greedy sampling and by ensemble regression calibrated with ensemble active learning. Using s = 6, the best performance (RMSE = 11.1 kg/ha) is provided by GB regression calibrated with entropy-based active learning. Compared with the average N content, these values correspond to about 30.2% for the case s = 10 and to about 28.7% for the case s = 6, respectively.
Discussion
The principal novelty introduced in this article is the exploitation of ensembles of active learning and regression techniques for vegetation parameters estimation using a calibration set significantly reduced compared to previous literature studies. In this regard, the main focus was the optimization of estimation methodologies, mostly based on regression, for the improvement of estimates through active learning [3] and/or ad hoc selection of regression variables [7]. However, the problem of the amount of calibration data feeding such regressions has been underestimated. The major finding here presented to the community is that reliable vegetation parameter estimates can be obtained by reducing up to 80% of the calibration set provided the adoption of opportune active sampling and regression strategies. In particular, it has been shown that, despite the significant reduction of the calibration set, the proposed methodology provides results fully comparable, sometimes slightly improved, against those reported in the literature.
For the Wageningen dataset, literature data have been produced in [3] and [12]. In particular, Franceschini et al. [3] exploited uncertainty-based active learning obtaining an RMSE of 0.32 t/ha using about 30 calibration samples per acquisition. Amitrano et al. [12] selected more or less the same number of samples using entropy-based active learning reporting an RMSE of 0.20 t/ha. These values are fully comparable with those provided by the proposed methodology. In particular, using the setting s = 10, the average performance improvement is of 4.5%. Using the setting s = 6, the introduced performance worsening is, on average, 3.5%.
In the case of the N experiment, considering the literature RMSEs of 6.50 kg/ha [3] and 4.68 kg/ha [12], the proposed methodology introduces a performance improvement of 5.41% for the setting s = 10 and of 3.77% for the setting s = 6.
As a general comment, the obtained results are quite flat whatever the algorithm setting is used, especially concerning the DM experiment. Slight differences arise in the N experiment, for which the RMSE varies between 7.41% and 11.3% and between 9.1% and 20.1% of the total average N content for the settings s = 10 and s = 6, respectively.
For this dataset, ensemble regression was the best performing in three experiments out of four. Interestingly, the first acquisition was always the one characterized by the highest RMSEs. This means that the exploitation of incremental sampling was beneficial as the increase of calibration data available for subsequent acquisitions is helpful for the building of the predictive model.
As for the CNR dataset, literature results have been produced in [30] and [32] only for the N case. In particular, Candiani et al. [30] obtained an RMSE of 7.1 kg/ha through a hybrid approach, exploiting radiative transfer models and machine learning regression algorithms, calibrated with all the available samples. In [32], N was estimated by exploiting a self-supervised learning technique, including a denoising convolutional autoencoder and a multilayer perceptron trained on simulated data, achieving an RMSE of 7.9 kg/ha. The application of the proposed methodology introduced a performance worsening in the range of 9.8%–11.9% setting s = 10 and of 8.3%–10.3% setting s = 6.
For this dataset, the best results were provided by PLSR, which provided the best results in three experiments out of four. Interestingly, ensemble active learning resulted in the best sampling methodology in three experiments out of four, as well. This is probably related to the high standard deviation of the dataset and to the fact that, as better explained in the following, the combination of more active sampling methodologies avoids the polarization of the selection toward specific features, which is beneficial in case the study area is inhomogeneous due to different crop treatments, growing stages, or structural characteristics.
As a general comment, the RMSE values reported in Tables III–V show cases in which the setting s = 6 outperforms s = 10. However, the diagram reported in Fig. 3, collecting the experiments performed on both the datasets, highlights that the best regressions, on average, are those obtained using the setting s = 10 as these markers (the black ones) are located on the bottom and on the right of the corresponding for the setting s = 6 (green ones). This means that such experiments are characterized by lower RMSE and higher R2. This is also evident in observing the overall averages, identified by solid circular markers.
As expected, this trend is confirmed when other settings, in terms of number of calibration samples, are used. As shown in Fig. 3, the RMSE tends to reduce as the number of calibration samples increases. Interestingly, the reduction is almost linear for both DM and N experiments. Similarly, the increase in the average R2 is almost linear with the increase in the number of calibration points.
As for the sampling methodology, the obtained results revealed that the joint exploitation of entropy-based active learning and greedy sampling is a viable option, as it provides the best estimates in half of the performed experiments, although combined with different regression techniques.
Indeed, the two adopted active learning methodologies exploit different concepts of diversity to select the more informative samples for the problem at hand. As explained in [12], in the case of entropy-based active learning, the diversity criterion is given by the entropy of the cumulative histogram obtained by appending, from time to time, the features extracted from the spectral response of new plots to the list constituted by those already considered. When a new spectral feature introduces different components in the dataset, the histogram of the extended feature set becomes flatter.
In the case of greedy sampling [20], the discriminator is the distance between the sample under consideration and the centroid of the cluster consisting of the samples already selected as informative. The further the sample, the higher the information it brings to the pool.
The two techniques give different representations of the data variance, which is related to the shape of a histogram calculated without any data projection, in the case of entropy-based active learning, and to the eigenvalues of the covariance matrix, as the outcome of a change of the basis of data, in the case of greedy sampling. This makes the behavior of the two techniques different in the active selection of calibration data.
In this regard, the data reported in Table VI show the number of samples selected via active learning, categorized based on the fertilization treatments applied to the Wageningen test site for the setting s = 6. From the table, it arises that the two techniques tend to privilege different treatments. As an example, entropy-based active sampling is strongly polarized toward treatments 1, 13, and 14. On the other hand, greedy sampling, although privileging the treatment 1, as well, selects just one time a plot belonging to treatment 13 and no plots belonging to treatment 14. In addition, entropy-based active sampling does not select plots belonging to treatments 2, 4, and 6, despite they are, together with treatment 1, the more represented in greedy sampling, which, in general, shows a selection more distributed across all the treatments. In fact, overall, greedy sampling selects samples belonging to 11 treatments out of 15, while the ratio for entropy-based active learning is 6 out of 15.
As given in the last row of Table VI, the combination of the two techniques tends to merge the two different behaviors. As expected, the active learning ensemble is strongly polarized toward treatment 1, which is the most represented using the single technique. All the other selected samples are redistributed among the treatments according to the maximization of the specific criterion applied to 3 samples per technique instead of 6. As a result, 8 treatments out of 15 are represented in the pool. This probably prevents the insurgence of outliers and provides stable estimates of the parameters of interest due to reduced polarization of the calibration set toward specific treatments, which results overrepresented, compared to the single active learning framework. It is worthwhile to remark that the sum of the samples selected by ensemble active learning can be lower than that given by the single active learning method as some samples can be selected by both of them.
The diversity introduced by the exploitation of different active learning techniques can be evaluated also at a spectral level. In this regard, Fig. 4 reports the spectra of the plots selected by the proposed active learning methodology for the case s = 6, CNR dataset, and active learning ensemble. Solid and dashed lines refer to entropy-based active sampling and greedy sampling, respectively.
CNR dataset, spectra of the plots selected using active learning ensemble with the setting s = 6. Solid and dashed lines refer to entropy-based active sampling and greedy sampling, respectively.
As reported in [33], waveband centered at around 550 nm provides excellent sensitivity to plant nitrogen content. This is mostly confirmed by the plotted spectra, as plots characterized by the higher nitrogen content exhibit a more pronounced reflectance peak at such wavelength, except for the plot represented by the black curve, which is the one characterized by the lowest biomass and nitrogen content. In this case, the anomalous behavior is probably due to a different growing stage. Interestingly, entropy-based active learning tends to select plots characterized by low nitrogen content. Conversely, greedy sampling is polarized toward plots showing high nitrogen content.
As for the biomass, indications about the most informative wavelengths can be found in [33] and [34]. According to these works, most of the information is carried by those included in the interval 690–1100 nm. Areas characterized by high biomass usually exhibit a lower response in the visible spectrum if compared with low biomass ones. This behavior is then inverted at red-edge and NIR frequencies [34]. This phenomenology is mostly respected by the spectra plotted in Fig. 4.
Like the nitrogen case, the two adopted active learning techniques show an opposite behavior. Greedy sampling privileges the areas characterized by lower reflectance in the reference interval of wavelengths. Conversely, entropy-based sampling is more polarized toward high-reflectance regions. This is positive for increasing the diversity of the calibration dataset.
Although the proposed methodology is able to provide results fully comparable to the literature, some open points deserve further investigation. First, information about fertilization treatments should be considered in the process, as they can represent an additional baseline for guiding active learning algorithms. Moreover, the statistic mechanisms governing the selection of samples and how their reduction impacts model performance, in relation to the scene characteristics, should be better explained, as their full understanding could provide indications about the optimal number of calibration data to be used for regression. Finally, the effect of increasing ensemble size should be explored, as it may contribute to the reduction of estimation error.
Conclusion
HS remote sensing is a powerful tool for vegetation parameters estimation, provided that adequate calibration samples are collected to train machine learning models. However, the costs associated with field sampling make farmers poorly prone to its implementation. As a consequence, the full exploitation of remote sensing data in agriculture is still an open point.
The purpose of this work was to demonstrate that appropriate regression techniques and data-driven sampling methodologies allow for dramatically reducing the need for calibration data, up to 80% compared to the reference literature, with performance fully comparable or negligibly degraded.
This has been assessed by processing of two different datasets. In the first case, the accuracy of the prediction of ryegrass DM and nitrogen content was slightly better than the literature. In the second one, the estimation of maize nitrogen content got worse of about 10% compared to previous studies. The registered absolute error is within the range indicated as acceptable by stakeholders.
Interestingly, the results revealed that the exploitation of ensembles, both at regression and active learning levels, is beneficial. First, the joint use of more regression methodologies prevents the insurgence of outliers as their combination mitigates possible failures of the single technique. Moreover, the use of different active learning techniques for calibration sample selection enables an improved representation of data variance. The exploitation of different diversity criteria tends to better distribute samples across the different crop features, preventing an excessive polarization of the calibration set. The best results have been obtained using this setting, independently from the regression methodology, in half of the performed experiments.
At the best of our knowledge, these findings are new and indicate that the use of ensembles of regressors and active sampling techniques is a promising tool to support precision agriculture tasks with a significant reduction of the field work required to retrieve calibration data. Moreover, the ensemble allows to obtain more stable and accurate results compared to the exploitation of a single regressor and/or active learning methodology. Beyond the estimation scores, which are important but not the central aspect of our study, the takeaway message is that the relation between the number of calibration samples and the estimates is not straightforward, i.e., it is not obvious that the more calibration points, the better the results. A reasoned strategy aiming at the optimization of calibration data can lead to results comparable to those retrievable using more samples for model fitting.
The proposed framework is oriented toward operational environments, in which the reduction of field work and its associated costs is key for the full adoption of remote sensing technologies. However, there are some open points needing further research; at first, the generalization. The obtained results are promising but need to be assessed in more variegated scenarios of crops and sensing instruments. Moreover, it would be interesting to investigate the effects of fertilization treatments on active sampling, the optimization of the number of calibration samples, and the enlargement of the ensemble size.