Introduction
Air pollution, regardless of emissions sources, has become an important topic in environmental research and management [1], [2], particularly in urban areas [3]. To effectively control air quality, it requires an accurate and reliable solution for urban air pollution forecasting. In the state of New South Wales (NSW), Australia, the key airborne pollutants such as particulate matters (
Air pollution prediction is a challenging task due to the dynamic and non-stationary nature of the time series of pollutants. For this, statistical methods have been widely applied, integrating spatial correlations with trends and seasonal patterns to estimate the chronological dependency of historical and future values [6]. Commonly-used techniques based on autoregressive integrated moving-average (ARIMA) with different variants (e.g., VARIMA, SARIMA) [7], [8] have contributed to a large number of studies, involving data that are often stationary, detrended or deseasonalized. These conditions may however be unfulfilled for real-world applications, wherein temporal distributions are prone to not only locations but also environmental perturbations and data drifts [9].
Improving the accuracy and reliability of the forecast would require an approach that can learn from the big data to be less dependent on the assumptions of model-based or statistics-based methods. In this regard, machine learning (ML) techniques such as artificial neural networks [10], support vector machine [11], random forest [1], K-nearest neighbors or naïve Bayes [12] have been applied to the time-series prediction of airborne pollutants. However, these shallow-learning methods require intensive data processing procedures involving high computational latency before training and during prediction [9]. As an advanced version of ML, deep learning (DL) models, with the ability to learn “deeper” from multiple layers, can produce superior performance comparable to human experts [13]. Among the state-of-the-art DL networks, the convolutional neural network (CNN) and recurrent neural network (RNN) demonstrate their high performance for comprehensive learning of spatial and temporal features respectively in many applications [14], [15], including air quality forecasting [16]. The long short-term memory (LSTM) [17] as a variant of RNN, has proved its robustness in capturing long-term dependency of time and causal features of the inputs, especially pollutant concentrations and meteorological values. Therefore, the LSTM is particularly promising in air-quality estimation [18], [19].
For air-quality forecast, most of the learning-based models result in point-wise estimation at each time step, which can be considered as deterministic. As such, good predictions can be achieved only when training data and remaining data share the same statistical properties (e.g., the same distribution). This condition is impractical for deep learning with air pollution data in the presence of spatial-temporal correlations and influences of external changes in emissions, weather patterns and the multifaceted factors of environmental volatility [20]. Indeed, as mentioned in a recent survey [21], uncertainties associated with those conditions and incoming data imperfectness, such as missing and out-of-distribution values, pose a challenge in deep learning, for which the integration of probabilistic methods such as Bayesian reasoning into deep neural networks [22] is worth exploring to deal with uncertainties.
In this paper, we propose a new deep-learning model that can incorporate spatially dispersed features when learning the time series of airborne pollutant concentrations via the fusion of historical observations and predicted values from the CCAM-CTM output. The proposed model can handle data uncertainties and imperfectness by using a recursive neural network with Bayesian inference forming the long short-term memory-Bayesian neural network (LSTM-BNN). Besides, we aim to optimally approximate the probability density function (PDF) to produce the distribution of the pollution forecast at each time step. Therefore, Bayesian modeling of uncertainties with variational inference is applied to both training and forecasting tasks. The quantification of uncertainties from the proposed technique presents an effective treatment of bias inference associated with the conventional Gaussian assumption for distributions of the forecast values and significantly reduces the sampling numbers.
The contributions of this paper include:
An effective LSTM-BNN framework using an LSTM single-step recursive forecast model in combination with Bayesian reasoning for data fusion of air pollution observations and existing numerical estimations.
A new algorithm for kernel density estimation with subdivision tuning (KDEST), developed for the air pollutant’s probability density function with a reduced amount of samples of forecast distributions for uncertainty quantification.
A new imputation algorithm for spatially-adjusted multivariate imputation by chained equations (SAMICE), developed to adaptively impute missing observation data of the target location based on correlation with neighbor stations.
Possible application of the proposed model with spatial inferences to be integrated with the system managing all stations in a region to achieve the required accuracy and reliability of the forecast, as verified through extensive experiments.
The paper is organized as follows. After the Introduction, Section II presents the proposed LSTM-BNN framework. Section III is devoted to the handling of uncertainties with the approximation for Bayesian inference and the proposed KDEST technique. Section IV presents the imputation of spatio-temporal distributions of air pollutants and the SAMICE algorithm development. The results and discussion from comparison and ablation analysis for different model configurations in various seasons are included in Section V, indicating the potential of applying the proposed approach for NSW suburbs. Finally, a conclusion is drawn in the last section, Section VI.
Deep Learning Framework for Air-Pollutant Forecast
In this section, we introduce the datasets used for training, validating and forecasting along with their collection and preprocessing before modeling with our proposed framework with LSTM-BNN model.
A. Air-Quality Data from Observations and Numerical Model
The air-quality data are collected from two sources: (i) observations (OBS), measured from air quality stations, and (ii) numerical model’s predicted values from CCAM-CTM.
1) Observations - OBS
The real observations are open databases managed and published by NSW government through the application programming interface (API), which provides air pollution information from over 50 state-run stations over the whole NSW [23]. This includes the main pollutants such as
2) CCAM-CTM
From the numerical models, pollutant concentrations are available for up to 72-hour forecasts obtained from the combination of two numerical models currently used in NSW state of Australia:
The Conformal Cubic Atmospheric Model (CCAM) is a 3D cubic atmospheric model which uses a non-hydrostatic, semi-implicit, semi-Lagrangian dynamical core to simulate climate and weather at fine resolutions. It accounts sufficiently for the local topography, atmospheric processes and associated climate impacts or extreme weather features (e.g., tropical cyclones or bushfires) [24].
The Chemical Transport Model (CTM) is currently deployed for predictions of particles (
and$PM_{2.5}$ ),$PM_{10}$ ,$NO$ and$NO_{2}$ . This model employs data of emissions and anthropogenic sources from the air quality inventory of NSW-Sydney Greater Metropolitan Region (GMR), calculated emissions for marine aerosol, wind-blown dust, volatile organic compounds (VOC), as an integration of the sources and distribution sizes of air pollutants [25].$O_{3}$
The combined CCAM-CTM has been implemented in NSW since 2017 to flexibly scale the predictions at different resolutions (80 km
3) Accuracy and Reliability
An essential requirement for air pollution forecasting is to maintain its accuracy. Here, the CCAM-CTM model requires highly accurate capture of variable emissions sources as the main inputs in order to infer estimation outputs via multiple chemical reactions and physical equations [25]. However, as a result of inaccurate predictions of organic compounds and other chemical species, the model displays unreliable results at different seasons such as overestimation and underestimation of
Another problem is data leakage in OBS data due to missing information or imperfect conditions. This may occur at stations and low-cost sensors from unexpected failures of instruments or various impacts of volatile environment [26]. The missing information problem degrades the capacity of learning the dynamic features and other extreme events from the time series. Moreover, data-driven models are ineffective because of incomplete inputs or absent variables.
The drifting effect remains also a problem for air quality prediction. According to a recent report from 2012–2018 in the NSW GMR [27], the pollutant concentrations vary significantly from year to year, especially for the particle levels. Therefore, the forecast performance is inevitably affected by the concept drift problem in air-quality data when using any learning technique with a pre-trained model [9]. As such, historical data may appear to be insufficient to handle the prediction in the coming periods given chaotic changes of air pollution.
To overcome these issues, we develop an effective technique for kernel density estimation with subdivision tuning (KDEST) to improve prediction accuracy and smoothen the distribution shape of limited samples obtained from our LSTN-BNN model. Besides, an algorithm for spatially-adjusted multivariate imputation by chained equations (SAMICE) is proposed to handle any missing information or abrupt changes in concentration levels to update the spatio-temporal distributions of new incoming data from nearby stations.
4) Data Partition and Model Configuration
Each variable in this work constitutes approximately 30,000 hourly-averaged values from March 2018 to August 2021 used for training, validating and testing the model. The raw data will be scanned to remove the outliers (negative or extreme values 3 times higher than averages), resample missing time steps, and impute the missing values. After preprocessing, the inputs are transformed into matrices of three dimensions, i.e. number of samples, number of time steps, and number of features, to create a set of spatio-temporal data for fitting to the forecast model. Then, the transformed dataset is divided into training and validating sets respectively with splitting ratios of 80% and 10%, while the testing set with a splitting ratio of 10% of the total samples. This selection with the sliding windows method in training and testing processes accounts for the temporal nature of the time series used in the LSTM forecast model. Indeed, assigning 80% of the total for the training set can cover all seasonal patterns, extreme events (e.g., bushfires in black summer 2019–2020 in NSW, Australia) and other episodes of air pollutants. Hence, the distributions of various features can be considered as fully learned by the proposed model during the training process. The 10% of hourly-averaged data (approximate 3000 values) assigned for each validating and testing datasets, equivalent to 125 days, can sufficiently evaluate the generic capacity of our model for a particular season of the year. Before training, the data were normalized in the interval [0, 1] to increase the speed of convergence and reduce the prediction bias.
From empirical experiments with our real data, the hyperparameters selected in our model are summarized in Table 1. Here, the activation function is the Rectified Linear Unit (ReLU), and Adam optimizer is chosen for training because it applies an adaptively stochastic optimization method that is suitable for time series. Finally, the early stopping method is also applied with
The proposed model architecture and related functions are developed with a high-level neural network API in Python, namely Keras running on Tensorflow, an open-source library for machine learning tasks. We train our program on the Interactive High Performance Computing (iHPC) server with NVIDIA Quadro RTX 6000 GPU.
B. Fusion of Observation and Model Data
Methods for multistep-ahead predictions can be categorized as (i) direct or one-shot forecast and (ii) recursive forecast. The former produces a sequence of multiple time steps at one prediction, which is suitable for stationary data with seasonal patterns (e.g., temperature). This method may however face the uncertainty problem in air pollutants, causing instability in predictive performance, especially for the long-term forecast (e.g., 48h-, 60h- or 72h-future values). On the other hand, the latter method, using a one-step model recursively with its new input updated by the latest prediction values, can flexibly produce multiple forecast time steps in an iterative manner to reach the standard period of 72-hour forecast.
To mitigate the accumulated errors, we apply a recursive model for single-step forecast with the fusion of real-world observations and predicted CCAM-CTM data. Here, as only historical observations are available, a recent forecast value from the model output is fed back to join the input sequence for the next time-step forecast. Taking advantage of multistep-ahead forecast in physical modeling, the predicted data from CCAM-CTM model are combined with observations and previous forecast values to enhance the knowledge of future trends which contribute to reducing the predictive error at each forecast time step. During operations, OBS data obtained from the monitoring stations and low-cost sensor networks are updated hourly to the input sequence for replacing the previous forecast to suppress the model uncertainty of previous forecast. Apart from improving prediction performance, this method also allows for the prevention of data leakage in DL with neural networks.
C. Proposed Architecture
In our approach, a recurrent neural network (RNN) model is utilized as the main core to formulate our proposed framework owing to its robustness of modeling sequences and flexibility with respect to different scenarios of prediction such as one-to-one, one-to-many, many-to-one or many-to-many time-step predictions [13]. To control the flow of information from the input sequences with the long-term patterns of time series of airborne pollutants, overcome the vanishing and exploding issues in RNNs and, more importantly, improve the forecast accuracy via quantification of uncertainties, we propose a hybrid deep learning model using LSTM-BNN, based on the LSTM network in integration with Bayesian inference. Here, to implement the LSTM-BNN predictive model, recurrent layers are stacked intermittently with Monte-Carlo (MC) dropout layers for regularization, prevention of overfitting and quantification of uncertainties during prediction [28].
The sequential structure of an LSTM layer includes multiple memory cells with inputs \begin{equation*} C_{t} = f_{t}*C_{t-1}+i_{t}*\tilde {C_{t}}, \tag{1}\end{equation*}
\begin{align*} f_{t} &= \sigma (W_{f}.[h_{t-1}, x_{t}] + b_{f}), \tag{2}\\ i_{t} &= \sigma (W_{i}.[h_{t-1}, x_{t}] + b_{i}), \tag{3}\\ \tilde {C_{t}} &= tanh(W_{C}.[h_{t-1}, x_{t}] + b_{C}). \tag{4}\end{align*}
\begin{equation*} h_{t} = o_{t}*tanh(C_{t}), \tag{5}\end{equation*}
\begin{equation*} o_{t} = \sigma (W_{f}.[h_{t-1}, x_{t}] + b_{o}), \tag{6}\end{equation*}
\begin{align*} \sigma (x) &= \frac {1}{1+e^{-x}}, \tag{7}\\ tanh(x) &= \frac {e^{x}-e^{-x}}{e^{x}+e^{-x}}. \tag{8}\end{align*}
The DL parameters
The proposed LSTM-BNN framework is depicted in Fig. 1, where its inputs combine CCAM-CTM and the real-world OBS collected from monitoring stations and low-cost sensor networks with data being normalized in
Bayesian Inference for Model Uncertainty Handling
As uncertainties in DL models are inevitable, we integrate the Bayesian inference method in the LSTM model for uncertainty quantification to improve the accuracy and reliability of the forecast.
A. Bayesian Inference in Neural Network
The degree of belief on the neural network model specifications based on new information (observations) can be inferred from the posterior \begin{equation*} p(\omega |x) = \frac {p(x,\omega)}{P(x)} \Longrightarrow p(\omega |x) = \frac {p(x|\omega)p(\omega)}{P(x)}, \tag{9}\end{equation*}
For every new parameter \begin{equation*} p(y_{forecast}|X_{new}) = \int p(y_{forecast}|\omega)p(\omega |X_{new})d\omega. \tag{10}\end{equation*}
Since the posterior of weights \begin{equation*} D_{KL}(q(\omega)||p(\omega |x))= \int q(\omega)log \frac {q(\omega)}{p(\omega |x)}d\omega. \tag{11}\end{equation*}
\begin{align*} \min \limits _{\theta } D_{KL}(q(\omega, \theta)||p(\omega |x)) &= \min \limits _{\theta } \mathbb {E}_{q(\omega, \theta)} \left [{log q(\omega, \theta) }\right. \\ &\qquad \left.{ {-} log p(\omega |x) }\right], \tag{12}\end{align*}
First, through multiple samplings of forecast values
B. Kernel Density Estimation With Subdivision Tuning
The Gaussian distribution assumption is widely used in probabilistic models for applications with a large number of samples for each time step according to the central limit theorem (CLT) [32]. As such, in air-quality forecasting, the cost of computation is quite expensive for multi-step ahead estimation over a large region. Moreover, in addition to inevitable uncertainties, direct Gaussian-based inferences from observations containing non-normal distributions could incur some bias issues. As a remedy for that and also taking into account the recursive fusion of CCAM-CTM and OBS for large-area prediction, we propose to subdivide the data for adjusting the parameters of the kernel density estimation (KDE), a non-parametric method commonly used to infer the smooth shape of distributions from the observed data [33]. The idea is to obtain an optimal approximation of a probability density function (PDF) of each forecast time step.
To estimate the distribution density at point \begin{equation*} \hat {f}(x) = \frac {1}{nh}\sum _{i=1}^{n}K \left({\frac {x-x_{i}}{h} }\right), \tag{13}\end{equation*}
\begin{align*} ISE & = \int [(f(x) - \hat {f}(x)]^{2} dx \\ & = \int f^{2}(x) dx - 2 \int f(x) \hat {f}(x) dx + \int \hat {f}^{2}(x) dx), \tag{14}\end{align*}
\begin{equation*} J_{ISE} = \int \hat {f}^{2}(x) dx - \frac {2}{n}\sum _{i=1}^{n}\hat {f}(x_{i}). \tag{15}\end{equation*}
\begin{equation*} \hat {f}_{j}(x) = \frac {1}{(n-k)h_{j}}\sum _{i=1}^{n-k}K_{j} \left({\frac {x-x_{i}}{h_{j}} }\right), \tag{16}\end{equation*}
\begin{equation*} J_{ISE}(h_{j}) = \frac {1}{k}\left({\sum _{i=1}^{k} \int \hat {f_{j}}^{2}(x) dx }\right) - \frac {2}{k}\sum _{i=1}^{k}\hat {f_{j}}(x_{i}). \tag{17}\end{equation*}
The procedure of finding optimized bandwidth (
Algorithm 1 KDEST
Input: Forecast posterior distribution at a time step
Output: Optimal bandwidth (
Define bandwidth range
Divide the posterior distribution into
for each
end for
To illustrate our inference for forecasting three air pollutants
Histograms and inferences of CCAM-CTM (black line), Gaussian inference by the mean of distribution (green line) and the maximum likelihood estimation (red line) from KDEST, compared to the ground truth OBS (blue line).
Spatio-Temporal Distribution Imputation
Missing information and imperfectness in data recording are important issues in air quality prediction. This Section is devoted to the imputation techniques we developed in this work for air-pollutant forecasts.
A. Correlation of Spatio-Temporal Profiles of Air Pollutants
Correlations of measurements at regional air-quality stations are widely used to select appropriate features of air pollutants or meteorologies for imputing, training and forecast in association with the spatial relation analysis [37]. For levels of a pollutant collected at 19 air-quality stations over the Greater Metropolitan Region (GMR) of Sydney in 2021 [23], the variation of correlations with respect to locations is quite excessive and obviously represents a concern for data imputation. Temporal short-term correlations also vary episodically due to complex dispersion, meteorological impacts, emissions conditions and chemical reactions of the pollutants, resulting in intermediate correlations of an air pollutant at a target station with other stations. For example, at Liverpool station in early 2021, distinguished changes in the correlation for ozone occurred at Cook & Phillip, Randwick and Earlwood with respect to Liverpool, in a window of 48-hour observations during the
Correlations of ozone at Liverpool with respect to other stations in the
B. Spatially-Adjusted Multivariate Imputation By Chained Equations
In environmental monitoring, the observations measured are occasionally absent due to volatile impacts of outdoor conditions, system failures or communication problems [26]. Information loss may cause model corruptions with any incomplete length of inputs. Besides, the forecast becomes unreliable by using only statistical properties of historical data to infer the absent information due to the concept drift problem [9]. As a remedy, to deal with the drift of incoming data by referring to nearby stations through correlations between stations, we develop here a novel online imputation technique, called spatially-adjusted multivariate imputation by chained equations (SAMICE). The proposed technique, modified from the multiple imputation method by chained equations (MICE) [38], also treats a feature with missing values as a dependent variable and other remaining variables as predictors in a multiple regression model. Here, not all features in the dataset but only the most correlated variables are involved in the regressive-based imputation from the predictive distributions of the fitted model.
Let us consider the whole OBS dataset \begin{equation*} y_{-i} = y_{j}\cdot \alpha _{-i} + \beta _{-i}, \tag{18}\end{equation*}
The idea behind our SAMICE algorithm is to utilize only the most correlated variables to improve the regressive-based imputation. Here, a faster convergence with higher reliability is expected to result by reducing uncertainties from predictions after multiple cycles of imputation. For that, the correlation between the target station \begin{equation*} r_{ij} = \frac { \sum _{i,j} (y_{i} - \bar {y_{i}})(y_{j} - \bar {y_{j}}) } {\sqrt {\sum _{i}(y_{i} - \bar {y_{i}})^{2}\sum _{j}(y_{j} - \bar {y_{j}})^{2}}}, \tag{19}\end{equation*}
The correlation-based SAMICE regression model is now formulated to impute missing values at station \begin{align*} y_{-i} = \begin{cases} \displaystyle y_{j-remain}\cdot \alpha _{-i} + \beta _{-i} & (\text {if} r_{ij} \geq r_{thr}) \\ \displaystyle \bar {y}_{i} & (\text {if} r_{ij} < r_{thr}, \forall j), \end{cases} \tag{20}\end{align*}
Algorithm 2 SAMICE
Input:
Sequences of incoming data
Sequences of incoming data
Output: Imputed data for
Set the spatial correlation threshold
for
Compute Pearson’s correlation coefficients
if
end if
end for
if
end if
Obtain imputed values to form the set
Results and Discussion
In this work, we considered the air-pollutant forecast in two main periods: summer (January 2021) and winter (late May and early June 2021) to evaluate the performance and reliability of our LSTM-BNN model in comparison with the current CCAM-CTM for respectively two key pollutants, the ozone and
A. Evaluation Metrics
For performance evaluation on the forecast of the time-series data for the concerned air pollutants, collected at a number of monitoring stations in NSW, widely-adopted metrics are used here:
The mean absolute error (
):$MAE$ \begin{equation*} MAE = \frac {1}{n}\sum _{i=1}^{n} |y_{i} - \hat {y_{i}}|, \tag{21}\end{equation*} View Source\begin{equation*} MAE = \frac {1}{n}\sum _{i=1}^{n} |y_{i} - \hat {y_{i}}|, \tag{21}\end{equation*}
The root mean square error (
):$RMSE$ \begin{equation*} RMSE = \sqrt {\frac {1}{n}\sum _{i=1}^{n} (y_{i} - \hat {y_{i}})^{2}}, \tag{22}\end{equation*} View Source\begin{equation*} RMSE = \sqrt {\frac {1}{n}\sum _{i=1}^{n} (y_{i} - \hat {y_{i}})^{2}}, \tag{22}\end{equation*}
The Pearson’s correlation (
):$r$ and\begin{equation*} r = \frac {\sum (x_{i} - \hat {x_{i}})(y_{i} - \hat {y_{i}})}{\sqrt {\sum (x_{i} - \hat {x_{i}})^{2}\sum (y_{i} - \hat {y_{i}})^{2}}}, \tag{23}\end{equation*} View Source\begin{equation*} r = \frac {\sum (x_{i} - \hat {x_{i}})(y_{i} - \hat {y_{i}})}{\sqrt {\sum (x_{i} - \hat {x_{i}})^{2}\sum (y_{i} - \hat {y_{i}})^{2}}}, \tag{23}\end{equation*}
The coefficient of determination (
):$R^{2}$ where\begin{equation*} R^{2} =1 - \frac {\sum (y_{i} - \hat {y_{i}})^{2}}{\sum (y_{i} - \bar {y})^{2}}, \tag{24}\end{equation*} View Source\begin{equation*} R^{2} =1 - \frac {\sum (y_{i} - \hat {y_{i}})^{2}}{\sum (y_{i} - \bar {y})^{2}}, \tag{24}\end{equation*}
and$y_{i}$ here are respectively the measured observations and forecast values of variable$\hat {y_{i}}$ at the$y$ instant, (similarly to variable$i^{th}$ ), and$x$ is the number of inspected samples. The lower values of$n$ and$RMSE$ or higher values of$MAE$ and$r$ indicate better performances.$R^{2}$
B. LSTM-BNN Performance
In the following, we illustrate the high performance of the proposed LSTM-BNN model in forecasting the two air pollutants of interest (
1) Recursive Forecast
Here, the LSTM-BNN data are sampled with 50 and 300 values per distribution respectively for KDEST inference (KDEST-50) and Gaussian-based inference (GAUSSIAN-300) to benchmark with the measurements of observations (OBS) as the ground truth, and the predicted values of CCAM-CTM as a physical model currently used for air quality estimation in NSW state. We also compared the results with those obtained from a hybrid deep learning model, the CNN-LSTM constructed by 1D-CNN layers concatenated with LSTM layers [16]. From the predicted profiles for
Comparison of recursive forecast profiles with LSTM-BNN: KDEST-50 (red), GAUSSIAN-300 (green), CCAM-CTM (dashed black), CNN-LSTM (orange) and ground truth OBS (blue dot) from
While ozone prediction with CCAM-CTM often has large errors, the LSTM-BNN approach can provide its forecast rather accurately, even at higher concentrations of the pollutant, contributed by its diurnal characteristics. Forecast values of the CNN-LSTM model in general present a good fit to OBS as LSTM-BNN but display underpredictions at some peaks of concentrations such as the forecasts in the midday of the
Notably, with the proposed KDEST, the number of samples can be reduced from 300 down to 50 without performance loss. This can result in some improvement in computational efficiency and enable possibilities for prediction with missing data. To further illustrate the advantage of the proposed KDEST algorithm, we conducted an ablation study with smaller numbers of sampled data in addition to KDEST-50, i.e., 5 (KDEST-5), 10 (KDEST-10), 20 (KDEST-20) and 30 (KDEST-30).
An extensive comparison was conducted for predictions with the CCAM-CTM model, two deterministic DL networks including an LSTM model and the hybrid CNN-LSTM model, the proposed LSTM-BNN with KDEST at various numbers of sampled data, and a Gaussian-inference LSTM-BNN (GAUSSIAN-300) model without KDEST. Table 2 summarizes the comparison based on the metrics
It can be seen from the ablation study that the proposed LSTM-BNN with KDEST sampled at 30 data points (KDEST-30) is about the best for forecasting ozone with 10.73% improvement in MAE, 31.9% improvement in RMSE as compared to CCAM-CTM, and 54.3% reduction in the processing time in comparison to the LSTM-BNN with Gaussian inference at 300 samples. Moreover, predictions using the proposed model with KDEST-30 have a higher coefficient of determination (
2) Direct Forecast
Numerical models like CCAM-CTM often require a large number of values from air emissions inventory and meteorologies as their inputs. Also, unavailable inputs from these models may cause an accuracy problem in prediction. As such, we consider here the use of the proposed LSTM-BNN model to forecast multiple values with only historical data of OBS and CCAM-CTM. For this, we conducted an extensive ablation study for direct predictions with short-term forecast horizons of 6, 12, 24, 36, 48, 60 and 72 hours ahead. The profiles (left) and scatter plots (right) for
Forecast performance (left) and scatter plot (right) of direct-forecast models for
A comprehensive experiment was also conducted on different combinations of input lengths and output horizons. Table 3 summarizes typically the performance evaluation for the direct forecast of fine particles in the wintertime. It shows that the forecast accuracy is acceptable with
C. Suburban Scale Air Pollution Forecast
With the availability of data recorded at air-quality stations only, the missing information or observation at a location required for LSTM-BNN can be imputed by using the proposed SAMICE algorithm, based on correlations with nearby monitoring stations. In this work, 15 stations located in [
1) Spatial Data Imputation with SAMICE
As mentioned previously, our LSTM-BNN framework with SAMICE imputation can provide forecasts of air pollutants in terms of spatial distributions at an interested location. This merit can be verified by comparing profiles of forecast values with imputation by the conventional MICE and proposed SAMICE algorithm.
Benchmarked to the ground-truth observations, the accuracy enhancement of SAMICE over MICE can be seen in Figs. 6(a) and (b) for the predicted profiles of
Predicted air-pollutant profiles from incoming data with random missing values at a ratio of 0.2, imputed by MICE and SAMICE for the targeted station at Liverpool.
Table 4 summarizes the outperformance of SAMICE for prediction of the three air pollutants in consideration with randomly dropping data at ratios varying from 0.1 to 0.5 and the threshold 0.9 for Pearson’s coefficient
2) Suburban Air-Pollutant Distributions
Applying the proposed framework for the Sydney GMR gridding, spatial distributions of the air pollutant forecast can be obtained at any suburb or location of interest shown in the map of Fig.7(a). Figures 7(b) - (e) present the comparisons between the distributions of real observations (left) and 72-hour forecast (right) respectively for
Spatial distributions of real observations versus 72h-forecast of
The spatial distribution maps present quite accurately the forecast dispersion of three air pollutants as per evaluation given in Table 4 for Liverpool station. For example, particles tend to move South East while ozone displays a high concentration on the west during winter 2021. More importantly, this allows for possibly predicting potential risk of air pollution, particularly in any suburb or local area along with the meteorology forecast and, given the availability of low-cost wireless sensor networks, which is promising for microclimate analysis.
Conclusion
This paper has presented a long short-term memory Bayesian neural network (LSTM-BNN) as a new deep learning model to improve accuracy and reliability of the air pollution forecast, particularly for two main air pollutants