Journals & Magazines >IEEE Journal of Selected Topi... >Volume: 14

D2CL: A Dense Dilated Convolutional LSTM Model for Sea Surface Temperature Prediction

Abstract:

Accurately predicting sea surface temperature (SST) is practically important to many applications, such as weather forecasting, ocean environment protection, and marine d...Show More

Metadata

Abstract:

Accurately predicting sea surface temperature (SST) is practically important to many applications, such as weather forecasting, ocean environment protection, and marine disaster prevention. The major challenge for predicting SST is to capture both the spatial and temporal characteristics of SST, which has not yet been well addressed by the existing methods. In this work, we proposed a novel dense dilated convolutional LSTM (D2CL) model to predict SST. D2CL first integrates dilated convolutional network and LSTM to learn spatial and temporal features from the SST data simultaneously. Then, it uses multiple feature extractors of different dilated kernel sizes to learn features of varying scales. Finally, it introduces dense connection to reduce feature loss during the training process and achieves SST prediction in an encoder–decoder style. We have conducted extensive experiments over two real datasets to validate the effectiveness of D2CL and all the proposed enhancing techniques. As suggested by the experimental results, D2CL outperforms the existing methods and could achieve accurate SST prediction for as long as seven days in future.

Published in: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing ( Volume: 14)

Page(s): 12514 - 12523

Date of Publication: 17 November 2021

ISSN Information:

DOI: 10.1109/JSTARS.2021.3128577

Funding Agency:

Figures are not available for this document.

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Approximately three quarters of the earth’s surface is covered by sea, which plays an important role in shaping global climate. Specifically, the sea surface temperature (SST) has great effects on many environmental issues (e.g., climate change, marine disasters, and ocean acidification) and environmental phenomena (e.g., ocean currents and El Niño-Southern Oscillation). Therefore, accurately predicting SST is of practical significance and could benefit many environment-related research activities and applications.

However, it is difficult to predict SST, which dynamically changes over time and space. On the one hand, during the day time, SST changes according to the position of sun, which is one major factor affecting SST by heating the water. On the other hand, different sea areas usually have different SST records due to the difference in radiation and evaporation efficiency.

Up to now, various approaches have been proposed for SST prediction. Existing SST prediction approaches can be roughly divided into three categories, i.e., physical models, traditional machine learning models, and deep learning models.

Physical models: Typical physical models, e.g., general circulation model (GCM) model [1], CMIP5 model [2], and MyOcean Project model [3], are based on the physical theories to model the spatial and temporal dependency in SST sequence and require domain knowledge to design the models.
Traditional machine learning-based models: Machine learning models, including Markov model [4]–[6], regression model [7], [8], and support vector machine (SVM) [5], [9]–[11], do SST prediction by training models using both spatial and temporal features extracted from the historical SST data.
Deep learning-based models: Deep learning methods have also been applied for predicting SST due to their strong predictive power in handling temporal and spatial data [12], [13]. For example, long short-term memory (LSTM) model is used to predict SST by taking the advantage of the time dependency in SST sequence, and convolutional neural network (CNN) model treats SST data as a 2-D map and predicts SST by learning the spatial dependency between different regions.

Existing approaches for SST prediction have already achieved good prediction performance with a small deviation error. For example, the mean absolute error (MAE) of MyOcean is only about 0.5. However, some critical application scenarios, such as coral bleaching and weather prediction, still require a more accurate prediction for SST to better solve their problems. For example, corals are highly sensitive to the temperature and a small change in SST could result in coral bleaching. Therefore, this work aims to take the advantage of the correlations between spatial dependency and temporal dependency in SST to further improve the accuracy of SST prediction, which have not yet been considered by the existing approaches.

To address the abovementioned issue, we propose a dense dilated convolutional LSTM (D2CL) model that can effectively learn both the spatial and temporal features of SST for more accurate SST prediction by combining the dilated convolutional network and LSTM. The dilated convolution can effectively learn spatial and temporal correlations in SST and learn features of multiple granularities. Meanwhile, the dense structure could help to minimize the information loss when training the prediction model.

The main contributions of this article are threefold as follows.

We proposed the D2CL model that can learn the spatial and temporal features from historical SST data simultaneously by integrating dilated convolutional network and LSTM.
We developed multiple feature extractors of different dilated kernel sizes to learn the features of multiple scales and introduced dense connection to maximize information transfer between model layers.
We conducted extensive experiments over two real SST datasets to demonstrate the effectiveness of the proposed model and the enhancing techniques.

The rest of this article is organized as follows. We review the related work in Section II and Section III gives the problem definition of SST prediction. The details of the D2CL model are presented in Section IV. Section V evaluates the proposed model via experiments, followed by a conclusion of this work in Section VI.

SECTION II.

Related Work

This section presents a detailed review on the existing SST prediction methods, which can be roughly grouped into three categories, i.e., physical models, traditional machine learning-based models, and deep learning-based models.

A. Physical Models

Physical models use Newton’s laws of motion, law of conservation of energy, and seawater equation of state, etc., to predict SST. Typical physical models include GCM [1], Coupled Model Intercomparison Project (CMIP) model [2], and MyOcean Project model [3]. The GCM model is based on the Navier–Stokes equation, combined with the solar radiation, ocean latent heat, and other ocean dynamics parameters, to simulate the changes of SST. The CMIP model is an ensemble model that integrates multiple physical models to achieve more accurate prediction for SST. MyOcean integrates different data sources to produce an analyzed dataset based on physical models.

Physical models are usually of high complexity and interpretability, but require a good understanding of the dynamic mechanism of SST. However, SST is determined by many factors, which makes it difficult to learn and monitor such dynamics.

B. Traditional Machine Learning-Based Models

Different from the physical models, machine learning models are based on the probability theory and predict SST by learning the underlying patterns from the historical SST data. Major machine learning models for SST prediction include Markov model [4], pattern searching model [6], regression model [7], [8], and SVM model [5], [9]–[11].

Markov model: Xue et al. [4] proposed a seasonally varying Markov model, constructed in a multivariate empirical orthogonal function (EOF) space of the observed SST and sea level, for predicting the SST in tropical pacific.
Pattern searching model: Agarwal et al. [6] developed a pattern searching model to find similar temporal cycles and select the best match based on the interseasonal changes to predict SST.
Regression model: Laepple et al. [7] introduced a regression model for SST prediction and found that there exist correlations between SST and hurricanes. Kug et al. [8] also developed a regression-based model to predict SST dynamically.
SVM model: Lins et al. [9], [10] combined SVM and particle swarm optimization to predict the SST across Northeastern Brazilian coast. Aguilar et al. [5], [11] introduced warm water volume as an additional feature to enhance SST prediction with an SVM model.

Traditional machine learning models mainly focus on extracting the temporal features of SST and cannot well capture the spatial features.

C. Deep Learning-Based Models

Classic neural networks, e.g., feed-forward neural network (FNN) and artificial neural network, have been applied for predicting SST ten years ago. For example, Tripathi et al. [14] predicted the SST in the Indian Ocean with the FNN model. Tangang et al. [15] introduced wind-stress and SST anomalies for seasonal SST prediction with the FNN model. Garcia-Gorriz et al. [16] used meteorological variables (e.g., sea level pressure, air temperature, and wind) as inputs and conducted SST prediction in the western Mediterranean Sea with the FNN model. Wu et al. [11] proposed a nonlinear model that uses a multilayer perception neural network to predict the SST in tropical Pacific.

Recently, deep learning methods have also been applied for SST prediction. Some studies aim to capture the temporal features in SST sequence. For example, Zhang et al. [12] used the LSTM model to predict the SST in Bohai; Patil et al. [17] proposed the wavelet neural network for the prediction of daily SST; Yang et al. [18] combined the Markov random field with LSTM to predict the SST in Bohai. Some other studies aim to capture the spatial features in the SST records of different regions. For example, Zheng et al. [13] proposed to use the CNN model for SST prediction, and Xiao et al. [19] proposed the convolutional LSTM model for SST prediction.

Most existing deep learning methods for SST prediction learn either the temporal features or the spatial features in the SST data. However, they fail to consider the correlation between the two types of features, which inevitably reduces the prediction accuracy. In contrast, our D2CL model can learn both the spatial and temporal features of SST, thus achieving higher prediction accuracy than the exiting methods.

SECTION III.

Problem Definition and Notations

For SST prediction, the sea surface is usually divided into grid regions of the same size according to longitude and latitude. The SST records of each grid region is obtained by the observation equipments in the grid. All the grid regions form an $R \times C$ matrix $X_i$ to represent the SST at certain time slot $T_i$ , where $R$ and $C$ correspond to the number of grid regions along the longitude and latitude, respectively. All the matrices from the historical SST records form a time sequence $X_{1}, X_{2},\ldots, X_{t}$ .

Example 1:

Fig. 1

Fig. 1.

ECS, where the rectangle highlights the region of interest in this work.

Show All

illustrates the East China Sea (ECS) located within

$23^\circ N\text{--}33^\circ N, 117^\circ E\text{--}131^\circ E$

. In this work, we focus on the region within

$23^\circ 00^{\prime } N\text{--}33^\circ 10^{\prime } N$

$122^\circ 50^{\prime } E\text{--}129^\circ 50^{\prime } E$

, which is divided into

$40 \times 30$

grid regions, and each grid region has a size of

$0.25^\circ \times 0.25^\circ$

. The daily temperature record in each grid region is the average of all the observations collected from the equipments in the grid region. Fig. 2

Fig. 2.

SST sequence for one grid region in ECS in 2015.

Show All

presents the sequence of the SST records for one grid region of ECS in the year of 2015. According to the curve, the SST starts to increase in January, reaches the peak in September, and then decreases.

$\square$

In practice, we usually try to learn knowledge from a period of SST records in history and use the learned knowledge for predicting the SST in future. In this work, we also follow this principle and formally define the problem of SST prediction in the following.

Definition 1 (SST Prediction):

Given $u$ historical SST records $X_{t-u+1},X_{t-u+2}...,X_{t}$ for all regions, the SST prediction problem is to predict the next $v$ SST records in future, i.e.,

$\begin{equation*} \begin{aligned} X_{t+1},\ldots,X_{t+v} = \mathop {\arg \max }_{X_{t+1},X_{t+2},\ldots,X_{t+v}} \ \ \\ p(X_{t+1},\ldots,X_{t+v}|X_{t-u+1},X_{t-u+2}...,X_{t}) \end{aligned} \tag{1} \end{equation*}$ View Source

where

$p(*)$

is the probability of the given records.

For example, given the SST records in the past 30 days, we could predict the SST records in next seven days, where $u$ $=$ 30 and $v$ $=$ 7.

The Nomenclature summarizes the notations used in this article.

SECTION IV.

Methodology

A. Model Architecture

Fig. 3 illustrates the architecture of D2CL, which is an encoder–decoder model from the global view. D2CL receives as input the historical SST records and predicts the SST records in future. D2CL consists of dilated ConvLSTM layers, each of which contains stacked dilated convolutional blocks (DC blocks) of different dilation rates. In addition, the layers of D2CL are densely connected to maximize information transfer between layers.

Fig. 3.

Architecture of the D2CL model, where the color indicates the difference in dilation rate.

Show All

B. Encoder and Decoder

In this work, we developed an encoder–decoder model to compress the historical SST sequence $X_{t-u+1},X_{t-u+2},\ldots,X_{t}$ to generate a hidden feature vector $H$ , which is then used for predicting the future SST records $X_{t+1},X_{t+2},\ldots,X_{t+v}$ .

As illustrated in Fig. 3, the encoder consists of $L$ dilated ConvLSTM layers. We introduce a concatenation operation between adjacent dilated ConvLSTM layers. The concatenation operation concatenates the outputs of previous layer. The number of dilated ConvLSTM layers is set according to experiments. Specifically, the encoder in our model consists of two dilated ConvLSTM layers, i.e., $L$ $=$ 2. The decoder consists of three dilated ConvLSTM operations and a full connection layer. The connection in decoder is a feed-forward connection.

C. Dilated ConvLSTM Layer

Existing SST prediction models usually use the single-scale feature extraction and cannot well learn the hidden features, thus leading to low prediction accuracy. To solve this limitation, each dilated ConvLSTM layer in D2CL uses three feature extractors, i.e., dilated ConvLSTM (DC) blocks, of different dilation rates to extract multiscale hidden features.

1) Dilated ConvLSTM Block

Each DC block consists of batch normalization (BN), dilated ConvLSTM, and rectified linear unit (ReLU). The BN operation normalizes the data to reduces the occurrence of overfitting. The dilated ConvLSTM operation learns features from the data with a large receptive field. The ReLU operation filters the negative values of the data to improve model efficiency.

Fig. 4 illustrates a 2-D dilated convolutional operation constructed by inserting “holes” (zeros) between pixels, corresponding to the grid regions, in the convolutional kernel. Fig. 4(a), (b), and (c) shows the normal convolutional kernel, the convolutional kernel with one hole, and the convolutional kernel with two holes, respectively. Generally, for a convolutional kernel of size $k \times k$ , the size of the corresponding dilated kernel is $k_d \times k_d$ , and $k_d$ is computed as

$\begin{equation*} k_d = k + (k-1)(r-1) \tag{2} \end{equation*}$ View Source

where

$r$

is called dilation rate.

Fig. 4.

Normal convolutional kernel and dilation convolutional kernels.

Show All

The dilated convolutional operation is denoted as

$\begin{align*} Y= X \ast _l W_{k_d} \tag{3} \end{align*}$ View Source

where

$X$

is the input data,

$W_{k_d}$

is the kernel filter,

$\ast _l$

is the dilated convolutional operator, and

$Y$

is the output result.

The structure of the dilated ConvLSTM operation is illustrated in Fig. 3 DC block. It consists of multiple sequential dilated ConvLSTM cells. The output of the encoder is the output $H$ of the last dilated ConvLSTM cell in the end layer of the encoder. $H$ contains all the hidden features from the previous layers and is, then, used as the input of the decoder.

Example 2:

We use the historical SST data of 30 days to predict the SST records within next seven days, i.e., $u$ $=$ 30 and $v$ $=$ 7. In this case, the dilated ConvLSTM operation in the encoder has 30 dilated ConvLSTM cells, and the dilated ConvLSTM operation in the decoder has seven dilated ConvLSTM cells.

The structure of the dilated ConvLSTM cell is illustrated in Fig. 5, where forget gate $f_t$ decides the information to be thrown away from the cell state; input gate $i_t$ decides the new information to be stored in the cell state; $\tilde{C}_t$ calculates a new candidate value, scaled by the volume of information to update the cell state $C_t$ ; and output gate $o_t$ decides what parts of the cell state to be output.

Fig. 5.

Structure of the dilated ConvLSTM cell.

Show All

The gates, i.e., input gate, forget gate, and output gate, and internal state in the dilated ConvLSTM are formulated by the following equations:

$\begin{eqnarray*} i_t &=&\sigma \left(W_{xi}\ast _l X_t+W_{hi}\ast _l h_{t-1}+W_{ci}\circ C_{t-1}+b_i\right) \\ f_t &=&\sigma \left(W_{xf}\ast _l X_t+W_{hf}\ast _l h_{t-1}+W_{cf}\circ C_{t-1}+b_f\right) \\ \tilde{C}_t &=& \text{{tanh}}{\left(W_{xc}\ast _l X_t+W_{hc}\ast _l h_{t-1}+b_c\right)} \\ C_t &=&f_t\circ C_{t-1}+i_t\circ \tilde{C}_t \\ o_t &=&\sigma \left(W_{xo}\ast _l X_t + W_{ho} \ast _l h_{t-1} + W_{co} \circ C_t + b_o \right) \\ h_t &=&o_t\circ \text{{tanh}}\left(C_t\right) \end{eqnarray*}$ View Source

where

$i_t$

$f_t$

, and

$o_t$

are the input gate, forget gate, and output gate, respectively;

$C_t$

is the cell state;

$C_{t-1}$

is the previous cell state;

$\tilde{C}_t$

is the candidate state;

$h_t$

is the hidden state;

$W$

is the weight matrix;

$x_t$

is the current input data; and

$h_{t-1}$

is the previous hidden output.

2) Multiscale Feature Extraction

Fig. 6(a) presents the feature extraction ranges of three DC blocks $B_{m,1}$ , $B_{m,2}$ , and $B_{m,3}$ with the same dilation rate $r$ =2. In this case, as illustrated by the figure, the features of many grids cannot be well extracted in each block, which is the well-known “gridding” issue [20].

$Fig. 6. - Feature extraction ranges of $B_{m,1}$, $B_{m,2}$, and $B_{m,3}$ with varying dilation rates, where the grids marked in blue contributes to the calculation of the center grid marked in red through three DC blocks with kernel size $3 \times 3$. (a) Feature extraction ranges of DC blocks $B_{m,1}$, $B_{m,2}$, and $B_{m,3}$ with the same dilation rate $r$=2. (b) Feature extraction ranges of DC blocks $B_{m,1}$, $B_{m,2}$, and $B_{m,3}$ with dilation rates $r$=1, 2, and 3, respectively.$

Fig. 6.

Feature extraction ranges of $B_{m,1}$ , $B_{m,2}$ , and $B_{m,3}$ with varying dilation rates, where the grids marked in blue contributes to the calculation of the center grid marked in red through three DC blocks with kernel size $3 \times 3$ . (a) Feature extraction ranges of DC blocks $B_{m,1}$ , $B_{m,2}$ , and $B_{m,3}$ with the same dilation rate $r$ =2. (b) Feature extraction ranges of DC blocks $B_{m,1}$ , $B_{m,2}$ , and $B_{m,3}$ with dilation rates $r$ =1, 2, and 3, respectively.

Show All

To solve this issue, DC blocks should have different dilation rates to make the receptive fields of a series of dilated convolutions fully cover all the grids in the last block $B_{m,3}$ . Therefore, in D2CL, we set the dilation rates of blocks $B_{m,1}$ , $B_{m,2}$ , and $B_{m,3}$ to 1, 2, and 3, respectively. Such a design enables D2CL to extract all the features, as illustrated in Fig. 6(b).

D. Dense Connection

A drawback of the existing deep learning models for SST prediction is that the gradient cannot transmit directly from the later layers to the earlier layers. This may prevent the information transmit in the model and further affect the prediction performance.

To address this issue, we build direct connections from each layer to all the subsequent layers. By doing this, the dilated ConvLSTM layer $L_m$ , thus, can receive the outputs of all the preceding layers $L_0,L_1,\ldots,L_{m-1}$ as inputs, i.e.,

$\begin{equation*} I_{L_m} = O_{L_0} \oplus O_{L_1} \oplus \ldots \oplus O_{L_{m-1}} \tag{4} \end{equation*}$ View Source

where

$I_{L_m}$

is the input of layer

$L_m$

, and

$\oplus$

is a concatenation operation.

According to Fig. 3, each dilated ConvLSTM layer contains three DC blocks. Therefore, the output of layer $L_m$ consists of the output of three DC blocks $O_{B_{m,1}},O_{B_{m,2}},\text{ and }O_{B_{m,3}}$ , i.e.,

$\begin{equation*} O_{L_m} = O_{B_{m,1}} \oplus O_{B_{m,2}} \oplus O_{B_{m,3}} \tag{5} \end{equation*}$ View Source

where

$O_{L_m}$

represents the output of layer

$L_m$

Example 3:

Assume that we use the historical SST data of 30 days to predict the SST records in next seven days, and the raw historical SST data are $I_{L_0}$ , i.e., the input of the dilated ConvLSTM layer $L_1$ . For the ease of representation, we also denote $O_{L_0}$ $=$ $I_{L_1}$ . Then, the output of layer $L_1$ is $O_{L_1}$ $=$ $O_{B_{1,1}} \oplus O_{B_{1,2}} \oplus O_{B_{1,3}}$ . Similarly, the input and output of layer $L_2$ are $I_{L_2}$ $=$ $O_{L_0} \oplus O_{L_1}$ and $O_{L_2}$ $=$ $O_{B_{2,1}} \oplus O_{B_{2,2}} \oplus O_{B_{2,3}}$ , respectively; the input and output of layer $L_3$ are $I_{L_3}$ $=$ $O_{L_0} \oplus O_{L_1} \oplus O_{L_2}$ and $O_{L_3}$ $=$ $O_{B_{3,1}} \oplus O_{B_{3,2}} \oplus O_{B_{3,3}}$ , respectively.

With the dense connection mechanism in D2CL, we can maximize the information transfer between the model layers and further improve the prediction performance.

SECTION V.

Experiments

Experiments over two real datasets have been conducted to evaluate the effectiveness of the proposed model. We first analyze the impact of major parameters on the performance of D2CL. Then, we compare D2CL with two baseline approaches, i.e., LSTM, CNN, and ConvLSTM, and one physical approach, i.e., MyOcean, to demonstrate its superiority.

A. Datasets

In our experiments, we use the Optimum Interpolation SST (OISST) dataset from National Oceanic and Atmospheric Administration (NOAA).¹ The OISST data contain the daily SST records from 1981 to 2015. Specifically, we selected those SST records in ECS and South China Sea (SCS) to evaluate the D2CL model.

ECS data: The ECS covers the area of $[23^\circ 00^{\prime } N\text{--}33^\circ 10^{\prime } N$ , $122^\circ 50^{\prime } E\text{--}129^\circ 50^{\prime } E]$ , and is divided into $40 \times 30$ grids of size $0.25^\circ \times 0.25^\circ$ .
SCS data: The SCS covers the area of $[11^\circ 55^{\prime } N\text{--}21^\circ 50^{\prime } N$ , $111^\circ 00^{\prime } E\text{--}117^\circ 50^{\prime } E]$ , and is divided into $40 \times 26$ grids of size $0.25^\circ \times 0.25^\circ$ .

B. Training and Experimental Settings

According to the settings in the existing approaches [12], we also set $u$ $=$ 30 and $v$ $=$ 7, i.e., predicting the SST in next seven days with the historical SST records of preceding 30 days.

The D2CL model is trained to predict $X_{t+1},\ldots,X_{t+v}$ with the historical SST records $X_{t-u+1},X_{t-u+2},\ldots,X_{t}$ by minimizing the following loss function:

$\begin{equation*} \text{Loss} =||X_v - D(E(X_u))||^2 \tag{6} \end{equation*}$ View Source

where

$X_v$

is the true SST,

$X_u$

are the historical SST records,

$E(\cdot)$

is the encoder operation of D2CL, and

$D(\cdot)$

is the decoder operation of D2CL. The loss function is optimized with Nadam optimizer [21].

The SST records in both the ECS and SCS data are split into three subsets: 60% for training, 30% for validation, and 10% for testing. We use the mean squared error (MSE) function as the loss function and set the number of training iterations to epoch $=$ 1 000 in all experiments. The training will stop if the loss value does not improve for ten consecutive epoches.

All the experiments are conducted on a 64-core Intel Xeon processor with 512GB RAM and 2 NVIDIA RTX 2080 Ti GPUs. The D2CL model is implemented based on TensorFlow 1.13.0. Although it takes hours to train the model, D2CL can run in real time to produce the predicted results once trained.

C. Evaluation Metrics

We use four evaluation metrics, i.e., MSE, root-mean-squared error (RMSE), MAE, and mean-absolute-percentage error (MAPE), to measure the performance of the SST prediction models. Let $\hat{x}$ and $x$ be the predicted value and the observed value, respectively, and the four evaluation metrics are calculated using the following equations:

$\begin{align*} \text{MSE} =& \frac{1}{n} \sum _{i}(x_i - \hat{x}_i)^2 \tag{7} \\ \text{RMSE} =& \sqrt{\frac{1}{n} \sum _{i}(x_i - \hat{x}_i)^2} \tag{8} \\ \text{MAE} =& \frac{1}{n} \sum _i |x_i - \hat{x}_i| \tag{9} \\ \text{MAPE} =& \frac{100\%}{n}\sum _{i}|\frac{x_i-\hat{x}_i}{x_i}| \tag{10} \end{align*}$ View Source

where

$n$

is the number of predictions. The smaller these metrics are, the better the models perform.

D. Varying the Kernel Size

The kernel size has a great influence on the receptive field. The larger the receptive field is, the more features can be extracted. However, if we want to get more subtle features, we need to narrow the receptive field, i.e., reducing the kernel size. Table II gives the results of D2CL for different kernel sizes, where the best results are highlighted by boldface. The D2CL model with kernel size $k$ $=$ 3 outperforms that with kernel size $k$ $=$ 2. However, the D2CL model performs worse when further increasing the kernel size $k$ to 5. Therefore, it can be seen that increasing the kernel size does not always improve the prediction performance. Thus, we set the kernel size to $k$ $=$ 3 in the following experiments.

TABLE I Notations and Their Meanings

TABLE II Results of D2CL for Different Kernel Sizes

E. Varying the Number of Layers in Encoder

Table III presents the results of D2CL while varying the number of layers in the encoder, where the best results are highlighted in bold. Compared with the D2CL of one layer, the D2CL with two layers decrease 0.02, 0.01, 0.01, and 0.11 on MSE and MAPE, respectively. However, the D2CL of three layers achieves worse performance than the D2CL of two layers.

TABLE III Results of D2CL for Different Numbers of Layers in the Encoder

In general, increasing the depth of the model helps to improve the prediction’s accuracy. However, with the increase of depth, the problem of gradient disappearance will emerge and affect the updates of model parameters. In addition, stacking more than three layers, D2CL costs more computation resources. Therefore, in the following experiments, we choose the D2CL model with two layers in default.

F. Effectiveness of Enhancing Techniques

To validate the effectiveness of the proposed enhancing techniques, i.e., dense connection (D), triple extractors (T), and dilated convolution (C), we evaluate the performance of D2CL without these techniques. According to Table IV, after removing $D$ , $T$ , and $C$ , respectively, the new models D2CL-D, D2CL-T, and D2CL-C have worse performance than D2CL. These results demonstrate that all the three techniques could improve the prediction accuracy of D2CL.

TABLE IV Results of D2CL and its Variants

G. Model Comparison

Table V tabulates the results of LSTM [12], CNN [13], ConvLSTM [19], and D2CL on the ECS data. D2CL outperforms the other three models in terms of all four evaluation metrics. Specifically, the MSE of D2CL is only 0.39 while that of LSTM, CNN, and ConvLSTM are 0.48, 0.62, and 0.63, respectively.

TABLE V Results of LSTM, CNN, ConvLSTM, and D2CL on ECS Data

Similarly, Table VI tabulates the results of LSTM, CNN, ConvLSTM, and D2CL among which D2CL achieves the best performance.

TABLE VI Results of LSTM, CNN, ConvLSTM, and D2CL on SCS Data

Furthermore, we compare the performance of CNN, ConvLSTM, and D2CL while varying the number of days to be predicted. Figs. 7 and 8 present the results of three models in the ECS data and SCS data, respectively. According to results in both figures, the performance of all the three models decreases with the increase of the number of days. D2CL achieves the best performance when predicting SST for the next two or more days. For the one day prediction, however, ConvLSTM is better than D2CL. This is because D2CL can take the advantage of the spatio-temporal relationships among those days to be predicted for achieving better prediction. For one day prediction, there is no such information, thus making D2CL perform worse than ConvLSTM.

Fig. 7.

Prediction results in ECS while varying the number of days to be predicted.

Show All

Fig. 8.

Prediction results in SCS while varying the number of days to be predicted.

Show All

H. Comparison With the Physical Model

We conducted more experiments to compare the performance of our model with the physical approach MyOcean using OISST as the ground truth. The results of MyOcean are downloaded directly from the Copernicus Marine Service² and covers the period from January 2015 to December 2019. We also consider ECS and use the spatial resolution of $0.25^\circ \times 0.25^\circ$ .

According to Table VII, MyOcean achieves quite good performance in terms of MAE and works better than CNN and ConvLSTM. Our model D2CL, however, reaches a higher prediction accuracy than other three methods in terms of all four evaluation metrics.

TABLE VII Results of MyOcean, CNN, ConvLSTM, and D2CL on ECS Data

I. Correlation Analysis

Here, we apply the historical SST data of 30 days ${\bf X}_u$ $=(X_1,X_2,\ldots,X_{30}$ ) to predict the SST of the next seven days ${\bf Y}_v$ $=(Y_1,Y_2,\ldots,Y_7$ ) where $X_i$ ( $i$ $=$ $1,2,\dots, 30$ ) represents the SST records of all grid regions on the $i$ th historical day and $Y_j$ ( $j$ $=$ $1,2,\dots, 7$ ) corresponds the SST records of all grid regions on the $j$ th future day. Therefore, each data sample of both ECS data and SCS data consists of a pair of ( ${\bf X}_u$ , ${\bf Y}_v$ ) where $u$ $=$ 30 and $v$ $=$ 7. To explore the correlation between each day $X_i$ in ${\bf X}_u$ and each day $Y_j$ in ${\bf Y}_v$ , we calculate the Pearson product-moment correlation coefficient $\text{pccs}_{i,j}$ between $X_i$ and $Y_j$ as follows:

$\begin{align*} \text{pccs}_{i,j} = \frac{\text{cov}(X_i,Y_j)}{\rho (X_i)\rho (X_j)} \tag{11} \end{align*}$ View Source

where

$\text{cov}(X_i,Y_j)$

is the covariance between

$X_i$

and

$Y_j$

$\rho (X_i)$

is the standard deviation of

$X_i$

, and

$\rho (X_j)$

is the standard deviation of

$X_j$

Figs. 9 and 10 show the heatmaps of the average $\text{pccs}_{i,j}$ for all the samples on the ECS data and SCS data, respectively. In the two figures, the horizontal axis corresponds to the 30 historical days, whereas the vertical axis corresponds to the seven days to be predicted.

Fig. 9.

Correlation of the historical SST records with the predicted future SST records on the ECS data.

Show All

Fig. 10.

Correlation of the historical SST records and the predicted future SST records on the SCS data.

Show All

From Fig. 9, we can see that the odd future days, i.e., the 1st, 3rd, 5th, and 7th future days, have almost similar pattern of correlation with the historical days, and the even future days, i.e., the 2nd, 4th, and 6th future days, have nearly similar pattern of correlation with the historical days. Specifically, the odd future days have strong correlation with the 2nd to 5th, 11th to 13th, 20th to 23th, and 28th to 30th historical days. The even future days have strong correlation with the 5th to 9th, 14th to 19th, and 23th to 27th historical days. There is no clear reason why the SST changes are periodic on the ECS data because the dynamics of SST is very complicated. According to our knowledge, such periodicity may be due to the regular sea surface wind or the ocean current.

However, the periodic correlation patterns on the SCS data, as shown in Fig. 10, are not so obvious as on the ECS data. That is, by comparing Figs. 9 and 10, it is obvious that the correlation between the future days and historical days on the ECS data is stronger than that on the SCS data. On the one hand, we speculate that the SST in SCS is influenced by both of the Indian Ocean Warm Pool and the Pacific Ocean, which leads to weak periodic patterns. On the other hand, the dynamics of SST in SCS is smaller than that in ECS, and the predictions on the SCS data are, thus, more accurate than that on the ECS data.

J. Visualization Analysis

To provide a clear view of the advantage of D2CL, we visualize the absolute errors, i.e., MAE, of CNN, ConvLSTM, and D2CL in seven days on the ECS data [cf., Fig. 11(a)]. For each subplot, the x-axis and y-axis correspond to the longitude and latitude, respectively. Each subplot has $40 \times 30$ grids of size $0.25^\circ \times 0.25^\circ$ . The blue color indicates that there is a small absolute error, whereas the red color means that the absolute error is large. According to the visualization, the errors of the CNN model are obvious, especially for the 1st, 6th, and 7th days. Although most predictions of the ConvLSTM model are close to the observed values, there are still some obvious errors in the areas on the left side of ECS. Compared with the CNN model and ConvLSTM model, the predicted results of our D2CL model are almost the same as the true values on the ECS data.

Fig. 11.

Absolute errors of CNN, ConvLSTM, and D2CL in seven predicted days. (a) ECS. (b) SCS.

Show All

We also visualize the prediction errors of CNN, ConvLSTM, and D2CL when predicting SST for the next seven days on the SCS data [cf., Fig. 11(b)]. Both the ConvLSTM model and D2CL model achieve better prediction performance than CNN.

SECTION VI.

Conclusion

In this work, we proposed a new SST prediction model D2CL, which can learn spatial and temporal features simultaneously via dilated ConvLSTM operation. The D2CL model uses multiple feature extractors of different dilation rates to learn features of multiple scales and introduces dense connection to maximize the information transfer between model layers. According to the experiments on real datasets, our model outperforms the existing methods. In the future, we will try to exploit more external features, e.g., the wind speed and the radiation of the short wave, to further improve the model prediction accuracy.