Introduction
Energy production and use is the single biggest contributor to global warming, accounting for roughly two-thirds of human-induced greenhouse gas emissions [1]. International Energy Agency estimates that a push for electric mobility, electric heating, and electricity access could lead to a 90% rise in power demand by 2040 [2]. Furthermore, EIA (U.S. Energy Information Administration) estimates that the industrial and commercial sectors consume 50% of the total energy production [3]. Hence, efficient energy management in buildings will prove crucial to combat negative environmental hazards, such as degradation and carbon dioxide emission [4].
In addition to environmental impact, energy efficiency measures in buildings provide economic benefits in terms of reduced overall operating costs. Large electricity consumers commonly pay premium prices for demand peaks. For example, Independent Electricity System Operator (IESO) charges their large consumers fees based on the consumers’ contribution to the top five province-wide peaks and another category of consumers is charged premiums based on their monthly peak [5].
Load forecasting has been attracting research and industry attention because of its importance for energy production planning and scheduling. The rapid increase of smart meter use has created opportunities for load forecasting on individual building and household level, thus facilitating budget planning, identifying savings opportunities, and reducing energy footprint.
Energy consumption data from smart meters is used, often together with meteorological data, to build models capable of inferring future energy consumption. One way of building these systems is by training Machine Learning (ML) models using historical data and then using these trained models to predicting future loads. If the model produces a predicted load pattern similar to the actual, the interested parties can make cost-effective decisions based on these predicted values. Examples of ML models for load forecasting include Neural Networks (NN) [6], Support Vector Regression (SVR) [7], and deep learning [8].
Feedforward neural networks (FFNN) [9] and Deep Neural Network (DNN) [8] have achieved favorable results [8], [9]; however, FFNN and many Deep Learning (DL) architectures are not designed to capture time dependencies since they only take the current input to calculate predictions. Recurrent Neural Networks (RNNs) are capable of capturing time dependencies as their nodes establish a directed graph along a sequence [10]. This allows RNNs to consider the current input along with the previously received inputs and makes them suitable for time-series data.
While RNNs have an advantage over DNNs in analyzing the temporal dynamic behaviour [11], in language translation a Sequence-to-Sequence (S2S) RNN which combines an encoder RNN and decoder RNN has shown a greater success [12]. The encoder RNN is tasked with encoding information into a fixed-length vector, which the decoder RNN uses to sequentially produce translation outputs [12]. However, in these models the encoder is burdened with compressing all necessary information into this fixed-length vector. In language translation, this was addressed using attention mechanisms [13], [14] which allow the decoder to look back at the encoder outputs to find the most relevant information.
This paper proposes S2S RNN with Attention for load forecasting and evaluates prediction accuracy of different attention mechanisms with varied forecasting horizons. S2S RNN from neural machine translation which is a classification task, is adopted for the regression task of load forecasting. To accommodate S2S RNN, a sample generation approach based on the sliding window is applied. Attention mechanism is added to ease the connection between encoder and decoder. Bahdanau et al. [13] and three variants of Luong et al. [14] attention are considered as well as three RNN cells: vanilla RNN, Gated Recurrent Units (GRU), and Long-Short-Term Memory (LSTM). The results show that S2S with Bahdanau attention outperform DNNs, S2S RNNs, and S2S with other attention mechanisms. As expected, the accuracy decreases as the forecasting horizon expands; however, the increased input sequence length does not always lead to improved accuracy.
The rest of the paper is organized as follows: Section II discusses the background, Section III presents the related work, Section IV describes the methodology, Section V explains the experiments and corresponding results, and finally Section VI concludes the paper.
Background
This section introduces RNNs, S2S RNNs, and attention mechanisms.
A. Recurrent Neural Networks
Recurrent Neural Networks have an architecture similar to FFNN, but with the addition of recurrent connections to the same neurons in the previous time step. The output at each time step is based on both the current input and the input at the previous time steps, making RNNs good at modeling temporal behaviours found in time series data.
As illustrated in Fig. 1, RNNs take a sequence of inputs \begin{equation*} y_{[t]}= f^\circ (x_{[t]}, h_{[t-1]})\tag{1a}\end{equation*}
Traditional or Vanilla RNNs are mainly trained using back-propagation through time (BPTT) [15]; however, this method can lead to the vanishing gradient problem for longer sequences [16]. Long-Short-Term Memory (LSTM) networks [17] were designed to overcome this problem; therefore, they are capable of storing information for longer periods of time and the model can make better predictions.
LSTMs are comprised of cells which contain gates responsible for learning which data in a given sequence should be kept and which data can be forgotten. The LSTM cell contains three gates (input \begin{align*} i_{[t]}=&\sigma (W_{xi}x_{[t]} + b_{xi} + W_{hi}h_{[t-1]} + b_{hi})\tag{2a}\\ f_{[t]}=&\sigma (W_{xf}x_{[t]} + b_{xf} + W_{hf}h_{[t-1]} + b_{hf})\tag{2b}\\ g_{[t]}=&\tanh (W_{xg}x_{[t]} + b_{xg} + W_{hg}h_{[t-1]} + b_{hg})\tag{2c}\\ o_{[t]}=&(W_{xo}x_{[t]} + b_{xo} + W_{ho}h_{[t-1]} + b_{ho})\tag{2d}\\ c_{[t]}=&f_{t}\odot c_{[t-1]} + i_{t}\odot g_{t} \tag{2e}\\ h_{[t]}=&o_{t}\odot \tanh (c_{[t-1]})\tag{2f}\end{align*}
To simplify the LSTM model, the Gated Recurrent Unit (GRU) was recently introduced [18]. GRU merges the memory state and hidden state into a single hidden state and combines the input and forget gates into an update gate. As GRUs have fewer parameters, convergence is achieved faster than with LSTMs; nevertheless, GRUs contain sufficient gates and states for long-term memory retention.
B. Sequence to Sequence RNNs
Sequence to Sequence (S2S or Seq2Seq) RNNs [12] consist of an encoder and decoder RNN as illustrated in Fig. 2. A sequence \begin{align*} h_{[j]}=&f^{*}(x_{[j]}, h_{[j-1]}) \tag{3}\\ \vec {c}=&q(\{h_{[{1}]},\ldots ,h_{[T]}\})\tag{4}\end{align*}
The context vector is an encoded representation of the input sequence that is passed to the decoder RNN which extracts information at each unraveled time step to obtain the output sequence \begin{equation*} \dot {y}_{[i]} = g^{*}(\dot {y}_{[i-1]}, {h^{*}}_{[i-1]} )\tag{5}\end{equation*}
The use of two RNNs strengthens consecutive sequence prediction, while also allowing the time dimensionality of inputs and outputs to vary [12]. Although load forecasting does not require varying lengths, it can benefit from strong consecutive sequence prediction.
C. Attention Mechanism
In S2S models, the encoder is responsible for compressing all significant information of an input sequence into a single context vector
The Bahdanau [13] attention (BA) mechanism was the first form of attention for S2S models. With BA, encoder hidden states \begin{align*} {h}^{b}_{[i]}=&f^{b}([\dot {y}_{[i-1]}; c^{b}_{[i]}], h_{[i-1]}^{b}) \tag{6}\\ \dot {y}_{[i]}=&g^{b}(\dot {y}_{[i-1]}, c^{b}_{[i]}, h^{b}_{[i]})\tag{7}\end{align*}
\begin{equation*} c^{b}_{[i]} = \sum _{j=1}^{T} \alpha ^{b}_{[ij]}h_{[j]}\tag{8}\end{equation*}
\begin{equation*} \alpha ^{b}_{[ij]} = \frac {\text {exp}(e^{b}_{[ij]})}{\sum _{m=1}^{T} \text {exp}(e^{b}_{[im]})}\tag{9}\end{equation*}
\begin{equation*} e^{b}_{[ij]} = S(h^{b}_{[i-1]}, h_{[j]})\tag{10}\end{equation*}
The attention weight
Luong et al. [14] developed global and local attention-based models for machine translation, differing whether the attention is concentrated on a few input positions or on all. In remainder of our paper, Luong attention (LA) will refer to the global model. The main difference between BA and LA is that BA applies the attention mechanism before the variables are passed through the respective RNN cell, while LA applies the mechanism to the outputs of that respective cell. Luong et al. [14] presented attention variants differing in the score functions:\begin{align*} S(h^{l}_{[i]}, h_{[j]})= \begin{cases} \displaystyle {h^{l}_{[i]}}^\intercal h_{[j]} ~& \textit {dot}\\ \displaystyle {h^{l}_{[i]}}^\intercal W(h_{[j]}) & \textit {general}\\ \displaystyle {v}^\intercal \tanh (W(cat(h^{l}_{[i]},h_{[j]} ))) & \textit {concat} \end{cases} \\\tag{11}\end{align*}
Related Work
This section discusses related load forecasting works as well as the S2S models in other domains.
A. Load Forecasting
Load forecasting can be classified into three main categories: short, medium, and long-term [19], [20]; however, there is no clear distinction between those categories. In our work, short-term is considered the next hour, medium-term refers to next few hours to a day ahead, and long-term implies a day or more ahead.
There are many approaches to load forecasting (physics, statistics, and machine learning-based), but this section focuses on machine learning-based models as our work belongs to this category. Support Vector Regression (SVR) and NN have been very popular: several studies considered NN and SVM models for estimating energy loads [9], [21] and some compared their performance [9]. The accuracy and conclusions varied depending on data sets, features, system architectures, and similar. NNs applications for energy forecasting are not new [22]–[24], but as the field of neural networks and deep learning has been evolving fast, so is NN-based forecasting. Jetcheva et al. [25] proposed a NN model for day-ahead building-level load forecasting with an ensemble-based approach for parameter selection whereas Chae et al. [26] and Yuan et al. [27] considered a NN model with Bayesian regularization algorithm. Araya et al. [28] proposed an ensemble framework for anomaly detection in building energy consumption; they included prediction-based classifiers (SVR and random forest) as their base forecasting models. Convolutional Neural Networks (CNN) have also been used for load forecasting [29]; they outperform SVM models while achieving comparable results to NN and other deep learning methods [30]. Approaches based on AutoRegressive Integrated Moving Average (ARIMA) have also been proposed [31].
Recently, RNNs have been gaining popularity for load forecasting because of their ability to capture time dependencies in data. Kong et al. [32] proposed an LSTM based RNN model for short-term residential load forecasting. Likewise, Shi et al. [33] also focused on short-term forecasting; they proposed a novel pooling based deep recurrent neural network (PDRNN) for residential consumers. Short to medium term aggregate load forecasting was considered by Bouktif et al. [34]; they coupled a standard LSTM model with a genetic algorithm (GA). In a different work, same authors Bouktif et al. [35] proposed an RNN with multiple sequences of inputs to capture the most relevant time lags. Yu et al. [36] combined GRU with dynamic time warping (DTW) for daily peak load forecasting. A time-dependency convolutional neural network (TD-CNN) and a cycle-based long short-term memory (C-LSTM) network have also been used to improve accuracy of short-term load forecasting [37].
As can be seen from recent works on load forecasting [32], [33], [36], RNNs have been outperforming other approaches. Our work differs by focusing on S2S RNN which have shown great success in modeling time-dependencies in language translation. Marino et al. [19] used standard LSTM and LSTM-based S2S models for residential load forecasting; our work differs by means of different sample generation, different connection of encoder and decoder, use of attention mechanisms, and a longer prediction sequence length. Zheng et al. [38] proposed a hybrid algorithm that combines similar days (SD) selection, empirical mode decomposition (EMD), and LSTM neural networks. Whereas Zheng et al. [38] work proposed a unique hybrid model, the S2S LSTM-based model is identical to the one used by Marino et al. [19]. Rahman et al. [39] developed two S2S LSTM-based models for medium to long-term forecasting. Our work differs from all three S2S works [19], [38], [39] by means of different sample generation, different connection of encoder and decoder, and use of attention mechanisms. In our previous work [40], we presented initial results on S2S RNN for energy forecasting; in contrast, this work focuses on adding attention mechanisms to the S2S models and evaluating their performance with different RNN cells and forecasting lengths.
B. Sequence-to-Sequence Models
Sequence to Sequence models have been used not only for load forecasting, but also for a number of other tasks. Also known as encoder-decoder RNNs, these models have become increasingly popular in tackling classification problems. They were originally developed by Cho et al. [18] to improve performance of statistical machine translation (SMT). This same work not only proposed a novel model architecture, but also a novel RNN cell structure, later to become known as the GRU unit. The work by Sutskever et al. [12] introduced a slight variation to the RNN S2S framework, for translation from English to French. In contrast to Cho et al. [18], Sutskever et al. [12] used LSTM in place of GRU; moreover they also differ in how they connect encoder and decoder.
Examples of S2S use in other domains include the work of Venugopalan et al. [41] on LSTM-based S2S models for generating descriptions of real-world videos and the work of Kawano et al. [42] on predicting changes in protein stability.
While S2S RNN models have found success in several domains, the encoder in S2S is burdened with the need to represent all information in a fixed-length vector. Thus, an attention mechanism, otherwise known as an “alignment model”, was added to these models by Bahdanau et al. [13] and also by Luong et al. [14]: these mechanisms have been described in Section II. Both were designed for machine translations, whereas our work adapts them for load forecasting. Moreover, we evaluate performance of different attention models with different cell types and different time horizons.
Methodology
This section first introduces the features and evaluation process. Next, the sample generation and the proposed BA and LA S2S RNNs for load forecasting are described.
A. Features and Evaluation Process
Data sets obtained from smart meters typically contain the reading date and time with corresponding energy consumption. From those attributes, additional features are extracted and the resulting data set contains nine features: month, day of year, day of month, weekday, weekend, holiday, hour, season, and energy usage.
The data set was divided into a training and test set: the first 80% of data was used for training and the last 20% for testing. This validation process was chosen to ensure that the model is built using older data and tested on newer data.
Standardization was applied to bring all variables into similar ranges. The values of each feature in the data were transformed to have zero-mean and unit-variance:\begin{equation*} \tilde {x} = \frac {x - \mu }{\sigma }\tag{12}\end{equation*}
B. Sample Generation
Sample generation here refers to the process of transforming data into the input and target samples to be passed to the ML model. The same approach is used as in Sehovac et al. [40]. An input sample is represented as a matrix, \begin{align*} x_{[j]}=&[ \text {Month}_{[j]}~\text {DayOfYear}_{[j]}~\text {DayOfMonth}_{[j]} \\&\text {Weekday}_{[j]}~\text {Weekend}_{[j]}~\text {Holiday}_{[j]}~\text {Hour}_{[j]} \\&\text {Season}_{[j]}~\text {Usage}_{[j]}]\tag{13}\end{align*}
For each input sample, one target sample \begin{equation*} y = [\text {Usage}_{[{1}]},~\ldots \,~\text {Usage}_{[i]},~\ldots \,~\text {Usage}_{[N]}]\tag{14}\end{equation*}
Fig. 3 illustrates the sample generation process for training set.
For the test set, samples are generated somewhat differently. As illustrated in Fig. 4, the sliding windows for the test set shifts sequentially with the overlap equivalent to the target length
Test set sample generation. (a) Samples generated at index
C. S2S Prediction With BA
S2S load forecasting proposed by Sehovac et al. [40] is augmented by adding Bahdanau Attention (BA). Whereas Section II-C gives a generic breakdown of the BA mechanism, here we give the process of adapting BA to S2S models for load forecasting. The overall process is illustrated in Fig. 5 and details are provided in Algorithm 1.
Algorithm 1 S2S-BA_{Train}(G = (E_{T}, D_{N}, \tanh , \text {Softmax}, W^{2h\rightarrow h}, W^{1+2h\rightarrow h}, W^{h\rightarrow 1},v, P))
Input: Model G consisting of:
functions tanh and Softmax, fully-connected layers
and initial weights P.
6:
Generate input samples
and corresponding target vectors
Initialize
10:
for each epoch do
for each batch do
13:
initialize
initialize
16:
for time step j from 1 to T do # Encoder
20:
23:
for time step i from 1 to N do # Decoder
33:
compute
BPTT
Return: Trained model G.
The encoder process, lines 14-22 of Algorithm 1, are the same as in Sehovac et al. [40] since an identical encoder (
The context vector is used as the initial hidden state of decoder \begin{align*} \lambda _{1}=&[H^{b}_{[i-1]}; {H}]\in \mathbb {R}^{T\times 2h} \tag{15a}\\ \lambda _{2}=&\tanh (W^{2h\rightarrow h}(\lambda _{1}))\in \mathbb {R}^{T\times h} \tag{15b}\\ e^{b}_{[ij]}=&\langle \lambda _{2}, v \rangle \in \mathbb {R}^{T},\quad \text {where}~ v\in \mathbb {R}^{h} \tag{15c}\end{align*}
Concatenation is performed first (15a), followed by a fully-connected layer and activation function \begin{equation*} \alpha ^{b}_{[ij]} = \frac {\text {exp}(e^{b}_{[ij]})}{\sum _{k=1}^{T} \text {exp}(e^{b}_{[ik]})}\tag{16}\end{equation*}
Equation (16) is the Softmax of the attention energies computed in (15c). These attention weights are then used in an inner product with \begin{align*} c^{b}_{[i]}=&\langle \alpha ^{b}_{[ij]}, \mathbf {H} \rangle \tag{17a}\\=&\sum _{j=1}^{T} \alpha ^{b}_{[ij]}h_{[j]}\tag{17b}\end{align*}
The next step is to concatenate the context vector with the previously predicted output \begin{align*} \lambda _{3}=&[\dot {y}_{[i-1]};c^{b}_{[i]}]\in \mathbb {R}^{1+h} \tag{18a}\\ h^{b}_{[i]}=&\text {GRU}_{Cell}(\lambda _{3}, h^{b}_{[i-1]}) \in \mathbb {R}^{h} \tag{18b}\end{align*}
The last step is to pass the current hidden state, the context vector, and the previous output through a function as given in equation (7). In this work, we concatenate all three variables and pass this vector through two fully-connected layers (Algorithm 1, line 31):\begin{align*} \lambda _{4}=&[\dot {y}_{[i-1]};c^{b}_{[i]};h^{b}_{[i]}]\in \mathbb {R}^{1+2h} \tag{19a}\\ \lambda _{5}=&W^{1+2h \rightarrow h}(\lambda _{4}) \tag{19b}\\ \dot {y}_{[i]}=&W^{h \rightarrow 1}(\lambda _{5}) \in \mathbb {R}^{1} \tag{19c}\end{align*}
This concludes the process for one time step and after
D. S2S Prediction With LA
This section explains the process of adapting LA to S2S RNN for load forecasting: Fig. 6 provides the overview whereas Algorithm 2 shows details. The encoder takes the identical approach as in BA; lines 12-20 in Algorithm 2 are identical to lines 14-22 of Algorithm 1, respective to their own variables. Hence, at time step
Algorithm 2 S2S-LA-General_{Train}(G = (E_{T}, D_{N}, \tanh, \text {Softmax}, W^{2h\rightarrow h}, W^{h\rightarrow h}, W^{h\rightarrow 1}, P_{0}))
Input: Model G consisting of: ET, DN, activation functions
tanh and Softmax, fully-connected layers
Output: Trained Model G.
5:
Generate input samples
and corresponding target vectors
8:
for each epoch do
for each epoch do
11:
initialize
initialize
14:
for time step j from 1 to T do # Encoder
18:
21:
for time step i from 1 to N do # Decoder
append
30:
compute
BPTT(loss)
34:
Return: Trained model G.
The key difference between BA and LA is that in BA energies are computed first whereas the first step in LA is to compute the current hidden state \begin{equation*} h^{l}_{[i]} = \text {GRU}_{Cell}(\dot {y}_{[i-1]}, h^{l}_{[i-1]})\tag{20}\end{equation*}
The next step is to compute the attention energies \begin{equation*} e^{l}_{[ij]} = \langle {H}, h^{l}_{[i]} \rangle \in \mathbb {R}^{T}\tag{21}\end{equation*}
\begin{align*} \lambda _{6}=&W^{h \rightarrow h}({H}) \in \mathbb {R}^{T \times h} \tag{22a}\\ e^{l}_{[ij]}=&\langle \lambda _{6}, h^{l}_{[i]} \rangle \in \mathbb {R}^{T}\tag{22b}\end{align*}
The remaining score function, concat, computes the energies similarly as in the Equations (15). However, the difference here is that LA use the matrix
Continuing (line 25, Algorithm 2), shows that the attention weights
The next step (Algorithm 2, line 27) is to compute the attentional hidden states \begin{align*} \lambda _{7}=&[h^{l}_{[i]};c^{l}_{[i]}]\in \mathbb {R}^{2h} \tag{23a}\\ \hat {h}^{l}_{[i]}=&\tanh (W^{2h \rightarrow h}(\lambda _{7})) \in \mathbb {R}^{h} \tag{23b}\end{align*}
\begin{equation*} \dot {y}_{[i]} = W^{h\rightarrow 1}(\hat {h}^{l}_{[i]})\tag{24}\end{equation*}
Note that the attentional hidden state and current hidden state have different functions in LA. The attentional hidden state
This concludes the LA mechanism for one time step. After
Evaluation
The proposed approach was evaluated on a real-world dataset from a commercial building provided by an industry partner. The dataset contained one year and three months of energy load data in five minute intervals for a total number of readings: 12 readings in one hour
A. Experiments
All S2S models were tested for four different prediction lengths \begin{equation*} \vec {N} = [12, ~48, ~120, ~288]\tag{25}\end{equation*}
Input Case 1: input
, predict eachT=12 ofN \vec {N} Input Case 2: input
, predict eachT=48 ofN \vec {N} Input Case 3: input
, predict eachT=120 ofN \vec {N} Input Case 4: input
, predict eachT=288 ofN \vec {N}
The four input cases with four prediction lengths, makes for a total of 16 cases considered. All models were trained for 10 epochs, since this was sufficient to reach an acceptable level of convergence. The RNN hyperparameters used to compute the results were:
Number of layers 1, with exception of Non-S2S-3L which had three layers
Hidden dimension size
h = [64,~128] Cell state dimension size
(LSTM)c = [64,~128] Batch size
B = 256 Learning rate = 0.001
Only one
S2S-o model from the work of Sehovac et al. [40] with GRU/LSTM/RNN cells (3 models)
S2S-BA model with GRU/LSTM/RNN cells (3 models)
S2S-LA model with GRU/LSTM/RNN cells. Accompanied with each cell is one of three attention score functions: dot, general, concat (9 models)
Non-S2S RNN, one layer with GRU/LSTM/RNN cell (3 models)
Non-S2S RNN, three layers with GRU/LSTM/RNN cell (3 models)
DNN model with sizes: small, medium, and large (3 models)
We define a model type as one of the seven listed above (three LA score functions are considered separately) and a model as a combination of model type and cell type. Each model is different for each of the 16 cases in terms of input length
The two Non-S2S RNN models are conventional RNN models, and thus cannot have a prediction length longer than input sequence length. Hence, this model is only used when
The DNN model took the same input matrix
DNN-small: Input layer (size
)T \times f Output layer (size\Join ~512 \Join ~256 ~\Join ~128~\Join )N DNN-medium: Input layer
Output layer\Join ~512^{{\textstyle \Join }3} \Join ~256 \Join ~128~\Join DNN-large: Input layer
Output layer\Join ~1024^{{\textstyle \Join }2} \Join ~512^{{\textstyle \Join }3} \Join ~256^{{\textstyle \Join }2}~ \Join
The notation
The accuracy measures used throughout this work are the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE): \begin{align*} \text {MAE}=&\frac {1}{n} \sum _{i=0}^{n} \left |{ y_{i} - \hat {y_{i}} }\right | \tag{26}\\ \text {MAPE}=&\frac {100\%}{n} \sum _{i=0}^{n} \left |{ \frac {y_{i} - \hat {y_{i}}}{y_{i}} }\right |\tag{27}\end{align*}
Since this work randomizes the training samples and uses randomly initialized initial weights, five random seeds were used for each case so that each model sees a different randomized order of training samples and different initial weights.
B. Results and Discussion
The proposed approach was implemented in Python with the PyTorch tensor library [43]. Experiments were conducted using GPUs on two different machines: the first contained two NVIDIA GeForce RTX 2080 Ti GPU cards, and the second contained one NVIDIA GeForce GTX 1060 GPU card.
The following four subsections present results obtained for the four input cases. The first figure in each of the four subsections shows the best achieved results by each model type and compares the performance of each model type, regardless of cell or hidden dimension size
The second figure in each input case subsections analyzes the performance of cells used in the models and compares cell performance as input length
1) Input Case 1: One Hour
This subsection analyzes the results obtained for input length
It is important to note that the S2S-o models perform comparable to the attention models. A short input length such as
Fig. 8 shows how accuracy changes with increase of prediction length
Comparing cell performance of each non-DNN model: input
The Vanilla RNN cell achieved comparable results with the two attention models: S2S-LA-general and S2S-LA-concat. In addition, for the Vanilla RNN cell, every attention model outperformed the S2S-o model for
For the GRU cell, results are similar among all models across all prediction lengths, with the majority of attention models performing slightly better than the S2S-o model for
MAE results show the same patterns as MAPE for all four input cases, therefore, MAE graphs are not included with the remaining three cases.
2) Input Case 2: Four Hours
This subsection analyzes the results obtained for input length
The comparison of the DNN model results between Fig. 7 and Fig. 9 shows that the accuracy slightly increases as input length
Fig. 10 analyzes performance of different cells. It can be seen that for all cells the Non-S2S RNN models perform the worst when
Comparing cell performance of each non-DNN model: input
Like with the Vanilla cell, the GRU cell results in Fig. 10 show the majority of attention models outperforming the S2S-o model as
3) Input Case 3: Ten Hours
This subsection provides the results obtained for input length
Best achieved results by each model type: input
In case 3, as can be observed from Fig. 12, for the Vanilla RNN cell, both Non-S2S RNN models obtain better results than the S2S-o model for
Comparing cell performance of each non-DNN model: input
For the GRU and LSTM models, the Non-S2S RNN models perform the worst for all prediction lengths. Similar to the GRU and LSTM cell results for input cases 1 and 2, the majority of GRU-based attention models outperform the S2S-o model as
4) Input Case 4: Twenty-Four Hours
For this case, the input length is
With the Vanilla RNN cells, the Non-S2S RNN models performed the best for
Best achieved results by each model type: input
As in input case 3, the GRU and LSTM Non-S2S RNN models performed the worst for all prediction lengths. As in other input cases, the LSTM-based S2S-o model outperformed all other LSTM models for
C. Discussion
This subsection analyzes model results for varying input length
From Fig. 15, it can be seen that, for the longest prediction length
Comparing cell performance of each non-DNN model: input T = 288, all prediction lengths N.
Analyzing the best MAPE achieved over varied input lengths
Mostly, the S2S attention models share a similar pattern: versions with
As expected, the best S2S-o models for
Lastly, the DNN model in Fig. 15 shows the clearest signs of improvement over input length versions
Overall, it can be concluded that as the prediction length
However, the preferred model may not necessarily be the S2S-BA model or any attention model for that matter. The attention models contain more parameters than other models, with the S2S-BA model containing the most parameters. Table 2 shows the total number of weights and biases for each S2S model and each cell type. It can be observed that as cell changes from Vanilla to GRU and to LSTM, the number of parameters increases. Also, for each cell, the number of parameters increases when attention is added to S2S-o model: attention models in order of the number of weights are LA-dot, LA-general, LA-concat, and BA. Nonetheless, the S2S-o model achieves comparable results to all attention-based models for each prediction length while being faster to train because of having fewer parameters. Hence, if the interested party main objective is high accuracy irrelevant of the training speed, the S2S-BA model is preferred. If a slight decrease in accuracy is acceptable, a reduction in training speed can be achieved by using the S2S-o model.
NNs with a large number of weights such as S2S with attention are prone to overfitting; thus, we have examined train and test losses. An example plot for the S2S-BA model with GRU cell,
Conclusion
Continuously increasing electricity consumption and its impact on global warming are escalating importance of energy efficiency and conservation. Load forecasting is contributing to energy management efforts through improved production planning and scheduling, budget planning, and by identifying savings opportunities. Feedforward neural networks and Support Vector Regression have had a great success in load forecasting; however, Recurrent Neural Networks have an advantage because of their ability to model time dependencies.
This paper proposes Sequence to Sequence Recurrent Neural Network (S2S RNN) with Attention for electrical load forecasting. RNN provides ability to model time dependencies and the S2S approach strengthens this ability by using two RNNs: encoder and decoder. Moreover, an attention mechanism was added to ease the connection between encoder and decoder and further improve load forecasting. The proposed solution was evaluated with four attention mechanisms, three RNN cells (vanilla, LSTM, and GRU), with different forecasting horizons, and different input lengths. As expected, the forecasting accuracy decreases as prediction length increases; however, the decline is much steeper with DNN models then with S2S-o or S2S models with attention. Overall, S2S with Bahdanau attention outperformed all other models. It is important to note that with S2S attention models, accuracy does not continuously increase with the increase of input sequence length. For longer prediction length
This work evaluated 348 NNs on a single data set, but further experiments with different datasets are needed to draw more generic conclusions. Future work will also consider other time series data and explore industrial applications.