Introduction
Earthquake detection [1]–[3] and earthquake early warning (EEW) [4], [5] are the main tasks of seismological research. Many data processing techniques used in traditional seismological research originated from small datasets and limited computing power. Low-cost MEMS acceleration sensors have been extensively used in the monitoring system of Internet of Things (IoT) over the last few years, because of their low installation and operation costs, the examples include a wireless sensor network (WSN) [6], [7], community seismic network (CSN) [8]. Although they have a great potential to replace the traditional expensive seismic networks whose coverage is hardly dense due to the high installation and operation costs, however, the large noises inherent in a low-cost MEMS acceleration sensor reduce the quality of data recorded [9], thus a novel approach is needed to adapt to the data with different signal-to-noise ratios (SNR).
In recent years, the machine learning (ML) has been widely applied to earthquake detection [10]–[13], including earthquake first arrival recognition [14]–[16] and source location [17], [18]. Compared with other time series (stock market price, WiFi signals), a high-dimensional seismic sequence has many implict features (evolution process of different components and a single component, etc.) that are difficult to capture. Khan et al. [19] developed an artificial neural network (ANN) model [20], [21] to detect seismic events by artificially selecting labels. Also, many researchers developed different seismic detection models based on different convolutional neural network (CNN) methods [3], [22], [23]. Nevertheless, no matter which method is adopted, the above mentioned models are supervised. A real seismic waveform needs to be identified by subjectively selecting feature labels, which will affect the detection performance of the model. Therefore, seismic detection methods based on the ML gradually used to eliminate the influence of subjective factors.
Considering that the performance of the DL algorithm depends on the size and quality of the training dataset, in our work, however, we utilize low-cost MEMS sensors to record ground motion signals instead of smartphones, which are polluted by different noise levels (human activities or the sensor themselves), resulting in the lack of high-quality seismic data (High SNR). Too few training datasets easily leads to overfitting [5]. Therefore, to solve this problem, the current research is mainly based on the following three solutions:
To Use Transformation: To overcome the problems of incomplete real seismic waveform and low SNR, Dokht et al. [24] designed a general deep convolution network model for the seismic event and phase detection based on time-frequency representation and convolution neural network. Saad and Chen [25] used automated unsupervised approaches to extract waveform signals from continuous microseismic data according to the time-frequency representation of microseismic trajectory, which was also applicable in an environment with a low SNR, confirming that the waveform-based inverse time migration method could be used in the model to improve the resolution of microseismic imaging.
To Train Model Using a Generalized Deep Learning Model Based on a Small Dataset: Using a generalized deep learning architecture to extract the most representative features from limited/small training datasets, Saad et al. [26] successfully proposed the SCALODEEP model to detect ground demand signals. Similarly, Zhu et al. [27] proposed a CNN-based phase recognition classifier (CPIC) for phase detection and picked up from small and medium-sized training datasets. While Saad and Chen [28] used a capsule neural network (CapsNet) to identify and detect earthquakes automatically and confirmed that it could learn from small datasets with a good generalization performance.
To Develop a Data Augmentation-Approach: Data augmentation is also an effective method to increase data samples. The conditional GAN [29] was used to generate the seismic dataset effectively. Wang et al. [30] developed the EarthquakeGen to generate a short seismic waveform and verified its rationality. Although seismic data can be generated by inferring the implicit and explicit characteristics of the seismic waveform, it is not easy to ensure the diversity or efficiency of the data generated.
In this paper, we propose a novel DGM for data augmentation. However, the design of the generation model depends on the measurement of the distribution pattern of the original data. The speech synthesis technology based on Hidden Markov Model (HMM) has been proven to be very effective in synthesizing an acceptable speech [31]. Due to the discrete limitation of HMM, it cannot represent a continuous space. The LSTM model [32] based on natural language is applied to capture and memorize the long-term and short-term features of the sequences, so as to generate realistic text information. However, compared with the traditional recurrent neural network (RNN) algorithm, the training difficulty also increases because of too many parameters. Zhao et al. [33] proposes the seq2seq based on LSTM together with an attention mechanism to improve the efficiency quality of text summarization, which, however, lack text information coherence. Although different DL models have been developed in these studies to achieve time series generation, they cannot fully represent the distribution of the original data. The performance of generation model cannot meet the best expectations. Therefore, we designed a variant of the GAN structure called EQGAN, automatically capturing different dimensional and time history features.
In addition, we mix the real seismic data and noise data as the data input layer, integrate Wasserstein Distance (WD) and spectral normalization (SN) to improve the stability of model training, overcome mode collapses, and generate high-quality earthquake data recorded by approximate acceleration sensors. Since there is no absolute one-to-one correspondence between the generated and real data, it is not easy to evaluate the quality of the generated data. Frechet Inception Distance (FID) [34], [35] have been proposed in previous studies to evaluate the similarity between generated images and real images. However, for non-image data, an accurate evaluation of the GAN model is still a challenge. Reagan’s powerful seismic data generation ability is qualititatively analyzed in this study through visual representation, frequency domain and autocorrelation schemes, and a new quantitative error evaluation scheme is designed based on HTS theory, which proves the excellent stability and high efficiency of our model.
The rest of this article is organized as follows. In Section II, we describe the basic theory of the GAN, design and develop the DGM by analyzing the data distribution patterns as well as characteristics of real seismic data. Section III provides details of data collection and preprocessing techniques. We then present the experimental results and evaluate our model from different metrics in Section IV. Finally, Section V includes the conclusion and future work of the research.
Theory and Model Design
In this section, we first present the basic theoretical framework of standard GAN in Part
A. Theoretical Basis
3-component earthquake data is a series of discrete measurements captured in a continuous time series. Acceleration components change in different degrees at different time or in different dimensions at the same time. To automatically extract data features through the ML model instead of the traditional ANN model, so as to infer implicit data features, we introduce the GAN framework to build a generation model by fitting real data distribution.
In 2014, Goodfellow [36] proposed the concept of GAN, an epoch-making unsupervised learning algorithm framework (Fig.1c), in which only backpropagation is used to train the network, avoiding the use of Markov chain, and making a deep learning breakthrough. GAN’s basic idea comes from the 0–1 Game Theory, which is mainly composed of a generator and a discriminator. The purpose of the generator is to learn and capture the potential distribution of real data as much as possible meanwhile generating new data. At the same time, the essence of the discriminator is a binary classifier with the purpose of identifying whether the incoming data is from real data or generated data as accurately as possible. This learning optimization process is a maximin game to optimize and improve their generation or discrimination ability continuously, whose purpose is to find a Nash equilibrium between the two sides. The performance of the generation task depends on the design of the GAN adversarial mechanism. The optimal objective function can be expressed as follows:\begin{align*}\min _{G} \max _{D} \mathrm {~V}(\mathrm {D}, G)=& E_{x \sim p_{\text {r}}(x)}[\log (D(x))] \\&~\quad +\,E_{z \sim p_{g}(z)}[\log (1-D(G(z)))]\tag{1}\end{align*}
System structure of different generation models. (a) NN, (b) LSTM, (c) Normal GAN, (d) EQGAN.
B. Analysis of Seismic Data Features
Due to the complex high-dimensional data structure of seismic data, it is impossible to directly represent its specific distribution pattern. The evolution of the seismic waveform is mapped into a controllable observation sequence, showing multiple multi-dimensional time-series correlations. To ensure the performance of the model training and generate various seismic waves, we systematically analyze the evolution of earthquake waves: (i) Fully extract the implicit and explicit characteristics of the seismic sequence’s time evolution and spatial dimension. (ii) A realistic earthquake sequence generation model is established, and appropriate countermeasures are designed and adjusted to meet the dependence of high-quality generation tasks. Different time series distribution patterns lead to different construction methods of mapping space. To complete the task of seismic data generation, we need to find the spatial distribution pattern \begin{equation*} \theta ^{*}=\arg \max _{\theta } \prod _{n=1}^{N} P\left ({x^{n}, \theta }\right) \tag{2}\end{equation*}
Since earthquake sequence \begin{align*}\theta ^{*}=& \arg \max _{\theta } \Pi _{n=1}^{N} \prod _{i=1}^{3} \underbrace {P\left ({x_{i}^{N} \mid x_{i}^{1 \rightarrow n-1}, \theta _{f}}\right)}_{\text {feature extract }} \\&\qquad \qquad \qquad \,\, \cdot \, \,\underbrace {P\left ({y_{i}^{n} \mid x_{i}^{n}, \theta _{g}}\right)}_{\text {data generation }},\quad \theta =\left \{{\theta _{f}, \theta _{g}}\right \} \tag{3}\end{align*}
However, it is infeasible to directly solve the equation to measure the evolution process of time series, and we introduce Kullback-Leibler (KL) Divergence [37], [38]:\begin{equation*} K L\left ({\mathbb {P}_{r} \| \mathbb {P}_{g}}\right)=\int _{\mathcal {X}} P_{r}(x) \log \frac {P_{r}(x)}{P_{g}(x)} \mathrm {d} x \tag{4}\end{equation*}
The optimal parameter combination can be abtained by minimizing the KL divergence of the data distribution generated
C. Algorithm and Model Design
In order to expand the scope of seismic sequence analysis and extract the long-range as well as short-range spatiotemporal correlation characteristics of the earthquake wave evolution process, we introduce the LSTM [39] algorithm (Fig.1b) to capture the invisible evolution relationship among the earthquake sequences. Keep the crucial data and information that need to be memorized for a long time and forget the unimportant information. Input information \begin{equation*} x=\sigma \left ({W_{i} \cdot \left [{h_{t-1}, x_{t}}\right]+b_{i}}\right) \tag{5}\end{equation*}
\begin{align*}h_{t}=& \underbrace {\sigma \left ({W_{o}\left [{h_{t-1}, x_{t}}\right]+b_{o}}\right)}_{\text {output gate}} \\&\qquad \qquad \quad \cdot \,\tanh \underbrace {(\underbrace {f_{t} \times C_{t-1}}_{\text {forget gate}}+\underbrace {i_{t} \times \tanh \left ({\tilde {C}_{t}}\right)}_{\text {input gate}})}_{\text {state unit}} \tag{6}\end{align*}
Among them,
In the earthquake data generation process, measuring the distance among the low-dimensional manifolds of the distribution patterns in a high-dimensional hidden space more accurately drives the generated sequences towards the objective function. On the other hand, the highly scattered data points in the training dataset lead to an irregular gradient transmission, which cannot guarantee the training’s stability. Therefore, it is unreasonable to train GAN by minimizing KL divergence to make the two distributions approach each other. We introduce Wasserstein distance [42], [43] to describe the similarity between the distribution of generated and real data, so as to solve this problem, the specific theory on WD is explained in Appendix 3.\begin{equation*} \mathrm {W}\left ({P_{r}, P_{g}}\right)=\inf _{\gamma \sim \Pi \left ({P_{r}, P_{g}}\right)} \mathbb {E}_{(x, y) \sim y}[\|\mathrm {x}-\mathrm {y}\|] \tag{7}\end{equation*}
Compared with KL divergence, WD can provide smooth and meaningful distance even if two distributions are located in low-dimensional manifolds with no or less overlap. It can effectively measure the pattern difference in the submanifolds of high-dimensional distributions and describe the similarity of the two distributions, which fundamentally solves the problem of vanishing gradients.
To further ensure the stability of EQGAN model training, the discriminator must satisfy the Lipschitz constraint. Therefore, we use the spectral normalization method to normalize the spectral normalization [44] of weight matrix \begin{equation*} \bar {W}_{\mathrm {SN}}(W):=W / \sigma (W)\tag{8}\end{equation*}
The system framework of our EQGAN model is summarized in Fig.1d, in which LSTM, attention, NN, SN, and WD are combined to realize the extraction and generation of seismic data features; it is developed through the platform of Python 3.6 and TensorFlow v2.0.
Data Collection and Preprocessing
Data collection and preprocessing are the premises of model training and analysis. Here we present the experimental data source and preprocessing to better explain the model workflow.
A. Dataset
The occurrence of earthquake events is accidental and irreplaceable. With this in mind, we need a special data acquisition scheme. The earthquake datasets used for model training are mainly from the National Research Institute for Earth Science and Disaster Resilience (NIED) [45] databases, we also integrated the seismic data recorded by our sensors into the dataset. The earthquake events with magnitudes ranging from 4 to 8 recorded from April 2009 to May 2019 were selected from the NIED database and preprocessed to convert them into units (g). Also, the data of 120 stations for the three earthquakes of Tottori (2000) (M6.61), Niigata (2004) (M6.63), and Chuetsuoki (2007) (M6.8) were downloaded from the United States Geological Survey (USGS) database [46]. In addition, the earthquake data also include small events (2020) (approximately M2.5) recorded by our MEMS sensor in Korea. The sampling rate of all earthquake events is 100 Hz. The data is shown in three channels titled
B. Data Preprocessing and Training Details
To facilitate training and analysis, we preprocess the seismic data. Each data (they are all recorded at a sampling rate of 100 data points per second) only retains 3600 data points included the P-wave and S-wave (including 50 abnormal data points at the two data endpoints), and only obtains the final length of 3500 data. The training and test dataset are divided according to the ratio of 7:3. The experiment was carried out on Ubuntu 18.04 operating system, and the learning rate was set to
Results and Discussion
A more important fact is how to evaluate the quality of generated data. With all things considered, we design a variety of evaluation schemes in this section:
Initially, we analyze the performance of the EQGAN model to generate data through visual appearance.
Then, we compare and analyze the frequency domain of generated and real data.
From the perspective of seismic data distribution pattern, we introduce a paired scatter plot to analyze the distribution of generated data points and real data or noise data and the correlation among different channels. At the same time, we also compare the performance of other generation models.
Also, to make the evaluation more reliable, based on the statistical analysis index, we introduce the Mean Squared Error (MSE), Mean Absolute Percentage Error (MAPE), and WD error quantification index and use the High-throughput Screening Theory to design a novel generation model quantitative accuracy evaluation method.
Finally, we compare the computational complexity of different models and discuss the robustness of our model from the perspective of computational cost.
Through an analysis of generated data, real data, and noise data, results of our model cast a new light on augmenting seismic data.
A. Visual Appearance
The accuracy of earthquake detection depends not only on the first arrival of earthquake waves but also the amplitude and frequency. Therefore, visual performance of generated data is one of the primary indicators to evaluate the quality of generated data. Fig.2b-c highlights the seismic data generated by our EQGAN model, which presents similar characteristics to that of the real seismic waveform. It is clear to observe the arrival time of P-wave and S-wave in different dimensions of generated data. It is worth noting that compared with Channel
The diversity and fidelity of the data generated by our EQGAN model, (a) The fundamental waveform of seismic data, (b) and (c) represent the waveform of generated data.
From the perspective of generating data diversity, in a real earthquake sequence, some earthquake waves contain small vibrations such as foreshocks or aftershocks, which is in line with the scientific nature of seismology. We can not only generate a single epicenter or aftershock, but also capture the characteristics of a single epicenter. Simultaneously, in Fig.2c, it is evident that the amplitude of generated data is also significantly different. The presentation of these different data features can prove the diversity of the data generated by our generation model, which is highly coincide with the real seismic data recorded by acceleration sensors.
B. Frequency Domain Analysis
To further confirm the quality of generated data, we use the Fast Fourier Transform (FFT) to obtain its frequency domain [47] and the real data (Fig.3). Because of the frequency synchronization between generated and real data, it is not easy to represent the frequency domain of the whole dataset through an exact measurement in the frequency domain. However, we can evaluate the fluctuation in the frequency range by randomly selecting 100 real data samples and generating data for testing. As can be seen, the frequency domain of real data is maintained at 0–40 Hz. Similarly, most of the frequency fields of generated data are maintained in the same range, and there is no false data found beyond the real data frequency domain, which further suggests that real data and generated data are highly similar in the frequency range.
The comparision of frequency domain between real earthquake data and generated data, (a) Real data frequency domain (b) Generate data frequency domain.
C. Autocorrelation Distribution Analysis
Through further exploration, with another evaluation method, we randomly select a piece of data from the corresponding dataset and use the scatter matrix diagram to visualize data analysis. Fig.4 can be divided into two parts: the scatter diagram shows that in all kinds of data (real data, generated data and noise data), any two channels of
Different data distribution patterns including real data, noise data, and generated data by different generation models.
Since the distribution of data points is a relatively scattered and weak correlation due to the difference in the amplitude of P-wave and S-wave, in contrast, the distribution of non-seismic data points is uniform and concentrated. It is evident in the autocorrelation distribution map that both real and generated data present an approximate Gaussian distribution pattern on Channel
Even if this method can reflect the excellent quality of the data generated by EQGAN model, it is difficult to distinguish the false positive data generated by standard GAN and NN model. Moreover, it can only be used to analyze randomly selected single data instead of evaluating a dataset as a whole.
D. Comparative Analysis of Different Generation Models
Although previous evaluation methods can be used to verify the potential of EQGAN in earthquake sequence generation, one limitation of our implementation is that they all be used to qualitatively evaluate the quality of individual generated data, which is also a common weakness in the evaluation of many ML model. In this research, to further clarify excellent performance of our EQGAN model, we design a scheme to quantitatively verify the generated data based on MSE, MAPE, and WD meanwhile evaluating the performance of different generation models. (i) Firstly, eight samples of representative seismic data is selected from all real datasets as the standard dataset.
Then, calculate the MSE of sample dataset and standard dataset:
where\begin{equation*} MSE=\sum _{i=1}^{N} \frac {\left ({c_{i}-r_{i}}\right)^{2}}{N} \tag{9}\end{equation*} View Source\begin{equation*} MSE=\sum _{i=1}^{N} \frac {\left ({c_{i}-r_{i}}\right)^{2}}{N} \tag{9}\end{equation*}
represents the sample dataset,$c_{i}$ is the standard dataset, and$r_{i}$ represents the data length. We use real data, noise data, and data generated by different models as sample datasets to obtain MSE (Eq 9) corresponding to standard datasets.$N$ The mean value, minimum value, and standard deviation of the MSE vector are extracted as characteristic parameters for experimental confirmation. Different characteristic parameters show similar distribution patterns. Fig.5a displays the distribution difference between each sample dataset and the real dataset when MSE is minimized. It can be seen that the distribution of the dataset generated by EQGAN has the highest similarity with the real dataset, and there are cliff-like differences between the distribution of other sample datasets and the real dataset.
Distribution diagram of error quantification index. (a) Minimizing MSE, (b) Minimizing MAPE, (c) Minimizing WD. Here, the y-axis (Vertical) represents the distribution probability, and the x-axis (Horizontal) shows the value of the corresponding error-index.
Comparing the models generated by different ML algorithms, the results reveal that the overall similarity between different sample datasets and the real dataset is as follows: EQGAN > GAN > LSTM > NN.
Although MSE strongly indicates the actual situation of the error between generated and real data, it is not convincing to judge the generation ability from the value of MSE alone. Consequently, we handle the same scheme to calculate the mean absolute percentage error (MAPE) of the sample dataset and the standard dataset, respectively (Eq 10):\begin{equation*} MAPE=\frac {\sum _{i=1}^{N} \frac {\left |{c_{i}-r_{i}}\right |}{r_{i}}}N \times 100 \% \tag{10}\end{equation*}
MAPE is a statistical index to measure the accuracy of prediction, which considers the error between predicted and actual value as well as the ratio between the error and actual value. It is generally believed that the closer the MAPE is to 0, the higher the similarity between the two groups of data will be. By calculating the MAPE of different generated, real, noise and standard datasets, the results explain that the distribution pattern of real and generated datasets is very similar. Fig.5b confirms the similarity between the dataset generated by different models and the real dataset: EQGAN > GAN > LSTM > NN. From the statistics perspective, both MSE and MAPE are widely applicable to the quantitative error analysis of data. WD is a special indicator to measure the difference of probability distribution between two pieces of high-dimensional data. Accordingly, we work the corresponding scheme to calculate the WD (Eq 11) and further analyze the quality of generated data. The results are exhibited in Fig.5c.\begin{equation*} W_{p}(\mu, \nu)=\left ({\inf _{\gamma \in \Gamma (\mu, \nu)} \int _{\mathcal {X} \times \mathcal {X}}\|x-y\|^{p} d \gamma (x, y)}\right)^{1 / p} \tag{11}\end{equation*}
Despite the fact that the previous evaluation method has been used to fully explain and examine the quality of generated data, to prove the robustness and stability of EQGAN, we propose a High Throughput Screening (HTS) Theory to analyze the performance of different generation models. The HTS technology is an essential means of drug research and development based on experimental methods at the molecular and cellular level, in which microplate is used as the experimental tool carrier to screen high-quality data, so as to meet the needs automatically [48]–[50]. The quality of data screening depends on the design of the microplate, and some data will show false-positive results with different types of microplate screening [51], which is entirely consistent with the error evaluation scheme of MSE, MAPE and WD designed by us. Fig.6 presents our basic screening process of generated data. In view of the significant difference in data generated by different models, in order to eliminate the dimensional influence among different error indexes and obtain the comparability among seismic datasets, we normalized all datasets under different error quantitative indexes. It is worth mentioning that normalization will reduce the differences among the data generated by different models and change their distributions. Hence, in this paper, we try different normalization methods for MSE, MAPE and WD, choosing the best method to process the data (Fig.7).
Distribution of different error quantification indexes with normalization. Here, the y-axis (Vertical) represents the distribution probability, and the x-axis (Horizontal) shows the value of the corresponding error-index.
Compared with Fig.5 and Fig.7, it stands to reason that we can find that the generated datasets with normalization have a higher similarity with the real dataset, which does not mean that their distribution pattern is changed in the normalization process. Still, differences in the data are reduced, which does not affect the scientific nature of the statistical analysis. Therefore, through Fig.5 and Fig.6, we can conclude that the GAN framework training generation task is better than a single algorithm model, and generation performance of our EQGAN model is better than that of standard GAN.
Furthermore, based on MSE, MAPE and WD, we calculate the correlation between the sample dataset generated by different models and the real dataset. The scatter plot matrix directly reveals the correlation between the datasets generated by different models and the real datasets under different quantitative indexes meanwhile the incidence matrix is used to quantify and summarize the linear strength relationship between the datasets (Fig.8a). It is observed that the correlation coefficient between the dataset generated by EQGAN and the real dataset is 0.11. Although it looks deficient, it is much higher than that of other generation models, which shows that the generation performance of the EQGAN model is not only high but also fully reflects that the data generated by EQGAN is not a copy of the real data.
Stability and performance analysis from different models. (a) shows the correlation between datasets generated by different models and real datasets, (b) Comparison of performance and stability of different models under different filter screening conditions, (c) Accuracy analysis of the same amount of data from different generation models filtered by different filters (1-filter represents MSE, 2-filter denotes MSE+MAPE, 3-filter denotes MSE + MAPE + WD).
It’s noteworthy that based on the above analysis results, we use each generation model to complete 10 consecutive generation tasks, generating 2,000 data samples each time, and further verify the performance of different models. The error bars diagram (Fig.8b) shows the datasets generated based on different models and gives the maximum, minimum, and average value of MSE, MAPE and WD, respectively. At the same time, it allows us to master the efficiency and stability of the models. Through EQGAN, the generation task can be completed more stably under MSE, but the generation performance of GAN is the highest. The reason is that there will be false-positive data after the normalization of the data generated by GAN, which improves the efficiency. Simultaneously, in a more complex evaluation index map under MAPE and WD, the results show that the stability and accuracy of the data generated by the model with LSTM and NN algorithm are the best.
Finally, through the HTS method, we filter 2000 data samples generated by each model respectively according to the increasing complexity of MSE, MAPE and WD. Fig.8c shows that the generation performance of different models is: EQGAN (81%) > GAN (74%) > NN (21%) > LSTM (2%), implying that EQGAN possesses a strong generalization and stability to deal with distinct diversity earthquake series intelligently.
E. Computational Complexity
One may expect that our EQGAN model with different algorithms would have high computational complexity. This is, however, not the case. To measure the quality of an algorithm, there are usually three considerations: (i) The time consumed in the execution of the algorithm (ii) The number of resources occupied during execution, such as the amount of memory space occupied (iii) The algorithm is easy to understand, implement and verify Therefore, different algorithms need to be selected in different cases. Due to a large amount of data processing, we first consider the difficulty of data processing in LSTM or GAN, which is not different from that in EQGAN. Still, EQGAN has apparent advantages in time complexity and easy implementation of the algorithm. This paper mainly discusses the time complexity of the algorithm and its feasibility. Appendix 5 gives the measurement indexes of different algorithm complexity.
Conclusion
A new DGM called EQGAN is proposed in this research to capture the multi-dimensional temporal evolution of seismic sequences and generate high-quality seismic sequences containing P-waves and S-waves. In order to verify the performance of the EQGAN model, by comparing standard GAN, NN and LSTM model, we not only qualitatively evaluate the quality of generated data from the distribution pattern and frequency domain, but also quantitatively analyze the similarity between generated and real data by fusing statistical indexes of MSE, MAPE and WD with seismic data. Experimental results show that the efficiency of data generated by our EQGAN model reaches 81% (The generation performance of standard GAN, LSTM and NN models are 72%, 2%, and 21%, respectively), which further demonstrate that our generation model has excellent performance and stability.
Even if the current discussion is not as easy to be explained as the traditional supervised training models, the data screening and evaluation scheme based on the HTS theory and techniques are highly consistent with the distribution pattern of seismic data. There is no apparent defect to prevent the expansion of the EQGAN model, which also promotes the application and innovation of ML algorithms in seismology. These findings provide a potential mechanism for data augmentation. We also assume that the EQGAN algorithm may generate seismic sequences similar to that recorded by a specific position sensor, which provides a more convenient data support scheme for earthquake prediction. Looking forward, the proposed EQGAN model provides a more convenient dataset support scheme for seismic prediction. Based on generated data, we will further develop and train the earthquake detection model to improve the accuracy and robustness of the EEW system. Moreover, with fault-detection/identification techniques as a future research direction [52], we will design fault and detection equipment for regulating and maintaining sensors under abnormal conditions to improve the quality of the data recorded by our sensors.