Journals & Magazines >IEEE Access >Volume: 8

A Genetic Programming-Driven Data Fitting Method

The flow chart of optimization construction for hybrid model.

Abstract:

Data fitting is the process of constructing a curve, or a set of mathematical functions, that has the best fit to a series of data points. Different with constructing a f...Show More

Metadata

Abstract:

Data fitting is the process of constructing a curve, or a set of mathematical functions, that has the best fit to a series of data points. Different with constructing a fitting model from same type of function, such as the polynomial model, we notice that a hybrid fitting model with multiple types of function may have a better fitting result. Moreover, this also shows better interpretability. However, a perfect smooth hybrid fitting model depends on a reasonable combination of multiple functions and a set of effective parameters. That is a high-dimensional multi-objective optimization problem. This paper proposes a novel data fitting model construction approach. In this approach, the model is expressed by an improved tree coding expression and constructed through an evolution search process driven by the genetic programming. In order to verify the validity of generated hybrid fitting model, 6 prediction problems are chosen for experiment studies. The experimental results show that the proposed method is superior to 7 typical methods in terms of the prediction accuracy and interpretability.

The flow chart of optimization construction for hybrid model.

Published in: IEEE Access ( Volume: 8)

Page(s): 111448 - 111459

Date of Publication: 15 June 2020

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2020.3002563

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

The goal of constructing a data fitting model is to seek a set of functions, which can describe the approximate correlation among a group of variables, and subject to constraints. It can be acted as a kind of data characterization or prediction tool. Generally, this method can be broadly divided into two categories, the model with a concrete function expression and the model based on some intelligent calculation approaches. The polynomial model and neural network are the typical one of former and latter respectively. In recent years, the ensemble learning and deep learning have been applied to deal with data fitting problem and show outstanding performance. However, the training process of them is relatively complicated. More importantly, the training-driven model turns out to be a black box which cannot showcase the coupling relationship among variables, making it difficult to comprehend and further utilize. Accordingly, the model with concrete function expression still has its advantages. But, the traditional methods, such as polynomial model, require a prior hypothesis including the type and number of used functions. In most cases, these are unknowable. And, a group of optimized parameters are also desired. Therefore, how to generate a fitting model with reasonable structure while optimizing related parameters has become a key problem.

It is found that the hybrid fitting model with lower complexity and higher fitting accuracy can be constructed by mixing different types of functions. But, constructing such a model firstly calls for mechanisms with more effective coding expression and optimization ability. Concerning this issue, this paper proposes a method for constructing the hybrid fitting model based on representation by tree coding and co-optimization of model structure and parameters by evolutionary search. Major contributions of this paper include:

The improved expression tree coding mechanism is proposed to express hybrid fitting model. In this coding mechanism, each node is composed of structure part and multiplier factor part. When its structure changes, the variable length of tree coding makes it possible to express the new model flexibly, which lay the foundation for searching and optimizing the model. Moreover, compared with traditional coding, the complexity of improved expression tree coding is reduced.
The optimization mechanism of hybrid fitting model based on improved genetic programming (GP) is proposed. This mechanism for co-optimization of model structure and parameters by evolutionary search, which can improve fitting accuracy of the model, reduce its complexity and make it possible to enhance its interpretability.

The remaining part of this work is organized as follows. Analysis of related research on fitting model and its optimization mechanism in Section II. Introduction of the proposed method for tree coding expression and relevant evolution and optimization of hybrid fitting model in Section III. Then, the results and its discussion in Section IV. Finally, some conclusion and future work in Section V.

SECTION II.

Related Works

This section focuses on the discussion of related works of the data fitting method based on intelligent computing, fitting approaches with explicit function expression, and the mechanism for optimizing them.

A. Data Fitting Model Based on Intelligent Computing Method

With the development of machine learning, the data fitting model based on the intelligent computing method has been widely applied. The support vector machine (SVM), for instance, is a mature means. Karimi et al. [1] developed binary SVM model for urban expansion prediction by selecting the most appropriate kernel function and its parameters. Sousa et al. [2] used the genetic algorithm to optimize the SVM model that helps forecast the classification and recovery rate of urban waste. Although SVM can solve nonlinear and local minimum problems, it is difficult to deal with a ton of data [3]. Ensemble learning as a practical method, such as random forest (RF), eXtreme gradient boosting (XGBoost) have been developed on the basis of Bagging and Boosting [4]. In [5], RF was used to construct model for predicting mortality rate of patients suffering acute renal injury, and in [6], a new ultra-short-term offline prediction model of photovoltaic characteristics based on RF was proposed. The results show that the prediction of RF is highly accurate, but the final result is limited by the prediction performance of each decision tree. In [7], XGBoost, a typical boosting algorithm, is used to avoid overfitting problems and establish an efficient energy load prediction model in residential buildings. Based on XGBoost, a C-A-XGBoost sales prediction model was proposed for focusing on the characteristics of commodity sales and the trend fitting of data series [8]. The experimental results suggest that the prediction is more accurate. Neural network is an effective nonlinear data fitting method [9], [10]. In recent years, with the development of convolutional neural network (CNN), the data prediction model based on it has been more widely used. In [11], a CNN-based framework for predicting the next day’s direction of movement for the indices of S&P 500, NASDAQ, DJI, NYSE, and RUSSELL.

A data fitting model combining linear regression and the deep belief network model has been proposed [12]. In addition, the long short-term memory network (LSTM) has special structure of memory and gate, which is also frequently used to solve problems of prediction [13]–[15].

The data fitting method based on the intelligent computing shows excellent performance, but its working mechanism is complex, especially the model generated by training, which is a black box and cannot describe the detailed relations between different variables in the data. In many tasks of data fitting and prediction, interpretability of the model is of great significance. For example, the reference [16] pointed out that the energy consumption model of conveyor based on BP neural network is not conducive to describe the problems of controlling optimization, while the model on the basis of function expression is more reasonable.

B. Data Fitting Model Based on Function Expression

The data fitting model based on function expression, in addition to the simpler structure, can express the coupling relationship between different variables in a clearer way as well. The polynomial model, a variant of the linear model, is a typical example adaptable to the nonlinear relationship [17]. The Gaussian distribution model is widely applied for its robustness and computational efficiency [18]. In addition, the Lasso regression can effectively deal with problems of high-dimensional data by constructing penalty function to obtain a more detailed model [19], [20]. A prediction method for wind power combining Lasso regression [21] shortens computing time greatly. Combined with Lasso regression to predict power consumption in [22], the output of Lasso regression shows that the power consumption of Guangdong Province is closely related to the historical consumption, the proportion of the secondary industry and the permanent population. In [23], a linear piecewise fitting model is given to forecast yield automatically from temperature, reactor volume and reactant concentration. Due to the complexity of chemical reaction, it is difficult for experts to make clear of the rules in yield prediction, and as it pointed out, the piecewise fitting model is easier to understand than SVR. An effective algorithm was proposed in [24] to identify the key segmentation features and the number of final segmentation points. Each segment was fitted with a multivariate linear regression function. But when the continuous trial is taken to automatically determine the number of data areas, the calculation is inefficient and may lead to over fitting.

Before being built, the data fitting model based on function expression generally calls for given structural hypothesis of the model, and following identification of relevant parameters. Such model has a simple working mechanism and clear expression of practical problems. However, for an unknown problem, it is often difficult to offer a reasonable structural hypothesis, and the optimization of related parameters will also directly affect the performance of the model.

C. Optimization Machanism for Model Parameters

In [25], the thermal error compensation model of the machine tool based on the exponential model in use of the least square method to optimize the estimation equation of axial deformation of spindle and time. In [26], the quasi newton method was used to optimize the parameters of multivariate nonlinear regression model, and obtained the regression model of dry matter the potato contains. In [27], a multiple nonlinear regression model was established by studying the influence of various operation parameters on the thermal environment. By defining two objective functions, maximum exergy efficiency and the minimum total cost, then the downhill simplex method is used to optimize parameters. In recent years, the research of optimization regression model based on evolutionary algorithms has been developed rapidly. The regression equation between the stress of solder joint and the structural parameters was established [28], and the genetic algorithm (GA) optimizes the structural parameters of solder joint, the optimal combination of structural parameters with the minimum stress of solder joint is available. In [29], a prediction model of crude oil price based on wavelet transformation and multiple linear regression, and particle swarm optimization (PSO) is used to optimize the model parameters. Chen et al. [30] introduced particle calculation into PSO to optimize the nonlinear model composed of multiple regression models. Sheng et al. [31] adopted expectation maximization (EM) [32], a common approach to estimate of the optimal super parameters, to optimize the Gaussian mixture regression for estimating the charge of electric vehicles. Such studies at present mainly focus on optimizing parameters in the model, but research for a mixture of the better model structure and the related parameters has not been reported.

SECTION III.

Proposed Approach

The interpretability of a model is critical for many data fitting and prediction tasks. Obviously, a model with clear function expression meets more of this requirement. This study found that the hybrid data fitting model (referred to as the hybrid model) mixed by different types of functions can make the model much less complicated while ensuring the fitting accuracy, which does more favor to comprehension and analysis of the data fitting model. But the main problem is that how to optimize the structure and parameters of the hybrid model.

A. Analysis of the Hybrid Model

Suppose the data set $S$ consists of $N$ data points, which can be expressed as $S=\{X_{i},Y_{i} \}$ , $X_{i} =[x_{1},x_{2},\ldots,x_{d}],i=1,2,\ldots,N$ . In $S$ , $X_{i}$ is the input of the $i$ th sample point and $Y_{i}$ is the output, $d$ stands for the dimension of $X_{i}$ , and $N$ is the number of samples.

Definition 1:

The definition of the hybrid model $\psi$ is as shown in Equation 1

$\begin{equation*} \psi =\sum \limits _{k=1}^{K} {w_{k} g_{k} },\tag{1}\end{equation*}$ View Source

where

$K$

is the number of subfunctions,

$w_{k}$

is the multiplier factor of the

$k$

-th subfunction,

$g_{k}$

is the

$k$

-th subfunction composed of an exponential function, a logarithmic function, a Gaussian function and a power function.

Definition 2:

Model error. The mean absolute error is used to evaluate the error of $\psi$ , and the model error $f_{error}$ can be expressed as

$\begin{equation*} f_{error} =\frac {1}{n}\sum \limits _{i=1}^{n} {\left |{ {Y'_{i} -Y_{i}} }\right |},\tag{2}\end{equation*}$ View Source

where

$Y_{i}$

is the actual value of the

$i$

th sample point in

$\xi$

${Y}'_{i}$

is the calculated value about

$\psi$

$\xi$

, and

$n$

is the number of sample points in

$\xi$

is the observation sample set,

$\xi \in S$

Definition 3:

Model complexity. The number of subfunctions $K$ is the complexity of $\psi$ .

In general, $\psi$ is the superposition of multiple subfunctions, so more of them contributes to a more complex model.

Take the following example to illustrate the differences in different models. Table 1 shows the structure, error, complexity and the fitting degree of eight models that all adopt the least square method to optimize the relevant parameters. The fitting degree is calculated by decision coefficient ( $R^{2}$ ). The larger it is, the better the fitting effect is. Figure 1 shows the fitting curves of three models with better performance.

TABLE 1 Comparisons of Model Complexity and Fitting Degree

FIGURE 1.

Comparison of fitting effects among different models.

A Genetic Programming-Driven Data Fitting Method

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Works

A. Data Fitting Model Based on Intelligent Computing Method

B. Data Fitting Model Based on Function Expression

C. Optimization Machanism for Model Parameters

Proposed Approach

A. Analysis of the Hybrid Model

Definition 1:

Definition 2:

Definition 3:

B. Coding Mechanism of Hybrid Model

C. Optimization Mechanism of Hybrid Model

step1:

step2:

step3:

step4:

step5:

D. Time Complexity Analysis for Proposed Method

Experiments

A. Data Sets

B. Evaluation Metrics

C. Comparison of Experimental Results

D. Comparison of I-Et Coding and Traditional Coding

E. Interpretability of the Hybrid Model

F. Parameters Discussion

1) Influence of Weight on Model Error and Complexity

2) Influence of Probability P of Search Operator on Model Error

3) Influence of Probability P_{cs}P_{cs} of Different Crossover Strategies on Model Error

4) Influence of Probability \text{P}_{gs}\text{P}_{gs} of Generating Gaussian Node on Model Error

Conclusion

References

IEEE Account

Purchase Details

Profile Information

Need Help?

3) Influence of Probability $P_{cs}$ of Different Crossover Strategies on Model Error

4) Influence of Probability $\text{P}_{gs}$ of Generating Gaussian Node on Model Error