Loading web-font TeX/Math/Italic
A Genetic Programming-Driven Data Fitting Method | IEEE Journals & Magazine | IEEE Xplore

A Genetic Programming-Driven Data Fitting Method


The flow chart of optimization construction for hybrid model.

Abstract:

Data fitting is the process of constructing a curve, or a set of mathematical functions, that has the best fit to a series of data points. Different with constructing a f...Show More

Abstract:

Data fitting is the process of constructing a curve, or a set of mathematical functions, that has the best fit to a series of data points. Different with constructing a fitting model from same type of function, such as the polynomial model, we notice that a hybrid fitting model with multiple types of function may have a better fitting result. Moreover, this also shows better interpretability. However, a perfect smooth hybrid fitting model depends on a reasonable combination of multiple functions and a set of effective parameters. That is a high-dimensional multi-objective optimization problem. This paper proposes a novel data fitting model construction approach. In this approach, the model is expressed by an improved tree coding expression and constructed through an evolution search process driven by the genetic programming. In order to verify the validity of generated hybrid fitting model, 6 prediction problems are chosen for experiment studies. The experimental results show that the proposed method is superior to 7 typical methods in terms of the prediction accuracy and interpretability.
The flow chart of optimization construction for hybrid model.
Published in: IEEE Access ( Volume: 8)
Page(s): 111448 - 111459
Date of Publication: 15 June 2020
Electronic ISSN: 2169-3536

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

The goal of constructing a data fitting model is to seek a set of functions, which can describe the approximate correlation among a group of variables, and subject to constraints. It can be acted as a kind of data characterization or prediction tool. Generally, this method can be broadly divided into two categories, the model with a concrete function expression and the model based on some intelligent calculation approaches. The polynomial model and neural network are the typical one of former and latter respectively. In recent years, the ensemble learning and deep learning have been applied to deal with data fitting problem and show outstanding performance. However, the training process of them is relatively complicated. More importantly, the training-driven model turns out to be a black box which cannot showcase the coupling relationship among variables, making it difficult to comprehend and further utilize. Accordingly, the model with concrete function expression still has its advantages. But, the traditional methods, such as polynomial model, require a prior hypothesis including the type and number of used functions. In most cases, these are unknowable. And, a group of optimized parameters are also desired. Therefore, how to generate a fitting model with reasonable structure while optimizing related parameters has become a key problem.

It is found that the hybrid fitting model with lower complexity and higher fitting accuracy can be constructed by mixing different types of functions. But, constructing such a model firstly calls for mechanisms with more effective coding expression and optimization ability. Concerning this issue, this paper proposes a method for constructing the hybrid fitting model based on representation by tree coding and co-optimization of model structure and parameters by evolutionary search. Major contributions of this paper include:

  1. The improved expression tree coding mechanism is proposed to express hybrid fitting model. In this coding mechanism, each node is composed of structure part and multiplier factor part. When its structure changes, the variable length of tree coding makes it possible to express the new model flexibly, which lay the foundation for searching and optimizing the model. Moreover, compared with traditional coding, the complexity of improved expression tree coding is reduced.

  2. The optimization mechanism of hybrid fitting model based on improved genetic programming (GP) is proposed. This mechanism for co-optimization of model structure and parameters by evolutionary search, which can improve fitting accuracy of the model, reduce its complexity and make it possible to enhance its interpretability.

The remaining part of this work is organized as follows. Analysis of related research on fitting model and its optimization mechanism in Section II. Introduction of the proposed method for tree coding expression and relevant evolution and optimization of hybrid fitting model in Section III. Then, the results and its discussion in Section IV. Finally, some conclusion and future work in Section V.

SECTION II.

Related Works

This section focuses on the discussion of related works of the data fitting method based on intelligent computing, fitting approaches with explicit function expression, and the mechanism for optimizing them.

A. Data Fitting Model Based on Intelligent Computing Method

With the development of machine learning, the data fitting model based on the intelligent computing method has been widely applied. The support vector machine (SVM), for instance, is a mature means. Karimi et al. [1] developed binary SVM model for urban expansion prediction by selecting the most appropriate kernel function and its parameters. Sousa et al. [2] used the genetic algorithm to optimize the SVM model that helps forecast the classification and recovery rate of urban waste. Although SVM can solve nonlinear and local minimum problems, it is difficult to deal with a ton of data [3]. Ensemble learning as a practical method, such as random forest (RF), eXtreme gradient boosting (XGBoost) have been developed on the basis of Bagging and Boosting [4]. In [5], RF was used to construct model for predicting mortality rate of patients suffering acute renal injury, and in [6], a new ultra-short-term offline prediction model of photovoltaic characteristics based on RF was proposed. The results show that the prediction of RF is highly accurate, but the final result is limited by the prediction performance of each decision tree. In [7], XGBoost, a typical boosting algorithm, is used to avoid overfitting problems and establish an efficient energy load prediction model in residential buildings. Based on XGBoost, a C-A-XGBoost sales prediction model was proposed for focusing on the characteristics of commodity sales and the trend fitting of data series [8]. The experimental results suggest that the prediction is more accurate. Neural network is an effective nonlinear data fitting method [9], [10]. In recent years, with the development of convolutional neural network (CNN), the data prediction model based on it has been more widely used. In [11], a CNN-based framework for predicting the next day’s direction of movement for the indices of S&P 500, NASDAQ, DJI, NYSE, and RUSSELL.

A data fitting model combining linear regression and the deep belief network model has been proposed [12]. In addition, the long short-term memory network (LSTM) has special structure of memory and gate, which is also frequently used to solve problems of prediction [13]–​[15].

The data fitting method based on the intelligent computing shows excellent performance, but its working mechanism is complex, especially the model generated by training, which is a black box and cannot describe the detailed relations between different variables in the data. In many tasks of data fitting and prediction, interpretability of the model is of great significance. For example, the reference [16] pointed out that the energy consumption model of conveyor based on BP neural network is not conducive to describe the problems of controlling optimization, while the model on the basis of function expression is more reasonable.

B. Data Fitting Model Based on Function Expression

The data fitting model based on function expression, in addition to the simpler structure, can express the coupling relationship between different variables in a clearer way as well. The polynomial model, a variant of the linear model, is a typical example adaptable to the nonlinear relationship [17]. The Gaussian distribution model is widely applied for its robustness and computational efficiency [18]. In addition, the Lasso regression can effectively deal with problems of high-dimensional data by constructing penalty function to obtain a more detailed model [19], [20]. A prediction method for wind power combining Lasso regression [21] shortens computing time greatly. Combined with Lasso regression to predict power consumption in [22], the output of Lasso regression shows that the power consumption of Guangdong Province is closely related to the historical consumption, the proportion of the secondary industry and the permanent population. In [23], a linear piecewise fitting model is given to forecast yield automatically from temperature, reactor volume and reactant concentration. Due to the complexity of chemical reaction, it is difficult for experts to make clear of the rules in yield prediction, and as it pointed out, the piecewise fitting model is easier to understand than SVR. An effective algorithm was proposed in [24] to identify the key segmentation features and the number of final segmentation points. Each segment was fitted with a multivariate linear regression function. But when the continuous trial is taken to automatically determine the number of data areas, the calculation is inefficient and may lead to over fitting.

Before being built, the data fitting model based on function expression generally calls for given structural hypothesis of the model, and following identification of relevant parameters. Such model has a simple working mechanism and clear expression of practical problems. However, for an unknown problem, it is often difficult to offer a reasonable structural hypothesis, and the optimization of related parameters will also directly affect the performance of the model.

C. Optimization Machanism for Model Parameters

In [25], the thermal error compensation model of the machine tool based on the exponential model in use of the least square method to optimize the estimation equation of axial deformation of spindle and time. In [26], the quasi newton method was used to optimize the parameters of multivariate nonlinear regression model, and obtained the regression model of dry matter the potato contains. In [27], a multiple nonlinear regression model was established by studying the influence of various operation parameters on the thermal environment. By defining two objective functions, maximum exergy efficiency and the minimum total cost, then the downhill simplex method is used to optimize parameters. In recent years, the research of optimization regression model based on evolutionary algorithms has been developed rapidly. The regression equation between the stress of solder joint and the structural parameters was established [28], and the genetic algorithm (GA) optimizes the structural parameters of solder joint, the optimal combination of structural parameters with the minimum stress of solder joint is available. In [29], a prediction model of crude oil price based on wavelet transformation and multiple linear regression, and particle swarm optimization (PSO) is used to optimize the model parameters. Chen et al. [30] introduced particle calculation into PSO to optimize the nonlinear model composed of multiple regression models. Sheng et al. [31] adopted expectation maximization (EM) [32], a common approach to estimate of the optimal super parameters, to optimize the Gaussian mixture regression for estimating the charge of electric vehicles. Such studies at present mainly focus on optimizing parameters in the model, but research for a mixture of the better model structure and the related parameters has not been reported.

SECTION III.

Proposed Approach

The interpretability of a model is critical for many data fitting and prediction tasks. Obviously, a model with clear function expression meets more of this requirement. This study found that the hybrid data fitting model (referred to as the hybrid model) mixed by different types of functions can make the model much less complicated while ensuring the fitting accuracy, which does more favor to comprehension and analysis of the data fitting model. But the main problem is that how to optimize the structure and parameters of the hybrid model.

A. Analysis of the Hybrid Model

Suppose the data set S consists of N data points, which can be expressed as S=\{X_{i},Y_{i} \} , X_{i} =[x_{1},x_{2},\ldots,x_{d}],i=1,2,\ldots,N . In S , X_{i} is the input of the i th sample point and Y_{i} is the output, d stands for the dimension of X_{i} , and N is the number of samples.

Definition 1:

The definition of the hybrid model \psi is as shown in Equation 1 \begin{equation*} \psi =\sum \limits _{k=1}^{K} {w_{k} g_{k} },\tag{1}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where K is the number of subfunctions, w_{k} is the multiplier factor of the k -th subfunction, g_{k} is the k -th subfunction composed of an exponential function, a logarithmic function, a Gaussian function and a power function.

Definition 2:

Model error. The mean absolute error is used to evaluate the error of \psi , and the model error f_{error} can be expressed as \begin{equation*} f_{error} =\frac {1}{n}\sum \limits _{i=1}^{n} {\left |{ {Y'_{i} -Y_{i}} }\right |},\tag{2}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where Y_{i} is the actual value of the i th sample point in \xi , {Y}'_{i} is the calculated value about \psi in \xi , and n is the number of sample points in \xi , \xi is the observation sample set, \xi \in S .

Definition 3:

Model complexity. The number of subfunctions K is the complexity of \psi .

In general, \psi is the superposition of multiple subfunctions, so more of them contributes to a more complex model.

Take the following example to illustrate the differences in different models. Table 1 shows the structure, error, complexity and the fitting degree of eight models that all adopt the least square method to optimize the relevant parameters. The fitting degree is calculated by decision coefficient (R^{2} ). The larger it is, the better the fitting effect is. Figure 1 shows the fitting curves of three models with better performance.

TABLE 1 Comparisons of Model Complexity and Fitting Degree
Table 1- 
Comparisons of Model Complexity and Fitting Degree
FIGURE 1. - Comparison of fitting effects among different models.
FIGURE 1.

Comparison of fitting effects among different models.

From Table 1, three findings stand out. Firstly, when the polynomial model is used to fit data, the fitting degree can be improved by moderately increasing complexity of the model. But when the latter reaches a certain level, the former will not be significantly enhanced, such as the change between the model of 12-order polynomial and the 14-order polynomial. Then, the Gaussian function leads to a higher fitting degree. For example, the fitting degree of the fitting model with five Gaussian subfunctions is close to that of the 12-order polynomial model. Finally, the hybrid model helps to improve the fitting degree and reduce complexity of the model by optimizing the combination structure of subfunctions. The last two models in Table 1 highlight this. In addition, the overall characteristics of the hybrid model manifest the effect of subfunction superposition. The model with a single type of function, when reflecting changes of data details, will improve overall performance inevitably at the cost of an increasingly complex model, such as the polynomial model. Instead, the hybrid model shown in Figure 1 works better featuring multiple subfunctions and less model complexity, which are proven to be very useful.

B. Coding Mechanism of Hybrid Model

The coding expression is the basis of constructing the hybrid model. To accommodate the search operation, the expression tree is used to encode the hybrid model. Besides, the decoding operation is a high-frequency calculation in the search process, complicated models will make calculation too large. In order to avoid it, the hybrid model can be encoded by the improved expression tree (I-ET).

In the I-ET coding mechanism, a node consists of two parts: the multiplier factor and the structure. In other words, the structure part (sp) of each node will be associated with a randomly generated multiplier factor part(mp). sp is the element selected from different type of sets made up of the function set F=\{F_{1},F_{2},\ldots,F_{m} \} and terminal set T=\{x_{1},x_{2},\ldots,x_{d},c\} . T is the set of input variables and the constant c . In fact, g_{k} in the hybrid model is the subtree composed of several nodes including mp and sp. As a special case, the Gaussian function constituting g_{k} is changing from the variable node to the Gaussian node with the certain probability P_{gs} . Figure 2 shows the coding structure between traditional expression tree coding and I-ET coding of node.

FIGURE 2. - Coding structure of node.
FIGURE 2.

Coding structure of node.

Suppose the hybrid model \psi contains K subfunctions of which internal structure is not taken into consideration, K-1 nodes are required for connection. In use of traditional expression tree coding and I-ET coding, the number of nodes required for a subfunction are 3 and 1 respectively. Therefore, for K subfunctions, the number of nodes is 4K-1 in the former coding, and 2K-1 in the latter. Undoubtedly, I-ET coding is less complicated.

Additionally, the flexibility of such coding helps to express structural changes in the model. As is shown in Figure 3, a new model can be obtained when the sub-function \psi _{z} = 0.55N(x ,8.5,0.9) is added to the original hybrid model \psi _{r} = 7.5 (0.01x^{2} -0.12x+0.51)+ 0.45N(x ,4.5, 0.8)+0.96. Obviously, an increase or decrease in the number of subfunctions in the original model only calls for adding or deleting the related branches of tree coding, without great change of overall coding structure.

FIGURE 3. - Coding structure before and after adding subfunction.
FIGURE 3.

Coding structure before and after adding subfunction.

C. Optimization Mechanism of Hybrid Model

Genetic programming [33]–​[35] can search the structure of expression tree coding, and by virtue of its idea, the optimization mechanism can be designed to construct the hybrid model. The specific steps are as follows.

  • step1:

    Initialization. The population is composed of NP randomly generated I-ET coding individuals, which is expressed as pop=\{\psi _{1},\ldots,\psi _{NP} \},j=1,\ldots,NP . Constructing \psi _{j} needs some initial parameters, such as the maximal depth D of the model, the function set F , the terminal set T , and the initial node depth of the model is 1. The specific constructing process is as follows.

    1. If the depth of the current node is less than D , an element is randomly selected from F\cup T . Otherwise, do the same from T . The selected element is taken as sp of the current node and associated with a randomly generated mp. If sp belongs to F , turn to 2), otherwise turn to 3).

    2. Identify the corresponding number of branches according to the number of children nodes of sp of the current node. For example, if sp is +, the number of child nodes is 2. The depth of the current node is plus 1, and returns to 1) to construct children nodes.

    3. If sp of the current node belongs to variable in T , this variable will be set as the Gaussian node by the probability P_{gs} , and added what the Gaussian node calls attributes of the mean and variance that are randomly generated within a certain attribute range. The node as the terminal node of the branch, that is, the branch stops growing.

  • step2:

    Iterative search. In this process, the search operator probability P is used to choose the crossover or mutation operator and to generate the offspring individuals.

    1. Crossover operation. It can be expressed as \{o_{1},o_{2} \}=\psi _{1} \otimes \psi _{2} , in which \otimes is the crossover operator, \psi _{1} and \psi _{2} are the two parent individuals randomly selected from the population, o_{1} and o_{2} are the two offspring individuals produced by crossover. The overall process of crossover is shown in Figure 4. Here are the concrete steps of crossover \psi _{1} and \psi _{2} .

      1. Select the crossover points of \psi _{1} and \psi _{2} separately according to the number of nodes of the models that generate numbers randomly.

      2. According to the probability P_{cs} , choose from two different crossover strategies to do crossover operation. Figure 4 demonstrates comparison between the two crossover operations. The crossover strategy 1 takes sp and mp of the crossover point as a whole, and directly exchanges subtrees of the two parents with the crossover point as the root node. In the crossover strategy 2, the mp of the two crossover points is exchanged first, and then the subtrees with the crossover points as the root nodes are done likewise. In other words, mp is not exchanged with sp.

        Obviously, the information processing granularity of the two crossover strategies is different. The first exchanges information with subfunctions as the unit, aiming to search for different subfunction combinations, while the second does likewise with the structure of subfunctions as the unit without relevant parameters, so as to keep the multiplier factor of the node changing after initializing the model.

    2. Mutation operation. It can be defined as o=\odot (\psi) , in which \odot is the mutation operator, \psi is the parent individual selected randomly from the population, o is the offspring individual produced by mutation. The specific mutation process of \psi is as follows. First, select randomly mutation point of \psi . Second, since the attribute value of the Gaussian node determines particularity of the hybrid model, the mutation operation will be performed in use of different mutation strategies according to whether it is the Gaussian node.

      When the mutation point is a non-Gaussian node, as is shown in Figure 5, delete the subtree whose root node is the mutation point, and then a new randomly generated subtree will be inserted.

      When the mutation point is a Gaussian node, there are four ways of mutation: 1) mutation of the entire Gaussian node, namely, the mean, variance and mp. 2) mutation of the mp. 3) mutation of the mean. 4) mutation of the variance.

      One of them will be randomly selected to perform mutation operation. Figure 6 shows the process of such operation in the first way of mutation.

  • step3:

    Model evaluation and selection. The offspring individual generated by searching will compete with the parent ones, and the superior will be selected to form the next generation population pop^{gen+1} . Model error f_{error} and complexity f_{node} are used to evaluate model. The former adopts the calculation formula in the Definition 2, while the latter uses the number of nodes of I-ET coding. Apparently, this is a process of bi-objective optimization.

    When the two competing individuals \psi _{1} and \psi _{2} make following work, as shown in Equation 3 \begin{align*} \begin{cases} {f_{error} (\psi _{1})\le f_{error} (\psi _{2})} \\ {f_{node} (\psi _{1})\le f_{node} (\psi _{2})}, \end{cases}\tag{3}\end{align*}

    View SourceRight-click on figure for MathML and additional features. when \psi _{1} dominating \psi _{2} , \psi _{2} will be eliminated. Yet when \psi _{1} and \psi _{2} do not control each other, the fitness will be used for comparison. The calculation formula of fitness fit is as shown in Equation 4 \begin{equation*} fit=\alpha f_{_{error}}^{sf} +(1-\alpha)f_{_{node} }^{sf},\tag{4}\end{equation*}
    View SourceRight-click on figure for MathML and additional features.
    where \alpha is proportion of adjusting two objectives, f_{error}^{sf} and f_{node}^{sf} are the normalized value. The normalization formula can be displayed as \begin{align*} f_{_{error}}^{sf}=&\frac {f_{error} (\psi)-f_{error}^{min}}{f_{error}^{max} -f_{error}^{min}}, \tag{5}\\ f_{_{node}}^{sf}=&\frac {f_{node} (\psi)-f_{node}^{min} }{f_{node}^{max} -f_{node}^{min}},\tag{6}\end{align*}
    View SourceRight-click on figure for MathML and additional features.
    where f_{error}^{min} and f_{error}^{max} are respectively the minimum and maximum of f_{error} in the current population, f_{node}^{min} and f_{node}^{max} are the minimum and maximum of f_{node} .

  • step4:

    If the number of individuals in the next population pop^{gen+1} is less than NP, turn to step2 to continue searching, otherwise, turn to step5.

  • step5:

    The current number of iterations gen plus 1. Determine whether the gen is the maximum. If so, output the best model. Otherwise, go to step2.

FIGURE 4. - Crossover operation of models.
FIGURE 4.

Crossover operation of models.

FIGURE 5. - Mutation operation in the non-Gaussian node.
FIGURE 5.

Mutation operation in the non-Gaussian node.

FIGURE 6. - Mutation operation in the Gaussian node.
FIGURE 6.

Mutation operation in the Gaussian node.

D. Time Complexity Analysis for Proposed Method

Suppose the population size is NP, and the maximal number of generations is NG during the execution of the algorithm.

In the initialization, the core operation is to randomly select nodes from function set and terminal set. So, the time complexity of initializing model \psi using I-ET coding is O(N) , N is the number of nodes of model, and the time complexity of step1 is O (NP \times N ). In step2.1, the time complexity of selecting the crossover individuals is O (NP), and finding the crossover point by traversing tree is O(N ), so the time complexity of step2.1 is O (NP \times N ). In step2.2, the time complexity of selecting the mutation individual is O (NP), finding the mutation point by traversing tree is O(N) , so the time complexity of step2.2 is O (NP \times N ). Therefore, the time complexity of step2 is O (NP \times N ). In step3, the time complexity of calculating model complexity f_{node} is O (NP \times N ), and model error f_{error} is O (NP \times N\times n ), n is the number of sample points, so the time complexity of step3 is O (NP \times N\times n ).

Overall, the time complexity of proposed algorithm is O (NG \times NP \times N\times n ). As the cost of building a good effort data fitting model, it is at an acceptable level.

SECTION IV.

Experiments

The proposed method was implemented on PC (2.3 GHz, 8 GB RAM, Windows 10) with MATLAB 2018a. In order to verify the performance of proposed method, six problems of data prediction were selected for experimental study. Table 2 shows the relevant parameters setting of the method in the process of optimization. D is used to limit the infinite growth of the tree, P is used to choose the crossover or mutation operator, P_{cs} is used to choose crossover strategy, P_{gs} is used to generate gaussian node, and \alpha is used to adjust f_{error} and f_{node} .

TABLE 2 Parameters Setting
Table 2- 
Parameters Setting

A. Data Sets

There are six data sets in UCI machine learning database taken as test cases. Among them, the data set Hydrodynamics contains 308 instances, each of which is represented by seven attributes. In order to evaluate the ship’s performance, it is great value to predict residual resistance of a ship at the beginning of design. The data set Energy efficiency contains data corresponding to 768 building shapes in description of eight attributes ranging from the surface area and the overall height. The purpose is to establish the relationship between the heating or cooling load and the above eight attributes. The data set Concrete contains data of 1030 different concrete samples described by eight attributes, such as cement, water and fly ash, aiming at identifying the relation between compressive strength of concrete and eight attributes. The last two data sets, White wine quality and Red wine quality, contain nearly 1599 and 4998 kinds of red and white wine samples respectively, with the aim to build a model that predicts quality of wine based on its 11 physical and chemical features. The specific information of the above data sets is shown in Table 3.

TABLE 3 Information of Data Sets
Table 3- 
Information of Data Sets

B. Evaluation Metrics

This paper adopts 5-fold cross validation to evaluate the performance of the proposed method. The algorithm will run independently for 10 times, and calculate the Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) between the predicted and the actual value. The final result is the average of the values obtained for all runs. MAE adopts f_{error} in Definition 2 for calculation, and here is the calculation formula of RMSE \begin{equation*} e_{\textrm {RMSE}} =\sqrt {\frac {1}{n}\sum \limits _{i=1}^{n} {\left ({{Y'_{i} -Y_{i}} }\right)^{2}}},\tag{7}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where Y_{i} is the observation value of the i th sample point and {Y}'_{i} is the prediction value, n is the number of the sample points in the test set which covers part of the sample points in the data set S .

C. Comparison of Experimental Results

This paper selects 7 comparison algorithms including the classical ones SVR, RF and XGBoost, and several improved methods proposed in recent years like ALAMO in the reference [37], OPLRA in [23] and the two approaches named PROA and PROB in [24]. The classical methods in the experiment use the sklearn algorithm package under Python 3.7. The main parameters are as follows: The kernel of SVR is rbf, the number of decision trees used in both RF and XGBoost is 200; the learning rate of XGBoost is 0.1. The results of other methods are directly adduced from the original references. Table 4 and Table 5 respectively present the average results of MAE and RMSE of each method.

TABLE 4 Comparison of MAE Results of Each Method on the 6 Data Sets
Table 4- 
Comparison of MAE Results of Each Method on the 6 Data Sets
TABLE 5 Comparison of RMSE Results of Each Method on the 6 Data Sets
Table 5- 
Comparison of RMSE Results of Each Method on the 6 Data Sets

Firstly, compare the proposed method in Table 4 with the three classical ones, SVR, RF and XGBoost. Of the three classical, XGBoost gets the minimum MAE. The proposed method is superior to SVR and RF in the six data sets, and works better than XGBoost in five sets except in the set Cooling. Then, the proposed performs better except in Cooling than those in the references like ALAMO, OPLRA, PROA and PROB. However, the proposed is inferior to OPLRA, PROA and PROB in dealing with the set Cooling.

All in all, the proposed gets the minimum MAE in five data sets but XGBoost has the minimum in Cooling.

According to the RMSE in Table 5, the performance of PROA is better than those in other references, so is that of RF which surpasses PROA in Cooling, Concrete and White. Besides, XGBoost also does better than PROA in four data sets except in Heating and Red. It turns out that the proposed method is apparently more advisable on the RMSE than PROA and XGBoost.

In order to evaluate the performance of each method for comparison in a more comprehensive way, the following scoring strategy is adopted. For each data set, arrange various methods according to their MAE and RMSE. The method with the lowest prediction error scores 10 points, the one that follows gets 9 points, and so on. For lack of partial results, the score of each method is the scores on average of the five data sets except Red. In addition, in use of RMSE scoring, ALAMO is excluded from calculating scores. The final average score represents the overall performance of the method. The higher score means the method works better. The average score of different methods is shown in Figure 7.

FIGURE 7. - Average score of each method in MAE (left) and RMSE (right).
FIGURE 7.

Average score of each method in MAE (left) and RMSE (right).

Figure 7 can make it easier to compare the performance of these methods. The proposed method gets the highest score in use of MAE and RMSE. XGBoost and PROA also work well. In addition, there are some differences when using MAE and RMSE for scoring. When RMSE is taken as the scoring indicator, RF is very competitive and has the same score as PROA.

This paper also adopts the Welch t test to compare the significance of differences in performance of each method at the significance level of 5%. The MAE values of these methods on each data set are used. The test results are shown in Table 6, in which the “+” indicates that the proposed method is obviously superior to the comparison one; the “≈” means no significant difference between the two methods; the “-” suggests that the proposed is largely inferior to the other.

TABLE 6 Results of the Statistical Significance Test
Table 6- 
Results of the Statistical Significance Test

As seen from Table 6, when compared with SVR, the proposed method is superior in the six data sets. In comparison with RF, the proposed works worse in Cooling set, there is no significant difference between the two in Concrete set, but the proposed is superior to RF in the rest of the four data sets. When compared with XGBoost, the proposed is inferior in Cooling set, similar performance of both approaches is presented in Concrete and Hydro sets, yet the proposed is better than XGBoost in the rest of the three sets.

D. Comparison of I-Et Coding and Traditional Coding

In order to evaluate the influence of the I-ET coding and traditional tree coding on the model performance, the training error (train f_{error} ), the test error (test f_{error} ), the average complexity of the population in the whole optimization process(avg f_{node} ) and the running time cost are compared respectively for 6 test problems. The two coding mechanisms run ten times respectively under the same hardware, software environment, and stop condition. Table 7 shows the comparison of train f_{error} , test f_{error} and avg f_{node} between them. Figure 8 shows the variation of running time cost and avg f_{node} on 6 data sets. Figure 9 displays the curve of average complexity of the population of the two coding mechanisms throughout the optimization process in Hydro.

TABLE 7 Comparison of Two Coding Mechanisms
Table 7- 
Comparison of Two Coding Mechanisms
FIGURE 8. - Variation of running time and avg fnode of two coding mechanisms on 6 data sets.
FIGURE 8.

Variation of running time and avg fnode of two coding mechanisms on 6 data sets.

FIGURE 9. - Variation curve of average complexity of population in Hydro.
FIGURE 9.

Variation curve of average complexity of population in Hydro.

It can be seen from Table 7 that the I-ET coding is superior to the traditional one in three indicators on all data sets. Compared with the traditional coding, the train f_{error} , the test f_{error} and the avg f_{node} are reduced by an average of 24.5%, 37.0%, and 36.2% respectively. Thus, the I-ET being more effective in accuracy and complexity. Meanwhile, it can be seen from Figure 8 that the time cost has a strong correlation with the avg f_{node} of population. The larger avg f_{node} is, the longer time cost is. Figure 9 can further reflect that the I-ET coding can reduce the model complexity compared to the traditional coding. In summary, the I-ET coding reduce the model complexity and time cost distinctly while decrease the train and test f_{error} visibly.

E. Interpretability of the Hybrid Model

Taking the Hydro set as example, the interpretability of the proposed and polynomial model is analysed. The expression of the 3-order polynomial model y_{p} is as shown in Equation 8, as shown at the bottom of this page,

and the expression of the hybrid model \psi by evolutionary search in this paper is as shown in Equation 9, as shown at the bottom of this page.

The expressions of the two models suggest that the 3- order polynomial model y_{p} is very huge and complex, of which prediction error is 1.406. That of the 4-order polynomial model is 0.643, which is not listed here since the expressions are too complicated. Compared with y_{p} , the hybrid model \psi generated by the proposed method has simpler structure and smaller error of only 0.483. It is more accurate in prediction than the 3-order and 4-order polynomial models as well.

In addition, simplification of the hybrid model structure makes analysis of the prediction results more convenient. For example, in the results of the Hydro set, the model constructed by the proposed method highlights the importance of the attributes x_{1} , x_{3} , x_{4} and x_{6} , which respectively correspond to the longitudinal position of the center of buoyancy, length-displacement ratio, beam-draught ratio and Froude number. Obviously, on the basis of these results, it is more convenient to further analyze which attributes or their combinations that have a greater impact on residual resistance of the ship. Therefore, the proposed method can make the model much less complex and allow further interpretation.

F. Parameters Discussion

Take the set Hydro as example once again to analyze four major parameters.

1) Influence of Weight on Model Error and Complexity

When fitness of the model is calculated, the weight \alpha is an important parameter affecting model error and complexity, and determines the evolutionary direction of the population.

If \alpha is too small, the complexity of the model will be the dominant factor, and the whole population will evolve in the direction of less complexity. But as the model with low complexity contains too little information, the model might be less accurate. In contrast, if \alpha is too large, reducing the model error will lead the drive of evolution, but the model is more likely to be rather complicated. Figure 10 shows the influence of different values of \alpha on model error and complexity. It can be seen that when \alpha is within 0.6 to 0.8, both accuracy and simplicity of the model will be guaranteed.

FIGURE 10. - Influence of different weight on model performance.
FIGURE 10.

Influence of different weight on model performance.

2) Influence of Probability P of Search Operator on Model Error

In the search process, the related genetic operation to produce offspring is selected based on the probability P , which is the key parameter influencing the performance of GP. The larger P results in the greater probability of choosing crossover operation and faster generation of new individuals. But the structure of excellent individuals will be destroyed quickly. In contrast, the smaller P is, the greater the probability of choosing mutation operation is, so the searching process will become random. When it comes to the state of the population in the evolutionary process, calculating P under a self-adaptive mechanism makes dynamic variability of the population adaptable. Figure 11 shows the influence of different calculation strategies of P on model error in the iteration process. It can be seen that the model error curve with the fixed value up to 0.9 of P is higher than that with the value of P obtained from a self-adaptive mechanism. When this mechanism is used to determine the probability of the search operator, selecting different genetic operations can improve the searching performance of algorithms.

FIGURE 11. - Error curve of 
$P$
 under different mechanisms.
FIGURE 11.

Error curve of P under different mechanisms.

3) Influence of Probability P_{cs} of Different Crossover Strategies on Model Error

When the crossover strategy is selected, the probability P_{cs} can affect model error. If P_{cs} is too small, the search particle will be too large, but if it is too large, the combination of searching subfunctions will be at the very heart, and the search of relevant parameters is little. Figure 12 reflects the influence of P_{cs} on model error, suggesting that with the increase of P_{cs} , the model error decreases and then increases. When P_{cs} is between 0.5 and 0.7, the two crossover strategies can be well balanced.

FIGURE 12. - Influence of 
$P_{cs}$
 on model error.
FIGURE 12.

Influence of P_{cs} on model error.

4) Influence of Probability \text{P}_{gs} of Generating Gaussian Node on Model Error

When individuals are initialized or mutated, the probability P_{gs} turning from ordinary variable to the Gaussian node will influence model error which can be reduced thanks to the Gaussian function. If P_{gs} is too small, the Gaussian function is less likely to appear in the hybrid model. As a result, there is great model error. But if P_{gs} is too large, the resulting superfluous Gaussian function in the model and insufficient information about other models will cause larger model error likewise.

Figure 13 demonstrates the influence of different P_{gs} on model error. When P_{gs} grows, overall model error decreases and then increases. And when P_{gs} is 0.5, model error is the lowest and it also performs well.

FIGURE 13. - Influence of 
$P_{gs}$
 on model error.
FIGURE 13.

Influence of P_{gs} on model error.

SECTION V.

Conclusion

This paper presents the method of constructing a hybrid data fitting model based on I-ET coding and evolutionary search. This approach is proven to be effective by experiments on the six UCI data sets, and comparison of the results by different means. The results suggest that the proposed method for model construction can bring forth hybrid model with higher prediction accuracy and lower complexity. Meanwhile, this paper also discusses how I-ET coding adopted in the proposed method makes calculation less complex and the relationship between the selection of four important parameters and the performance of the constructed model.

In future work, the following work will be continued. Firstly, the effectiveness of the proposed approach was verified only on UCI datasets, it is a general framework that can be applied to predict other practical problems. Then, the coding of hybrid model is the fundamental problem, which directly affects the efficiency of the whole algorithm. Although I-ET coding can reduce the complexity of model, it is still complex in the search process. The efficient and simple model coding structure can be further explored to improve the efficiency of search operation. Finally, the multi-objective optimization technique needs to be further studied, so as to construct the hybrid model with better performance.

References

References is not available for this document.