Introduction
The goal of constructing a data fitting model is to seek a set of functions, which can describe the approximate correlation among a group of variables, and subject to constraints. It can be acted as a kind of data characterization or prediction tool. Generally, this method can be broadly divided into two categories, the model with a concrete function expression and the model based on some intelligent calculation approaches. The polynomial model and neural network are the typical one of former and latter respectively. In recent years, the ensemble learning and deep learning have been applied to deal with data fitting problem and show outstanding performance. However, the training process of them is relatively complicated. More importantly, the training-driven model turns out to be a black box which cannot showcase the coupling relationship among variables, making it difficult to comprehend and further utilize. Accordingly, the model with concrete function expression still has its advantages. But, the traditional methods, such as polynomial model, require a prior hypothesis including the type and number of used functions. In most cases, these are unknowable. And, a group of optimized parameters are also desired. Therefore, how to generate a fitting model with reasonable structure while optimizing related parameters has become a key problem.
It is found that the hybrid fitting model with lower complexity and higher fitting accuracy can be constructed by mixing different types of functions. But, constructing such a model firstly calls for mechanisms with more effective coding expression and optimization ability. Concerning this issue, this paper proposes a method for constructing the hybrid fitting model based on representation by tree coding and co-optimization of model structure and parameters by evolutionary search. Major contributions of this paper include:
The improved expression tree coding mechanism is proposed to express hybrid fitting model. In this coding mechanism, each node is composed of structure part and multiplier factor part. When its structure changes, the variable length of tree coding makes it possible to express the new model flexibly, which lay the foundation for searching and optimizing the model. Moreover, compared with traditional coding, the complexity of improved expression tree coding is reduced.
The optimization mechanism of hybrid fitting model based on improved genetic programming (GP) is proposed. This mechanism for co-optimization of model structure and parameters by evolutionary search, which can improve fitting accuracy of the model, reduce its complexity and make it possible to enhance its interpretability.
The remaining part of this work is organized as follows. Analysis of related research on fitting model and its optimization mechanism in Section II. Introduction of the proposed method for tree coding expression and relevant evolution and optimization of hybrid fitting model in Section III. Then, the results and its discussion in Section IV. Finally, some conclusion and future work in Section V.
Related Works
This section focuses on the discussion of related works of the data fitting method based on intelligent computing, fitting approaches with explicit function expression, and the mechanism for optimizing them.
A. Data Fitting Model Based on Intelligent Computing Method
With the development of machine learning, the data fitting model based on the intelligent computing method has been widely applied. The support vector machine (SVM), for instance, is a mature means. Karimi et al. [1] developed binary SVM model for urban expansion prediction by selecting the most appropriate kernel function and its parameters. Sousa et al. [2] used the genetic algorithm to optimize the SVM model that helps forecast the classification and recovery rate of urban waste. Although SVM can solve nonlinear and local minimum problems, it is difficult to deal with a ton of data [3]. Ensemble learning as a practical method, such as random forest (RF), eXtreme gradient boosting (XGBoost) have been developed on the basis of Bagging and Boosting [4]. In [5], RF was used to construct model for predicting mortality rate of patients suffering acute renal injury, and in [6], a new ultra-short-term offline prediction model of photovoltaic characteristics based on RF was proposed. The results show that the prediction of RF is highly accurate, but the final result is limited by the prediction performance of each decision tree. In [7], XGBoost, a typical boosting algorithm, is used to avoid overfitting problems and establish an efficient energy load prediction model in residential buildings. Based on XGBoost, a C-A-XGBoost sales prediction model was proposed for focusing on the characteristics of commodity sales and the trend fitting of data series [8]. The experimental results suggest that the prediction is more accurate. Neural network is an effective nonlinear data fitting method [9], [10]. In recent years, with the development of convolutional neural network (CNN), the data prediction model based on it has been more widely used. In [11], a CNN-based framework for predicting the next day’s direction of movement for the indices of S&P 500, NASDAQ, DJI, NYSE, and RUSSELL.
A data fitting model combining linear regression and the deep belief network model has been proposed [12]. In addition, the long short-term memory network (LSTM) has special structure of memory and gate, which is also frequently used to solve problems of prediction [13]–[15].
The data fitting method based on the intelligent computing shows excellent performance, but its working mechanism is complex, especially the model generated by training, which is a black box and cannot describe the detailed relations between different variables in the data. In many tasks of data fitting and prediction, interpretability of the model is of great significance. For example, the reference [16] pointed out that the energy consumption model of conveyor based on BP neural network is not conducive to describe the problems of controlling optimization, while the model on the basis of function expression is more reasonable.
B. Data Fitting Model Based on Function Expression
The data fitting model based on function expression, in addition to the simpler structure, can express the coupling relationship between different variables in a clearer way as well. The polynomial model, a variant of the linear model, is a typical example adaptable to the nonlinear relationship [17]. The Gaussian distribution model is widely applied for its robustness and computational efficiency [18]. In addition, the Lasso regression can effectively deal with problems of high-dimensional data by constructing penalty function to obtain a more detailed model [19], [20]. A prediction method for wind power combining Lasso regression [21] shortens computing time greatly. Combined with Lasso regression to predict power consumption in [22], the output of Lasso regression shows that the power consumption of Guangdong Province is closely related to the historical consumption, the proportion of the secondary industry and the permanent population. In [23], a linear piecewise fitting model is given to forecast yield automatically from temperature, reactor volume and reactant concentration. Due to the complexity of chemical reaction, it is difficult for experts to make clear of the rules in yield prediction, and as it pointed out, the piecewise fitting model is easier to understand than SVR. An effective algorithm was proposed in [24] to identify the key segmentation features and the number of final segmentation points. Each segment was fitted with a multivariate linear regression function. But when the continuous trial is taken to automatically determine the number of data areas, the calculation is inefficient and may lead to over fitting.
Before being built, the data fitting model based on function expression generally calls for given structural hypothesis of the model, and following identification of relevant parameters. Such model has a simple working mechanism and clear expression of practical problems. However, for an unknown problem, it is often difficult to offer a reasonable structural hypothesis, and the optimization of related parameters will also directly affect the performance of the model.
C. Optimization Machanism for Model Parameters
In [25], the thermal error compensation model of the machine tool based on the exponential model in use of the least square method to optimize the estimation equation of axial deformation of spindle and time. In [26], the quasi newton method was used to optimize the parameters of multivariate nonlinear regression model, and obtained the regression model of dry matter the potato contains. In [27], a multiple nonlinear regression model was established by studying the influence of various operation parameters on the thermal environment. By defining two objective functions, maximum exergy efficiency and the minimum total cost, then the downhill simplex method is used to optimize parameters. In recent years, the research of optimization regression model based on evolutionary algorithms has been developed rapidly. The regression equation between the stress of solder joint and the structural parameters was established [28], and the genetic algorithm (GA) optimizes the structural parameters of solder joint, the optimal combination of structural parameters with the minimum stress of solder joint is available. In [29], a prediction model of crude oil price based on wavelet transformation and multiple linear regression, and particle swarm optimization (PSO) is used to optimize the model parameters. Chen et al. [30] introduced particle calculation into PSO to optimize the nonlinear model composed of multiple regression models. Sheng et al. [31] adopted expectation maximization (EM) [32], a common approach to estimate of the optimal super parameters, to optimize the Gaussian mixture regression for estimating the charge of electric vehicles. Such studies at present mainly focus on optimizing parameters in the model, but research for a mixture of the better model structure and the related parameters has not been reported.
Proposed Approach
The interpretability of a model is critical for many data fitting and prediction tasks. Obviously, a model with clear function expression meets more of this requirement. This study found that the hybrid data fitting model (referred to as the hybrid model) mixed by different types of functions can make the model much less complicated while ensuring the fitting accuracy, which does more favor to comprehension and analysis of the data fitting model. But the main problem is that how to optimize the structure and parameters of the hybrid model.
A. Analysis of the Hybrid Model
Suppose the data set
Definition 1:
The definition of the hybrid model \begin{equation*} \psi =\sum \limits _{k=1}^{K} {w_{k} g_{k} },\tag{1}\end{equation*}
Definition 2:
Model error. The mean absolute error is used to evaluate the error of \begin{equation*} f_{error} =\frac {1}{n}\sum \limits _{i=1}^{n} {\left |{ {Y'_{i} -Y_{i}} }\right |},\tag{2}\end{equation*}
Definition 3:
Model complexity. The number of subfunctions
In general,
Take the following example to illustrate the differences in different models. Table 1 shows the structure, error, complexity and the fitting degree of eight models that all adopt the least square method to optimize the relevant parameters. The fitting degree is calculated by decision coefficient (
From Table 1, three findings stand out. Firstly, when the polynomial model is used to fit data, the fitting degree can be improved by moderately increasing complexity of the model. But when the latter reaches a certain level, the former will not be significantly enhanced, such as the change between the model of 12-order polynomial and the 14-order polynomial. Then, the Gaussian function leads to a higher fitting degree. For example, the fitting degree of the fitting model with five Gaussian subfunctions is close to that of the 12-order polynomial model. Finally, the hybrid model helps to improve the fitting degree and reduce complexity of the model by optimizing the combination structure of subfunctions. The last two models in Table 1 highlight this. In addition, the overall characteristics of the hybrid model manifest the effect of subfunction superposition. The model with a single type of function, when reflecting changes of data details, will improve overall performance inevitably at the cost of an increasingly complex model, such as the polynomial model. Instead, the hybrid model shown in Figure 1 works better featuring multiple subfunctions and less model complexity, which are proven to be very useful.
B. Coding Mechanism of Hybrid Model
The coding expression is the basis of constructing the hybrid model. To accommodate the search operation, the expression tree is used to encode the hybrid model. Besides, the decoding operation is a high-frequency calculation in the search process, complicated models will make calculation too large. In order to avoid it, the hybrid model can be encoded by the improved expression tree (I-ET).
In the I-ET coding mechanism, a node consists of two parts: the multiplier factor and the structure. In other words, the structure part (sp) of each node will be associated with a randomly generated multiplier factor part(mp). sp is the element selected from different type of sets made up of the function set
Suppose the hybrid model
Additionally, the flexibility of such coding helps to express structural changes in the model. As is shown in Figure 3, a new model can be obtained when the sub-function
C. Optimization Mechanism of Hybrid Model
Genetic programming [33]–[35] can search the structure of expression tree coding, and by virtue of its idea, the optimization mechanism can be designed to construct the hybrid model. The specific steps are as follows.
step1:
Initialization. The population is composed of NP randomly generated I-ET coding individuals, which is expressed as
. Constructingpop=\{\psi _{1},\ldots,\psi _{NP} \},j=1,\ldots,NP needs some initial parameters, such as the maximal depth\psi _{j} of the model, the function setD , the terminal setF , and the initial node depth of the model is 1. The specific constructing process is as follows.T If the depth of the current node is less than
, an element is randomly selected fromD . Otherwise, do the same fromF\cup T . The selected element is taken as sp of the current node and associated with a randomly generated mp. If sp belongs toT , turn to 2), otherwise turn to 3).F Identify the corresponding number of branches according to the number of children nodes of sp of the current node. For example, if sp is +, the number of child nodes is 2. The depth of the current node is plus 1, and returns to 1) to construct children nodes.
If sp of the current node belongs to variable in
, this variable will be set as the Gaussian node by the probabilityT , and added what the Gaussian node calls attributes of the mean and variance that are randomly generated within a certain attribute range. The node as the terminal node of the branch, that is, the branch stops growing.P_{gs}
step2:
Iterative search. In this process, the search operator probability
is used to choose the crossover or mutation operator and to generate the offspring individuals.P Crossover operation. It can be expressed as
, in which\{o_{1},o_{2} \}=\psi _{1} \otimes \psi _{2} is the crossover operator,\otimes and\psi _{1} are the two parent individuals randomly selected from the population,\psi _{2} ando_{1} are the two offspring individuals produced by crossover. The overall process of crossover is shown in Figure 4. Here are the concrete steps of crossovero_{2} and\psi _{1} .\psi _{2} Select the crossover points of
and\psi _{1} separately according to the number of nodes of the models that generate numbers randomly.\psi _{2} According to the probability
, choose from two different crossover strategies to do crossover operation. Figure 4 demonstrates comparison between the two crossover operations. The crossover strategy 1 takes sp and mp of the crossover point as a whole, and directly exchanges subtrees of the two parents with the crossover point as the root node. In the crossover strategy 2, the mp of the two crossover points is exchanged first, and then the subtrees with the crossover points as the root nodes are done likewise. In other words, mp is not exchanged with sp.P_{cs} Obviously, the information processing granularity of the two crossover strategies is different. The first exchanges information with subfunctions as the unit, aiming to search for different subfunction combinations, while the second does likewise with the structure of subfunctions as the unit without relevant parameters, so as to keep the multiplier factor of the node changing after initializing the model.
Mutation operation. It can be defined as
, in whicho=\odot (\psi) is the mutation operator,\odot is the parent individual selected randomly from the population,\psi is the offspring individual produced by mutation. The specific mutation process ofo is as follows. First, select randomly mutation point of\psi . Second, since the attribute value of the Gaussian node determines particularity of the hybrid model, the mutation operation will be performed in use of different mutation strategies according to whether it is the Gaussian node.\psi When the mutation point is a non-Gaussian node, as is shown in Figure 5, delete the subtree whose root node is the mutation point, and then a new randomly generated subtree will be inserted.
When the mutation point is a Gaussian node, there are four ways of mutation: 1) mutation of the entire Gaussian node, namely, the mean, variance and mp. 2) mutation of the mp. 3) mutation of the mean. 4) mutation of the variance.
One of them will be randomly selected to perform mutation operation. Figure 6 shows the process of such operation in the first way of mutation.
step3:
Model evaluation and selection. The offspring individual generated by searching will compete with the parent ones, and the superior will be selected to form the next generation population pop
. Model error^{gen+1} and complexityf_{error} are used to evaluate model. The former adopts the calculation formula in the Definition 2, while the latter uses the number of nodes of I-ET coding. Apparently, this is a process of bi-objective optimization.f_{node} When the two competing individuals
and\psi _{1} make following work, as shown in Equation 3\psi _{2} when\begin{align*} \begin{cases} {f_{error} (\psi _{1})\le f_{error} (\psi _{2})} \\ {f_{node} (\psi _{1})\le f_{node} (\psi _{2})}, \end{cases}\tag{3}\end{align*} View Source\begin{align*} \begin{cases} {f_{error} (\psi _{1})\le f_{error} (\psi _{2})} \\ {f_{node} (\psi _{1})\le f_{node} (\psi _{2})}, \end{cases}\tag{3}\end{align*}
dominating\psi _{1} ,\psi _{2} will be eliminated. Yet when\psi _{2} and\psi _{1} do not control each other, the fitness will be used for comparison. The calculation formula of fitness fit is as shown in Equation 4\psi _{2} where\begin{equation*} fit=\alpha f_{_{error}}^{sf} +(1-\alpha)f_{_{node} }^{sf},\tag{4}\end{equation*} View Source\begin{equation*} fit=\alpha f_{_{error}}^{sf} +(1-\alpha)f_{_{node} }^{sf},\tag{4}\end{equation*}
is proportion of adjusting two objectives,\alpha andf_{error}^{sf} are the normalized value. The normalization formula can be displayed asf_{node}^{sf} where\begin{align*} f_{_{error}}^{sf}=&\frac {f_{error} (\psi)-f_{error}^{min}}{f_{error}^{max} -f_{error}^{min}}, \tag{5}\\ f_{_{node}}^{sf}=&\frac {f_{node} (\psi)-f_{node}^{min} }{f_{node}^{max} -f_{node}^{min}},\tag{6}\end{align*} View Source\begin{align*} f_{_{error}}^{sf}=&\frac {f_{error} (\psi)-f_{error}^{min}}{f_{error}^{max} -f_{error}^{min}}, \tag{5}\\ f_{_{node}}^{sf}=&\frac {f_{node} (\psi)-f_{node}^{min} }{f_{node}^{max} -f_{node}^{min}},\tag{6}\end{align*}
andf_{error}^{min} are respectively the minimum and maximum off_{error}^{max} in the current population,f_{error} andf_{node}^{min} are the minimum and maximum off_{node}^{max} .f_{node} step4:
If the number of individuals in the next population pop
is less than NP, turn to step2 to continue searching, otherwise, turn to step5.^{gen+1} step5:
The current number of iterations gen plus 1. Determine whether the gen is the maximum. If so, output the best model. Otherwise, go to step2.
D. Time Complexity Analysis for Proposed Method
Suppose the population size is NP, and the maximal number of generations is NG during the execution of the algorithm.
In the initialization, the core operation is to randomly select nodes from function set and terminal set. So, the time complexity of initializing model
Overall, the time complexity of proposed algorithm is
Experiments
The proposed method was implemented on PC (2.3 GHz, 8 GB RAM, Windows 10) with MATLAB 2018a. In order to verify the performance of proposed method, six problems of data prediction were selected for experimental study. Table 2 shows the relevant parameters setting of the method in the process of optimization.
A. Data Sets
There are six data sets in UCI machine learning database taken as test cases. Among them, the data set Hydrodynamics contains 308 instances, each of which is represented by seven attributes. In order to evaluate the ship’s performance, it is great value to predict residual resistance of a ship at the beginning of design. The data set Energy efficiency contains data corresponding to 768 building shapes in description of eight attributes ranging from the surface area and the overall height. The purpose is to establish the relationship between the heating or cooling load and the above eight attributes. The data set Concrete contains data of 1030 different concrete samples described by eight attributes, such as cement, water and fly ash, aiming at identifying the relation between compressive strength of concrete and eight attributes. The last two data sets, White wine quality and Red wine quality, contain nearly 1599 and 4998 kinds of red and white wine samples respectively, with the aim to build a model that predicts quality of wine based on its 11 physical and chemical features. The specific information of the above data sets is shown in Table 3.
B. Evaluation Metrics
This paper adopts 5-fold cross validation to evaluate the performance of the proposed method. The algorithm will run independently for 10 times, and calculate the Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) between the predicted and the actual value. The final result is the average of the values obtained for all runs. MAE adopts \begin{equation*} e_{\textrm {RMSE}} =\sqrt {\frac {1}{n}\sum \limits _{i=1}^{n} {\left ({{Y'_{i} -Y_{i}} }\right)^{2}}},\tag{7}\end{equation*}
C. Comparison of Experimental Results
This paper selects 7 comparison algorithms including the classical ones SVR, RF and XGBoost, and several improved methods proposed in recent years like ALAMO in the reference [37], OPLRA in [23] and the two approaches named PROA and PROB in [24]. The classical methods in the experiment use the sklearn algorithm package under Python 3.7. The main parameters are as follows: The kernel of SVR is rbf, the number of decision trees used in both RF and XGBoost is 200; the learning rate of XGBoost is 0.1. The results of other methods are directly adduced from the original references. Table 4 and Table 5 respectively present the average results of MAE and RMSE of each method.
Firstly, compare the proposed method in Table 4 with the three classical ones, SVR, RF and XGBoost. Of the three classical, XGBoost gets the minimum MAE. The proposed method is superior to SVR and RF in the six data sets, and works better than XGBoost in five sets except in the set Cooling. Then, the proposed performs better except in Cooling than those in the references like ALAMO, OPLRA, PROA and PROB. However, the proposed is inferior to OPLRA, PROA and PROB in dealing with the set Cooling.
All in all, the proposed gets the minimum MAE in five data sets but XGBoost has the minimum in Cooling.
According to the RMSE in Table 5, the performance of PROA is better than those in other references, so is that of RF which surpasses PROA in Cooling, Concrete and White. Besides, XGBoost also does better than PROA in four data sets except in Heating and Red. It turns out that the proposed method is apparently more advisable on the RMSE than PROA and XGBoost.
In order to evaluate the performance of each method for comparison in a more comprehensive way, the following scoring strategy is adopted. For each data set, arrange various methods according to their MAE and RMSE. The method with the lowest prediction error scores 10 points, the one that follows gets 9 points, and so on. For lack of partial results, the score of each method is the scores on average of the five data sets except Red. In addition, in use of RMSE scoring, ALAMO is excluded from calculating scores. The final average score represents the overall performance of the method. The higher score means the method works better. The average score of different methods is shown in Figure 7.
Figure 7 can make it easier to compare the performance of these methods. The proposed method gets the highest score in use of MAE and RMSE. XGBoost and PROA also work well. In addition, there are some differences when using MAE and RMSE for scoring. When RMSE is taken as the scoring indicator, RF is very competitive and has the same score as PROA.
This paper also adopts the Welch t test to compare the significance of differences in performance of each method at the significance level of 5%. The MAE values of these methods on each data set are used. The test results are shown in Table 6, in which the “+” indicates that the proposed method is obviously superior to the comparison one; the “≈” means no significant difference between the two methods; the “-” suggests that the proposed is largely inferior to the other.
As seen from Table 6, when compared with SVR, the proposed method is superior in the six data sets. In comparison with RF, the proposed works worse in Cooling set, there is no significant difference between the two in Concrete set, but the proposed is superior to RF in the rest of the four data sets. When compared with XGBoost, the proposed is inferior in Cooling set, similar performance of both approaches is presented in Concrete and Hydro sets, yet the proposed is better than XGBoost in the rest of the three sets.
D. Comparison of I-Et Coding and Traditional Coding
In order to evaluate the influence of the I-ET coding and traditional tree coding on the model performance, the training error (train
It can be seen from Table 7 that the I-ET coding is superior to the traditional one in three indicators on all data sets. Compared with the traditional coding, the train
E. Interpretability of the Hybrid Model
Taking the Hydro set as example, the interpretability of the proposed and polynomial model is analysed. The expression of the 3-order polynomial model
The expressions of the two models suggest that the 3- order polynomial model
In addition, simplification of the hybrid model structure makes analysis of the prediction results more convenient. For example, in the results of the Hydro set, the model constructed by the proposed method highlights the importance of the attributes
F. Parameters Discussion
Take the set Hydro as example once again to analyze four major parameters.
1) Influence of Weight on Model Error and Complexity
When fitness of the model is calculated, the weight
If
2) Influence of Probability P of Search Operator on Model Error
In the search process, the related genetic operation to produce offspring is selected based on the probability
3) Influence of Probability P_{cs}
of Different Crossover Strategies on Model Error
When the crossover strategy is selected, the probability
4) Influence of Probability \text{P}_{gs}
of Generating Gaussian Node on Model Error
When individuals are initialized or mutated, the probability
Figure 13 demonstrates the influence of different
Conclusion
This paper presents the method of constructing a hybrid data fitting model based on I-ET coding and evolutionary search. This approach is proven to be effective by experiments on the six UCI data sets, and comparison of the results by different means. The results suggest that the proposed method for model construction can bring forth hybrid model with higher prediction accuracy and lower complexity. Meanwhile, this paper also discusses how I-ET coding adopted in the proposed method makes calculation less complex and the relationship between the selection of four important parameters and the performance of the constructed model.
In future work, the following work will be continued. Firstly, the effectiveness of the proposed approach was verified only on UCI datasets, it is a general framework that can be applied to predict other practical problems. Then, the coding of hybrid model is the fundamental problem, which directly affects the efficiency of the whole algorithm. Although I-ET coding can reduce the complexity of model, it is still complex in the search process. The efficient and simple model coding structure can be further explored to improve the efficiency of search operation. Finally, the multi-objective optimization technique needs to be further studied, so as to construct the hybrid model with better performance.