Journals & Magazines >IEEE Access >Volume: 8

A Novel Hybrid Machine Learning Algorithm for Limited and Big Data Modeling With Application in Industry 4.0

Hybrid Machine Learning.

Abstract:

To meet the challenges of manufacturing smart products, the manufacturing plants have been radically changed to become smart factories underpinned by industry 4.0 technol...Show More

Metadata

Abstract:

To meet the challenges of manufacturing smart products, the manufacturing plants have been radically changed to become smart factories underpinned by industry 4.0 technologies. The transformation is assisted by employment of machine learning techniques that can deal with modeling both big or limited data. This manuscript reviews these concepts and present a case study that demonstrates the use of a novel intelligent hybrid algorithms for Industry 4.0 applications with limited data. In particular, an intelligent algorithm is proposed for robust data modeling of nonlinear systems based on input-output data. In our approach, a novel hybrid data-driven combining the Group-Method of Data-Handling and Singular-Value Decomposition is adapted to find an offline deterministic model combined with Pareto multi-objective optimization to overcome the overfitting issue. An Unscented-Kalman-Filter is also incorporated to update the coefficient of the deterministic model and increase its robustness against data uncertainties. The effectiveness of the proposed method is examined on a set of real industrial measurements.

Hybrid Machine Learning.

Published in: IEEE Access ( Volume: 8)

Page(s): 111381 - 111393

Date of Publication: 04 June 2020

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2020.2999898

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

Nomenclature

AbbreviationExpansion

AC	Actor-Critic
CNN	Convolutional Neural Network
DL	Deep Learning
DNN	Deep Neural Networks
D&PL	Distributed and Parallel Learning
DBN	Deep Belief Networks
DDPG	Deep Deterministic Policy Gradient
DOE	Design of Experiences
DQN	Deep Q-Network
DT	Decision Tree
GMDH	Group Method of Data Handling
GA	Genetic Algorithm
GPR	Gaussian Process Regression
IL	Incremental Learning
JK	JackKnife
KNN	K-Nearest Neighbor
LDA	Linear Discriminant Analysis
MAB	Multi-Armed Bandit
ML	Machine Learning
MCS	Monte Carlo Simulation
NN	Neural Networks
NSGA	Nondominated Sorting Genetic Algorithm
PCA	Principal Component Analysis
PDF	Probability Density Function
PCA	Principal Component Analysis k-Means
PG	Policy Gradient
QDA	Quadratic Discriminant Analysis
ResNet	Residual Network
RNN	Recurrent Neural Network
RF	Random Forest
RL	Reinforcement Learning
RVM	Relevance Vector Machine
SARSA	State-Action-Reward-State-Action
SVD	Singular Value Decomposition
SOM	Self Organised Maps
SVM	Support Vector Machine
SVR	Support Vector Regression
TL	Transfer Learning
UKF	Unscented Kalman Filter

SECTION I.

Introduction

Current key challenges of manufacturing processes can be summarized as (i) adoption of advanced manufacturing technologies, (ii) growing importance of manufacturing of high value-added products, (iii) increasing process complexity, uncertainty and dynamism, (iv) utilizing advanced knowledge, data science, and AI systems [1]–[3]. Among their challenges, the process of data collection could be more challenging and resource intensive, due to being costly, time consuming and compute–intensive. As such, the amount of data needed to build accurate models is often limited. System identification, decision making, and predictive analytics based on limited data, may reduce the production yields, increases the production costs or decreases enterprise competitiveness. In the case of limited data, the size of the dataset is not enough to train a reliable model. For example, when the size of the training dataset is less than the number of unknown parameters of the model, e.g. the unknown weights of an artificial neural network architecture, we face the limited data challenge and conventional learning tasks might not property work. In such case, one still needs to develop appropriate data models with small variance of forecasting error and good accuracy based on these small data sets.

On the other hand, in some cases one has to deal with big data, where the data produced by the system has big volume, variety, veracity and velocity. Big data is an ambiguous term to define in data sizes that are difficult to manage, observe, acquisition, store, process and analyses, using prevalent database tools. These processes are often too complex with highly dynamical uncertainties, demanding heavy computational effort to find a simple model. Machine Learning (ML) offers effective solutions to solve challenging issues in various industrial applications [4]. ML includes computer algorithms and statistical methods required for data-driven control, estimation, prediction, classification, or clustering. Although ML is effective in many ways, some of the existing ML techniques have some limitations, such as over-fitting, under-fitting due to the nature of the data, poor generalizability and poor long-term prediction ability [5], [6]. Hybrid ML technique can potentially capture more characteristics of complex systems to overcome these limitations. Hybrid ML works based on developing algorithms to couple model-based and data-driven learning system. Despite research in hybrid data-driven ML techniques, they are not yet widely used mainly because of high computational complexity [7], [8]. This manuscript aims to provide a review of common data modeling techniques for limited and big data based on ML approaches with list of advantages and disadvantages. It also provides an overview of some recent research works on data modeling techniques with limited or big data constraint for various industrial applications as illustrated in Fig.1.

FIGURE 1.

Common modeling techniques for limited or big data scenarios.

Show All

Based on the advantages and limitations of the ML, the paper introduces an intelligent hybrid data-driven algorithm that is robust against limited and big data constrains to reduce the computational cost and increase the estimation accuracy. This manuscript has two main parts. The first part provides brief review of some well-known ML techniques. Then, we introduce a hybrid algorithm and apply it to a modeling task with limited data constraints.

This manuscript is organized as follows. Section 2 and 3 present modeling techniques for limited and big data scenarios. Section 4 introduces a novel intelligent algorithm for robust modeling of nonlinear process. Section 5 shows an industrial case study that can be modeled by the proposed intelligent algorithm. Conclusion remarks are provided in section 6.

SECTION II.

Machine Learning Techniques to Model Limited or Big Data

One of the distinctions of industry 4.0 concept is data-informed decision making. In some applications, data collection can be a challenging and expensive task, leading to limited data problem. The level of required modeling accuracy depends on sample size. To make a reliable statistical test under small sample size, Design of Experiments (DOE) methods have been proposed, such as Response Surface, Taguchi and Factorial [9]. Development of DOE based on Taguchi methods to reduce industrial experimental tests requires reliable data modeling [10], [11]. In some other industrial applications, large-scale data is produced, and sophisticated machine learning needs to be employed to process the data. In the following sections, we review a number of machine learning techniques, from conventional to recent, that are often used in industrial applications.

To process, analyze, predict and support decision-making based on limited or big data, Machine Learning (ML) techniques can be used. ML techniques are able to observe, store and model data with high nonlinearity and uncertainty. ML can easily identify trends and patterns associated with black-box (or gray-box) in complex systems. There are different techniques in ML, and the choice of the method depends on factors, such as the nature of the dataset, the scope of problem, and the desired outcomes. Usual ML tasks include regression, modeling, prediction, classification and clustering. Generally, there are three major ML categories: supervised, unsupervised and reinforcement learning. Recently, semi-supervised learning methods are also increasingly developed and used in many applications.

Supervised learning works in preparing a model through labeled training data until the model achieves a desired level of accuracy on the training data. Supervised learning is usually performed when the final values of the output variable or the class labels are known, and one can produce an error function between the output of the model and that of the system. Supervised learning is used in classification (predicting a label) or regression classification (predicting a quantity). Some supervised learning techniques are: rule-based systems, regularization, Bayesian, ensemble, Neural Networks (NNs), instance-based, decision tress and explicit regression. Many of these supervised learning techniques are used in both regression and classification in industry cases.

Unsupervised learning is used to find patterns or hidden structures in datasets that have not been categorized or labeled. Unsupervised learning typically focuses on exploratory analysis, dimensionality reduction methods, and feature extraction. Examples of unsupervised learning include: Principal Component Analysis (PCA), k-Means, Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), and Self Organized Maps (SOM). Reinforcement Learning (RL) differs from supervised and unsupervised learning and it works with data from a dynamic environment. The goal of reinforcement learning is to find the superlative sequence of actions that will generate the optimal outcome and controlling but not to cluster or label data. The way reinforcement learning solves this problem is by allowing a software called agent to explore, interact with, and learn from the environment (system). The idea is a trade-off between exploration and exploitation. Within the agent, there is a function that takes in state observations (the inputs) and maps them to actions (the outputs). This is the single function that will take the place of all of the individual subcomponents of the control system. In the RL nomenclature, this function is called the policy. Given a set of observations, the policy decides which action to take. The policies in RL algorithms can be divided by off-policy: PG, SARSA and on-policy: DQN, AC, DDPG and MAB.

Some of popular techniques of clustering, classification, regression and advanced machine learning are discussed in the following.

A. Classification

Classification is one of the most popular methods in ML with many potential applications in industry settings. Classification is a data mining technique to predict categorical class labels based on the observations. Appropriate models need to be constructed to define imperative data classes in classification. In order to build a classifier, one often divides the data into three parts: training data, test and validation. The training dataset is used to find unknown parameters of the model, which is verified using test data and validated using validation data.

If $\mathbf {X}=X_{i=1}^{N}\subset R^{q}$ and a corresponding set of labels $\mathbf {Y}=Y_{i=1}^{N}to\subset R$ , then $\left \{{\left ({\boldsymbol {x}_{1},y_{1} }\right),\ldots,(\boldsymbol {x}_{N},y_{N}) }\right \}$ is training set of $q$ dimensional patterns.

The classifier can assign an appropriate label to all unlabeled patterns, i.e. allocate them to the most appropriate class. In order to improve the classifier robustness and generalizability, a number of approaches have been developed. Dimensionality reduction through feature selection is an effective technique to improve generalizability of classification tasks. Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are two frequently used feature extraction methods [12], [13]. There are many classification approaches each with their own pros and cons. One of the frequently used simple yet powerful classification methods is K-Nearest Neighbor (KNN) classifier [14]. Support Vector Machine (SVM) is a novel supervised machine learning which can be used for both classification or regression challenges. SVM builds a separating hyperplane that maximizes the margin between the two classes [15]. SVM have high prediction accuracy, non-parametric, robust to outliers and low prediction time [5]. However, some limitation of SVM are: high computational cost, poor uncertainty management ability and requires cross validation procedure to determine hyper-parameters [5]. Relevance Vector Machine (RVM) employs a Bayesian framework to infer the weights, with which the Probability Density Function (PDF)s of the outputs instead of point estimates can be obtained. RVM provides performance comparable to SVM, while utilizing arbitrary kernel functions with high sparsity and offering probabilistic predictions [16]. High sparsity means that a significant number of weights are zero, leading to more computationally efficient models. Advantages of RVM include ability of generating PDF directly, being non-parametric, and ability of realizing high sparsity and avoiding cross validation process. It has however some limitations including large volumes of data is required for modeling, huge time and memory are consumed during the training process, easily falling into a local optimum and potentially causing over-fitting [6].

A Decision Tree (DT) has a tree structure (flowchart) with numerous nodes and branches. It is a fast and easy method with decent performance in many classification tasks. Bayesian classification is a statistical model and learns the distributions of instance to predict class membership probabilities.

B. Regression

The regression technique is very close to classification technique and the difference is to find a pattern to determine numerical values. The regression modeling tries to find the relationship between a response or dependent variable $y$ and independent variables $x_{1},\ldots,x_{k}$ . In other words, the regression modeling aims is to develop a model to create a prediction of the response variable(s), based on independent variables. Linear regression has been massively used in the literature for fitting a measurable response variable as a function of one or more independent predicator variables. The least square regression [17] method is the most commonly used approach for fitting a regression line. NNs, as a tool for nonlinear regression, is considered as a well-developed technique that can be applied to all kinds of areas in process identification, control, and model prediction [18]. Support Vector Regression (SVR) is an extension of SVM to build a nonlinear regression model [9]. SVR transfers the data into a higher dimensional feature space and fits it to a linear function with minimum complexity to the feature space [19]. Kernel-based probabilistic and nonparametric models such as Gaussian Process Regression (GPR) can create a stochastic function as the regression output with some advantages: provide covariance to generate uncertainty level, non-parametric and being flexible [20]. However, some of GPR limitation are: performance is highly affected by kernel functions and high computational cost [21]. Thin Plate Spline (TPS) is a prevalent technique and insensitive to noise that can be used in data fitting and prediction. TPS choose a function that minimizes an integral that represents the bending energy of a surface [22]. The advantages to using TPSs, do not require any a priori knowledge of the functional form of the data or the relationship of interest. Just as complex data visualization is a key strength, it is also a limitation of using thin plate splines. The three-dimensionality of thin plate splines makes for the inclusion of confidence intervals difficult as the visual may become too complex for interpretation [23]. A popular nonlinear model that is appropriate technique in interpolation and extrapolation curve fitting is Taylor Polynomial (TP) [22].

C. Clustering

Clustering technique can segment data into groups, based on data similarity. It is using to identify outliers and resulting groups may be the matter of interest. Clustering can be achieved by various algorithms and it is an iterative process (involving trial and error). Some of popular techniques clustering are: K-means, Fuzzy K-means, Hierarchical, NN, Gaussian Mixture. K-means is a partitioning method to partitions data into K exclusive clusters. Each cluster has a centroid (or center) and sum of distances from all objects to the center is minimized. Example neural network architectures for clustering are: (i) self-organizing maps, (ii) competitive layers. Gaussian Mixture is good when clusters have different sizes and are correlated and assume that data is drawn from a fixed number K of normal distributions. In general, clustering technique: (i) dose no method is perfect for data modeling (depends on data), (ii) process is iterative; explore different algorithms, (iii) beware of local minima (global optimization can help).

D. Ensemble Techniques

Ensemble techniques are mixture of numerous models to create a novel learning method with better performance than the individual classifiers. In this technique, the unseen data (test data) is passed to individual classifier, returning some votes. The ensemble technique revenues the final class prediction based on the popular of classification models votes (Fig. 2). This technique is appropriate when there is not enough data available for presenting the data distribution. The technique is a decent option for uncertainty of selecting the computational model. Ensemble techniques are also used when the classifier is not able to solve complex problems. The technique is used in many industrial applications, such as intrusion detection, malware fraud and remote sensing, speech, and identity recognition. Random Forest (RF) is a well-known ensemble technique that is a cluster of many decision trees that any tree is built by sampling with replacement [24], [25]. A recently proposed ensemble technique is Adaptive Boosting (AdaBoost) algorithm that can be used for regression and/or classification regression industrial problems.

FIGURE 2.

An example of ensemble learning [26].

Show All

E. Resampling Techniques

Resampling techniques are very prevalent methods because of their accuracy, robustness, simplicity and high generalizability.

Resampling generate new data by using difference methods without being correlated to theoretical distribution. This method used when data distribution is very limited or unknown [8]. Resampling generate many times new data with or without replacement. Bootstrapping and randomization methods are examples of resampling technique with and without replacement, respectively. Some popular resampling techniques are JackKnife (JK), Monte Carlo Simulation (MCS) and exact test methods. MCS is a repeated random sampling based on many possible scenarios to obtain numerical results and estimation.

F. Representation Learning

Representation learning is a technique to predict or classify unstructured data, which is useful in various ML tasks, such as dimensionality reduction. Representation learning determines a lower dimensional representation of capturing several input configurations from original dataset. It can offer a solution for big data through facilitating significant improvements in statistical and computational efficiency. The hidden representations of representation nodes inside the dataset search to recall core information for ML statistical processes to model full properties of the data, such as vertex content and topological structure. Following this, modeling and analytics tasks can be effortlessly used through vector-based and conventional machine learning algorithms.

G. Deep Learning

Deep Learning (DL) can automatically find hierarchical representations of data sets, which works on deep architectures in supervised and/or unsupervised learning strategies. By using these strategies, DL has the potential to capture complicated patters and highly nonlinear big data with large feature spaces. Different DL architectures, such as Deep Neural Networks (DNNs) and Recurrent Neural Networks (RNNs), can characterize the multi-layering of commonly narrow algorithms to contain numerous processing data layers [27].

DNN requires much more parameters than traditional systems, which brings huge cost during online evaluation. A new effort on DNN aiming at reducing the model size while keeping the accuracy improvements is using Singular Value Decomposition (SVD). SVD works on the weight matrices in DNN and then restructure the model based on the inherent sparseness of the original matrices. After restructuring we can reduce the DNN model size significantly with negligible accuracy loss [28].

Recent architectures proposed for DL, such as Convolutional Neural Networks (CNNs), Deep Belief Networks (DBNs) and Group Method of Data Handling-type Neural Networks (GMDH-type NN) can process difficult learning algorithm for classification and regression tasks. Different improvements in CNN architecture can be categorized as parameter optimization, regularization, and structural reformulation. Depending upon the type of architectural modification, CNN can be broadly categorized into seven different classes namely; (i) spatial exploitation, (ii) depth, (iii) multi-path, (iv) width, (v) feature map exploitation, (vi) channel boosting, and (vii) attention as shown in Fig.3. [29]. Some other DL include Inception [30], deep Residual Network (ResNet) [31], and VGG16 [32]. Szegedy et al. [30] proposed Inception-v4 that combine Inception architecture with residual connection to accelerate the network training. The core idea of ResNet is recommending a so-called “identity shortcut connection” that skips one or more layers [31]. Moreover, VGG16 is a convolutional neural network model proposed by Simonyan and Zisserman [32] to improve AlexNet by replacing large kernel-sized filters with multiple smaller kernel-sized filters one after another. GMDH is a family of inductive algorithms for computer-based mathematical modeling of multi-parametric datasets that features fully automatic structural and parametric optimization of models. GMDH-type NN is a way of using self-organizing networks and shown to result in successful applications in a broad range of areas [33].

FIGURE 3.

Recent taxonomy of deep CNN architectures [29].

Show All

H. Distributed and Parallel Learning

Distributed and Parallel Learning (D&PL) technique uses a parallel processing through distributed network, assigning learning processes to be efficient of ML algorithms. The technical limitation of classical ML can be addressed by D&PL technique, which naturally need the whole datasets to be set within a local memory. D&PL techniques use the divided data in vertical (by features within the training set) or horizontal (by instances within the training set) fashions. In most of the cases the division are worked based on horizontally, as it is the furthermost natural selection of the applications. Although D&PL architectures can be leveraged in classification and regression tasks, classification has found further wide-spread application due to simply way for implementation.

I. Transfer Learning

Transfer Learning (TL) influences heterogeneity of the dataset, which is typical for big data processing and analytics. This is determined by velocity characteristics of big data, and training of new models can use important resource and time. A wide range of models can be trained using TL in a more efficient manner through classifying a group of domains. This approach can be used for regression, classification or clustering tasks. Fig. 4 shows a typical set of training performance accuracy, for ML models with and without TL. The model with TL not only starts at a higher performance, also shows faster converges towards an optimal solution. This can be attributed to the old model being closer to the solution as a starting point, rather than using random initializations, so that through the process of updating via gradient descent through back-propagation, or other applicable algorithms, fewer iterations are required to converge to the solution for the new task.

FIGURE 4.

Transfer learning technique: (a) model with a higher initial performance, (b) improvement with a greater rate of performance (c) a faster training timeframe [28].

Show All

J. Active Learning

Active Learning (AL) searches to shift ML strategies for big data from huge volumes of unlabelled data, to that which uses labelled data. AL can be used for broad applications of supervised and semi-supervised ML scenarios. Obtaining labelled data is time consuming, and AL can reduce the cost associated through classifying a subsection of points in the original data distribution. Fig. 5 provides an intuitive example diagram of AL. The process of AL classifies key points in a dataset, which if labelled. Similar techniques approaches can be taken for regression modeling.

FIGURE 5.

Active learning technique: (a) class 1 (red) labelled instance belonging, with the other points depicted in grey. By using active learning, two points are recognized in (b) white circles and labelled as class 1 (red) and class 2 (blue). (c) using k-Means machine learning techniques is used to more easily classify the other points.

Show All

K. Kernel-Based Learning

Kernel-based learning techniques use some nonlinear kernel functions, along with well-known methods such as SVMs, to transfer the input-space of a dataset to a higher-dimensional feature-space. This method permits for better expressive power and better chance to perform various analyses with the same dataset [34]. However, this often requires significantly higher computational complexity as compared with the cases with linear kernels. It may appear counterintuitive to make a higher dimensional illustration for a big dataset. However, these methods have been shown successful in many applications. One often must make a trade-off for maximizing the computational efficiency and minimizing the impact of increasing the size of the dataset. Fig. 6 illustrates the kernel mapping process to simple prediction models.

FIGURE 6.

Kernel mapping process.

Show All

L. Multi-Objective Learning

Learning algorithms in ML can be divided into three categories: single-objective learning, scalarized multi-objective learning, and Pareto-based multi-objective learning [35]. There are two main weaknesses of scalarized multi-objective learning: (i) the determination of an appropriate hyperparameter $\lambda $ that properly reflects the purpose of the user is not trivial, and (ii) only a single solution can be obtained, from which little insight into the problem can be gained [36]. Most of the efforts on solving multi-objective ML problems can be solved using Pareto-based multi-objective optimization methodology particularly due to the great success of multi-objective optimization using evolutionary algorithms and other population-based stochastic search methods [37]. Pareto-based multi-objective learning approaches are more powerful compared to learning algorithms with a scalar cost function in addressing various topics of ML tasks [36]. One gains a deeper insight into the learning problem by analysing the Pareto front composed of multiple Pareto-optimal solutions [38]. Unlike single objective optimization, multi-objective optimization reduces the chances of falling in a local minimum. In other word, in multi-objective optimization framework, improving one of the objective functions might worsen the other objective function(s); thus, there is not an optimum point to optimize all the objective functions. Multi-objective optimization algorithms based on evolutionary algorithms [39] have been proposed in the literature, e.g. NSGA-II [16], NSGA-III [40], multi-objective uniform-diversity differential evolution (MUDE) [41], and on-line variable-fidelity meta-model assisted Multi-Objective Genetic Algorithm (OLVFM-MOGA) [42].

M. Hybrid Data-Driven Learning

As stated in pervious sections, ML techniques have some limitations, and a hybrid ML technique can potentially capture more characteristics of complex systems to overcome these limitations. Recent studies of hybrid models combining different ML techniques have shown promising results [22], [37]. There are various types of frameworks to develop hybrid models, and it is unknown that which hybrid model can perform the best in data-driven learning. Some recent hybrid data-driven methods are: a combination of NN and EKF, GPR technique with ARD kernel [21], SVM and SVR models optimized by GA [43], RVM with incremental learning [44]. ANNs have become popular practical solutions of engineering problems. In the following we introduce a novel algorithm of this type and apply it to an ML task under limited data constrains.

SECTION III.

A Novel Algorithm for Robust Modeling of Nonlinear Processes

In industry 4.0 applications, an offline approximate model needs to be developed that can express the relationship between inputs and outputs of industrial process. In fact, off-line modeling represents an initial identification model of the complex nonlinear system. This model is then updated using limited data obtained from the online process. The online adaptation requires real-time algorithms while high runtime algorithms, such as evolutionary optimization methods can only be used for offline modeling [15]. To address the needs of both offline and online parts, we propose a new algorithm that builds a robust model based on the input-output data using an offline deterministic model and online updating parts.

In the first part, the structure (topology) of the NN and the initial values of its coefficients are extracted based on deterministic input-output data. In other words, the uncertainties included in the observed data are not considered and the modeling is done based on nominal values of input-output data. It is obvious that there are many sources of uncertainties, such as human error, laboratory equipment errors, or errors due to changes in environmental factors, which can affect the data. In order to obtain a robust model, all these sources of uncertainties must be considered as a percentage of the variation around the reported nominal value [45]. In the second part, the coefficients of the obtained model are updated using MCS in combination with UKF to achieve a robust model. In the following, we give details of these steps.

A. Off-Line Modeling Based on Deterministic Data

In the off-line deterministic modeling part, first the modified Taguchi DOE is used to make a reliable statistical test under small sample size [16]. Then, a polynomial model is trained explaining the relationship between the inputs and outputs of the industrial process. To this end, one can use GMDH-type NNs with multi-objective optimization and SVD to overcome both overfitting and singularity. In the proposed algorithm, we used the fuzzy adaptive mutation proposed by some authors in to reach global optimum solutions [46]. As shown in Fig. 7, in the first part (left), a primary model is extracted, and the derived model is used as an input to the second part that builds a more robust model. This model is not sensitive to properties of data and can have reasonable performance with both limited and big data.

FIGURE 7.

A novel algorithm to create a robust model for industrial processes with available input-output data.

Show All

As shown in Fig. 7, the input-output data are divided into two sets of training and prediction. The training set, which consists of 60%t of all inputs–output data pairs, is used for training the neural networks model. The prediction set consists of 40% of the data that are indeed unforeseen input–output data samples during the training process and are used for testing the performance of the trained model in correctly capturing relationship between he inputs and outputs of the process. For an acceptable performance of GMDH type NN, the topology and the polynomial coefficients of each neuron should be properly determined. Here, SVD is used to determine the polynomial coefficients of each neuron. Multi-objective Genetic Algorithm (GA) is used to find optimal topology of GMDH-type NN. In multi-objective optimization, both modeling and prediction errors are simultaneously considered as objectives. Using Nondominated Sorting Genetic Algorithm (NSGA)-II [15], Pareto optimum non-dominated models are obtained from the point of view of these two objective functions. An alphabetical chromosome is used to coding the structure and topology of general structure of GMDH (GS-GMDH). In the conventional GMDH, each neuron is building using combination of two neurons in adjacent layer, while in the GS-GMDH, all neurons in the all previous layers used to build a new neuron. The alphabetical coding of such GS-GMDH is shown in Fig. 8. In a GS-GMDH neural network, neuron ac in the first hidden layer is connected to the output layer by directly going through the second hidden layer. Therefore, it is now very easy to notice that the name of output neuron (network’s output) includes ac twice as acac. In other words, a virtual neuron named ac has been constructed in the second hidden layer and used with abac in the same layer to make the output neuron abacacac as shown in Fig.8. The evolutionary process starts by randomly generating an initial population of alphabetical chromosome, each as a candidate solution. Then, using the crossover and mutation and tournament selection, the entire population of symbolic strings improves gradually based on training and prediction errors.

FIGURE 8.

The alphabetical chromosome representing GS-GMDH.

Show All

Indeed, in order to achieve high modeling accuracy, the polynomial degree is usually increased, which reduces the generalizability (or prediction capability) of the model, often due to overfitting. Finally, the designer chooses the trade-off obtained from Pareto non-dominated solutions by compromising between these two objective functions. A polynomial model among the inputs and outputs of the industrial process is developed by using the selected GMDH structure. The derived topology of NN is used as the input of the second part of the proposed algorithm.

B. Online Update to Make the Model Robust

In the second part (right side of Fig. 7) of the proposed algorithm, the obtained model in the first part is modified to enhance its robustness. To this end, the coefficients of the derived model are updated using UKF to capture the uncertainties in the input-output data. In order to obtain a robust model, all sources of uncertainties are considered as a percentage of the variation around the nominal values of main data table. To take into account the uncertainties, $N$ data table sets are built around the nominal values using MCS. The network’s error on each data table is calculated according to \begin{equation*} E_{j}=\frac {\sum _{i=1}^{k} \left ({y_{model}-y_{actual} }\right)^{2} }{k},\quad j=1,2,\ldots,N\tag{1}\end{equation*} View Source

The goal of UKF is to obtain the network coefficients that minimize the mean and variance of network error. Indeed, minimizing the mean of error ($mean\left ({E_{j} }\right))$ attempts to minimize the total network error for all tables. Also, by minimizing the variance of error ($var\left ({E_{j} }\right))$ , the error variation is minimized relative to the average value. Mean and variance of error is combined according to equation (2), and $F$ is minimized in a recursive process.\begin{equation*} F=Mean\left ({E_{j} }\right)+var\left ({E_{j} }\right)\tag{2}\end{equation*} View Source

The UKF filter equations for determining the GMDH-type neural network coefficients are as follows:

($i$ )
The weight vector and its covariance matrix in network are initialized with:\begin{align*} \hat {a}_{0}=&E\left [{ a_{0} }\right] \tag{3}\\ P_{0}=&E\left [{ \left ({a_{0}-\hat {a}_{0} }\right)\left ({a_{0}-\hat {a}_{0} }\right)^{T} }\right]\tag{4}\end{align*} View Source where $a=\left \{{a_{1},a_{2},\ldots,a_{s} }\right \}$ is the vector of coefficients of neural network and $s$ is the number of coefficients.
($ii$ )
The time-update equation are:\begin{align*} \hat {a}_{k}^{-}=&\hat {a}_{k-1} \tag{5}\\ P_{a,k}^{-}=&P_{a,k-1}+R_{r,k-1}\tag{6}\end{align*} View Source where $k$ indicates the time.
($iii$ )
The sigma points and measurement update according basic equations of UKF.

Algorithm 1 show the pseudo-code of the proposed algorithm and summarizes the steps.

Algorithm 1 The Proposed Algorithm for Robust Modeling of Data

Part A: offline modeling:

A1. Enter matrix of experimental data set D[X(m,n) Y(m,1)] % $m$ and $n$ are the number of experimental data set and number of inputs, respectively.

A2. T=(1:t,n) and V=D(t+1:m,n) %Making Training ${(}T{)}$ and validation ${(}V{)}$ data sets

A3. Pa=$[{Y}_{{\textit {1}}}~Y_{{\textit {2}}}\ldots Y_{k}{]}$ % Pareto optimum design of GMDH and finding $k$ non-dominated solutions ${(}Y_{{\textit {1}}}\ldots \text{Y}_{k}{)}$

A4. $\text{Y}_{model}=F{(}X{)}$ % Select trade-off model from Pareto front and finding polynomial model

Part B: on-line updating:

B1. DT = $[{D}_{1} D_{2} \ldots D_{N}{]}$ % Building $N$ data table sets around the nominal values of $D$ using MCS.

B2. $E_{j}\!=\!\frac {\sum \limits _{i=1}^{k} \left ({y_{model}-y_{actual} }\right)^{2}}{k}$ % calculation error of $Y_{model}$ on all data tables (${j}=1\ldots N{)}$

B3. $F=Mean\left ({E_{j} }\right)+var\left ({E_{j} }\right)$ % finding cost function based of mean and variance of derived errors

B4. ${\hat {a}_{k}^{-}=\hat {a}}_{k-1}\,\,P_{a,k}^{-}\!=\!P_{a,k-1}\!+\!R_{r,k-1}$ % Update coefficients of $Y_{model}$ using UKF

B5. if ${(}F > \varepsilon {)}$ go to B2 else go to B5 % $\varepsilon $ is threshold value of $F$ (equation 2)

B5. $Y_{robust}=G{(}X{)}$ % finding robust model

SECTION IV.

Industrial Case Study

To examine the performance of the algorithm proposed above, an industry case study is considered. To this end, 27 samples (L27) modified Taguchi DOE method (Table 1) was constructed at Carbon Fiber production line. The PAN fiber inputs were chosen as the controlling parameters based on the feasible processing window of the pilot plant: temperatures of 227, 230, 233, and 236°, space velocity of 20, 25, 30 and 35 m/h, and stretching-ratio of 1.0, 2.0, 3.0 and 4.0%. The measured output was a physical property. To reduce the number of experiments, the Taguchi design was modified by adding some marginal operating parameters as listed in Table 1.

TABLE 1 Input-Output Experimental Data of Carbon Fiber Production Line

NSGA-II and SVD are employed to design optimal topology and to find optimal values of the polynomial coefficient of GMDH, respectively. It is clear from this limited input-output data that the inputs are not in the same order. For example, $x_{1}$ is in order $10^{3}$ , while order of $x_{3}$ is $10^{0}$ . Therefore, to keep the impact of each data set unchanged, using a linear mapping equation (7), all inputs are mapped to the interval [1, 2].\begin{equation*} x_{i}^{M}=1+\frac {\left ({x_{i}-x_{i}^{min} }\right)}{\left ({x_{i}^{max}-x_{i}^{min} }\right)},\quad i=1,2,3\tag{7}\end{equation*} View Source where $x_{i}^{M}$ and $x_{i}$ are the mapped and actual values of the input variables, respectively. $x_{i}^{max}$ and $x_{i}^{min}$ are the maximum and minimum values of each input variables, and $i=1,2,3$ . It is clear that the other mapping methods such as logarithmic normalization could not be employed here since it will change the impact of each data set.

Two different sets of training and validation have been used in order to show the prediction capability of the optimized GMDH-type neural networks. The training set, which is used for training, is composed of 18 out of 30 input–output data pairs. The validation set, consisting of 12 unforeseen input–output data samples during the training process, is solely used for checking the performance of the trained model. GMDH-type neural networks are employed to fit a polynomial curve to the output of the model as a function of effective input parameters. Considering two objective functions (namely, Training Error, TE, and Prediction Error, PE), NSGA-II is used for the Pareto multi-objective optimization of the GMDH-type neural network. The values of 80 for population size, 0.9 for crossover probability, 0.1 for mutation probability, and 300 for number of generations, were found to be suitable using a trial and error study. The resultant Pareto front, representing a set of non-dominated superior solutions, is shown in Fig. 9. In this figure, points A and B designate solutions with the best PE and TE, respectively. As compared with solution ‘A’, Point C shows a very small increase in the TE value (about 0.7%) but a substantial improvement (about 83%) in PE. Consequently, when the two objectives are simultaneously taken into consideration, point ‘C’ may be considered as a best trade-off solution.

FIGURE 9.

Final front including non-dominated solutions in plane of objective functions.

Show All

The structure of the neural network corresponding to the design point C obtained by the genetic algorithm NSGA-II is shown in Fig. 10. Reasonable behavior of GMDH-type neural networks model in the training and validation data are shown in Fig. 11. Moreover, Fig. 12 shows a correlation coefficient of 92.36% between the actual value and the predicted one, representing a reasonably accurate model. for the obtained model. Table 2 shows the values of TE and PE corresponding to points A, B, and, C.

TABLE 2 The and Pe of Optimum Design Point Obtained in fig 7

FIGURE 10.

Optimized structure of GMDH-type neural network of design point C.

Show All

FIGURE 11.

Performance of the optimized GMDH model on both actual and validation data sets corresponding to optimum point C.

Show All

FIGURE 12.

Correlation coefficient between the actual values and predicted ones corresponding to optimum point C.

Show All

To check the robustness of the obtained model, a ±5% variation around the nominal values reported in Table 1 are considered and 1000 random data tables are constructed using MCS. Fig. 13 shows the upper and lower bounds of the predicted values using this model. Extensive changes relative to the nominal values show poor robustness of derived model. UKF is used to improve the robustness of the derived model. It must be noted that the structure of GMDH as shown in Fig. 10 is employed and only the coefficients of equations are update by UKF. The equations governing the neurons of this model derived by UKF are as follows: \begin{align*} y_{1}=&1.29-0.01x_{1}-0.0546x_{2}+0.0135x_{1}^{2} \\&+\,0.0156x_{2}^{2}-0.00718x_{1}x_{2} \tag{8a}\\ y_{model}=&-4.24+7.787y_{1}-0.0129x_{3}-2.71y_{1}^{2} \\&+\,0.003x_{3}^{2}+0.0022y_{1}x_{3}\tag{8b}\end{align*} View Source

FIGURE 13.

Upper and lower bound of predicted values by deterministic model on 1000 input-output random tables.

Show All

Fig. 14 shows the upper and lower bound of the predicted values. It is seen that employing UKF is effective in reducing the tolerance margins. The intelligent algorithm results assisted the carbon fiber production line in alleviating the reduced number of experiments, cost and limited data. It should be noted that the same procedure can be used on big data to find a robust model among inputs and outputs of any nonlinear industrial process. Also, DL methods, such as CNNs, DBNs and RNNs that mentioned in section II can be used for very large (terabyte) data sets. One of the most immediate and impactful outcomes of technological evolution is the vast advancement in automation through data modeling. The intelligent algorithm for robust data modeling continues to accelerate, so will automation. Although our model aimed to increase the estimation accuracy, the computational effort is also reduced by 2 (ms) in offline modeling and 3 (ms) in online modeling.

FIGURE 14.

Upper and lower bound of predicted values by robust model on 1000 input-output random tables.

Show All

SECTION V.

Conclusion

In this paper, some of existing machine learning techniques that can deal with issues commonly observed in many industrial applications and arising from having limited or big data constraints, were reviewed. These techniques can be used to efficiently address various challenges, such as increased complexity, uncertainty, and dynamism, related to data processing and analytics of manufacturing processes. Machine learning is a powerful tool for many industrial applications, and its importance is further enhanced due to increased use of data collection and sensor technologies as part of industry 4.0 implementation, leading to generation of valuable data sources. We proposed a new intelligent algorithm that can model limited industrial data. We applied the proposed algorithm to a real industry case study with data collected from a Carbon Fiber production line. Pareto multi-objective optimization including two objective functions was employed to design the topology of GMDH and overcome overfitting. Using the proposed algorithm, a deterministic model was obtained that showed a very accurate fit to the actual data. We further used UKF approach to improve robustness of the model against uncertainties. The intelligent algorithm assisted the carbon fiber production line in alleviating the reduced number of experiments, cost and limited data. The proposed algorithm has the potential to be used in any other industrial settings for which the aim is to obtain a reliable and robust model between its inputs and outputs.

A Novel Hybrid Machine Learning Algorithm for Limited and Big Data Modeling With Application in Industry 4.0

Abstract:

Metadata

Abstract:

Nomenclature

Introduction