Introduction
In the era of big data, data analysis has become one of the fundamental tools for extracting information over the last few decades. Due to the abundance of data availability, data analysis and machine learning have become critical components for data-driven decision-making [1]. Data is stored as both structured (i.e., excel sheets, databases) and unstructured (i.e., text, email, social media, images, videos) data. Structured data is stored as tabular data, with each column containing a distinct feature and each row containing a distinct instance. Usually, traditional classifiers such as Support Vector Machine (SVM) [2], Logistic Regression (LogReg) [3] and tree-based algorithms [4] are preferred for tabular data analysis. These models provide satisfactory results with limited amounts of data and they generally outperform deep learning (DL) models on tabular datasets [5].
However, the traditional classifiers tend to give a poor performance on large datasets [6], while, DL models [7] comparatively perform better on large datasets due to their ability to learn complex patterns amongst the data [7]. In most cases, Convolutional Neural Networks (CNN) [8] can show outstanding performance in classifying images [9]. CNNs have become highly effective in analyzing unstructured data [10]. CNNs stack multiple convolutional and pooling layers on top of each other. The use of convolutional kernels and the stacking of multiple layers lead to the hierarchical feature abstraction of the input. Initially, the layers find low-level features like edges and corners within an image input. This information is then passed to the deeper layers, which are then able to detect high-level features by combining the low-level features. All these attributes make CNN highly effective in analyzing data that have spatial relationships amongst their features. However, in its initial form, tabular data do not contain spatial relationships amongst their features, making CNN unsuitable for tabular data. Due to this, the full strength of CNNs’ learning capacities could not be utilized yet on tabular data. Furthermore, CNNs perform much better with big data than traditional classifiers. The rise in the availability of data further emphasises the use of CNNs on large tabular datasets.
Tabular data is the most popular form of available data [11]. Unfortunately, CNNs could not be directly applied on tabular data for the reasons mentioned above. This motivated researchers to convert the tabular data into images using embedding techniques. While converting data into images, each feature is assigned a pixel position within the image and then an image dataset is created from the tabular dataset. This enables CNNs to use the relationships within the features and learn accordingly. To the best of our knowledge, three recent studies have introduced embedding techniques that make CNN suitable for tabular data analysis [11], [12], [13]. In [12] the authors introduce DeepInsight, a technique where non-image datasets are converted into image datasets and are forwarded to DL models. A similar approach is introduced in the Image Generator for Tabular Data (IGTD) [13]. Features are assigned to pixel positions and then, each instance is converted into an image. Similar features are placed next to each other in the image. Values of different features are assigned to pixel intensities. However, IGTD is specifically designed for gene expression prediction and not as a generic solution for tabular datasets. This issue is tackled in [11] where the authors introduce two techniques (i.e., Equal Font-SuperTML (EFTML) and Variable Font-SuperTML (VFTML)). VFTML gives greater image space to the more relevant features, while EFTML provides equal space for each feature. Although, in theory, VFTML is likely to produce better results than EFTML, the method cannot perform better in practice. Further analysis of their method shows that the TML possesses a few shortcomings. The method is not the most space-efficient when converting data points to images. Additionally, no parameters are used to assign the canvas space for each feature in VFTML. These shortcomings are also applicable for the IGTD and DeepInsight techniques.
Feature analysis is critical for tabular data analysis. None of the studies [11], [12], [13] use statistical tools for feature analysis in their experiments. IGTD groups the features adjacently based on similarity. However, they do not consider the associativity of the features with the class. SuperTML applies variable font-size to the features but does not consider the statistical relations amongst them. Furthermore, none of the previous studies focus on the applications of the exploratory data analysis and their impact on the performance of CNNs. The methods mentioned above also require manual assigning of pixel positions for each feature, which is a tedious, time-consuming and error-prone process. However, the results from [11] indicate that CNNs are much more effective than traditional classifiers for tabular tasks. CNNs can largely produce results which are better result than the traditional classifiers. The manual dataset creation and the feature associativity error are the current issues with SuperTML. The details of these shortcomings are explained in section II. This study aims to develop a usable but robust method for tabular dataset creation.
In this paper, the Dynamic Weighted Tabular Method (DWTM) is proposed for applying Convolutional Neural Networks (CNN) on tabular datasets. A high-level, pictorial description of the DWTM methodology is shown in Figure 1. DWTM is the first of its kind that uses feature weights to create images for applying CNNs. After computation of feature weights each instance in the dataset is converted into an image with spaces assigned to each feature based on the weights. The features are inserted into the image canvas accordingly. The primary emphasis of this study is to create a tabular dataset using an automated procedure that enables using CNN models for tabular data analysis while prioritizing the essential features. Additionally, the designed system should be robust to deal with datasets of all types and sizes. The proposed method uses statistical techniques (i.e., Pearson Correlation and Chi-Square) to compute the weights of each feature. The features are then arranged in descending order based on the calculated weights. Each feature is assigned a portion of space in the image canvas based on the ratio of their corresponding weights. The features are inserted accordingly based on their weights. The algorithm that DWTM has developed is based on the best-fit approach to ensure the maximum utilization of the image canvas space. All data points are converted into images and are fed into CNNs (i.e., Resnet-18 [14], DenseNet [15] and Inception [16]) for analysis. To the best of our knowledge, no previous studies have proposed this approach. In this study, DWTM is applied on six benchmark datasets and results have been compared with both previous studies and traditional classifiers. DWTM demonstrates better results for all datasets. In short, the study has the following contributions:
We build a novel automated embedding technique, DTWM, for applying CNN models on tabular datasets.
We develop a tool that classifies tabular dataset by using CNN models based on dynamically selected feature importance for both categorical and continuous feature set.
We apply DWTM over six benchmark tabular classification datasets which demonstrates significant improvement over popular methods (i.e., SuperTML, IGTD) and the traditional classifiers.
Rest of the paper is organized as follows: section 2 briefly describes the previous works on similar topics, section 3 displays the materials and methodology used for this technique, section 4 shows the experiments and results obtained using this method, section 5 contains the discussion section, and finally, the study is concluded in section 6.
Literature Review
In recent times, DL has become the fundamental tool for machine learning applications [17], [18]. DL is applied in a wide domain such as computer vision (CV) [19], natural languages processing (NLP) [20], [21] and speech recognition (SR) [22]. In this section, the main ideas from previous studies related to Convolutional Neural Networks (CNN) methods for tabular data tasks are discussed. CNNs are popular due to their unparalleled success with classifying images. Architectures like the AlexNet [23], VGG [24] and deep Residual Networks [14] achieve state-of-the-art performance on the ImageNet1 dataset. Additionally, due to CNNs success in extracting features from given vectors [25], CNNs are now the ideal choice for Natural Language Processing (NLP) and Image classification tasks.
Despite the success of CNN’s in other fields, much previous research attempts to use CNN for tabular data analysis. However, in most cases, traditional ML models work far better than CNN models [5]. Hence, CNNs for tabular data remained unexplored for an extended period. However, significant development is being made in this sector in recent times. Xu et al. [26] introduce the novel Conditional GAN (CTGAN) that uses mode-specific normalization and a conditional generator. Mode-specific normalization deals with multimodal and Non-Gaussian distributions, while conditional generator deals with imbalanced columns. The authors find that their model can learn better distributions than the Bayesian network-based models in their results. Butorovic et al. [27] propose a novel method then for using CNN on tabular data analysis known as Tabular Convolution (TAC). Feature vectors are transformed into kernels using the Kernel method and convolved using the base image. The authors use Resnet with TAC for classifying gene expression and found that the results using TAC are similar to the best results that ML classifiers produce.
In another study, the DeepInsight method is proposed [12]. The method converts non-image data to images and uses them as input for CNNs. In this method, CNNs could simultaneously work on different types of data, including tabular data. However, this method does not work well if the dataset is small, as it will create a limited number of images for input. In [13] the authors proposed the image generator for tabular data (IGTD). The method uses an embedding technique to convert tabular data into images by assigning feature positions to pixels based on the similarity of the features. CNNs were applied to the converted image dataset and they outperformed the DeepInsight [12] and traditional classifiers for predicting cancer cell lines and molecular descriptors of drugs.
Sun et al. [28] proposed the Super Characters Method which is used for sentiment classification, whereby texts are converted into images using two-dimensional embeddings. This idea removes the issue of adding another separate step for word embedding as the images are used as input to CNNs. Further investigation shows that the results obtained using this method consistently outperform the other methods for sentiment analysis. As an update based on this work, the authors introduce the SuperTML method [11]. The same idea of two-dimensional embedding is used. However, this time the SuperTML method is applicable for tabular data. For each instance, a separate image is created. A different textbox is allocated in the image for each feature, where the values in each row are inserted. In this paper, the authors propose two variations of the SuperTML method, the EF and VF variations. The features are given equal importance in the EF variation, while the most important feature has the largest feature size in the VF variation. Results show that SuperTML produces state-of-the-art performance on benchmark tabular datasets.
Despite the recent success of DL methods in tabular data, none emphasizes feature importance correctly, which is a critical element in tabular data analysis. Previous studies predominantly use static methods, which may not work well for all tabular datasets. The method proposed in this study dynamically allocates image canvas space to the features based on their strength of association with the class label. We present a comparison result in Table 1 which shows different tabular based CNN models for classification and regression problems and their corresponding performance.
To the best of our knowledge DeepInsight [12] was the first method to apply CNNs on tabular data by converting them to images. The DeepInsight method successfully classified RNA sequences with a 99% accuracy. However, the robustness of this method was not tested in the study. The method was successful on a large dataset but it has not been applied on smaller datasets to date. Furthermore, the success of DeepInsight is only seen on one type of tabular data. Another limitation of the DeepInsight method is that it produces large images; hence, the computation cost and learning time for the CNN architectures are quite high [13]. IGTD [13] tackles the issue of large images and creates much more compact images from the same tabular data [13]. The method is much more flexible and uses the added benefit of clustering similar features together to boost the learning capabilities of CNNs. A similar problem to DeepInsight remains as the IGTD is yet to be tested on multiple types of tabular datasets. Furthermore, the achieved performance is not the best.
[11] proves to be much more effective compared to the IGTD and DeepInsight methods. Firstly, the method is applied to four different types of tabular datasets. Large, small and multi-class datasets were tested. The SuperTML provided over 90% accuracy on the smaller datasets (i.e., Iris and Wine) but failed to match this performance on the larger dataset (Adult). The VF variation uses the largest feature size for the most important feature and reduces the font size for the lesser important features. The EF variation does not take feature size into consideration. The VF variation should work better in this regard but in the paper, it was found that the EF variation was performing much better. Further analysis of the method showed that the VF-SuperTML has some technical flaws with its feature importance method leading to poorer performances. Figure 2 is a VF-SuperTML generated image of the Adult dataset. The reduced feature sizes mean that some of the features are disturbed chaotically within the image. They also may be too small when compared to their importance or relevance scores. The issue arises from the fact that the VF-TML does not use any mathematical computation of the feature importance to determine their text size accordingly. Instead, variable font sizes are used based on the priority list of the features. Furthermore, it also affects the image in a negative manner as more blank spaces are left within the image which the CNN has no use for learning from the dataset. The EF-SuperTML may work better in this case as when using equal fonts the reduced font sizes of the less priority features will be increased and hence it leads to much better use of the image canvas space. In contrast, a sample image of the DWTM method of this paper is shown in Figure 4 to show how the method uses the image canvas space more efficiently to create much better images for tabular datasets. This directly affects the results achieved from the CNNs as shown in Table 8. The DWTM outperforms the SuperTML on all three similar datasets.
Methodology
This section gives the methodology for the Dynamic Weighted Tabular Method (DWTM) in detail. Furthermore, details of the datasets used to evaluate DWTM are also provided. Tabular datasets contain multiple features; some are more associated with the class than others. In the experiments for this study, statistical techniques (i.e., pearson correlation, chi-square test) are used to compute the weights of each feature. The features are then arranged in descending order based on the calculated weights. Each feature is assigned space in the image canvas based on its corresponding weights. The method requires 4 inputs: the length and height of the image, the r-score and the maximum number of characters required for each feature. In our embedding technique, we embed the features into the canvas based on their strength (r score). Many of our features have a long floating-point sequence because the value of r score ranges from 0-1. For the sake of space, we keep a portion of the sequence after the floating-point as characters which are embedded into the canvas and the rest is trimmed. The maximum number of characters refers to the maximum number of such character sequence that is required for each feature in the tabular dataset. It is an important parameter to consider when converting to images as the number of characters determines the space allocation of the features by updating their length and height accordingly. DWTM calculates the length, height and area required for each feature by using the ratio of the weights of each feature to the sum of the total weights of all features and distributes the image canvas space accordingly.
The overview of the methodology is shown earlier in Figure 1. Algorithm 1 represents the algorithm used for this method. The repository provided with this paper is autonomous and robust and requires only a single input of the tabular dataset. When a new dataset is loaded, the method identifies the present feature set and computes the feature importance from the latest dataset for generating the image dataset.
The DWTM Method
Input:
Output:
s contains Starting Point for each feature
f contains Font Size for each feature
Initialize:
Feature Attributes:
Box Attributes:
Procedure:
for
end for
while (
Feature-Insertion-in-Image-Canvas
end while
A. Structured Data to Image Algorithm
B. Weight Computation of the Features
The Pearson Correlation coefficient is used to calculate the weights of each feature by determining the associativity between each feature and the class. The Pearson Correlation technique is widely used in research for finding the associativity between attributes or variables [29]. The Pearson r-score ranges from −1 to +1 and corresponds to the strength of associativity of the features with the class label. Negative value refers to negative associativity. The r-scores of the selected features are calculated. Afterward, the weights for each feature are computed using their calculated r-scores. However, Pearson Correlation is not applicable for associativity calculation on categorical data [30]. Hence, the Chi-Square test is used when there are categorical features in the dataset. Afterward, Cramer’s V is calculated to find the strength of associativity between the two variables [31].\begin{equation*} F_{ratio}= \frac {r_{i}}{\sum {r_{1-n}}} \tag{1}\end{equation*}
C. Canvas Size Allocation
In this subsection, the area required for each feature in the image canvas is calculated using the \begin{align*} F_{Area}=&r_{i}*l*h \tag{2}\\ F_{Area}=&F_{Height}*F_{Length} \tag{3}\\ F_{Height}=&\sqrt {\frac {F_{Area}}{F_{char}}} \tag{4}\\ F_{Length}=&{F_{Height}F_{char}} \tag{5}\end{align*}
Algorithm 2 Feature Box Computation
Input:
Output:
Procedure:
for
end for
Insert
D. Optimized Canvas Area Division
After computing all the dimensions for each feature, it is now possible to insert the feature into each image. However, the points at which these features are to be inserted into the image canvas for the optimal solution are still unknown. Hence, the best fit solution to the problem is designed. Each feature is assumed to be a box of length
Algorithm 3 Feature Insertion in Image Canvas
Input:
Output:
Procedure:
I = [m]*[n]
for
if
Canvas not available
Continue
end if
if
Canvas is available
Insert feature into image canvas
pop SB[k] from SB
end if
if SB != empty then Feature-Trim
end if
end for
A sample simulation for this procedure is shown in Figure 3. Features are inserted one by one into the image using a descending order of their feature weights. When the required space is found, it is checked if the corresponding rows and columns equalling the length and height of the feature box are empty or not. If there is not enough space the features are systematically trimmed until they can be inserted. If the image space is empty, the feature is added to the canvas space and the space is marked as filled for the corresponding feature. The starting point is also stored. The process is repeated until all the features are given space in the image canvas. In the end, the dimensions and the starting points for all the features are acquired. The height of each feature is equivalent to the font size in the image.
In some cases, some of the features fail to be inserted by the algorithm. This situation occurs due to the unavailability of abundant space in one location to insert the feature box. Hence, the feature box cannot be inserted despite enough overall blank space availability in the image. The features which face this difficulty fall in the lower half of the priority list. To overcome this issue, a trim feature is used. In the trim feature font size of each uninserted feature is decreased by one until all the features are inserted into the image. To do this, algorithm 4 is used which reduces the font size of the remaining features by 1. Again the Weighted Feature Insertion procedure is called. This process continues until each feature is inserted or has a font size of 0. The result is that the least important features heuristically have lesser space or are removed altogether.
Algorithm 4 Feature Trim
Input:
Output:
Procedure:
for
b.l = bl. - n
b.a = b.l * b.h
end for
E. Image Creation
The algorithm returns each feature’s starting point, area, length, and height. Then using OpenCV [33], each feature is inserted into the image one after another into the image canvas using the information determined earlier. An image is created by using all the features of each data point. These values are rounded to the maximum character sizes allowed for those features and then inserted using a monospace font, which is available in OpenCV. Figure 4 presents a sample image produced by this method on a classified dataset.
F. The Convolutional Neural Networks Used in the Experiments
The image dataset is created based on the above methodology. The dataset is then split into training and testing sets and forwarded to the DL models for training. In this case the popular Residual Network-18 [14], Densely Connected Convolution Networks [15] and Inception [16] model are used. Residual Network (ResNet) is a model which uses Residual Learning to address the degradation problem in deep neural networks. ResNet introduces shortcut connections that enable the training of deeper networks. This idea helps the DL models perform better on classification tasks due to adding more layers. The study in [14] introduces networks with up to 1000 layers in depth. The ResNet-18 model is preferred for this study.
The study in [15] builds on the idea of the shortcut connections between layers and proposes a Densely Connected Convolutional Network (DenseNet). All the layers with matching feature maps are connected in this network, thus significantly boosting the information flow between the layers. Results from the study show that DenseNet performs better than other state-of-the-art architectures in terms of computational efficiency. In [16] the authors proposed the GoogLeNet, also known as Inception Network. This network applies multiple kernels on the same level, thus enabling the network to tackle overfitting and reducing Deep Neural Networks’ computational expense. As a result, it is feasible to use Inception in big data tasks [34]. An updated version of the network, Inception-v3 [35], provides state-of-the-art results for computer vision tasks while being much cheaper computationally. In this study, ReLU activation is used. ReLU has a linear scale for positive values and a value of 0 for negative instances. LR values between 0.0001 and 0.00005 and 30 epochs are used to train the models. Additionally, the Stochastic Gradient Descent (SGD) with exponential learning rate decay, Adam and Adamax are used for optimizing the CNN architectures.
G. Benchmark Dataset Selection
Several standard image datasets (MNIST, CIFAR and ImageNet) are considered for evaluating state-of-the-art CNN networks on image data. However, no such specific dataset for evaluating tabular methods exist [36]. Hence, to evaluate DWTM, a standard set of datasets need to be considered for tabular data analysis. DWTM is applied over the selected benchmark datasets. Finally, the results of DWTM are compared with the results of the traditional classifiers and other DL-based techniques.
The number of features, number of instances, number of classes and the type of input need to be considered when selecting benchmark tabular datasets. Previous studies demonstrated that traditional classifiers are not well suited to dealing with categorical data [36]. Furthermore, multiclass problems tend to raise the difficulties of classification, especially when the data is limited. All these issues are considered when selecting the datasets in this study to evaluate DWTM. Future researchers are encouraged to use these datasets for evaluating classification methods on tabular data, thus making these datasets the benchmark for evaluation. The details of the selected datasets are summarized in Table 2.
The Iris dataset is the most commonly used dataset in ML literature and hence it has been used for evaluation.2 The dataset consists of data from 150 iris plants. 4 features are recorded for each instance. Three classes of iris plants, each having 50 instances, are available in this dataset. The Wine dataset3 requires the method to effectively classify wine samples using the available chemical data of wine grown in a region in Italy. Both the Iris and Wine datasets test a method’s effectiveness in dealing with multiclass problems while using a limited amount of data. The Iris dataset also tests the method’s effectiveness in classification using only four features. In the Iris dataset, the class label is a categorical variable.
Medical data analysis is essential using Machine Learning at present and so the emphasis is given to using medical datasets. It is critical in disease diagnosis for models not to be biassed towards one class [37]. As a result, the scopes of a proposed method for disease diagnosis are also evaluated. The medical datasets selected for this study are the Cleveland dataset,4 the Early Stage Diabetes Risk Prediction [38] and the Breast Cancer Wisconsin [39] Dataset. The Cleveland dataset consists of data from 303 patients correlated to the patient’s risk of heart disease. In this study, the 14 features recommended on the UCI website by previous studies are used for the experiments. The class column, condition contains values from zero to five, which indicates the patient’s risk of heart disease. The values are converted from 1-4 to one to identify the presence of heart disease in a patient. The Early Stage Diabetes Risk Prediction [38] dataset was conducted in the Sylhet Diabetes Hospital in Bangladesh. To create the dataset, doctors used a questionnaire on 520 patients. Eventually, 17 features are recorded using the tests and questionnaires. Afterward, patients are tested to check for diabetes, and the results of this test represent the class for each patient. The Breast Cancer Wisconsin [39] Dataset contains data of 699 patients. The data is collected using a Fine Needle Aspirate of a breast mass. There are ten features in the dataset. Class values two and four indicate that breast cancer is malignant or benign. The Breast Cancer dataset contains all numerical data. The Diabetes and Cleveland datasets contain both numerical and categorical variables thus testing the durability of the method on all types of medical datasets.
To test the robustness of the proposed method, datasets that contain less than 1000 instances are selected alongside large datasets. The Adult dataset5 is one of the most popular dataset used in a DL survey for tabular data [36]. The datasets contain information of 48842 individuals and the class represents their income. The class column is binary, with zero representing individuals who earn less than
Result and Analysis
Numerous ablation studies are conducted to fine-tune the DWTM methodology. The Cleveland dataset is used to study for ablation studies as it is an imbalanced dataset and is quite difficult to classify. Afterward, the models with the best parameters are selected and tested on the benchmark datasets to evaluate the performance of DWTM.
A. Ablation Study
To determine the effect of the parameters on the CNN models, we fine tune several several hyperparameters. We change several hyperparameters like the optimizers, training type, font type, font size, image size, and type of network to select the best available configuration.
1) Selected CNN Architectures
We apply three different CNN models for image classification by using the DWTM algorithm (i.e., ResNet-18, Inception, DenseNet). The results from the analysis of the CNN models are shown in Table 3. The ResNet-18 produces the best results initially. Later on, when testing the datasets with more random seeds, we find that the DenseNet shows more robustness. ResNet-18 still produces the best results initially but through each training phase, the decrease in loss value is much lower than the ones achieved by the DenseNet model. On the other hand, the DenseNet model produces poor performances initially but the performance improves substantially after each epoch. The DenseNet presents a consistent result regardless of the dataset. The Inception model shows a moderate performance which ranges between the ResNet-18 and DenseNet models. The network also produces high acccuracy regardless of the dataset being used. However, we observe that the Inception network tends to suffer from bias issues when random weights are used. In contrast, the ResNet-18 and DenseNet models may require longer to be trained with random weights but they always manage to produce significant scores. In the light of above performance, we recommend to use the DenseNet for consistency and train it for up to 30 epochs.
2) Optimizers and Learning Rate
We test several optimizers to minimize the loss functions of the CNN models. We find that the Stochastic Gradient Descent (SGD) with exponential learning rate decay (ELRD) is the best performing optimizer [40] and Adam optimizers performed most consistently in previous studies [41]. The results from our experiments are shown in Table 4. For each dataset, a separate set of parameters are required. Hence, it is a time-consuming and tedious process. In some scenarios, when varying random seed values, the SGD without ELRD failed to improve performance after running for a certain number of epochs. We also use ELRD was and the SGD achieved outstanding performance for some random seeds. However, the Adam and Adamax optimizers perform well regardless of the dataset and random seed used with the default parameters mentioned in [42]. The AdaGrad and RMSProp also perform well in many circumstances. However, Adam combines the advantages of AdaGrad and RMSProp while also using bias correction to ensure divergence does not occur. On the Cleveland dataset, the Adam optimizer trains the network much faster than the Adamax variant. Thus, the Adam optimizer is preferred for this study. Learning Rates of 0.001 and 0.0005 for the SGD and Adam were used respectively to find the optimal results on the benchmark datasets.
3) Image Size
DWTM creates image datasets automatically, unlike the methods proposed in [11], [12], and [13]. We create image datasets in multiple sizes. Increasing the image size means that the pixel positions can be provided more accurately and errors due to rounding are minimized. Furthermore, CNNs can learn from the data more efficiently if the images are larger. The caveat is that large image sizes lead to increased training time for the CNNs. The DWTM method also takes much longer to assign pixel positions to each feature as larger images require the method to iterate through a larger area. Hence, small image size is considered at the start and is gradually increased until the loss function for the CNN models is minimized. The results of our experiments are shown in Table 5 which indicates that image sizes of 128 by 128 are the best for tabular datasets. Larger images were also used, but we find that the loss per epoch values for image sizes of 128 and 256 are quite similar. The DWTM method, however, required almost four times as much time to assign pixel positions and create images. When using images of dimensions 64 and 96, we observe that the CNNs took much longer to minimize the loss values. As a result, the overall training time of the method increases.
4) Font Type and Font Scale
Four font types (i.e., plain, simplex, complex, and duplex) are considered for this study. Table 6 presents the results from analyzing these fonts. The complex and duplex font types achieve better loss and validation accuracy values than the plain and simplex font types. The plain and simplex performs similarly, so only one result was shown in the table. The complex and duplex font types provided identical results in majority of the cases of our experiments. Hence, both are viable and effective options for applying DWTM. Image sizes of 128 by 128 are used to study the effect of font types. Furthermore, the font scale is also varied. The font scale increases the thickness of the font. It is observed that with increasing font scale, the loss value improves. The increased thickness of the fonts representing the feature value means that the CNNs have more pixels to capture the feature information.
5) Individual Feature Fonts
The effect of font size for individual features is also considered when comparing the performance of DWTM to SuperTML. SuperTML also uses the ResNet-18 architecture. SuperTML was applied on the Iris and Adult datasets. It is observed that the ResNet-18 achieves equivalent performance for both techniques on the Iris dataset. However, there is a huge boost in performance on the Adult dataset and DWTM-ResNet-18 produces 100% accuracy scores compared to the 87.60% accuracy of SuperTML. This shows the remarkable effect of DWTM on the performance of CNN models on tabular datasets. The weighted feature embedding based on relevance provides the CNNs with more pixels to learn the important features.
6) Pretrained vs Random Weights
The three CNN networks used for this study were trained in two ways. Initially, the networks were trained from scratch with random weights. Afterward, a new network is tested with the pre-trained weights on the Image Net dataset. Previous studies have shown the effectiveness of transfer learning for CNN models [43]. A similar observation was made during the conducted experiments. All three models can minimize the weights much more quickly when using the pre-trained weights of the Image Net dataset. The ResNet-18, in particular, shows a significant improvement in the loss per epoch and converges to its best minimum loss value within 30 epochs when using pre-trained weights. Hence, the pre-trained weights for the CNNs are preferred for tabular data analysis.
7) Best Parameters
A summary of the findings from the conducted ablation studies and the final parameter values used in the experiments on the benchmark datasets is shown in Table 7.
B. Results
The selected parameters as mentioned in subsection IV-A are used to test the performance of DWTM on the selected benchmark tabular datasets in the following subsection IV-B1. Afterward, DWTM is also applied on Kaggle competitions to see how it matches up against the state-of-the-art methods used in tabular data. The results on these competitions are discussed later in subsection IV-B2.
1) Results on the Benchmark Datasets
The results from the experiments are displayed in Table 8. DWTM+R, DWTM+D, and DWTM+I refer to the use of the ResNet-18, DenseNet and InceptionV1 models, respectively. Results from the TML [11] and IGTD [13] methods are also included for the Iris, Wine and Adult datasets. We also apply three traditional classifiers (i.e., Logistic Regression (LogReg), Random Forests (RF) and Support Vector Machine (SVM)) on these datasets and the results are presented here for comparison.
From Table 8, it can be observed that DWTM provides significantly better results on the Cleveland dataset than the traditional classifiers. The DenseNet, Inception and ResNet-18 produce 100% accuracy scores. The results show that DWTM is a viable option for disease diagnosis due to its ability to be unbiased. The Cleveland dataset has only 303 instances. DWTM assists the CNN models in providing better results than the traditional classifiers on the Cleveland dataset. The method again proves successful in disease diagnosis on the Diabetes dataset. The CNN models provide balanced results, which the traditional classifiers fail to deliver. Results on the Diabetes dataset provide further evidence of the model’s ability to provide state-of-the-art results on a medical dataset. DWTM shows further robustness on the Breast Cancer dataset as each CNN model produces almost the maximum possible results using the proposed method.
The Iris dataset is the most popular dataset from UCI. However, it is also the smallest dataset containing only 150 instances and four features. Furthermore, the Iris dataset has three classes of Iris plants. Despite the minimal information available, DWTM achieves maximum performance on the dataset, thus matching the performance of traditional classifiers. The Wine Dataset contains only 178 instances and three classes similar to the Iris dataset. Table 8 shows our experiments’ results produced on the Wine dataset. It is observed that the CNNs consistently produce results that are usually better or equivalent to the results from the traditional classifiers. The Wine and Iris dataset results also show DWTM’s ability to deal with multiclass tabular datasets. The results of the experiments on the Adult dataset are shown in Table 8. The Adult dataset contains over 48000 instances, the largest dataset used for evaluation in this study. The performance of DWTM models is far better than those of the traditional classifiers. The performance on this dataset shows that DWTM can be applied to datasets of any dimension. A summary of the loss per epoch of the CNNs on the test datasets are shown in Figure 5.
In our experiment, in the majority of the cases, we get high test accuracy, i.e., greater than 98%, which seems to be overfitting. There might be a few reasons for such overfitting. If the test and train data sets have common instances to be tested. Another reason might be that if the dataset is imbalanced and the test data is largely from the majority class, then there is a high chance of being overfitted. Since we consider popular datasets and they are well balanced. We also applied the cross validation approach accordingly during testing. However, these well known datasets have features which can be separated distinctly from one another. Therefore, we were able to achieve outstanding results for four of the datasets while two datasets show less than 100% accuracy. For more acceptability of our process, we considered multiple random seeds and the model achieves such high accuracies in the four of the cases on the testing and validation sets as mentioned in Section IV-A.
2) Results on Kaggle Competitions
DWTM is applied on live competitions in Kaggle to observe how they generalize on tabular datasets. For this study, the ongoing competition Spaceship Titanic is selected. To this date (20-07-2022) there are 19282 entries on 2322 teams. DWTM is applied on the training set to receive the feature weights. The same weights are used on the test dataset. Afterward, the dataset is converted to images. ResNet-18, DenseNet and Inception methods are applied on the dataset and it is observed that ResNet-18 and DenseNet methods achieve 100% accuracy on the validation sets. The highest accuracy on the on leaderboards has a score of 86.509. This shows the immense potential of DWTM to achieve high performance levels on any type of tabular data.
Discussion
In the past, CNN models were not very effective with tabular datasets. Our study shows that CNN models consistently outperform the traditional classifiers on tabular datasets. Furthermore, in the past, it was assumed that DL only performed well with large amounts of data [44]. The results contradict this statement and provide further evidence of the usability of CNN models for small datasets.
Deep learning and traditional machine learning models have the major difference in the feature selection and classification steps. Deep learning methods, i.e., CNN, has an automated feature extraction process during the convolution steps [45]. The convolution step applies several filters to extract prominent features. The convolution layer finds features easily if the objects are distinguishable in the image canvas without much effort. Alternatively, images from different classes represent different distributions of the objects. In this connection, our DWTM embedding approach successfully performs the task during feature positioning on the canvas in an easily recognizable manner by the filters in the convolution layers. Thus, our DWTM helps to classify the instances easily in the later stages of CNN such as flatten layer and during applying the softmax activation function.
The loss for each epoch on the Benchmark datasets is shown in Figure 5. All three CNN models demonstrate similar performance on the Wine dataset eventually. However, the ResNet and DenseNet converges much faster than Inception. Furthermore, in smaller datasets the CNNs take longer to reach minimal loss values. The Adult dataset which is quite large in comparison is much more suited to the CNNs as they reach optimum loss values rapidly. This suggests that the DWTM is much more suited to larger tabular datasets.
DWTM is also a viable option for classification tasks on medical datasets. The method effectively deals with bias much better than the traditional classifiers. The model outperforms the traditional classifiers in the Cleveland, Diabetes, and Breast Cancer datasets. Notably, in the Cleveland dataset, the traditional classifiers all produced Sensitivity scores below 0.80. On the other hand, the CNN models produce high sensitivity and specificity scores. DWTM also shows it can perform multiclass tasks with ease. The method provides outstanding results in both the Iris and Wine datasets. The results produced are better (or similar) than those provided by SuperTML and traditional classifiers. The Adult dataset is the only dataset used with a vast number of instances. DWTM surpasses the performance of the traditional classifiers and the performance of SuperTML by a considerable margin. All these results show the robustness of the model as it works effectively in small, large and multiclass datasets. Furthermore, the results also indicate that DWTM makes the CNN models much more beneficial than the traditional classifiers for tabular data tasks.
From these experiments, it can also be noted that the Adamax and Adam optimizers produce better results, followed by the Stochastic Gradient Descent (SGD) with Exponential Learning Rate Decay. Our recommendation based on the experiments is to use the Adam optimizer with default parameters initially. If Adam fails to provide decent results, the learning rate and bias values vary. However, theoretically, the SGD with momentum achieves minima most effectively if the proper parameter values are assigned. If the Adam optimizer fails to provide good results, experiments should be conducted with the SGD to find its best parameters. Even then, if good results are not achieved, we can consider recent variants of Adam like NAdam [46], ND-Adam [47] and combining Adam with SGD [48]. A learning rate of 0.001 alongside the Adam optimizer tends to reach the desired outcomes more often. Learning rate values closer to 0.0005 works the best with SGD.
ResNet-18 is highly efficient with tabular data as per the experiments. However, if the models are trained up to 30 epochs, DenseNet proves to be the best option. It is recommended to use DenseNet on tabular datasets with DWTM. ResNet-18 is the best option when rapid or real-time classification solutions are required. Furthermore, large datasets require a significant amount of time to be trained. ResNet-18 is a more realistic solution for big data and larger datasets. ResNet-18 also has more room for improvement due to the availability of deeper ResNet-18 networks [14]. The Inception network is the third best network for tabular data and it is recommended to use this network if DenseNet fails to perform.
Although previous studies of SuperTML [11], DeepInsight [12] and IGTD [13] use the idea of converting non-image datasets into image datasets for deep learning applications, none of the studies utilize feature weights for classification. IGTD is an updated version of DeepInsight for tabular data. However, IGTD was specifically designed for gene expressions. The study [13] has limited applications on tabular datasets and so IGTD is not a proper solution for all tabular datasets. On the other hand, SuperTML was applied to numerous tabular datasets. Furthermore, it achieved high accuracy levels when compared to the traditional classifiers. As a result, SuperTML is considered the benchmark for applying CNNs on tabular data. On top of this, the study in [36] states that typically it is challenging to deal with categorical data using DL models. DWTM embedding technique allows CNNs to take categorical data as inputs and deal with them efficiently. To the best of our knowledge, this makes DWTM the only technique which can handle categorical data with CNNs on structured datasets.
The IGTD and SuperTML provided different ideas for assigning the feature values into the images. SuperTML used texts and inserted them with OpenCV. IGTD used pixel intensities to represent the values within the features. The method provided by SuperTML is used initially for DWTM due to the reasons mentioned above of SuperTML being the benchmark technique. Later on, the pixel intensities method of IGTD is also considered. However, the technique has not dealt with categorical data and at present, there is no way of representing categorical data with pixel intensities during embedding without creating numerical relationships between them. Hence, the pixel intensities technique is not used.
The results show that DWTM provides similar or better outcomes when compared to SuperTML. Comparing the performance of the methods on the Adult dataset shows that the use of weighted font sizes has a significant impact on the performance of the CNNs. Without feature weights, the CNN provides an accuracy of 87.60% as shown by the results of SuperTML on the tabular dataset. On the other hand, DWTM uses weighted feature importance based on their correlation significance to the class label. The font size is determined mathematically from these feature weights and each CNN model produces 100% accuracy. DWTM is also more dynamic and robust compared to SuperTML. It uses a novel approach by using the feature weights to create images. Compared to the DeepInsight technique, it also performs remarkably on small datasets. DWTM provides the most significant space to the most important features based on the assigned weights. As a result, the CNN models are likely to learn the more complex patterns from the essential features. To the best of our knowledge, no previous study used feature weights for CNN models. On top of this, DWTM uses an entirely automated procedure, unlike the previous techniques.
This study provides the foundation of feature analysis for DL models. In this study, Pearson Score Coefficient is used for assigning weights. Future studies can use other statistical techniques and evaluate which options are the best for calculating feature weights. The Pearson Correlation and Cramer’s V are preferred for these experiments as they can find the strength of associativity of a feature to the class. Alternatively, future researchers can use Fishers Correlation [49] which uses a transformation technique to deal with highly correlated features. This method works much better than Pearson Correlation with highly correlated features and is another alternative for calculating feature weights. Another popular statistical technique is the Analysis of Variance (ANOVA) [50]. This method works better when the class contains three or more levels. Hence, ANOVA is usually preferred over Pearson Correlation for multiclass datasets and regression tasks. ANOVA can also be used with DWTM for assigning weights to determine the importance of the features. Due to the introduction of DWTM, feature analysis may become essential for CNN applications on tabular data tasks. Combining various feature analysis techniques and using them with DWTM has the potential to produce state-of-the-art results in all kinds of tabular data tasks.
In the experiments, a maximum number of 30 epochs are used. The CNN models are pre-trained on the ImageNet dataset; thus, 30 epochs are sufficient in most cases. Increasing the number of epochs may produce even better results. Additionally, models like the Resnet-152, Inception-v4 and other CNN models show substantial learning capabilities. Using these models with DWTM can further increase the performance of CNNs for tabular data. Feature Selection is a vital part of tabular data analysis. In the future, DWTM can be upgraded further to produce the best subset of features for CNN applications on tabular datasets.
DWTM proves to be an effective tool for tabular data classification. Nevertheless, CNNs and the method itself are computationally expensive and thus for simple datasets it is better to use the cost-effective traditional classifiers. DWTM should only be used when these classifiers fail. The DWTM is also limited to only classification tasks. In the future, a viable regression package is required to make CNNs viable for tabular data regression tasks. Furthermore, using pixel intensities instead of integer values should theoretically give the CNNs more information to complete the prediction tasks effectively. Further analysis needs to be conducted to compare the performance of DWTM using embedded values and pixel intensities.
Conclusion
In this paper, we have developed a feature embedding technique named DWTM that dynamically assigns feature weights for tabular dataset, while using different CNN architectures. We have applied DWTM over six benchmark datasets and compared the results with popular existing methods (i.e., SuperTML, IGTD, etc.) and three traditional classifiers. We have also observed that DWTM usually outperforms (an average accuracy of 98%) the results of traditional classifiers and the previously mentioned CNN-based methods on the benchmark datasets. Additionally, the method is robust for utilization in various types of datasets (e.g., multiclass, large/small). To the best of our knowledge, this is the first study for feature embedding, which computes the strength of the features for a tabular dataset and applies any CNN architectures for classification tasks. This study can be considered the new and novel benchmark for embedding techniques on tabular data.
Supplementary Materials
The implementation of DWTM is available at: https://github.com/Ifraham/Dynamic-Weighted-Tabular-Method, The readme file contains the information on how to apply DWTM. The package is available in the directory labeled DWTM. The experiments on the benchmarked datasets is available in the directory labeled Experiments.