Introduction
Deep Neural Networks (DNNs) have emerged as highly accurate Machine Learning (ML) models for a wide range of applications, particularly in the domains of computer vision and natural language processing (NLP). In these domains, Convolutional Neural Networks (CNNs) and Transformer-based models have dominated, boasting impressive architectures, such as AlexNet [1], VGG [2], ResNet [3], SqueezeNet [4], MobileNet [5], RoBERTa [6], GPT-3 [7], and LLaMA [8]. These well-established architectures, coupled with publicly available datasets, have significantly accelerated progress in image processing and NLP tasks. Although there are not many publicly available DNN architectures specifically designed for tabular data [9], the practical significance of tabular data is highlighted by the fact that approximately 25% of DNNs deployed in Google's data centers employ Multi-Layer Perceptron (MLP) networks [10], which are likely to process tabular input data.
One challenging characteristic of DNNs is their massive number of trainable parameters, often numbering in the millions. This abundance of parameters requires several orders of magnitude more computational resources than traditional linear ML models [11], thus making DNNs one of the most resource-intensive workloads in cloud data centers [12]. Resource demand intensifies when large online companies, such as Meta, routinely handle tens of trillions of inference requests daily [11], necessitating near real-time inference with minimal computational cost. The resource demand pertains to DNN inference and training, which involves memory-intensive and computation-intensive matrix operations. As DNNs have continued to grow in complexity, the size of their weight matrices has surpassed the on-chip memory capacity of general-purpose processors. Consequently, off-chip memory access has emerged as a critical bottleneck, dominating the execution time — a problem often referred to as the memory wall in the context of the Von Neumann computer architecture [13]. Off-chip memory access also consumes several orders of magnitude more energy than on-chip memory access [14].
To address the challenges of computing resources, a literature review identified that three optimization techniques are frequently mentioned in the scientific literature, and there are also other solution proposals resulting from research [15]. The frequently mentioned techniques are quantization, which reduces the bit-width of DNN weights; teacher-student frameworks [16], [17], wherein smaller student DNNs are trained using logits from larger teacher DNNs; and pruning, which eliminates the least significant elements of DNNs after training. These methods have been incorporated into mainstream deep learning (DL) libraries, such as PyTorch1 and TensorFlow.2 This study investigates pruning, that is, whether the original model's predictive performance can be maintained by selectively removing specific neurons rather than employing bit-width reduction by quantization or the teacher-student paradigm.
Pruning connections from a DNN is called unstructured pruning, and it produces sparse weight matrices. In contrast, structured pruning, which we focus on in this study, avoids this pitfall by removing whole CNN filters, entire layers, or neurons in the fully connected (FC) layers. FC layers are the fundamental structural elements of DNNs trained with tabular data. After the pruning process, whether unstructured or structured, the resulting DNN must undergo fine-tuning through additional training to regain its predictive performance [18], [19], [20]. Fine-tuning refers to retraining the DNN for a limited number of epochs to restore the model's predictive performance lost in pruning.
Our research problem addresses the impact of neuron removal in the context of tabular data, focusing on optimizing resource utilization without compromising the model's predictive performance. Because much of the research has so far concentrated on computer vision and NLP, the effects of pruning in the context of tabular data are underexplored. We present an experimental study on tabular data comparing the effects of post-training neuron pruning and fine-tuning with the alternative approach of removing neurons from the original architecture and training from scratch. Owing to the limited availability of established architectures and datasets for DNNs trained with tabular data for experimental purposes, we opted to investigate winning solutions from Kaggle competitions [21]. We identified and utilized three open-source implementations for which the input data were publicly available for research. We used DNN inference latency as a proxy for DNN resource consumption.
Our findings are:
Removing neurons and training a smaller DNN from scratch leads to higher predictive performance than structured pruning and fine-tuning.
Structured pruning can be used to pinpoint from which layers neurons can be removed from the initial DNN architecture with the smallest drop in the model's predictive performance.
Smaller initial DNN architectures can have on-par or higher predictive performance than the original DNN.
Fine-tuning after structured pruning helps control the predictive performance lost in pruning.
Fine-tuning for only one epoch after pruning recovered most of the predictive performances. Further fine-tuning was less beneficial.
A pruned and fine-tuned DNN had a worse predictive performance than the original DNN in all our experiments.
Upon the removal of neurons, the comparative diminution in inference latency was less substantial than the relative reduction in neuron count.
The remainder of this article is organized as follows. Section II provides an overview of DNN pruning and its applicability to tabular data. Section III describes the research methodology and our strategies for neuron removal in DNNs trained with tabular datasets. The experimental results are presented in Section IV, followed by a discussion in Section V, validity and reliability considerations in Section VI, and conclusions in Section VII.
Background
DNNs can be categorized based on two primary criteria: the structure of the data they process and their architectural design. In the domain of computer vision, DNNs process images represented as three-dimensional tensors, with one dimension corresponding to image width, another to image height, and the third to the red, green, and blue color channels. The architecture of computer vision DNNs predominantly comprises convolutional layers. These layers consist of trainable filters that extract features from the images by sliding over the image and applying the filter multiple times per image. Consequently, convolutions are typically computationally intensive but not as demanding in terms of memory. Because of this characteristic of CNNs, extensive research has been conducted on pruning CNNs that handle image inputs.
The standard architecture for DNNs processing sequential inputs, such as text sequences in NLP, was historically a Recurrent Neural Network (RNN) and its subtype Long Short-Term Memory (LSTM) neural network. Recently, the transformer architecture [22], which utilizes the attention mechanism, has begun to supplant RNNs for handling sequential inputs.
DNNs trained on tabular data lack de facto architectures. The commonality lies in the input data format, which is organized into rows and columns. In certain instances, if the rows can be chronologically ordered, they may be treated as sequences suitable for processing using transformer architectures. A distinctive feature of tabular data is the presence of two primary column types: continuous numeric and categorical. The categorical columns are converted into inputs for the DNN using trainable embeddings. The categorical columns are typically utilized in Deep Learning Recommendation Models (DLRM), and the associated embeddings comprise the majority of the parameters in the DNN [23], [24]. However, not all tabular datasets contain categorical columns, nor can their rows be chronologically ordered. The fundamental building blocks for DNNs trained on tabular data are the FC layers. Our research focuses on these FC layers.
DNN pruning is a technique employed to reduce the size of a DNN by eliminating elements that have the least impact on the model's predictive performance. The roots of DNN pruning can be traced back three decades [25], [26] when computational resources were significantly scarce.
The landscape of DNN research transformed a decade ago with the introduction of the groundbreaking CNN architecture, AlexNet [1], boasting 60 million trainable parameters. Using Graphics Processing Units (GPU) for DNN training has become a pivotal milestone in the field. However, it soon became apparent that the most accurate CNN architectures were excessively over-parameterized, necessitating network compression through weight pruning [18], [27], also known as unstructured pruning.
Studies have shown that pruning more than 90% of the network's weights can be conducted without significantly reducing the DNN's predictive performance [18], [27]. Nevertheless, pruning individual weights, or connections, yields sparse weight matrices typically presented in Compressed Sparse Row (CSR) format. This sparsity can lead to irregular memory access patterns, resulting in slower computations than matrix operations involving contiguous memory areas unless customized hardware is employed [28], [29].
An alternative to the unstructured weight pruning approach is structured pruning, which targets DNN elements such as layers, CNN filters, or neurons within FC layers. Although the effectiveness of structured pruning CNN filters with a minimal decrease in predictive performance is well-established [29], [30], [31], less empirical evidence exists on the impact of structured pruning neurons in FC layers, which are particularly prevalent in the realm of tabular data.
For structured pruning, specific criteria for element removal must be established. One approach involves eliminating elements whose values fall below a predefined threshold, such as the L1 or L2 norm of all elements [18]. These pruning criteria can be applied globally across the entire DNN or individually to each layer. Principal Component Analysis (PCA) can be applied to individual DNN layers to identify redundant elements effectively [32].
Popular DL libraries, such as PyTorch, offer support for structured pruning [33]. However, it is essential to note that the implementation sets the weights of pruned elements to zero without completely removing them from the DNN [34].
DNNs in different application domains have different retraining cycles depending on the stability of the input data. In certain application domains, frequent changes occur in the underlying concepts within the training data, referred to as concept drift [35]. This necessitates recurring retraining of DNNs from scratch to adapt to the shifts in data. However, employing pruning and fine-tuning in each retraining cycle introduces repetitive overhead, potentially inflating the training cost beyond the savings achieved during inference. The following question arises: Is it possible to train a smaller DNN from scratch instead of pruning to achieve predictive performance equivalent to its larger, over-parameterized counterpart? Toward this end, the lottery ticket hypothesis [36] proposes that there exist smaller sub-networks, known as “winning tickets,” capable of achieving or surpassing the original DNN's predictive performance when trained from scratch. Moreover, structured pruning and fine-tuning may yield worse predictive performance compared to training smaller models than the original model from scratch [37].
Research Questions and Methodology
Our research methodology adhered to experimental research for comparing two approaches.
A. Research Questions
Our research aimed to optimize the resource utilization of DNNs when applied to tabular data by removing neurons from FC layers without compromising model performance. We compare the performance of models resulting from removing neurons and training from scratch to models trained with the original architecture and then structurally pruned and fine-tuned. The research questions were as follows:
RQ1: Does removing neurons and training from scratch yield more accurate models than pruning and fine-tuning?
RQ2: What is the impact of neuron removal on inference resource consumption?
B. Materials: Three Kaggle Competition Winners
Our objective was to experiment with DNN architectures that have proven to work well for a particular problem with a given tabular dataset. Unlike CNNs trained with image datasets, there is a scarcity of known effective DNN architectures for tabular datasets. Even in the case of TabNet [9], there is a lack of benchmark tabular datasets that exist for images, such as CIFAR, MNIST, and ImageNet.
To identify successful models for tabular input data, we examined Kaggle,3 an online platform hosting ML competitions. We discovered that some winning solutions feature both an open-source DNN architecture and publicly available tabular datasets. We analyzed solutions from 304 Kaggle competitions, spanning from 2010 to 2022, listed in [21], to select those meeting the criteria for our experiment.
The input data is tabular.
The winning solution includes a DNN.
The code for the DNN is available as open-source.
The DNN is implemented using one of the popular DL libraries, PyTorch or TensorFlow.
The datasets used in the competition are publicly available for research purposes.
We found three winning solutions that met these criteria and decided to experiment with them:
Experiment 1 (E1): Mechanisms of Action (MoA) Prediction is a multi-label classification problem for identifying the most effective drugs for diseases [38].
Experiment 2 (E2): Google Brain Ventilator Pressure Prediction is a regression problem for predicting the optimal input pressure for a ventilator machine attached to a patient who is not able to breathe with his lungs [39].
Experiment 3 (E3): Predicting Molecular Properties is a regression problem for predicting the coupling between atom pairs in molecules. [40].
C. Experimentation Settings
1) Tooling
All three selected Kaggle competition winners use PyTorch. For structured pruning, we used function torch.nn.utils.prune.ln_structured implemented in PyTorch [33]. The structured pruning implementation in PyTorch sets all weights for a pruned neuron to zero but does not remove the neuron from the DNN. Therefore, we implemented an additional library to remove zeroed neurons from the DNN, available in [41].
We found that two of the three DNNs were too heavy to be trained using an ordinary laptop. We used high-performance computing (HPC) Puhti [42] to train the DNNs. Structured pruning itself is a lightweight operation that can be executed using any hardware.
2) Approach 1 (A1): Neuron Removal and Training From Scratch
We removed neurons from the original DNN without pruning by decreasing the number of neurons in the selected FC layers. We then allowed PyTorch to initialize the weights randomly using its default initialization strategy and trained the DNNs from scratch.
3) Approach 2 (A2): Structured Pruning and Fine-Tuning
We used L2 norm pruning in all the pruning experiments mentioned in this article. In most cases, L2 norm pruning has been shown to perform better than L1 norm pruning [18]. After removing the pruned neurons from the DNN, we fine-tuned the DNN weights by further training with the original training data for 1-3 epochs.
4) Layer Selection
We removed neurons only from the FC layers in both approaches, A1 and A2, as described below. All other types of layers, e.g., convolutional layers and attention heads of transformers, were scoped out. In each experiment, we chose the specific FC layers for neuron removal based on the structure of the DNN.
E1: We selected the first of the three FC layers (see Fig. 2). Removing neurons from one FC layer affects the number of weights in the next layer, as shown in Fig. 1. In practice, the percentage of parameters removed was greater than the percentage of neurons removed. Table 1 illustrates this using the DNN architecture from E1.
E2: The DNN comprises two separate blocks of two FC layers. We selected the first layer from both blocks (see Fig. 5(a) and (b)). Removing neurons from the second layers would have broken the DNN's internal structure, and it would have also been a less efficient approach for reducing the size of the DNN.
E3: We selected the first FC layers (see Fig. 8) in the feed-forward part of the transformer block [22]. These layers have the largest number of parameters in the DNN. In this case, removing neurons from the other FC layers would have broken the DNN's internal structure.
5) Data Collection and Measurements
We used inference latencies as a proxy to estimate the impact of neuron removal on the inference resource consumption in all three experiments. DNNs with neurons removed and trained from scratch were used to measure the inference latencies, although the method of how the neurons were removed is not relevant when measuring the latencies. We measured the latencies by executing the validation dataset of each experiment with backward propagation disabled. We used CPUs for inference measurements when possible, although GPUs were used for training. There was a relatively high variance in the inference latency measurements. We repeated the measurements ten times for each DNN architecture and present the mean values and result ranges.
Results
In this section, we present and compare the results of the experiment of neuron removal and training from scratch (A1) to pruning and fine-tuning (A2) for the three Kaggle competition winners (E1–E3).
A. E1: Mechanisms of Action (MoA) Prediction
1) DNN Description
The task in this competition was to predict the probabilities for 206 classes. The input data includes 872 continuous numeric variables and three categorical variables. The winning solution is an ensemble of several DNNs [38]. The most accurate model in the ensemble, which we used in our experiments, was a 3-stage DNN consisting of three sequentially trained DNNs. The first stage DNN consumes all the 875 input variables and produces 206 output values. The second and third stage DNNs consume 206 inputs and produce 206 outputs.
The neurons were removed from the third stage DNN. The third stage DNN has 1,478,456 parameters, two FC layers with 1,024 neurons, and an output layer with 206 neurons. More than 99% of the parameters are in FC layers. The model was trained for 25 epochs. Model performance was measured using the binary cross-entropy loss for the validation dataset. We achieved the best model size and validation loss combination by removing the neurons from the first layer, as shown in Fig. 2. Removing even a small percentage of neurons from the second layer increased the validation loss, and removing neurons from the last layer would have changed the shape of the output.
2) Results for A1
The results are shown in Fig. 3(a). We removed neurons from the first layer gradually from 5% to 95% of neurons in incremental steps of 5% and observed the impact on the validation loss. The weights were initialized randomly with different values each time the DNN was trained from scratch. We trained each DNN ten times and present the mean values and range of results Fig. 3(a). The figure shows that the validation loss decreased steadily with fewer neurons in the first FC layer, up to 75% It seems that the third stage DNN has too much capacity, and the highest predictive power comes from the first and second-stage DNNs. The inference for the third stage DNN can be executed with only 35% of the original total parameters with a model that has equal or higher predictive performance than the original model.
3) Results for A2
The pruning results are shown in Fig. 3(b). Pruning neurons from the first FC layer had a negligible impact on the validation loss of up to 40% of neurons pruned. Higher pruning percentages steadily increased the validation loss.
We fine-tuned the pruned models for 1-3 epochs to determine whether the increase in validation loss could be controlled by fine-tuning. Surprisingly, in this experiment, fine-tuning was even harmful in some cases. Fine-tuning increased the validation loss for models that had less than 60% of neurons pruned from the first FC layer. For higher pruning percentages, fine-tuning recovered some but not all of the validation loss. Fine-tuning for more than one epoch was beneficial only when more than 70% of the neurons from the first FC layer were pruned.
4) Inference Latencies
The inference latencies using the CPUs are presented in Fig. 4. The latency decreased steadily when more neurons were removed, but the relative decrease was not as high as that of the neuron removal percentage. When removing neurons and training from scratch, the lowest validation loss on average was reached when 60% of the neurons were removed from the first FC layer. This corresponds to 51% fewer total parameters than in the original model. With this smaller model, the inference latency was 18% lower than that of the original model.
B. E2: Google Brain Ventilator Pressure Prediction
1) DNN Description
In the winning solution [39] of the Kaggle competition, the first three blocks in the DNN are positional encoders implemented with Long Short-Term Memory (LSTM), which captures the time-series nature of the input data. The LSTMs were followed by three transformers [22], which were followed by convolutional layers. Each LSTM and transformer contains two FC layers. The time-series input data had two categorical identifier columns and five continuous numeric columns. This DNN has 9,219,073 parameters, of which 3,939,840 are in FC layers. We removed neurons from the first FC layers following the positional encoder LSTMs in Fig. 5(a) and transformer attention heads in Fig. 5(b). The model was trained for 150 epochs.
We performed preliminary experiments by varying the percentage of pruned neurons from 0 to 30% from the selected FC layers. The validation loss increased more when pruning neurons from the FC layer after LSTMs than from the feed-forward part of the transformers. Based on this, we removed or pruned a fixed 5% of neurons from the first FC layers after the LSTMs and varied the neuron pruning percentage from the transformer's feed-forward part from 5% to 30%.
2) Results for A1
The results are shown in Fig. 6(a). We removed neurons from the transformer's first FC layer gradually from 5% to 95% of the neurons in incremental steps of 5% and observed the impact on the validation loss. Considerable fluctuations observed in the validation loss across different training sessions may provide evidence supporting the lottery ticket hypothesis [36]. Sometimes, but not always, the validation loss for a smaller DNN trained from scratch was lower than that for the original DNN, depending on the random weight initialization. To ensure the robustness of the findings, we trained each smaller DNN architecture ten times and present the mean values and result ranges in Fig. 6(a).
3) Results for A2
The pruning results are shown in Fig. 6(b). We pruned neurons in incremental steps of 5% from the transformer's feed-forward part only up to 30% as the validation loss increased rapidly, and we assumed that further pruning would not produce more accurate models. After pruning, we fine-tuned the models for 1-3 epochs. For each fine-tuning epoch, the validation loss remained lower than that of the models pruned without fine-tuning. However, fine-tuning for only one epoch resulted in the lowest validation loss. Fine-tuning further worsened the validation loss.
4) Inference Latencies
The inference latencies using the CPUs are presented in Fig. 7. When removing neurons and training from scratch, the lowest validation loss on average was reached when 40% of the neurons were removed from the transformer's FC layers. This corresponds to 8% less total parameters than the original model. With this smaller model, the inference latency was 5% lower than that of the original model.
C. E3: Predicting Molecular Properties
1) DNN Description
The winning solution [40] is a DNN with 118,695,244 parameters. It follows the transformer architecture in [22]. This is by far the largest DNN covered in our experiments, with 70,491,750 parameters in FC layers. The remaining parameters are in embeddings, attention heads, convolutions, and normalization layers. Most of the parameters are in 14 identical transformer blocks [22]. The first FC layer, proj1 in the feed-forward part of the transformer, was selected for neuron removal (see Fig. 8). Each of these 14 proj1 layers has 2,473,800 parameters, and in total, they have 34,633,200 parameters, which is approximately 30% of all parameters in the DNN. The model was trained for 250 epochs.
This competition used the logarithm of the Mean Absolute Error (MAE) as the evaluation metric. The validation error values are negative because logarithmic functions yield negative values for inputs between 0 and 1. Nevertheless, smaller values indicate a more accurate model.
2) Results for A1
The neuron removal and pruning results are shown in Fig. 9. It was not feasible for us to systematically iterate over different neuron removal approaches and train this large DNN for each of them from scratch. It took three days to train the model using four V100 GPUs in parallel in HPC Puhti [42]. Instead, we trained the original DNN once and performed preliminary experiments with pruning strategies that would have the least impact on the validation loss and would not break the DNN structure.
Based on the preliminary pruning results, we removed 10%, 20%, and 30% of neurons from each of the 14 proj1 layers and trained the model from scratch. We repeated the training from scratch twice for each removal percentage. The dots in Fig. 9 mark the validation error averages for training from scratch. In all three cases, the validation error was lower than for the original DNN. This is a similar result to E1 but with a DNN with two orders of magnitude more parameters.
3) Results for A2
With pruning, we were able to drop 6% of neurons from proj1 layers (3.5% of all the model parameters) without increasing the validation error. When pruning more neurons from the proj1 layers, the validation error increased rapidly. After fine-tuning for one epoch, the validation error became closer to the baseline validation error but remained above it. Fine-tuning for two and three epochs increased the validation error.
4) Inference Latencies
The inference latencies are shown in Fig. 10. Owing to the size of the DNN, we used a GPU also for the inference. When removing neurons and training from scratch, the lowest validation loss on average was reached when 20% of the neurons were removed from the FC proj1 layers. This corresponds to 12% fewer total parameters than the original model. With this smaller model, the inference latency was 5% lower than that of the original model.
Discussion
This section presents our responses to the research questions, discusses additional observations from the experiments, and suggests future research directions.
A. Answer to RQ1: Does Removing Neurons and Training From Scratch Yield More Accurate Models Than Pruning and Fine-Tuning?
In all three experiments, removing neurons and training from scratch produced more accurate models than pruning and fine-tuning. DNNs with neurons removed achieved even higher predictive performance than the original model until a certain threshold was reached. Pruning resulted in, at best, on-par predictive performance with the original model. In E1 and E3, it was possible to prune 35% and 6% of the neurons, respectively, from the selected FC layers without fine-tuning without losing the model's accuracy. When pruning and fine-tuning, the predictive performance drop was more controlled than that without fine-tuning, but the predictive performance of the original model was not reached in any of the experiments. In E1, it was surprising that the pruned and fine-tuned models exhibited worse predictive performance than those pruned without fine-tuning, even when up to 60% of neurons were removed. This suggests that, in certain cases, fine-tuning after pruning may be detrimental to the model's performance. The reasons for this phenomenon warrant further investigation in future research. In E2 and E3, fine-tuning for one epoch recovered most of the predictive performance lost by pruning, whereas fine-tuning for more epochs was not helpful. In E1, fine-tuning for two and three epochs resulted in higher model accuracy compared to fine-tuning for a single epoch, particularly when the pruning percentage exceeded 75% of neurons in the first FC layer.
B. Answer to RQ2: What is the Impact of Neuron Removal to Inference Resource Consumption?
We used DNN inference latency as a proxy for inference resource consumption. In the measurements, we used models with neurons removed and trained from scratch because they were proven more accurate than the pruned and fine-tuned models in our experiments. In all three experiments, the inference latency was lower for models with removed neurons; however, the relative latency drop was smaller than the relative drop in the number of neurons and parameters. In E1, removing 52% of the total parameters lowered inference latency by 18%. In E2, removing 8% of the total parameters lowered inference latency by 5%. In E3, removing 12% of the total parameters reduced the inference latency by 5%. It should be noted that in all cases above, the model with removed neurons had better predictive performance than the original model.
In addition, note that for the model in E1, the percentage of parameters in the FC layers was significantly higher than in E2 and E3. In E1, 99% of parameters are in the FC layers, in contrast to E2 and E3, with only 43% and 58%, respectively. A higher percentage of parameters in the FC layers leads to a better total compression rate via neuron removal. In other words, the higher compression rate and the lower inference latencies for E1 were not due to the smaller model but a higher percentage of parameters in the FC layers.
Despite observing a non-negligible latency drop in all cases, it is essential to recognize that neuron removal, while showing some impact, may not always be the optimal approach for reducing the DNN inference resource consumption. There are alternative methods, such as weight quantization, that have the potential to deliver favorable outcomes. Exploring these alternative strategies may enhance efficiency and performance gains in optimizing DNN inference resource utilization. Finding an optimal combination of different strategies is an interesting research direction.
C. Applicability of Studied Approaches
In all our experiments, Approach 1 of removing neurons and training from scratch consistently outperformed Approach 2 of training, pruning, and fine-tuning. However, structured pruning and fine-tuning can still prove valuable, particularly when the original DNN is sufficiently large to make training from scratch too expensive for multiple experiments. This was evident in our analysis of E3, where it was not feasible to systematically remove neurons and retrain the DNN to compare the results to pruning apart from a few data points. Similarly, it would be infeasible to find an ideal size for the same DNN systematically. Thus, overshooting and structured pruning is a more sensible approach.
In scenarios involving concept drift in the input data [35], a continuous training approach that includes new data is required. In such cases, applying pruning and fine-tuning introduces overhead for each model retraining cycle. To mitigate this non-negligible overhead, a less frequent or even one-shot neuron removal strategy may be more suitable. Our results underscore the potential advantages of neuron removal and training from scratch, particularly in scenarios characterized by evolving data distributions. Future research endeavors should aim to determine the optimal frequency of architecture search under continuous concept drift, as it is possible that an adaptable model architecture, with more neurons instead of fewer, could better accommodate shifting data distributions and prevent gradual model performance decline over time.
The careful selection of the pruned FC layers has a profound effect on the resulting total parameter count of the DNN. When neurons are pruned from one FC layer and subsequently from another, the size of the subsequent layer is because fewer weights are required for incoming connections. In specific cases, such as experiment E1, pruning 75% of the neurons from the first FC layer translated to a 64% reduction in parameters across the entire DNN (see Table 1).
It is noteworthy that the size of the selected layer is more relevant to the compression ratio than the position of the layer within the DNN. If the DNN's largest FC layer is not the first layer, then the largest layer remains a suitable target for neuron removal. However, the optimal layer for neuron removal varies on a case-by-case basis. Some FC layers contribute more to the DNN's predictive performance than other FC layers of the same size do. In E1, removing neurons from the first layer decreased the predictive performance to a lesser extent than removing neurons from the second layer, despite both layers having the same number of neurons. In E2, removing neurons from the positional encoder (see Fig. 5(a)) decreased the predictive performance more than removing neurons from the feed-forward part of the transformer (see Fig. 5(b)), although these FC layers also have the same number of neurons.
D. Future Work
Pruning all FC layers uniformly appears to result in a lower predictive performance than the selective pruning of FC layers. This suggests that structured pruning, even without fine-tuning, could be utilized to identify layers suitable for weight quantization. Because pruning without fine-tuning involves significantly lighter computations than training, it can be viewed as a complementary method rather than a competing one for quantization and teacher-student frameworks. We propose selective weight quantization based on pruning as a potential research direction. Automating the layer selection process by comparing the predictive performance of alternative DNN architectures produced by structured pruning can assist DL practitioners in optimizing DNN architectures.
In all experiments, we removed neurons from the original DNN and retrained it without touching any hyper-parameters. For the resulting DNN, an updated set of hyper-parameters might lead to more accurate models or faster convergence during training. Hyperparameter tuning in this context remains a future research topic.
It should be noted that DNNs in both E2 and E3 have similar transformer blocks as in large language models (LLM). In both cases, removing a significant number of neurons from the feed-forward part of the transformer was possible. If this applies to the transformer architecture in general, it seems to be a generic way to reduce LLMs' size and resource consumption.
DLRMs are DNNs trained on tabular data containing categorical columns, and most of their parameters are typically associated with trainable embeddings [23], [24]. Pruning FC layers in DLRMs is likely to have only a minor impact on the total number of parameters. Investigating this aspect could be a potential direction for future research.
Validity and Reliability
DNNs sourced from Kaggle competitions exhibited a noticeable trend of over-parameterization. This observation is hardly surprising given that Kaggle competitions prioritize high predictive performance over the efficient utilization of computational resources. It would be intriguing to explore the potential benefits of structured pruning and fine-tuning on a DNN that is deliberately designed to be less over-parameterized. Such an endeavor could lead to models that strike a balance between size and predictive performance, potentially aligning closer to the optimal resource allocation.
In industrial DL applications, CPUs are often preferred for inference because of their lower cost and better availability than GPUs [11]. Therefore, we used CPUs for inference latency benchmarks in experiments E1 and E2. Unfortunately, the model in E3 was too large to run extensive benchmarks using CPUs, so we opted to use GPUs.
It is essential to acknowledge the diversity in tabular data types, making the “one size fits all” DNN architecture concept impractical. Variations in input data characteristics, such as tabular time-series compared to non-sequential data, necessitate tailored approaches. A sample of three proven DNN architectures for three somewhat different problems may not represent a wide variety of possible architectures that work well with tabular data. Interestingly, two DNNs in this study adopted a transformer architecture [22], which has gained prominence in NLP. Notably, we found that significant reductions in the number of neurons in the transformer's feed-forward section did not result in less accurate models. This suggests the possibility of similar outcomes with LLMs used in NLP applications.
The selected Kaggle competition winners do not include any DLRMs, which represent a significant category of DNNs trained on tabular data. Conducting similar experiments using DLRMs is a topic for future research.
Conclusion
In our study, we compared two strategies for optimizing the resource consumption of DNNs trained on tabular data: neuron removal and training from scratch compared to structured pruning of neurons and fine-tuning. We applied these strategies to the three Kaggle competition-winning DNNs.
Structured pruning followed by fine-tuning results in models with lower predictive performance than training smaller DNNs from scratch. This finding suggests that a progressive model development approach, starting with a smaller architecture and increasing complexity, is more effective in achieving optimal performance for both training and inference. However, structured pruning can be used to pinpoint the FC layers that have the least impact on the DNN's predictive performance when the neurons are removed. This is what we performed in our experiments.
Fine-tuning after pruning, while capable of recovering some of the model's predictive performance lost in pruning, introduces an additional training overhead. Despite its potential to yield lighter models for inference, this overhead may not be justifiable when the DNN requires continuous training owing to concept drift. Moreover, based on our experiments, fine-tuning for more than one epoch was less beneficial.
We also considered the feasibility of employing structured pruning and fine-tuning for large DNNs with billions of parameters, such as state-of-the-art LLMs. Conducting a systematic DNN architecture search with LLMs becomes increasingly impractical because of computational limitations and resource demands. Instead, pruning and fine-tuning can be used to convert a large DNN to a lighter one for inference. However, the original training data must be available for fine-tuning, and a moderate decrease in the model's predictive performance should be tolerated.
The DNN inference latency served as a representative measure of the DNN's resource usage in our experiments. When neurons were removed, the comparative decrease in inference latency did not match the decrease observed in the neuron count. This suggests the potential for more efficient strategies, such as weight quantization, to minimize resource consumption.
ACKNOWLEDGMENT
The authors would like to acknowledge CSC – IT Center for Science, Finland, for the computational resources.