Introduction
Recently, the adoption of deep learning models in various domains for different tasks has increased considerably. Parallelly, the number of layers and parameters in models necessary for achieving the required performance for high-complexity tasks continues to increase. Accordingly, there is an increase in the amount of memory required for model training [1]. However, as the memory capacity of hardware-mounted accelerators cannot be physically expanded, memory lightweight techniques [2], [3] and distributed training [4] have become popular. Distributed training is a training technique involving the utilization of multiple accelerators either to handle models that demand a significant amount of memory for training or to expedite the training process by weight replication. During distributed training, the method of splitting a model across multiple devices for training is referred to as model parallelism, and the training method of dividing the model into layers and configuring the stages on the devices is called layer pipelining [5]. Herein, instead of dividing the model into layers, dividing it based on the weights of each layer is known as tensor parallelism [6]. Conversely, the distributed training method that accelerates training by replicating the entire weights of a model across multiple devices to increase the size of minibatch is called data parallelism [7].
The training process in deep learning is roughly divided into two steps: a forward propagation step and a backpropagation step. The former predicts the result using the input data with the weights of the model and stores intermediate operation values (activations); the latter calculate gradients to adjust current weights of the model using the predicted result obtained through the forward propagation step. Therefore, in layer pipelining, where the model is divided into multiple devices, it is essential to transmit the output activations and gradients of the stages through network communication. In other words, the forward propagation that begins with the input data in the first stage continuously transmits the output activation of each stage through the network to the subsequent stage until it reaches the last stage. The final stage involves calculating the gradient of the received input data based on the output value with the loss function and initiating the backpropagation process back to the first stage in accordance with the chain rule. Finally, when the backpropagation of the first stage is completed, the corresponding iteration terminates, and the weights are updated through the obtained gradient and optimizer for the next iteration.
In this structure, dependencies exist between adjacent stages, creating pipeline bubbles where devices wait in an idle state. To achieve high training throughput, devices must be kept busy for as long as possible. Therefore, the most popular method for reducing bubbles is pipeline parallelism using the microbatch technique applied to GPipe [8], as shown in Figure 1. The figure illustrates that the time required to compute backpropagation is generally twice that needed for forward propagation. Pipeline parallelism accelerates training by splitting a minibatch into microbatches, which are multiple smaller batch units, and allowing them to be performed in parallel, exceeding the limit of only one stage being able to compute at a time due to dependency issues. In other words, as the minibatch is divided into a larger number of microbatches, the parallelism of the computation increases and, consequently, the training time decreases. However, if the minibatch is divided beyond a certain number of microbatches because of inefficient computation (often attributed to the small batch size and frequent network communication overhead), the training throughput decreases. Therefore, to reach the maximum training throughput, an appropriate number of microbatches, M, must be established [9]. The variables employed in this study are as shown in Table 1. Hence, GPipe, which is microbatch-based pipeline parallelism, has the advantage of accelerating training process while maintaining low training convergence difficulty by obtaining precise gradients from each iteration, just as in training on a single device. However, GPipe completely separates forward propagation and backpropagation stages so that all activations of the minibatch are maintained in the device’s memory. This limits the increase in the minibatch size N. Therefore, the integration of a recomputation technique to GPipe was proposed [8]. This technique first performs predictions without storing activations to initiate the backpropagation pipeline in the forward propagation phase. The actual process of forward propagation is conducted by recomputing the activations necessary for backpropagation immediately prior to its execution. Thus, the activation recomputation technique is very memory-friendly because it requires only one microbatch of activation memory to conduct training. However, this method induces a decrease of approximately 20% in training throughput because the forward propagation operation must be performed twice [10]. PipeDream-Flush [11] proposed the application of PipeDream’s one-forward-one-backward (1F1B) mechanism [12] in minibatch units together with GPipe’s microbatch technique for improved memory-consumption efficiency. The 1F1B mechanism repeats forward propagation and backpropagation sequentially when the pipeline operation enters the steady phase, instead of performing backpropagation after completing all forward propagation steps. Here, the steady phase refers to the period after the pipeline warm-up phase, which occurs after the point in Figure 1 where the last stage initiates forward propagation for the first microbatch. Thus, to enter the steady phase, PipeDream-Flush performs the required number of forward propagations for the pipeline to run completely for each stage and facilitates the activation of microbatches (warm-up phase). In the steady stage where the 1F1B mechanism is applied, backpropagation is performed continuously after forward propagation; hence, the activation for which the gradient has been calculated can be immediately released from memory. Consequently, PipeDream-Flush only needs to maintain up to N activations depending on the stage. Unlike GPipe, which must maintain all M activations, the memory occupied by activations does not increase beyond a certain number of microbatches. Therefore, even if a larger minibatch is given, training can be performed without memory exhaustion as long as the size of the microbatches used for training remains the same. However, PipeDream-Flush requires the activation memory capacity of N microbatches for the first stage, but only one for the last stage. Accordingly, there is an imbalance in memory consumption between devices, as shown in Figure 1. The severity of this imbalance increases with the number of stages in the pipeline. To solve this problem, the amount of weights to be trained on each stage can be adjusted; however, this again causes differences in computation time between stages. Thus, although PipeDream-Flush maintains the same training throughput while consuming less memory compared to GPipe, it still has to maintain a large number of activations depending on the stage. Additionally, as the number of stages in the pipeline increases, memory consumption linearly increases, which is a drawback.
Schematic of pipeline parallelism methods with
Therefore, in this study, we propose a new synchronous pipeline parallelism technique using the broadcast-based forward propagation method for efficiently training large-scale neural networks. This technique involves constructing a pipeline by creating a subset of stages through linear pipeline broadcast among the parallel patterns. The proposed approach enhances PipeDream-Flush in two key ways. First, it reduces the imbalance of activation memory consumption by almost half without adjusting the weight between devices, enabling memory-friendly training even in a pipeline with many stages. Second, despite partially utilizing the activation recomputation technique, the proposed method minimizes training throughput degradation by optimizing the pipeline. Moreover, due to structural optimization, training acceleration is achievable based on the model’s characteristics.
Related Works
A. Weighted-Replication-Based Acceleration Methods
Synchronous pipeline parallelism, as mentioned in the previous section, has limitations in that training cannot be accelerated beyond a certain level and a large amount of activation memory must be maintained. To solve this problem, methods that improve memory consumption by replicating the weights of the model have been proposed. In double-model pipeline parallelism, the weights of the stages are replicated at the symmetrical point so that each device has two partial weight sets. For example, the first device has the weights of the first stage
B. Asynchronous Pipeline Parallelism Methods
The synchronous pipeline parallelism technique employs a flush barrier at each iteration to empty the pipeline, preventing uncertainty in training. Therefore, during the training process, a weight update is performed after every flushing barrier. Hence, this approach does not achieve the complete elimination of bubbles. To address this issue, various asynchronous pipeline parallelism methods have been proposed to minimize uncertainty while maximizing training throughput. PipeDream [12], an asynchronous pipeline in minibatch units, does not encounter bubbles once it reaches the steady stage while performing 1F1B. To reduce uncertainty in training, PipeDream utilizes a weight stashing technique to store weights until the gradient for the weights used in forward propagation reaches the backpropagation pipeline. However, this results in the consumption of a large amount of weight memory as
Methodology
A. Broadcast Collective Communication
Broadcast is a parallel pattern that transmits and copies data from the root process to the remaining processes. Broadcast can be implemented based on various algorithms, such as flat tree, chain tree, and binary tree. The optimal algorithm should be selected depending on the conditions under which broadcast is used. A flat tree (Figure 2a) is a tree with one level. Since the transmission load is concentrated on the root process, it is an appropriate algorithm when the size of the data to be transmitted is small [19]. A chain tree (Figure 2b) is a tree with the same number of levels as the number of processes. Unlike a flat tree, it has the advantage of distributing the transmission load rather than concentrating it on the root processor. To reduce the communication cost of the chain tree, it can be pipelined using the segmentation technique [20], as shown in Figure 2c. This technique is known as linear pipeline broadcast and it is very effective when the message size is large.
Implementations of broadcast collective communication. Linear pipeline broadcast, which uses chain tree together with segmentation technique, is advantageous compared to flat tree when the size of message to be broadcast is large.
B. Broadcast-Based Forward Propagation
Since the forward propagation pipelines of existing methods are connected through peer-to-peer communication, there is a dependency between stages, which limits improvements in training throughput and memory consumption. To address this issue, we propose broadcast-based forward propagation. The proposed method performs data transmission in forward propagation by grouping it into three stages using broadcast every other time, as shown in Figure 1. For example, in the last three stages, the output of the
Flow of microbatch-based input data in pipeline parallelism (
C. Adapting Activation Recomputation
Although broadcast-based forward propagation can shorten the network time, a prediction step using replicated weights is required. This creates extra bubbles in the pipeline for the same duration because the prediction is required after the warm-up phase. Therefore, the proposed method uses these bubbles to enable the stage that shares the weights to perform activation recomputation. This allows the stages to replace half of the activation memory with recomputation without stalling the entire pipeline (refer to
The proposed method receives input data by performing linear pipeline broadcast and peer-to-peer communication alternately.
D. Broadcast-Based Forward Propagation Stacking
In pipeline parallelism, the synchronization of a large number of weights can lead to significant performance degradation due to synchronization overhead, outweighing the benefits of weight replication acceleration. The proposed approach aims to reduce the overhead caused by weight synchronization by structuring the method in a way that only half of the stages need to replicate weights through broadcast-based forward propagation (leaf node of the chain tree). Additionally, it limits the maximum number of partial weights to be replicated to one, making it suitable for large-scale training. By ensuring that only half of the stages replicate the weights of the previous stage, peer-to-peer communication is utilized instead of broadcast collective communication. This strategy helps in distributing network communication, thereby avoiding network traffic bottlenecks that may arise from concentrating communication at a single point process. Similar to CPPipe, the proposed method employs bubbles from the weight update step to the beginning of the next iteration to mask the synchronization overhead. If the pipeline is not optimally configured with broadcast in the environment setting, it can be disadvantageous to start from the first stage, as shown in Case 2 of Figure 5, in terms of synchronization time and bubble reduction. Therefore, the proposed method is structured to stack the broadcast pipeline from the last stage
Stacking the broadcast-based pipeline from the bottom can secure more time for synchronization. Moreover, this case has a shorter warm-up phase duration.
E. Network Optimization
Since the proposed method is designed for large-scale neural network training, it aims to minimize memory consumption and reduce training throughput degradation during weight-intensive model training. However, the proposed method is also very suitable for training compute-intensive models with very large input data from the training throughput perspective, such as high-resolution images. Broadcast-based forward propagation plays a role in reducing the waiting time for data reception at each stage, as shown in Figure 6. A very large activation size requires a considerable amount of time for sending and receiving between stages. The following performance improvement can be expected when linear pipeline broadcast is employed. Assuming that the size of the input data for all stages is \begin{equation*}T_{transfer}^{flat tree} = T_{latency} + 2 \times \alpha \times T_{byte} \tag {1}\end{equation*}
When the computation times of the last three stages are all different, PipeDream-Flush suffers from long reception times, including waiting. However, the proposed method wherein the broadcast-based forward propagation technique is applied can partially alleviate the computation time imbalance between stages through network communication optimization. The dot box indicates that network communication and activation recomputation are performed in parallel.
We will assume in the following that communication is full-duplex. When linear pipeline broadcast is used, it becomes\begin{equation*}T_{transfer}^{linear pipeline} = 2 \times T_{latency} + \alpha \times T_{byte} + \frac {\alpha }{\beta } \times T_{byte} \tag {2}\end{equation*}
Thus, the network communication time decreases as the values of
Evaluation
A. Experimental Environment and Configuration
We conducted experiments in the following environment to measure and evaluate the performance of the proposed method. We employed six nodes connected by 10G Ethernet, and each node was equipped with 2 NVIDIA V100 GPUs with 16 GB device memory. Therefore, we conducted experiments using 12 GPUs. The deep learning framework used for training was PyTorch 1.13.1 based on CUDA 11.6 [21]. Additionally, Open MPI 4.1.6 [22] and MPI4Py 3.1.6 [23] were adopted together to implement distributed training. Initially, VGG19 [24] and ResNet152 [25] were employed as compute-intensive models, and RESISC45 [26], comprising 3-channel
B. Compute-Intensive Models
First, Figure 7 illustrates the training throughput for ResNet152 and VGG19. The x-axis represents the size of the microbatch, whereas the left y-axis represents the training throughput. When the number of segmentations is 1, Proposed (1) is equivalent to the flat tree broadcast. The upper graph in Figure 7 displays the experimental results with a minibatch size of 128. Initially, PipeDream-2BW demonstrated significantly higher training throughput compared to other methods as an asynchronous pipeline parallelism method. However, compared to PipeDream-Flush, a synchronous parallelism method with barriers to avoid introducing uncertainty during the training process, the proposed method demonstrated higher training throughput in both models and across most cases of microbatches. For the proposed method, an improvement in training throughput was observed for segmentation 2 and above using linear pipeline broadcast, which suggests that linear pipeline broadcast is very effective. The second row of Figure 7 presents the experimental results for a minibatch size of 256, which shows a more extended steady phase compared to the minibatch size of 128. Therefore, all methods, except PipeDream-2BW, which maintains a steady phase until the end of training, exhibited higher training throughput at a minibatch size of 256. Similar to the minibatch size of 128, the proposed method demonstrates the highest training throughput among synchronous pipeline parallelism. The peak training throughput performances of the ResNet152 and VGG19 models were observed with microbatch sizes of 8 and 4, respectively, achieving 14.6% and 12.6% higher acceleration performances than those for PipeDream-Flush. PipeDream-Flush and PipeDream-2BW experienced a degradation in training throughput for the ResNet152 model as the microbatch size decreased, whereas the training throughput increased for the VGG19 model. Similarly, the proposed method yielded comparable results, except when the ResNet152 model was trained with a microbatch size of 4. In this case, the proposed method achieved a similar training throughput when compared to the performance of PipeDream-Flush, as the small batch size limited the time available for network optimization. The total amounts of network communication data between the stages of the two models, ResNet152 and VGG19, are 67.4 MB and 91.9 MB when the microbatch size is 8 (
Comparison of the training throughput and peak memory performance on the RESISC45 dataset
The line graph in Figure 7 illustrates the memory consumption of ResNet152 and VGG19, with the right y-axis indicating the maximum memory consumption. In the graph, at a minibatch size of 128, PipeDream-2BW consumes more memory than PipeDream-Flush due to weight duplication. Compared to PipeDream-Flush, the proposed method significantly reduces peak memory consumption by eliminating the need for almost half of the activation memory while minimizing duplicated weights for acceleration. At a minibatch size of 256, all methods exhibit comparable memory consumption levels due to the 1F1B technique. At size microbatches, where both models showed peak training throughput, the proposed method exhibited 30.1% and 12.0% lower peak memory consumptions compared to the results achieved with PipeDream-Flush, respectively. This demonstrates its superior performance with higher training throughput and reduced memory consumption. The reason the peak consumption reduction is smaller for the VGG19 model compared to the ResNet152 is that peak training throughput is achieved with different microbatch sizes. This result is attributed to the constant weight memory size; however, the amount of activation memory consumed decreases due to the reduction in the batch size, leading to a decrease in the ratio of activation memory consumed to the total memory. In contrast, the amount of replicated weight memory utilized for the acceleration of VGG19 is 50.6 MB, significantly lower than the 87.8 MB used by the ResNet152 model.
C. Weight-Intensive Models
Figure 8 shows the training throughput for GPT2 and GPT2-Medium. Unlike the experimental results of the ResNet152 and VGG19 models, the training throughput of the proposed method in the GPT2 model remained consistent, with no significant changes depending on the number of segmentations. This is because the weights are allocated uniformly between stages due to the nature of the model, which consists of multiple transformer layers. In the other words, if there is no variance in computation time between stages, the potential for network optimization through broadcast-based forward propagation with the linear pipeline approach is reduced (steady phase), as explained in the network optimization subsection. Nevertheless, the proposed method exhibited outstanding performance without any reduction in training throughput as it counteracted a significant portion of the overhead incurred by utilizing activation recomputation with broadcast-based forward propagation. For the GPT2-Medium, all methods except the proposed one could not be trained because of insufficient memory, rendering performance comparison impossible. Figure 8 line graph illustrates the results of comparing the peak memory consumption of the GPT2 models. In GPT2, the peak training throughput was achieved at the microbatch size 4. The proposed method demonstrated a 36.0% reduction in peak memory compared to the results achieved with PipeDream-Flush, without any significant degradation in training throughput. The reduction in maximum memory consumption reduction of the proposed method is smaller for a microbatch size of 4 compared to 8. This is also why peak consumption reduction is smaller in the VGG19 model than in ResNet152, as explained in the previous subsection. In summary, the proposed method demonstrates superior performance by significantly reducing memory consumption while maintaining the same training throughput as achieved by PipeDream-Flush during the training of GPT2.
GPT2 comparison of the training throughput and memory consumption performance
D. Bubble Reduction
Figure 9 illustrates the measured time for the warm-up phase in each iteration of the pipeline. This demonstrates the effectiveness of the proposed method in optimizing the network through broadcast-based forward propagation explained in Section III-B and its architectural design. As PipeDream-Flush and PipeDream-2BW share the same warm-up phase, no significant difference was observed. However, the proposed method demonstrated a significant reduction in bubbles during the warm-up phase across all models. For ResNet152 and VGG19, the proposed method exhibited the smallest bubble sizes when the number of segmentations of broadcast-based forward propagation was set to 8, with a microbatch size of 8. This indicates that the application of linear pipeline broadcast is highly effective for enhancing synchronous pipeline parallelism. As a result, numerically, the proposed method achieves accelerations of 29.4% and 42.2%, compared to the results of the baseline PipeDream-Flush. In contrast, for GPT2, there was no significant difference in bubble reduction when above the number of segments 2, due to the uniform weight distribution across the stages. Nevertheless, since the size of the activation transfer over the network is larger than that of the compute-intensive models (198.0 MB when the microbatch size is 8 and
Comparison of pipeline bubble reduction performance during warm-up phase (
Conclusion
In this study, we analyzed the limitations of existing synchronous and asynchronous pipeline parallelism techniques, such as interstage activation imbalance and inefficient memory consumption. We propose a broadcast-based forward propagation method using partial weight replication to address these issues. Furthermore, we demonstrate that implementing the proposed broadcast-based forward propagation through linear pipeline broadcast can significantly reduce the cost of network communication between the three stages in the forward propagation pipeline, thereby increasing training throughput. However, the use of broadcast-based forward propagation introduces a bubble due to the added prediction step. Thus, we utilize this idle time to apply an activation recomputation technique, reducing activation memory by nearly half. Consequently, the proposed method demonstrated excellent performance in compute-intensive and weight-intensive models through experiments, achieving a significant reduction in peak memory consumption without compromising training throughput. In particular, for the compute-intensive models, the proposed method operates very efficiently due to interstage computation time and transmission of activation data imbalance, showing higher training throughput than the baseline PipeDream-Flush. In future, we will focus on minimizing prediction and activation recomputation steps in broadcast-based forward propagation, as this is directly related to the reduction of pipeline bubbles.