Journals & Magazines >IEEE Access >Volume: 12

Efficient Training of Large-Scale Neural Networks Using Linear Pipeline Broadcast

We propose a broadcast-based forward propagation technique that employs linear pipeline broadcast collective communication to decrease memory consumption while mitigating...

Abstract:

Recently, the adoption of deep learning models in several domains and for various tasks has increased, correspondingly amplifying the number of model layers and parameter...Show More

Metadata

Abstract:

Recently, the adoption of deep learning models in several domains and for various tasks has increased, correspondingly amplifying the number of model layers and parameters needed to achieve the required performance. Accordingly, the amount of memory required for model training has increased, advancing the adoption and exploration of distributed training. Generally, model parallelism techniques require a large amount of memory for training during distributed training. Among them, layer pipelining, which involves dividing the model into layers and configuring the stages on the devices, has attracted interest. Activation recomputation is a popular method for efficiently utilizing pipeline parallelism while minimizing memory consumption. However, it can lead to a decrease in training throughput due to redundant operations. Therefore, this study introduces a forward propagation technique that employs a linear pipeline broadcast method to decrease memory consumption while mitigating training throughput reduction by partially integrating recomputation in PipeDream-Flush. The proposed broadcast-based forward propagation offsets the overhead caused by activation recomputation by optimizing network communication between pipeline stages and reducing bubbles in the warm-up phase of the pipeline. Experimental results demonstrate that the proposed technique reduces memory consumption by approximately 36.0% at peak training throughput for GPT2 than PipeDream-Flush, without a significant decrease in training throughput. Compared with that for PipeDream-Flush, the proposed method achieved peak training throughputs of 14.6% and 12.6% higher for the ResNet152 and VGG19 models, respectively, while consuming 30.1% and 12.0% lesser memory.

We propose a broadcast-based forward propagation technique that employs linear pipeline broadcast collective communication to decrease memory consumption while mitigating...

Published in: IEEE Access ( Volume: 12)

Page(s): 165653 - 165662

Date of Publication: 06 November 2024

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2024.3492314

Funding Agency:

Contents

SECTION I.

Introduction

Recently, the adoption of deep learning models in various domains for different tasks has increased considerably. Parallelly, the number of layers and parameters in models necessary for achieving the required performance for high-complexity tasks continues to increase. Accordingly, there is an increase in the amount of memory required for model training [1]. However, as the memory capacity of hardware-mounted accelerators cannot be physically expanded, memory lightweight techniques [2], [3] and distributed training [4] have become popular. Distributed training is a training technique involving the utilization of multiple accelerators either to handle models that demand a significant amount of memory for training or to expedite the training process by weight replication. During distributed training, the method of splitting a model across multiple devices for training is referred to as model parallelism, and the training method of dividing the model into layers and configuring the stages on the devices is called layer pipelining [5]. Herein, instead of dividing the model into layers, dividing it based on the weights of each layer is known as tensor parallelism [6]. Conversely, the distributed training method that accelerates training by replicating the entire weights of a model across multiple devices to increase the size of minibatch is called data parallelism [7].

The training process in deep learning is roughly divided into two steps: a forward propagation step and a backpropagation step. The former predicts the result using the input data with the weights of the model and stores intermediate operation values (activations); the latter calculate gradients to adjust current weights of the model using the predicted result obtained through the forward propagation step. Therefore, in layer pipelining, where the model is divided into multiple devices, it is essential to transmit the output activations and gradients of the stages through network communication. In other words, the forward propagation that begins with the input data in the first stage continuously transmits the output activation of each stage through the network to the subsequent stage until it reaches the last stage. The final stage involves calculating the gradient of the received input data based on the output value with the loss function and initiating the backpropagation process back to the first stage in accordance with the chain rule. Finally, when the backpropagation of the first stage is completed, the corresponding iteration terminates, and the weights are updated through the obtained gradient and optimizer for the next iteration.

In this structure, dependencies exist between adjacent stages, creating pipeline bubbles where devices wait in an idle state. To achieve high training throughput, devices must be kept busy for as long as possible. Therefore, the most popular method for reducing bubbles is pipeline parallelism using the microbatch technique applied to GPipe [8], as shown in Figure 1. The figure illustrates that the time required to compute backpropagation is generally twice that needed for forward propagation. Pipeline parallelism accelerates training by splitting a minibatch into microbatches, which are multiple smaller batch units, and allowing them to be performed in parallel, exceeding the limit of only one stage being able to compute at a time due to dependency issues. In other words, as the minibatch is divided into a larger number of microbatches, the parallelism of the computation increases and, consequently, the training time decreases. However, if the minibatch is divided beyond a certain number of microbatches because of inefficient computation (often attributed to the small batch size and frequent network communication overhead), the training throughput decreases. Therefore, to reach the maximum training throughput, an appropriate number of microbatches, M, must be established [9]. The variables employed in this study are as shown in Table 1. Hence, GPipe, which is microbatch-based pipeline parallelism, has the advantage of accelerating training process while maintaining low training convergence difficulty by obtaining precise gradients from each iteration, just as in training on a single device. However, GPipe completely separates forward propagation and backpropagation stages so that all activations of the minibatch are maintained in the device’s memory. This limits the increase in the minibatch size N. Therefore, the integration of a recomputation technique to GPipe was proposed [8]. This technique first performs predictions without storing activations to initiate the backpropagation pipeline in the forward propagation phase. The actual process of forward propagation is conducted by recomputing the activations necessary for backpropagation immediately prior to its execution. Thus, the activation recomputation technique is very memory-friendly because it requires only one microbatch of activation memory to conduct training. However, this method induces a decrease of approximately 20% in training throughput because the forward propagation operation must be performed twice [10]. PipeDream-Flush [11] proposed the application of PipeDream’s one-forward-one-backward (1F1B) mechanism [12] in minibatch units together with GPipe’s microbatch technique for improved memory-consumption efficiency. The 1F1B mechanism repeats forward propagation and backpropagation sequentially when the pipeline operation enters the steady phase, instead of performing backpropagation after completing all forward propagation steps. Here, the steady phase refers to the period after the pipeline warm-up phase, which occurs after the point in Figure 1 where the last stage initiates forward propagation for the first microbatch. Thus, to enter the steady phase, PipeDream-Flush performs the required number of forward propagations for the pipeline to run completely for each stage and facilitates the activation of microbatches (warm-up phase). In the steady stage where the 1F1B mechanism is applied, backpropagation is performed continuously after forward propagation; hence, the activation for which the gradient has been calculated can be immediately released from memory. Consequently, PipeDream-Flush only needs to maintain up to N activations depending on the stage. Unlike GPipe, which must maintain all M activations, the memory occupied by activations does not increase beyond a certain number of microbatches. Therefore, even if a larger minibatch is given, training can be performed without memory exhaustion as long as the size of the microbatches used for training remains the same. However, PipeDream-Flush requires the activation memory capacity of N microbatches for the first stage, but only one for the last stage. Accordingly, there is an imbalance in memory consumption between devices, as shown in Figure 1. The severity of this imbalance increases with the number of stages in the pipeline. To solve this problem, the amount of weights to be trained on each stage can be adjusted; however, this again causes differences in computation time between stages. Thus, although PipeDream-Flush maintains the same training throughput while consuming less memory compared to GPipe, it still has to maintain a large number of activations depending on the stage. Additionally, as the number of stages in the pipeline increases, memory consumption linearly increases, which is a drawback.

TABLE 1 Variables Used in This Study

FIGURE 1.

Schematic of pipeline parallelism methods with $K = 6$ and $M = 6$ . The numbers in the boxes represent the microbatches being performed and the transparent boxes with yellow numbers in Proposed represent broadcast collective communications.

Show All

Therefore, in this study, we propose a new synchronous pipeline parallelism technique using the broadcast-based forward propagation method for efficiently training large-scale neural networks. This technique involves constructing a pipeline by creating a subset of stages through linear pipeline broadcast among the parallel patterns. The proposed approach enhances PipeDream-Flush in two key ways. First, it reduces the imbalance of activation memory consumption by almost half without adjusting the weight between devices, enabling memory-friendly training even in a pipeline with many stages. Second, despite partially utilizing the activation recomputation technique, the proposed method minimizes training throughput degradation by optimizing the pipeline. Moreover, due to structural optimization, training acceleration is achievable based on the model’s characteristics.

SECTION II.

Related Works

A. Weighted-Replication-Based Acceleration Methods

Synchronous pipeline parallelism, as mentioned in the previous section, has limitations in that training cannot be accelerated beyond a certain level and a large amount of activation memory must be maintained. To solve this problem, methods that improve memory consumption by replicating the weights of the model have been proposed. In double-model pipeline parallelism, the weights of the stages are replicated at the symmetrical point so that each device has two partial weight sets. For example, the first device has the weights of the first stage $W_{1}$ and the weights of the last stage $W_{K}$ loaded into memory and vice versa. Thus, unlike unidirectional pipeline parallelism, double-model pipeline parallelism requires twice as much weight memory. GEMS [13] is a memory-aware synchronized training method that performs backpropagation in the last stage and forward propagation in the first stage direction using the device’s idle time and the replicated weights. Consequently, bubbles are slightly reduced, and only one batch activation needs to be maintained; however, the training throughput is low because of the low overall pipeline efficiency. Therefore, Chimera [14] configured a bidirectional pipeline in which two pipelines are executed in parallel, which has the advantage of high training throughput. However, it requires more activation memory because of the parallel execution of the pipelines. In addition, since both methods update the replicated weights through the optimizer, the gradient memory and momentum memory are doubled, in addition to the weight memory. Furthermore, collective communication, such as Allreduce [15], through the network is required to synchronize the gradients before the weight updates. CPPipe [9] introduced a novel approach where all stages duplicate all weights, except for the weights of the final stage, to expedite the training of computation-intensive models and decrease activation memory consumption. Instead of the stages sequentially receiving input data through forward propagation pipeline, CPPipe concurrently generates all microbatch input data for all stages in parallel using the replicated weights. Subsequently, it transmits the generated input data to each stage through All-to-All collective communication. Consequently, each stage possesses all input data microbatches once the All-to-All communication is finalized. By eliminating the forward propagation pipeline, CPPipe enables the pipeline to execute with a size of one microbatch activation memory. Hence, CPPipe achieves high training throughput and minimal memory consumption. However, due to the extensive weight replications, applying CPPipe to weight-intensive models can lead to significant overhead in weight synchronization, thereby reducing training throughput. Furthermore, as the number of stages increases in the pipeline, the communication cost for weight synchronization escalates rapidly, making this approach unsuitable for large-scale training involving dozens or more stages. Bi-partition [16] technique enhances the efficiency of the microbatch-based 1F1B pipeline by replicating each stage. This duplication ensures that the weights of neighboring stages overlap by half. With these overlapped weights, the stages can address the difference in computation time between forward and backward propagations, thus reducing the imbalance in computational overhead resulting from layer division. Consequently, this approach helps to minimize bubbles in the pipeline.

B. Asynchronous Pipeline Parallelism Methods

The synchronous pipeline parallelism technique employs a flush barrier at each iteration to empty the pipeline, preventing uncertainty in training. Therefore, during the training process, a weight update is performed after every flushing barrier. Hence, this approach does not achieve the complete elimination of bubbles. To address this issue, various asynchronous pipeline parallelism methods have been proposed to minimize uncertainty while maximizing training throughput. PipeDream [12], an asynchronous pipeline in minibatch units, does not encounter bubbles once it reaches the steady stage while performing 1F1B. To reduce uncertainty in training, PipeDream utilizes a weight stashing technique to store weights until the gradient for the weights used in forward propagation reaches the backpropagation pipeline. However, this results in the consumption of a large amount of weight memory as $K-k$ different versions of weights and their activations must be maintained in memory based on the stage. To tackle this issue, PipeDream-2BW [11] adopts the microbatch technique from GPipe. This allows constructing an asynchronous pipeline by updating after M time accumulations, as is the case in GPipe, reducing the number of weight versions that all stages must maintain to two. Although this approach significantly decreases the memory consumption of weights, it still remains disadvantageous in terms of memory consumption when training weight-intensive models because the weights need to be copied twice. Additionally, due to the nature of asynchronous pipeline parallelism, there is a limit on the model’s convergence difficulty, which is very high because of the uncertainty in the training process. To reduce the staleness of gradients and the memory redundancy of PipeDream-2BW, WPipe [17] uses a novel pipelining scheme that divides model partitions into two groups. It moves the forward pass of the next period of weight updates to the front of the backward pass of the current period of weight updates in the first group, retains the order in the second group, and updates each group alternately. PipeOptim [18] proposes an optimizer-dependent weight prediction strategy to address the weight staleness issue that cannot be resolved by weight stashing techniques. This approach aims to reduce the training convergence difficulty of the model while preserving the high training throughput achieved by PipeDream and PipeDream-2BW.

SECTION III.

Methodology

A. Broadcast Collective Communication

Broadcast is a parallel pattern that transmits and copies data from the root process to the remaining processes. Broadcast can be implemented based on various algorithms, such as flat tree, chain tree, and binary tree. The optimal algorithm should be selected depending on the conditions under which broadcast is used. A flat tree (Figure 2a) is a tree with one level. Since the transmission load is concentrated on the root process, it is an appropriate algorithm when the size of the data to be transmitted is small [19]. A chain tree (Figure 2b) is a tree with the same number of levels as the number of processes. Unlike a flat tree, it has the advantage of distributing the transmission load rather than concentrating it on the root processor. To reduce the communication cost of the chain tree, it can be pipelined using the segmentation technique [20], as shown in Figure 2c. This technique is known as linear pipeline broadcast and it is very effective when the message size is large.

FIGURE 2.

Implementations of broadcast collective communication. Linear pipeline broadcast, which uses chain tree together with segmentation technique, is advantageous compared to flat tree when the size of message to be broadcast is large.

Show All

B. Broadcast-Based Forward Propagation

Since the forward propagation pipelines of existing methods are connected through peer-to-peer communication, there is a dependency between stages, which limits improvements in training throughput and memory consumption. To address this issue, we propose broadcast-based forward propagation. The proposed method performs data transmission in forward propagation by grouping it into three stages using broadcast every other time, as shown in Figure 1. For example, in the last three stages, the output of the $P_{K-2}$ stage is broadcast to the $P_{K-1}$ and $P_{K}$ . The reason for performing the broadcast every other time is explained in the following subsection. Here, the broadcasted output of $P_{K-2}$ serves as the input data of $P_{K-1}$ so that the $P_{K}$ stage cannot use it directly. Therefore, $P_{K-2}$ replicates the weights of $P_{K-1}$ before starting the iteration and performs prediction using the replicated weights $W_{K-1}$ to convert it into input data when the broadcast is completed. Using broadcast, two peer-to-peer communications between the three stages can be replaced with one broadcast communication and one prediction, thereby reducing the network cost and overhead. $P_{K-1}$ stores the input data in a queue through broadcast and peer-to-peer communication, as shown in Figure 3. The mechanism for putting input data into a queue is also explained in the following subsection. At this point, in the proposed method, linear pipeline broadcast is adapted to optimize broadcasting. This is because the activation data to be transmitted is relatively large due to the nature of deep learning. Depending on the model being trained, the size of the input data, and the size of the minibatch, several tens of megabytes of data are transmitted between stages for each iteration. Moreover, the broadcast-based forward propagation significantly shortens the warm-up phase (Figure 1), and the gain increases with the size of the activation data that needs to be transmitted. This shortened warm-up phase is employed to offset the decrease in training throughput caused by the inclusion of prediction steps.

FIGURE 3.

Flow of microbatch-based input data in pipeline parallelism ($K = 4, M = 4$ ). A black border indicates the transmission via linear pipeline broadcast. The subscripts and superscripts of x indicate the stage number of the input data and the microbatch number, respectively.

Show All

C. Adapting Activation Recomputation

Although broadcast-based forward propagation can shorten the network time, a prediction step using replicated weights is required. This creates extra bubbles in the pipeline for the same duration because the prediction is required after the warm-up phase. Therefore, the proposed method uses these bubbles to enable the stage that shares the weights to perform activation recomputation. This allows the stages to replace half of the activation memory with recomputation without stalling the entire pipeline (refer to $A_{k}$ in Figure 1). This reduction in activation memory consumption is maximized as the K value increases. In other words, the broadcast-based forward propagation mechanism can achieve a greater reduction in activation memory consumption when training large-scale models that involve a greater number of devices. As illustrated in Figure 4, while the $P_{K}$ receives the first microbatch input data $x_{K-1}^{1}$ through broadcast and performs prediction, $P_{K-1}$ receives the second microbatch input data ($x_{K-1}^{2}$ ) from $P_{K-2}$ through peer-to-peer communication and performs prediction using its own weights. Finally, the result is transmitted to $P_{K}$ . In this case, $P_{K}$ does not need to perform prediction; thus, it can proceed with forward propagation immediately after receiving. The broadcast-based forward propagation of the proposed method iterates over these as a set and drives the forward propagation pipeline. The proposed method performs broadcasting every other time because adding a prediction step can effectively conceal the overhead due to activation recomputation in the operation of the broadcast-based forward propagation pipeline in addition to reducing memory consumption. In particular, for PipeDream-Flush, since the stages located at the front store a large amount of activations in the warm-up phase, the proposed method reduces the activation memory to be maintained by half, thereby alleviating the imbalance. To adapt this mechanism, the stage that performs activation recomputation first conducts forward propagation using the head microbatch of the input data queue and subsequently performs backpropagation.

FIGURE 4.

The proposed method receives input data by performing linear pipeline broadcast and peer-to-peer communication alternately.

Show All

D. Broadcast-Based Forward Propagation Stacking

In pipeline parallelism, the synchronization of a large number of weights can lead to significant performance degradation due to synchronization overhead, outweighing the benefits of weight replication acceleration. The proposed approach aims to reduce the overhead caused by weight synchronization by structuring the method in a way that only half of the stages need to replicate weights through broadcast-based forward propagation (leaf node of the chain tree). Additionally, it limits the maximum number of partial weights to be replicated to one, making it suitable for large-scale training. By ensuring that only half of the stages replicate the weights of the previous stage, peer-to-peer communication is utilized instead of broadcast collective communication. This strategy helps in distributing network communication, thereby avoiding network traffic bottlenecks that may arise from concentrating communication at a single point process. Similar to CPPipe, the proposed method employs bubbles from the weight update step to the beginning of the next iteration to mask the synchronization overhead. If the pipeline is not optimally configured with broadcast in the environment setting, it can be disadvantageous to start from the first stage, as shown in Case 2 of Figure 5, in terms of synchronization time and bubble reduction. Therefore, the proposed method is structured to stack the broadcast pipeline from the last stage $P_{K}$ to maximize the synchronization time and minimize bubbles (as illustrated in Case 1 in Figure 5).

FIGURE 5.

Stacking the broadcast-based pipeline from the bottom can secure more time for synchronization. Moreover, this case has a shorter warm-up phase duration.

Show All

E. Network Optimization

Since the proposed method is designed for large-scale neural network training, it aims to minimize memory consumption and reduce training throughput degradation during weight-intensive model training. However, the proposed method is also very suitable for training compute-intensive models with very large input data from the training throughput perspective, such as high-resolution images. Broadcast-based forward propagation plays a role in reducing the waiting time for data reception at each stage, as shown in Figure 6. A very large activation size requires a considerable amount of time for sending and receiving between stages. The following performance improvement can be expected when linear pipeline broadcast is employed. Assuming that the size of the input data for all stages is $\alpha $ and the network latency, the time consumed to transmit 1 byte, and the number of segmentations are $T_{latency}$ , $T_{byte}$ , and $\beta $ , respectively, when the flat tree broadcast is used, it takes\begin{equation*}T_{transfer}^{flat tree} = T_{latency} + 2 \times \alpha \times T_{byte} \tag {1}\end{equation*} View Source

FIGURE 6.

When the computation times of the last three stages are all different, PipeDream-Flush suffers from long reception times, including waiting. However, the proposed method wherein the broadcast-based forward propagation technique is applied can partially alleviate the computation time imbalance between stages through network communication optimization. The dot box indicates that network communication and activation recomputation are performed in parallel.

Show All

We will assume in the following that communication is full-duplex. When linear pipeline broadcast is used, it becomes\begin{equation*}T_{transfer}^{linear pipeline} = 2 \times T_{latency} + \alpha \times T_{byte} + \frac {\alpha }{\beta } \times T_{byte} \tag {2}\end{equation*} View Source

Thus, the network communication time decreases as the values of $\alpha $ and $\beta $ increase. In addition, this can minimize idle time, ultimately achieving pipeline acceleration.

SECTION IV.

Evaluation

A. Experimental Environment and Configuration

We conducted experiments in the following environment to measure and evaluate the performance of the proposed method. We employed six nodes connected by 10G Ethernet, and each node was equipped with 2 NVIDIA V100 GPUs with 16 GB device memory. Therefore, we conducted experiments using 12 GPUs. The deep learning framework used for training was PyTorch 1.13.1 based on CUDA 11.6 [21]. Additionally, Open MPI 4.1.6 [22] and MPI4Py 3.1.6 [23] were adopted together to implement distributed training. Initially, VGG19 [24] and ResNet152 [25] were employed as compute-intensive models, and RESISC45 [26], comprising 3-channel $256\times 256$ images, was used as the dataset. For the compute-intensive model, 8 GPUs were utilized to configure the pipeline parallelism. For the weight-intensive model, the GPT2 [27] series was utilized, with an input sequence length of 768, and pre-training was conducted using the OpenWebText [28] dataset. The pipeline consisted of 12 GPUs, with each stage containing one and two transformer layers for GPT2 and GPT2-Medium models, respectively. Minibatch sizes of 128 and 256 were used for both types of training, with microbatch sizes of 8 and 4, respectively. The size of the microbatch used in the experiment was the maximum capacity that PipeDream-Flush could handle while training the GPT2 series, as it requires a substantial amount of memory.

B. Compute-Intensive Models

First, Figure 7 illustrates the training throughput for ResNet152 and VGG19. The x-axis represents the size of the microbatch, whereas the left y-axis represents the training throughput. When the number of segmentations is 1, Proposed (1) is equivalent to the flat tree broadcast. The upper graph in Figure 7 displays the experimental results with a minibatch size of 128. Initially, PipeDream-2BW demonstrated significantly higher training throughput compared to other methods as an asynchronous pipeline parallelism method. However, compared to PipeDream-Flush, a synchronous parallelism method with barriers to avoid introducing uncertainty during the training process, the proposed method demonstrated higher training throughput in both models and across most cases of microbatches. For the proposed method, an improvement in training throughput was observed for segmentation 2 and above using linear pipeline broadcast, which suggests that linear pipeline broadcast is very effective. The second row of Figure 7 presents the experimental results for a minibatch size of 256, which shows a more extended steady phase compared to the minibatch size of 128. Therefore, all methods, except PipeDream-2BW, which maintains a steady phase until the end of training, exhibited higher training throughput at a minibatch size of 256. Similar to the minibatch size of 128, the proposed method demonstrates the highest training throughput among synchronous pipeline parallelism. The peak training throughput performances of the ResNet152 and VGG19 models were observed with microbatch sizes of 8 and 4, respectively, achieving 14.6% and 12.6% higher acceleration performances than those for PipeDream-Flush. PipeDream-Flush and PipeDream-2BW experienced a degradation in training throughput for the ResNet152 model as the microbatch size decreased, whereas the training throughput increased for the VGG19 model. Similarly, the proposed method yielded comparable results, except when the ResNet152 model was trained with a microbatch size of 4. In this case, the proposed method achieved a similar training throughput when compared to the performance of PipeDream-Flush, as the small batch size limited the time available for network optimization. The total amounts of network communication data between the stages of the two models, ResNet152 and VGG19, are 67.4 MB and 91.9 MB when the microbatch size is 8 ($K = 8$ ).

FIGURE 7.

Comparison of the training throughput and peak memory performance on the RESISC45 dataset $(K = 8)$ . The bar graph represents training throughput, and the line graph represents peak memory consumption. The number in parentheses next to Proposed indicates the number of segmentations in the linear pipeline broadcast.

Show All

The line graph in Figure 7 illustrates the memory consumption of ResNet152 and VGG19, with the right y-axis indicating the maximum memory consumption. In the graph, at a minibatch size of 128, PipeDream-2BW consumes more memory than PipeDream-Flush due to weight duplication. Compared to PipeDream-Flush, the proposed method significantly reduces peak memory consumption by eliminating the need for almost half of the activation memory while minimizing duplicated weights for acceleration. At a minibatch size of 256, all methods exhibit comparable memory consumption levels due to the 1F1B technique. At size microbatches, where both models showed peak training throughput, the proposed method exhibited 30.1% and 12.0% lower peak memory consumptions compared to the results achieved with PipeDream-Flush, respectively. This demonstrates its superior performance with higher training throughput and reduced memory consumption. The reason the peak consumption reduction is smaller for the VGG19 model compared to the ResNet152 is that peak training throughput is achieved with different microbatch sizes. This result is attributed to the constant weight memory size; however, the amount of activation memory consumed decreases due to the reduction in the batch size, leading to a decrease in the ratio of activation memory consumed to the total memory. In contrast, the amount of replicated weight memory utilized for the acceleration of VGG19 is 50.6 MB, significantly lower than the 87.8 MB used by the ResNet152 model.

C. Weight-Intensive Models

Figure 8 shows the training throughput for GPT2 and GPT2-Medium. Unlike the experimental results of the ResNet152 and VGG19 models, the training throughput of the proposed method in the GPT2 model remained consistent, with no significant changes depending on the number of segmentations. This is because the weights are allocated uniformly between stages due to the nature of the model, which consists of multiple transformer layers. In the other words, if there is no variance in computation time between stages, the potential for network optimization through broadcast-based forward propagation with the linear pipeline approach is reduced (steady phase), as explained in the network optimization subsection. Nevertheless, the proposed method exhibited outstanding performance without any reduction in training throughput as it counteracted a significant portion of the overhead incurred by utilizing activation recomputation with broadcast-based forward propagation. For the GPT2-Medium, all methods except the proposed one could not be trained because of insufficient memory, rendering performance comparison impossible. Figure 8 line graph illustrates the results of comparing the peak memory consumption of the GPT2 models. In GPT2, the peak training throughput was achieved at the microbatch size 4. The proposed method demonstrated a 36.0% reduction in peak memory compared to the results achieved with PipeDream-Flush, without any significant degradation in training throughput. The reduction in maximum memory consumption reduction of the proposed method is smaller for a microbatch size of 4 compared to 8. This is also why peak consumption reduction is smaller in the VGG19 model than in ResNet152, as explained in the previous subsection. In summary, the proposed method demonstrates superior performance by significantly reducing memory consumption while maintaining the same training throughput as achieved by PipeDream-Flush during the training of GPT2.

FIGURE 8.

GPT2 comparison of the training throughput and memory consumption performance $(K = 12)$ . The bar graph represents the training throughput, and the line graph represents peak memory consumption. The number in parentheses next to proposed indicates the number of segmentations in the linear pipeline broadcast.

Show All

D. Bubble Reduction

Figure 9 illustrates the measured time for the warm-up phase in each iteration of the pipeline. This demonstrates the effectiveness of the proposed method in optimizing the network through broadcast-based forward propagation explained in Section III-B and its architectural design. As PipeDream-Flush and PipeDream-2BW share the same warm-up phase, no significant difference was observed. However, the proposed method demonstrated a significant reduction in bubbles during the warm-up phase across all models. For ResNet152 and VGG19, the proposed method exhibited the smallest bubble sizes when the number of segmentations of broadcast-based forward propagation was set to 8, with a microbatch size of 8. This indicates that the application of linear pipeline broadcast is highly effective for enhancing synchronous pipeline parallelism. As a result, numerically, the proposed method achieves accelerations of 29.4% and 42.2%, compared to the results of the baseline PipeDream-Flush. In contrast, for GPT2, there was no significant difference in bubble reduction when above the number of segments 2, due to the uniform weight distribution across the stages. Nevertheless, since the size of the activation transfer over the network is larger than that of the compute-intensive models (198.0 MB when the microbatch size is 8 and $K = 12$ ), the proposed method achieves a bubble reduction ratio of nearly 57.2% during the warm-up phase compared to the results of PipeDream-Flush.

FIGURE 9.

Comparison of pipeline bubble reduction performance during warm-up phase ($N = 128$ ).

Show All

SECTION V.

Conclusion

In this study, we analyzed the limitations of existing synchronous and asynchronous pipeline parallelism techniques, such as interstage activation imbalance and inefficient memory consumption. We propose a broadcast-based forward propagation method using partial weight replication to address these issues. Furthermore, we demonstrate that implementing the proposed broadcast-based forward propagation through linear pipeline broadcast can significantly reduce the cost of network communication between the three stages in the forward propagation pipeline, thereby increasing training throughput. However, the use of broadcast-based forward propagation introduces a bubble due to the added prediction step. Thus, we utilize this idle time to apply an activation recomputation technique, reducing activation memory by nearly half. Consequently, the proposed method demonstrated excellent performance in compute-intensive and weight-intensive models through experiments, achieving a significant reduction in peak memory consumption without compromising training throughput. In particular, for the compute-intensive models, the proposed method operates very efficiently due to interstage computation time and transmission of activation data imbalance, showing higher training throughput than the baseline PipeDream-Flush. In future, we will focus on minimizing prediction and activation recomputation steps in broadcast-based forward propagation, as this is directly related to the reduction of pipeline bubbles.

References is not available for this document.

Efficient Training of Large-Scale Neural Networks Using Linear Pipeline Broadcast

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Works

A. Weighted-Replication-Based Acceleration Methods

B. Asynchronous Pipeline Parallelism Methods