Abstract:
The transformer architecture has revolutionized many applications such as large language models. This progress has been largely enabled by distributed training, yet commu...Show MoreMetadata
Abstract:
The transformer architecture has revolutionized many applications such as large language models. This progress has been largely enabled by distributed training, yet communication remains a significant bottleneck. This paper examines the communication behavior of transformer models, focusing on how different parallelism schemes in multi-node/multi-GPU training communicate data. We use GPT-based language models as a case study due to their prevalence. We validate our empirical results using analytical models. Our analysis reveals practical insights and potential areas for further optimization in framework and HPC middleware design.
Published in: IEEE Micro ( Early Access )