Loading web-font TeX/Main/Regular
Releasing the Power of In-Network Aggregation With Aggregator-Aware Routing Optimization | IEEE Journals & Magazine | IEEE Xplore

Releasing the Power of In-Network Aggregation With Aggregator-Aware Routing Optimization


Abstract:

By offloading partial of the aggregation computation from the logical central parameter servers to network devices like programmable switches, In-Network Aggregation (INA...Show More

Abstract:

By offloading partial of the aggregation computation from the logical central parameter servers to network devices like programmable switches, In-Network Aggregation (INA) is a general, effective, and widely used approach to reduce network load thus alleviating the communication bottlenecks suffered by large-scale distributed training. Given the fact that INA would take effects if and only if associated traffic goes through the same in-network aggregator, the key to taking advantage of INA lies in routing control. However, existing proposals fall short in doing so and thus are far from optimal, since they select routes for INA-supported traffic without comprehensively considering the characteristics, limitations, and requirements of the network environment, aggregator hardware, and distributed training jobs. To fill the gap, in this paper, we systematically establish a mathematical model to formulate i) the up-down routing constraints of Clos datacenter networks, ii) the limitations raised by modern programmable switches’ pipeline hardware structure, and iii) the various aggregator-aware routing optimization goals required by distributed training tasks under different parallelism strategies. Based on the model, we develop ARO, an Aggregator-aware Routing Optimization solution for INA-accelerated distributed training applications. To be efficient, ARO involves a suite of search space pruning designs, by using the model’s characteristics, yielding tens of times improvement in the solving time with trivial performance loss. Extensive experiments show that ARO is able to find near-optimal results for large-scale routing optimization in tens of seconds, achieving 1.8\sim 4.0\times higher throughput than the state-of-the-art solution.
Published in: IEEE/ACM Transactions on Networking ( Volume: 32, Issue: 5, October 2024)
Page(s): 4488 - 4502
Date of Publication: 08 July 2024

ISSN Information:

Funding Agency:


I. Introduction

Nowadays, machine learning (ML), especially deep learning, has demonstrated great capabilities and achieved great success in abundant fields like machine vision [1], natural language processing [2], weather prediction [3], content generation [4], and game playing [5]. With the development of ML, new advanced models are constantly proposed. Both the size of the model and the scale of the training dataset show explosive growth trends. In order to complete the model training in a reasonable time, distributed machine learning (DML), especially with the paradigm of data parallelism, has become an inevitable design. However, simply increasing the cluster scale to enhance the compute capacity often fails to achieve the corresponding performance improvements. During the data-parallel distributed training, to guarantee the convergence of the global model, training workers have to synchronize their locally trained gradients or updated model parameters periodically [6]. As confirmed by recent studies [7], [8], [9], [10], with the training cluster’s scale increases, the communication cost of model synchronization gradually becomes a prominent performance bottleneck for the entire training.

Contact IEEE to Subscribe

References

References is not available for this document.