I. Introduction
Over the past decade, machine learning techniques obtain tremendous success and have been widely employed for various applications like email filtering, advertising recommendation, speech recognition, machine translation, computer vision, etc [1], [2], [3], [4], [5]. With the increasing popularity of machine learning and the rapid development of new technologies, the realistic quantities of training data for a learning task have increased from GBs to TBs and PBs. Data-parallel distributed training has become the key to obtaining the resulting model over such massive amounts of data within reasonable times [2], [3], [4].