I. Introduction
Over the past decade, machine learning techniques have obtained huge success and have been widely employed for various applications like email filtering, advertising recommendation, speech recognition, machine translation, computer vision, etc [1]–[4]. With the increasing popularity of machine learning and rapid developments of new technologies, the realistic quantities of training data for a learning task have increased from GBs to TBs and PBs. Data-parallel distributed training becomes the key to obtaining the resulting model over such a massive of data within reasonable times [1]–[3].