Abstract:
Pre-trained Language Models in Natural Language Processing have become increasingly computationally expensive and memory demanding. The recently proposed computation-adap...Show MoreMetadata
Abstract:
Pre-trained Language Models in Natural Language Processing have become increasingly computationally expensive and memory demanding. The recently proposed computation-adaptive BERT models facilitate their deployment in practical applications. Training such a BERT model involves jointly optimizing subnets of varying sizes, which is not easy due to their mutual interference with one another. The larger-size subnets in particular could deteriorate when there is a large performance gap between the smallest subnet and the super-net. In this work, we propose Neural grafting to boost BERT subnets, especially the larger ones. Specifically, we regard the less important sub-modules of a BERT model as less active and reactivate them via layer-wise Neural grafting. Experimental results show that the proposed method improves the average performance of BERT subnets on six datasets of GLUE benchmark. The subnet performing comparable to the supernet BERT-Base reduces around 67% and 70% inference latency on GPU and CPU, respectively. Moreover, we compare two Neural grafting strategies under varied experimental settings, hoping to shed light on the application scenarios of Neural grafting.
Published in: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 04-10 June 2023
Date Added to IEEE Xplore: 05 May 2023
ISBN Information: