I. Introduction
The emergence of next-generation wireless networks, particularly 6G, heralds a transformative era in connectivity, ushering in a wide array of applications. Benefiting from data-oriented signal processing techniques, future wireless networks are expected to not only pursue accurate communication in bit level but also offer a wide range of new functionalities such as semantic-aware intelligent tasks. Joint signal processing at the transceiver is expected to improve the end-to-end system gain, by sensing and exploiting the intrinsic nature of source signals. Among these scenarios, audio communication is always the indispensable one. The tradition transceiver design for audio communication is a divide-and-conquer paradigm. Audio codecs are meant for compressing audio [1], [2], [3], while the transmission robustness is ensured by channel coding and other error control techniques. Audio codecs play a pivotal role in compressing the audio, while rate control works in cooperation with the codec to strategically allocate bits across and within audio frames, thereby optimizing the overall communication efficiency. Channel coding is designed with the pursuit for a low error rate in average. But in practice, we often observe left bit errors that manifest into uncorrectable errors and thus packet loss occurs, in which case the general solution is to request retransmission of lost packets. However, retransmission is suitable only for scenarios with short round trip times (RTTs). For most real-time communications (RTC) applications, error audio frames may not be concealed via retransmission where we expect audio frames to be played as soon as they are decoded, e.g., FaceTime, WeChat, etc. Resending error audio packets can contribute to overall delay, ultimately leading to a poor user quality of experience (QoE).