1. Introduction
Despite their many successes, deep neural networks remain prone to catastrophic forgetting [37], whereby a model's performance on old tasks degrades significantly while it is learning to solve new tasks. Catastrophic forgetting has become a major challenge for continual learning (CL) scenarios, where the model is trained on a sequence of tasks, with limited or no access to old training data. The ability to learn continually without forgetting is crucial to many real-world applications, such as computer vision [36], [46], intelligent robotics [30], and natural language processing [6], [23]. In these settings, an agent learns from a stream of new data or tasks, but training on the old data is restricted due to limitations in storage, scaling of training time, or even concerns about privacy.