1. Introduction
Generative modeling plays a crucial role in machine learning, particularly in applications like image [13], [14], [17], [38], [44], voice [37], [42], and text synthesis [2], [51]. Diffusion models have showcased impressive capabilities in producing high-quality samples across diverse domains. In comparison to generative adversarial networks (GANs) [8] and variational autoencoders (VAEs) [20], diffusion models successfully sidestep issues such as model collapse and pos-terior collapse, resulting in a more stable training process. However, the substantial computational cost poses a critical bottleneck hampering the widespread adoption of diffusion models. Furthermore, the computational cost for diffusion models can be attributed to two primary factors. First, these models typically require hundreds of denoising steps to generate images, rendering the procedure considerably slower than that of GANs. Prior efforts [21], [27], [29], [44] have addressed this challenge by seeking shorter and more efficient sampling trajectories, thereby reducing the number of denoising steps. Second, the substantial network architecture of diffusion models demands considerable time and memory resources, particularly for foundational models pretrained on large-scale datasets, e.g., LDM [38] and Stable Diffusion. Our work aims to tackle the latter challenge, focusing on the compression of diffusion models.