1. Introduction
In comparison to GAN-based models [7], [17], [25], contemporary state-of-the-art text-to-image (T2I) diffusion models exhibit enhanced capabilities in producing high-fidelity images [12], [37], [44], [49]. With the remarkable cross-modality alignment capabilities of T2I models, there is significant potential for generative techniques to enhance image classification [2], [4]. For instance, a straightforward approach en-tails augmenting the existing training dataset with synthetic images generated by feeding categorical textual prompts to a T2I diffusion model. However, upon reviewing prior approaches employing T2I diffusion models for image classification, it becomes evident that the challenge in generative data augmentation for domain-specific datasets is producing samples with both a faithful foreground and a diverse background. Depending on whether a reference image is used in the generative process, we divide these methods into two groups:
Strategies to expand domain-specific datasets for improved classification are varied. Row 1 illustrates vanilla distillation from a pretrained text-to-image (T2I) model, which carries the risk of generating outputs with reduced faithfulness. Intra-class augmentation, depicted in Row 2, tends to yield samples with limited diversity to maintain high fidelity to the original class. Our proposed method, showcased in Rows 3 and 4, adopts an inter-class augmentation strategy. This involves introducing edits to a reference image using images from other classes within the training set, which significantly enriches the dataset with a greater di-versity of samples.
Text-guided knowledge distillation [52], [57] involves generating new images from scratch using category-related prompts to expand the dataset. For the off-the-shelf T2I models, such vanilla distillation presume these models have comprehensive knowledge of target domain, which can be problematic for domain-specific datasets. Insuf-ficient domain knowledge easily makes the distillation process less effective. For example, vanilla T2I models struggle to generate images that accurately represent spe-cific bird species based solely on their names (see Row 1 of Fig. 1).
Generative data augmentation [1], [69] employs generative models to enhance existing images. Da-fusion [58], for instance, translates the source image into multiple edited versions within the same class. This strategy, termed intra-class augmentation, primarily introduces intra-class variations. While intra-class augmentation re-tains much of the original image's layout and visual details, it results in limited background diversity (see Row 2 of Fig. 1). However, synthetic images with constrained diversity may not sufficiently enhance the model's ability to discern foreground concepts.