I. Introduction
Diffusion model, which trains by adding noise and subsequently denoising images, is a deep learning technique that has been utilized in various fields such as image generation and noise removal [1]–[3]. In particular, methods which uti-lize diffusion model have been proposed to transform image poses [4], and recently, pose transformation methods based on input prompts have been introduced [5]. However, existing prompt-based pose transformation methods have limitations in preserving details within the image, such as facial features or accessories [6]. For example, if user transforms the pose of an input image using the text prompt “Pose with head turned to right”, details (e.g., necklace) included in the image may not be preserved. As shown in Fig. 1, while Output Image# 1 preserves these details, Output Image#2 loses them.