Sound-Guided Semantic Image Manipulation | IEEE Conference Publication | IEEE Xplore

Sound-Guided Semantic Image Manipulation


Abstract:

The recent success of the generative model shows that leveraging the multi-modal embedding space can manipu-late an image using text information. However, manipulating an...Show More

Abstract:

The recent success of the generative model shows that leveraging the multi-modal embedding space can manipu-late an image using text information. However, manipulating an image with other sources rather than text, such as sound, is not easy due to the dynamic characteristics of the sources. Especially, sound can convey vivid emotions and dynamic expressions of the real world. Here, we propose a framework that directly encodes sound into the multi-modal (image-text) embedding space and manipulates an image from the space. Our audio encoder is trained to pro-duce a latent representation from an audio input, which is forced to be aligned with image and text representations in the multi-modal embedding space. We use a direct latent op-timization method based on aligned embeddings for sound-guided image manipulation. We also show that our method can mix different modalities, i.e., text and audio, which en-rich the variety of the image modification. The experiments on zero-shot audio classification and semantic-level image classification show that our proposed model outperforms other text and sound-guided state-of-the-art methods.
Date of Conference: 18-24 June 2022
Date Added to IEEE Xplore: 27 September 2022
ISBN Information:

ISSN Information:

Conference Location: New Orleans, LA, USA

Funding Agency:

References is not available for this document.

1. Introduction

Image manipulation has been widely studied in the field of computer vision due to its usefulness in photo-realistic manipulation applications, social media image sharing, and image-based advertisement. An image can be used to trans-fer its style into the target image [14], [13]. Also, modifying specific parts in the human face image, such as hairstyle or color, is useful in image manipulation applications [49], [34]. The purpose of semantic image manipulation is to generate a novel image that contains both source image identification and semantic information of user intention. In this paper, we tackle the semantic image manipulation task, which is the task of modifying an image with user-provided se-mantic cues. To apply the user intention into the image, a mixture of sketches and text is used to perform image ma-nipulation and synthesis [33], [49]. User intention can be ap-plied by drawing a paint [33] or writing text with semantic meanings [49], [13].

Modified images with sound-guided semantic image manipulation. Our method manipulates source images (top row) given user-provided sound (middle row) into semantic images (last row).

Select All
1.
R. Abdal, Y. Qin and P. Wonka, "Image2stylegan: How to embed images into the stylegan latent space?", Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4432-4441, 2019.
2.
J.-B. Alayrac, A. Recasens, R. Schneider, R. Arandjelovic, J. Ramapuram, J. De Fauw, et al., "Self-supervised multimodal versatile net-works", NeurIPS, vol. 2, no. 6, 2020.
3.
Y. Aytar, C. Vondrick and A. Torralba, "See hear and read: Deep aligned representations", CoRR abs/1706.00932, 2017.
4.
H. Brouwer, "Audio-reactive latent interpolations with style-gan", NeurIPS 2020 Workshop on Machine Learning for Creativity and Design, 2020.
5.
F. Caba Heilbron, V. Escorcia, B. Ghanem and J. Car-los Niebles, "Activitynet: A large-scale video benchmark for human activity understanding", Proceedings of the ieee conference on computer vision and pattern recognition, pp. 961-970, 2015.
6.
H. Chen, W. Xie, A. Vedaldi and A. Zisserman, "Vggsound: A large-scale audio-visual dataset", ICASSP 2020–2020 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 721-725, 2020.
7.
L. Chen, S. Srivastava, Z. Duan and C. Xu, "Deep cross-modal audio-visual generation", Proceedings of the on The-matic Workshops of ACM Multimedia 2017, pp. 349-357, 2017.
8.
Y. Chen, Y. Xian, A. Koepke, Y. Shan and Z. Akata, "Dis-tilling audio-visual knowledge by compositional contrastive learning", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7016-7025, 2021.
9.
J. Deng, J. Guo, N. Xue and S. Zafeiriou, "Arcface: Additive angular margin loss for deep face recognition", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690-4699, 2019.
10.
H. Dong, S. Yu, C. Wu and Y. Guo, "Semantic image syn-thesis via adversarial learning", Proceedings of the IEEE International Conference on Computer Vision, pp. 5706-5714, 2017.
11.
A. El-Nouby, S. Sharma, H. Schulz, D. Hjelm, L. E. Asri, S. E. Kahou, et al., "Tell draw and repeat: Generating and modifying images based on contin-ual linguistic instruction", Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10304-10312, 2019.
12.
C. Fellbaum, "Wordnet", Theory and applications of ontology: computer applications, pp. 231-243, 2010.
13.
L. Gatys, A. Ecker and M. Bethge, "A neural algorithm of artistic style", Journal of Vision, vol. 16, no. 12, pp. 326-326, 2016.
14.
L. A. Gatys, A. S. Ecker and M. Bethge, "Image style trans-fer using convolutional neural networks", Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2414-2423, 2016.
15.
J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, et al., "Au-dio set: An ontology and human-labeled dataset for audio events", 2017 IEEE International Conference on Acous-tics Speech and Signal Processing (ICASSP), pp. 776-780, 2017.
16.
A. Guzhov, F. Raue, J. Hees and A. Dengel, Audioclip: Ex-tending clip to image text and audio, 2021.
17.
A. Guzhov, F. Raue, J. Hees and A. Dengel, "Esresne (x) t-fbsp: Learning robust time-frequency transformation of au-dio", arXiv preprint, pp. 11587, 2021.
18.
W. Hao, Z. Zhang and H. Guan, "Cmcgan: A uniform frame-work for cross-modal visual-audio mutual generation", Proceedings of the AAAI Conference on Artificial Intelli-gence, vol. 32, 2018.
19.
S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold et al., "Cnn architectures for large-scale au-dio classification", 2017 ieee international conference on acoustics speech and signal processing (icassp), pp. 131-135, 2017.
20.
D. Jeong, S. Doh and T. Kwon, "Träumerai: Dreaming music with stylegan", arXiv preprint, 2021.
21.
W. Jiang, N. Xu, J. Wang, C. Gao, J. Shi, Z. Lin, et al., "Language-guided global image editing via cross-modal cyclic mechanism", Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2115-2124, 2021.
22.
T. Karras, S. Laine and T. Aila, "A style-based generator ar-chitecture for generative adversarial networks", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401-4410, 2019.
23.
T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen and T. Aila, "Analyzing and improving the image quality of stylegan", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8110-8119, 2020.
24.
W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vi-jayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev et al., "The kinetics human action video dataset", arXiv preprint, 2017.
25.
C.-C. Lee, W.-Y. Lin, Y.-T. Shih, P.-Y. Kuo and L. Su, "Crossing you in style: Cross-modal style transfer from music to visual arts", Proceedings of the 28th ACM International Conference on Multimedia, pp. 3219-3227, 2020.
26.
B. Li, X. Qi, T. Lukasiewicz and P. H. Torr, "Mani-gan: Text-guided image manipulation", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7880-7889, 2020.
27.
P. Mazumder, P. Singh, K. K. Parida and V. P. Namboodiri, "Avgzslnet: Audio-visual generalized zero-shot learning by reconstructing label features from multi-modal embeddings", Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3090-3099, 2021.
28.
A. Nagrani, S. Albanie and A. Zisserman, "Learnable pins: Cross-modal embeddings for person identity", Proceedings of the European Conference on Computer Vision (ECCV), pp. 71-88, 2018.
29.
S. Nam, Y. Kim and S. J. Kim, "Text-adaptive generative adversarial networks: Manipulating images with natural language", Advances in Neural Information Processing Systems (NeurIPS), 2018.
30.
T.-H. Oh, T. Dekel, C. Kim, I. Mosseri, W. T. Freeman, M. Rubinstein, et al., "Speech2face: Learning the face behind a voice", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7539-7548, 2019.
Contact IEEE to Subscribe

References

References is not available for this document.