EmoStyle: Emotion-Aware Semantic Image Manipulation with Audio Guidance

Qiwei Shen, Junjie Xu, Jiahao Mei, Xingjiao Wu, Daoguo Dong

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

With the flourishing development of generative models, image manipulation is receiving increasing attention. Rather than text modality, several elegant designs have delved into leveraging audio to manipulate images. However, existing methodologies mainly focus on image generation conditional on semantic alignment, ignoring the vivid affective information depicted in the audio. We propose an Emotion-aware StyleGAN Manipulator (EmoStyle), a framework where affective information from audio can be explicitly extracted and further utilized during image manipulation. Specifically, we first leverage the multi-modality model ImageBind for initial cross-modal retrieval between images and music, and select the music-related image for further manipulation. Simultaneously, by extracting sentiment polarity from the lyrics of the audio, we generate an emotionally rich auxiliary music branch to accentuate the affective information. We then leverage pre-trained encoders to encode audio and the audio-related image into the same embedding space. With the aligned embeddings, we manipulate the image via a direct latent optimization method. We conduct objective and subjective evaluations on the generated images, and our results show that our framework is capable of generating images with specified human emotions conveyed in the audio.

Original languageEnglish
Article number3193
JournalApplied Sciences (Switzerland)
Volume14
Issue number8
DOIs
StatePublished - Apr 2024
Externally publishedYes

Keywords

  • affective information
  • audio-based image manipulation
  • generative model
  • image manipulation

Fingerprint

Dive into the research topics of 'EmoStyle: Emotion-Aware Semantic Image Manipulation with Audio Guidance'. Together they form a unique fingerprint.

Cite this