Abstract
In order to address the problems of existing methods, such as obvious noise, weak reality and asynchronous with video, we proposed a sound generation method based on timing-aligned visual feature mapping. Firstly, we designed a feature aggregation window based on temporal constraint, which extract integrated visual feature from the video sequence. Secondly, the integrated visual feature was transformed into multi-frequency audio feature by a spatio-temporal matching cross-modal mapping network. Finally, we utilized an audio decoder to obtain Mel-spectrogram from audio features, and send to a vocoder to output the final waveform. We completed qualitative and quantitative experiments on the VAS dataset, and the results show that the proposed method significantly improves audio quality, timing alignment, and audience perception.
| Translated title of the contribution | Sound Generation Method with Timing-Aligned Visual Feature Mapping |
|---|---|
| Original language | Chinese (Traditional) |
| Pages (from-to) | 1506-1514 |
| Number of pages | 9 |
| Journal | Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics |
| Volume | 34 |
| Issue number | 10 |
| DOIs | |
| State | Published - Oct 2022 |
| Externally published | Yes |