摘要
In order to address the problems of existing methods, such as obvious noise, weak reality and asynchronous with video, we proposed a sound generation method based on timing-aligned visual feature mapping. Firstly, we designed a feature aggregation window based on temporal constraint, which extract integrated visual feature from the video sequence. Secondly, the integrated visual feature was transformed into multi-frequency audio feature by a spatio-temporal matching cross-modal mapping network. Finally, we utilized an audio decoder to obtain Mel-spectrogram from audio features, and send to a vocoder to output the final waveform. We completed qualitative and quantitative experiments on the VAS dataset, and the results show that the proposed method significantly improves audio quality, timing alignment, and audience perception.
| 投稿的翻译标题 | Sound Generation Method with Timing-Aligned Visual Feature Mapping |
|---|---|
| 源语言 | 繁体中文 |
| 页(从-至) | 1506-1514 |
| 页数 | 9 |
| 期刊 | Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics |
| 卷 | 34 |
| 期 | 10 |
| DOI | |
| 出版状态 | 已出版 - 10月 2022 |
| 已对外发布 | 是 |
关键词
- auto-encoder
- cross-modal
- sound generation
- timing alignment
指纹
探究 '时序对齐视觉特征映射的音效生成方法' 的科研主题。它们共同构成独一无二的指纹。引用此
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver