Download PDFOpen PDF in browserMFFCN: Multi-layer Feature Fusion Convolution Network for Audio-visual Speech EnhancementEasyChair Preprint 48506 pages•Date: January 2, 2021AbstractThe purpose of speech enhancement is to extract target speech signal from a mixture of sounds generated from several sources. Speech enhancement can potentially benefit from the visual information from the target speaker, such as lip movement and facial expressions, because the visual aspect of speech is essentially unaffected by acoustic environment. In order to fuse audio and visual information, an audio-visual fusion strategy is proposed, which goes beyond simple feature concatenation and learns to automatically align the two modalities, leading to more powerful representation which increase intelligibility in noisy conditions. The proposed model fuses audio-visual features layer by layer, and feed these audio-visual features to each corresponding decoding layer. Experiment results show relative improvement from 6% to 24% on test sets over the audio modality alone, depending on audio noise level. Moreover, there is a significant increase of PESQ from 1.21 to 2.06 in our -15dB SNR experiment. Keyphrases: audio-visual, multi-layer feature fusion convolution network(MFFCN), speech enhancement
|