Monaural Speech Enhancement with Selective Local and Non-Local Attention
Lea Girard
Faculty of Engineering and Information Technology, University of Melbourne, Melbourne, Australia
Mina Bae
Faculty of Engineering and Information Technology, University of Melbourne, Melbourne, Australia
Keywords: Monaural Speech Enhancement, Selective Attention, Deep Learning, Audio Signal Processing
Abstract
Monaural speech enhancement remains a formidable challenge in audio signal processing, primarily due to the absence of spatial cues that typically facilitate the separation of target speech from background interference. Recent advancements in deep learning have significantly improved the quality and intelligibility of enhanced speech, yet balancing the extraction of fine-grained local acoustic features with the comprehension of global contextual dependencies remains an ongoing dilemma. This paper presents a novel framework that integrates a selective local and non-local attention mechanism to dynamically model both short-term phonetic characteristics and long-term acoustic environments. The local attention module focuses on preserving transient speech components and preserving high-frequency details, while the non-local attention mechanism captures long-range dependencies, aiding in the suppression of stationary and non-stationary noises over extended temporal receptive fields. Furthermore, a selective gating mechanism is introduced to adaptively fuse the outputs of these two attention branches, allocating computational focus based on the instantaneous characteristics of the input signal. Comprehensive evaluations on standard benchmark datasets demonstrate that the proposed architecture achieves state-of-the-art performance across multiple objective metrics, including perceptual evaluation of speech quality and short-time objective intelligibility. The results indicate that the dynamic fusion of local and global contexts significantly mitigates speech distortion and noise residual artifacts, particularly in low signal-to-noise ratio conditions.
References
Li, Y., Li, K., Yin, X., Yang, Z., Dong, Z., Yao, Z., ... & Lu, Y. (2026, March). Sepprune: Structured pruning for efficient deep speech separation. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 40, No. 38, pp. 31861-31869).
Li, A., Liu, W., Luo, X., Yu, G., Zheng, C., & Li, X. (2021). A simultaneous denoising and dereverberation framework with target decoupling. arXiv preprint arXiv:2106.12743.
Zhang, Luyan. "MCP: A Control-Theoretic Orchestration Framework for Synergistic Efficiency and Interpretability in Multimodal Large Language Models." arXiv preprint arXiv:2509.16597 (2025).
Xu, X., Tu, W., Yang, Y., Li, J., Zhang, Y., & Chen, H. (2026). Contribution-aware Dynamic Multi-modal Balance for Audio-Visual Speech Separation. IEEE Transactions on Multimedia.
Shan, T., Wenner, C. E., Xu, C., Duan, Z., & Maddox, R. K. (2022). Speech-in-noise comprehension is improved when viewing a deep-neural-network-generated talking face. Trends in Hearing, 26, 23312165221136934.
Xu, X., Tu, W., & Yang, Y. (2025). Efficient audio–visual information fusion using encoding pace synchronization for Audio–Visual Speech Separation. Information Fusion, 115, 102749.
Li, Andong, et al. "BridgeVoC: Revitalizing Neural Vocoder from a Restoration Perspective." arXiv preprint arXiv:2511.07116 (2025).
Shan, T., Cappelloni, M. S., & Maddox, R. K. (2024). Subcortical responses to music and speech are alike while cortical responses diverge. Scientific Reports, 14(1), 789.
Shan, T., Lalor, E. C., & Maddox, R. K. (2026). Chimeric music reveals an interaction of pitch and time in electrophysiological signatures of music encoding. Journal of Neuroscience, 46(4).
Wang, J., Zhao, R., Wei, W., Wang, Y., Yu, M., Zhou, J., ... & Xu, L. (2026, March). Comorag: A cognitive-inspired memory-organized rag for stateful long narrative reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 40, No. 39, pp. 33557-33565).
Xu, X., Tu, W., & Yang, Y. (2023, June). Selector-enhancer: learning dynamic selection of local and non-local attention operation for speech enhancement. In Proceedings of the AAAI conference on artificial intelligence (Vol. 37, No. 11, pp. 13853-13860).
Xu, Xinmeng, Weiping Tu, and Yuhong Yang. "Pcnn: A lightweight parallel conformer neural network for efficient monaural speech enhancement." arXiv preprint arXiv:2307.15251 (2023).
Huang, Jimin, et al. "Open-finllms: Open multimodal large language models for financial applications." arXiv preprint arXiv:2408.11878 (2024).
Zhou, J., Shuang, K., An, Z., Guo, J., & Loo, J. (2023). Improving document-level event detection with event relation graph. Information Sciences, 645, 119355.
Wu, Y., He, Y., Liu, X., Wang, Y., & Dannenberg, R. B. (2023, June). Transplayer: Timbre style transfer with flexible timbre control. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.
Ren, Y., Wu, D., Khurana, A., Mastorakos, G., Fu, S., Zong, N., ... & Huang, M. (2023, June). Classification of patient portal messages with BERT-based language models. In 2023 IEEE 11th International Conference on Healthcare Informatics (ICHI) (pp. 176-182). IEEE.
Wang, Juyuan, et al. "HeadRank: Decoding-Free Passage Reranking via Preference-Aligned Attention Heads." arXiv preprint arXiv:2604.17237 (2026).
Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). LibriSpeech: An ASR corpus based on public domain audio books. In Proceedings of ICASSP.
Dai, L., Li, A., Chi, C., Liang, Y., Li, X., & Zheng, C. (2026). GOMPSNR: Reflourish the Signal-to-Noise Ratio Metric for Audio Generation Tasks. arXiv preprint arXiv:2601.13758.