Shaik AreefaBegam; Sunnydayal Vanambathina

International Journal of Image, Graphics and Signal Processing (IJIGSP)

IJIGSP Vol. 18, No. 2, 8 Apr. 2026

Cover page and Table of Contents: PDF (size: 1444KB)

Nested U-Net-Based Speech Enhancement with Multi-Scale Feature Extraction and Dual-Path Time-Frequency Feature Modeling

PDF (1444KB), PP.107-123

Views: 0 Downloads: 0

Author(s)

Shaik AreefaBegam ¹ Sunnydayal Vanambathina ^1,*

1. School of Electronics Engineering, VIT-AP University, Amaravati-522237, Andhra Pradesh, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijigsp.2026.02.07

Received: 13 Dec. 2025 / Revised: 15 Jan. 2026 / Accepted: 10 Feb. 2026 / Published: 8 Apr. 2026

Index Terms

Speech Enhancement, Nested U-Net Architecture, Multi-Scale Feature Extraction, Dual-Path Higher-Order Information Interaction, Time-Frequency Attention, Feature Calibration, Long-Range Dependency Modeling

Abstract

Speech enhancement plays a vital role in improving the perceptual quality and intelligibility of speech signals degraded by environmental noise, particularly in modern network-based and signal processing systems. Traditional U-Net architectures capture local spectral details effectively but struggle to model long-range dependencies and may propagate residual noise through skip connections. Transformer-based models provide strong global context modeling but often fail to retain fine-grained spectral cues. To overcome these limitations, this paper presents a Nested U-Net–based network-oriented speech enhancement framework that incorporates Multi-Scale Feature Extraction, Feature Calibration, and a Dual-Path Higher-Order Information Interaction with Time-Frequency Attention module. The Multi-Scale Feature Extraction blocks in both encoder and decoder extract multi-resolution spectral patterns, while the nested topology strengthens hierarchical feature reuse. At the bottleneck, a stack of four Dual-Path Higher-Order Information Interaction with Time-Frequency Attention modules captures long-range temporal and spectral dependencies, and feature calibration adaptively filters encoder features to reduce noise transfer. Extensive experiments on Common Voice and LibriSpeech datasets demonstrate that the proposed model achieves superior perceptual evaluation of speech quality, short-time objective intelligibility, and signal-to-distortion ratio scores, particularly under moderate (0dB) signal-to-noise ratio conditions. The results confirm that the framework provides robust enhancement performance and consistently outperforms several recent state-of-the-art methods in terms of speech quality, intelligibility, and noise suppression.

Cite This Paper

Shaik AreefaBegam, Sunnydayal Vanambathina, "Nested U-Net-Based Speech Enhancement with Multi-Scale Feature Extraction and Dual-Path Time-Frequency Feature Modeling", International Journal of Image, Graphics and Signal Processing(IJIGSP), Vol.18, No.2, pp. 107-123, 2026. DOI:10.5815/ijigsp.2026.02.07

Reference

[1]N. Upadhyay and R. K. Jaiswal, “Single channel speech enhancement: using Wiener filtering with recursive noise estimation,” Procedia Computer Science, vol. 84, pp. 22–30, 2016.
[2]Upadhyay, Navneet, and Abhijit Karmakar. “Speech enhancement using spectral subtraction-type algorithms: A comparison and simulation study.” Procedia Computer Science 54 (2015): 574-584.
[3]J. Jensen and R. Heusdens, “Improved subspace-based single-channel speech enhancement using generalized super Gaussian priors,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 3, pp. 862–872, 2007.
[4]A. Narayanan and D. Wang, “A CASA-based system for long-term SNR estimation,” IEEE TASLP, vol. 20, no. 9, pp. 2518–2527, 2012.
[5]Kumar, Anurag, and Dinei Florencio. “Speech enhancement in multiple-noise conditions using deep neural networks.” arXiv preprint arXiv: 1605.02427 (2016).
[6]Pascual, Santiago, Antonio Bonafonte, and Joan Serra. “SEGAN: Speech enhancement generative adversarial network.” arXiv preprint arXiv: 1703.09452 (2017).
[7]K. Tan and D. Wang, “Learning Complex Spectral Mapping with Gated Convolutional Recurrent Networks for Monaural Speech Enhancement,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 380-390, 2020, doi: 10.1109/TASLP.2019.2955276.
[8]Y. Xu, J. Du, L. -R. Dai, and C. -H. Lee, “A Regression Approach to Speech Enhancement Based on Deep Neural Networks,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7-19, Jan. 2015, doi: 10.1109/TASLP.2014.2364452.
[9]Zhao, Han, Shuayb Zarar, Ivan Tashev, and Chin-Hui Lee. “Convolutional-recurrent neural networks for speech enhancement.” In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2401-2405. IEEE, 2018.
[10]Defossez, Alexandre, Gabriel Synnaeve, and Yossi Adi. “Real-time speech enhancement in the waveform domain.” arXiv preprint arXiv: 2006.12847 (2020).
[11]Hou, Jen-Cheng, Syu-Siang Wang, Ying-Hui Lai, Yu Tsao, Hsiu-Wen Chang, and Hsin-Min Wang. “Audio-visual speech enhancement using multimodal deep convolutional neural networks.” IEEE Transactions on Emerging Topics in Computational Intelligence 2, no. 2 (2018): 117-128.
[12]Kolboek, Morten, Zheng-Hua Tan, and Jesper Jensen. “Speech enhancement using long short-term memory-based recurrent neural networks for noise-robust speaker verification.” In 2016 IEEE spoken language technology workshop (SLT), pp. 305-311. IEEE, 2016.
[13]Chiluveru, Samba Raju, and Manoj Tripathy. “Low SNR speech enhancement with DNN-based phase estimation.” International Journal of Speech Technology 22, no. 1 (2019): 283-292.
[14]Mamun, Nursadul, and John HL Hansen. “Speech enhancement for cochlear implant recipients using a deep complex convolution transformer with frequency transformation.” IEEE/ACM transactions on audio, speech, and language processing 32 (2024): 2616-2629.
[15]Healy, Eric W., Eric M. Johnson, Ashutosh Pandey, and DeLiang Wang. “Progress made in the efficacy and viability of deep-learning-based noise reduction.” The Journal of the Acoustical Society of America 153, no. 5 (2023): 2751.
[16]Schroter, Hendrik, Alberto N. Escalante-B, Tobias Rosenkranz, and Andreas Maier. “DeepFilterNet: A low-complexity speech enhancement framework for full-band audio based on deep filtering.” In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7407-7411. IEEE, 2022.
[17]Dmitrieva, Ekaterina, and Maksim Kaledin. “HiFi-Stream: Streaming Speech Enhancement with Generative Adversarial Networks.” arXiv preprint arXiv: 2503.17141 (2025).
[18]Xian, Yang, Yang Sun, Wenwu Wang, and Syed Mohsen Naqvi. “A multi-scale feature recalibration network for end-to-end single-channel speech enhancement.” IEEE Journal of Selected Topics in Signal Processing 15, no. 1 (2020): 143-155.
[19]Hu, Yanxin, Yun Liu, Shubo Lv, Mengtao Xing, Shimin Zhang, Yihui Fu, Jian Wu, Bihong Zhang, and Lei Xie. “DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement.” arXiv preprint arXiv: 2008.00264 (2020).
[20]Le, Xiaohuai, Hongsheng Chen, Kai Chen, and Jing Lu. “DPCRN: Dual-path convolution recurrent network for single-channel speech enhancement.” arXiv preprint arXiv: 2107.05429 (2021).
[21]X. Hao, X. Su, R. Horaud, and X. Li, “Fullsubnet: A Full-Band and Sub-Band Fusion Model for Real-Time Single Channel Speech Enhancement,” ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 2021, pp. 6633-6637, doi: 10.1109/ICASSP39728.2021.9414177.
[22]Fu, Szu-Wei, Cheng Yu, Tsun-An Hsieh, Peter Plantinga, Mirco Ravanelli, Xugang Lu, and Yu Tsao. “MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement.” (2021)
[23]Cao, Ruizhe, Sherif Abdulatif, and Bin Yang. “CMGAN: Conformer-based metric GAN for speech enhancement.” arXiv preprint arXiv: 2203.15149 (2022).
[24]Xu, Xinmeng, and Jianjun Hao. “U-former: Improving monaural speech enhancement with multi-head self and cross attention.” In 2022 26th International Conference on Pattern Recognition (ICPR), pp. 663-669. IEEE, 2022.
[25]Jannu, Chaitanya, and Sunny Dayal Vanambathina. “Multi-stage progressive learning-based speech enhancement using time–frequency attentive squeezed temporal convolutional networks.” Circuits, Systems, and Signal Processing 42, no. 12 (2023): 7467-7493.
[26]Grzywalski, Tomasz, and Szymon Drgas. “Speech enhancement by iterating a forward pass through U-Net.” In 2020 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA), pp. 157-162. IEEE, 2020.
[27]F. Dang, H. Chen and P. Zhang, “DPT-FSNet: Dual-Path Transformer Based Full-Band and Sub-Band Fusion Network for Speech Enhancement,” ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, Singapore, 2022, pp. 6857-6861, doi: 10.1109/ICASSP43922.2022.9746171.
[28]Fuchs, Alexander, Robin Priewald, and Franz Pernkopf. “Recurrent dilated dense nets for a time-series segmentation task.” In 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 75-80. IEEE, 2019.
[29]K. Zhen, M. S. Lee, and M. Kim, “A dual-staged context aggregation method towards efficient end-to-end speech enhancement.” ICASSP, pp. 366–370, 2020.
[30]Xiang, Xiaoxiao, Xiaojuan Zhang, and Haozhe Chen. “A convolutional network with multi-scale and attention mechanisms for end-to-end single-channel speech enhancement.” IEEE Signal Processing Letters 28 (2021): 14551459.
[31]Xiang, Xiaoxiao, Xiaojuan Zhang, and Haozhe Chen. “A nested U-Net with self-attention and dense connectivity for monaural speech enhancement.” IEEE Signal Processing Letters 29 (2021): 105-109.
[32]Park, Hyun Joon, Byung Ha Kang, Wooseok Shin, Jin Sob Kim, and Sung Won Han. “Manner: Multi-view attention network for noise erasure.” In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7842-7846. IEEE, 2022.
[33]Zhang, Zehua, Xingwei Liang, Ruifeng Xu, and Mingjiang Wang. “Hybrid Attention Time-Frequency Analysis Network for Single-Channel Speech Enhancement.” In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 10426-10430. IEEE, 2024.
[34]Yu, Xincheng, Dongyue Guo, Jianwei Zhang, and Yi Lin. “Rose: a recognition-oriented speech enhancement framework in air traffic control using multi-objective learning.” IEEE/ACM Transactions on Audio, Speech, and Language Processing (2024).
[35]Hwang, Seorim, Sung Wook Park, and Youngcheol Park. “Causal Speech Enhancement Based on a Two-Branch Nested U-Net Architecture Using Self-Supervised Speech Embeddings.” In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5. IEEE, 2025.
[36]Yechuri, Sivaramakrishna, and Sunnydayal Vanambathina. “A nested U-Net with efficient channel attention and D3Net for speech enhancement.” Circuits, Systems, and Signal Processing 42, no. 7 (2023): 4051-4071.
[37]Zhang, Qiquan, Xinyuan Qian, Zhaoheng Ni, Aaron Nicolson, Eliathamby Ambikairajah, and Haizhou Li. “A time-frequency attention module for neural speech enhancement.” IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2022): 462-475.
[38]Wei, Wenbing, Ying Hu, Hao Huang, and Liang He. “Iifc-net: A monaural speech enhancement network with high-order information interaction and feature calibration.” IEEE Signal Processing Letters 31 (2023): 196-200.
[39]Gong, Shuyu, Zhewei Wang, Tao Sun, Yuanhang Zhang, Charles D. Smith, Li Xu, and Jundong Liu. “Dilated fcn: Listening longer to hear better.” In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 254-258. IEEE, 2019.
[40]Chollet, Franc¸ois. “Xception: Deep learning with depthwise separable convolutions”. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251-1258. 2017.
[41]Qin, Xuebin, Zichen Zhang, Chenyang Huang, Masood Dehghan, Osmar R. Zaiane, and Martin Jagersand. “U2-Net: Going deeper with nested U-structure for salient object detection.” Pattern recognition 106 (2020): 107404.
[42]Howard, Andrew G., Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. “Mobilenets: Efficient convolutional neural networks for mobile vision applications.” arXiv preprint arXiv: 1704.04861 (2017).
[43]Sandler, Mark, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. “Mobilenetv2: Inverted residuals and linear bottlenecks.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510-4520. 2018.
[44]Song, Zhendong, Yupeng Ma, Fang Tan, and Xiaoyi Feng. “Hybrid dilated and recursive recurrent convolution network for time-domain speech enhancement.” Applied Sciences 12, no. 7 (2022): 3461.
[45]Takahashi, Naoya, and Yuki Mitsufuji. “D3net: Densely connected multi-dilated dense-net for music source separation.” arXiv preprint arXiv: 2010.01733 (2020).
[46]Rao, Yongming, Wenliang Zhao, Yansong Tang, Jie Zhou, Ser Nam Lim, and Jiwen Lu. “Hornet: Efficient high-order spatial interactions with recursive gated convolutions.” Advances in Neural Information Processing Systems 35 (2022): 10353-10366.
[47]Luo, Yi, Zhuo Chen, and Takuya Yoshioka. “Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation.” In ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 46-50. IEEE, 2020.
[48]Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. “Layer normalization.” arXiv preprint arXiv: 1607.06450 (2016).
[49]Zhang, Qiquan, Qi Song, Zhaoheng Ni, Aaron Nicolson, and Haizhou Li”. Time-frequency attention for monaural speech enhancement.” In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7852-7856. IEEE, 2022.
[50]Kim, Eesung, and Hyeji Seo. ”SE-Conformer: Time-Domain Speech Enhancement Using Conformer.” In Inter speech, pp. 2736-2740. 2021.
[51]Pandey, Ashutosh, and DeLiang Wang. “A new framework for CNN-based speech enhancement in the time domain.” IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, no. 7 (2019): 1179-1188.
[52]Pandey, Ashutosh, and DeLiang Wang. “Densely connected neural network with dilated convolutions for real-time speech enhancement in the time domain.” In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6629-6633. IEEE, 2020.
[53]Yecchuri, Sivaramakrishna, and Sunny Dayal Vanambathina.” Sub-convolutional U-Net with transformer attention network for end-to-end single-channel speech enhancement.” EURASIP Journal on Audio, Speech, and Music Processing 2024, no. 1 (2024): 8.
[54]Hung, Jeih-Weih, Pin-Chen Huang, and Li-Yin Li. “Employing Huber and TAP Losses to Improve Inter-SubNet in Speech Enhancement.” Future Internet 16, no. 10 (2024): 360.
[55]Common Voice. Mozilla. (2017). Available: https://commonvoice.mozilla.org/en, Date 27-09-2025
[56]Panayotov, Vassil, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. “Librispeech: an ASR corpus based on public domain audiobooks.” In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206-5210. IEEE, 2015.
[57]NOIZEUS: Available at https://ecs.utdallas.edu/loizou/speech/noizeus/ Date: 2-10-2025
[58]Ma, J., Hu, Y., and Loizou, P. (2009). “Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions,” Journal of the Acoustical Society of America, 125(5), 3387-3405
[59]Recommendation, ITU-T. “Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs.” Rec. ITU-T P. 862 (2001).
[60]Taal, Cees H., Richard C. Hendriks, Richard Heusdens, and Jesper Jensen. “An algorithm for intelligibility prediction of time–frequency weighted noisy speech.” IEEE Transactions on Audio, Speech, and Language Processing 19, no. 7 (2011): 2125-2136.
[61]Vincent, Emmanuel, Remi Gribonval, and C´ edric F´ evotte. “Performance measurement in blind audio source separation.” IEEE Transactions on Audio, Speech, and Language Processing 14, no. 4 (2006): 1462-1469.
[62]Dubey, Harishchandra, Ashkan Aazami, Vishak Gopal, Babak Naderi, Sebastian Braun, Ross Cutler, Alex Ju, et al. “Icassp 2023 deep noise suppression challenge.” IEEE Open Journal of Signal Processing 5 (2024): 725-737.
[63]Thiemann, Joachim, Nobutaka Ito, and Emmanuel Vincent. “The diverse environments multi-channel acoustic noise database (demand): A database of multichannel environmental noise recordings.” In Proceedings of Meetings on Acoustics, vol. 19, no. 1, p. 035081. Acoustical Society of America, 2013.
[64]Kingma, Diederik P., and Jimmy Ba. “Adam: A method for stochastic optimization.” arXiv preprint arXiv: 1412.6980 (2014).
[65]Macartney, Craig, and Tillman Weyde. “Improved speech enhancement with the Wave-U-Net.” arXiv preprint arXiv: 1811.11307 (2018).
[66]J. Richter, S. Welker, J.-M. Lemercier, B. Lay, T. Peer, and T. Gerkmann, “Causal diffusion models for generalized speech enhancement,” IEEE Open Journal of Signal Processing, vol. 5, pp. 780-789, 2024.