Speech Enhancement Based on a Two-Branch Nested U-Net Architecture Using TS-Conformer

PDF (1133KB), PP.133-150

Views: 0 Downloads: 0

Author(s)

Hanna Deepa Mallolu 1 Sunnydayal Vanambathina 1,*

1. School of Electronics Engineering, VIT-AP University, Amaravati, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijigsp.2026.03.07

Received: 22 Jan. 2026 / Revised: 25 Feb. 2026 / Accepted: 25 Mar. 2026 / Published: 8 Jun. 2026

Index Terms

Speech Enhancement, TS-Conformer, Nested U-Net, Two-Branch Decoding 1. Introduction

Abstract

Transformers, while powerful in capturing long-range dependencies with self-attention mechanisms, face several limitations in speech processing tasks. Moreover, transformers can lack inherent inductive biases to efficiently model local and fine-grained temporal and spectral structures critical for speech perception, resulting in suboptimal handling of fine details. To address this issue, this paper introduces a speech enhancement (SE) network that builds on a two-branch nested U-Net framework integrated with a two-stage conformer (TS-Conformer) for robust speech enhancement. The nested U-Net employs dual decoding branches for simultaneous spectral mapping and mask estimation, enabling complementary learning of speech characteristics. The TS-Conformer sequentially models temporal and frequency dependencies to improve contextual representation while maintaining local continuity. In addition, a complex feature extraction unit (CFEU-i) is incorporated to enhance multi-scale feature learning in the complex domain. By combining hierarchical feature extraction with sequential spectro-temporal modeling, the proposed method effectively suppresses noise while preserving speech quality. Experimental results demonstrate that the proposed NUNet-Conformer effectively achieves superior performance compared to recent SE approaches in terms of Signal-to-Distortion Ratio(SDR), Short-Time Objective Intelligibility(STOI), and Perceptual Evaluation of Speech Quality (PESQ).

Cite This Paper

Hanna Deepa Mallolu, Sunny Dayal Vanambathina, "Speech Enhancement Based on a Two-Branch Nested U-Net Architecture Using TS-Conformer", International Journal of Image, Graphics and Signal Processing(IJIGSP), Vol.18, No.3, pp. 133-150, 2026. DOI:10.5815/ijigsp.2026.03.07

Reference

[1]Xu, Y., Du, J., Dai, L.-R., Lee, C.-H.: An experimental study on speech enhancement based on deep neural networks. IEEE Signal processing letters 21(1), 65–68 (2013).
[2]Wang, Y., Narayanan, A., Wang, D.: On training targets for supervised speech separation. IEEE/ACM transactions on audio, speech, and language processing 22(12), 1849–1858 (2014).
[3]Kolbæk, M., Tan, Z.-H., Jensen, J.: Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25(1), 153–167 (2016).
[4]Heymann, J., Drude, L., Haeb-Umbach, R.: Neural network based spectral mask estimation for acoustic beamform ing. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 196–200 (2016). IEEE
[5]Tan, K., Wang, D.: Complex spectral mapping with a convolutional recurrent network for monaural speech en hancement. In: ICASSP 2019-2019 IEEE Inter-national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6865–6869 (2019). IEEE
[6]Choi, H.-S., Kim, J., Huh, J., Kim, A., Ha, J.-W., Lee, K.: Phase-aware speech enhancement with deep complex u-net. In: International Conference on Learning Representations (2019). https://openreview.net/forum id=SkeRTsAcYm.
[7]Wang, K., He, B., Zhu, W.-P.: Tstnn: Two-stage trans- former based neural network for speech enhancement in the time domain. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),pp. 7098–7102 (2021).
[8]Fu, Szu-Wei, et al. “Metricgan+: An improved version of metricgan for speech enhancement.” arXiv preprint arXiv:2104.03538 (2021).
[9]Zhang, Z., Xu, S., Zhuang, X., Zhou, L., Li, H., Wang, M.: Two-stage unet with multi-axis gated multilayer percep- tron for monaural noisy-reverberant speech enhancement.International Conference o n Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023). IEEE.
[10]Saleem, N., Gunawan, T.S., Shafi, M., Bourouis, S., Trigui, A.: Multi-attention bottleneck for gated convolutional encoder-decoder-based speech enhancement. IEEE Access 11, 114172–114186 (2023).
[11]Xu, S., Zhang, Z. Wang, M. Channel and temporal- frequency attention UNet for monaural speech enhancement. J AUDIO SPEECH MUSIC PROC. 2023, 30 (2023). https://doi.org/10.1186/s13636-023-00295-6.
[12]Zhao, S., Ma, B.“D2Former: A Fully Complex Dual-Path Dual- Decoder Conformer Network Using Joint Complex Masking and Complex Spectral Mapping for Monaural Speech Enhancement,”ICASSP 2023, pp. 1–5, 2023.
[13]Fu, Szu-Wei, et al. “Metricgan: Generative adversarial networks based black-box metric scores optimization for speech enhancement.” International Conference on Machine Learning. PmLR, 2019.
[14]Abdulatif, S., Cao, R., Yang, B. (2024). Cmgan: Conformer- based metric-gan for monaural speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32, 2477-2493.Hwang, Seorim, Sung Wook Park, and Youngcheol Park. ”Design of a dual-path speech enhancement model.” Applied Sciences 15, no. 11 (2025): 6358.
[15]Dang, F., Chen, H., Zhang, P.: Dpt-fsnet: Dual-path transformer based full-band and sub-band fusion network for speech enhancement. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), pp. 6857–6861 (2022). IEEE
[16]Hu, Yanxin, et al. “DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement.” arXiv preprint arXiv:2008.00264 (2020).
[17]Ruan, H., Tan, Z., Chen, L., Wan, W., Cao, J.: Efficient sub-pixel convolutional neural network for terahertz image super resolution. Optics letters 47(12), 3115– 3118 (2022)
[18]Hwang, S., Park, Y., Park, S.: Monoaural speech enhancement using a nested u-net with two-level skip connections. In: Interspeech, pp. 191–195 (2022)
[19]Armentano-Oller, C., Marimon, M., Villegas, M.: Becoming a high-resource language in speech: The catalan case in the commonvoice corpus. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, LanguageResources and Evaluation (LREC-COLING 2024), pp. 2142–2148 (2024)
[20]Loizou, P., Hu, Y.: Noizeus: A noisy speech corpus for evaluation of speech enhancement algorithms. Speech Communication 49, 588–601 (2017)
[21]V. Christophe, Y. Junichi, M. Kirsten, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” in University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2016.
[22]Z. Zhang, ”Improved Adam Optimizer for Deep Neural Networks,” 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS), Banff, AB, Canada, 2018, pp. 1-2, doi: 10.1109/IWQoS.2018.8624183.
[23]Xian, Y., Sun, Y., Wang, W., Naqvi, S.M.: A multi-scale feature recalibration network for end-to-end single channel speech enhancement. IEEE Journal of Selected Topics in Signal Processing 15(1), 143–155 (2020)
[24]Xiang, X., Zhang, X., Chen, H.: A convolutional network with multi-scale andattention mechanisms for end-to-end single-channel speech enhancement. IEEE Signal Processing Letters 28, 1455–1459 (2021)
[25]Xiang, X., Zhang, X., Chen, H.: A nested u-net with self-attention and dense connectivity for monaural speech enhancement. IEEE Signal Processing Letters 29, 105–109 (2021)
[26]Yu, G., Li, A., Zheng, C., Guo, Y., Wang, Y., Wang, H.: Dual-branch attention-in-attention transformer for single channel speech enhancement. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), pp. 7847–7851 (2022). IEEE
[27]Chen, K.-L., Wong, D.D., Tan, K., Xu, B., Kumar, A., Ithapu, V.K.: Leveraging heteroscedastic uncertainty in learn- ing complex spectral mapping for single-channel speech enhancement. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023).IEEE
[28]Zhang, Z., Liang, X., Xu, R., Wang, M.: Hybrid attention time-frequency analysis network for single-channel speech enhancement. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 10426–10430 (2024). IEEE
[29]Zhang, S., Qiu, Z., Takeuchi, D., Harada, N., Makino, S.: Unrestricted global phase bias-aware single-channel speech enhancement with conformer-based metric gan. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1026–1030 (2024). IEEE
[30]Guo, Z., Du, J., Siniscalchi, S.M., Pan, J., Liu, Q.: Controllable conformer for speech enhancement and recognition. IEEE Signal Processing Letters (2024)
[31]Abdulatif, S., Cao, R., Yang, B.: Cmgan: Conformer-based metric-gan for monaural speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing 32, 2477–2493 (2024)
[32]Zhao, Shengkui, et al. “FRCRN: Boosting feature representation using frequency recurrence for monaural speech enhancement.” ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2022.