Shaik AreefaBegam

Work place: School of Electronics Engineering, VIT-AP University, Amaravati-522237, Andhra Pradesh, India

E-mail: areefabegum.24phd7023@vitap.ac.in

Website: https://orcid.org/0009-0009-0541-7132

Research Interests:

Biography

Shaik AreefaBegam was born in Jaggayyapet, Andhra Pradesh, India. She received her B.Tech. degree in Electronics and Communication Engineering from JNTU Kakinada, India, in 2012, and her M.Tech. degree in VLSI Design from JNTU Kakinada, India, in 2014. Her research interests include speech enhancement, statistical signal processing, blind source separation, and machine learning.

Author Articles
Nested U-Net-Based Speech Enhancement with Multi-Scale Feature Extraction and Dual-Path Time-Frequency Feature Modeling

By Shaik AreefaBegam Sunnydayal Vanambathina

DOI: https://doi.org/10.5815/ijigsp.2026.02.07, Pub. Date: 8 Apr. 2026

Speech enhancement plays a vital role in improving the perceptual quality and intelligibility of speech signals degraded by environmental noise, particularly in modern network-based and signal processing systems. Traditional U-Net architectures capture local spectral details effectively but struggle to model long-range dependencies and may propagate residual noise through skip connections. Transformer-based models provide strong global context modeling but often fail to retain fine-grained spectral cues. To overcome these limitations, this paper presents a Nested U-Net–based network-oriented speech enhancement framework that incorporates Multi-Scale Feature Extraction, Feature Calibration, and a Dual-Path Higher-Order Information Interaction with Time-Frequency Attention module. The Multi-Scale Feature Extraction blocks in both encoder and decoder extract multi-resolution spectral patterns, while the nested topology strengthens hierarchical feature reuse. At the bottleneck, a stack of four Dual-Path Higher-Order Information Interaction with Time-Frequency Attention modules captures long-range temporal and spectral dependencies, and feature calibration adaptively filters encoder features to reduce noise transfer. Extensive experiments on Common Voice and LibriSpeech datasets demonstrate that the proposed model achieves superior perceptual evaluation of speech quality, short-time objective intelligibility, and signal-to-distortion ratio scores, particularly under moderate (0dB) signal-to-noise ratio conditions. The results confirm that the framework provides robust enhancement performance and consistently outperforms several recent state-of-the-art methods in terms of speech quality, intelligibility, and noise suppression.

[...] Read more.
Other Articles