Harika Vuyyuru

Work place: Department of Computer Science and Information Technology, K L University, Vaddeswaram, Guntur District 522302, India

E-mail: harikasiri2601@gmail.com

Website: https://orcid.org/0009-0004-1901-7504

Research Interests:

Biography

Harika Vuyyuru completed her B.Tech in Computer Science and Information Technology (CSIT) from Koneru Lakshmaiah (KL) University. During her academic journey, she specialized in Machine Learning, Data Science, and Artificial Intelligence, areas that fascinated him because of their immense potential to transform the future of technology. As part of my academic research, she worked on a paper focused on multimodal emotion recognition, which explores how machines can better understand human emotions by combining multiple inputs such as speech, facial expressions, and text. This project allowed her to experiment with advanced deep learning models and multimodal data processing techniques. It not only strengthened her technical expertise but also gave her valuable research experience with real-world applications in fields such as healthcare, interactive systems, and digital well-being.

Author Articles
Weighted Late Fusion based Deep Attention Neural Network for Detecting Multi-Modal Emotion

By Srinivas P. V. V. S. Shaik Nazeera Khamar Nohith Borusu Mohan Guru Raghavendra Kota Harika Vuyyuru Sampath Patchigolla

DOI: https://doi.org/10.5815/ijigsp.2026.01.07, Pub. Date: 8 Feb. 2026

In the field of affective computing research, multi-modal emotion detection has gained popularity as a way to boost recognition robustness and get around the constraints of processing a multiple type of data. Human emotions are utilized for defining a variety of methodologies, including physiological indicators, facial expressions, as well as neuroimaging tactics. Here, a novel deep attention mechanism is used for detecting multi-modal emotions. Initially, the data are collected from audio and video features. For dimensionality reduction, the audio features are extracted using Constant-Q chromagram and Mel-Frequency Cepstral Coefficients (MM-FC2). After extraction, the audio generation is carried out by a Convolutional Dense Capsule Network (Conv_DCN) is used. Next is video data; the key frame extraction is carried out using Enhanced spatial-temporal and Second-Order Gaussian kernels. Here, Second-Order Gaussian kernels are a powerful tool for extracting features from video data and converting it into a format suitable for image-based analysis. Next, for video generation, DenseNet-169 is used. At last, all the extracted features are fused, and emotions are detected using a Weighted Late Fusion Deep Attention Neural Network (WLF_DAttNN). Python tool is used for implementation, and the performance measure achieved an accuracy of 97% for RAVDESS and 96% for CREMA-D dataset.

[...] Read more.
Other Articles