International Journal of Innovative Research in Computer and Communication Engineering

ISSN Approved Journal | Impact factor: 8.771 | ESTD: 2013 | Follows UGC CARE Journal Norms and Guidelines

| Monthly, Peer-Reviewed, Refereed, Scholarly, Multidisciplinary and Open Access Journal | High Impact Factor 8.771 (Calculated by Google Scholar and Semantic Scholar | AI-Powered Research Tool | Indexing in all Major Database & Metadata, Citation Generator | Digital Object Identifier (DOI) |


TITLE Multimodal Emotion Recognition using Video, Audio and Facial Expressions
ABSTRACT Emotion is a key component of human communication and decision–making. With the rapid growth of AI and human–computer interaction, there is a growing demand for systems that can infer and respond to human emotions in a robust and context–aware manner. Unimodal emotion recognition methods that rely only on facial expressions or only on speech often fail in real–world conditions affected by noise, occlusion and illumination changes. In this work we present a multimodal emotion recognition system that jointly exploits video–based facial expressions, facial landmarks and audio features extracted from speech. The system processes an input video by decomposing it into synchronized visual and audio streams. On the visual side, faces are detected, aligned using facial landmarks and passed through a ResNet–34 convolutional neural network (CNN) to learn discriminative appearance features. Geometric information is captured through a landmark multi–layer perceptron and fused with the CNN embedding. On the audio side, the speech track is converted into Mel–Frequency Cepstral Coefficients (MFCCs) and fed to a dedicated CNN that learns emotional cues from pitch, energy and spectral shape. High–level face and audio embeddings are concatenated at the feature level and modeled over time using a Long Short–Term Memory (LSTM) network, followed by fully connected layers with softmax activation to classify emotions such as happiness, sadness, anger, fear and neutral. The system is implemented in Python using PyTorch for deep learning and Streamlit for an interactive web interface. Experiments on FER facial images and the RAVDESS audio– visual dataset show that the proposed multimodal fusion model achieves higher accuracy and F1–score than unimodal baselines, especially under noisy or partially occluded conditions. The approach is suitable for applications in education, healthcare, customer support and emotion–aware virtual assistants.
AUTHOR DIVYA V. JINGADE, KISHORE S. K., MEHAK H., RANJAN RAVI BHAT, PROF. NAVYA K. G. Department of Computer Science and Design, Bapuji Institute of Engineering and Technology, Davanagere, Karnataka, India Assistant Professor, Department of Computer Science and Design, Bapuji Institute of Engineering and Technology, Davanagere, Karnataka, India
VOLUME 177
DOI DOI: 10.15680/IJIRCCE.2025.1312037
PDF pdf/37_Multimodal Emotion Recognition using Video, Audio and Facial Expressions.pdf
KEYWORDS
image
Copyright © IJIRCCE 2020.All right reserved