International Journal of Innovative Research in Computer and Communication Engineering
ISSN Approved Journal | Impact factor: 8.771 | ESTD: 2013 | Follows UGC CARE Journal Norms and Guidelines
| Monthly, Peer-Reviewed, Refereed, Scholarly, Multidisciplinary and Open Access Journal | High Impact Factor 8.771 (Calculated by Google Scholar and Semantic Scholar | AI-Powered Research Tool | Indexing in all Major Database & Metadata, Citation Generator | Digital Object Identifier (DOI) |
| TITLE | A Survey on Multimodal Human-Computer Interaction Systems using Voice, Vision and Gesture Recognition |
|---|---|
| ABSTRACT | Human-Computer Interaction (HCI) has transformed from traditional keyboard and mouse interfaces to intelligent systems capable of understanding natural human behavior. Multimodal interaction systems integrate multiple communication modalities such as voice commands, face recognition, gesture recognition, and visual perception to improve interaction efficiency and accessibility. This survey paper presents an overview of multimodal Human-Computer Interaction systems with a focus on AI-based virtual assistants that combine speech recognition and interaction, computer vision and gesture-based interaction. Recent advancements in deep learning and real-time processing have enabled intelligent assistants to perform tasks such as face detection, object detection, and application control through voice commands. Technologies such as speech-to-text engines, YOLO-based object detection, and graphical user interfaces play a significant role in developing robust multimodal systems. This paper reviews existing multimodal interaction approaches, discusses commonly used algorithms and tools, compares different techniques, and highlights major challenges such as latency, accuracy, and computational complexity. The survey is intended to serve as a foundation for future project that aims to design and implement a real-time multimodal virtual assistant system capable of interacting with users through voice, vision and gestures. |
| TITLE | |
| AUTHOR | ANKITA YADAV, ANUSHRI RAUT, HRIDAY PANCHMUKH, RAJBEER SACHAR Diploma Students, Dept. of A.N., AISSMS Polytechnic, Pune, India |
| VOLUME | 180 |
| DOI | DOI: 10.15680/IJIRCCE.2026.1401087 |
| pdf/87_A Survey on Multimodal Human-Computer Interaction Systems using Voice, Vision and Gesture.pdf | |
| KEYWORDS | |
| References | 1. N. Mohamed, M. B. Mustafa, and N. Jomhari, “A Review of the Hand Gesture Recognition System: Current Progress and Future Directions,” IEEE Access, vol. 9, pp. 152785–152806, 2021. 2. H. M. Yishak and L. Li, “Advanced Face Detection with YOLOv8: Implementation and Integration into AI Modules,” Open Access Library Journal, vol. 11, pp. e112474, 2024. Available: https://doi.org/10.4236/oalib. 3. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779–788. 4. D. Bhonde, K. Mongse, L. Naikwar, N. Dwivedi, and O. Mahulkar, “Gesture and Voice-Based Personal Computer Control System,” International Journal on Advanced Electrical and Computer Engineering, vol. 14, no. 1, 2025. 5. [5] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal Machine Learning: A Survey and Taxonomy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 2, pp. 423–443, 2019. 6. S. Oviatt, “Multimodal Interfaces,” The Human–Computer Interaction Handbook, 2nd ed., CRC Press, pp. 413–432, 2012. |