Vol. 2 No. 07 (2025): International Journal of Science and Technology
Articles

MULTIMODAL SPEECH PROCESSING

Published 16-05-2025

How to Cite

MULTIMODAL SPEECH PROCESSING . (2025). INTERNATIONAL JOURNAL OF SCIENCE AND TECHNOLOGY, 2(07), 33-38. https://doi.org/10.70728/tech.v2.i07.0012

Abstract

The aim of this research is to evaluate the effectiveness of multimodal speech processing techniques in improving speech recognition accuracy, addressing the key issue of integrating audio, visual, and contextual cues in real-time applications; to solve this problem, data is required from diverse speech datasets that encompass various modalities, including video input, audio recordings, and contextual language information.

This dissertation examines the effectiveness of multimodal speech processing techniques in enhancing speech recognition accuracy, focusing particularly on the integration of audio, visual, and contextual cues for real-time applications. The research identifies significant challenges in the current speech recognition systems and emphasizes the necessity of utilizing diverse speech datasets that incorporate modalities such as video input, audio recordings, and contextual language information. Key findings reveal that incorporating visual cues along with auditory signals considerably improves recognition rates, particularly in environments with ambient noise, leading to a marked reduction in misunderstanding critical messages in communication. The implications of these findings are particularly significant within the healthcare sector, where effective communication between patients and providers is essential for accurate diagnosis and treatment. Improved speech recognition capabilities can enhance telemedicine services, assistive technologies for individuals with speech impairments, and overall patient-provider interactions, ultimately leading to better health outcomes. This study not only contributes to the existing body of knowledge in speech processing but also highlights the potential for transformative applications in healthcare, encouraging future research and development that leverages multimodal approaches to enhance communication efficacy in clinical settings.

References

  1. G. H. D. H. K. T. L. L. "Lend a Hand: Semi Training-Free Cued Speech Recognition via MLLM-Driven Hand Modeling for Barrier-free Communication" ArXiv, 2025, [Online]. Available: https://www.semanticscholar.org/paper/ [Accessed: 2025-04-26]
  2. S. B. N. T. P. C. A. G. "FedCMD: A Federated Cross-modal Knowledge Distillation for Drivers’ Emotion Recognition" ACM Transactions on Intelligent Systems and Technology, 2024, [Online]. Available: https://www.semanticscholar.org/paper/ [Accessed: 2025-04-26]
  3. G. M. N. S. N. H. J. B. S. D. "Multimodal Emotion Recognition Using Computer Vision: A Comprehensive Approach" 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), 2024, [Online]. Available: https://www.semanticscholar.org/paper/ [Accessed: 2025-04-26]
  4. S. K. L. A. E. S. N. J. B. "A Systematic Review on Multimodal Emotion Recognition: Building Blocks, Current State, Applications, and Challenges" IEEE Access, 2024, [Online]. Available: https://www.semanticscholar.org/paper/ [Accessed: 2025-04-26]
  5. C. C. R. L. Y. H. S. M. S. P. C. E. C. C. H. Y. "It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition" ArXiv, 2024, [Online]. Available: https://www.semanticscholar.org/paper/ [Accessed: 2025-04-26]
  6. Y. L. T. H. S. M. J. Z. Y. Y. J. T. H. H. E. A. "Summary of ChatGPT-Related research and perspective towards the future of large language models" Meta-Radiology, 2023, [Online]. Available: https://doi.org/10.1016/j.metrad.2023.100017 [Accessed: 2025-04-26]
  7. P. X. X. Z. D. A. C. "Multimodal Learning With Transformers: A Survey" IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, [Online]. Available: https://doi.org/10.1109/tpami.2023.3275156 [Accessed: 2025-04-26]
  8. S. P. Y. K. "A Metaverse: Taxonomy, Components, Applications, and Open Challenges" IEEE Access, 2022, [Online]. Available: https://doi.org/10.1109/access.2021.3140175 [Accessed: 2025-04-26]
  9. X. Z. R. Z. "A Survey of Fake News" ACM Computing Surveys, 2020, [Online]. Available: https://doi.org/10.1145/3395046 [Accessed: 2025-04-26]
  10. P. H. L. G. H. A. F. S. "Contrastive Representation Learning: A Framework and Review" IEEE Access, 2020, [Online]. Available: https://doi.org/10.1109/access.2020.3031549 [Accessed: 2025-04-26]
  11. A. B. P. V. S. R. A. Y. E. V. P. K. K. "The Power of Generative AI: A Review of Requirements, Models, Input–Output Formats, Evaluation Metrics, and Challenges" Future Internet, 2023, [Online]. Available: https://doi.org/10.3390/fi15080260 [Accessed: 2025-04-26]
  12. N. A. D. B. "Artificial intelligence in the creative industries: a review" Artificial Intelligence Review, 2021, [Online]. Available: https://doi.org/10.1007/s10462-021-10039-7 [Accessed: 2025-04-26]
  13. D. M. Z. T. S. Z. Y. X. Y. M. D. Y. J. J. "An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation" IEEE/ACM Transactions on Audio Speech and Language Processing, 2021, [Online]. Available: https://doi.org/10.1109/taslp.2021.3066303 [Accessed: 2025-04-26]
  14. M. H. R. T. R. "A strategic framework for artificial intelligence in marketing" Journal of the Academy of Marketing Science, 2020, [Online]. Available: https://doi.org/10.1007/s11747-020-00749-9 [Accessed: 2025-04-26]
  15. S. K. L. A. E. S. N. J. B. "A Systematic Review on Multimodal Emotion Recognition: Building Blocks, Current State, Applications, and Challenges" IEEE Access, 2024, [Online]. Available: https://doi.org/10.1109/access.2024.3430850 [Accessed: 2025-04-26]
  16. W. C. X. X. X. X. J. Y. J. P. "Key-Sparse Transformer for Multimodal Speech Emotion Recognition" ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, [Online]. Available: https://doi.org/10.1109/icassp43922.2022.9746598 [Accessed: 2025-04-26]
  17. D. E. J. B. B. C. R. L. V. W. Y. L. J. E. B. J. R. G. "Head and neck squamous cell carcinoma" Nature Reviews Disease Primers, 2020, [Online]. Available: https://doi.org/10.1038/s41572-020-00224-3 [Accessed: 2025-04-26]
  18. S. R. L. F. R. "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English" PLoS ONE, 2018, [Online]. Available: https://doi.org/10.1371/journal.pone.0196391 [Accessed: 2025-04-26]
  19. Y. C. X. W. J. W. Y. W. L. Y. K. Z. H. C. E. A. "A Survey on Evaluation of Large Language Models" ACM Transactions on Intelligent Systems and Technology, 2024, [Online]. Available: https://doi.org/10.1145/3641289 [Accessed: 2025-04-26]
  20. Y. W. Z. S. N. Z. R. X. D. L. T. H. L. X. S. "A Survey on Metaverse: Fundamentals, Security, and Privacy" IEEE Communications Surveys & Tutorials, 2022, [Online]. Available: https://doi.org/10.1109/comst.2022.3202047 [Accessed: 2025-04-26]