Model and Algorithms for Classifying Anomalous Phenomena based on the Convergence of Acoustic-Visual Signals

Authors

  • N. Ravshanov Digital Technologies and Artificial Intelligence Development Research Institute Author
  • B.I. Boborakhimov Digital Technologies and Artificial Intelligence Development Research Institute Author
  • M.I. Berdiev National Guard of the Republic of Uzbekistan Author

DOI:

https://doi.org/10.71310/pcam.6_70.2025.07

Keywords:

dynamic weighting, spatiotemporal representation learning, attention-based alignment, robust anomaly recognition, real-world surveillance

Abstract

This paper proposes a Context-adaptive Audio-Visual Neural Network (CAVN) model for anomaly detection in public safety systems. Existing approaches primarily rely on visual data and employ simple fusion strategies for combining modalities, which leads to limitations in capturing complex semantic relationships. The proposed model consists of four main components: a visual feature extraction module based on SlowFast architecture, an audio feature extraction module based on Audio Spectrogram Transformer (AST), a fusion module based on bidirectional cross-attention mechanism, and a temporal context aggregation module based on Transformer encoder. The main scientific novelty of the model lies in the adaptive modality balancing mechanism, which dynamically adjusts the relative importance of modalities under different conditions (dark/bright, noisy/quiet). Experimental results demonstrate that the proposed CAVN model outperforms existing methods by in overall accuracy and by in dark conditions. Ablation studies confirmed the contribution of each module to the overall performance.

References

Baltrusaitis T., Ahuja C., Morency L.P. 2019. Multimodal machine learning: A survey and taxonomy // IEEE Transactions on Pattern Analysis and Machine Intelligence. – Vol. 41, – No. 2. – P. 423-443.

Feichtenhofer C., Fan H., Malik J., He K. 2019. SlowFast networks for video recognition // Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). – P. 6202-6211.

Arnab A., Dehghani M., Heigold G., Sun C., Lucic M., Schmid C. 2021. ViViT: A video vision transformer // Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). – P. 6836-6846.

Gong Y., Chung Y.A., Glass J. 2021. AST: Audio Spectrogram Transformer // Proceedings of Interspeech. – P. 571-575.

Gong Y., Lai C.I., Chung Y.A., Glass J. 2022. SSAST: Self-supervised audio spectrogram transformer // Proceedings of the AAAI Conference on Artificial Intelligence. – Vol. 36, No. 10. – P. 10699-10709.

Baevski A., Zhou Y., Mohamed A., Auli M. 2020. wav2vec 2.0: A framework for selfsupervised learning of speech representations // Advances in Neural Information Processing Systems. – Vol. 33, – P. 12449-12460.

Nagrani A., Yang S., Arnab A., Jansen A., Schmid C., Sun C. 2021. Attention bottlenecks for multimodal fusion // – Vol. 34, – P. 14200-14213.

Huang P.Y., Sharma V., Xu H., Ryali C., Fan H., Li Y., Feichtenhofer C. 2023. MAViL: Masked audio-video learners // Advances in Neural Information Processing Systems. – Vol. 36,

Girdhar R., El-Nouby A., Liu Z., Singh M., Alwala K.V., Joulin A., Misra I. 2023. Image-Bind: One embedding space to bind them all // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – P. 15180-15190.

Lu J., Batra D., Parikh D., Lee S. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks // Advances in Neural Information Processing Systems. – Vol. 32, – P. 13-23.

Li J., Selvaraju R., Gotmare A., Joty S., Xiong C., Hoi S. 2021. Align before fuse: Vision and language representation learning with momentum distillation // Advances in Neural Information Processing Systems. – Vol. 34, – P. 9694-9705.

2021. Radford A., Kim J.W., Hallacy C., Ramesh A., Goh G., Agarwal S., Sastry G., Askell A., Mishkin P., Clark J., Krueger G., Sutskever I. Learning transferable visual models from natural language supervision // Proceedings of the International Conference on Machine Learning (ICML).– – P. 8748-8763.

Duong H.T., Le V.T., Hoang V.T. 2023. Deep learning-based anomaly detection in video surveillance: A survey // Sensors. – Vol. 23, No. 11. - Article 5024.

Nayak R., Pati U.C., Das S.K. 2021. A comprehensive review on deep learning-based methods for video anomaly detection // Image and Vision Computing. – Vol. 106,

Rezaee K., Rezakhani S.M., Khosravi M.R., Moghimi M.K. 2021. A survey on deep learningbased real-time crowd anomaly detection for secure distributed video surveillance // Personal and Ubiquitous Computing. – Vol. 28, – P. 135-151.

Georgescu M.I., Barbalau A., Ionescu R.T., Khan F.S., Popescu M., Shah M. 2021. Anomaly detection in video via self-supervised and multi-task learning // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – P. 12742-12752.

Wang L., Tian J., Zhou S., Shi H., Hua G. 2023. Memory-augmented appearance-motion network for video anomaly detection // Pattern Recognition. – Vol. 138,

Liu Z., Nie Y., Long C., Zhang Q., Li G. 2021. A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction // Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). – P. 13588-13597.

Papoutsakis K., Papadogiorgaki M., Sarris N. 2024. Deep crowd anomaly detection: Stateof-the-art, challenges, and future research directions // Artificial Intelligence Review. – Vol. 57.

Gao J., Yang H., Gong M., Li X. 2024. Audio-visual representation learning for anomaly events detection in crowds // Neurocomputing. – Vol. 582.

Wu P., Liu X., Liu J. 2022. Weakly supervised audio-visual violence detection // IEEE Transactions on Multimedia. – Vol. 25, – P. 4412-4423.

Leporowski B., Bakhtiarnia A., Bonnici N., Muscat A., Zanella L., Wang Y., Iosifidis A. 2023. Audio-visual dataset and method for anomaly detection in traffic videos – 202. http://arxiv.org/abs/2305.15084

Lin W., Gao J., Wang Q., Li X. 2021. Learning to detect anomaly events in crowd scenes from synthetic data // Neurocomputing. – Vol. 436, – P. 248-259.

Bamaqa A., Bahattab A., Khojandi B. 2022. SIMCD: Simulated crowd data for anomaly detection and prediction // Expert Systems with Applications. – Vol. 203,

Brousmiche M., Rouat J., Dupont S. 2022. Multimodal attentive fusion network for audiovisual event recognition // Information Fusion. – Vol. 85, – P. 52-59.

Shaikh M.B., Chai D., Islam S.M.S., Akhtar N. 2024. Multimodal fusion for audio-image and video action recognition // Neural Computing and Applications. – Vol. 36, – P. 5499-5513.

Middya A.I., Nag B., Roy S. 2022. Deep learning based multimodal emotion recognition using model-level fusion of audio-visual modalities // Knowledge-Based Systems. – Vol. 244.

He K., Zhang X., Ren S., Sun J. 2016. Deep residual learning for image recognition // IEEE CVPR. – P. 770-778.

Hershey S. et al. 2017. CNN architectures for large-scale audio classification // IEEE ICASSP. – P. 131-135.

Dosovitskiy A. et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale // ICLR.

Vaswani A. et al. 2017. Attention is all you need // Advances in Neural Information Processing Systems. – P. 5998-6008.

Hendrycks D., Gimpel K. Gaussian error linear units (GELUs). – 2016. – https://arxiv.org/abs/1606.08415.

Lin T.-Y., Goyal P., Girshick R., He K., Dollár P. Focal loss for dense object detection // Proc. IEEE/CVF International Conference on Computer Vision (ICCV). – 2017. – P. 2980-2988.

Downloads

Published

2026-01-11

Issue

Section

Статьи