Towards the Conceptual Retrieval of Multimedia Documentary: A Survey

Ghozia, Ahmed; Attiya, Gamal; El-Fishawy, Nawal

doi:10.21608/mjeer.2019.62785

	Towards the Conceptual Retrieval of Multimedia Documentary: A Survey
Menoufia Journal of Electronic Engineering Research
Article 15, Volume 28, Issue 2, July 2019, Page 259-286
Document Type: Original Article
DOI: 10.21608/mjeer.2019.62785
View on SCiNiTO
Authors
Ahmed Ghozia^* ; Gamal Attiya; Nawal El-Fishawy
Dept. of Computer Science and Engineering, Faculty of Electronic Engineering, Menoufia University, Egypt.
Abstract
Billions of active online users are continuously feeding the world with multimedia Big Data through their smart phones and PCs. These heterogenous productions are existing in different social media platforms, such as Facebook and Twitter, delivering a composite message in the form of audio, visual and textual signals. Analyzing multimedia Big Data to understand the intended delivered message, had been a challenge to audio, image, video and text processing researchers. Thanks to the recent advances in deep learning algorithms, researchers had been able to improve the performance of multimedia Big Data analytics and understanding techniques This paper presents a survey on how a multimedia file is analyzed, key challenges facing multimedia analysis, and how deep learning is helping conquer and advance beyond those challenges. Future directions of multimedia analysis are also addressed. The aim is to stay objective all through this study, bringing both empowering enhancements and in addition inescapable shortcomings, wishing to bring up fresh questions and stimulating new research frontiers for the reader.
Keywords
Multimedia analysis; video understanding; Image classification; speech recognition; natural language processing; Deep learning


References
width: 0px; "> </[1] S. H Khan, X. He, F. Porikli, M. Bennamoun, F. Sohel, and R. Togneri, “Learning deep structured network for weakly supervised change detection,” ArXiv e-prints, Jun. 2016. [2] Ahmed Ghozia, Gamal Attiya and Nawal A.Elfishawy “The power of deep learning, current research and future trends,” Menoufia Journal of Electronic Engineering Research (MJEER), VOL.28-NO.2 July 2019. [3] Axel Pinz Christoph Feichtenhofer, “Image and video understanding,” Facebook AI Research. Facebook, pp. 402–418, 2014. ize-adjust: auto; -webkit-text-stroke-[4] Tao Mei and Cha Zhang, “Deep learning for intelligent video analysis,” Proceedings of the 2017 ACM on Multimedia Conference. ACM, pp. 1955– 1956, 2017. [5] Meng Wang, Wei Li, and Xiaogang Wang, “Transferring a generic pedestrian detector towards specific scenes,” Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, pp. 3274–3281, 2012. [6] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei, “Large-scale video classification with convolutional neural networks,” Proceedings of the IEEE conference on Computer Vision and PatternRecognition, pp. 1725–1732, 2014. [7] Chiori Hori, Takaaki Hori, Teng-Yok Lee, Ziming Zhang, Bret Harsham, John R Hershey, Tim K Marks, and Kazuhiko Sumi, “Attention-based multimodal fusion for video description,” Computer Vision (ICCV), 2017 IEEE International Conference on. IEEE, pp. 4203–4212, 2017. [8] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2625–2634, 2015. [9] Matthew D Zeiler and Rob Fergus, “Visualizing and understanding convolutional networks,” European conference on computer vision. Springer, pp. 818–833, 2014. [10] J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell, “Long-term Recurrent Convolutional Networks for Visual Recognition and Description,” ArXiv e-prints, Nov. 2014. [11] MoezBaccouche, Franck Mamalet, Christian Wolf, Christophe Garcia, and AtillaBaskurt, “Sequential deep learning for human action recognition,” International Workshop on Human Behavior Understanding. Springer, pp. 29– 39, 2011. [12] Karen Simonyan and Andrew Zisserman, “Two-stream convolutional networks for action recognition in videos,” Advances in neural information processing systems, pp. 568–576, 2014 xt-size-adjust: auto; -webkit-text-str[13] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov, “Unsupervised learning of video representations using lstms,” International conference on machine learning, pp. 843–852, 2015. [14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David WardeFarley,SherjilOzair, Aaron Courville, and YoshuaBengio, “Generative adversarial nets,” Advances in neural information processing systems, pp. 2672–2680, 2014. [15] Yoon Kim, Carl Denton, Loung Hoang, and Alexander M Rush, “Neural machine translation by jointly learning to align and translate,” Proceedings of ICLR, 2017. [16] Jan K Chorowski, DzmitryBahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and YoshuaBengio, “Attention-based models for speech recognition,” Advances in neural information processing systems, pp. 577–585, 2015. [17] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, attend and spell:A neural network for large vocabulary conversational speech recognition,” Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, pp. 4960–4964, 2016. [18] DzmitryBahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and YoshuaBengio, “End-to-end attention-based large vocabulary speech recognition,” Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, pp. 4945–4949, 2016. [19] DzmitryBahdanau, Dmitriy Serdyuk, Phil´emonBrakel, Nan Rosemary Ke, Jan Chorowski, Aaron Courville, and YoshuaBengio, “Task loss estimation for sequence prediction,” arXiv preprint arXiv:1511.06456, 2015. [20] William Chan and Ian Lane, “On online attention-based speech recognition and joint mandarin character-pinyin training.” INTERSPEECH, pp. 3404–3408, 2016. [21] Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012. [22] Yann LeCun, L´eonBottou, YoshuaBengio, and Patrick Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.e-adjust: auto; -webkit-text-stroke-wi[23] Ossama Abdel-Hamid, Li Deng, and Dong Yu, “Exploring convolutional neural network structures and optimization techniques for speech recognition.” Interspeech, vol. 2013, pp. 1173–5, 2013. [24] Tara N Sainath, Abdel-rahman Mohamed, Brian Kingsbury, and BhuvanaRamabhadran, “Deep convolutional neural networks for lvcsr,” Acoustics, speech and signal processing (ICASSP), 2013 IEEE international conference on. IEEE, pp. 8614–8618, 2013. [25] William Chan and Ian Lane, “Deep convolutional neural networks for acoustic modeling in low resource languages,” Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, pp. 2056–2060, 2015. [26] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal processing magazine, vol. 29, no. 6, pp. 82–97, 2012. [27] Tara N Sainath, Brian Kingsbury, Abdel-rahman Mohamed, George E Dahl, George Saon, Hagen Soltau, Tomas Beran, Aleksandr Y Aravkin, and BhuvanaRamabhadran, “Improvements to deep convolutional neural networks for lvcsr,”IEEE Workshop onAutomaticSpeech Recognition and Understanding (ASRU), 2013.IEEE, pp. 315–320,2013. [28] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [29] Tom Sercu, Christian Puhrsch, Brian Kingsbury, and Yann LeCun, “Very deep multilingual convolutional neural networks for lvcsr,” Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, pp. 4955– 4959, 2016. [30] Tom Sercu and Vaibhava Goel, “Advances in very deep convolutional neural networks for lvcsr,” arXiv preprint arXiv:1604.01792, 2016. [31] Yanmin Qian, Mengxiao Bi, Tian Tan, and Kai Yu, “Very deep convolutional neural networks for noise robust speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 12, pp. 2263–2276, 2016.text-size-adjust: auto; -webkit-text-s[32] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich, “Going deeper with convolutions,” Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015. [33] Min Lin, Qiang Chen, and Shuicheng Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013. [34] Salah El Hihi and YoshuaBengio, “Hierarchical recurrent neural networks for longterm dependencies,” Advances in neural information processing systems, pp. 493–499, 1996. [35] Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015. [36] Dario Amodei, Sundaram Ananthanarayanan, RishitaAnubhai, Jingliang Bai, EricBattenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen et al., “Deep speech 2: End-to-end speech recognition in english and mandarin,” International Conference on Machine Learning, pp. 173–182, 2016. [37] Yu Zhang, William Chan, and Navdeep Jaitly, “Very deep convolutional networks for end-to-end speech recognition,” Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, pp. 4845–4849, 2017. [38] Sepp Hochreiter and Ju¨rgenSchmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. [39] Yu Zhang, William Chan, and Navdeep Jaitly, “Very deep convolutional networks for end-to-end speech recognition,” Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, pp. 4845–4849, 2017. [40] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, ChengYang Fu, and Alexander C Berg, “Ssd: Single shot multibox detector,” European conference on computer vision. Springer, pp. 21–37, 2016. [41] Rupesh K Srivastava, Klaus Greff, and Ju¨rgenSchmidhuber, “Training very deep networks,” Advances in neural information processing systems, pp. 2377– 2385, 2015.
Statistics Article View: 319