Impact of Small Files on Hadoop Performance: Literature Survey and Open Points | ||||
Menoufia Journal of Electronic Engineering Research | ||||
Article 7, Volume 28, Issue 1, January 2019, Page 109-120 PDF (480.75 K) | ||||
Document Type: Original Article | ||||
DOI: 10.21608/mjeer.2019.62728 | ||||
View on SCiNiTO | ||||
Authors | ||||
Tharwat EL-SAYED* 1; Mohamed Badawy1; Ayman El-Sayed 2 | ||||
1Computer Science & Eng. Dept., Faculty of Electronic Eng., Menoufia University, Menouf 32952, Egypt | ||||
2Computer Science & Eng. Dept., Faculty of Electronic Eng., Menoufia University, Menouf 32952, Egypt. | ||||
Abstract | ||||
Hadoop is an open-source framework written by java and used for big data processing. It consists of two main components: Hadoop Distributed File System (HDFS) and MapReduce. HDFS is used to store data while MapReduce is used to distribute and process an application tasks in a distributed processing form. Recently, several researchers employ Hadoop for processing big data. The results indicate that Hadoop performs well with Large Files (files larger than Data Node block size). Nevertheless, Hadoop performance decreases with small files that are less than its block size. This is because, small files consume the memory of both the DataNode and the NameNode, and increases the execution time of the applications (i.e. decreases MapReduce performance). In this paper, the problem of the small files in Hadoop is defined and the existing approaches to solve this problem are classified and discussed. In addition, some open points that must be considered when thinking of a better approach to improve the Hadoop performance when processing the small files. | ||||
Keywords | ||||
Hadoop; Big Data; Small Files; Hive; HBase; Amazon EMR; S3DistCp | ||||
References | ||||
1] Youssef M. ESSA, Gamal ATTIYA and Ayman EL-SAYED, "Mobile Agent based New Framework for Improving Big Data Analysis", Proceeding of the 2013 IEEE International Conference on Cloud Computing and Big Data (CloudCom-Asia 2013), Fuzhou, China, December 16-19, 2013. "> [2] White, Tom, " Hadoop: The definitive guide", O'Reilly Media, Inc., 2012. [3] Wang, Feng, et al. "Hadoop high availability through metadata replication" Proceedings of the first international workshop on Cloud data management. ACM, 2009. [4] Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung, "The Google file system", ACM SIGOPS operating systems review. Vol. 37. No. 5. ACM, 2003. [5] Mall, Nupur N., and Sheetal Rana, "Overview of Big Data and Hadoop", Imperial Journal of Interdisciplinary Research 2.5, 2016. [6] Manjunath, R., R. K. Channabasava, and S. Balaji, "A Big Data MapReduce Hadoop distribution architecture for processing input splits to solve the small data problem", Applied and Theoretical Computing and Communication Technology (iCATccT), 2016 2nd International Conference on. IEEE, 2016. [7] Mir, Mansoor Ahmad, and Jawed Ahmed, "An Optimal Solution for small file problem in Hadoop", International Journal of Advanced Research in Computer Science 8.5, 2017. [8] Zheng, Tong, Weibin Guo, and Guisheng Fan, "A Method to Improve the Performance for Storing Massive Small Files in Hadoop", 7th International Conference on Computer Engineering and Networks, 2017. [9] Qin, Dongxue, "Study on Processing of Massive Small Files Based on Hadoop", Liaoning University, China, 2011. [10] Sharma, Garima, and Anita Ganpati, "Performance evaluation of fair and capacity scheduling in Hadoop YARN", Green Computing and Internet of Things (ICGCIoT), 2015 International Conference on. IEEE, 2015. [11] Fu, Songling, et al., "Performance Optimization for Managing Massive Numbers of Small Files in Distributed File Systems", IEEE Transactions on Parallel and Distributed Systems, Vol. 26, No. 12 pp. 3433-3448, 2015. [12] Vorapongkitipun, Chatuporn, and Natawut Nupairoj, "Improving performance of small-file accessing in Hadoop", Computer Science and Software Engineering (JCSSE), 2014 11th International Joint Conference on. IEEE, 2014. [13] Dev, Dipayan, and Ripon Patgiri, "HAR: Archive and metadata distribution! Why not both?”, Computer Communication and Informatics (ICCCI), 2015 International Conference on. IEEE, 2015. [14] Huang, Yicheng, et al., "Towards model-based approach to Hadoop deployment and configuration", Web Information System and Application Conference (WISA), 2015 12th. IEEE, 2015. [15] Chintapalli, Sanket Reddy, "Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters", Diss. Auburn University, 2014. [16] Xiong, A. P., and J. Y. Ma., "HDFS distributed metadata management research", International Conference on Applied Science and Engineering Innovation, 2015. [17] Xie, Jiong, et al., "Improving mapreduce performance through data placement in heterogeneous Hadoop clusters", Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on. IEEE, 2010. [18] Eltabakh, Mohamed Y., et al., "CoHadoop: flexible data placement and its exploitation in Hadoop", Proceedings of the VLDB Endowment 4.9 (2011): 575- 585, 2011. [19] Shahrivari, Saeed. "Beyond batch processing: towards real-time and streaming big data", Computers 3.4 (2014): 117-129, 2014. [20] Zhou, Fang, "Assessment of Multiple MapReduce Strategies for Fast Analytics of Small Files", 2015. [21] White, Tom, "The small files problem", Cloudera Blog, blog. cloudera. Com/blog/2009/02/the-small-files problem, 2009. [22] Gohil, Parth, and Bakul Panchal, "Efficient ways to improve the performance of HDFS for small files", Computer Engineering and Intelligent Systems 5.1 (2014): 45-49, 2014. [23] Team, Apache HBase, "Apache hbase reference guide", Apache, version 2.0, 2015. [24] Harter, Tyler, et al., "Analysis of HDFS under HBase: a Facebook messages case study", FAST. Vol. 14, 2014. [25] Deyhim, Parviz, "Best Practices for Amazon", Technical report, 2013. Vorapongkitipun, C., & Nupairoj, N., "Improving performance of small-file accessing in Hadoop", In, 2014 11th IEEE International Joint Conference on Computer Science and Software Engineering (JCSSE), pp. 200-205, May, 2014. | ||||
Statistics Article View: 289 PDF Download: 940 |
||||