A Comparative Study for Arabic Text Classification Based on BOW and Mixed Words Representations

Sallam, Rouhia M.; Mousa, Hamdy; Hussien, Mahmoud

doi:10.21608/ijci.2016.33954

	A Comparative Study for Arabic Text Classification Based on BOW and Mixed Words Representations
IJCI. International Journal of Computers and Information
Article 3, Volume 5, Issue 1, June 2016, Pages 24-34 PDF (646.14 K)
Document Type: Original Article
DOI: 10.21608/ijci.2016.33954
Authors
Rouhia M. Sallam^* ¹; Hamdy Mousa²; Mahmoud Hussien³
¹Faculty of Applied Sciences, Taiz University, Yemen
²Faculty of Computers and Information Menoufia University
³Faculty of Computers and Information, Menofia University, Egypt
Abstract
This paper compares two methods for features representation in Arabic text classification. These methods are bag of words (BOW) that mean the word-level unigram and mixed words representations. The mixed words use a mixture of a bag of words and two adjacent words with different proportions. The main objective of this paper is to measure the accuracy of each method and to determine which method is more accurate for Arabic text classification based on the representation modes. Each method uses normalization and stemming. The results show that the use of mixed words in features representation achieves the highest accuracy by 98.61% when normalization is used.
Keywords
Arabic Text Categorization; Frequency Ratio Accumulation Method; Term and Document Frequency; Features Selection; bag of words and Mixed Words

Statistics Article View: 461 PDF Download: 543