A Comparative Study for Arabic Text Classification Based on BOW and Mixed Words Representations | ||||
IJCI. International Journal of Computers and Information | ||||
Article 3, Volume 5, Issue 1, June 2016, Page 24-34 PDF (646.14 K) | ||||
Document Type: Original Article | ||||
DOI: 10.21608/ijci.2016.33954 | ||||
![]() | ||||
Authors | ||||
Rouhia M. Sallam ![]() ![]() ![]() | ||||
1Faculty of Applied Sciences, Taiz University, Yemen | ||||
2Faculty of Computers and Information Menoufia University | ||||
3Faculty of Computers and Information, Menofia University, Egypt | ||||
Abstract | ||||
This paper compares two methods for features representation in Arabic text classification. These methods are bag of words (BOW) that mean the word-level unigram and mixed words representations. The mixed words use a mixture of a bag of words and two adjacent words with different proportions. The main objective of this paper is to measure the accuracy of each method and to determine which method is more accurate for Arabic text classification based on the representation modes. Each method uses normalization and stemming. The results show that the use of mixed words in features representation achieves the highest accuracy by 98.61% when normalization is used. | ||||
Keywords | ||||
Arabic Text Categorization; Frequency Ratio Accumulation Method; Term and Document Frequency; Features Selection; bag of words and Mixed Words | ||||
Statistics Article View: 431 PDF Download: 516 |
||||