A Novel Scalable and Effective Partitioning Approach for Big Data Reduction | ||||
IJCI. International Journal of Computers and Information | ||||
Article 2, Volume 6, Issue 1, January 2019, Page 9-19 PDF (1001.23 K) | ||||
Document Type: Original Article | ||||
DOI: 10.21608/ijci.2019.35122 | ||||
![]() | ||||
Authors | ||||
M. G. Malhat![]() ![]() | ||||
1Computer Science dept., Faculty of computers and Information, Menoufia University, Egypt | ||||
2Computer Science dept., Faculty of Computers and Information, Menofia University, Egypt | ||||
3Faculty of Computer and Information Menoufia University | ||||
4Computer Science dept., Faculty of computers and Information, Menofia University, Egypt | ||||
Abstract | ||||
The continuous increment of data size makes the traditional instance selection methods ineffective to reduce big training datasets in a single machine. Recent approaches to solving this technical problem partition the training dataset into subsets prior to apply the instance selection methods into each subset separately. However, the performance of the applied instance selection methods to subsets is negatively affected, especially when the number of partitioned subsets is increased. In this work, we propose a novel scalable and effective automated partitioning approach, called overlapped distance-based class-balance partitioning. This approach distributes the training dataset instances to the partitioned subsets based on a given distance metric and ensures the equal representation of data classes into partitioned subsets. Moreover, the instances might be assigned to two subsets once they satisfy the dynamic threshold. We implement and test empirically the scalability and effectiveness of the proposed approach using condensed nearest neighbor method over eight standard datasets. The proposed approach is compared empirically and analytically with stratification partitioning approach and a non-overlapped version from our approach with respect to 1) the reduction rate, classification accuracy, and effectiveness metrics, and 2) the scalability aspect, where the number of subsets is increased. The comparison results demonstrate that our approach is more scalable and effective than other partitioning approaches with respect to these standard datasets. | ||||
Keywords | ||||
Big Data; Data mining; Data Reduction; Instance Selection; Data Partitioning | ||||
Statistics Article View: 549 PDF Download: 505 |
||||