A Novel Scalable and Effective Partitioning Approach for Big Data Reduction

Malhat, M. G.; Elmenshawy, M.; Mousa, Hamdy; Elsisi, A. B.

doi:10.21608/ijci.2019.35122

	A Novel Scalable and Effective Partitioning Approach for Big Data Reduction
IJCI. International Journal of Computers and Information
Article 2, Volume 6, Issue 1, January 2019, Pages 9-19 PDF (1001.23 K)
Document Type: Original Article
DOI: 10.21608/ijci.2019.35122
Authors
M. G. Malhat¹; M. Elmenshawy²; Hamdy Mousa³; A. B. Elsisi⁴
¹Computer Science dept., Faculty of computers and Information, Menoufia University, Egypt
²Computer Science dept., Faculty of Computers and Information, Menofia University, Egypt
³Faculty of Computer and Information Menoufia University
⁴Computer Science dept., Faculty of computers and Information, Menofia University, Egypt
Abstract
The continuous increment of data size makes the traditional instance selection methods ineffective to reduce big training datasets in a single machine. Recent approaches to solving this technical problem partition the training dataset into subsets prior to apply the instance selection methods into each subset separately. However, the performance of the applied instance selection methods to subsets is negatively affected, especially when the number of partitioned subsets is increased. In this work, we propose a novel scalable and effective automated partitioning approach, called overlapped distance-based class-balance partitioning. This approach distributes the training dataset instances to the partitioned subsets based on a given distance metric and ensures the equal representation of data classes into partitioned subsets. Moreover, the instances might be assigned to two subsets once they satisfy the dynamic threshold. We implement and test empirically the scalability and effectiveness of the proposed approach using condensed nearest neighbor method over eight standard datasets. The proposed approach is compared empirically and analytically with stratification partitioning approach and a non-overlapped version from our approach with respect to 1) the reduction rate, classification accuracy, and effectiveness metrics, and 2) the scalability aspect, where the number of subsets is increased. The comparison results demonstrate that our approach is more scalable and effective than other partitioning approaches with respect to these standard datasets.
Keywords
Big Data; Data mining; Data Reduction; Instance Selection; Data Partitioning

Statistics Article View: 578 PDF Download: 531