METHODOLOGY FOR SELECTING MICROARRAY BIOMARKER GENES FOR CANCER CLASSIFICATION

El Houby, E; Yassin, N

doi:10.21608/ijicis.2015.10908

	METHODOLOGY FOR SELECTING MICROARRAY BIOMARKER GENES FOR CANCER CLASSIFICATION
International Journal of Intelligent Computing and Information Sciences
Article 3, Volume 15, Issue 1, January 2015, Page 25-39 PDF (521.54 K)
Document Type: Original Article
DOI: 10.21608/ijicis.2015.10908
View on SCiNiTO
Authors
E El Houby¹; N Yassin²
¹Engineering Division, Systems & Information Department, National Research Centre, El Buhouth Street, Dokki, Cairo,
²Engineering Division, Systems & Information Department, National Research Centre, El Buhouth Street, Dokki, Cairo,Egypt
Abstract
In the analysis of microarray gene expression data, it is very difficult to obtain a satisfactory classification result by machine learning techniques because of the dimensionality problem. That is the gene expression data are very high dimensional, while datasets usually contain a few tens samples. Microarray data includes many redundant, noisy genes and numerous genes contain inappropriate information for classification.The best combination of gene selection and classification is required to identify biomarker genesfrom thousands of genes. In this research, a methodology has been developed to eliminate noisy, irrelevant and redundant genes and find a small setof significant informative biomarker genes which can classify cancer dataset with high accuracy. The process consists of two phases which are gene selection and classification. In gene selection phase, the genes have been ranked according to their ranking scores; two statistical approaches which are class separability and T-test have been used. Then from the highest ranked genes, different subsets of genes have been used to classify dataset until reach the highest possible accuracy. Two data mining techniques have been used for classifications which are K-Nearest Neighbor and Support Vector Machine. The proposed method has been used to classify 7 benchmarkgene expression cancer datasets. The results showed that the proposed methodology can identifysmall subsetof relevant predictive genes and can achieve high prediction accuracy with this small subset of genes for different datasets.The accuracyand subset of biomarker genes have been identified for different cancer datasets.


Statistics Article View: 135 PDF Download: 239