Use of biclustering for missing value imputation in gene expression data

The gene expression data shows the expression values of ten thousands of genes under hundreds of experimental conditions [1]. The data is useful for various applications such as cellular processes analysis, gene functions prediction and diseases diagnoses [2, 3]. However, some values in the gene expression data are missing due to image corruption, dust or scratches on the slides or experimental errors. As many subsequent analysis tools work on complete datasets only, recovery of missing values is necessary. A straightforward approach is to repeat the experiment; but this might not be feasible because of economic reasons or sometimes limitations of samples. Thus, computational-based missing values imputation becomes necessary and crucial.

Early approaches in missing value imputation are simply to replace the missing values with zeros or row averages. Later, methods that explore coherence inside the gene expression data were developed. There are mainly two ways to explore the coherence information, namely the global and the local approaches [4]. The global approaches assume a global covariance structure in all genes [5-10] while the local approaches exploit local correlations existing in subsets of genes for estim- ation [11, 12]. Hybrid approaches that combine local and global information have also been proposed [13]. Besides, external information such as gene ontology [14], external datasets [15] or histone acetylation information [16] can be exploited for missing value imputation.

Recently, a multi-stage approach to clustering and missing value imputation was proposed [17]. The rationale behind is that both clustering and missing value imputation explore coherence inside the gene expression data. The combination of missing value imputation and clustering was found to improve the accuracy of missing values imputation [18]. In reality, related genes often co-express under certain conditions only [19]. Biclustering that groups genes and conditions simultaneously should be performed to characterize coherence inside the gene expression data.

Hence, biclustering, instead of clustering, should be combined with missing value imputation. In this article, we present a framework that combines biclustering and missing value imputation. A model-based imputation is developed for missing value imputation inside biclusters. In this way, coherence found inside biclusters can be used to improve missing value imputation. As a result, accuracy of missing value imputation and quality of identified biclusters can both be enhanced by the proposed framework.

For full text: click here

(Author: K.O. Cheng, N.F. Law, W.C. Siu

Published by Sciedu Press)