Abstract
Using mRNA-Seq and clinical data for 469 clear cell Renal Cell Carcinoma (ccRCC) samples from The Cancer Genome Atlas (TCGA), we develop a protocol to identify patients likely to have early recurrence of their disease. We first split the data into two sets, with 380 samples in the training set and 89 samples in the test set. Using the training set, we identify genes whose outlier status (high or low mRNA expression) is predictive of recurrence, based on Kaplan-Meier recurrence free survival log-rank p-value. We find a significant overlap among genes identified as predictive biomarkers in Reads per Kilobase Million (RPKM) normalized data and Raw Reads mRNA-Seq data. Using 80 consensus genes predictive in both RPKM and Raw Reads data, we define an outlier-based risk score R to stratify patients into two groups, a high-risk (early recurrence) group (R <; 2) and a low-risk (late recurrence) group (R > 2). The KM recurrence curve using this stratification shows excellent separation in training and test sets. Restricting the analysis to patients who had recurrence within two years (109 cases) and those who had no recurrence in five years (107 cases) we find that the risk predictor achieves ca. 80 percent sensitivity and specificity. The 80 genes identified by the outlier analysis were used to develop a more intuitive classifier based on Generalized Matrix Learning Vector Quantization (GMLVQ). This method stratifies samples into risk classes based on defining prototypes in feature space and an appropriate distance metric. GMLVQ identified a subset of 12 genes that have high accuracy in predicting recurrence, which suggests that an assay with a small number of genes might be able to predict recurrence in ccRCC.