Evaluating the comparative performance of popular gene set test methods

Gene set testing is a method of characterizing differential expression between sample groups by grouping functionally related genes together into gene sets. Statistical tests for differential expression are performed on the gene sets rather than on individual genes. In recent years, a number of gene set test methods have been developed and independently tested, but little has been done to do a wide scale comparison of these gene set methods. We use a custom simulation framework to compare the statistical power and false discovery rate of a number of current gene set test methods (including mvGST, ROAST, CAMERA, ROMER, GlobalTest, GSA, PAGE, SAFE, sigPathway, and GSEAlm) over several combinations of realistic and relevant biological parameters. In total, 100 unique test scenarios were performed using 8,700 distinct simulated data sets. We found that all of the methods are subject to the classic trade-off between the power and the FDR. Some of the methods were more powerful but failed to maintain the FDR in some cases, while other methods maintained the FDR at the expense of power. In general, the gene set test methods either did not hold power or maintain the FDR in the presence of realistic conditions with sample sizes of 2 to 4 that are commonly used in real gene expression experiments. We also found that none of the methods performed well under realistic inter-gene correlation (dependent gene expression values within gene set) and conclude that further research and development is needed in this area.