Background: High-throughput techniques bring novel tools but also statistical
challenges to genomic research. Identifying genes with differential expression
between different species is an effective way to discover evolutionarily
conserved transcriptional responses. To remove systematic variation between
different species for a fair comparison, the normalization procedure serves as
a crucial pre-processing step that adjusts for the varying sample sequencing
depths and other confounding technical effects.
Results: In this paper, we propose a scale based normalization (SCBN) method
by taking into account the available knowledge of conserved orthologous genes
and hypothesis testing framework. Considering the different gene lengths and
unmapped genes between different species, we formulate the problem from the
perspective of hypothesis testing and search for the optimal scaling factor
that minimizes the deviation between the empirical and nominal type I errors.
Conclusions: Simulation studies show that the proposed method performs
significantly better than the existing competitor in a wide range of settings.
An RNA-seq dataset of different species is also analyzed and it coincides with
the conclusion that the proposed method outperforms the existing method. For
practical applications, we have also developed an R package named "SCBN" and
the software is available at
http://www.bioconductor.org/packages/devel/bioc/html/SCBN.html