Public Science Insights: Increase the precision of gene expression measurement in cancer research

Posted Wed, Aug, 03,2016

Expression array and next-generation sequencing technology have been widely used to investigate the molecular mechanisms of many genetic diseases including cancer. Despite the initial claim that normalization of sequencing data might not be necessary, many normalization methods have been developed for sequencing data, including RNA-seq data. The goal of normalizations is to provide an accurate measure of gene expression.

In the past, methods for RNA-seq data normalization primarily focused on adjusting for variations/noises introduced by GC content and mappability, as well as defining a reference expression value for each sample, so that across sample comparison of gene expression can be made. In general, adjusting for GC content and mappability is straightforward. For example, a regression model could be fitted to remove the systematic variations from these two sources. Defining reference expression values for each sample requires making additional assumptions, for example, differentially expressed genes account for a very small portion of the genome; and the median expression values of all the non-differentially expressed genes are the same for all samples. Given the nature of most genomic diseases, these assumptions appear to be reasonable. However, it has to be pointed out that the above mentioned median expression value driven normalization approach is valid only if the disease under investigation does not have large structural genomic alterations, such as copy number alteration (CNA). Unfortunately, cancer is not one of those diseases.

Before we explain the potential limitations in existing RNA-seq data normalization methods, let's take a look at how DNA sequencing data, such as whole genome sequencing and whole exome sequencing data, are normalized for samples with CNAs. Note that, unlike RNA, where genes adjacent to each other might have huge differences in expression, the DNA sequences adjacent to each other almost always have exactly the same copy number unless there are structural alterations. Therefore, DNA sequencing data have much less variation compared with RNA-seq data, and very often the exact copy number of genomic regions on the genome can be estimated with high confidence. Subsequently, normalization of DNA sequencing data is usually performed by using genomic regions with normal copy number (n=2) as the reference regions. Note that these regions might be very different from those regions that have median DNA copy number.

Ideally, we would want to apply the same strategy to RNA-seq data normalization, i.e., defining the reference gene(s) based on the expression values of the genes with normal DNA copy number since it is widely known that genomic regions with higher copy number are associated with higher gene expressions due to additional transcription templates. While this seems to be a reasonable approach, identifying genomic regions with normal copy number using RNA-seq data is extremely difficult, if not impossible, due to highly inflated variation as compared with using DNA sequencing data.

We recently proposed an integrated RNA-seq data normalization method that is capable of improving gene expression measurement precision. Integrated analysis refers to combining data from different sources and analysing the data incorporated so that results are made based on a single estimation framework. More specifically, DNA and RNA sequencing data from the same patients are obtained, and regions with normal DNA copy numbers are estimated using the DNA sequencing data, and then RNA-seq data normalization is performed using these regions as reference regions.

Defining the reference regions is one of the most challenging issues in data normalization, especially RNA-seq data normalization. We anticipate that the integrated normalization approach not only provides a tool for data normalization, but also sheds light on how future cancer genomic studies are planned and designed.

Dr. Shengping Yang and Dr Zhide Fang are authors of the recently published paper, An Integrated Approach for RNA-seq Data Normalization, available for download now in Cancer Informatics.

  • Efficient Processing: 4 Weeks Average to First Editorial Decision
  • Fair & Independent Expert Peer Review
  • High Visibility & Extensive Database Coverage
Services for Authors

Quick Links

New article and journal news notification services