Gene Expression - What is it ?

The GENOME(Full gene set) is a store of biological information but on its own it is unable to release that information to the cell. Utilization of the biological information requires the coordinated activity of enzymes and other proteins, which participate in a complex series of biochemical reactions referred to as genome expression.
The initial product of genome expression are RNA's another term is called transcriptome. A transcriptome , is a collection of RNA molecules derived from those protein-coding genes whose biological information is required by the cell at a particular time.The transcriptome(RNA) directs synthesis of the final product of genome expression, the proteome(set of proteins) ,the process is termed as translation.The series of events of genome expression in given below.
Screen Shot 2013-12-12 at 6.34.49 PM.png
Gene expression process . Picture from Genomes2 by T.A Brown

Video for Regulation of Gene expression.

Before moving on to the Microarray lets try to understand more facts about RNA. A typical bacterium contains 0.05 0.10 pg of RNA, making up about 6% of its total weight. A mammalian cell, being much larger, contains more RNA, 20 30 pg in all, but this represents only 1% of the cell as a whole. It is important to understand that not all of this RNA constitutes the transcriptome, only the protein coding genes which are capable of being translated to proteins.So, most cellular rna's does not fall under this category.RNA's can be classified coding and Noncoding.Coding RNAs codes for proteins and non-coding does not.The RNA content of the cell is given by the following diagram
Screen Shot 2013-12-12 at 7.19.25 PM.png
RNA Classes Taken from Genomes2 T.A Brown

Microarrays why ??
Genomics : -Back in twentieth century scientist use study a small set of genes by mutating them analyse the protein they encode .Now to analyse much bigger set of genes say the whole human genome consisting of ~40,000 genes it would years of time to complete the analysis of a single gene.Microarrays help to reduce the burden of time. In one single experiment scientist can study the whole genome in a matter of hours.This has converted the 20th century genetics into 21st century genomics which means studying thousands of genes at a single time.

To get a brief idea of DNA Microarray you can check this animation link given below
This high-throughput expression profiling technique can be used to compare the level of gene transcription in clinical conditions in order to:
  • identify diagnostic or prognostic biomarkers ;
  • classify diseases (eg, tumors with different prognosis that are indistinguishable by microscopic examination);
  • monitor the response to therapy;and
  • understand the mechanisms involved in the genesis of disease processes.

Microarrays can be broadly classified according to at least three criteria:
  • length of the probes;
  • manufacturing method; and
  • number of samples that can be simultaneously profiled on one array.

According to the length of the probes, arrays can be classified into ‘‘complementary DNA (cDNA) arrays,’’ which use long probes of hundreds or thousands of base pairs (bps), and ‘‘oligonucleotide arrays,’’ which use short probes (usually 50 bps or less).
Manufacturing methods include: ‘‘deposition’’ of previously synthesized sequences and ‘‘in-situ synthesis.’’ In-situ technologies were photolithography ,inkjet printing and electrochemical synthesis.
The third criterion for the classification of microarrays refers to the number of samples that can be profiled on one array.Single-channel arrays analyze a single sample at a time, whereas multiple-channel arrays can analyze two or more samples simultaneously.

Type of studies that can be conducted with Microarray data are:
  • Finding differences in expression levels between predefined groups of samples.For example identification of genes differentially expressed in obese and non-obese persons.
  • A second application is prediction of class (predictive modeling) based on gene expression profiling.An example would be prediction of blood cancer on blood expression profile.
  • The third type of application involves analyzing a given set of gene expression profiles with the goal of discovering subgroups that share common features. For example, the expression profiles of a large number of women with blood cancer will be measured with the goal of identifying subgroups of patients who have a similar gene expression profile.

Here is Dr. Sean Davis(NCI) talks about Microarray Preprocessing

In Microarray, the term ‘‘probe’’ is used to describe the nucleotide sequence that is attached to the microarray surface. The word ‘‘target’’ in microarray experiments refers to what is hybridized to the probes.In general, the probes are shorter than the genes.After selecting the desired probes for all genes of interest,these are printed on a solid substrate. Samples containing the target genes are processed, labeled with a fluorescent dye (cyb3 or cyb5), hybridized to the array, and scanned, producing an image file.

Picture showing the Affymetrix Gene chip (taken from

Screen Shot 2013-12-14 at 1.42.19 PM.pngScreen Shot 2013-12-14 at 1.43.55 PM.png

In Affymetrix expression arrays, 25-mer oligonucleotide probes (that is single-stranded DNA of length 25 bases) are immobilized on an array, and each gene of interest has about a dozen 25-mers that match different portions of the mRNA.

The primary function of microarrays is to measure the amount of mRNA in thousands or even more probes simultaneously to see which genes are being expressed in disease cell types. The primary measurements are the array intensities to be statistically analyzed to detect differences in gene expression for different disease types.

Let us try to understand a toy example for the technique of microarray with differential gene expression.

This example is taken from the book A Practical Approach to Microarray Data Analysis by Berrar Daniel P.

Suppose we want to study expression of four genes a,b,c,d in a diseased Tumor condition A and B in the normal and diseased cells.First,we must obtain the tissue or serum cells, extract the mRNA, convert it to cDNA and label the target with red dye . Similarly we also take reference sample (Normal sample) and color (tag) with green dye.The colors red and green are arbitrary. It does not imply that the actual reporter molecules are indeed red or green.Hybridization is performed between the labelled cDNAs and probes on the array.After washing of excess cDNA we observe the intensities of color and measure them.The colors we observe are a red, a green, and two orange spots on our four gene array.

Basically, what this tells us is that (a) gene a is much more active (highly expressed) in the target than in the standard or reference sample, as the red-dyed mRNA sequences have massively outperformed the green-dyed sequences in hybridizing to probe sequences representing gene a. By a similar argument, we can say (b) that gene c of the target sample is under-expressed when compared to the standard reference, and (c) that both the expression of gene b and d seems balanced, that is, the transcript mRNA abundance in the target is roughly the same as that in the reference. Hence, we get orange color.

Now to detect the relative abundances of reference and target mRNAs , we use a device equipped with a laser and a microscope. The intensity of the emitted light allows us to estimate quantitatively the relative abundances of transcribed mRNA.Suppose for patient 1 with Tumor A we may have obtained the following expression ratios: r(a) = 5.00, r(b) = 0.98, r(c) = 0.33, and r(d) = 0.89.In this scheme, a value close to 1.00 means balanced expression, 2.00 means the target mRNA abundance is two times as high as that in the reference, and 0.50 means the reference abundance is twice as high as that of the target.Now,to complete our four-gene, ten-patient experiment, we must repeat the entire procedure ten times, and produce one array per patient. Once all arrays are done we record, and integrate the expression profiles of all patients in a single data matrix along with other information needed to analyze the data.Picture table given below shows the data matrix after integrating the data.

Observations on the microarray data

Screen Shot 2013-12-12 at 11.50.04 PM.png

  • First,we observe that for patient 5 and gene b, and for patient 8 and gene c. There may be many reasons of missing information ,but there are techniques like taking mean, expectation maximization,maximum likelihood etc to deal missing data.
  • Second, for tumor A patients the expression levels of gene a seem to have a tendency to be by a factor 2 or more higher than the base line level of 1.00. At the same time, for tumor B patients, a’s expression levels tend to be a factor two or more lower than the reference level. This differential expression pattern of gene a suggests that the gene may be involved in the events deciding the destiny of the tumor cell in terms of developing into either of the two forms. If this difference is statistically significant on a particular confidence level remains to be seen.
  • Third, there seems to be also a differential expression pattern for gene c. However, here we observe the tendency to under expressed levels for tumor A and over expressed for tumor B.
  • Fourth, most expression values of gene b and d appear to be about the base line of 1.00, suggesting that the two genes are not differentially expressed across the studied tumors.
  • Fifth, we observe that high expression levels of gene a are often matched by a low level of gene c for the same patient, and vice versa. This suggests that the two genes are (negatively) co-regulated.

File Formats for Microarray data

All Affymetrix GeneChips are scanned in an Affymetrix scanner, and the initial assessing of features is calculated using Affymetrix software.
The software involves numerous files.
  • .EXP Contains basic information about the experiment.
  • .DAT Contains the raw image.The DAT file contains a 16-bit intensity image in a proprietary format. The file structure consists of a 512 byte header followed by the raw image data.The image shown above involved a 4733 by 4733 grid of pixels, so the total file size is 2 * 47332 + 512 = 44803090 bytes = (45M).
  • .CEL Contains features Quantifications.
  • .CDF Maps between features, probes, probe-sets, and genes.
  • .CHP Library file containing information about which probes belong to which probe set.Contains gene expression levels, as assessed by the Affy software.
  • GIN — Library file containing information about the probe sets, such as the gene name associated with the probe set.

Microarray analysis workflow

Screen Shot 2014-03-19 at 7.28.20 PM.png

Microarray data analysis in Brief
1. Microarray experiments typically involve the measurement of the expression levels of many thousands of genes in only a few biological samples.

2. Some of the key steps in preprocessing are (1) image quantification; (2) data exploration, such as scatter plots; (3) background
adjustment, normalization, and summarization; and (4) quality assessment

3. One of the most common visualization methods for microarray data is the scatterplot. This shows the comparison of gene expression values for two samples. Most data points typically fall on a 45 degree line, but transcripts that are up- or downregulated are positioned off the line. Below are the two plots shows the scatter plots.

Log of Intensity values

Raw intensity values

The main feature of this scatter plot (and most such plots of microarray data) is the substantial correlation between the expression values in the two conditions being compared. Another feature of this plot is the preponderance of low-intensity values which indicates most of genes are expressed in at low level. A further adjustment is to create an MA plot( Figure below), which essentially tilts a scatter plot on its side.The purpose of an MA-plot is to investigate intensity bias.The x axis represents the mean of the log intensity values, so that transcripts expressed from low to high levels vary from left to right.The y axis represents the ratio of the signal intensities in two samples.
In the plot the transcripts ( genes) which are more upregulated have higher y axis values and down regulated transcripts have lower y axis values.
Such plots help to highlight regulated transcripts, and they also help to visualize aberrant structures in the data.
MA-plots are used to study dependences between the log ratio of two variables and the mean values of the same two variables. The log ratios of the two measurements are called M values (from “minus” in the log scale) and are represented in the vertical axis. The mean values of the two measurements are called A values (from “average” in the log scale) and are represented in the horizontal axis.

4. The term “normalization” as applied to microarray data does not refer to the normal (Gaussian) distribution, but instead it refers to the process of correcting two or more data sets prior to comparing their gene expression values.It is necessary to normalize microarray data because the Cy3 and Cy5 dyes are incorporated into cDNA with different efficiencies.Without normalization, it would not be possible to accurately assess the relative expression of samples that are labeled with those dyes; genes that are actually expressed at comparable levels would have a ratio different than one and zero .

5. Robust Multiarray Analysis(RMA): RMA is a method of performing background subtraction, normalization, and averaging of probe-level feature intensities extracted from .cel files using the Affymetrix platforms.

6. Expression Ratios: How can you decide which genes are significantly regulated in a microarray experiment? One approach is to calculate the expression ratio in control and experimental cases and to rank order the genes. You might apply an arbitrary cutoff such as a threshold of at least twofold up- or downregulation and define those as genes of interest.

7. The goal of inferential statistical analysis of microarray data is to test the hypothesis that some genes are differentially expressed in an experimental comparison of two or more conditions. We formulate the null hypothesis as H(0) that there is no difference in signal intensity across the conditions being tested and H(1) which indicates that there are differences in gene expression levels. We define and calculate a test statistic
which is a value that characterizes the observed gene expression data. We accept or reject the null hypothesis based on the results of the test statistic and at significance level alpha which is p < 0.05.

8. A variety of test statistics may be applied to microarray data. These tests are all used to derive P values that help assess the likelihood that particular genes are regulated. Some the test are listed in the table below.

Parametric Test
Non parametric Test
Compare one group to a
hypothetical value
One-sample t-test
Wilcoxon test
Compare two unpaired
Unpaired t-test
Mann–Whitney test
Compare two paired groups
Paired t-test
Wilcoxon test
Compare three or more
unmatched groups
One-way ANOVA
Kruskal–Wallis test
Compare three or more
matched groups
Repeated-measures ANOVA
Friedman test
Table Adapted from Motulsky (1995) and Zolman (1993).

Learning Goals