This section describes the analysis of Gene expression and the use of GEO database ,R and Bioconductor to analyze microarray data. This section also describes about the different types of file formats involved in storing gene expression data.

Video discusses on Gene expression and NCBI's GEO website . This is a three part lecture series covering introduction of gene expression and how the data is stored in GEO database.

General Information for R and Bioconductor

To get started with R and Bioconductor it is important to know where you can find help for the numerous functions, classes, and concepts you
are about to come across. The ? operator is the most immediate source of information about R objects.Some specialized sources for help are the R and Bioconductor mailing lists (, To make the installation of Bioconductor packages as easy as possible, a Web-accessible script called biocLite that you can use to install any Bioconductor package along with its dependencies. You can also use biocLite to install packages hosted on CRAN.Below shows how to install bioconductor packages using biocLite.
The command update.packages can be used to check for and install new versions of already installed packages
biocLite(c("graph", "GEOquery"))
#To update bioconductor packages from the bioconductor repository
Since, Genomic microarray data can be very complex the package Biobase contain standardized data structures to represent genomic data.The ExpressionSet class is designed to combine several different sources of information into a single convenient structure.The data in an ExpressionSet consist of:
  • assayData: Expression data from microarray experiments (assayData is used to hint at the methods used to access different data components, as we show below).
  • metadata: A description of the samples in the experiment (phenoData), metadata about the features on the chip or technology used for the experiment (featureData), and further annotations for the features, for example gene annotations from biomedical databases (annotation).
  • experimentData: A flexible structure to describe the experiment.
If you have access to .CEL or other files produced by microarray chip manufacturer hardware. Usually the strategy is to use a Bioconductor package
such as affyPLM, affy, oligo, limma, or arrayMagic to read these files. These Bioconductor packages have functions (e.g., ReadAffy, expresso, or justRMA in affy) to read CEL files and perform preliminary preprocessing, and to represent the resulting data as an ExpressionSet or other type of object.

The following diagram below shows the R packages and functions used in the each step of microarray data analysis.
Screen Shot 2013-12-14 at 7.49.29 PM.png
R packages in Microarray data analysis
Building a expression set by hand
dataDirectory <- system.file("extdata", package="Biobase")
exprsFile <-file.path(dataDirectory, "exprsData.txt")
exprs <- as.matrix(read.table(exprsFile, header=TRUE, sep="\t", row.names=1,
head(exprs) # to see the expression matrix
In order to load the GEO data files it provides a format, known as SOFT, which stands for Simple Omnibus Format in Text. There are actually four types of GEO SOFT file available:

GEO Platform (GPL)-These files describe a particular type of microarray. They are annotation files.
GEO Sample (GSM)-Files that contain all the data from the use of a single chip. For each gene there will be multiple scores including the main one, held in the VALUE column.
GEO Series (GSE)-Lists of GSM files that together form a single experiment.
GEO Dataset (GDS) - These are curated files that hold a summarised combination of a GSE file and its GSM files. They contain normalised expression levels for each gene from each sample (i.e. just the VALUE field from the GSM file).

Video Showing Differential Gene Expression Data Analysis
Video 1

Video 2

Video Showing analysing Microarray Gene expression data from a study cohort comprized of 190 samples from patients suffering from Acute Lymphoblastic Leukemia (ALL) from Dan Boer etal . The major Goal of this Tutorial is to give the notion of:
1. Unsupervised and supervised classi cation
2. Training, testing and prediction against a random set