Assignment 2 : Identifying relationship of genes to Down syndrome


In this assignment you will analyze genetic Microarray data for association with Down's Syndrome. Down syndrome is caused by an extra copy of all or part of chromosome 21. We are most likely to see differential expression of some of the genes in chromosome 21. The data is from an experiment that was performed using the Affymetrix GeneChip Human U133A arrays. It includes 25 samples taken from 10 human subjects and 4 different tissues.The raw data is available at Files/Assignment 2 folder on IU Canvas site includes the .CEL files and phenotype data. The data from the .CEL files are normalized by various algorithms (RMA ,MA5) and then they are converted to GDS files which we download differential expression.

For this project, you will use Biocondutor with the microarray assay data to answer some questions.

1. Preprocessing

Preprocessing Affymetrix data is a well-defined process consisting of steps to correct, normalize and summarize the data. First we need to import the “raw” data in .CEL format and the PHENODATA;
  • Create a working directory where you dump all your .CEL files and the Phenodata.txt.
  • Install the affy and affycoretools packages using Bioconductor . You can use my script at github or use this.
  • Import the above two packages
  • Read all the *.CEL files and Phenodata using
    • data <- ReadAffy()
    • pData(data)<- read.table("Phenodata.txt", header=T, row.names=1, sep="\t")

Answer the following questions:

Q1. How types of tissues are in the Phenodata.txt ?
Q2. Give the summary of the Affybatch object data provided when you import
Q3. Plot histograms and boxplot of the affy object using hist() and boxplot(). Does distribution of the intensities show it requires normalization? Also show a plot for disease type "Down syndrome" and "Normal" using red and green colors respectively (note: There are 11 cases of Downs Syndrome and 14 cases of Normal in the dataset).

2. Summarizing expression values

  • In affy package there are several methodologies available to correct, normalize and summarize expression values.
  • Pass the affy object to rma() or ma5() to get the expression set.
  • Now use the function boxplot(exprs( your affy obect )) to plot the box plot

Q4. Does the box plot indicate the data is normalized?

3. Exploratory Analysis

  • Convert the data to an expression set using exprs function and perform principal component on the transposed data with correlation equal True
    • expr.prcomp <- prcomp(t(e),cor=TRUE). # e is the expression set.
    • Check to see how you can distinguish the data is it based on Disease type or it is based on Tissue type.

Q5. How many principal components do you get? Plot the two principal component plots based on Disease type and Tissue type and explain which best describes the data.

Here is some code to help you generate the principal components
 **# Getting Groups and Group Names**
 groups = as.numeric(pData(mydata)[,1])
 gn = levels(pData(mydata)[,1]) #GroupNames
 groups
 gn
 
 **# Creating data with two principal components with groups**
 comp<-data.frame(expr.prcomp$x[,1:2],groups)
 **# Creating a Group column with different classes**
 comp$Group<-NA
 comp$Group[comp$groups==1] <-"Downs Syndrome"
 comp$Group[comp$groups==2] <- "Normal"
 
 **# Plotting the data point using ggplot2**
 library(ggplot2)
 qplot(x=PC1,y=PC2,data=comp,color=factor(Group)) +
      theme_bw()+geom_point(size=6)+labs(colour="Groups")
 
 
 

4. Analysis of Differential Gene expression

We will use limma package from Bioconductor to perform the analysis of differential gene expression

library(limma)
**#Creating a two groups compare design matrix**
 
Group<- factor(pData(data)[,1 ] , levels = levels(pData(data)[,1]))
design<- model.matrix(~Group)
 
**#Fitting a linear model for each gene in the expression set eset given the design matrix**
 
fit <-lmFit(eset,design)
 
**#Given a microarray linear model fit, compute moderated t-statistics, moderated F-#statistic, and log-od#ds of differential expression by empirical Bayes moderation of the #standard errors towards a common value**
fit<-eBayes(fit)
topGenes<-topTable(fit,coef=2,adjust="BH",n=50)
Q7. Use the topTable function with coefficient =2 and adjust ="BH" to get top 100 most statistically significant differentially expressed genes and write it to an excel table (you need to write the code for that).

Next we are going to select those genes that have adjusted p-value below 0.05, to create a list of small number of highly significant genes and create a heatmap.

selected <- p.adjust(fit$p.value[, 2]) <0.05
esetSel <- eset [selected, ]
esetSel
heatmap(exprs(esetSel)

Q8. Show your Heat Map. Can you tell what are the gene names? Though the microarray gene chip names are given, you need to convert it into genes names. See the Gene expression video for this.

5. Annotation of Genes


Q9. Use the code below to annotate the genes and represent as an HTML file with ENSEMBL links. Make a screen shot of the page and post it as image in your pdf file.

**#To verify which platform was used.**
eset@annotation
 
biocLite("hgu133a.db")
biocLite("annotate")
biocLite("R2HTML")
library(hgu133a.db)
library(annotate)
library(R2HTML)
ID <- rownames(topGenes) # tab is the object from topTable function
Symbol <- getSYMBOL(ID, "hgu133a.db")
Name <- as.character(lookUp(ID, "hgu133a.db", "GENENAME"))
Ensembl <- as.character(lookUp(ID, "hgu133a.db", "ENSEMBL"))
Ensembl <- ifelse(Ensembl=="NA", NA,
                  paste("<a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=",
                        Ensembl, "'>", Ensembl, "</a>", sep=""))
tmp <- data.frame(ID=ID, Symbol=Symbol, Name=Name, Ensembl=Ensembl, stringsAsFactors=F)
tmp[tmp=="NA"] <- NA
HTML(tmp, "out.html", append=F); write.table(tmp ,file="target.txt",row.names=F, sep="\t")
Report all your answers to questions in a PDF document.