I mostly work on human gene expression datasets that are preprocessed. Recently, I thought it is a good idea to have expression data from different tissues as a reference/control for a comparison against disease/perturbation expression datasets. I wrote a R script to obtain expression data from GEO. I thought script is worth to share since it took some time to compile from different sources on web. Many script statements comes from this nice tutorial. I have just provided code here, you should try commands in R to check output. I will keep on updating page with various analysis.
You need to install GEOquery and Biobase packages from Bioconductor repository and load them in R by using commands given below. I am also installing limma which has many functions for expression analysis.
## I install packages in R >
biocLite(c("GEOquery","Biobase", "limma"))## Now load them >
Gene Expression Omnibus (GEO) have publicly available datasets in few formats. One of them is GDS, which includes curated normalised expression files along with sample annotation. You can read more about formats on GEO help pages. First get GEO accession number for a particular dataset and feed it to GEOquery function. This function download expression data from GEO and load it in R for you. The command is
# GDS596 is a accession of my data of interest
> gds <- getGEO('GDS596', destdir=".")
gds object has two important elements (thanks to getGEO function writers), 1. Metadata associated with samples/arrays 2. Expression table
# First I check column headers to # have an rough information about content > names(Meta(gds)) # Data information > Meta(gds)$title # Title of the study > Meta(gds)$sample_organism # Organism source > Meta(gds)$sample_type # sample type # Array information > Meta(gds)$sample_count # sample counts > Meta(gds)$description # Sample description # Probe related information > Meta(gds)$platform # Platform name for annotation package > Meta(gds)$feature_count # probe counts
Checking expression data
> colnames(Table(gds)) # columns headers of expression data # First two columns are features and, # GSM prefixed columns represents expression > Table(gds)[1:10,1:10] # This is much better > dim(Table(gds)) # I check how many samples and probes sets exists
> eset <- GDS2eSet(gds, do.log2=TRUE) > eset > str(eset) # here you can check the structure of ExpressionSet for more Info.
ExpressionSet is a environment specially created for gene expression data analysis. The idea was to have a object which consists of Expression table, Metadata associated with- Samples and Probes in a single object. If you do any operation on lets say, rows or columns in expression table, it should also be applied on Probes or Sample annotations as well. There are three important elements you can look for (dont try third element which is exression matrix) a. pData(eset) for sample information and b. featureData(eset) for probe level information c. exprs(eset) for expression matrix. Lets see what information each element has
> head(pData(eset)) > names(pData(eset))
I do not need certain samples so I will exclude them from eset
> eset=eset[,pData(eset)$tissue!="colorectal adenocarcinoma" & pData(eset)$tissue!="leukemia chronic myelogenous K562" & pData(eset)$tissue!="leukemia lymphoblastic molt4" & pData(eset)$tissue!="leukemia promyelocytic hl60" & pData(eset)$tissue!="lymphoma burkitts daudi" & pData(eset)$tissue!="lymphoma burkitts raji" & pData(eset)$tissue!="721-B-lymphoblasts"] > dim(eset) # checking dimension # I would like to know if you have some better ideas to do the same....
This element is quite complicated so I look at it’s structure first.
I can see that @data element object contains gene ID, symbol etc. Let check this element
> head(featureData(eset)@data) # Not so good...hmm..Lets check first 10 rows and 5 columns > featureData(eset)@data[1:10,1:5] # Oke, looks better than before. Lets check what is there in each column > names(featureData(eset)@data)
With quick glance at the data, I see few missing values in column “Gene ID” and also few gene IDs concatanated with /// are represented with the same probe identifier. I want to get rid of them to avoid unnecessary complications in further analysis.
> eset=eset[featureData(eset)@data$"Gene ID"!='',] # getting rid of probes with no Gene IDs > dim(eset) > probes=featureData(eset)@data$"ID"[grep('///', featureData(eset)@data$"Gene ID", invert=TRUE)] # Getting probe names matching to single gene > eset=eset[match(probes,rownames(eset)),] # subsetting probes > dim(eset) # here I have probes matching to single gene only # Making sure everything has worked out well > length(unique(featureData(eset)@data$"Gene symbol" )) > sum(featureData(eset)@data$"ID"!=rownames(featureData(eset))) # it should be zero > sum(featureData(eset)@data$"ID"!=rownames(exprs(eset))) # it should be zero
I want expression for genes rather than corresponding probes. Expression often represented with multiple probes for a single gene. Many people work with probes but I like to work on genes. Best way is to average out expression of multiple probes corresponding to the same gene. Limma package provide avereps function to do this. Input is expression set and a vector of gene identifiers for corresponding rows,
> exp=avereps(exprs(eset), ID=featureData(eset)@data$"Gene ID") > dim(exp) > exp[1:10,1:5]
I have got average expression profiles for gene identifiers. Rownames of exp are now gene identifiers provided to avereps function. Now, I play a trick to map averaged expression profiles to eset.
> eset=eset[match(rownames(exp), featureData(eset)@data$"Gene ID"),] > exprs(eset)<-exp > exprs(eset)[1:10,1:5] > dim(eset) > boxplot(exprs(eset))
This expression data is ready, if medians (black lines) of boxes in boxplot are along the line (which you dont see in the boxplot). This data need further processing, especially array normalization. Microarray normalization itself is a field of active research which can not be cover here. Finally, before jumping into biological discovery, you should check outlier samples, genes, uniform distribution of gene expression etc. Once you known that the data is good, then only you should go for subsequent analysis. Data can be used for differential expression analysis, regulatory network predictions, co-expressed gene analysis etc. I appreciate if someone point out errors or has better commands than used over here. This is for now and will add chunk of code for expression analysis soon.