典型医学设计实验GEO数据分析 (step-by-step) - 数据获取到标准化
GEO是當今最大、最全的公共基因數(shù)據(jù)資源庫,包括基因的表達、突變、修飾等信息,涵蓋幾乎所有的疾病,且單個實驗檢測樣品數(shù)目較多,是我們分析、學(xué)習(xí)的很好資源。
實驗設(shè)計
原始文章對14個潰瘍性結(jié)腸炎病人 (Ulcerative colitis, UC)和15個克羅恩病病人 (Crohn’s disease, CD)的發(fā)炎組織和未發(fā)炎組織活檢采樣,用Affy芯片檢測基因表達譜,研究發(fā)炎組織和未發(fā)炎組織的基因表達差異。(Genome-wide Pathway Analysis Using Gene Expression Data of Colonic Mucosa in Patients with Inflammatory Bowel Disease. Inflamm Bowel Dis. 雜志影響因子不高,但是領(lǐng)域的專業(yè)雜志)
檢測并安裝依賴的包
a = rownames(installed.packages())install_bioc <- c("Biobase","oligoClasses","ArrayExpress", "pd.hugene.1.0.st.v1", "hugene10sttranscriptcluster.db", "oligo", "arrayQualityMetrics","limma", "topGO", "ReactomePA", "clusterProfiler", "gplots", "ggplot2","geneplotter", "RColorBrewer", "pheatmap", "dplyr","stringr","genefilter")for(i in install_bioc) {if(! i %in% a) BiocManager::install(i, update=F)}# General Bioconductor packages library(Biobase) library(oligoClasses)# Annotation and data import packages library(ArrayExpress) library(pd.hugene.1.0.st.v1) library(hugene10sttranscriptcluster.db)# Quality control and pre-processing packages library(oligo) library(arrayQualityMetrics)# Analysis and statistics packages library(limma) library(topGO) library(ReactomePA) library(clusterProfiler)# Plotting and color options packages library(gplots) library(ggplot2) library(geneplotter) library(RColorBrewer) library(pheatmap)# Formatting/documentation packages #library(rmarkdown) #library(BiocStyle) library(dplyr) #library(tidyr)# Helpers: library(stringr) library(matrixStats) library(genefilter) #library(openxlsx) #library(devtools)從ArrayExpress下載原始數(shù)據(jù)
實驗檢測采用的是Affy芯片 (A-AFFY-141 - Affymetrix GeneChip Human Gene 1.0 ST Array [HuGene-1_0-st-v1]),原始數(shù)據(jù)為CEL格式,存儲于ArrayExpress,索引號是E-MTAB-2967。
使用getAE函數(shù)下載原始數(shù)據(jù)和注釋數(shù)據(jù)到當前工作目錄 (Rmd所在目錄),返回一個包含所有下載文件名的列表。
# type: 'raw' to download and extract only the raw data, 'processed' to download and extract only the processed data or 'full' to have both raw and processed data. # anno_AE <- getAE("E-MTAB-2967", type = "raw")若自動下載不成功,手動點擊下載圖中File部分所有文件,放置到當前目錄。(也可去后臺回復(fù)affy數(shù)據(jù)?獲取)
# type: 'raw' to download and extract only the raw data, 'processed' to download and extract only the processed data or 'full' to have both raw and processed data. # 修改參數(shù)local=T,讀入當前目錄的數(shù)據(jù) anno_AE <- getAE("E-MTAB-2967", type = "raw", local=T)## Warning in getAE("E-MTAB-2967", type = "raw", local = T): No processed data ## files found in directory /disk1/train/GEO/f1000anno_AE里面存儲了所有的文件的路徑和名字信息。并把原始數(shù)據(jù)解壓縮,釋放出CEL文件到當前目錄,方便后續(xù)讀取。
anno_AE## $path ## [1] "/disk1/train/GEO/f1000" ## ## $rawFiles ## [1] "164_I_.CEL" "164_II.CEL" "183_I.CEL" "183_II.CEL" "2114_I.CEL" ## [6] "2114_II.CEL" "2209_A.CEL" "2209_B.CEL" "2255_I.CEL" "2255_II.CEL" ## [11] "2400_I.CEL" "2400_II.CEL" "2424_A.CEL" "2424_B.CEL" "255_I.CEL" ## [16] "255_II.CEL" "2826_I.CEL" "2826_II.CEL" "2853_I.CEL" "2853_II.CEL" ## [21] "2978_I.CEL" "2978_II.CEL" "2987_I.CEL" "2987_II.CEL" "2992_I.CEL" ## [26] "2992_II.CEL" "2995_I.CEL" "2995_II.CEL" "321_I.CEL" "321_II.CEL" ## [31] "3222_I.CEL" "3222_II.CEL" "3223_I.CEL" "3223_II.CEL" "3226_I.CEL" ## [36] "3226_II.CEL" "3233_I.CEL" "3233_II.CEL" "3258_I.CEL" "3258_II.CEL" ## [41] "3259_I.CEL" "3259_II.CEL" "3262_I.CEL" "3262_II.CEL" "3266_I.CEL" ## [46] "3266_II.CEL" "3269_I.CEL" "3269_II.CEL" "3271_I.CEL" "3271_II.CEL" ## [51] "3302_I.CEL" "3302_II.CEL" "3332_I.CEL" "3332_II.CEL" "848_A.CEL" ## [56] "848_B.CEL" "888_I.CEL" "888_II.CEL" ## ## $rawArchive ## [1] "E-MTAB-2967.raw.1.zip" "E-MTAB-2967.raw.2.zip" ## ## $processedFiles ## NULL ## ## $processedArchive ## character(0) ## ## $sdrf ## [1] "E-MTAB-2967.sdrf.txt" ## ## $idf ## [1] "E-MTAB-2967.idf.txt" ## ## $adf ## [1] "A-AFFY-141.adf.txt"原始數(shù)據(jù)解釋
ArrayExpress的每個數(shù)據(jù)根據(jù)MAGE-TAB?(MicroArray Gene Expression Tabular)指南存儲,包含5種不同類型的文件:
-
IDF (Investigation Description Format):研究描述文件,包含實驗信息如題目、描述、提交者聯(lián)系方式、實驗操作過程等。
-
ADF (Array Design Format): 芯片設(shè)計文件
-
SDRF (Sample and Data Relationship Format): 樣本屬性文件,如實驗分組、處理方式、取樣部位等。
-
原始數(shù)據(jù)文件 (raw)
-
加工后的數(shù)據(jù)文件 (processed data files)
ExpressionSets數(shù)據(jù)結(jié)構(gòu)描述
組學(xué)數(shù)據(jù)通常比較復(fù)雜,包含很多不同的部分,如實驗樣品信息、基因組注釋信息和實驗數(shù)據(jù),對應(yīng)到芯片數(shù)據(jù)是樣品屬性和分組信息、不同來源的基因ID信息和功能注釋信息、基因表達矩陣。
為了更好的組織這些數(shù)據(jù),Bioconductor的Biobase包定義了一個標準化的數(shù)據(jù)結(jié)構(gòu) (ExpressionSet類)存儲這些數(shù)據(jù)。ExpressionSet類包含下面幾部分:
-
assayData: 芯片實驗的表達數(shù)據(jù),探針在行,樣品名字在列,用函數(shù)exprs獲取
-
metaData
-
phenoData: 樣品描述信息,樣品名字在行,實驗分組、處理方式、取樣部位在列。通常是SDRF文件的信息的讀入。用函數(shù)pData獲取。
-
featureData: 基因注釋特征信息,行為探針名字,列為探針對應(yīng)的基因、轉(zhuǎn)錄本名字和相應(yīng)的注釋信息等。用函數(shù)fData獲取。
-
experimentData: 對實驗描述的其它信息。
ExpressionSet類可以幫我們協(xié)調(diào)數(shù)據(jù)修改、提取過程中的所有信息的一致性。但是要注意phenoData的行名字必須與assayData的列名字一致,都是樣品標識符;assayData的行名字必須與featureData的行名字一致,都是基因標識符,具體見下圖:
導(dǎo)入數(shù)據(jù),存儲為”ExpressionSet”
讀入SDRF數(shù)據(jù)。
sdrf_location <- file.path("E-MTAB-2967.sdrf.txt") SDRF <- read.delim(sdrf_location)我們查看下,讀進來的數(shù)據(jù)長什么樣子,有哪些信息?
SDRF[1:3,1:5]總共有哪些樣品相關(guān)的信息?取樣個體標記、物種、疾病、取樣部分、是否患病、raw data存儲的位置和名字等。
t(SDRF[1, ])## 1 ## Source.Name "164_I" ## Characteristics.individual. "164" ## Characteristics.organism. "Homo sapiens" ## Characteristics.disease. "Crohn's disease" ## Characteristics.organism.part. "colon" ## Characteristics.phenotype. "non-inflamed colonic mucosa" ## Material.Type "organism part" ## Protocol.REF "P-MTAB-41361" ## Protocol.REF.1 "P-MTAB-41363" ## Extract.Name "164_I" ## Protocol.REF.2 "P-MTAB-41364" ## Labeled.Extract.Name "164_I:Biotin" ## Label "biotin " ## Protocol.REF.3 "P-MTAB-41366" ## Assay.Name "164_I_" ## Technology.Type "array assay" ## Array.Design.REF "A-AFFY-141" ## Term.Source.REF "ArrayExpress" ## Protocol.REF.4 "P-MTAB-41367" ## Array.Data.File "164_I_.CEL" ## Comment..ArrayExpress.FTP.file. "ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-2967/E-MTAB-2967.raw.1.zip" ## Factor.Value.disease. "Crohn's disease" ## Factor.Value.phenotype. "non-inflamed colonic mucosa"Array.Data.File存儲了樣品名字信息 (CEL文件名),用作行名字。把SDRF數(shù)據(jù)表轉(zhuǎn)為AnnotatedDataFrame格式用于下游構(gòu)建ExpressionSet對象。
rownames(SDRF) <- SDRF$Array.Data.File SDRF <- AnnotatedDataFrame(SDRF)Affymetrix芯片的原始數(shù)據(jù)是CEL文件,里面包含檢測到的探針雜交密度值,代表原始基因表達量。另外每個CEL還含有額外信息,如芯片類型、掃描時間等,常用于做批次校正。
oligo包的read.celfiles函數(shù)可以讀取這些文件。(雖然不影響,但有幾個文件命名不規(guī)律,如164_I_多寫了個下劃線,有幾個樣品用A,B代替了I,II。不規(guī)范的名字是不太利于批量分析和下游識別的,引以為戒。)
raw_data <- oligo::read.celfiles(filenames = as.character(SDRF$Array.Data.File),verbose = FALSE, phenoData = SDRF) stopifnot(validObject(raw_data))獲得的raw_data就是一個ExpressionSet對象。raw_data@phenoData@data(或者pData(raw_data))是對應(yīng)的樣品屬性信息,與SDRF信息一致。raw_data@annotation獲取注釋平臺信息pd.hugene.1.0.st.v1。
str(raw_data)## Formal class 'GeneFeatureSet' [package "oligoClasses"] with 9 slots ## ..@ manufacturer : chr "Affymetrix" ## ..@ intensityFile : chr NA ## ..@ assayData :<environment: 0x815aba0> ## ..@ phenoData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots ## .. .. ..@ varMetadata :'data.frame': 23 obs. of 2 variables: ## .. .. .. ..$ labelDescription: chr [1:23] NA NA NA NA ... ## .. .. .. ..$ channel : Factor w/ 2 levels "exprs","_ALL_": 2 2 2 2 2 2 2 2 2 2 ... ## .. .. ..@ data :'data.frame': 58 obs. of 23 variables: ## .. .. .. ..$ Source.Name : Factor w/ 58 levels "164_I","164_II",..: 1 2 3 4 5 6 7 8 9 10 ... ## .. .. .. ..$ Characteristics.individual. : int [1:58] 164 164 183 183 2114 2114 2209 2209 2255 2255 ... ## .. .. .. ..$ Characteristics.organism. : Factor w/ 1 level "Homo sapiens": 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Characteristics.disease. : Factor w/ 2 levels "Crohn's disease",..: 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Characteristics.organism.part. : Factor w/ 1 level "colon": 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Characteristics.phenotype. : Factor w/ 2 levels "inflamed colonic mucosa",..: 2 1 2 1 2 1 2 1 2 1 ... ## .. .. .. ..$ Material.Type : Factor w/ 1 level "organism part": 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Protocol.REF : Factor w/ 1 level "P-MTAB-41361": 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Protocol.REF.1 : Factor w/ 1 level "P-MTAB-41363": 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Extract.Name : Factor w/ 58 levels "164_I","164_II",..: 1 2 3 4 5 6 7 8 9 10 ... ## .. .. .. ..$ Protocol.REF.2 : Factor w/ 1 level "P-MTAB-41364": 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Labeled.Extract.Name : Factor w/ 58 levels "164_I:Biotin",..: 1 2 3 4 5 6 7 8 9 10 ... ## .. .. .. ..$ Label : Factor w/ 1 level "biotin ": 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Protocol.REF.3 : Factor w/ 1 level "P-MTAB-41366": 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Assay.Name : Factor w/ 58 levels "164_I_","164_II",..: 1 2 3 4 5 6 7 8 9 10 ... ## .. .. .. ..$ Technology.Type : Factor w/ 1 level "array assay": 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Array.Design.REF : Factor w/ 1 level "A-AFFY-141": 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Term.Source.REF : Factor w/ 1 level "ArrayExpress": 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Protocol.REF.4 : Factor w/ 1 level "P-MTAB-41367": 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Array.Data.File : Factor w/ 58 levels "164_I_.CEL","164_II.CEL",..: 1 2 3 4 5 6 7 8 9 10 ... ## .. .. .. ..$ Comment..ArrayExpress.FTP.file.: Factor w/ 2 levels "ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-2967/E-MTAB-2967.raw.1.zip",..: 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Factor.Value.disease. : Factor w/ 2 levels "Crohn's disease",..: 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Factor.Value.phenotype. : Factor w/ 2 levels "inflamed colonic mucosa",..: 2 1 2 1 2 1 2 1 2 1 ... ## .. .. ..@ dimLabels : chr [1:2] "rowNames" "columnNames" ## .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot ## .. .. .. .. ..@ .Data:List of 1 ## .. .. .. .. .. ..$ : int [1:3] 1 1 0 ## ..@ featureData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots ## .. .. ..@ varMetadata :'data.frame': 0 obs. of 1 variable: ## .. .. .. ..$ labelDescription: chr(0) ## .. .. ..@ data :'data.frame': 1102500 obs. of 0 variables ## .. .. ..@ dimLabels : chr [1:2] "featureNames" "featureColumns" ## .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot ## .. .. .. .. ..@ .Data:List of 1 ## .. .. .. .. .. ..$ : int [1:3] 1 1 0 ## ..@ experimentData :Formal class 'MIAME' [package "Biobase"] with 13 slots ## .. .. ..@ name : chr "" ## .. .. ..@ lab : chr "" ## .. .. ..@ contact : chr "" ## .. .. ..@ title : chr "" ## .. .. ..@ abstract : chr "" ## .. .. ..@ url : chr "" ## .. .. ..@ pubMedIds : chr "" ## .. .. ..@ samples : list() ## .. .. ..@ hybridizations : list() ## .. .. ..@ normControls : list() ## .. .. ..@ preprocessing : list() ## .. .. ..@ other : list() ## .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot ## .. .. .. .. ..@ .Data:List of 2 ## .. .. .. .. .. ..$ : int [1:3] 1 0 0 ## .. .. .. .. .. ..$ : int [1:3] 1 1 0 ## ..@ annotation : chr "pd.hugene.1.0.st.v1" ## ..@ protocolData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots ## .. .. ..@ varMetadata :'data.frame': 2 obs. of 2 variables: ## .. .. .. ..$ labelDescription: chr [1:2] "Names of files used in 'exprs'" "Run dates for files used in 'exprs'" ## .. .. .. ..$ channel : Factor w/ 2 levels "exprs","_ALL_": 2 2 ## .. .. ..@ data :'data.frame': 58 obs. of 2 variables: ## .. .. .. ..$ exprs: Factor w/ 58 levels "164_I_.CEL","164_II.CEL",..: 1 2 3 4 5 6 7 8 9 10 ... ## .. .. .. ..$ dates: Factor w/ 58 levels "2010-06-09T11:39:13Z",..: 56 55 9 10 6 8 58 57 5 7 ... ## .. .. ..@ dimLabels : chr [1:2] "rowNames" "columnNames" ## .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot ## .. .. .. .. ..@ .Data:List of 1 ## .. .. .. .. .. ..$ : int [1:3] 1 1 0 ## ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot ## .. .. ..@ .Data:List of 4 ## .. .. .. ..$ : int [1:3] 3 5 1 ## .. .. .. ..$ : int [1:3] 2 42 0 ## .. .. .. ..$ : int [1:3] 1 3 0 ## .. .. .. ..$ : int [1:3] 1 0 0# 如果用Rstudio, View(raw_data)顯示更好 # View(raw_data)查看樣本屬性信息
head(Biobase::pData(raw_data))## Source.Name Characteristics.individual. Characteristics.organism. ## 164_I_.CEL 164_I 164 Homo sapiens ## 164_II.CEL 164_II 164 Homo sapiens ## 183_I.CEL 183_I 183 Homo sapiens ## 183_II.CEL 183_II 183 Homo sapiens ## 2114_I.CEL 2114_I 2114 Homo sapiens ## 2114_II.CEL 2114_II 2114 Homo sapiens ## Characteristics.disease. Characteristics.organism.part. ## 164_I_.CEL Crohn's disease colon ## 164_II.CEL Crohn's disease colon ## 183_I.CEL Crohn's disease colon ## 183_II.CEL Crohn's disease colon ## 2114_I.CEL Crohn's disease colon ## 2114_II.CEL Crohn's disease colon ## Characteristics.phenotype. Material.Type Protocol.REF ## 164_I_.CEL non-inflamed colonic mucosa organism part P-MTAB-41361 ## 164_II.CEL inflamed colonic mucosa organism part P-MTAB-41361 ## 183_I.CEL non-inflamed colonic mucosa organism part P-MTAB-41361 ## 183_II.CEL inflamed colonic mucosa organism part P-MTAB-41361 ## 2114_I.CEL non-inflamed colonic mucosa organism part P-MTAB-41361 ## 2114_II.CEL inflamed colonic mucosa organism part P-MTAB-41361 ## Protocol.REF.1 Extract.Name Protocol.REF.2 Labeled.Extract.Name ## 164_I_.CEL P-MTAB-41363 164_I P-MTAB-41364 164_I:Biotin ## 164_II.CEL P-MTAB-41363 164_II P-MTAB-41364 164_II:Biotin ## 183_I.CEL P-MTAB-41363 183_I P-MTAB-41364 183_I:Biotin ## 183_II.CEL P-MTAB-41363 183_II P-MTAB-41364 183_II:Biotin ## 2114_I.CEL P-MTAB-41363 2114_I P-MTAB-41364 2114_I:Biotin ## 2114_II.CEL P-MTAB-41363 2114_II P-MTAB-41364 2114_II:Biotin ## Label Protocol.REF.3 Assay.Name Technology.Type Array.Design.REF ## 164_I_.CEL biotin P-MTAB-41366 164_I_ array assay A-AFFY-141 ## 164_II.CEL biotin P-MTAB-41366 164_II array assay A-AFFY-141 ## 183_I.CEL biotin P-MTAB-41366 183_I array assay A-AFFY-141 ## 183_II.CEL biotin P-MTAB-41366 183_II array assay A-AFFY-141 ## 2114_I.CEL biotin P-MTAB-41366 2114_I array assay A-AFFY-141 ## 2114_II.CEL biotin P-MTAB-41366 2114_II array assay A-AFFY-141 ## Term.Source.REF Protocol.REF.4 Array.Data.File ## 164_I_.CEL ArrayExpress P-MTAB-41367 164_I_.CEL ## 164_II.CEL ArrayExpress P-MTAB-41367 164_II.CEL ## 183_I.CEL ArrayExpress P-MTAB-41367 183_I.CEL ## 183_II.CEL ArrayExpress P-MTAB-41367 183_II.CEL ## 2114_I.CEL ArrayExpress P-MTAB-41367 2114_I.CEL ## 2114_II.CEL ArrayExpress P-MTAB-41367 2114_II.CEL ## Comment..ArrayExpress.FTP.file. ## 164_I_.CEL ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-2967/E-MTAB-2967.raw.1.zip ## 164_II.CEL ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-2967/E-MTAB-2967.raw.1.zip ## 183_I.CEL ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-2967/E-MTAB-2967.raw.1.zip ## 183_II.CEL ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-2967/E-MTAB-2967.raw.1.zip ## 2114_I.CEL ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-2967/E-MTAB-2967.raw.1.zip ## 2114_II.CEL ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-2967/E-MTAB-2967.raw.1.zip ## Factor.Value.disease. Factor.Value.phenotype. ## 164_I_.CEL Crohn's disease non-inflamed colonic mucosa ## 164_II.CEL Crohn's disease inflamed colonic mucosa ## 183_I.CEL Crohn's disease non-inflamed colonic mucosa ## 183_II.CEL Crohn's disease inflamed colonic mucosa ## 2114_I.CEL Crohn's disease non-inflamed colonic mucosa ## 2114_II.CEL Crohn's disease inflamed colonic mucosa這么展示更利于查看
t(Biobase::pData(raw_data)[1,])## 164_I_.CEL ## Source.Name "164_I" ## Characteristics.individual. "164" ## Characteristics.organism. "Homo sapiens" ## Characteristics.disease. "Crohn's disease" ## Characteristics.organism.part. "colon" ## Characteristics.phenotype. "non-inflamed colonic mucosa" ## Material.Type "organism part" ## Protocol.REF "P-MTAB-41361" ## Protocol.REF.1 "P-MTAB-41363" ## Extract.Name "164_I" ## Protocol.REF.2 "P-MTAB-41364" ## Labeled.Extract.Name "164_I:Biotin" ## Label "biotin " ## Protocol.REF.3 "P-MTAB-41366" ## Assay.Name "164_I_" ## Technology.Type "array assay" ## Array.Design.REF "A-AFFY-141" ## Term.Source.REF "ArrayExpress" ## Protocol.REF.4 "P-MTAB-41367" ## Array.Data.File "164_I_.CEL" ## Comment..ArrayExpress.FTP.file. "ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-2967/E-MTAB-2967.raw.1.zip" ## Factor.Value.disease. "Crohn's disease" ## Factor.Value.phenotype. "non-inflamed colonic mucosa"篩選并保留關(guān)心的樣本屬性信息,個體信息 (Source.Name,Characteristics.individual.),疾病信息 (Factor.Value.disease.), 發(fā)炎與否 (Factor.Value.phenotype.)。
Biobase::pData(raw_data) <- Biobase::pData(raw_data)[, c("Source.Name", "Characteristics.individual.", "Factor.Value.disease.", "Factor.Value.phenotype.")]原始數(shù)據(jù)質(zhì)控
exprs(raw_data)可獲取原始的表達信息,行代表芯片探針在芯片上的位置,列代表每個樣品。
Biobase::exprs(raw_data)[1:5, 1:5]## 164_I_.CEL 164_II.CEL 183_I.CEL 183_II.CEL 2114_I.CEL ## 1 4496 5310 4492 4511 2872 ## 2 181 280 137 101 91 ## 3 4556 5104 4379 4608 2972 ## 4 167 217 99 79 82 ## 5 89 110 69 58 47對表達數(shù)據(jù)取對數(shù),并進行主成分分析。每個點代表一個樣品,顏色代表是否發(fā)炎,形狀代表疾病類型。從圖中可以看出,在第一主成分兩種疾病區(qū)分比較明顯,在第二主成分疾病的區(qū)別也略大于是否發(fā)炎。而我們的關(guān)注點是發(fā)炎,而不是不同類型的疾病,需要在下游分析中考慮移除疾病的影響。
exp_raw <- log2(Biobase::exprs(raw_data)) PCA_raw <- prcomp(t(exp_raw), scale. = FALSE)percentVar <- round(100*PCA_raw$sdev^2/sum(PCA_raw$sdev^2),1) sd_ratio <- sqrt(percentVar[2] / percentVar[1])dataGG <- data.frame(PC1 = PCA_raw$x[,1], PC2 = PCA_raw$x[,2],Disease = pData(raw_data)$Factor.Value.disease.,Phenotype = pData(raw_data)$Factor.Value.phenotype.,Individual = pData(raw_data)$Characteristics.individual.)ggplot(dataGG, aes(PC1, PC2)) +geom_point(aes(shape = Disease, colour = Phenotype)) +ggtitle("PCA plot of the log-transformed raw expression data") +xlab(paste0("PC1, VarExp: ", percentVar[1], "%")) +ylab(paste0("PC2, VarExp: ", percentVar[2], "%")) +theme(plot.title = element_text(hjust = 0.5))+coord_fixed(ratio = sd_ratio) +scale_shape_manual(values = c(4,15)) + scale_color_manual(values = c("darkorange2", "dodgerblue4"))另外,需要看下原始數(shù)據(jù)的表達分布。從圖中看出,不同的芯片原始表達分布不均一,需要做合適的標準化處理。
# oligo::boxplot(raw_data, target = "core", # main = "Boxplot of log2-intensitites for the raw data") dataBoxplot <- data.frame(t(exp_raw), Sample=colnames(exp_raw),Disease = pData(raw_data)$Factor.Value.disease.,Phenotype = pData(raw_data)$Factor.Value.phenotype.,Individual = pData(raw_data)$Characteristics.individual.)dataBoxplot <- reshape2::melt(dataBoxplot, id.vars=c("Sample","Disease","Phenotype","Individual"))ggplot(dataBoxplot, aes(x=Sample, y=value)) + geom_boxplot(aes(color=Disease)) + theme_bw() + theme(axis.text.x=element_text(angle=45, hjust=1, vjust=1))arrayQualityMetrics可以對芯片進行更多層面的檢測,并生成一個檢測報告。
arrayQualityMetrics(expressionset = raw_data,outdir = "ysx",force = TRUE, do.logtransform = TRUE,intgroup = c("Factor.Value.disease.", "Factor.Value.phenotype."))背景校正
鑒于探針的熒光強度會受到非特異性雜交和光學(xué)檢測系統(tǒng)噪音的影響,因此需要對信號強度進行背景校正,從而能更真實的反映和比較特異性雜交產(chǎn)生的信號,更好的反應(yīng)基因表達情況。
芯片間標準化 (calibration)
不同的芯片雜交受很多因素影響,比如反轉(zhuǎn)錄效率,標記或雜交反應(yīng),芯片物理性質(zhì),試劑批次和實驗室環(huán)境等,校正后,才可以在不同芯片之間可比。
Summarization
在Affymetrix平臺,一個轉(zhuǎn)錄本通常用多個探針檢測。對每個基因來說,需要根據(jù)所有探針的信息估算出其相對于總的轉(zhuǎn)錄本的表達值,也就是Summarization。隨后就可以使用注釋數(shù)據(jù)集獲得基因的symbols或ENSEMBL ID。我們這個平臺的注釋信息存儲在hugene10sttranscriptcluster.db包中。
head(ls("package:hugene10sttranscriptcluster.db"))## [1] "hugene10sttranscriptcluster" ## [2] "hugene10sttranscriptcluster_dbconn" ## [3] "hugene10sttranscriptcluster_dbfile" ## [4] "hugene10sttranscriptcluster_dbInfo" ## [5] "hugene10sttranscriptcluster_dbschema" ## [6] "hugene10sttranscriptcluster.db"Affymetrix芯片”probesets”的變更
在采用芯片檢測樣品前,需要先提取RNA,加上一個熒光標簽,并進行至少一輪PCR擴增以增加檢測量。傳統(tǒng)上這一步是根據(jù)mRNA``3'端的poly-A序列設(shè)計引物完成反轉(zhuǎn)。這樣最終的擴增產(chǎn)物里面3’端的序列會更多些。而且如果起始RNA片段化了或降解了,5’端被檢測到的幾率更低了。因此設(shè)計探針時,更多的探針也是針對3’端設(shè)置的,這就是3' IVT Affymetrix芯片。它是基于探針集的,探針集的一組固定探針用于檢測一個特定的基因或轉(zhuǎn)錄本。在一些情況下,如選擇性Poly-A位點出現(xiàn)時,一個基因需要設(shè)計多個探針集檢測。
而后來的Gene/ExonAffymetrix芯片是基于外顯子的,探針集來源于外顯子區(qū)。這時為了獲得基因的表達,需要做兩步summarization,第一步是從探針集獲得外顯子的表達,然后根據(jù)多個來源于同一個基因或轉(zhuǎn)錄本的外顯子獲取基因的表達。因此需要合適的基因注釋包,如我們這用到的hugene10sttranscriptcluster.db。
Gene芯片相比于Exon芯片探針更少,只保留Exon芯片的一部分”good”探針。在Exon芯片上至少需要4個探針來代表一個Exon,而在Gene芯片上,很多代表一個基因的探針集只有3個或更少的探針。
如下圖所示,用單一顏色代表一個探針集,一個基因的探針集用一個顏色集合表示,如所有黃色探針代表一個外顯子,所有黃色、橙色、紅色探針集都用于檢測同一個基因。
左圖中,每個外顯子或探針集包含很多探針 (同樣顏色的塊出現(xiàn)多次),因此做探針集或外顯子水平的summarization是合適的。而對于gene芯片,每個探針集的探針數(shù)目很少,因此在探針集或外顯子水平的summarization是不推薦的,雖然也可以通過包hugene10stprobeset.db來完成。
而且在Gene和Exon芯片上不再有錯配探針 (mismatch probes)。錯配探針本來是用做背景校正排除非特異性雜交和背景噪音的,但被更好的計算方式取代。
一步完成背景校正、標準化、表達整合
oligo包提供了一個函數(shù)rma允許一步完成基于去卷積的背景校正、分位數(shù)校正標準化 (quantile normalization)和RMA (robust multichip average) 表達整合 (summarization)。
相對log表達 (RLE) 質(zhì)控分析
我們先利用rma函數(shù)進行背景校正和summarization獲得基因的表達量,而略過標準化。rma函數(shù)輸出的結(jié)果已經(jīng)做過對數(shù)處理。
palmieri_eset <- oligo::rma(raw_data, target = "core", normalize = FALSE)## Background correcting ## Calculating Expression然后進行RLE (Relative Log Expression)分析,首先計算每個轉(zhuǎn)錄本在所有樣品中的表達中位數(shù),然后每個轉(zhuǎn)錄本的表達減去這個中位數(shù),整理格式進行繪圖。
row_medians_assayData <- Biobase::rowMedians(as.matrix(Biobase::exprs(palmieri_eset)))# 減去中值 RLE_data <- sweep(Biobase::exprs(palmieri_eset), 1, row_medians_assayData)RLE_data <- as.data.frame(RLE_data) # 轉(zhuǎn)換為長格式 RLE_data_gathered <- tidyr::gather(RLE_data, patient_array, log2_expression_deviation)ggplot2::ggplot(RLE_data_gathered, aes(patient_array, log2_expression_deviation)) + geom_boxplot(outlier.shape = NA) + ylim(c(-2, 2)) + theme(axis.text.x = element_text(colour = "aquamarine4", angle = 60, size = 6.5, hjust = 1 ,face = "bold"))圖中,Y-軸代表每個芯片中轉(zhuǎn)錄本表達與該轉(zhuǎn)錄本在所有芯片表達中位數(shù)的偏差。箱體越延展表明芯片中轉(zhuǎn)錄本的表達量與中位表達差別越大,箱體在Y-軸的偏移表示大部分轉(zhuǎn)錄本的表達定量存在系統(tǒng)的增高或降低。這可能存在質(zhì)量問題或是檢測批次效應(yīng)造成的。因此如果一個樣品的boxplot的形狀和中值與其它樣品差別很大,需要進一步評估是否為異常樣本,并考慮移除。
圖中5個中位數(shù)在-0.7~-0.8之間的樣品?2826_I,?2826_II,?3262_II,3302_II和3332_II可能是異常樣品。需要結(jié)合后面的聚類分析,再確認是否移除。
校正三部曲
background-correct, normalize and summarize.
palmieri_eset_norm <- oligo::rma(raw_data, target = "core")## Background correcting ## Normalizing ## Calculating Expression參數(shù)target定義summarization的程度,默認是core表示獲得基因水平的表達 (using transcript clusters containing “safely” annotated genes) (對gene芯片,只有這么一個選項)。如果是外顯子芯片,還可以選擇extended和full。如果想獲得外顯子水平的表達,也可以用probeset做參數(shù),但是一般不推薦。
除了RMA還有其它的背景校正和標準化方法,但RMA通常是一個好的默認選擇。RMA在所有芯片間使用quantile normalization標準化方法使得不同樣品檢測的表達值可比。
出校準化后的表達值
norm_exprs <- Biobase::exprs(palmieri_eset_norm) norm_exprs <- data.frame(ID=rownames(norm_exprs), norm_exprs) write.table(norm_exprs, "normalized_probe_expr.tsv", sep="\t", row.names=F, quote=F)校準化后數(shù)據(jù)的質(zhì)量評估
主成分分析
exp_palmieri <- Biobase::exprs(palmieri_eset_norm) PCA <- prcomp(t(exp_palmieri), scale = FALSE)percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1) sd_ratio <- sqrt(percentVar[2] / percentVar[1])dataGG <- data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2],Disease = Biobase::pData(palmieri_eset_norm)$Factor.Value.disease.,Phenotype = Biobase::pData(palmieri_eset_norm)$Factor.Value.phenotype.)ggplot(dataGG, aes(PC1, PC2)) +geom_point(aes(shape = Disease, colour = Phenotype)) +ggtitle("PCA plot of the calibrated, summarized data") +xlab(paste0("PC1, VarExp: ", percentVar[1], "%")) +ylab(paste0("PC2, VarExp: ", percentVar[2], "%")) +theme(plot.title = element_text(hjust = 0.5)) +coord_fixed(ratio = sd_ratio) +scale_shape_manual(values = c(4,15)) + scale_color_manual(values = c("darkorange2", "dodgerblue4"))第一主坐標軸大體可以分開發(fā)炎與未發(fā)炎樣品,第二主坐標軸分開兩種不同的疾病。(從圖上看,還是疾病的區(qū)分更明顯)
熱圖聚類分析
通過熱圖展示樣品之間的相似度關(guān)系,并且標記樣品的屬性如是否發(fā)炎或是否來自同一個疾病。
# 生成行注釋 phenotype_names <- ifelse(str_detect(pData(palmieri_eset_norm)$Factor.Value.phenotype., "non"), "non_infl.", "infl.")disease_names <- ifelse(str_detect(pData(palmieri_eset_norm)$Factor.Value.disease., "Crohn"), "CD", "UC")annotation_for_heatmap <- data.frame(Phenotype = phenotype_names, Disease = disease_names)row.names(annotation_for_heatmap) <- row.names(pData(palmieri_eset_norm))計算Manhattan距離,并進行聚類分析
# 使用dist計算樣本之間的距離,首選做下轉(zhuǎn)置,因為dist是對列進行操作的 # dist默認是歐式距離,反應(yīng)直線路徑的平方距離 # 這里用Manhattan距離,計算的是直角路徑的絕對距離,計算結(jié)果更穩(wěn)定 dists <- as.matrix(dist(t(exp_palmieri), method = "manhattan"))rownames(dists) <- row.names(pData(palmieri_eset_norm)) hmcol <- rev(colorRampPalette(RColorBrewer::brewer.pal(9, "YlOrRd"))(255)) colnames(dists) <- NULL # 設(shè)置對角距離為NA,增強顏色的展示力 diag(dists) <- NA# www.ehbio.com/ImageGP ann_colors <- list(Phenotype = c(non_infl. = "chartreuse4", infl. = "burlywood3"),Disease = c(CD = "blue4", UC = "cadetblue2")) pheatmap(dists, col = (hmcol), annotation_row = annotation_for_heatmap,annotation_colors = ann_colors,legend = TRUE, treeheight_row = 0,legend_breaks = c(min(dists, na.rm = TRUE), max(dists, na.rm = TRUE)), legend_labels = (c("small distance", "large distance")),main = "Clustering heatmap for the calibrated samples")熱圖中上部黃色的條帶可能是異常樣品 (2826_II, 3262_II, 3271_I, 2978_II 和 3332_II),部分樣品與之前RLE圖推測可能是異常的樣品重合,可以考慮移除。這里為了跟原文一致,所有樣品都保留了下來。
另外arrayQualityMetrics有更精確的計算異常樣品的方式,我們后續(xù)也會提供基于網(wǎng)絡(luò)圖距離的異常值剔除方法。
過濾低表達基因
芯片數(shù)據(jù)中有一大部分探針的信號值與背景值相差不大,而且在不同樣品中差別不大,不能給我們的生物問題提供更多信息。(注:雖然有部分表達低的基因表達變化比較小時就可以有比較大的作用,但因為其表達低,檢測噪音也大,結(jié)果可信度一般)
首選繪制下每個轉(zhuǎn)錄本在所有芯片中的表達中位數(shù)的分布 (共計33,297轉(zhuǎn)錄本)。可以看到左側(cè)區(qū)域富集低表達的轉(zhuǎn)錄本,具體選擇移除的閾值主觀性比較強,這里選擇表達中位數(shù)需要大于4,給后續(xù)分析多保留一些基因。也可以排序一下,保留前75%的基因。
palmieri_medians <- rowMedians(Biobase::exprs(palmieri_eset_norm))hist_res <- hist(palmieri_medians, 100, col = "cornsilk1", freq = FALSE, main = "Histogram of the median intensities", border = "antiquewhite4",xlab = "Median intensities") man_threshold <- 4hist_res <- hist(palmieri_medians, 100, col = "cornsilk", freq = FALSE, main = "Histogram of the median intensities",border = "antiquewhite4",xlab = "Median intensities")abline(v = man_threshold, col = "coral4", lwd = 2)假如最小的實驗組有5個樣品,如果一個轉(zhuǎn)錄本檢測到表達值大于之前設(shè)定的閾值 (man_threshold)的樣本數(shù)小于5,則移除。
首先獲得各個實驗組的樣品數(shù)
no_of_samples <- table(paste0(pData(palmieri_eset_norm)$Factor.Value.disease., "_", pData(palmieri_eset_norm)$Factor.Value.phenotype.)) no_of_samples ## ## Crohn's disease_inflamed colonic mucosa ## 15 ## Crohn's disease_non-inflamed colonic mucosa ## 15 ## ulcerative colitis_inflamed colonic mucosa ## 14 ## ulcerative colitis_non-inflamed colonic mucosa ## 14# 獲得最小樣品數(shù) samples_cutoff <- min(no_of_samples)# sum(x > man_threshold):轉(zhuǎn)錄本在多少樣品中表達值高于給定閾值 idx_man_threshold <- apply(Biobase::exprs(palmieri_eset_norm), 1,function(x){sum(x > man_threshold) >= samples_cutoff}) table(idx_man_threshold)## idx_man_threshold ## FALSE TRUE ## 10493 22804# 提取過濾轉(zhuǎn)錄本 palmieri_manfiltered <- subset(palmieri_eset_norm, idx_man_threshold)注釋轉(zhuǎn)錄本簇
在我們構(gòu)建線性模型做差異分析之前,先把探針映射到Gene symbol,并只保留這部分基因用于后續(xù)分析。
# 使用select函數(shù),獲得探針I(yè)D對應(yīng)的Gene symbol和描述 anno_palmieri <- AnnotationDbi::select(hugene10sttranscriptcluster.db,keys = (featureNames(palmieri_manfiltered)),columns = c("SYMBOL", "GENENAME"),keytype = "PROBEID") # 過濾掉沒有Symbol的探針 anno_palmieri <- subset(anno_palmieri, !is.na(SYMBOL)) head(anno_palmieri)如果一個探針對應(yīng)于多個Gene symbol,我們沒有辦法判斷到底是哪個,只能移除。
# 按探針I(yè)D分組 anno_grouped <- group_by(anno_palmieri, PROBEID)# 計算每組Symbol的個數(shù) anno_summarized <- dplyr::summarize(anno_grouped, no_of_matches = n_distinct(SYMBOL))head(anno_summarized)# 過濾Symbol個數(shù)大于1的 anno_filtered <- filter(anno_summarized, no_of_matches > 1)head(anno_filtered)probe_stats <- anno_filtered nrow(probe_stats)## [1] 1763有1763個探針I(yè)D對應(yīng)多個Gene Symbol,從Assya數(shù)據(jù)中移除這些探針I(yè)D。
# featureNames(palmieri_manfiltered): 獲取所有探針名字 ids_to_exlude <- (featureNames(palmieri_manfiltered) %in% probe_stats$PROBEID)table(ids_to_exlude)## ids_to_exlude ## FALSE TRUE ## 21041 1763palmieri_final <- subset(palmieri_manfiltered, !ids_to_exlude)validObject(palmieri_final)## [1] TRUE前面是從assay數(shù)據(jù)中移除了這些探針I(yè)D,后面需要從feature data中移除:
head(anno_palmieri)現(xiàn)在的feature data是空的,增加一列PROBID。
# 增加一列PROBID fData(palmieri_final)$PROBEID <- rownames(fData(palmieri_final)) # 使用left_join根據(jù)PROBID列,合并,增加注釋信息 fData(palmieri_final) <- left_join(fData(palmieri_final), anno_palmieri)## Joining, by = "PROBEID"# restore rownames after left_join rownames(fData(palmieri_final)) <- fData(palmieri_final)$PROBEID validObject(palmieri_final)## [1] TRUE總結(jié)
以上是生活随笔為你收集整理的典型医学设计实验GEO数据分析 (step-by-step) - 数据获取到标准化的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: PATH和path,傻傻分不清
- 下一篇: 改写教科书!人类细胞可将RNA序列写入D