复现nature communication PCA原图|代码分析(一)
復現PCA原圖之bulk RNA-seq
NGS系列文章包括NGS基礎、轉錄組分析?(Nature重磅綜述|關于RNA-seq你想知道的全在這)、ChIP-seq分析?(ChIP-seq基本分析流程)、單細胞測序分析?(重磅綜述:三萬字長文讀懂單細胞RNA測序分析的最佳實踐教程 (原理、代碼和評述))、DNA甲基化分析、重測序分析、GEO數據挖掘(典型醫學設計實驗GEO數據分析 (step-by-step) - Limma差異分析、火山圖、功能富集)等內容。
2020年4月14日,Sanger研究團隊于nature communication在線發表了題為Single-cell transcriptomics identifies an effectorness gradient shaping the response of CD4+ T cells to cytokines的研究內容,作者使用蛋白質組學、bulk RNA-seq和單細胞轉錄組測序對人體40,000個以上的na?ve and memory CD4+ T cells進行分析,發現細胞類型之間的細胞因子反應差異很大。memory T細胞不能分化為Th2表型,但可以響應iTreg極化獲得類似Th17的表型。單細胞分析表明,T細胞構成了一個轉錄連續體(transcriptional continuum),從幼稚到中樞和效應記憶T細胞,形成了一種效應梯度,并伴隨著趨化因子和細胞因子表達的增加。最后,作者表明,T細胞活化和細胞因子反應受效應梯度的影響。
我們這次復現一個Fig1cPCA圖和Fig2aPCA圖,這個說起來容易、做起來可就不一定容易的plot!
以上是Fig1c原圖,圖注為“PCA plots from the whole proteome ?of TN and TM cells. Different colors correspond to cell types and different shades to stimulation time points. PCA plots were derived using 47 naive and 47 memory T cell samples for RNAseq”,作者使用不同處理方式對human naive (TN) and memory (TM) CD4+ T cells進行處理,然后收集不同時間點的sample進行bulk RNA-seq。
以上為Fig 2a原圖,圖注為“PCA plot from the full transcriptome of TN and TM cells following five days of cytokine stimulations. Only stimulated cells were included in this analysis. PCA plots were derived using 20 naive and 21 memory T cells samples ”
我們需要復現該圖之前,先需要下載數據,可以點擊https://www.opentargets.org/projects/effectorness 對bulk RNA-Seq的count數據和metadata數據進行下載,然后進行以下步驟:
加載R包
library(annotables) library(SummarizedExperiment) library(DESeq2) library(ggplot2) library(cowplot) library(rafalib) library(dplyr) library(reshape2) library(RColorBrewer) library(pheatmap) library(limma) library(ggrepel)數據格式化
## 加載counts矩陣 counts <- read.table("NCOMMS-19-7936188_bulk_RNAseq_raw_counts.txt", header=T, row.names=1) View(counts)# 行是基因的Ensembl ID,列是sample_ID,數據有94個樣品## 加載sample metadata sample.info <- read.table("NCOMMS-19-7936188_bulk_RNAseq_metadata.txt", header=T, row.names=1, stringsAsFactors = T) View(sample.info) ## 以上數據有10列:sample_id、cell_type、cytokine_condition、stimulation_time、donor_id、sex、age、sequencing_batch、cell_culture_batch和rna_integrity_number。## 可以看出celltype有兩種:naive和memory;cytokine_condition有7種:INFB,resting,iTreg,Th0,Th1,Th2,Th17;stimulation_time有兩種:16h和5d。# 數據格式化 ## 將所有注釋轉換為合適的數據類型 sample.info$sequencing_batch <- factor(sample.info$sequencing_batch) sample.info$cell_culture_batch <- factor(sample.info$cell_culture_batch) sample.info$activation_status <- ifelse(sample.info$cytokine_condition == "Resting", "Resting", "Activated")從Ensembl87/GRCh38獲取基因注釋
## Fetching gene annotations from Ensembl87/GRCh38 features.data <- data.frame(grch38) features.data <- features.data[!duplicated(features.data[c("ensgene")]),] rownames(features.data) <- features.data$ensgene features.data <- features.data[,-1]NGS基礎 - 參考基因組和基因注釋文件
將基因注釋轉換為最合適的數據類型:
## Converting gene annotations to the most suitable data types gene.info$Gene_id <- as.character(gene.info$Gene_id) gene.info$Start <- as.numeric(gene.info$Start) gene.info$End <- as.numeric(gene.info$End) gene.info$Strand <- factor(gene.info$Strand)創建relevant metadata的變量:
## Creating a variable with relevant metadata for the study meta <- list(Study="Mapping cytokine induced gene expression changes in human CD4+ T cells",Experiment="RNA-seq panel of cytokine induced T cell polarisations",Laboratory="Trynka Group, Wellcome Sanger Institute",Experimenter=c("Eddie Cano-Gamez","Blagoje Soskic","Deborah Plowman"),Description="To study cytokine-induced cell polarisation, we isolated human naive and memory CD4+ T cells in triplicate from peripheral blood of healthy individuals. Next, we polarised the cells with different cytokine combinations linked to autoimmunity and performed RNA-sequencing.",Methdology=c("Library Prep: Illumina TruSeq (Poly-A capture method)", "Sequencing Platform: Illumina HiSeq 2500","Read Alignment: STAR (MAPQ > 20)","Read Quantification: featureCounts","Reference: Ensembl 87 (GRCh38/hg38)"),Characteristics="Data type: Raw counts",Date="September, 2019",URL="https://doi.org/10.1101/753731" )建立SummarizedExperiment object
# CREATING SUMMARIZED EXPERIMENT OBJECT RNA.experiment <- SummarizedExperiment(assays=list(counts=as.matrix(counts)),colData=sample.info,rowData=gene.info,metadata=meta)注意:這里我的報錯了。。。
說的其實蠻明白的,于是運行以下代碼:
rownames(sample.info) <- NULL再次建立SummarizedExperiment object
# CREATING SUMMARIZED EXPERIMENT OBJECT RNA.experiment <- SummarizedExperiment(assays=list(counts=as.matrix(counts)),colData=sample.info,rowData=gene.info,metadata=meta)saveRDS(RNA.experiment, "bulkRNAseq_summarizedExperiment.rds")加載數據
exp <- readRDS("bulkRNAseq_summarizedExperiment.rds")基因過濾
根據以下條件篩選基因:
刪除映射到Y染色體的基因;
刪除映射到HLA區域的基因(即Ensembl 87/GRCh38中的“chr6 28,000,000-34,000,000”);
僅保留蛋白質編碼基因和lincRNA(Long intergenic non-coding RNA);
刪除低表達的基因(即至少30 reads)。
注意:DESeq2在進行統計測試時執行嚴格的自動過濾。
exp <- exp[!is.na(rowData(exp)$Chr)] exp <- exp[rowData(exp)$Chr != "Y",] exp <- exp[!(rowData(exp)$Chr == "6" & rowData(exp)$Start > 28000000 & rowData(exp)$End < 34000000),] exp <- exp[rowData(exp)$Biotype == "protein_coding" | rowData(exp)$Biotype == "lincRNA",] exp <- exp[rowSums(assay(exp))>30,]將數據轉換為DESeqDataSet。由于此代碼中沒有進行統計測試,因此design暫時為空(即?1)。
dds <- DESeqDataSet(exp, design= ~ 1) rowData(dds) <- rowData(exp)DESeqDataSet是DESeq2包(DESeq2差異基因分析和批次效應移除)中儲存reads count以及統計分析過程中的數據的一個“對象”,在代碼中常表示為“dds”。
一個DESeqDataSet對象必須關聯相應的design公式。design公式指明了要對哪些變量進行統計分析。該公式(上文中的design = ~batch + condition)以短波浪字符開頭,中間用加號連接變量。design公式可以修改,但是相應的差異表達分析就需要重新做,因為design公式是用來估計統計模型的分散度以及log2 fold change的。
合并技術重復
寫在前面:DESeq2 provides a function collapseReplicates which can assist in combining the counts from technical replicates into single columns of the count matrix. The term technical replicate implies multiple sequencing runs of the same library. You should NOT collapse biological replicates using this function.
將相同條件的技術復制合計到同一列中:
dds$ID <- factor(paste0(dds$donor_id, "_", dds$cell_type, "_", dds$stimulation_time, "_", dds$cytokine_condition)) dds.coll <- collapseReplicates(object=dds, groupby=dds$ID) dds.coll <- estimateSizeFactors(dds.coll) #在DESeq2中,通過estimateSizeFactors函數為每個樣本計算一個系數,稱之為sizefactor;saveRDS(dds.coll, "rawCounts_bulkRNAseq_DESeq2.rds")數據歸一化
通過log2對數據進行歸一化:
rld.coll <- rlog(dds.coll, blind=TRUE) saveRDS(rld.coll, "rlogCounts_bulkRNAseq_DESeq2.rds")數據可視化
定義函數:執行主成分分析(PCA)(一文看懂PCA主成分分析)并返回每個樣本的前6個PC值,以及樣本注釋和每個成分所解釋的方差百分比。
get.pcs <- function(exp){pca <- prcomp(t(assay(exp)))pVar <- pca$sdev^2/sum(pca$sdev^2)intgroup=as.data.frame(colData(exp))d <- data.frame(PC1=pca$x[, 1], PC2=pca$x[, 2], PC3=pca$x[, 3],PC4=pca$x[, 4], PC5=pca$x[, 5], PC6=pca$x[, 6], intgroup)d <- list(d,pVar[1:6])names(d) <- c("PCs","pVar")return(d) }對rlog計數矩陣執行主成分分析:
pcs.rld <- get.pcs(rld.coll)可視化PCA的結果
ggplot(data=pcs.rld$PCs, aes(x=PC1, y=PC2, color=cell_type, shape=activation_status, alpha = stimulation_time)) +geom_point(size = 8) +xlab(paste0("PC1:", round(pcs.rld$pVar[1]*100), "% variance")) +ylab(paste0("PC2: ", round(pcs.rld$pVar[2]*100), "% variance")) +scale_colour_manual(values = c("#5AB4AC","#AF8DC3")) + scale_alpha_discrete(range = c(0.5,1)) +coord_fixed() + theme_bw() +theme(panel.grid = element_blank())刪除由PCA標識的異常樣本(即262_CD4_Memory_16h_iTreg):
dds.coll <- dds.coll[, dds.coll$ID != "262_CD4_Memory_16h_iTreg"] rld.coll <- rld.coll[, rld.coll$ID != "262_CD4_Memory_16h_iTreg"] saveRDS(dds.coll, "rawCounts_bulkRNAseq_DESeq2.rds") saveRDS(rld.coll, "rlogCounts_bulkRNAseq_DESeq2.rds")在clean data中進行PCA:
pcs.rld <- get.pcs(rld.coll)繼續可視化PCA結果
ggplot(data=pcs.rld$PCs, aes(x=PC1, y=PC2, color=cell_type, shape=activation_status, alpha = stimulation_time)) +geom_point(size = 8) +xlab(paste0("PC1:", round(pcs.rld$pVar[1]*100), "% variance")) +ylab(paste0("PC2: ", round(pcs.rld$pVar[2]*100), "% variance")) +scale_colour_manual(values = c("#5AB4AC","#AF8DC3")) + scale_alpha_discrete(range = c(0.5,1)) +coord_fixed() + theme_bw() +theme(panel.grid = element_blank())這下分布確實真的好看,去掉離群值后不同細胞的分散度更為合理。
然后我們可以看看原圖是什么樣子噠!
其實還蠻像的,,,其實好看的圖都是“做”出來的。。。。
細胞類型和時間點特定分析
將naive T細胞和memory T細胞樣本分為兩個不同的DESeqDataSet。
dds.naive <- dds.coll[,dds.coll$cell_type=="CD4_Naive"] dds.memory <- dds.coll[,dds.coll$cell_type=="CD4_Memory"]rld.naive <- rld.coll[,rld.coll$cell_type=="CD4_Naive"] rld.memory <- rld.coll[,rld.coll$cell_type=="CD4_Memory"]naive T細胞
naive T細胞的早期刺激
僅對16h-stimulated naive T cells進行PCA:
rld_naive16h <- rld.naive[, (rld.naive$activation_status == "Activated") & (rld.naive$stimulation_time=="16h")] pcs_naive16h <- get.pcs(rld_naive16h)ggplot(data=pcs_naive16h$PCs, aes(x=PC1, y=PC2, color=as.factor(donor_id))) +geom_point(size = 8) + xlab(paste0("PC1: ", round(pcs_naive16h$pVar[1]*100), "% variance")) +ylab(paste0("PC2: ", round(pcs_naive16h$pVar[2]*100), "% variance")) +scale_colour_brewer(palette = "Dark2") + coord_fixed() + theme_bw() +theme(panel.grid = element_blank())\#按照不同樣本id進行分析;去掉個體間變異性:
assay(rld_naive16h) <- removeBatchEffect(assay(rld_naive16h), batch = factor(as.vector(colData(rld_naive16h)$donor_id)))重新計算PCA:
pcs_naive16h <- get.pcs(rld_naive16h)ggplot(data=pcs_naive16h$PCs, aes(x=PC1, y=PC2, color=cytokine_condition)) +geom_point(size = 8) + geom_label_repel(aes(label=cytokine_condition, color=cytokine_condition)) +xlab(paste0("PC1: ", round(pcs_naive16h$pVar[1]*100), "% variance")) +ylab(paste0("PC2: ", round(pcs_naive16h$pVar[2]*100), "% variance")) +scale_colour_brewer(palette = "Dark2") +scale_fill_brewer(palette = "Dark2") +coord_fixed() + theme_bw() +theme(panel.grid = element_blank(), legend.position = "none")naive T細胞的晚期刺激
rld_naive5d <- rld.naive[, (rld.naive$activation_status == "Activated") & (rld.naive$stimulation_time=="5d")] pcs_naive5d <- get.pcs(rld_naive5d)ggplot(data=pcs_naive5d$PCs, aes(x=PC1, y=PC2, color=as.factor(donor_id))) +geom_point(size = 8) + xlab(paste0("PC1: ", round(pcs_naive16h$pVar[1]*100), "% variance")) +ylab(paste0("PC2: ", round(pcs_naive16h$pVar[2]*100), "% variance")) +scale_colour_brewer(palette = "Dark2") + coord_fixed() + theme_bw() +theme(panel.grid = element_blank())去掉個體間變異性:
assay(rld_naive5d) <- removeBatchEffect(assay(rld_naive5d), batch = factor(as.vector(colData(rld_naive5d)$donor_id)))重新計算PCA:
pcs_naive5d <- get.pcs(rld_naive5d)ggplot(data=pcs_naive5d$PCs, aes(x=PC1, y=PC2, color=cytokine_condition)) +geom_point(size = 8) + geom_label_repel(aes(label=cytokine_condition, color=cytokine_condition)) +xlab(paste0("PC1: ", round(pcs_naive16h$pVar[1]*100), "% variance")) +ylab(paste0("PC2: ", round(pcs_naive16h$pVar[2]*100), "% variance")) +scale_colour_brewer(palette = "Dark2") +scale_fill_brewer(palette = "Dark2") +coord_fixed() + theme_bw() +theme(panel.grid = element_blank(), legend.position = "none")原圖
可以看出和原圖還是很像的,分類也是蠻清楚的!
Memory T cells
memory T細胞的早期刺激
again……
Performing PCA on 16h-stimulated memory T cells only.
rld_memory16h <- rld.memory[, (rld.memory$activation_status == "Activated") & (rld.memory$stimulation_time=="16h")] pcs_memory16h <- get.pcs(rld_memory16h)ggplot(data=pcs_memory16h$PCs, aes(x=PC1, y=PC2, color=as.factor(donor_id))) +geom_point(size = 8) + xlab(paste0("PC1: ", round(pcs_naive16h$pVar[1]*100), "% variance")) +ylab(paste0("PC2: ", round(pcs_naive16h$pVar[2]*100), "% variance")) +scale_colour_brewer(palette = "Dark2") + coord_fixed() + theme_bw() +theme(panel.grid = element_blank())去掉個體間變異性:
assay(rld_memory16h) <- removeBatchEffect(assay(rld_memory16h), batch = factor(as.vector(colData(rld_memory16h)$donor_id)))重新計算PCA:
pcs_memory16h <- get.pcs(rld_memory16h)ggplot(data=pcs_memory16h$PCs, aes(x=PC1, y=PC2, color=cytokine_condition)) +geom_point(size = 8) + geom_label_repel(aes(label=cytokine_condition, color=cytokine_condition)) +xlab(paste0("PC1: ", round(pcs_naive16h$pVar[1]*100), "% variance")) +ylab(paste0("PC2: ", round(pcs_naive16h6$pVar[2]*100), "% variance")) +scale_colour_brewer(palette = "Dark2") +scale_fill_brewer(palette = "Dark2") +coord_fixed() + theme_bw() +theme(panel.grid = element_blank(), legend.position = "none")memory T細胞的晚期刺激
again……again…….
Performing PCA on 5 days-stimulated memory T cells only.
rld_memory5d <- rld.memory[, (rld.memory$activation_status == "Activated") & (rld.memory$stimulation_time=="5d")] pcs_memory5d <- get.pcs(rld_memory5d)ggplot(data=pcs_memory5d$PCs, aes(x=PC1, y=PC2, color=as.factor(donor_id))) +geom_point(size = 8) + xlab(paste0("PC1: ", round(pcs.n16$pVar[1]*100), "% variance")) +ylab(paste0("PC2: ", round(pcs.n16$pVar[2]*100), "% variance")) +scale_colour_brewer(palette = "Dark2") + coord_fixed() + theme_bw() +theme(panel.grid = element_blank())assay(rld_memory5d) <- removeBatchEffect(assay(rld_memory5d), batch = factor(as.vector(colData(rld_memory5d)$donor_id)))重新計算PCA
pcs_memory5d <- get.pcs(rld_memory5d)ggplot(data=pcs_memory5d$PCs, aes(x=PC1, y=PC2, color=cytokine_condition)) +geom_point(size = 8) + geom_label_repel(aes(label=cytokine_condition, color=cytokine_condition)) +xlab(paste0("PC1: ", round(pcs.n16$pVar[1]*100), "% variance")) +ylab(paste0("PC2: ", round(pcs.n16$pVar[2]*100), "% variance")) +scale_colour_brewer(palette = "Dark2") +scale_fill_brewer(palette = "Dark2") +coord_fixed() + theme_bw() +theme(panel.grid = element_blank(), legend.position = "none")原圖
基本分布還是差不多的,,,,
后面需要做什么差異分析的,就靠你自己了!
校對排版?| 生信寶典
你可能還想看
PCA主成分分析實戰和可視化 附R代碼和測試數據
用了這么多年的PCA可視化竟然是錯的!!!
什么?你做的差異基因方法不合適?
單細胞分群后,怎么找到Marker基因定義每一類群?
往期精品(點擊圖片直達文字對應教程)
后臺回復“生信寶典福利第一波”或點擊閱讀原文獲取教程合集
總結
以上是生活随笔為你收集整理的复现nature communication PCA原图|代码分析(一)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 分享我们承建的三篇NAR的数据库
- 下一篇: 生信宝典之傻瓜式(六)查找转录因子的靶基