代码分析 | 单细胞转录组Normalization详解
標準化加高變基因選擇
NGS系列文章包括NGS基礎、轉錄組分析?(Nature重磅綜述|關于RNA-seq你想知道的全在這)、ChIP-seq分析?(ChIP-seq基本分析流程)、單細胞測序分析?(重磅綜述:三萬字長文讀懂單細胞RNA測序分析的最佳實踐教程 (原理、代碼和評述))、DNA甲基化分析、重測序分析、GEO數據挖掘(典型醫學設計實驗GEO數據分析 (step-by-step) - Limma差異分析、火山圖、功能富集)等內容。
最近找到了芬蘭CSC-IT科學中心主講的生物信息課程(https://www.csc.fi/web/training/-/scrnaseq )視頻,官網上還提供了練習素材以及詳細代碼,今天就來練習一下單細胞數據Normalization以及高變基因選擇的過程。
在本練習中,我們將數據標準化,然后比較兩個軟件包scran和Seurat分析得到的高變基因,并使用scater軟件包進行可視化(Hemberg-lab單細胞轉錄組數據分析(十二)- Scater單細胞表達譜tSNE可視化)。
knitr::opts_chunk$set(echo = TRUE) library(scater) library(scran) library(Seurat) options(stringsAsFactors = FALSE) set.seed(32546)可以使用以下命令下載數據,使用的數據是上次QC部分中使用的10X數據集-1k PBMCs using 10x v3 chemistry
system("curl -O http://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_filtered_feature_bc_matrix.h5")讀入數據并設置SingleCellExperiment對象:
# setwd("~/scrna-seq2019/day1/2-normalization/") pbmc.mat <- Read10X_h5("pbmc_1k_v3_filtered_feature_bc_matrix.h5") pbmc.sce <- SingleCellExperiment(assays = list(counts = as.matrix(pbmc.mat))) pbmc.sce <- pbmc.sce[rowSums(counts(pbmc.sce) > 0) > 2,] isSpike(pbmc.sce, "MT") <- grepl("^MT-", rownames(pbmc.sce)) pbmc.sce <- calculateQCMetrics(pbmc.sce) colnames(colData(pbmc.sce)) ## [1] "is_cell_control" ## [2] "total_features_by_counts" ## [3] "log10_total_features_by_counts" ## [4] "total_counts" ## [5] "log10_total_counts" ## [6] "pct_counts_in_top_50_features" ## [7] "pct_counts_in_top_100_features" ## [8] "pct_counts_in_top_200_features" ## [9] "pct_counts_in_top_500_features" ## [10] "total_features_by_counts_endogenous" ## [11] "log10_total_features_by_counts_endogenous" ## [12] "total_counts_endogenous" ## [13] "log10_total_counts_endogenous" ## [14] "pct_counts_endogenous" ## [15] "pct_counts_in_top_50_features_endogenous" ## [16] "pct_counts_in_top_100_features_endogenous" ## [17] "pct_counts_in_top_200_features_endogenous" ## [18] "pct_counts_in_top_500_features_endogenous" ## [19] "total_features_by_counts_feature_control" ## [20] "log10_total_features_by_counts_feature_control" ## [21] "total_counts_feature_control" ## [22] "log10_total_counts_feature_control" ## [23] "pct_counts_feature_control" ## [24] "pct_counts_in_top_50_features_feature_control" ## [25] "pct_counts_in_top_100_features_feature_control" ## [26] "pct_counts_in_top_200_features_feature_control" ## [27] "pct_counts_in_top_500_features_feature_control" ## [28] "total_features_by_counts_MT" ## [29] "log10_total_features_by_counts_MT" ## [30] "total_counts_MT" ## [31] "log10_total_counts_MT" ## [32] "pct_counts_MT" ## [33] "pct_counts_in_top_50_features_MT" ## [34] "pct_counts_in_top_100_features_MT" ## [35] "pct_counts_in_top_200_features_MT" ## [36] "pct_counts_in_top_500_features_MT"過濾質量較差的細胞:
pbmc.sce <- filter(pbmc.sce, pct_counts_MT < 20) pbmc.sce <- filter(pbmc.sce,total_features_by_counts > 1000 &total_features_by_counts < 4100)創建一個新的assay用于比較標準化和未標準化數據。
assay(pbmc.sce, "logcounts_raw") <- log2(counts(pbmc.sce) + 1) plotRLE(pbmc.sce[,1:50], exprs_values = "logcounts_raw", style = "full")運行PCA的方式降維(Hemberg-lab單細胞轉錄組數據分析(十一)- Scater單細胞表達譜PCA可視化)并將結果保存到新對象raw.sce中:
raw.sce <- runPCA(pbmc.sce, exprs_values = "logcounts_raw") scater::plotPCA(raw.sce, colour_by = "total_counts") ## Warning: 'add_ticks' is deprecated. ## Use '+ geom_rug(...)' instead.繪制B細胞標記基因MS4A1的表達:
plotReducedDim(raw.sce, use_dimred = "PCA", by_exprs_values = "logcounts_raw",colour_by = "MS4A1") ## Warning: 'add_ticks' is deprecated. ## Use '+ geom_rug(...)' instead.以上圖例告訴了你什么?
其實可以看出來B細胞是和其他細胞混在一起的,區分度并不明顯。
標準化:Log
Seurat的默認標準化方法是:每個細胞的某一count先除以該細胞的總count,然后乘以scale因子-10000,再做個對數轉換(單細胞分析Seurat使用相關的10個問題答疑精選!)。
要使用Seurat的標準化方法,我們先把來自SCE對象中的過濾數據用來創建Seurat對象,實施標準化后,再將結果轉換回SingleCellExperiment對象,用于繪制比較圖。
pbmc.seu <- CreateSeuratObject(counts(pbmc.sce), project = "PBMC") pbmc.seu <- NormalizeData(pbmc.seu) pbmc.seu.sce <- as.SingleCellExperiment(pbmc.seu) pbmc.seu.sce <- calculateQCMetrics(pbmc.seu.sce)執行PCA(一文看懂PCA主成分分析)并使用plotRLE和plotReducedDim檢查標準化結果。這次使用“logcounts”繪圖(“logcounts”是默認值,可省略此設置)。檢查一些標記基因,例如GNLY(NK細胞)或LYZ單核細胞)。
plotRLE(pbmc.seu.sce[,1:50], style = "full") pbmc.seu.sce <- runPCA(pbmc.seu.sce) scater::plotPCA(pbmc.seu.sce, colour_by = "total_counts") plotReducedDim(pbmc.seu.sce, use_dimred = "PCA", colour_by = "MS4A1") ## Warning: 'add_ticks' is deprecated. ## Use '+ geom_rug(...)' instead.標準化:scran
scran中的標準化過程是基于Lun等人(2016)的反卷積方法:先合并許多細胞的計數以避免drop-out問題,然后將基于池的量化因子“去卷積”為cell-based factors,以進行特定細胞的標準化。而標準化之前對細胞進行聚類并非必要的,不過減少cluster中細胞之間的差異表達基因數量確實可以提高標準化的準確性。
qclust <- quickCluster(pbmc.sce) pbmc.sce <- computeSumFactors(pbmc.sce, clusters = qclust) summary(sizeFactors(pbmc.sce)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.1836 0.6380 0.8207 1.0000 1.2021 2.7399 pbmc.sce <- normalize(pbmc.sce) ## Warning in .get_all_sf_sets(object): spike-in set 'MT' should have its own ## size factors plotRLE(pbmc.sce[,1:50], exprs_values = "logcounts", exprs_logged = FALSE, style = "full") pbmc.sce <- runPCA(pbmc.sce) scater::plotPCA(pbmc.sce, colour_by = "total_counts") ## Warning: 'add_ticks' is deprecated. ## Use '+ geom_rug(...)' instead. plotReducedDim(pbmc.sce, use_dimred = "PCA", colour_by = "MS4A1") ## Warning: 'add_ticks' is deprecated. ## Use '+ geom_rug(...)' instead.高變基因選擇
高變基因選擇: scran
在用于發現highly variable gene(HVG)的方法中,scran是在沒有spike-ins并假設大多數基因沒有可變表達的情況下, 先使用整個數據將趨勢擬合到技術差異中,然后通過從總方差中減去趨勢的擬合值得到對于每個內源基因的方差的生物學組成。
fit <- trendVar(pbmc.sce, use.spikes = NA) decomp <- decomposeVar(pbmc.sce, fit) top.hvgs <- order(decomp$bio, decreasing = TRUE) head(decomp[top.hvgs,]) ## DataFrame with 6 rows and 6 columns ## mean total bio ## <numeric> <numeric> <numeric> ## S100A9 2.1798394106162 9.81200838511402 8.70848113171717 ## S100A8 1.94717610389369 8.74191022471459 7.73278930922898 ## LYZ 2.12848173541277 8.71130204756805 7.62859300817667 ## HLA-DRA 2.26709495525931 5.56946701594201 4.43065009839707 ## IGKC 1.03117091153778 4.64610433899492 3.94960628330731 ## CD74 2.85525470083599 4.66046377538451 3.29006358386426 ## tech p.value FDR ## <numeric> <numeric> <numeric> ## S100A9 1.10352725339686 0 0 ## S100A8 1.00912091548561 0 0 ## LYZ 1.08270903939138 0 0 ## HLA-DRA 1.13881691754493 0 0 ## IGKC 0.69649805568761 0 0 ## CD74 1.37040019152025 6.00496990101363e-272 3.15447280938074e-269 plot(decomp$mean, decomp$total, xlab = "Mean log-expression", ylab = "Variance")o <- order(decomp$mean)lines(decomp$mean[o], decomp$tech[o], col = "red", lwd = 2)該趨勢不是非常適合此數據,這可能是因為數據集很小,只有大約1000個細胞。TrendVar函數的parametric = TRUE選項可以解決此問題。
我們使用5%的false discovery rate (FDR)來選擇生物學組成明顯大于零的基因。
hvg.out <- decomp[which(decomp$FDR <= 0.05),] hvg.out <- hvg.out[order(hvg.out$bio, decreasing=TRUE),] plotExpression(pbmc.sce, features = rownames(hvg.out)[1:10])高變基因選擇: Seurat
Seurat 3中的默認方法是方差穩定變換(方差齊性的一種變換)。通過數據擬合趨勢可以預測每個基因隨其均值的變化。對于每個基因,在所有細胞中計算標準化值的方差,并用于對特征進行排名。默認情況下返回top2000的基因。
pbmc.seu <- FindVariableFeatures(pbmc.seu, selection.method = "vst") top10 <- head(VariableFeatures(pbmc.seu), 10) vplot <- VariableFeaturePlot(pbmc.seu) LabelPoints(plot = vplot, points = top10, repel = TRUE) ## When using repel, set xnudge and ynudge to 0 for optimal results ## Warning: Transformation introduced infinite values in continuous x-axisSeurat的VariableFeatures中包含有多少scran檢測到的可變基因呢?通過以下代碼我們可以看出1119個基因的重合
table(rownames(hvg.out) %in% VariableFeatures(pbmc.seu)) #### FALSE TRUE ## 1973 1119 sessionInfo() ## R version 3.5.1 (2018-07-02) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: CentOS release 6.10 (Final) ## ## Matrix products: default ## BLAS/LAPACK: /homeappl/appl_taito/opt/mkl/11.3.0/compilers_and_libraries_2016.0.109/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] parallel stats4 stats graphics grDevices utils datasets ## [8] methods base ## ## other attached packages: ## [1] Seurat_3.0.0 scran_1.10.2 ## [3] scater_1.10.1 ggplot2_3.1.1 ## [5] SingleCellExperiment_1.4.1 SummarizedExperiment_1.12.0 ## [7] DelayedArray_0.8.0 BiocParallel_1.16.5 ## [9] matrixStats_0.54.0 Biobase_2.42.0 ## [11] GenomicRanges_1.34.0 GenomeInfoDb_1.18.1 ## [13] IRanges_2.16.0 S4Vectors_0.20.1 ## [15] BiocGenerics_0.28.0 ## ## loaded via a namespace (and not attached): ## [1] Rtsne_0.15 ggbeeswarm_0.6.0 ## [3] colorspace_1.4-0 ggridges_0.5.1 ## [5] dynamicTreeCut_1.63-1 XVector_0.22.0 ## [7] BiocNeighbors_1.0.0 listenv_0.7.0 ## [9] npsurv_0.4-0 bit64_0.9-7 ## [11] ggrepel_0.8.0 codetools_0.2-16 ## [13] splines_3.5.1 R.methodsS3_1.7.1 ## [15] lsei_1.2-0 knitr_1.22 ## [17] jsonlite_1.6 ica_1.0-2 ## [19] cluster_2.0.7-1 png_0.1-7 ## [21] R.oo_1.22.0 sctransform_0.2.0 ## [23] HDF5Array_1.10.1 httr_1.4.0 ## [25] compiler_3.5.1 assertthat_0.2.1 ## [27] Matrix_1.2-17 lazyeval_0.2.1 ## [29] limma_3.38.3 htmltools_0.3.6 ## [31] tools_3.5.1 rsvd_1.0.0 ## [33] igraph_1.2.4.1 gtable_0.3.0 ## [35] glue_1.3.1 GenomeInfoDbData_1.2.0 ## [37] RANN_2.6.1 reshape2_1.4.3 ## [39] dplyr_0.8.0.1 Rcpp_1.0.1 ## [41] gdata_2.18.0 ape_5.3 ## [43] nlme_3.1-137 DelayedMatrixStats_1.4.0 ## [45] gbRd_0.4-11 lmtest_0.9-36 ## [47] xfun_0.6 stringr_1.4.0 ## [49] globals_0.12.4 irlba_2.3.3 ## [51] gtools_3.8.1 statmod_1.4.30 ## [53] future_1.12.0 edgeR_3.24.3 ## [55] zlibbioc_1.28.0 MASS_7.3-51.1 ## [57] zoo_1.8-4 scales_1.0.0 ## [59] rhdf5_2.26.2 RColorBrewer_1.1-2 ## [61] yaml_2.2.0 reticulate_1.12 ## [63] pbapply_1.4-0 gridExtra_2.3 ## [65] stringi_1.2.4 caTools_1.17.1.2 ## [67] bibtex_0.4.2 Rdpack_0.11-0 ## [69] SDMTools_1.1-221.1 rlang_0.3.1 ## [71] pkgconfig_2.0.2 bitops_1.0-6 ## [73] evaluate_0.12 lattice_0.20-38 ## [75] ROCR_1.0-7 purrr_0.3.2 ## [77] Rhdf5lib_1.4.2 labeling_0.3 ## [79] htmlwidgets_1.3 bit_1.1-14 ## [81] cowplot_0.9.4 tidyselect_0.2.5 ## [83] plyr_1.8.4 magrittr_1.5 ## [85] R6_2.3.0 gplots_3.0.1 ## [87] pillar_1.3.1 withr_2.1.2 ## [89] fitdistrplus_1.0-14 survival_2.43-3 ## [91] RCurl_1.95-4.11 tsne_0.1-3 ## [93] tibble_2.0.1 future.apply_1.2.0 ## [95] hdf5r_1.2.0 crayon_1.3.4 ## [97] KernSmooth_2.23-15 plotly_4.9.0 ## [99] rmarkdown_1.11 viridis_0.5.1 ## [101] locfit_1.5-9.1 grid_3.5.1 ## [103] data.table_1.12.2 metap_1.1 ## [105] digest_0.6.18 tidyr_0.8.2 ## [107] R.utils_2.7.0 munsell_0.5.0 ## [109] beeswarm_0.2.3 viridisLite_0.3.0 ## [111] vipor_0.4.5撰文:Tiger
編輯:生信寶典
總結
以上是生活随笔為你收集整理的代码分析 | 单细胞转录组Normalization详解的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 在家吃饭保平安,华人学者研究发现,经常在
- 下一篇: 教育部推出首批490门“国家精品在线开放