當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Hemberg-lab单细胞转录组数据分析（四）

發(fā)布時間：2025/3/15 编程问答 24 豆豆

生活随笔收集整理的這篇文章主要介紹了 Hemberg-lab单细胞转录组数据分析（四）小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

Hemberg-lab單細胞轉(zhuǎn)錄組數(shù)據(jù)分析（一）

Hemberg-lab單細胞轉(zhuǎn)錄組數(shù)據(jù)分析（二）

Hemberg-lab單細胞轉(zhuǎn)錄組數(shù)據(jù)分析（三）

收藏|北大生信平臺"單細胞分析、染色質(zhì)分析"視頻和PPT分享

生信老司機以中心法則為主線講解組學(xué)技術(shù)的應(yīng)用和生信分析心得 - 限時免費

測序文庫拆分 (Demultiplexing)

文庫拆分因使用的前期Protocol不同或構(gòu)建的流程不同需要有對應(yīng)的處理方式。我們認為最靈活可用的文庫拆分工具是zUMIs (https://github.com/sdparekh/zUMIs/wiki/Usage)，可以用來拆分和比對大部分基于UMI的建庫方式。對于Smartseq2或其他雙端全長轉(zhuǎn)錄本方案，數(shù)據(jù)通常已經(jīng)拆分好了。例如GEO或ArrayExpress之類的公共數(shù)據(jù)存儲庫會要求小規(guī)模或plate-based scRNASeq數(shù)據(jù)拆分好再上傳，并且很多測序服務(wù)商提供的數(shù)據(jù)都是自動拆分好的。如果使用的分析流程依賴于拆分好的數(shù)據(jù)但測序服務(wù)商提供的數(shù)據(jù)沒有拆分時就需要自己拆分。因為不同的建庫方案引入的barcode序列的長度和位置不同，通常都需要自己寫腳本解決。

對于所有數(shù)據(jù)類型，”文庫拆分”涉及從一端或雙端短序列中識別和移除細胞條形碼序列 (cell barcode)。如果提前知道加入的細胞條形碼，比如數(shù)據(jù)來自基于PCR板的方案，只需要找到條形碼并與條形碼庫作比對，歸類于與之最相似的那個就可以 (根據(jù)條形碼的設(shè)計，一般允許最多1-2個錯配)。這些數(shù)據(jù)通常在比對之前先做拆分，從而可以并行比對，提高效率。

我們有公開可用 (<>)的 perl腳本，可以拆分任何plate-based的建庫方案生成的數(shù)據(jù)，不管有沒有UMI。操作如下：

# https://github.com/tallulandrews/scRNASeqPipeline # perl 1_Flexible_UMI_Demultiplexing.pl read1.fq read2.fq b_structure index mismatch prefix\n"; # ? ?print STDERR " # ? ?read1.fq : barcode/umi containing read # ? ?read2.fq : non-barcode containing read # ? ?b_structure : a single string of the format C##U# or U#C## # ? ? ? ?where C## is the cell-barcode and U# is the UMI. # ? ? ? ?e.g. C10U4 = a 10bp cell barcode followed by a 4bp UMI # ? ?index : file containg a single column of expected cell-barcodes. # ? ? ? ?if equal to \"UNKNOWN\" script will output read counts for each unique barcode. # ? ?mismatch : maximum number of permitted mismatches (recommend 2) # ? ?prefix : prefix for output fastq files. perl 1_Flexible_UMI_Demultiplexing.pl 10cells_read1.fq 10cells_read2.fq "C12U8" 10cells_barcodes.txt 2 Ex

運行過程輸出如下：

## ?Doesn't match any cell: 0 ## ?Ambiguous: 0 ## ?Exact Matches: 400 ## ?Contain mismatches: 0 ## ?Input Reads: 400 ## ?Output Reads: 400 ## Barcode Structure: 12 bp CellID followed by 8 bp UMI # https://github.com/tallulandrews/scRNASeqPipeline # perl 1_Flexible_FullTranscript_Demultiplexing.pl read1.fq read2.fq b_pos b_len index mismatch prefix # ? ? read1.fq : barcode containing read # ? ? ? read2.fq : non-barcode containg read # ? ? b_pos : position of cell-barcode in the read. [\"start\" or \"end\"] # ? ?b_len : length of cell-barcode (bp) # ? ?index : file contain a single column of expected barcodes # ? ?mismatch : maximum number of permitted mismatches (recommend 2) # ? ?prefix : prefix for output fq files. perl 1_Flexible_FullTranscript_Demultiplexing.pl 10cells_read1.fq 10cells_read2.fq "start" 12 10cells_barcodes.txt 2 Ex ## ## Doesn't match any cell: 0 ## Ambiguous: 0 ## Exact Matches: 400 ## Contain Mismatches: 0 ## Input Reads: 400 ## Output Reads: 400

對于包含UMI的數(shù)據(jù)，文庫拆分包含把UMI code加到來源于基因區(qū)的序列的名字上。如果數(shù)據(jù)來源于droplet-based protocol或者SeqWell，實際加入的barcode序列的種類多于捕獲到的細胞數(shù)時，為了避免生成特別多的文件，一般也把cell-barcode加到測序reads的名字后面。在這種情況下，文庫拆分是在定量過程中進行的，有利于識別來源于完整細胞而不是背景噪聲中的cell-barcode序列。

鑒定含有細胞的液滴/微孔

液滴型scRNA-seq方法中只有一小部分的液滴包含珠狀物和一個完整細胞。然而生物實驗不會那么理想，有些RNA會從死細胞或破損細胞中漏出來。所以沒有完整細胞的液滴有可能捕獲周圍環(huán)境游離出的少了RNA并且走完測序環(huán)節(jié)出現(xiàn)在最終測序結(jié)果中。液滴大小、擴增效率和測序環(huán)節(jié)中的波動會導(dǎo)致”背景”和真實細胞最終獲得的文庫大小變化很大，使得區(qū)分哪些文庫來源于背景哪些來源于真實細胞變得復(fù)雜。

大多數(shù)方法使用每個barcode對應(yīng)的總分子數(shù)(如果是UMI)或總reads數(shù)的分布來尋找一個”break point”區(qū)分來自于真實細胞的較大的文庫和來自于背景的較小的文庫。下面加載一些包含大細胞和細胞的模擬數(shù)據(jù)實際操作下：

umi_per_barcode <- read.table("droplet_id_example_per_barcode.txt.gz") truth <- read.delim("droplet_id_example_truth.gz", sep=",")

練習(xí): 有多少唯一的條形碼被檢測到？數(shù)據(jù)里多少來自真細胞？為了簡化計算，寫代碼排除掉少于10個分子的條形碼。

答案:

# 每一行代碼對應(yīng)上面一個問題 dim(umi_per_barcode)[1] dim(truth)[1] umi_per_barcode <- umi_per_barcode[umi_per_barcode[,2] > 10,]

一個找尋每個條形碼對應(yīng)的分子數(shù)突然下降的拐點的方法：

barcode_rank <- rank(-umi_per_barcode[,2]) plot(barcode_rank, umi_per_barcode[,2], xlim=c(1,8000))

可以看到文庫大小分布類似一條指數(shù)曲線，為了看的更清楚，對數(shù)據(jù)進行l(wèi)og轉(zhuǎn)換。

log_lib_size <- log10(umi_per_barcode[,2]) plot(barcode_rank, log_lib_size, xlim=c(1,8000))

轉(zhuǎn)換后，拐點更明顯了，我們可以手動估計拐點的位置來區(qū)分背景噪音和真實細胞，但是用算法來識別要更精確，可重用性更強。

# inflection point o <- order(barcode_rank) log_lib_size <- log_lib_size[o] barcode_rank <- barcode_rank[o]rawdiff <- diff(log_lib_size)/diff(barcode_rank) inflection <- which(rawdiff == min(rawdiff[100:length(rawdiff)], na.rm=TRUE))plot(barcode_rank, log_lib_size, xlim=c(1,8000)) abline(v=inflection, col="red", lwd=2)

threshold <- 10^log_lib_size[inflection]cells <- umi_per_barcode[umi_per_barcode[,2] > threshold,1] TPR <- sum(cells %in% truth[,1])/length(cells) Recall <- sum(cells %in% truth[,1])/length(truth[,1]) c(TPR, Recall) ## [1] 1.0000000 0.7831707

另一種方法是構(gòu)建混合模型，找出較高和較低分布相交的位置。但是數(shù)據(jù)可能不會像假定的分布那么好：

set.seed(-92497) # mixture model require("mixtools")mix <- normalmixEM(log_lib_size) ## number of iterations= 43 plot(mix, which=2, xlab2="log(mol per cell)")p1 <- dnorm(log_lib_size, mean=mix$mu[1], sd=mix$sigma[1]) p2 <- dnorm(log_lib_size, mean=mix$mu[2], sd=mix$sigma[2]) if (mix$mu[1] < mix$mu[2]) {split <- min(log_lib_size[p2 > p1]) } else {split <- min(log_lib_size[p1 > p2]) }

練習(xí)?使用分割點識別細胞并計算TPR和Recall。

答案

cells <- umi_per_barcode[umi_per_barcode[,2] > 10^split,1]TPR <- sum(cells %in% truth[,1])/length(cells) Recall <- sum(cells %in% truth[,1])/length(truth[,1]) c(TPR, Recall)

第三種方法，CellRanger假設(shè)真實細胞文庫大小變化在10倍以內(nèi)，用期望的細胞數(shù)目估計區(qū)間的分布。

n_cells <- length(truth[,1]) # CellRanger totals <- umi_per_barcode[,2] totals <- sort(totals, decreasing = TRUE) # 99th percentile of top n_cells divided by 10 thresh = totals[round(0.01*n_cells)]/10 plot(totals, xlim=c(1,8000)) abline(h=thresh, col="red", lwd=2)

練習(xí)?用這個閾值識別細胞并計算TPR和Recall

答案

cells <- umi_per_barcode[umi_per_barcode[,2] > thresh,1]TPR <- sum(cells %in% truth[,1])/length(cells) Recall <- sum(cells %in% truth[,1])/length(truth[,1]) c(TPR, Recall)

最后一個工具，EmptyDrops?(https://github.com/MarioniLab/DropletUtils, 目前還是測試版。它的輸入是所有液滴的?基因 x 細胞分子計數(shù)矩陣，從具有極低reads計數(shù)的液滴中估計background RNA的分布，然后尋找與background不同的基因表達譜的液滴視為真實細胞。background RNA通常看起來非常類似于群體中最大的細胞的表達譜，這個方法通常也會和拐點結(jié)合。因此，EmptyDrops是唯一能夠在高度多樣化樣本中鑒定非常小的細胞的方法。

下面提供了這個方法的運行代碼：（我們會根據(jù)官方及時更新代碼）

require("Matrix") raw.counts <- readRDS("droplet_id_example.rds")require("DropletUtils") # emptyDrops set.seed(100) e.out <- emptyDrops(my.counts) is.cell <- e.out$FDR <= 0.01 sum(is.cell, na.rm=TRUE) plot(e.out$Total, -e.out$LogProb, col=ifelse(is.cell, "red", "black"),xlab="Total UMI count", ylab="-Log Probability")cells <- colnames(raw.counts)[is.cell]TPR <- sum(cells %in% truth[,1])/length(cells) Recall <- sum(cells %in% truth[,1])/length(truth[,1]) c(TPR, Recall)

總結(jié)

以上是生活随笔為你收集整理的Hemberg-lab单细胞转录组数据分析（四）的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：有了它，快速学会RStudio应用
下一篇：冻存样品对单细胞测序影响大吗？