當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

这个R包自动注释单细胞数据的平均准确率为83％，使用后我的结果出现了点问题|附全代码...

發(fā)布時間：2025/3/15 编程问答 40 豆豆

生活随笔收集整理的這篇文章主要介紹了这个R包自动注释单细胞数据的平均准确率为83％，使用后我的结果出现了点问题|附全代码... 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

估計大家都有一個這樣的感覺就是單細(xì)胞數(shù)據(jù)具有一定的數(shù)據(jù)依賴性，好多的marker在相同的組織中，別人的數(shù)據(jù)就表達(dá)的十分明顯，在你的數(shù)據(jù)中就是不太顯著，比如NK細(xì)胞的KLRF1。于是，細(xì)胞自動注釋也就應(yīng)運(yùn)而生了，其實已經(jīng)有很多的細(xì)胞注釋的R包，比如SingleR，Cellassign（cellassign：用于腫瘤微環(huán)境分析的單細(xì)胞注釋工具（9月Nature））等。

今天介紹的也是自動注釋R包，由浙江大學(xué)研究團(tuán)隊開發(fā)，并在2020年3月27日在iScience上發(fā)表了題為 scCATCH: Automatic Annotation on Cell Types of Clusters from Single-Cell RNA Sequencing Data 的研究內(nèi)容，并且表明在6種不同組織的測試數(shù)據(jù)中的平均標(biāo)注準(zhǔn)確率為83％（Pancreas、Brain、Lung、PBMCs、Brain和Brain）。你也可以試一試哦：https://github.com/ZJUFanLab/scCATCH

注釋原理

步驟：

(A) Paired comparison of clusters to identify the potential marker genes for each cluster. Compared with every other cluster, genes significantly upregulated in only one cluster (log10 fold change ≥0.25, p < 0.05) and expressed in more than a quarter of cells (≥25%) would be considered marker genes. p values were obtained through the Wilcoxon test. ? indicates p < 0.05.
(B) Construction of tissue-specific cell taxonomy reference databases (CellMatch) with tissue-specific cell markers reported in the literature from humans or mice.
(C) Evidence-based score and annotation. For each cluster, cell types were scored on the basis of validated marker genes and their supporting literature, and the cell type with the highest score (top 1) was determined for the cluster.

特點：

CellMatch包含353種細(xì)胞類型和686種亞型，與184種組織類型，20,792種細(xì)胞特異性標(biāo)記基因以及2,097個人類和小鼠參考文獻(xiàn)。
scCATCH主要包括兩個函數(shù)“findmarkergenes”和“scCATCH”，以實現(xiàn)對每個已識別集群的自動注釋。
scCATCH可用于注釋癌組織的scRNA-seq數(shù)據(jù)。
scCATCH可以處理包含超過10,000個細(xì)胞和15個以上clusters的大型單細(xì)胞轉(zhuǎn)錄組數(shù)據(jù)集。

Note：
（1）只能注釋人或小鼠；
（2）在數(shù)據(jù)庫中，人的組織和腫瘤以及小鼠的正常組織的參比資源很多，小鼠的腫瘤組織較少。

安裝R包

# download the source package of scCATCH_2.0.tar.gz and install it install.packages(pkgs = 'scCATCH_2.0.tar.gz')

或者

# install devtools and install scCATCH install.packages(pkgs = 'devtools') devtools::install_github('ZJUFanLab/scCATCH')

加載數(shù)據(jù)

我們使用作者的測試數(shù)據(jù)，是小鼠腎臟的203個細(xì)胞，作者直接在這里加載的seurat對象，你可以點擊https://github.com/ZJUFanLab/scCATCH/tree/master/data進(jìn)行下載：

load("mouse_kidney_203_Seurat.RData")

構(gòu)建marker基因列表

clu_markers <- findmarkergenes(object = mouse_kidney_203_Seurat,species = 'Mouse'cluster = 'All',match_CellMatch = FALSE,cancer = NULL,tissue = NULL,cell_min_pct = 0.25,logfc = 0.25,pvalue = 0.05)

這里的match_CellMatch = FALSE表明不優(yōu)先和CellMatch進(jìn)行匹配（當(dāng)細(xì)胞數(shù)大于10000，或分群數(shù)大于15，強(qiáng)烈建議優(yōu)先匹配CellMatch參考數(shù)據(jù)集）

返回的是一個list，包括new_data_matrix和clu_markers兩個dataframe。

對潛在標(biāo)記基因進(jìn)行注釋

clu_ann <- scCATCH(object = clu_markers$clu_markers,species = 'Mouse',cancer = NULL,tissue = 'Kidney')

注釋前：

DimPlot(mouse_kidney_203_Seurat, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

new.cluster.ids <- clu_ann$cell_type#把預(yù)測細(xì)胞類型與cluser進(jìn)行匹配 names(new.cluster.ids) <- clu_ann$cluster mouse_kidney_203_Seurat <- RenameIdents(mouse_kidney_203_Seurat, new.cluster.ids)#進(jìn)行重新命名 DimPlot(mouse_kidney_203_Seurat, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

（1）于是我又想看看如果我在物種species =’Mouse’中改為Human，會發(fā)生什么？出現(xiàn)的是一下的結(jié)果：

原因，很明顯嘛，，，marker大小寫都不一樣，怎么可能進(jìn)行標(biāo)注啊!

（2）那如果我把tissue換成‘Bone marrow’，會發(fā)生什么呢？

這其實也就說明它怎么到最后都會指定細(xì)胞類型的，所以組織類型可是不能搞錯哈！

（3）我又想試一試在cancer中換成Renal Cell Carcinoma時會出現(xiàn)什么后果？

但其實在應(yīng)用我自己的PBMC數(shù)據(jù)中，我其實發(fā)現(xiàn)了部分的標(biāo)注錯誤，比如他會把很高比例的并且比較明顯的單核細(xì)胞標(biāo)注成了Treg，但在B細(xì)胞中的分型較好。下圖是作者用不同數(shù)據(jù)集對不同的自動注釋R包進(jìn)行測試后的結(jié)果：

作者肯定要說自己的準(zhǔn)確性較高，表明其R包的實用性較強(qiáng)，但其實我們一定要明白這種自動注釋也是有一定的數(shù)據(jù)依賴性的，不同的R包在不同數(shù)據(jù)中進(jìn)行標(biāo)注的準(zhǔn)確性也沒法保證，這能都試一試，畢竟多數(shù)情況下這是作為輔助自己手動分細(xì)胞亞群的佐證。

參考文獻(xiàn)

Shao et al., scCATCH:Automatic Annotation on Cell Types of Clusters from Single-Cell RNA Sequencing Data, iScience, Volume 23, Issue 3, 27 March 2020. [doi: 10.1016/j.isci.2020.100882](https://www.sciencedirect.com/science/article/pii/S2589004220300663). PMID:[32062421](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7031312/)

你可能還想看

Celaref | 單細(xì)胞測序細(xì)胞類型注釋工具
SCENIC | 從單細(xì)胞數(shù)據(jù)推斷基因調(diào)控網(wǎng)絡(luò)和細(xì)胞類型
AnimalTFDB 3.0 動物轉(zhuǎn)錄因子注釋和預(yù)測的綜合資源庫
NGS基礎(chǔ) - 參考基因組和基因注釋文件

往期精品(點擊圖片直達(dá)文字對應(yīng)教程)

后臺回復(fù)“生信寶典福利第一波”或點擊閱讀原文獲取教程合集

總結(jié)

以上是生活随笔為你收集整理的这个R包自动注释单细胞数据的平均准确率为83％，使用后我的结果出现了点问题|附全代码...的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： iBiology |除了B站，这还有个专
下一篇： Nature Communication