测序的PCR duplicates及用samtools的rmdup去除PCR重复reads
建庫中有一步是:
PCR擴(kuò)增加了接頭的DNA片段。
理想情況下,對打碎的基因組DNA,每個DNA片段測且僅測到一次。
但這一步擴(kuò)增了6個cycle,那么每個DNA片段有了64份拷貝。將擴(kuò)增后所有產(chǎn)物“灑”到flowcell,來自一個DNA片段的兩個拷貝,可能會錨定在兩個bead上,經(jīng)過測序得到的這兩條read,就是PCR duplication。
一般來說,如果PCR duplication rate過高,那么同樣總數(shù)目的reads,所提供的關(guān)于基因組的信息就大大減少了
samtools的rmdup如何去除PCR重復(fù)reads
隨機(jī)打斷測序需要去除PCR重復(fù)reads,特異性捕獲不需要
samtools rmdup 的官方說明書見:http://www.htslib.org/doc/samtools.html
Remove potential PCR duplicates: if multiple read pairs have identical external coordinates, only retain the pair with highest mapping quality. In the paired-end mode, this commandONLYworks with FR orientation and requires ISIZE is correctly set. It does not work for unpaired reads (e.g. two ends mapped to different chromosomes or orphan reads).
拿一個小的雙端測序數(shù)據(jù)來測試一下:
samtools rmdup tmp.sorted.bam tmp.rmdup.bam
[bam_rmdup_core] processing reference chr10...
[bam_rmdup_core] 2 / 12 = 0.1667 in library
雙端測序數(shù)據(jù)用samtools rmdup效果很差,很多人建議用picard工具的MarkDuplicates 功能
samtools 去除PCR冗余
ref:samtools 使用說明
samtools markdup [-llength] [-r] [-s] [-T] [-S]in.algsort.bam out.bam
-lINTExpected maximum read length ofINTbases. [300]
-rRemove duplicate reads.
-sPrint some basic stats.
-TPREFIXWrite temporary files toPREFIX.samtools.nnnn.mmmm.tmp
-SMark supplementary reads of duplicates as duplicates.
需要四步:
samtools sort -n xxx.bam-o xxx.sort.bam
samtools fixmate -m xxx.sort.bam xxx.fixmate.bam #注意這里samtools 1.2 的fixmate沒有-m參數(shù)
samtools sort xxx.fixmate.bam-o xxx.positionsort.bam
samtools markdup -r xxx.positionsort.bam xxx.markdup.bam#注意這里samtools 1.2 去冗余參數(shù)為rmdup,且1.2版本會報錯,實際用1.3的rmdup參數(shù)
all:
samtools sort-n xxx.bam |samtools fixmate -m |samtools sort |samtools markdup -r >xxx.markdup.bam
在sam/bam水平:
picard
ref網(wǎng)站:Picard Tools - By Broad Institute
使用:
java -jar picard.jar MarkDuplicates
I=xxx.sorted.bam
O=xxx.sorted.markdup.bam
M=xxx.markdup.txt
直接刪除冗余:
java -jar picard.jar MarkDuplicates
REMOVE_DUPLICATES=true
I=xxx.sorted.bam
O=xxx.sorted.markdup.bam
M=xxx.markdup.txt
參考來源:
https://www.jianshu.com/p/73483070379b
http://www.bio-info-trainee.com/2003.html
https://www.jianshu.com/p/879c5e9ed56e
https://www.jianshu.com/p/879c5e9ed56e
總結(jié)
以上是生活随笔為你收集整理的测序的PCR duplicates及用samtools的rmdup去除PCR重复reads的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。