操作记录-2020-11-13:精简代码处理ChIP_seq数据
今天準備嘗試編寫一組精簡代碼用于處理ChIP_seq數據,希望能成功吧。
1.建立相應目錄
對新數據建立對應實驗人員(lizexing)、測序類型(ChIP_seq)和日期(2020_11_13)的目錄。
# 建立后如下: (base) zexing@DNA:~/projects/lizexing/ChIP_seq/2020_11_13$# 新建對應的目錄 mkdir raw_data clean_data bam bam_bw bam_sort sam macs2_bdgdiff macs2_callpeak matrix_reference_point matrix_scale_regions fastqc_report MD5_txt scripts_log2.檢查數據完整性
cat md5.txt > check_md5sum.txt && md5sum -c check_md5sum.txt操縱記錄如下
(base) zexing@DNA:~/projects/lizexing/ChIP_seq/2020_11_13/clean_data$ cat *.txt > check_md5sum.txt && md5sum -c check_md5sum.txt Scr-S-L-KQ_FKDL202604880-1a_1.clean.fq.gz: OK Scr-S-L-KQ_FKDL202604880-1a_2.clean.fq.gz: OK shTgm1-2-S-L-KQ_FKDL202604881-1a_1.clean.fq.gz: OK shTgm1-2-S-L-KQ_FKDL202604881-1a_2.clean.fq.gz: OK shTgm2-1-S-L-KQ_FKDL202604882-1a_1.clean.fq.gz: OK shTgm2-1-S-L-KQ_FKDL202604882-1a_2.clean.fq.gz: OK3.根據需要精簡文件名稱
操作記錄如下:
(base) zexing@DNA:~/projects/lizexing/ChIP_seq/2020_11_13/clean_data$ ll total 2.7G drwxrwxr-x 2 zexing zexing 4.0K 11月 16 11:45 . drwxrwxr-x 15 zexing zexing 4.0K 11月 13 10:36 .. -rw-rw-r-- 1 zexing zexing 476 11月 16 11:31 check_md5sum.txt -rw-rw-r-- 1 zexing zexing 152 11月 16 10:50 MD5_Scr-S-L-KQ_FKDL202604880-1a.txt -rw-rw-r-- 1 zexing zexing 162 11月 16 11:03 MD5_shTgm1-2-S-L-KQ_FKDL202604881-1a.txt -rw-rw-r-- 1 zexing zexing 162 11月 16 11:21 MD5_shTgm2-1-S-L-KQ_FKDL202604882-1a.txt -rw-rw-r-- 1 zexing zexing 401M 11月 16 10:57 Scr_1.clean.fq.gz -rw-rw-r-- 1 zexing zexing 410M 11月 16 11:01 Scr_2.clean.fq.gz -rw-rw-r-- 1 zexing zexing 495M 11月 16 11:11 shTgm1-2_1.clean.fq.gz -rw-rw-r-- 1 zexing zexing 510M 11月 16 11:20 shTgm1-2_2.clean.fq.gz -rw-rw-r-- 1 zexing zexing 423M 11月 16 11:26 shTgm2-1_1.clean.fq.gz -rw-rw-r-- 1 zexing zexing 437M 11月 16 11:27 shTgm2-1_2.clean.fq.gz4. 在Linux服務器中對ChIP_seq數據進行處理并常規call peak。
vim新建ChIP_seq_script_1將數據質控、比對、格式轉換、排序、生成目錄、bamCoverage命令轉換文件格式和macs2 callpeak綜合在一起。
#!/bin/bash # 上面一行宣告這個script的語法使用bash語法,當程序被執行時,能夠載入bash的相關環境配置文件。 # Program # This program is used for ChIP-seq data analysis. # History # 2020/11/13 zexing First release # 設置變量${dir}為常用目錄 dir=/f/xudonglab/zexing/projects/lizexing/ChIP_seq/2020_11_13 # 用戶名稱和日期需要更改# 對數據進行質控 fastqc -t 16 -o ${dir}/fastqc_report/ ${dir}/clean_data/*.fq.gz# 利用for循環進行后續操作 for i in Scr shTgm1-2 shTgm2-1 # 樣品名稱需要修改 do # 對數據進行比對 bowtie2 -t -p 16 -x /f/xudonglab/zexing/reference/UCSC_mm10/bowtie2_index/mm10 -1 ${dir}/clean_data/${i}_1.clean.fq.gz -2 ${dir}/clean_data/${i}_2.clean.fq.gz -S ${dir}/sam/${i}.sam# 對數據進行格式轉換 samtools view -@ 16 -S ${dir}/sam/${i}.sam -1b -o ${dir}/bam/${i}.bam# 對數據進行排序 samtools sort -@ 16 -l 5 -o ${dir}/bam_sort/${i}.bam.sort ${dir}/bam/${i}.bam# 對數據生成目錄 samtools index -@ 16 ${dir}/bam_sort/${i}.bam.sort # bamCoverage命令轉換文件格式 bamCoverage -p 16 -v -b ${dir}/bam_sort/${i}.bam.sort -o ${dir}/bam_bw/${i}.bam.sort.bw# 使用macs2進行常規callpeak macs2 callpeak -t ${dir}/bam_sort/${i}.bam.sort -f BAM -g mm -B -q 0.05 --outdir ${dir}/macs2_callpeak/ -n ${i}done在后臺運行ChIP_seq_script_1:
nohup bash ChIP_seq_script_1 > ChIP_seq_script_1_log &5. 使用deeptools軟件繪制熱圖/密度圖
1. computeMatrix scale-regions模式計算信號強度并用plotHeatmap/plotProfile作圖
scale-regions模式計算的是區域形式,所以指定作圖位置的BED或GTF格式文件為macs2 callpeak生成的后綴為peaks.narrowPeak的BED文件。
vim新建ChIP_seq_script_2,腳本如下:
#! /bin/bash # 上面一行宣告這個script的語法使用bash語法,當程序被執行時,能夠載入bash的相關環境配置文件。 # Program: # This program is used for computeMatrix scale-regions. #History: # 2020/11/03 zexing First release # In the scale-regions mode, all regions in the BED file are stretched or shrunken to the length (in bases) indicated by the user. # 參數-R 指定作圖位置的BED或GTF格式文件,可用#標記同一組區域,默認無。 # 參數-S 輸入bigwig文件。 # 參數-o 指定輸出為文件名用于plotHeatmap, plotProfile # 參數-b上游(默認0bp),-a下游(默認0bp)設定感興趣的區域,如果該區域是基因,則為基因TSS上游或TES下游。 # 參數--skipZeros設定0分區域的處理 # 參數-p 設置線程數bed=/f/xudonglab/zexing/projects/lizexing/ChIP_seq/2020_11_13/macs2_callpeak bw=/f/xudonglab/zexing/projects/lizexing/ChIP_seq/2020_11_13/bam_bw results=/f/xudonglab/zexing/projects/lizexing/ChIP_seq/2020_11_13/matrix_scale_regionscomputeMatrix scale-regions \ -R ${bed}/Scr_peaks.narrowPeak ${bed}/shTgm1-2_peaks.narrowPeak ${bed}/shTgm2-1_peaks.narrowPeak \ -S ${bw}/Scr.bam.sort.bw ${bw}/shTgm1-2.bam.sort.bw ${bw}/shTgm2-1.bam.sort.bw \ -o ${results}/matrix_scale_threegroups.gz \ -b 1000 -a 1000 \ -p 16# 使用plotHeatmap對結果繪制熱圖并聚類 plotHeatmap -m ${results}/matrix_scale_threegroups.gz \ -o ${results}/scale_threegroups_heatmap.png \ --dpi 750 \ --whatToShow 'heatmap and colorbar' \ --startLabel "Start" \ --endLabel "End" \ --regionsLabel Scr-peak shTgm1-2-peak shTgm2-1-peak \ --samplesLabel Scr shTgm1-2 shTgm2-1 # 使用plotProfile對結果繪制密度圖并聚類 plotProfile -m ${results}/matrix_scale_threegroups.gz \ -o ${results}/scale_threegroups_profile.png \ --dpi 750 \ --legendLocation upper-right \ --startLabel "Start" \ --endLabel "End" \ --regionsLabel Scr-peak shTgm1-2-peak shTgm2-1-peak \ --samplesLabel Scr shTgm1-2 shTgm2-1 \ --perGroup后臺運行ChIP_seq_script_2腳本如下
nohup bash ChIP_seq_script_2 > ChIP_seq_script_2_log &2. computeMatrix reference-point模式計算信號強度并用plotHeatmap/plotProfile作圖
reference-point模式計算的是峰值高點模式,所以指定作圖位置的BED或GTF格式文件為macs2 callpeak生成的后綴為summits.bed的BED文件。
vim新建ChIP_seq_script_3,腳本如下:
#! /bin/bash # 上面一行宣告這個script的語法使用bash語法,當程序被執行時,能夠載入bash的相關環境配置文件。 # Program: # This program is used for computeMatrix reference_point. #History: # 2020/11/03 zexing First release # Reference-point refers to a position within a BED region(e.g., the starting point). In this mode, only those genomicpositions before (upstream) and/or after(downstream) of the reference point will be plotted. # 參數-R 指定作圖位置的BED或GTF格式文件,可用#標記同一組區域,默認無。 # 參數-S 輸入bigwig文件。 # 參數-o 指定輸出為文件名用于plotHeatmap, plotProfile # 參數-b上游(默認0bp),-a下游(默認0bp)設定感興趣的區域,如果該區域是基因,則為基因TSS上游或TES下游。 # 參數--skipZeros設定0分區域的處理 # 參數-p 設置線程數dir=/f/xudonglab/zexing/projects/lizexing/ChIP_seq/2020_11_13 bed=${dir}/macs2_callpeak bw=${dir}/bam_bw results=${dir}/matrix_reference_point computeMatrix reference-point \ -R ${bed}/Scr_summits.bed ${bed}/shTgm1-2_summits.bed ${bed}/shTgm2-1_summits.bed \ -S ${bw}/Scr.bam.sort.bw ${bw}/shTgm1-2.bam.sort.bw ${bw}/shTgm2-1.bam.sort.bw \ -o ${results}/matrix_reference_threegroups.gz \ -b 1000 -a 1000 \ -p 16# 使用plotHeatmap對結果繪制熱圖并聚類 plotHeatmap -m ${results}/matrix_reference_threegroups.gz \ -o ${results}/reference_threegroups_heatmap.png \ --dpi 750 \ --whatToShow 'heatmap and colorbar' \ --refPointLabel "Peak_center" \ --regionsLabel Scr-peak shTgm1-2-peak shTgm2-1-peak \ --samplesLabel Scr shTgm1-2 shTgm2-1 # 使用plotProfile對結果繪制密度圖并聚類 plotProfile -m ${results}/matrix_reference_threegroups.gz \ -o ${results}/reference_threegroups_profile.png \ --dpi 750 \ --legendLocation upper-right \ --refPointLabel "Peak_center" \ --regionsLabel Scr-peak shTgm1-2-peak shTgm2-1-peak \ --samplesLabel Scr shTgm1-2 shTgm2-1 \ --perGroup后臺運行ChIP_seq_script_3腳本如下
nohup bash ChIP_seq_script_3 > ChIP_seq_script_3_log &6.在Linux服務器對ChIP_seq數據不同樣本間的差異peak進行處理
1. 使用macs2 predictd預測插入片段長度
vim新建ChIP_seq_script_4 預測插入片段長度確定均值。
對于插入片段長度,大多數的轉錄因子chip_seq數據推薦值為200, 大部分組蛋白修飾的chip_seq數據推薦值為147。具體實驗可以根據預測來確定均值。
#!/bin/bash # 上面一行宣告這個script的語法使用bash語法,當程序被執行時,能夠載入bash的相關環境配置文件。 # Program # This program is used for ChIP-seq data analysis. # History # 2020/11/13 zexing First release # 設置變量${dir}為常用目錄 dir=/f/xudonglab/zexing/projects/lizexing/ChIP_seq/2020_11_13 # 用戶名稱和日期需要更改 # 利用for循環進行后續操作 for i in Scr shTgm1-2 shTgm2-1 # 樣品名稱需要更改 do macs2 predictd -i ${dir}/bam_sort/${i}.bam.sort done后臺運行ChIP_seq_script_4腳本如下
nohup bash ChIP_seq_script_4 > ChIP_seq_script_4_log &在運行結果日志“ChIP_seq_script_4_log”中查看預測片段長度,操作記錄如下:
(base) zexing@DNA:~/projects/lizexing/ChIP_seq/2020_11_13/scripts_log$ egrep "predicted fragment length is" ChIP_seq_script_4_log INFO @ Mon, 16 Nov 2020 17:38:09: # predicted fragment length is 275 bps INFO @ Mon, 16 Nov 2020 17:38:57: # predicted fragment length is 292 bps INFO @ Mon, 16 Nov 2020 17:39:36: # predicted fragment length is 284 bps2. 使用macs2 callpeak查看測序深度
vim新建ChIP_seq_script_5 查看測序深度。
#!/bin/bash # 上面一行宣告這個script的語法使用bash語法,當程序被執行時,能夠載入bash的相關環境配置文件。 # Program # This program is used for ChIP-seq data analysis. # History # 2020/11/13 zexing First release # 設置變量${dir}為常用目錄 dir=/f/xudonglab/zexing/projects/lizexing/ChIP_seq/2020_11_13 # 用戶名稱和日期需要更改 # 利用for循環進行后續操作 for i in Scr shTgm1-2 shTgm2-1 # 樣品名稱需要更改 do macs2 callpeak -t ${dir}/bam_sort/${i}.bam.sort --outdir ${dir}/macs2_bdgdiff/ -n ${i} -q 0.05 -g mm -B --nomodel --extsize 260 egrep "tags after filtering in treatment|tags after filtering in control" ${dir}/macs2_bdgdiff/${i}_peaks.xls >> ${dir}/macs2_bdgdiff/${i}_tags.txt done后臺運行ChIP_seq_script_5腳本如下
nohup bash ChIP_seq_script_5 > ChIP_seq_script_5_log &查看各樣本的測序深度,操作記錄如下:
(base) zexing@DNA:~/projects/lizexing/ChIP_seq/2020_11_13/macs2_bdgdiff$ cat Scr_tags.txt # tags after filtering in treatment: 42361873. 使用macs2 bdgdiff提取樣品間差異peak
vim新建ChIP_seq_script_6 提取樣品間差異peak。
#! /bin/bash #上面一行宣告這個script的語法使用bash語法,當程序被執行時,能夠載入bash的相關環境配置文件。 # Program: # This program is used for calling differential binding events of ChIP-seq data by macs2. # History: # 2020/11/06 zexing First release # 用法說明: # usage: macs2 bdgdiff [-h] --t1 T1BDG --t2 T2BDG --c1 C1BDG --c2 C2BDG # [-C CUTOFF] [-l MINLEN] [-g MAXGAP] [--d1 DEPTH1] # [--d2 DEPTH2] [--outdir OUTDIR] # (--o-prefix OPREFIX | -o OFILE OFILE OFILE) # 可選參數: # optional arguments: # 參數 --t1是讀取MACS pileup bedGraph for condition 1. # 參數 --t2是讀取MACS pileup bedGraph for condition 2. # 參數 --c1是讀取MACS control lambda bedGraph for condition 1. # 參數 --c2是讀取MACS control lambda bedGraph for condition 2. # 參數 -g 是Maximu gap to merge nearby differential regions. # 參數 -l Minimum length of differential region. Try bigger value to remove small regions. DEFAULT: 200 # 參數 --d1 Sequencing depth (# of non-redundant reads in million) for condition 1. # 參數 --d2 Sequencing depth (# of non-redundant reads in million) for condition 2. # 參數 --o-prefix diff_c1_vs_c2保存輸出文件名。# 設置變量${dir}為常用目錄 dir=/f/xudonglab/zexing/projects/lizexing/ChIP_seq/2020_11_13/macs2_bdgdiffmacs2 bdgdiff --t1 ${dir}/Scr_treat_pileup.bdg --c1 ${dir}/Scr_control_lambda.bdg --d1 4236187 \ --t2 ${dir}/shTgm1-2_treat_pileup.bdg --c2 ${dir}/shTgm1-2_control_lambda.bdg --d2 5103633 \ -g 60 -l 260 --o-prefix ${dir}/diff_Scr_vs_shTgm1-2macs2 bdgdiff --t1 ${dir}/Scr_treat_pileup.bdg --c1 ${dir}/Scr_control_lambda.bdg --d1 4236187 \ --t2 ${dir}/shTgm2-1_treat_pileup.bdg --c2 ${dir}/shTgm2-1_control_lambda.bdg --d2 4491626 \ -g 60 -l 260 --o-prefix ${dir}/diff_Scr_vs_shTgm2-1后臺運行ChIP_seq_script_6腳本如下
nohup bash ChIP_seq_script_6 > ChIP_seq_script_6_log &其中-d1和-d2的值就是第二步運行時輸出的reads數目,-o參數指定輸出文件的前綴。運行成功后,會產生3個文件
diff_Scr_vs_shTgm1-2_c3.0_cond1.bed # 保存在condition1中上調的peakdiff_Scr_vs_shTgm1-2_c3.0_cond2.bed # 保存了在condition2中上調的peakdiff_Scr_vs_shTgm1-2_c3.0_common.bed # 保存的是沒有達到閾值的,非顯著差異peak上述3個文件格式是完全相同的,最后一列的內容為log10 likehood ratio值,用來衡量兩個條件之間的差異,默認閾值為3,大于閾值的peak為組間差異顯著的peak, 這個閾值可以通過-c參數進行調整。
7.在RStudio中利用ChIPseeker進行基因注釋及轉換
此部分使用上一步生成的bed文件在本地機Rstudio中,利用ChIPseeker進行操作。
1. 對常規macs2 callpeak的BED文件進行注釋
narrowPeak文件是針對一段peak區域注釋,summits.bed是針對peak峰值注釋。
# 編輯腳本如下: # This script is used for analysis of daizhongye ChIP-seq data # History # Lizexing 2020-11-13 First release # 原始測序數據經過在服務器上進行bowtie2比對和macs2 callpeak分析后,得到的bed文件, # 將其下載之本地電腦后進行后續操作 # 安裝ChIPseeker包 # BiocManager::install("ChIPseeker") # 設置工作目錄 setwd("G:/daizhongye/ChIP-seq/2020_10_29/macs2_callpeak/") #加載ChIPseeker包 library(ChIPseeker) # 加載基因組注釋庫 # 安裝小鼠注釋包 # BiocManager::install("TxDb.Mmusculus.UCSC.mm10.knownGene") # 安裝人的注釋包 # BiocManager::install("TxDb.Hsapiens.UCSC.hg19.knownGene") # 讀取chipseq峰的bed文件 Scr_peak <- readPeakFile("G:/daizhongye/ChIP-seq/2020_10_29/macs2_callpeak/Scr_summits.bed") shTgm1_2_peak <- readPeakFile("G:/daizhongye/ChIP-seq/2020_10_29/macs2_callpeak/shTgm1-2_summits.bed") shTgm2_1_peak <- readPeakFile("G:/daizhongye/ChIP-seq/2020_10_29/macs2_callpeak/shTgm1-2_summits.bed")# 注釋,TSS的范圍可自定義 # 加載小鼠基因組注釋包 require(TxDb.Mmusculus.UCSC.mm10.knownGene) # 對txdb進行指定 txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene# 進行注釋 Scr_peakAnno <- annotatePeak(Scr_peak, tssRegion = c(-3000, 3000), TxDb = txdb) shTgm1_2_peakAnno <- annotatePeak(shTgm1_2_peak, tssRegion = c(-3000, 3000), TxDb = txdb) shTgm2_1_peakAnno <- annotatePeak(shTgm2_1_peak, tssRegion = c(-3000, 3000), TxDb = txdb)# 輸出結果 # 設置工作目錄 setwd("G:/daizhongye/ChIP-seq/2020_10_29/peakanno/") write.table(Scr_peakAnno, file = "Scr_peak.txt",sep = '\t', quote = FALSE, row.names = FALSE) write.table(shTgm1_2_peakAnno, file = "shTgm1_2_peak.txt",sep = '\t', quote = FALSE, row.names = FALSE) write.table(shTgm2_1_peakAnno, file = "shTgm2_1_peak.txt",sep = '\t', quote = FALSE, row.names = FALSE)# 對Scr數據分布進行繪圖 tiff("Scr_peakAnno_1.tiff") plotAnnoBar(Scr_peakAnno) dev.off()tiff("Scr_peakAnno_2.tiff") vennpie(Scr_peakAnno) dev.off()tiff("Scr_peakAnno_3.tiff") plotAnnoPie(Scr_peakAnno) dev.off()tiff("Scr_peakAnno_4.tiff") plotDistToTSS(Scr_peakAnno) dev.off()#對一組數據分布進行繪圖 tiff("shTgm1_2_peakAnno_1.tiff") plotAnnoBar(shTgm1_2_peakAnno) dev.off()tiff("shTgm1_2_peakAnno_2.tiff") vennpie(shTgm1_2_peakAnno) dev.off()tiff("shTgm1_2_peakAnno_3.tiff") plotAnnoPie(shTgm1_2_peakAnno) dev.off()tiff("shTgm1_2_peakAnno_4.tiff") plotDistToTSS(shTgm1_2_peakAnno) dev.off()# 對二組數據分布進行繪圖 tiff("shTgm2_1_peakAnno_1.tiff") plotAnnoBar(shTgm2_1_peakAnno) dev.off()tiff("shTgm2_1_peakAnno_2.tiff") vennpie(shTgm2_1_peakAnno) dev.off()tiff("shTgm2_1_peakAnno_3.tiff") plotAnnoPie(shTgm2_1_peakAnno) dev.off()tiff("shTgm2_1_peakAnno_4.tiff") plotDistToTSS(shTgm2_1_peakAnno) dev.off()2. 利用下述腳本對提取出來的cond1.bed和cond2.bed進行注釋
參考文章:CHIP-seq流程學習筆記(6)-peak注釋軟件ChIPseeker
# 編輯腳本如下: # This script is used for Annotate peaks from macs2_bdgdiff of daizhongye ChIP-seq data。 # History # Lizexing 2020-11-07 First release # 利用macs2 bdgdiff命令得到差異化的peak后,得到3種類型的bed文件,將其下載之本地電腦后進行操作。 # 安裝ChIPseeker包 # BiocManager::install("ChIPseeker") # 設置工作目錄 setwd("G:/daizhongye/ChIP-seq/2020_10_29/macs2_bdgdiff/") #加載ChIPseeker包 library(ChIPseeker) # 加載基因組注釋庫 # 安裝小鼠注釋包 # BiocManager::install("TxDb.Mmusculus.UCSC.mm10.knownGene") # 安裝人的注釋包 # BiocManager::install("TxDb.Hsapiens.UCSC.hg19.knownGene") # 讀取差異化peak的bed文件 Scr_vs_shTgm1_2_cond1 <- readPeakFile("G:/daizhongye/ChIP-seq/2020_10_29/macs2_bdgdiff/diff_Scr_vs_shTgm1-2_c3.0_cond1.bed") Scr_vs_shTgm1_2_cond2 <- readPeakFile("G:/daizhongye/ChIP-seq/2020_10_29/macs2_bdgdiff/diff_Scr_vs_shTgm1-2_c3.0_cond2.bed") Scr_vs_shTgm2_1_cond1 <- readPeakFile("G:/daizhongye/ChIP-seq/2020_10_29/macs2_bdgdiff/diff_Scr_vs_shTgm2-1_c3.0_cond1.bed") Scr_vs_shTgm2_1_cond2 <- readPeakFile("G:/daizhongye/ChIP-seq/2020_10_29/macs2_bdgdiff/diff_Scr_vs_shTgm2-1_c3.0_cond2.bed")# 注釋,TSS的范圍可自定義 # 加載小鼠基因組注釋包 require(TxDb.Mmusculus.UCSC.mm10.knownGene) # 對txdb進行指定 txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene# 進行注釋 Scr_vs_shTgm1_2_cond1_peakAnno <- annotatePeak(Scr_vs_shTgm1_2_cond1, tssRegion = c(-3000, 3000), TxDb = txdb) Scr_vs_shTgm1_2_cond2_peakAnno <- annotatePeak(Scr_vs_shTgm1_2_cond2, tssRegion = c(-3000, 3000), TxDb = txdb) Scr_vs_shTgm2_1_cond1_peakAnno <- annotatePeak(Scr_vs_shTgm2_1_cond1, tssRegion = c(-3000, 3000), TxDb = txdb) Scr_vs_shTgm2_1_cond2_peakAnno <- annotatePeak(Scr_vs_shTgm2_1_cond2, tssRegion = c(-3000, 3000), TxDb = txdb)# 保存注釋結果 write.table(Scr_vs_shTgm1_2_cond1_peakAnno, file = "Scr_vs_shTgm1_2_cond1_peakAnno.csv",sep = '\t', quote = TRUE, row.names = FALSE) write.table(Scr_vs_shTgm1_2_cond2_peakAnno, file = "Scr_vs_shTgm1_2_cond2_peakAnno.csv",sep = '\t', quote = TRUE, row.names = FALSE) write.table(Scr_vs_shTgm2_1_cond1_peakAnno, file = "Scr_vs_shTgm2_1_cond1_peakAnno.csv",sep = '\t', quote = TRUE, row.names = FALSE) write.table(Scr_vs_shTgm2_1_cond2_peakAnno, file = "Scr_vs_shTgm2_1_cond2_peakAnno.csv",sep = '\t', quote = TRUE, row.names = FALSE)3. 利用下述腳本將注釋文件中的gene_id轉化為gene_symbol
參考文章:R中常用函數使用說明
# 構建待處理基因集的向量 setwd("G:/daizhongye/ChIP-seq/2020_10_29/macs2_bdgdiff/")Scr_vs_shTgm1_2_cond1 <- read.csv("G:/daizhongye/ChIP-seq/2020_10_29/macs2_bdgdiff/Scr_vs_shTgm1_2_cond1_peakAnno.csv", header = TRUE,sep = "\t" ) Scr_vs_shTgm1_2_cond2 <- read.csv("G:/daizhongye/ChIP-seq/2020_10_29/macs2_bdgdiff/Scr_vs_shTgm1_2_cond2_peakAnno.csv", header = TRUE,sep = "\t" ) Scr_vs_shTgm2_1_cond1 <- read.csv("G:/daizhongye/ChIP-seq/2020_10_29/macs2_bdgdiff/Scr_vs_shTgm2_1_cond1_peakAnno.csv", header = TRUE,sep = "\t" ) Scr_vs_shTgm2_1_cond2 <- read.csv("G:/daizhongye/ChIP-seq/2020_10_29/macs2_bdgdiff/Scr_vs_shTgm2_1_cond2_peakAnno.csv", header = TRUE,sep = "\t" )# 提取待處理基因集中的gene_id一列并轉化為向量格式 Scr_vs_shTgm1_2_cond1_V1 <- as.vector(Scr_vs_shTgm1_2_cond1[, 14]) Scr_vs_shTgm1_2_cond2_V1 <- as.vector(Scr_vs_shTgm1_2_cond2[, 14]) Scr_vs_shTgm2_1_cond1_V1 <- as.vector(Scr_vs_shTgm2_1_cond1[, 14]) Scr_vs_shTgm2_1_cond2_V1 <- as.vector(Scr_vs_shTgm2_1_cond2[, 14])# 由鼠的gene_id轉化到gene_symbol library("clusterProfiler") library("org.Mm.eg.db") Scr_vs_shTgm1_2_cond1_V2 <- bitr(Scr_vs_shTgm1_2_cond1_V1, # 待轉化的文件名fromType = "ENTREZID", # fromType是指你的數據ID類型是屬于哪一類的toType = "SYMBOL", # toType是指你要轉換成哪種ID類型,可以寫多種,也可以只寫一種OrgDb = org.Mm.eg.db) # Orgdb是指對應的注釋包是哪個Scr_vs_shTgm1_2_cond2_V2 <- bitr(Scr_vs_shTgm1_2_cond2_V1, # 待轉化的文件名fromType = "ENTREZID", # fromType是指你的數據ID類型是屬于哪一類的toType = "SYMBOL", # toType是指你要轉換成哪種ID類型,可以寫多種,也可以只寫一種OrgDb = org.Mm.eg.db) # Orgdb是指對應的注釋包是哪個Scr_vs_shTgm2_1_cond1_V2 <- bitr(Scr_vs_shTgm2_1_cond1_V1, # 待轉化的文件名fromType = "ENTREZID", # fromType是指你的數據ID類型是屬于哪一類的toType = "SYMBOL", # toType是指你要轉換成哪種ID類型,可以寫多種,也可以只寫一種OrgDb = org.Mm.eg.db) # Orgdb是指對應的注釋包是哪個Scr_vs_shTgm2_1_cond2_V2 <- bitr(Scr_vs_shTgm2_1_cond2_V1, # 待轉化的文件名fromType = "ENTREZID", # fromType是指你的數據ID類型是屬于哪一類的toType = "SYMBOL", # toType是指你要轉換成哪種ID類型,可以寫多種,也可以只寫一種OrgDb = org.Mm.eg.db) # Orgdb是指對應的注釋包是哪個# 查看轉化后的結果 View(Scr_vs_shTgm1_2_cond1_V2) View(Scr_vs_shTgm1_2_cond2_V2) View(Scr_vs_shTgm2_1_cond1_V2) View(Scr_vs_shTgm2_1_cond2_V2)# 保存差異gene_symbol以便后續處理 write.csv(Scr_vs_shTgm1_2_cond1_V2, "G:/daizhongye/ChIP-seq/2020_10_29/macs2_bdgdiff/Scr_vs_shTgm1_2_cond1_V2.txt", row.names = FALSE) write.csv(Scr_vs_shTgm1_2_cond2_V2, "G:/daizhongye/ChIP-seq/2020_10_29/macs2_bdgdiff/Scr_vs_shTgm1_2_cond2_V2.txt", row.names = FALSE) write.csv(Scr_vs_shTgm2_1_cond1_V2, "G:/daizhongye/ChIP-seq/2020_10_29/macs2_bdgdiff/Scr_vs_shTgm2_1_cond1_V2.txt", row.names = FALSE) write.csv(Scr_vs_shTgm2_1_cond2_V2, "G:/daizhongye/ChIP-seq/2020_10_29/macs2_bdgdiff/Scr_vs_shTgm2_1_cond2_V2.txt", row.names = FALSE)# 更改列名稱進行合并文件 colnames(Scr_vs_shTgm1_2_cond1_V2)[1] <- "geneId" colnames(Scr_vs_shTgm1_2_cond2_V2)[1] <- "geneId" colnames(Scr_vs_shTgm2_1_cond1_V2)[1] <- "geneId" colnames(Scr_vs_shTgm2_1_cond2_V2)[1] <- "geneId"# 合并轉化后的文件 Scr_vs_shTgm1_2_cond1_V3 <- merge(Scr_vs_shTgm1_2_cond1_V2, Scr_vs_shTgm1_2_cond1, by = "geneId") Scr_vs_shTgm1_2_cond2_V3 <- merge(Scr_vs_shTgm1_2_cond2_V2, Scr_vs_shTgm1_2_cond2, by = "geneId") Scr_vs_shTgm2_1_cond1_V3 <- merge(Scr_vs_shTgm2_1_cond1_V2, Scr_vs_shTgm2_1_cond1, by = "geneId") Scr_vs_shTgm2_1_cond2_V3 <- merge(Scr_vs_shTgm2_1_cond2_V2, Scr_vs_shTgm2_1_cond2, by = "geneId")# 對結果進行輸出保存 write.csv(Scr_vs_shTgm1_2_cond1_V3, "G:/daizhongye/ChIP-seq/2020_10_29/macs2_bdgdiff/Scr_vs_shTgm1_2_cond1_V3.csv", row.names = FALSE, quote = TRUE) write.csv(Scr_vs_shTgm1_2_cond2_V3, "G:/daizhongye/ChIP-seq/2020_10_29/macs2_bdgdiff/Scr_vs_shTgm1_2_cond2_V3.csv", row.names = FALSE, quote = TRUE) write.csv(Scr_vs_shTgm2_1_cond1_V3, "G:/daizhongye/ChIP-seq/2020_10_29/macs2_bdgdiff/Scr_vs_shTgm2_1_cond1_V3.csv", row.names = FALSE, quote = TRUE) write.csv(Scr_vs_shTgm2_1_cond2_V3, "G:/daizhongye/ChIP-seq/2020_10_29/macs2_bdgdiff/Scr_vs_shTgm2_1_cond2_V3.csv", row.names = FALSE, quote = TRUE)4. 利用intersect()函數將ChIP-seq和RNA-seq中相同的基因提取出來
參考文章:R中常用函數使用說明
# 利用intersect函數對RNA_seq和ChIP_seq中的交集gene_symbol進行提取。 # 參考文章:https://blog.csdn.net/woodcorpse/article/details/80494605?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522160471856319195264746932%2522%252C%2522scm%2522%253A%252220140713.130102334..%2522%257D&request_id=160471856319195264746932&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~all~baidu_landing_v2~default-1-80494605.pc_first_rank_v2_rank_v28&utm_term=R%E4%B8%ADintersect&spm=1018.2118.3001.4449 # 交集intersect:兩個向量的交集,集合可以是數字、字符串等# ChIP_seq中的差異peak在上面已經定義,分別是: # Scr_vs_shTgm1_2_cond1即Scr中上調的peak,即敲除shTgm1_2后降低的peak ChIP_seq_group_1_down_genes <- as.vector(Scr_vs_shTgm1_2_cond1_V2$SYMBOL) # Scr_vs_shTgm1_2_cond2即敲除shTgm1_2后升高的peak ChIP_seq_group_1_up_genes <- as.vector(Scr_vs_shTgm1_2_cond2_V2$SYMBOL) # Scr_vs_shTgm2_1_cond1即Scr中上調的peak,即敲除shTgm2_1后降低的peak ChIP_seq_group_2_down_genes <-as.vector(Scr_vs_shTgm2_1_cond1_V2$SYMBOL) # Scr_vs_shTgm2_1_cond2即敲除shTgm2_1后升高的peak ChIP_seq_group_2_up_genes <- as.vector(Scr_vs_shTgm2_1_cond2_V2$SYMBOL)# RNA_seq中的差異gene需要再重新讀入,分別是: # 敲除shTgm1_2后降低的genes RNA_seq_group_1_down_genes <- read.csv("G:/daizhongye/RNA-seq/2020_10_29/Rtreatment/significant_different_genes/significant_pvalue_different_genes_group_1_genecount_down.csv", header = TRUE,sep = ",") # 敲除shTgm1_2后升高的genes RNA_seq_group_1_up_genes <- read.csv("G:/daizhongye/RNA-seq/2020_10_29/Rtreatment/significant_different_genes/significant_pvalue_different_genes_group_1_genecount_up.csv", header = TRUE,sep = ",") # 敲除shTgm2_1后降低的genes RNA_seq_group_2_down_genes <- read.csv("G:/daizhongye/RNA-seq/2020_10_29/Rtreatment/significant_different_genes/significant_pvalue_different_genes_group_2_genecount_down.csv", header = TRUE,sep = ",") # 敲除shTgm2_1后升高的genes RNA_seq_group_2_up_genes <- read.csv("G:/daizhongye/RNA-seq/2020_10_29/Rtreatment/significant_different_genes/significant_pvalue_different_genes_group_2_genecount_up.csv", header = TRUE,sep = ",")# 將RNA_seq中的差異genes對應的一列提取出來,并轉化為向量形式 RNA_seq_group_1_down_genes_V1 <- as.vector(RNA_seq_group_1_down_genes$X) RNA_seq_group_1_up_genes_V1 <- as.vector(RNA_seq_group_1_up_genes$X) RNA_seq_group_2_down_genes_V1 <- as.vector(RNA_seq_group_2_down_genes$X) RNA_seq_group_2_up_genes_V1 <- as.vector(RNA_seq_group_2_up_genes$X)# 提取ChIP_seq和RNA_seq中共有的gene交集 group_1_down_genes <- intersect(ChIP_seq_group_1_down_genes,RNA_seq_group_1_down_genes_V1) group_1_up_genes <- intersect(ChIP_seq_group_1_up_genes,RNA_seq_group_1_up_genes_V1) group_2_down_genes <- intersect(ChIP_seq_group_2_down_genes,RNA_seq_group_2_down_genes_V1) group_2_up_genes <- intersect(ChIP_seq_group_2_up_genes,RNA_seq_group_2_up_genes_V1)總結
以上是生活随笔為你收集整理的操作记录-2020-11-13:精简代码处理ChIP_seq数据的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 格兰杰检验的基本步骤_Toda-Yama
- 下一篇: 自然环境资源数据集分享——资源环境数据云