當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Stringtie详解

發布時間：2023/12/8 编程问答 37 豆豆

生活随笔收集整理的這篇文章主要介紹了 Stringtie详解小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

StringTie 是一種快速高效的將 RNA-Seq 比對到潛在轉錄本的組裝器。它使用新的網絡流算法以及可選的從頭組裝步驟來組裝和定量代表每個基因位點的多個剪接變體的全長轉錄本。它的輸入不僅可以包括其他轉錄組裝器也可以使用的短讀取比對，還可以包括從這些讀取組裝的較長序列的比對。為了識別實驗之間的差異表達基因，StringTie 的輸出可以通過專門的軟件如 Ballgown、Cuffdiff 或其他程序（DESeq2、edgeR 等）進行處理。

下載與安裝

源碼安裝

wget http://ccb.jhu.edu/software/stringtie/dl/stringtie-2.2.1.tar.gz tar -zxvf stringtie-2.2.1.tar.gz cd stringtie-VER make release

github安裝

git clone https://github.com/gpertea/stringtie cd stringtie make release

conda安裝（推薦）省時省力

conda install stringtie -c bioconda

用法詳解

stringtie基本用法：stringtie <aligned_reads.bam> [options]*

StringTie v1.3.3b usage:stringtie <input.bam ..> [-G <guide_gff>] [-l <label>] [-o <out_gtf>] [-p <cpus>][-v] [-a <min_anchor_len>] [-m <min_tlen>] [-j <min_anchor_cov>] [-f <min_iso>][-C <coverage_file_name>] [-c <min_bundle_cov>] [-g <bdist>] [-u][-e] [-x <seqid,..>] [-A <gene_abund.out>] [-h] {-B | -b <dir_path>} Assemble RNA-Seq alignments into potential transcripts.Options:--version : print just the version at stdout and exit-G reference annotation to use for guiding the assembly process (GTF/GFF3)使用參考注釋基因文件指導組裝過程，格式GTF/GFF3。輸出文件中既包含已知表達的轉錄本，也包含新的轉錄本。選項-B，-b，-e，-C需要此選項--rf assume stranded library fr-firststrand鏈特異性建庫方式：fr-firststrand(最常用的是dUTP測序方式，其他有NSR，NNSR).--fr assume stranded library fr-secondstrand鏈特異性建庫方式：fr-secondstrand(如 Ligation,Standard SOLiD).-l name prefix for output transcripts (default: STRG)將<label>設置為輸出轉錄本名稱的前綴。默認：STRG-f minimum isoform fraction (default: 0.1)將預測轉錄本的最低isoform的豐度設定為在給定基因座處組裝的豐度最高的轉錄本的一部分。較低豐度的轉錄物通常是經加工的轉錄本的不完全剪接前體的artifacts。默認值為0.1。-m minimum assembled transcript length (default: 200)設置預測的轉錄本所允許的最小長度.默認值為200-o output path/file name for the assembled transcripts GTF (default: stdout)設置StringTie組裝轉錄本的輸出GTF文件的路徑和文件名。此處可指定完整路徑，在這種情況下，將根據需要創建目錄。默認情況下，StringTie將GTF寫入標準輸出。-a minimum anchor length for junctions (default: 10)預測新轉錄本的最小的錨點長度。默認值：10-j minimum junction coverage (default: 1)連接點的覆蓋度，即設置至少有這么多的spliced reads 比對到連接點(align across a junction)。這個數字可以是分數, 因為有些reads可以比對到多個地方。當一個read 比對到 n 個地方是，則此處連接點的覆蓋度為1/n 。默認值為1。-t disable trimming of predicted transcripts based on coverage(default: coverage trimming is enabled)該參數禁止修剪組裝的轉錄本的末端。默認情況下，StringTie會根據組裝的轉錄本的覆蓋率的突然下降來調整預測的轉錄本的開始和/或停止坐標。-c minimum reads per bp coverage to consider for transcript assembly(default: 2.5)設置預測轉錄本所允許的最小read 覆蓋度。當一個轉錄本的覆蓋度低于閾值，則輸出文件中不含該轉錄本。默認值為 2.5-v verbose (log bundle processing details)輸出運行過程中的運行信息-g gap between read mappings triggering a new bundle (default: 50)設置gap最小值。-C output a file with reference transcripts that are covered by reads輸出所有轉錄本對應的reads覆蓋度的文件，此處的轉錄本是指參考注釋基因文件中提供的轉錄本。(需要參數 -G).-M fraction of bundle allowed to be covered by multi-hit reads (default:0.95)-p number of threads (CPUs) to use (default: 1) 線程數目-A gene abundance estimation output file輸出結果中的gene豐度信息-B enable output of Ballgown table files which will be created in thesame directory as the output GTF (requires -G, -o recommended)應用該選項，則會輸出Ballgown輸入表文件（* .ctab），其中包含用-G選項給出的參考轉錄本的覆蓋率數據。-b enable output of Ballgown table files but these files will becreated under the directory path given as <dir_path>指定 *.ctab 文件的輸出路徑, 而非由-o選項指定的目錄。-e only estimate the abundance of given reference transcripts (requires -G)限制reads比對的處理，僅估計和輸出與用-G選項給出的參考轉錄本匹配的組裝轉錄本。使用該選項，則會跳過處理與參考轉錄本不匹配的組裝轉錄本，這將大大的提升了處理速度。-x do not assemble any transcripts on the given reference sequence(s)忽略所有比對到指定的參考序列上的reads，因此這部分的reads不需要組裝轉錄本。參數 <seqid_list>可以是單個參考序列名稱 (如： -x chrM)，也可以是逗號分隔的序列名稱列表 (如： -x 'chrM,chrX,chrY')。這可以加快StringTie的組裝分析的速度，特別是在排除線粒體基因組的情況下，在某些情況下，線粒體的基因可能具有非常高的覆蓋率，但是它們對于特定的RNA-Seq分析可能不感興趣的。-u no multi-mapping correction (default: correction enabled)-h print this usage message and exitTranscript merge usage mode:stringtie --merge [Options] { gtf_list | strg1.gtf ...} With this option StringTie will assemble transcripts from multiple input files generating a unified non-redundant set of isoforms. In this mode the following options are available:-G <guide_gff> reference annotation to include in the merging (GTF/GFF3)參考注釋基因組文件(GTF/GFF3)-o <out_gtf> output file name for the merged transcripts GTF(default: stdout) 指定輸出合并的GTF文件的路徑和名稱 (默認值：標準輸出)-m <min_len> minimum input transcript length to include in the merge(default: 50) 合并文件中，指定允許最小輸入轉錄本的長度 (默認值: 50)-c <min_cov> minimum input transcript coverage to include in the merge(default: 0) 合并文件中，指定允許最低輸入轉錄本的覆蓋度(默認值: 0)-F <min_fpkm> minimum input transcript FPKM to include in the merge(default: 1.0) 合并文件中，指定允許最低輸入轉錄本的FPKM值 (默認值: 0)-T <min_tpm> minimum input transcript TPM to include in the merge(default: 1.0) 合并文件中，指定允許最低輸入轉錄本的TPM值 (默認值: 0)-f <min_iso> minimum isoform fraction (default: 0.01)-g <gap_len> gap between transcripts to merge together (default: 250)-i keep merged transcripts with retained introns; by defaultthese are not kept unless there is strong evidence for them合并后，保留含retained introns的轉錄本 (默認值: 除非有強有力的證據，否則不予保留)-l <label> name prefix for output transcripts (default: MSTRG)輸出轉錄本的名稱前綴 (默認值: MSTRG)

使用stringtie的注意事項如下所示：

第一，aligned_reads.bam 是輸入文件，該輸入文件要求必須按其基因組位置排序，如TopHat的輸出文件accepted_hits.bam可直接當做輸入文件，而 HISAT2的輸出文件則需經過samtools sort生成的bam文件才可當做輸入文件。
第二，輸入BAM文件中的每個 spliced read 比對（即跨越至少一個連接點的比對）必須包含標簽XS，用以指示測序產生的read是來源于基因組序列上的哪條鏈產生的RNA。由TopHat和 HISAT2 (需參數 --dta，該參數用于發現剪接位點) 產生的比對結果中已經包含標簽XS。但是，有的mapping程序(read mapper)未必含有標簽XS，所以，用戶在進行下一步分析時需要進行檢查。注意：一定要使用-dta選項來運行HISAT2，否則結果將會受到影響。
第三，作為選項，可以向StringTie提供GTF / GFF3格式的參考注釋基因組文件。在這種情況下，StringTie更喜歡使用注釋文件中的這些“已知”基因，對于那些被表達的基因，它將計算coverage，TPM和FPKM值。它還會產生額外的轉錄本，而注釋文件中并沒有這些轉錄本。請注意，如果不使用選項-e，那么參考轉錄本就需要被reads 完全覆蓋，以便包含在StringTie的輸出中。在這種情況下，其他通過StringTie從數據中組裝的轉錄本，且不在注釋文件中的轉錄本也會輸出。

輸出結果

主要輸出結果

GTF文件：記錄組裝的轉錄本信息 -o GFF

Tab文件：記錄基因豐度信息. -A TAB

GTF文件：完全覆蓋與參考注釋基因組文件所匹配的轉錄本信息 -C GTF

*.ctab文件：用于下游Ballgown軟件做差異表達分析的輸入文件 -B *.ctab

GTF文件：在合并模式下，生成一個合并的GTF文件

GTF文件：記錄組裝的轉錄本信息

seqname: 染色體，contig, 或 scaffold
source: GTF文件的源文件。
feature: 特征類型；如：exon, transcript, mRNA, 5’UTR。
start: 開始位置，使用基于1的索引
end: 結束位置，使用基于1的索引
score: 組裝的轉錄本的可信度分數。目前這個字段沒有被使用，并且如果轉錄本與a read alignment bundle有連接，則StringTie輸出常數值1000。
strand: 正向鏈： ‘+’；反向鏈： ‘-’.
frame: CDS特征的 Frame or phase 。 StringTie不使用該字段，只記錄一個“.”。
attributes:
- gene_id: A unique identifier for a single gene and its child transcript and exons based on the alignments’ file name. 基于比對文件名的單個基因及其子轉錄本和外顯子的唯一標識符
- transcript_id: A unique identifier for a single transcript and its child exons based on the alignments’ file name. 基于比對文件名的單個轉錄本及其子外顯子的唯一標識符。
- exon_number: A unique identifier for a single exon, starting from 1, within a given transcript. 給定轉錄本中單個外顯子的唯一標識符，從1開始。
- reference_id: The transcript_id in the reference annotation (optional) that the instance matched. 用以拼接的參考transcript_id
- ref_gene_id: The gene_id in the reference annotation (optional) that the instance matched. 用以拼接的參考gene_id
- ref_gene_name: The gene_name in the reference annotation (optional) that the instance matched.用以拼接的參考gene_name
- cov: The average per-base coverage for the transcript or exon. 轉錄本或外顯子的平均每個堿基覆蓋率。
- FPKM: Fragments per kilobase of transcript per million read pairs. This is the number of pairs of reads aligning to this feature, normalized by the total number of fragments sequenced (in millions) and the length of the transcript (in kilobases).
- TPM: Transcripts per million. This is the number of transcripts from this particular gene normalized first by gene length, and then by sequencing depth (in millions) in the sample. A detailed explanation and a comparison of TPM and FPKM can be found here, and TPM was defined by B. Li and C. Dewey here.

Tab文件：記錄基因豐度信息

Column 1 / Gene ID: The gene identifier comes from the reference annotation provided with the -G option. If no reference is provided this field is replaced with the name prefix for output transcripts (-l).
Column 2 / Gene Name: This field contains the gene name in the reference annotation provided with the -G option. If no reference is provided this field is populated with ‘-’.
Column 3 / Reference: Name of the reference sequence that was used in the alignment of the reads. Equivalent to the 3rd column in the .SAM alignment.
Column 4 / Strand: ‘+’ denotes that the gene is on the forward strand, ‘-’ for the reverse strand.
Column 5 / Start: Start position of the gene (1-based index).
Column 6 / End: End position of the gene (1-based index).
Column 7 / Coverage: Per-base coverage of the gene.
Column 8 / FPKM: normalized expression level in FPKM units (see previous section).
Column 9 / TPM: normalized expression level in RPM units (see previous section).

其實，我覺得就是把輸出gff中attribution中的屬性提取出來，就行進一步細分。

參考資料：
https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual#input

總結

以上是生活随笔為你收集整理的Stringtie详解的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： JPEG系列二 JPEG文件中的EXIF
下一篇： 80年代的年画，画面朝气蓬勃，催人奋进，