Quality control of sequencing data
What is quality control?
質控是從數(shù)據(jù)刪除可辨認錯誤從而提高數(shù)據(jù)質量的過程,是拿到數(shù)據(jù)后的第一步工作。
How critical is quality control?
The more unknowns about the genome under study, the more important it is to correct any errors.
When aligning against well-studied and understood genomes, we can recognize and identify errors by their
alignments. When assembling a de-novo genome, errors can derail the process; hence, it is more
important to apply a higher stringency for filtering.
When do we perform quality control?
Quality control is performed at different stages
Pre-alignment: “raw data” - the protocols are the same regardless of what analysis will follow
Post-alignment: “data? filtering” - the protocols are specific to the analysis that is being performed.
How reliable are QC tools?
Does quality control introduce errors?
How does read quality trimming work?
Originally, the reliability of sequencing decreased along the read. A common correction is to work backwards from the end of each read and remove low quality measurements from it.? This is called trimming
Why do we need to trim adapters?
How do we trim adapters?
Trim adapters with trimmomatic :
trimmomatic SE SRR519926_1. fastq output. fq ILLUMINACLIP: adapter. fa: 2: 30: 5Trimming adapter sequences - is it necessary?
Removal of adapter sequences in a process called read trimming, or clipping, is one of the first steps in analyzing NGS data. With more than 30 published adapter trimming tools there is a more than large choice for the appropriate tool. Yet, there is a debate whether this step really is as important as the number of tools suggests, or whether it is possible to skip this time-consuming step for many NGS applications.
Why do adapters contaminate my sequences?
Adapters have to be ligated to every single DNA molecule during library preparation. For Illumina short read sequencing, the corresponding protocols involve (in most cases) a DNA fragmentation step, followed by the ligation of certain oligonucleotides to the 5’ and 3’ ends. These 5’ and 3’ adapter sequences have important functions in Illumina sequencing, since they hold barcoding sequences, forward/reverse primers (for paired-end sequencing) and the important binding sequences for immobilizing the fragments to the flowcell and allowing bridge-amplification.
When are adapters sequences observed in the reads?
In common short read sequencing, the DNA insert (original molecule to be sequenced) is downstream from the read primer, meaning that the 5’ adapters will not appear in the sequenced read. But, if the fragment is shorter than the number of bases sequenced, one will sequence into the 3’ adapter. To make it clear: In Illumina sequencing, adapter sequences will only occur at the 3’ end of the read and only if the DNA insert is shorter than the number of sequencing cycles (see picture below)!
How often that happens largely depends on the used NGS protocol. Think about it: How often will you sequence into the 3’ adapters when performing common RNA-Seq? After mRNA enrichment, cDNA creation (using a reverse transcriptase) and DNA fragmentation the protocols typically involve a size selection. When using a miSeq with 2x300 paired-end mode, one will select molecules that are longer than the read length, in our example greater than 600 nucleotides in length. However, it is technically impossible to obtain a specific fragment size, but one will rather get a distribution of fragment lengths (see picture). Thus, one will also obtain a certain fraction of adapter contamination for large fragment sizes. For RNA-Seq you will observe that only 0.2 - 2% of reads contain adapter sequences.
Summary
Adapter contamination will lead to NGS alignment errors and an increased number of unaligned reads, since the adapter sequences are synthetic and do not occur in the genomic sequence. There are applications (e.g. small RNA sequencing) where adapter trimming is highly necessary. With a fragment size of around 24 nucleotides, one will definitely sequence into the 3’ adapter. But there are also applications (transcriptome sequencing, whole genome sequencing, etc.) where adapter contamination can be expected to be so small (due to an appropriate size selection) that one could consider to skip the adapter removal and thereby save time and efforts.
總結
以上是生活随笔為你收集整理的Quality control of sequencing data的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 理解Window和WindowManag
- 下一篇: 16bit 180MS/s 高速数据采集