序列比对之Biostrings包
生活随笔
收集整理的這篇文章主要介紹了
序列比对之Biostrings包
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
基本概念
Biostrings包很重要的3個功能是進行Pairwise sequence alignment 和Multiple sequence alignment及 Pattern finding in a sequence
序列比對一般有2個過程:
1)構建計分矩陣公式(the scoring matrix formulation)
2)比對(alignment itself)
global alignment methods (全局比對):align every residue in the sequences ,例如Needleman-Wunsch algorithm.
local alignment technique(局部比對):align regions of high similarity in the sequences,例如Smith-Waterman algorithm
安裝
if("Biostrings" %in% rownames(installed.packages()) == FALSE) {source("http://bioconductor.org/biocLite.R");biocLite("Biostrings")}
suppressMessages(library(Biostrings))
ls('package:Biostrings')
----------------Pairwise sequence alignment---------------
步驟:首先構建罰分規則,然后按照規則進行比對。用pairwiseAlignment()函數
舉例1:核酸序列
(myScoringMat <- nucleotideSubstitutionMatrix(match = 1, mismatch = -1, baseOnly = TRUE))#構建罰分規則
gapOpen <- 2 #gap分為2
gapExtend <- 1 #延伸gap分為1
sequence1 <- "GAATTCGGCTA" #序列1
sequence2 <- "GATTACCTA" #序列2
myAlignment <- pairwiseAlignment(sequence1, sequence2,
substitutionMatrix = myScoringMat, gapOpening = gapOpen,
gapExtension = gapExtend, type="global", scoreOnly = FALSE) #進行比對
myAlignment
舉例2:對蛋白序列進行比對
蛋白比對會更復雜,因此模型更多,
data(package="Biostrings") #查看所有數據集 data(BLOSUM62) #這里選擇BLOSUM62數據 subMat <- "BLOSUM62" #賦值 gapOpen <- 2 gapExtend <- 1 sequence1 <- "PAWHEAE" sequence2 <- "HEAGAWGHE"
myAlignProt <- pairwiseAlignment(sequence1, sequence2, substitutionMatrix = subMat, gapOpening = gapOpen, gapExtension = gapExtend, type="global", scoreOnly = FALSE) #全局比對
myAlignProt2 <- pairwiseAlignment(sequence1, sequence2, substitutionMatrix = subMat, gapOpening = gapOpen, gapExtension = gapExtend, type="local",scoreOnly = FALSE) ##局部比對
可以看到局部比對返回的是,高度相似的序列部分.
3)可視化,對于序列可以用最經典的對角線來可視化(以人和黑猩猩的hemoglobin beta為例)
library(seqinr) # 為了讀取fasta序列
myseq <- read.fasta(file = "F:/R/Bioconductor/biostrings/prtein_example_seq.fas")
dotPlot(myseq[[1]], myseq[[2]], col=c("white", "red"), xlab="Human", ylab="Chimpanzee")
##########Multiple sequence alignment############
一般多序列比對可以用于進化分析
install.packages("muscle") #需要安裝該包,因為該包在我的版本上沒法安裝,所以這里就不講了
library(muscle)
######Phylogenetic analysis and tree plotting########
這里先不做分析
#########blast格式的解析######
install.packages("RFLPtools",dependencies=TRUE)
library(RFLPtools)
data(BLASTdata) #先查看數據集了解一下相關數據格式情況
head(BLASTdata)
colnames(BLASTdata)
DIR <- system.file("extdata", package = "RFLPtools") #用自帶數據集
MyFile <- file.path(DIR, "BLASTexample.txt")
MyBLAST <- read.blast(file = MyFile)
mySimMat <- simMatrix(MyBLAST) #可以根據blast結果用來生成相似性矩陣,太厲害了
#########Pattern finding in a sequence######
library(Biostrings)
mynucleotide <- DNAString("aacataatgcagtagaacccatgagccc")
matchPattern(DNAString("ATG"), mynucleotide) #示例1
matchPattern("TAA", mynucleotide) #示例2
##以下函數可以用來尋找orf(需要修改)
myCodonFinder <- function(sequence){
startCodon = DNAString("ATG") # 指定起始密碼子
stopCodons = list("TAA", "TAG", "TGA") # 指定終止密碼子
codonPosition = list() #initialize the output to be returned as a list
codonPosition$Start = matchPattern(startCodon, sequence) # search start codons
x=list()
for(i in 1:3){ # iterate over all stop codons
x[[i]]= matchPattern(DNAString(stopCodons[[i]]), sequence)
codonPosition$Stop=x
}
return(codonPosition) # returns results
}
StartStops <- myCodonFinder(mynucleotide)
總結
以上是生活随笔為你收集整理的序列比对之Biostrings包的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 负载均衡篇 不同层次的负载均衡(2/3/
- 下一篇: WKE——Webkit精简的纯C接口的浏