搜索
查看: 381|回复: 0

[software] Biostrings

[复制链接]

29

主题

29

帖子

149

积分

管理员

Rank: 9Rank: 9Rank: 9

积分
149
发表于 2018-9-18 11:27:23 | 显示全部楼层 |阅读模式
基本概念
Biostrings包很重要的3个功能是进行Pairwise sequence alignment 和Multiple sequence alignment和Pattern finding in a sequence
序列比对一般有2个过程:
1)构建计分矩阵公式(the scoring matrix formulation)
2)比对(alignment itself)
global alignment methods (全局比对):align every  residue in the sequences ,例如Needleman-Wunsch algorithm.
local alignment technique(局部比对): align regions of high similarity in the sequences,例如Smith-Waterman algorithm
安装

[Python] 纯文本查看 复制代码
if("Biostrings" %in% rownames(installed.packages()) == FALSE) {source("http://bioconductor.org/biocLite.R");biocLite("Biostrings")}
suppressMessages(library(Biostrings))
ls('package:Biostrings')


----------------Pairwise sequence alignment---------------
步骤:首先构建罚分规则,然后按照规则进行比对。用pairwiseAlignment()函数
举例1:核酸序列
[Python] 纯文本查看 复制代码
(myScoringMat <- nucleotideSubstitutionMatrix(match = 1, mismatch = -1, baseOnly = TRUE))#构建罚分规则
gapOpen <- 2             #gap分为2
gapExtend <- 1           #延伸gap分为1
sequence1 <- "GAATTCGGCTA"  #序列1
sequence2 <- "GATTACCTA"    #序列2
myAlignment <- pairwiseAlignment(sequence1, sequence2, 
                                 substitutionMatrix = myScoringMat, gapOpening = gapOpen,
                                 gapExtension = gapExtend, type="global", scoreOnly = FALSE)  #进行比对
myAlignment

举例2:对蛋白序列进行比对,蛋白比对会更复杂,因此模型更多
[Python] 纯文本查看 复制代码
data(package="Biostrings")  #查看所有数据集
data(BLOSUM62)               #这里选择BLOSUM62数据
subMat <- "BLOSUM62"         #赋值
gapOpen <- 2
gapExtend <- 1
sequence1 <- "PAWHEAE"
sequence2 <- "HEAGAWGHE"
myAlignProt <- pairwiseAlignment(sequence1, sequence2, 
                                 substitutionMatrix = subMat, gapOpening = gapOpen, gapExtension = 
                                   gapExtend, type="global", scoreOnly = FALSE)  #全局比对
myAlignProt2 <- pairwiseAlignment(sequence1, sequence2, 
                                 substitutionMatrix = subMat, gapOpening = gapOpen, gapExtension = 
                                   gapExtend, type="local",scoreOnly = FALSE)  ##局部比对

3)可视化,对于序列可以用最经典的对角线来可视化(以人和黑猩猩的hemoglobin beta为例)
[Python] 纯文本查看 复制代码
library(seqinr) # 为了读取fasta序列
myseq <- read.fasta(file = "F:/R/Bioconductor/biostrings/prtein_example_seq.fas")
dotPlot(myseq[[1]], myseq[[2]], col=c("white", "red"), xlab="Human", ylab="Chimpanzee")


##########Multiple sequence alignment##############
因为我R版本不支持musle包,这里就不演示了
#########Pattern finding in a sequence######
[Python] 纯文本查看 复制代码
library(Biostrings)
mynucleotide <- DNAString("aacataatgcagtagaacccatgagccc")
matchPattern(DNAString("ATG"), mynucleotide)  #示例1
matchPattern("TAA", mynucleotide)       #示例2
##寻找起始终止密码子
myCodonFinder <- function(sequence){
  startCodon = DNAString("ATG") # 指定起始密码子
  stopCodons = list("TAA", "TAG", "TGA") # 指定终止密码子
  codonPosition = list() #initialize the output to be returned as a list
  codonPosition$Start = matchPattern(startCodon, sequence) # search start codons
  x=list()
  for(i in 1:3){ # iterate over all stop codons
    x[[i]]= matchPattern(DNAString(stopCodons[[i]]), sequence)
    codonPosition$Stop=x
  }
  return(codonPosition) # returns results
}
StartStops <- myCodonFinder(mynucleotide)

通过以上,大家对biostrings处理全局、局部比对、模式收索应该有所了解了吧。








本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有帐号?立即注册

x
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

QQ|手机版|小黑屋|生信技能树    

GMT+8, 2019-2-21 22:49 , Processed in 0.031194 second(s), 25 queries .

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.