搜索
查看: 2684|回复: 9

从NCBI上下载所有病毒参考基因组

[复制链接]

58

主题

103

帖子

754

积分

版主

Rank: 7Rank: 7Rank: 7

积分
754
QQ
发表于 2016-9-14 18:21:12 | 显示全部楼层 |阅读模式
本帖最后由 Panda姐 于 2016-9-14 18:39 编辑
下面的三个方法都是从Biostar上面检索得到的,仅供参考,欢迎补充。

方法一:https://www.biostars.org/p/13646/
这是一个5年前的老帖子,说是在NCBI的检索框里进行关键词检索后,然后使用Send to,保存所有条目内容以你需要的格式保存。
我只试过用这个方法下过单个基因的序列,感觉用这个方法会遗漏一些数据。


方法二:https://www.biostars.org/p/98218/
上面那个人又在两年半后回答了一个类似的问题。说是在ftp://ftp.ncbi.nih.gov/genbank/目录下把gbvrl*.seq.gz的文件全下载下来。



方法三:https://www.biostars.org/p/198783/

ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/


方法四:我最后下载的就是这个目录下的all字开头的文件。

ftp://ftp.ncbi.nih.gov/genomes/Viruses/

方法五:就是下载assembly_summary.txt 文件后,筛选出想要下载基因组的FTP路径,然后用wget批量下载。我尝试用这个方法下载全部细菌的参考基因组,但是下载总是不成功,卡在匿名登陆那里。

http://www.ncbi.nlm.nih.gov/genome/viruses/about/
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/viral/




关于方法四和方法五,两个目录下的文件有啥不同,我在biostar上找到了一个帖子:
Dear Colleague,

is a legacy (the old genome site) directory that will eventually be retired (archived and no longer updated).
This one:
is a new directory from NCBI FTP genome site reorganization. Here is the information about the new site:
To access current and actively updated genome assembly data, use the following three directories on the NCBI Genomes FTP site: genbank, refseq, and all.
genbank is a directory of primary genome assembly data and contains assembled genome sequences and associated annotations (if available) that sequencing centers or individual investigators submitted to GenBank or to another member of the International Nucleotide Sequence Database Collaboration (INSDC). You should use this directory if you are interested in obtaining all submitted genome assemblies and your main focus is not accessing genome annotation. The directory is organized by taxonomic groups and you will be able to browse it directly. refseq is a directory of NCBI-derived genome assembly data containing assembled genomes that NCBI RefSeq staff selected from the primary INSDC data. You should use the refseq directory if you are interested in annotation data that are of high quality and regularly maintained. The sequences of a RefSeq genomic assembly are a copy of those present in the corresponding INSDC assembly. In some cases the copy may not be completely identical as the RefSeq staff may (1) remove smaller pieces (known as contigs) of a sequence or reported contaminants or (2) add non-nuclear genome sequences (for example, mitochondrion) to the assembly. To find primary GenBank (INSDC) assemblies used to create the RefSeq assemblies, use the assembly reports files. All RefSeq genome assemblies have annotations that RefSeq staff either propagated from the primary records or provided through NCBI prokaryotic or eukaryotic genome annotation pipelines. The number of genomic assemblies present in the refseq directory is smaller than that in the genbank directory. The directory is organized by taxonomic groups and you will be able to browse it directly. all is a directory that combines the contents of the genbank and refseq directories. Each individual assembly data file is contained in an individual sub-directory. The all directory holds many thousands of sub-directories and you should only access it as a path to a known assembly. Many of the sub-directories are for old versions of assemblies; these are archival and the RefSeq staff will not update them with new data or data in new file formats.
All other directories on the NCBI Genomes FTP site are legacy directories and we will be sequentially archiving them. If you are using any of these directories, pay attention to their update dates to assure that you are obtaining current data. If you find a directory missing, check if it has already been moved into the archive directory, which you will also find on the Genomes FTP site. Read more about the FTP genomes site structure and learn details on the site reorganization, content, file formats, downloading instructions, and future plans.
Pasted from: https://www.biostars.org/p/192391/





本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有帐号?立即注册

x



上一篇:sequencing error rates
下一篇:人的一个细胞的染色体的质量是6pg
回复

使用道具 举报

633

主题

1177

帖子

3979

积分

管理员

Rank: 9Rank: 9Rank: 9

积分
3979
发表于 2016-9-14 18:27:05 | 显示全部楼层
师妹!~
你这个问题很复杂,需要打赏,请点击 http://www.bio-info-trainee.com/donate 进行打赏,谢谢
回复

使用道具 举报

58

主题

103

帖子

754

积分

版主

Rank: 7Rank: 7Rank: 7

积分
754
QQ
 楼主| 发表于 2016-9-14 18:28:36 | 显示全部楼层

师兄,怎么了?
回复 支持 反对

使用道具 举报

633

主题

1177

帖子

3979

积分

管理员

Rank: 9Rank: 9Rank: 9

积分
3979
发表于 2016-9-14 18:55:29 | 显示全部楼层

就我们两人!
你这个问题很复杂,需要打赏,请点击 http://www.bio-info-trainee.com/donate 进行打赏,谢谢
回复 支持 反对

使用道具 举报

58

主题

103

帖子

754

积分

版主

Rank: 7Rank: 7Rank: 7

积分
754
QQ
 楼主| 发表于 2016-9-14 20:14:12 | 显示全部楼层

明天过节了嘛
回复 支持 反对

使用道具 举报

633

主题

1177

帖子

3979

积分

管理员

Rank: 9Rank: 9Rank: 9

积分
3979
发表于 2016-9-14 21:17:55 | 显示全部楼层

那,祝师妹中秋节快乐哈
心想事成~~~
你这个问题很复杂,需要打赏,请点击 http://www.bio-info-trainee.com/donate 进行打赏,谢谢
回复 支持 反对

使用道具 举报

23

主题

37

帖子

326

积分

管理员

Rank: 9Rank: 9Rank: 9

积分
326
发表于 2017-2-10 17:36:21 | 显示全部楼层
谢谢分享 赞 -来自生信菜鸟
回复 支持 反对

使用道具 举报

0

主题

17

帖子

101

积分

注册会员

Rank: 2

积分
101
发表于 2018-5-2 14:07:17 | 显示全部楼层
本帖最后由 alienzj 于 2018-5-2 14:59 编辑

有个不错的工具:
ncbi-genome-download
比如,下载refseq数据库里面所有的病毒序列(fasta格式):
[Bash shell] 纯文本查看 复制代码
ncbi-genome-download --format fasta viral
回复 支持 反对

使用道具 举报

58

主题

103

帖子

754

积分

版主

Rank: 7Rank: 7Rank: 7

积分
754
QQ
 楼主| 发表于 2018-6-19 15:28:00 | 显示全部楼层
alienzj 发表于 2018-5-2 14:07
有个不错的工具:
ncbi-genome-download
比如,下载refseq数据库里面所有的病毒序列(fasta格式):

嗯嗯,多谢,但是知道原理更好,嘿嘿
回复 支持 反对

使用道具 举报

2

主题

17

帖子

245

积分

中级会员

Rank: 3Rank: 3

积分
245
发表于 2018-8-8 22:29:04 | 显示全部楼层
赞一个!!!
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

QQ|手机版|小黑屋|生信技能树    

GMT+8, 2018-8-16 01:01 , Processed in 0.123550 second(s), 31 queries .

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.