搜索
楼主: Jimmy

生信编程直播第0题-生信编程很简单!

[复制链接]

10

主题

52

帖子

559

积分

版主

Rank: 7Rank: 7Rank: 7

积分
559
QQ
发表于 2017-4-10 22:32:23 | 显示全部楼层
本帖最后由 旭日早升 于 2017-4-12 15:37 编辑

201-python-无名

Dongye老师又来讲python了,感觉每次老师讲的都是编程思想风暴啊,上次的讲的“从1学python”给我这个初学者提供了很多编程思考,这次也是直接从实例出发给我们讲解了编程的高级思维,我感觉这种思维对于生信人员同样重要,尤其是入门者,当你还是一块白板,你可以从开始就建立好的编程思维方式和习惯。
老师的讲解从抛出几个问题出发:1、为什么40行代码变成了200行;2、你上个月写的代码在哪里;3、别人的工具为什么那么好用;4、我的代码为什么只能用一次?可以说这几个问题我都存在,基本上每次遇到问题都要重新编程,代码的可移植性和重复利用都很差。相信对于很多人也是一样。通过前面几道题的训练,对于基本的编程我们应该还可以应对了,但是这种好的思维和习惯就不是那么容易建立的,老师至少从经验上给我们讲解了需要注意的一些问题啊。通过前后两个版本的代码的演示,我收获很大。其实第一版的代码相较于我自己写的习惯已经算是精炼了,但是第二版给我的震撼更大,第二版通过参数、函数的使用等等,直接将代码变成了处理fastq的工具包,之后不管是调用这个工具还是在这个工具的基础上添加新的功能都十分方便啊。通过学习我知道在编程能力和技巧上都十分欠缺,需要努力补课。
本次课学到的知识点:1、使用optparse处理参数(重要,需要再深入学习);2、对于python中and,or的理解。3、默认参数,异常参数的处理等细节问题。
[Python] 纯文本查看 复制代码
# Description: Cainiao_0_1_v1, fastq file, [url=http://www.biotrainee.com/thread-834-1-1.html]http://www.biotrainee.com/thread-834-1-1.html[/url]
# Data: 2017-04-09
 

import os,sys,re
from collections import OrderedDict

args = sys.argv
filename = args[1]
cut_5 = int(args[2])
cut_3 = int(args[3])

os.chdir("./")

##cut 5' 3'
with open(filename, 'rt') as f:
        while True:
                line1 = f.readline().rstrip()
                if not (line1 and f):
                        break 
                line2 = f.readline().rstrip()
                line3 = f.readline().rstrip()
                line4 = f.readline().rstrip()

                print(line1)
                print(line2[cut_5:][:-cut_3])
                print(line3)
                print(line4[cut_5:][:-cut_3])

##length
length_dic = OrderedDict()

with open(filename, 'rt') as f:
        while True:
                line1 = f.readline().rstrip()
                if not (line1 and f):
                        break 
                line2 = f.readline().rstrip()
                line3 = f.readline().rstrip()
                line4 = f.readline().rstrip()

                length = len(line2)
                if length not in length_dic:
                        length_dic[length] = 0
                length_dic[length] += 1

for length, num in length_dic.items():
        print("%s\t%s" % (length, num))

##fastq2fasta
with open(filename, 'rt') as f:
        while True:
                line1 = f.readline().rstrip()
                if not (line1 and f):
                        break 
                line2 = f.readline().rstrip()
                line3 = f.readline().rstrip()
                line4 = f.readline().rstrip()

                print(">"+line1)
                print(line2)
##GC%
NUM = 0
GC = 0
with open(filename, 'rt') as f:
        while True:
                line1 = f.readline().rstrip()
                if not (line1 and f):
                        break 
                line2 = f.readline().rstrip().upper()
                line3 = f.readline().rstrip()
                line4 = f.readline().rstrip()

                NUM += len(line2)
                GC += line2.count("G") + line2.count("C")
print(NUM)
print(GC*1.0/NUM) 

# Description: Cainiao_0_1_v2, fastq tools from teacher Dongye

from optparse import OptionParser
from collections import OrderedDict
import sys

args = sys.argv

def cut_fastq(filename, cut_5=0, cut_3=0):
        """
        from 5',3' cut bases
        :param filename: fastq filename
        :param cut_5: number cut 5', >0
        :param cut_3: number cut 3', >0
        :return:
        """
        if cut_3 < 0:
                raise ValueError('cut_3: %s < 0' % cut_3)
        if cut_5 < 0:
                raise ValueError('cut_5: %s < 0' % cut_5)
        with open(filename) as if_fq:
                while if_fq:
                        line1 = if_fq.readline().rstrip()
                        if not (line1 and if_fq):
                                break 
                        line2 = if_fq.readline().rstrip()
                        line3 = if_fq.readline().rstrip()
                        line4 = if_fq.readline().rstrip()

                        print(line1)
                        print(cut_3 and line2[cut_5:][:-cut_3] or line2[cut_5:]) #very useful, if cut_3 and cut_5 is 0, the total read will return
                        print(line3)
                        print(cut_3 and line4[cut_5:][:-cut_3] or line4[cut_5:])


def length_distribution(filename):
        """
        reads length statistic

        :param filename: fastq filename
        :return: 
        """
        reads_len = OrderedDict()
        with open(filename, 'rt') as if_fq:
                while if_fq:
                        line1 = if_fq.readline().rstrip()
                        if not (line1 and if_fq):
                                break 
                        line2 = if_fq.readline().rstrip()
                        line3 = if_fq.readline().rstrip()
                        line4 = if_fq.readline().rstrip()

                        length = len(line2)
                        if length not in reads_len:
                                reads_len[length] = 0
                        reads_len[length] += 1

        for length, num in reads_len.items():
                print("%s\t%s" % (length, num))
        return reads_len


def fq_2_fa(filename):
        """
        tranform fastq to fasta

        :param filename: fastq file
        :return:
        """
        with open(filename, 'rt') as if_fq:
                while if_fq:
                        line1 = if_fq.readline().rstrip()
                        if not (line1 and if_fq):
                                break 
                        line2 = if_fq.readline().rstrip()
                        line3 = if_fq.readline().rstrip()
                        line4 = if_fq.readline().rstrip()

                        print(">"+line1)
                        print(line2)


def fq_length(filename):
        """
        total read length 

        :param filename: fastq file
        :return:
        """
        num = 0
        with open(filename, 'rt') as if_fq: 
                while if_fq:
                        line1 = if_fq.readline().rstrip()
                        if not (line1 and if_fq):
                                break 
                        line2 = if_fq.readline().rstrip().upper()
                        line3 = if_fq.readline().rstrip()
                        line4 = if_fq.readline().rstrip()

                        num += len(line2)
        print(num)
        return num


def count_gc(filename):
        """
        GC proportion  

        :param filename: fastq file
        :return:
        """
        GC = 0
        with open(filename, 'rt') as if_fq: 
                while if_fq:
                        line1 = if_fq.readline().rstrip()
                        if not (line1 and if_fq):
                                break 
                        line2 = if_fq.readline().rstrip().upper()
                        line3 = if_fq.readline().rstrip()
                        line4 = if_fq.readline().rstrip()
                        GC += line2.count("G") + line2.count("C")

        # use function
        SUM = fq_length(filename)

        print("GC: %s" % (GC * 1.0 / SUM))
        return GC * 1.0 / SUM 

def main(args):
        parser = OptionParser()
        parser.add_option("-f", "--fastq", dest="filename", help="fastq filename", metavar="FILE")
        parser.add_option("--cut", "--cut-fastq", dest="cut", help="cut fastq file", action="store_true", default=False)
        parser.add_option("-5", "--cut-5", dest="cut_5", type=int, help="cut fastq N(N>0) bases from 5", metavar="INT", default=0)
        parser.add_option("-3", "--cut-3", dest="cut_3", type=int, help="cut fastq N(N>0) bases from 3", metavar="INT", default=0)
        parser.add_option("--len-dis", dest="len_dis", help="sequence length distribution", action="store_true", default=False)
        parser.add_option("--fasta", "--fastq2fasta", dest="fasta", help="convert fastq to fasta", action="store_true", default=False)
        parser.add_option("--len", "--fastq-length", dest="fq_len", help="total reads number", action="store_true", default=False)
        parser.add_option("--gc", "--count-gc", dest="gc", help="print GC%", action="store_true", default=False)

        (options, args) = parser.parse_args()
        if not options.filename:
                parser.print_help()
        filename = options.filename

        if options.cut:
                cut_5 = options.cut_5
                cut_3 = options.cut_3
                cut_fastq(filename, cut_5, cut_3)

        if options.len_dis:
                length_distribution(filename)

        if options.fq_len:
                fq_length(filename)

        if options.fasta:
                fq_2_fa(filename)

        if options.gc:
                count_gc(filename)


if __name__ == '__main__':
        main(args)

最后感谢老师的辛苦准备和讲解,收获很大。


Part2:有了前面的fastq tool的编写,这次fasta文件就相对容易一点了,我自己首先用编写fastq的方式尝试编写了,然后看视频与老师的代码做了对比,主要功能实现了,但是还有些细节需要学习。首先是读入文件输出,我每次都是先整个把文件读入后存入字典,然后再对字典进行遍历和输出,这样一来就会有两次循环,而老师在读入的时候就进行判断和处理,直接输出到文件,比较简便。在一个就是参数的处理,这一次老师引入了可变参数的概念,这样就可以方便的引入更多参数,当然也可以不引入,十分方便。下面把我的代码和老师的都贴上。
[Python] 纯文本查看 复制代码
# Description: Cainiao_0_2_v1, fasta tool, [url]http://www.biotrainee.com/thread-834-1-1.html[/url]
# Data: 2017-04-11
# 

from optparse import OptionParser
from collections import OrderedDict
import sys, re, random 

args = sys.argv

def complement_seq(filename):
	"""
	the sequence complement 
	:param filename: fastq filename
	:return:
	"""
	dic = {"A":"T","T":"A","C":"G","G":"C","a":"t","t":"a","c":"g","g":"c"}
	fa_seq = OrderedDict()
	compl_seq = OrderedDict()
	with open(filename) as if_fa:
		for line in if_fa:
			line = line.rstrip()
			if line.startswith(">"):
				lst = line.split("|")
				name = lst[3]
				if name not in fa_seq:
					fa_seq[name] = ""
				continue
			fa_seq[name] += line
	for name, seq in fa_seq.items():
		compl_seq[name] = ""
		for i in range(len(seq)):
			compl_seq[name] += dic[seq[i]]
		print(">%s\n%s\n" % (name, compl_seq[name]))
	return compl_seq

def reverse_seq(filename):
	"""
	the sequence reverse
	:param filename: fastq filename
	:return:
	"""
	fa_seq = OrderedDict()
	rever_seq = OrderedDict()
	with open(filename) as if_fa:
		for line in if_fa:
			line = line.rstrip()
			if line.startswith(">"):
				lst = line.split("|")
				name = lst[3]
				if name not in fa_seq:
					fa_seq[name] = ""
				continue
			fa_seq[name] += line
	for name, seq in fa_seq.items():
		rever_seq[name] = seq[::-1]
		print(">%s\n%s\n" % (name, rever_seq[name]))
	return rever_seq


def DNA_2_RNA(filename):
	"""
	transform DNA sequence to RNA sequence 
	:param filename: fastq filename
	:return:
	"""
	fa_seq = OrderedDict()
	RNA_seq = OrderedDict()
	with open(filename) as if_fa:
		for line in if_fa:
			line = line.rstrip()
			if line.startswith(">"):
				lst = line.split("|")
				name = lst[3]
				if name not in fa_seq:
					fa_seq[name] = ""
				continue
			fa_seq[name] += line
	for name, seq in fa_seq.items():
		RNA_seq[name] = seq.replace("T", "U").replace("t", "u")
		print(">%s\n%s\n" % (name, RNA_seq[name]))
	return RNA_seq


def upper_or_lower(filename, upper, lower):
	"""
	out upper or lower sequence
	:param filename: fastq filename
	:param upper: to upper case
	:param lower: to lower case 
	:return:
	"""
	fa_seq = OrderedDict()
	out_seq = OrderedDict()
	with open(filename) as if_fa:
		for line in if_fa:
			line = line.rstrip()
			if line.startswith(">"):
				lst = line.split("|")
				name = lst[3]
				if name not in fa_seq:
					fa_seq[name] = ""
				continue
			fa_seq[name] += line
	for name, seq in fa_seq.items():
		if upper:
			out_seq[name] = seq.upper()
		elif lower:
			out_seq[name] = seq.lower()
		else:
			out_seq[name] = seq
		print(">%s\n%s\n" % (name, out_seq[name]))
	return out_seq


def retrieve(filename, ID):
	"""
	retrieve sequence
	:param filename: fastq filename
	:param ID: sequence ID
	:return:
	"""
	fa_seq = OrderedDict()
	retrieve_seq = OrderedDict()
	with open(filename) as if_fa:
		for line in if_fa:
			line = line.rstrip()
			if line.startswith(">"):
				lst = line.split("|")
				name = lst[3]
				if name not in fa_seq:
					fa_seq[name] = ""
				continue
			fa_seq[name] += line
	for name, seq in fa_seq.items():
		if name == ID:
			retrieve_seq[name] = seq
			print(">%s\n%s\n" % (name, retrieve_seq[name]))
	return retrieve_seq


def main(args):
	parser = OptionParser()
	parser.add_option("-f", "--fasta", dest="filename", help="fasta filename", metavar="FILE")
	parser.add_option("--complement", dest="complement", help="complement sequence", action="store_true", default=False)
	parser.add_option("--reverse", dest="reverse", help="reverse sequence", action="store_true", default=False)
	parser.add_option("--RNA", "--DNA2RNA", dest="RNA", help="convert DNA to RNA", action="store_true", default=False)
	parser.add_option("--upper", dest="upper", help="transform to upper case", action="store_true", default=False)
	parser.add_option("--lower", dest="lower", help="transform to lower case", action="store_true", default=False)
	parser.add_option("--retrieve", "--extract", dest="retrieve", help="retrieve sequence by ID", action="store_true", default=False)
	parser.add_option("--ID", "--name", dest="ID", help="the sequence ID to retrieve", metavar="ID")


	(options, args) = parser.parse_args()
	if not options.filename:
		parser.print_help()
	filename = options.filename

	if options.complement:
		complement_seq(filename)

	if options.reverse:
		reverse_seq(filename)

	if options.RNA:
		DNA_2_RNA(filename)

	if options.upper:
		upper = options.upper 
		lower = options.lower
		upper_or_lower(filename, upper, lower)
	if options.lower:
		upper = options.upper 
		lower = options.lower
		upper_or_lower(filename, upper, lower)
	if options.retrieve:
		ID = options.ID 
		retrieve(filename, ID)

if __name__ == '__main__':
	main(args)

# Description: Cainiao_0_2_v2, fasta tool from teacher Dongye
# Data: 2017-04-12
# 

from optparse import OptionParser
import sys

args = sys.argv

def complement_seq(filename, out_filename):
	"""
	output the complementary sequence 

	:param filename: fastq filename
	:param out_filename: output filename
	:return:
	"""
	complement = {"A":"T","T":"A","C":"G","G":"C","a":"t","t":"a","c":"g","g":"c"}
	with open(filename) as F_in:
		with open(out_filename, 'w') as F_out:
			for line in F_in:
				line = line.rstrip()
				if not line.startswith(">"):
					seq = ""
					for i in line:
						seq += complement[i]
				else:
					seq = line 
				F_out.write("%s\n" % (seq))

def reverse_seq(filename, out_filename):
	"""
	the sequence reverse

	:param filename: fastq filename
	:param out_filename: output filename 
	:return:
	"""
	print("reverse_seq")
	with open(filename) as F_in:
		with open(out_filename, 'w') as F_out:
			seq = ""
			for line in F_in:
				line = line.rstrip()
				if not line.startswith(">"):
					seq += line 
				else:
					if seq:
						seq = seq[::-1]
						F_out.write("%s\n" % (seq))
					F_out.write("%s\n" % (line))
					seq = ""
			if seq:
				seq = seq[::-1]
				F_out.write("%s\n" % (seq))


def DNA2RNA(filename, out_filename):
	"""
	transform DNA sequence to RNA sequence 

	:param filename: fastq filename
	:param out_filename: output 
	:return:
	"""
	print("DNA2RNA")
	complement = {"A":"A","T":"U","C":"C","G":"G","a":"a","t":"u","c":"c","g":"g"}
	with open(filename) as F_in:
		with open(out_filename, 'w') as F_out:
			for line in F_in:
				line = line.rstrip()
				if not line.startswith(">"):
					seq = ""
					for i in line:
						seq += complement[i] 
				else:
					seq = line
				F_out.write("%s\n" % (seq))


def format_output(filename, out_filename, **kwargs):
	"""
	out upper or lower sequence

	:param filename: fastq filename
	:param out_filename: output
	:param kwargs: variable args 
	:return:
	"""
	print("format_output")
	capital = kwargs["capital"] if "capital" in kwargs else True
	with open(filename) as F_in:
		with open(out_filename, 'w') as F_out:
			for line in F_in:
				line = line.rstrip()
				if not line.startswith(">"):
					if capital:
						seq = line.upper()
					else:
						seq = line.lower()
				else:
					seq = line
				F_out.write("%s\n" % (seq))


def main(args):
	parser = OptionParser()
	parser.add_option("-f", "--fasta", dest="filename", help="fasta filename", metavar="FILE")
	parser.add_option("-o", "--output", dest="out_filename", help="output fasta filename", metavar="FILE")
	parser.add_option("--com", "--complement_seq", dest="complement", help="complement fasta sequence", action="store_true", default=False)
	parser.add_option("--rev", "--reverse_seq", dest="reverse", help="reverse fasta sequence", action="store_true", default=False)
	parser.add_option("--rna", "--dna2rna", dest="rna", help="convert DNA to RNA", action="store_true", default=False)
	parser.add_option("--upper", dest="format_upper", help="output upper fasta sequence", action="store_true", default=False)
	parser.add_option("--lower", dest="format_lower", help="output lower fasta sequence", action="store_true", default=False)

	(options, args) = parser.parse_args()
	if not options.filename:
		parser.print_help()
	filename = options.filename
	out_filename = options.out_filename
	if options.complement:
		complement_seq(filename, out_filename)
	elif options.reverse:
		reverse_seq(filename, out_filename)
	elif options.rna:
		DNA2RNA(filename, out_filename)
	if options.format_upper:
		format_output(filename, out_filename, capital=True)
	elif options.format_lower:
		format_output(filename, out_filename, capital=False)


if __name__ == '__main__':
	main(args)



回复 支持 反对

使用道具 举报

0

主题

5

帖子

89

积分

注册会员

Rank: 2

积分
89
发表于 2017-4-13 00:19:14 | 显示全部楼层
本帖最后由 lingboling002 于 2017-4-13 00:25 编辑

生信第3次作业,通过小甲鱼视频的入门学习,选择用python答题。
[Python] 纯文本查看 复制代码
import sys

args = sys.argv

filename = args[1]
cut_5 = int(args[2])
cut_3 = int(args[3])

with open(filename) as FIN:
    while True:
        line_1 = FIN.readline().strip('\n')
        if not (line_1 and FIN):
            break
        line_2 = FIN.readline().strip('\n')
        line_3 = FIN.readline().strip('\n')
        line_4 = FIN.readline().strip('\n')

        print(line_1)
        print(line_2[cut_5:][:-cut_3])
        print(line_3)
        print(line_4[cut_5:][:-cut_3])

本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有帐号?立即注册

x
回复 支持 反对

使用道具 举报

0

主题

5

帖子

89

积分

注册会员

Rank: 2

积分
89
发表于 2017-4-13 00:46:02 | 显示全部楼层
本帖最后由 lingboling002 于 2017-4-13 00:48 编辑
lingboling002 发表于 2017-4-13 00:19
生信第3次作业,通过小甲鱼视频的入门学习,选择用python答题。
[Python] 纯文本查看 复制代码
import sys

args ...[/quote][mw_shl_code=python,true]import sys

args = sys.argv

filename = args[1]

dis = {}
with open(filename) as FIN:
    while True:
        line_1 = FIN.readline().strip('\n')
        if not (line_1 and FIN):
            break
        line_2 = FIN.readline().strip('\n')
        line_3 = FIN.readline().strip('\n')
        line_4 = FIN.readline().strip('\n')

        length = len(line_2)
        if not length in dis:
            dis[length] = 0
        dis[length] +=1

for  length, num in dis.items():
    print ("%s\t%s" % (length,num))
序列长度分布统计,结果为 80 353334
回复 支持 反对

使用道具 举报

0

主题

5

帖子

89

积分

注册会员

Rank: 2

积分
89
发表于 2017-4-13 00:57:40 | 显示全部楼层
lingboling002 发表于 2017-4-13 00:19
生信第3次作业,通过小甲鱼视频的入门学习,选择用python答题。
[Python] 纯文本查看 复制代码
import sys

args ...[/quote]

[mw_shl_code=python,true]import sys

args = sys.argv

filename = args[1]

NUM = 0
GC = 0

with open(filename) as FIN:
    while True:
        line_1 = FIN.readline().strip('\n')
        if not (line_1 and FIN):
            break
        line_2 = FIN.readline().strip('\n').upper()
        line_3 = FIN.readline().strip('\n')
        line_4 = FIN.readline().strip('\n')

        NUM += len(line_2)
        GC += line_2.count("G") + line_2.count("C")
print(NUM)
print(GC*1.0/NUM)

结果为总碱基数28266720    GC含量为 0.5049683868521003


回复 支持 反对

使用道具 举报

0

主题

19

帖子

241

积分

中级会员

Rank: 3Rank: 3

积分
241
发表于 2017-4-24 02:21:42 | 显示全部楼层
dongye 发表于 2017-3-18 19:04
生信编程直播第零题第一讲,Fastq文件简单统计处理
by dongye

请问您的代码可以直接要怎么运行呢?我不会用pycharm, 如果用terminal的话,应该怎么跑呢?
回复 支持 反对

使用道具 举报

0

主题

19

帖子

241

积分

中级会员

Rank: 3Rank: 3

积分
241
发表于 2017-4-24 03:33:23 | 显示全部楼层
本帖最后由 朝曦曦 于 2017-4-24 05:29 编辑

学员205 python 求助
[AppleScript] 纯文本查看 复制代码
import os

os.chdir('/Users/qianzhao/Desktop/biotrainee/lesson0/TMP/')

import sys
from optparse import OptionParser
args = sys.argv

def complement_seq(filename, out_filename):
    """
    输出互补序列
    
    :param filename:
    :param out_filename:
    :return:
    """
    complement = {'A': 'T', 'G': 'C', 'C': 'G', 'T': 'A'}
    with open(filename) as F_in:
        with open(out_filename, 'w') as F_out:
            for line in F_in:
                line = line.strip('\n')
                if not line.startswith('>'):
                    seq =''
                    for i in line:
                        seq += complement[i]
                else:
                    seq = line
                F_out.write('%s\n' % (seq))[/i][i]   
    return out_filename
    F_in.closed
complement_seq(HIV_sequence.fasta, outHIV_seq.fasta)



这段代码报错内容为
NameError                                 Traceback (most recent call last)
<ipython-input-12-a1098ff9bf08> in <module>()
     23
     24
---> 25 complement_seq(HIV_sequence.fasta, outHIV_seq.fasta)
     26 print(outHIV_seq.fasta)
     27

NameError: name 'HIV_sequence' is not defined
我实在不知道怎么回事儿,求解答。。

回复 支持 反对

使用道具 举报

11

主题

52

帖子

280

积分

中级会员

Rank: 3Rank: 3

积分
280
发表于 2017-5-7 14:21:40 | 显示全部楼层
朝曦曦 发表于 2017-4-24 03:33
学员205 python 求助
[mw_shl_code=applescript,true]import os

第一个参数应该是一个字符串,你现在这样写表示一个变量,字符串要用单引号或者双引号括起来
回复 支持 反对

使用道具 举报

11

主题

52

帖子

280

积分

中级会员

Rank: 3Rank: 3

积分
280
发表于 2017-5-7 14:22:43 | 显示全部楼层
朝曦曦 发表于 2017-4-24 02:21
请问您的代码可以直接要怎么运行呢?我不会用pycharm, 如果用terminal的话,应该怎么跑呢?
...

python code.py args1 args2
你看我第一期视频,里面是有两种执行方式的
回复 支持 反对

使用道具 举报

0

主题

16

帖子

149

积分

注册会员

Rank: 2

积分
149
发表于 2017-5-13 16:51:39 | 显示全部楼层
本帖最后由 xxnku 于 2017-5-17 21:31 编辑

关于python的内建模块OptionParser感觉好难...第1讲的视频代码,
对Fastq的操作by dongye:
[Python] 纯文本查看 复制代码
"""
简介:fastq文件处理工具
作者:dongye
时间:2017年5月11日19:37
"""

from optparse import OptionParser    #导入OptionParser类
import sys

args = sys.argv

def cut_fastq(filename,cut_5=0,cut_3=0):
    
    if cut_3 < 0:
        raise ValueError ('cut_3:%s < 0' % cut_3)
    if cut_5 < 0:
        raise ValueError ('cut_5:%s < 0' % cut_5)
    
    with open(filename) as if_fq:
        while if_fq:
            line_1 = if_fq.readline().strip('\n')
            if not (if_fq and line_1):
                break
            line_2 = if_fq.readline().strip('\n')
            line_3 = if_fq.readline().strip('\n')
            line_4 = if_fq.readline().strip('\n')

            print (line_1)
            print (cut_3 and line_2[cut_5:][:-cut_3] or line_2[cut_5:])
            print (line_3)
            print (cut_3 and line_4[cut_5:][:-cut_3] or line_4[cut_5:])

def length_distribution(filename):
    
    reads_len = {}
    with open(filename) as if_fq:
        while if_fq:
            line_1 = if_fq.readline().strip('\n')
            if not (if_fq and line_1):
                break
            line_2 = if_fq.readline().strip('\n')
            line_3 = if_fq.readline().strip('\n')
            line_4 = if_fq.readline().strip('\n')

            length = len(line_2)
            if not length in reads_len:
                reads_len[length] = 0
            else:
                reads_len[length] += 1

    for length, num in reads_len.items():
        print ('%s\t%s' %(length,num))
    return reads_len

def fq_2_fa(filename):

    with open(filename) as if_fq:
        while if_fq:
            line_1 = if_fq.readline().strip('\n')
            if not (if_fq and line_1):
                break
            line_2 = if_fq.readline().strip('\n')
            line_3 = if_fq.readline().strip('\n')
            line_4 = if_fq.readline().strip('\n')

            print ('>' + line_1)
            print (line_2)

def fq_length(filename):
    """
    统计fastq文件总的碱基数量
    """

    num = 0
    with open(filename) as if_fq:
        while if_fq:
            line_1 = if_fq.readline().strip('\n')
            if not (if_fq and line_1):
                break
            line_2 = if_fq.readline().strip('\n')
            line_3 = if_fq.readline().strip('\n')
            line_4 = if_fq.readline().strip('\n')

            num += len(line_2)
    print (num)
    return num

def count_gc(filename):
    gc = 0

    with open(filename) as if_fq:
        while if_fq:
            line_1 = if_fq.readline().strip('\n')
            if not (if_fq and line_1):
                break
            line_2 = if_fq.readline().strip('\n').upper()
            line_3 = if_fq.readline().strip('\n')
            line_4 = if_fq.readline().strip('\n')

            gc += line_2.count('G') + line_2.count('C')

    # 函数的好处
    SUM = fq_length(filename)
    print ('GC: %s' %(gc * 1.0 / SUM))
    return gc * 1.0 / SUM

def main(args):

    parser = OptionParser()    #创建一个OptionParser()对象

    parser.add_option("-f","--fastq",dest="filename",help="fastq filename",metavar="FILE")
    #使用add_option来定义命令行参数,--fastq和-f分别实长短参数名
    #dest=:表示此option在经过解析后的options对象中成员的名字
    #metavar=:表示显示到help中option的默认值

    parser.add_option("--cut","--cut-fastq",dest="cut",help="cut fastq file",action="store_true",default=False)
    #action有一组固定的值(store,store_ture,store_false和其他)可以选,action的默认值是store——表示将命令行的参数值保存在options的对象里,如此处的-f,-5和-3。store_true和store_false——表示用于处理命令行参数后面 不带值的情况,如此处的--cut,--len-dis,--fasta,--len和--gc
    #default=:表示此option的默认值

    parser.add_option("-5","--cut-5",dest="cut_5",type=int,help="cut fastq N(N>0) bases from 5",metavar="INT",default=0)
    #type=:表示此option的类型,默认为string

    parser.add_option("-3","--cut-3",dest="cut_3",type=int,help="cut fastq N(N>0) bases from 3",metavar="INT",default=0)

    parser.add_option("--len-dis",dest="len_dis",help="sequence length distribution",action="store_true",default=False)

    parser.add_option("--fasta","--fastq2fastq",dest="fasta",help="conver fastq to fasta",action="store_true",default=False)

    parser.add_option("--len","--fastq-length",dest="fq_len",help="total reads number",action="store_true",default=False)

    parser.add_option("--gc","--count-gc",dest="gc",help="print GC%",action="store_true",default=False)

    (options,args) = parser.parse_args()
    #调用parse_args()来解析程序的命令行,默认处理sys.args[1:]

    if not options.filename:
        parser.print_help()
    
    filename = options.filename

    if options.cut:
        cut_5 = options.cut_5
        cut_3 = options.cut_3
        cut_fastq(filename,cut_5,cut_3)

    if options.len_dis:
        length_distribution(filename)

    if options.fq_len:
        fq_length(filename)

    if options.fasta:
        fq_2_fa(filename)

    if options.gc:
        count_gc(filename)

if __name__ == '__main__':
    main(args)



对Fasta的操作by xxnku:
[Python] 纯文本查看 复制代码
#encoding = utf-8
"""
简介:求fasta文件中每个序列的互补序列
date:2017年5月18:54
"""
import sys
from collections import OrderedDict

args = sys.argv

seq = OrderedDict()
tmp_dit = {'A':'T','G':'C','C':'G','T':'A','N':'N'}

with open(args[1]) as f:

    for line in f:
        line = line.strip('\n')
        if line.startswith('>'):
            seq_id = line
            seq[seq_id] = ''
        else:
            for i in line:
                seq[seq_id] += tmp_dit[i]

for seq_id, sequence in seq.items():
    print ('%s\n%s' %(seq_id,sequence))

#encoding = utf-8
"""
简介:求fasta文件中每个序列的反向序列
date:2017年5月13日17:23
"""

import sys

args = sys.argv

with open(args[1]) as f:

    while True:
        line_1 = f.readline().strip('\n')
        if not line_1:
            break
        line_2 = f.readline().strip('\n')
        line_2 = line_2[::-1]

        print (line_1)
        print ('3\''+ line_2 + '5\'')

#encoding = utf-8
"""
简介:求fasta文件中dna to rna
date:2017年5月13日17:35
"""

import sys

args = sys.argv

with open(args[1]) as f:

    while True:
        line_1 = f.readline().strip('\n')
        if not line_1:
            break
        line_2 = f.readline().strip('\n').upper()
        line_2 = line_2.replace('T','U')

        print (line_1)
        print (line_2)



对Fasta的操作 by dongye:
[Python] 纯文本查看 复制代码
#encoding = utf-8

"""
简介:fasta_tools,所有的结果都打印出来
作者:dongye
date:2017年5月16日20:11
"""
import sys
from collections import OrderedDict
from optparse import OptionParser

args = sys.argv

def complement_seq(filename):
    tmp_dic = {'A':'T','G':'C','C':'G','T':'A','N':'N'}
    seq = OrderedDict()

    with open(filename) as f:
        for line in f:
            line = line.rstrip()
            if line.startswith('>'):
                seq_id = line
                seq[seq_id] = ''
            else:
                line = line.upper()
                for i in line:
                    seq[seq_id] += tmp_dic[i]
            
        for ID,sequence in seq.items():
            print ('%s\n%s' %(ID,sequence))

def reverse_seq(filename):

    with open(filename) as f:

        while True:
            line_1 = f.readline().rstrip()
            if not line_1:
                break
            line_2 = f.readline().rstrip()

            line_2 = line_2[::-1]

            print (line_1)
            print (line_2)

def dna_2_rna(filename):

    with open(filename) as f:

        while True:

            line_1 = f.readline().rstrip()
            if not line_1:
                break
            line_2 = f.readline().rstrip().upper()

            line_2 = line_2.replace('T','U')

            print (line_1)
            print (line_2)

def format_output(filename,**kwargs):

    capital = kwargs["capital"] if "capital" in kwargs else True
    #将if...else语句缩减为单一的条件表达式,语法为expression1 if A else expression2,如果A为True,条件表达式的结果为expression1,否则为expression2

    with open(filename) as f:

        while True:

            line_1 = f.readline().rstrip()
            if not line_1:
                break
            line_2 = f.readline().rstrip()

            if capital:
                line_2 = line_2.upper()
            else:
                line_2 = line_2.lower()

            print (line_1)
            print (line_2)

def sort_seq(filename,sort_type):

    with open(filename) as f:
        fasta = {}

        for line in f:
            line = line.strip()
            if line.startswith('>'):
                ID = line
                fasta[ID] = ''
            else:
                fasta[ID] += line

        if sort_type == 'id':
            fasta = sorted(fasta.items(),key=lambda i:i[0])
            #python3中好像不能用iteritems()...不过iterms()可以实现iteritems()同样的效果
        elif sort_type == 'len':
            fasta = sorted(fasta.items(),key=lambda i:len(i[1]))
        else:
            fasta = fasta.items()

        for k,v in fasta:
            print ('%s\n%s'%(k,v))

def main(args):
    parser = OptionParser()

    parser.add_option("-f","--fasta",dest="filename",help="fasta filename",metavar="FILE")

    parser.add_option("--com","--complent_seq",dest="complement",help="complent fasta sequence",action="store_true",default=False)

    parser.add_option("--rev","--reverse_seq",dest="reverse",help="reverse fasta sequence",action="store_true",default=False)

    parser.add_option("--rna","--dna2rna",dest="rna",help="convert DNA to RNA",action="store_true",default=False)

    parser.add_option("--upper",dest="format_upper",help="output upper fasta sequence",action="store_true",default=False)
    
    parser.add_option("--lower",dest="format_lower",help="output lower fasta sequence",action="store_true",default=False)

    parser.add_option("--sort_type",dest="sort_type",help="output fasta by sort name or seq len",default=False)

    (options,args) = parser.parse_args()

    if not options.filename:
        parser.print_help()

    filename = options.filename

    if options.complement:
        complement_seq(filename)

    if options.reverse:
        reverse_seq(filename)

    if options.rna:
        dna_2_rna(filename)

    if options.format_upper:
        format_output(filename,capital=True)

    if options.format_lower:
        format_output(filename,capital=False)

    if options.sort_type:
        sort_seq(filename,options.sort_type)

if __name__ == '__main__':
    main(args)
回复 支持 反对

使用道具 举报

0

主题

16

帖子

149

积分

注册会员

Rank: 2

积分
149
发表于 2017-5-13 16:59:20 | 显示全部楼层
dongye 发表于 2017-3-18 19:04
生信编程直播第零题第一讲,Fastq文件简单统计处理
by dongye

老师你好,第43行如果写成print line_2[cut_5:-cut_3] 有没有什么不妥
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

QQ|手机版|小黑屋|生信技能树 ( 粤ICP备15016384号  

GMT+8, 2019-11-13 03:56 , Processed in 0.039278 second(s), 23 queries .

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.