搜索
查看: 5382|回复: 27

生信刷题打卡贴

[复制链接]

10

主题

59

帖子

269

积分

版主

Rank: 7Rank: 7Rank: 7

积分
269
发表于 2016-9-8 11:35:15 | 显示全部楼层 |阅读模式
本帖最后由 bioinfo何婷 于 2016-10-20 11:17 编辑

在国内本科是生命学院的,来到国外后被老师说编程能力不行,早痛不如晚痛,开一个贴记录生信刷题生活。

给大家介绍一个超酷的网站:Rosalind,在quroa上可是被强推哦。


占一个坑,明天开始刷题。恩,我用的是java,不要问我为什么不用python...老师用什么我就跟着用什么。哈哈哈。。。

啊,对了这种刷题的网站是不能把自己写的代码公布出来的。我打算把我觉得很妙,想了半天或者搞混淆的知识点的东西在这里记录下来,留给自己反思。

更新中)
四楼 Finding a motif in DNA
十楼 Computing GC content
十一楼 Rabbit and Recurrence Relations
十二楼 Translating RNA into Protein
十三楼 Mendel's First Law
二十六楼 Consensus and Profile







上一篇:生信分析人员如何系统入门perl?
下一篇:详细讲解如何抽取fasta文件里面指定序列名的序列
回复

使用道具 举报

10

主题

59

帖子

269

积分

版主

Rank: 7Rank: 7Rank: 7

积分
269
 楼主| 发表于 2016-9-13 07:23:55 | 显示全部楼层
本帖最后由 bioinfo何婷 于 2016-9-13 07:25 编辑

Rabbit and Recurrence Relations
Biological background
no!It's just a math problem.
Problem
A sequence is an ordered collection of objects (usually numbers), which are allowed to repeat. Sequences can be finite or infinite. Two examples are the finite sequence (π,−2‾√,0,π)and the infinite sequence of odd numbers (1,3,5,7,9,…). We use the notation an to represent the n-th term of a sequence.

A recurrence relation is a way of defining the terms of a sequence with respect to the values of previous terms. In the case of Fibonacci's rabbits from the introduction, any given month will contain the rabbits that were alive the previous month, plus any new offspring. A key observation is that the number of offspring in any month is equal to the number of rabbits that were alive two months prior. As a result, if Fn represents the number of rabbit pairs alive after the n-th month, then we obtain the Fibonacci sequence having terms Fn that are defined by the recurrence relation Fn=Fn−1+Fn−2 (with F1=F2=1 to initiate the sequence). Although the sequence bears Fibonacci's name, it was known to Indian mathematicians over two millennia ago.

When finding the n-th term of a sequence defined by a recurrence relation, we can simply use the recurrence relation to generate terms for progressively larger values of n. This problem introduces us to the computational technique of dynamic programming, which successively builds up solutions by using the answers to smaller cases.

Given: Positive integers n≤40 and k≤5.
Return: The total number of rabbit pairs that will be present after n
months if we begin with 1 pair and in each generation, every pair of reproduction-age rabbits produces a litter of k rabbit pairs (instead of only 1 pair).

Key points:
1.面对对象的递归函数实现。
2.得用长整型,题目数据给的数字太大了。http://www.nowamagic.net/librarys/veda/detail/2314 这个博客总结的递归思想很好。


回复 支持 0 反对 1

使用道具 举报

4

主题

29

帖子

119

积分

注册会员

Rank: 2

积分
119
发表于 2016-9-8 12:55:11 | 显示全部楼层
加油啊,光看书确实没用,要实践起来
回复 支持 1 反对 0

使用道具 举报

634

主题

1182

帖子

4030

积分

管理员

Rank: 9Rank: 9Rank: 9

积分
4030
发表于 2016-9-8 11:44:53 | 显示全部楼层
先赞一个,我会follow你的
你这个问题很复杂,需要打赏,请点击 http://www.bio-info-trainee.com/donate 进行打赏,谢谢
回复 支持 反对

使用道具 举报

10

主题

59

帖子

269

积分

版主

Rank: 7Rank: 7Rank: 7

积分
269
 楼主| 发表于 2016-9-9 04:45:58 | 显示全部楼层
本帖最后由 bioinfo何婷 于 2016-9-10 01:03 编辑

Finding a motif in DNA
Biological Background:

1 找motif的目的:Finding the same interval of DNA in the genomes of two different organisms(often taken from different species) is highly suggestive that the interval hasthe same function in both organisms.
2 motif定义:We define a motif as such a commonly shared interval of DNA.
3 repeats定义:The situation is complicated by the fact that genomes are riddled with intervals of DNAthat occur multiple times (possibly with slight modifications), called repeats.

Problem:
Given two strings s and t, t is a substring of s if t is contained as a contiguous collection of symbols in s (as a result, t must be no longer than s).
The position of a symbol in a string is the total number of symbols found to its left, including itself (e.g., the positions of all occurrences of 'U' in "AUGCUUCAGAAAGGUCUUACG" are 2, 5, 6, 15, 17, and 18). The symbol at position i of s is denoted by s.
A substring of s can be represented as s[j:k], where j and k represent the starting and ending positions of the substring in s; for example, if s = "AUGCUUCAGAAAGGUCUUACG", then s[2:5]= "UGCU".
The location of a substring s[j:k] is its beginning position j; note that t will have multiple locations in s if it occurs more than once as a substring of s (see the Sample below).

  Given: Two DNA strings s and t (each of length at most 1 kbp).
  Return: All locations of t as a substring of s.

Key point:
1 用到对字符串操作的indexOf函数。若长序列s里面找到短序列t,则返回找到的起始位置,若没找到就返回-1,并且可以加上参数i指示从哪个位置开始找,即用 (int)n=s.indexOf((str)t,(int)i)就可以返回第二个第三个等motif的位置
2 做到了找到第一个motif的位置后,下面就是要加上循环,让他输出所有位置
[Java] 纯文本查看 复制代码
while(n!=-1){
n=str1.indexOf(str2, n+1);
j=n+1; 
System.out.print(j+" ");}

3 输出结果的时候一个超小的知识点老是搞混,每次总是要试半天。现在写在这里记录:
System.out.print(n+1 + “\n”);   //返回变量n+1的值并回车一行
System.out.print('n');   //返回n的ASCII码值
System.out.print("n");   //返回真正的n

回复 支持 反对

使用道具 举报

9

主题

25

帖子

182

积分

版主

Rank: 7Rank: 7Rank: 7

积分
182
发表于 2016-9-9 06:46:52 | 显示全部楼层
回复

使用道具 举报

10

主题

59

帖子

269

积分

版主

Rank: 7Rank: 7Rank: 7

积分
269
 楼主| 发表于 2016-9-9 06:58:06 | 显示全部楼层

哇 这么早就上论坛 嘻嘻
回复 支持 反对

使用道具 举报

634

主题

1182

帖子

4030

积分

管理员

Rank: 9Rank: 9Rank: 9

积分
4030
发表于 2016-9-9 07:10:58 | 显示全部楼层
bioinfo何婷 发表于 2016-9-9 06:58
哇 这么早就上论坛 嘻嘻

你们的确有一点早,加油~
你这个问题很复杂,需要打赏,请点击 http://www.bio-info-trainee.com/donate 进行打赏,谢谢
回复 支持 反对

使用道具 举报

10

主题

59

帖子

269

积分

版主

Rank: 7Rank: 7Rank: 7

积分
269
 楼主| 发表于 2016-9-9 11:22:35 | 显示全部楼层
Jimmy 发表于 2016-9-9 07:10
你们的确有一点早,加油~

Jimmy忠实粉丝!
回复 支持 反对

使用道具 举报

10

主题

59

帖子

269

积分

版主

Rank: 7Rank: 7Rank: 7

积分
269
 楼主| 发表于 2016-9-10 00:45:47 | 显示全部楼层
本帖最后由 bioinfo何婷 于 2016-9-10 00:47 编辑

这样一个个比较算法复杂度实在太高了,用哈希表会快很多。但是这个解决办法看不懂...他这个计算子序列的hashvalue赋值了什么吗?Value不应该是DNA序列吗,KEY是坐标...小白,求好心人帮助理解下
Algorithm idea

    1 Compute the hash values of all substrings in DNA string.
    2 Before comparing motif and a substring character-by-character, first compare their hash values.

Hash Value

for(i=0; i<dna.length(); i++){
int ans=0, i;
string temp=dna.substr(i, motif.length());
    if((hashValue(temp)==hashValue(motif)) && isIdentical(temp, motif)){
            pos=i+1;
            cout<<pos<<" ";
    }
}

回复 支持 反对

使用道具 举报

10

主题

59

帖子

269

积分

版主

Rank: 7Rank: 7Rank: 7

积分
269
 楼主| 发表于 2016-9-10 05:16:27 | 显示全部楼层
Computing GC content
Biological background
1 计算GC含量的目的:The biological analog of identifying unknown text ariseswhen researchers encounter a molecule of DNA deriving from an unknown species.Because of the base pairing relations of the two DNA strands, cytosine and guaninewill always appear in equal amounts in a double-stranded DNA molecule.Thus, to analyze the symbol frequencies of DNA for comparison against a database, we compute themolecule's GC-content, or the percentage of its bases that are either cytosine or guanine.#生词symbol frequency频数#
2 通过计算GC含量比较物种节省计算量:In practice, the GC-content of most eukaryotic genomes hovers around 50%. However, becausegenomes are so long, we may be able to distinguish species based on very small discrepanciesin GC-content; furthermore, most prokaryotes have a GC-content significantly higher than 50%,so that GC-content can be used to quickly differentiate many prokaryotes and eukaryotes by using relatively small DNA samples.#生词eukaryotic genomes真核基因组生物,prokaryotes原核生物#

Problem
The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

  Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).
  Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind   allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.


Key points
1 我这个可能又是效率很低的解决办法,思路是这样的一行行读取,遇到FASTA的格式header line先输出,不是题头的DNA序列计算每行碱基G和C的含量,当找不到G/C了,即是遇到了下一个DNA序列的header line,此时把之前所有行的GC加起来就得到一个序列的GC含量。然后这从中呢,算每一行的G/C含量复制了昨天找motif的代码,都是一行里面一个个对比找嘛。所以这里肯定能优化用哈希,而且总的思路就不应该一行行读取,感觉应该一开始就split一下变成一整分字符串数组,一分DNA序列一分DNA序列这么计算...可能是这样吧,但是没有做到。不过傻办法做对了。科科!
2 这个地方搞混了:字符串数组的长度表示str.length;
                  字符串的长度表示str.length();
                  这点好坑啊...


回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

QQ|手机版|小黑屋|生信技能树 ( 粤ICP备15016384号  

GMT+8, 2019-8-23 18:06 , Processed in 0.045960 second(s), 26 queries .

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.