搜索
查看: 2376|回复: 11

一起来上公开课

[复制链接]

10

主题

59

帖子

269

积分

版主

Rank: 7Rank: 7Rank: 7

积分
269
发表于 2018-1-1 06:02:24 | 显示全部楼层 |阅读模式
12.31目标:Cousera JHU Data Sciences Specialization
每天两小时
到1月22日截止,记录进度




上一篇:关于医改
下一篇:真核生物基因可变剪切示意图
回复

使用道具 举报

10

主题

59

帖子

269

积分

版主

Rank: 7Rank: 7Rank: 7

积分
269
 楼主| 发表于 2018-1-4 06:04:14 | 显示全部楼层
1.2
JHU Data Sciences Specialization-Getting and Cleaning Data 3
#week 3
#subsetting and sorting
##logicals ands and ors
##dealinhg with missing values
by using which
##sorting
by using sort
sort(X$var, decreasing =T)
when we got NA value and want to put them in the last part
sort(X$var2, na.last=T)
##ordering
order the variables by var1 and 2
X[order(X$var1, X$var2),]
##ordering with plyr
arrange(X,var1)
arrange(X,desc(var1))
adding rows and cols
X$var4 <- rnorm(5)
Y <- cbind(X, rnorm(5)) // bind rows by using rbind
##summarizing data
head(,n=)
tail(,n=)
summary()
str()
##making table
##by default the table will not tell you the NA value
table(restData$zipCode, useNA="ifany")
##checking for missing value
sum(is.na(restData$council))
any(is.na(restData$council))
all(restData$zipCode > 0)
##row and column sums
colSums(is.na(restData))
##Values with specific characteristics
table(restData$zipCode %in% c("21","34"))
restData[restData$zipCode %in% c("21","34")]
##cross tabs
data(UCBAdmissions)
DF=as.data.frame(UBCAdimissions)
summary(DF)
##values break down by the properties
xt <- xtbale(Freq ~ Gender + Admit, data= DF)
#creating new variables
##creating sequences
##by command, increasing new values by 2
s1 <- seq(1,10,by=2)
##defining the length
s2 <-seq(1,10,length=3)
##making the seq with the same length as the x
x<- c(1,3,8,25,100); seq(along =x)
##subsetting variables
restData$nearMe = restData$neighborhood %in% c("A","B")
table(restData$nearMe)
##creating binary variables
restData$zipWrong = ifelse(restData$zipCode <0, TRUE,FALSE)
table(restData$zipWrong, restData$zipCode <0)
##creating categorical variables
restData$zipGroup = cut (restData$zipCode, breaks=quantile(restData$zipCode))
table(restData$zipGroups)
##easier cutting
library(Hmisc)
restData$zipGroups = cut2(restData$zipCode, g=4)
table(restData$zipGroups)
##creating factor variables
restData$zcf <- factor(restData$zipCode)
##using the mutate function to get a new variable in the plyr library
restData2 = mutate(restData, zipGroups=cut2(zipCode,g=4))
#reshaping data
##melting dataframe
[Bash shell] 纯文本查看 复制代码
mtcars$carname <- rownames(mtcars)
carMelt <- melt(mtcars,id=c("carname","gear","cyl"),measure.vars = c("mpg","hp"))

##casting data frames
cylData <-dcast(carMelt,cyl~variable)
##averaging values
tapply(InsectSprays$count,InsectSpray$num, sum)
##split
SpIns = split(InsectSpray$count,InsectSprays$spray)
##lapply
sprCount = lapply(spIns,sum)
##combine
unlist(sprCount)
##or
sapply(spIns,sum)
##another way, plyr package
ddply(InsectSprays,.(spray),summarize,sum=sum(count))





回复 支持 1 反对 0

使用道具 举报

10

主题

59

帖子

269

积分

版主

Rank: 7Rank: 7Rank: 7

积分
269
 楼主| 发表于 2018-1-1 09:51:33 | 显示全部楼层
12.31
JHU Data Sciences Specialization-Getting and Cleaning Data 3
#week 1
##The tidy data:1.each variable you measure should be in one column.2.each different observation of that variable should be in a different row.3.inculde a row at the top of each file with variable names.
##The Code Book:1there should be a section called "study design" that has a thorough description of how you collected the data, and a section called "code book" that describes each variable and its units.
##The instruction list:1.the input for the script is the raw data.2.the output is the processed, tidy data.3. there are no parameters to the script.
##Loading flat files-read.table():1.reads the data into RAM.big data causes problems.2.important parameters file,header,sep,row.names.nrows.
##Reading excel files:
[Bash shell] 纯文本查看 复制代码
library(readxl)
a <- read_xlsx("file")

##Reading XML:1.frequently used to store structured data.2.components:makeup-labels that give the text structure;content-the real text.
[Bash shell] 纯文本查看 复制代码
libraty(XML)
fileUrl <- "file"
doc <- xmlTreeParse(fileUrl,useInternal=TRUE)
rootNode <- xmlRoot(doc)
xmlName(rootNode)

##Reading JSON:1.lightweight data storage.2.common format for data from application programming interfaces.3.similar structure to XML
##Using data.table:1.inherets from data.frame-all functions that accept data.frame work on data.table.2.written in C so it is much faster
#swirl-dplyr
##用tbl_df展示data.frame很好!


没有找到R的代码高亮咩
回复 支持 1 反对 0

使用道具 举报

10

主题

59

帖子

269

积分

版主

Rank: 7Rank: 7Rank: 7

积分
269
 楼主| 发表于 2018-1-3 05:18:02 | 显示全部楼层
本帖最后由 bioinfo何婷 于 2018-1-3 05:27 编辑

1.1
JHU Data Sciences Specialization-Getting and Cleaning Data 3
#week 2
#Reading from MySQL
##install MySQL on computer;install RMySQL; Connecting and listing databases:Connecting to hg19
[Bash shell] 纯文本查看 复制代码
install.packages("RMySQL")

[Bash shell] 纯文本查看 复制代码
ucscDb <- dbConnect(MySQL(),user="genome",host="genome-mysql.cse.ucsc.edu")
result <- dbGetQuery(ucscDb,"show databases;");
dbDisconnect(ucscDb);

[Bash shell] 纯文本查看 复制代码
#use this database, link this server
hg19 <- dbConnect(MySQL(),user="genome",db="hg19",host="genome-mysql.cse.ucsc.edu")
#check out the num of database/dataframe
allTables<-dbListTables(hg19)
length(allTables)
#show the 1st to 5th tables
allTables[1:5]
#look at dhg19 database and check the num of fields in this affyU133Plus2 table
dbListFields(hg19,"affyU133Plus2")
#check the num of rows/records
dbGetQuery(hg19,"select count(*) from affyU133Plus2")
#Read from the table
affyData <- dbReadTable(hg19,"affyU133Plus2")
head(affyData)
#select a specific subset
query <- dbSendQuery(hg19,"select * from affyU133Plus2 where misMatches between 1 and 3")
affMis <- fetch(query);
quantile(affMis$misMatches)
#remember to close the conn everytime

#Reading from HDF5
##HDF5: for storing large data sets
[Bash shell] 纯文本查看 复制代码
#R HDF5 package
#load the biocLite function
source("http://bioconductor.org/biocLite.R")
biocLite("rhdf5")
#create groups
#write to groups
#write a data set
#reading data

#Reading from the web
#webscraping
[Bash shell] 纯文本查看 复制代码
getting data off webpages
con = url("http://cn.dealmoon.com/")
htmlCode = readLines(con)
close(con)

##parsing with XML
[Bash shell] 纯文本查看 复制代码
library(XML)
url <- "http://cn.dealmoon.com/"
html <- htmlTreeParse(url,useInternalNodes = T)
#check the title of the page
xpathSApply(html,"//title",xmlValue)

##GET from the httr package, other way
[Bash shell] 纯文本查看 复制代码
library(httr)
html2 = GET(url)
content2 = content(html2,as="text")
parsedHtml = htmlParse(content2,asText = T)
xpathSApply(parsedHtml,"//title",xmlValue)
#accessing websites with password
pg2 = GET ("url", authenticate("user","passwd"))
#status:200 indicates that we are successful to access the link

#Reading from APIs
##API: application programming interface
[Bash shell] 纯文本查看 复制代码

#Reading from other sources

回复 支持 反对

使用道具 举报

10

主题

59

帖子

269

积分

版主

Rank: 7Rank: 7Rank: 7

积分
269
 楼主| 发表于 2018-1-6 11:44:27 | 显示全部楼层
1.5
JHU Data Sciences Specialization-Regression Models 7
落下了 几天 吐舌头
咱们先跳到有意思的部分~

[Bash shell] 纯文本查看 复制代码
#using the Galton's data[/size]
[size=14px]install.packages("UsingR")[/size]
[size=14px]library("UsingR")[/size]
[size=14px]#checking the marginal distributions[/size]
[size=14px]data("galton")
[/size]long <- melt(galton)
g <- ggplot(long,aes(x=value,fill=variable))
g <- g+geom_histogram(colour="black",binwidth = 1)
g <- g+facet_grid(.~variable)
[size=14px]

[Bash shell] 纯文本查看 复制代码
#proof, least squares estimation is the empirical mean
library("manipulate")
library("ggplot2")
myHist <- function(mu){
    mse <- mean((galton$child - mu)^2)
    g <- ggplot(galton,aes(x=child))+geom_histogram(fill="salmon",color="black",binwidth = 1)
    g <- g+geom_vline(xintercept = mu,size =3)
    g <- g+ggtitle(paste("mu=",mu, ", MSE =", round(mse,2),sep=""))
    g
}
manipulate(myHist(mu),mu=slider(62, 74, step=0.5))

[Bash shell] 纯文本查看 复制代码
#comparing children's heights and their parents' height
ggplot(galton,aes(x=parent,y=child))+geom_point()
#avoid overplotting
library(dplyr)
freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child", "parent", "freq")
freqData$child <- as.numeric(as.character(freqData$child))
freqData$parent <- as.numeric(as.character(freqData$parent))
g <- ggplot(filter(freqData, freq > 0), aes(x = parent, y = child))
g <- g  + scale_size(range = c(2, 20), guide = "none" )
g <- g + geom_point(colour="grey50", aes(size = freq+20, show_guide = FALSE))
g <- g + geom_point(aes(colour=freq, size = freq))
g <- g + scale_colour_gradient(low = "lightblue", high="white")
[font=Tahoma]g
[/font][backcolor=rgb(246, 248, 250)][color=#24292e]#instead of using manipulate[/color]
[color=#24292e]lm(I(child - mean(child))~ I(parent - mean(parent)) - 1, data = galton)




回复 支持 反对

使用道具 举报

10

主题

59

帖子

269

积分

版主

Rank: 7Rank: 7Rank: 7

积分
269
 楼主| 发表于 2018-1-10 12:09:33 | 显示全部楼层
本帖最后由 bioinfo何婷 于 2018-1-10 12:49 编辑

1.9
JHU Data Sciences Specialization- regression models
#week 2
#Linear regression for prediction
[Bash shell] 纯文本查看 复制代码
#
fit2 <- lm(price ~ carat) ,data=diamond)
#getting a more interpretable intercept
fit1 <- lm(price ~ i(carat - mean(carat)) ,data=diamond)
#predicting new value
newx <- c(0.16,0.27,0.34)
coef(fit)[1] + coef(fit)[2] * newx
#more useful way
predict(fit,newdata = data.frame(carat=newx))

#residuals: represent variation left unexplained by model. We emphasize the difference between residuals and errors. The errors unobservable true errors from the known coefficients, while residuals are the observable errors from the estimated coefficients. In a sense, the residuals are estimates of the errors.
#residual and residual variation
[Bash shell] 纯文本查看 复制代码
#residual plot
g = ggplot(data.frame(x=x, y=resid(lm(y~x))),aes(x=x,y=y))
g=g+geom_hline(yintercept=0, size=2);
g = g +geom_point(size = 7 ,color="black",alpha=0.4)
g = g +geom_point(size = 5 ,color="red",alpha=0.4)
g=g+xlab("X")+ylab("residual")
g

#inference in regression
[Bash shell] 纯文本查看 复制代码
#get the full output in lm
summary(fit)
#get the coefficients
summary(fit)$coefficients

#For a given value of x, the interval estimate for the mean of the dependent variable, , is called the confidence interval.
[Bash shell] 纯文本查看 复制代码
#calculating 95% confidence interval
attach(faithful)
#in the data set faithful, develop a 95% confidence interval of the mean eruption duration for the waiting time of 80 minutes.
fit <- lm(eruptions~waiting)
#Then we create a new data frame that set the waiting time value.
newdata<-data.fram(waiting=50)
#We now apply the predict function and set the predictor variable in the newdata argument. We also set the interval type as "confidence", and use the default 0.95 confidence level.
predict(fit,newdata,interval="confidence")
detach(faithful)

quiz 2有两道题 当时卡住了 马克一下

lm(mtcars~1) ---这样是model里面只有intercept
然后这个predict(fit)代表的是regression line上面模拟的Points



本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有帐号?立即注册

x
回复 支持 反对

使用道具 举报

10

主题

59

帖子

269

积分

版主

Rank: 7Rank: 7Rank: 7

积分
269
 楼主| 发表于 2018-1-11 05:45:15 | 显示全部楼层
1.9
JHU Data Sciences Specialization- regression models
#week 2&3
some coding exerciese
[Bash shell] 纯文本查看 复制代码
#use the residuals (fit$residuals) of our model to estimate the standard deviation (sigma) of the error We've already defined n for you as the number of points in Galton's dataset (928)
sigma <- sqrt(fit$residuals)/926
#look at the "sigma" portion of the summary of fit
#use another way
summary(fit)$sigma
#use another way
sqrt(deviance(fit)/(n-2))

# The intercept is really the coefficient of a special regressor which has the same value, 1, at every sample. The function, lm, includes this regressor by default.
回复 支持 反对

使用道具 举报

10

主题

59

帖子

269

积分

版主

Rank: 7Rank: 7Rank: 7

积分
269
 楼主| 发表于 2018-1-14 23:51:26 | 显示全部楼层
本帖最后由 bioinfo何婷 于 2018-1-22 13:20 编辑

1.13
JHU Data Sciences Specialization- regression model
#week 3
#multivariable regression analyses
The linear models adjust the regression estimate with respect to the other variable. It is sort of like the effect of the other variable has removed from both the predictor and the response.
[Bash shell] 纯文本查看 复制代码
n=100
x=rnom(n)
x2=rnorm(n)
x3=rnorm(n)
y=1+x+x2+x3+rnorm(n,sd=.1)
#proof of the last sentence
ey=resid(lm(y~x3 + x2))
ex=resid(lm(x~x2+x3))
coef(lm(ey~ex -1))
coef(lm(y~x+x2+x3))

In the machine learning, the regression model should be the first try to suit model.
#Simpson's Paradox:things can change to the exacr opposite when you perform adjustment.
#residuals
#influence measure: do ? influence.measures to see the full suite of influence measures in stats.
first care: there is a big cloud of uncorrelated data by generating a bunch of pairs of independent random normals. Then added random standard. then added the point at (10,10)
[Bash shell] 纯文本查看 复制代码
#how to create it
n <- 100
x <- c(10,rnorm(n))
y <- c(10,rnorm(n))
plot(x,y,frame =FALSE,cex=2,pch=21,bg="lightblue",col="black")
abline(lm(y~x))


作业的三道题,需要记录

the DFBETA particular observation is the difference between the regression coefficient for an included variable (say age, or education in our well-worn salary example) calculated for all of the data and the regression coefficient calculated with the observation deleted, scaled by the standard error calculated with the observation deleted.



The diagonals measure leverage of observations (y). The higher the value (leverage), the more influential the observation point. So hat diagonal of most influential point is: 0.9945734 (point 5)

本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有帐号?立即注册

x
回复 支持 反对

使用道具 举报

10

主题

59

帖子

269

积分

版主

Rank: 7Rank: 7Rank: 7

积分
269
 楼主| 发表于 2018-1-23 08:06:46 | 显示全部楼层
1.22
JHU Data Sciences Specialization- GLM
#week 4
#What are the least squares and the maximum likelihood estimation methods?
Maximum likelihood estimation method (MLE)
The likelihood function indicates how likely the observed sample is as a function of possible parameter values. Therefore, maximizing the likelihood function determines the parameters that are most likely to produce the observed data. From a statistical point of view, MLE is usually recommended for large samples because it is versatile, applicable to most models and different types of data, and produces the most precise estimates.
Least squares estimation method (LSE)
Least squares estimates are calculated by fitting a regression line to the points from a data set that has the minimal sum of the deviations squared (least square error). In reliability analysis, the line and the data are plotted on a probability plot.
回复 支持 反对

使用道具 举报

10

主题

59

帖子

269

积分

版主

Rank: 7Rank: 7Rank: 7

积分
269
 楼主| 发表于 2018-1-25 11:19:52 | 显示全部楼层
1.24
JHU Data Sciences Specialization- logistic regression
#week 4
#part1
[Bash shell] 纯文本查看 复制代码
a <-glm(data$1 ~ data$2,family="binomial") 

possion GLM
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

QQ|手机版|小黑屋|生信技能树    

GMT+8, 2019-6-16 19:41 , Processed in 0.040385 second(s), 27 queries .

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.