搜索
查看: 1260|回复: 1

[R] R数据科学啃书——ggplot

[复制链接]

3

主题

5

帖子

50

积分

注册会员

Rank: 2

积分
50
发表于 2018-9-8 11:13:58 | 显示全部楼层 |阅读模式
2.first steps
ggplot() creates a coordinate system that you can add layers to.
ggplot(data = mpg) creates an empty graph, and choose the dataset to use in this graph.
You complete your graph by adding one or more layers to ggplot().
The function geom_point() adds a layer of points to your plot, which creates a scatterplot.
其他geom函数:定义不同plot类型
每个geom函数都会跟一个mapping参数->defines how variables in your dataset are mapped to visual properties.
The mapping argument is always paired with aes(), and the x and y arguments of aes() specify which variables to map to the x and y axes.
例:ggplot(data=mpg)+geom_point(mapping(aes(x=displ,y=hwy))
模板:ggplot(data = <DATA>) +    <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
Exercises1.Run ggplot(data = mpg). What do you see?
An empty graph.
2.How many rows are in mpg? How many columns?
234 rows and 11 columns.
3.What does the drv variable describe? Read the help for ?mpg to find out.
It decribes which wheels are driving wheels. f= front wheel drive, r=rear wheels drive, 4= 4 wheels drive.
4.Make a scatterplot of hwy vs cyl.
ggplot(data=mpg)+geom_point(mapping(aes(x=cyl,y=hwy)
5.What happens if you make a scatterplot of class vs drv? Why is the plot not useful?
Because it shows no obvious relationship between class and drv.
3.Aethetic mappings
You can add a third variable, like class, to a two dimensional scatterplot by mapping it to an aesthetic.
ggplot(data = mpg) +    geom_point(mapping = aes(x = displ, y = hwy, color = class))
可供选择的美学参数:
color        size                alpha(圆圈的灰度)         shape(ggplot一次只用6种形状的点,所以超过6个种类就unplot       
添加aesthetic参数后,It selects a reasonable scale to use with the aesthetic, and it constructs a legend that explains the mapping between levels and values.(图例)
For x and y aesthetics, ggplot2 does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts as a legend; it explains the mapping between locations and values.(连续的图例?
ggplot(data = mpg) +  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")#手动设定这个散点图的点的颜色#注意逗号的位置!
Exercises
  • What’s gone wrong with this code? Why are the points not blue?
ggplot(data = mpg) +    geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
这个color参数在aes函数内部,意思是给mpg数据集中"blue"这个变量的不同level赋予颜色,然而mpg数据集中并没有"blue"这个变量,于是被识别为逻辑变量,当然所有的变量都是FALSE,所以都是FALSE的颜色(红色)(见Q6)
2.Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?
每个变量下方都有代表变量类型的三字母缩写
3.Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?
color:渐变色;size:自动把变量转换为categorical然后再map;shape:显示无法map
5.What happens if you map the same variable to multiple aesthetics?
就是同时给这个变量赋予多个形态参数呗
6.What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)
人为地规定边框线的宽度(对于那些没有边框线。如下例中,规定形状21(空心圈圈)的边框线宽度为5,设置它的size也为5,就可以得到同心圆啦
7.What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)?

将其识别为逻辑变量,按TRUE/FALSE赋值
4.common problem
One common problem when creating ggplot2 graphics is to put the + in the wrong place: it has to come at the end of the line, not the start.
ggplot(data = mpg)   
+geom_point(mapping = aes(x = displ, y = hwy))####do not write like this
5.Facets
split your plot into facets, subplots that each display one subset of the data.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2) #facet_wrap这里的变量必须是离散的
facet your plot on the combination of two variables->use facet_grid()
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
If you prefer to not facet in the rows or columns dimension, use a . instead of a variable name,
e.g. + facet_grid(. ~ cyl).#这样就会自动排成一行#facet(class~.)就会排成一列
Exercises1.What happens if you facet on a continuous variable?
会把每一个取值都列出来,然后按离散变量处理
2.What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?
ggplot(data = mpg) +    geom_point(mapping = aes(x = drv, y = cyl))
3.What plots does the following code make? What does . do?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .) #facets排成一列,.代替了一个变量,就是横向只有1个取值
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
`facet_grid(. ~ cyl)#排成一行
4.Take the first faceted plot in this section:ggplot(data = mpg) +geom_point(mapping = aes(x = displ, y = hwy)) +facet_wrap(~ class, nrow = 2)What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?
faceting能更好的分开不同类别的x,y变量之间的关系,缺点是这样总体的xy关系会更不明显。当数据量更大时可能用faceting可以更直观的将不同类别的数据区分开吧
5.Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?
nrow:行数,ncol:列数
facet_grid直接是按照两个变量来排列,例如 facet_grid(drv ~ cyl)就是按drv变量成列,按cyl变量成行(相当于是更大的不连续的纵-横坐标
6.When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?
这样才不挤啊,横向的宽度是有限的但是长度可以无限延长嘛
6.Geometric Objects1.What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?
line; point; bar;
2.Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
3.What does show.legend = FALSE do? What happens if you remove it?
Why do you think I used it earlier in the chapter?
不显示图例;显示图例
To compare the "group" instruction and the feature of ggplot when you map an aesthetic to a discrete variable
4.What does the se argument to geom_smooth() do?
Display confidence interval around smooth(The dark shade)
#se=standard error, it adds standard error bands.
5.Will these two graphs look different? Why/why not?
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point() +geom_smooth()  
ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
No. They look exactly the same.
6.Recreate the R code necessary to generate the following graphs.
1)ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +geom_point() +geom_smooth()
2)ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +geom_point() +
geom_smooth(mapping=aes(group=drv))
3)ggplot(data = mpg, mapping = aes(x = displ, y = hwy,color=drv)) +
geom_point() +geom_smooth()
4)ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +    geom_point(mapping=aes(color=drv)) +geom_smooth()
5)ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping=aes(color=drv)) +
geom_smooth(mapping=aes(linetype=drv))
6)`ggplot(data = mpg)) +geom_point(mapping = aes(x = displ, y = hwy,fill=drv),shape=21,color="white",size=5,border=5) + geom_smooth()
7.Statistical transformations
几何对象函数->统计变换函数->计算出新的统计变量->几何对象
ggplot(data = diamonds) +    stat_count(mapping = aes(x = cut))
等效于
ggplot(data = diamonds) +    geom_bar(mapping = aes(x = cut))
显式使用某种统计变换的情况:1、需要覆盖geom函数的default value
e.g. stat_count()适用于当the height of the bar is generated by counting rows,而有时需要直接使用原表格数据作图(count已经在原表格数据中),则用stat="identity"
2、需要覆盖从统计变换生成的变量到图形属性的默认映射
e.g.geom_bar的默认映射是count->直方图,如果想用比例来画直方图,应该指定y=..prop..
ggplot(data = diamonds) +    geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
#group=1意为将所有钻石视为一个整体,若不加这一句,则将不同cut的钻石单独视为不同组,则所有的prop都是100%。
3、需要在代码中强调统计变换
stat_summary->geom_pointrange; geom_pointrange->stat_identity(default)
用geom_pointrange重复stat_summary的图时需指定,stat="summary"
Excercise
  • What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?
    geom_pointrange
    ggplot(data=diamonds)+geom_pointrange(mapping=aes(x=cut,y=depth),stat="summary",fun.ymin=min,fun.ymax=max,fun.y=median)
  • What does geom_col() do? How is it different to geom_bar()?
    It creates a bar chart and the heights of the bars represent values in the data.
    geom_col uses stat_identity: it leaves the data as is.
    geom_bar uses stat_count by default: it counts the number of cases at each x position.
  • Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?
    mapping, data, position,...(inherit.aes)
  • What variables does stat_smooth() compute? What parameters control its behaviour?computed variables:
    y        predicted value
    ymin        lower pointwise confidence interval around the mean
    ymax        upper pointwise confidence interval around the mean
    se        standard error
    parameters:
    span        control the "wiggliness" of the default loess smoother.The span is the fraction of points used to fit each local regression. Small numbers make a wigglier curve, larger numbers make a smoother curve.
    method                choose the modeling  function
  • In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?ggplot(data = diamonds) +
      geom_bar(mapping = aes(x = cut, y = ..prop..))#all prop will be 100%=5 bars with same height
    ggplot(data = diamonds) +
      geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))# 5 same bars.Each bar is consist of different color bars and every color occupy the same part.
8.Position Adjustmentsggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, colour = cut))#每个bar边框不同颜色
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = cut))#每个bar用不同颜色填充
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = clarity))#每个bar里面,再按clarity显示颜色,表示在该cut组中,不同clarity占的比例
1.stacking式(堆叠式)
ggplot(data = diamonds) +    geom_bar(mapping = aes(x = cut, fill = clarity))#the bars are automatically stacked. Each colored rectangle represents a combination of cut and clarity.ggplot``(``series``,` `aes``(``time``,` `value``,` `group` `=` `type``)``)` `+
  ``geom_line``(``aes``(``colour ``=` `type``)``,` `position` `=` `"stack"``)` `+
  ``geom_point``(``aes``(``colour ``=` `type``)``,` `position` `=` `"stack"``)#加position=stack后是堆叠上去的(不交叉,一条上方再一条)ggplot``(``series``,` `aes``(``time``,` `value``,` `group` `=` `type``)``)` `+
  ``geom_line``(``aes``(``colour ``=` `type``)``)` `+
  ``geom_point``(``aes``(``colour ``=` `type``)``)#不加position=stack则默认identity,点和线都在原始数据在的位置2.position="identity"
place each object exactly where it falls in the context of the graph.
This is not very useful for bars, because it overlaps them. To see that overlapping we either need to make the bars slightly transparent by setting alpha to a small value, or completely transparent by setting fill = NA.
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar(alpha = 1/5, position = "identity")
ggplot(data = diamonds, mapping = aes(x = cut, colour = clarity)) +
geom_bar(fill = NA, position = "identity")#仔细观察,每一个clarity的bar是重叠的,如果是不设透明度的话,那些比较矮的bar会被高的遮住,看不到
3.position = "fill"
类似堆叠式,但是每一组的总bar高度一样,然后组内按比例
ggplot(data = diamonds) +   
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
4.position = "dodge"(并列式)
places overlapping objects directly beside one another. This makes it easier to compare individual values.
相当于每一组单独画一个histogram
ggplot(data = diamonds) +    geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
5.position = "jitter"(不适用于bar chart,用于scatterplot)
可避免overplotting
adds a small amount of random noise to each point. This spreads the points out because no two points are likely to receive the same amount of random noise.
ggplot(data = mpg) +   
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")#可以更好地反映where the mass of the data is)#等效于geom_jitter
总结:
  • dodge:“避让”方式,即往旁边闪,如柱形图的并排方式就是这种。
  • fill:填充方式, 先把数据归一化,再填充到绘图区的顶部。
  • identity:原地不动,不调整位置
  • jitter:随机抖一抖,让本来重叠的露出点头来
  • stack:叠罗汉
    以下代码为:
    p <- ggplot(diamonds, aes(x=price, fill=cut, color=cut))

    > p+stat_density(position="stack")
    > p+stat_density(position="identity")
    > p+stat_density(position="dodge")
    > p+stat_density(position="fill")

Excercise:1.What is the problem with this plot? How could you improve it?ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point()#overplotting,大部分点都重叠了,在图上显示不出来
#改进
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_jitter()
2.What parameters to geom_jitter() control the amount of jittering?
width        Amount of vertical and horizontal jitter. The jitter is added in both positive and negative directions, so the total spread is twice the value specified here. If omitted, defaults to 40% of the resolution of the data: this means the jitter values will occupy 80% of the implied bins.
height        Amount of vertical and horizontal jitter.
3.Compare and contrast geom_jitter() with geom_count().
Both of them can show the overplotting
geom_count show it by counting the overlapping points and show them in the size of the point, so it shows the exact position of the points but it doesn't show the real number.
geom_jitter show it by randomly jitter the overlapping points away, so it shows the exaxt number of the points by not the real position.
4.What’s the default position adjustment for geom_boxplot()? Create a visualisation of the mpg dataset that demonstrates it.
dodge
ggplot(data=mpg)+geom_boxplot(mapping=aes(x=drv,y=displ))
9.Coordinate systems
默认是笛卡尔坐标系(Cartesian)#真的很难理解这人名怎么翻成笛卡尔的
1.coord_flip()                交换xy轴
2.coord_quickmap() sets the aspect ratio correctly for maps. This is very important if you’re plotting spatial data with ggplot2
3.coord_polar() uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.
Exercises
  • Turn a stacked bar chart into a pie chart using coord_polar().
    `p<-ggplot(data=diamonds,mapping=aes(x=cut,fill=clarity))`p+geom_bar  `p+geom_bar()+coord_polar()#牛眼图`p+geom_bar(width=1)+coord_polar(theta="y")#饼图,width必须要有,大小无影响p+geom_bar(position="fill")+coord_polar(theta="y")#多圆圈图

  • What does labs() do? Read the documentation.
    横纵坐标上的文字
  • What’s the difference between coord_quickmap() and coord_map()?
    前者会自动匹配合适的横纵坐标比例的坐标系
    后者就是默认值。尤其是在画地理map时,用后者无法反映真实形状,也不好调
  • What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?
    coord_fixed()         A fixed scale coordinate system forces a specified ratio between the physical representation of data units on the axes. The ratio represents the number of units on the y-axis equivalent to one unit on the x-axis. The default, ratio = 1, ensures that one unit on the x-axis is the same length as one unit on the y-axis.
    geom_abline()添加了一条y=x(默认值)的参考线
    如果没有coord_fixed() ,横纵坐标轴的刻度不一致,则45度角的直线不是y=x,会影响对相关性斜率的直观判断。
    10.The layered grammar of graphicsggplot(data = <DATA>) +
    <GEOM_FUNCTION>(
        mapping = aes(<MAPPINGS>),
        stat = <STAT>,
        position = <POSITION>
    ) +
    <COORDINATE_FUNCTION> +
    <FACET_FUNCTION>
    图形构建的过程数据集->统计变换->几何对象->映射fill(to相应的变量)->放置在坐标系中->映射x/y to 相应的变量




本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有帐号?立即注册

x



上一篇:使用dplyr进行数据转换
下一篇:Bioconductor简介
回复

使用道具 举报

3

主题

5

帖子

50

积分

注册会员

Rank: 2

积分
50
 楼主| 发表于 2018-9-8 11:15:31 | 显示全部楼层
哭泣了,用markdown写的漂漂亮亮的传上来丑爆了
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

QQ|手机版|小黑屋|生信技能树 ( 粤ICP备15016384号  

GMT+8, 2019-10-19 01:19 , Processed in 0.033903 second(s), 32 queries .

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.