搜索
查看: 555|回复: 0

[R] <R for Data Science> 读书笔记8 | 第十二章: Tidy data

[复制链接]

12

主题

18

帖子

155

积分

注册会员

Rank: 2

积分
155
发表于 2018-11-23 15:17:50 | 显示全部楼层 |阅读模式
本帖最后由 冒刷 于 2018-11-23 15:23 编辑

12  Tidy data | tidyr package
12.1 准备工作
[AppleScript] 纯文本查看 复制代码
library(tidyverse)
12.2 整理数据
以下是对同一组数据(变量有country, year, population和cases)的四种不同整理方式(只有table1是整洁的,因只有table1的每一列是一个变量)

保持数据整洁的三条原则:
  • 每个变量必须有自己的列
  • 每个观测值必须有自己的行
  • 每个数值必须有对应的单元格

尤其在R的运用中,保持这种结构的一致性十分重要!dplyr, ggplot2和其他包都是针对这样的数据结构进行工作的。
[AppleScript] 纯文本查看 复制代码
table1[/size][/font][/color][/align]# A tibble: 6 x 4
#  country      year  cases population
#  <chr>       <int>  <int>      <int>
# 1 Afghanistan  1999    745   19987071
# 2 Afghanistan  2000   2666   20595360
# 3 Brazil       1999  37737  172006362
# 4 Brazil       2000  80488  174504898
# 5 China        1999 212258 1272915272
# 6 China        2000 213766 1280428583
library(ggplot2)
ggplot(table1, aes(year, cases)) + geom_line(aes(group = country), colour = "red") + geom_point(aes(colour = country)) 


12.3 Spreading and gathering
tidy数据时,找出哪些是变量,是否有一个变量在多列存在?找出哪些是观测值,是否同意观测值在多行存在?解决这两个问题是需要用到的函数有gather()和spread()。

12.3.1 Gathering
以table4a表为例,列名1999和列名2000不是变量名称,每一行不是一次观测值而是两次观测值。
[AppleScript] 纯文本查看 复制代码
#troubleshooting for table4a[/size][/font][/color][/align]table4a_2 <- table4a %>% gather(`1999`, `2000`, key = "year", value = "cases")
table4a_2
# A tibble: 6 x 3
#  country     year   cases
#  <chr>       <chr>  <int>
# 1 Afghanistan 1999     745
# 2 Brazil           1999   37737
# 3 China           1999  212258
# 4 Afghanistan  2000    2666
# 5 Brazil            2000   80488
# 6 China           2000  213766
# 由于1999和2000不是syntactic names(因为不是字母开头),所以须以反引号引用。

table4b
# A tibble: 3 x 3
#  country         `1999`     `2000`
# * <chr>            <int>      <int>
# 1 Afghanistan   19987071   20595360
# 2 Brazil       172006362  174504898
# 3 China       1272915272 1280428583
table4b_2 <- table4b %>% gather(`1999`, `2000`, key = "year", value = "population")
table4b_2
# A tibble: 6 x 3
#  country          year     population
#  <chr>             <chr>           <int>
# 1 Afghanistan 1999    19987071
# 2 Brazil           1999   172006362
# 3 China           1999  1272915272
# 4 Afghanistan 2000    20595360
# 5 Brazil           2000   174504898
# 6 China           2000  1280428583

# 使用dplyr::left_join()函数将table4a_2和table4b_2合并成一个tibble
dplyr::left_join(table4a_2, table4b_2)
# Joining, by = c("country", "year")
# A tibble: 6 x 4
#  country            year     cases      population
#  <chr>              <chr>   <int>        <int>
# 1 Afghanistan  1999     745         19987071
# 2 Brazil            1999     37737     172006362
# 3 China            1999     212258   1272915272
# 4 Afghanistan   2000    2666        20595360
# 5 Brazil             2000    80488      174504898
# 6 China            2000     213766    1280428583
12.3.2 Spreading
[AppleScript] 纯文本查看 复制代码
table2
# A tibble: 12 x 4
#   country           year     type               count
#   <chr>             <int>   <chr>              <int>
# 1 Afghanistan  1999    cases             745
# 2 Afghanistan  1999    population     19987071
# 3 Afghanistan  2000    cases             2666
# 4 Afghanistan  2000    population     20595360
# 5 Brazil           1999     cases             37737
# 6 Brazil           1999     population     172006362
# 7 Brazil           2000     cases             80488
# 8 Brazil           2000     population     174504898
# 9 China          1999      cases             212258
# 10 China        1999     population      1272915272
# 11 China        2000     cases              213766
# 12 China        2000     population      1280428583

# troubleshooting table2
table2 %>% spread(key = type, value = count)
# A tibble: 6 x 4
#  country            year   cases      population
#  <chr>              <int>   <int>       <int>
# 1 Afghanistan  1999    745        19987071
# 2 Afghanistan  2000    2666      20595360
# 3 Brazil            1999    37737    172006362
# 4 Brazil            2000    80488    174504898
# 5 China           1999     212258  1272915272
# 6 China           2000     213766  1280428583
12.4 Separating and uniting
[AppleScript] 纯文本查看 复制代码
table3
# A tibble: 6 x 3
#  country            year     rate             
#  * <chr>            <int>    <chr>            
# 1 Afghanistan  1999     745/19987071     
# 2 Afghanistan  2000     2666/20595360    
# 3 Brazil            1999     37737/172006362  
# 4 Brazil            2000     80488/174504898  
# 5 China           1999      212258/1272915272
# 6 China           2000      213766/1280428583
12.4.1 Separate
以分隔符为界对某一列进行分列,默认在遇到非字母数字符号时视作分隔符。也可使用sep进行分隔符定制。
[AppleScript] 纯文本查看 复制代码
table3 %>% separate(rate, into = c("cases", “population"))[/size][/font][/color][/align]# OR
table3 %>% separate(rate, into = c("cases", "population"), sep = “/")

# A tibble: 6 x 4
#  country           year    cases       population
#  * <chr>           <int>   <chr>         <chr>     
# 1 Afghanistan  1999     745        19987071  
# 2 Afghanistan  2000     2666      20595360  
# 3 Brazil            1999     37737    172006362 
# 4 Brazil            2000     80488    174504898 
# 5 China           1999      212258  1272915272
# 6 China           2000      213766  1280428583
12.4.2 Unite
Separate 的反面,将多列合并为一列。
[AppleScript] 纯文本查看 复制代码
table5[/size][/font][/color][/align]# A tibble: 6 x 4
#  country          century year  rate             
# * <chr>            <chr>   <chr> <chr>            
# 1 Afghanistan   19         99    745/19987071     
# 2 Afghanistan   20         00    2666/20595360    
# 3 Brazil             19         99    37737/172006362  

table5 %>% unite(new, century, year)
# A tibble: 6 x 3
#  country               new     rate             
#  <chr>                 <chr>   <chr>            
# 1 Afghanistan      19_99   745/19987071     
# 2 Afghanistan      20_00   2666/20595360    
# 3 Brazil                19_99   37737/172006362  

table5 %>% unite(new, century, year, sep = "")
# A tibble: 6 x 3
#  country            new     rate             
#  <chr>              <chr>   <chr>            
# 1 Afghanistan   1999    745/19987071     
# 2 Afghanistan    2000   2666/20595360    
# 3 Brazil             1999    37737/172006362  
12.5 Missing values
[AppleScript] 纯文本查看 复制代码
stocks <- tibble([/size][/font][/color][/align]  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)
stocks %>% spread(year, return)
# make the implicit missing value explicit by putting years in the columns
# A tibble: 4 x 3
#      qtr `2015` `2016`
#    <dbl>  <dbl>  <dbl>
# 1     1   1.88   NA   
# 2     2   0.59   0.92
# 3     3   0.35   0.17
# 4     4   NA     2.66
stocks %>% spread(year, return) %>% gather(year, return, `2015`:`2016`, na.rm = TRUE)
# turn explicit missing values implicit
# A tibble: 6 x 3
#      qtr year  return
# * <dbl> <chr>  <dbl>
# 1     1 2015    1.88
# 2     2 2015    0.59
# 3     3 2015    0.35
# 4     2 2016    0.92
# 5     3 2016    0.17
# 6     4 2016    2.66
stocks %>% complete(year, qtr)
# make missing values explicit in tidy data
# A tibble: 8 x 3
#     year   qtr return
#    <dbl> <dbl>  <dbl>
# 1  2015     1   1.88
# 2  2015     2   0.59
# 3  2015     3   0.35
# 4  2015     4  NA   
# 5  2016     1  NA   
# 6  2016     2   0.92
# 7  2016     3   0.17
# 8  2016     4   2.66
这里有两处数据缺失
  • Explicitly missing: 2015年第四季度的return变量明确地为NA缺失
  • Implicitly missing: 2016年第一季度没有return变量的记录
    [AppleScript] 纯文本查看 复制代码
    Treatment <- tribble(
         ~person,       ~Treatment, ~response,
         "Rick Whit",     1,                7,
         NA,                 2,               10,
         NA,                 3,               9,
         "kath Burke",   1,               4,
     )
    Treatment
    # A tibble: 4 x 3
    #  person         Treatment response
    #  <chr>             <dbl>    <dbl>
    # 1 Rick Whit          1        7
    # 2 NA                    2       10
    # 3 NA                    3        9
    # 4 kath Burke        1        4
    Treatment %>% fill(person)
    # A tibble: 4 x 3
    #  person        Treatment response
    #  <chr>            <dbl>    <dbl>
    # 1 Rick Whit          1        7
    # 2 Rick Whit          2       10
    # 3 Rick Whit          3        9
    # 4 kath Burke        1        4
    # fill()会将距离最近找到的缺失值邻近的未缺失值填补当前遇到的缺失值
    12.6 Case study (an example to tidy the dataset)
    [AppleScript] 纯文本查看 复制代码
    tidyr::who
    # tidyr::who是取自14年WHO全球结核报告的数据集,数据十分杂乱,需要清理!
    who1 <- who %>% gather(new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE)
    who1 %>% count(key)
    # A tibble: 56 x 2
    #        key                    n
    #      <chr>                <int>
    # 1 new_ep_f014      1032
    # 2 new_ep_f1524    1021
    # 3 new_ep_f2534    1021
    # 4 new_ep_f3544    1021
    # 5 new_ep_f4554    1017
    # 6 new_ep_f5564    1017
    # 7 new_ep_f65        1014
    # 8 new_ep_m014    1038
    # 9 new_ep_m1524  1026
    #10 new_ep_m2534 1020
    # ... with 46 more rows
    who2 <- who1 %>% mutate(key = stringr::str_replace(key, "newrel", “new_rel"))
    who3 <- who2 %>% separate(key, c("new", "type", "sexage"), sep = “_")
    who4 <- who3 %>% select(-new, -iso2, -iso3)
    who5 <- who4 %>% separate(sexage, c("sex", "age"), sep = 1)
    # separate sexage into sex and age by splitting after the first character


本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有帐号?立即注册

x



上一篇:R数据科学(十) 使用stringr处理字符串
下一篇:linux两个文件内容的匹配,并替换
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

QQ|手机版|小黑屋|生信技能树 ( 粤ICP备15016384号  

GMT+8, 2019-10-19 01:24 , Processed in 0.030448 second(s), 27 queries .

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.