本帖最后由 冒刷 于 2018-12-4 11:01 编辑
13 Relational data | relational database management (RDBMS) 13.1 Intro
通常我们需要对多个表格进行合并处理,多个表格的数据称为relational data,构建两两表格间的关系。 ● Mutating joins, 根据一个表格的匹配观测值中添加新变量给另一个表格。 ● Filtering joins, 根据一个表格的观测值是否匹配过滤另一个表格的数据。
13.1.1 准备 library(tidyverse) library(nycflights13)
13.2 nycflights13 上图为"nycflights13包"含有的表格,以及表格间的数据关联性,建立两两表格间寻找关系的习惯,focus到表格间感兴趣的变量的关系建立。
13.3 Keys 用于连接两两表格关系的变量称为keys。一个key由单个/多个变量组成,能够唯一标识观测值,如planes能够通过tailnum唯一(特异)标识,或者weather能够通过year,month,day,hour和origin唯一(特异)标识。 有两种类型的keys: ● primary key: 在自己的表格中特异标识一个观测值。如weather$origin。 ● foreign key: 在别的表格中特异标识一个观测值。如airports$faa。
确定primary keys是否特异标识某一观测值的示例:[AppleScript] 纯文本查看 复制代码 weather %>% count(year, month, day, hour, origin) %>% filter(n>1)[/color][/size][/font][/align][font=Helvetica][size=2]# use filter() to find rows/cases where conditions are true.
# A tibble: 3 x 6
# year month day hour origin n
# <dbl> <dbl> <int> <int> <chr> <int>
# 1 2013 11 3 1 EWR 2
# 2 2013 11 3 1 JFK 2
# 3 2013 11 3 1 LGA 2
flights表格很难简单通过变量组合达到唯一标识某一观测值。[AppleScript] 纯文本查看 复制代码 flights %>% count(year, month, day, flight) %>% filter(n>1)[/color][/size][/font][/align][font=Helvetica][size=2]# A tibble: 29,768 x 5
# year month day flight n
# <int> <int> <int> <int> <int>
# 1 2013 1 1 1 2
# 2 2013 1 1 3 2
# 3 2013 1 1 4 2
# 4 2013 1 1 11 3
# 5 2013 1 1 15 2
# 6 2013 1 1 21 2
# 7 2013 1 1 27 4
# 8 2013 1 1 31 2
# 9 2013 1 1 32 2
#10 2013 1 1 35 2
# ... with 29,758 more rows
# issue: 某一架飞机一天之内只飞一次的概率极低,这里的flights表格缺少primary key
解决办法:使用mutate()和row_number()增加一个primary key,称为surrogate key。
13.4 Mutating joins | 用于组合一对表格的数据[AppleScript] 纯文本查看 复制代码 flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier)[/color][/size][/font][/align][font=Helvetica][size=2]flights2
# A tibble: 336,776 x 8
# year month day hour origin dest tailnum carrier
# <int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
# 1 2013 1 1 5 EWR IAH N14228 UA
# 2 2013 1 1 5. LGA IAH N24211 UA
# 3 2013 1 1 5 JFK MIA N619AA AA
# 4 2013 1 1 5 JFK BQN N804JB B6
# 5 2013 1 1 6 LGA ATL N668DN DL
# 6 2013 1 1 5 EWR ORD N39463 UA
# 7 2013 1 1 6 EWR FLL N516JB B6
# 8 2013 1 1 6 LGA IAD N829AS EV
# 9 2013 1 1 6 JFK MCO N593JB B6
#10 2013 1 1 6 LGA ORD N3ALAA AA
# ... with 336,766 more rows
# add the full airline name to the flights2 data
flights2 %>% select(-origin, -dest) %>% left_join(airlines, by = "carrier")
# A tibble: 336,776 x 7
# year month day hour tailnum carrier name
# <int> <int> <int> <dbl> <chr> <chr> <chr>
# 1 2013 1 1 5 N14228 UA United Air Lines Inc.
# 2 2013 1 1 5 N24211 UA United Air Lines Inc.
# 3 2013 1 1 5 N619AA AA American Airlines Inc.
# 4 2013 1 1 5 N804JB B6 JetBlue Airways
# 5 2013 1 1 6 N668DN DL Delta Air Lines Inc.
# 6 2013 1 1 5 N39463 UA United Air Lines Inc.
# 7 2013 1 1 6 N516JB B6 JetBlue Airways
# 8 2013 1 1 6 N829AS EV ExpressJet Airlines Inc.
# 9 2013 1 1 6 N593JB B6 JetBlue Airways
#10 2013 1 1 6 N3ALAA AA American Airlines Inc.
# ... with 336,766 more rows
# 还可使用mutate()达到同样的目的
flights2 %>% select(-origin, -dest) %>% mutate(name = airlines$name[match(carrier, airlines$carrier)])
# A tibble: 336,776 x 7
# year month day hour tailnum carrier name
# <int> <int> <int> <dbl> <chr> <chr> <chr>
# 1 2013 1 1 5 N14228 UA United Air Lines Inc.
# 2 2013 1 1 5 N24211 UA United Air Lines Inc.
# 3 2013 1 1 5 N619AA AA American Airlines Inc.
# 4 2013 1 1 5 N804JB B6 JetBlue Airways
# 5 2013 1 1 6 N668DN DL Delta Air Lines Inc.
# 6 2013 1 1 5 N39463 UA United Air Lines Inc.
# 7 2013 1 1 6 N516JB B6 JetBlue Airways
# 8 2013 1 1 6 N829AS EV ExpressJet Airlines Inc.
# 9 2013 1 1 6 N593JB B6 JetBlue Airways
#10 2013 1 1 6 N3ALAA AA American Airlines Inc.
# ... with 336,766 more rows 13.4.1 理解joins
join通过指定的key完成匹配,num of dots = num of matches = num of rows in the output.
13.4.2 Inner join (仅保留两个表格中均出现的观测值)
[AppleScript] 纯文本查看 复制代码 x %>% inner_join(y, by = "key")[/size][/font][/align][font=Helvetica][size=2]# A tibble: 2 x 3
# key val_x val_y
# <dbl> <chr> <chr>
# 1 1 x1 y1
# 2 2 x2 y2
# R use by to tell dplyr which variable is the key
# inner join discards unmatched rows in the output ( not appropriate for use in analysis due to high probability of losing observations !!! )
13.4.3 Outer joins (保留在两个表格中至少出现过一次的观测值) 三种类型的outer joins: ● Left join: 保留表格x的所有观测值 (最常用) ● right join: 保留表格y的所有观测值 ● full join: 保留两个表格的所有观测值
13.4.4 Duplicate keys - 其中一个表格有重复变量,在一对多的关系中常见
[AppleScript] 纯文本查看 复制代码 left_join(x, y, by = “key”)
# A tibble: 4 x 3
# key val_x val_y
# <dbl> <chr> <chr>
# 1 1 x1 y1
# 2 2 x2 y2
# 3 2 x3 y2
# 4 1 x4 y1
2. 两个表格均有重复变量,通常是个错误,因为两个表格,keys都不能唯一标识一个观测值,当join时,得到的是所有可能的组合……[AppleScript] 纯文本查看 复制代码 left_join(x, y, by = “key”)[/size][/font]
[font=Helvetica][size=2]# A tibble: 6 x 3
# key val_x val_y
# <dbl> <chr> <chr>
# 1 1 x1 y1
# 2 2 x2 y2
# 3 2 x2 y3
# 4 2 x3 y2
# 5 2 x3 y3
# 6 3 x4 y4
13.4.5 Defining the key columns ● 默认by = NULL,代表使用两个表格中的所有变量进行join[AppleScript] 纯文本查看 复制代码 flights2 %>% left_join(weather) ● 字符向量,by = “x”,仅使用x变量对两个表格进行join[AppleScript] 纯文本查看 复制代码 flights2 %>% left_join(planes, by = “tailnum”)[/color][/size][/font][/align][font=Helvetica][size=2]# 两个表格的year并不是一致的对应关系,指定唯一的tailnum对表格进行合并 ● 命名的字符向量,by = c(“a” = “b”),匹配x表格的变量a和y表格的变量b[AppleScript] 纯文本查看 复制代码 flights2 %>% left_join(airports, c("dest" = “faa"))[/color][/size][/font][/align][font=Helvetica][size=2]flights2 %>% left_join(airports, c("origin" = “faa”))
13.5 Filtering joins ● semi_join(x, y): 保留x表格中所有和y表格匹配的观测值 - anti_join(x, y): 丢弃x表格中所有和y表格匹配的观测值
[AppleScript] 纯文本查看 复制代码 # find the top three most popular destinations
top_dest <- flights %>% count(dest, sort = TRUE) %>% head(3)
top_dest
# A tibble: 3 x 2
# dest n
# <chr> <int>
# 1 ORD 17283
# 2 ATL 17215
# 3 LAX 16174
# find each flight that went to one of above destinations
flights %>% filter(dest %in% top_dest$dest) [AppleScript] 纯文本查看 复制代码 flights %>% semi_join(top_dest)
[AppleScript] 纯文本查看 复制代码 # anti-join
flights %>% anti_join(planes, by = "tailnum") %>% count(tailnum, sort = TRUE)
13.6 偶尔过滤时有用的操作- 前提:当希望将一个复杂的过滤器分解为简单部分时,所有以下操作都基于一个完整的行,比较每个变量的值,x和y表格有相同的变量。
intersect(x, y): 返回x和y表格的交集 union(x, y): 返回在x和y表格中唯一特异的观测值 setdiff(x, y): 返回在x表格中有,但在y表格中没有的观测值 [AppleScript] 纯文本查看 复制代码 df1 <- tribble([/size][/font][/align]
[*][font=Helvetica][size=2] ~x, ~y,
1, 1,
2, 1
)
df2 <- tribble(
~x, ~y,
1, 1,
1, 2
)
intersect(df1, df2)
# A tibble: 1 x 2
# x y
# <dbl> <dbl>
# 1 1 1
union(df1, df2)
# A tibble: 3 x 2
# x y
# <dbl> <dbl>
# 1 1 2
# 2 2 1
# 3 1 1
setdiff(df1, df2)
# A tibble: 1 x 2
# x y
# <dbl> <dbl>
# 1 2 1
setdiff(df2, df1)
# A tibble: 1 x 2
# x y
# <dbl> <dbl>
# 1 1 2
|