处理文本是每一种计算机语言都应该具备的功能,但不是每一种语言都侧重于处理文本。R是统计类软件,处理文本并不是它的主项。然而借助R的基础函数或者stringr包结合正则表达式可以实现字符串的常见处理。
本文主要介绍如何通过R语言的基础函数和stringr包中的函数实现字符串的常见处理。特别注意的的是R中的基础函数和stringr包函数有两个很大的不同。
- 书写方式不同。大多数基础函数处理规则多作为第一参数,而被处理对象放在第二位置;stringr包中的函数被处理对象为第一参数,而处理规则作为第二参数。
grep(pattern, x, ...)
str_detect(string, pattern, ...)
字符串的常见处理主要有:字符串长度计算,大小写转化,排序,空格去除,复制,拼接,分割,提取,匹配查询,替换(这里只介绍函数的实现效果,而不涉及正则表达式原则)。 * 执行的原理不一致。有很多基础函数的处理规则往往是针对单元素的,即使强制用多元素能成功执行,但结果也往往只执行第一个元素;而stringr包中的函数通常可对多元素执行操作,执行操作时将短的字符串重复,长度一致后在相同位置执行。
字符串长度计算
- 字符向量长度计算函数:length
函数length返回字符向量的长度,而非字符串中字符的长度。
name = c("李白","杜甫", "Shakespeare")
length(name)
## [1] 3
- 字符串长度计算函数:nchar,str_length和str_count
函数nchar,str_length和str_count均可计算字符串的长度。由于R通常是向量化操作,所以nchar,str_length和str_count对于字符向量可以返回字符向量中每个元素的长度。
library(stringr)
nchar(name)
## [1] 2 2 11
str_length(name)
## [1] 2 2 11
str_count(name)
## [1] 2 2 11
尽管函数str_count可以现实字符向量中字符串长度的计算,但是更多的时候用来计算特定字符串出现的次数,其计算的原理前面已经提到过:串短的字符串重复,长度一致后在相同位置特定字符串个数的统计。
fruit <- c("apple", "banana", "pear", "pineapple")
str_count(fruit) # 字符向量中字符长度计算
## [1] 5 6 4 9
str_count(fruit, "a") # "a"重复四次,与fruit在相同位置,查询"a"出现的次数
## [1] 1 3 1 1
str_count(fruit, "a")
## [1] 1 3 1 1
str_count(fruit, "p")
## [1] 2 0 1 3
str_count(fruit, "e")
## [1] 1 0 1 2
str_count(fruit, c("a", "b", "p", "p")) # 按位置一一对应查询
## [1] 1 1 1 3
str_count(c("a.", "...", ".a.a"), "\\.") # 字符"."次数查询
## [1] 1 3 2
str_count(c("a.", "...", ".a.a"), fixed(".")) # 字符"."次数查询
## [1] 1 3 2
字符大小写转化
- 函数tolower,将向量中的元素转化为小写字母
- 函数toupper,将向量中的元素转化为大写字母
- 函数casefold,将向量中的元素转化为小或大写字母(upper = F,转化为小写;upper = T,转化为大写)
- 函数chartr,按指定的规则进行转换
x <- c("Hellow", "World", "!")
tolower(x)
## [1] "hellow" "world" "!"
toupper(x)
## [1] "HELLOW" "WORLD" "!"
casefold(x) # 默认upper = F
## [1] "hellow" "world" "!"
casefold(x, upper = T)
## [1] "HELLOW" "WORLD" "!"
chartr('ol', 'pm', x) # o转化为p,l转化为m
## [1] "Hemmpw" "Wprmd" "!"
DNA <- "AtGCtttACC" # DNA为长度为1的字符向量
tolower(DNA)
## [1] "atgctttacc"
toupper(DNA)
## [1] "ATGCTTTACC"
chartr("Tt", "Uu", DNA) # T转化为U,t转化为u
## [1] "AuGCuuuACC"
chartr("Tt", "UU", DNA)
## [1] "AUGCUUUACC"
字符串排序函数:sort,str_sort和order和str_order
order和str_order按一定条件有序返回字符串在向量中位置的索引值;
sort和str_sort直接按一定条件有序返回字符串。
这里,R的基础函数order,sort与函数str_sort,str_order默认的排序规则是略有差异的。
order(name)
## [1] 3 2 1
str_order(name)
## [1] 3 1 2
sort(name)
## [1] "Shakespeare" "杜甫" "李白"
str_sort(name)
## [1] "Shakespeare" "李白" "杜甫"
此外,还需要强调一下str_order和str_sort函数可以对字符串中的数字按数字顺序处理。
x = c("R1", "R3", "R11", "R4")
str_order(x)
## [1] 1 3 2 4
str_order(x, numeric = T)
## [1] 1 2 4 3
str_sort(x)
## [1] "R1" "R11" "R3" "R4"
str_sort(x, numeric = T)
## [1] "R1" "R3" "R4" "R11"
字符串中空格去除函数:str_trim
函数str_trim可以去除字符串中的空格,通过参数side设置去除字符串开头,结尾,结尾和开头中的空格,但不能去除字符串中间的空格。
fruit = c(" apple", "pear ", "ban ana")
str_trim(fruit)
## [1] "apple" "pear" "ban ana"
str_trim(fruit, side = "left")
## [1] "apple" "pear " "ban ana"
str_trim(fruit, side = "right")
## [1] " apple" "pear" "ban ana"
字符串复制函数:rep和str_dup
函数rep和str_dup均可对字符串进行复制。函数rep会使向量中元素个数重复,向量长度会增加;函数str_dup使向量中每个元素值重复,向量的长度不增加。
rep(c("mn", "xy", "abc", "ef"), 1:4)
## [1] "mn" "xy" "xy" "abc" "abc" "abc" "ef" "ef" "ef" "ef"
str_dup(c("mn", "xy", "abc", "ef"), 1:4)
## [1] "mn" "xyxy" "abcabcabc" "efefefef"
字符串拼接函数:paste和str_c
R中基础函数paste和str_c都可以实现字符串的拼接,函数paste拼接时默认使用“,”,函数str_c拼接时默认使用""。参数sep可以设置拼接使用的拼接符。
paste("A", 1:4)
## [1] "A 1" "A 2" "A 3" "A 4"
paste("A", 1:4, sep = "-")
## [1] "A-1" "A-2" "A-3" "A-4"
paste("A", 1:4, sep = "-", collapse = "+")
## [1] "A-1+A-2+A-3+A-4"
paste(c("A","B", NA, "C"), 1:4) # NA 参与拼接
## [1] "A 1" "B 2" "NA 3" "C 4"
str_c(c("A","B", NA, "C"), 1:4) # NA 不参与拼接
## [1] "A1" "B2" NA "C4"
paste函数还有一个用法,设置collapse参数,连成一个字符串。
y = c("张三", "李四", "王五", "赵六")
paste(x, y, sep = "-", collapse = "; ")
## [1] "R1-张三; R3-李四; R11-王五; R4-赵六"
paste(x, collapse = "; ")
## [1] "R1; R3; R11; R4"
字符串分割函数:strsplit,str_split和str_split_fixed
函数strsplit,str_split和str_split_fixed均可实现字符串的分割,但strsplit和str_split返回结果为列表,而str_split_fixed返回结果为矩阵。
author = c(" 鲁 迅", "贾 平 凹 ", " 余 秋 雨 ", "司马 相 如")
strsplit(author, " ")
## [[1]]
## [1] "" "鲁" "迅"
##
## [[2]]
## [1] "贾" "平" "凹"
##
## [[3]]
## [1] "" "余" "秋" "雨"
##
## [[4]]
## [1] "司马" "相" "如"
str_split(author, " ")
## [[1]]
## [1] "" "鲁" "迅"
##
## [[2]]
## [1] "贾" "平" "凹" ""
##
## [[3]]
## [1] "" "余" "秋" "雨" ""
##
## [[4]]
## [1] "司马" "相" "如"
str_split_fixed(author, " ", n = 3)
## [,1] [,2] [,3]
## [1,] "" "鲁" "迅"
## [2,] "贾" "平" "凹 "
## [3,] "" "余" "秋 雨 "
## [4,] "司马" "相" "如"
函数unlist可将函数strsplit和str_split返回结果列表转化为向量。
unlist(strsplit(author, " "))
## [1] "" "鲁" "迅" "贾" "平" "凹" "" "余" "秋" "雨"
## [11] "司马" "相" "如"
unlist(str_split(author, " "))
## [1] "" "鲁" "迅" "贾" "平" "凹" "" "" "余" "秋"
## [11] "雨" "" "司马" "相" "如"
三个字符串分割函数strsplit,str_split和str_split_fixed中,str_split_fixed的返回结果为数据框,方便对后期结果的引用。此外,函数str_split和str_split_fixed中都有参数n,但str_split中的参数可设置也可不设置,函数返回结果依旧是列表;str_split_fixed中参数n必须设置。其中参数n小于最大分割个数时,后面的不再分隔;参数n超过最大分割数时,后面内容为空。
str_split(author," ")
## [[1]]
## [1] "" "鲁" "迅"
##
## [[2]]
## [1] "贾" "平" "凹" ""
##
## [[3]]
## [1] "" "余" "秋" "雨" ""
##
## [[4]]
## [1] "司马" "相" "如"
str_split(author, " ", n = 2)
## [[1]]
## [1] "" "鲁 迅"
##
## [[2]]
## [1] "贾" "平 凹 "
##
## [[3]]
## [1] "" "余 秋 雨 "
##
## [[4]]
## [1] "司马" "相 如"
str_split(author, " ", n = 5)
## [[1]]
## [1] "" "鲁" "迅"
##
## [[2]]
## [1] "贾" "平" "凹" ""
##
## [[3]]
## [1] "" "余" "秋" "雨" ""
##
## [[4]]
## [1] "司马" "相" "如"
str_split_fixed(name, " ", n = 3)
## [,1] [,2] [,3]
## [1,] "李白" "" ""
## [2,] "杜甫" "" ""
## [3,] "Shakespeare" "" ""
str_split_fixed(author, " ", n = 2)
## [,1] [,2]
## [1,] "" "鲁 迅"
## [2,] "贾" "平 凹 "
## [3,] "" "余 秋 雨 "
## [4,] "司马" "相 如"
str_split_fixed(author, " ", n = 5)
## [,1] [,2] [,3] [,4] [,5]
## [1,] "" "鲁" "迅" "" ""
## [2,] "贾" "平" "凹" "" ""
## [3,] "" "余" "秋" "雨" ""
## [4,] "司马" "相" "如" "" ""
字符串提取
- 函数substr(x, start,stop):对字符串x截取从start到stop的子字符串。
- 函数substring(text,first, last = 1000000L):对字符串x截取从first到last的子字符串,last默认值为1000000,可以不传参。
- str_sub(x, start = 1L, end = -1L):对字符串x截取从first到last的子字符串,last和end有默认值为,可以不传参。
txt <- c("Hello, the World!","I'm Chinese", "I love China.", "我爱我的国!")
substr(txt, 1, 5)
## [1] "Hello" "I'm C" "I lov" "我爱我的国"
substring(txt, 1, 5)
## [1] "Hello" "I'm C" "I lov" "我爱我的国"
str_sub(txt, 1, 3)
## [1] "Hel" "I'm" "I l" "我爱我"
substr(txt[1], c(1,2,3,4), c(2,3,4,5)) # 只对第一个元素有效
## [1] "He"
substring(txt[1], c(1,2,3,4), c(2,3,4,5)) # 重复短元素,在相同位置匹配
## [1] "He" "el" "ll" "lo"
str_sub(txt[1], c(1,2,3,4), c(2,3,4,5)) # 重复短元素,在相同位置匹配
## [1] "He" "el" "ll" "lo"
substr(txt, c(1,2,3,4), c(2,3,4,5))
## [1] "He" "'m" "lo" "的国"
substring(txt, c(1,2,3,4), c(2,3,4,5))
## [1] "He" "'m" "lo" "的国"
str_sub(txt, c(1,2,3,4), c(2,3,4,5))
## [1] "He" "'m" "lo" "的国"
- 函数strtrim(x,width):对字符串x从开头截取指定width的子字符串,参数均可循环使用。对于中文字符,一个字符的长度为2,因此width也要设置为2倍宽度。
- stringr包中的函数word(string,start = 1L, end = start, sep = fixed(" ")):用于从语句中提取单词(字符串)。string为字符串或字符串向量;start为数值向量给出提取的开始位置;end为数值向量给出提取的结束位置;sep为单词间分隔符,默认为空格。
strtrim(txt, 7)
## [1] "Hello, " "I'm Chi" "I love " "我爱我"
word(txt, 2)
## [1] "the" "Chinese" "love" NA
strtrim(txt, c(1,2,3,4)) # 重复短元素,在相同位置匹配
## [1] "H" "I'" "I l" "我爱"
word(txt, c(1,2)) # 重复短元素,在相同位置匹配
## [1] "Hello," "Chinese" "I" NA
字符串匹配查询
函数match、grep,grepl,str_detect,str_locate和str_locate_all,str_match和str_match_all均可实现字符串的匹配查询,但又略有不同。函数match可实现多对多的完全匹配,但同一个值仅能返回到第一次匹配到的位置索引值;其它函数则可实现完全或者部分匹配。
| 函数名称 | 操作原理 | 结果类型 |
|---|---|---|
| match | 多对多的完全匹配查询,仅返回第一次匹配到的位置索引值 | 数字型向量 |
| grep | 仅能一对多的部分或完全匹配查询,有几个返回几个位置索引值 | 数字型向量 |
| grepl | 仅能一对多的部分或完全匹配查询,有几个返回几个位置逻辑值,grepl中的l代表返回位置逻辑值 | 逻辑型向量 |
| str_detect | 在相同位置(短字符串自动重复跟长字符串长度相同,下同),完全或部分匹配查询,返回位置逻辑值 | 逻辑型向量 |
| str_locate | 在相同位置,查询第一次完全或部分匹配的字符串,返回位置索引值矩阵;第一列为匹配字符的起始位置,第二列为终止位置;不能匹配的返回NA | 数字型矩阵 |
| str_locate_all | 在相同位置,查询所有完全或部分匹配的字符串,返回位置索引值列表 | 数字型列表 |
| str_match | 在相同位置,查询每个位置第一次完全或部分匹配的字符串,返回匹配到的字符串,不能匹配的返回NA | 字符值矩阵 |
| str_match_all | 在相同位置,查询每个位置所有完全或部分匹配的字符串,返回匹配到的字符串 | 字符值列表 |
| str_extract | 在相同位置,查询每个位置第一次完全或部分匹配的字符串,返回匹配到的字符串,不能匹配的返回NA。与str_match结果值一样 | 字符值向量 |
| str_extract_all | 在相同位置,查询每个位置所有完全或部分匹配的字符串,返回匹配到的字符串 | 字符值列表 |
match("cd", c("ab", "cd", "ef", "mn")) # 完全匹配
## [1] 2
match("cd", c("ab", "cdx", "ef", "mn")) # 部分匹配不能实现
## [1] NA
match("cd", c("ab", "cd", "ef", "cd", "mn")) # 仅返回第一次被匹配到的位置向量
## [1] 2
match(c("cd", "mn"), c("ab", "cd", "ef", "mn"))
## [1] 2 4
match(c("cd", "mn"), c("ab", "cd", "ef", "cd", "mn"))
## [1] 2 5
grep("cd", c("cxcx", "mn", "mn", "cd", "ef", "mn", "cxcx")) # 完全匹配
## [1] 4
grepl("cd", c("cxcx", "mn", "mn", "cd", "ef", "mn", "cxcx")) # 完全匹配
## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE
grep("cd", c("cxcx", "mn", "mn", "cdxcdx", "ef", "mn", "cxcx")) # 部分匹配
## [1] 4
grepl("cd", c("cxcx", "mn", "mn", "cdxcdx", "ef", "mn", "cxcx")) # 部分匹配
## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE
grep("cd", c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx")) # 一对多完全或部分匹配查询,有几个匹配几个
## [1] 1 4 7
grepl("cd", c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx")) # 一对多完全或部分匹配,有几个匹配几个
## [1] TRUE FALSE FALSE TRUE FALSE FALSE TRUE
grep(c("cd", "mn"), c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx")) # 多对多完全或部分匹配查询,其实只有第一个"cd"参与了匹配
## Warning in grep(c("cd", "mn"), c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", :
## argument 'pattern' has length > 1 and only the first element will be used
## [1] 1 4 7
grepl(c("cd", "mn"), c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx")) # 多对多完全或部分匹配查询,其实只有第一个"cd"参与了匹配
## Warning in grepl(c("cd", "mn"), c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", :
## argument 'pattern' has length > 1 and only the first element will be used
## [1] TRUE FALSE FALSE TRUE FALSE FALSE TRUE
str_detect(c("cd", "mn"), c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx", "mn", "cd"))
## Warning in stri_detect_regex(string, pattern, negate = negate, opts_regex =
## opts(pattern)): longer object length is not a multiple of shorter object length
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
str_detect(c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx", "mn", "cd"), c("cd", "mn"))
## Warning in stri_detect_regex(string, pattern, negate = negate, opts_regex =
## opts(pattern)): longer object length is not a multiple of shorter object length
## [1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
str_locate(c("cd", "mn"), c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx", "mn", "cd"))
## Warning in stri_locate_first_regex(string, pattern, opts_regex = opts(pattern)):
## longer object length is not a multiple of shorter object length
## start end
## [1,] NA NA
## [2,] NA NA
## [3,] NA NA
## [4,] NA NA
## [5,] NA NA
## [6,] NA NA
## [7,] NA NA
## [8,] 1 2
## [9,] 1 2
str_locate(c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx", "mn", "cd"), c("cd", "mn"))
## Warning in stri_locate_first_regex(string, pattern, opts_regex = opts(pattern)):
## longer object length is not a multiple of shorter object length
## start end
## [1,] 1 2
## [2,] 1 2
## [3,] NA NA
## [4,] NA NA
## [5,] NA NA
## [6,] 1 2
## [7,] 1 2
## [8,] 1 2
## [9,] 1 2
str_locate_all(c("cd", "mn"), c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx", "mn", "cd"))
## Warning in stri_locate_all_regex(string, pattern, omit_no_match = TRUE, : longer
## object length is not a multiple of shorter object length
## [[1]]
## start end
##
## [[2]]
## start end
##
## [[3]]
## start end
##
## [[4]]
## start end
##
## [[5]]
## start end
##
## [[6]]
## start end
##
## [[7]]
## start end
##
## [[8]]
## start end
## [1,] 1 2
##
## [[9]]
## start end
## [1,] 1 2
str_locate_all(c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx", "mn", "cd"), c("cd", "mn"))
## Warning in stri_locate_all_regex(string, pattern, omit_no_match = TRUE, : longer
## object length is not a multiple of shorter object length
## [[1]]
## start end
## [1,] 1 2
## [2,] 4 5
##
## [[2]]
## start end
## [1,] 1 2
##
## [[3]]
## start end
##
## [[4]]
## start end
##
## [[5]]
## start end
##
## [[6]]
## start end
## [1,] 1 2
##
## [[7]]
## start end
## [1,] 1 2
## [2,] 4 5
##
## [[8]]
## start end
## [1,] 1 2
##
## [[9]]
## start end
## [1,] 1 2
str_match(c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx", "mn", "cd"), c("cdx", "mn"))
## Warning in stri_match_first_regex(string, pattern, opts_regex = opts(pattern)):
## longer object length is not a multiple of shorter object length
## [,1]
## [1,] "cdx"
## [2,] "mn"
## [3,] NA
## [4,] NA
## [5,] NA
## [6,] "mn"
## [7,] "cdx"
## [8,] "mn"
## [9,] NA
str_match(c("cdx", "mn"), c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx", "mn", "cd"))
## Warning in stri_match_first_regex(string, pattern, opts_regex = opts(pattern)):
## longer object length is not a multiple of shorter object length
## [,1]
## [1,] NA
## [2,] NA
## [3,] NA
## [4,] NA
## [5,] NA
## [6,] NA
## [7,] NA
## [8,] "mn"
## [9,] "cd"
str_match_all(c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx", "mn", "cd"), c("cdx", "mn"))
## Warning in stri_match_all_regex(string, pattern, omit_no_match = TRUE,
## opts_regex = opts(pattern)): longer object length is not a multiple of shorter
## object length
## [[1]]
## [,1]
## [1,] "cdx"
## [2,] "cdx"
##
## [[2]]
## [,1]
## [1,] "mn"
##
## [[3]]
## [,1]
##
## [[4]]
## [,1]
##
## [[5]]
## [,1]
##
## [[6]]
## [,1]
## [1,] "mn"
##
## [[7]]
## [,1]
## [1,] "cdx"
## [2,] "cdx"
##
## [[8]]
## [,1]
## [1,] "mn"
##
## [[9]]
## [,1]
str_match_all(c("cdx", "mn"), c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx", "mn", "cd"))
## Warning in stri_match_all_regex(string, pattern, omit_no_match = TRUE,
## opts_regex = opts(pattern)): longer object length is not a multiple of shorter
## object length
## [[1]]
## [,1]
##
## [[2]]
## [,1]
##
## [[3]]
## [,1]
##
## [[4]]
## [,1]
##
## [[5]]
## [,1]
##
## [[6]]
## [,1]
##
## [[7]]
## [,1]
##
## [[8]]
## [,1]
## [1,] "mn"
##
## [[9]]
## [,1]
## [1,] "cd"
str_extract(c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx", "mn", "cd"), c("cdx", "mn"))
## Warning in stri_extract_first_regex(string, pattern, opts_regex =
## opts(pattern)): longer object length is not a multiple of shorter object length
## [1] "cdx" "mn" NA NA NA "mn" "cdx" "mn" NA
str_extract(c("cdx", "mn"), c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx", "mn", "cd"))
## Warning in stri_extract_first_regex(string, pattern, opts_regex =
## opts(pattern)): longer object length is not a multiple of shorter object length
## [1] NA NA NA NA NA NA NA "mn" "cd"
str_extract_all(c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx", "mn", "cd"), c("cdx", "mn"))
## Warning in stri_extract_all_regex(string, pattern, simplify = simplify, : longer
## object length is not a multiple of shorter object length
## [[1]]
## [1] "cdx" "cdx"
##
## [[2]]
## [1] "mn"
##
## [[3]]
## character(0)
##
## [[4]]
## character(0)
##
## [[5]]
## character(0)
##
## [[6]]
## [1] "mn"
##
## [[7]]
## [1] "cdx" "cdx"
##
## [[8]]
## [1] "mn"
##
## [[9]]
## character(0)
str_extract_all(c("cdx", "mn"), c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx", "mn", "cd"))
## Warning in stri_extract_all_regex(string, pattern, simplify = simplify, : longer
## object length is not a multiple of shorter object length
## [[1]]
## character(0)
##
## [[2]]
## character(0)
##
## [[3]]
## character(0)
##
## [[4]]
## character(0)
##
## [[5]]
## character(0)
##
## [[6]]
## character(0)
##
## [[7]]
## character(0)
##
## [[8]]
## [1] "mn"
##
## [[9]]
## [1] "cd"
字符串替换函数:sub, gsub,str_replace和str_replace_all
尽管sub和gsub,str_replace和str_replace_all可用于字符串的替换,但严格地说R语言没有字符串替换的函数,因为R语言不管什么操作对参数都是传值不传址。
text = c("Hellow, Adam Adam!", "Hi, Paul Adam !", "How are you, Adam, Ava.")
sub(pattern = "Adam", replacement = "world", text)
## [1] "Hellow, world Adam!" "Hi, Paul world !"
## [3] "How are you, world, Ava."
gsub(pattern = "Adam", replacement = "world", text)
## [1] "Hellow, world world!" "Hi, Paul world !"
## [3] "How are you, world, Ava."
可以看到:虽然说是“替换”,但原字符串并没有改变,要改变原变量我们只能通过再赋值的方式。 sub和gsub的区别是前者只做一次替换(不管有几次匹配),而gsub把满足条件的匹配都做替换。
stringr包中也有类似函数sub的函数str_repalce做一次替换,以及类似函数gsub的str_repalce_all函数把满足条件的匹配都做替换。
sub(c("H"), c("I"),c("HacHgd", "aeHfgH", "defg"))
## [1] "IacHgd" "aeIfgH" "defg"
str_replace(c("HacHgd", "aeHfgH", "defg"), c("H"), c("I"))
## [1] "IacHgd" "aeIfgH" "defg"
gsub(c("H"), c("I"),c("HacHgd", "aeHfgH", "defg"))
## [1] "IacIgd" "aeIfgI" "defg"
str_replace_all(c("HacHgd", "aeHfgH", "defg"), c("H"), c("I"))
## [1] "IacIgd" "aeIfgI" "defg"
与sub和gsub不同,stringr包中的函数str_repalce和replace_all不仅可以实现一个字符串的查询替换,也可以实现多个字符串在相同位置的针对查询替换。(其实本质是一样的,就是短的字符向量重复完成匹配)
sub(c("H","a"), c("I", "b"),c("HacHgd", "aeHfgH", "defg")) # 只有H参与了查询替换
## Warning in sub(c("H", "a"), c("I", "b"), c("HacHgd", "aeHfgH", "defg")):
## argument 'pattern' has length > 1 and only the first element will be used
## Warning in sub(c("H", "a"), c("I", "b"), c("HacHgd", "aeHfgH", "defg")):
## argument 'replacement' has length > 1 and only the first element will be used
## [1] "IacHgd" "aeIfgH" "defg"
str_replace(c("HacHgd", "aeHfgH", "defg"), c("H","a"), c("I", "b"))
## Warning in stri_replace_first_regex(string, pattern,
## fix_replacement(replacement), : longer object length is not a multiple of
## shorter object length
## [1] "IacHgd" "beHfgH" "defg"
gsub(c("H","a"), c("I", "b"),c("HacHgd", "aeHfgH", "defg"))
## Warning in gsub(c("H", "a"), c("I", "b"), c("HacHgd", "aeHfgH", "defg")):
## argument 'pattern' has length > 1 and only the first element will be used
## Warning in gsub(c("H", "a"), c("I", "b"), c("HacHgd", "aeHfgH", "defg")):
## argument 'replacement' has length > 1 and only the first element will be used
## [1] "IacIgd" "aeIfgI" "defg"
str_replace_all(c("HacHgd", "aeHfgH", "defg"), c("H","a"), c("I", "b"))
## Warning in stri_replace_all_regex(string, pattern,
## fix_replacement(replacement), : longer object length is not a multiple of
## shorter object length
## [1] "IacIgd" "beHfgH" "defg"
str_replace(c("HacHgd", "aeHfgH", "defg"), c("H","a","g", "d"), c("I", "b","H","e")) #此时返回结果长度为4
## Warning in stri_replace_first_regex(string, pattern,
## fix_replacement(replacement), : longer object length is not a multiple of
## shorter object length
## [1] "IacHgd" "beHfgH" "defH" "HacHge"
str_replace_all(c("HacHgd", "aeHfgH", "defg"), c("H","a","g", "d"), c("I", "b","H","e"))#此时返回结果长度为4
## Warning in stri_replace_all_regex(string, pattern,
## fix_replacement(replacement), : longer object length is not a multiple of
## shorter object length
## [1] "IacIgd" "beHfgH" "defH" "HacHge"
此外,函数str_repalce_all还可以实现多个字符串的同时替换(str_replac没有此功能)。
y = c(c("I", "b"))
names(y) = c("H","a")
str_replace_all(c("HacHgd", "aeHfgH", "defg"),y)
## [1] "IbcIgd" "beIfgI" "defg"
针对函数str_repalce_all的多个字符串的同时替换功能,有时会出现意想不到的结果,而mgsub::mgsub可以产生另外一种结果。
y = c(c("a", "H"))
names(y) = c("H","a")
str_replace_all(c("HacHgd", "aeHfgH", "defg"),y)
## [1] "HHcHgd" "HeHfgH" "defg"
mgsub::mgsub(c("HacHgd", "aeHfgH", "defg"), c("H","a"),c(c("a", "H")))
## [1] "aHcagd" "Heafga" "defg"
- - - - - -
参考文章