13 min read

R中字符串处理:函数实现

处理文本是每一种计算机语言都应该具备的功能,但不是每一种语言都侧重于处理文本。R是统计类软件,处理文本并不是它的主项。然而借助R的基础函数或者stringr包结合正则表达式可以实现字符串的常见处理。

本文主要介绍如何通过R语言的基础函数和stringr包中的函数实现字符串的常见处理。特别注意的的是R中的基础函数和stringr包函数有两个很大的不同。

  • 书写方式不同。大多数基础函数处理规则多作为第一参数,而被处理对象放在第二位置;stringr包中的函数被处理对象为第一参数,而处理规则作为第二参数。
grep(pattern, x, ...)
str_detect(string, pattern, ...)

字符串的常见处理主要有:字符串长度计算,大小写转化,排序,空格去除,复制,拼接,分割,提取,匹配查询,替换(这里只介绍函数的实现效果,而不涉及正则表达式原则)。 * 执行的原理不一致。有很多基础函数的处理规则往往是针对单元素的,即使强制用多元素能成功执行,但结果也往往只执行第一个元素;而stringr包中的函数通常可对多元素执行操作,执行操作时将短的字符串重复,长度一致后在相同位置执行。

字符串长度计算

  • 字符向量长度计算函数:length

函数length返回字符向量的长度,而非字符串中字符的长度。

name = c("李白","杜甫", "Shakespeare")
length(name)
## [1] 3
  • 字符串长度计算函数:nchar,str_length和str_count

函数nchar,str_length和str_count均可计算字符串的长度。由于R通常是向量化操作,所以nchar,str_length和str_count对于字符向量可以返回字符向量中每个元素的长度。

library(stringr)
nchar(name)
## [1]  2  2 11
str_length(name)
## [1]  2  2 11
str_count(name)
## [1]  2  2 11

尽管函数str_count可以现实字符向量中字符串长度的计算,但是更多的时候用来计算特定字符串出现的次数,其计算的原理前面已经提到过:串短的字符串重复,长度一致后在相同位置特定字符串个数的统计。

fruit <- c("apple", "banana", "pear", "pineapple")
str_count(fruit) # 字符向量中字符长度计算
## [1] 5 6 4 9
str_count(fruit, "a") # "a"重复四次,与fruit在相同位置,查询"a"出现的次数
## [1] 1 3 1 1
str_count(fruit, "a")
## [1] 1 3 1 1
str_count(fruit, "p")
## [1] 2 0 1 3
str_count(fruit, "e")
## [1] 1 0 1 2
str_count(fruit, c("a", "b", "p", "p")) # 按位置一一对应查询
## [1] 1 1 1 3

str_count(c("a.", "...", ".a.a"), "\\.") # 字符"."次数查询
## [1] 1 3 2
str_count(c("a.", "...", ".a.a"), fixed(".")) # 字符"."次数查询
## [1] 1 3 2

字符大小写转化

  • 函数tolower,将向量中的元素转化为小写字母
  • 函数toupper,将向量中的元素转化为大写字母
  • 函数casefold,将向量中的元素转化为小或大写字母(upper = F,转化为小写;upper = T,转化为大写)
  • 函数chartr,按指定的规则进行转换
x <- c("Hellow", "World", "!") 
tolower(x)
## [1] "hellow" "world"  "!"
toupper(x)
## [1] "HELLOW" "WORLD"  "!"
casefold(x) # 默认upper = F
## [1] "hellow" "world"  "!"
casefold(x, upper = T)
## [1] "HELLOW" "WORLD"  "!"
chartr('ol', 'pm', x) # o转化为p,l转化为m
## [1] "Hemmpw" "Wprmd"  "!"
DNA <- "AtGCtttACC" # DNA为长度为1的字符向量
tolower(DNA) 
## [1] "atgctttacc"
toupper(DNA) 
## [1] "ATGCTTTACC"
chartr("Tt", "Uu", DNA) # T转化为U,t转化为u
## [1] "AuGCuuuACC"
chartr("Tt", "UU", DNA) 
## [1] "AUGCUUUACC"

字符串排序函数:sort,str_sort和order和str_order

order和str_order按一定条件有序返回字符串在向量中位置的索引值;

sort和str_sort直接按一定条件有序返回字符串。

这里,R的基础函数order,sort与函数str_sort,str_order默认的排序规则是略有差异的。

order(name)
## [1] 3 2 1
str_order(name)
## [1] 3 1 2
sort(name)
## [1] "Shakespeare" "杜甫"        "李白"
str_sort(name)
## [1] "Shakespeare" "李白"        "杜甫"

此外,还需要强调一下str_order和str_sort函数可以对字符串中的数字按数字顺序处理。

x = c("R1", "R3", "R11", "R4")
str_order(x)
## [1] 1 3 2 4
str_order(x, numeric = T)
## [1] 1 2 4 3
str_sort(x)
## [1] "R1"  "R11" "R3"  "R4"
str_sort(x, numeric = T)
## [1] "R1"  "R3"  "R4"  "R11"

字符串中空格去除函数:str_trim

函数str_trim可以去除字符串中的空格,通过参数side设置去除字符串开头,结尾,结尾和开头中的空格,但不能去除字符串中间的空格。

fruit = c(" apple", "pear ", "ban ana")
str_trim(fruit)
## [1] "apple"   "pear"    "ban ana"
str_trim(fruit, side = "left")
## [1] "apple"   "pear "   "ban ana"
str_trim(fruit, side = "right")
## [1] " apple"  "pear"    "ban ana"

字符串复制函数:rep和str_dup

函数rep和str_dup均可对字符串进行复制。函数rep会使向量中元素个数重复,向量长度会增加;函数str_dup使向量中每个元素值重复,向量的长度不增加。

rep(c("mn", "xy", "abc", "ef"), 1:4)
##  [1] "mn"  "xy"  "xy"  "abc" "abc" "abc" "ef"  "ef"  "ef"  "ef"
str_dup(c("mn", "xy", "abc", "ef"), 1:4)
## [1] "mn"        "xyxy"      "abcabcabc" "efefefef"

字符串拼接函数:paste和str_c

R中基础函数paste和str_c都可以实现字符串的拼接,函数paste拼接时默认使用“,”,函数str_c拼接时默认使用""。参数sep可以设置拼接使用的拼接符。

paste("A", 1:4) 
## [1] "A 1" "A 2" "A 3" "A 4"
paste("A", 1:4, sep = "-") 
## [1] "A-1" "A-2" "A-3" "A-4"
paste("A", 1:4, sep = "-", collapse = "+") 
## [1] "A-1+A-2+A-3+A-4"
paste(c("A","B", NA, "C"), 1:4) # NA 参与拼接
## [1] "A 1"  "B 2"  "NA 3" "C 4"
str_c(c("A","B", NA, "C"), 1:4) # NA 不参与拼接
## [1] "A1" "B2" NA   "C4"

paste函数还有一个用法,设置collapse参数,连成一个字符串。

y = c("张三", "李四", "王五", "赵六")
paste(x, y, sep = "-", collapse = "; ")
## [1] "R1-张三; R3-李四; R11-王五; R4-赵六"
paste(x, collapse = "; ")
## [1] "R1; R3; R11; R4"

字符串分割函数:strsplit,str_split和str_split_fixed

函数strsplit,str_split和str_split_fixed均可实现字符串的分割,但strsplit和str_split返回结果为列表,而str_split_fixed返回结果为矩阵。

author = c(" 鲁 迅", "贾 平 凹 ", " 余 秋 雨 ", "司马 相 如")
strsplit(author, " ")
## [[1]]
## [1] ""   "鲁" "迅"
## 
## [[2]]
## [1] "贾" "平" "凹"
## 
## [[3]]
## [1] ""   "余" "秋" "雨"
## 
## [[4]]
## [1] "司马" "相"   "如"
str_split(author, " ")
## [[1]]
## [1] ""   "鲁" "迅"
## 
## [[2]]
## [1] "贾" "平" "凹" ""  
## 
## [[3]]
## [1] ""   "余" "秋" "雨" ""  
## 
## [[4]]
## [1] "司马" "相"   "如"
str_split_fixed(author, " ", n = 3)
##      [,1]   [,2] [,3]    
## [1,] ""     "鲁" "迅"    
## [2,] "贾"   "平" "凹 "   
## [3,] ""     "余" "秋 雨 "
## [4,] "司马" "相" "如"

函数unlist可将函数strsplit和str_split返回结果列表转化为向量。

unlist(strsplit(author, " "))
##  [1] ""     "鲁"   "迅"   "贾"   "平"   "凹"   ""     "余"   "秋"   "雨"  
## [11] "司马" "相"   "如"
unlist(str_split(author, " "))
##  [1] ""     "鲁"   "迅"   "贾"   "平"   "凹"   ""     ""     "余"   "秋"  
## [11] "雨"   ""     "司马" "相"   "如"

三个字符串分割函数strsplit,str_split和str_split_fixed中,str_split_fixed的返回结果为数据框,方便对后期结果的引用。此外,函数str_split和str_split_fixed中都有参数n,但str_split中的参数可设置也可不设置,函数返回结果依旧是列表;str_split_fixed中参数n必须设置。其中参数n小于最大分割个数时,后面的不再分隔;参数n超过最大分割数时,后面内容为空。

str_split(author," ") 
## [[1]]
## [1] ""   "鲁" "迅"
## 
## [[2]]
## [1] "贾" "平" "凹" ""  
## 
## [[3]]
## [1] ""   "余" "秋" "雨" ""  
## 
## [[4]]
## [1] "司马" "相"   "如"
str_split(author, " ", n = 2)
## [[1]]
## [1] ""      "鲁 迅"
## 
## [[2]]
## [1] "贾"     "平 凹 "
## 
## [[3]]
## [1] ""          "余 秋 雨 "
## 
## [[4]]
## [1] "司马"  "相 如"
str_split(author, " ", n = 5) 
## [[1]]
## [1] ""   "鲁" "迅"
## 
## [[2]]
## [1] "贾" "平" "凹" ""  
## 
## [[3]]
## [1] ""   "余" "秋" "雨" ""  
## 
## [[4]]
## [1] "司马" "相"   "如"
str_split_fixed(name, " ", n = 3)
##      [,1]          [,2] [,3]
## [1,] "李白"        ""   ""  
## [2,] "杜甫"        ""   ""  
## [3,] "Shakespeare" ""   ""
str_split_fixed(author, " ", n = 2)
##      [,1]   [,2]       
## [1,] ""     "鲁 迅"    
## [2,] "贾"   "平 凹 "   
## [3,] ""     "余 秋 雨 "
## [4,] "司马" "相 如"
str_split_fixed(author, " ", n = 5)
##      [,1]   [,2] [,3] [,4] [,5]
## [1,] ""     "鲁" "迅" ""   ""  
## [2,] "贾"   "平" "凹" ""   ""  
## [3,] ""     "余" "秋" "雨" ""  
## [4,] "司马" "相" "如" ""   ""

字符串提取

  • 函数substr(x, start,stop):对字符串x截取从start到stop的子字符串。
  • 函数substring(text,first, last = 1000000L):对字符串x截取从first到last的子字符串,last默认值为1000000,可以不传参。
  • str_sub(x, start = 1L, end = -1L):对字符串x截取从first到last的子字符串,last和end有默认值为,可以不传参。
txt <- c("Hello, the World!","I'm Chinese", "I love China.", "我爱我的国!")
substr(txt, 1, 5)
## [1] "Hello"      "I'm C"      "I lov"      "我爱我的国"
substring(txt, 1, 5)
## [1] "Hello"      "I'm C"      "I lov"      "我爱我的国"
str_sub(txt, 1, 3)
## [1] "Hel"    "I'm"    "I l"    "我爱我"

substr(txt[1], c(1,2,3,4), c(2,3,4,5)) # 只对第一个元素有效
## [1] "He"
substring(txt[1], c(1,2,3,4), c(2,3,4,5)) # 重复短元素,在相同位置匹配
## [1] "He" "el" "ll" "lo"
str_sub(txt[1],  c(1,2,3,4), c(2,3,4,5)) # 重复短元素,在相同位置匹配
## [1] "He" "el" "ll" "lo"
substr(txt, c(1,2,3,4), c(2,3,4,5))
## [1] "He"   "'m"   "lo"   "的国"
substring(txt, c(1,2,3,4), c(2,3,4,5))
## [1] "He"   "'m"   "lo"   "的国"
str_sub(txt,  c(1,2,3,4), c(2,3,4,5))
## [1] "He"   "'m"   "lo"   "的国"
  • 函数strtrim(x,width):对字符串x从开头截取指定width的子字符串,参数均可循环使用。对于中文字符,一个字符的长度为2,因此width也要设置为2倍宽度。
  • stringr包中的函数word(string,start = 1L, end = start, sep = fixed(" ")):用于从语句中提取单词(字符串)。string为字符串或字符串向量;start为数值向量给出提取的开始位置;end为数值向量给出提取的结束位置;sep为单词间分隔符,默认为空格。
strtrim(txt, 7) 
## [1] "Hello, " "I'm Chi" "I love " "我爱我"
word(txt, 2)
## [1] "the"     "Chinese" "love"    NA

strtrim(txt,  c(1,2,3,4)) # 重复短元素,在相同位置匹配
## [1] "H"    "I'"   "I l"  "我爱"
word(txt, c(1,2)) # 重复短元素,在相同位置匹配
## [1] "Hello,"  "Chinese" "I"       NA

字符串匹配查询

函数match、grep,grepl,str_detect,str_locate和str_locate_all,str_match和str_match_all均可实现字符串的匹配查询,但又略有不同。函数match可实现多对多的完全匹配,但同一个值仅能返回到第一次匹配到的位置索引值;其它函数则可实现完全或者部分匹配。

函数名称 操作原理 结果类型
match 多对多的完全匹配查询,仅返回第一次匹配到的位置索引值 数字型向量
grep 仅能一对多的部分或完全匹配查询,有几个返回几个位置索引值 数字型向量
grepl 仅能一对多的部分或完全匹配查询,有几个返回几个位置逻辑值,grepl中的l代表返回位置逻辑值 逻辑型向量
str_detect 在相同位置(短字符串自动重复跟长字符串长度相同,下同),完全或部分匹配查询,返回位置逻辑值 逻辑型向量
str_locate 在相同位置,查询第一次完全或部分匹配的字符串,返回位置索引值矩阵;第一列为匹配字符的起始位置,第二列为终止位置;不能匹配的返回NA 数字型矩阵
str_locate_all 在相同位置,查询所有完全或部分匹配的字符串,返回位置索引值列表 数字型列表
str_match 在相同位置,查询每个位置第一次完全或部分匹配的字符串,返回匹配到的字符串,不能匹配的返回NA 字符值矩阵
str_match_all 在相同位置,查询每个位置所有完全或部分匹配的字符串,返回匹配到的字符串 字符值列表
str_extract 在相同位置,查询每个位置第一次完全或部分匹配的字符串,返回匹配到的字符串,不能匹配的返回NA。与str_match结果值一样 字符值向量
str_extract_all 在相同位置,查询每个位置所有完全或部分匹配的字符串,返回匹配到的字符串 字符值列表
match("cd", c("ab", "cd", "ef", "mn"))  # 完全匹配
## [1] 2
match("cd", c("ab", "cdx", "ef", "mn")) # 部分匹配不能实现
## [1] NA
match("cd", c("ab", "cd", "ef", "cd", "mn")) # 仅返回第一次被匹配到的位置向量
## [1] 2
match(c("cd", "mn"), c("ab", "cd", "ef", "mn"))
## [1] 2 4
match(c("cd", "mn"), c("ab", "cd", "ef", "cd", "mn"))
## [1] 2 5
grep("cd", c("cxcx", "mn", "mn", "cd", "ef", "mn", "cxcx")) # 完全匹配
## [1] 4
grepl("cd", c("cxcx", "mn", "mn", "cd", "ef", "mn", "cxcx")) # 完全匹配
## [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
grep("cd", c("cxcx", "mn", "mn", "cdxcdx", "ef", "mn", "cxcx")) # 部分匹配
## [1] 4
grepl("cd", c("cxcx", "mn", "mn", "cdxcdx", "ef", "mn", "cxcx")) # 部分匹配
## [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
grep("cd", c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx")) # 一对多完全或部分匹配查询,有几个匹配几个
## [1] 1 4 7
grepl("cd", c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx")) # 一对多完全或部分匹配,有几个匹配几个
## [1]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
grep(c("cd", "mn"), c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx")) # 多对多完全或部分匹配查询,其实只有第一个"cd"参与了匹配
## Warning in grep(c("cd", "mn"), c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", :
## argument 'pattern' has length > 1 and only the first element will be used
## [1] 1 4 7
grepl(c("cd", "mn"), c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx")) # 多对多完全或部分匹配查询,其实只有第一个"cd"参与了匹配
## Warning in grepl(c("cd", "mn"), c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", :
## argument 'pattern' has length > 1 and only the first element will be used
## [1]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
str_detect(c("cd", "mn"), c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx", "mn", "cd")) 
## Warning in stri_detect_regex(string, pattern, negate = negate, opts_regex =
## opts(pattern)): longer object length is not a multiple of shorter object length
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
str_detect(c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx", "mn", "cd"), c("cd", "mn"))
## Warning in stri_detect_regex(string, pattern, negate = negate, opts_regex =
## opts(pattern)): longer object length is not a multiple of shorter object length
## [1]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
str_locate(c("cd", "mn"), c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx", "mn", "cd")) 
## Warning in stri_locate_first_regex(string, pattern, opts_regex = opts(pattern)):
## longer object length is not a multiple of shorter object length
##       start end
##  [1,]    NA  NA
##  [2,]    NA  NA
##  [3,]    NA  NA
##  [4,]    NA  NA
##  [5,]    NA  NA
##  [6,]    NA  NA
##  [7,]    NA  NA
##  [8,]     1   2
##  [9,]     1   2
str_locate(c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx", "mn", "cd"), c("cd", "mn"))
## Warning in stri_locate_first_regex(string, pattern, opts_regex = opts(pattern)):
## longer object length is not a multiple of shorter object length
##       start end
##  [1,]     1   2
##  [2,]     1   2
##  [3,]    NA  NA
##  [4,]    NA  NA
##  [5,]    NA  NA
##  [6,]     1   2
##  [7,]     1   2
##  [8,]     1   2
##  [9,]     1   2

str_locate_all(c("cd", "mn"), c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx", "mn", "cd")) 
## Warning in stri_locate_all_regex(string, pattern, omit_no_match = TRUE, : longer
## object length is not a multiple of shorter object length
## [[1]]
##      start end
## 
## [[2]]
##      start end
## 
## [[3]]
##      start end
## 
## [[4]]
##      start end
## 
## [[5]]
##      start end
## 
## [[6]]
##      start end
## 
## [[7]]
##      start end
## 
## [[8]]
##      start end
## [1,]     1   2
## 
## [[9]]
##      start end
## [1,]     1   2
str_locate_all(c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx", "mn", "cd"), c("cd", "mn"))
## Warning in stri_locate_all_regex(string, pattern, omit_no_match = TRUE, : longer
## object length is not a multiple of shorter object length
## [[1]]
##      start end
## [1,]     1   2
## [2,]     4   5
## 
## [[2]]
##      start end
## [1,]     1   2
## 
## [[3]]
##      start end
## 
## [[4]]
##      start end
## 
## [[5]]
##      start end
## 
## [[6]]
##      start end
## [1,]     1   2
## 
## [[7]]
##      start end
## [1,]     1   2
## [2,]     4   5
## 
## [[8]]
##      start end
## [1,]     1   2
## 
## [[9]]
##      start end
## [1,]     1   2
str_match(c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx", "mn", "cd"), c("cdx", "mn")) 
## Warning in stri_match_first_regex(string, pattern, opts_regex = opts(pattern)):
## longer object length is not a multiple of shorter object length
##       [,1] 
##  [1,] "cdx"
##  [2,] "mn" 
##  [3,] NA   
##  [4,] NA   
##  [5,] NA   
##  [6,] "mn" 
##  [7,] "cdx"
##  [8,] "mn" 
##  [9,] NA
str_match(c("cdx", "mn"), c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx", "mn", "cd"))
## Warning in stri_match_first_regex(string, pattern, opts_regex = opts(pattern)):
## longer object length is not a multiple of shorter object length
##       [,1]
##  [1,] NA  
##  [2,] NA  
##  [3,] NA  
##  [4,] NA  
##  [5,] NA  
##  [6,] NA  
##  [7,] NA  
##  [8,] "mn"
##  [9,] "cd"

str_match_all(c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx", "mn", "cd"), c("cdx", "mn")) 
## Warning in stri_match_all_regex(string, pattern, omit_no_match = TRUE,
## opts_regex = opts(pattern)): longer object length is not a multiple of shorter
## object length
## [[1]]
##      [,1] 
## [1,] "cdx"
## [2,] "cdx"
## 
## [[2]]
##      [,1]
## [1,] "mn"
## 
## [[3]]
##      [,1]
## 
## [[4]]
##      [,1]
## 
## [[5]]
##      [,1]
## 
## [[6]]
##      [,1]
## [1,] "mn"
## 
## [[7]]
##      [,1] 
## [1,] "cdx"
## [2,] "cdx"
## 
## [[8]]
##      [,1]
## [1,] "mn"
## 
## [[9]]
##      [,1]
str_match_all(c("cdx", "mn"), c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx", "mn", "cd"))
## Warning in stri_match_all_regex(string, pattern, omit_no_match = TRUE,
## opts_regex = opts(pattern)): longer object length is not a multiple of shorter
## object length
## [[1]]
##      [,1]
## 
## [[2]]
##      [,1]
## 
## [[3]]
##      [,1]
## 
## [[4]]
##      [,1]
## 
## [[5]]
##      [,1]
## 
## [[6]]
##      [,1]
## 
## [[7]]
##      [,1]
## 
## [[8]]
##      [,1]
## [1,] "mn"
## 
## [[9]]
##      [,1]
## [1,] "cd"
str_extract(c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx", "mn", "cd"), c("cdx", "mn")) 
## Warning in stri_extract_first_regex(string, pattern, opts_regex =
## opts(pattern)): longer object length is not a multiple of shorter object length
## [1] "cdx" "mn"  NA    NA    NA    "mn"  "cdx" "mn"  NA
str_extract(c("cdx", "mn"), c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx", "mn", "cd"))
## Warning in stri_extract_first_regex(string, pattern, opts_regex =
## opts(pattern)): longer object length is not a multiple of shorter object length
## [1] NA   NA   NA   NA   NA   NA   NA   "mn" "cd"

str_extract_all(c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx", "mn", "cd"), c("cdx", "mn")) 
## Warning in stri_extract_all_regex(string, pattern, simplify = simplify, : longer
## object length is not a multiple of shorter object length
## [[1]]
## [1] "cdx" "cdx"
## 
## [[2]]
## [1] "mn"
## 
## [[3]]
## character(0)
## 
## [[4]]
## character(0)
## 
## [[5]]
## character(0)
## 
## [[6]]
## [1] "mn"
## 
## [[7]]
## [1] "cdx" "cdx"
## 
## [[8]]
## [1] "mn"
## 
## [[9]]
## character(0)
str_extract_all(c("cdx", "mn"), c("cdxcdx", "mny", "mn", "cd", "ef", "mnz", "cdxcdx", "mn", "cd"))
## Warning in stri_extract_all_regex(string, pattern, simplify = simplify, : longer
## object length is not a multiple of shorter object length
## [[1]]
## character(0)
## 
## [[2]]
## character(0)
## 
## [[3]]
## character(0)
## 
## [[4]]
## character(0)
## 
## [[5]]
## character(0)
## 
## [[6]]
## character(0)
## 
## [[7]]
## character(0)
## 
## [[8]]
## [1] "mn"
## 
## [[9]]
## [1] "cd"

字符串替换函数:sub, gsub,str_replace和str_replace_all

尽管sub和gsub,str_replace和str_replace_all可用于字符串的替换,但严格地说R语言没有字符串替换的函数,因为R语言不管什么操作对参数都是传值不传址。

text = c("Hellow, Adam Adam!", "Hi, Paul Adam !", "How are you, Adam, Ava.")
sub(pattern = "Adam", replacement = "world", text)
## [1] "Hellow, world Adam!"      "Hi, Paul world !"        
## [3] "How are you, world, Ava."
gsub(pattern = "Adam", replacement = "world", text)
## [1] "Hellow, world world!"     "Hi, Paul world !"        
## [3] "How are you, world, Ava."

可以看到:虽然说是“替换”,但原字符串并没有改变,要改变原变量我们只能通过再赋值的方式。 sub和gsub的区别是前者只做一次替换(不管有几次匹配),而gsub把满足条件的匹配都做替换。

stringr包中也有类似函数sub的函数str_repalce做一次替换,以及类似函数gsub的str_repalce_all函数把满足条件的匹配都做替换。

sub(c("H"), c("I"),c("HacHgd", "aeHfgH", "defg"))
## [1] "IacHgd" "aeIfgH" "defg"
str_replace(c("HacHgd", "aeHfgH", "defg"), c("H"), c("I"))
## [1] "IacHgd" "aeIfgH" "defg"
gsub(c("H"), c("I"),c("HacHgd", "aeHfgH", "defg"))
## [1] "IacIgd" "aeIfgI" "defg"
str_replace_all(c("HacHgd", "aeHfgH", "defg"), c("H"), c("I"))
## [1] "IacIgd" "aeIfgI" "defg"

与sub和gsub不同,stringr包中的函数str_repalce和replace_all不仅可以实现一个字符串的查询替换,也可以实现多个字符串在相同位置的针对查询替换。(其实本质是一样的,就是短的字符向量重复完成匹配)

sub(c("H","a"), c("I", "b"),c("HacHgd", "aeHfgH", "defg")) # 只有H参与了查询替换
## Warning in sub(c("H", "a"), c("I", "b"), c("HacHgd", "aeHfgH", "defg")):
## argument 'pattern' has length > 1 and only the first element will be used
## Warning in sub(c("H", "a"), c("I", "b"), c("HacHgd", "aeHfgH", "defg")):
## argument 'replacement' has length > 1 and only the first element will be used
## [1] "IacHgd" "aeIfgH" "defg"
str_replace(c("HacHgd", "aeHfgH", "defg"), c("H","a"), c("I", "b"))
## Warning in stri_replace_first_regex(string, pattern,
## fix_replacement(replacement), : longer object length is not a multiple of
## shorter object length
## [1] "IacHgd" "beHfgH" "defg"
gsub(c("H","a"), c("I", "b"),c("HacHgd", "aeHfgH", "defg"))
## Warning in gsub(c("H", "a"), c("I", "b"), c("HacHgd", "aeHfgH", "defg")):
## argument 'pattern' has length > 1 and only the first element will be used
## Warning in gsub(c("H", "a"), c("I", "b"), c("HacHgd", "aeHfgH", "defg")):
## argument 'replacement' has length > 1 and only the first element will be used
## [1] "IacIgd" "aeIfgI" "defg"
str_replace_all(c("HacHgd", "aeHfgH", "defg"), c("H","a"), c("I", "b"))
## Warning in stri_replace_all_regex(string, pattern,
## fix_replacement(replacement), : longer object length is not a multiple of
## shorter object length
## [1] "IacIgd" "beHfgH" "defg"
str_replace(c("HacHgd", "aeHfgH", "defg"), c("H","a","g", "d"), c("I", "b","H","e")) #此时返回结果长度为4
## Warning in stri_replace_first_regex(string, pattern,
## fix_replacement(replacement), : longer object length is not a multiple of
## shorter object length
## [1] "IacHgd" "beHfgH" "defH"   "HacHge"
str_replace_all(c("HacHgd", "aeHfgH", "defg"), c("H","a","g", "d"), c("I", "b","H","e"))#此时返回结果长度为4
## Warning in stri_replace_all_regex(string, pattern,
## fix_replacement(replacement), : longer object length is not a multiple of
## shorter object length
## [1] "IacIgd" "beHfgH" "defH"   "HacHge"

此外,函数str_repalce_all还可以实现多个字符串的同时替换(str_replac没有此功能)。

y = c(c("I", "b"))
names(y) = c("H","a")
str_replace_all(c("HacHgd", "aeHfgH", "defg"),y)
## [1] "IbcIgd" "beIfgI" "defg"

针对函数str_repalce_all的多个字符串的同时替换功能,有时会出现意想不到的结果,而mgsub::mgsub可以产生另外一种结果。

y = c(c("a", "H"))
names(y) = c("H","a")
str_replace_all(c("HacHgd", "aeHfgH", "defg"),y)
## [1] "HHcHgd" "HeHfgH" "defg"
mgsub::mgsub(c("HacHgd", "aeHfgH", "defg"), c("H","a"),c(c("a", "H")))
## [1] "aHcagd" "Heafga" "defg"

- - - - - -

参考文章

  1. 字符串处理
  2. R语言字符串处理包stringr