regex - dictionary expansion in R -


i'm looking quick , efficient solution expand dictionary (df1)

                 pattern cat1 cat2 1          want [food]       b 2 i'm [amplifier] [pos].       b  df1 <- data.frame(pattern=c("i want [food]", "i'm [amplifier] [pos]"),                       cat1=c("a", "c"), cat2=c("b", "d"), stringsasfactors=false) 

that has string patterns categories enclosed within square brackets []. these indicate categories appear in additional data frame in dictionary format (df2).

     pattern  category 1      pizza      food 2    hot dog      food 3      chips      food 4       amplifier 5  amplifier 6      happy       pos 7 optimistic       pos  df2 <- structure(list(pattern = c("pizza", "hot dog", "chips", "very",  "very much", "happy", "optimistic"), category = c("food", "food",  "food", "amplifier", "amplifier", "pos", "pos")), .names = c("pattern",  "category"), row.names = c(na, -7l), class = "data.frame") 

i want create extended data.frame takes df 1 , expands df 2 looks this:

                   pattern cat1 cat2 1             want pizza       b 2            want hotdog       b 3             want chips       b 4           i'm happy    c    d 5      i'm more happy    c    d 6      i'm optimistic    c    d 7 i'm more optimistic    c    d  output <- structure(list(pattern = c("i want pizza", "i want hotdog", "i want chips",  "i'm happy", "i'm more happy", "i'm optimistic",  "i'm more optimistic"), cat1 = c("a", "a", "a", "c", "c",  "c", "c"), cat2 = c("b", "b", "b", "d", "d", "d", "d")), .names = c("pattern",  "cat1", "cat2"), row.names = c(na, -7l), class = "data.frame") 

here's i'd do:

library(stringi) library(data.table) setdt(df1) setdt(df2)  capture_patt = "\\[(\\w+)\\]" df1[, {     cats = stri_match_all(pattern, regex = capture_patt)[[1]][, 2]     new_patt = gsub(capture_patt, "%s", pattern)      subs = do.call(cj, lapply(cats, function(cat)        df2[.(category = cat), on="category", pattern]     ))      .(res = do.call(sprintf, c(.(fmt = new_patt), subs))) }, by=names(df1)]   #                   pattern cat1 cat2                       res # 1:          want [food]       b              want chips # 2:          want [food]       b            want hot dog # 3:          want [food]       b              want pizza # 4: i'm [amplifier] [pos].       b           i'm happy. # 5: i'm [amplifier] [pos].       b      i'm optimistic. # 6: i'm [amplifier] [pos].       b      i'm happy. # 7: i'm [amplifier] [pos].       b i'm optimistic. 

how works.

the objects are...

  • cats categories need grab
  • new_patt sprintf-ready version of pattern
  • subs table of substitutions must made
  • res new column

the trickier functions are...

  • cj takes cross product, expand.grid in mrflick's answer.
  • do.call(f, list_o_args) passes list of args function.

Comments