r - Can't call a rest api with foreach() using parallelization -


i'm loading data online api. data paginated need make consecutive calls.

therefore set parallelized foreach() loop rbind() output.

here's code:

    library('foreach')     library('parallel')     library('jsonlite')      registerdomc(cores = parallel::detectcores())      data <- foreach(page = 1:10, .combine = rbind) %dopar% {          raw.data <- fromjson(paste(endpoint, '&page=', page, sep =''))          raw.data <- raw.data$results          data.piece <- raw.data[c('id', 'scraper', 'title', 'text', 'ts', 'url', 'pertinence', 'source')]          data.piece     } 

endpoint rest url.

the loop returns null , furthermore runs (each call should indeed need couple of seconds).

so seems calls skipped. if run same code not in parallel works without problems.

i bumped similar situation , adapting code situation yields following:

library(jsonlite) library(dplyr) library(foreach) library(doparallel)  fetch.data <- function(page) {     # confirm url fetching data ...     url = 'http://api.paginated/?page='     endpoint = paste0(url, page)     print(paste0('fetching data => ', endpoint))     raw.data <- fromjson(endpoint, flatten = true)     raw.data }   no_cores <- detectcores() cluster <- makecluster(no_cores) registerdoparallel(cluster) t.start <- sys.time() data <- foreach(page=1:10, .combine=bind_rows, .packages=c('jsonlite')) %dopar% {     if (page %% 4 == 0) sys.sleep(1)     page_data <- fetch.data(page)     page_data <- page_data$results     data.piece <- page_data[c('id', 'scraper', 'title', 'text', 'ts', 'url', 'pertinence', 'source')]     data.piece } t.end <- sys.time() stopimplicitcluster() print(t.end - t.start) 

this code worked me recently. thing have take care of play within api's throttling limits. may mean have slow down script - example, every 4th page wait 1 sec.


Comments