i'm loading data online api. data paginated need make consecutive calls.
therefore set parallelized foreach() loop rbind() output.
here's code:
library('foreach') library('parallel') library('jsonlite') registerdomc(cores = parallel::detectcores()) data <- foreach(page = 1:10, .combine = rbind) %dopar% { raw.data <- fromjson(paste(endpoint, '&page=', page, sep ='')) raw.data <- raw.data$results data.piece <- raw.data[c('id', 'scraper', 'title', 'text', 'ts', 'url', 'pertinence', 'source')] data.piece }
endpoint rest url.
the loop returns null , furthermore runs (each call should indeed need couple of seconds).
so seems calls skipped. if run same code not in parallel works without problems.
i bumped similar situation , adapting code situation yields following:
library(jsonlite) library(dplyr) library(foreach) library(doparallel) fetch.data <- function(page) { # confirm url fetching data ... url = 'http://api.paginated/?page=' endpoint = paste0(url, page) print(paste0('fetching data => ', endpoint)) raw.data <- fromjson(endpoint, flatten = true) raw.data } no_cores <- detectcores() cluster <- makecluster(no_cores) registerdoparallel(cluster) t.start <- sys.time() data <- foreach(page=1:10, .combine=bind_rows, .packages=c('jsonlite')) %dopar% { if (page %% 4 == 0) sys.sleep(1) page_data <- fetch.data(page) page_data <- page_data$results data.piece <- page_data[c('id', 'scraper', 'title', 'text', 'ts', 'url', 'pertinence', 'source')] data.piece } t.end <- sys.time() stopimplicitcluster() print(t.end - t.start)
this code worked me recently. thing have take care of play within api's throttling limits. may mean have slow down script - example, every 4th page wait 1 sec.
Comments
Post a Comment