Creating custom source for reading from cloud datastore using latest python apache_beam cloud datafow sdk -
recently cloud dataflow python sdk made available , decided use it. unfortunately support read cloud datastore yet come have fall on writing custom source can utilize benefits of dynamic splitting, progress estimation etc promised. did study documentation thoroughly unable put pieces can speed entire process.
to more clear first approach was:
- querying cloud datastore
- creating pardo function , passing returned query it.
but took 13 minutes iterate on 200k entries.
so decided write custom source read entities efficiently. unable achieve due lack of understanding of putting pieces together. can 1 please me how create custom source reading datastore.
edited: first approach link gist is: https://gist.github.com/shriyanka/cbf30bbfbf277deed4bac0c526cf01f1
thank you.
in code provided, access datastore happens before pipeline constructed:
query = client.query(kind='user').fetch()
this executes whole query , reads entities before beam sdk gets involved @ all.
more precisely, fetch()
returns lazy iterable on query results, , get iterated over when construct pipeline, @ beam.create(query)
- but, once again, happens in main program, before pipeline starts. likely, what's taking 13 minutes, rather pipeline (but please feel free provide job id can take deeper look). can verify making small change code:
query = list(client.query(kind='user').fetch())
however, think intention both read , process entities in parallel.
for cloud datastore in particular, custom source api not best choice that. reason underlying cloud datastore api not provide properties necessary implement custom source "goodies" such progress estimation , dynamic splitting, because querying api generic (unlike, say, cloud bigtable, returns results ordered key, e.g. can estimate progress looking @ current key).
we rewriting java cloud datastore connector use different approach, uses pardo
split query , pardo
read each of sub-queries. please see this pull request details.
Comments
Post a Comment