Creating an aggregate metrics from JSON logs in apache spark -


i getting started apache spark. have requirement convert json log flattened metrics, can considered simple csv well.

for eg.

  "orderid":1,   "orderdata": {   "customerid": 123,   "orders": [     {       "itemcount": 2,       "items": [         {           "quantity": 1,           "price": 315         },         {           "quantity": 2,           "price": 300         },        ]     }   ] } 

this can considered single json log, want convert into,

orderid,customerid,totalvalue,units   1    ,   123    ,   915    ,  3 

i going through sparksql documentation , can use hold of individual values "select orderid,orderdata.customerid order" not sure how summation of prices , units.

what should best practice done using apache spark?

try:

>>> pyspark.sql.functions import * >>> doc = {"orderdata": {"orders": [{"items": [{"quantity": 1, "price": 315}, {"quantity": 2, "price": 300}], "itemcount": 2}], "customerid": 123}, "orderid": 1} >>> df = sqlcontext.read.json(sc.parallelize([doc])) >>> df.select("orderid", "orderdata.customerid", explode("orderdata.orders").alias("order")) \ ... .withcolumn("item", explode("order.items")) \ ... .groupby("orderid", "customerid") \ ... .agg(sum("item.quantity"), sum(col("item.quantity") * col("item.price"))) 

Comments