i getting started apache spark. have requirement convert json log flattened metrics, can considered simple csv well.
for eg.
"orderid":1, "orderdata": { "customerid": 123, "orders": [ { "itemcount": 2, "items": [ { "quantity": 1, "price": 315 }, { "quantity": 2, "price": 300 }, ] } ] }
this can considered single json log, want convert into,
orderid,customerid,totalvalue,units 1 , 123 , 915 , 3
i going through sparksql documentation , can use hold of individual values "select orderid,orderdata.customerid order" not sure how summation of prices , units.
what should best practice done using apache spark?
try:
>>> pyspark.sql.functions import * >>> doc = {"orderdata": {"orders": [{"items": [{"quantity": 1, "price": 315}, {"quantity": 2, "price": 300}], "itemcount": 2}], "customerid": 123}, "orderid": 1} >>> df = sqlcontext.read.json(sc.parallelize([doc])) >>> df.select("orderid", "orderdata.customerid", explode("orderdata.orders").alias("order")) \ ... .withcolumn("item", explode("order.items")) \ ... .groupby("orderid", "customerid") \ ... .agg(sum("item.quantity"), sum(col("item.quantity") * col("item.price")))
Comments
Post a Comment