i have following ds:
|-- created_at: timestamp (nullable = true) |-- channel_source_id: integer (nullable = true) |-- movie_id: integer (nullable = true)
i classify movie_id based on conditions:
number of times have been played;
#count ocurrences id select count(createad_at) logs group movie_id
what range (created_at) movie has been played;
#returns distinct movies_id select distinct(movie_id) logs #for each movie_id, retrieve hour has been played #when have result, apply filter df extract intervals select created_at logs movie_id = ?
number of differents channel_source_id have played movie;
#count number of channels have played select count(distinct(channel_source_id)) logs movie_id = ? group movie_id
i've written simple table me on classification
played 1 5 times, range between 00:00:00 - 03:59:59, 1 3 different channels >> movie type played 6 10 times, range between 04:00:00 - 07:59:59, 4 5 different channels >> movie type b etc
i'm using spark import file i'm lost how can perform classification. me give me hand on should start?
def run() = { val sqlcontext = new sqlcontext(sc) val df = sqlcontext.read .format("com.databricks.spark.csv") .options(map("header" -> "true", "inferschema" -> "true")) .load("/home/plc/desktop/movies.csv") df.printschema() }
Comments
Post a Comment