scala - Best way to order RDD Elements Apache Spark -


i need find efficient way sort globally rdd of large dimensions, shuffling less possible. need not incur in performance problems, example outofmemory exceptions , on.

thanks

amount of data have shuffle sorted rdd fixed in sense every solution minimal shuffles little possible. can improved pushing down sorting mechanism shuffle part handled rdd.sortby, orderedrddfunctions.sortbykey or javapairrdd.sortbykey.

so choose method applicable data. example:

val rdd = org.apache.spark.mllib.random.randomrdds.normalrdd(sc, 100, 10, 323l) rdd.sortby(identity).take(3) // array[double] =  //   array(-2.678684754806642, -1.4394327869537575, -1.2573154896913827) 

Comments