i need find efficient way sort globally rdd of large dimensions, shuffling less possible. need not incur in performance problems, example outofmemory exceptions , on.
thanks
amount of data have shuffle sorted rdd fixed in sense every solution minimal shuffles little possible. can improved pushing down sorting mechanism shuffle part handled rdd.sortby, orderedrddfunctions.sortbykey or javapairrdd.sortbykey.
so choose method applicable data. example:
val rdd = org.apache.spark.mllib.random.randomrdds.normalrdd(sc, 100, 10, 323l) rdd.sortby(identity).take(3) // array[double] = // array(-2.678684754806642, -1.4394327869537575, -1.2573154896913827)
Comments
Post a Comment