i need find efficient way sort globally rdd of large dimensions, shuffling less possible. need not incur in performance problems, example outofmemory
exceptions , on.
thanks
amount of data have shuffle sorted rdd fixed in sense every solution minimal shuffles little possible. can improved pushing down sorting mechanism shuffle part handled rdd.sortby
, orderedrddfunctions.sortbykey
or javapairrdd.sortbykey
.
so choose method applicable data. example:
val rdd = org.apache.spark.mllib.random.randomrdds.normalrdd(sc, 100, 10, 323l) rdd.sortby(identity).take(3) // array[double] = // array(-2.678684754806642, -1.4394327869537575, -1.2573154896913827)
Comments
Post a Comment