i looking documentation implementation of parallel lbfgs , owlqn algorithms in spark 1.6 ml library.
i found page 1.6: http://spark.apache.org/docs/1.6.1/ml-advanced.html nothing parallelization
for 2.0: http://spark.apache.org/docs/2.0.0/ml-advanced.html still nothing parallelization
finally, read code [link1]. method
def train(dataset: dataframe): logisticregressionmodel
seems optimize model using breeze don't find spark functions called (map, flatmap, reduce,...).
in code [link2], map used compute sub-gradients reduced compute gradient.
thanks
in short, spark uses breeze lbfgs , owlqn optimization algorithms , provides them each way compute gradient of cost function @ each iteration.
spark's logisticregression
class, instance, utilizes logisticcostfun
class extends breeze's difffunction
trait. cost function class implements calculate
abstract method has signature:
override def calculate(coefficients: bdv[double]): (double, bdv[double])
the calculate method utilizes logisticaggregator
class, real work done. aggregation class defines 2 important methods:
def add(instance: instance): this.type // gradient update equation hard-coded here def merge(other: logisticaggregator): this.type // adds other's gradient current gradient
the add method defines way update gradient after adding single data point, , merge method defines way combine 2 separate aggregators. class shipped executors, used aggregate each data partition, , used combine partition aggregators single aggregator. final aggregator instance holds cumulative gradient current iteration, , used update coefficients on driver node. process controlled call treeaggregate
in logisticcostfun
class:
val logisticaggregator = { val seqop = (c: logisticaggregator, instance: instance) => c.add(instance) val combop = (c1: logisticaggregator, c2: logisticaggregator) => c1.merge(c2) instances.treeaggregate( new logisticaggregator(coeffs, numclasses, fitintercept, featuresstd, featuresmean) )(seqop, combop) }
you can think of bit more this: breeze implements several different optimization methods (e.g. lbfgs, owlqn) , requires tell optimization method how compute gradient. spark tells breeze algorithm how compute gradient via logisticcostfun
class. logisticcostfun
says ship out logisticaggregator
instance each partition, collect gradient updates, , ship them combined on driver.
Comments
Post a Comment