apache spark - Optimizer LBFGS OWLQN implementation -


i looking documentation implementation of parallel lbfgs , owlqn algorithms in spark 1.6 ml library.

i found page 1.6: http://spark.apache.org/docs/1.6.1/ml-advanced.html nothing parallelization

for 2.0: http://spark.apache.org/docs/2.0.0/ml-advanced.html still nothing parallelization

finally, read code [link1]. method

def train(dataset: dataframe): logisticregressionmodel 

seems optimize model using breeze don't find spark functions called (map, flatmap, reduce,...).

in code [link2], map used compute sub-gradients reduced compute gradient.

thanks

in short, spark uses breeze lbfgs , owlqn optimization algorithms , provides them each way compute gradient of cost function @ each iteration.

spark's logisticregression class, instance, utilizes logisticcostfun class extends breeze's difffunction trait. cost function class implements calculate abstract method has signature:

override def calculate(coefficients: bdv[double]): (double, bdv[double]) 

the calculate method utilizes logisticaggregator class, real work done. aggregation class defines 2 important methods:

def add(instance: instance): this.type // gradient update equation hard-coded here def merge(other: logisticaggregator): this.type // adds other's gradient current gradient 

the add method defines way update gradient after adding single data point, , merge method defines way combine 2 separate aggregators. class shipped executors, used aggregate each data partition, , used combine partition aggregators single aggregator. final aggregator instance holds cumulative gradient current iteration, , used update coefficients on driver node. process controlled call treeaggregate in logisticcostfun class:

val logisticaggregator = {   val seqop = (c: logisticaggregator, instance: instance) => c.add(instance)   val combop = (c1: logisticaggregator, c2: logisticaggregator) => c1.merge(c2)    instances.treeaggregate(     new logisticaggregator(coeffs, numclasses, fitintercept, featuresstd, featuresmean)   )(seqop, combop) } 

you can think of bit more this: breeze implements several different optimization methods (e.g. lbfgs, owlqn) , requires tell optimization method how compute gradient. spark tells breeze algorithm how compute gradient via logisticcostfun class. logisticcostfun says ship out logisticaggregator instance each partition, collect gradient updates, , ship them combined on driver.


Comments