i looking documentation implementation of parallel lbfgs , owlqn algorithms in spark 1.6 ml library.
i found page 1.6: http://spark.apache.org/docs/1.6.1/ml-advanced.html nothing parallelization
for 2.0: http://spark.apache.org/docs/2.0.0/ml-advanced.html still nothing parallelization
finally, read code [link1]. method
def train(dataset: dataframe): logisticregressionmodel seems optimize model using breeze don't find spark functions called (map, flatmap, reduce,...).
in code [link2], map used compute sub-gradients reduced compute gradient.
thanks
in short, spark uses breeze lbfgs , owlqn optimization algorithms , provides them each way compute gradient of cost function @ each iteration.
spark's logisticregression class, instance, utilizes logisticcostfun class extends breeze's difffunction trait. cost function class implements calculate abstract method has signature:
override def calculate(coefficients: bdv[double]): (double, bdv[double]) the calculate method utilizes logisticaggregator class, real work done. aggregation class defines 2 important methods:
def add(instance: instance): this.type // gradient update equation hard-coded here def merge(other: logisticaggregator): this.type // adds other's gradient current gradient the add method defines way update gradient after adding single data point, , merge method defines way combine 2 separate aggregators. class shipped executors, used aggregate each data partition, , used combine partition aggregators single aggregator. final aggregator instance holds cumulative gradient current iteration, , used update coefficients on driver node. process controlled call treeaggregate in logisticcostfun class:
val logisticaggregator = { val seqop = (c: logisticaggregator, instance: instance) => c.add(instance) val combop = (c1: logisticaggregator, c2: logisticaggregator) => c1.merge(c2) instances.treeaggregate( new logisticaggregator(coeffs, numclasses, fitintercept, featuresstd, featuresmean) )(seqop, combop) } you can think of bit more this: breeze implements several different optimization methods (e.g. lbfgs, owlqn) , requires tell optimization method how compute gradient. spark tells breeze algorithm how compute gradient via logisticcostfun class. logisticcostfun says ship out logisticaggregator instance each partition, collect gradient updates, , ship them combined on driver.
Comments
Post a Comment