scala - What is the input format of org.apache.spark.ml.classification.LogisticRegression fit()? -


in this example of training logisticregression model use rdd[labeledpoint] input fit() method write "// use labeledpoint, case class. spark sql can convert rdds of case classes // schemardds, uses case class metadata infer schema."

where conversion happening? when try code:

val sqlcontext = new sqlcontext(sc) import sqlcontext._ val model = lr.fit(training); 

,where training of type rdd[labeledpoint], gives compilation error stating fit expects data frame. when convert rdd data frame exception:

an exception occured while executing java class. null: invocationtargetexception: requirement failed: column features must of type org.apache.spark.mllib.linalg.vectorudt@f71b0bce structtype(structfield(label,doubletype,false), structfield(features,org.apache.spark.mllib.linalg.vectorudt@f71b0bce,true))

but confusing me. why expect vector? needs labels. wondering correct format?

the reason using ml logisticregression , not mllib logisticregressionwithlbfgs because want elasticnet implementation.

the exception says dataframe expects follow structure:

structtype(structfield(label,doubletype,false),  structfield(features,org.apache.spark.mllib.linalg.vectorudt@f71b0bce,true)) 

so prepare training data list of (label, features) tuples this:

val training = sqlcontext.createdataframe(seq(   (1.0, vectors.dense(0.0, 1.1, 0.1)),   (0.0, vectors.dense(2.0, 1.0, -1.0)),   (0.0, vectors.dense(2.0, 1.3, 1.0)),   (1.0, vectors.dense(0.0, 1.2, -0.5)) )).todf("label", "features") 

Comments