i'm trying filter dataframe content, using spark's 1.5 method dropduplicates(). using data filled tables (i mean no empty cells) gives correct result, when csv source contains empty cells (i'll provide source file) - spark throw arrayindexoutofboundsexception. doing wrong? i've read spark sql , dataframes tutorial version 1.6.2, not describe dataframe operations in detail. reading book "learning spark. lightning-fast big data analysis.", it's written spark 1.5 , operations need not described there. i'll glad explanation either link manual. thank you.
package data; import org.apache.spark.sparkconf; import org.apache.spark.api.java.javardd; import org.apache.spark.api.java.javasparkcontext; import org.apache.spark.sql.dataframe; import org.apache.spark.sql.row; import org.apache.spark.sql.rowfactory; import org.apache.spark.sql.sqlcontext; import org.apache.spark.sql.types.datatypes; import org.apache.spark.sql.types.structtype; import java.util.arrays; public class testdrop { public static void main(string[] args) { dropdata dropdata = new dropdata("src/main/resources/distinct-test.csv"); dropdata.execute(); } } class dropdata{ private string csvpath; private javasparkcontext sparkcontext; private sqlcontext sqlcontext; dropdata(string csvpath) { this.csvpath = csvpath; } void execute(){ initcontext(); dataframe dataframe = loaddataframe(); dataframe.show(); dataframe.dropduplicates(new string[]{"surname"}).show(); //this 1 fails too: dataframe.drop("surname") } private void initcontext() { sparkcontext = new javasparkcontext(new sparkconf().setmaster("local[4]").setappname("drop test")); sqlcontext = new sqlcontext(sparkcontext); } private dataframe loaddataframe() { javardd<string> strings = sparkcontext.textfile(csvpath); javardd<row> rows = strings.map(string -> { string[] cols = string.split(","); return rowfactory.create(cols); }); structtype st = datatypes.createstructtype(arrays.aslist(datatypes.createstructfield("name", datatypes.stringtype, false), datatypes.createstructfield("surname", datatypes.stringtype, true), datatypes.createstructfield("age", datatypes.stringtype, true), datatypes.createstructfield("sex", datatypes.stringtype, true), datatypes.createstructfield("socialid", datatypes.stringtype, true))); return sqlcontext.createdataframe(rows, st); } }
sending list instead of object[] results creation rows, containing 1 column list inside. that's doing wrong.
Comments
Post a Comment