apache spark - Why specifying schema to be DateType / TimestampType will make querying extremely slow? -
i'm using spark-csv 1.1.0 , spark 1.5. make schema follows:
private def makeschema(tablecolumns: list[sparksqlfieldconfig]): structtype = { new structtype( tablecolumns.map(p => p.columndatatype match { case fielddatatype.integer => structfield(p.columnname, integertype, nullable = true) case fielddatatype.decimal => structfield(p.columnname, floattype, nullable = true) case fielddatatype.string => structfield(p.columnname, stringtype, nullable = true) case fielddatatype.datetime => structfield(p.columnname, timestamptype, nullable = true) case fielddatatype.date => structfield(p.columnname, datetype, nullable = true) case fielddatatype.boolean => structfield(p.columnname, booleantype, nullable = false) case _ => structfield(p.columnname, stringtype, nullable = true) }).toarray ) }
but when there datetype
columns, query dataframes slow. (the queries simple groupby(), sum()
, on)
with same dataset, after commented 2 lines map date datetype
, datetime timestamptype
(that is, map them stringtype
), queries become faster.
what possible reason this? thank much!
we have found possible answer problem.
when specifying column datetype
or timestamptype
, spark-csv try parse dates internal formats each line of row, makes parsing progress slower.
from official documentation, seems can specify in option format dates. suppose can make parsing progress faster.
Comments
Post a Comment