apache spark - Why specifying schema to be DateType / TimestampType will make querying extremely slow? -


i'm using spark-csv 1.1.0 , spark 1.5. make schema follows:

private def makeschema(tablecolumns: list[sparksqlfieldconfig]): structtype = {     new structtype(       tablecolumns.map(p => p.columndatatype match {         case fielddatatype.integer => structfield(p.columnname, integertype, nullable = true)         case fielddatatype.decimal => structfield(p.columnname, floattype, nullable = true)         case fielddatatype.string => structfield(p.columnname, stringtype, nullable = true)         case fielddatatype.datetime => structfield(p.columnname, timestamptype, nullable = true)         case fielddatatype.date => structfield(p.columnname, datetype, nullable = true)         case fielddatatype.boolean => structfield(p.columnname, booleantype, nullable = false)         case _ => structfield(p.columnname, stringtype, nullable = true)       }).toarray     )   } 

but when there datetype columns, query dataframes slow. (the queries simple groupby(), sum() , on)

with same dataset, after commented 2 lines map date datetype , datetime timestamptype(that is, map them stringtype), queries become faster.

what possible reason this? thank much!

we have found possible answer problem.

when specifying column datetype or timestamptype, spark-csv try parse dates internal formats each line of row, makes parsing progress slower.

from official documentation, seems can specify in option format dates. suppose can make parsing progress faster.


Comments

Popular posts from this blog

sql - VB.NET Operand type clash: date is incompatible with int error -

SVG stroke-linecap doesn't work for circles in Firefox? -

python - TypeError: Scalar value for argument 'color' is not numeric in openCV -