apache spark - Why specifying schema to be DateType / TimestampType will make querying extremely slow? -


i'm using spark-csv 1.1.0 , spark 1.5. make schema follows:

private def makeschema(tablecolumns: list[sparksqlfieldconfig]): structtype = {     new structtype(       tablecolumns.map(p => p.columndatatype match {         case fielddatatype.integer => structfield(p.columnname, integertype, nullable = true)         case fielddatatype.decimal => structfield(p.columnname, floattype, nullable = true)         case fielddatatype.string => structfield(p.columnname, stringtype, nullable = true)         case fielddatatype.datetime => structfield(p.columnname, timestamptype, nullable = true)         case fielddatatype.date => structfield(p.columnname, datetype, nullable = true)         case fielddatatype.boolean => structfield(p.columnname, booleantype, nullable = false)         case _ => structfield(p.columnname, stringtype, nullable = true)       }).toarray     )   } 

but when there datetype columns, query dataframes slow. (the queries simple groupby(), sum() , on)

with same dataset, after commented 2 lines map date datetype , datetime timestamptype(that is, map them stringtype), queries become faster.

what possible reason this? thank much!

we have found possible answer problem.

when specifying column datetype or timestamptype, spark-csv try parse dates internal formats each line of row, makes parsing progress slower.

from official documentation, seems can specify in option format dates. suppose can make parsing progress faster.


Comments

Popular posts from this blog

SVG stroke-linecap doesn't work for circles in Firefox? -

routes - Laravel 4 Wildcard Routing to Different Controllers -

cross browser - XSLT namespace-alias Not Working in Firefox or Chrome -