apache spark - Why specifying schema to be DateType / TimestampType will make querying extremely slow? -


i'm using spark-csv 1.1.0 , spark 1.5. make schema follows:

private def makeschema(tablecolumns: list[sparksqlfieldconfig]): structtype = {     new structtype(       tablecolumns.map(p => p.columndatatype match {         case fielddatatype.integer => structfield(p.columnname, integertype, nullable = true)         case fielddatatype.decimal => structfield(p.columnname, floattype, nullable = true)         case fielddatatype.string => structfield(p.columnname, stringtype, nullable = true)         case fielddatatype.datetime => structfield(p.columnname, timestamptype, nullable = true)         case fielddatatype.date => structfield(p.columnname, datetype, nullable = true)         case fielddatatype.boolean => structfield(p.columnname, booleantype, nullable = false)         case _ => structfield(p.columnname, stringtype, nullable = true)       }).toarray     )   } 

but when there datetype columns, query dataframes slow. (the queries simple groupby(), sum() , on)

with same dataset, after commented 2 lines map date datetype , datetime timestamptype(that is, map them stringtype), queries become faster.

what possible reason this? thank much!

we have found possible answer problem.

when specifying column datetype or timestamptype, spark-csv try parse dates internal formats each line of row, makes parsing progress slower.

from official documentation, seems can specify in option format dates. suppose can make parsing progress faster.


Comments

Popular posts from this blog

android - Why am I getting the message 'Youractivity.java is not an activity subclass or alias' -

Making Empty C++ Project: General exception (Exception from HRESULT:0x80131500) Visual Studio Community 2015 -

How to fix java warning for "The value of the local variable is not used " -