apache spark - Why specifying schema to be DateType / TimestampType will make querying extremely slow? -


i'm using spark-csv 1.1.0 , spark 1.5. make schema follows:

private def makeschema(tablecolumns: list[sparksqlfieldconfig]): structtype = {     new structtype(       tablecolumns.map(p => p.columndatatype match {         case fielddatatype.integer => structfield(p.columnname, integertype, nullable = true)         case fielddatatype.decimal => structfield(p.columnname, floattype, nullable = true)         case fielddatatype.string => structfield(p.columnname, stringtype, nullable = true)         case fielddatatype.datetime => structfield(p.columnname, timestamptype, nullable = true)         case fielddatatype.date => structfield(p.columnname, datetype, nullable = true)         case fielddatatype.boolean => structfield(p.columnname, booleantype, nullable = false)         case _ => structfield(p.columnname, stringtype, nullable = true)       }).toarray     )   } 

but when there datetype columns, query dataframes slow. (the queries simple groupby(), sum() , on)

with same dataset, after commented 2 lines map date datetype , datetime timestamptype(that is, map them stringtype), queries become faster.

what possible reason this? thank much!

we have found possible answer problem.

when specifying column datetype or timestamptype, spark-csv try parse dates internal formats each line of row, makes parsing progress slower.

from official documentation, seems can specify in option format dates. suppose can make parsing progress faster.


Comments

Popular posts from this blog

android - Why am I getting the message 'Youractivity.java is not an activity subclass or alias' -

python - How do I create a list index that loops through integers in another list -

c# - “System.Security.Cryptography.CryptographicException: Keyset does not exist” when reading private key from remote machine -