apache spark - kafka to sparkstreaming to HDFS -


i using creatdirectstream in order integrate sparkstreaming , kafka. here code used:

val ssc = new streamingcontext(new sparkconf, seconds(10))     val kafkaparams = map("metadata.broker.list" -> "sandbox:6667")     val topics = set("topic1")      val messages = kafkautils.createdirectstream[string, string, stringdecoder, stringdecoder](       ssc, kafkaparams, topics) 

now want store messages hdfs. right this?

messages.saveastextfiles("/tmp/spark/messages") 

saveastextfiles("/tmp/spark/messages") - persist data in local file system , in case provided folder structure ("/tmp/spark/messages") part of local hdfs show in hdfs directory because saveastextfiles leverages same mapereduce api's write output.

the above work in scenarios spark executors , hdfs on same physical machines in case hdfs directory or url different , not on same machines, executors running, not work.

in case need ensure data persisted in hdfs, practice should provide complete hdfs url. - saveastextfiles("http://<host-name>:9000/tmp/spark/messages")

or can leverage either of following methods: -

  1. dstream.saveasnewapihadoopfiles(<hdfs url location>)
  2. dstream.saveashadoopfiles(<hdfs url location>)

Comments

Popular posts from this blog

SVG stroke-linecap doesn't work for circles in Firefox? -

routes - Laravel 4 Wildcard Routing to Different Controllers -

cross browser - XSLT namespace-alias Not Working in Firefox or Chrome -