apache spark - kafka to sparkstreaming to HDFS -
i using creatdirectstream in order integrate sparkstreaming , kafka. here code used:
val ssc = new streamingcontext(new sparkconf, seconds(10)) val kafkaparams = map("metadata.broker.list" -> "sandbox:6667") val topics = set("topic1") val messages = kafkautils.createdirectstream[string, string, stringdecoder, stringdecoder]( ssc, kafkaparams, topics) now want store messages hdfs. right this?
messages.saveastextfiles("/tmp/spark/messages")
saveastextfiles("/tmp/spark/messages") - persist data in local file system , in case provided folder structure ("/tmp/spark/messages") part of local hdfs show in hdfs directory because saveastextfiles leverages same mapereduce api's write output.
the above work in scenarios spark executors , hdfs on same physical machines in case hdfs directory or url different , not on same machines, executors running, not work.
in case need ensure data persisted in hdfs, practice should provide complete hdfs url. - saveastextfiles("http://<host-name>:9000/tmp/spark/messages")
or can leverage either of following methods: -
dstream.saveasnewapihadoopfiles(<hdfs url location>)dstream.saveashadoopfiles(<hdfs url location>)
Comments
Post a Comment