apache spark - kafka to sparkstreaming to HDFS -
i using creatdirectstream in order integrate sparkstreaming , kafka. here code used:
val ssc = new streamingcontext(new sparkconf, seconds(10)) val kafkaparams = map("metadata.broker.list" -> "sandbox:6667") val topics = set("topic1") val messages = kafkautils.createdirectstream[string, string, stringdecoder, stringdecoder]( ssc, kafkaparams, topics)
now want store messages hdfs. right this?
messages.saveastextfiles("/tmp/spark/messages")
saveastextfiles("/tmp/spark/messages")
- persist data in local file system , in case provided folder structure ("/tmp/spark/messages") part of local hdfs show in hdfs directory because saveastextfiles
leverages same mapereduce api's write output.
the above work in scenarios spark executors , hdfs on same physical machines in case hdfs directory or url different , not on same machines, executors running, not work.
in case need ensure data persisted in hdfs, practice should provide complete hdfs url. - saveastextfiles("http://<host-name>:9000/tmp/spark/messages")
or can leverage either of following methods: -
dstream.saveasnewapihadoopfiles(<hdfs url location>)
dstream.saveashadoopfiles(<hdfs url location>)
Comments
Post a Comment