-
Notifications
You must be signed in to change notification settings - Fork 308
Add method for conversion of RDD[GenericRecord] to DataFrame #216
base: master
Are you sure you want to change the base?
Conversation
Codecov Report@@ Coverage Diff @@
## master #216 +/- ##
=========================================
+ Coverage 90.62% 90.82% +0.2%
=========================================
Files 5 6 +1
Lines 320 327 +7
Branches 49 50 +1
=========================================
+ Hits 290 297 +7
Misses 30 30 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why this isn't implement as an UDF? Databricks recommendation is to implement conversions like these as an UDF.
The thing about this implementation is that it works in batch mode, not in streaming. createDataFrame doesn't work in streaming. Implementing an UDF guarantees that the conversion will be supported in batch as well as streaming
Also, UDFs allow you to chain operations within a single spark SQL
The purpose of this was to convert an existing RDD to a DataFrame. I don't understand how this could be achieved with a UDF. My use case was streaming (traditional, not structured). I was consuming Avro from Kafka, iterating over the DStream with foreachRDD, then converting to DataFrames to perform windowed aggregations. |
Actually I suppose if you converted the RDD to a DataFrame containing a Row of Avro objects, then a UDF could be applied, but I still don't understand the benefit. |
Is there any plan to merge this PR? If not, what would be a proposed way to address this issue? I have the same use-case as @cbyn - Avro message from Kafka, traditional streaming, convert RDD to DataFrame, aggregations, and dump to Parquet. |
I have the same use case. Any update on this? |
I'm happy to do anything necessary to get a solution merged. But in the meantime it is pretty easy to use the code in this PR. All you need are RddUtils.scala (my addition) and SchemaConverters.scala. Let me know if I can help. |
Just wanted to check if this would be possible using JAVA ? Any help/pointers much appreciated bit pressed for time! Many thanks |
Hi All |
In response to issues #211 and #201