Skip to content
This repository has been archived by the owner on Dec 20, 2018. It is now read-only.

Add method for conversion of RDD[GenericRecord] to DataFrame #216

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

cbyn
Copy link

@cbyn cbyn commented Feb 9, 2017

In response to issues #211 and #201

@codecov-io
Copy link

codecov-io commented Feb 9, 2017

Codecov Report

Merging #216 into master will increase coverage by 0.2%.

@@            Coverage Diff            @@
##           master     #216     +/-   ##
=========================================
+ Coverage   90.62%   90.82%   +0.2%     
=========================================
  Files           5        6      +1     
  Lines         320      327      +7     
  Branches       49       50      +1     
=========================================
+ Hits          290      297      +7     
  Misses         30       30

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4a3284b...96caad2. Read the comment docs.

@cbyn cbyn mentioned this pull request Feb 9, 2017
Copy link

@GaalDornick GaalDornick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why this isn't implement as an UDF? Databricks recommendation is to implement conversions like these as an UDF.

The thing about this implementation is that it works in batch mode, not in streaming. createDataFrame doesn't work in streaming. Implementing an UDF guarantees that the conversion will be supported in batch as well as streaming
Also, UDFs allow you to chain operations within a single spark SQL

@cbyn
Copy link
Author

cbyn commented Apr 16, 2017

The purpose of this was to convert an existing RDD to a DataFrame. I don't understand how this could be achieved with a UDF. My use case was streaming (traditional, not structured). I was consuming Avro from Kafka, iterating over the DStream with foreachRDD, then converting to DataFrames to perform windowed aggregations.

@cbyn
Copy link
Author

cbyn commented Apr 16, 2017

Actually I suppose if you converted the RDD to a DataFrame containing a Row of Avro objects, then a UDF could be applied, but I still don't understand the benefit.

@shardnit
Copy link

Is there any plan to merge this PR? If not, what would be a proposed way to address this issue? I have the same use-case as @cbyn - Avro message from Kafka, traditional streaming, convert RDD to DataFrame, aggregations, and dump to Parquet.

@johnnycaol
Copy link

I have the same use case. Any update on this?

@cbyn
Copy link
Author

cbyn commented Jun 17, 2017

I'm happy to do anything necessary to get a solution merged. But in the meantime it is pretty easy to use the code in this PR. All you need are RddUtils.scala (my addition) and SchemaConverters.scala. Let me know if I can help.

@mnarrell
Copy link

Agreed. There are several excellent additions to this project that have stalled. This and,
#217
#220
would be much welcomed additions.

@JP-MRPhys
Copy link

JP-MRPhys commented Nov 1, 2017

Just wanted to check if this would be possible using JAVA ? Any help/pointers much appreciated bit pressed for time! Many thanks

@pushpendra-jaiswal-90
Copy link

Hi All
Any plans of getting this merged.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants