-
Notifications
You must be signed in to change notification settings - Fork 308
Add forceSchema option to output to specified schema #222
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #222 +/- ##
=========================================
+ Coverage 90.46% 92.16% +1.7%
=========================================
Files 5 5
Lines 325 383 +58
Branches 51 73 +22
=========================================
+ Hits 294 353 +59
+ Misses 31 30 -1 |
6a8debd
to
7e6b342
Compare
71733da
to
7051abe
Compare
What is the status on this? |
I know Spark 2.2.0 support has recently been merged, so there is some activity, but I'm curious who I need to reach out to in order to get some eyes on this PR. There's not a contributing doc AFAIK, so I'm at a loss as to how to push this forward. |
I have additional enhancements and fixes to this code, but before I go through the exercise of rebasing and incorporating them, I'd like to know whether this might get looked at. Tagging some top contributors... @JoshRosen @marmbrus Thanks in advance. |
@lindblombr could you resolve the conflict? Ping @cloud-fan @liancheng for review. Thanks. |
If the spark schema doesn't match the specified avro schema, what shall we do? And shall we allow compatible schema changing like int to long? |
@cloud-fan If the schemas don't match, my thinking was that writing would fail (in some way). But, if we want it to be more elegant, I'd need to implement some compatibility checking and better error handling. Otherwise, the errors can occur in any number of places, making problems difficult to diagnose. The intent of this was to handle, specifically, the case of reading in a set of avro files and writing that same set of avro files out using the same writer schema. In this case, it works well. For cases outside of this, it seems it would be the responsibility of the developer to ensure type consistency. If we want this change to be more generic, I can add handling for "forward compatible" types, like "int" => "long", "float" => "double", etc. Barring that, I would be happy to add some more detailed error handling, so we can say things like
etc. etc. Thanks so much for taking a look! |
As a start, I think we can simply require the spark schema to be same as avro schema, while accepting namespace/field name difference. |
This addresses #52
So, this patch might be a bit ugly and, given that I've done this internally, it may not be up-to-snuff, but I'm willing to get it cleaned up in the interest of making this a worthwhile contribution. Its currently running for us on a number of spark jobs and working properly. Please tear into it and provide feedback. I will try to have the unit tests added over the weekend.
Oh, and hello! :)