Add forceSchema option to output to specified schema #222

lindblombr · 2017-03-24T22:15:58Z

This addresses #52

So, this patch might be a bit ugly and, given that I've done this internally, it may not be up-to-snuff, but I'm willing to get it cleaned up in the interest of making this a worthwhile contribution. Its currently running for us on a number of spark jobs and working properly. Please tear into it and provide feedback. I will try to have the unit tests added over the weekend.

Oh, and hello! :)

codecov-io · 2017-03-24T22:25:18Z

Codecov Report

Merging #222 into master will increase coverage by 1.7%.
The diff coverage is 94.36%.

@@            Coverage Diff            @@
##           master     #222     +/-   ##
=========================================
+ Coverage   90.46%   92.16%   +1.7%     
=========================================
  Files           5        5             
  Lines         325      383     +58     
  Branches       51       73     +22     
=========================================
+ Hits          294      353     +59     
+ Misses         31       30      -1

KamalKang · 2017-09-14T00:14:55Z

What is the status on this?

lindblombr · 2017-09-17T16:24:54Z

I know Spark 2.2.0 support has recently been merged, so there is some activity, but I'm curious who I need to reach out to in order to get some eyes on this PR. There's not a contributing doc AFAIK, so I'm at a loss as to how to push this forward.

lindblombr · 2017-12-09T23:28:00Z

I have additional enhancements and fixes to this code, but before I go through the exercise of rebasing and incorporating them, I'd like to know whether this might get looked at. Tagging some top contributors... @JoshRosen @marmbrus

Thanks in advance.

dbtsai · 2018-04-27T23:41:47Z

@lindblombr could you resolve the conflict?

Ping @cloud-fan @liancheng for review.

Thanks.

cloud-fan · 2018-05-03T03:11:45Z

If the spark schema doesn't match the specified avro schema, what shall we do? And shall we allow compatible schema changing like int to long?

lindblombr · 2018-05-03T04:51:27Z

@cloud-fan If the schemas don't match, my thinking was that writing would fail (in some way). But, if we want it to be more elegant, I'd need to implement some compatibility checking and better error handling. Otherwise, the errors can occur in any number of places, making problems difficult to diagnose. The intent of this was to handle, specifically, the case of reading in a set of avro files and writing that same set of avro files out using the same writer schema. In this case, it works well. For cases outside of this, it seems it would be the responsibility of the developer to ensure type consistency.

If we want this change to be more generic, I can add handling for "forward compatible" types, like "int" => "long", "float" => "double", etc.

Barring that, I would be happy to add some more detailed error handling, so we can say things like

SchemaMismatchException: Dataset to store has field foo which does not exist in forceSchema 'myschema'.
SchemaFieldTypeMismatchException: Dataset to store has field bar of type 'string' while forceSchema specified field bar of type 'long'
SchemaMismatchException: Dataset to store has no fields that match forceSchema
WARN field a of type int of dataset will be converted to Avro type long using forceSchema
ERROR field a of type long cannot be converted to Avro type int

etc. etc.

Thanks so much for taking a look!

cloud-fan · 2018-05-03T06:44:10Z

As a start, I think we can simply require the spark schema to be same as avro schema, while accepting namespace/field name difference.

lindblombr changed the base branch from branch-3.2 to master March 25, 2017 01:22

lindblombr changed the base branch from master to branch-3.2 March 25, 2017 01:22

lindblombr force-pushed the force_schema branch from 6a8debd to 7e6b342 Compare March 25, 2017 02:09

lindblombr changed the base branch from branch-3.2 to master March 25, 2017 02:09

lindblombr mentioned this pull request Mar 25, 2017

Save DF with specific Avro schema #52

Open

lindblombr force-pushed the force_schema branch from 71733da to 7051abe Compare March 30, 2017 22:18

Enable forceSchema option for writer

6f88549

lindblombr force-pushed the force_schema branch from 7051abe to 6f88549 Compare April 2, 2017 22:02

Merge branch 'master' into force_schema

7fb1626

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add forceSchema option to output to specified schema #222

Add forceSchema option to output to specified schema #222

lindblombr commented Mar 24, 2017

codecov-io commented Mar 24, 2017 •

edited

Loading

KamalKang commented Sep 14, 2017

lindblombr commented Sep 17, 2017

lindblombr commented Dec 9, 2017

dbtsai commented Apr 27, 2018 •

edited

Loading

cloud-fan commented May 3, 2018

lindblombr commented May 3, 2018 •

edited

Loading

cloud-fan commented May 3, 2018

Add forceSchema option to output to specified schema #222

Are you sure you want to change the base?

Add forceSchema option to output to specified schema #222

Conversation

lindblombr commented Mar 24, 2017

codecov-io commented Mar 24, 2017 • edited Loading

Codecov Report

KamalKang commented Sep 14, 2017

lindblombr commented Sep 17, 2017

lindblombr commented Dec 9, 2017

dbtsai commented Apr 27, 2018 • edited Loading

cloud-fan commented May 3, 2018

lindblombr commented May 3, 2018 • edited Loading

cloud-fan commented May 3, 2018

codecov-io commented Mar 24, 2017 •

edited

Loading

dbtsai commented Apr 27, 2018 •

edited

Loading

lindblombr commented May 3, 2018 •

edited

Loading