Skip to content
This repository has been archived by the owner on Dec 20, 2018. It is now read-only.

Add forceSchema option to output to specified schema #222

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

lindblombr
Copy link

This addresses #52

So, this patch might be a bit ugly and, given that I've done this internally, it may not be up-to-snuff, but I'm willing to get it cleaned up in the interest of making this a worthwhile contribution. Its currently running for us on a number of spark jobs and working properly. Please tear into it and provide feedback. I will try to have the unit tests added over the weekend.

Oh, and hello! :)

@codecov-io
Copy link

codecov-io commented Mar 24, 2017

Codecov Report

Merging #222 into master will increase coverage by 1.7%.
The diff coverage is 94.36%.

@@            Coverage Diff            @@
##           master     #222     +/-   ##
=========================================
+ Coverage   90.46%   92.16%   +1.7%     
=========================================
  Files           5        5             
  Lines         325      383     +58     
  Branches       51       73     +22     
=========================================
+ Hits          294      353     +59     
+ Misses         31       30      -1

@lindblombr lindblombr changed the base branch from branch-3.2 to master March 25, 2017 01:22
@lindblombr lindblombr changed the base branch from master to branch-3.2 March 25, 2017 01:22
@lindblombr lindblombr changed the base branch from branch-3.2 to master March 25, 2017 02:09
@KamalKang
Copy link

What is the status on this?

@lindblombr
Copy link
Author

I know Spark 2.2.0 support has recently been merged, so there is some activity, but I'm curious who I need to reach out to in order to get some eyes on this PR. There's not a contributing doc AFAIK, so I'm at a loss as to how to push this forward.

@lindblombr
Copy link
Author

I have additional enhancements and fixes to this code, but before I go through the exercise of rebasing and incorporating them, I'd like to know whether this might get looked at. Tagging some top contributors... @JoshRosen @marmbrus

Thanks in advance.

@dbtsai
Copy link

dbtsai commented Apr 27, 2018

@lindblombr could you resolve the conflict?

Ping @cloud-fan @liancheng for review.

Thanks.

@cloud-fan
Copy link

If the spark schema doesn't match the specified avro schema, what shall we do? And shall we allow compatible schema changing like int to long?

@lindblombr
Copy link
Author

lindblombr commented May 3, 2018

@cloud-fan If the schemas don't match, my thinking was that writing would fail (in some way). But, if we want it to be more elegant, I'd need to implement some compatibility checking and better error handling. Otherwise, the errors can occur in any number of places, making problems difficult to diagnose. The intent of this was to handle, specifically, the case of reading in a set of avro files and writing that same set of avro files out using the same writer schema. In this case, it works well. For cases outside of this, it seems it would be the responsibility of the developer to ensure type consistency.

If we want this change to be more generic, I can add handling for "forward compatible" types, like "int" => "long", "float" => "double", etc.

Barring that, I would be happy to add some more detailed error handling, so we can say things like

  • SchemaMismatchException: Dataset to store has field foo which does not exist in forceSchema 'myschema'.
  • SchemaFieldTypeMismatchException: Dataset to store has field bar of type 'string' while forceSchema specified field bar of type 'long'
  • SchemaMismatchException: Dataset to store has no fields that match forceSchema
  • WARN field a of type int of dataset will be converted to Avro type long using forceSchema
  • ERROR field a of type long cannot be converted to Avro type int

etc. etc.

Thanks so much for taking a look!

@cloud-fan
Copy link

As a start, I think we can simply require the spark schema to be same as avro schema, while accepting namespace/field name difference.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants