-
Notifications
You must be signed in to change notification settings - Fork 308
Pick last file sorted by path for schema #269
base: master
Are you sure you want to change the base?
Pick last file sorted by path for schema #269
Conversation
Codecov Report
@@ Coverage Diff @@
## master #269 +/- ##
=========================================
+ Coverage 92.21% 92.61% +0.4%
=========================================
Files 5 5
Lines 321 325 +4
Branches 43 41 -2
=========================================
+ Hits 296 301 +5
+ Misses 25 24 -1 |
) | ||
} | ||
def sampleFilePath = if (conf.getBoolean(IgnoreFilesWithoutExtensionProperty, true)) { | ||
files.iterator.map(_.getPath).filter(_.getName.endsWith(".avro")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
files.map(.getPath).sortBy(.getName)....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it has same result right?
files can be a very large sequence. the iterator approach avoids creating 2 copies of that sequence. also it is not necessary to do a full sort just to get the first sorted element.
are you saying its not worth the optimization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right for not sorting all the file names.
But I don't think we need to convert it to an iterator
.
Maybe we can try to make it more shorter like files.map(_.getPath).minBy(_.getName)
?
We can create a function which accepts parameter Seq(Path)
, then check if it is empty before getting the minimal one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iterator is lightweight and avoids materialization
minBy(_.getName) wouldnt work because we want to sort by the path, not just the filename (e.g. /some/path/x=1/part-0000.avro
comes before /some/path/x=2/part-0000.avro
)
minBy(_.toString) might work but i don't feel too certain about it. rather use Comparable to do the right thing. unfortunately Path is just Comparable, not Comparable[Path], so scala doesn't understand how to use it, which is why i resorted to using compareTo directly.
files.headOption.getOrElse { | ||
throw new FileNotFoundException("No Avro files found.") | ||
} | ||
files.iterator.map(_.getPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
df1.write.avro(s"$tempDir/different_schemas/z=1") | ||
val df2 = spark.createDataFrame(Seq(Tuple1("a"), Tuple1("b"))) | ||
df2.write.avro(s"$tempDir/different_schemas/z=2") | ||
val df3 = spark.read.avro(s"$tempDir/different_schemas") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a loop for the reading? I am not sure if the order will be different every time
Have you considered using the schema from the newest data file to get the most up to date version of the schema? Or perhaps a configuration option to do that? Seems like most would update their schemas in a backwards compatible way and using the most recent schema would expose newer fields in the schema. |
t
hat is not a bad idea. a switch seems reasonable. i would suggest to do
this in a separate branch
…On Mon, Jun 4, 2018 at 4:09 PM, Carl Laird ***@***.***> wrote:
Have you considered using the schema from the newest data file to get the
most up to date version of the schema? Or perhaps a configuration option to
do that? Seems like most would update their schemas in a backwards
compatible way and using the most recent schema would expose newer fields
in the schema.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#269 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAyIJDw2Qzkuu_e48jdnUwwy_QxJJfr5ks5t5ZPngaJpZM4SCQnp>
.
|
@cwlaird3 good idea |
@koertkuipers @cwlaird3 I checked with @liancheng , which is PMC member and one of the original author of Data source project. This PR changes the behavior and is possible to cause regression to other users. |
currently it uses a random file to pick schema. what would be an example of a user for which you break things by going from a random file to the last file? |
I agree with @koertkuipers .. but if there's still a concern adding a configuration option to change the behavior could address that. |
spark-avro already provides a mechanism for the user to provide a schema with the the thing that is currently missing is merging of schemas across all files |
By configuration I meant a flag to enable the behavior you've implemented here - not to provide a schema. |
oh a flag to go from random schema to non-random schema?
if someone can come up with a user for which this pullreq breaks their
usage i am up for that, otherwise no :)
…On Fri, Jun 8, 2018 at 2:11 PM, Carl Laird ***@***.***> wrote:
By configuration I meant a flag to enable the behavior you've implemented
here - not to provide a schema.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#269 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAyIJPNX1swwfXR9eCxVyPh_-8fUjgRmks5t6r5AgaJpZM4SCQnp>
.
|
Picking the same file consistently for schema avoids weird bugs where the schema of an avro data source changes randomly or unexpectedly.