-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Orc writes don't fully support Booleans with nulls #11763
Conversation
Signed-off-by: Kuhu Shukla <[email protected]>
Signed-off-by: Kuhu Shukla <[email protected]>
build |
@@ -26,10 +26,17 @@ | |||
pytestmark = pytest.mark.nightly_resource_consuming_test | |||
|
|||
orc_write_basic_gens = [byte_gen, short_gen, int_gen, long_gen, float_gen, double_gen, | |||
string_gen, boolean_gen, DateGen(start=date(1590, 1, 1)), | |||
string_gen, BooleanGen(nullable=False), DateGen(start=date(1590, 1, 1)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is removing the test for nullable boolean values. Can we have an explicit test(s) that have a non-nullable struct with nullable values, or many different types, in it? I am fine if this is a follow on issue.
@@ -1243,6 +1243,14 @@ val GPU_COREDUMP_PIPE_PATTERN = conf("spark.rapids.gpu.coreDump.pipePattern") | |||
.booleanConf | |||
.createWithDefault(true) | |||
|
|||
val ENABLE_ORC_NULLABLE_BOOL = conf("spark.rapids.sql.format.orc.write.boolType.enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just fall back for all booleans instead of only nullable ones? Spark already marks almost everything as nullable, so there is very little value in trying to distinguish between the two. But then I see things like #11762 where it scares me that CUDF might end up writing something out that they think is valid, but in practice is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack. yes.
When I updated my tests for #11781 to write out 128000 rows I got crashes for boolean columns under ORC with the same error message that this is trying to work around. So even for boolean columns that are not-nullable under a struct that is we are going to have to fall back to the CPU. I think in general we just want to fall back to the CPU for all boolean columns on ORC writes. |
Thank you for the above finding @revans2 . I will update my patch and I see I have a few more tests to fix for the fallback as well. I expect the tests' change to be bigger than the actual change here. |
Signed-off-by: Kuhu Shukla <[email protected]>
build |
@@ -34,8 +34,11 @@ | |||
def test_write_hive_bucketed_table(spark_tmp_table_factory, file_format): | |||
num_rows = 2048 | |||
|
|||
# Use every type except boolean , see https://github.com/NVIDIA/spark-rapids/issues/11762 and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit remove space after boolean
@@ -29,8 +29,12 @@ def _restricted_timestamp(nullable=True): | |||
end=datetime(2262, 4, 11, tzinfo=timezone.utc), | |||
nullable=nullable) | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit remove extra newline
@pytest.mark.parametrize('orc_gens', bool_gen, ids=idfn) | ||
@pytest.mark.parametrize('orc_impl', ["native", "hive"]) | ||
@allow_non_gpu('ExecutedCommandExec', 'DataWritingCommandExec', 'WriteFilesExec') | ||
def test_write_round_trip_bools_only(spark_tmp_path, orc_gens, orc_impl): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is meant to fallback, right? I know some of those tests we put fallback in the name of it to be clear.
Signed-off-by: Kuhu Shukla <[email protected]>
build |
Getting close. Fixing hopefully the last test failure that comes from python test for schema evolution. WIP. |
Signed-off-by: Kuhu Shukla <[email protected]>
build |
The RAT errors seem unrelated. Appreciate inputs if I am missing anything here.
|
It's a build failure on Spark 4.0. Not blocking for merge. Tracked by #11822. |
bool_gen = [pytest.param([BooleanGen(nullable=True)], | ||
marks=pytest.mark.allow_non_gpu('ExecutedCommandExec','DataWritingCommandExec')), | ||
pytest.param([BooleanGen(nullable=False)], | ||
marks=pytest.mark.allow_non_gpu('ExecutedCommandExec','DataWritingCommandExec'))] | ||
@pytest.mark.parametrize('orc_gens', orc_write_gens_list, ids=idfn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pytest.mark.parametrize('orc_gens', orc_write_gens_list, ids=idfn) | |
@pytest.mark.parametrize('orc_gens', orc_write_gens_list, ids=idfn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nit that seemed to be missed. Would be nice to have whitespace separating the module variables from the methods.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuOrcFileFormat.scala
Outdated
Show resolved
Hide resolved
@revans2 @tgravescs requesting comments. I do seek brighter ways for tests that I had to sort of skip by changing types and such. I will open follow ons based on Bobby's earlier comment and any other you two might have. I have tried to add the same comment in as many places when we decide to revert the test part of this after the fix is in cudf. Thank u very much. |
val df = spark.createDataFrame(data).toDF("a") | ||
df.repartition(10).write.orc(file.getCanonicalPath) | ||
checkPredicatePushDown(spark, file.getCanonicalPath, 10, "a == true") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Kuhu Shukla <[email protected]>
Signed-off-by: Kuhu Shukla <[email protected]>
build |
Signed-off-by: Kuhu Shukla <[email protected]>
build |
bool_gen = [pytest.param([BooleanGen(nullable=True)], | ||
marks=pytest.mark.allow_non_gpu('ExecutedCommandExec','DataWritingCommandExec')), | ||
pytest.param([BooleanGen(nullable=False)], | ||
marks=pytest.mark.allow_non_gpu('ExecutedCommandExec','DataWritingCommandExec'))] | ||
@pytest.mark.parametrize('orc_gens', orc_write_gens_list, ids=idfn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nit that seemed to be missed. Would be nice to have whitespace separating the module variables from the methods.
Fixes #11736 and exposes #11762 which is why I am marking this WIP and seeing how I can work around this without impacting many tests in
orc_write_test.py