Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orc writes don't fully support Booleans with nulls #11763

Merged
merged 10 commits into from
Dec 7, 2024

Conversation

kuhushukla
Copy link
Collaborator

Fixes #11736 and exposes #11762 which is why I am marking this WIP and seeing how I can work around this without impacting many tests in orc_write_test.py

@kuhushukla kuhushukla self-assigned this Nov 25, 2024
@kuhushukla kuhushukla marked this pull request as draft November 25, 2024 20:03
@kuhushukla kuhushukla changed the title Orc writes don't fully support Booleans with nulls [WIP] Orc writes don't fully support Booleans with nulls Nov 25, 2024
@kuhushukla kuhushukla marked this pull request as ready for review November 25, 2024 20:41
@kuhushukla
Copy link
Collaborator Author

build

revans2
revans2 previously approved these changes Nov 25, 2024
@@ -26,10 +26,17 @@
pytestmark = pytest.mark.nightly_resource_consuming_test

orc_write_basic_gens = [byte_gen, short_gen, int_gen, long_gen, float_gen, double_gen,
string_gen, boolean_gen, DateGen(start=date(1590, 1, 1)),
string_gen, BooleanGen(nullable=False), DateGen(start=date(1590, 1, 1)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is removing the test for nullable boolean values. Can we have an explicit test(s) that have a non-nullable struct with nullable values, or many different types, in it? I am fine if this is a follow on issue.

@@ -1243,6 +1243,14 @@ val GPU_COREDUMP_PIPE_PATTERN = conf("spark.rapids.gpu.coreDump.pipePattern")
.booleanConf
.createWithDefault(true)

val ENABLE_ORC_NULLABLE_BOOL = conf("spark.rapids.sql.format.orc.write.boolType.enabled")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just fall back for all booleans instead of only nullable ones? Spark already marks almost everything as nullable, so there is very little value in trying to distinguish between the two. But then I see things like #11762 where it scares me that CUDF might end up writing something out that they think is valid, but in practice is not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack. yes.

@revans2
Copy link
Collaborator

revans2 commented Nov 26, 2024

When I updated my tests for #11781 to write out 128000 rows I got crashes for boolean columns under ORC with the same error message that this is trying to work around. So even for boolean columns that are not-nullable under a struct that is we are going to have to fall back to the CPU. I think in general we just want to fall back to the CPU for all boolean columns on ORC writes.

@sameerz sameerz added the bug Something isn't working label Nov 27, 2024
@kuhushukla
Copy link
Collaborator Author

Thank you for the above finding @revans2 . I will update my patch and I see I have a few more tests to fix for the fallback as well. I expect the tests' change to be bigger than the actual change here.

@kuhushukla kuhushukla changed the title [WIP] Orc writes don't fully support Booleans with nulls Orc writes don't fully support Booleans with nulls Dec 2, 2024
@kuhushukla
Copy link
Collaborator Author

build

@@ -34,8 +34,11 @@
def test_write_hive_bucketed_table(spark_tmp_table_factory, file_format):
num_rows = 2048

# Use every type except boolean , see https://github.com/NVIDIA/spark-rapids/issues/11762 and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit remove space after boolean

@@ -29,8 +29,12 @@ def _restricted_timestamp(nullable=True):
end=datetime(2262, 4, 11, tzinfo=timezone.utc),
nullable=nullable)


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit remove extra newline

@pytest.mark.parametrize('orc_gens', bool_gen, ids=idfn)
@pytest.mark.parametrize('orc_impl', ["native", "hive"])
@allow_non_gpu('ExecutedCommandExec', 'DataWritingCommandExec', 'WriteFilesExec')
def test_write_round_trip_bools_only(spark_tmp_path, orc_gens, orc_impl):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is meant to fallback, right? I know some of those tests we put fallback in the name of it to be clear.

@kuhushukla
Copy link
Collaborator Author

build

@kuhushukla
Copy link
Collaborator Author

Getting close. Fixing hopefully the last test failure that comes from python test for schema evolution. WIP.

@kuhushukla
Copy link
Collaborator Author

build

@kuhushukla
Copy link
Collaborator Author

The RAT errors seem unrelated. Appreciate inputs if I am missing anything here.

Error: ] /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/spark400/scala/org/apache/spark/sql/nvidia/DFUDFShims.scala:27: type mismatch;
 found   : org.apache.spark.sql.Column
 required: org.apache.spark.sql.catalyst.expressions.Expression
[INFO] [Info] : org.apache.spark.sql.Column <: org.apache.spark.sql.catalyst.expressions.Expression?
[INFO] [Info] : false
Error: ] /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/spark400/scala/org/apache/spark/sql/nvidia/DFUDFShims.scala:28: type mismatch;
 found   : org.apache.spark.sql.catalyst.expressions.Expression
 required: org.apache.spark.sql.Column
[INFO] [Info] : org.apache.spark.sql.catalyst.expressions.Expression <: org.apache.spark.sql.Column?

@jlowe
Copy link
Contributor

jlowe commented Dec 5, 2024

The RAT errors seem unrelated

It's a build failure on Spark 4.0. Not blocking for merge. Tracked by #11822.

integration_tests/src/main/python/orc_write_test.py Outdated Show resolved Hide resolved
bool_gen = [pytest.param([BooleanGen(nullable=True)],
marks=pytest.mark.allow_non_gpu('ExecutedCommandExec','DataWritingCommandExec')),
pytest.param([BooleanGen(nullable=False)],
marks=pytest.mark.allow_non_gpu('ExecutedCommandExec','DataWritingCommandExec'))]
@pytest.mark.parametrize('orc_gens', orc_write_gens_list, ids=idfn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@pytest.mark.parametrize('orc_gens', orc_write_gens_list, ids=idfn)
@pytest.mark.parametrize('orc_gens', orc_write_gens_list, ids=idfn)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nit that seemed to be missed. Would be nice to have whitespace separating the module variables from the methods.

integration_tests/src/main/python/schema_evolution_test.py Outdated Show resolved Hide resolved
integration_tests/src/main/python/orc_test.py Show resolved Hide resolved
@kuhushukla
Copy link
Collaborator Author

@revans2 @tgravescs requesting comments. I do seek brighter ways for tests that I had to sort of skip by changing types and such. I will open follow ons based on Bobby's earlier comment and any other you two might have. I have tried to add the same comment in as many places when we decide to revert the test part of this after the fix is in cudf. Thank u very much.

integration_tests/src/main/python/orc_test.py Show resolved Hide resolved
val df = spark.createDataFrame(data).toDF("a")
df.repartition(10).write.orc(file.getCanonicalPath)
checkPredicatePushDown(spark, file.getCanonicalPath, 10, "a == true")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

@kuhushukla
Copy link
Collaborator Author

build

@kuhushukla
Copy link
Collaborator Author

build

bool_gen = [pytest.param([BooleanGen(nullable=True)],
marks=pytest.mark.allow_non_gpu('ExecutedCommandExec','DataWritingCommandExec')),
pytest.param([BooleanGen(nullable=False)],
marks=pytest.mark.allow_non_gpu('ExecutedCommandExec','DataWritingCommandExec'))]
@pytest.mark.parametrize('orc_gens', orc_write_gens_list, ids=idfn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nit that seemed to be missed. Would be nice to have whitespace separating the module variables from the methods.

@kuhushukla kuhushukla merged commit fb2f72d into NVIDIA:branch-24.12 Dec 7, 2024
48 of 49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Orc writes don't fully support Booleans with nulls
5 participants