Orc writes don't fully support Booleans with nulls #11763

kuhushukla · 2024-11-25T20:03:08Z

Fixes #11736 and exposes #11762 which is why I am marking this WIP and seeing how I can work around this without impacting many tests in orc_write_test.py

Signed-off-by: Kuhu Shukla <[email protected]>

kuhushukla · 2024-11-25T20:41:27Z

build

revans2 · 2024-11-25T20:48:44Z

integration_tests/src/main/python/orc_write_test.py

@@ -26,10 +26,17 @@
 pytestmark = pytest.mark.nightly_resource_consuming_test

 orc_write_basic_gens = [byte_gen, short_gen, int_gen, long_gen, float_gen, double_gen,
-        string_gen, boolean_gen, DateGen(start=date(1590, 1, 1)),
+        string_gen, BooleanGen(nullable=False), DateGen(start=date(1590, 1, 1)),


This is removing the test for nullable boolean values. Can we have an explicit test(s) that have a non-nullable struct with nullable values, or many different types, in it? I am fine if this is a follow on issue.

revans2 · 2024-11-26T20:11:03Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

@@ -1243,6 +1243,14 @@ val GPU_COREDUMP_PIPE_PATTERN = conf("spark.rapids.gpu.coreDump.pipePattern")
    .booleanConf
    .createWithDefault(true)

+  val ENABLE_ORC_NULLABLE_BOOL = conf("spark.rapids.sql.format.orc.write.boolType.enabled")


Can we just fall back for all booleans instead of only nullable ones? Spark already marks almost everything as nullable, so there is very little value in trying to distinguish between the two. But then I see things like #11762 where it scares me that CUDF might end up writing something out that they think is valid, but in practice is not.

revans2 · 2024-11-26T21:21:08Z

When I updated my tests for #11781 to write out 128000 rows I got crashes for boolean columns under ORC with the same error message that this is trying to work around. So even for boolean columns that are not-nullable under a struct that is we are going to have to fall back to the CPU. I think in general we just want to fall back to the CPU for all boolean columns on ORC writes.

kuhushukla · 2024-11-27T16:40:20Z

Thank you for the above finding @revans2 . I will update my patch and I see I have a few more tests to fix for the fallback as well. I expect the tests' change to be bigger than the actual change here.

Signed-off-by: Kuhu Shukla <[email protected]>

kuhushukla · 2024-12-02T18:40:44Z

build

tgravescs · 2024-12-02T18:45:32Z

integration_tests/src/main/python/datasourcev2_write_test.py

@@ -34,8 +34,11 @@
 def test_write_hive_bucketed_table(spark_tmp_table_factory, file_format):
    num_rows = 2048

+    # Use every type except boolean , see https://github.com/NVIDIA/spark-rapids/issues/11762 and


nit remove space after boolean

tgravescs · 2024-12-02T18:45:44Z

integration_tests/src/main/python/hive_write_test.py

@@ -29,8 +29,12 @@ def _restricted_timestamp(nullable=True):
                        end=datetime(2262, 4, 11, tzinfo=timezone.utc),
                        nullable=nullable)

+


nit remove extra newline

tgravescs · 2024-12-02T18:50:28Z

integration_tests/src/main/python/orc_write_test.py

+@pytest.mark.parametrize('orc_gens', bool_gen, ids=idfn)
+@pytest.mark.parametrize('orc_impl', ["native", "hive"])
+@allow_non_gpu('ExecutedCommandExec', 'DataWritingCommandExec', 'WriteFilesExec')
+def test_write_round_trip_bools_only(spark_tmp_path, orc_gens, orc_impl):


this is meant to fallback, right? I know some of those tests we put fallback in the name of it to be clear.

Signed-off-by: Kuhu Shukla <[email protected]>

kuhushukla · 2024-12-03T17:07:46Z

build

kuhushukla · 2024-12-03T20:17:56Z

Getting close. Fixing hopefully the last test failure that comes from python test for schema evolution. WIP.

Signed-off-by: Kuhu Shukla <[email protected]>

kuhushukla · 2024-12-05T17:32:18Z

build

kuhushukla · 2024-12-05T20:00:40Z

The RAT errors seem unrelated. Appreciate inputs if I am missing anything here.

Error: ] /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/spark400/scala/org/apache/spark/sql/nvidia/DFUDFShims.scala:27: type mismatch;
 found   : org.apache.spark.sql.Column
 required: org.apache.spark.sql.catalyst.expressions.Expression
[INFO] [Info] : org.apache.spark.sql.Column <: org.apache.spark.sql.catalyst.expressions.Expression?
[INFO] [Info] : false
Error: ] /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/spark400/scala/org/apache/spark/sql/nvidia/DFUDFShims.scala:28: type mismatch;
 found   : org.apache.spark.sql.catalyst.expressions.Expression
 required: org.apache.spark.sql.Column
[INFO] [Info] : org.apache.spark.sql.catalyst.expressions.Expression <: org.apache.spark.sql.Column?

jlowe · 2024-12-05T22:15:09Z

The RAT errors seem unrelated

It's a build failure on Spark 4.0. Not blocking for merge. Tracked by #11822.

integration_tests/src/main/python/hive_parquet_write_test.py

integration_tests/src/main/python/orc_write_test.py

jlowe · 2024-12-05T22:20:35Z

integration_tests/src/main/python/orc_write_test.py

+bool_gen = [pytest.param([BooleanGen(nullable=True)],
+                                 marks=pytest.mark.allow_non_gpu('ExecutedCommandExec','DataWritingCommandExec')),
+            pytest.param([BooleanGen(nullable=False)],
+                         marks=pytest.mark.allow_non_gpu('ExecutedCommandExec','DataWritingCommandExec'))]
 @pytest.mark.parametrize('orc_gens', orc_write_gens_list, ids=idfn)


Suggested change

@pytest.mark.parametrize('orc_gens', orc_write_gens_list, ids=idfn)

@pytest.mark.parametrize('orc_gens', orc_write_gens_list, ids=idfn)

This is a nit that seemed to be missed. Would be nice to have whitespace separating the module variables from the methods.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuOrcFileFormat.scala

integration_tests/src/main/python/schema_evolution_test.py

integration_tests/src/main/python/orc_test.py

tests/src/test/scala/org/apache/spark/sql/rapids/OrcFilterSuite.scala

kuhushukla · 2024-12-06T00:12:26Z

@revans2 @tgravescs requesting comments. I do seek brighter ways for tests that I had to sort of skip by changing types and such. I will open follow ons based on Bobby's earlier comment and any other you two might have. I have tried to add the same comment in as many places when we decide to revert the test part of this after the fix is in cudf. Thank u very much.

integration_tests/src/main/python/orc_test.py

tgravescs · 2024-12-06T16:23:52Z

tests/src/test/scala/org/apache/spark/sql/rapids/OrcFilterSuite.scala

+          val df = spark.createDataFrame(data).toDF("a")
+          df.repartition(10).write.orc(file.getCanonicalPath)
+          checkPredicatePushDown(spark, file.getCanonicalPath, 10, "a == true")
+


Suggested change

Signed-off-by: Kuhu Shukla <[email protected]>

…issue_11736

Signed-off-by: Kuhu Shukla <[email protected]>

kuhushukla · 2024-12-06T19:24:39Z

build

Signed-off-by: Kuhu Shukla <[email protected]>

kuhushukla · 2024-12-06T19:33:40Z

build

jlowe · 2024-12-06T22:16:08Z

integration_tests/src/main/python/orc_write_test.py

+bool_gen = [pytest.param([BooleanGen(nullable=True)],
+                                 marks=pytest.mark.allow_non_gpu('ExecutedCommandExec','DataWritingCommandExec')),
+            pytest.param([BooleanGen(nullable=False)],
+                         marks=pytest.mark.allow_non_gpu('ExecutedCommandExec','DataWritingCommandExec'))]
 @pytest.mark.parametrize('orc_gens', orc_write_gens_list, ids=idfn)


This is a nit that seemed to be missed. Would be nice to have whitespace separating the module variables from the methods.

Initial change, fix one failure in orc write test

022eb61

Signed-off-by: Kuhu Shukla <[email protected]>

kuhushukla self-assigned this Nov 25, 2024

kuhushukla marked this pull request as draft November 25, 2024 20:03

kuhushukla changed the title ~~Orc writes don't fully support Booleans with nulls~~ [WIP] Orc writes don't fully support Booleans with nulls Nov 25, 2024

kuhushukla added 2 commits November 25, 2024 14:08

Merge branch 'branch-24.12' into issue_11736

6d0e6f3

Allow tests for structs sans boolean in orc writes test

625e2ab

Signed-off-by: Kuhu Shukla <[email protected]>

kuhushukla marked this pull request as ready for review November 25, 2024 20:41

revans2 previously approved these changes Nov 25, 2024

View reviewed changes

revans2 reviewed Nov 26, 2024

View reviewed changes

revans2 mentioned this pull request Nov 26, 2024

Fix non-nullable under nullable struct write #11781

Merged

sameerz added the bug Something isn't working label Nov 27, 2024

Fix test failures and address comments

507c75f

Signed-off-by: Kuhu Shukla <[email protected]>

kuhushukla dismissed revans2’s stale review via 507c75f December 2, 2024 18:40

kuhushukla changed the title ~~[WIP] Orc writes don't fully support Booleans with nulls~~ Orc writes don't fully support Booleans with nulls Dec 2, 2024

tgravescs reviewed Dec 2, 2024

View reviewed changes

Fix tests and address review comments

eebf787

Signed-off-by: Kuhu Shukla <[email protected]>

Fix schema evolution test to avoid booleans for now

b26e62c

Signed-off-by: Kuhu Shukla <[email protected]>

jlowe reviewed Dec 5, 2024

View reviewed changes

tgravescs reviewed Dec 6, 2024

View reviewed changes

kuhushukla added 3 commits December 6, 2024 13:16

Address comments from reviews

5d582ba

Signed-off-by: Kuhu Shukla <[email protected]>

Merge remote-tracking branch 'refs/remotes/origin/branch-24.12' into …

6fa0db0

…issue_11736

Address a missed review comment

0ced895

Signed-off-by: Kuhu Shukla <[email protected]>

Address missed review comments

2e2202b

Signed-off-by: Kuhu Shukla <[email protected]>

jlowe approved these changes Dec 6, 2024

View reviewed changes

revans2 approved these changes Dec 6, 2024

View reviewed changes

tgravescs approved these changes Dec 6, 2024

View reviewed changes

kuhushukla merged commit fb2f72d into NVIDIA:branch-24.12 Dec 7, 2024
48 of 49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Orc writes don't fully support Booleans with nulls #11763

Orc writes don't fully support Booleans with nulls #11763

kuhushukla commented Nov 25, 2024

kuhushukla commented Nov 25, 2024

revans2 Nov 25, 2024

revans2 Nov 26, 2024

kuhushukla Nov 27, 2024

revans2 commented Nov 26, 2024

kuhushukla commented Nov 27, 2024

kuhushukla commented Dec 2, 2024

tgravescs Dec 2, 2024

tgravescs Dec 2, 2024

tgravescs Dec 2, 2024

kuhushukla commented Dec 3, 2024

kuhushukla commented Dec 3, 2024

kuhushukla commented Dec 5, 2024

kuhushukla commented Dec 5, 2024

jlowe commented Dec 5, 2024

jlowe Dec 5, 2024

jlowe Dec 6, 2024

kuhushukla commented Dec 6, 2024

tgravescs Dec 6, 2024

kuhushukla commented Dec 6, 2024

kuhushukla commented Dec 6, 2024

jlowe Dec 6, 2024

		@@ -29,8 +29,12 @@ def _restricted_timestamp(nullable=True):
		end=datetime(2262, 4, 11, tzinfo=timezone.utc),
		nullable=nullable)

	@pytest.mark.parametrize('orc_gens', orc_write_gens_list, ids=idfn)

	@pytest.mark.parametrize('orc_gens', orc_write_gens_list, ids=idfn)

Orc writes don't fully support Booleans with nulls #11763

Orc writes don't fully support Booleans with nulls #11763

Conversation

kuhushukla commented Nov 25, 2024

kuhushukla commented Nov 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 commented Nov 26, 2024

kuhushukla commented Nov 27, 2024

kuhushukla commented Dec 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kuhushukla commented Dec 3, 2024

kuhushukla commented Dec 3, 2024

kuhushukla commented Dec 5, 2024

kuhushukla commented Dec 5, 2024

jlowe commented Dec 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kuhushukla commented Dec 6, 2024

Choose a reason for hiding this comment

kuhushukla commented Dec 6, 2024

kuhushukla commented Dec 6, 2024

Choose a reason for hiding this comment