Document Hive text write serialization format checks

This commit documents the serialization format checks for writing Hive text, and why it differs from the read-side. `spark-rapids` supports only '^A'-separated Hive text files for read and write. This format tends to be denoted in a Hive table's Storage Properties with `serialization.format=1`. If a Hive table is written with a different/custom delimiter, it is denoted with a different value of `serialization.format`. For instance, a CSV table might be denoted by `serialization.format='', field.delim=','`. It was noticed in #11803 that: 1. On the [read side](https://github.com/NVIDIA/spark-rapids/blob/aa2da410511d8a737e207257769ec662a79174fe/sql-plugin/src/main/scala/org/apache/spark/sql/hive/rapids/HiveProviderImpl.scala#L155-L161), `spark-rapids` treats an empty `serialization.format` as `''`. 2. On the [write side](https://github.com/NVIDIA/spark-rapids/blob/aa2da410511d8a737e207257769ec662a79174fe/sql-plugin/src/main/scala/org/apache/spark/sql/hive/rapids/GpuHiveFileFormat.scala#L130-L136), an empty `serialization.format` is seen as `1`. The reason for the read side value is to be conservative. Since the table is pre-existing, its value should have been set already. The reason for the write side is that there are legitimate cases where a table might not have its `serialization.format` set. (CTAS, for one.) This commit documents all the scenarios that need to be considered on the write side. Signed-off-by: MithunR <[email protected]>
NVIDIA · Dec 11, 2024 · c5507cd · c5507cd
1 parent 4fbecbc
commit c5507cd
Showing 1 changed file with 32 additions and 0 deletions.
diff --git a/sql-plugin/src/main/scala/org/apache/spark/sql/hive/rapids/GpuHiveFileFormat.scala b/sql-plugin/src/main/scala/org/apache/spark/sql/hive/rapids/GpuHiveFileFormat.scala
@@ -127,6 +127,38 @@ object GpuHiveFileFormat extends Logging {
         s"only $lazySimpleSerDe is currently supported for text")
     }
 
+    // The check for serialization key here differs slightly from the read-side check in
+    // HiveProviderImpl::getExecs():
+    //   1. On the read-side, we do a strict check for `serialization.format == 1`, denoting
+    //      '^A'-separated text.  All other formatting is unsupported.
+    //   2. On the write-side too, we support only `serialization.format == 1`.  But if
+    //      `serialization.format` hasn't been set yet, it is still treated as `^A` separated.
+    //
+    // On the write side, there are a couple of scenarios to consider:
+    //   1. If the destination table exists beforehand, `serialization.format` should have been
+    //      set already, to a non-empty value.  This will look like:
+    //      ```sql
+    //      CREATE TABLE destination_table( col INT, ... );  --> serialization.format=1
+    //      INSERT INTO TABLE destination_table SELECT * FROM ...
+    //      ```
+    //   2. If the destination table is being created as part of a CTAS, without an explicit
+    //      format specified, then Spark leaves `serialization.format` unpopulated, until *AFTER*
+    //      the write operation is completed.  Such a query might look like:
+    //      ```sql
+    //      CREATE TABLE destination_table AS SELECT * FROM ...
+    //      --> serialization.format is absent from Storage Properties. "1" is inferred.
+    //      ```
+    //   3. If the destination table is being created as part of a CTAS, with a non-default
+    //      text format specified explicitly, then the non-default `serialization.format` is made
+    //      available as part of the destination table's storage properties.  Such a table creation
+    //      might look like:
+    //      ```sql
+    //      CREATE TABLE destination_table
+    //        ROW FORMAT DELIMITED FIELDS TERMINATED BY `,` STORED AS TEXTFILE
+    //        AS SELECT * FROM ...
+    //      --> serialization.format="", field.delim=",".  Unsupported case.
+    //      ```
+    // All these cases may be covered by explicitly checking for `serialization.format=1`.
     val serializationFormat = storage.properties.getOrElse(serializationKey, "1")
     if (serializationFormat != ctrlASeparatedFormat) {
       meta.willNotWorkOnGpu(s"unsupported serialization format found: " +