Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Document Hive text write serialization format checks
This commit documents the serialization format checks for writing Hive text, and why it differs from the read-side. `spark-rapids` supports only '^A'-separated Hive text files for read and write. This format tends to be denoted in a Hive table's Storage Properties with `serialization.format=1`. If a Hive table is written with a different/custom delimiter, it is denoted with a different value of `serialization.format`. For instance, a CSV table might be denoted by `serialization.format='', field.delim=','`. It was noticed in #11803 that: 1. On the [read side](https://github.com/NVIDIA/spark-rapids/blob/aa2da410511d8a737e207257769ec662a79174fe/sql-plugin/src/main/scala/org/apache/spark/sql/hive/rapids/HiveProviderImpl.scala#L155-L161), `spark-rapids` treats an empty `serialization.format` as `''`. 2. On the [write side](https://github.com/NVIDIA/spark-rapids/blob/aa2da410511d8a737e207257769ec662a79174fe/sql-plugin/src/main/scala/org/apache/spark/sql/hive/rapids/GpuHiveFileFormat.scala#L130-L136), an empty `serialization.format` is seen as `1`. The reason for the read side value is to be conservative. Since the table is pre-existing, its value should have been set already. The reason for the write side is that there are legitimate cases where a table might not have its `serialization.format` set. (CTAS, for one.) This commit documents all the scenarios that need to be considered on the write side. Signed-off-by: MithunR <[email protected]>
- Loading branch information