Skip to content

Commit

Permalink
Document Hive text write serialization format checks
Browse files Browse the repository at this point in the history
This commit documents the serialization format checks for writing
Hive text, and why it differs from the read-side.

`spark-rapids` supports only '^A'-separated Hive text files for read
and write. This format tends to be denoted in a Hive table's Storage
Properties with `serialization.format=1`.

If a Hive table is written with a different/custom delimiter, it is
denoted with a different value of `serialization.format`.  For instance,
a CSV table might be denoted by `serialization.format='',
field.delim=','`.

It was noticed in #11803
that:
1. On the [read
   side](https://github.com/NVIDIA/spark-rapids/blob/aa2da410511d8a737e207257769ec662a79174fe/sql-plugin/src/main/scala/org/apache/spark/sql/hive/rapids/HiveProviderImpl.scala#L155-L161), `spark-rapids` treats an empty `serialization.format` as `''`.
2. On the [write
   side](https://github.com/NVIDIA/spark-rapids/blob/aa2da410511d8a737e207257769ec662a79174fe/sql-plugin/src/main/scala/org/apache/spark/sql/hive/rapids/GpuHiveFileFormat.scala#L130-L136),
an empty `serialization.format` is seen as `1`.

The reason for the read side value is to be conservative.  Since the
table is pre-existing, its value should have been set already.

The reason for the write side is that there are legitimate cases where a
table might not have its `serialization.format` set.  (CTAS, for one.)

This commit documents all the scenarios that need to be considered on
the write side.

Signed-off-by: MithunR <[email protected]>
  • Loading branch information
mythrocks committed Dec 11, 2024
1 parent 4fbecbc commit c5507cd
Showing 1 changed file with 32 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,38 @@ object GpuHiveFileFormat extends Logging {
s"only $lazySimpleSerDe is currently supported for text")
}

// The check for serialization key here differs slightly from the read-side check in
// HiveProviderImpl::getExecs():
// 1. On the read-side, we do a strict check for `serialization.format == 1`, denoting
// '^A'-separated text. All other formatting is unsupported.
// 2. On the write-side too, we support only `serialization.format == 1`. But if
// `serialization.format` hasn't been set yet, it is still treated as `^A` separated.
//
// On the write side, there are a couple of scenarios to consider:
// 1. If the destination table exists beforehand, `serialization.format` should have been
// set already, to a non-empty value. This will look like:
// ```sql
// CREATE TABLE destination_table( col INT, ... ); --> serialization.format=1
// INSERT INTO TABLE destination_table SELECT * FROM ...
// ```
// 2. If the destination table is being created as part of a CTAS, without an explicit
// format specified, then Spark leaves `serialization.format` unpopulated, until *AFTER*
// the write operation is completed. Such a query might look like:
// ```sql
// CREATE TABLE destination_table AS SELECT * FROM ...
// --> serialization.format is absent from Storage Properties. "1" is inferred.
// ```
// 3. If the destination table is being created as part of a CTAS, with a non-default
// text format specified explicitly, then the non-default `serialization.format` is made
// available as part of the destination table's storage properties. Such a table creation
// might look like:
// ```sql
// CREATE TABLE destination_table
// ROW FORMAT DELIMITED FIELDS TERMINATED BY `,` STORED AS TEXTFILE
// AS SELECT * FROM ...
// --> serialization.format="", field.delim=",". Unsupported case.
// ```
// All these cases may be covered by explicitly checking for `serialization.format=1`.
val serializationFormat = storage.properties.getOrElse(serializationKey, "1")
if (serializationFormat != ctrlASeparatedFormat) {
meta.willNotWorkOnGpu(s"unsupported serialization format found: " +
Expand Down

0 comments on commit c5507cd

Please sign in to comment.