-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24256][SQL] SPARK-24256: ExpressionEncoder should support user-defined types as fields of Scala case class and tuple #21310
Conversation
… as fields of Scala case class and tuple
You need to add tests first. Could you? |
I will investigate how can we add test for this. thoughts are welcomed |
@viirya @cloud-fan before I add test, could you guys take a look and advise if the approach taken in this patch is acceptable? |
I will take a look later today. |
I'm not sure if I look into correct project. But seems spark-avro project doesn't have |
@viirya thanks for the feedback. We internally customized the AvroEncoder based on the open source PR, since it never gets merged into spark-avro. we propose this feature since it should apply to every user-defined Encoder, not limited to AvroEncoder. |
@fangshil, Avro was now in Spark. How does it relate to this PR? Should we go forward? |
@fangshil also can you point me out the PR not merged into spark-avro please so that I can check when I have some time. |
@HyukjinKwon thanks for the update. What do you mean by "Avro was now in Spark"? The PR I mentioned is https://github.com/databricks/spark-avro/pull/215/files. I have been maintaining this PR internally for a while in my company with Spark 2.3 |
I think this PR is blocked by adding UDT officially(it's currently internal). Maybe we can start a thread about UDT in the dev list. |
Can one of the admins verify this patch? |
To summarize our discussion in this pr: The purpose of this patch is to enable ExpressionEncoder to work together with other types of Encoders, while it seems like upstream prefers to go with UDT. Given this, we can close this PR and we will start the discussion on UDT in another channel |
There's PR pending to add support for Dataset of Avro here: #22878 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala
Lines 126 to 142 in 14b1312
/** | |
* Creates an encoder for Java Bean of type T. | |
* | |
* T must be publicly accessible. | |
* | |
* supported types for java bean field: | |
* - primitive types: boolean, int, double, etc. | |
* - boxed types: Boolean, Integer, Double, etc. | |
* - String | |
* - java.math.BigDecimal, java.math.BigInteger | |
* - time related: java.sql.Date, java.sql.Timestamp | |
* - collection types: only array and java.util.List currently, map support is in progress | |
* - nested java bean. | |
* | |
* @since 1.6.0 | |
*/ | |
def bean[T](beanClass: Class[T]): Encoder[T] = ExpressionEncoder.javaBean(beanClass) |
If we adopt this enhancement, the related documentation might update as well? @fangshil
As @fangshil points out, due to the fact that Spark's encoder-generating facilities found in ScalaReflection and JavaTypeInference cannot be made aware of a user-defined Encoder[T], it is fairly inconvenient to work with a Dataset[T] for which such an encoder has been defined. He mentions two reasons:
Now, the first problem can perhaps be worked around, for example by implicitly defining an Encoder[(T, S)] whenever there is an implicit Encoder[T] and Encoder[S]. However, the second problem remains. And that is precisely what the present PR sets out to solve. I understand if the Spark community would prefer to take another approach to solving this problem, but then I'd like to find out what that approach is. For instance, is the consensus that the best approach is to create a UserDefinedType[T] and register it through the currently private UDTRegistration API? If so, could someone please point me to a thread in the Spark dev list that can shed light on the justification behind this choice, and on the timeline for making that API public? Finally, I'd like to ask why, even if the UserDefinedType[T] approach is preferred, the work in the present PR isn't being considered as a supplementary enhancement -- one which many Spark users would find very convenient. |
What changes were proposed in this pull request?
Right now, ExpressionEncoder supports ser/de of primitive types, as well as scala case class, tuple and java bean class. Spark's Dataset natively supports these mentioned types, but we find Dataset is not flexible for other user-defined types and encoders.
For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. Although we can use AvroEncoder to define Dataset with types being the Avro Generic or Specific Record, using such Avro typed Dataset has many limitations:
The limitation is that ExpressionEncoder does not support serde of Scala case class/tuple with subfields being any other user-defined type with its own Encoder for serde.
To address this issue, we propose a trait as a contract(between ExpressionEncoder and any other user-defined Encoder) to enable case class/tuple/java bean to support user-defined types.
With this proposed patch and our minor modification in AvroEncoder, we remove above-mentioned limitations with cluster-default conf spark.expressionencoder.org.apache.avro.specific.SpecificRecord = com.databricks.spark.avro.AvroEncoder$
This is a patch we have implemented internally and has been used for a few quarters. We want to propose to upstream as we think this is a useful feature to make Dataset more flexible to user types.
How was this patch tested?
I have tested this patch internally. Did not write unit test since the user-defined Encoder(AvroEncoder) is defined outside Spark.
I look for suggestions on how to write unit tests for this patch.