-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27388][SQL] encoder for objects defined by properties (ie. Avro) #24299
Conversation
with this addition avro fixed types are encoded correctly Currently the avro complex unions are not supported
Overall: isn't this what the Encoder for Java Beans is for and already supports? I am not sure it obvious to me this should go in ScalaReflection. |
Say we have an object A having the fields B, C, and D; where A, B are avro objects, C is an avro fixed object (with property bytes) and D is an java enum.
(1)
{code}
val implicit exprEnc = ExpressionEncoder[A]()
val r: Dataset[A] = List(makeA).toDS()
val ds: Dataset[(B, C, D)] = r.map(e => (e.getB, e.getC, e.getD))
{/code}
(2)
{code}
val implicit exprA = Encoders.bean[A](classOf[A])
val implicit exprA = Encoders.bean[B](classOf[B])
val implicit exprA = Encoders.bean[C](classOf[C])
val implicit exprA = Encoders.bean[D](classOf[D])
val r: Dataset[A] = List(makeA).toDS()
val ds: Dataset[(B, C, D)] = r.map(e => (e.getB, e.getC, e.getD))
{/code}
Using the addition in this PR the code in (1) works correctly, but If we use java bean instead as in code (2), we have the following issues:
- the objects of type C, avro fixed types, are not correctly encoded because the property `bytes` of fixed types is not prefixed by set/get
- the java bean encoder fails to create an encoder for java enum (assertion fails, not a StructType since an enum is saved as String)
- the map fails to find encoders for B, C, and D, because while creating the encoder for the tuple (e.getB, e.getC, e.getD) it will
recursively search for an encoder of the tuple elements in ScalaReflection
Also, we believe that the current implementation of Encoders.bean (JavaTypeInference) has a bug, indeed
Line 136 in sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
should be:
val properties = getJavaBeanReadableAndWritableProperties(other)
and not
val properties = getJavaBeanReadableProperties(other)
All the tests with the java bean encoder are done with this correction
All this shows that the addition in this PR must be in ScalaReflection; and since there is no encoder for java.util.List, java.util.Map and java enums, we added support for them in ScalaReflection.
|
@srowen This PR is mainly about adding support to Avro objects and make our code easier |
Many thanks to @mazeboard for bringing this PR to my attention and for taking a crack at the problem of typed Avro Datasets in Spark! I feel it's a matter of due diligence for me to point to another PR supporting Avro-typed Datasets in Spark, namely #22878, which (full transparency) is the work of @xuanyuanking and myself. The approaches taken here and there are different, and it would seem so are the coverages of the Avro spec. I'd like to take the time to compare/contrast. I am more qualified to speak to the approach and capabilities introduced #22878 (which has a history going back to Spark-Avro), and so if I misread this PR in the process, @mazeboard, please do correct my understanding. #24299I'll summarize my reading of this PR's approach: to extend the existing Java One stated limitation is complex Union types (unions of two types where Correct my understanding if this is wrong, but because this approach is based on Reflection over the types of the generated #22878To summarize the approach of #22878: creates an AvroEncoder which generates a pair of serializer/deserializer Expressions based on the AvroSchema (whether as gleaned from a SpecificRecord class, or as passed directly as a JSON string). Its work stands on the Avro efforts that were recently folded into Spark-proper from the (now deprecated) Databricks/Spark-Avro project. However, to date, Spark-Avro does not provide support for a first-class Encoder, along with the efficiency and Strong/Static typing that entails. Being based on Spark-Avro however, #22878 does gain benefits of an AvroSchema driven approach. To avoid confusion, I’m going to refer to this recently folded-in functionality as “Spark-Avro”, referencing this portion of the current Spark project, rather than to the former deprecated project. AvroSchemaBecause #22878 generates its Encoder through the AvroSchema, we gain a couple things:
With #22878 you can create an
Two implications:
CoverageSpark-Avro’s ability to move between AvroSchema and Spark Dataset Schema also gives us the ability to traverse the DataType to create our Encoder’s Ser/De Expressions rather than using Reflection. This gives us two immediate benefits
These two items mean the AvroEncoder in #22878 can generate an Last thoughtsThe PR goes a long way in support of Avro while still being very concise, which is definitely advantageous from a maintainability perspective. My concerns with a Reflection-based approach are:
Parting words for #22878:
Again, I'm very happy to see where this discussion goes as it evolves (: |
This PR is not an extension of the existing Java Bean Encoder: The PR adds support for bean objects, java.util.List, java.util.Map, and java enums to ScalaReflection; unlike the existing javaBean Encoder, properties can be named without the set/get prefix (this is one of the key points that allows the encoding of Avro Fixed types. I believe, the other key point is that the addition must be in ScalaReflection). Reminder of Avro types: All Avro types are supported by this PR (may be a better test including all Avro types is required), including simple union types and excluding complex union types: Simple Avro unions are [null, type1], all other unions are complex. this PR supports simple unions but not complex unions for the simple reason that the Avro compiler will generate java code with type Object for all complex unions, and fields with simple unions will be typed as the non-null type of the union. Currently, the ScalaReflection does not have an encoder for Object type, but we can modify the ScalaReflection to use a default encoder (ie. kryo) for Object type (currently I do not know if this is advisable, but why not? instead of throwing an error we could use a default encoder for the objects that have no encoder found). I do not understand the issue related to being Reflection-driven approach: all common scala objects are encoded using reflection. I may be wrong, but as I tried to explain in this PR, that we need to add types in ScalaReflection to be able to transform Datasets to other Datasets of embedded Avro types. As an example the following map function transforms the Dataset[A] to a Dataset[(B, C)] {code} The map function will recursively use ScalaReflection to find encoders for B, C types (Do you know if this runs with #22878 solution? Or does it complain at runtime with no encoder found?) Finally, I did not understand the benefits of AvroSchema driven approach, for me an Avro object is completely defined by its properties (that are derived from the Avro schema); the Avro compiler generates java code with all the properties in the Avro schema. |
Thanks @mazeboard for the reply and clarifying the support here for Unions. I've edited my post to a more targeted discussion of Complex Union types for clarity:
Also now better understood this PR expands Spark's ScalaReflection capabilities (rather than the I'm inclined to wait a bit for some more direction or questions from the Spark committers. |
@@ -308,6 +309,19 @@ class ExpressionEncoderSuite extends CodegenInterpretedPlanTest with AnalysisTes | |||
encodeDecodeTest(Option("abc"), "option of string") | |||
encodeDecodeTest(Option.empty[String], "empty option of string") | |||
|
|||
encodeDecodeTest( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bdrillard for the summary of previous work.
Seems #22878 covered this scenario? Please correct me if I missing something. https://github.com/apache/spark/pull/22878/files#diff-24ca7610c9c163779104e6c797713431R327
Currently all Avro types are supported, except complex unions 2. for testing, add an Avro having all possible types: primitive types: null, boolean, int, long, float, double, bytes, string complex types: Records, Enums, Arrays, Maps, Unions, Fixed
just a note about the number of lines changes: the java code for the Avro objects used in the test have 1252 lines of code; so if we ignore those lines of code the overall changed lines of code is +288 -17 |
I would like to give an example to explain why the PR addition The test is done with spark-sql 2.4.0 The first transformation (map), from Dataset[Foo) to Dataset[Bar],
With the PR additon the following code runs without errors
It is possible to make the first program work by adding the following implicit:
But this becomes rapidly a burden if we need to expose many embedded objects within the Avro object. |
Even though we prefer the solution that modifies ScalaReflection in this PR, we also modified JavaTypeInference (in PR #24367) to be able to create encoders for Avro objects using the bean encoder; We currently have two solutions, the one in this PR and another one using Encoders.bean |
…ively or both are not prefixed
The issue here is how the input and the convertedBack objects are compared if we replace the check, in line ExpressionEncoderSuite.scala:442, left.asInstanceOf[Comparable[Any]].compareTo(right) == 0 By left.asInstanceOf[Comparable[Any]].equals(right) == 0 the test for Avro encoder passes, but unfortunately other tests fail Equality of objects is tricky. The GenericData compare function /** Comparison implementation. When equals is true, only checks for equality, * not for order. */ @SuppressWarnings(value="unchecked") protected int compare(Object o1, Object o2, Schema s, boolean equals) { fails to compare Maps when the parameter equals is false I propose the following replace ExpressionEncoderSuite:434:444 lines val isCorrect = (input, convertedBack) match { case (b1: Array[Byte], b2: Array[Byte]) => Arrays.equals(b1, b2) case (b1: Array[Int], b2: Array[Int]) => Arrays.equals(b1, b2) case (b1: Array[Array[_]], b2: Array[Array[_]]) => Arrays.deepEquals(b1.asInstanceOf[Array[AnyRef]], b2.asInstanceOf[Array[AnyRef]]) case (b1: Array[_], b2: Array[_]) => Arrays.equals(b1.asInstanceOf[Array[AnyRef]], b2.asInstanceOf[Array[AnyRef]]) case (left: Comparable[_], right: Comparable[_]) => left.asInstanceOf[Comparable[Any]].compareTo(right) == 0 case _ => input == convertedBack } by val convertedBackRow = encoder.toRow(convertedBack) val isCorrect = row == convertedBackRow With the proposed modification all the tests in ExpressionEncoderSuite passes
@srowen Hi, I would like to hear from you about the possibility to merge this PR, many developer teams need to have the possibility to work with Datasets of SpecificRecords (ie. Avro), I implemented a version that modifies the bean encoder (see #24367) but I believe that adding encoders to ScalaReflection, as done by this PR, is the way to go for reasons I explained in this PR; the solution is simple with small changes/additions; please let me know if this can be considered as a viable solution and if someone can take this PR forward; as you said JavaBean is pretty old and rarely used. |
Can one of the admins verify this patch? |
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
What changes were proposed in this pull request?
This PR adds expression encoders for beans, java.util.List, java.util.Map and java enum. This addition make it possible to encode Avro objects.
The Beans are objects defined by properties; A property is defined by a setter and a getter functions where the getter return type is equal to the setter unique parameter type and the getter and setter functions have the same name; if the getter name is prefixed by "get" then the setter name must be prefixed by "set"; see tests for bean examples.
Avro objects are beans and thus we can create an expression encoder for avro objects as follows:
All avro types, including fixed types, and excluding complex union types, are suppported by this addition.
The avro fixed types are beans with exactly one property: bytes.
Currently complex avro unions are not supported because a complex union is declared as Object and there is no expression encoder for Object type (need to use a custom serializer like kryo for example)
How was this patch tested?
currently only 1 encodeDecodeTest was added to ExpressionEncoderSuite; the test uses an avro object with all accepted types (ie. array, map, fixed, bytes, enum, ...).