-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Intermittent crash on NDS query 96 with grace hopper cluster #11854
Labels
bug
Something isn't working
Comments
revans2
added
? - Needs Triage
Need team to review and classify
bug
Something isn't working
labels
Dec 10, 2024
Observed the same exception on query7 with grace hooper cluster in CI build 277
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
In CI we have been seeing occasional failure related for NDS scale factor 3k when running on a grace hoper cluster. It appears to only ever crash when we are running with parquet data with decimals, not floats, for many of the number types.
We need someone to go through all of the historic runs and see if we can fully understand what is happening here, before we dig into a single possible explanation.
One of the odd things is that for at least a few of the runs we see errors when trying to deserialize a task.
When we zoom in on the last part of the calls we see.
This is on Spark 3.4.3, so ShuffleMapTask.scala:87 is just trying to deserialize a
(RDD[_], ShuffleDependency[_, _, _])
tuple of RDD + ShuffleDependency.The odd part is that this appears to pass on retry. Currently I suspect that it is some kind of memory/network corruption because the grace hopper hardware we are running on is pre-production, but it is not specific to a single node and it is specific to a single query, so that makes it more fun to try and debug.
The text was updated successfully, but these errors were encountered: