Replies: 1 comment
-
Hello @kkdatta. The reason that External Shuffle Service (ESS) should be disabled is that external shuffle service is a different serving stack (from a yarn NodeManager) using files as the backend, and we don't have an implementation to of a UCX based shuffle that terminates at these external processes. We really would like to use GPU-resident blocks as much as possible for faster transfers and to avoid spill. On dynamic allocation, Spark supports it without ESS via shuffle tracking. We haven't prioritized testing it but it should in theory work, as we report back to Spark which executors hold specific block ids via the same MapStatus apis that regular spark shuffle uses. Please feel free to reach out to us via [email protected] if you'd like to talk about your usecase privately. We are happy to see interest in UCX and would love to learn about the system configuration and hardware being utilized. |
Beta Was this translation helpful? Give feedback.
-
https://docs.nvidia.com/spark-rapids/user-guide/latest/additional-functionality/rapids-shuffle.html#ucx-mode says ESS needs to be disabled hence dynamic allocation needs to be disabled. Is there anyway to curate for the case of dynamically allocating executors depending on the load? Or are we saying if we use UCX mode the number of executors need to be constant for the spark context?
Beta Was this translation helpful? Give feedback.
All reactions