High CPU utlization with all query operators/stages GPU based. #11963
Replies: 4 comments 3 replies
-
It could be a number of things causing the issues. We would need to do some profiling to really find out. I am happy to do some for you. Your use case is simple enough I should be able to reproduce it locally. Be aware that we try to use the GPU for things that the GPU is good at and still use the CPU for things it is good at. The CPU still wins in compression and decompression when you have lots of cores, so that is my guess. But I would have to run some thing to really see. |
Beta Was this translation helpful? Give feedback.
-
Thanks!
|
Beta Was this translation helpful? Give feedback.
-
Setup scripts with all the versions
CPU & GPU run spark shell commands
|
Beta Was this translation helpful? Give feedback.
-
I first wanted to verify that I got similar results, because I was running in local mode with 12 CPU cores and 1 GPU instead of you using 4 GPUs with 12 CPU cores each. It is just a lot simple to profile thing in local mode. The query has three stages. The first stage reads in the parquet data and does a partial aggregation to drop the duplicates. Te second stage finishes the deduplication and repartitions the data so that the window operation can happen. The last stage will sort the data, do the window operation, and write the results out. For the first stage about 9 CPU cores were fully utilized the entire time. For the second stage I saw about 10 CPU cores being fully utilized. The final stage only had about 3.5 CPU cores being utilized. So yes this does look like there is a lot of CPU being used, more than I would want/expect. I did some very simple hprof profiling (
It looks like just about all of the slowness is related to shuffle, and most of that comes from shuffle serialization. We know that this is an issue and have been working on improving it. It is still a WIP, but you should hopefully start to see some improvements in 25.02. On my setup I see about 105 seconds to run the query. Just FYI |
Beta Was this translation helpful? Give feedback.
-
Hello,
I have a general question about CPU utilization. It's unlikely a bug just a behavior I don't fully understand. I'm running benchmarks on synthetic data and I see surprisingly high CPU utilization. I was expecting that when all query parts are executed on GPU then CPU utilization should be relatively low as all it does is just fetch the data during the shuffle from other workers. I see utilization 80-90% for the most of the time time of the query.
AWS instances used for GPU workers: g4dn.12xlarge - 48 cores / 192 GB ram / 4 T4 gpu / 900GB NVMe / 50 Gbps
Am I missing something, what does load CPU that much?
I compare it with performance of CPU based instance that I run separately, I also get 80-90% utilization which makes sense I guess as CPUs used for all the operations in that case.
AWS instances used for CPU workers: r6id.24xlarge - 96 cores / 768 GB ram / 5.7TB NVMe / 37.5 Gbps
Setup
1 master
1 workers
master:
m5dn.xlarge
4 cores / 16 GB ram / 150GB NVMe / 25 Gbps
worker:
r6id.24xlarge
96 cores / 768 GB ram / 5.7TB NVMe / 37.5 Gbps
Logic
Data
S3
150 parquet files
22 GB gzip
50 mil rows
Schema:
100 string columns
5 random chars each
Beta Was this translation helpful? Give feedback.
All reactions