High CPU utlization with all query operators/stages GPU based. #11963

MaxNevermind · 2025-01-14T02:22:36Z

MaxNevermind
Jan 14, 2025

Hello,

I have a general question about CPU utilization. It's unlikely a bug just a behavior I don't fully understand. I'm running benchmarks on synthetic data and I see surprisingly high CPU utilization. I was expecting that when all query parts are executed on GPU then CPU utilization should be relatively low as all it does is just fetch the data during the shuffle from other workers. I see utilization 80-90% for the most of the time time of the query.
AWS instances used for GPU workers: g4dn.12xlarge - 48 cores / 192 GB ram / 4 T4 gpu / 900GB NVMe / 50 Gbps
Am I missing something, what does load CPU that much?

I compare it with performance of CPU based instance that I run separately, I also get 80-90% utilization which makes sense I guess as CPUs used for all the operations in that case.
AWS instances used for CPU workers: r6id.24xlarge - 96 cores / 768 GB ram / 5.7TB NVMe / 37.5 Gbps

Setup

1 master
1 workers

master:
m5dn.xlarge
4 cores / 16 GB ram / 150GB NVMe / 25 Gbps

worker:
r6id.24xlarge
96 cores / 768 GB ram / 5.7TB NVMe / 37.5 Gbps

Logic


  spark
    .read
    .parquet("s3a://rapids-test-1/data_gen_1")
    .dropDuplicates("col1")
    .withColumn("window", first(col("col2")).over(Window.partitionBy(col("col2")).orderBy("col3")))
    .write
    .option("compression", "snappy")
    .mode("overwrite")
    .parquet("s3a://rapids-test-1/data_gen_4")

Data

S3
150 parquet files
22 GB gzip
50 mil rows

Schema:
100 string columns
5 random chars each

revans2 · 2025-01-14T15:01:38Z

revans2
Jan 14, 2025
Maintainer

It could be a number of things causing the issues. We would need to do some profiling to really find out. I am happy to do some for you. Your use case is simple enough I should be able to reproduce it locally.

Be aware that we try to use the GPU for things that the GPU is good at and still use the CPU for things it is good at. The CPU still wins in compression and decompression when you have lots of cores, so that is my guess. But I would have to run some thing to really see.

0 replies

MaxNevermind · 2025-01-14T16:44:42Z

MaxNevermind
Jan 14, 2025
Author

Thanks!
Here a data generator in Scala.

object DataGen extends App {

  val spark = org.apache.spark.sql.SparkSession.builder()
    .appName("ParallelDataGenerator")
    .master("local[*]")
    .getOrCreate()

  import org.apache.spark.sql.types._
  import org.apache.spark.sql.{DataFrame, Row, SparkSession}
  import java.util.concurrent.ThreadLocalRandom

  def generateData(spark: SparkSession, rowCount: Long, colCount: Int): DataFrame = {
    val schema = StructType((1 to colCount).map(i => StructField(s"col$i", StringType, nullable = false)))
    val numPartitions = spark.sparkContext.defaultParallelism
    val dataRDD = spark.sparkContext.parallelize(1L to rowCount, numPartitions).map { _ =>
      Row.fromSeq((1 to colCount).map(_ => randomString(5)))
    }
    spark.createDataFrame(dataRDD, schema)
  }

  def randomString(length: Int): String = {
    val chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789"
    val builder = new StringBuilder(length)
    val random = ThreadLocalRandom.current()
    for (_ <- 0 until length) {
      builder.append(chars.charAt(random.nextInt(chars.length)))
    }
    builder.toString()
  }

  println(java.time.LocalDateTime.now())
  generateData(spark, rowCount = 50000000, colCount = 100)
    .repartition(150)
    .write
    .mode("overwrite")
    .parquet("s3a://rapids-test-1/data_gen_1")
  println(java.time.LocalDateTime.now())

}

2 replies

revans2 Jan 14, 2025
Maintainer

@MaxNevermind Can I get a few config setting from you? I assume that you are running with 12 threads per executor and 4 executors. What are the following configs? If you didn't set them, that is fine I just want to know if you overrode any of these.

spark.sql.shuffle.partitions
spark.sql.files.maxPartitionBytes
spark.shuffle.manager
spark.rapids.memory.host.spillStorageSize
spark.rapids.memory.pinnedPool.size
spark.rapids.sql.concurrentGpuTasks

Also what version of Spark-RAPIDS are you using?

revans2 Jan 14, 2025
Maintainer

Oh also what was the heap size set to for the executor? If you have a history file I can look at it should tell me everything I need to know to truly replicate it.

MaxNevermind · 2025-01-14T17:04:19Z

MaxNevermind
Jan 14, 2025
Author

Setup scripts with all the versions

sudo apt update
sudo apt upgrade -y
sudo apt install openjdk-11-jdk -y
cd /home/ubuntu
sudo wget https://archive.apache.org/dist/spark/spark-3.5.2/spark-3.5.2-bin-hadoop3.tgz
sudo wget https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.10.1/rapids-4-spark_2.12-24.10.1.jar
sudo wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.779/aws-java-sdk-bundle-1.12.779.jar
sudo wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar
sudo wget https://repo1.maven.org/maven2/org/apache/spark/spark-hadoop-cloud_2.12/3.5.2/spark-hadoop-cloud_2.12-3.5.2.jar
sudo tar -xvzf ./spark-3.5.2-bin-hadoop3.tgz
sudo mv ./aws-java-sdk-bundle-1.12.779.jar ./spark-3.5.2-bin-hadoop3/jars
sudo mv ./hadoop-aws-3.3.4.jar ./spark-3.5.2-bin-hadoop3/jars
sudo mv ./spark-hadoop-cloud_2.12-3.5.2.jar ./spark-3.5.2-bin-hadoop3/jars
sudo add-apt-repository ppa:graphics-drivers/ppa -y
sudo apt update
sudo apt install nvidia-driver-535 -y
sudo reboot


sudo mkfs.ext4 -F /dev/nvme1n1
sudo mkdir /mnt/nvme
sudo mount /dev/nvme1n1 /mnt/nvme
sudo mkdir /mnt/nvme/spark_local_dir
sudo mkdir /mnt/nvme/spark_worker_dir
sudo mkdir /mnt/nvme/hadoop_tmp_dir

export WORK_DIR=/home/ubuntu
export PRIVATE_MASTER_DNS=ip-10-0-7-36.us-west-2.compute.internal
export SPARK_MASTER=spark://$PRIVATE_MASTER_DNS:7077
export SPARK_HOME=$WORK_DIR/spark-3.5.2-bin-hadoop3
export SPARK_RAPIDS_PLUGIN_JAR=$WORK_DIR/rapids-4-spark_2.12-24.10.1.jar

touch $SPARK_HOME/conf/spark-env.sh
echo '
export SPARK_WORKER_OPTS="-Dspark.worker.resource.gpu.amount=1 -Dspark.worker.resource.gpu.discoveryScript=$SPARK_HOME/examples/src/main/scripts/getGpusResources.sh"
export SPARK_LOCAL_DIRS="/mnt/nvme/spark_local_dir"
export SPARK_WORKER_DIR="/mnt/nvme/spark_worker_dir"
export AWS_JAVA_V1_DISABLE_DEPRECATION_ANNOUNCEMENT=true
' >> $SPARK_HOME/conf/spark-env.sh

sudo $SPARK_HOME/sbin/start-master.sh
sudo $SPARK_HOME/sbin/start-worker.sh $SPARK_MASTER

CPU & GPU run spark shell commands

sudo $SPARK_HOME/bin/spark-shell \
	--master $SPARK_MASTER \
	--deploy-mode "client" \
	--conf spark.driver.cores=4 \
	--conf spark.driver.memory=14g \
	--conf spark.driver.extraJavaOptions="-Dhadoop.tmp.dir=/mnt/nvme/hadoop_tmp_dir" \
	--conf spark.executor.cores=24 \
	--conf spark.executor.memory=160g \
	--conf spark.executor.extraJavaOptions="-Dhadoop.tmp.dir=/mnt/nvme/hadoop_tmp_dir" \
	--conf spark.dynamicAllocation.enabled=false \
	--conf spark.sql.shuffle.partitions=200 \
	--conf spark.network.timeout=300s \
	--conf spark.local.dir=/mnt/nvme/spark_work_dir \
	--conf spark.sql.parquet.compression.codec=gzip \
	--conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 \
	--conf spark.hadoop.fs.s3a.committer.name=magic \
	--conf spark.sql.sources.commitProtocolClass=org.apache.spark.internal.io.cloud.PathOutputCommitProtocol \
	--conf spark.sql.parquet.output.committer.class=org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter \
	--conf spark.sql.adaptive.enabled=false \
	--conf spark.dynamicAllocation.enabled=false \
	--conf spark.sql.legacy.charVarcharAsString=true


sudo $SPARK_HOME/bin/spark-shell \
	--master $SPARK_MASTER \
	--deploy-mode "client" \
	--conf spark.driver.cores=4 \
	--conf spark.driver.memory=14g \
	--conf spark.driver.extraJavaOptions="-Dhadoop.tmp.dir=/mnt/nvme/hadoop_tmp_dir" \
	--conf spark.executor.cores=12 \
	--conf spark.executor.memory=36g \
	--conf spark.executor.extraJavaOptions="-Dhadoop.tmp.dir=/mnt/nvme/hadoop_tmp_dir" \
	--conf spark.dynamicAllocation.enabled=false \
	--conf spark.sql.shuffle.partitions=200 \
	--conf spark.network.timeout=300s \
	--conf spark.local.dir=/mnt/nvme/spark_work_dir \
	--conf spark.sql.parquet.compression.codec=gzip \
	--conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 \
	--conf spark.hadoop.fs.s3a.committer.name=magic \
	--conf spark.sql.sources.commitProtocolClass=org.apache.spark.internal.io.cloud.PathOutputCommitProtocol \
	--conf spark.sql.parquet.output.committer.class=org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter \
	--conf spark.sql.adaptive.enabled=false \
	--conf spark.sql.legacy.charVarcharAsString=true \
	--conf spark.plugins=com.nvidia.spark.SQLPlugin \
	--conf spark.rapids.sql.enabled=true \
	--conf spark.executor.resource.gpu.amount=1 \
	--conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
	--conf spark.task.resource.gpu.amount=0.083 \
	--conf spark.rapids.memory.host.spillStorageSize=8G \
	--conf spark.rapids.memory.pinnedPool.size=8g \
	--conf spark.rapids.sql.concurrentGpuTasks=1 \
	--jars $SPARK_RAPIDS_PLUGIN_JAR

0 replies

revans2 · 2025-01-14T21:31:18Z

revans2
Jan 14, 2025
Maintainer

I first wanted to verify that I got similar results, because I was running in local mode with 12 CPU cores and 1 GPU instead of you using 4 GPUs with 12 CPU cores each. It is just a lot simple to profile thing in local mode.

The query has three stages. The first stage reads in the parquet data and does a partial aggregation to drop the duplicates. Te second stage finishes the deduplication and repartitions the data so that the window operation can happen. The last stage will sort the data, do the window operation, and write the results out.

For the first stage about 9 CPU cores were fully utilized the entire time. For the second stage I saw about 10 CPU cores being fully utilized. The final stage only had about 3.5 CPU cores being utilized.

So yes this does look like there is a lot of CPU being used, more than I would want/expect. I did some very simple hprof profiling (-agentlib:hprof=cpu=samples,depth=12). I ran the query twice and grabbed the top 30-ish items from the list of stack traces. I know that I am ignoring the long tail. I also removed the stack traces that are just blocked all the time waiting on I/O or user input. This is really just quick and dirty to see if anything stands out, and it does.

what	counts	pct
Shuffle Serialization	40115	35.34%
Shuffle Compression	22528	19.85%
Parquet Write	14985	13.20%
Shuffle Write I/O	9146	8.06%
Aggregate	5597	4.93%
Parquet Read	4931	4.34%
Shuffle Partition	4218	3.72%
Read Shuffle Data	2991	2.64%
Shuffle Write Cleanup	2786	2.45%
Shuffle Read I/O	2193	1.93%
Shuffle Decompression	2160	1.90%
Shuffle Deserialization	1850	1.63%

It looks like just about all of the slowness is related to shuffle, and most of that comes from shuffle serialization. We know that this is an issue and have been working on improving it. It is still a WIP, but you should hopefully start to see some improvements in 25.02.

On my setup I see about 105 seconds to run the query. Just FYI

1 reply

MaxNevermind Jan 14, 2025
Author

Thanks!
That's a very nice breakdown of utilization by steps. I will try to do it myself, could be useful approach for future optimizations.

We know that this is an issue and have been working on improving it.

That sounds interesting, can you list some related PRs? Wanted to understand the underlying nature of improvements or maybe even contribute.

I guess the old the old optimization approach for Spark of reducing the number of shuffles stills stands true, didn't realize that that the time mostly spent of serialization and compression, I thought it primarily pushing the data through the network.

Though I don't quite get it why Shuffle Decompression / Shuffle Deserialization takes so much less time of total, any ideas? My naive thought was that it should be somewhat close to Shuffle Serialization / Shuffle Compression.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High CPU utlization with all query operators/stages GPU based. #11963

{{title}}

Replies: 4 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

High CPU utlization with all query operators/stages GPU based. #11963

MaxNevermind Jan 14, 2025

Setup

Logic

Data

Replies: 4 comments · 3 replies

revans2 Jan 14, 2025 Maintainer

MaxNevermind Jan 14, 2025 Author

revans2 Jan 14, 2025 Maintainer

revans2 Jan 14, 2025 Maintainer

MaxNevermind Jan 14, 2025 Author

revans2 Jan 14, 2025 Maintainer

MaxNevermind Jan 14, 2025 Author

MaxNevermind
Jan 14, 2025

Replies: 4 comments 3 replies

revans2
Jan 14, 2025
Maintainer

MaxNevermind
Jan 14, 2025
Author

revans2 Jan 14, 2025
Maintainer

revans2 Jan 14, 2025
Maintainer

MaxNevermind
Jan 14, 2025
Author

revans2
Jan 14, 2025
Maintainer

MaxNevermind Jan 14, 2025
Author