-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ImageNet running in Yarn, nodeManager memory keep on increasing #123
Comments
Thanks for posting the issue! Is this using SparkNet with Caffe (or TensorFlow)? We're trying to reproduce it at the moment. |
This is using SparkNet with Caffe. Here is more information. I tried to correlate the time where the memory leak happenes. Here is the logs: The driver logs:: Tue Apr 26 10:25:26 PDT 2016 250.293, i = 1: weight = 0.008250123 Tue Apr 26 10:25:27 PDT 2016 250.763, i = 2: setting weights on workers Tue Apr 26 10:25:29 PDT 2016 252.46, i = 2: training Tue Apr 26 10:26:00 PDT 2016 284.225, i = 2: weight = 0.008250123 Tue Apr 26 10:26:01 PDT 2016 284.668, i = 3: setting weights on workers Tue Apr 26 10:26:03 PDT 2016 286.562, i = 3: training Tue Apr 26 10:26:36 PDT 2016 319.761, i = 3: weight = 0.008250123 Tue Apr 26 10:26:36 PDT 2016 320.2, i = 4: setting weights on workers Tue Apr 26 10:26:38 PDT 2016 321.874, i = 4: training Tue Apr 26 10:27:10 PDT 2016 354.226, i = 4: weight = 0.008250123 Tue Apr 26 10:27:11 PDT 2016 354.768, i = 5: setting weights on workers Tue Apr 26 10:27:13 PDT 2016 356.384, i = 5: training ** |
Reading mean image from file for ImageNet will speed up the process: Here is the mean image for Imagenet |
Thanks a lot! On the CifarApp in local mode, the problem does not seem to occur; trying ImageNet now. We need to figure out if we can reproduce the memory leak without YARN, and then use a memory profiler to track it down. If it is easy for you to run a memory profiler with your current setup, that might help us diagnosing and reproducing the bug. |
Will run in spark standalone cluster. |
Compared two heapdump which has 1 G memory difference. float[] contritue to 300M, which pinpoint to data JavaNDArray Each heap dump is 10G or more, hard to load up. Hopefully this will help. Here is all code related to data in scala/Java Part: ( I am going through them now) java/libs/JavaNDUtils.java: public static final int[] copyOf(int[] data) { for buf only things is in 👍 Let us kill this bug so that we can benchmark ImageNet. :) |
Find out the issue. It is due to broadcast variable keep on adding up. Here is the solution:(unpersist and destory broadcast variable, I am using spark 1.6.0) |
That's really great! Would you be interested in submitting a PR that fixes it (otherwise we'll do it and reference this issue)? |
Would you please do it this time? I will submit PR for other enhancement later. thanks. |
Ok, I created PR #125. Thanks again for finding and fixing the problem! We are doing some more testing before merging it, please let us also know about your experience on YARN (we are not running on YARN) |
I have run the imagenet in yarn cluster mode. Noticed nodemanager memory keep on increasing. Seems to be some memory leak in c++/jni code since coarsedGrainedbackend memory is very stable.
See the two process: (1127 keep on growing, while 1130 very stable)
****0 S yarn 1127 1125 0 80 0 - 2910 wait 13:15 ? 00:00:00 /bin/bash -c LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop/../../../CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop/lib/native:/opt/gpu/cuda/lib64:/data02/nhe/SparkNet/lib:/data02/nhe/cuda-7.0::/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop/lib/native /usr/lib/jvm/java-7-oracle-cloudera/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms22528m -Xmx22528m -Djava.io.tmpdir=/data02/yarn/nm/usercache/hdfs/appcache/application_1461609406099_0001/container_1461609406099_0001_02_000002/tmp '-Dspark.authenticate=false' '-Dspark.driver.port=56487' '-Dspark.shuffle.service.port=7337' '-Dspark.ui.port=0' -Dspark.yarn.app.container.log.dir=/data02/yarn/container-logs/application_1461609406099_0001/container_1461609406099_0001_02_000002 -XX:MaxPermSize=256m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://[email protected]:56487 --executor-id 1 --hostname bdalab12.samsungsdsra.com --cores 16 --app-id application_1461609406099_0001 --user-class-path file:/data02/yarn/nm/usercache/hdfs/appcache/application_1461609406099_0001/container_1461609406099_0001_02_000002/app.jar 1> /data02/yarn/container-logs/application_1461609406099_0001/container_1461609406099_0001_02_000002/stdout 2> /data02/yarn/container-logs/application_1461609406099_0001/container_1461609406099_0001_02_000002/stderr
0 S yarn 1130 1127 99 80 0 - 56878287 futex_ 13:15 ? 01:25:40 /usr/lib/jvm/java-7-oracle-cloudera/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms22528m -Xmx22528m -Djava.io.tmpdir=/data02/yarn/nm/usercache/hdfs/appcache/application_1461609406099_0001/container_1461609406099_0001_02_000002/tmp -Dspark.authenticate=false -Dspark.driver.port=56487 -Dspark.shuffle.service.port=7337 -Dspark.ui.port=0 -Dspark.yarn.app.container.log.dir=/data02/yarn/container-logs/application_1461609406099_0001/container_1461609406099_0001_02_000002 -XX:MaxPermSize=256m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://[email protected]:56487 --executor-id 1 --hostname bdalab12.samsungsdsra.com --cores 16 --app-id application_1461609406099_0001 --user-class-path file:/data02/yarn/nm/usercache/hdfs/appcache/application_1461609406099_0001/container_1461609406099_0001_02_000002/app.jar
The text was updated successfully, but these errors were encountered: