Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ImageNet running in Yarn, nodeManager memory keep on increasing #123

Closed
nhe150 opened this issue Apr 25, 2016 · 10 comments
Closed

ImageNet running in Yarn, nodeManager memory keep on increasing #123

nhe150 opened this issue Apr 25, 2016 · 10 comments

Comments

@nhe150
Copy link

nhe150 commented Apr 25, 2016

I have run the imagenet in yarn cluster mode. Noticed nodemanager memory keep on increasing. Seems to be some memory leak in c++/jni code since coarsedGrainedbackend memory is very stable.

See the two process: (1127 keep on growing, while 1130 very stable)

****0 S yarn 1127 1125 0 80 0 - 2910 wait 13:15 ? 00:00:00 /bin/bash -c LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop/../../../CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop/lib/native:/opt/gpu/cuda/lib64:/data02/nhe/SparkNet/lib:/data02/nhe/cuda-7.0::/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop/lib/native /usr/lib/jvm/java-7-oracle-cloudera/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms22528m -Xmx22528m -Djava.io.tmpdir=/data02/yarn/nm/usercache/hdfs/appcache/application_1461609406099_0001/container_1461609406099_0001_02_000002/tmp '-Dspark.authenticate=false' '-Dspark.driver.port=56487' '-Dspark.shuffle.service.port=7337' '-Dspark.ui.port=0' -Dspark.yarn.app.container.log.dir=/data02/yarn/container-logs/application_1461609406099_0001/container_1461609406099_0001_02_000002 -XX:MaxPermSize=256m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://[email protected]:56487 --executor-id 1 --hostname bdalab12.samsungsdsra.com --cores 16 --app-id application_1461609406099_0001 --user-class-path file:/data02/yarn/nm/usercache/hdfs/appcache/application_1461609406099_0001/container_1461609406099_0001_02_000002/app.jar 1> /data02/yarn/container-logs/application_1461609406099_0001/container_1461609406099_0001_02_000002/stdout 2> /data02/yarn/container-logs/application_1461609406099_0001/container_1461609406099_0001_02_000002/stderr


0 S yarn 1130 1127 99 80 0 - 56878287 futex_ 13:15 ? 01:25:40 /usr/lib/jvm/java-7-oracle-cloudera/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms22528m -Xmx22528m -Djava.io.tmpdir=/data02/yarn/nm/usercache/hdfs/appcache/application_1461609406099_0001/container_1461609406099_0001_02_000002/tmp -Dspark.authenticate=false -Dspark.driver.port=56487 -Dspark.shuffle.service.port=7337 -Dspark.ui.port=0 -Dspark.yarn.app.container.log.dir=/data02/yarn/container-logs/application_1461609406099_0001/container_1461609406099_0001_02_000002 -XX:MaxPermSize=256m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://[email protected]:56487 --executor-id 1 --hostname bdalab12.samsungsdsra.com --cores 16 --app-id application_1461609406099_0001 --user-class-path file:/data02/yarn/nm/usercache/hdfs/appcache/application_1461609406099_0001/container_1461609406099_0001_02_000002/app.jar

@robertnishihara
Copy link
Member

Thanks for posting the issue! Is this using SparkNet with Caffe (or TensorFlow)? We're trying to reproduce it at the moment.

@nhe150
Copy link
Author

nhe150 commented Apr 26, 2016

This is using SparkNet with Caffe. Here is more information. I tried to correlate the time where the memory leak happenes. Here is the logs:

The driver logs::
Tue Apr 26 10:25:09 PDT 2016 232.468, i = 1: collecting weights
Tue Apr 26 10:25:26 PDT 2016 249.743: collect took 17.274 s

Tue Apr 26 10:25:26 PDT 2016 250.293, i = 1: weight = 0.008250123
Tue Apr 26 10:25:26 PDT 2016 250.293, i = 2: broadcasting weights
Tue Apr 26 10:25:27 PDT 2016 250.763: broadcast took 0.47 s

Tue Apr 26 10:25:27 PDT 2016 250.763, i = 2: setting weights on workers
Tue Apr 26 10:25:29 PDT 2016 252.46: setweight took 1.697 s

Tue Apr 26 10:25:29 PDT 2016 252.46, i = 2: training
Tue Apr 26 10:25:44 PDT 2016 267.882, i = 2: collecting weights
Tue Apr 26 10:26:00 PDT 2016 283.696: collect took 15.814 s

Tue Apr 26 10:26:00 PDT 2016 284.225, i = 2: weight = 0.008250123
Tue Apr 26 10:26:00 PDT 2016 284.225, i = 3: broadcasting weights
Tue Apr 26 10:26:01 PDT 2016 284.668: broadcast took 0.443 s

Tue Apr 26 10:26:01 PDT 2016 284.668, i = 3: setting weights on workers
Tue Apr 26 10:26:03 PDT 2016 286.562: setweight took 1.894 s

Tue Apr 26 10:26:03 PDT 2016 286.562, i = 3: training
Tue Apr 26 10:26:18 PDT 2016 301.765, i = 3: collecting weights
Tue Apr 26 10:26:35 PDT 2016 319.236: collect took 17.406 s

Tue Apr 26 10:26:36 PDT 2016 319.761, i = 3: weight = 0.008250123
Tue Apr 26 10:26:36 PDT 2016 319.761, i = 4: broadcasting weights
Tue Apr 26 10:26:36 PDT 2016 320.2: broadcast took 0.439 s

Tue Apr 26 10:26:36 PDT 2016 320.2, i = 4: setting weights on workers
Tue Apr 26 10:26:38 PDT 2016 321.874: setweight took 1.674 s

Tue Apr 26 10:26:38 PDT 2016 321.874, i = 4: training
Tue Apr 26 10:26:54 PDT 2016 337.492, i = 4: collecting weights
Tue Apr 26 10:27:10 PDT 2016 353.704: collect took 16.211 s

Tue Apr 26 10:27:10 PDT 2016 354.226, i = 4: weight = 0.008250123
Tue Apr 26 10:27:10 PDT 2016 354.226, i = 5: broadcasting weights
Tue Apr 26 10:27:11 PDT 2016 354.768: broadcast took 0.542 s

Tue Apr 26 10:27:11 PDT 2016 354.768, i = 5: setting weights on workers
Tue Apr 26 10:27:13 PDT 2016 356.384: setweight took 1.616 s

Tue Apr 26 10:27:13 PDT 2016 356.384, i = 5: training
Tue Apr 26 10:27:28 PDT 2016 371.715, i = 5: collecting weights

**
The nodemana## ger logs where memory keep on growing::**
2016-04-26 10:25:28,265 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.1 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:25:31,312 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.1 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:25:34,360 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.1 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:25:37,404 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.1 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:25:40,451 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.1 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:25:43,498 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.1 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:25:46,525 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.2 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:25:49,549 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.2 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:25:52,595 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.2 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:25:55,640 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.2 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:25:58,685 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.2 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:26:01,732 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.6 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:26:04,779 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.6 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:26:07,826 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.6 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:26:10,870 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.6 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:26:13,917 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.6 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:26:16,963 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 9.6 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:26:20,012 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 10.3 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:26:23,057 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 10.3 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:26:26,103 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 10.3 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:26:29,148 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 10.3 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:26:32,194 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 10.3 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:26:35,239 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 10.3 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:26:38,285 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 10.3 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:26:41,329 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 10.3 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:26:44,377 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 10.5 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:26:47,424 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 10.5 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:26:50,471 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 10.5 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:26:53,532 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 10.5 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:26:56,578 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.0 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:26:59,624 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.0 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:27:02,669 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.0 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:27:05,715 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.0 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:27:08,760 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.0 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:27:11,813 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.0 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:27:14,859 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.0 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:27:17,905 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.0 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:27:20,951 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.0 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:27:23,998 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.0 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:27:27,045 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.0 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:27:30,093 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.4 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:27:33,139 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.4 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:27:36,200 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.4 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:27:39,246 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.4 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:27:42,291 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.4 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:27:45,337 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.4 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:27:48,383 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.4 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:27:51,416 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.4 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:27:54,450 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.4 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:27:57,482 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.4 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:28:00,530 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.4 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:28:03,578 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.4 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:28:06,624 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.9 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:28:09,670 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.9 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:28:12,716 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.9 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:28:15,760 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.9 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:28:18,822 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.9 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:28:21,868 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.9 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:28:24,916 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.9 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:28:27,965 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.9 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:28:31,012 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.9 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:28:34,059 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.9 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:28:37,106 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 11.9 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used
2016-04-26 10:28:40,148 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 42178 for container-id container_1461609406099_0007_01_000002: 12.3 GB of 13 GB physical memory used; 202.4 GB of 27.3 GB virtual memory used

@nhe150
Copy link
Author

nhe150 commented Apr 26, 2016

Reading mean image from file for ImageNet will speed up the process:
Here is the code:
val fileName = sparkNetHome + "/imagenet.mean"
val in: ObjectInputStream = new ObjectInputStream(new FileInputStream(fileName))
val meanImage: Array[Float] = in.readObject().asInstanceOf[Array[Float]]
logger.log("reading mean ")

Here is the mean image for Imagenet
imagenet.mean.zip

@pcmoritz
Copy link
Collaborator

Thanks a lot! On the CifarApp in local mode, the problem does not seem to occur; trying ImageNet now. We need to figure out if we can reproduce the memory leak without YARN, and then use a memory profiler to track it down. If it is easy for you to run a memory profiler with your current setup, that might help us diagnosing and reproducing the bug.

@nhe150
Copy link
Author

nhe150 commented Apr 26, 2016

Will run in spark standalone cluster.
I had run jvisualvm on yarn, the coarse grained executor memory is stable, however the yarn container that spawns the coarse grained executor(which is a shell script) memory keep on growing up like posted(which I havenot profiled).

@nhe150
Copy link
Author

nhe150 commented Apr 27, 2016

Compared two heapdump which has 1 G memory difference.

float[] contritue to 300M, which pinpoint to data JavaNDArray
byte[] contribute to 600M, which pinpoint to buf ByteArrayOutputStream

Each heap dump is 10G or more, hard to load up. Hopefully this will help. Here is all code related to data in scala/Java Part: ( I am going through them now)

java/libs/JavaNDUtils.java: public static final int[] copyOf(int[] data) {
java/libs/JavaNDUtils.java: return Arrays.copyOf(data, data.length);
java/libs/JavaNDUtils.java: // Remove element from position index in data, return deep copy
java/libs/JavaNDUtils.java: public static int[] removeIndex(int[] data, int index) {
java/libs/JavaNDUtils.java: assert(index < data.length);
java/libs/JavaNDUtils.java: int len = data.length;
java/libs/JavaNDUtils.java: System.arraycopy(data, 0, result, 0, index);
java/libs/JavaNDUtils.java: System.arraycopy(data, index + 1, result, index, len - index - 1);
java/libs/JavaNDArray.java: protected final float[] data;
java/libs/JavaNDArray.java: public JavaNDArray(float[] data, int dim, int[] shape, int offset, int[] strides) {
java/libs/JavaNDArray.java: this.data = data;
java/libs/JavaNDArray.java: public JavaNDArray(float[] data, int... shape) {
java/libs/JavaNDArray.java: this(data, shape.length, shape, 0, JavaNDUtils.calcDefaultStrides(shape));
java/libs/JavaNDArray.java: return new JavaNDArray(data, dim - 1, JavaNDUtils.removeIndex(shape, axis), offset + index * strides[axis], JavaNDUtils.removeIndex(strides, axis));
java/libs/JavaNDArray.java: return new JavaNDArray(data, dim, JavaNDUtils.copyOf(newShape), offset + JavaNDUtils.dot(lowerOffsets, strides), strides); // todo: why copy shape?
java/libs/JavaNDArray.java: data[ix] = value;
java/libs/JavaNDArray.java: return data[ix];
java/libs/JavaNDArray.java: System.arraycopy(data, offset, result, flatIndex, shape[dim - 1]);
java/libs/JavaNDArray.java: result[flatIndex] = data[offset + i * strides[dim - 1]];
java/libs/JavaNDArray.java: result[0] = data[offset];
java/libs/JavaNDArray.java: return new JavaNDArray(data, flatShape.length, flatShape, 0, JavaNDUtils.calcDefaultStrides(flatShape));
java/libs/JavaNDArray.java: return data;
scala/libs/JavaCPPUtils.scala: val data = new ArrayFloat
scala/libs/JavaCPPUtils.scala: val pointer = floatBlob.cpu_data
scala/libs/JavaCPPUtils.scala: data(i) = pointer.get(i)
scala/libs/JavaCPPUtils.scala: NDArray(data, shape)
scala/libs/JavaCPPUtils.scala: val buffer = blob.mutable_cpu_data()
scala/libs/JavaCPPUtils.scala: val buffer = blob.cpu_data()
scala/libs/Preprocessor.scala:// The Preprocessor is provides a function for reading data from a dataframe row
scala/libs/Preprocessor.scala:// The convert method in DefaultPreprocessor is used to convert data extracted
scala/libs/Preprocessor.scala:// from a dataframe into an NDArray, which can then be passed into a net. The
scala/libs/Preprocessor.scala: schema(name).dataType match {
scala/libs/Preprocessor.scala: schema(name).dataType match {
scala/libs/Preprocessor.scala: schema(name).dataType match {
scala/libs/Preprocessor.scala: } else if (name == "data") {
scala/libs/Preprocessor.scala: throw new Exception("The name is not label or data, name = " + name + "\n")
scala/libs/NDArray.scala: def apply(data: Array[Float], shape: Array[Int]) = {
scala/libs/NDArray.scala: if (data.length != shape.product) {
scala/libs/NDArray.scala: throw new IllegalArgumentException("The data and shape arguments are not compatible, data.length = " + data.length.toString + " and shape = " + shape.deep + ".\n")
scala/libs/NDArray.scala: new NDArray(new JavaNDArray(data, shape:_*))
scala/libs/CaffeNet.scala: // Preallocate a buffer for data input into the net
scala/libs/CaffeNet.scala: // data
scala/libs/CaffeNet.scala: def forward(rowIt: Iterator[Row], dataBlobNames: List[String] = ListString): Map[String, NDArray] = {
scala/libs/CaffeNet.scala: for (name <- dataBlobNames) {
scala/libs/CaffeNet.scala: val data = new ArrayFloat
scala/libs/CaffeNet.scala: blob.cpu_data.get(data, 0, data.length)
scala/libs/CaffeNet.scala: weightList += NDArray(data, shape)
scala/libs/CaffeNet.scala: blob.mutable_cpu_data.put(flatWeights, 0, flatWeights.length)

for buf only things is in 👍
ScaleAndConverter
val im = ImageIO.read(new ByteArrayInputStream(compressedImage))
val resizedImage = Thumbnails.of(im).forceSize(width, height).asBufferedImage()
Some(BufferedImageToByteArray(resizedImage))

Let us kill this bug so that we can benchmark ImageNet. :)

@nhe150
Copy link
Author

nhe150 commented Apr 28, 2016

Find out the issue. It is due to broadcast variable keep on adding up.

Here is the solution:(unpersist and destory broadcast variable, I am using spark 1.6.0)
logger.log("setting weights on workers", i)
workers.foreach(_ => workerStore.get[CaffeSolver]").trainNet.setWeights(broadcastWeights.value))
broadcastWeights.unpersist()
broadcastWeights.destroy()

@pcmoritz
Copy link
Collaborator

That's really great! Would you be interested in submitting a PR that fixes it (otherwise we'll do it and reference this issue)?

@nhe150
Copy link
Author

nhe150 commented Apr 28, 2016

Would you please do it this time? I will submit PR for other enhancement later. thanks.

@pcmoritz
Copy link
Collaborator

Ok, I created PR #125. Thanks again for finding and fixing the problem!

We are doing some more testing before merging it, please let us also know about your experience on YARN (we are not running on YARN)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants