-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak. #69
Comments
More details: I am using tensorflow 2.15.1, running on Ubuntu 22.04. I can reproduce it on both kernel 6.5 and 6.7. |
Hi @chao-camect , thanks for reporting this issue, could you share a small re-producer so that we can investigate? |
I think you should be able to reproduce it by training any model... It must be leaking in some common operations.
To run it, download data from [Kaggle] (https://www.kaggle.com/c/dogs-vs-cats/data).
|
More context: I run it inside docker. I installed the deps inside docker using following script:
|
could you also run this environment check script and upload the result here? thanks! https://github.com/intel/intel-extension-for-tensorflow/blob/main/tools/python/env_check.py |
The tool doesn't support tensorflow 2.15.
|
I suspect this is an issue with a common Intel library. |
@Disty0 Thanks for linking to the other issue. I agree with you. For me, it leaks 3MB-4MB every second. It must be some common operation. I don't get how could it evaded from Intel's own engineers... |
Hi @chao-camect , I am running the training script on Arc770 with docker that we published: https://intel.github.io/intel-extension-for-tensorflow/latest/docs/install/install_for_xpu.html#get-docker-container-from-dockerhub Total 25000 training images. GPU Memory Used (MiB) keeps stable at 4222 You should be able to run https://github.com/intel/intel-extension-for-tensorflow/blob/main/tools/python/env_check.py now, also need to clone https://github.com/intel/intel-extension-for-tensorflow/blob/main/tools/python/config.json, could you try again and upload the result here? |
It's CPU memory not GPU. |
did you install by following steps here: https://intel.github.io/intel-extension-for-tensorflow/latest/docs/install/experimental/install_for_arc_gpu.html |
How many iterations have you trained when facing OOM? |
@chao-camect we observed memory usage increasing on host during the training, developer team is looking into it, will post here when there are any updates. Thanks! |
Thanks for the prompt response! |
Hi @chao-camect
This is the memory usage trend I tested on Arc770 using the latest build: Let us know whether you can re-produce the result or not, thanks! |
Thanks. When will the weekly build be ready? I see that the latest version is 20240329. |
Should be in this week, will let you know when they are ready. |
intel_extension_for_tensorflow_lib_weekly 2.15.0.0.2.dev20240415 |
@chao-camect Pleased uninstall your previous intel-extension-for-tensorflow package. The package name are different. The install command is: A couple of things to be noted:
FYI Good lunck! |
No. It's still leaking, just slower. As you can see from your own graph... |
@chao-camect Yes we compared the traning on NV, this is the training result on A100. Running the script you provide us. From the patten, memory usage increases between epochs, but keep stable in epch. A100 behaves similarly. Suppose memory leak is related to specific workload (or operator/kernel).
We have checked your example for 1 epoch and find 72 bytes leak from itex. This weekly build fixe this. Note that onednn primitive cache and kernel cache from queue.submit consumed memory, the behavior looks like memory leak but actually it is NOT. Ignore such 'leaks' on python objects and tensorflow... |
It would be nice if you can separate that part from you 'bigger programm' for us to test. BTW, Did you upate the driver? What's the driver version did you use now? |
It'll take some time before I can get a minimized version for you to test with. Do you test the extension with a set of common models regularly? I don't think there is anything special in my model.
|
The drvier version is OK. Of course we have models check regularly. Meanwhile, if you can get a minimized version for us, that's would be ideal. Thanks! |
Hi @chao-camect , could you try below environment variable at your side, and let us know if memory leak still exists? Thanks! export ITEX_LAYOUT_OPT=0 |
Looks like it did the trick. |
I believe the memory leak is gone with ITEX_LAYOUT_OPT=0. |
also, our fix is WIP... will let you know as soon as it works... |
@chao-camect , please help to try our latest weekly build to see if it works for your case, thanks! pip install --upgrade intel-extension-for-tensorflow-weekly[xpu] -f https://developer.intel.com/itex-whl-weekly |
I have been training using tensorflow + keras on nvidia GPU for a while.
Recently I experimented with A770. With some efforts, I finally got it working, except that there is a memory leak.
The same code works fine on nvidia 3090, it uses about 8GB memory, very stably.
With A770, it starts with 8GB and grows very quickly until killed because of OOM.
I used tracemalloc to see where is the leak. No luck. So it's not in python code.
I haven't got time to get more details of it...
The text was updated successfully, but these errors were encountered: