Model global weights are suddenly zero within local training code #3132

virginiafdez · 2025-01-07T08:00:34Z

virginiafdez
Jan 7, 2025

Python version (`python3 -V`)

3.8

NVFlare version (`python3 -m pip list | grep "nvflare"`)

2.2.5

NVFlare branch (if running examples, please use the branch that corresponds to the NVFlare version, `git branch`)

No response

Operating system

Ubuntu 22.04

Have you successfully run any of the following examples?

hello-numpy-sag with simulator
hello-pt with simulator
hello-numpy-sag with POC
hello-pt with POC

Please describe your question

I am relatively new to nvflare, and Iam running a training code I need to trim up, and while running it on the nvflare simulator, I am seeing that the weights that are passed on from server to each site initially are set to zero.
The model starts with pre-loading weights when it is created. In the scatter and gather workflow, I am printing said weights by accessing the ._global_weights attribute of the fl_ctx object and they are fine (not zero).
But then, right after, the broadcast_and_wait function is called to launch the training task and if I go to the execute method of that task, which I imagine, is the first thing that gets executed, and print the fl_ctx weights, they are zero.
Is there anything I should be aware of that might be causing this? Any function that is potentially zeroing or losing the weights that are correctly loaded in the server and passed on to the task?

Thanks a lot for any input!!

virginiafdez · 2025-01-10T10:31:14Z

virginiafdez
Jan 10, 2025
Author

Update:
I have noticed that SVT Privacy and Percentile Privacy are used. But if I print the output of the process_dxo method of said filters, the weights are still not zero, and yet when I get into the execute method of the Executor, the weights are all zero.
I have also noticed that if I remove the filters from the server JSON file, the weights stop being zero on the execute method of the trainer.
I cannot figure out, though, what is being called that is zeroing the weights.
I assume there is a check in the Controller for Filters, and if there are Filters, they are applied, and something else is happening which is the root cause of the problem.

0 replies

YuanTingHsieh · 2025-01-10T13:27:27Z

YuanTingHsieh
Jan 10, 2025
Maintainer

@virginiafdez Thanks for you interest and question!

Could you switch to NVFlare 2.5 or the main branch and try again?
And can you please share the job folder that you are running?

0 replies

virginiafdez · 2025-01-13T10:39:36Z

virginiafdez
Jan 13, 2025
Author

Hello!

I've managed to continue debugging the application; the weights are actually not zero, but they are still causing my training to NaN, something that does not happen when SVT is not used. I've narrowed it down to the impact that the privacy filters cause on the normalisation layers (because the NaNs only happen when eval() is used).
I'm trying to look for a minimal impact of SVT on the pre-trained weights but I am not having any success; do you have any recommendations?
I'm currently using: "fraction": 1, "epsilon": 0.0001, "noise_var": 1000000, "gamma": 15
The weights that are getting filtered are those from a simple DenseNet121 that can be downloaded from the torch model zoo.

About switching to NVFlare 2.5, I am unable to, as the production system we are using has 2.2.5 and cannot currently be updated.

2 replies

YuanTingHsieh Jan 13, 2025
Maintainer

@virginiafdez thanks for providing more info!

@holgerroth @ZiyueXu77 do you have good suggestion for using SVT? thanks!

holgerroth Jan 15, 2025
Maintainer

This noise_var is super high. I would suggest SVT parameters more similar to this

NVFlare/examples/advanced/brats18/configs/brats_fedavg_dp/app/config/config_fed_client.json

Line 28 in 08a446e

"noise_var": 1.0,

You could try some suggestions from this discussion #824 (reply in thread)

virginiafdez · 2025-01-15T17:20:05Z

virginiafdez
Jan 15, 2025
Author

More updates into this issue, and a summary of what happens in each scenario. I've been shuffling the following:

With and without filters (SVT)
Train in train() mode (obviously) and eval in train() or eval() mode
Load specific weights during training

Regardless of SVT being applied or not, with all the weights (the ones coming out of SVT) loaded, the training results are fine. If I evaluate on train() mode, the results are fine as well.

When I use SVT, and I validate on eval() mode, as I should:

nan happens if I load all the weights
nan does not happen if I load all the weights except those coming out of the norm layers. I have filtered even more, and the issue seems to be coming out of the running_mean and running_variance values, which makes sense, cause they are the ones exhibiting a different behaviour on train and eval mode.

Since these values running mean and variances are not learnable, does it make sense for them to go through the filters anyway? I assume I could write a customised filter excluding specific keys and try it out.
I am also surprised the behaviour from a Densenet121 (standard, pre-trained and taken from torch) is so dependent on these values. I have looked at the histograms of filtered and unfiltered values from the norm layers and although they do change, I don't find it so dramatic.

I was curious on what are the minimal-impact parameters you can set with SVT to avoid modifying the weights too much (although I know that the point of SVT is precisely to do something to the weights, it's useful for testing purposes when the filter has to be there but the parameters can be tweaked).

Besides this, the issue is clarified as this seems to be the root cause of the problem unless there is anything else coming from you that might point to a different direction.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model global weights are suddenly zero within local training code #3132

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Model global weights are suddenly zero within local training code #3132

virginiafdez Jan 7, 2025

Python version (python3 -V)

NVFlare version (python3 -m pip list | grep "nvflare")

NVFlare branch (if running examples, please use the branch that corresponds to the NVFlare version, git branch)

Operating system

Have you successfully run any of the following examples?

Please describe your question

Replies: 4 comments · 2 replies

virginiafdez Jan 10, 2025 Author

YuanTingHsieh Jan 10, 2025 Maintainer

virginiafdez Jan 13, 2025 Author

YuanTingHsieh Jan 13, 2025 Maintainer

holgerroth Jan 15, 2025 Maintainer

virginiafdez Jan 15, 2025 Author

virginiafdez
Jan 7, 2025

Python version (`python3 -V`)

NVFlare version (`python3 -m pip list | grep "nvflare"`)

NVFlare branch (if running examples, please use the branch that corresponds to the NVFlare version, `git branch`)

Replies: 4 comments 2 replies

virginiafdez
Jan 10, 2025
Author

YuanTingHsieh
Jan 10, 2025
Maintainer

virginiafdez
Jan 13, 2025
Author

YuanTingHsieh Jan 13, 2025
Maintainer

holgerroth Jan 15, 2025
Maintainer

virginiafdez
Jan 15, 2025
Author