best_FL_global_model.pt is selected from n-1 rounds of learning #802

parkeraddison · 2022-08-22T22:59:48Z

parkeraddison
Aug 22, 2022

It looks like the best_FL_global_model.pt is only ever selected/saved in a round prior to aggregation (EventType.BEFORE_AGGREGATE), using the initial metrics (MetaKey.INITIAL_METRICS) computed before training took place at the client sites (ValidateType.BEFORE_TRAIN_VALIDATE).

This essentially means that the selection of a "best" FL global model can only take place on the n-1th round's global model, so if the global model improved in the nth round of learning then during cross-site eval we see that SRV_FL_global_model.pt actually outperforms SRV_best_FL_global_model.pt.

I'm wondering, could there be an additional event fired or method called after the rounds are complete to perform one final validation of the current FL global model across all sites and fire EventType.GLOBAL_BEST_MODEL_AVAILABLE accordingly?

Answered by holgerroth

Aug 23, 2022

Yes, your observation is correct. You could add a task to workflow that does a final evaluation before completing the run.

View full answer

holgerroth · 2022-08-23T16:38:10Z

holgerroth
Aug 23, 2022
Maintainer

Yes, your observation is correct. You could add a task to workflow that does a final evaluation before completing the run.

0 replies

pygabc1 · 2023-03-20T02:05:56Z

pygabc1
Mar 20, 2023

Hi, does it mean that the metric (i.e., accuracy) of SRV_best_FL_global_model could be worse than the metric of SRV_FL_global_model? I have this situation now.

0 replies

holgerroth · 2023-03-20T15:23:23Z

holgerroth
Mar 20, 2023
Maintainer

It could be if your model is still converging. Typically the best global model selection is useful to avoid overfitting the training data, assuming the validation performance will decrease later in training.

15 replies

holgerroth Apr 26, 2023
Maintainer

No, the central flag only controls which dataset split is used. For central=True, each client uses the full training dataset. Can you try the 2.3.0 version? I think there was an issue fixed in the global model selector which might have affected this behavior #1401. You should see a "best" global model in the server workspace after completion.

pygabc1 May 1, 2023

No, the central flag only controls which dataset split is used. For central=True, each client uses the full training dataset. Can you try the 2.3.0 version? I think there was an issue fixed in the global model selector which might have affected this behavior #1401. You should see a "best" global model in the server workspace after completion.

Hi, Holger R. Roth, thank you for your reply again. As you recommend to get global model selection, I trained with “central=False”, and the following is the performance of the best_FL_global_model.pt by tested by test data.

epochMax AUROC
1, 0.886
2, 0.891
3, 0.886
5, 0.881
10, 0.873

It means that the following line in def train(self,…) does not work, does it (I know the following line is updating self.best_local_model_file)?

if val_auc_mean > self.best_acc:
self.best_acc = val_auc_mean
self.save_model(is_best=True)

Anyway, how to settle this issue (i.e., epochMax is another parameter I have to tune)? Is still using version 2.3.0 a solution (I am using 2.2, linux OS and 3.8 python)?

holgerroth May 1, 2023
Maintainer

Thanks. Please try 2.3.0 and see if you observe the same issue.

pygabc1 May 4, 2023

Thanks. Please try 2.3.0 and see if you observe the same issue.

Hi, Holger R. Roth, thank you for your reply again. I tried to use NVFlare version 2.3.0, but the result is exactly same as above, which still shows that "epochMax is another parameter I have to tune". This is from best_FL_global_model.pt.

epochMax AUROC
1, 0.886
2, 0.891
3, 0.886
5, 0.881
10, 0.873

YuanTingHsieh Aug 5, 2023
Maintainer

@pygabc1

I think you mix local models with the "FL global" model.

For the logic https://github.com/NVIDIA/NVFlare/blob/2.2/examples/cifar10/pt/learners/cifar10_learner.py#L235-L239
It is for saving the local best model epoch by epoch.

For the logic https://github.com/NVIDIA/NVFlare/blob/2.2/examples/cifar10/pt/learners/cifar10_learner.py#L304-L314
It is for saving the local best model AFTER each client finish a round of training.

The above logic are all for just local model.

For "FL global" model, it is saved based on the metrics that each client return.
These lines calculate the global model evaluate on local data: https://github.com/NVIDIA/NVFlare/blob/2.2/examples/cifar10/pt/learners/cifar10_learner.py#L414-L419

This global best model is saved round by round.
If this round is better than last round, then we update it.

Your "trainer" if specified 20 epochs, because you DID NOT implement early stopping, it will finish its 20 epochs.
And the server side will get the model after 20 epochs, which has worse performance than 10 epochs.

So in your case, your "trainer" should return the model at epoch 10 to the server.

As for how to do "early stopping", that is the scope of Deep Learning, you can find various resources online.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

best_FL_global_model.pt is selected from n-1 rounds of learning #802

{{title}}

Replies: 3 comments 15 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

best_FL_global_model.pt is selected from n-1 rounds of learning #802

parkeraddison Aug 22, 2022

Replies: 3 comments · 15 replies

holgerroth Aug 23, 2022 Maintainer

pygabc1 Mar 20, 2023

holgerroth Mar 20, 2023 Maintainer

holgerroth Apr 26, 2023 Maintainer

pygabc1 May 1, 2023

holgerroth May 1, 2023 Maintainer

pygabc1 May 4, 2023

YuanTingHsieh Aug 5, 2023 Maintainer

parkeraddison
Aug 22, 2022

Replies: 3 comments 15 replies

holgerroth
Aug 23, 2022
Maintainer

pygabc1
Mar 20, 2023

holgerroth
Mar 20, 2023
Maintainer

holgerroth Apr 26, 2023
Maintainer

holgerroth May 1, 2023
Maintainer

YuanTingHsieh Aug 5, 2023
Maintainer