Training SOTA BNN without batchnorm layers, competitively with the current implementation #334
Replies: 3 comments 2 replies
-
PS Sorry, one of the titles should read "Training Score MINUS Validation Score." I realize that just using the dash might be ambiguous. All "scores" are top-1 accuracies. For some reason, I couldn't get tensorflow to correctly compute the top-5 accuracy. |
Beta Was this translation helpful? Give feedback.
-
PPS Woops, forgot to add the link for the Larq paper! I'm sure you have copies, but for ease of clicking: https://arxiv.org/abs/2011.09398 And I added the wrong spectrum link; should have been this: https://spectrum.chat/larq/general/replacing-unnecessary-batchnorm-layers-using-gaussian-order-statistics~38145cdc-5389-424b-9668-6d55591460ee |
Beta Was this translation helpful? Give feedback.
-
Hi @atbolsh, very interesting, thanks for sharing! The BN can be a pain to deal with in BNNs, so it would be awesome if we can get rid of them.
Do I understand correctly that during inference the models are the same? I.e. the normalization operations are still there, but arrived at in a different way?
Did the instability still occur if you only lowered the lr for binary layers while keeping the fp lr at 0.1? The team is very busy at the moment so I'm not sure if we have the resources to take this on anytime soon. However, feel free to send me an email at [email protected] if you would like to discuss it. We would be happy to accept a PR if it improves the current QuickNet, but it looks like it's not quite there yet. You can always open a draft PR for the time being. A separate repo may also make sense if you're thinking about turning this into a paper. |
Beta Was this translation helpful? Give feedback.
-
[Continued from https://spectrum.chat/larq/general/bnn-model-with-custom-kernel-quantizer~7d306aaf-71d0-435e-8345-4b12ec0a534e]
My hobby project is finally done!
I was able to rewrite QuickNet without any batchnorms, and I can train it on ImageNet, coming up with validation accuracy very slightly worse than the control. However, my training so far has been limited to a single NVIDIA 1070, which means small batch sizes (256) and only 45 epochs; and I think there is evidence that my implementation has higher capacity, and if trained correctly (600 epochs), might surpass the control in performance.
Note on the experiments: I replace batchnorms that adjust to the real$\mu$ and $\sigma$ with fixed normalization layers, that use a priori calculations of what $\mu$ and $\sigma$ should be. I derived how to renormalize after MaxPools, ReLUs, and ordinary layers (binarized of fully-connected), but I couldn’t quickly find the correct normalization for MaxBlurPool, which is what QuickNet uses. Therefore, aside from a straightforward control (untrained networks from larq SOTA), I also implemented a version of the control where the MaxBlurPool is replaced with a more conventional MaxPool.
This version of a control network performs almost exactly the same as my version of the network.
I think that I can probably sit down for a week, and derive the correct renormalization for MaxBlurPool, and that will allow the new network to rival the original version of QuickNet.
The Results:
In general, I think that my network will do better with more training epochs: it has less overfitting, as you can see below, and the training curve seems to have a lot of room to improve; I suspect that, after a full 600 epochs, the training accuracy will be approximately equal to the control training accuracy, while the validation accuracy might be significantly higher.
There is additional evidence that the new, batchnorm-less version of the network has higher capacity than the current implementaion: specifically, the fact that it required a lower learning rate (see below).
Training details:
I used different version of QuickNetSmall (the one in larq.sota; the one with a MaxPool instead of a MaxBlurPool; and the one without batchnorms).
Following [larq paper], I used two different learning rates, one for the full-precision weights, and one for the binary weights. I used a 5-epoch burn-in to reach the maximum learning rate, followed by a cosine decay of the learning rate to 0. The main difference from the training described in [larq paper] is that I only had 45 epochs and a batch size of 256, due to hardware and time constraints.
Both versions of the control used a learning rate of 0.1 for the full-precision layers, and 0.01 for the binary layers. The new network used 0.01 for the full-precision layers, and 0.001 for the binary layers. I saw that such a reduction in learning rate was necessary after debugging the performance on the smaller ImageNette dataset; using the same learning rate as the control causes the loss to spike at the start, and therefore instability (sometimes, weights become nan). It’s worth noting that, carefully watching the loss during the first epoch, batch by batch, the loss goes up before coming down in all cases (control and the new network), but the size of this spike varies; I think this is because of the role that regularizers play.
However, much like dropout, adding a trainable batchnorm layer decreases the overall capacity of the network; therefore, I think that the new version of the network, without a batchnorm, will perform better once the network is actually trained close to capacity, instead of this shorter training period.
Next Steps:
This result, in conjunction with the results from the autoencoder, really suggests that training without batchnorm leads to interesting results, and may lead to better binary neural networks.
This needs to run on better hardware, for 600 epochs. In addition, I need to figure out how to renormalize a MaxBlurPool layer.
I really think that there is a paper here, not to mention a potential update to larq.sota. I can share all the code; it’s very readable and short. However, I think that the final version that goes into larq needs to be redone in C instead of python.
What do you think? What should be done?
What's the best way to share code? A small, separate repo, or a pull request, maybe?
Beta Was this translation helpful? Give feedback.
All reactions