Training SOTA BNN without batchnorm layers, competitively with the current implementation #334

atbolsh · 2021-07-08T19:53:32Z

atbolsh
Jul 8, 2021

[Continued from https://spectrum.chat/larq/general/bnn-model-with-custom-kernel-quantizer~7d306aaf-71d0-435e-8345-4b12ec0a534e]

My hobby project is finally done!

I was able to rewrite QuickNet without any batchnorms, and I can train it on ImageNet, coming up with validation accuracy very slightly worse than the control. However, my training so far has been limited to a single NVIDIA 1070, which means small batch sizes (256) and only 45 epochs; and I think there is evidence that my implementation has higher capacity, and if trained correctly (600 epochs), might surpass the control in performance.

Note on the experiments: I replace batchnorms that adjust to the real $\mu$ and $\sigma$ with fixed normalization layers, that use a priori calculations of what $\mu$ and $\sigma$ should be. I derived how to renormalize after MaxPools, ReLUs, and ordinary layers (binarized of fully-connected), but I couldn’t quickly find the correct normalization for MaxBlurPool, which is what QuickNet uses. Therefore, aside from a straightforward control (untrained networks from larq SOTA), I also implemented a version of the control where the MaxBlurPool is replaced with a more conventional MaxPool.

This version of a control network performs almost exactly the same as my version of the network.

I think that I can probably sit down for a week, and derive the correct renormalization for MaxBlurPool, and that will allow the new network to rival the original version of QuickNet.

The Results:

In general, I think that my network will do better with more training epochs: it has less overfitting, as you can see below, and the training curve seems to have a lot of room to improve; I suspect that, after a full 600 epochs, the training accuracy will be approximately equal to the control training accuracy, while the validation accuracy might be significantly higher.

There is additional evidence that the new, batchnorm-less version of the network has higher capacity than the current implementaion: specifically, the fact that it required a lower learning rate (see below).

Training details:

I used different version of QuickNetSmall (the one in larq.sota; the one with a MaxPool instead of a MaxBlurPool; and the one without batchnorms).

Following [larq paper], I used two different learning rates, one for the full-precision weights, and one for the binary weights. I used a 5-epoch burn-in to reach the maximum learning rate, followed by a cosine decay of the learning rate to 0. The main difference from the training described in [larq paper] is that I only had 45 epochs and a batch size of 256, due to hardware and time constraints.

Both versions of the control used a learning rate of 0.1 for the full-precision layers, and 0.01 for the binary layers. The new network used 0.01 for the full-precision layers, and 0.001 for the binary layers. I saw that such a reduction in learning rate was necessary after debugging the performance on the smaller ImageNette dataset; using the same learning rate as the control causes the loss to spike at the start, and therefore instability (sometimes, weights become nan). It’s worth noting that, carefully watching the loss during the first epoch, batch by batch, the loss goes up before coming down in all cases (control and the new network), but the size of this spike varies; I think this is because of the role that regularizers play.

However, much like dropout, adding a trainable batchnorm layer decreases the overall capacity of the network; therefore, I think that the new version of the network, without a batchnorm, will perform better once the network is actually trained close to capacity, instead of this shorter training period.

Next Steps:

This result, in conjunction with the results from the autoencoder, really suggests that training without batchnorm leads to interesting results, and may lead to better binary neural networks.

This needs to run on better hardware, for 600 epochs. In addition, I need to figure out how to renormalize a MaxBlurPool layer.

I really think that there is a paper here, not to mention a potential update to larq.sota. I can share all the code; it’s very readable and short. However, I think that the final version that goes into larq needs to be redone in C instead of python.

What do you think? What should be done?

What's the best way to share code? A small, separate repo, or a pull request, maybe?

atbolsh · 2021-07-08T19:54:58Z

atbolsh
Jul 8, 2021
Author

PS Sorry, one of the titles should read "Training Score MINUS Validation Score." I realize that just using the dash might be ambiguous.

All "scores" are top-1 accuracies. For some reason, I couldn't get tensorflow to correctly compute the top-5 accuracy.

0 replies

atbolsh · 2021-07-08T20:05:20Z

atbolsh
Jul 8, 2021
Author

PPS Woops, forgot to add the link for the Larq paper! I'm sure you have copies, but for ease of clicking: https://arxiv.org/abs/2011.09398

And I added the wrong spectrum link; should have been this: https://spectrum.chat/larq/general/replacing-unnecessary-batchnorm-layers-using-gaussian-order-statistics~38145cdc-5389-424b-9668-6d55591460ee

0 replies

koenhelwegen · 2021-07-12T17:03:26Z

koenhelwegen
Jul 12, 2021

Hi @atbolsh, very interesting, thanks for sharing! The BN can be a pain to deal with in BNNs, so it would be awesome if we can get rid of them.

I replace batchnorms that adjust to the real $\mu$ and $\sigma$ with fixed normalization layers, that use a priori calculations of what $\mu$ and $\sigma$ should be.

Do I understand correctly that during inference the models are the same? I.e. the normalization operations are still there, but arrived at in a different way?

Both versions of the control used a learning rate of 0.1 for the full-precision layers, and 0.01 for the binary layers. The new network used 0.01 for the full-precision layers, and 0.001 for the binary layers. I saw that such a reduction in learning rate was necessary after debugging the performance on the smaller ImageNette dataset; using the same learning rate as the control causes the loss to spike at the start, and therefore instability (sometimes, weights become nan).

Did the instability still occur if you only lowered the lr for binary layers while keeping the fp lr at 0.1?

The team is very busy at the moment so I'm not sure if we have the resources to take this on anytime soon. However, feel free to send me an email at [email protected] if you would like to discuss it. We would be happy to accept a PR if it improves the current QuickNet, but it looks like it's not quite there yet. You can always open a draft PR for the time being. A separate repo may also make sense if you're thinking about turning this into a paper.

2 replies

atbolsh Jul 12, 2021
Author

Hello @koenhelwegen , glad to hear removing the batchnorm would be interesting.

Yes, the new network does have normalization operations, though I'm not confident the $\mu$ and $\sigma$ it uses and the $\mu$ and $\sigma$ that a batchnorm in a fully-trained current BNN would be quite the same. But normalizations are still there, that's correct.

And yes, unfortunately the instability still occurred with the fp lr at 0.1, since the system is more sensitive to all of the weights.

I'll make a separate repo; if I use the MIT license, will that be ok to be incorporated into larq later?

I'll post the link to that repo here, and then we can move the discussion into email.

atbolsh Jul 12, 2021
Author

Here is the repo with an MIT license: https://github.com/atbolsh/BNN_alternate_SOTA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training SOTA BNN without batchnorm layers, competitively with the current implementation #334

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Training SOTA BNN without batchnorm layers, competitively with the current implementation #334

atbolsh Jul 8, 2021

Replies: 3 comments · 2 replies

atbolsh Jul 8, 2021 Author

atbolsh Jul 8, 2021 Author

koenhelwegen Jul 12, 2021

atbolsh Jul 12, 2021 Author

atbolsh Jul 12, 2021 Author

atbolsh
Jul 8, 2021

Replies: 3 comments 2 replies

atbolsh
Jul 8, 2021
Author

atbolsh
Jul 8, 2021
Author

koenhelwegen
Jul 12, 2021

atbolsh Jul 12, 2021
Author

atbolsh Jul 12, 2021
Author