-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
question about concurrently stacking the activation function #23
Comments
Thanks for the attention. We use the depth conv as an efficient implementation of our activation function, which is same as Eq. (6) in our paper. Each element of the output of this activation is related to various non-linear inputs, which can be regarded as concurrently stacking. |
Thank you for your reply! The learnable parameters game and beta of BN in the activation function are a and b in Eq.(6)? |
Not really. In fact, the BN can be merged into the conv in the activation. Then, the weight and bias of the merged conv are a and b in Eq.(6). |
Excellent idea! |
Is this actually true? |
I wonder the same thing. Afaikt from reading the code the sequence of layers is as below At training time:
At inference time (the LeakyReLU is gradually removed during training and the two 1x1 convs are fused, the BatchNorm's are fused into the preceeding convs):
When considering that there's several of these blocks stacked after each other this seems kind of like an inverted MobileNet block (7x7 depthwise -> 1x1 conv -> downsample -> ReLU), but I don't understand the activation sequence. Afaikt it's just one activation and a regular depthwise convolution, perhaps I'm missing something? There's ablation test in the paper (Table 1) showing a performance increase from 60% to 74% when comparing plain ReLU with your "activation" function, is this then comparing "1x1 convs -> ReLU" with "1x1 convs -> ReLU -> 3x3 depthwise (n=1)"? If so I can imagine that you'd get a big jump in performance when just comparing 1x1 convs against 1x1+3x3 convs, since with just 1x1 convs and no depthwise convs the receptive field of the network would be a lot worse. |
|
Your understanding is correct, we currently use depthwise conv to implement the activation function we proposed, which is for latency friendliness. |
Hi, thank you for the innovative work. Would you mind to explain why the fusion of bias (b in Eq.6) are zeros? Why we dont let the bias becomes a learnable parameter like the normal convolution since they actually let it be a learnable parameters in BNET. And I think in both case of deploy variable, self.bias can be initialized as None for cleaner code since None and zeros for bias are actually same in convolution. Code before changes
Code after changes
|
Hi, thanks for the great work.
Can I understand the class 'activation(nn.ReLU)' as a combination of ReLU->depth Conv-> BN?
I don't seem to see concurrently stacking activation functions in the code?Is 'concurrently ' mean Multi-branch activation functions?
Thank you very much!
The text was updated successfully, but these errors were encountered: