-
Notifications
You must be signed in to change notification settings - Fork 426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wavesplit implementation #70
base: master
Are you sure you want to change the base?
Conversation
Hey Manuel, here are first answers for the model:
Yes they are both different. The separation stack uses residual blocks with dilation reinitialised to 1 after each block, while the speaker stack is simply a stack of dilated convolutions with dilation doubling at every layer.
Yes it's 512 everywhere. 256 should not change much.
There is no mask, wavesplit directly predicts output channels with a final conv1x1 that maps 512 channels to n, with n the number of speakers
Yes we have an embedding for silence, considered as a speaker. However, in most experiments we assume the number of speakers, like in the standard settings of WSJ0-2mix or 3mix.
No, we use the raw mixture.
L2 distance you mean between speaker vectors and embeddings? Why not training with the classification loss? |
Hi, thanks for your wavesplit implementation! Recently, I think I need to use some ideas of wavesplit in my research. I wonder if you can share the results you get for now? (doesn't matter if the number cannot match the paper) Thanks a lot! |
For now, our implementation of Wavesplit doesn't work at all, sorry. Would you like to help us implement it? It would be very welcome ! |
Thanks for your reply. I am sorry that I am not very familiar with separation and might not have enough time for this. But if I have some positive results with the Wavesplit idea, I am sure I will try to make a pull request. Anyway, thanks for your code and I think that is the only repo I can find in Github that is relevent to this paper. |
Hi I take a deeper look at your implementation. I find two things. (1) It seems that the symbol of speaker loss is wrong (this is the problem of the paper, the symbol of equation (4)(5) is wrong). However, it seems that only changing this part will not fully solve the problem. (2) I think the input to the separation stack is different from the paper. The current input speaker embeddings to the separation stack has the shape [batch_size * num_srcs, spk_embed, frames], but I think the paper has some averaging. (the speaker embeddings should be [batch_size * num_srcs, spk_embed]) I still haven't get any positive results but I think this information might be useful to you? |
Yes, it is indeed useful, thanks a lot for reporting ! Maybe @lienz can barge in on this? |
I completely have overlooked this part and you are right there is averaging. I have addressed some Neil comments and have an updated version on local actually i did never pushed it however, as it needs some code refactoring etc. I ll probably give it a try this week-end. |
Hi @HuangZiliAndy what error are you referring to? There is indeed an averaging, AFTER computing the loss of the speaker stack. |
Hi @lienz , I think I am referring to equation (4) and (5). In my understanding, we are trying to reduce d(h_t^j , E_si) and enlarge d(h_t^j, E_sk). (reduce the distance to the target speaker and enlarge the others) However, the loss defined in (4) and (5) will actually become larger. I think the symbol for (4) and (5) is wrong. Please correct me if I make a mistake, thanks! |
That's correct, that is a bad typo! We will update the paper to correct this, thanks for spotting! We indeed wrote the log probability instead of the loss (which is -log_prob). (4) and (5) are -loss_speaker instead of loss_speaker. |
The new version of the paper is now online: |
Thank you very much @lienz. I have also tried using oracle embeddings (one hot) and could not get very good results (somewhere in the -10 dB sdr loss ballpark on development). I think the updated paper version is much more clear. However, IMO, it is not clear how you compute speaker stack SDR loss for all layers. Do you use for every layer the output linear layer which will map from 512 to n speakers in a shared fashion ? (I changed base branches because there seems there is some problems these days with github not updating pull requests when I push new commits) |
Hey @popcornell , one-hot instead of embeddings should be fine but a bit worse than using the actual embeddings. Also for the SDR loss we use at each layer of the separation stack a different Conv1x1 layer that maps from 512 to n speakers. We tried sharing those parameters as in Nachmani et al. but it was better to have a different Conv1x1 for each layer. Hope that helps! |
Samu (@popcornell) has been trying to replicate the results in the paper for a little while now, but couldn't get close to it.
There might be some things that we missed in the implementation or small mistakes we didn't notice.
Neil (@lienz), would you mind having a look at the code please? That would be really great ! The description of the files are in the Readme of the recipe
Note : the code is not in its final format, we will cite the paper when we merge it obviously, this is just a draft to ask you for a review.