-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
branch in V4 version train it's working ? #33
Comments
You haven't done anything wrong. Due to the model v4 having over 200 million parameters, the training process is very slow. I am currently experimenting with features such as offset noise, normalization, and cfg to make the training more stable. Your results seem quite normal, and theoretically, the convergence time of the v4 model is close to that of sd1.5. The previous three versions used smaller noise and predicted x0, resulting in faster training. However, v4 employs the classic approach of predicting noise as the target. |
this is so cool! I understand now. I'm going to retrain to see |
@lpscr are you able to converge the model ? |
@adelacvg checked you update the model arch on |
Yes, the previous training process was slow to converge due to issues with the UNet. Additionally, there were semantic problems caused by a bug in the diffusion training architecture from ControlNet. The current diffusion training framework is now based on Tortoise, eliminating any semantic faults. Furthermore, the architecture employs transformer blocks without updown, leading to much faster convergence. |
Thanks :) |
Regarding contentvec, I chose it primarily to prevent timbre leakage. Hubert or Whisper have noticeable timbre leakage issues when trained using self-supervision. I have trained a model, and although there is some loss in audio quality during zero-shot scenarios, it performs better than the previous model on the same data scale. |
Hi @adelacvg Is it possible to transfer bit Prosody and style also from NS2VC architecture not just voice? |
Certainly, but I believe that prosody and speed are better suited for GPT or an acoustic model. The diffusion part, working as a good decoder, should suffice. |
Just need to ask one more question, Are semantic tokens like Hu-BERT, wav2vec, and ContentVec have prosody information? |
Of course, prosody encompasses fundamental frequency, pause duration, intonation, and other essential information. Semantic tokens inherently carry duration information and intonation. |
Yes, I have the same intuition because pronunciation is an integral part of linguistics. |
Hi @adelacvg Have you checked YODAS : https://huggingface.co/datasets/espnet/yodas 370k hours dataset, although data quality is poor as music is there or some samples are empty but still good quality data for VC pretraining. |
@rishikksh20 Thank you very much for your suggestion. However, I'm currently short on GPU resources, and all GPUs are being used for experiments with the AR TTS model based on GPT. The pre-trained model may be trained when there are available GPUs. |
@adelacvg Everyone is GPU-poor, I am also waiting for my GPU to be vacated. By the way how's the progress with TTTS training do you have any sample to share? |
@rishikksh20 The model in the master branch of TTTS is based on Tortoise, and the results are comparable to Tortoise. I have provided a Colab link for testing the pre-trained model. For the v2 version, I would like to use a training method similar to Valle's, while still using Diffusion as the decoder, with the hope of achieving better zero-shot results. |
For v4 I am planning to train on Encodec features for better speaker generalization as commented here #16 (comment) . |
hi ! thank you very much for your work and this amazing repo
i try train the branch v4 i have something very wrong here
when i train about 3 hours it's not change i have noise all the steps i use this
1 .
python preprocess.py
2.
python model1.py
29000 steps v4 branch
in v3 or main branch after some steps i have this
5000 steps v3 or main branch
like you see in v4 , i get only noise i do something wrong
can you please tell me in the v4 train working ?
or what i do wrong
thank you for your time
The text was updated successfully, but these errors were encountered: