-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Oracle containers seem to restart periodically #17
Comments
Need to understand why the memory is hitting that limit - could possibly be a memory leak. |
Yeah it seems like a leak, it starts around 50MB and continues to grow up to the 250 limit in ~16hours. I spoke to Rob and he has the same issues on his nodes |
Same here. On the fuseapp I am getting this error:
And on other containers I get this:
|
Disabling THP seems to fix the restarting issues. Tested by myself and Rob for ~48hrs combined. The memory usages increases and gets freed close to the limit without causing the containers to restarts. Have added some changes to the compose file and qs to fix these issues and put in PRs here: Probably requires some more time to test, but initial signs look promising! :) |
Ok looked at this further today/ last night added some more log traces to the containers to figure out what exactly is going on. The cause of the issue is that watcher.js tries to add data to the amqp queue, but the container has ran out of memory. Erlang then reshuffles the amqp container (either by dropping slots or writing to disk) to free up space, this slows the container down and trips the watchdog check. The reason for this is that the oracle containers are only given 10% cpu access, on my machine upping to 15% solved the watchdog triggering issue. So have upped to 25% in git for safety. The memory increase is not a leak it's just rabbit (amqp) filling it's buffers, queues and databases which is expected behavior. GC gets triggered when OOMed (or close to it) which clears some memory. Tested on a kvm with 3vcpu (Xeon E3-1231v3) with 15% cap on oracle container: I will continue to monitor but previously the containers on my machine were consistently dying after 8-16hrs. Obviously this 25% may need to be adjusted depending on validators system specs so might be worth mentioning it somewhere in the docs and review at a later date :) |
@LiorRabin I just noticed that the CPU limitations for docker is relative per core. IE if two cores are present the max is 2.0 (it doesn't scale to 1.0), this means that any more than one core is currently completely wasted!. I can modify the quick start to scale the number depending on how many CPU cores are present if needed? |
@Andrew-Pohl Go for it :) |
I have noticed on my nodes that the oracle containers are periodically restarting themselves due to errors. I have attached an example log from the initiate-change container. It has been observed on the following containers:
• initiate-change
• collected-signatures
• signature-request
• rewarded-on-cycle
• affirmation-request
I did notice that I am hitting the 250MB limit on some of the oracle containers which may be resulting in the issues.
log.txt
The text was updated successfully, but these errors were encountered: