Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oracle containers seem to restart periodically #17

Open
Andrew-Pohl opened this issue Apr 21, 2020 · 7 comments
Open

Oracle containers seem to restart periodically #17

Andrew-Pohl opened this issue Apr 21, 2020 · 7 comments

Comments

@Andrew-Pohl
Copy link
Member

Andrew-Pohl commented Apr 21, 2020

I have noticed on my nodes that the oracle containers are periodically restarting themselves due to errors. I have attached an example log from the initiate-change container. It has been observed on the following containers:

• initiate-change
• collected-signatures
• signature-request
• rewarded-on-cycle
• affirmation-request

image

I did notice that I am hitting the 250MB limit on some of the oracle containers which may be resulting in the issues.

image

log.txt

@LiorRabin
Copy link
Contributor

Need to understand why the memory is hitting that limit - could possibly be a memory leak.

@Andrew-Pohl
Copy link
Member Author

Need to understand why the memory is hitting that limit - could possibly be a memory leak.

Yeah it seems like a leak, it starts around 50MB and continues to grow up to the 250 limit in ~16hours. I spoke to Rob and he has the same issues on his nodes

@CrackerHax
Copy link

Same here. On the fuseapp I am getting this error:

[2020-05-15 23:40:44.223 +0000] INFO  (1 on 2a7aac20d111): emitRewardedOnCycle
[2020-05-15 23:42:54.175 +0000] ERROR (1 on 2a7aac20d111): Invalid JSON RPC response: ""
    Error: Invalid JSON RPC response: ""
        at Object.InvalidResponse (/node_modules/truffle-hdwallet-provider/dist/index.js:15:637106)
        at t.i.onreadystatechange (/node_modules/truffle-hdwallet-provider/dist/index.js:15:748901)
        at t.e.dispatchEvent (/node_modules/truffle-hdwallet-provider/dist/index.js:1:142373)
        at t._setReadyState (/node_modules/truffle-hdwallet-provider/dist/index.js:15:753708)
        at t._onHttpRequestError (/node_modules/truffle-hdwallet-provider/dist/index.js:15:757376)
        at ClientRequest.<anonymous> (/node_modules/truffle-hdwallet-provider/dist/index.js:15:754873)
        at ClientRequest.emit (events.js:198:13)
        at TLSSocket.socketErrorListener (_http_client.js:392:9)
        at TLSSocket.emit (events.js:198:13)
        at emitErrorNT (internal/streams/destroy.js:91:8)
        at emitErrorAndCloseNT (internal/streams/destroy.js:59:3)
        at process._tickCallback (internal/process/next_tick.js:63:19)

And on other containers I get this:

npm ERR! code ELIFECYCLE
npm ERR! errno 11
npm ERR! [email protected] watcher:collected-signatures: ./scripts/start-worker.sh watcher collected-signatures-watcher
npm ERR! Exit status 11
npm ERR!
npm ERR! Failed at the [email protected] watcher:collected-signatures script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

npm ERR! A complete log of this run can be found in:
npm ERR!     /root/.npm/_logs/2020-05-15T23_50_10_352Z-debug.log
{"level":60,"time":1589585661282,"msg":"Max processing time reached","name":"watcher-native-erc-rewarded-on-cycle","v":1}
npm ERR! code ELIFECYCLE
npm ERR! errno 11
npm ERR! [email protected] watcher:rewarded-on-cycle: ./scripts/start-worker.sh watcher rewarded-on-cycle-watcher
npm ERR! Exit status 11
npm ERR!
npm ERR! Failed at the [email protected] watcher:rewarded-on-cycle script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

npm ERR! A complete log of this run can be found in:
npm ERR!     /root/.npm/_logs/2020-05-15T23_34_21_464Z-debug.log

> [email protected] watcher:rewarded-on-cycle /bridge-oracle
> ./scripts/start-worker.sh watcher rewarded-on-cycle-watcher

@Andrew-Pohl
Copy link
Member Author

Disabling THP seems to fix the restarting issues. Tested by myself and Rob for ~48hrs combined. The memory usages increases and gets freed close to the limit without causing the containers to restarts.

Have added some changes to the compose file and qs to fix these issues and put in PRs here:
#18
fuseio/fuse-network#36

Probably requires some more time to test, but initial signs look promising! :)

@Andrew-Pohl
Copy link
Member Author

Andrew-Pohl commented May 20, 2020

Ok looked at this further today/ last night added some more log traces to the containers to figure out what exactly is going on.

The cause of the issue is that watcher.js tries to add data to the amqp queue, but the container has ran out of memory. Erlang then reshuffles the amqp container (either by dropping slots or writing to disk) to free up space, this slows the container down and trips the watchdog check. The reason for this is that the oracle containers are only given 10% cpu access, on my machine upping to 15% solved the watchdog triggering issue. So have upped to 25% in git for safety.

The memory increase is not a leak it's just rabbit (amqp) filling it's buffers, queues and databases which is expected behavior. GC gets triggered when OOMed (or close to it) which clears some memory.

Tested on a kvm with 3vcpu (Xeon E3-1231v3) with 15% cap on oracle container:
• Run time as of writing 40hrs, no container errors or restarts
• Ram usage at start 900MB after ~12hrs 1.6-1.8GBs (stays here for the duration)
• Swap usage didn't check at start but stabilizes around 1.3GB
• 15min load average ~0.44 (~15% cpu usage)

dockers

I will continue to monitor but previously the containers on my machine were consistently dying after 8-16hrs.

Obviously this 25% may need to be adjusted depending on validators system specs so might be worth mentioning it somewhere in the docs and review at a later date :)

@Andrew-Pohl
Copy link
Member Author

@LiorRabin I just noticed that the CPU limitations for docker is relative per core. IE if two cores are present the max is 2.0 (it doesn't scale to 1.0), this means that any more than one core is currently completely wasted!. I can modify the quick start to scale the number depending on how many CPU cores are present if needed?

@LiorRabin
Copy link
Contributor

@Andrew-Pohl Go for it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants