Oracle containers seem to restart periodically #17

Andrew-Pohl · 2020-04-21T10:40:09Z

I have noticed on my nodes that the oracle containers are periodically restarting themselves due to errors. I have attached an example log from the initiate-change container. It has been observed on the following containers:

• initiate-change
• collected-signatures
• signature-request
• rewarded-on-cycle
• affirmation-request

I did notice that I am hitting the 250MB limit on some of the oracle containers which may be resulting in the issues.

log.txt

LiorRabin · 2020-04-27T07:29:28Z

Need to understand why the memory is hitting that limit - could possibly be a memory leak.

Andrew-Pohl · 2020-04-27T09:51:25Z

Need to understand why the memory is hitting that limit - could possibly be a memory leak.

Yeah it seems like a leak, it starts around 50MB and continues to grow up to the 250 limit in ~16hours. I spoke to Rob and he has the same issues on his nodes

CrackerHax · 2020-05-15T23:56:36Z

Same here. On the fuseapp I am getting this error:

[2020-05-15 23:40:44.223 +0000] INFO  (1 on 2a7aac20d111): emitRewardedOnCycle
[2020-05-15 23:42:54.175 +0000] ERROR (1 on 2a7aac20d111): Invalid JSON RPC response: ""
    Error: Invalid JSON RPC response: ""
        at Object.InvalidResponse (/node_modules/truffle-hdwallet-provider/dist/index.js:15:637106)
        at t.i.onreadystatechange (/node_modules/truffle-hdwallet-provider/dist/index.js:15:748901)
        at t.e.dispatchEvent (/node_modules/truffle-hdwallet-provider/dist/index.js:1:142373)
        at t._setReadyState (/node_modules/truffle-hdwallet-provider/dist/index.js:15:753708)
        at t._onHttpRequestError (/node_modules/truffle-hdwallet-provider/dist/index.js:15:757376)
        at ClientRequest.<anonymous> (/node_modules/truffle-hdwallet-provider/dist/index.js:15:754873)
        at ClientRequest.emit (events.js:198:13)
        at TLSSocket.socketErrorListener (_http_client.js:392:9)
        at TLSSocket.emit (events.js:198:13)
        at emitErrorNT (internal/streams/destroy.js:91:8)
        at emitErrorAndCloseNT (internal/streams/destroy.js:59:3)
        at process._tickCallback (internal/process/next_tick.js:63:19)

And on other containers I get this:

npm ERR! code ELIFECYCLE
npm ERR! errno 11
npm ERR! [email protected] watcher:collected-signatures: ./scripts/start-worker.sh watcher collected-signatures-watcher
npm ERR! Exit status 11
npm ERR!
npm ERR! Failed at the [email protected] watcher:collected-signatures script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

npm ERR! A complete log of this run can be found in:
npm ERR!     /root/.npm/_logs/2020-05-15T23_50_10_352Z-debug.log

{"level":60,"time":1589585661282,"msg":"Max processing time reached","name":"watcher-native-erc-rewarded-on-cycle","v":1}
npm ERR! code ELIFECYCLE
npm ERR! errno 11
npm ERR! [email protected] watcher:rewarded-on-cycle: ./scripts/start-worker.sh watcher rewarded-on-cycle-watcher
npm ERR! Exit status 11
npm ERR!
npm ERR! Failed at the [email protected] watcher:rewarded-on-cycle script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

npm ERR! A complete log of this run can be found in:
npm ERR!     /root/.npm/_logs/2020-05-15T23_34_21_464Z-debug.log

> [email protected] watcher:rewarded-on-cycle /bridge-oracle
> ./scripts/start-worker.sh watcher rewarded-on-cycle-watcher

Andrew-Pohl · 2020-05-19T19:10:46Z

Disabling THP seems to fix the restarting issues. Tested by myself and Rob for ~48hrs combined. The memory usages increases and gets freed close to the limit without causing the containers to restarts.

Have added some changes to the compose file and qs to fix these issues and put in PRs here:
#18
fuseio/fuse-network#36

Probably requires some more time to test, but initial signs look promising! :)

Andrew-Pohl · 2020-05-20T19:33:20Z

Ok looked at this further today/ last night added some more log traces to the containers to figure out what exactly is going on.

The cause of the issue is that watcher.js tries to add data to the amqp queue, but the container has ran out of memory. Erlang then reshuffles the amqp container (either by dropping slots or writing to disk) to free up space, this slows the container down and trips the watchdog check. The reason for this is that the oracle containers are only given 10% cpu access, on my machine upping to 15% solved the watchdog triggering issue. So have upped to 25% in git for safety.

The memory increase is not a leak it's just rabbit (amqp) filling it's buffers, queues and databases which is expected behavior. GC gets triggered when OOMed (or close to it) which clears some memory.

Tested on a kvm with 3vcpu (Xeon E3-1231v3) with 15% cap on oracle container:
• Run time as of writing 40hrs, no container errors or restarts
• Ram usage at start 900MB after ~12hrs 1.6-1.8GBs (stays here for the duration)
• Swap usage didn't check at start but stabilizes around 1.3GB
• 15min load average ~0.44 (~15% cpu usage)

I will continue to monitor but previously the containers on my machine were consistently dying after 8-16hrs.

Obviously this 25% may need to be adjusted depending on validators system specs so might be worth mentioning it somewhere in the docs and review at a later date :)

Andrew-Pohl · 2020-05-21T08:50:19Z

@LiorRabin I just noticed that the CPU limitations for docker is relative per core. IE if two cores are present the max is 2.0 (it doesn't scale to 1.0), this means that any more than one core is currently completely wasted!. I can modify the quick start to scale the number depending on how many CPU cores are present if needed?

LiorRabin · 2020-05-21T08:51:33Z

@Andrew-Pohl Go for it :)

Andrew-Pohl mentioned this issue May 20, 2020

Modified the compose file to fix issues: #18

Merged

Andrew-Pohl mentioned this issue May 21, 2020

Oracle containers never utilize more than 1 CPU core #19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Oracle containers seem to restart periodically #17

Oracle containers seem to restart periodically #17

Andrew-Pohl commented Apr 21, 2020 •

edited

Loading

LiorRabin commented Apr 27, 2020

Andrew-Pohl commented Apr 27, 2020

CrackerHax commented May 15, 2020

Andrew-Pohl commented May 19, 2020

Andrew-Pohl commented May 20, 2020 •

edited

Loading

Andrew-Pohl commented May 21, 2020

LiorRabin commented May 21, 2020

Oracle containers seem to restart periodically #17

Oracle containers seem to restart periodically #17

Comments

Andrew-Pohl commented Apr 21, 2020 • edited Loading

LiorRabin commented Apr 27, 2020

Andrew-Pohl commented Apr 27, 2020

CrackerHax commented May 15, 2020

Andrew-Pohl commented May 19, 2020

Andrew-Pohl commented May 20, 2020 • edited Loading

Andrew-Pohl commented May 21, 2020

LiorRabin commented May 21, 2020

Andrew-Pohl commented Apr 21, 2020 •

edited

Loading

Andrew-Pohl commented May 20, 2020 •

edited

Loading