Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zetaclient stops connecting to external chain RPCs after receiving HTTP error codes. #3328

Open
Tracked by #2016
CharlieMc0 opened this issue Dec 30, 2024 · 8 comments
Labels
bug Something isn't working zetaclient Issues related to ZetaClient

Comments

@CharlieMc0
Copy link
Member

CharlieMc0 commented Dec 30, 2024

Describe the Bug
When zetaclientd gets an HTTP error back from the BTC RPC it stops working and never retries until zetaclientd is restarted.

To Reproduce
Would need to replicate the the 409 or other HTTP errors codes from the BTC RPC. It isn't clear if this is limited to BTC or an issue for the RPC connectivity for all networks.

Expected Behavior
Ideally all attempts to connect to RPCs for external networks should have a backoff timer and continue to retry even if they receive errors from the external RPC server.

Screenshots
If applicable, add screenshots to help explain your problem.

Screenshot 2024-12-29 at 6 07 06 PM

You can see from the logs one one validator was falling behind on BTC blocks
Screenshot 2024-12-29 at 6 07 32 PM
Screenshot 2024-12-29 at 6 11 04 PM

@CharlieMc0 CharlieMc0 added bug Something isn't working zetaclient Issues related to ZetaClient labels Dec 30, 2024
@gartnera
Copy link
Member

gartnera commented Jan 2, 2025

Both of the places were FetchUTXOs is called do not exit if error. So if this is really happening then the log messages are probably unrelated.

What we really need in cases like this is a goroutine dump via curl -o goroutine.dump http://localhost:6061/debug/pprof/goroutine?debug=2

My best guess is that one of the RPCs just hung. The bitcoin rpc client doesn't take contexts or have any configurable timeouts on calls.

err := ob.FetchUTXOs(ctx)
if err != nil {
// log debug log if the error if no wallet is loaded
// this is to prevent extensive logging in localnet when the wallet is not loaded for non-Bitcoin test
// TODO: prevent this routine from running if Bitcoin node is not enabled
// https://github.com/zeta-chain/node/issues/2790
if !strings.Contains(err.Error(), "No wallet is loaded") {
ob.logger.UTXOs.Error().Err(err).Msg("error fetching btc utxos")
} else {
ob.logger.UTXOs.Debug().Err(err).Msg("No wallet is loaded")
}
}

// refresh unspent UTXOs and continue with keysign regardless of error
err := observer.FetchUTXOs(ctx)
if err != nil {
signer.Logger().
Std.Error().
Err(err).
Msgf("SignGasWithdraw: FetchUTXOs error: nonce %d chain %d", nonce, chain.ChainId)
}

@ws4charlie
Copy link
Contributor

Seems we only kept 7 days of logs in Datadog and above log prints are not tracked any more. After searching above log prints in code, the initial error was on Bitcoin RPC GetBlockCount (error getting block height), and then it failed FetchUTXOs. When FetchUTXOs fails, zetaclient won't be able to get balance and outbound will not work properly.

@gartnera As you said, the Bitcoin RPCs are designed to NOT take a context, so they should be able to resume/reconnect by themselves. The root cause is not looking clear with above logs.

@ws4charlie
Copy link
Contributor

ws4charlie commented Jan 6, 2025

The whole process was:

  • We have a watchRPCStatus goroutine that NEVER stops for Bitcoin chain.
  • Every mininute, this goroutine either print RPC latency is OK or an error message.
  • The last heart beat of validator1-us-east-1-mainnet was on 16:20:43 as shown in below screenshot.
  • The heart beat resumed after restarting zetaclient 18:58:33.

Here are my guesses after investigation into logs.

  1. The watchRPCStatus go-routine got stuck on this call blockTime, err := rpc.CheckRPCStatus(ob.btcClient, tssAddress) for 2.5 hours.
  2. The ObserveInbound go-routine also got stuck, for similar reason, on this line err := ob.ObserveInbound(ctx)

The reason is that both go-routines either NEVER work silently, they print message on every iteration. To be specific,
watchRPCStatus prints RPC latency is OK or error;
ObserveInbound prints observeInboundBTC: block XXX has YYY txs
We see no logs like above after last heart beat.

On which lines the above two go-routines got stuck?
A go routine stops execution only if it is blocked by a lock or IO. There is no lock used in above go-routines (no crash as well) so the IO is only what we can suspect.

In our case, the RPC status check got stuck on one of the following RPC calls in function CheckRPCStatus
GetBlockCount, GetBlockHash, GetBlockHeader, ListUnspentMinMaxAddresses.

The inbound go routine got stuck on one of the two lines (not sure which one):

  • currentBlock, err := ob.btcClient.GetBlockCount() at line 76 that invokes GetBlockCount.
  • res, err := ob.GetBlockByNumberCached(int64(blockNumber)) at line 108, which invokes GetBlockHash, GetBlockHeader, GetBlockVerboseTx internally.

Restarting the program solved the RPC errors was a interesting hint. The 409 error indicates a request conflict with the current state of the target resource and was resolved by a restart?

The link to Grafana logs
image

@CharlieMc0
Copy link
Member Author

CharlieMc0 commented Jan 6, 2025

What's a good way to avoid this in the future? We need to ensure zetaclient can handle unusual responses from RPCs without going offline.

Something like a RPC heartbeat fails to work after 5 minutes restart the go-routine for that network? If we can't handle the error directly when it happens.

@ws4charlie
Copy link
Contributor

ws4charlie commented Jan 6, 2025

This feels to restarting the zetaclient reset/released a process-level resource (don't know what was it), therefor solved the RPC hung problem. The above issue was like (for instance):
zetaclient go routine calls GetBlockCount and the call just got stuck right on the line and never returned. So the program execution was unexpectedly hung (the Bitcoin RPCs were designed return, without a context). A remedy to this would be external and operational in stead of self diagnosis inside zetaclient.

We would leave this issue under monitoring for a while before taking effective actions so we can collect and identify some patterns/clues of the errors.

@gartnera
Copy link
Member

gartnera commented Jan 6, 2025

What's a good way to avoid this in the future? We need to ensure zetaclient can handle unusual responses from RPCs without going offline.

  1. Completely reimplement the bitcoin rpc such that it takes contexts and properly returns on timeout/context cancelation.
  2. Add an in process watchdog that must be bumped every N seconds. If it is not bumped the process would automatically restart.

Medium to high effort

@swift1337
Copy link
Contributor

swift1337 commented Jan 7, 2025

Completely reimplement the Bitcoin RPC such that it takes contexts and properly returns on timeout/context cancellation.

I agree, this is the only reliable way of solving this.

We need to write a wrapper around github.com/btcsuite/btcd/btcjson that uses the same DTOs but has proper HTTP & ctx support. Only used methods should be implemented, no need to code all supported RPC methods.

We can also leverage this moment to bring proper retries, logs, and Prometheus metrics.


UPD: Inspected btcjson, it shouldn't be hard to implement our own client.

btcjson.MarshalCmd(...) marshals types into {"jsonrpc":"1.0","method":"...","params":["..."],"id":1} rpc calls

@ws4charlie
Copy link
Contributor

@gartnera I agree with the short/medium term to setup watchdog to automatically restart. Let's monitor how frequently the issue would happen moving forward and find some patterns. If necessary, re-implementing (or fork and add context) Bitcoin RPC is always not a bad idea if RPC calls hang a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working zetaclient Issues related to ZetaClient
Projects
None yet
Development

No branches or pull requests

5 participants