Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dangling leases problem #265

Open
TormenTeDx opened this issue Nov 14, 2024 · 7 comments
Open

Dangling leases problem #265

TormenTeDx opened this issue Nov 14, 2024 · 7 comments
Labels
P1 repo/provider Akash provider-services repo issues

Comments

@TormenTeDx
Copy link

I think yesterday my provider pod restarted and then I lost all my current leases. They were gone from the provider, but they're still
active on the chain and the new leases work fine, but the old leases are gone from my provider and they are dangling leases on the chain.

Please look at the number of leases on the screenshot.
image

The graph shows 30 and on the top value shows 15. 15 is the number of leases I currently have on my provider - which means I have them as pods. The 30 is the number of leases there are currently on the chain. Which means 30-15 = 15 - there is 15 dangling leases on the chain on my provider. I checked on akashdash and it shows 30.

and they won't close when they will run out of $
they will only close when you withdraw - manually or automatically, and there is no more $ on the lease then it will close
but when you setup like big withdrawal time like 100h for example to get paid once a week
they will be active all the time. The only problem is they lock your 0.5 akt deposit bid fee.
The lease will stay active and the $ wont be paid untill withdrawal time.

I just ran manual withdrawal for all these leases, and few of them got closed right after.

dunno when it excatly happened, but I know the pod restarted for some reason and then after that I had no leases on my provider

I don't know how to excatly reproduce this. I know that it happened right after the pod restarted. It happened before occasionaly - maybe I had like 2-3 dangle leases in past 6 months, but it's never been like that where all the leases were just gone.

I noticed it happens also on other providers, whenever there's a difference in these two numbers then 100% there are some dangling leases on the chain and Im pretty sure this happens during the restart of the provider pod

@chainzero
Copy link
Collaborator

@TormenTeDx - have you attempted to heal the dangling leases by executing the script referenced in this doc:

https://akash.network/docs/providers/provider-faq-and-guide/#dangling-deployments

While the title of this section is "Dangling Deployments" it actually cures both dangling deployments (I..e when a lease has been closed but the deployment in K8s remains) and dangling leases (I.e. when the lease on chain is active but no deployment exists in K8s.).

This is a direct link to the suggested script to cure both scenarios:

https://gist.github.com/andy108369/f211bf6c06f2a6e3635b20bdfb9f0fca

While this is not a permanent fix - it should cure the current issue and close out all stale leases for your provider on the blockchain.

And within that script this is the section that address dangling leases. Run the script as is and in it's entirety. Just isolating this section below for knowledge of the relevant section.

## 4) delete orphaned leases
####

# active leases without actual deployments

PROVIDER="$(kubectl -n akash-services exec -i $(kubectl -n akash-services get pods -l app=akash-provider --output jsonpath='{.items[0].metadata.name}') -- sh -c "echo \$AKASH_FROM")"

LEASEDATA="$(provider-services query market lease list --provider $PROVIDER --gseq 0 --oseq 0 --page 1 --limit 10000 --state active -o json)"
NSDATA="$(kubectl get ns -o json)"
echo "$LEASEDATA" | jq -r '.leases[].lease.lease_id | [.owner, .dseq, .gseq, .oseq, .provider] | @tsv' | while read owner dseq gseq oseq provider; do
  IS_EMPTY=$(echo "$NSDATA" | jq --arg dseq $dseq --arg oseq $oseq --arg gseq $gseq --arg owner $owner --arg provider $provider -r '.items[] | select(.metadata.labels."akash.network/lease.id.dseq"==$dseq and .metadata.labels."akash.network/lease.id.gseq"==$gseq and .metadata.labels."akash.network/lease.id.oseq"==$oseq and .metadata.labels."akash.network/lease.id.owner"==$owner and .metadata.labels."akash.network/lease.id.provider"==$provider) | length' | wc -l);

  if [[ $IS_EMPTY -eq 0 ]]; then
    echo "=== Found orphaned lease ==="
    #echo kubectl get ns -l akash.network/lease.id.dseq=$dseq,akash.network/lease.id.gseq=$gseq,akash.network/lease.id.oseq=$oseq,akash.network/lease.id.owner=$owner,akash.network/lease.id.provider=$provider
    #echo kubectl -n lease get manifest -l akash.network/lease.id.dseq=$dseq,akash.network/lease.id.gseq=$gseq,akash.network/lease.id.oseq=$oseq,akash.network/lease.id.owner=$owner,akash.network/lease.id.provider=$provider
    ns=$(provider-services show-cluster-ns --dseq $dseq --owner $owner --provider $provider)
    echo kubectl -n $ns get all
    echo "ACTION: close this lease if you can't find it is really running on your K8s cluster:"
    echo kubectl -n akash-services exec -i $(kubectl -n akash-services get pods -l app=akash-provider --output jsonpath='{.items[0].metadata.name}') -- bash -c \"provider-services tx market bid close --owner $owner --dseq $dseq --gseq $gseq --oseq $oseq --from $provider\"
    echo "NOTE: However, executing an action from the provider address will cause the \`account sequence mismatch\` afterwards. Make sure to restart akash-provider service once done with running the \`tx market bid close\` command! Ideally, make sure akash-provider services is stopped first."
  fi
done

@TormenTeDx
Copy link
Author

Yes, it found orphaned leases. I closed all of them. My numbers look good now

@TormenTeDx
Copy link
Author

OK here is some extra info. 100% it happens when restarting the pod.
Also I noticed 2 things. Sometimes when the pod is restarting, the endpoint via curl or grpcurl won't work. When you type the command it just stalls and you wait and it won't show. After a minute pod will restart automatically and after the restart it started to work fine. Dunno why it happens, but it's only sometimes.
Next thing I noticed is - I was lucky enough that I checked my provider when this was happening. I noticed my pod restarting very frequently in the last 15-20 minutes or so. I checked the endpoint - it was unresponsive, the pod restarted one more time, after around 20-30 sec of running I ran the grpcurl command and it worked, but I noticed there is no leases. All leases were removed on my provider, but they're still on the blockchain. Which means, the leases are removed from the provider when your pod is restarting, and sometimes after a restart your pod might not be fully working right, the pod will show its running, but grpcurl won't work. And when this happens pod will restart itself after a while and then leases are gone. Dunno if it helps but yeah I noticed this.

@chainzero
Copy link
Collaborator

@TormenTeDx - could you provide precise, ordered steps to re-create the issue? And please confirm the Akash provider version active on provider.

For example:

Step 1 - launch a number of deployments using XYZ SDL (provide link to SDL)

Step 2 - restart the provider pod

Step 3 - observed behavior has been the deployments are now closed but lease remain on chain

Believe above steps capture those that should be used to reproduce the issue. But wanted to ensure. And also want to ensure provider version and SDL used. As I believe there was a thought prior that this could be related to specific SDL/deployment types. Want to first focus on provider functionality and can later focus on reporting issues if need be.

@SGC41
Copy link

SGC41 commented Nov 26, 2024

pod restart and all leases lost...
can see no rim or reason why it happened.

running triple control planes and triple etcd on individual servers
version v1.28.6 as most providers do and the latest of the recommended charts and such.

if memory serves, then in the past this issue was sometimes caused by rpc nodes or control planes if they was immediately lost.
but that was a long time ago, not sure if it ever got really fixed tho...

my provider pod didn't restart excessively....
tho its possible that it restarted twice in a row, but no more than that...

deleting the dangling is hardly a real fix... it can take months for providers to accumulate customers.

if we imagine 3 months to fill a provider, then this happens and one had to start over.
that would half the earnings in the that period assuming a similar progression of customers...
and thats not considering the customers there would be permanently done with a provider or even akash.

due to such an event.

i'm not currently aware of any obvious path to creating dangling leases.
but i know that leases can be lost so easily, is a bad look for akash and providers.

@TormenTeDx
Copy link
Author

there are actually 2 ways it can happen.

  1. The pod just restarts twice in a row and it will drop all current leases and they will all be dangling.
  2. when you scale down/up the provider, some of the leases will go dangling.

I dont know why it happens. It's just sometimes it happens. One day I scaled down/up provider just to check something and like 80% of leases went dangling. I tried doing the same thing 1h later but nothing happened.

@TormenTeDx
Copy link
Author

Small update
Recently I changed IP on my master node, then my provider went down, couldn't make it work, so I had to recreate a cluster.
I noticed that the leases that were working fine - before I broke the cluster. After I came back with recreated cluster they were in dangling state on the chain

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 repo/provider Akash provider-services repo issues
Projects
None yet
Development

No branches or pull requests

3 participants