Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graph remembers Gateway IP address for listing OCM shares #10846

Open
wkloucek opened this issue Jan 9, 2025 · 11 comments · May be fixed by #10916
Open

Graph remembers Gateway IP address for listing OCM shares #10846

wkloucek opened this issue Jan 9, 2025 · 11 comments · May be fixed by #10916
Assignees
Labels

Comments

@wkloucek
Copy link
Contributor

wkloucek commented Jan 9, 2025

Describe the bug

I'm testing OCM sharing with owncloud/ocis-charts#840 and cs3org/reva#5033

Steps to reproduce

  1. start the ocm-install deplyoment example in minikube
  2. login to each of the both oCIS installations with a different user
  3. establish a OCM trust relationship between those two users
  4. share a file via OCM to user x on instance B
  5. as user x, list the OCM shares and see that you received a file share
  6. kubectl rollout restart deploy -n $ns gateway where ns is the namespace of instance B / user x
  7. wait until the previous gateway pod is gone and only the new gateway pod is there
  8. list shares again

Expected behavior

Everything works as before

Actual behavior

We don't see the OCM share anymore.

Graph service complains:

graph-587d848cbf-2shrr graph {"level":"debug","service":"graph","error":"generalException: stat:rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.244.7.190:9142: connect: connection reset by peer\"","shareid":"b7e559ef-54dc-4184-9284-b4c6493c6d97","remoteshareid":"815ea2cb-95de-434c-8eaf-d682a420e607","time":"2025-01-09T15:06:50Z","line":"github.com/owncloud/ocis/v2/services/graph/pkg/service/v0/utils.go:565","message":"could not stat received ocm share, skipping"}

The used gateway IP is the one of the old gateway pod:

k get pods -n $ns -o wide
NAME                                 READY   STATUS        RESTARTS        AGE     IP             NODE       NOMINATED NODE   READINESS GATES
...
gateway-66665b8659-85f4s             1/1     Running       0               14s     10.244.7.202   minikube   <none>           <none>
gateway-84d55df46d-pdn25             1/1     Terminating   1 (6m33s ago)   6m36s   10.244.7.190   minikube   <none>           <none>

Setup

see issue description

Additional context

restarting the graph service doesn't help here.

@kobergj kobergj moved this from Qualification to Prio 1 in Infinite Scale Team Board Jan 9, 2025
@kobergj
Copy link
Collaborator

kobergj commented Jan 9, 2025

Prio 1 to find out if this is a release blocker. Can be deprioritized if not.

@dj4oC
Copy link
Contributor

dj4oC commented Jan 9, 2025

It is working until the pods are restarted, right?
In that case, it would not block a rc1 but a final, right?

@wkloucek
Copy link
Contributor Author

It is working until the pods are restarted, right?

We can never guarantee to NOT restart pods. So it might break sooner than you'd wish.

@wkloucek
Copy link
Contributor Author

Similar but different message during OCM sharing:

ocm-64f9c566d8-vdxp7 ocm {"level":"error","service":"ocm","pkg":"rgrpc","traceid":"53cb70a07c24fe9cab98c0ffc1aa22dc","error":"rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.96.5.119:9142: i/o timeout\"","status":{"code":15,"message":"error listing spaces","trace":"53cb70a07c24fe9cab98c0ffc1aa22dc"},"filters":[{"type":4,"Term":{"SpaceType":"+grant"}},{"type":6,"Term":{"User":{"idp":"https://xxxx","opaque_id":"xxx"}}}],"time":"2025-01-13T19:51:53Z","line":"github.com/cs3org/reva/[email protected]/internal/grpc/services/storageprovider/storageprovider.go:580","message":"failed to list storage spaces"}

@kobergj
Copy link
Collaborator

kobergj commented Jan 13, 2025

It is probably not related to ocm but to the service registry issues. Should be reproducible without ocm

@kobergj kobergj self-assigned this Jan 22, 2025
@kobergj
Copy link
Collaborator

kobergj commented Jan 22, 2025

Insight: Restarting the ocm service fixes the problem

@wkloucek
Copy link
Contributor Author

Insight: Restarting the ocm service fixes the problem

it fixes the error in the graph service? 😆

@kobergj
Copy link
Collaborator

kobergj commented Jan 22, 2025

Yes. The connection issue with gateway is coming from ocm service. It talked to the gateway using a gateway client initialized at start only. So it would never talk to nats to get recent registry changes. Fix in above PR.

@kobergj kobergj assigned 2403905 and unassigned kobergj Jan 22, 2025
@wkloucek
Copy link
Contributor Author

Yes. The connection issue with gateway is coming from ocm service.

Ah, the graph service just "repeats" the error message of OCM in this case?

@kobergj
Copy link
Collaborator

kobergj commented Jan 22, 2025

Ah, the graph service just "repeats" the error message of OCM in this case?

Yes exactly. We thought it is the connection graph:gateway but in fact it is ocm:gateway that throws this error. Therefore we should be able to find the same error first in the ocm logs. Then it is repeated by gateway(?) and graph.

@kobergj
Copy link
Collaborator

kobergj commented Jan 22, 2025

Fun Fact: The SQL invite manager has the same flaw. But we don't care because we don't support it.

@unbekanntes-pferd unbekanntes-pferd moved this from Prio 1 to In progress in Infinite Scale Team Board Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: In progress
Development

Successfully merging a pull request may close this issue.

4 participants