-
Notifications
You must be signed in to change notification settings - Fork 239
IMDSv2 Requests Causing Context Cancelled #398
Comments
What version of Kiam are you using? There was support added for IMDSv2 in 3.6-rc1 |
👋 We are currently running 3.6-rc1. Sorry, forgot to include that 🙄 , its been a hell of a week 💤 😴 EDIT: Something else I noticed is that only a fraction of these requests are actually showing a 502 back to the client, most complete with 200 and no message about "context canceled" |
So the kiam agent just proxies any requests to the /api/token to the actual instance metadata api, it doesn't forward them to the kiam server. cc @rbvigilante |
This is interesting. We've been running 3.6rc1 as well; I tested As I understand it, Kiam is proxying more or less 100% transparently now (not even setting |
just facing the same issues with last official 3.6 release:
|
We're running the official 3.6 release too now; I'll see if I can reproduce this as soon as I get a chance (probably on Friday) |
The duration in @project0's logs is consistently ~1s. Could this be some timeout issue, @rbvigilante? |
The proxy just uses the default transport in go var DefaultTransport RoundTripper = &Transport{
Proxy: ProxyFromEnvironment,
DialContext: (&net.Dialer{
Timeout: 30 * time.Second,
KeepAlive: 30 * time.Second,
DualStack: true,
}).DialContext,
ForceAttemptHTTP2: true,
MaxIdleConns: 100,
IdleConnTimeout: 90 * time.Second,
TLSHandshakeTimeout: 10 * time.Second,
ExpectContinueTimeout: 1 * time.Second,
} The only default timeout that's 1 second is the ExpectContinueTimeout
so might be worth investigating that, perhaps the AWS token api is a bit slow to respond? |
There seems to be similar problems within the SDKs, aws/aws-sdk-go#2972.
I don't think its the issue here (as we proxy through a ssl protected GRPC connection), but its good to know ;-) |
I'm still definitely able to get tokens out of the metadata API through Kiam 3.6. I'm still not sure if we have anything requesting IMDSv2 with any great volume internally; I'll try to find out. @project0 or @dmizelle - would it be possible for one of you to try running something like |
@rbvigilante I just tested it within a container, but i was not able to reproduce the 502 which appears randomly in the logs. Just to be clear, it seems to work in generic. I just cannot find the correlation why we get those timeouts in the logs. Looks like its a really rare case that it fails. |
Right - so the problem as you see it is intermittent. I wonder whether this is being caused by Kiam, or by the metadata API itself (or some combination of the metadata API being occasionally slow to respond and Kiam's proxy cutting the request off in that scenario). It seems weird that the metadata API could get that slow, though. Hmm. |
IMDSv2 introduces some breaking change security features to include max hop count which can be modifying on the instance metadata options. By default it is 1 now, so if Kiam is running in a container, IMDSv2 calls will FAIL but seem to fallback to IMDSv1 after a long period of time, but otherwise it seems to just hang, so this could be the reason the context is cancelled. |
Yeah, I thought the max hops thing might be a problem, but it seems like IMDSv2 works through Kiam most of the time, which seems to preclude that being the issue we're seeing here. I can confirm that I'm able to get a token from IMDSv2, then use that token to pull information out of the metadata service. |
Say me if I'm wrong, but if server runs as "hostNetwork" it should have only one hop ? So it should pass the TTL of 1 ? |
That should work. |
Can confirm our Kiam servers run on the host network. |
I made a simple test and I confirm IMDSv2 works with hostNetwork :
In a non "hostNetwork" pod, So, I not see any problems... ? |
Hi, I tried some things yesteday for improve security on my nodes. It seems related to native AWS SDK "ec2metadata" usage here : https://github.com/uswitch/kiam/blob/master/pkg/aws/sts/resolver_detect_arn.go#L46 |
After some tests it's stange... Change max hops limit not works better !
|
I'm also seeing 502 status responses for
Why is that? I thought, based on the description of the design, that Kiam is supposed to handle all the metadata service requests so that it can provide fast cached responses? Does anyone have ideas of things I can try, to resolve the high latency issue? Or can I provide you with any additional info? Edit: Manually setting the hop limit on the EKS nodes to 2 didn't make any difference. |
We noticed a similar-sounding problem with a Java-based application run by one of our teams, where traffic to Kiam spiked really hard for a fifteen-minute period before the application's token expired, causing everything to slow down. This happened because the AWS SDK for Java is coded to request a new token fifteen minutes before the old one expires, but Kiam only invalidated its internal cache five minutes before. We fixed this by adding |
@rbvigilante I guess I'm unclear why this would suddenly begin to be a problem in Kiam 3.6, whereas 3.5 was working fine. Has Kiam's behavior for session refresh changed in 3.6? I guess the behavior of the AWS SDK in that situation is to request new credentials from Kiam, Kiam gives back the cached credentials, but the SDK sees that the "new" credentials expire in <15m and therefore request new credentials again? And this continues until Kiam generates new credentials? I'll try adjusting my |
So has setting the refresh interval fixed your issue @rehevkor5 ? or any other workarounds ? |
Yes, it did. |
Could you help me please, I see the same error which says kiam-server:
kiam-agent:
I get internmittent errors which says I tried setting the session-refresh to 15mins but then the application threw |
When a downstream deployment that uses a recent version of aws-java-sdk attempts to request credentials, it ends up using IMDSv2 (a PUT request to
/latest/api/token
) but it seems that thekiam-agent
reports back a 502 to the client.Eventually, the client falls back to requesting credentials in the old way, but it adds about a second of additional latency for our applications.
Looking through the logs on the kiam-agent side, I see the following:
From the
kiam-server
side, I do see that it ends up logging that it has credentials:We see this in our test environment as well, so I'm going to throw KIAM into debug log level and enable GRPC logging.
The text was updated successfully, but these errors were encountered: