-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
issues retrieving .cvmfsreflog on Nikhef stratum1 (ruif) since Sunday #1
Comments
Can you paste the complete URL? I can try downloading the reflog and check if it looks healthy. From which version did you upgrade to version 2.9.4? |
Update: I think this did not have to do with the upgrade at all. We just seem to be tipping over a threshold where the number of concurrent requests to squid on that machine causes the kernel to start dropping connection requests. As part of the sync process is a localhost retrievel (but still via squid) of the .cvmfsreflog, this sometimes times out and breaks the update process. For now, we've extended net.ipv4.tcp_max_syn_backlog to 1024. That seems to have addressed the immediate problem, but I feel we are reaching the point where having a single machine as the stratum-1 is no longer feasible. I don't know if there is an ipv6 equivalent setting in sysctl, or how these kernel parameters are organised (as in, are the ipv4 settings actually for both ipv4 and ipv6?). We also don't know if these settings are sane and sound. |
According to the stratum 1's MRTG file descriptor plot your front-end squid hasn't had more than 520 simultaneous connections, at least when checked every 5 minutes. So it doesn't seem logical to me that the backlog was an issue. Were you seeing errors in /var/log/messages? The cpu time also shown for your squid isn't very high even if you have only one worker defined so I'm not sure what's going on. If you have only one worker you might want to increase it anyway because you seem to have had connection problems. The FNAL stratum 1 is more heavily loaded than yours, using one old machine (and a backup), but they do have 3 workers configured. The RAL stratum 1 is even more heavily loaded than FNAL's, also on one machine (with a backup). I don't know how many workers they have defined. They typically have about 3 times as many simultaneous connections as the NIKHEF stratum 1. |
There was another setting (I think max open file descriptors) that was on the low side, but increasing it did not have immediate effect. It does show in the graph, though.
|
I think I'll try to increase the number of workers first. |
net.ipv4.tcp_max_syn_backlog is the setting for incomplete TCP connections. Here's the description from the listen(2) man page, which accepts a backlog parameter:
So it sounds like another thing that you could do is enable synccookies. On the FNAL and OSG Stratum 1s, net.ipv4.tcp_max_syn_backlog is 2048. I read that the default value depends on the amount of RAM, so I suspect your machine has lower RAM than they do. OSG has 64G and FNAL has 192G. It doesn't sound like the number of workers has anything to do with this problem, so it might not be worth bothering with that. That mostly helps when squid starts getting close to running 100% on one core, which does not seem to be the case for you. It makes sense that increasing the number of file descriptors didn't help. Squid sets the backlog value (which as above man page says is only for completed and not yet processed connections) to 25% of the number of file descriptors. You can monitor that backlog in real time with |
I've increased the memory on ruif to 128GB. Let's see how that holds up. |
We're seeing errors since yesterday where retrieval of the .cvmfsreflog either takes a long time or times out completely. Software was updated to 2.9.4-1 on October 18.
just now rebooted the server.
The text was updated successfully, but these errors were encountered: