Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issues retrieving .cvmfsreflog on Nikhef stratum1 (ruif) since Sunday #1

Open
dvandok opened this issue Oct 24, 2022 · 7 comments
Open

Comments

@dvandok
Copy link

dvandok commented Oct 24, 2022

We're seeing errors since yesterday where retrieval of the .cvmfsreflog either takes a long time or times out completely. Software was updated to 2.9.4-1 on October 18.

just now rebooted the server.

@jblomer
Copy link
Contributor

jblomer commented Oct 24, 2022

Can you paste the complete URL? I can try downloading the reflog and check if it looks healthy.

From which version did you upgrade to version 2.9.4?

@dvandok
Copy link
Author

dvandok commented Oct 24, 2022

Update: I think this did not have to do with the upgrade at all. We just seem to be tipping over a threshold where the number of concurrent requests to squid on that machine causes the kernel to start dropping connection requests. As part of the sync process is a localhost retrievel (but still via squid) of the .cvmfsreflog, this sometimes times out and breaks the update process.

For now, we've extended net.ipv4.tcp_max_syn_backlog to 1024. That seems to have addressed the immediate problem, but I feel we are reaching the point where having a single machine as the stratum-1 is no longer feasible.

I don't know if there is an ipv6 equivalent setting in sysctl, or how these kernel parameters are organised (as in, are the ipv4 settings actually for both ipv4 and ipv6?). We also don't know if these settings are sane and sound.

@DrDaveD
Copy link

DrDaveD commented Oct 24, 2022

According to the stratum 1's MRTG file descriptor plot your front-end squid hasn't had more than 520 simultaneous connections, at least when checked every 5 minutes. So it doesn't seem logical to me that the backlog was an issue. Were you seeing errors in /var/log/messages?

The cpu time also shown for your squid isn't very high even if you have only one worker defined so I'm not sure what's going on. If you have only one worker you might want to increase it anyway because you seem to have had connection problems. The FNAL stratum 1 is more heavily loaded than yours, using one old machine (and a backup), but they do have 3 workers configured. The RAL stratum 1 is even more heavily loaded than FNAL's, also on one machine (with a backup). I don't know how many workers they have defined. They typically have about 3 times as many simultaneous connections as the NIKHEF stratum 1.

@dvandok
Copy link
Author

dvandok commented Oct 24, 2022

There was another setting (I think max open file descriptors) that was on the low side, but increasing it did not have immediate effect. It does show in the graph, though.
As for the logs, this is the kind of message we got, and these are the last of them (after this the increased backlog made them go away)

okt 24 12:31:49 ruif.nikhef.nl kernel: net_ratelimit: 331 callbacks suppressed
okt 24 12:31:49 ruif.nikhef.nl kernel: TCP: drop open request from 127.0.0.1/55906
okt 24 12:31:49 ruif.nikhef.nl kernel: TCP: drop open request from 62.44.127.153/35016
okt 24 12:31:49 ruif.nikhef.nl kernel: TCP: drop open request from 80.188.228.154/49954
okt 24 12:31:49 ruif.nikhef.nl kernel: TCP: drop open request from 154.23.220.9/49713
okt 24 12:31:50 ruif.nikhef.nl kernel: TCP: drop open request from 154.23.220.9/59211
okt 24 12:31:50 ruif.nikhef.nl kernel: TCP: drop open request from 131.154.128.11/42666
okt 24 12:31:50 ruif.nikhef.nl kernel: TCP: drop open request from 154.23.220.9/7524
okt 24 12:31:50 ruif.nikhef.nl kernel: TCP: drop open request from 130.246.183.246/38394
okt 24 12:31:50 ruif.nikhef.nl kernel: TCP: drop open request from 0000:0000:0000:0000:0000:0000:0000:0001/51500
okt 24 12:31:50 ruif.nikhef.nl kernel: TCP: drop open request from 80.188.228.154/54263

@dvandok
Copy link
Author

dvandok commented Oct 24, 2022

I think I'll try to increase the number of workers first.

@DrDaveD
Copy link

DrDaveD commented Oct 24, 2022

net.ipv4.tcp_max_syn_backlog is the setting for incomplete TCP connections. Here's the description from the listen(2) man page, which accepts a backlog parameter:

       The  behavior of the backlog argument on TCP sockets changed with Linux
       2.2.  Now it specifies the  queue  length  for  completely  established
       sockets  waiting  to  be  accepted, instead of the number of incomplete
       connection requests.  The maximum length of the  queue  for  incomplete
       sockets  can be set using /proc/sys/net/ipv4/tcp_max_syn_backlog.  When
       syncookies are enabled there is no logical maximum length and this set-
       ting is ignored.  See tcp(7) for more information.

       If    the   backlog   argument   is   greater   than   the   value   in
       /proc/sys/net/core/somaxconn, then it is  silently  truncated  to  that
       value;  the  default  value  in  this  file  is 128.  In kernels before
       2.4.25, this limit was a hard coded value, SOMAXCONN,  with  the  value
       128.

So it sounds like another thing that you could do is enable synccookies. On the FNAL and OSG Stratum 1s, net.ipv4.tcp_max_syn_backlog is 2048. I read that the default value depends on the amount of RAM, so I suspect your machine has lower RAM than they do. OSG has 64G and FNAL has 192G.

It doesn't sound like the number of workers has anything to do with this problem, so it might not be worth bothering with that. That mostly helps when squid starts getting close to running 100% on one core, which does not seem to be the case for you.

It makes sense that increasing the number of file descriptors didn't help. Squid sets the backlog value (which as above man page says is only for completed and not yet processed connections) to 25% of the number of file descriptors. You can monitor that backlog in real time with ss -nl|grep 80. The third column is the lesser of the value that squid set and net.core.somaxconn, and the second column is the current number of waiting connections.

@dvandok
Copy link
Author

dvandok commented Oct 26, 2022

I've increased the memory on ruif to 128GB. Let's see how that holds up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants