Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: nsd stops forking with new zone data #417

Open
pettai opened this issue Dec 27, 2024 · 10 comments
Open

bug: nsd stops forking with new zone data #417

pettai opened this issue Dec 27, 2024 · 10 comments

Comments

@pettai
Copy link

pettai commented Dec 27, 2024

It seems our NSD secondary has triggered some sort of intermittent bug
This happens several weeks/months of running nsd stops forking with the new zone data.

A manual nsd-control transfer some.zone. or even nsd-control force_transfer some.zone won’t work.
Only restart of nsd solves the problem.

The logs doesn’t give a hint other than process nsd[3185521] stopped reporting updates.

[...]
Dec 27 05:41:04 sunic nsd[1413]: zone some.zone. serial 1735277408 is updated to 1735278004
Dec 27 05:46:55 sunic nsd[1413]: xfrd: zone some.zone. committed "received update to serial 1735278005 at 2024-12-27T05:46:55 from 2001:6b0:: TSIG verified with key tsig-sunet"
Dec 27 05:46:58 sunic nsd[3185521]: zone some.zone. received update to serial 1735278005 at 2024-12-27T05:46:55 from 2001:6b0:: TSIG verified with key tsig-sunet of 1369 bytes in 9.7e-05 seconds
Dec 27 05:47:02 sunic nsd[1413]: zone some.zone. serial 1735278004 is updated to 1735278005
Dec 27 05:07:59 sunic nsd[1413]: zone some.zone. serial 1735275607 is updated to 1735275608
Dec 27 05:10:42 sunic nsd[1413]: zone some.zone. serial 1735275608 is updated to 1735275609
Dec 27 05:11:04 sunic nsd[1413]: zone some.zone. serial 1735275609 is updated to 1735276204
[...]

And nsd-control zone status says state: ok, (which seems odd since several updates has failed to update the zone.)

zone:	some.zone
	state: ok
	served-serial: "1735279204 since 2024-12-27T06:01:01"
	commit-serial: "1735311605 since 2024-12-27T15:01:10"
	wait: "3510 sec between attempts"

Unfortunately, we restarted nsd since nothing else worked.
So we don't know if nsd[3185521] died or got in a defunct state.

Currently we run nsd 4.11.0 based on debians build/package , but this issue has happened with nsd 4.10.x too.

# nsd -v
NSD version 4.11.0
Written by NLnet Labs.

Configure line: --build=aarch64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-option-checking --disable-silent-rules --libdir=${prefix}/lib/aarch64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --with-configdir=/etc/nsd --with-nsd_conf_file=/etc/nsd/nsd.conf --with-pidfile=/run/nsd/nsd.pid --with-dbfile=/var/lib/nsd/nsd.db --with-zonesdir=/etc/nsd --with-xfrdfile=/var/lib/nsd/xfrd.state --disable-largefile --disable-recvmmsg --enable-root-server --enable-mmap --enable-ratelimit --enable-zone-stats --enable-systemd --enable-checking --enable-dnstap
Event loop: libevent 2.1.12-stable (uses epoll)
Linked with OpenSSL 3.0.2 15 Mar 2022

Copyright (C) 2001-2024 NLnet Labs.  This is free software.
There is NO warranty; not even for MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE.

Something that perhaps is less common with our nsd configuration, is that we use nsd's cpu-affinity and we also have catalog zones enabled. Let me know if more data (and what data) is needed.

@wtoorop
Copy link
Member

wtoorop commented Dec 30, 2024

Thanks @pettai ,
As @ttyS4 pointed out on the mailing-list, this is indeed quite similar to issue #338 .
I plan to implement some watchdog processes that can log, and potentially terminate, the stage in the reload process when it is taking long.

@pettai
Copy link
Author

pettai commented Jan 2, 2025

Now it happened again, and indeed it looks like it spins CPU while waiting for something...
(and nsd: xfrd lives on)

top - 00:09:30 up 12 days, 17:03,  1 user,  load average: 1.77, 1.90, 2.43
Tasks: 557 total,   2 running, 555 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.2 us,  0.0 sy,  0.0 ni, 97.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  31992.6 total,  21231.5 free,   1987.5 used,   8773.6 buff/cache
MiB Swap:   8191.5 total,   8191.5 free,      0.0 used.  27109.5 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
2376126 nsd       20   0 2369904 860688   3176 R  99.7   2.6   2151:59 nsd: load
 716345 nsd       20   0   84100  42852   7500 S   0.0   0.1  29:06.50 nsd: xfrd
2382233 nsd       20   0 2391532 868300   3588 S   0.0   2.7  14:43.47 nsd: server 1
2382234 nsd       20   0 2396148 870592   3588 S   0.0   2.7  14:57.94 nsd: server 2
[...]
2382236 nsd       20   0 2369904 857520      8 S   0.0   2.6   0:11.71 nsd: old-main

I tried activate more verbose logging, but that didn't do anything...
nsd-control reload didn't do much either...

Jan  2 00:08:37 sunic nsd[716345]: new control connection from 127.0.0.1
Jan  2 00:08:37 sunic nsd[716345]: control cmd:  verbosity 5
Jan  2 00:08:37 sunic nsd[716345]: remote control operation completed
Jan  2 00:08:58 sunic nsd[716345]: new control connection from 127.0.0.1
Jan  2 00:08:58 sunic nsd[716345]: remote control connection authenticated
Jan  2 00:08:58 sunic nsd[716345]: control cmd:  verbosity 2
[...]
Jan  2 00:12:37 sunic systemd[1]: nsd.service: Reload operation timed out. Killing reload process.

zonestatus on one domain that doesn't update:

zone:	sunet.se
	state: ok
	served-serial: "2024123100 since 2024-12-31T14:31:02"
	commit-serial: "2025010101 since 2025-01-02T00:01:01"
	wait: "79091 sec between attempts"

@pettai
Copy link
Author

pettai commented Jan 2, 2025

Thanks @pettai , As @ttyS4 pointed out on the mailing-list, this is indeed quite similar to issue #338 . I plan to implement some watchdog processes that can log, and potentially terminate, the stage in the reload process when it is taking long.

At first glance yes, but we don't have verification configured, so I think it's something different

@wtoorop
Copy link
Member

wtoorop commented Jan 2, 2025

Now it happened again, and indeed it looks like it spins CPU while waiting for something...

Thanks! The spinning is new and useful information! This is indeed quite different from #338 .
What is also different in your situation is that systemd monitors the reload process somehow? And then kills it if it takes too long? How? I'll have a look in how debian handles nsd with systemd!

@ttyS4
Copy link

ttyS4 commented Jan 2, 2025

@pettai you might want to increase the verbosity in the config file then do a fresh start because after it got stuck it might not make any difference anymore...

@ttyS4
Copy link

ttyS4 commented Jan 2, 2025

@wtoorop possibly useful links:

https://www.freedesktop.org/software/systemd/man/latest/systemd.service.html#ExecReload=
https://www.freedesktop.org/software/systemd/man/latest/systemd.service.html#Type=

https://sources.debian.org/src/nsd/4.11.0-1/debian/nsd.service/

We use debian (nsd 4.9.1) without systemd because unexpected magic by systemd like shown here leads to confusion and is not helpful for us.

@pettai i expect you used systemctl reload nsd which engaged systemd leading to killing these processes?

@wtoorop
Copy link
Member

wtoorop commented Jan 2, 2025

Thanks @ttyS4 Those are very useful links, thanks!, but I think we're already onto the issue @pettai is seeing and it is probably not related but a new bug introduced with nsd 4.11.0.

@pettai pettai changed the title nsd stops forking with new zone data bug: nsd stops forking with new zone data Jan 6, 2025
@pettai
Copy link
Author

pettai commented Jan 7, 2025

Some compilation output from building NSD 4.11.0, could it be related?

[...]
gcc -Wdate-time -D_FORTIFY_SOURCE=2 -DNDEBUG -I. -I./simdzone/include -g -O2 -ffile-prefix-map=/build/nsd-4.11.0=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -W -Wall -Wextra -Wdeclaration-after-statement -I/usr/include/google -c server.c
server.c: In function 'server_reload_handle_sigchld':
server.c:2384:46: warning: unused parameter 'sig' [-Wunused-parameter]
 2384 | static void server_reload_handle_sigchld(int sig, short event,
      |                                          ~~~~^~~
server.c:2384:57: warning: unused parameter 'event' [-Wunused-parameter]
 2384 | static void server_reload_handle_sigchld(int sig, short event,
      |                                                   ~~~~~~^~~~~
[...]

@pettai
Copy link
Author

pettai commented Jan 7, 2025

With regards to the the comment, here's another warning with regards to xfrd + catalog zone processing:

gcc -Wdate-time -D_FORTIFY_SOURCE=2 -DNDEBUG -I. -I./simdzone/include -g -O2 -ffile-prefix-map=/build/nsd-4.11.0=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -W -Wall -Wextra -Wdeclaration-after-statement -I/usr/include/google -c xfrd-catalog-zones.c
xfrd-catalog-zones.c: In function 'xfrd_process_catalog_consumer_zone':
xfrd-catalog-zones.c:724:64: warning: suggest braces around empty body in an 'if' statement [-Wempty-body]
  724 |                                                         NULL)));
      |                                                                ^

@pettai
Copy link
Author

pettai commented Jan 9, 2025

Hi, any updates/finding on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants