Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zone synchronization fails sporadically #411

Open
maintain3r opened this issue Nov 22, 2024 · 6 comments
Open

Zone synchronization fails sporadically #411

maintain3r opened this issue Nov 22, 2024 · 6 comments
Assignees

Comments

@maintain3r
Copy link

Hello NLnetLabs Team,
Im reaching out to you as you're the last instance.
I have NSD servres running as one primary and multiple secondary servers pointing to the same primary as for the source of truth. Primary and Secondary NSD servers don't know anything about each other.
Primary NSD configured to allow XFR queries to come from the subnets of Secondary NSD hosts.
Running nsd-control transfer on Secondary hosts is used to fetch the zones configured in nsd.conf file at secondary instance bootstrap. The same cmd is configured as Cron job and runs every 5min. This way I know that the zone changes I make on the primary host will appear on secondary in not more than 5min.
The reason I don't have notify section configured on the primary host is that secondary hosts are cloud vms and can be replaced by the cloud for any reason and at any time. This makes it impossible to preconfigure Primary NSD with ip addresses of any secondary servers.

Recently the sync between primary and the secondary hosts didn't work.
Quick check on secondary hosts with cmd "nsd-control zonestatus" showed that of the zones has its state as "refreshing".
I tried to run nsd-control transfer in my terminal but that didn't change anything, then I ran "nsd-control force_transfer"
which made a zone transfer ignoring Serial field of SOA reacord of all zones and the issue was fixed.
Repeating nsd-control zonestatus was showing all zones "status: ok".
After checking the logs I saw some weird stuff:
nsd.log:
[timestamp] nsd[475]: error: xfrd: failed writing tcp Operation now in progress
[timestamp] nsd[475]: info: xfrd: zone example.com ignoring old serial (222/221) from
[timestamp] nsd[475]: info: xfrd: zone example.com bad transfer 0 from

This lines also appear randomly in nsd.log file:
error: could not SSL_write crypto error:00000000:lib(0)::reason(0)

For some reason NSD on secondary hosts didn't work properly.
But even if there was an issue with zone transfer I still don't understand why Cron job command 'nsd-control transfer' that was repeating every few minutes didn't fix the issue. The TCP line from the log does not really say anything if it was a communication issue with primary server. The "old serial" lines are not informative either. How did secondary end up with a higher Serial number than the primary makes no sense to me.
Im not sure putting "nsd-control force_transfer" in Cron is a good idea as I don't want to fetch zones from primary regardless of the change in the Serial number for the zone.

Ubuntu srv 22.04
NSD version 4.3.9
Configure line: --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-option-checking --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --with-configdir=/etc/nsd --with-nsd_conf_file=/etc/nsd/nsd.conf --with-pidfile=/run/nsd/nsd.pid --with-dbfile=/var/lib/nsd/nsd.db --with-zonesdir=/etc/nsd --with-xfrdfile=/var/lib/nsd/xfrd.state --disable-largefile --disable-recvmmsg --enable-root-server --enable-mmap --enable-ratelimit --enable-zone-stats --enable-systemd --enable-checking --enable-dnstap
Event loop: libevent 2.1.12-stable (uses epoll)
Linked with OpenSSL 3.0.2 15 Mar 2022

Please let me know if you need any other details.

Thanks!

@k0ekk0ek
Copy link

Hi @maintain3r.

Im not sure putting "nsd-control force_transfer" in Cron is a good idea as I don't want to fetch zones from primary regardless of the change in the Serial number for the zone.

It's best to not forcefully trigger a transfer. Let's leave that as is for now.

The error almost seems like a timeout or something, but I really can't be sure yet. I'll try analyze the problem further tomorrow.

@k0ekk0ek k0ekk0ek self-assigned this Nov 26, 2024
@maintain3r
Copy link
Author

Hi @k0ekk0ek thank you for your reply. Please let me know if you want me to check anything else.

@maintain3r
Copy link
Author

Hello, any updates on this one?

@k0ekk0ek
Copy link

k0ekk0ek commented Dec 4, 2024

Hi @maintain3r. Sorry for the delay in response, something else came up and one of the other devs will need to pick this up instead.

@k0ekk0ek k0ekk0ek removed their assignment Dec 4, 2024
@wtoorop wtoorop self-assigned this Dec 4, 2024
@wtoorop
Copy link
Member

wtoorop commented Dec 16, 2024

Hi @maintain3r, Sorry for the late response. Since your secondaries are able to receive transfers, we don't think the "failed writing tcp" or "could not SSL_write crypto error" are problematic. The tcp error is likely about TCP fastopen (with zone transfers) . The SSL_write error may be related to that as well (but then on the nsd-control channel).

The reason that the secondary did not transfer, was mentioned by the message "ignoring old serial (222/221)". The secondary did have a zone with a higher serial loaded than it saw on the primary, so it didn't transfer.

There could be many reasons for this. Are you certain there is only a single primary? Do you use a zone database (what version of NSD do you use, and what is the value of the database setting on the secondaries (and primary). Do the secondaries write out the zone files?

Maybe we can spot some potential causes if you can share your configuration? You can send it privately to us by PGP encrypting it. For example to me (willem at nlnetlabs.nl) encrypted with my PGP Key: E5F8 F821 2F77 A498.

@maintain3r
Copy link
Author

Hello @wtoorop and thank you for your attention to this topic.
Even if there was a TCP related issue the "ignoring old serial (222/221)" msg remains a mystery for me.
Looks like for some reason the secondary node which always lags behind master increased its serial value even if tcp transfer ended up unsuccessful at some point. And then, next time when cron job on secondary NSD tries to fetch the updates from the master NSD, the secondary NSD realized that the serial is higher than the one on the master and didn't transfer the zone. For all secondaries I have 2 primary NSD nodes and the behaviour mentioned is exactly the same for both. I didn't include it in the logs only for the sake of simplicity. The logs I mentioned above are exactly the same with the ip add of the 2nd primary NSD node.
All secondaries have 2 primary NSD node ip addresses and use it as the source of truth.
All secondary NSD nodes save the zone files locally.

NSD on primary nodes:
NSD version 4.1.26
Ubuntu srv 20.04

NSD on secondary nodes:
Ubuntu srv 22.04
NSD version 4.3.9
Configure line: --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-option-checking --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --with-configdir=/etc/nsd --with-nsd_conf_file=/etc/nsd/nsd.conf --with-pidfile=/run/nsd/nsd.pid --with-dbfile=/var/lib/nsd/nsd.db --with-zonesdir=/etc/nsd --with-xfrdfile=/var/lib/nsd/xfrd.state --disable-largefile --disable-recvmmsg --enable-root-server --enable-mmap --enable-ratelimit --enable-zone-stats --enable-systemd --enable-checking --enable-dnstap
Event loop: libevent 2.1.12-stable (uses epoll)
Linked with OpenSSL 3.0.2 15 Mar 2022

What is the value of the database setting on the secondaries (and primary):
database: "/var/lib/nsd/nsd.db"

The config is pretty much standard but if you need anything else please let me know.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants