-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement data dump/crash dump for diagnosis of rare panics/crashes on (probably) data-driven defects #6684
Comments
I agree that can be frustrating; we can at least see where the read or write is happening, but without the race detector, we can't have more information (there is a performance hit to keep track of all that). The Go runtime is crashing the program, there's nothing we can do to recover it or get more information. If you build Caddy with the Go race detector enabled (having a slightly significant performance penalty, just be advised), it will give you the information you desire. Set Closing, since there's nothing actionable for us, but feel free to continue the discussion if needed. |
Thanks for the hints and advice, we'll investigate this for sure. |
Thanks @mholt - I've added this to our custom builds and tests. Unfortunately it doesn't look like Caddy is cleanly shutting down, even the single HTTP request, CADDY_VERSION=v2.7.6 which for a simulated panic done with panic("TEST-PANIC: Simulated panic for log testing") just returns CURL a
via CURL or simply an ugly browser error and we're not seeing any kind of benefits/outputs from the XCADDY_RACE_DETECTOR but I understood that as only related to the go race detection feature, not dumps on panic in the dev/test Caddyfile we have this on global scope and expected a separate "panic.log" to be written
but it only leads to 1 more line in the stdout, severity "debug" and not "panic", so in the debug.log, the first.
what I suggest/or expected was something like this quick Dump Context helper does
(maybe minus the ASCII art for TEST/QA/PROD :-) This at least leads me to the assumption, that we may have had an issue for the specific issue, or wider, for longer time already, |
As you suspect, the race detector will only detect races, not panics. Panics, even on ordinary builds, will already print a stack trace. (Although, I can't remember, if we strip so many debugging symbols that the production stack traces are useful, I Think we don't though, because the traces we've received in bug reports have all been helpful.) All we need to pinpoint most panics is a stack trace. Races need two bits of information, the second one only comes with the race detector enabled. Also, the crash reporting a "concurrent map read/write" is not a panic, it's a hard-coded exit into the Go runtime, so we can't recover from it. So, if you see that the race detector caught a data race (you'll know -- it'll say "DATA RACE" in all caps), then put that info into your linked issue. (Where I also ask for |
1. What version of Caddy are you running (caddy -version)?
now 2.8.4 with
xcaddy build --output bin/LRTcaddy --with github.com/christophcemper/[email protected] --with github.com/caddyserver/transform-encoder --with github.com/pteich/caddy-tlsconsul --with github.com/caddyserver/cache-handler
after 10 month stable 2.7.6. with
xcaddy build --output bin/LRTcaddy --with github.com/christophcemper/[email protected] --with github.com/caddyserver/transform-encoder --with github.com/pteich/caddy-tlsconsul
that did not experience a very rare, unseen before crash now only 2 times in 3 weeks.
It's unreproducible, yet taking down production for multiple projects if it happens.
Such rare crashes should at least provide more details via panic handler, once they happen.
2. What are you trying to do?
Find out how to create a scenario for reproducing it.
Find out how to create a trimmed down config, by dumping payload data of current processing - not just static line numbers.
Some ideas to dump at least
i.e.
Currently we only have a naked stack trace as posted in #6683
which is not helpful (enough) to perform any config reducing or create a reproducable case.
3. What is your entire Caddyfile?
NA
4. How did you run Caddy (give the full command and describe the execution environment)?
/usr/bin/caddy run --environ --config /etc/caddy/Caddyfile
via a systemd service
systemctl status caddy
● caddy.service - Caddy
Loaded: loaded (/lib/systemd/system/caddy.service; enabled; vendor preset: enabled)
Active: active (running) since Sun 2024-11-10 11:45:13 UTC; 7h ago
Docs: https://caddyserver.com/docs/
Process: 1929477 ExecReload=/usr/bin/caddy reload --config /etc/caddy/Caddyfile --force (code=exited, status=0/SUCCESS)
Main PID: 1879405 (caddy)
Tasks: 39 (limit: 154411)
Memory: 2.2G
CGroup: /system.slice/caddy.service
└─1879405 /usr/bin/caddy run --environ --config /etc/caddy/Caddyfile
5. What did you expect to see?
Actually processed data, more than the stack trace below.
Oct 28 (intro for 1st goroutine - longer log attached)
Oct 28 15:05:08 web01 caddy[2621539]: fatal error: concurrent map read and map write
Oct 28 15:05:08 web01 caddy[2621539]: goroutine 3873976361 [running]:
Oct 28 15:05:08 web01 caddy[2621539]: github.com/caddyserver/caddy/v2/modules/caddyhttp.GetVar({0x1f9bd08?, 0xc015b73290?}, {0x1a68ef6, 0xe})
Right after
Oct 28 15:05:08 web01 caddy[2621539]: fatal error: concurrent map read and map write
it could print all possible details known, based in ideas in point 2.
7. How can someone who is starting from scratch reproduce this behavior as minimally as possible?
Due to lack of data/config details in the panic log,
we are not able to reproduce the issue yet in any way,
nor do we know how to isolate this down to the host/domain that is causing this.
The text was updated successfully, but these errors were encountered: