-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fatal error: concurrent map read and map write after upgrading 2.7.6 to 2.8.4 and/or adding github.com/caddyserver/cache-handler #6683
Comments
Thanks; a quick preliminary look suggests this may be a simple fix in the vars middleware. Will get back to this soon 🙂 |
Wow, thanks so much for the quick response. Maybe unrelated, but caught my eye: the xcaddy switched implicitly from our company standard golang 1.21.8 to 1.22.0, 1.22.8 or 1.22.9 depending on which log line I look at in our logs 🤔 I find that highly confusing atm tbh, |
Go itself now uses the latest version of the runtime when building. It downloads the latest patch version that matches the minimum minor version requirement. See https://tip.golang.org/doc/toolchain |
Without a config we cannot minimally reproduce this, and therefore can't fix this. Please share a minimally reproducing config. |
Sorry, we also cannot reproduce it as mentioned, and visible from the frequency, If you want to review the full production config in private, please let me know how to share. |
In addition to that, it's beyond my imagination how we would want to be able to reproduce some random web traffic pattern of X production web sites for an obvious timing/race condition problem that occurs every 3+ weeks... but maybe enlighten me how we could approach that, as mentioned. |
You're describing a race in map access. The odd thing is, the subject caddy/modules/caddyhttp/vars.go Lines 316 to 324 in 7088605
I, for one, cannot imagine 2 HTTP requests sharing the same context, so maybe we're missing something. If you can share minimal config, maybe we can all look at reproducing it with, perhaps, Vegeta. |
hmmm... maybe late, but that code snippet is just a read. Where is the write that conflicts? And how is correct concurrency ensured? Are there mutex'es? Is the context's variable "thread safe" already? I found @francislavoie 's mail in his profile, will mail our prod config there, for investigation, |
Configs are not private information. Post your config here if you want to get help. |
Since we disagree on this, I guess we'll just keep counting how many times Caddy takes down a dozen production websites and services and backend-APIs down in the next days & weeks based on this "policy". |
In addition to that, I want to re-confirm, again,
it is NOT reproducible, and I've not read any idea or approach how a probably data-driven race should be become reproducible,
|
half of the config refers to internal hosts, ips, names, etc. that are nobody's business, for example, this is true for our config, and most other company's internal system service configs. |
That tone is insulting. I have that policy because I'm a volunteer. I offer my time to the project in the open so that anyone who has similar issues may find their answers from the discussions and troubleshooting that happens in the open, in searchable/indexable parts of the internet. I have to draw the line somewhere to limit things and to keep a healthy open-source/life balance. If you want private support then you should consider sponsoring the project https://caddyserver.com/support and Matt will be able to prioritize helping you. |
Good morning @francislavoie Thanks for you fast responses, much appreciated (it was past midnight here in Austria). I want to apologize if you felt my response insulting or somehow related to your personal efforts. I of course agree that the idea to keep as much as possible in the open helps everyone. I still don't agree that revealing production config is something you can or should expect, always, In this case, especially, we don't even know how to isolate & minimize the config down, which is another issue where I'll share my ideas. As it stands now we're not even guessing, we're looking at a random stack trace together, Next steps for us:
Thanks also for the reminder on the sponsoring and paid support offers, we'll review those, too! |
I thought about this more @mohammed90 and have the following ideas/questions: Q1: Even if the map is really always used only by 1 HTTP request and not shared, is the memory safe? Q2: Are multiple go routines of handlers possibly accessing the map concurrently? Q3: Can we ensure that we synchronize the map access? Currently it is not synchronized. Concurrent map access without proper synchronization is a serious issue that can lead to memory corruption or at least outages that we manually need to cure. We could perhaps use sync.Mutex, sync.RWMutex (Better for Read-Heavy Workloads), or sync.Map (Built-in Concurrent Map). Q4: Can we get more detailed info on where (which URL, host, etc) this is happening on? I have some more suggestions on that here #6684 as it's a general problem with all panics/crashes of Caddy. |
The issue here is that multiple goroutines could be accessing the same context's vars map concurrently. This can happen in several scenarios (which I am not all aware of personally yet, or verified if those are sync'ed already)
To fix this, we should modify the var map implementation to be thread-safe. |
The middleware chain should never be concurrent. It's synchronous. Request matchers being part of the middleware chain should always be synchronous. Error handlers can only happen after an error in the main route middleware chain returns an error, and therefore is also synchronous. Rewrites being part of the middleware chain are also synchronous. Same with authentication. The only possible place is a spawned goroutine by a handler. I'm not aware of any in Caddy itself that does this and allows concurrent access to vars. You're also using plugins, one of those plugins may be doing that. |
Thanks for your insights and estimation, I agree, that's certainly possible. We have now on QA system already reverted back to CADDY_VERSION=v2.7.6 because those plugins also have had quite some changes, in addition to a few changes we've identified already in the core. For the core, from what we understood, also the reported vars.go has 4 revisions as diff between 2.7.6 and 2.8.4
|
I'm mostly away from my desk on the weekends, so I have yet to catch up on the whole discussion as it's only Monday morning here, but:
Anyway, I'm mostly repeating what Francis already said, but I hope this offers some clarity. If you can post the config (and ideally the request too) here, even if we can't reliably reproduce it, we can likely have a better sense of the problem and offer a viable solution rather than guessing. (We try not to guess here when possible.) What is the full output of |
Thanks so much @mholt and @francislavoie for clarifications and context,
Yes, makes total sense. RE:
We could easily operate such a "Recorder" and be able to re-run requests not just for that but possible if combined with backup/recover PIT from the of the system(s) even create "real world black box tests" of reference systems in the future. Does such a "HTTP Request Recorder & Replay" module/feature exist? If not we may build one actually. Next,
|
We've also just signed up for the Sponsor Startup plan @mholt with our startup AIPRM now (almost empty repo) |
v2.8.4 h1:q3pe0wpBj1OcHFZ3n/1nl4V4bxBrYoSoab7rL9BMYNk=
here we go
`caddy list-modules``
|
This is a very good detail to know. We use Caddy for a lot of reverse proxy work to internal services, probably 60% or more of all host use it. We started using the So that's worth diving deeper on: 1/ 60% or more of all hosts us reverse proxy It's been ~14 days until the first crash, but my current bet it on this specific addition, Possibly the same crash would even arise using the backported build... 🤔 |
@christophcemper Thanks for your sponsorship and the additional details. If you want to email your config to matt at dyanim dot com that could be helpful.
Nothing turnkey... that I know of. I did a search in the non-standard modules listed there for
To clarify, the goroutines (concurrency) we use in the reverse proxy is literally only I just don't know where that's happening. A config could be helpful. |
@mholt thanks for the insights and clarifications!
Turns out few minutes after I read this I managed to find a DATA RACE locally on my dev/fork of 2.7.6. in caddyhttp/responsewriter.go- will report separately. Regarding the mentioned case, a combo of cache module + reverse proxy + header modification I have an isolated config and tes setup, as well as build commands and binaries used. I'm posting here also, but Theory 1 - unusual combo of caching reverse proxyCaddyfileDEV-defect-6683-map-read-write-crash-simple-theory1
Makefile to run/build (but we used original amd64 binary also)
Test results of the simple case1-3 should be 200s, successfully reverse-proxied
the successful CURL also has some more. i.e. like this
I hope this helps |
FWIW, I failed to upload "defect-6683-map-read-write-crash-simple-theory1.zip that has it all with reference logs from mac and amd64 runs, with only 42MB |
Honestly, that's not that interesting. We're at 2.9.0-beta.3 now. If you're not testing against latest, your tests results are likely against stale code (you might be re-finding bugs that are already fixed). |
Thanks; what is the output of the race detector though? And... yeah, unfortunately a really old version like 2.7.6 might not be super helpful, but with the output of the race detector we can confirm either way. |
I've spent quite some time today to dive deeper. Yeah, old versions are less useful - unless they are more stable. Remember we downgraded due to instable 2.8.4? Yes, I understand where you are coming from here, from the viewpoint of ongoing development. However, besides ongoing development which is great, there's an operational reality. How stable will future versions be if the responsibility for stability is also passed on to add-on modules? I believe still not (as) stable at all (as we would like) We are at version 2.7.6 for now, again, So for us, the 2.9.0-beta.3 couldn't be less interesting for us at the current moment. Blindly trusting the "newer version being better"-idea with 2.8.4. got us into the troubles we're in... OK, and adding that cache module, which gets more and more suspicious tbh.
I would hope so, so we can backport. But I believe we need hardening all the way up with mutexes and timeouts thereof. So it doesn't seem solved, so far in both 2.8 and 2.9 but I haven't tested those, still focussing on the "last known good one", and neither new releases ensure sync/lock of Vars or Replacer Middleware as I understand. New things found out so far in my 2.7.6 fork1/ Vars middleware now got an InstrumentedRWMutex - with timeout logs etc. using a new lib I wrote today 2/ Vars middleware is used wrong by reverse proxy it seems, or in combo reverse proxy + cache module - some of the reverse proxy go routines seem to remain/not get cleaned up. 3/ Replacer middleware is also suffering from DATA RACEs - and is not yet sync'ed by me 4/ Reverse Proxy / Streamer are involved in DATA RACEs. It appears that when http.handlers.reverse_proxy gives us "msg":"streaming error" then this is more likely 5/ Caddy Module: http.handlers.cache is based on Souin cache - but when you look at the code and ServeHTTP there, you find a lot of Go routines, channels, parallelism... not finished review, but it contradicts what I understood is expected as not being concurrent per HTTP request. It appears this is where more troubles with unsynced/not locked Vars/Replacer middlewares and/or issues in Reverse Proxy / Streamer could combine. I received multiple DATA RACEs only by manual 1-user browsing, with references to local source folders only... but I expect a lot more once I deploy all that to QA system that make more sense to share. It's 2am again, so I end here, will provide more insights tomorrow. |
I believe this was fixed in v2.8.x. as of commit e7336cc. |
Thank you @mohammed90 - that's certainly worth investigating. I spent the whole day rewriting my rwmutexplus plus for this situation, so I can trace lock contention/timeouts etc. much better now. Everything I found was just re-confirmed, but in nicer, more readable logs (except when you turn up VerboseLevel to 4, then the Stdout/err explodes, too). I don't understand why but even on pre-prod hardware the Vars Middleware is really slow, and certainly suffers from concurrency that wasn't handled so far. It turns out that many shared read locks hang in there for 50+ ms (not micro, milli, like network pings!) for really trivial vars to get like "client_ip", many times over and over, even when only few humans operate on that webserver. Also the matchers.error module is similar slow. We have in there Average hold time: 1.367592ms and the client_ip one similar... all these millis add up, and using 20 x 1+ milli to get the client_ip for 1 request really leaves room to optmize
All these map accesses of course can cause that fatal if not coordinated... but apparently not very often, luckily 😅 here's how I build currently, in case you're interested I'll checkout that hint for the replace, thanks! |
@mohammed90 I looked into your commit and see how you implemented a mutex for that library (not reviewed/tested for conflicts, memory leaks, etc..) However, the problems are still elsewhere where there's unsync'ed code accessing the old way via Maps pulled directly from the Context object without any protection/sync. repl := r.Context().Value(caddy.ReplacerCtxKey).(*caddy.Replacer) I know this place well, because just 1 line above we have a very expensive/slow access to the Vars map actually (but done unsync'ed here) That is already changed to rwsyncplus in my fork here. ...and it shows how the timings are off across the code, same for the |
Last words: I merged the improvements in Var Middleware up to 2.8.4.
On init the map is setup with a Write Lock, for client_ip and that proxy bool and takes over 50ms for it. All the others reading the In all other versions out there, they access that map with whatever data it has. or not. If THAT is not a concurrency issue wiithin just this goroutine 1161 that needs mutex'es I don't know what is. The other question however is - what takes 50,000 micros to initialize that map there? for every request? 🤔 |
This merge to https://github.com/christophcemper/caddy/tree/v2.8.4-rwmutexplus2 |
Here's one cache-handler/souin related DATA RACE where the binary is from a stable/released fork running on QA here already @mholt @mohammed90 |
Afte having the Vars Middleware rwmutex'ed now in my fork, I could not help wonder why we wouldn't just use a standard Yeah, it uses more memory, but so does manual lock management + code complexity/risk and those are still like inferior locks like SQLite full table locks on the full map, while Can you maybe point me to some discussions/tests/insights from maybe many years ago to understand that background? (don't get me wrong - I don't like locks, as any locks are the natural enemy of performance, but hour long outages are not a price to pay for performance) |
The newer logs look nice, thanks for the updates -- but I'm still looking for the output of the race detector. I saw the one in souin, but it sounds like you're saying there is still a race in Caddy itself, and I can't diagnose it without the output of the race detector. The reason I hesitate to add more mutexing to the vars handler is because handlers generally shouldn't be accessing that concurrently during a request. I'd like to pin down exactly what's happening using the race detector output. |
Adding to what Matt says and probably mentioned earlier, the output of the race detector, i.e. with |
Yes, I agree - here's one DATA RACE I just got when starting up on my Mac Studio @mholt from this build version based on v2.8.4
|
That's very interesting -- thanks for posting that. I'll look into it tomorrow; but in the meantime, it's pretty clear from looking at the referenced lines, that it isn't the same as the fatal concurrent map read/write originally posted about, unfortunately. I would be expecting a data race with one of the reads or writes at |
@christophcemper Can you verify for me, to make sure I'm looking at the right lines of code (since you are running a modified fork of 2.8.4): what is this line of code for you: |
Sorry for potential confusion here @mholt as I tried to post as much as possible, I also posted a DATA RACE from the local caddy build I had, past midnight Thursday before leaving to a 3day weekend. /Users/christophc/Workspace/common/CaddyLRT/caddy-fork/modules/caddytls/tls.go:603 is here But I also meanwhile still/again got a "vars.go" related DATA RACE from the previous 2.7.6. (even using instrumented rwmutex already), |
New DATA RACE from the current 2.7.6. instrumented rwmutex already showed up over the weekend finally on QA system so this one is from the
|
I figured it out, I think. The request logging at the end of the response write so we can collect the response size. The size is then set in the replacer. The replacer is created and used per request. However -- your addition of the cache handler is what's causing the data race. Cache is inherently a data race between the cached value and the response from upstream. If you check Souin code, you'll find there are scenarios where it spins off a goroutine to fetch new value from upstream while fetching the value from cache and catching the winner. Both goroutines share the same I still don't think the mutexes in Caddy are needed because we already fixed one of them (for the replacer). Your issues stem from how the cache handler behaves. If you can try vanilla Caddy with the Pinging @darkweak for the cache/souin logic. |
Thanks for the further details @christophcemper, and for the theory @mohammed90 -- I think you may be right, I would also be curious on a vanilla Caddy build as that may narrow it down. Woke up to about 40+ notifications so I'm going to make my way through those but then I'll circle back to this. |
1. What version of Caddy are you running (
caddy -version
)?now 2.8.4 with
xcaddy build --output bin/LRTcaddy --with github.com/christophcemper/[email protected] --with github.com/caddyserver/transform-encoder --with github.com/pteich/caddy-tlsconsul --with github.com/caddyserver/cache-handler
after 10 month stable 2.7.6. with
xcaddy build --output bin/LRTcaddy --with github.com/christophcemper/[email protected] --with github.com/caddyserver/transform-encoder --with github.com/pteich/caddy-tlsconsul
that did not experience this very rare, unseen before crash
2. What are you trying to do?
Multiple production hosts, reverse proxies, etc.
We added the module
github.com/caddyserver/cache-handler
for a projectand (unfortunately) also upgraded from 2.7.6 to 2.8.4. in one step.
Since then we had 2 occassions of a crash of Caddy on Oct 28 and Nov 10
that caused a total outage and could only be remedied with a manual restart.
3. What is your entire Caddyfile?
Sorry we cannot share it in public, happy to send directly to mholt or other key people.
4. How did you run Caddy (give the full command and describe the execution environment)?
via a systemd service
5. What did you expect to see?
Stable caddy like all the time before.
6. What did you see instead (give full error messages and/or log)?
Oct 28 (intro for 1st goroutine - longer log attached)
Nov 10 (Intro for 1st go routine, longer log attached)
7. How can someone who is starting from scratch reproduce this behavior as minimally as possible?
Unfortunately, we could not reproduce the issue yet in any way, nor do we know how to isolate this down to the host/domain that is causing this.
We have not yet tried to down-grade to 2.7.6. and see if that happens again with the cache module,
but then if it only occurs every 3 or so weeks randomly we're not even sure if that makes sense at this point.
We would kindly ask for help to enable/change to more helpful logging details for this than the raw unqualified go stack-trace
caddy-crash-2024-11-10.log
caddy-crash-2024-10-28.log
The text was updated successfully, but these errors were encountered: