Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spurious Thread_local_storage.Not_set exception #35

Open
amongonz opened this issue Oct 6, 2024 · 19 comments · May be fixed by #36
Open

Spurious Thread_local_storage.Not_set exception #35

amongonz opened this issue Oct 6, 2024 · 19 comments · May be fixed by #36

Comments

@amongonz
Copy link

amongonz commented Oct 6, 2024

I've been getting spurious Thread_local_storage.Not_set exceptions coming from dpool when doing IO with moonpool-io inside child fibers running on the same Moonpool_fib.main runner as the parent. My tests work fine most of the time, except for the seldom run that fails.

When it fails, however, I always get the same trace (on moonpool 0.7):

Thread 1 killed on uncaught exception Thread_local_storage.Not_set
Called from Moonpool_dpool.work_.main_loop in file "src/dpool/moonpool_dpool.pp.ml", line 102, characters 12-29
Called from Thread.create.(fun) in file "thread.ml", line 48, characters 8-14
@c-cube
Copy link
Owner

c-cube commented Oct 7, 2024

That's unfortunate (and a bit weird, afaict literally all calls to TLS.get_exn are properly guarded!). Is there a way to get some repro/regression test? :-)

Also: try to opam pin https://github.com/c-cube/thread-local-storage.git/#feat/debug-tls-get, it should give you a better backtrace.

@amongonz
Copy link
Author

amongonz commented Oct 7, 2024

I've got a repro: I've learned that it happens even on a single fiber, and the exception isn't even being raised from main, just the backtrace being printed, which was enough to fail my cram tests.

let () =
  Printexc.record_backtrace true;

  let null = Moonpool_io.Unix.openfile "/dev/null" [ O_RDWR; O_CLOEXEC ] 0 in
  Fun.protect
    ~finally:(fun () -> Moonpool_io.Unix.close null)
    (fun () ->
      Moonpool_fib.main (fun _ ->
          let rec loop () =
            let buf = Bytes.create 1024 in
            match Moonpool_io.Unix.read null buf 0 (Bytes.length buf) with
            | 0 -> ()
            | read_n ->
              ignore (Moonpool_io.Unix.write null buf 0 read_n : int);
              loop ()
          in
          loop ()
      )
    )

Making a cram test for this file and running on bash while dune test --force ; do :; done eventually fails the test.

 Doesn't print an exception:
 
   $ ./main.exe
+  Thread 1 killed on uncaught exception Thread_local_storage.Not_set
+  Called from Moonpool_dpool.work_.main_loop in file "src/dpool/moonpool_dpool.pp.ml", line 102, characters 12-29
+  Called from Thread.create.(fun) in file "thread.ml", line 48, characters 8-14

Also: try to opam pin https://github.com/c-cube/thread-local-storage.git/#feat/debug-tls-get, it should give you a better backtrace.

My opam 2.1.5 didn't like that:

[ERROR] Could not retrieve Could not extract archive:
        Unknown archive type: /tmp/opam-259024-2a235b/thread-local-storage.git

@polytypic
Copy link

polytypic commented Oct 7, 2024

Could you try adding a call / top-level side-effect

let () = Picos_io_select.configure () (* Can only be called once *)

or

let () = Picos_io_select.check_configured () (* Can be called multple times *)

to the program such that it will be executed on the main thread before any threads or domains are spawned (and before using Moonpool_io)?

Either call will (among other things) configure signal masks such that the signal used by the Intr mechanism of Picos_io_select, which is SIGUSR2 by default, and SIGCHLD will not be delivered to the (current) thread and via signal mask inheritance to any other thread (the threads spawned by Picos_io_select will then handle those signals and manage signal masks for those signals).

In the latest main version of Picos_io_select the program would give the error message:

not configured; please configure before spawning any threads or domains

instead of letting the Not_set exception propagate as in 0.5.0. Apologies for not handling that in 0.5.0! (The documentation of 0.5.0 says that the library will automatically configure with default arguments, but unfortunately that does not happen when using the Intr mechanism.)

BTW, for any new file descriptors that your create (and don't pass to other processes) it is recommended to put them into non-blocking mode. This should do it:

  let null = Moonpool_io.Unix.openfile "/dev/null" [ O_RDWR; O_CLOEXEC ] 0 in
+ Moonpool_io.Unix.set_nonblock null;
  Fun.protect

The underlying Picos_io.Unix library automatically handles EINTR.

@amongonz
Copy link
Author

amongonz commented Oct 8, 2024

Checking my tests again, I find it suspicious that the exception doesn't leak or crash any IO, it just prints that backtrace along the actual output and the tests otherwise terminate successfully. To me, this feels more like a forgotten print.

Could you try adding a call / top-level side-effect [...]

I just tried both calls with the above repro, but neither fixed the spurious backtraces.

BTW, for any new file descriptors that your create (and don't pass to other processes) it is recommended to put them into non-blocking mode.

Noted, although the software I was testing moonpool and picos with only uses the standard streams for now.

@polytypic
Copy link

polytypic commented Oct 8, 2024

Hmm... I can't seem to be able to reproduce the issue.

Here is the program I tried: moonpool_issue_35.ml

It runs fine on Ubuntu GitHub action:

(cd _build/default/test && ./moonpool_issue_35.exe)
Read...
Got 0
OK

It also produces the same output on my macOS laptop.

You should be able to clone the repo and run dune test.

It also runs just fine without the Picos_io_select.configure () call. Reading from /dev/null seems to just immediately return 0 and the read does not block. (Can you read from /dev/null? 🤔) However, you should call Picos_io_select.configure () before spawning any threads or domains.

Note that you should have your entire moonpool application inside the Moonpool_fib.main call. That call starts the scheduler. If you call functions that need a scheduler before that it can cause issues.

Also, Fun.protect should be avoided or used with care in programs using fibers, because it doesn't protect against cancelation. In this particular case Fun.protect should work fine, because Unix.close is not a cancelable operation, but I want to note this issue as it could cause surprises at some point.

@amongonz
Copy link
Author

amongonz commented Oct 8, 2024

@polytypic I just cloned your repo: if I run while dune test --force ; do :; done and wait past 5-10 seconds of expected output I get again:

Read...
Got 0
OK
Thread 1 killed on uncaught exception Thread_local_storage.Not_set
Called from Moonpool_dpool.work_.main_loop in file "src/dpool/moonpool_dpool.pp.ml", line 102, characters 12-29
Called from Thread.create.(fun) in file "thread.ml", line 48, characters 8-14

Sadly the loop doesn't stop on its own because the exit code is still 0; a cram test makes it easier to find.

It runs fine on Ubuntu GitHub action

This is why I mentioned these are spurious and ran these tests in a loop: the trigger isn't deterministic, sometimes it refuses to show up for a minute and then it shows several times in a row.

@amongonz
Copy link
Author

amongonz commented Oct 8, 2024

@polytypic I've forked your repo and changed the test suite to make it easier. Depending on luck, it takes from a few seconds to a few minutes to reproduce:

$ time while dune test --force ; do :; done
File "test/test.t", line 1, characters 0-0:
diff --git a/_build/.sandbox/81e12ad9a439dc22a3d38fb96cf5752a/default/test/test.t b/_build/.sandbox/81e12ad9a439dc22a3d38fb96cf5752a/default/test/test.t.corrected
index e4b0053..4adc3e0 100644
--- a/_build/.sandbox/81e12ad9a439dc22a3d38fb96cf5752a/default/test/test.t
+++ b/_build/.sandbox/81e12ad9a439dc22a3d38fb96cf5752a/default/test/test.t.corrected
@@ -4,3 +4,6 @@ No backtraces:
   Read...
   Got 0
   OK
+  Thread 1 killed on uncaught exception Thread_local_storage.Not_set
+  Called from Moonpool_dpool.work_.main_loop in file "src/dpool/moonpool_dpool.pp.ml", line 102, characters 12-29
+  Called from Thread.create.(fun) in file "thread.ml", line 48, characters 8-14
                                   
real    0m12.307s
user    0m4.394s
sys     0m3.437s

@polytypic
Copy link

polytypic commented Oct 8, 2024

Yes, it seems a thread is created in moonpool_dpool.ml in a static initializer:

(* special case for main domain: we start a worker immediately *)
let () =
  assert (Domain_.is_main_domain ());
  let w = { th_count = Atomic_.make 1; q = Bb_queue.create () } in
  (* thread that stays alive *)
  print_endline "moonpool_dpool.ml";
  ignore (Thread.create (fun () -> work_ 0 w) () : Thread.t);
  domains_.(0) <- Lock.create (Some w, None)

@c-cube Would it be possible to delay that to happen only when you call something from Moonpool? (Personally I think one should avoid static intializers like the plague. They always come back to bite you at some point.)

Alternatively, do you really want that thread to receive POSIX signals? Maybe you could just block all the signals?

modified   src/dpool/moonpool_dpool.ml
@@ -148,7 +148,14 @@ let () =
   assert (Domain_.is_main_domain ());
   let w = { th_count = Atomic_.make 1; q = Bb_queue.create () } in
   (* thread that stays alive *)
-  ignore (Thread.create (fun () -> work_ 0 w) () : Thread.t);
+  ignore
+    (Thread.create
+       (fun () ->
+         Thread.sigmask SIG_BLOCK [ Sys.sigusr2 ] |> ignore;
+         work_ 0 w)
+       ()
+      : Thread.t);
   domains_.(0) <- Lock.create (Some w, None)
 
 let[@inline] max_number_of_domains () : int = Array.length domains_

The above prevents the issue, but blocking just SIGUSR2 is silly. You could probably just block all the signals you are not specifically interested in. Which signals would you like to receive in that thread?

@c-cube
Copy link
Owner

c-cube commented Oct 8, 2024

I don't know how to run this static initializer from the main domain in a lazy way, alas. If I make it into a some sort of lazy block it might be called from the wrong domain :-(. I fully blame the design of Domain for this, I'd rather not have to write the pool at all honestly.

Blocking common signals from these threads seems a lot easier.

@c-cube c-cube linked a pull request Oct 8, 2024 that will close this issue
@c-cube
Copy link
Owner

c-cube commented Oct 8, 2024

Blocking signals seems to work for me, can you confirm @amongonz ?

@amongonz
Copy link
Author

amongonz commented Oct 9, 2024

@c-cube Not entirely, I pinned moonpool to simon/fix-35 and retried my tests, but the proposed fix kinda turns it into a heisenbug: While the above repro has looped for almost an hour without triggering, the test suite which originally revealed it still triggers it within 10-50 seconds of looping (it used to take 1-10 s), so it made the error harder to trigger but still there.

I'm not sure how to make a stronger repro though, maybe giving it an actual workload. 1 Luckily I was just trying out moonpool/picos, so I'm not personally in a hurry to find a fix.


Fun fact, trying to fence the source of our mystery exception I realised that if I modify moonpool_dpool.ml here I can consistently catch it when it happens, despite loop not using TLS at all (I think):

-     loop ()
+     try loop ()
+     with Thread_local_storage.Not_set as exn ->
+       Format.eprintf "MYSTERY: %s\n%s@." (Printexc.to_string exn)
+         (Printexc.get_backtrace ());
+       raise exn

But wrapping the body of loop instead fails to catch it before it kills the thread 2, is it coming from a safe point? I'm not familiar with the codebase, but could some signal handler be making an unguarded access to a TLS key at the wrong time, e.g. use-after-free?

Footnotes

  1. The repro originally used stdin/stdout, then I switched both to /dev/null in an attempt to make the cram test simpler, but I've realised too late that makes the read_n branch unused. Whoops.

  2. Depending on where I place the try/with I've also seen it blamed on unix.ml.

@c-cube
Copy link
Owner

c-cube commented Oct 9, 2024

I'm not aware of any signal handling in moonpool itself (besides the new one). Picos does have some signal handling though.

I have absolutely no idea where this is coming from 😬 . Async exceptions are a possibility but even then, where does the TLS access come from?

@polytypic
Copy link

polytypic commented Oct 9, 2024

One unguarded TLS access, which slipped into Picos 0.5.0, is in a signal handler. Have you also pinned to use the latest main of Picos? The latest main will raise with an error message instead of letting the Not_set propagate.

If it is the signal handler, then it means that some thread / domain has been spawned before the signal mask has been set. At what point is the thread / domain running the loop in moonpool_dpool.ml spawned?

@amongonz
Copy link
Author

amongonz commented Oct 9, 2024

@polytypic I had not pinned picos. To check this isn't fixed already to some degree, I've just pinned all picos and moonpool packages to the main branch, reinstalled them on opam, rebuilt my project from a clean state and tested again: I still hit this same output, possibly even sooner now (which is great for debugging!)

@polytypic
Copy link

So, the way I found the earlier Thread.create call happening before the signal mask was set is that I simply added print_endline calls to all the place where threads were spawned:

Unstaged changes (6)
modified   src/core/fifo_pool.ml
@@ -164,6 +164,7 @@ let create ?on_init_thread ?on_exit_thread ?on_exn ?around_task ?num_threads
     (* function called in domain with index [i], to
        create the thread and push it into [receive_threads] *)
     let create_thread_in_domain () =
+      print_endline "fifo_pool.ml";
       let st = { idx = i; dom_idx; st = pool } in
       let thread = Thread.create (WL.worker_loop ~ops:worker_ops) st in
       (* send the thread from the domain back to us *)
modified   src/core/moonpool.ml
@@ -3,6 +3,7 @@ open Types_
 exception Shutdown = Runner.Shutdown
 
 let start_thread_on_some_domain f x =
+  print_endline "moonpool.ml";
   let did = Random.int (Domain_pool_.max_number_of_domains ()) in
   Domain_pool_.run_on_and_wait did (fun () -> Thread.create f x)
 
modified   src/core/ws_pool.ml
@@ -310,6 +310,7 @@ let create ?(on_init_thread = default_thread_init_exit_)
     (* function called in domain with index [i], to
        create the thread and push it into [receive_threads] *)
     let create_thread_in_domain () =
+      print_endline "ws_pool.ml";
       let thread = Thread.create (WL.worker_loop ~ops:worker_ops) st in
       (* send the thread from the domain back to us *)
       Bb_queue.push receive_threads (idx, thread)
modified   src/dpool/moonpool_dpool.ml
@@ -148,7 +148,14 @@ let () =
   assert (Domain_.is_main_domain ());
   let w = { th_count = Atomic_.make 1; q = Bb_queue.create () } in
   (* thread that stays alive *)
-  ignore (Thread.create (fun () -> work_ 0 w) () : Thread.t);
+  print_endline "moonpool_dpool.ml";
+  ignore
+    (Thread.create
+       (fun () ->
+         Thread.sigmask SIG_BLOCK [ Sys.sigusr2 ] |> ignore;
+         work_ 0 w)
+       ()
+      : Thread.t);
   domains_.(0) <- Lock.create (Some w, None)
 
 let[@inline] max_number_of_domains () : int = Array.length domains_
modified   src/fib/dune
@@ -2,7 +2,7 @@
  (name moonpool_fib)
  (public_name moonpool.fib)
  (synopsis "Fibers and structured concurrency for Moonpool")
- (libraries moonpool picos)
+ (libraries moonpool picos picos_io.select)
  (enabled_if
   (>= %{ocaml_version} 5.0))
  (flags :standard -open Moonpool_private -open Moonpool)
modified   src/private/domain_.ml
@@ -19,7 +19,11 @@ let recommended_number () = 1
 type t = Thread.t
 
 let get_id (self : t) : int = Thread.id self
-let spawn f : t = Thread.create f ()
+
+let spawn f : t =
+  print_endline "domain.ml";
+  Thread.create f ()
+
 let relax () = Thread.yield ()
 let join = Thread.join
 let is_main_domain () = true

And then I also added a print_endline call to the point where Picos_io_select.configure () was called. You could also add a print_endline to the point where moonpool now blocks signals. You want to ensure that neither threads nor domains are spawned before that point.

@polytypic
Copy link

polytypic commented Oct 9, 2024

I've just pinned all picos and moonpool packages to the main branch [...] I still hit this same output

So, if you still get the Not_set exception, that would seem to indicate that the source of TLS.get_exn is not in Picos, but in somewhere inside Moonpool, perhaps.

the test suite which originally revealed it still triggers it within 10-50 seconds of looping

Would it be possible to get access to that original test suite or something trimmed from it that triggers the issue? I could try to find the place where a thread/domain is spawned before the signal mask is set and/or identify the place where TLS.get_exn is being called.

@amongonz
Copy link
Author

amongonz commented Oct 9, 2024

Would it be possible to get access to that original test suite or something trimmed from it that triggers the issue?

I'll try to share a more faithful repro shortly, but I suspect my project tests fail sooner simply because it does more work per test and more tests per run of dune test.

The "test suite" is just a few cram tests for a barebones LSP server I'm working on: your classic read loop reading bytes from stdin into a buffer until it can parse a message, then replying on stdout. I'm currently trying out fiber libraries to implement more complex upcoming features, but for now the IO and tasking is really that simple.

@polytypic
Copy link

@amongonz Any news on a potential repro case, I could take a look.

@amongonz
Copy link
Author

amongonz commented Oct 23, 2024

@polytypic I've fixed the repro in https://github.com/amongonz/picos-issue to perform some "actual work" (copy 1 MiB from /dev/zero to /dev/null in chunks of up to 1 KiB) and added a script to easily run the above test loop.

Once the test executable is built, by running run-test-loop.sh I consistently trigger the cram test failure within seconds; I'm running all moonpool and picos packages pinned to their main branches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants