-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spurious Thread_local_storage.Not_set exception #35
Comments
That's unfortunate (and a bit weird, afaict literally all calls to Also: try to |
I've got a repro: I've learned that it happens even on a single fiber, and the exception isn't even being raised from let () =
Printexc.record_backtrace true;
let null = Moonpool_io.Unix.openfile "/dev/null" [ O_RDWR; O_CLOEXEC ] 0 in
Fun.protect
~finally:(fun () -> Moonpool_io.Unix.close null)
(fun () ->
Moonpool_fib.main (fun _ ->
let rec loop () =
let buf = Bytes.create 1024 in
match Moonpool_io.Unix.read null buf 0 (Bytes.length buf) with
| 0 -> ()
| read_n ->
ignore (Moonpool_io.Unix.write null buf 0 read_n : int);
loop ()
in
loop ()
)
) Making a cram test for this file and running on bash Doesn't print an exception:
$ ./main.exe
+ Thread 1 killed on uncaught exception Thread_local_storage.Not_set
+ Called from Moonpool_dpool.work_.main_loop in file "src/dpool/moonpool_dpool.pp.ml", line 102, characters 12-29
+ Called from Thread.create.(fun) in file "thread.ml", line 48, characters 8-14
My opam 2.1.5 didn't like that:
|
Could you try adding a call / top-level side-effect let () = Picos_io_select.configure () (* Can only be called once *) or let () = Picos_io_select.check_configured () (* Can be called multple times *) to the program such that it will be executed on the main thread before any threads or domains are spawned (and before using Either call will (among other things) configure signal masks such that the signal used by the In the latest
instead of letting the BTW, for any new file descriptors that your create (and don't pass to other processes) it is recommended to put them into non-blocking mode. This should do it: let null = Moonpool_io.Unix.openfile "/dev/null" [ O_RDWR; O_CLOEXEC ] 0 in
+ Moonpool_io.Unix.set_nonblock null;
Fun.protect The underlying |
Checking my tests again, I find it suspicious that the exception doesn't leak or crash any IO, it just prints that backtrace along the actual output and the tests otherwise terminate successfully. To me, this feels more like a forgotten print.
I just tried both calls with the above repro, but neither fixed the spurious backtraces.
Noted, although the software I was testing moonpool and picos with only uses the standard streams for now. |
Hmm... I can't seem to be able to reproduce the issue. Here is the program I tried: It runs fine on Ubuntu GitHub action:
It also produces the same output on my macOS laptop. You should be able to clone the repo and run It also runs just fine without the Note that you should have your entire moonpool application inside the Also, |
@polytypic I just cloned your repo: if I run
Sadly the loop doesn't stop on its own because the exit code is still 0; a cram test makes it easier to find.
This is why I mentioned these are spurious and ran these tests in a loop: the trigger isn't deterministic, sometimes it refuses to show up for a minute and then it shows several times in a row. |
@polytypic I've forked your repo and changed the test suite to make it easier. Depending on luck, it takes from a few seconds to a few minutes to reproduce: $ time while dune test --force ; do :; done
File "test/test.t", line 1, characters 0-0:
diff --git a/_build/.sandbox/81e12ad9a439dc22a3d38fb96cf5752a/default/test/test.t b/_build/.sandbox/81e12ad9a439dc22a3d38fb96cf5752a/default/test/test.t.corrected
index e4b0053..4adc3e0 100644
--- a/_build/.sandbox/81e12ad9a439dc22a3d38fb96cf5752a/default/test/test.t
+++ b/_build/.sandbox/81e12ad9a439dc22a3d38fb96cf5752a/default/test/test.t.corrected
@@ -4,3 +4,6 @@ No backtraces:
Read...
Got 0
OK
+ Thread 1 killed on uncaught exception Thread_local_storage.Not_set
+ Called from Moonpool_dpool.work_.main_loop in file "src/dpool/moonpool_dpool.pp.ml", line 102, characters 12-29
+ Called from Thread.create.(fun) in file "thread.ml", line 48, characters 8-14
real 0m12.307s
user 0m4.394s
sys 0m3.437s |
Yes, it seems a thread is created in (* special case for main domain: we start a worker immediately *)
let () =
assert (Domain_.is_main_domain ());
let w = { th_count = Atomic_.make 1; q = Bb_queue.create () } in
(* thread that stays alive *)
print_endline "moonpool_dpool.ml";
ignore (Thread.create (fun () -> work_ 0 w) () : Thread.t);
domains_.(0) <- Lock.create (Some w, None) @c-cube Would it be possible to delay that to happen only when you call something from Moonpool? (Personally I think one should avoid static intializers like the plague. They always come back to bite you at some point.) Alternatively, do you really want that thread to receive POSIX signals? Maybe you could just block all the signals? modified src/dpool/moonpool_dpool.ml
@@ -148,7 +148,14 @@ let () =
assert (Domain_.is_main_domain ());
let w = { th_count = Atomic_.make 1; q = Bb_queue.create () } in
(* thread that stays alive *)
- ignore (Thread.create (fun () -> work_ 0 w) () : Thread.t);
+ ignore
+ (Thread.create
+ (fun () ->
+ Thread.sigmask SIG_BLOCK [ Sys.sigusr2 ] |> ignore;
+ work_ 0 w)
+ ()
+ : Thread.t);
domains_.(0) <- Lock.create (Some w, None)
let[@inline] max_number_of_domains () : int = Array.length domains_ The above prevents the issue, but blocking just |
I don't know how to run this static initializer from the main domain in a lazy way, alas. If I make it into a some sort of lazy block it might be called from the wrong domain :-(. I fully blame the design of Blocking common signals from these threads seems a lot easier. |
Blocking signals seems to work for me, can you confirm @amongonz ? |
@c-cube Not entirely, I pinned moonpool to simon/fix-35 and retried my tests, but the proposed fix kinda turns it into a heisenbug: While the above repro has looped for almost an hour without triggering, the test suite which originally revealed it still triggers it within 10-50 seconds of looping (it used to take 1-10 s), so it made the error harder to trigger but still there. I'm not sure how to make a stronger repro though, maybe giving it an actual workload. 1 Luckily I was just trying out moonpool/picos, so I'm not personally in a hurry to find a fix. Fun fact, trying to fence the source of our mystery exception I realised that if I modify moonpool_dpool.ml here I can consistently catch it when it happens, despite - loop ()
+ try loop ()
+ with Thread_local_storage.Not_set as exn ->
+ Format.eprintf "MYSTERY: %s\n%s@." (Printexc.to_string exn)
+ (Printexc.get_backtrace ());
+ raise exn But wrapping the body of Footnotes |
I'm not aware of any signal handling in moonpool itself (besides the new one). Picos does have some signal handling though. I have absolutely no idea where this is coming from 😬 . Async exceptions are a possibility but even then, where does the TLS access come from? |
One unguarded TLS access, which slipped into Picos 0.5.0, is in a signal handler. Have you also pinned to use the latest main of Picos? The latest main will raise with an error message instead of letting the If it is the signal handler, then it means that some thread / domain has been spawned before the signal mask has been set. At what point is the thread / domain running the |
@polytypic I had not pinned picos. To check this isn't fixed already to some degree, I've just pinned all picos and moonpool packages to the main branch, reinstalled them on opam, rebuilt my project from a clean state and tested again: I still hit this same output, possibly even sooner now (which is great for debugging!) |
So, the way I found the earlier Unstaged changes (6)
modified src/core/fifo_pool.ml
@@ -164,6 +164,7 @@ let create ?on_init_thread ?on_exit_thread ?on_exn ?around_task ?num_threads
(* function called in domain with index [i], to
create the thread and push it into [receive_threads] *)
let create_thread_in_domain () =
+ print_endline "fifo_pool.ml";
let st = { idx = i; dom_idx; st = pool } in
let thread = Thread.create (WL.worker_loop ~ops:worker_ops) st in
(* send the thread from the domain back to us *)
modified src/core/moonpool.ml
@@ -3,6 +3,7 @@ open Types_
exception Shutdown = Runner.Shutdown
let start_thread_on_some_domain f x =
+ print_endline "moonpool.ml";
let did = Random.int (Domain_pool_.max_number_of_domains ()) in
Domain_pool_.run_on_and_wait did (fun () -> Thread.create f x)
modified src/core/ws_pool.ml
@@ -310,6 +310,7 @@ let create ?(on_init_thread = default_thread_init_exit_)
(* function called in domain with index [i], to
create the thread and push it into [receive_threads] *)
let create_thread_in_domain () =
+ print_endline "ws_pool.ml";
let thread = Thread.create (WL.worker_loop ~ops:worker_ops) st in
(* send the thread from the domain back to us *)
Bb_queue.push receive_threads (idx, thread)
modified src/dpool/moonpool_dpool.ml
@@ -148,7 +148,14 @@ let () =
assert (Domain_.is_main_domain ());
let w = { th_count = Atomic_.make 1; q = Bb_queue.create () } in
(* thread that stays alive *)
- ignore (Thread.create (fun () -> work_ 0 w) () : Thread.t);
+ print_endline "moonpool_dpool.ml";
+ ignore
+ (Thread.create
+ (fun () ->
+ Thread.sigmask SIG_BLOCK [ Sys.sigusr2 ] |> ignore;
+ work_ 0 w)
+ ()
+ : Thread.t);
domains_.(0) <- Lock.create (Some w, None)
let[@inline] max_number_of_domains () : int = Array.length domains_
modified src/fib/dune
@@ -2,7 +2,7 @@
(name moonpool_fib)
(public_name moonpool.fib)
(synopsis "Fibers and structured concurrency for Moonpool")
- (libraries moonpool picos)
+ (libraries moonpool picos picos_io.select)
(enabled_if
(>= %{ocaml_version} 5.0))
(flags :standard -open Moonpool_private -open Moonpool)
modified src/private/domain_.ml
@@ -19,7 +19,11 @@ let recommended_number () = 1
type t = Thread.t
let get_id (self : t) : int = Thread.id self
-let spawn f : t = Thread.create f ()
+
+let spawn f : t =
+ print_endline "domain.ml";
+ Thread.create f ()
+
let relax () = Thread.yield ()
let join = Thread.join
let is_main_domain () = true And then I also added a |
So, if you still get the
Would it be possible to get access to that original test suite or something trimmed from it that triggers the issue? I could try to find the place where a thread/domain is spawned before the signal mask is set and/or identify the place where |
I'll try to share a more faithful repro shortly, but I suspect my project tests fail sooner simply because it does more work per test and more tests per run of The "test suite" is just a few cram tests for a barebones LSP server I'm working on: your classic read loop reading bytes from stdin into a buffer until it can parse a message, then replying on stdout. I'm currently trying out fiber libraries to implement more complex upcoming features, but for now the IO and tasking is really that simple. |
@amongonz Any news on a potential repro case, I could take a look. |
@polytypic I've fixed the repro in https://github.com/amongonz/picos-issue to perform some "actual work" (copy 1 MiB from Once the test executable is built, by running |
I've been getting spurious
Thread_local_storage.Not_set
exceptions coming from dpool when doing IO with moonpool-io inside child fibers running on the sameMoonpool_fib.main
runner as the parent. My tests work fine most of the time, except for the seldom run that fails.When it fails, however, I always get the same trace (on moonpool 0.7):
The text was updated successfully, but these errors were encountered: