[spring 1.0] avoid spurious access violation on thread due to other timed out thread #30

spoonincode · 2025-01-15T19:16:25Z

(At least with how spring uses EOS VM) all executing threads share the same executable pages. This causes a conflict when one executing thread enters a time out condition because a time out sets the pages to non-executable which forces execution to stop not just on the timed out thread but on all executing threads. The non-timed out threads will consider this failure an access violation because since they weren't expecting a time out the failure is assumed to be due to accessing invalid wasm memory (etc).

One way to resolve this is by having each thread maintain its own EOS VM "instance", but we have expressly avoided that in spring to avoid re-compiling contracts (and incurring all that overhead) on each thread; for which there can be many. Another way to resolve this is by each thread maintaining its own mapping of shared compilation memory, similar to how EOS VM OC works. But unfortunately that feels like quite a refactor in the current design.

The approach this PR takes is to, upon signal delivery, identify if the access violation was in the code pages and if so check a thread local boolean that indicates whether the timed_run is timed out or not. If this boolean is false any access violation within the code pages in this thread must have been due to a different thread timing out, so the faulting thread simply keeps retrying to execute since the different thread's timed_run will eventually reset the memory accordingly.

The nice thing about this approach is that it actually ends up being a fairly small change. The down side is that the non-timed out threads will enter a "SEGV storm" for a tiny period of time before the timed out thread's timed_run restores permissions. That doesn't feel like a deal breaker though. But the approach does introduce a layering violation between timed_run and signal handling code due to the use of a global thread local, which is unfortunate.

heifner · 2025-01-15T19:57:55Z

include/eosio/vm/backend.hpp

@@ -305,20 +305,21 @@ namespace eosio { namespace vm {

      template<typename Watchdog, typename F>
      inline void timed_run(Watchdog&& wd, F&& f) {
-         std::atomic<bool>       _timed_out = false;
+         std::atomic<bool>&      _timed_out = timed_run_is_timed_out;


This took me a bit to understand. Might be worth a comment explaining the combo of thread-local and atomic. Maybe something like: // timed_run_is_timed_out thread-local referenced here, the timer calls from a different thread hence the atomic

Also the name timed_run_is_timed_out made me at first thought time_run function was another function timed_out. Maybe name it as timed_run_has_timed_out.

I changed it to has but I'm 50/50 on what should be the correct wording here. The timed out state is very transient and reset quickly which makes it seem more of a "current status" than a more sticky state change that latches.

linh2931 · 2025-01-15T20:48:50Z

include/eosio/vm/backend.hpp

@@ -305,20 +305,21 @@ namespace eosio { namespace vm {

      template<typename Watchdog, typename F>
      inline void timed_run(Watchdog&& wd, F&& f) {
-         std::atomic<bool>       _timed_out = false;
+         std::atomic<bool>&      _timed_out = timed_run_is_timed_out;


Also the name timed_run_is_timed_out made me at first thought time_run function was another function timed_out. Maybe name it as timed_run_has_timed_out.

linh2931 · 2025-01-16T01:57:39Z

include/eosio/vm/signals.hpp

@@ -18,7 +18,13 @@ namespace eosio { namespace vm {
   inline thread_local std::atomic<sigjmp_buf*> signal_dest{nullptr};

   __attribute__((visibility("default")))
-   inline thread_local std::vector<std::span<std::byte>> protected_memory_ranges;
+   inline thread_local std::span<std::byte> code_memory_range;


Why not just name code_memory_range as code_range?

In this case I think it mirrors the naming of existing get_*_span() calls. s/_span$/_memory_range/ (i.e. keeping the prefix in tact)

spoonincode · 2025-01-21T19:01:21Z

include/eosio/vm/signals.hpp

+            //a SEGV in the code range when timed_run_is_timed_out=false is due to a _different_ thread's execution activating a deadline
+            // timer. Return and retry executing the same code again. Eventually timed_run() on the other thread will reset the page
+            // permissions and progress on this thread can continue
+            if (sig == SIGSEGV && timed_run_is_timed_out.load(std::memory_order_acquire) == false)


I am suspicious this needs to be SIGBUS on macOS. But so far I don't have a good environment to test it with. I think we may need to reevaluate it in the future.

avoid spurious access violation from other timedout threads

d2a81f3

spoonincode mentioned this pull request Jan 15, 2025

[1.0.4] bump eos-vm submodule to fix spurious access violations in performance_test_basic_read_only test AntelopeIO/spring#1102

Merged

spoonincode linked an issue Jan 15, 2025 that may be closed by this pull request

Test failure: performance_test_basic_read_only_trxs AntelopeIO/spring#425

Closed

heifner approved these changes Jan 15, 2025

View reviewed changes

linh2931 approved these changes Jan 16, 2025

View reviewed changes

spoonincode commented Jan 21, 2025

View reviewed changes

comment & name changes from review

1e84cb8

spoonincode merged commit 02751e0 into release/spring-1.0 Jan 21, 2025
10 checks passed

spoonincode deleted the threaded_timedout_fix_s10 branch January 21, 2025 20:14

spoonincode mentioned this pull request Jan 22, 2025

[spring 1.0 -> main] avoid spurious access violation on thread due to other timed out thread #33

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spring 1.0] avoid spurious access violation on thread due to other timed out thread #30

[spring 1.0] avoid spurious access violation on thread due to other timed out thread #30

spoonincode commented Jan 15, 2025

heifner Jan 15, 2025

linh2931 Jan 15, 2025

spoonincode Jan 21, 2025

linh2931 Jan 15, 2025

linh2931 Jan 16, 2025

spoonincode Jan 21, 2025

spoonincode Jan 21, 2025

[spring 1.0] avoid spurious access violation on thread due to other timed out thread #30

[spring 1.0] avoid spurious access violation on thread due to other timed out thread #30

Conversation

spoonincode commented Jan 15, 2025

heifner Jan 15, 2025

Choose a reason for hiding this comment

linh2931 Jan 15, 2025

Choose a reason for hiding this comment

spoonincode Jan 21, 2025

Choose a reason for hiding this comment

linh2931 Jan 15, 2025

Choose a reason for hiding this comment

linh2931 Jan 16, 2025

Choose a reason for hiding this comment

spoonincode Jan 21, 2025

Choose a reason for hiding this comment

spoonincode Jan 21, 2025

Choose a reason for hiding this comment