Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[spring 1.0] avoid spurious access violation on thread due to other timed out thread #30

Merged
merged 2 commits into from
Jan 21, 2025

Conversation

spoonincode
Copy link
Member

(At least with how spring uses EOS VM) all executing threads share the same executable pages. This causes a conflict when one executing thread enters a time out condition because a time out sets the pages to non-executable which forces execution to stop not just on the timed out thread but on all executing threads. The non-timed out threads will consider this failure an access violation because since they weren't expecting a time out the failure is assumed to be due to accessing invalid wasm memory (etc).

One way to resolve this is by having each thread maintain its own EOS VM "instance", but we have expressly avoided that in spring to avoid re-compiling contracts (and incurring all that overhead) on each thread; for which there can be many. Another way to resolve this is by each thread maintaining its own mapping of shared compilation memory, similar to how EOS VM OC works. But unfortunately that feels like quite a refactor in the current design.

The approach this PR takes is to, upon signal delivery, identify if the access violation was in the code pages and if so check a thread local boolean that indicates whether the timed_run is timed out or not. If this boolean is false any access violation within the code pages in this thread must have been due to a different thread timing out, so the faulting thread simply keeps retrying to execute since the different thread's timed_run will eventually reset the memory accordingly.

The nice thing about this approach is that it actually ends up being a fairly small change. The down side is that the non-timed out threads will enter a "SEGV storm" for a tiny period of time before the timed out thread's timed_run restores permissions. That doesn't feel like a deal breaker though. But the approach does introduce a layering violation between timed_run and signal handling code due to the use of a global thread local, which is unfortunate.

@@ -305,20 +305,21 @@ namespace eosio { namespace vm {

template<typename Watchdog, typename F>
inline void timed_run(Watchdog&& wd, F&& f) {
std::atomic<bool> _timed_out = false;
std::atomic<bool>& _timed_out = timed_run_is_timed_out;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This took me a bit to understand. Might be worth a comment explaining the combo of thread-local and atomic. Maybe something like: // timed_run_is_timed_out thread-local referenced here, the timer calls from a different thread hence the atomic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the name timed_run_is_timed_out made me at first thought time_run function was another function timed_out. Maybe name it as timed_run_has_timed_out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to has but I'm 50/50 on what should be the correct wording here. The timed out state is very transient and reset quickly which makes it seem more of a "current status" than a more sticky state change that latches.

@@ -305,20 +305,21 @@ namespace eosio { namespace vm {

template<typename Watchdog, typename F>
inline void timed_run(Watchdog&& wd, F&& f) {
std::atomic<bool> _timed_out = false;
std::atomic<bool>& _timed_out = timed_run_is_timed_out;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the name timed_run_is_timed_out made me at first thought time_run function was another function timed_out. Maybe name it as timed_run_has_timed_out.

@@ -18,7 +18,13 @@ namespace eosio { namespace vm {
inline thread_local std::atomic<sigjmp_buf*> signal_dest{nullptr};

__attribute__((visibility("default")))
inline thread_local std::vector<std::span<std::byte>> protected_memory_ranges;
inline thread_local std::span<std::byte> code_memory_range;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just name code_memory_range as code_range?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I think it mirrors the naming of existing get_*_span() calls. s/_span$/_memory_range/ (i.e. keeping the prefix in tact)

//a SEGV in the code range when timed_run_is_timed_out=false is due to a _different_ thread's execution activating a deadline
// timer. Return and retry executing the same code again. Eventually timed_run() on the other thread will reset the page
// permissions and progress on this thread can continue
if (sig == SIGSEGV && timed_run_is_timed_out.load(std::memory_order_acquire) == false)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am suspicious this needs to be SIGBUS on macOS. But so far I don't have a good environment to test it with. I think we may need to reevaluate it in the future.

@spoonincode spoonincode merged commit 02751e0 into release/spring-1.0 Jan 21, 2025
10 checks passed
@spoonincode spoonincode deleted the threaded_timedout_fix_s10 branch January 21, 2025 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test failure: performance_test_basic_read_only_trxs
3 participants