-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[spring 1.0] avoid spurious access violation on thread due to other timed out thread #30
Conversation
include/eosio/vm/backend.hpp
Outdated
@@ -305,20 +305,21 @@ namespace eosio { namespace vm { | |||
|
|||
template<typename Watchdog, typename F> | |||
inline void timed_run(Watchdog&& wd, F&& f) { | |||
std::atomic<bool> _timed_out = false; | |||
std::atomic<bool>& _timed_out = timed_run_is_timed_out; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This took me a bit to understand. Might be worth a comment explaining the combo of thread-local and atomic. Maybe something like: // timed_run_is_timed_out thread-local referenced here, the timer calls from a different thread hence the atomic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the name timed_run_is_timed_out
made me at first thought time_run
function was another function timed_out
. Maybe name it as timed_run_has_timed_out
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to has
but I'm 50/50 on what should be the correct wording here. The timed out state is very transient and reset quickly which makes it seem more of a "current status" than a more sticky state change that latches.
include/eosio/vm/backend.hpp
Outdated
@@ -305,20 +305,21 @@ namespace eosio { namespace vm { | |||
|
|||
template<typename Watchdog, typename F> | |||
inline void timed_run(Watchdog&& wd, F&& f) { | |||
std::atomic<bool> _timed_out = false; | |||
std::atomic<bool>& _timed_out = timed_run_is_timed_out; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the name timed_run_is_timed_out
made me at first thought time_run
function was another function timed_out
. Maybe name it as timed_run_has_timed_out
.
@@ -18,7 +18,13 @@ namespace eosio { namespace vm { | |||
inline thread_local std::atomic<sigjmp_buf*> signal_dest{nullptr}; | |||
|
|||
__attribute__((visibility("default"))) | |||
inline thread_local std::vector<std::span<std::byte>> protected_memory_ranges; | |||
inline thread_local std::span<std::byte> code_memory_range; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just name code_memory_range
as code_range
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case I think it mirrors the naming of existing get_*_span()
calls. s/_span$/_memory_range/
(i.e. keeping the prefix in tact)
include/eosio/vm/signals.hpp
Outdated
//a SEGV in the code range when timed_run_is_timed_out=false is due to a _different_ thread's execution activating a deadline | ||
// timer. Return and retry executing the same code again. Eventually timed_run() on the other thread will reset the page | ||
// permissions and progress on this thread can continue | ||
if (sig == SIGSEGV && timed_run_is_timed_out.load(std::memory_order_acquire) == false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am suspicious this needs to be SIGBUS
on macOS. But so far I don't have a good environment to test it with. I think we may need to reevaluate it in the future.
(At least with how spring uses EOS VM) all executing threads share the same executable pages. This causes a conflict when one executing thread enters a time out condition because a time out sets the pages to non-executable which forces execution to stop not just on the timed out thread but on all executing threads. The non-timed out threads will consider this failure an access violation because since they weren't expecting a time out the failure is assumed to be due to accessing invalid wasm memory (etc).
One way to resolve this is by having each thread maintain its own EOS VM "instance", but we have expressly avoided that in spring to avoid re-compiling contracts (and incurring all that overhead) on each thread; for which there can be many. Another way to resolve this is by each thread maintaining its own mapping of shared compilation memory, similar to how EOS VM OC works. But unfortunately that feels like quite a refactor in the current design.
The approach this PR takes is to, upon signal delivery, identify if the access violation was in the code pages and if so check a thread local boolean that indicates whether the
timed_run
is timed out or not. If this boolean is false any access violation within the code pages in this thread must have been due to a different thread timing out, so the faulting thread simply keeps retrying to execute since the different thread'stimed_run
will eventually reset the memory accordingly.The nice thing about this approach is that it actually ends up being a fairly small change. The down side is that the non-timed out threads will enter a "SEGV storm" for a tiny period of time before the timed out thread's
timed_run
restores permissions. That doesn't feel like a deal breaker though. But the approach does introduce a layering violation betweentimed_run
and signal handling code due to the use of a global thread local, which is unfortunate.