-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sporadic panic in async-io thread with esp-sys-svc async TLS example #84
Comments
To be more concrete you never saw it before applying this optimization? |
To be honest, it seems to depend on code layout. It might trigger in other conditions, I have not managed to narrow it down. It smells like a race condition of some sort. |
cc @ivmarkov Any idea why this would be happening? |
The issue actually appears to be in our locking implementation, specifically |
No idea. Will try to reproduce once I'm back. |
@ivmarkov have you had any luck reproducing this ? Let me know if you need further input and testing from my part. |
Couldn't get to there yet. Sorry about that! Very likely will try to do something related to async-io in the next couple of days. |
I am running in the same issue as described by @jothan I have done some investigation and I believe the issue boils down to the pinning of event_listener In this case the address generated is not 32 bit aligned (ex : 0x3ffe860f not sure if this matters) but more importantly point back to an area of the stack where the first byte is not 0x0. Using GDB i can see that the value it points to is 0x3f (which is the MSB of an address used during the creation of the thread) since the remain bytes are 0 maybe 0xffe8610 would be the correct pinned address? let event_listener = EventListener::new();
let _ = event_listener.is_listening();
pin!(event_listener); The pinned address will be 32bit aligned pointing to a stack area where all bytes are 0 and the bug doesn't happen. For reference here are the two assembly dump of the function Hope that helps @ivmarkov |
@npmenard Thank you for the detailed analysis! I can reproduce the issue (though I haven't connected with GDB to examine the address), and I confirm that the dummy (Also, not sure if that matters, but I'm consistently reproducing the issue with opt-levels "s" and "z" (haven't tried with other opt levels yet, will do)). As for:
I assume by address 0x3ffe860f you mean the
Even more suspicious. This sounds a bit like a bug in the LLVM xtensa backend. @MabezDev - sorry for pulling you here, but do you think you and/or Andrei can help? There are assembly dumps attached to the previous comment. |
It's the address that would be printed when formatting event_listener with I have noticed something odd but I am not sure if it will shade some light, whenever the bug occurs the value printed by Lastly I am attaching a backtrace showing the code path that writes at the location in stack where ultimatley the Pin will point to. It happens during WindowOverflow8 exception is triggered when setting a thread local variable. Hopefully this helps a bit. Let me know if there is anything else I can do to help. |
With the new Rust 1.78 for ESP32 (in pre-release in the rust-build project), debug assertions for std are now enabled and seem to catch a similar alignment problem with an assertion in core::ptr::read::precondition_check. Full backtrace: backtrace.txt Partial backtrace:
|
Looking at the backtraces, it would seem that OnceCell does something weird that causes EventListener trouble down the line. |
Why would OnceCell be doing something weird and moreover, this something only happens with esp idf + xtensa and not on x86 and (likely) arm? I still find it more likely that the root cause is a miscompilation bug in the llvm xtensa backend. Will open a bug at the xtensa llvm fork shortly btw. And will test (again) with riscv32imc on the esp idf, though I'm relatively sure the issue did not happen there. |
@ivmarkov Doing something weird in this case is triggering the miscompilation. I don't doubt your analysis. |
Other than dereferencing a raw pointer, I don't see anything else weird, unfortunately... |
I know there is little to no progress here, for which I apologize. In the meantime, what I found to work for me is setting the version of We'll resume the work on finding the root cause shortly. |
Does this issue still occur with async-lock v3.4.0? |
@notgull It still panics in the same spot with debug assertions on (Rust 1.78), however, it does not crash when they are off. |
@ivmarkov Any news on the LLVM front ? I can reproduce unaligned pointers on the latest 1.80 toolchain. |
@MabezDev sorry for the ping ^^^ @jothan I know this is not really a proper fix, but you can give async-io-mini a try until a proper fix is in-place. |
I might give async-io-mini a try, my blocker is mostly my use of smol-hyper (that depends on regular async-io). I am also using some async-io timers, but I can always use some other timers. |
Did any of you end up opening an issue in https://github.com/espressif/llvm-project ? I can't find it. |
I didn't. Would you like to take it upon yourself? Would really appreciate that and you might be the right person, given your |
I can reliably reproduce a panic with this: use core::pin::pin;
use esp_idf_svc::io;
// Don't forget to raise the CONFIG_PTHREAD_TASK_STACK_SIZE_DEFAULT in `sdkconfig.defaults` to > 4K so that the
// `async-io` background thread can work fine
pub fn main() {
esp_idf_svc::sys::link_patches();
esp_idf_svc::log::EspLogger::initialize_default();
io::vfs::initialize_eventfd(5).unwrap();
// You can use `esp_idf_svc::hal::task::block_on` as well
async_io::block_on(pin!(async move { Result::<_, anyhow::Error>::Ok(()) }));
}
But removing This sounds a bit different from the report here, so can someone please verify if this is the actual issue? |
Is it? It is as if the code generated by the
$crate::pin::Pin::<&mut _> { __pointer: &mut { $value } } According to the report I quoted, the above code results in: |
Actually @npmenard says something even more interesting:
Sounds like |
Well I guess we can blame pin, but that might also be independent of our issue. What's weird is that pin seems to carbon-copy whatever you give it, which may be slightly problematic for a self-referencing generator. Whether this is related to an architecture-specific crash... I don't know. |
I'm not blaming pin. But so far, it seems we can only reproduce the mis-compilation when pin is used. I'm sure the mis-compilation has a bigger blast radius, but so far we have seen it happening when pin is somehow involved.
I would not expect anything else given that Rust arrays are |
Why do you expect Regardless, I can reproduce the issue (using different opt levels, I get very different crashes and occasionally something working), I'm trying to minimize it somehow. |
Because the
Good news - if you can minimize it, you would be our hero! :) |
So, in my case, the root cause of the crash is:
Now let's see if I can break a non-idf application with this knowledge :) |
Alignment of 64 bytes? :o How is that even possible? What might require such an alignment?
Yes this is aligned to 64bits, but obviously not to 64 bYtes. |
https://doc.rust-lang.org/std/mem/fn.align_of_val.html
The culprit is ConcurrentQueue, it raises the alignment requirement (with a contained |
I think you are getting there! |
@bugadani Did you already open an issue at https://github.com/espressif/llvm-project ? |
I've reported the issue internally |
Not sure what is that internal repo, but... I hope you'll inform us how it is developing :) |
I just read about the atomics-in-psram issue. I am using both cores with PSRAM on an ESP32-cam board. I am boxing some futures on the heap that might end up in PSRAM. I think I'm going to revert to a single threaded executor or disable the second core entirely for stability's sake. |
@bugadani Is there any development in the internal repo that you could share with us all? |
I don't have any updates. |
We have identified an issue with LLVM not upholding alignment requirements, and the LLVM team is working on a fix. I don't know if this is the root cause for this issue, but I'll update you with the results after the fix is done. |
Code: https://github.com/esp-rs/esp-idf-svc/blob/4fff46bba1be66ae3c6945d3b8bda30a589f6f6b/examples/tls_async.rs
Backtrace:
backtrace.txt
Crate versions:
Cargo.lock
I am running the example on a ESPcam board, with an esp32 (revision v3.0) Xtensa module (Tinker AI ESP32-S, to be exact).
The example usually crashes once or twice after reset before working.
I've only managed to reproduce this with an optimized debug build with codegen-units=1.
The text was updated successfully, but these errors were encountered: