-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ASPNet service crashing on linux-arm64 when targeted to net9.0 #110442
Comments
Tagging subscribers to this area: @dotnet/gc |
Thanks for reporting the issue. Does it repro frequently? Please share a dump (multiple if available) via email if possible. |
Yes, it fails periodically (but have no clue, what is the exact trigger). We have I will (try to) send you the dumps over mail. |
I have tried the trivial Might the behavior somehow depend (even non-intentionally) on the platform the application is built on? (we build/publish on windows machines and then deploy the result to linux servers): |
Thanks for sharing the dump. It appears that Thread::GetAllocContext is returning an invalid context. In 9 this part of the code was touched in #103055 and #103607. @jkoritzinsky, do you think any of those changes would cause a race on arm64? @pavel-faltynek, would be helpful if you can share a few more dumps -- assume all of them AV with the same stack? And any other specific details and/or a repro would be helpful. |
I know we had to do some fixes for during process shutdown (on Windows) in #103877. My first guess would be that this is happening because the alloc context for a thread was destroyed, but the Maybe there's a corresponding shutdown issue for Linux? I can't think of anything else without looking at the dump myself. |
In this particular case doesnt look like it's occurring during shutdown. I have shared the dump with you offline. |
The crash dump shows that the runtime was not notified about thread shutting down. It is most likely an managed/native interop problem (e.g. bug in interop corrupting unmanaged heap). I can see from the crash dump that your service uses number of nuget packages - the interop bug might be in one of them. It may be useful to run it on checked build CoreCLR to see whether it will give us any extra insights. Could you please give it a try?
|
@jkotas, it doesn't seem I can access this. I have tried to randomly accept "account and project creation" in |
@mangod9, generating more dumps might be the easier part. As far as I can remember all of the ones I loaded in WinDbg had the same call stack. I have added:
|
Regarding the nuget note: there is one which differs between Additionally one code update was performed for EDIT: Forgot the |
I have shared the checked build at https://github.com/jkotas/scratch/ |
Thank you, @jkotas. I have added dump 8 executed against the checked CLR.
Doesn't seem like repeated thing. Found only single instance (even when the service crashed many times). |
So just to clarify you hit the assert only once, but the service was still crashing with |
same bug on docker runtime image x86_64 on tag :9.0. On 9.0-alpine work perfect. |
Right. Single shot assert, unfortunately no documented relationship to the dump(s). So I have no clue, whether the dump 8 is anyhow connected to the assert or not. |
do you happen to have a standalone repro? |
heisenbug |
Leaving it on its own crashing/restarting freely for longer time, there is additional outcome:
|
I have added dump 9 which is verifiably related (matching
Stack:
|
Thanks for uploading more dumps we will take a look. We might also need to get it to repro with stress log enabled. Will provide instructions for that if required. |
I have performed some automation in the process of crash data collection, so there might be available other dumps strongly related to the asserts observed/captured (if this is still something that would help you). Up to now, two more: 10
11
|
Would it be possible to get it to repro with these env. vars and share dumps if possible? These might help with determining if any thread detach events are missed.
|
@mangod9, yes, I'm on it. Are the dumps enough, or will there be any other data showing up waiting for collection? |
dumps should have the logs for now, but if those dont help might have to provide a private build with additional logging/diagnostics enabled. |
ok, I have about 50 dumps from which 7 ones recorded an assert, so I pick some candidates and add to the storage
Adding following ones (the time can be used to match the assert, if needed):
|
Hi @pavel-faltynek, I had a look at the dumps, I can see native stacks and some info but I'm not able to load the DAC to extract the stress logs. I suspect some memory regions are missing from the dumps, perhaps due to the way in which the dump was collected. There are env vars that can be set to collect dumps on crash with the necessary memory regions (more below). The current theory for the issue is that after the thread destruction notifacation, a reverse pinvoke from a native library on the same thread causes a Thread object to be initialized again, which later would not get a destruction notification. I tried inducing that locally and am seeing some random issues but I didn't see exactly the same crash stack. I'd like to see if it's the same issue occurring in your scenario. There probably would have been some indication in the stress log of that happening. I also created a test PR that adds a clear log entry when the above issue occurs along with a new config var that when enabled causes a fail-fast on thread reinitialization after cleanup (test PR), in hopes that we can also see a stack trace for what leads to it. I locally-built (unsigned) release binaries of CoreCLR with the test PR (binaries) - the assertion failures don't seem to point to a clear issue, and I'm not sure they're all a result of the issue, so I produced release binaries. The binaries should be compatible with runtime version 9.0.0. Would you mind updating your build with those binaries temporarily, and collecting a crash dump with the following settings? Please restore the original binaries for the runtime (don't use the checked binaries from before), and then update the runtime with the binaries I linked above. Set env vars to collect a crash dump including the necessary memory regions (more info):
Enable stress log as before:
Set new config var from the test PR above to fail-fast on thread reinitialization after cleanup:
|
@kouvel, thank you. Will do it and post the results or the additional questions, if any. |
I have
What I get:
|
Thanks @pavel-faltynek. From the stack it looks like indeed there is some thread reinitialization going on.
I believe that has to be done on a linux arm64 machine with lldb since it needs to load and run native components in the debugger (SOS). For managed stacks also, SOS (installed with Could you please share the dumps with me as well at [email protected]? |
I was more or less afraid of this, thanks for the confirmation. |
Description
Aspnet service crashes on
SIGSEGV
when compiled forlinux-arm64
and targeted tonet9.0
.It does not crash on windows at all, also as on windows and linux when targeted to
net8.0
.Reproduction Steps
Unfortunately I have no repro steps (other than just run the service and send few http requests to it).
On
Ubuntu 22.04.5 LTS (GNU/Linux 5.15.0-125-generic aarch64)
it looks like it just randomly crashes.Installed dotnet:
9.0.101
.Expected behavior
Don't crash even on
linux-arm64
, please 😁Actual behavior
There is a crash report available which - when preprocessed by
apport-unpack /var/crash/_usr_lib_dotnet_dotnet.1000.crash ~/crash
- can provide a dump. As there is possibly sensitive data in the dump, I can share it for inspection only via some "more secure" channels, if needed. I'm far from being expert here, but I'm able to open it in WinDbg and observe following:Under the "Stack" section, there is:
Regression?
No response
Known Workarounds
No response
Configuration
No response
Other information
No response
The text was updated successfully, but these errors were encountered: