Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASPNet service crashing on linux-arm64 when targeted to net9.0 #110442

Open
pavel-faltynek opened this issue Dec 5, 2024 · 31 comments
Open

ASPNet service crashing on linux-arm64 when targeted to net9.0 #110442

pavel-faltynek opened this issue Dec 5, 2024 · 31 comments

Comments

@pavel-faltynek
Copy link

pavel-faltynek commented Dec 5, 2024

Description

Aspnet service crashes on SIGSEGV when compiled for linux-arm64 and targeted to net9.0.
It does not crash on windows at all, also as on windows and linux when targeted to net8.0.

Reproduction Steps

Unfortunately I have no repro steps (other than just run the service and send few http requests to it).
On Ubuntu 22.04.5 LTS (GNU/Linux 5.15.0-125-generic aarch64) it looks like it just randomly crashes.
Installed dotnet: 9.0.101.

Expected behavior

Don't crash even on linux-arm64, please 😁

Actual behavior

There is a crash report available which - when preprocessed by apport-unpack /var/crash/_usr_lib_dotnet_dotnet.1000.crash ~/crash - can provide a dump. As there is possibly sensitive data in the dump, I can share it for inspection only via some "more secure" channels, if needed. I'm far from being expert here, but I'm able to open it in WinDbg and observe following:

(12f144.12f14a): Signal SIGSEGV (Segmentation fault) code SEGV_MAPERR (Address not mapped to object) at 0xffbe9093ca2a
*** WARNING: Unable to verify timestamp for libcoreclr.so
libcoreclr!alloc_context::init_alloc_count [inlined in libcoreclr!SVR::GCHeap::FixAllocContext+0x14]:
0000ffff`850eca54 79406428 ldrh        w8,[x1,#0x32]

Under the "Stack" section, there is:

[0x0]   libcoreclr!alloc_context::init_alloc_count   (Inline Function)   (Inline Function)   
[0x1]   libcoreclr!SVR::GCHeap::FixAllocContext+0x14   0xffff44246540   0xffff84fd94f8   
[0x2]   libcoreclr!GCToEEInterface::GcEnumAllocContexts+0x54   0xffff44246570   0xffff850a6210   
[0x3]   libcoreclr!SVR::gc_heap::fix_allocation_contexts+0x28   (Inline Function)   (Inline Function)   
[0x4]   libcoreclr!SVR::gc_heap::garbage_collect+0x60   0xffff442465a0   0xffff850a3f44   
[0x5]   libcoreclr!SVR::gc_heap::gc_thread_function+0xca8   0xffff44246670   0xffff850a329c   
[0x6]   libcoreclr!SVR::gc_heap::gc_thread_stub+0x38   0xffff442466f0   0xffff84fdc420   
[0x7]   libcoreclr!<unnamed-namespace>::CreateNonSuspendableThread::$_0::operator()+0x3c   (Inline Function)   (Inline Function)   
[0x8]   libcoreclr!<unnamed-namespace>::CreateNonSuspendableThread::$_0::__invoke+0x50   0xffff44246710   0xffff852cafb0   
[0x9]   libcoreclr!CorUnix::CPalThread::ThreadEntry+0x1b8   0xffff44246740   0xffff8556d5c8   
[0xa]   libc_so+0x7d5c8   0xffff44246800   0x0   

Regression?

No response

Known Workarounds

No response

Configuration

No response

Other information

No response

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Dec 5, 2024
@vcsjones vcsjones added area-GC-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Dec 5, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Dec 5, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Dec 5, 2024
@mangod9 mangod9 added this to the 10.0.0 milestone Dec 5, 2024
@mangod9
Copy link
Member

mangod9 commented Dec 5, 2024

Thanks for reporting the issue. Does it repro frequently? Please share a dump (multiple if available) via email if possible.

@pavel-faltynek
Copy link
Author

pavel-faltynek commented Dec 5, 2024

Yes, it fails periodically (but have no clue, what is the exact trigger). We have systemd managed restarts also as load balancer checking the service health over http requests (so it's kind of active immediately after startup). Firstly I was thinking the first "non-health" (bigger) request crashes it, but this might not be the case.

I will (try to) send you the dumps over mail.
(Mail attachment failed, sent via alternative method).

@pavel-faltynek
Copy link
Author

pavel-faltynek commented Dec 6, 2024

I have tried the trivial dotnet new webapi, dotnet new webapiaot directly on the target system (curling over http/https), additionally adding explicit request to invoke GC.Collect(), but no issue here at all. All was compiled/published on the target system, which leads me to a question:

Might the behavior somehow depend (even non-intentionally) on the platform the application is built on? (we build/publish on windows machines and then deploy the result to linux servers):
dotnet publish $project --configuration $configuration --framework $framework --runtime $runtime --no-self-contained --output $output
For the reported case, $framework = 'net9.0', $runtime = 'linux-arm64'.

@mangod9
Copy link
Member

mangod9 commented Dec 6, 2024

Thanks for sharing the dump. It appears that Thread::GetAllocContext is returning an invalid context. In 9 this part of the code was touched in #103055 and #103607.

@jkoritzinsky, do you think any of those changes would cause a race on arm64?

@pavel-faltynek, would be helpful if you can share a few more dumps -- assume all of them AV with the same stack? And any other specific details and/or a repro would be helpful.

@jkoritzinsky
Copy link
Member

I know we had to do some fixes for during process shutdown (on Windows) in #103877.

My first guess would be that this is happening because the alloc context for a thread was destroyed, but the Thread object for the thread is still around, but the thread object cleans itself up in co-op mode, so it shouldn't be possible to race with the GC there.

Maybe there's a corresponding shutdown issue for Linux?

I can't think of anything else without looking at the dump myself.

@mangod9
Copy link
Member

mangod9 commented Dec 7, 2024

In this particular case doesnt look like it's occurring during shutdown. I have shared the dump with you offline.

@jkotas
Copy link
Member

jkotas commented Dec 9, 2024

The crash dump shows that the runtime was not notified about thread shutting down. It is most likely an managed/native interop problem (e.g. bug in interop corrupting unmanaged heap). I can see from the crash dump that your service uses number of nuget packages - the interop bug might be in one of them.

It may be useful to run it on checked build CoreCLR to see whether it will give us any extra insights. Could you please give it a try?

@pavel-faltynek
Copy link
Author

@jkotas, it doesn't seem I can access this. I have tried to randomly accept "account and project creation" in dev.azure.com (after login to my "live/microsoft" account), but these portals are over complicated to me (meaning I'm not patient enough to get oriented). Any chance you could just share it? Thank you.

@pavel-faltynek
Copy link
Author

pavel-faltynek commented Dec 9, 2024

@mangod9, generating more dumps might be the easier part. As far as I can remember all of the ones I loaded in WinDbg had the same call stack.

I have added:

  • dump number 3 (just redeployed and wait for crash, so no explicit request, only balancer health checks), looks a bit different:
[0x0]   libcoreclr!Object::RawSetMethodTable   (Inline Function)   (Inline Function)   
[0x1]   libcoreclr!SVR::CObjectHeader::SetFree+0x10   (Inline Function)   (Inline Function)   
[0x2]   libcoreclr!SVR::gc_heap::make_unused_array+0x5c   0xffff619964d0   0xffffa406cb30   
[0x3]   libcoreclr!SVR::gc_heap::fix_allocation_context+0x64   (Inline Function)   (Inline Function)   
[0x4]   libcoreclr!SVR::GCHeap::FixAllocContext+0xf0   0xffff61996540   0xffffa3f594f8   
[0x5]   libcoreclr!GCToEEInterface::GcEnumAllocContexts+0x54   0xffff61996570   0xffffa4026210   
[0x6]   libcoreclr!SVR::gc_heap::fix_allocation_contexts+0x28   (Inline Function)   (Inline Function)   
[0x7]   libcoreclr!SVR::gc_heap::garbage_collect+0x60   0xffff619965a0   0xffffa4023f44   
[0x8]   libcoreclr!SVR::gc_heap::gc_thread_function+0xca8   0xffff61996670   0xffffa402329c   
[0x9]   libcoreclr!SVR::gc_heap::gc_thread_stub+0x38   0xffff619966f0   0xffffa3f5c420   
[0xa]   libcoreclr!<unnamed-namespace>::CreateNonSuspendableThread::$_0::operator()+0x3c   (Inline Function)   (Inline Function)   
[0xb]   libcoreclr!<unnamed-namespace>::CreateNonSuspendableThread::$_0::__invoke+0x50   0xffff61996710   0xffffa424afb0   
[0xc]   libcoreclr!CorUnix::CPalThread::ThreadEntry+0x1b8   0xffff61996740   0xffffa44ed5c8   
[0xd]   libc_so+0x7d5c8   0xffff61996800   0x0   
  • dump number 4, 5 (looks "standard"):
[0x0]   libcoreclr!alloc_context::init_alloc_count   (Inline Function)   (Inline Function)   
[0x1]   libcoreclr!SVR::GCHeap::FixAllocContext+0x14   0xffff559a6540   0xffff97f694f8   
[0x2]   libcoreclr!GCToEEInterface::GcEnumAllocContexts+0x54   0xffff559a6570   0xffff98036210   
[0x3]   libcoreclr!SVR::gc_heap::fix_allocation_contexts+0x28   (Inline Function)   (Inline Function)   
[0x4]   libcoreclr!SVR::gc_heap::garbage_collect+0x60   0xffff559a65a0   0xffff98033f44   
[0x5]   libcoreclr!SVR::gc_heap::gc_thread_function+0xca8   0xffff559a6670   0xffff9803329c   
[0x6]   libcoreclr!SVR::gc_heap::gc_thread_stub+0x38   0xffff559a66f0   0xffff97f6c420   
[0x7]   libcoreclr!<unnamed-namespace>::CreateNonSuspendableThread::$_0::operator()+0x3c   (Inline Function)   (Inline Function)   
[0x8]   libcoreclr!<unnamed-namespace>::CreateNonSuspendableThread::$_0::__invoke+0x50   0xffff559a6710   0xffff9825afb0   
[0x9]   libcoreclr!CorUnix::CPalThread::ThreadEntry+0x1b8   0xffff559a6740   0xffff984fd5c8   
[0xa]   libc_so+0x7d5c8   0xffff559a6800   0x0   

@pavel-faltynek
Copy link
Author

pavel-faltynek commented Dec 9, 2024

Regarding the nuget note: there is one which differs between net8.0 and net9.0 targets: Microsoft.AspNetCore.Mvc.Testing - which explicitly states that is not compatible with net8.0 (in its latest version), but I believe this does not contribute to standard service runtime (it's test related only).

Additionally one code update was performed for net9.0: instead of creating X509Certificate2 directly from pfx storage, we now use X509CertificateLoader (there is newly SYSLIB0057 obsoletion warning). I have checked that reverting this update does not "remedy" the problem (we had some issues with pfx in the past on mobile platforms, so wanted to be sure it's not the issue).

EDIT: Forgot the Microsoft.AspNetCore.Authentication.JwtBearer which also went from 8.0.10 to 9.0.0 (with no impact on behavior when reverting back).

@jkotas
Copy link
Member

jkotas commented Dec 9, 2024

Any chance you could just share it? Thank you.

I have shared the checked build at https://github.com/jkotas/scratch/

@pavel-faltynek
Copy link
Author

pavel-faltynek commented Dec 10, 2024

Thank you, @jkotas. I have added dump 8 executed against the checked CLR.
Additionally, there is an assert in system log:

Assert failure(PID 1546607 [0x0017996f], Thread: 1546612 [0x179974]): (size >= Align (min_obj_size))
     File: /__w/1/s/src/coreclr/gc/gc.cpp:7896
     Image: /usr/lib/dotnet/dotnet

Doesn't seem like repeated thing. Found only single instance (even when the service crashed many times).

@mangod9
Copy link
Member

mangod9 commented Dec 10, 2024

Doesn't seem like repeated thing. Found only single instance (even when the service crashed many times).

So just to clarify you hit the assert only once, but the service was still crashing with FixAllocContext on the stack? Did you happen to capture a dump with the assert?

@askovpen
Copy link

same bug on docker runtime image x86_64 on tag :9.0. On 9.0-alpine work perfect.

@pavel-faltynek
Copy link
Author

So just to clarify you hit the assert only once, but the service was still crashing with FixAllocContext on the stack? Did you happen to capture a dump with the assert?

Right. Single shot assert, unfortunately no documented relationship to the dump(s). So I have no clue, whether the dump 8 is anyhow connected to the assert or not.

@mangod9
Copy link
Member

mangod9 commented Dec 11, 2024

same bug on docker runtime image x86_64 on tag :9.0. On 9.0-alpine work perfect.

do you happen to have a standalone repro?

@askovpen
Copy link

same bug on docker runtime image x86_64 on tag :9.0. On 9.0-alpine work perfect.

do you happen to have a standalone repro?

heisenbug

@pavel-faltynek
Copy link
Author

Leaving it on its own crashing/restarting freely for longer time, there is additional outcome:

Dec 10 10:54:31 Assert failure(PID 1546607 [0x0017996f], Thread: 1546612 [0x179974]): (size >= Align (min_obj_size))
Dec 10 10:54:31     File: /__w/1/s/src/coreclr/gc/gc.cpp:7896
Dec 10 10:54:31     Image: /usr/lib/dotnet/dotnet
Dec 10 17:17:57 Assert failure(PID 1565784 [0x0017e458], Thread: 1565789 [0x17e45d]): (size >= Align (min_obj_size))
Dec 10 17:17:57     File: /__w/1/s/src/coreclr/gc/gc.cpp:7896
Dec 10 17:17:57     Image: /usr/lib/dotnet/dotnet
Dec 10 17:44:42 Assert failure(PID 1567038 [0x0017e93e], Thread: 1567043 [0x17e943]): (acontext->get_home_heap() == 0) || (acontext->get_home_heap()->pGenGCHeap->heap_number < gc_heap::n_heaps)
Dec 10 17:44:42     File: /__w/1/s/src/coreclr/gc/gc.cpp:51409
Dec 10 17:44:42     Image: /usr/lib/dotnet/dotnet
Dec 10 18:40:53 Assert failure(PID 1569654 [0x0017f376], Thread: 1569659 [0x17f37b]): !CREATE_CHECK_STRING(pMT && pMT->Validate())
Dec 10 18:40:53     File: /__w/1/s/src/coreclr/vm/object.cpp:553
Dec 10 18:40:53     Image: /usr/lib/dotnet/dotnet
Dec 10 19:12:59 Assert failure(PID 1571145 [0x0017f949], Thread: 1571150 [0x17f94e]): (size >= Align (min_obj_size))
Dec 10 19:12:59     File: /__w/1/s/src/coreclr/gc/gc.cpp:7896
Dec 10 19:12:59     Image: /usr/lib/dotnet/dotnet
Dec 10 19:29:02 Assert failure(PID 1571875 [0x0017fc23], Thread: 1571880 [0x17fc28]): (acontext->get_home_heap() == 0) || (acontext->get_home_heap()->pGenGCHeap->heap_number < gc_heap::n_heaps)
Dec 10 19:29:02     File: /__w/1/s/src/coreclr/gc/gc.cpp:51409
Dec 10 19:29:02     Image: /usr/lib/dotnet/dotnet

@pavel-faltynek
Copy link
Author

pavel-faltynek commented Dec 12, 2024

I have added dump 9 which is verifiably related (matching PID) to the familiar assert:

Dec 12 13:22:12 Assert failure(PID 1645568 [0x00191c00], Thread: 1645573 [0x191c05]): (size >= Align (min_obj_size))
Dec 12 13:22:12     File: /__w/1/s/src/coreclr/gc/gc.cpp:7896
Dec 12 13:22:12     Image: /usr/lib/dotnet/dotnet

Stack:

[0x0]   libc_so+0x7f200   0xffff6a7a2120   0xffffb156a67c   
[0x1]   libc_so+0x3a67c   0xffff6a7a21f0   0xffffb1557130   
[0x2]   libc_so+0x27130   0xffff6a7a2210   0xffffb12e8ea8   
[0x3]   libcoreclr!PROCAbort+0x34   0xffff6a7a2360   0xffffb12e8d18   
[0x4]   libcoreclr!RaiseFailFastException+0x6c   0xffff6a7a2380   0xffffb115b2d4   
[0x5]   libcoreclr!FailFastOnAssert+0x24   0xffff6a7a2390   0xffffb115b11c   
[0x6]   libcoreclr!_DbgBreakCheck+0x28c   0xffff6a7a23a0   0xffffb115b358   
[0x7]   libcoreclr!_DbgBreakCheckNoThrow+0x84   0xffff6a7a3480   0xffffb115b668   
[0x8]   libcoreclr!DbgAssertDialog+0xc0   0xffff6a7a3500   0xffffb10212e4   
[0x9]   libcoreclr!SVR::gc_heap::fix_allocation_context+0xc8   0xffff6a7a3540   0xffffb0ef8e6c   
[0xa]   libcoreclr!GCToEEInterface::GcEnumAllocContexts+0x54   0xffff6a7a35c0   0xffffb101f0cc   
[0xb]   libcoreclr!SVR::gc_heap::fix_allocation_contexts+0x28   (Inline Function)   (Inline Function)   
[0xc]   libcoreclr!SVR::gc_heap::garbage_collect+0x60   0xffff6a7a35f0   0xffffb101cf18   
[0xd]   libcoreclr!SVR::gc_heap::gc_thread_function+0x918   0xffff6a7a3680   0xffffb101c600   
[0xe]   libcoreclr!SVR::gc_heap::gc_thread_stub+0x38   0xffff6a7a36f0   0xffffb0efc810   
[0xf]   libcoreclr!<unnamed-namespace>::CreateNonSuspendableThread::$_0::operator()+0x44   (Inline Function)   (Inline Function)   
[0x10]   libcoreclr!<unnamed-namespace>::CreateNonSuspendableThread::$_0::__invoke+0x54   0xffff6a7a3710   0xffffb12eea3c   
[0x11]   libcoreclr!CorUnix::CPalThread::ThreadEntry+0x370   0xffff6a7a3740   0xffffb15ad5c8   
[0x12]   libc_so+0x7d5c8   0xffff6a7a3800   0x0   

@mangod9
Copy link
Member

mangod9 commented Dec 12, 2024

Thanks for uploading more dumps we will take a look. We might also need to get it to repro with stress log enabled. Will provide instructions for that if required.

@pavel-faltynek
Copy link
Author

pavel-faltynek commented Dec 12, 2024

I have performed some automation in the process of crash data collection, so there might be available other dumps strongly related to the asserts observed/captured (if this is still something that would help you).

Up to now, two more:

10

Dec 12 14:46:34 Assert failure(PID 1651096 [0x00193198], Thread: 1651101 [0x19319d]): !CREATE_CHECK_STRING(pMT && pMT->Validate())
Dec 12 14:46:34     File: /__w/1/s/src/coreclr/vm/object.cpp:553
Dec 12 14:46:34     Image: /usr/lib/dotnet/dotnet

11

Dec 12 16:49:57 Assert failure(PID 1661208 [0x00195918], Thread: 1661213 [0x19591d]): (acontext->get_home_heap() == 0) || (acontext->get_home_heap()->pGenGCHeap->heap_number < gc_heap::n_heaps)
Dec 12 16:49:57     File: /__w/1/s/src/coreclr/gc/gc.cpp:51409
Dec 12 16:49:57     Image: /usr/lib/dotnet/dotnet

@mangod9
Copy link
Member

mangod9 commented Dec 17, 2024

Would it be possible to get it to repro with these env. vars and share dumps if possible? These might help with determining if any thread detach events are missed.

DOTNET_LogFacility=0x01000000
DOTNET_LogLevel=1
DOTNET_StressLog=1
DOTNET_StressLogSize=1400000
DOTNET_TotalStressLogSize=18000000

@pavel-faltynek
Copy link
Author

@mangod9, yes, I'm on it. Are the dumps enough, or will there be any other data showing up waiting for collection?

@mangod9
Copy link
Member

mangod9 commented Dec 19, 2024

dumps should have the logs for now, but if those dont help might have to provide a private build with additional logging/diagnostics enabled.

@pavel-faltynek
Copy link
Author

pavel-faltynek commented Dec 19, 2024

ok, I have about 50 dumps from which 7 ones recorded an assert, so I pick some candidates and add to the storage

Dec 19 11:28:35 Assert failure(PID 2307721 [0x00233689], Thread: 2307726 [0x23368e]): (acontext->get_home_heap() == 0) || (acontext->get_home_heap()->pGenGCHeap->heap_number < gc_heap::n_heaps)
Dec 19 11:28:35     File: /__w/1/s/src/coreclr/gc/gc.cpp:51409
Dec 19 11:28:35     Image: /usr/lib/dotnet/dotnet
Dec 19 11:34:26 Assert failure(PID 2308118 [0x00233816], Thread: 2308123 [0x23381b]): !CREATE_CHECK_STRING(pMT && pMT->Validate())
Dec 19 11:34:26     File: /__w/1/s/src/coreclr/vm/object.cpp:553
Dec 19 11:34:26     Image: /usr/lib/dotnet/dotnet
Dec 19 12:26:25 Assert failure(PID 2313360 [0x00234c90], Thread: 2313376 [0x234ca0]): (size >= Align (min_obj_size))
Dec 19 12:26:25     File: /__w/1/s/src/coreclr/gc/gc.cpp:7896
Dec 19 12:26:25     Image: /usr/lib/dotnet/dotnet
Dec 19 14:02:21 Assert failure(PID 2322372 [0x00236fc4], Thread: 2322377 [0x236fc9]): (size >= Align (min_obj_size))
Dec 19 14:02:21     File: /__w/1/s/src/coreclr/gc/gc.cpp:7896
Dec 19 14:02:21     Image: /usr/lib/dotnet/dotnet
Dec 19 14:48:58 Assert failure(PID 2326563 [0x00238023], Thread: 2326568 [0x238028]): !CREATE_CHECK_STRING(pMT && pMT->Validate())
Dec 19 14:48:58     File: /__w/1/s/src/coreclr/vm/object.cpp:553
Dec 19 14:48:58     Image: /usr/lib/dotnet/dotnet
Dec 19 14:59:57 Assert failure(PID 2327634 [0x00238452], Thread: 2327639 [0x238457]): (acontext->get_home_heap() == 0) || (acontext->get_home_heap()->pGenGCHeap->heap_number < gc_heap::n_heaps)
Dec 19 14:59:57     File: /__w/1/s/src/coreclr/gc/gc.cpp:51409
Dec 19 14:59:57     Image: /usr/lib/dotnet/dotnet
Dec 19 15:05:29 Assert failure(PID 2328026 [0x002385da], Thread: 2328031 [0x2385df]): !CREATE_CHECK_STRING(pMT && pMT->Validate())
Dec 19 15:05:29     File: /__w/1/s/src/coreclr/vm/object.cpp:553
Dec 19 15:05:29     Image: /usr/lib/dotnet/dotnet

Adding following ones (the time can be used to match the assert, if needed):

crash.2024-12-19_11-28-49.zip
crash.2024-12-19_11-34-39.zip
crash.2024-12-19_12-26-36.zip
crash.2024-12-19_15-49-44.zip

@kouvel
Copy link
Member

kouvel commented Dec 28, 2024

Hi @pavel-faltynek, I had a look at the dumps, I can see native stacks and some info but I'm not able to load the DAC to extract the stress logs. I suspect some memory regions are missing from the dumps, perhaps due to the way in which the dump was collected. There are env vars that can be set to collect dumps on crash with the necessary memory regions (more below).

The current theory for the issue is that after the thread destruction notifacation, a reverse pinvoke from a native library on the same thread causes a Thread object to be initialized again, which later would not get a destruction notification. I tried inducing that locally and am seeing some random issues but I didn't see exactly the same crash stack. I'd like to see if it's the same issue occurring in your scenario. There probably would have been some indication in the stress log of that happening. I also created a test PR that adds a clear log entry when the above issue occurs along with a new config var that when enabled causes a fail-fast on thread reinitialization after cleanup (test PR), in hopes that we can also see a stack trace for what leads to it.

I locally-built (unsigned) release binaries of CoreCLR with the test PR (binaries) - the assertion failures don't seem to point to a clear issue, and I'm not sure they're all a result of the issue, so I produced release binaries. The binaries should be compatible with runtime version 9.0.0. Would you mind updating your build with those binaries temporarily, and collecting a crash dump with the following settings?

Please restore the original binaries for the runtime (don't use the checked binaries from before), and then update the runtime with the binaries I linked above.

Set env vars to collect a crash dump including the necessary memory regions (more info):

DOTNET_DbgEnableMiniDump=1
DOTNET_DbgMiniDumpType=2  
DOTNET_DbgMiniDumpName=/path/to/coredump

Enable stress log as before:

DOTNET_LogFacility=0x01000000
DOTNET_LogLevel=1
DOTNET_StressLog=1
DOTNET_StressLogSize=1400000
DOTNET_TotalStressLogSize=18000000

Set new config var from the test PR above to fail-fast on thread reinitialization after cleanup:

DOTNET_Thread_FailFastOnReinitialize=1

@pavel-faltynek
Copy link
Author

@kouvel, thank you. Will do it and post the results or the additional questions, if any.

@pavel-faltynek
Copy link
Author

pavel-faltynek commented Dec 30, 2024

I have

  • wiped out the checked dotnet runtime
  • installed clean 9.0.101
  • patched the /usr/lib/dotnet/shared/Microsoft.NETCore.App/9.0.0 with the provided binaries (including the dbg ones)
  • setup additional envs (coredump-%t-%p for dump name)

What I get:

  • The dumps are now created at the specified place.
  • When the provided binaries location is added to the symbol search, some of the stack records are "more readable" (see below). No clue, how to get libcrypto.so symbols...
  • On Windows (using WinDbg), I'm (still) unable to get the stresslog (tried !DumpLog -addr 0x01000000), so leaving it for more skilled people :-)
  • Typical stack trace (looks the same in every dump I checked (~5 ones), will share random selection in 13 fail-fast directory).
[0x0]   libc_so+0xb6800   0xffbec1fca2b0   0xffff8984ac28   
[0x1]   libcoreclr!PROCCreateCrashDump+0x27c [/home/kount/e/dotnet/runtime/src/coreclr/pal/src/thread/process.cpp @ 2309]   0xffbec1fca2f0   0xffff8984bfb0   
[0x2]   libcoreclr!PROCCreateCrashDumpIfEnabled+0xbac [/home/kount/e/dotnet/runtime/src/coreclr/pal/src/thread/process.cpp @ 2526]   0xffbec1fca350   0xffff89849898   
[0x3]   libcoreclr!PROCAbort+0x2c [/home/kount/e/dotnet/runtime/src/coreclr/pal/src/thread/process.cpp @ 2559]   0xffbec1fca3e0   0xffff8984977c   
[0x4]   libcoreclr!RaiseFailFastException+0x14 [/home/kount/e/dotnet/runtime/src/coreclr/pal/src/thread/process.cpp @ 1276]   0xffbec1fca400   0xffff8972bac8   
[0x5]   libcoreclr!FailFastOnAssert+0x1c [/home/kount/e/dotnet/runtime/src/coreclr/utilcode/debug.cpp @ 63]   0xffbec1fca410   0xffff8972ba54   
[0x6]   libcoreclr!__FreeBuildAssertFail+0x1d8 [/home/kount/e/dotnet/runtime/src/coreclr/inc/volatile.h @ 505]   0xffbec1fca420   0xffff8980f7a8   
[0x7]   libcoreclr!TlsDestructionMonitor::Activate+0xf4 [/home/kount/e/dotnet/runtime/src/coreclr/vm/ceemain.cpp @ 1722]   (Inline Function)   (Inline Function)   
[0x8]   libcoreclr!ZTW22tls_destructionMonitor   0xffbec1fca4b0   0xffff894ec320   
[0x9]   libcoreclr!SetThread+0x20 [/home/kount/e/dotnet/runtime/src/coreclr/vm/threads.cpp @ 360]   (Inline Function)   (Inline Function)   
[0xa]   libcoreclr!SetupThread+0x278 [/home/kount/e/dotnet/runtime/src/coreclr/vm/threads.h @ 697]   0xffbec1fca4d0   0xffff894ed4d8   
[0xb]   libcoreclr!SetupThreadNoThrow+0x80 [/home/kount/e/dotnet/runtime/src/coreclr/vm/threads.cpp @ 15732480]   0xffbec1fca540   0xffff8957f790   
[0xc]   libcoreclr!JIT_ReversePInvokeEnterRare+0x58 [/home/kount/e/dotnet/runtime/src/coreclr/vm/jithelpers.cpp @ 4818]   0xffbec1fca5d0   0xffff85d37274   
[0xd]   0xffff85d37274   0xffbec1fca620   0xffbed31d58dc   
[0xe]   libcrypto_so+0x1758dc   0xffbec1fca670   0xffbed31d5998   
[0xf]   libcrypto_so+0x175998   0xffbec1fca690   0xffbed32b50d8   
[0x10]   libcrypto_so+0x2550d8   0xffbec1fca6b0   0xffbed31dd6f8   
[0x11]   libcrypto_so+0x17d6f8   0xffbec1fca6d0   0xffbed3228bc0   
[0x12]   libcrypto_so+0x1c8bc0   0xffbec1fca700   0xffbed31fd9ec   
[0x13]   libcrypto_so+0x19d9ec   0xffbec1fca730   0xffbed31fe230   
[0x14]   libcrypto_so+0x19e230   0xffbec1fca770   0xffff89aea3d4   
[0x15]   libc_so+0x7a3d4   0xffbec1fca790   0xffff89aed494   
[0x16]   libc_so+0x7d494   0xffbec1fca800   0xffff89b55edc   
[0x17]   libc_so+0xe5edc   0xffbec1fca920   0xffffffffffffffff   
[0x18]   0xffffffffffffffff   0xffbec1fca920   0x0   

@kouvel
Copy link
Member

kouvel commented Dec 30, 2024

Thanks @pavel-faltynek. From the stack it looks like indeed there is some thread reinitialization going on.

On Windows (using WinDbg), I'm (still) unable to get the stresslog (tried !DumpLog -addr 0x01000000)

I believe that has to be done on a linux arm64 machine with lldb since it needs to load and run native components in the debugger (SOS). For managed stacks also, SOS (installed with dotnet-sos tool) would need to be loaded and then clrstack command can be used. Then the command !dumplog /full/path/to/stress.log should work.

Could you please share the dumps with me as well at [email protected]?

@pavel-faltynek
Copy link
Author

I was more or less afraid of this, thanks for the confirmation.
I have sent the dumps location to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants