Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch skipping crashing in functions marked with [UnmanagedCallersOnly] #102767

Open
markoostveen opened this issue May 28, 2024 · 22 comments
Open
Assignees
Milestone

Comments

@markoostveen
Copy link

Description

This is a repost of my stackoverflow post, that I have found a temporary solution for here

C# function pointers causes access violation upon calling from C++

I have tried debugging, but when the issue occurs. In this case C++ calls the callback, but before it enters the function's scope the exception is thrown, but immediately returns with an error code.

For the sake of debugging, I've tried switching the calling convention from cdecl to stdcall to see if it has to do with stack corruption.

My assumptions are that:

  • static functions marked with [UnmanagedCallersOnly] have a static address, especially when an entrypoint is defined.
  • Thread.BeginThreadAffinity() followed by Thread.EndThreadAffinity() ensures that an unmanaged thread can successfully start executing managed code.

Good to know:

  • The function is being invoked by a thread not created in C#, there are many threads calling this function

Reproduction Steps

In the following bit of code, I'm using C# function pointers to omit using a delegate type, for faster performance. The code works fine when I use a delegate type marshalled as a function pointer, no exception no issues. However, when changing it from a delegate to a function pointer it sometimes throws an access violation.

Code is modified to remove namespaces and prefixes not relevant to the question

    public static class Engine
    {
        // Causing issues with sometimes throwing access violation
        [DllImport("engine", CallingConvention = CallingConvention.Cdecl)]
        public static unsafe extern UInt32 EventScheduler_ScheduleLocalEvent(
            int priority, UInt64 delay, delegate* unmanaged[Cdecl]<nint, void> callback, nint obj);

        // Works fine every time, but is much slower seems due to delegate instatiations
        [UnmanagedFunctionPointer(CallingConvention.Cdecl)]
        public delegate void EventScheduler_ScheduleLocalEvent_callbackDelegate(nint obj);
        [DllImport("engine", CallingConvention = CallingConvention.Cdecl)]
        public static unsafe extern UInt32 EventScheduler_ScheduleLocalEvent(
            int priority,
            UInt64 delay,
            [MarshalAs(UnmanagedType.FunctionPtr)] EventScheduler_ScheduleLocalEvent_callbackDelegate callback,
            nint obj);

    }

    public static class EventScheduler
    {
        public delegate void EventCallback();

        // attribute is only present in the function pointer senario
        [UnmanagedCallersOnly(CallConvs = new[] { typeof(CallConvCdecl ) }, EntryPoint = "ScheduleLocalEvent_EventCallbackFunction")]
        private static void ScheduleLocalEvent_EventCallbackFunction(nint handlePtr)
        {
            Thread.BeginThreadAffinity();
            GCHandle handle = GCHandle.FromIntPtr(handlePtr);
            var cb          = (EventScheduler.EventCallback)handle.Target!;
            cb();
            handle.Free();
            Thread.EndThreadAffinity();
        }
        public static UInt32 ScheduleLocalEvent(int priority, SimulationTime delayTime, EventCallback eventCallback)
        {
            GCHandle handle  = GCHandle.Alloc(eventCallback, GCHandleType.Normal);
            IntPtr handlePtr = GCHandle.ToIntPtr(handle);
            unsafe
            {
                return Engine.EventScheduler_ScheduleLocalEvent(
                    priority, delayTime, &ScheduleLocalEvent_EventCallbackFunction, handlePtr);
            }
        }
   }

On the C++ side

    #define API extern "C" __declspec(dllexport)
    
    API uint32_t EventScheduler_ScheduleLocalEvent(int priority, uint64_t delay, void(__cdecl* callback)(void* obj), void* obj)
    {
        ERSAssert(Ers::Core::InsideSubModel());
        Core::SyncManagerBase& syncManager = Ers::Core::GetSyncManagerBase();
        return syncManager.ScheduleEvent(priority, delay, [callback, obj]() { callback(obj); });
    }

Expected behavior

The thread should enter and execute the static function passed using a function pointer

Actual behavior

The thread doesn't enter the user-defined portion of the function. an Access Violation is thrown in a generated call visible in the IL during the native-to-managed-transition

Regression?

No response

Known Workarounds

Change the garbage collection mode by adding ```xml
true
true


### Configuration

Tested on PC's around the office with the same behavior. All were x64, issue occurred on both Windows and Linux. The issue was reproduced on .net8.0 and .net9.0.100-preview

### Other information

The issue doesn't appear always, i think it might be a race condition inside the garbage collector.
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label May 28, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label May 28, 2024
@am11 am11 added area-System.Runtime.InteropServices and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels May 28, 2024
@hez2010
Copy link
Contributor

hez2010 commented May 28, 2024

Can you try changing GCHandle.Alloc(eventCallback, GCHandleType.Normal) to GCHandle.Alloc(eventCallback, GCHandleType.Pinned)?

@jkotas
Copy link
Member

jkotas commented May 28, 2024

Can you try changing GCHandle.Alloc(eventCallback, GCHandleType.Normal) to GCHandle.Alloc(eventCallback, GCHandleType.Pinned)?

This may hide issue similar to how changing the GC mode hides the issue, but it won't fix the underlying problem.

Thread.BeginThreadAffinity() followed by Thread.EndThreadAffinity() ensures that an unmanaged thread can successfully start executing managed code.

Thread.Begin/EndThreadAffinity do nothing in .NET Core. You can delete these calls. (These calls were relevant in .NET Framework for low-level hosting, e.g. under MS SQL Server.)

I have tried debugging, but when the issue occurs. In this case C++ calls the callback, but before it enters the function's scope the exception is thrown, but immediately returns with an error code.

Could you please share more details about the location of the crash? Where is the origin exception thrown exactly?

@AaronRobinsonMSFT AaronRobinsonMSFT added the needs-author-action An issue or pull request that requires more info or actions from the author. label May 28, 2024
@markoostveen
Copy link
Author

markoostveen commented May 28, 2024

Can you try changing GCHandle.Alloc(eventCallback, GCHandleType.Normal) to GCHandle.Alloc(eventCallback, GCHandleType.Pinned)?

I've tried this before posting on StackOverflow, this didn't fix the issue. Also seemed to have no effect on the frequency it was occuring

Thread.Begin/EndThreadAffinity do nothing in .NET Core. You can delete these calls. (These calls were relevant in .NET Framework for low-level hosting, e.g. under MS SQL Server.)

Got it, that is nice to know!

Could you please share more details about the location of the crash? Where is the origin exception thrown exactly?

Yes, this is a screenshot where the exception get's thrown, though it only get's visibly thrown in .net9.0 preview in .net8.0 is calls abort and exists with an exit code indicating the Access Violation.
A portion is hidden as I wasn't allowed to show call paths from our project
image
This issue though always seems to happen in this call in the disassembly which I assume has something to do with setting up garbage collection
image

@dotnet-policy-service dotnet-policy-service bot removed the needs-author-action An issue or pull request that requires more info or actions from the author. label May 28, 2024
@jkotas
Copy link
Member

jkotas commented May 28, 2024

This issue though always seems to happen in this call in the disassembly which I assume has something to do with setting up garbage collection

Yes, this call is unmanaged->managed transition. It would be useful to know where the crash happens inside this call. I do not think that the Visual Studio will show it to you. You would need to run under low-level debugger like windbg or collect crash dump using (https://learn.microsoft.com/en-us/windows/win32/wer/collecting-user-mode-dumps).

@markoostveen
Copy link
Author

This issue though always seems to happen in this call in the disassembly which I assume has something to do with setting up garbage collection

Yes, this call is unmanaged->managed transition. It would be useful to know where the crash happens inside this call. I do not think that the Visual Studio will show it to you. You would need to run under low-level debugger like windbg or collect crash dump using (https://learn.microsoft.com/en-us/windows/win32/wer/collecting-user-mode-dumps).

Now this is where it gets interesting, while running it inside windbg the issue doesn't occur at all after running it many times, i'll try to get you a crash dump, any special settings you would like to be turned on for the minidump flags?

@jkotas
Copy link
Member

jkotas commented May 28, 2024

any special settings you would like to be turned on for the minidump flags?

Default settings should work fine, to start with at least.

@markoostveen
Copy link
Author

any special settings you would like to be turned on for the minidump flags?

Default settings should work fine, to start with at least.

Here you go
AWealthOfRows.exe.11692.dmp

@jkotas
Copy link
Member

jkotas commented May 28, 2024

Thank you for sharing the minidump. Here are the details of the crash:

00007ff9`9f1abbea 833d07ed2e0000  cmp     dword ptr [00007ff9`9f49a8f8],0 <- ACCESS_VIOLATION here
00007ff9`9f1abbf1 7405            je      Ers!Ers.EventScheduler.ScheduleLocalEvent_EventCallbackFunction()+0x38 (00007ff9`9f1abbf8)
00007ff9`9f1abbf3 e8c80ac75f      call    coreclr!JIT_DbgIsJustMyCode (00007ff9`fee1c6c0)

The crash is in "Just My Code" check. It is an extra code that gets inserted by the JIT at the start of every method when running under managed debugger like Visual Studio. It allows the managed debugger to provide better stepping experience. It explains why you are not able to reproduce the crash outside Visual Studio.

@jkotas
Copy link
Member

jkotas commented May 28, 2024

The exception details:

0:019> dt coreclr!_EXCEPTION_RECORD 0x0000003f`265ff4b0
   +0x000 ExceptionCode    : 0xc0000005
   +0x004 ExceptionFlags   : 0
   +0x008 ExceptionRecord  : (null) 
   +0x010 ExceptionAddress : 0x000001e0`7fe90860 Void
   +0x018 NumberParameters : 2
   +0x020 ExceptionInformation : [15] 0
0:019> dx -r1 (*((coreclr!unsigned __int64 (*)[15])0x3f265ff4d0))
(*((coreclr!unsigned __int64 (*)[15])0x3f265ff4d0))                 [Type: unsigned __int64 [15]]
    [0]              : 0x0 [Type: unsigned __int64]
    [1]              : 0x1e08017f56e [Type: unsigned __int64]

Notice that the ExceptionAddress is different than the actual address of the code (see my previous comment). It suggests that this is problem with patch skipping. The patch skipping is a managed debugger feature where it copies out a few instructions to different spot to execute them in isolation. The patch skipping has to update the relative addresses in the code when it copies the code out, but it was not done here for 00007ff99f49a8f8` address.

@dotnet/dotnet-diag-contrib PTLA

Copy link
Contributor

Tagging subscribers to this area: @tommcdon
See info in area-owners.md if you want to be subscribed.

@jkotas jkotas changed the title C# function pointers causes access violation upon calling from C++ Patch skipping crashing in functions marked with [UnmanagedCallersOnly] May 28, 2024
@markoostveen
Copy link
Author

markoostveen commented May 29, 2024

Awesome you've been able to pinpoint it so fast, that makes sense. Do you need anything else from me?
On an unrelated note, will a patch likely make it into .net9.0 still?

@hoyosjs
Copy link
Member

hoyosjs commented May 31, 2024

I can't see the bypass buffers in the dump sadly, but I think there's a few issues:

  • NativeWalker::DecodeInstructionForPatchSkip seems to be messing up and thinking this is a write instruction. This might have regressed with AVX512 support in .NET 8. Have you seen this in .NET 6?
  • I don't see anything handling RIP-relative operations for some operations. (like cmp in this case) in
    {
    patchBypassRW[1] = 0x8b; // MOV reg, mem
    _ASSERTE((int)sizeof(void*) <= SharedPatchBypassBuffer::cbBufferBypass);
    *(void**)bufferBypassRW = (void*)(patch->address + m_instrAttrib.m_cbInstr + dwOldDisp);
    }
    else
    {
    _ASSERTE(m_instrAttrib.m_cOperandSize <= SharedPatchBypassBuffer::cbBufferBypass);
    // Copy the data into our buffer.
    memcpy(bufferBypassRW, patch->address + m_instrAttrib.m_cbInstr + dwOldDisp, m_instrAttrib.m_cOperandSize);
    if (m_instrAttrib.m_fIsWrite)
    {
    // save the actual destination address and size so when we TriggerSingleStep() we can update the value
    pSharedPatchBypassBufferRW->RipTargetFixup = (UINT_PTR)(patch->address + m_instrAttrib.m_cbInstr + dwOldDisp);
    pSharedPatchBypassBufferRW->RipTargetFixupSize = m_instrAttrib.m_cOperandSize;
    }
    }
    }

@hoyosjs
Copy link
Member

hoyosjs commented May 31, 2024

No, even in 6.0 the decoder would classify the instruction as non-write. This is actually doing silly copy to and from the module data structure. But this is actually an unlikely root cause. Rather, the window of time between where we overwrite the displacement // Overwrite the *signed* displacement. and when another thread might be executing the CopyInstructionBlock but before it does the offset fixup this means we will be executing the old RIP relative addressing.

@hoyosjs
Copy link
Member

hoyosjs commented May 31, 2024

@markoostveen, does the native portion run on many threads and is this the first time they execute managed code?

@tommcdon
Copy link
Member

The crash is in "Just My Code" check. It is an extra code that gets inserted by the JIT at the start of every method when running under managed debugger like Visual Studio. It allows the managed debugger to provide better stepping experience. It explains why you are not able to reproduce the crash outside Visual Studio.

@markoostveen would you mind trying to disable Just My Code in the debugger to determine if issue reproduces without it enabled? Also would it be possible to share a VS solution containing a small C# console app and C++ DLL that we can use to reproduce and debug the issue?

@markoostveen
Copy link
Author

markoostveen commented May 31, 2024

@markoostveen, does the native portion run on many threads and is this the first time they execute managed code?

Yes the native portion is running on multiple threads, some started from C# but others started in native code. Those threads are executing jobs from a jobsystem(Written in my spare time) which in the end calls the managed callback as part of a job.

@markoostveen would you mind trying to disable Just My Code in the debugger to determine if issue reproduces without it enabled?

I've ran the program with both Just My Code enabled and disabled giving me the same behavior in both instances resulting in the same crash. If you want I can provide a dump where it is enabled and disabled

Also would it be possible to share a VS solution containing a small C# console app and C++ DLL that we can use to reproduce and debug the issue?

I'm not sure if I'll be able to provide this as I don't have an easy way to compress it into a smaller sharable console application. Sharing the entire app isn't possible due to company policy but I'll try to get clearance if need be.

@markoostveen
Copy link
Author

markoostveen commented May 31, 2024

Well, I got clearance to share it much faster than anticipated, although I can only share release binaries so you'll be able to run it.
I did have to build in a cutoff date so it'll only function until June 30th

It is a standalone executable running a small simulation (No GUI yet in C#)
AWealthOfRows.zip

While running you will see the following
image
Highlighted in blue shows the Thread ID, this ID is the thread printing the message which in this case is always the main thread as it's printing before and after the simulation not during.

The implementation of the method that is causing the issue is under namespace Ers.EventScheduler if you look at the Ers.dll in ILSpy or another IL inspection tool

To cause the issue I was encountering open visual studio using the following command in cmd

devenv /debugexe AWealthOfRows.exe

Then right click on the awealthofrows project click properties and change the debugger type to "Mixed (.Net Core, .Net 5+)"

@tommcdon tommcdon added this to the 9.0.0 milestone Jun 4, 2024
@tommcdon tommcdon removed the untriaged New issue has not been triaged by the area owner label Jun 4, 2024
@markoostveen
Copy link
Author

Hey, is there any progress on the issue? I just want to check-in. if there isn't any progress, I'll try to get clearance to extend the period cutoff date/killswitch.

@tommcdon
Copy link
Member

tommcdon commented Jul 1, 2024

Hey, is there any progress on the issue? I just want to check-in. if there isn't any progress, I'll try to get clearance to extend the period cutoff date/killswitch.

It's still under investigation. In the meantime, we have created a standalone repro so the original app shouldn't be necessary any longer

    internal class Program
    {
        static volatile int count = 0;

        [UnmanagedCallersOnly(CallConvs = new[] { typeof(CallConvCdecl) }, EntryPoint = "MyFoo")]
        static void MyFoo(nint obj)
        {
            Interlocked.Increment(ref count);
        }

        [DllImport("Dll1.dll", CallingConvention = CallingConvention.Cdecl)]
        static unsafe extern void TestFunction([MarshalAs(UnmanagedType.FunctionPtr)] delegate* unmanaged[Cdecl]<nint, void> callback);

        static unsafe void Main(string[] args)
        {
            Console.WriteLine("Hello, World!");
            const int numThreads = 1000;
            for (int i = 0; i < numThreads; i++)
            {
                TestFunction(&MyFoo);
            }
            while (count < numThreads)
            {
                Thread.Sleep(1);
            }
            Console.WriteLine("Hit enter to quit");
            Console.In.ReadLine();
        }
    }
#define API extern "C" __declspec(dllexport)

volatile LONG i = 0;

typedef void(__cdecl* MyCallback)(void* obj);

void DoStuff(MyCallback callback)
{
	callback((void*)i);
	InterlockedIncrement(&i);
}

DWORD WINAPI ThreadProc(
	_In_ LPVOID lpParameter
)
{
	DoStuff((MyCallback)lpParameter);
	return 0;
}


API void __cdecl TestFunction(void(__cdecl* callback)(void* obj))
{
	CreateThread(NULL, 0, ThreadProc, callback, 0, NULL);
}

@markoostveen
Copy link
Author

markoostveen commented Jan 2, 2025

Hi all,

Recently after changing some seemingly unrelated pieces of code our workaround has stopped working for some reason.

I have been delving into the issue and trying to debug the issue, it seems when I run it the following assertion is being hit

Assert failure(PID 23280 [0x00005af0], Thread: 49160 [0xc008]): m_instrAttrib.m_fIsCall

CORECLR! DebuggerPatchSkip::IsInPlaceSingleStep + 0x2F (0x00007ffc`3ceb2aff)
CORECLR! DebuggerPatchSkip::DebuggerPatchSkip + 0x2A8 (0x00007ffc`3cea1398)
CORECLR! DebuggerController::ActivatePatchSkip + 0x160 (0x00007ffc`3cea3450)
CORECLR! DebuggerController::DispatchPatchOrSingleStep + 0xF46 (0x00007ffc`3ceacd46)
CORECLR! DebuggerController::DispatchNativeException + 0xAB0 (0x00007ffc`3ceabba0)
CORECLR! Debugger::FirstChanceNativeException + 0x4F8 (0x00007ffc`3cedd4b8)
CORECLR! IsDebuggerFault + 0x8D (0x00007ffc`3c5273dd)
CORECLR! CLRVectoredExceptionHandlerPhase2 + 0x18C (0x00007ffc`3c51372c)
CORECLR! CLRVectoredExceptionHandler + 0x3D2 (0x00007ffc`3c513582)
CORECLR! CLRVectoredExceptionHandlerShim + 0x245 (0x00007ffc`3c514085)
    File: C:\Projects\temp\runtime\src\coreclr\debug\ee\controller.h:1537
    Image: C:\Projects\temp\runtime\artifacts\bin\coreclr\windows.x64.Debug\corerun.exe

Though when trying to delve deeping I'm unable to place a breakpoint on the actual location, I can't find a resource how I can actually debug the coreCLR where the issue is occurring, as I seem to be getting the following message.
Image

I have also been trying to catch the issue with post-mortem debugging in windbg but this doesn't seem to be working, anyone has any idea how to attach the debugger, since I can't just attach 2 invasive debuggers at once?

@markoostveen
Copy link
Author

markoostveen commented Jan 2, 2025

I've also disabled the feature FEATURE_EMULATE_SINGLESTEP to check if it was being related to the issue, but when I do, native only debugging works okay as expected, but then the breakpoint being hit is different, in that case, the 'address' member on struct DebuggerControllerPatch is NULL causing IsBound to return false in another assertion

@jkotas
Copy link
Member

jkotas commented Jan 2, 2025

m_instrAttrib.m_fIsCall

This is unrelated problem that was fixed by #109603

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

8 participants