Before Windows NT 4.0, the graphical part of the Windows subsystem was implemented completely in userland. Starting from NT 4.0 Microsoft decided to move a large part of the Window Manager and the Graphics Device Interface to kernel-mode in the Win32k.sys component. However, part of the implementation is still present in userland and the kernel component needs to call back user-mode code. To do so, Microsoft implemented a ‘reverse’ system call, allowing the kernel to call userland code. The whole process has already been discussed and explained in previous articles so we will not detail it again. Please refer to Tarjei Mandt white paper that contains a comprehensive description of the mechanism. In this post, we detail how Windows (from Windows XP to Windows 8) uses this mechanism to load modules in running processes. Understanding the mechanism may allow you to use it for your own purposes, in particular as a way to inject custom DLLs in processes while running code in the kernel portion of the Windows operating system. An article published by @zer0mem used the ‘reverse’ system call mechanism to execute code in user-mode from kernel code. This post offers an alternative approach if you are free to drop a Windows binary file on the file system.
User-mode callbacks walkthrough
General mechanism
The function that enables to call user code from kernel is located inside the Windows kernel and is an exported function named KeUserModeCallback. The prototype of KeUserModeCallback is
The initialization takes place in the function user32!UserClientDllInitialize (the entry point of the user32 DLL) and basically makes the KernelCallbackTable field point to the non-exported user32!apfnDispatch symbol.
Since this function is not present in any WDK header, you have to retrieve it dynamically with the help of an MmGetSystemRoutineAddress call.
What this function basically does is copying these parameters onto the userland stack and returning back to userland code (in ntdll!KiUserCallbackDispatcher function).
Callable functions are identified by the ApiNumber parameter. This is a zero-based index in an array accessible through the KernelCallbackTable field of the Process Environment Block.
This field is initialized when the user32 module is loaded in the process (before initialization, the field is NULL). The initialization takes place in the function user32!UserClientDllInitialize (the entry point of the user32 DLL) and basically makes the KernelCallbackTable field point to the non-exported user32!apfnDispatch symbol.
This table contains function pointers to various userland callable functions, all of them located in the user32 module. The contents (thus the index of the functions) and the length of the table depend on the operating system version.
Here is an example (truncated) displaying a function table in the Windows XP SP3 32 bits process:
Conditions for calling KeUserModeCallback
Before calling KeUserModeCallback, you must first check the KernelCallbackTable field of the Process Environment Block is not NULL (KeUserModeCallback will not do this for you). This field is at offset 0x2c on a 32-bit system and 0x58 on a 64-bit system (from Windows XP to Windows 8). Omitting to do so will eventually lead to a BSOD.
On Windows XP, the operating system does not place any condition on the state of the current thread for calling the KeUserModeCallback function, so it is safe calling the function whenever you want.
Starting from Windows Vista, things are different. Indeed, the KeUserModeCallback function checks for the presence of the CallOutActive flag in the Flags field of the current _KTHREAD structure (this field is set at least by the nt!KeExpandKernelStackAndCalloutEx function). If present, the operating system issues a bugcheck with a 0x107 undocumented code.
On Windows 8, the Microsoft developers added even more constraints to allow the call to succeed.
The first check performed by Windows 8 is ensuring the current thread runs at PASSIVE_LEVEL. If not, the operating system issues a bugcheck with code 0x4A (IRQL_GT_ZERO_AT_SYSTEM_SERVICE).
Then, the operating system checks if APCs are enabled. If not, the operating system issues a bugcheck with code 1 (APC_INDEX_MISMATCH).
Finally, the operating system checks the CallbackNestingLevel field of the current thread. If this value reaches 32, the function fails with a code equals to 0xC00000FD (STATUS_STACK_OVERFLOW). This field is set by KeUserModeCallback to record the number of nested calls to user-mode callbacks.
User-mode callback for loading a library
Among the interesting functions, we can notice the user32!__ClientLoadLibrary function pointer.
This functionality is natively used by win32k.sys to inject the uxtheme.dll in running processes, allowing the operating system to apply visual styles to applications.
This operation is twofold. First, it effectively loads the module in the process memory, as if it were loaded by userland code. Then, a function called ThemeInitApiHook is invoked giving uxtheme.dll a chance to provide alternate implementations for various functions used by user32. We will not dive into the details of how this initialization function is called and what the patched functions are used for. We will just try to describe the parameters needed to load a module without calling any specific initialization function.
Function index
The first parameter requested is the ApiNumber. The value for the ‘load library’ feature depends on the operating system version.
From now on, the index does not depend on the operating system flavor (32 bits or 64 bits).
Input buffer
The second and third parameters of the functions are the input buffer and its associated length, in bytes.
The input buffer for the ‘load library’ feature is described by the following structure:
This structure is a specialization of a more general mechanism that exists in the win32k.sys driver for user-mode callbacks: it is composed of a fixed header (from dwBufferSize to bFixed) and a variable-length data (starting from lpDLLPath).
dwBufferSize contains the length of the whole buffer, including the variable-length data.
dwAdditionalData contains the length of the variable-length data.
pbFree is a pointer to the end of the variable-length data.
We will not go into the implementation details of how this dynamic buffer is allocated and the previous fields are used by the Windows routines. We just have to mimic the way the buffer is filled in in order to call KeUserModeCallback.
Note: you can have a look at the win32k!AllocCallbackMessage and win32k!CaptureCallbackData functions called by win32k!ClientLoadLibrary if you want to understand how this structure is allocated and updated.
Relocatable buffer
The buffer supplied by the caller of KeUserModeCallback resides in kernel memory. The buffer must eventually resides in the user memory of the process (it is copied on the userland stack) in order to be handled by the userland functions.
In order to make the buffer location-independent, the Windows developers implemented a simple mechanism consisting of ‘fix-ups’. If the bFixed is FALSE, every pointer does not contain an address but an offset relative to the beginning of the structure.
Let’s take for example the buffer passed to the user32!__ClientLoadLibrary on a Windows XP 32 bits:
The buffer contains 1 fix-up (dwFixupsCount = 1). The array containing this fix-up is at offset 0x24 from the beginning of the structure (thus residing at address 0xef5a7904). The first and only element of this array is the offset of the value to fix: it is the UNICODE_STRING buffer (value 0x28 at offset 0x1c). After being fixed, the buffer points to the real memory address (0xef5a78e0+ 0x28 = 0xef5a7908).
This resolution is performed by the FixupCallbackPointers function of user32, after the buffer has been copied to the user land stack.
The code for this function looks like:
Load library-specific parameters
The first parameter in the dynamic part of the input buffer given to KeUserModeCallback is the name of the module to load; it is specified in the lpDLLPath field of the structure. The module is eventually loaded by the call to the kernel32!LoadLibraryExW function.
The second parameter passed in the buffer describes a function to call once the library is loaded and depends on the operating system version.
On Windows XP, the field (lpfnNotify) is an offset relative to the loaded module of the function to call. Starting from Windows Vista, the field (lpInitFunctionName) is the name of the function to call; this function must be exported because it is retrieved with the help of GetProcAddress.
To skip the initialization function call, simply specify a 0 value for lpfnNotify on Windows XP or specify no relocation for the function name (dwFixupsCount = 1 and offCbk[1] = 0) starting from Windows Vista.
Output buffer
On output, the KeUserModeCallback fills the OutputBuffer and OutputLength parameters with the results of the call if it succeeds.
For the load library case, the contents of the whole output buffer has not been investigated. However, the beginning of the output buffer matches the structure:
The lpBaseAddress field contains the base address of the loaded module.
What about Wow64?
What we described so far is relevant for 32-bit processes on 32-bit operating systems and 64-bit processes on 64-bit systems. But what about 32-bit processes on 64-bit systems?
The good news is that it works equally from the kernel point-of-view, so what we explained is still relevant.
In a Wow64 process, the first change is that the KernelCallbackTable field of the Process Environment Block now points to wow64win module functions:
What these functions do is performing an extra-marshalling between 64 and 32-bit structures.
Regarding the load library functionality, the wow64win!whcbClientLoadLibrary function first calls wow64win!FixupCaptureBuf64 which resolves the relative offsets.
The original buffer contains the raw data as received by the kernel:
In the original buffer, 2 fix-ups are declared. The pseudo-pointers are 64-bit long (in yellow and green in the previous image).
wow64win!FixupCaptureBuf64 replaces the relative offsets with absolute addresses. Since all relocations are performed, it sets the number of fix-ups to ‘0’.
The second step builds another buffer containing only the static part of the structure with a layout matching the 32-bit code expectations.
The control is then passed to the user32!__ClientLoadLibrary function that performs the operations as if it were running on a 32-bit operating system.
Since the loading of the specified module is performed as if it were called by the userland process, the standard restrictions and behaviors are applicable. In particular, DLL redirection is in effect and loading of a DLL in c:\Windows\System32 will be automatically redirected to c:\Windows\SysWOW64.
Conclusions
It is possible to use the KeUserModeCallback function to load a custom library in processes, provided they use the user32 module. In practice, nearly all end-user applications use this module so it should not be a strong constraint. Since this function and the associated parameters are not documented, this functionality is subject to change in the future versions (even if it did not change so much during the 15 past years).
If you are interested in investigating other methods of executing user-mode code from the kernel, you can also have a look at the 6-part articles published by Nynaeve.