In 2014, Thierry F. wrote an article about a technique that could allow a driver to inject a DLL in a process. This was based on the reverse engineering of the field PEB.KernelCallbackTable, which is untyped and completely undocumented. You may have discovered, through the article mentioned above that, behind this opaque pointer, there is a big table of pointers related to User32.dll. This means that a process that does not load User32.dll will not have the field PEB.KernelCallbackTable initialized.
Furthermore, because this is completely undocumented by Microsoft, it’s also not supported at all. The pointer to User32!__ClientLoadLibrary has always been located in the PEB.KernelCallbackTable since Windows XP at least, but its index in there changes for each new version of Windows though.
What if I tell you that there is a way a bit more documented to do the same, that doesn’t rely on any reverse engineering or assembly code at all? And if I add the fact that this DLL is invisible for any API (GetModuleHandle for example) and even for the !peb or !dlls in WinDBG? Is that enough to tickle your curiosity?
Just a note before we start: all the code you’ll see here is not bullet proof, or production-ready. This is just a sample, a proof-of-concept just to illustrate this technique. I am aware that there are some loopholes that need to be fixed, but consider them as an exercise left to the reader. You can find all the code detailed in this article on this repository: github.com/stormshield/Beholder-Win32
Note that the code in this article will also be stripped of some details (like the WoW64 processes' injection support) for readability purposes but the github repository will hold the full code.
Peeping through the keyhole
When a program loads a DLL, there is a cascade of several events happening to this DLL. We can basically list them below:
- A handle is opened on this DLL
- A section of the mapped size of the DLL is created
- The DLL is mapped into this section.
- The loader does some internal work, such as taking care of some alignments, sections’ rights, etc.
- Eventually, the DLL main is called
Again, this is a very rough description; the goal here is not to dive into each detail of this process. Anyway, what prevents a driver from doing exactly this? Opening a handle can easily be done with ZwCreateFile, mapping the DLL with ZwCreateSection/ZwMapViewOfSection. That leaves 2 problematic steps: the job done by the the loader and the execution of a function in this DLL. Hopefully, we can ask the Windows kernel to do that for us.
Before diving into the kernel code, let’s talk about the final objective here: the DLL. This DLL must NOT have any dependency. The loader will not resolve them for us (at least with the technique described here) and it could be complicated (and even unsafe since we are talking about userland space) to do it manually. The loader will also not fill the IAT (Import Address Table) of the DLL, so it’s a nice plus if the DLL can run without any imports. Obviously, for that kind of situation, kernel32!LoadLibrary and kernel32!GetProcAddress are our best friends (well, you will need to find those manually though).
I also promised there would be no assembly, so it must be done fully in C. Now let’s take a look at the kernel code.
Dive into the rabbit hole
To inject a DLL in every process, we need to be in each process’ context. For that, nothing is better than a notification callback. Let’s shoot two birds with one stone here and setup the LoadImage notification callback. Why this one? Because you’re sure it is called every time a process is created, and, cherry on the pie, it gives you the addresses of some interesting DLLs, like kernel32.dll. I suggest you call the function that will inject the DLL in the current process when you’re notified of the kernel32 mapping in this callback.
In this function, we'll retrieve a handle on the current process:
Now, we’re going to create and map the section in the current process memory:
We’re mapping the DLL with read and execution rights only. This will prevent any modification of the DLL in any way possible by the userland. We also set the SEC_IMAGE flag. This will tell the kernel loader to map the image as an executable one. This means that it will make all the required fixups and alignment. But it will not resolve the imports of this DLL!
Now that our DLL is mapped, we can easily call an exported function of our DLL. But first, we may need to give to this function some parameters in order to run properly. In order to do that, we will first allocate some userland memory that will hold those parameters:
This page will only have the read right. At least, that will prevent the userland to temper with your input parameters. To deny even further any modification, I set the SEC_NO_CHANGE flag. Now, even VirtualProtect or NtMapViewOfSection will fail if it targets our page. Note that you can also call MmSecureVirtualMemory on top of that. I suggest that, if you need to set some output parameter from your DLL, to allocate another page with the write right so the input parameters remain unmodified no matter what.
Okay, one last step, we need to setup the input parameters of our DLL. Just create a custom structure with the input parameters and fill it. Here, I just need some information about kernel32.dll. A MDL is a nice way to get around the PAGE_READONLY right set previously on the input parameters. This will allow us to map the same page in kernel address space with both read and write rights enabled.
Last piece of the puzzle, the execution of our DLL. In order to make this DLL as small as possible, I didn’t set any export, but only an entry point fixed at 0x1000. Just use RtlCreateUserThread and you’re good to go. I suggest you to retrieve the entry point’s offset dynamically by parsing the PE header though.
RtlCreateUserThread has one major drawback: it’s not available on Windows 7. It is exported in XP and since Windows 8 though. So, if you need to support Windows 7, you will have to find another solution. I may or may not have something up in my sleeve that I will reveal a bit later about this situation 😉
I’m the one who knocks
Now, a thread has been created, even before the main thread of the process has reached the entry point of the program. This thread will not be active right now, since you’re in the middle of some DLL initialization. So, I cannot wait for my new thread to finish since it will deadlock the current process (and the workstation in the end). And I need to wait until my thread is done in order to clean up the sections or read the output left by my userland code, etc. If only there was some kind of notification where I could register a custom callback that would be called whenever a thread has finished…
…
…
Oh wait!
Yup, you guessed it! PsSetCreateThreadNotifyRoutine will be our friend here. You just need to keep a context when you create the thread with RtlCreateUserThread, seek any terminating thread in your callback registered through PsSetCreateThreadNotifyRoutine that matches the CLIENT_ID received as output parameter of RtlCreateUserThread. You’ll end up with something like that:
You will notice that the thread created by RtlCreateUserThread can finish even before the process has started! In some situations, this can be very useful. Note that you can also call PsGetThreadExitStatus to retrieve the exit code of your thread.
There is no spoon
Now, the final step: the DLL itself. It must be compiled in a certain way if you want it without any IAT. First, remove all dependencies and default libraries.
Then, disable some security checks that requires some CRT. This mean no /GS and no /RTC
Setup your entry point to whatever function you want
And voilà! You're all set. You can obviously remove some options in order to make it smaller, but that’s up to you. It is possible to get a DLL smaller than 4KB while still able to retrieve LdrLoadDll or GetProcAddress by itself. Now your function designed as entry point in your DLL will have this prototype:
PDLL_PARAMS is a structure I defined earlier. It’s the 7th parameter of RtlCreateUserThread. Yup, you can share any kind of data blob between your driver and your DLL. Now, you’re free to parse kernel32 in order to search for LoadLibrary and GetProcAddress, all using plain C code. Even better, you don’t have to worry about things like relocations or relative addresses. For example, using strlen or strcmp like below on a constant string works like a charm, all thanks to the DLL format and loader fixing everything for us. You can even use globals without any issue.
And, last but not least, if you want to use functions like strcmp, strlen, etc. like the example above but don’t want any imports (which is recommended here), set the /Oi options in your DLL’s project. You can find the list of useable intrinsic functions here: msdn.microsoft.com/en-us/library/tzkfha43.aspx
The cake is a lie
I must confess, I lied a bit in the introduction. We are not really loading a DLL, but using the structure of a DLL to wrap up some code. This allows us to create a file-backed section and map it into the userland memory of the current process while using documented functions. Even the loader helps us (a little bit at least) in our journey!
For me, this technique is more reliable than the one using KeUserModeCallback(). It even works like a charm with 32bits or 64bits processes. You just need to recompile in the desired architecture. WoW64 processes can be injected too, but take care of your pointers size then when filling the input parameters since your driver will be 64bits and your DLL will be 32bits. The code above might need some adjustments here and there, but nothing dramatic. The only major drawback is that it won’t work on Windows 7 since RtlCreateUserThread is not exported. In the end, it’s not perfect but that’s pretty close. In fact, it may even be the best technique to inject code from a driver you’ve ever seen.