Search This Blog

Showing posts with label linux. Show all posts
Showing posts with label linux. Show all posts

Friday, April 3, 2015

Linux Loadable Kernel Module in Assembly

Hello everyone! First of all, sorry for being silent for the last two years. There have been certain reasons for this. Anyway, I am back and I am going to share a portion of what I've learnt over this period.

Before I begin, as usual, a note for nerds: the code in this article is for demonstration purposes only and does not contain certain things, like error checking, that would otherwise be inevitable. 

I have recently seen tones of posts about writing kernel module for a pre-compiled kernel on the Internet. Guys are doing good work, but there is one thing that I personally did not like - they all refer you to the configuration file for such kernel, which may be obtained this way or the other. Well, having configuration of the running kernel makes it almost no different from building a module for a kernel you compiled yourself (just almost). The bottom line - you want something to be done your way, do it yourself.

Tools used

Since building a kernel module written in C for the kernel you have no .config for may become a huge pain in certain parts of your body, I decided to go as low as possible and chose flat assembler (the good old flat assembler that may be found here). This wonderful instrument provides you with everything you may need when it comes to x86/x86_64 development (of course, most of your potential projects may be too complex for being implemented in assembly).


Target system

I was brave enough to perform this experiment on my dev machine running Debian with 3.2.0-4 kernel. Obviously, I do have proper kernel sources installed, but made no use of them in this example.


Loadable Kernel Module

I am not going to dive into the basics of Linux kernel structure and the way LKM support is implemented. It is simply irrelevant at this time. What we are interested in, is the structure of a module. To put it simple, the structure of a LKM may be described as:
  1. .init.text section  - contains all the module initialization code.
  2. .exit.text section - contains all the cleanup code executed right before the module is unloaded.
  3. Module information.
  4. All the rest.
While we may keep "all the rest" out of it for now, we do need to take care of proper representation of the init/exit sections and module information. In fact, init/exit sections are not a problem at all - that's just code after all, whereas module information is a bit problematic. But, first things first.

.modinfo section

This section contains some strings that let the kernel identify our module as a one that may be safely loaded and executed.

The first string tells the kernel about how our module is licensed:

"license=GPL"

You may use other license (e.g. "proprietary"), but that would make some symbols exported by the kernel invisible for your module.

The next one is

"depends="

here you should list modules your module depends on. Since our tiny module has no dependencies, we leave this string empty.

The last and the most important one is:

"vermagic=3.2.0-4-amd64 SMP mod_unload modversions "

this string tells us (and the kernel) which kernel the module was built for and what LKM handling options are enabled. However, the above string contains information that is good for building a module on my system, but it may (and almost certainly will) be wrong for your system. Don't worry, there is a simple way to get this string - run /sbin/modinfo on any *.ko file in your /lib/modules/`uname -r`/ directory.

__versions section

You can try to build a module without this section and it may even load and do its job, but you will get some nasty complaints from the kernel on being tainted.

The purpose of this section is to make sure your module and kernel are speaking the same language, meaning they use identical symbols. The structure of it is rather simple - an array of checksum/name pairs, where checksum is (in my case it is a x86_64 system) 8 bytes followed by a 56 bytes name (since names are shorter they are padded with 0). It is not as simple to find the proper values if you do not have properly configured kernel sources, though. You would have to simply check some modules for presence of specific symbol. I would suggest doing so in IDA Pro, but any hex editor would suffice too. 

.gnu.linkonce.this_module section

This section contains just one structure - module. I would not like to dive into specifics of this structure, after all, you can download kernel source and check include/linux/module.h file for struct module declaration. What is important to know, however, is that this structure contains the name of the module (as it would appear in lsmod's output) and pointers to module_init() and module_cleanup() functions.

Implementation

Well, seems like we've covered all the most important aspects. Let's get to the implementation itself. The following code may be compiled with flat assembler.

format ELF64
extrn printk
section '.init.text' executable

module_init:
push rdi
mov rdi, str1
xor eax, eax
call printk
xor eax, eax
pop rdi
ret


section '.exit.text' executable
module_cleanup:
xor eax, eax
ret
section '.rodata.str1.1'
str1 db '<0> Here I am, gentlemen!', 0x0a, 0
section '.modinfo' align 10h
db 'license=GPL', 0
db 'depends=', 0
db 'vermagic=3.2.0-4-amd64 SMP mod_unload modversions ', 0
  db  'vermagic=3.16.0-4-amd64 SMP mod_unload modversions ', 0

section '.gnu.linkonce.this_module' writable
this_module:
rb 18h
db 'simple_module', 0
rb 148h - ($ - this_module)
rb 150h - ($ - this_module) dq module_init
rb 238h - ($ - this_module)
rb 248h - ($ - this_module) dq module_cleanup
dq 0
section '__versions'
dq 0x568fba06
dq 0x2ab9dba5  @@:
db 'module_layout', 0
rb 56 - ($ - @b)
dq 0x27e1a049
  @@:
db 'printk', 0
rb 56 - ($ - @b)

Hope this article is helpful in some way. Thanks for reading and see you with the next post!

P.S. Updated the source to fit the latest kernel version.


Wednesday, March 21, 2012

Linux Threads Through a Magnifier: Remote Threads

Source code for this article may be found here.

Sometimes, a need may rise to start a thread in a separate process and the need is not necessarily malicious. For example, one may want to replace library functions or to place some code between the executable and a library function. However, Linux does not provide a system call that would do anything similar to CreateRemoteThread Windows API despite the fact that I see people searching for such functionality. You may google for "CreateRemoteThread equivalent in Linux" yourself and see that at least 90% of the results end up with something like "why would you want to do that?" There is a certain type of people in forums, most likely, thinking if they do not have an answer, then, probably, it does not exist and no one would ever need it. Others truly believe, that if they know why, they can tell you how to do that in another way. The latest is sometimes true, but most of the time, the solution being requested is the only one acceptable and that's what people refuse to understand.

So, let's say, you need to inject a thread into a running process for whatever reason (may be you want to perform a "DLL injection" the Linux way - your business). Although, there is no specific system call to allow you that, there are plenty of other system calls and library functions that would "happily" assist you.

Unavoidable ptrace()
First time you take a look at ptrace() it is a bit frightening (just like ioctl()) - one function, lots of possible requests and go figure out when and which parameter is being ignored. In practice, it quite simple. This function is used by debuggers and in cases when one needs to monitor the execution of a process for whatever reason. We will use this function for thread injection in this article.

The first thing you would want to do is to attach to the target process:

   ptrace(PTRACE_ATTACH, pid, NULL, NULL);

PTRACE_ATTACH - request to attach to a running process;
pid - the ID of the process you want to attach to.

If the return value is equal to the pid of the target process - voila, you are attached. If it is -1, however, this means that an error has occurred and you need to check errno to know what has happened. you should keep in mind, that on certain systems you may not be able to attach to a process which is not a descendant of the attaching one or has not specified it as tracer (using prctl()). For example, in Ubuntu, since Ubuntu 10.10 this is exactly the situation. If you want to change that, however, you then need to locate your ptrace.conf file and set ptrace scope to 0.

Since I am using Ubuntu and I can only attach to a child process (unless I want some additional headache) and this is what I am going to cover in this article.


Preparations
The first step, just like in case of Windows, you need to write an injector. It will load the victim process, inject the shellcode and exit. This is the simplest part and the skeleton of such loader would look like this:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <sys/user.h>

int   main(int argc, char** argv)
{
   pid_t   pid;
   int     status;

   if(0 == (pid = fork()))
   {
      // We are in the child process, so we just ptrace() and execl()
      ptrace(PTRACE_TRACEME, 0, NULL, NULL);
      execl(*(argv+1), NULL, NULL);
   }
   else
   {
      // We are in the parent (injector)
      ptrace(PTRACE_SETOPTIONS, pid, PTRACE_O_TRACEEXEC, NULL);
      // Wait for exec in the child
      waitpid(pid, &status, 0);
      
      // The rest of the code comes here

   }
   return 0;
}

As you can see, the loader forks and then behaves depending on the return value of the fork() function. If it returns 0, this means that we are in the child process (actually, you should check whether it returned -1, which would indicate an error), otherwise, it is a pid of the child process and we are in the parent.

Child
The child code does not have too many things to do. All that needs to be done is to tell the OS that it may be traced and replace itself with the victim executable by calling execl().

Parent
In case of parent, the situation is much different and much more complicated. You should tell the OS, that you want to get notification when the victim process issues sys_execve by calling ptrace() with PTRACE_SETOPTIONS  and PTRACE_O_TRACEEXEC. Then you simply waitpid().

When waitpid() returns (and you should check the return value for -1, which means error), it is still not the best time to start the injection. Especially, given that you may have no idea of what is where in the victim process. The next step is to wait for a system call to occur by telling the OS (and it would be good to skip a couple of system calls, so that the victim may initialize properly):

ptrace(PTRACE_SYSCALL, pid, NULL, NULL);

followed by a loop:

while(1)
{
   if(-1 == waitpid(pid, &status, 0))
   {
      //Some error occurred. Print a message and
      break;
   }

   if(WIFEXITED(status))
   {
      //The victim process has terminated. Print a message and
      break;
   }

   if(WIFSTOPPED(status))
   {
      // Here comes the actual injection code. Actually, all its stages.
   }
   
   if(WIFSIGNALED(status))
   {
      // The victim process received a signal and terminated. Print a message and
      break;
   }

   // All done.
   return 0;
}


Injection 
You should introduce a variable to count stages. Let's name it step

Stage 0 (step = 0)
I have not mentioned it, but ptrace() would notify you twice during a system call. First time right before the system call (so you can inspect registers), the second notification would arrive right after system call's completion (so you can inspect the return value). Therefore, this time we do nothing, but resume the traced victim:

ptrace(PTRACE_SYSCALL, pid, NULL, NULL);

and increment the stage variable.


Stage 1 (step = 1)
Backup victim's registers, portion of victim's code that would be overwritten with your shellcode and, finally, inject your shellcode.

Use ptrace(PTRACE_GETREGS, pid, NULL, regs) where regs is a pointer to struct user_regs (declared in sys/user.h). The content of the victim's registers would be copied there.

Use ptrace(PTRACE_PEEKTEXT, pid, address_in_victim, NULL) to copy the executable code from the victim (to make a backup) and ptrace(PTRACE_POKETEXT, pid, address_in_victim, shellcode) where address_in_victim is what its name suggests (you obtain the initial value from victim's RIP on 64 or EIP on 32 bit systems). Shellcode, however, contains bytes of the code being injected packed into an unsigned long value. You, most probably, would have to make those calls for several iterations, as I do not think your shellcode would be at most 8 bytes.

The start of your shellcode will allocate memory for the thread function (unless you are going to run code that already is there).

start:
   mov   rax, 9      ;sys_mmap
   mov   rdi, 0      ;requested address
   mov   rsi, 0x1000 ;one page
   mov   rdx, 7      ;PROT_READ | PROT_WRITE | PROT_EXEC
   mov   r10, 0x22   ;MAP_ANON | MAP_PRIVATE
   mov   r8, -1      ;fd
   mov   r9, 0       ;offset
   syscall
   db 0xCC

Increment stage variable. Resume the victim process with

ptrace(PTRACE_SINGLESTEP, pid, NULL, NULL);


Stage 2 (step = 2)
Ignore all stops until

0xCC == (unsigned char)(ptrace(PTRACE_PEEKTEXT, pid,
      ptrace(PTRACE_PEEKUSER, pid, offsetof(struct user, regs.rip), NULL), NULL) & 0xFF

which would mean that you have reached your break point. Check victim's rax register for return value

retval = ptrace(PTRACE_PEEKUSER, pid, offsetof(struct user, regs.rax), NULL);

and abort if it contains an error code.

You have to increment the Instruction Pointer (RIP/EIP) before letting the victim to resume:

ptrace(PTRACE_POKEUSER, pid, offsetof(struct user, regs.rip),
       ptrace(PTRACE_PEEKUSER,pid, offsetof(struct user, regs.rip), NULL) + 1);


Increment stage counter and 

ptrace(PTRACE_SINGLESTEP, pid, NULLNULL);


Stage 3 (step = 3)
After allocating memory, your shellcode should copy the thread function there and, actually, create a thread (similar to this).

You should, again, ignore all stops as long as

0xCC != (unsigned char)(ptrace(PTRACE_PEEKTEXT, pid,

      ptrace(PTRACE_PEEKUSER, pid, offsetof(struct user, regs.rip), NULL), NULL) & 0xFF

Once you get to this breakpoint, you know that the thread has been initiated and the injector has done what it was written for.

Now you have to restore the victim to its initial, pre-injection state by restoring the values of the registers:

ptrace(PTRACE_SETREGS, pid, NULL, regs);

and, which is even more important - you have to restore the backed up code by copying back the backed up unsigned longs.

The last thing would be detaching from the victim process:

ptrace(PTRACE_DETACH, pid, NULL, NULL);

At this point, your injector may safely exit letting the victim to continue execution.

Voila! You have just injected a thread into another process.

Output of the injector, victim program and the injected thread























P.S. Shared Object Injection (a la DLL injection)
Although, injection of executable code is quite simple, injection of shared object is a different story. Despite the fact, that Linux kernel provides sys_uselib system call, it may be unavailable on some systems. In such case, you have several options:

  • Check whether the victim uses libdl (dlopen(), dlsym() and dlclose() functions, parse the image and obtain addresses of relevant functions. However, not every program uses libdl.
  • Use sys_uselib system call. However, it may be unavailable.
  • Write your own shared object loader. This may be a real pain, but you would be able to reuse it whenever you need.

Hope this post was helpful. See you at the next.

Saturday, March 17, 2012

Linux Threads Through a Magnifier: Local Threads

Source code for this article is here.

Threads are everywhere. Even now, when you browse this page, threads are involved in the process. Most likely, you have more than one tab opened in the browser and each one has at least one thread associated with it. The server supplying this page runs several threads in order to serve multiple connections simultaneously. There may be unnumbered examples for threads, but let us concentrate on one specific implementation thereof. Namely, Linux implementation of threads.

It is hard to believe, that earlier Linux kernels did not support threads. Instead, all the "threading" was performed entirely in user space by a pthread (POSIX thread) library chosen for specific program. This reminds me of my attempt to implement multitasking in DOS when I was in college - possible, but full of headache.

Modern kernels, on the contrary, have full support for threads, which, from kernel's point of view are so-called "Light-weight Processes". They are usually organized in thread groups, which, in turn, represent processes as we know them. As a matter of fact, the getpid libc function (and sys_getpid system call) return an identifier of a thread group.

Let me reiterate - the best explanation is an explanation by example. In this article, I am going to cover the process of thread creation on 64 bit Linux running on PC using FASM (flat assembler).


Clone, Fork, Exec...
There are several system calls involved in process manipulations. The most known one is sys_fork. This system call "splits" a running process in two - parent and child. While they both continue execution from the instruction immediately following the sys_fork invocation, they have different PID (process ID) or, as we now know - different TGID (thread group ID) as well as each one gets a different return value from sys_fork. The return value is a child TGID for the parent process and 0 for the child. In case of error, fork returns -1 and sets errno appropriately, while sys_fork returns a negative error code. 

Exec does not return at all. Well, it formally has a return type of int, but getting a return value means, that the function failed. Exec* libc function or sys_execve system call are used in order to launch a new process. For example, if your application has to start another application, but you do not want or cannot, for any reason, execute system() function, then your application has to fork and the child process calls exec, thus, being replaced in memory by the new process. The execution of the new process starts normally from its entry point.

Clone - this is the function we are interested in. Clone is a libc wrapper for sys_clone Linux system call and is declared in the sched.h header as follows:

int clone(int (*fn)(void*), void *child_stack, int flags, void *arg, ...);

I encourage you to read the man page for clone libc function at http://linux.die.net/man/2/clone or with "man clone" :-) 


sys_clone
We are not going to deal with clone function here. There are lots of good resources on the internet which provide good examples for it. Instead, we are going to examine the sys_clone Linux system call.

First of all, let us take a look at the definition of the sys_clone in arch/x86/kernel/process.c:

long sys_clone(unsigned long clone_flags, unsigned long newsp,
               void __user *parent_tid, void __user *child_tid, struct pt_regs *regs)

Although, the definition looks quite complicated, in reality, it only needs clone_flags and newsp to be specified. 

But there is a strange thing - it does not take a pointer to the thread function as a parameter. That is normal - sys_clone only performs the action suggested by its name - clones the process. But how about the libc's clone? - you may ask. As I have mentioned above, libc's clone is a wrapper and what is does in addition to calling sys_clone is setting its return address in the cloned process to the address of the thread function. But let us examine it in more detail.

clone_flags - this value tells the kernel about how we want our process to be cloned. In our case, as we want to create a thread, rather then a separate process, we should use the following or'ed values:

CLONE_VM  (0x100) - tells the kernel to let the original process and the clone in the same memory space;
CLONE_FS (0x200) - both get the same file system information;
CLONE_FILES (0x400) - share file descriptors;
CLONE_SIGHAND (0x800) - both processes share the same signal handlers;
CLONE_THREAD (0x10000) - this tells the kernel, that both processes would belong to the same thread group (be threads within the same process);

SIGCHLD (0x11) - this is not a flag, this is the number of the SIGCHLD signal, which would be sent to the original process (thread) when the thread is terminated (used by wait functions).

newsp - the value of the stack pointer for the cloned process (new thread). This value may be NULL in which case, both threads are using the same stack. However, if the new thread attempts to write to the stack, then, due to the copy-on-write mechanism, it gets new memory pages, thus, leaving the stack of the original thread untouched.


Stack Allocation
Due to the fact, that in most cases, you would want to allocate a new stack for a new thread, I cannot leave this aspect uncovered in this article. To make things easier, let us implement a small function, which would receive the size of the requested  stack in bytes and return a pointer to the allocated memory region.

Important note:
As Linux follows AMD64 calling convention when running in 64 bits, function parameters and system call arguments are passed via the following registers:
Function call: arguments 1 - 6 via RDI, RSI, RDX, RCX, R8, R9; additional arguments are passed on stack.
System call: arguments 1 - 6 via RDI, RSI, RDX, R10, R8, R9; additional arguments are passed on stack.


C declaration:
void* map_stack(unsigned long stack_size);

Implementation:
PROT_READ     = 1
PROT_WRITE    = 2
MAP_PRIVATE   = 0x002
MAP_ANON      = 0x020
MAP_GROWSDOWN = 0x100
SYS_MMAP      = 9

map_stack:
   push  rdi rsi rdx r10 r8 r9                 ;Save registers
   mov   rsi, rdi                              ;Requested size
   xor   rdi, rdi                              ;Preferred address (may be NULL)   
   mov   rdx, PROT_READ or PROT_WRITE          ;Memory protection
   mov   r10, MAP_PRIVATE or MAP_ANON or MAP_GROWSDOWN ;Allocation attributes
   xor   r8, r8                                ;File descriptor (-1)
   dec   r8     
   xor   r9, r9                                ;Offset - irrelevant, so 0
   mov   rax, SYS_MMAP                         ;Set system call number
   syscall                                     ;Execute system call
   pop   r9 r8 r10 rdx rsi rdi                 ;Restore registers
   ret 

Calling this function would be as easy as:

mov  rdi, size
call map_stack

This function returns either a negative error code as provided by sys_mmap or the address of the allocated memory region. As we specified MAP_GROWSDOWN attribute, the obtained address would point to the top of the allocated region instead of pointing to its bottom, thus, making it perfect to specify as a new stack pointer.


Creation of Thread
In this section, we will implement a trivial create_thread function. It would allocate stack (of default size = 0x1000 bytes) for a new thread, invoke sys_clone and to either the instruction following call create_thread or to the thread function, depending on the return value of sys_clone.

C declaration:
long create_thread(void(*thread_func)(void*), void* param);

As you may see, the return type of the thread_func is void, unlike the real clone function. I will show you why a bit later.

Implementation:
create_thread:
   mov   r14, rdi    ;Save the address of the thread_func
   mov   r15, rsi    ;Save thread parameter
   mov   rdi, 0x1000 ;Requested stack size
   call  map_stack   ;Allocate stack
   mov   rsi, rax    ;Set newsp
   mov   rdi, CLONE_VM or CLONE_FS or CLONE_THREAD or CLONE_SIGHAND or SIGCHLD ;Set clone_flags
   xor   r10, r10    ;parent_tid
   xor   r8, r8      ;child_tid
   xor   r9, r9      ;regs
   mov   rax, SYS_CLONE
   syscall           ;Execute system call
   or    rax, 0      ;Check sys_clone return value
   jnz   .parent     ;If not 0, then it is the ID of the new thread
   push  r14         ;Otherwise, set new return address (thread_func)
   mov   rdi, r15    ;Set argument for the thread_func
   ret               ;Return to thread_func
.parent:
   ret               ;Return to parent (main thread)


Exiting Thread
Everyone who has ever searched the Web for Assembly programming tutorial for Linux is familiar with sys_exit system call. On 64 bit Intel platform it is call number 60. However, they all (tutorials) miss the point. Although, sys_exit works perfectly with single threaded hello-world-like applications, the situation is different with multithreaded ones. In general, sys_exit terminates thread, not a process, which, in case of a process with a single thread, is definitely enough, but may lead to strange artifacts (or even zombies) if, for example, a thread continues to print to stdout after you have terminated the main thread.

Now, the promised explanation on the the thread_func return type. In our case (as in most cases) the thread_func does not return by means of using the ret instruction. It just can't as there is no return address on the stack and even if you put one - returning would not terminate the thread. Instead, you should implement something like this exit_thread function.

C declaration:
void exit_thread(long result);

Implementation:
SYS_EXIT = 60
exit_thread:
                         ; Result is already in RDI
   mov   rax, SYS_EXIT   ; Set system call number
   syscall               ; Execute system call


Exiting Process
By exiting process we usually mean total termination of the running process. Linux gracefully provides us with a system call which terminates a group of threads (process) - sys_exit_group (call number 231). The function for terminating the process is as simple as this:

C declaration:
void exit_process(long result);

Implementation:
SYS_EXIT_GROUP = 231
exit_process:
                             ; Result is already in RDI
   mov   rax, SYS_EXIT_GROUP ; Set system call number
   syscall                   ; Execute system call



Attached Source Code
The source code attached to this article (which may be found here) contains a trivial example of the application that creates thread with the method described above. In addition, it contains the list of system call numbers for both 32 and 64 bit platforms.

Note for Nerds:
The attached code is for demonstration purpose only and may not contain such important elements as checking for errors, etc.


32 bit Systems
If you decide to convert the code given above to run on 32 bit systems, that would be quite easy. First of all - change register names to appropriate 32 bit ones. 

Second thing is to remember how parameters are passed to system calls in 32 bit kernels. They are still passed through registers, but the registers are different. Parameters 1st through 5th are passed through EBX, ECX, EDX, ESI, EDI. The system call number is placed as usual in EAX, the same register is used to store return value upon system call's completion.

Third - use int 0x80 instead of syscall instruction.

Forth - remember to change function prologues due to a different calling convention. While 64 bit systems use AMD64 ABI, 32 bit systems use cdecl passing arguments on stack by default.


Hope this article was interesting and helpful.

See you at the next (remote threads in Linux - stay tuned).


Friday, March 2, 2012

Dynamic Code Encryption as an Anti Dump and Anti Reverse Engineering measure

Source code for this article may be found here.


There has been said and written too much on how software vendors do not protect their products, so let me skip this. Instead, in this article, I would like to concentrate on those relatively easy steps, which software vendors have to take in order to enhance their protection (using packers and protectors is good, but certainly not enough) by not letting the whole code appear in memory in readable form for a single moment.

Attack Vectors
Prior to dealing with "why attackers are able to x, y, z" let us map most frequent attack vectors in ascending order of their complexity.

Static Analysis - inspecting an executable in your favorite disassembler. It may be hard to believe, but majority of software products out there are vulnerable to static analysis, thus, showing us, that most of vendors do not care about proprietary algorithms' safety in addition to the fact, that they seem not to care about piracy  either (but they tend to cry about it all the time).

Dynamic Analysis - running an executable inside your favorite debugger. This is a direct consequence of the previous paragraph. If an attacker is able to see the whole code in the disassembler - he/she definitely can run it  in a debugger (even if this requires some minor patching).

Static Patching - this means changing the code located in the file of the executable. It may be changing one jump or adding a couple of dozens of bytes of attacker's own code in order to alter the way the program runs.

Dynamic Patching - similar to static patching in the idea behind the method. The only difference is, that dynamic patching is performed while the target executable is loaded into memory.

Dumping - saving the data in memory to a file on disk. This method may be very useful when examining a packed executable. Such memory dumps may be easily loaded into, for example, IDA and examined as if that was a regular executable (some additional actions may be required for better convenience, like rebasing the program or adjusting references to other modules).

In most cases, at least two of the aforementioned vectors would be present in time of attack.


Packers and Cryptors
Using different packers, cryptors and protectors is quite a known practice among software vendors. The problem with this is, that few of them go beyond packing the code in file and fully unpacking it in memory and, sometimes, protecting the packer itself. By saying "go beyond" I mean any implementation of anti debugging methods of any kind. Besides, such utilities do not prevent an attacker from obtaining a memory dump good enough to deal with. One or two check the consistency of the code, which may (yes - may, as it not necessarily can) prevent patching the code, but every wall has a door and it only matters how much effort opening that door may require. Bottom line is, that these types of protection may only be useful in preventing static analysis, but only if there is no relevant unpacker or decrypter.


Protectors
This is "the next step" in the evolution of packers. These provide a bit more options and tools to estimate how secure the environment is. In addition to packing the code, they also utilize code consistency checks, anti debugging tricks, license verification, etc. Protectors are good countermeasures to the first three (or even four) attack vectors. However, even if certain protector has some anti patching heuristics, it is only good as long as it (heuristics) is not reversed and either patched or fooled in any other way. 

Despite all the "good" in protectors, even such powerful tools are not able to do much in order to prevent an attacker from obtaining a memory dump, which may be obtained by either using ReadProcessMemory or injecting a DLL and dumping "from inside" while suspending all other threads.


Anything Else?
Yes, there are some basic protections provided by the operating system, like session separation, for example, which prevents creation of remote threads (used with DLL injection), but those are hardly worth even mentioning here.

The picture drawn here appears to be sad and hopeless enough. However, there are several good methods to add more protection to a software product and more pain in some parts of the body to attackers.


Code Obfuscation
While this methods is widely used by protectors and, sometimes, by packers and cryptors (unfortunately, in most cases, for protecting themselves only) it seems to be almost totally unknown to the rest of software vendors. In my opinion, branching the code more than it is usually needed may not be considered as code obfuscation, it may rather be called an attempt to obfuscate an algorithm. The situation is such, that even implementation of something similar to this would be a significant improvement in vendors' efforts to protect their products.


Hiding the Code
Software vendors repeatedly fail at understanding two facts - popular means more vulnerable (in regard of commercial solutions) and the fact that there is no magic cure and they have to put some additional effort into protecting their products.

One of the options, which I would like to cover here, is dynamic encryption of executable code. This method promises that only certain parts of the code would be present in memory in readable (possible to disassemble) form, while the rest of the code (and preferably data) is encrypted.

I am still sure - the best way to explain something is explanation by example. The small piece of C code described below is intended to show the principle of dynamic code encryption. It contains several functions in addition to main - the first is the one (the target) we are going to protect. It does nothing special, just calculates the factorial of 10 and prints it out. The main function invokes a decrypter in order to decrypt the target, calls the target (thus, displaying the factorial of 10) and, finally invokes cryptor to encrypt the target back (hide it).

The code may be compiled for both Linux (using gcc) or Windows (using mingw32). It uses obfuscation code from here.


Target Function
Our target function is quite simple (it only calculates factorial for hardcoded number):

void func()
{
   __asm__ __volatile__("enc_start:");
   {  /* Braces are used here as we do not want IDA to track parameters */
      int i, f = 10;
      for( i = 9; i > 0; i--)
         f *= i;
      printf("10! = %d\n\n", f);
   }
   __asm__ __volatile__("enc_end:");
}

You noticed the labels in the beginning and in the end of the function body? These labels are only used for getting the start address of the region to be decrypted/encrypted and calculating it's length. Due to the fact that these labels are no processed by the C preprocessor, but are passed to assembler, they are accessible from other functions by default. The rest of the code is enclosed by braces in order to put all the actions related to variables i and f in the encrypted part of the function. This is what it looks like, before being decrypted:


Although, in attached code, the initial encryption is performed upon program start, in reality, it should be done with, probably, a third party tool. You would only have to put some unique marking at the start and end of the region you want to encrypt. For example:


__asm__(".byte  0x0D, 0xF0, 0xAD, 0xDE");
void  func()
{
...
}
__asm__(".byte  0xAD, 0xDE, 0xAD, 0xDE");


Encryption Algorithm
Selection of encryption algorithm is totally up to you. In this particular case, the algorithm is quite primitive (it does not even require a key):

b  - byte
i  - position
for i = 0; i < length; i++
   b(i+1) = b(i+1) xor (b(i) rol 1)
b(0) = b(0) xor (b(length) rol 1)

Execution Flow
So, let us assume that the program started with the function already encrypted. As this is just an example, we can get to the business right away:

int main()
{
   unsigned int  addr, len;
   __asm__ __volatile__("movl  $enc_start, %0\n\t"\
                        "movl  $enc_end, %1\n\t"\
                        : "=r"(addr), "=r"(len));
   len -= addr;
   decode(addr, len);
   func();
   encode(addr, len);
   return 0;
}

The code above is self explanatory enough. There are, however, a couple of things needed to be mentioned. decode and encode functions should take care of modifying the access rights of the memory region they are going to operate on. The following code may be used:

#ifdef WIN32
#include <windows.h>
#define SETRWX(addr, len)   {\
                               DWORD attr;\
                               VirtualProtect((LPVOID)((addr) &~ 0xFFF),\
                                  (len) + ((addr) - ((addr) &~ 0xFFF)),\
                                  PAGE_EXECUTE_READWRITE,\
                                  &attr);\
                            }
#define SETROX(addr, len)   {\
                               DWORD attr;\
                               VirtualProtect((LPVOID)((addr) &~ 0xFFF),\
                                  (len) + ((addr) - ((addr) &~ 0xFFF)),\
                                  PAGE_EXECUTE_READ,\
                                  &attr);\
                            }
#else
#include <sys/mman.h>
#define SETRWX(addr, len)   mprotect((void*)((addr) &~ 0xFFF),\
                                     (len) + ((addr) - ((addr) &~ 0xFFF)),\
                                     PROT_READ | PROT_EXEC | PROT_WRITE)
#define SETROX(addr, len)   mprotect((void*)((addr) &~ 0xFFF),\
                                     (len) + ((addr) - ((addr) &~ 0xFFF)),\
                                     PROT_READ | PROT_EXEC)
#endif

This is the only platform dependent code in this sample.

Bottom Line
The example given above is really a simple one. Things would be at least a bit more complicated in real life. While there is only one encrypted function, imagine, that there are several encrypted functions. Some of them are encrypted without keys (like the one above) others require keys of different complexity. Several keys may be hardcoded (for those parts that were encrypted in order to draw attacker's attention away from the "real" thing), others should be computed on the fly.

Example:
Function A is encrypted without a key. When decrypted, it performs several operations and decrypts function B, which, in turn encrypts function A back and calculates a key for function C based on the binary content of function A (or A and B to prevent breakpoints) or even based on some other code in unrelated place.

Of course, there is no such thing as unbreakable protection. But the time it takes to break certain protection makes the difference. A company that produces software product which is cracked the next day may hardly benefit from all the hard work. On the other hand, it is totally possible to create protection schemes that would require months to be cracked.

I will try and cover additional possibilities and aspects of software protection in my future posts in a hope to at least try to change the situation.


Hope this post was helpful.
See you at the next!