These are the ramblings of Matthijs Kooijman, concerning the software he hacks on, hobbies he has and occasionally his personal life.
Most content on this site is licensed under the WTFPL, version 2 (details).
Questions? Praise? Blame? Feel free to contact me.
My old blog (pre-2006) is also still available.
See also my Mastodon page.
Sun | Mon | Tue | Wed | Thu | Fri | Sat |
---|---|---|---|---|---|---|
1 | 2 | |||||
3 | 4 | 5 | 6 | 7 | 8 | 9 |
10 | 11 | 12 | 13 | 14 | 15 | 16 |
17 | 18 | 19 | 20 | 21 | 22 | 23 |
24 | 25 | 26 | 27 | 28 | 29 | 30 |
31 |
(...), Arduino, AVR, BaRef, Blosxom, Book, Busy, C++, Charity, Debian, Electronics, Examination, Firefox, Flash, Framework, FreeBSD, Gnome, Hardware, Inter-Actief, IRC, JTAG, LARP, Layout, Linux, Madness, Mail, Math, MS-1013, Mutt, Nerd, Notebook, Optimization, Personal, Plugins, Protocol, QEMU, Random, Rant, Repair, S270, Sailing, Samba, Sanquin, Script, Sleep, Software, SSH, Study, Supermicro, Symbols, Tika, Travel, Trivia, USB, Windows, Work, X201, Xanthe, XBee
While setting up Tika, I stumbled upon a fairly unlikely corner case in the Linux kernel networking code, that prevented some of my packets from being delivered at the right place. After quite some digging through debug logs and kernel source code, I found the cause of this problem in the way the bridge module handles netfilter and iptables.
Just in case someone else actually finds himself in this situation and actually manages to find this blogpost, I'll detail my setup, the problem and it solution here.
Tika runs Debian wheezy, with a single network interface to the internet (which is not involved in this problem). Furthermore, Tika runs a number of lxc containers, which are isolated systems sharing the same kernel, but running a complete userspace of their own. Using kernel namespaces and cgroups, these containers obtain a fair degree of separation: Each of them has its own root filesystem, a private set of mounted filesystem, separate user ids, separated network stacks, etc.
Each of these containers then connects to the outside world using a
virtual ethernet device. This is sort of a named pipe, but then for
ethernet. Each veth device has two ends, one inside the container, and
one outside, which are connected. On the inside, it just looks like each
container has a single ethernet device, which is configured normally. On
the outside, all of these veth interfaces are grouped together into a
bridge device, br-lxc
, which allows the containers to talk amongst
themselves (just as if they were connected to the same ethernet switch).
The bridge device in the host is configured with an IP address as well,
to allow communciation between the host and containers.
Now, I have a few port forwarding rules: when traffic comes in on my public IP address on specific ports, it gets forwarded to a specific container. There is nothing special about this, this is just like forwarding ports to LAN hosts on a NAT router.
A problem with port forwarding like this is that by default, packets
coming in from the internal side cannot be properly handled. As an
example, one of the containers is running a webserver, which serves a
custom Debian repository on the apt.stderr.nl
domain. When another
container tries to connect to that, DNS resolution will give it the
external IP of tika, but connecting to that IP fails.
Usually, the DNAT
rule used for portforwarding is configured to only
process packets from the external network. But even if it would process
internal packets, it would not work. The DNAT
rule changes the
destination address of these packets to point to my web container so
they get sent to the web container. However, the source address is
unchanged. Since the containers have a direct connection (through the
network bridge) reply packets get sent directly through the original
container - the host does not have a chance to "undo" the DNAT
on the
reply packets. For external connections, this is not a problem because
the host is the default gateway for the containers and the replies need
to through the host to reach the external ip.
The most common solution to this is split-horizon DNS - make sure that
all these domains resolve to the internal address of the web container,
so no port forwarding is needed. For various practical reasons, this
didn't work for me, so I settled for the other solution: Apply SNAT
in
addition to DNAT
, which causes the source address of the forwarded
packets to be changed to the host's address, forcing replies to pass
through the host. The Vuurmuur firewall I was using even had a
special "bounce" rule for exactly this purpose (setting up a DNAT
and
SNAT
iptables rule).
This setup worked perfectly - when connecting to the web container from
other containers. However, when the web container tried to connect to
itself (through the public IP address), the packets got lost. I
initially thought the packets were droppped - they went through the
PREROUTING
chain as normal, but never showed up in the FORWARD
chain. I also thought the problem was caused by the packet having the
same source and destination addresses, since packets coming from other
containers worked as normal. Neither of these turned out to be true, as
I'll show below.
Since reproducing the problem on a different and/or simpler setup is always a good approach in debugging, I tried to reproduce the problem on my laptop, using a (single) reguler ethernet device and applying DNAT and SNAT rules. This worked as expected, but when I added a bridge interface, containing just the ethernet interface, it broke again. Adding a second (vlan) interface to the bridge uncovered that the problem was not traffic DNATed back to its source, but rather traffic DNATed back to the same bridge port it originated from - traffic from one bridge port DNATed to the other worked normally.
Digging down into the kernel sources for the bridge module, I uncovered this piece of code, which applies some special handling for exactly DNATed packages on a bridge. It seems this is either a performance optimization, or a way to allow DNATing packets inside a bridge without having to enable full routing, though I find the exact effects of this code rather confusing.
I also found that setting the bridge device to promiscuous mode (e.g.
running tcpdump
) makes everything work. Setting
/proc/sys/net/bridge/bridge-nf-call-iptables
to 0 also makes this
work. This setting is to prevent bridged packets from passing through
iptables, but since this packet wasn't actually a bridged packet before
PREROUTING
, this actually makes the packet be processed using the
normal routing code and progresses through all regular chains normally.
Here's what I think happens:
br_handle_frame
NF_BR_PRE_ROUTING
netfilter chain
(e.g. the bridge / ebtables version, not the ip / iptables one).br_nf_pre_routing
hook for NF_BR_PRE_ROUTING
gets called.
This interrupts (returns NF_STOLEN
) the handling of the
NF_BR_PRE_ROUTING
chain, and calls the NF_INET_PRE_ROUTING
chain.br_nf_pre_routing_finish
finish handler gets called after
completing the NF_INET_PRE_ROUTING
chain.NF_BR_PRE_ROUTING
chain. However, because it detects that DNAT has
happened, it sets the finish handler to
br_nf_pre_routing_finish_bridge
instead of the regular
br_handle_frame_finish
finish handler.br_nf_pre_routing_finish_bridge
runs, this skb->dev to the parent
bridge and sets the BRNF_BRIDGED_DNAT
flag which calls
neigh->output(neigh, skb);
which presumably resolves to one of the
neigh_*output
functions, each of which again calls
dev_queue_xmit
, which should (eventually) call br_dev_xmit
.br_dev_xmit
sees the BRNF_BRIDGED_DNAT
flag and calls
br_nf_pre_routing_finish_bridge_slow
instead of actually
delivering the packet.br_nf_pre_routing_finish_bridge_slow
sets up the destination MAC
address, sets skb->dev back to skb->physindev and calls
br_handle_frame_finish
.br_handle_frame_finish
calls br_forward
. If the bridge device is
set to promisicuous mode, this also delivers the packet up through
br_pass_frame_up
. Since enabling promiscuous mode fixes my problem,
it seems likely that the packet manages to get all the way to here.br_forward
calls should_deliver
, which returns false
when
skb->dev != p->dev
(and "hairpin mode" is not enabled) causing
the packet to be dropped.This seems like a bug, or at least an unfortunate side effect. It seems there's currently two ways two work around this problem:
/proc/sys/net/bridge/bridge-nf-call-iptables
to 0, so there
is no need for this DNAT
+ bridge stuff. The side effect of this
solution is that bridge packets don't go through iptables, but that's
really what I'd have expected in the first place, so this is not a
problem for me.Next up is reporting this to a kernel mailing list to confirm if there is an actual kernel bug, or just a bug in my expectations :-)
Update: Turns out this behaviour was previously spotted, but no concensus about a fix was reached.
Comments are closed for this story.