Comment 11 for bug 1391339

Revision history for this message
Stefan Bader (smb) wrote :

It may look like a Xen issue. Though the t1.micro instance where using some 3.4 versions, too. One had some "kaos" in the version string the other looked more like to more common variant. If Amazon does not play tricks and have different patches applied to the same versions of Xen on different instance types (which I can only hope they don't), then the issue might be caused by the networking setup in dom0. Different versions of Xen likely mean different kernels in dom0 (but they could as well have different kernels in dom0 with the same xen version). And of course we have no clue whether they use standard bridging or maybe openvswitch, and if openvswitch whether the use the in-kernel version or the upstream one...

So it might be a kernel issue but not as you think. The fact that guest kernels before 3.13 handled GRO=on better is just because they did not use GRO (even when set on). So the guest kernel did not regress. It just uncovered a problem that likely existed before. From what we gathered for now my vague theory would be that something in the host kernel (counting openvswitch modules of any origin to that) causes GRO turned on inside a guest to slower than faster (and also with a greater variance in the average). Some people with better networking skills said it could happen if skb TRUESIZE goes wrong or a big skb end up with many small fragments. If we ignore the case where two instances end up on the same host, we have a chain of the host receiving the packets through its NIC, then routing/switching them onto the netback of the guest which then makes them available to netfornt inside the guest. And right now that area (host receiving and forwarding) looks suspicious.