Ubuntu
linux package

Trusty kernel inbound network performance regression when GRO is enabled

Bug #1391339 reported by Rodrigo Vaz on 2014-11-10

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	linux (Ubuntu)	Confirmed	Medium	Unassigned

Bug Description

After upgrading our EC2 instances from Lucid to Trusty we noticed an increase on download times, Lucid instances were able to download twice as fast as Trusty. After some investigation and testing older kernels (precise, raring and saucy) we confirmed that this only happens on trusty kernel or newer since utopic kernel shows the same result and disabling gro with `ethtool -K eth0 gro off` seems to fix the problem making download speed the same as the Lucid instances again.

The problem is easily reproducible using Apache Bench a couple times on files bigger than 100MB on 1Gb network (EC2) using HTTP or HTTPS.

Following is an example of download throughput with and without gro:

<email address hidden> ~# ethtool -K eth0 gro off
<email address hidden> ~# for i in {1..10}; do ab -n 10 $URL | grep "Transfer rate"; done
Transfer rate: 85183.40 [Kbytes/sec] received
Transfer rate: 86375.80 [Kbytes/sec] received
Transfer rate: 94720.24 [Kbytes/sec] received
Transfer rate: 84783.82 [Kbytes/sec] received
Transfer rate: 84933.09 [Kbytes/sec] received
Transfer rate: 84714.04 [Kbytes/sec] received
Transfer rate: 84795.58 [Kbytes/sec] received
Transfer rate: 84636.54 [Kbytes/sec] received
Transfer rate: 84924.26 [Kbytes/sec] received
Transfer rate: 84994.10 [Kbytes/sec] received
<email address hidden> ~# ethtool -K eth0 gro on
<email address hidden> ~# for i in {1..10}; do ab -n 10 $URL | grep "Transfer rate"; done
Transfer rate: 74193.53 [Kbytes/sec] received
Transfer rate: 56808.91 [Kbytes/sec] received
Transfer rate: 56011.58 [Kbytes/sec] received
Transfer rate: 82227.74 [Kbytes/sec] received
Transfer rate: 70806.54 [Kbytes/sec] received
Transfer rate: 72848.10 [Kbytes/sec] received
Transfer rate: 58451.94 [Kbytes/sec] received
Transfer rate: 61221.33 [Kbytes/sec] received
Transfer rate: 58620.21 [Kbytes/sec] received
Transfer rate: 69950.03 [Kbytes/sec] received
<email address hidden> ~#

Similar results can be observed using iperf and netperf as well.

Tested kernels:
Not affected: 3.8.0-44-generic (precise/raring), 3.11.0-26-generic (saucy)
Affected: 3.13.0-39-generic (trusty), 3.16.0-24-generic (utopic)

Let me know if I can provide any other information that might be helpful like perf traces and reports.
Rodrigo.

Tags:

Revision history for this message

Launchpad Janitor (janitor) wrote on 2014-11-10:

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu):
status:	New → Confirmed

Revision history for this message

Rodrigo Vaz (rodrigo-vaz) wrote on 2014-11-10:

FWIW here is the speed for the Lucid instances with a custom 3.8.11 kernel which I've used as baseline:

Transfer rate: 93501.64 [Kbytes/sec] received
Transfer rate: 84949.88 [Kbytes/sec] received
Transfer rate: 84795.65 [Kbytes/sec] received

Rodrigo.

penalvch (penalvch) on 2014-11-11

tags:

added: regression-release trusty

Revision history for this message

Kamal Mostafa (kamalmostafa) wrote on 2014-11-11:

This recent mainline commit might be relevant:

73d3fe6d1c6d840763ceafa9afae0aaafa18c4b5 gro: fix aggregation for skb using frag_list

That commit notes that it fixes a regression introduced by 8a29111c7ca6 ("net: gro: allow to build full sized skb"), which first appeared in Trusty. That fix is already included in v3.13.11.11, which will be merged into Trusty in the near future.

In the meantime, I've constructed a test PPA which supplies the Trusty 3.13.0-40.68 kernel plus that fix. Rodrigo, please try this kernel and advise whether it affects the problem:

https://launchpad.net/~kamalmostafa/+archive/ubuntu/test-lp1391339

Revision history for this message

Rodrigo Vaz (rodrigo-vaz) wrote on 2014-11-11:

Hi Kamal, thanks for puting together a PPA with the test kernel but unfortunately I had the same results:

Linux runtime-common 4 3.13.0-40-generic #68+g73d3fe6-Ubuntu SMP Tue Nov 11 16:39:20 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

GRO enabled:
<email address hidden> ~# for i in {1..3}; do ab -n 10 $URL | grep "Transfer rate"; done
Transfer rate: 47312.68 [Kbytes/sec] received
Transfer rate: 82794.75 [Kbytes/sec] received
Transfer rate: 68697.54 [Kbytes/sec] received

GRO disabled:
<email address hidden> ~# ethtool -K eth0 gro off
<email address hidden> ~# for i in {1..3}; do ab -n 10 $URL | grep "Transfer rate"; done
Transfer rate: 81509.55 [Kbytes/sec] received
Transfer rate: 82609.47 [Kbytes/sec] received
Transfer rate: 90442.16 [Kbytes/sec] received

Regards,
Rodrigo.

Revision history for this message

Stefan Bader (smb) wrote on 2014-11-12:

Hi Rodrigo, there is a possibility that the problem is not a regression in handling GRO but actually supporting GRO. I can find the following commit in 3.13-rc1:

commit 99d3d587b2b4314ccc8ea066cb327dfb523d598e
Author: Wei Liu <email address hidden>
Date: Mon Sep 30 13:46:34 2013 +0100

xen-netfront: convert to GRO API

Now I tried to reproduce the performance issue locally and using a Xen-4.4.1 Trusty host which runs a Trusty PV guest. On the guest side I start iperf in server mode (since that is the receive side, but to be sure I reversed the setup with the same results) and on a desktop running Trusty, I start iperf in client mode connecting to the PV guest. The desktop and host have 1Gbit NICs. With that I get an average of about 850Mbit/sec over 10 runs. Which is as good as I would expect it. And this does not change significantly whether I enable or disable GRO.

Now what we do not know is what gets involved network-wise between your server and the guests. Not really sure how this would possibly happen.

Revision history for this message

Stefan Bader (smb) wrote on 2014-11-12:

I meant, not sure how to get traffic coming in in a way that GRO actually slows things down instead of lowering the impact of processing.

Joseph Salisbury (jsalisbury) on 2014-11-12

Changed in linux (Ubuntu):
importance:	Undecided → Medium
tags:	added: kernel-da-key

Revision history for this message

Rodrigo Vaz (rodrigo-vaz) wrote on 2014-11-12:

Hi Stefan,

Interesting finding indeed, looks like the author has only tested on Xen-4.4 and the EC2 instance type that we use report Xen-3.4 as the logline below shows:

Xen version: 3.4.3.amazon (preserve-AD)

I'm using a common S3 url in my test cases and the results are the same either using curl or ab to measure the download speed. I've used iperf and netperf between two of our instances and got similar results, the difference is that using the S3 url is noticeable.

What is also interesting is that despite the fact that Lucid (with kernel 3.8.11) instance shows GRO as enabled it may not actually use GRO at all since the GRO API was only introduced on 3.13 and that could mean the processing impact will remain the same in both instances, I will try to measure this and will report but in the meanwhile let me know if is there anything I can mesure that may help to identify the issue.

Thanks,
Rodrigo.

Revision history for this message

Stefan Bader (smb) wrote on 2014-11-13:

I think we need you to give us a good test case (good in the sense that it reproduces the issue and also is easy to set up for us). Because I tried again this morning with iperf but on two t1-micro (64bit) Trusty daily AMIs. And with that I seem to get identical performance with or witout GRO disabled.

25 runs gro=on: Average bandwidth = 67.1386 Mbits/sec, StdDev = 15.0925 Mbits/sec
25 runs gro=off: Average bandwitdh = 67.4393 Mbits/sec, StdDev = 15.0597 Mbits/sec

Revision history for this message

Stefan Bader (smb) wrote on 2014-11-13:

Client side (sender) script Edit (589 bytes, text/plain)

This is what I run client side, the server (receiver) just starts "iperf -s".

Revision history for this message

Rodrigo Vaz (rodrigo-vaz) wrote on 2014-11-15:

#10

Hi Stefan,

Ok I think I've figured out why you were unable to reproduce the slowness, as I mentioned earlier we use the m2 instance type that runs on underlying Xen 3.4 where the t1.micro is probably running on newer infrastructure so I decided to give m3 a try and xen is actually newer (4.2) following is the result of your iperf script on m2 and m3 instances and as you can see the old instance is really affected:

m3 client / m3 server / gro on
Average bandwidth = 958.334 Mbits/sec
StdDev = 21.3581 Mbits/sec

Server Output Sample:
[SUM] 0.0-50.5 sec 5.87 GBytes 1000 Mbits/sec
[ 5] local ... port 5001 connected with ... port 36291
[ 5] 0.0-10.1 sec 1.16 GBytes 984 Mbits/sec
[ 4] local ... port 5001 connected with ... port 36292
[ 4] 0.0-10.1 sec 1.19 GBytes 1.01 Gbits/sec
[ 5] local ... port 5001 connected with ... port 36293
[ 5] 0.0-10.1 sec 1.17 GBytes 997 Mbits/sec
[ 6] local ... port 5001 connected with ... port 36294
[ 6] 0.0-20.2 sec 1.17 GBytes 498 Mbits/sec
[SUM] 0.0-20.2 sec 2.34 GBytes 995 Mbits/sec

m2 client / m3 server / gro on
Average bandwidth = 643.833 Mbits/sec
StdDev = 31.689 Mbits/sec

m2 client / m3 server / gro off
Average bandwidth = 643.744 Mbits/sec
StdDev = 29.4622 Mbits/sec

Server Output Sample:
[ 6] 0.0-181.1 sec 803 MBytes 37.2 Mbits/sec
[ 5] local ... port 5001 connected with ... port 32660
[ 4] 0.0-191.1 sec 802 MBytes 35.2 Mbits/sec
[ 6] local ... port 5001 connected with ... port 32661
[ 5] 0.0-201.2 sec 803 MBytes 33.5 Mbits/sec
[ 4] local ... port 5001 connected with ... port 32662
[ 6] 0.0-211.2 sec 782 MBytes 31.1 Mbits/sec
[ 5] local ... port 5001 connected with ... port 32663
[ 4] 0.0-221.3 sec 798 MBytes 30.2 Mbits/sec
[ 6] local ... port 5001 connected with ... port 32664
[ 5] 0.0-231.4 sec 795 MBytes 28.8 Mbits/sec
[ 4] local ... port 5001 connected with ... port 32665
[ 6] 0.0-241.4 sec 796 MBytes 27.7 Mbits/sec
[ 4] 0.0-251.5 sec 801 MBytes 26.7 Mbits/sec

So looks like the problem happens with Xen prior to 4.2 and I still think it is a regression since the kernel 3.11 handles inbound network traffic better with or without gro.

Thanks,
Rodrigo.

Hi Stefan,

m3 client / m3 server / gro on
Average bandwidth = 958.334 Mbits/sec
StdDev            = 21.3581 Mbits/sec

Server Output Sample:
[SUM]  0.0-50.5 sec  5.87 GBytes  1000 Mbits/sec
[  5] local ... port 5001 connected with ... port 36291
[  5]  0.0-10.1 sec  1.16 GBytes   984 Mbits/sec
[  4] local ... port 5001 connected with ... port 36292
[  4]  0.0-10.1 sec  1.19 GBytes  1.01 Gbits/sec
[  5] local ... port 5001 connected with ... port 36293
[  5]  0.0-10.1 sec  1.17 GBytes   997 Mbits/sec
[  6] local ... port 5001 connected with ... port 36294
[  6]  0.0-20.2 sec  1.17 GBytes   498 Mbits/sec
[SUM]  0.0-20.2 sec  2.34 GBytes   995 Mbits/sec

m2 client / m3 server / gro on
Average bandwidth = 643.833 Mbits/sec
StdDev            = 31.689 Mbits/sec

m2 client / m3 server / gro off
Average bandwidth = 643.744 Mbits/sec
StdDev            = 29.4622 Mbits/sec

Server Output Sample:
[  6]  0.0-181.1 sec   803 MBytes  37.2 Mbits/sec
[  5] local ... port 5001 connected with ... port 32660
[  4]  0.0-191.1 sec   802 MBytes  35.2 Mbits/sec
[  6] local ... port 5001 connected with ... port 32661
[  5]  0.0-201.2 sec   803 MBytes  33.5 Mbits/sec
[  4] local ... port 5001 connected with ... port 32662
[  6]  0.0-211.2 sec   782 MBytes  31.1 Mbits/sec
[  5] local ... port 5001 connected with ... port 32663
[  4]  0.0-221.3 sec   798 MBytes  30.2 Mbits/sec
[  6] local ... port 5001 connected with ... port 32664
[  5]  0.0-231.4 sec   795 MBytes  28.8 Mbits/sec
[  4] local ... port 5001 connected with ... port 32665
[  6]  0.0-241.4 sec   796 MBytes  27.7 Mbits/sec
[  4]  0.0-251.5 sec   801 MBytes  26.7 Mbits/sec

So looks like the problem happens with Xen prior to 4.2 and I still think it is a regression since the kernel 3.11 handles inbound network traffic better with or without gro.

Thanks,
Rodrigo.

Revision history for this message

Stefan Bader (smb) wrote on 2014-11-17:

#11

It may look like a Xen issue. Though the t1.micro instance where using some 3.4 versions, too. One had some "kaos" in the version string the other looked more like to more common variant. If Amazon does not play tricks and have different patches applied to the same versions of Xen on different instance types (which I can only hope they don't), then the issue might be caused by the networking setup in dom0. Different versions of Xen likely mean different kernels in dom0 (but they could as well have different kernels in dom0 with the same xen version). And of course we have no clue whether they use standard bridging or maybe openvswitch, and if openvswitch whether the use the in-kernel version or the upstream one...

So it might be a kernel issue but not as you think. The fact that guest kernels before 3.13 handled GRO=on better is just because they did not use GRO (even when set on). So the guest kernel did not regress. It just uncovered a problem that likely existed before. From what we gathered for now my vague theory would be that something in the host kernel (counting openvswitch modules of any origin to that) causes GRO turned on inside a guest to slower than faster (and also with a greater variance in the average). Some people with better networking skills said it could happen if skb TRUESIZE goes wrong or a big skb end up with many small fragments. If we ignore the case where two instances end up on the same host, we have a chain of the host receiving the packets through its NIC, then routing/switching them onto the netback of the guest which then makes them available to netfornt inside the guest. And right now that area (host receiving and forwarding) looks suspicious.