KVM images lose connectivity with bridged network

Bug #997978 reported by Jonathan Tullett
352
This bug affects 61 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned
qemu-kvm (Ubuntu)
Fix Released
High
Unassigned
Precise
Fix Released
High
Unassigned

Bug Description

=========================================
SRU Justification:
1. Impact: networking breaks after awhile in kvm guests using virtio networking
2. Development fix: The bug was fixed upstream and the fix picked up in a new
   merge.
3. Stable fix: 3 virtio patches are cherrypicked from upstream:
   a821ce5 virtio: order index/descriptor reads
   92045d8 virtio: add missing mb() on enable notification
   a281ebc virtio: add missing mb() on notification
4. Test case: Create a bridge enslaving the real NIC, and use that as the bridge
   for a kvm instance with virtio networking. See comment #44 for specific test
   case.
5. Regression potential: Should be low as several people have tested the fixed
   package under heavy load.
=========================================

System:
-----------
Dell R410 Dual processor 2.4Ghz w/16G RAM
Distributor ID: Ubuntu
Description: Ubuntu 12.04 LTS
Release: 12.04
Codename: precise

Setup:
---------
We're running 3 KVM guests, all Ubuntu 12.04 LTS using bridged networking.

From the host:
# cat /etc/network/interfaces
auto br0
iface br0 inet static
        address 212.XX.239.98
        netmask 255.255.255.240
        gateway 212.XX.239.97
        bridge_ports eth0
        bridge_fd 9
        bridge_hello 2
        bridge_maxage 12
        bridge_stp off

# ifconfig eth0
eth0 Link encap:Ethernet HWaddr d4:ae:52:84:2d:5a
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:11278363 errors:0 dropped:3128 overruns:0 frame:0
          TX packets:14437384 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:4115980743 (4.1 GB) TX bytes:5451961979 (5.4 GB)
          Interrupt:36 Memory:da000000-da012800

# ifconfig br0
br0 Link encap:Ethernet HWaddr d4:ae:52:84:2d:5a
          inet addr:212.XX.239.98 Bcast:212.XX.239.111 Mask:255.255.255.240
          inet6 addr: fe80::d6ae:52ff:fe84:2d5a/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:1720861 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1708622 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:210152198 (210.1 MB) TX bytes:300858508 (300.8 MB)

# brctl show
bridge name bridge id STP enabled interfaces
br0 8000.d4ae52842d5a no eth0

I have no default network configured to autostart in libvirt as we're using bridged networking:
# virsh net-list --all
Name State Autostart
-----------------------------------------
default inactive no

# arp
Address HWtype HWaddress Flags Mask Iface
mailer03.xxxx.com ether 52:54:00:82:5f:0f C br0
mailer01.xxxx.com ether 52:54:00:d2:f7:31 C br0
mailer02.xxxx.com ether 52:54:00:d3:8f:91 C br0
dxi-gw2.xxxx.com ether 00:1a:30:2a:b1:c0 C br0

From one of the guests:
<domain type='kvm' id='4'>
  <name>mailer01</name>
  <uuid>d41d1355-84e8-ae23-e84e-227bc0231b97</uuid>
  <memory>2097152</memory>
  <currentMemory>2097152</currentMemory>
  <vcpu>1</vcpu>
  <os>
    <type arch='x86_64' machine='pc-1.0'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
  </features>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
    <emulator>/usr/bin/kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw'/>
      <source file='/dev/mapper/vg_main-mailer01--root'/>
      <target dev='hda' bus='ide'/>
      <alias name='ide0-0-0'/>
      <address type='drive' controller='0' bus='0' unit='0'/>
    </disk>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw'/>
      <source file='/dev/mapper/vg_main-mailer01--swap'/>
      <target dev='hdb' bus='ide'/>
      <alias name='ide0-0-1'/>
      <address type='drive' controller='0' bus='0' unit='1'/>
    </disk>
    <controller type='ide' index='0'>
      <alias name='ide0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:d2:f7:31'/>
      <source bridge='br0'/>
      <target dev='vnet0'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/0'/>
      <target port='0'/>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/0'>
      <source path='/dev/pts/0'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <input type='mouse' bus='ps2'/>
    <graphics type='vnc' port='5900' autoport='yes' listen='127.0.0.1'>
      <listen type='address' address='127.0.0.1'/>
    </graphics>
    <video>
      <model type='cirrus' vram='9216' heads='1'/>
      <alias name='video0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </memballoon>
  </devices>
  <seclabel type='dynamic' model='apparmor' relabel='yes'>
    <label>libvirt-d41d1355-84e8-ae23-e84e-227bc0231b97</label>
    <imagelabel>libvirt-d41d1355-84e8-ae23-e84e-227bc0231b97</imagelabel>
  </seclabel>
</domain>

From within the guest:
# cat /etc/network/interfaces
# The primary network interface
auto eth0
iface eth0 inet static
        address 212.XX.239.100
        netmask 255.255.255.240
        network 212.XX.239.96
        broadcast 212.XX.239.111
        gateway 212.XX.239.97

# ifconfig
eth0 Link encap:Ethernet HWaddr 52:54:00:d2:f7:31
          inet addr:212.XX.239.100 Bcast:212.XX.239.111 Mask:255.255.255.240
          inet6 addr: fe80::5054:ff:fed2:f731/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:5631830 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6683416 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2027322829 (2.0 GB) TX bytes:2076698690 (2.0 GB)

A commandline which starts the KVM guest:
/usr/bin/kvm -S -M pc-1.0 -enable-kvm -m 2048 -smp 1,sockets=1,cores=1,threads=1 -name mailer01 -uuid d41d1355-84e8-ae23-e84e-227bc0231b97 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/mailer01.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -drive file=/dev/mapper/vg_main-mailer01--root,if=none,id=drive-ide0-0-0,format=raw -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -drive file=/dev/mapper/vg_main-mailer01--swap,if=none,id=drive-ide0-0-1,format=raw -device ide-drive,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -netdev tap,fd=18,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:d2:f7:31,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -usb -vnc 127.0.0.1:0 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4

Problem:
------------
Periodically (at least once a day), one or more of the guests lose network connectivity. Ping responds with 'host unreachable', even from the dom host. Logging in via the serial console shows no problems: eth0 is up, can ping the local host, but no outside connectivity. Restart the network (/etc/init.d/networking restart) does nothing. Reboot the machine and it comes alive again.

I've verified there's no arp games going on on the primary host (the arp tables remain the same before - when it had connectivity - and after - when it doesn't.

This is a critical issue affecting production services on the latest LTS release of Ubuntu. It's similar to an issue which was 'resolved' in 10.04 but appears to have risen its ugly head again.

Changed in libvirt (Ubuntu):
importance: Undecided → High
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks for reporting this bug. Does this also happen ifonly one of the VMs is up? Is there any pattern to the time of day or length of a vm's uptime before this happens? What does 'route -n' show before and after it happens?

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

(setting status to incomplete while awaiting response)

Changed in libvirt (Ubuntu):
status: New → Incomplete
Revision history for this message
Jonathan Tullett (j+launchpad-net) wrote :

This happens to one of the VMs at any one time (on Wednesday two failed, about 2 hours apart). There's no discernible pattern in terms of time of day or length of a VMs uptime.

The next time one fails (they've been stable today), I'll do a route -n and post the output. For record, currently (with a working VM), route -n shows:

 $ route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 212.XX.239.97 0.0.0.0 UG 100 0 0 eth0
212.XX.239.96 0.0.0.0 255.255.255.240 U 0 0 0 eth0

Revision history for this message
Serge Hallyn (serge-hallyn) wrote : Re: [Bug 997978] Re: KVM images lose connectivity with bridged network

Thanks,

In order to check whether it is a qemu (perhaps virtio driver) bug or
a bug in the kernel or network utilities on the host, would you be
able to try setting up a container and checking it's networking?
There are lighter weight ways of testing this, but the simplest way
would be to:

sudo apt-get install lxc
# If having lxcbr0 bothers you, since you don't need it for this test, you
# can set LXC_AUTO=false in /etc/default/lxc and do
# "sudo stop lxc; sudo start lxc".
cat > lxc.conf << EOF
lxc.network.type=veth
lxc.network.link=br0
lxc.network.flags=up
EOF

sudo lxc-create -t ubuntu -f lxc.conf -n lxc1
sudo lxc-start -n lxc1 -d

Then log into the container's console with

sudo lxc-console -n lxc1

and, from there, periodically check the network status. If that also
loses connectivity periodically, then we know the bug is happening
below kvm.

Revision history for this message
Jonathan Tullett (j+launchpad-net) wrote :

I tried building the container this morning using both your file above, and the following:
lxc.network.type=veth
lxc.network.link=br0
lxc.network.flags=up
lxc.network.ipv4=212.XX.239.103/28
lxc.network.name=eth0

# lxc-create -t ubuntu -f lxc.conf -n lxc1
debootstrap is /usr/sbin/debootstrap
Checking cache download in /var/cache/lxc/precise/rootfs-amd64 ...
Copy /var/cache/lxc/precise/rootfs-amd64 to /var/lib/lxc/lxc1/rootfs ...
Copying rootfs to /var/lib/lxc/lxc1/rootfs ...

##
# The default user is 'ubuntu' with password 'ubuntu'!
# Use the 'sudo' command to run tasks as root in the container.
##

'ubuntu' template installed
'lxc1' created

But it fails to start:

# lxc-start -n lxc1
lxc-start: failed to spawn 'lxc1'

/var/log/syslog shows:
May 12 09:47:12 dom0 kernel: [ 1107.903216] device vethHzjri2 entered promiscuous mode
May 12 09:47:12 dom0 kernel: [ 1107.905151] ADDRCONF(NETDEV_UP): vethHzjri2: link is not ready

ifconfig shows a load of virtual devices and brctl shows:

# brctl show
bridge name bridge id STP enabled interfaces
br0 8000.0649d9be1876 no eth0
       veth4bkC47
       vethHzjri2
       vethNBwjzP
       vethZo4vwt
       vethhzluzM
       vethidQWcJ
       vethmtoeDY
       vethuPj7Qk
       vethuxztRp
       vnet1

(I've tried starting the contain a few times).

I'm happy to debug this with you, but lxc isn't software I'm familiar with, unfortunately. Any ideas?

Revision history for this message
Jonathan Tullett (j+launchpad-net) wrote :

It took a few days, but we've finally had a failure of VM instance 2. It died 2.5 hours ago. Logging into the dom0 host shows the arp table dead for that host:

mailer02.xxxx.com (incomplete) br0

Logging into the machine itself via console:

root:~# ifconfig
eth0 Link encap:Ethernet HWaddr 52:54:00:d3:8f:91
          inet addr:212.XX.239.101 Bcast:212.XX.239.111 Mask:255.255.255.240
          inet6 addr: fe80::5054:ff:fed3:8f91/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:6893246 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8242152 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2290922658 (2.2 GB) TX bytes:3314395798 (3.3 GB)

root:~# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 212.XX.239.97 0.0.0.0 UG 100 0 0 eth0
212.XX.239.96 0.0.0.0 255.255.255.240 U 0 0 0 eth0

Restarting the network on the vm (/etc/init.d/networking restart) does nothing. Rebooting the VM brings it back to life.

I'm happy to try with the container again, but using the instructions provided (even with some additional research online), I can't get it running. Please advise.

Thanks.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Sorry, I didn't see your update on the 12th.

The configuration file as you showed it may also work, but it should be simpler
to use the one I posted.

Can you do
 sudo lxc-start -n lxc1 -l DEBUG -o debugout

and attach the file debugout here?

Do you have cgroups mounted? (what does 'grep cgroup /proc/self/mounts' show?)

Does the dhcp server for that network answer all requests, or only for
certain mac addresses? (Lack of dhcp response shouldn't prevent the
container from starting anyway, so shouldn't explain the problem)

Revision history for this message
Jonathan Tullett (j+launchpad-net) wrote :

Hey,

There is no DHCP server for that network, which is why I set up the static IPs.

The cgroup-bin package wasn't installed, I installed that and then grep showed some output:

root:~# grep cgroup /proc/self/mounts
cgroups /sys/fs/cgroup tmpfs rw,relatime,mode=755 0 0
cgroup /sys/fs/cgroup/cpu cgroup rw,relatime,cpu 0 0
cgroup /sys/fs/cgroup/cpuacct cgroup rw,relatime,cpuacct 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,relatime,devices 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,relatime,memory 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,relatime,freezer 0 0

The container has now started.

I'll configure it so it's running the same software as the other VMs and let's see what happens over the coming days.

Thanks for your help so far.

Changed in libvirt (Ubuntu):
status: Incomplete → New
Changed in bridge-utils (Ubuntu):
status: New → Incomplete
Changed in libvirt (Ubuntu):
status: New → Incomplete
Changed in bridge-utils (Ubuntu):
importance: Undecided → High
Revision history for this message
Jonathan Tullett (j+launchpad-net) wrote :

An update: had a different KVM VM die today (same symptoms/resolution as previously); the LXC instance remains working.

Revision history for this message
Jonathan Tullett (j+launchpad-net) wrote :

Another update: multiple KVM VM failures over the past couple of days, the lxc-container is working without issue.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks Jonathan, sounds like the issue is definately in qemu then.

Changed in qemu-kvm (Ubuntu):
status: New → Confirmed
Changed in bridge-utils (Ubuntu):
status: Incomplete → Invalid
Changed in qemu-kvm (Ubuntu):
importance: Undecided → High
Revision history for this message
Georg Leciejewski (vespaschorsch) wrote :

I have the same symptoms with Ubuntu 12. The network is not reachable 3-4 times a day, for about 1-2 minutes but recovers by itself. Does not seem traffic/load related as it also happens, when there is nothing on.

- No dhcp
- bridged network with static ip's for guests
- atm only running one VM Ubuntu 12
- pretty bare host install with just kvm and libvirt
- no relevant output in any log files

Network

---------------------------------------
interfaces file on host:

auto lo
iface lo inet loopback

# device: eth0
auto eth0
iface eth0 inet static
  address 176.9.x.x
  broadcast 176.9.x.x
  netmask 255.255.255.192
  gateway 176.9.x.x

# default route to access subnet
up route add -net 176.9.x.x netmask 255.255.255.192 gw 176.9.x.x eth0

auto br0
iface br0 inet static
  address 176.9.x.x
  netmask 255.255.255.255
  gateway 176.9.x.x
  pointopoint 176.9.x.x
  bridge_ports eth0
  bridge_stp off
  bridge_fd 0
  bridge_maxwait 0
  up route add -host 176.9.x.x dev br0
  up route add -host 176.9.x.x dev br0ll

------------------------------------
interfaces in vm

# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
auto eth0
iface eth0 inet static
  address 176.9.x.x
  netmask 255.255.255.192
  gateway 176.9.x.x
  pointopoint 176.9.x.x
  dns-nameservers 213.133.98.98. 213.133.99.99

--------------------------------
virsh # version
Compiled against library: libvir 0.9.8
Using library: libvir 0.9.8
Using API: QEMU 0.9.8
Running hypervisor: QEMU 1.0.0

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Ok, looking back over the original description, the bug poster has an eth0 mac address of d4:ae:52:84:2d:5a. I'm quite certain that is at least a part of the problem - the bridge will always take the lowest mac address of any device on it, and the nics on the VMs have lower mac addresses. So any time a VM goes on or offline, the bridge mac address will change, causing network traffic to pause.

Georg, please show 'ifconfig -a' output while the VMs are running

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

If the mac address of eth0 is the cause of your problem, then a (non-ideal) workaround would be to use the stock NATed virbr0 for your VMs instead, as it won't be bridged with eth0.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Sorry, I had it backwards. The bridge takes the higher address.

Revision history for this message
bradleyd (bradleydsmith) wrote :

I have the same issue as stated above, but instead of rebooting the guests to bring them back, a quick ifdown, ifup does the trick. Network is restored after this.

Revision history for this message
Georg Leciejewski (vespaschorsch) wrote :

@serge, thanks for your input. here Dom0 ifconfig, with 2 machines running.

i setup br0 myself because the machines are running at hetzner, with special addon IP's so the vm's can be reached from the outside. (wiki.hetzner.de/index.php/KVM_mit_Nutzung_aller_IPs_-_the_easy_way) i am also evaluating the problem with hetzner support to ensure its not one of their routers in from of the Dom0.

vnet0, vnet1 are the bridges created by libvirt, but the vm have static ip entries in /network/interfaces and are bound to br0 via libvirt.

------------------------------

br0 Link encap:Ethernet HWaddr c8:60:00:e9:4a:2e
          inet addr:176.9.126.xx Bcast:0.0.0.0 Mask:255.255.255.255
          inet6 addr: fe80::ca60:ff:fee9:4a2e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:5852477 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4403013 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:2310013511 (2.3 GB) TX bytes:2588138678 (2.5 GB)

eth0 Link encap:Ethernet HWaddr c8:60:00:e9:4a:2e
          inet addr:176.9.126.xx Bcast:176.9.126.95 Mask:255.255.255.192
          UP BROADCAST RUNNING PROMISC MULTICAST MTU:1500 Metric:1
          RX packets:6623242 errors:0 dropped:35190 overruns:0 frame:0
          TX packets:4668377 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2484375973 (2.4 GB) TX bytes:1715146205 (1.7 GB)
          Interrupt:17 Memory:fe500000-fe520000

lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:16436 Metric:1
          RX packets:59917 errors:0 dropped:0 overruns:0 frame:0
          TX packets:59917 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:9106822 (9.1 MB) TX bytes:9106822 (9.1 MB)

vnet0 Link encap:Ethernet HWaddr fe:54:00:2e:3d:0e
          inet6 addr: fe80::fc54:ff:fe2e:3d0e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:2033457 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2971987 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:500
          RX bytes:1023321831 (1.0 GB) TX bytes:1482597549 (1.4 GB)

vnet1 Link encap:Ethernet HWaddr fe:54:00:3f:2a:9c
          inet6 addr: fe80::fc54:ff:fe3f:2a9c/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:124034 errors:0 dropped:0 overruns:0 frame:0
          TX packets:140178 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:500
          RX bytes:27970351 (27.9 MB) TX bytes:56681695 (56.6 MB)
------------------------------------

Revision history for this message
Georg Leciejewski (vespaschorsch) wrote :

Got this wrong:
vnet0, vnet1 are the bridges created by libvirt, => are the network interface created by libvirt.

I am jsut reading this bug report concerning the MAC address problems:

https://bugzilla.redhat.com/show_bug.cgi?id=571991
https://bugzilla.redhat.com/show_bug.cgi?id=583139

and so far it seems ok that the br0 still has the MAC of eth0 and the VMs both start with FE::

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Quoting Georg Leciejewski (<email address hidden>):
> @serge, thanks for your input. here Dom0 ifconfig, with 2 machines
> running.
>
> i setup br0 myself because the machines are running at hetzner, with
> special addon IP's so the vm's can be reached from the outside.
> (wiki.hetzner.de/index.php/KVM_mit_Nutzung_aller_IPs_-_the_easy_way) i
> am also evaluating the problem with hetzner support to ensure its not
> one of their routers in from of the Dom0.
>
> vnet0, vnet1 are the bridges created by libvirt, but the vm have static
> ip entries in /network/interfaces and are bound to br0 via libvirt.
>
> ------------------------------
>
> br0 Link encap:Ethernet HWaddr c8:60:00:e9:4a:2e
> inet addr:176.9.126.xx Bcast:0.0.0.0 Mask:255.255.255.255
> inet6 addr: fe80::ca60:ff:fee9:4a2e/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:5852477 errors:0 dropped:0 overruns:0 frame:0
> TX packets:4403013 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:2310013511 (2.3 GB) TX bytes:2588138678 (2.5 GB)
>
> eth0 Link encap:Ethernet HWaddr c8:60:00:e9:4a:2e
> inet addr:176.9.126.xx Bcast:176.9.126.95 Mask:255.255.255.192
> UP BROADCAST RUNNING PROMISC MULTICAST MTU:1500 Metric:1
> RX packets:6623242 errors:0 dropped:35190 overruns:0 frame:0
> TX packets:4668377 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:2484375973 (2.4 GB) TX bytes:1715146205 (1.7 GB)
> Interrupt:17 Memory:fe500000-fe520000

One thing I notice here is that you have an ip address on eth0, which I
assume is bridged with br0?

When I bridge eth0 to br0 using the following /etc/network/interfaces:

=========================
auto lo
iface lo inet loopback

auto br0
iface br0 inet dhcp
 bridge_ports eth0

# The primary network interface
auto eth0
iface eth0 inet manual
=========================

I get the following ifconfig -a output:

=========================
br0 Link encap:Ethernet HWaddr fa:16:3e:59:27:16
          inet addr:10.55.60.89 Bcast:10.55.60.255 Mask:255.255.255.0
          inet6 addr: fe80::f816:3eff:fe59:2716/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:223 errors:0 dropped:0 overruns:0 frame:0
          TX packets:178 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:21627 (21.6 KB) TX bytes:19555 (19.5 KB)

eth0 Link encap:Ethernet HWaddr fa:16:3e:59:27:16
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:259 errors:0 dropped:0 overruns:0 frame:0
          TX packets:178 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:26871 (26.8 KB) TX bytes:19487 (19.4 KB)
=========================

What does your /etc/network/interfaces look like?

Revision history for this message
Georg Leciejewski (vespaschorsch) wrote :

here it is. already posted it above with ip's xx:

================================
auto lo
iface lo inet loopback

# device: eth0
auto eth0
iface eth0 inet static
  address 176.9.126.79
  broadcast 176.9.126.95
  netmask 255.255.255.192
  gateway 176.9.126.65

# default route to access subnet
up route add -net 176.9.126.64 netmask 255.255.255.192 gw 176.9.126.65 eth0

auto br0
iface br0 inet static
  address 176.9.126.79
  netmask 255.255.255.255
  gateway 176.9.126.65
  pointopoint 176.9.126.65
  bridge_ports eth0
  bridge_stp off
  bridge_fd 0
  bridge_maxwait 0
  up route add -host 176.9.126.92 dev br0
  up route add -host 176.9.126.93 dev br0

======================================
and brctl show
======================================
bridge name bridge id STP enabled interfaces
br0 8000.c86000e94a2e no eth0
                              vnet0
virbr0 8000.000000000000 yes
======================================

I also did an mtr during downtimes and it shows that packages are lost on dom0 -> .79
I still hope it is some kind of misconfig, but as said before same network/bridge config runn without interupts in ubuntu 10.4 lts.
Thanks for your patience.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Quoting Georg Leciejewski (<email address hidden>):
> here it is. already posted it above with ip's xx:
>
> ================================
> auto lo
> iface lo inet loopback
>
> # device: eth0
> auto eth0
> iface eth0 inet static

Hi,

I believe this is wrong. Could you change the eth0 bit to simply

auto eth0
iface eth0 inet manual

The fact that

> I also did an mtr during downtimes and it shows that packages are lost on dom0 -> .79

Supports the idea that that might solve your problem. (It also suggests
that yours is not the same as the original bug reporter's problem).

Revision history for this message
Georg Leciejewski (vespaschorsch) wrote :

I tried that with no difference, but i am on another path: maybe it is related to acpi

Changed in libvirt (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Christian Parpart (trapni) wrote :

I am having the same issue, and gave an in-depth inspection in my report: bug 1016848

I am running version Essex 2012.1, and networking in VlanManager mode, a dedicated nova-network gateway, KVM and virt-type, Ubuntu 12.04 Precise, and run into this incident in up to every 1-2 days, yesterday even twice within 3 hours.

One symptom may be high networking traffic I/O on the given KVM instance.

Until now, I worked around by `nova reboot $instance_name`.

bradleyd (bradleydsmith), what exactly did you mean by ifdown & ifup? the vnet%d interface or the bridge (e.g. br100) ? What interface did you re-up? -- for the mean time, I'd like to write a tiny daemon that runs on the hypervisor nodes, to check every N secs whether or not it can PING the KVM, and if not, it is to re-up its underlying network interface.

Serge Hallyn, I'd like to assist in whatever you need to get this beast fixed, since for me this is also a very major incident, too, and I just can't add more production services until knowing the OpenStack-stack is functioning well. So please tell me what I can provide you with. :-)

Regards,
Christian.

Revision history for this message
andrew bezella (abezella) wrote :

i believe that we're seeing the same problem with ganeti-managed kvm instances running on 12.04 utilizing a bridged network.

in an initial deployment of 8 guests (also 12.04) we had half of them drop off the network within a few hours. there is a weak correlation between high network load in the guests and the network dropping. in some but not all cases from the kvm instance's console i was able to ifdown/ifup its interface and bring it back online.

Revision history for this message
Stephane Neveu (stefneveu) wrote :

Same kind of problem here :

Running Ubuntu-servers 12.04 VMs on Ubuntu-servers 12.04 using KVM/Libvirt over bridges.

Pinging my gateway from a random VM and watching packets with tcpdump on the kvm host :

icmp is ok on my vnet -> ok on the bridge -> ok on my bond (active-backup) -> ok on my gateway (reply) -> ok on my bond -> ok on my brigde -> No packet received on my vnet !!!!

brct showmacs mybridge seems to be ok showing my mac:addr (bridge+vm)

I have to ifdown/ifup my eth0 on virtual guest to make it work again til the next time.

Revision history for this message
Stephane Neveu (stefneveu) wrote :

Note I'm using 3.2.0-23-virtual kernel for my VMs ...

Revision history for this message
Christian Parpart (trapni) wrote :

So is it a bug in the VM's networking driver or in the hypervisor ?

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Georg,

have you found any more about the relation to acpi?

@Stephane,

in your case it would certainly seem to be a bug in either the guest kernel or the virtual nic driver, as with the original bug submitter. Can you try switching to a different virtual nic type, i.e.

   model type='ne2k_pci'

Also if it is possible for you to run a test on a quantal host, which has a much newer qemu-kvm, that would be interesting. Can you tell me in numbers how heavy traffic needs to be before the VM drops out? Is it traffic to the VM which freezes that VM, or does any traffic to the host or any VM threaten to freeze any VM.

@Christian,

several people have piped in on this bug. I'm not certain about yours, but this bug in particular is in qemu itself (or perhaps, though unlikely, the kernel).

Revision history for this message
Stephane Neveu (stefneveu) wrote :

@Serge

I did upgrade my kernels yesterday evening both on hypervisors and guests so now I'm running 3.2.0-25-generic on hosts and the same kernel version but virtual for VMs. Same problem today morning, some VMs dropped out. I'm actually using virtio everywhere. I'll try to test another one and keep you in touch asap.
I also tried to run a VMs with the generic kernel (not the virtual one) and I'm facing the same issue.
Serge as I'm building this new plateform I can say they there is no traffic at all on my VMs exept 2 guys testing some java stuffs. I almost thinking that VMs drop out when there is not enough traffic on nics...

I'll keep you in touch.

Revision history for this message
Stephane Neveu (stefneveu) wrote :

Serge,

ne2k_pci seems to drop the link when generating some traffic (only tested on 4 VMs).
e1000 seems have to same problem as virtio, dropping connections without traffic ...
What else may I try ? Is it really a driver issue ?

Revision history for this message
Stephane Neveu (stefneveu) wrote :

I've noticed I never had such a problem on one host running 3.2.0-24-generic ... on this one, my VMs have only 2 vnet per VM whereas on others I have at least 4 vnet per VM.

Is there a tap generation limit somewhere ? (I don't think so, I do not see such a thing in sysctl -a)
I'll try to downgrade my kernel on one buggy host while waiting for some ideas...

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Stephane,

can you show your host network configuration and a VM's xml? Are you bridging the VM nics with the host's eth0 over your own br0, or are you using the default NATed virbr0? Does this happen even when only a single VM is up?

I will try to reproduce next week.

(For host network configuration, the results of :
   sudo ifconfig -a
   sudo brctl show
   netstat -nr
   virsh net-dumpxml
should suffice, and 'virsh dumpxml VM1' to show the xml configuration for a guest)

Revision history for this message
Stephane Neveu (stefneveu) wrote :
Download full text (3.4 KiB)

Serge,

I'm using bond0 (active-backup with eth0/eth4) then tagging vlans with bond0.XXXX and linking my bond0.XXXX in a bridge ... then I do the same with bond1 (eth1/eth5) etc until bond3.

Then here is a dumpxml example :

<domain type='kvm' id='5'>
  <name>myguest1</name>
  <uuid>cc31a6e0-267c-4470-bcd7-8a92755a85cd</uuid>
  <memory>2097152</memory>
  <currentMemory>2097152</currentMemory>
  <vcpu>2</vcpu>
  <os>
    <type arch='x86_64' machine='pc-0.14'>hvm</type>
    <boot dev='hd'/>
    <bootmenu enable='no'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <pae/>
  </features>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/bin/kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/vm/disques//myguest1.qcow2'/>
      <target dev='hda' bus='ide'/>
      <alias name='ide0-0-0'/>
      <address type='drive' controller='0' bus='0' unit='0'/>
    </disk>
    <controller type='ide' index='0'>
      <alias name='ide0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:a1:3d:dc'/>
      <source bridge='bridge1'/>
      <target dev='vnet16'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <interface type='bridge'>
      <mac address='52:54:00:3b:81:78'/>
      <source bridge='bridge2'/>
      <target dev='vnet17'/>
      <model type='virtio'/>
      <alias name='net1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </interface>
    <interface type='bridge'>
      <mac address='52:54:00:3d:96:57'/>
      <source bridge='bridge3'/>
      <target dev='vnet18'/>
      <model type='virtio'/>
      <alias name='net2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </interface>
    <interface type='bridge'>
      <mac address='52:54:00:10:2e:f1'/>
      <source bridge='bridge4'/>
      <target dev='vnet19'/>
      <model type='virtio'/>
      <alias name='net3'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/6'/>
      <target port='0'/>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/6'>
      <source path='/dev/pts/6'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <input type='mouse' bus='ps2'/>
    <graphics type='vnc' port='5904' autoport='yes'/>
    <video>
      <model type='cirrus' vram='9216' heads='1'/>
      <alias name='video0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </memballoon>
  </devices>
  <seclabel type='dynamic' model='apparmor' relabel='yes'>
    <label>libvirt-cc31a6e0-267c-4470-bcd7-8a9...

Read more...

Revision history for this message
Alex Dioso (adioso) wrote :

This bug affects me as well, is it related to the one discussed here https://bugzilla.kernel.org/show_bug.cgi?id=42829

I will try using the vhost_net driver in the host and vhost=on as a guest parameter to see if it bypasses the issue.

Revision history for this message
Stephane Neveu (stefneveu) wrote :

Alex,

Thanks for the link, I'll also try to test with :

<driver name='vhost' txmode='iothread' ioeventfd='on' event_idx='off'/>

Does it work better for you with vhost_net ?

Revision history for this message
Stephane Neveu (stefneveu) wrote :

Ok just in cas it may help...

It does work modprobing vhost_net and adding :
<driver name='vhost' txmode='iothread' ioeventfd='on' event_idx='off'/>
on every nics definitions.
Tested on more than 100 VMs.

Revision history for this message
Christian Parpart (trapni) wrote :

Stephane, ya, as a workaround (not to say: the better way), modprobe'ing "vhost_net" driver before actually starting the VMs works perfect. no incidents since 4 days now (tested on 30+ VMs).

But is your driver-tag for? I did not need to do that, my libvirt-bin added the vhost=on parameter to qemu-kvm automatically - so what do I need this line for then?

Regards,
Christian.

Revision history for this message
Stephane Neveu (stefneveu) wrote :

Christian,

I'm not really sure if just enabling vhost_net is enough... you may probably be right : it works well for you since 4 days ...
Like I still do not understand where the bug is locate, I prefered to do everything to fix it quickly.

Reading the bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=42829 they were also talking about event_idx='off'
It seems to be patched now (I'm not sure basically):
http://git.kernel.org/?p=linux/kernel/git/davem/net-next.git;a=commitdiff;h=4b727361f0bc7ee7378298941066d8aa15023ffb;hp=e1ac50f64691de9a095ac5d73cb8ac73d3d17dba

Regards,

Revision history for this message
Stephane Neveu (stefneveu) wrote :

Christian,

You are right, no need to add :

<driver name='vhost' txmode='iothread' ioeventfd='on' event_idx='off'/>

in the xml ...

modprobe vhost_net should be enough.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks Stephane, per comment #38 I'm marking this bug as affecting the kernel.

Changed in linux (Ubuntu):
status: New → Confirmed
importance: Undecided → High
tags: added: kernel-kvm
Changed in ifenslave (Ubuntu):
status: New → Confirmed
Thierry Carrez (ttx)
Changed in nova:
status: New → Invalid
no longer affects: ifenslave (Ubuntu)
no longer affects: libvirt (Ubuntu)
no longer affects: bridge-utils (Ubuntu)
no longer affects: linux (Ubuntu)
Changed in qemu-kvm (Ubuntu):
status: Confirmed → Fix Released
Changed in qemu-kvm (Ubuntu Precise):
status: New → In Progress
importance: Undecided → High
description: updated
58 comments hidden view all 138 comments
Revision history for this message
Matt Hilt (mjhilt-x) wrote :

Running "sudo reboot" from the VM doesn't change the PID, but using the reboot command on the openstack dashboard does.
It seems like some of our VMs used the former, and some the later, with correlation between the soft reboot and the instance dying. So we'll hard reboot the vms, and profusely apologize for causing alarm.

Revision history for this message
Soren Hansen (soren) wrote :

Matt, no problem at all. Please be sure to report back if you encounter the issue again after the hard reboot. Thanks!

Revision history for this message
Joe T (joe-topjian-v) wrote :

Hello,

Prior to applying the qemu package in Soren's PPA, we were able to reproduce this problem within 45 minutes (on average). We're now up to 22 hours (and climbing) without an issue.

If anyone is curious, here is the test setup that we have been using with OpenStack:

---

nova boot --image cfefd40f-be71-4c93-b480-c9964689f5ce --key_name sandbox --flavor 2 dhcp-1

dhcp-1> sudo su
dhcp-1> apt-get iperf

nova boot --image cfefd40f-be71-4c93-b480-c9964689f5ce --key_name sandbox --flavor 2 dhcp-2

dhcp-2> sudo su
dhcp-2> apt-get iperf
dhcp-2> iperf -s

dhcp-1> iperf -c dhcp-1 -t 86400 -i 10

---

Thanks,
Joe

Revision history for this message
Adam Conrad (adconrad) wrote : Please test proposed package

Hello Jonathan, or anyone else affected,

Accepted qemu-kvm into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/qemu-kvm/1.0+noroms-0ubuntu14.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please change the bug tag from verification-needed to verification-done. If it does not, change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in qemu-kvm (Ubuntu Precise):
status: In Progress → Fix Committed
tags: added: verification-needed
Changed in qemu-kvm (Ubuntu Precise):
status: Fix Committed → In Progress
Revision history for this message
Adam Conrad (adconrad) wrote :

Ignore the above automated message, the precise fix that was in the queue was superseded by a security update of the same version.

Revision history for this message
Joe T (joe-topjian-v) wrote :

Hi Adam,

Does this mean that the qemu fix for this ticket is not in -proposed yet? Or that the security update contains the fix?

Thanks,
Joe

Revision history for this message
BenKochie (ben-nerp) wrote :

As far as I can tell the fix is still not released in -proposed

Revision history for this message
Adam Conrad (adconrad) wrote :

The security update doesn't contain the fix, the original proposed update needs to be rebased against the security update. I may do that in a bit if Soren doesn't get there first.

Revision history for this message
Adam Conrad (adconrad) wrote :

Hello Jonathan, or anyone else affected,

Accepted qemu-kvm into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/qemu-kvm/1.0+noroms-0ubuntu14.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please change the bug tag from verification-needed to verification-done. If it does not, change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in qemu-kvm (Ubuntu Precise):
status: In Progress → Fix Committed
Revision history for this message
Joe T (joe-topjian-v) wrote :

I have installed this package on two of my OpenStack compute nodes. I'll have an update in a day or so on if this package still fixes the issue like the package from Soren's PPA.

Revision history for this message
Jonathan Tullett (j+launchpad-net) wrote :

Installed yesterday and rebooted the dom0 machine and thus all virtual machines. Will report back if there are any problems.

Revision history for this message
Joe T (joe-topjian-v) wrote :

We tested the -proposed packages yesterday and are confident that they resolve the issue. We used the test scenario described in comment #101.

Servers that have not had the updated package applied failed the test within an hour. Servers with the updated package did not fail the test.

Robert Dupont (rdupontd)
Changed in qemu-kvm (Ubuntu Precise):
status: Fix Committed → Fix Released
Changed in qemu-kvm (Ubuntu):
status: Fix Released → Fix Committed
status: Fix Committed → Fix Released
Changed in qemu-kvm (Ubuntu Precise):
status: Fix Released → Fix Committed
tags: added: verification-done
removed: verification-needed
Revision history for this message
Adam Conrad (adconrad) wrote : Update Released

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu-kvm - 1.0+noroms-0ubuntu14.3

---------------
qemu-kvm (1.0+noroms-0ubuntu14.3) precise-proposed; urgency=low

  * Fix race condition in virtio code on multicore systems. (LP: #997978)
    - 9001-virtio-add-missing-mb-on-notification.patch
    - 9002-virtio-add-missing-mb-on-enable-notification.patch
    - 9003-virtio-order-index-descriptor-reads.patch
 -- Soren Hansen <email address hidden> Mon, 03 Sep 2012 10:15:54 +0200

Changed in qemu-kvm (Ubuntu Precise):
status: Fix Committed → Fix Released
Revision history for this message
David Geng (genggjh) wrote :

I got the same issue, but my host OS is RHEL 6.3 (2.6.32-220.el6.x86_64), the qemu-kvm version is 0.12.1.2 , and my guest base image is Ubuntu 12.4 LTS.
My problem is:
After I enable the libvirt_use_virtio_for_bridges = true in the nova.conf, the new instance can not get ip address and the gateway can not be added in the router table.

The router table like this:

--before enable virtio
~$ route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 172.17.33.8 0.0.0.0 UG 100 0 0 eth0
172.17.32.0 0.0.0.0 255.255.252.0 U 0 0 0 eth0

--after enable virtio
~$ route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 172.17.32.1 0.0.0.0 UG 100 0 0 eth0
172.17.32.0 0.0.0.0 255.255.252.0 U 0 0 0 eth0

Here is the dnsmasq process on my host server:
root 32034 32033 0 Oct12 ? 00:00:00 /usr/sbin/dnsmasq --strict-order --bind-interfaces --conf-file= --domain=novalocal --pid-file=/var/lib/nova/networks/nova-br100.pid --listen-address=172.17.33.8 --except-interface=lo --dhcp-range=172.17.33.3,static,120s --dhcp-lease-max=256 --dhcp-hostsfile=/var/lib/nova/networks/nova-br100.conf --dhcp-script=/usr/bin/nova-dhcpbridge --leasefile-ro

Soren,
Your solution are only for ubuntu host and should install the ppa on the host machine, right?
Is there any solution or workaround for RHEL?

Revision history for this message
Jonathan Tullett (j+launchpad-net) wrote :

This bug is considered fixed for me. Not a single network glitch since installing the package from PPA. Many thanks to the development team!

Revision history for this message
kraig (kamador) wrote : Re: [Bug 997978] Re: KVM images lose connectivity with bridged network
Download full text (8.9 KiB)

Same here, thank you everyone!

--
Kraig Amador

On Friday, November 2, 2012 at 5:42 AM, Jonathan Tullett wrote:

> This bug is considered fixed for me. Not a single network glitch since
> installing the package from PPA. Many thanks to the development team!
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/997978
>
> Title:
> KVM images lose connectivity with bridged network
>
> Status in OpenStack Compute (Nova):
> Invalid
> Status in “qemu-kvm” package in Ubuntu:
> Fix Released
> Status in “qemu-kvm” source package in Precise:
> Fix Released
>
> Bug description:
> =========================================
> SRU Justification:
> 1. Impact: networking breaks after awhile in kvm guests using virtio networking
> 2. Development fix: The bug was fixed upstream and the fix picked up in a new
> merge.
> 3. Stable fix: 3 virtio patches are cherrypicked from upstream:
> a821ce5 virtio: order index/descriptor reads
> 92045d8 virtio: add missing mb() on enable notification
> a281ebc virtio: add missing mb() on notification
> 4. Test case: Create a bridge enslaving the real NIC, and use that as the bridge
> for a kvm instance with virtio networking. See comment #44 for specific test
> case.
> 5. Regression potential: Should be low as several people have tested the fixed
> package under heavy load.
> =========================================
>
> System:
> -----------
> Dell R410 Dual processor 2.4Ghz w/16G RAM
> Distributor ID: Ubuntu
> Description: Ubuntu 12.04 LTS
> Release: 12.04
> Codename: precise
>
> Setup:
> ---------
> We're running 3 KVM guests, all Ubuntu 12.04 LTS using bridged networking.
>
> From the host:
> # cat /etc/network/interfaces
> auto br0
> iface br0 inet static
> address 212.XX.239.98
> netmask 255.255.255.240
> gateway 212.XX.239.97
> bridge_ports eth0
> bridge_fd 9
> bridge_hello 2
> bridge_maxage 12
> bridge_stp off
>
> # ifconfig eth0
> eth0 Link encap:Ethernet HWaddr d4:ae:52:84:2d:5a
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:11278363 errors:0 dropped:3128 overruns:0 frame:0
> TX packets:14437384 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:4115980743 (4.1 GB) TX bytes:5451961979 (5.4 GB)
> Interrupt:36 Memory:da000000-da012800
>
> # ifconfig br0
> br0 Link encap:Ethernet HWaddr d4:ae:52:84:2d:5a
> inet addr:212.XX.239.98 Bcast:212.XX.239.111 Mask:255.255.255.240
> inet6 addr: fe80::d6ae:52ff:fe84:2d5a/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:1720861 errors:0 dropped:0 overruns:0 frame:0
> TX packets:1708622 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:210152198 (210.1 MB) TX bytes:300858508 (300.8 MB)
>
> # brctl show
> bridge name bridge id STP enabled interfaces
> br0 8000.d4ae52842d5a no eth0
>
> I have no default network configured to autostart in libvirt as we're using bridged networking:
> # virsh n...

Read more...

Revision history for this message
BenKochie (ben-nerp) wrote :

So I'm sorry to report that after about 50 days of uptime on qemu-kvm (1.0+noroms-0ubuntu14.3) I had 3 VMs out of ~60 in my cluster drop off the network. It happened on different host machines, so it's not a single machine in the cluster that's a problem.

Two of the nodes were restarted (full qemu shutdown/relaunch) so I didn't have a chance to debug them.

One of them I was able to console and work on before I gave up and restarted it. The interesting things I discovered was that the workarounds I had done in the past did not work. Previously I was able to ifdown/ifup the virtual interface to restore networking. Also migrating between nodes did not restore networking.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

But otherwise it sounds like it looked the same - VM still up but its network down?

Had these VMs been running for 50 days, or was it that you hadn't seen the problem in 50 days? If the latter, could these have been on a newer kernel?

Revision history for this message
Scobo (mk-binary-artworks) wrote :

Will there be a fix for 10.04 lucid, too?
I think this bug affects me too: Windows SBS 2008 with virtio network driver on ubuntu 10.04 lucid host loses network connection from time to time. It's not really deterministic when this error occurs. Usually it happens when there's a large amount of traffic sent over the network. After disabling (via VNC console) the network card in Windows and re-enabling it, everything works fine again, no need to reboot the VM.
Any help is appreciated, since it's a quite annoying bug... :-)

Revision history for this message
BenKochie (ben-nerp) wrote :

I'm testing this some more, and it looks like it's still easy to reproduce on my system.

cc2ab6833adc73311a2407be2eb5f915 /usr/bin/qemu-system-x86_64

I can cause the guest to drop VM networking with an rsync+ssh from a nearby host.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@BenKochie,

could you please file a new bug with details?

Revision history for this message
gadLinux (gad-aguilardelgado) wrote :

I've found this problem in my saucy installation. I just opened a bug cause I'm not sure if it's related.

https://bugs.launchpad.net/ubuntu/+source/core-network/+bug/1255516

My problem is actually worse since host machine cannot get access to other machines on the network. Not even virtual machines but hardware machines, routers, NAS, and so on.

The only way to recover from this situation is to ifdown ifup the bridge. Then it recovers until it happens again.

When I remove the bridge no problems. But I was not able to test it since I need the bridge up.

I'm investigating macvtap instead of bridge...

Revision history for this message
gadLinux (gad-aguilardelgado) wrote :

I have to clarify that I'm not sure that this has something to do with KVM, QEMU and the like.

I think is more a problem of the linux bridge driver since It fails even if you don't have any vm running. Yes, it takes more time to fail but this can be because not enough traffic to make it fail.

Revision history for this message
Krzysztof Janowicz (janowicz) wrote :

I think I may have the same problem running a 13.10 qemu-kvm host with 8 virtual machines. As far as I can see only two VMs seem to have the problem. All of them use em1: macvtap as source device and a bridge as source mode. After reading many posts here, I changed from virtio to rtl8139 as a device model to see if I still lose network connection.

Revision history for this message
Dewey McDonnell (dewey-w) wrote :

My web search reveals this problem has been around for years and most posts conclude that the problem continues. After solving my tap0 TX packets dropped problem, I find that my VM network has random freezes a few times each day. Many bloggers on this subject say this VM network freeze is difficult to reproduce. Not for me! I can cause a network freeze on my VM in a heartbeat. Like many others, all I need to do is start a data transfer over the bridge (FTP 2Kb file or larger). When the freeze happens, ifconfig tap0 says overruns:1. This is QEMU version 1.6.2 without KVM on CentOS 6.5 version 2.6.32-431.

Revision history for this message
Thomas Vachon (vachon) wrote :
Download full text (9.9 KiB)

After moving to 3.5 kernel I haven't seen it, even at 3x traffic which used
to cause it
On Mar 25, 2014 9:06 PM, "Dewey McDonnell" <email address hidden> wrote:

> My web search reveals this problem has been around for years and most
> posts conclude that the problem continues. After solving my tap0 TX
> packets dropped problem, I find that my VM network has random freezes a
> few times each day. Many bloggers on this subject say this VM network
> freeze is difficult to reproduce. Not for me! I can cause a network
> freeze on my VM in a heartbeat. Like many others, all I need to do is
> start a data transfer over the bridge (FTP 2Kb file or larger). When the
> freeze happens, ifconfig tap0 says overruns:1. This is QEMU version
> 1.6.2 without KVM on CentOS 6.5 version 2.6.32-431.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/997978
>
> Title:
> KVM images lose connectivity with bridged network
>
> Status in OpenStack Compute (Nova):
> Invalid
> Status in "qemu-kvm" package in Ubuntu:
> Fix Released
> Status in "qemu-kvm" source package in Precise:
> Fix Released
>
> Bug description:
> =========================================
> SRU Justification:
> 1. Impact: networking breaks after awhile in kvm guests using virtio
> networking
> 2. Development fix: The bug was fixed upstream and the fix picked up in
> a new
> merge.
> 3. Stable fix: 3 virtio patches are cherrypicked from upstream:
> a821ce5 virtio: order index/descriptor reads
> 92045d8 virtio: add missing mb() on enable notification
> a281ebc virtio: add missing mb() on notification
> 4. Test case: Create a bridge enslaving the real NIC, and use that as
> the bridge
> for a kvm instance with virtio networking. See comment #44 for
> specific test
> case.
> 5. Regression potential: Should be low as several people have tested the
> fixed
> package under heavy load.
> =========================================
>
> System:
> -----------
> Dell R410 Dual processor 2.4Ghz w/16G RAM
> Distributor ID: Ubuntu
> Description: Ubuntu 12.04 LTS
> Release: 12.04
> Codename: precise
>
> Setup:
> ---------
> We're running 3 KVM guests, all Ubuntu 12.04 LTS using bridged
> networking.
>
> From the host:
> # cat /etc/network/interfaces
> auto br0
> iface br0 inet static
> address 212.XX.239.98
> netmask 255.255.255.240
> gateway 212.XX.239.97
> bridge_ports eth0
> bridge_fd 9
> bridge_hello 2
> bridge_maxage 12
> bridge_stp off
>
> # ifconfig eth0
> eth0 Link encap:Ethernet HWaddr d4:ae:52:84:2d:5a
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:11278363 errors:0 dropped:3128 overruns:0 frame:0
> TX packets:14437384 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:4115980743 (4.1 GB) TX bytes:5451961979 (5.4 GB)
> Interrupt:36 Memory:da000000-da012800
>
> # ifconfig br0
> br0 Link encap:Ethernet HWaddr d4...

Revision history for this message
Dewey McDonnell (dewey-w) wrote :

Thanks Thomas for your speedy reply. Now my CentOS uname -a is version 3.5.0. My tap0 lockup problem is exactly the same as with the older kernel. Anything else you would suggest?

Revision history for this message
Russell McOrmond (russell-flora) wrote :

We're running qemu-kvm 1.0+noroms-0ubuntu14.13 on a 12.04.4 LTS with KVM based 12.04.4 LTS virtual machines , and have observed this problem. We have been using the regular software bridges on many machines, and have only noticed the problem on one of our newest servers.

Using virtio devices we get the discussed lockups.

Using e1000 we get no lockups, but this is a much lower performing interface and thus we have performance issues. We have NFS and other traffic between VMs and the host, so need more than the GigE that we have to external hosts.

I note above that this bug was considered fixed by qemu-kvm - 1.0+noroms-0ubuntu14.3, but this appears to not be the case.

Revision history for this message
Fawad Khaliq (fawadkhaliq) wrote :

Very easily reproducible on my side.

Revision history for this message
Izhar ul Hassan (ezhaar) wrote :

Yes, I think it is safe to say that the bug is still around. The VM loses network connectivity under "enough" load. I, for example can reproduce this by running a spark job which transfers a few gigabytes of data between worker VMs. And within a minute one of the VMs lose network connectivity. If I try to reboot the VM, it goes into error state. Trying to delete makes the qemu-kvm process defunct.

uname -r
3.8.0-29-generic

virsh --version
1.1.1

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@ezhaar - please open a new bug so we can collect new information. If you are on trusty then please file against qemu, othewise file against qemu-kvm. Then mark it as also affecting libvirt and linux (the kernel). Then reproduce the bug, and immediately after the crashes do 'apport-collect <bug-number>', which should collect the data for each of those packages. Please show the host network configuration, the libvirt network config if applicable, the xml dumps for the vms, and where to get spark.

Revision history for this message
Øyvind Jelstad (oyvind-2) wrote :

Looks like I have this problem on 12.04.1 LTS with kernel 3.2.0-67-generic #101-Ubuntu SMP Tue Jul 15 17:46:11 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux on the host and Debian Wheezy on the guest (SMP Debian 3.2.60-1+deb7u3 x86_64 GNU/Linux)

The guest will occationally loose connection with other hosts on the LAN, and their entries in the arp table on the guest are gone.
Only sporadic forwarding of arp replies on the host back to the guest seem to be the problem.

Dumping arp request (triggered by a ping from guest) on the bridge external (real) interface on the host catches both requests and replies, while the same dump on vnet0 misses replies for minutes until a reply suddeny comes through and reestablishes connection.

I am using virtio interface. It makes no difference if I change to e1000. FW policy is ACCEPT on all tables.

Revision history for this message
Martin Pajak (mpajak-r) wrote :

I faced probably the same problem installing Xen-4.3.2 with Gentoo kernel-3.12.21.

DomU's interface hangs after short time under heavy network load (starting at ~10Mbyte/s). From outside it looks, like the instance would crash, but deactivating and activating the interface e.g. form the "xl console <domU name>" with /etc/init.d/net.eth0 stop/start, restores normal operation.

After 3 days of testing/searching I found a workaround. Setting the following options with ethtool, I could successfully prevent my domU's interfaces from hanging:

ethtool --offload <network device> gso off tso off sg off gro off

This http://cloudnull.io/2012/07/xenserver-network-tuning/ leeds me to my solution.

I also posted my other expirience with this bridged network configuration in the Gentoo wiki https://wiki.gentoo.org/wiki/Xen .

Revision history for this message
Øyvind Jelstad (oyvind-2) wrote :

My problem with missing arp replies was solved( worked around) by setting
bridge_ageing 0
for the bridge in /etc/network/interfaces, making it a hub forwarding all packets to all hosts.

Revision history for this message
Gibbo (gibbo87) wrote :

I report the same situation as comment 129.
Spark Cluster installed with Ambari 1.7, installed HDP 2.2 running Zookeeper, Ganglia, HDFS and YARN.
Ubuntu 12.04

uname -r
3.2.0-67-virtual

Instances running fine until the start of a Spark of Hadoop job with YARN. The Job gets accepted but then 2 slaves of the cluster are interested by the bug and lose connectivity. They are still accessible by the openstack web page console but can't reach the network. Reboots brings the VM in a halt state.
Solved with ethtool -K eth0 tx off sg off tso off ufo off gso off gro off lro off

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Please file a new bug against the linux package, preferably (if possible) using the command 'ubuntu-bug linux'

Revision history for this message
Ramiro Varandas Jr (ramirovjnr) wrote :

If this problem is related to this one - https://bugs.launchpad.net/ubuntu/+source/qemu-kvm/+bug/1325560?comments=all - then upgrading the kernel should fix it.

In my case I'm using Ubuntu 14.04 with kernel 3.13 and after upgrading to kernel 3.16, no more connectivity problems.

Revision history for this message
Ramiro Varandas Jr (ramirovjnr) wrote :

- Updating -

The problem came back today - the VM was running OK for more than 72 hours - and then the connectivity with the gateway was lost again.

Looking Martin's post #132, I was seeing some packets being dropped by being incorrect. Applied the ethtool fix and also changed the driver to e1000 and going to monitor it.

After applying the ethtool patch, I got no more dropped packets with incorrect checksum.

Revision history for this message
Shades (initialhit) wrote :

Still seeing this on 14.04.4 LTS under enough load, or when moving from paused state.

Displaying first 40 and last 40 comments. View all 138 comments or add a comment.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.