[Ubuntu1610][Libvirt] Postcopy migration with --postcopy-after-precopy option is not working as expected

Bug #1620906 reported by bugproxy
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libvirt (Ubuntu)
Fix Released
High
Taco Screen team
qemu (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

Problem Description
=========================
# time virsh migrate avocado-vt-vm1-Bala qemu+ssh://9.40.192.182/system --verbose --postcopy-after-precopy
Migration: [100 %]error: operation failed: job: unexpectedly failed

real 0m30.355s
user 0m0.100s
sys 0m0.004s
root@ltc-hab1:~# time virsh migrate avocado-vt-vm1-Bala qemu+ssh://9.40.192.182/system --verbose
Migration: [100 %]error: operation failed: job: unexpectedly failed

real 0m30.291s
user 0m0.088s
sys 0m0.012s

---Issue---
1.Migration with option --postcopy-after-precopy should not start without --live option. (if --postcopy option is used without --live "error: argument unsupported: post-copy migration is not supported with non-live or paused migration" is thrown)

2.Also from libvirt we couldn't validate whether migration happens actually with postcopy enabled or not - as time taken with --postcopy-after-precopy is more than precopy without it, ideally postcopy migration should be faster.

3. In qemu we can check whether postcopy is enabled or not using "info migrate") - can this be added as a feature in libvirt to give the status of migration whether happens with postcopy enabled or not

---uname output---
# uname -a Linux ltc-hab1 4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24 10:09:20 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

Machine Type = Habanero

Steps to Reproduce
=================================
1. Created guest with shared storage in NFS
2. Enabled ports 49152:49216 in iptables, virt_use_nfs -> on
3. Mounted the image location in destination and started migration.
4. # time virsh migrate avocado-vt-vm1-Bala qemu+ssh://9.40.192.182/system --verbose --postcopy-after-precopy

Userspace tool common name: virsh (libvirt)
The userspace tool has the following bit modes: ppc64le
Userspace rpm: # dpkg --get-selections | grep -i libvirt
libvirt-bin install
libvirt-clients install
libvirt-daemon install
libvirt-daemon-system install
libvirt-dev:ppc64el install
libvirt0:ppc64el install python-libvirt

============
Logs
============

Sosreport - source and destination

== Comment: #8 - Madhu Pavan Kothapally <email address hidden> - 2016-08-26 05:32:11 ==
Patch sent upstream for --live flag check.
== Comment: #9 - Madhu Pavan Kothapally <email address hidden> - 2016-09-02 13:53:03 ==
two patches accepted upstream,
commit id1: 67af358d119bed1f50a81f0c826ccaf704a2a085
commit id2: 04597a7038482688fb2cc8b0b1869392f6f78016

Hi Cannonical,
Please include the above commits to fix migration with --postcopy-after-precopy issue.

Thank you.

Revision history for this message
bugproxy (bugproxy) wrote : SOSreport - source

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-143637 severity-high targetmilestone-inin---
Revision history for this message
bugproxy (bugproxy) wrote : SOSreport - destination

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → libvirt (Ubuntu)
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

As far as I understood the libvirt doc --postcopy-after-precopy isn't supported at all without a --postcopy.
I haven't thought on --live yet.

From the man page:
"--postcopy enables post-copy logic in migration, but does not actually start post-copy, i.e., migration is started in pre-copy mode. Once migration is running, the user may switch to post-copy using the migrate-postcopy command sent from another virsh instance or use --postcopy-after-precopy along with --postcopy to let libvirt automatically switch to post-copy after the first pass of pre-copy is finished."

That would mean just having "--postcopy-after-precopy" means switch to postcopy logic after first pass without having postcopy logic enabled which is doomed to fail.
And it seems specifying --postcopy-after-precopy without --postcopy just implies the second which is fine.

I quickly gave dropping --live from my commandline a try and I must say most of this works for me just nice:

These are the working postcopy migrations:
1. this is my preferred commandline and it works just fine
# virsh migrate --live --postcopy --postcopy-after-precopy kvmguest-yakkety qemu+ssh://10.0.4.243/system

2. works, but I had no time to switch on the fly with migrate-postcopy from another shell
# virsh migrate --live --postcopy kvmguest-yakkety qemu+ssh://10.0.4.243/system

3. Seems to imply postcopy as it works just fine
virsh migrate --live --postcopy-after-precopy kvmguest-yakkety qemu+ssh://10.0.4.243/system

4. Note that I wonder what that does exactly, but it works :-)
# virsh migrate --postcopy-after-precopy kvmguest-yakkety qemu+ssh://10.0.4.243/system

5. The only one that fails, but that gracefully - so I wouldn't consider it a bug is:
# virsh migrate --postcopy kvmguest-yakkety qemu+ssh://10.0.4.243/system
error: argument unsupported: post-copy migration is not supported with non-live or paused migration

Are you on the latest qemu/libvirt of yakkety?
Are the issues or patches you reported known to be architecture specific?

Changed in libvirt (Ubuntu):
status: New → Incomplete
importance: Undecided → High
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-09-13 01:37 EDT-------
These patches are not arch specific.
BTW # virsh migrate --postcopy-after-precopy kvmguest-yakkety qemu+ssh://10.0.4.243/system
shouldn't work.
With patch 04597a7038482688fb2cc8b0b1869392f6f78016, --postcopy flag is mandated with --postcopy-after-precopy

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi,
well I could only share what I experimentally found which was that it worked.
Thanks for confirming that those should be non arch specific.

I'll look into evaluating the patches if we can and should add them as delta.
Again thank you for reporting the issues with patches identified - that makes things much easier as long as it isn't the weirdest backport ever

All would be good if the Y Final Freeze wouldn't be so close ...

Changed in libvirt (Ubuntu):
status: Incomplete → Triaged
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi,
backports done - a testable version was made available at this ppa:
https://launchpad.net/~paelzer/+archive/ubuntu/qemu-machine-type-dev

Can you please test if that would suffices your request?

If yes I can go forward and try to push together with some other changes we have incoming.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

FYI - I had to delete the package and rebuild it soon - will be in the same ppa.
Will be libvirt - 2.1.0-1ubuntu6~ppa1 (or newer) then.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Build complete, soon published again.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

ping - still waiting for an ack on the change.

In the meantime another version 6 was released which forced me to build you a new one (with the same content as before).

Please test libvirt_2.1.0-1ubuntu7~ppa1 from https://launchpad.net/~paelzer/+archive/ubuntu/qemu-machine-type-dev and let me know if that satifies your need.

Revision history for this message
bugproxy (bugproxy) wrote : SOSreport - destination

Default Comment by Bridge

Jon Grimm (jgrimm)
Changed in libvirt (Ubuntu):
status: Triaged → In Progress
Revision history for this message
bugproxy (bugproxy) wrote : libvirt_debug_log

------- Comment (attachment only) From <email address hidden> 2016-09-27 06:33 EDT-------

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Well, the question might be what does that log mean?
The following
  grep "error:" libvirtd_bala.log
shows that the initially reported error shows no more up - is that good?

Only a few stopping guests can be seen, a la "Error on monitor internal error: End of file from monitor". That could or could not be a totally different issue.

Maybe bugproxy forgot to mirror your comment related to that file?

FYI - Since they are upstream accepted - very small - and only cmdline argument checks - we are currently evaluating to integrate the reported patches along the next libvirt upload that is prepared by smb atm.

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla
Download full text (4.7 KiB)

------- Comment From <email address hidden> 2016-09-27 09:31 EDT-------
Posting the comment as public:

---

I tested the scenarios is latest Ubuntu1610 host with 4.8.0-17 kernel along with patch given by canonical
in comment #19

# uname -a
Linux powerkvm4-lp1 4.8.0-17-generic #19-Ubuntu SMP Sun Sep 25 06:35:40 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

# dpkg --list | grep libvirt
ii gir1.2-libvirt-glib-1.0:ppc64el 0.2.3-2 ppc64el GObject introspection files for the libvirt-glib library
ii gir1.2-libvirt-sandbox-1.0 0.5.1+git20151113-3 ppc64el GObject introspection files for the libvirt-sandbox library
ii libvirt-bin 2.1.0-1ubuntu7~ppa1 ppc64el programs for the libvirt library
ii libvirt-clients 2.1.0-1ubuntu7~ppa1 ppc64el Programs for the libvirt library
ii libvirt-daemon 2.1.0-1ubuntu7~ppa1 ppc64el Virtualization daemon
ii libvirt-daemon-system 2.1.0-1ubuntu7~ppa1 ppc64el Libvirt daemon configuration files
ii libvirt-dev:ppc64el 2.1.0-1ubuntu7~ppa1 ppc64el development files for the libvirt library
ii libvirt-doc 2.1.0-1ubuntu7~ppa1 all documentation for the libvirt library
ii libvirt-glib-1.0-0:ppc64el 0.2.3-2 ppc64el libvirt GLib and GObject mapping library
ii libvirt-glib-1.0-dev:ppc64el 0.2.3-2 ppc64el Development files for the libvirt-glib library
ii libvirt-ocaml 0.6.1.2-1build2 ppc64el OCaml bindings for libvirt
ii libvirt-ocaml-dev 0.6.1.2-1build2 ppc64el OCaml bindings for libvirt
ii libvirt-sandbox-1.0-5 0.5.1+git20151113-3 ppc64el Application sandbox toolkit shared library
ii libvirt-sandbox-1.0-dev 0.5.1+git20151113-3 ppc64el Development files for libvirt-sandbox library
ii libvirt-sanlock 2.1.0-1ubuntu7~ppa1 ppc64el Sanlock plugin for virtlockd
ii libvirt0:ppc64el 2.1.0-1ubuntu7~ppa1 ppc64el library for interfacing with different virtualization systems
ii munin-libvirt-plugins 0.0.6-1 all Munin plugins using libvirt
ii python-libvirt

Observation:

1. Issue is observed when migration is done twice

a. I migrated guest(Ubuntu1610) 1st time successfully,
# time virsh migrate avocado-vt-vm1 qemu+ssh://9.47.68.198/system --verbose --postcopy-after-precopy --postcopy --live
Migration: [100 %]

real 0m57.423s
user 0m0.056s
sys 0m0.012s

b. But when I tried for 2nd time
- Clean the guest in destination
- Start the guest in the source
- Perform migration using same migration command with guest(no workloads in guest) it din't end for 47min
so I killed it.

# time virsh migrate avocado-vt-vm1 qemu+ssh://9.47.70.201/system --verbose --postcopy-after-precopy --postc...

Read more...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks for reiterating and summarizing Breno!

I didn't cover re-migrating yet.
I was usually:
1. live migrating
2. offline migrating
3. post-copy migrating

I see you clean the guest in destination and start it in the source again.
So there shouldn't be that much different to me doing an offline migration in between.

I extended my tests to do all that multiple times.
I now run the formerly described sequence 5 times, then I add some workload to the guest and run it 5 times again. But so far it just works nice, couldn't reproduce your case yet.
If you want to take a look:
https://code.launchpad.net/~ubuntu-server/ubuntu/+source/qemu-migration-test/+git/qemu-migration-test
./qemu-libvirt-test.sh -r "yakkety" -s 1 -l "check-bug-1620906.status" 2>&1 | tee "check-bug-1620906.log"

I see your guest is from the avocado suite - do you think you could recreate the situation without?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Simplified this is the sequence that works for me

@testkvm-yakkety-from -- virsh migrate --live --postcopy --postcopy-after-precopy kvmguest-yakkety qemu+ssh://10.223.248.8/system
@testkvm-yakkety-to -- virsh migrate --live --postcopy --postcopy-after-precopy kvmguest-yakkety qemu+ssh://10.223.248.189/system
@testkvm-yakkety-from -- virsh migrate --live kvmguest-yakkety qemu+ssh://10.223.248.8/system
@testkvm-yakkety-to -- virsh migrate --live kvmguest-yakkety qemu+ssh://10.223.248.189/system
@testkvm-yakkety-to -- virsh undefine /var/lib/uvtool/libvirt/kvmguest-yakkety.xml
@testkvm-yakkety-to -- virsh define /var/lib/uvtool/libvirt/kvmguest-yakkety.xml
@testkvm-yakkety-from -- virsh save kvmguest-yakkety /var/lib/uvtool/libvirt/kvmguest-yakkety.state
@testkvm-yakkety-to -- virsh restore /var/lib/uvtool/libvirt/kvmguest-yakkety.state
@testkvm-yakkety-from -- virsh restore /var/lib/uvtool/libvirt/kvmguest-yakkety.state
@testkvm-yakkety-from -- virsh migrate --live --postcopy --postcopy-after-precopy kvmguest-yakkety qemu+ssh://10.223.248.8/system
@testkvm-yakkety-to -- virsh migrate --live --postcopy --postcopy-after-precopy kvmguest-yakkety qemu+ssh://10.223.248.189/system
[...]

I'm disabling the non-postcopy migrations and will check if it occurs then

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Only postcopy migrations head-to-head work as well.

Only difference I migrated there and back.
If I read your post correctly you are killing/undefining on the target.
Could you try if migrating both ways works for you as well?

And if so I'd assume that something on that "clean on target" is causing the trouble.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Attaching my logs just in case.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Fixing the reported libvirt arg checks with coming upload.
Added a qemu task to cover our discussion and work for the postcopy migration itself that fails for you still.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 2.1.0-1ubuntu8

---------------
libvirt (2.1.0-1ubuntu8) yakkety; urgency=medium

  [ Christian Ehrhardt ]

  * avoid migration postcopy issues by ensuring valid commands (LP: #1620906)
    - d/p/ubuntu/check-live-for-postcopy.patch Check for --live flag for
      postcopy-after-precopy migration.
    - d/p/ubuntu/make-postcopy-mandatory-for-postcopy-after-precopy.patch to

  [ Stefan Bader ]

  * Fix Xenial to Yakkety migration from libvirt-bin.service to
    libvirtd.service (LP: #1627969).
  * Update Vcs-Git and Vcs-Browser fields to point to launchpad
    (LP: #1629210)

  [ Dann Frazier ]

  * Fix FTBS in Yakkety due to missing python dependency (LP: #1629041)

libvirt (2.1.0-1ubuntu7) yakkety; urgency=medium

  * Enable NUMA support in arm64 builds (LP: #1627926).

 -- Stefan Bader <email address hidden> Fri, 30 Sep 2016 10:11:30 +0200

Changed in libvirt (Ubuntu):
status: In Progress → Fix Released
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

As it "just works" for me on ppc as outlined in e.g. comments 15-17 I mark the qemu task incomplete.

The libvirt task with the requested fixes is completed.

Changed in qemu (Ubuntu):
status: New → Incomplete
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Please reopen with more details how you case gets to the fail in case it is still failing for you.
If possible identify the steps you have to take to do differently than I did in comments 15-17.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-11-01 00:48 EDT-------
Patch accepted upstream for flag check
commit id: 011935457a3b5d911e10dc60c681778e78c8fdf9

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks for the upstream patch id, but I'd really prefer if you had opened a new bug for it as I asked. All old requests were fulfilled in this bug already.

While I agree that this is related being a follow on work by you, it is not the same but a new twist to the old issue.

To keep tracking sane I did so for you now and created bug 1638470.

Changed in qemu (Ubuntu):
status: Incomplete → Invalid
bugproxy (bugproxy)
tags: added: targetmilestone-inin1610
removed: targetmilestone-inin---
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.