Loops on mount failure when Plymouth not running

Bug #553290 reported by Reuben Firmin
42
This bug affects 7 people
Affects Status Importance Assigned to Milestone
mountall (Ubuntu)
Fix Released
High
Scott James Remnant (Canonical)
Lucid
Fix Released
High
Scott James Remnant (Canonical)

Bug Description

Binary package hint: plymouth

I had the following line in my fstab:

none /proc/bus/usb usbfs devgid=126,devmode=664 0 0

While debugging #552046, I removed plymouth and tried to boot.

Mountall looped 3657 times with the error message: "Skipping mounting /proc/bus/usb since plymouth is not installed". At that point, the boot froze. I am encountering other boot issues, so perhaps this was not causing the blockage. Regardless, the looping behaviour is a bug.

Revision history for this message
Reuben Firmin (reubenf) wrote :

Commenting this line out from fstab got my past this point in the boot.

Revision history for this message
Steve Langasek (vorlon) wrote :

mountall depends on plymouth now, so as you've noticed, removing plymouth is no longer an option.

However, if something is wrong with plymouth at runtime, mountall should degrade gracefully; it sounds like that may not be happening here. Reassigning to mountall.

affects: plymouth (Ubuntu) → mountall (Ubuntu)
Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Could you edit /etc/init/mountall.conf and append --verbose to the command-line for the mountall binary, then try booting.

Capture the output at the point of looping - the precise detail is important here, including the error that leads it to skip the mount

Changed in mountall (Ubuntu):
status: New → Incomplete
summary: - If plymouth is missing, and fstab has a usbfs, system won't boot
+ Loops on mount failure when Plymouth not running
Changed in mountall (Ubuntu):
importance: Undecided → Medium
Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Part of the bug here is that even after skipping a filesystem, mountall will try and mount it again whenever it can - that's why you see the loop - that was right at the time - but now it's just wrong - a skipped filesystem should be ignored entirely.

But that still doesn't explain why it stops booting entirely :-(

Revision history for this message
Reuben Firmin (reubenf) wrote :

I'm having trouble reproducing the looping behaviour, even after having restored the usbfs to my fstab. I have cleaned up some udev configs since then...any chance those were interacting with this line in fstab?

On the boot being frozen: after filing this report, I discovered the workaround in #545536 - ctrl alt del when the boot is frozen makes it succeed. So some process (possibly not mountall) is sticking. I do have the "ureadahead" message mentioned in that bug, atlhough I believe that it's innocuous; nevertheless, does that give you a clue where in the boot is has gotten stuck?

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote : Re: [Bug 553290] Re: Loops on mount failure when Plymouth not running

On Thu, 2010-04-01 at 18:03 +0000, Reuben Firmin wrote:

> I'm having trouble reproducing the looping behaviour, even after having
> restored the usbfs to my fstab. I have cleaned up some udev configs
> since then...any chance those were interacting with this line in fstab?
>
No, but it may have been fixed by updating to mountall 2.10 which had a
lot of changes in this area.

> On the boot being frozen: after filing this report, I discovered the
> workaround in #545536 - ctrl alt del when the boot is frozen makes it
> succeed. So some process (possibly not mountall) is sticking. I do have
> the "ureadahead" message mentioned in that bug, atlhough I believe that
> it's innocuous; nevertheless, does that give you a clue where in the
> boot is has gotten stuck?
>
Nope, no idea, sorry. Does it still stick?

Scott
--
Scott James Remnant
<email address hidden>

Revision history for this message
Reuben Firmin (reubenf) wrote :

Yeah, it blocks consistently on boot, but the ctrl + alt + del workaround works. Don't want to take this bug off on a tangent though, so if you want to have this focused just on the looping behaviour that's fine by me.

Revision history for this message
Wladimir Mutel (mwg) wrote :

I would desire so much to see Plymouth decoupled from mountall.
This Plymouth thing (together with usplash, rhgb and similar stuff) just reminds me a Soviet joke, "let's draw the curtains over windows and swing the train carriage from the outside so that passengers thought they are actually rolling forward".
I am so accustomed to good old dmesg and other console output scrolling at the screen. I use Linux for 11 years already.
Please don't hide this valuable info from me. I really feel deprived when you do so :>

Revision history for this message
gam3 (gam3-launchpad) wrote :

Mount all should not depend on Plymouth, but should detect if it is running and us it if it is available.

Changed in mountall (Ubuntu):
assignee: nobody → Scott James Remnant (scott)
Changed in mountall (Ubuntu Lucid):
milestone: none → ubuntu-10.04
importance: Medium → High
status: Incomplete → Triaged
Changed in mountall (Ubuntu Lucid):
status: Triaged → Fix Committed
Revision history for this message
Steve Langasek (vorlon) wrote :

Scott,

I've had to back out your proposed fix for this - in my testing, it causes mountall to never mount *any* of my filesystems, except for the rootfs and virtual filesystems. So something needs more investigation here; in the meantime, I'm getting 2.12 uploaded with the i18n fixes.

Changed in mountall (Ubuntu Lucid):
status: Fix Committed → Triaged
Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Steve: to confirm, it caused mountall to never mount any of your filesystems even if plymouth was running?

Or just when Plymouth isn't running?

Revision history for this message
Steve Langasek (vorlon) wrote :

It caused mountall to not mount any of my filesystems, particularly when plymouth was running.

Revision history for this message
Steve Langasek (vorlon) wrote :

(reproduced on two separate machines - the simpler case had just a / and a /home and still hit the problem)

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Oh, when plymouth is running is certainly not intended.

When plymouth isn't running - I'd expect it to not mount any filesystem that took too long

Can you get --debug from the machines where you reproduced please

Revision history for this message
Steve Langasek (vorlon) wrote :

mountall --debug log attached.

Revision history for this message
Paul Sladen (sladen) wrote :

It's not a question of having /removed/ plymouth... server installs don't get plymouth installed in the first place!

Revision history for this message
Colin Watson (cjwatson) wrote :

Paul, that simply isn't true.

Note that plymouth is not the same as having a splash screen - server installs use plymouth (for boot logging, and generally for multiplexing of console I/O during boot), but they configure it not to display a splash screen.

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Steve: your log simply says that udev never issued an event for your /home

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Aha, I see the problem!

- if (mnt->tag != TAG_SKIPPED)
- all = FALSE;

should be:

- if (mnt->tag != TAG_SKIPPED)
- all = FALSE;
+ all = FALSE;

Oops ;-)

Committed and pushed as 312

I also decided that skipping on boredom is just wrong when Plymouth isn't running - and mountall should just wait, so committed that too

Changed in mountall (Ubuntu Lucid):
status: Triaged → Fix Committed
Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

I've uploaded a new version of mountall (2.13~ppa1) to https://launchpad.net/~scott/+archive/ppa

Please test and see whether it solves this problem

Thanks

Revision history for this message
Steve Langasek (vorlon) wrote :

Similar results again, mountall waits forever for /home even though it's clearly available. New mountall debug log attached.

FWIW, it seems that when I touch /forcefsck, this problem does *not* happen.

Revision history for this message
Steve Langasek (vorlon) wrote :

Scott,

I think there's a race condition in spawn() in mountall - AFAICS, there's nothing in the code that ensures spawn_child_handler is registered as a signal handler before calling execve(), so if fsck exits very fast, mountall may miss the signal that the child process has exited since it's passed to the default signal handler instead. This would explain why when the filesystem is clean and doesn't need checking, mountall locks up; but when it does a full fsck, mountall finishes fine.

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

On Sat, 2010-04-17 at 08:31 +0000, Steve Langasek wrote:

> I think there's a race condition in spawn() in mountall - AFAICS,
> there's nothing in the code that ensures spawn_child_handler is
> registered as a signal handler before calling execve(), so if fsck exits
> very fast, mountall may miss the signal that the child process has
> exited since it's passed to the default signal handler instead. This
> would explain why when the filesystem is clean and doesn't need
> checking, mountall locks up; but when it does a full fsck, mountall
> finishes fine.
>
This shouldn't matter since wait() is always called in the main loop;
the signal handler is just there to break the loop and even the default
signal handler should do that

Scott
--
Scott James Remnant
<email address hidden>

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

On Sat, 2010-04-17 at 06:57 +0000, Steve Langasek wrote:

> Similar results again, mountall waits forever for /home even though it's
> clearly available. New mountall debug log attached.
>
The log indicates that fsck is running? Is it?

Scott
--
Scott James Remnant
<email address hidden>

Revision history for this message
Steve Langasek (vorlon) wrote :

fsck ran, then exited successfully.

Sorry, it seems the (current) problem may be with the ply_boot_client_flush() call. Backtrace:

(gdb) thread apply all bt full

Thread 1 (Thread 0x7f20eb6aa700 (LWP 357)):
#0 0x00007f20eac3c920 in __read_nocancel () from /lib/libpthread.so.0
No symbol table info available.
#1 0x00007f20ea81ac6b in read (fd=11, buffer=0x7fff3e7b9780, number_of_bytes=1) at /usr/include/bits/unistd.h:45
No locals.
#2 ply_read_some_bytes (fd=11, buffer=0x7fff3e7b9780, number_of_bytes=1) at ply-utils.c:353
        bytes_read = -512
        total_bytes_read = 0
#3 ply_read (fd=11, buffer=0x7fff3e7b9780, number_of_bytes=1) at ply-utils.c:385
        __PRETTY_FUNCTION__ = "ply_read"
#4 0x00007f20ea608e56 in ply_boot_client_process_incoming_replies (client=0x7f20ecdc2810) at ./ply-boot-client.c:266
        request_node = <value optimized out>
        request = <value optimized out>
        byte = "\000"
        size = <value optimized out>
        __PRETTY_FUNCTION__ = "ply_boot_client_process_incoming_replies"
#5 0x00007f20ea609415 in ply_boot_client_flush (client=0x7f20ecdc2810) at ./ply-boot-client.c:753
        __PRETTY_FUNCTION__ = "ply_boot_client_flush"
#6 0x00007f20eb6d26ba in emit_event (name=0x7f20eb6de964 "mounting", mnt=0x7f20ecdc84a0) at mountall.c:2204
        env = 0x0
        env_len = 0
        __FUNCTION__ = "emit_event"
#7 0x00007f20eb6d6fe7 in try_mount (mnt=0x7f20ecdc84a0, force=0) at mountall.c:1586
        __FUNCTION__ = "try_mount"
#8 0x00007f20eb6d2de2 in spawn_child_handler (proc=0x7f20ecdd0900, pid=840, event=<value optimized out>, status=0) at mountall.c:1717
        __FUNCTION__ = "spawn_child_handler"
<snip>

So the flush call is trying to read from the pipe, because there's an outstanding request to plymouth that *may* result in a response: the "s to skip or m for maintenance shell" key watch.

Since we don't want to cancel that request (we've only gotten as far as "mounting"; something might still go wrong between now and the mount finishing, so the user should still be able to interrupt), ply_boot_client_flush() needs to be made smarter about knowing when there's something to be read (e.g., poll()).

I've confirmed that building from trunk with only revision 314 commented out lets this test system boot again.

Revision history for this message
Steve Langasek (vorlon) wrote :

Anyway, the good news seems to be that the fix for *this* bug can be safely uploaded now - it's only the fix for bug #559761 that appears to need more work.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package mountall - 2.13

---------------
mountall (2.13) lucid; urgency=low

  [ Scott James Remnant ]
  * Once a mountpoint has been skipped, don't try and mount it again
    (unless the udev device actually shows up). LP: #553290.
  * Skipping a filesystem means we should also skip anything that depends
    on that (ie. skip /usr/local when skipping /usr).
  * Don't skip filesystems due to timeout when Plymouth not available.

  * Don't run mount, swapon or fsck while there's an uncleared error on
    the filesystem. LP: #501801.

  * Don't display the filesystem check message when an fsck completes
    without needing to check the filesystem. LP: #564434.
 -- Steve Langasek <email address hidden> Mon, 19 Apr 2010 00:15:58 -0700

Changed in mountall (Ubuntu Lucid):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.