Active VT tracking can fail at startup

Bug #544139 reported by Chris Coulson
252
This bug affects 61 people
Affects Status Importance Assigned to Milestone
ConsoleKit
Confirmed
Medium
consolekit (Ubuntu)
Fix Released
High
Colin Watson
Lucid
Fix Released
High
Colin Watson
Maverick
Fix Released
High
Colin Watson

Bug Description

Impact: ConsoleKit sometimes fails to determine which VT is active, breaking many parts of the system. For example, any dialog that requires PolicyKit authentication cannot be unlocked.
Development branch: Fixed in https://launchpad.net/ubuntu/+source/consolekit/0.4.1-4ubuntu1 by retrying console opens if they return EIO.
Patch: https://bugs.launchpad.net/ubuntu/+source/consolekit/+bug/544139/+attachment/1728928/+files/consolekit_0.4.1-3ubuntu2.debdiff
TEST CASE: Unfortunately, this bug is not uniformly reproducible, and may take many reboot attempts or even not be reproducible at all on any given system. If it is reproducible, then you can either try unlocking an administration dialog such as System -> Administration -> Time and Date (which will fail with a broken version), or (quicker) check for the string "Error waiting for native console" in /var/log/daemon.log. A successful fix will always permit a user with administrative privileges to unlock administrative dialogs.
Regression potential: When it breaks (not necessarily every time), consolekit is effectively completely broken. The test case should be sufficient to ensure that it is working properly.

Original description follows (note that the discussion about why EINVAL was being returned does not correspond to the end result of investigating this bug, but is preserved here for the record):

Binary package hint: consolekit

A few times over the last couple of days, I've noticed some weird consolekit issues where it doesn't correctly determine which VT is active, causing a lot of things to break (eg, disk mounting, rebooting, suspending etc). The issue is solved by rebooting.

When it fails, I get a lot of messages in my daemon.log when consolekit starts:

WARNING: Error waiting for native console 5 activation: Invalid argument

This occurs because the following call fails with EINVAL:

ioctl (console_fd, VT_WAITACTIVE, num);

I discussed this with Scott on #ubuntu-desktop. To summarize, there is a window between GDM starting and the X server coming up where the ioctl that consolekit does on the VT's will fail. Unfortunately, consolekit starts around the time of this window. Here is the log:

<chrisccoulson> Keybuk - i mentioned a consolekit issue last week, and your name was mentioned there
<chrisccoulson> that might have been what you remember
<Keybuk> can you remember more about what you mentioned?
<chrisccoulson> Keybuk - a couple of times when I booted last week, consolekit was unable to determine what the active VT was
<chrisccoulson> and it was throwing out errors like this:
<chrisccoulson> WARNING: Error waiting for native console 5 activation: Invalid argument
<Keybuk> right
<Keybuk> but why is consolekit using that ioctl?
<Keybuk> that's only used when you switch VT
<chrisccoulson> Keybuk - it spawns a thread for each VT, which waits for it to become active
<chrisccoulson> so it can track where the active one is
<Keybuk> ok
<Keybuk> it'll fail with -EINVAL for a short period during boot
<Keybuk> does it correctly back-off from that, and restart the thread again later?
<Keybuk> (if it goes into an infinite loop, that's not good either)
<chrisccoulson> Keybuk - no, that's probably the issue really. once it has failed, it just gives up
<chrisccoulson> so, we probably need to fix consolekit then?
<Keybuk> yeah
<Keybuk> we caused X to have the same bug
<chrisccoulson> ah, ok. that makes sense. and that explains why i can't recreate it all the time
<Keybuk> you get -EINVAL from VT_WAITACTIVE in a very specific condition
<Keybuk> the current foreground VT is in KD_GRAPHICS mode, but also VT_AUTO
<Keybuk> ie. it's been left with painted graphics ... but no process running on it
<Keybuk> since it's in graphics mode, the kernel prohibits VT switches
<Keybuk> can you guess when that condition is true?
<chrisccoulson> do you know how long it's in that condition for?
<Keybuk> chrisccoulson: however long the X server takes to start ;-)
<Keybuk> couple of seconds usually
<chrisccoulson> oh, right. that seems obvious now :)
<chrisccoulson> Keybuk - so the window is quite large then (and I think consolekit is activated after GDM starts isn't it?)
<chrisccoulson> i think gdm is the first thing to use it anyway
<Keybuk> chrisccoulson: gdm activates it
<chrisccoulson> yeah, i thought so
<chrisccoulson> thanks
<Keybuk> which means it's activated "before X starts or while X is starting"
<Keybuk> ie. exactly in that window
<Keybuk> chrisccoulson: so, on the VT_WAITACTIVE+VT_AUTO thing ... you could kinda argue it's a kernel bug
<Keybuk> because the kernel bug should deal with that case on its own
<Keybuk> but the kernel guys will tell you that the whole VT_* stuff is a mess, and they'd rather leave it alone
<chrisccoulson> yeah, it might be easier to work around it in consolekit for now
<Keybuk> exactly

Changed in consolekit (Ubuntu):
importance: Undecided → High
status: New → Triaged
description: updated
Revision history for this message
Graham Crumb (gcrumb) wrote :

Given that I'm still seeing these symptoms on an up-to-date Lucid installation, I'm assuming there's been no movement on this.

Can anyone suggest a more elegant work-around than 'log in and out a few times until things work again'?

Revision history for this message
Jacek Kałucki (laborm) wrote :

I can't.
I noticed, that "Login screen configuration" can't be unlocked simultaneously.

Revision history for this message
Hendrik Schrieber (hennekn) wrote :

Hey guys!
I'm also having this problem but it didn't appear for a long time. Now, after the recent kernel update, it happens almost every time I start my laptop and it won't go away even if I completely shut down. Before, I was able to get it to work if I shut down and restarted. Can anyone confirm this worsening after the kernel update?

Tell me, if you need any more information regarding this issue.

Revision history for this message
Hendrik Schrieber (hennekn) wrote :

Isn't there any workaround for this bug? Is there a way to downgrade the kernel to the previous version? There is nothing working anymore: no sound, no power management, no screensaver, I can't plug in any USB devices because it cannot be mounted and I can only shutdown using the terminal.

Revision history for this message
Giulio Spinozzi (giuliospinozzi) wrote :

Also I have this bug in my Ubuntu 10.04 update 32 bit, but only sometimes this occurs....Any workarounds???

Revision history for this message
Anthony Long (c-launchpad-languagehammer-com) wrote :

Are you using the current nvidia driver, by any chance? This sounds exactly like the problem that I had with the 'current' version. I downgraded to v. 173 and the problems went away. My symptoms were not being able to mount drives, power management problems, shutting down issues, no sound (that I noticed - there may have been more).

Revision history for this message
Otto Kekäläinen (otto) wrote : Re: [Bug 544139] Re: Active VT tracking can fail at startup

FYI: My computer hasn't suffered the symptoms described in this bug
report for over 2 months, so for my part this bug is fixed.

Revision history for this message
Hendrik Schrieber (hennekn) wrote :

@ Anthony Long: I have an ATI Radeon X1250 (RS690M chipset), so I don't have the nvidia driver installed.

This bug still exists on my computer and appears almost every time since the last kernel update! Only workaround right now is to completely shutdown, start again and hope that it works this time. Log in and out does not work for me, restart doesn't work either!

Maybe the importance should be set to critical since it makes the system unusable?

Revision history for this message
Damien (takahara) wrote :

Same problem here, on all our production machines. Like Chris I have the following messages in daemon.log:

WARNING: Error waiting for native console X activation: Invalid argument

Only for me X grows from 1 to 62, so I have 62 messages there. (Oddly, the two last ones are 62 *then* 61)

Symptoms for the user are (AFAICT):
- no sound
- no mounting of external drives
- impossibility to power down/suspend (in my case the power off and suspend options are not visible in the system menu)
- many (if not all) tasks that require admin privileges cannot be performed. For instance, the user-admin interface does not ask for the admin password; its controls are simply inoperative.

The bug is already visible in GDM, in that it does not show the suspend option in the power menu (lower right icon). GDM does show the power off option, but clicking it does nothing.

I second Hendrik on this, this bug should be set to critical:
- the a lack of functionality resulting from this bug is _very_ bad. Even "commercial" operating systems don't behave like that ;-)
- this happens every now and then, which is even more frustrating and hard to track down for a any user
- a stock 10.04 install has this problem.
- this has been reported multiple times already
- this bug is 3 months old and no fix seems to have been released (if there is one please let us know :-)

I don't know heck about consolekit but I will certainly be happy test any solution that will be proposed.

Let's squish this ugly bug!!!

Revision history for this message
Alex Bird (alex-alexbird) wrote :

Happens to me every single boot on Acer Aspire One (110, ZG5, 1GB RAM) since the last time I ran update manager (e.g. most recent updates right now 10 July 2010). I assume this did a kernel upgrade. I never experienced this problem before. I am unsure how to try an earlier kernel without the grub screen appearing as it does in desktop ubuntu.
I've been telling everyone how great UNE is, but now I'm about ready to throw it out of the window.

Revision history for this message
Sepero (cowpie2000) wrote :

A Possible Solution

This bug seems very related to these two:
https://bugs.launchpad.net/ubuntu/+source/upstart/+bug/543506
https://bugs.launchpad.net/ubuntu/+source/upstart/+bug/554172

This solution here is what worked for me. If it works for you, please help spread it to others.
http://ubuntuforums.org/showpost.php?p=9480172&postcount=20

Revision history for this message
Damien (takahara) wrote :

Sepero,

Thanks a lot for your input. I'm testing your solution right now. Currently I have 30 reboots with no problems, which is a good sign.

I first tried 2) and 3) only, but it failed after 10 reboots (no power button etc...) Trying with all 3 now and it seems to work.

However, I can now see this error appearing twice:

ureadahead-other main process (###) terminated with status 4

where ### is a proc number.

I will report later when I have performed more reboots/shutdowns/boots cycles.

Crossing fingers!!

Revision history for this message
Damien (takahara) wrote :

Two remarks:

1) you will have to reactivate your network connection(s) the first time you login after doing the 3 changes proposed by Sepero. Obviously purging ifupdown has something to do with that... ;-)

2) the ureadahead errors that I mentioned previously seem to happen just after X starts (between a screen flash and my nvidia logo)

I have now 40 reboots and... no problems! Could this be It??

Revision history for this message
Damien (takahara) wrote :

After 106 reboots the bug appeared.... twice! I really thought it was gone for good 'cause nothing happened until the 90th reboot. But apparently something is still lurking out there...

In any case it's a nice improvement, thanks for the hack!! ( ^^)/ At least now we have a usable system.
For "linux grade" stability we'll need a proper fix in consolekit I guess...

Revision history for this message
Mike Bianchi (mbianchi-foveal) wrote :

Damien, Sepero, and all ...

If you look at https://bugs.launchpad.net/ubuntu/+source/upstart/+bug/543506/comments/83 you will see
that I also recommend
    Add "init='/sbin/init --verbose'" to the linux boot line in grub.

I believe this helps by forcing "/sbin/init" to serialize starting up /etc/init/*.conf scripts ever-so-slightly and I found
that it helped avoid runlevel "unknown" boots.

Damien, please give --verbose a try and report on how you do. The fact that your are keeping a successful boot count should be very useful in verifying if this helps.

Please note that these are "fixes" are all get-arounds. What we desperately need is for someone to figure out why the writes to the console are occasionally failing!

Revision history for this message
Hendrik Schrieber (hennekn) wrote :

Ok, I'm gonna try the work-around suggested in the forum. bug 554172 does not look like it's the same I'm seeing though because restart doesn't fix the problem only a total shutdown and then start again sometimes fixes the problem temporarily.

@DD: I can see the power off and suspend buttons but every time I click them it just takes me back to the login screen. Does that mean it's a different issue?

Revision history for this message
Damien (takahara) wrote :

Mike,

I don't have the runlevel-unknown issue, at least not that I think of.
But anyway, I'll try your extra hack today and see if it helps. Problem is I may have to do a good 200 reboots to be sure it helps so it's going to take some time. Stay tuned...

Revision history for this message
Damien (takahara) wrote :

Hendrik, all,

('DD' and 'Damien' are the same person ;-)

Is the power-off behaviour your only problem? Can you still perform admin tasks and have sound when your problem occurs? If the answer is yes then it is likely a different issue, at least partly. But I think there's quite a few bugs related to the same problem; the cause may be the same.

BTW, is there any way to ditch the new 'fast' boot process of ubuntu? It's not really stable at the moment, to say the least. If we could go back to a classic sequential boot sequence it may just solve all these problems.

Revision history for this message
Hendrik Schrieber (hennekn) wrote :

No it's not the only problem, I can't power-off or restart, sound does not work, cups is not started and some other services like apache2 are not started either. I haven't tried to perform admin tasks yet so I don't know about that. How can I edit my linux boot line? Using grub2, so there is no menu.lst!

Revision history for this message
Damien (takahara) wrote :

Hi all,

Here are some preliminary results...

1) After setting Mike's "verbose" option the errors came back on 11 reboots out of 51 (~20%)

2) So I removed the "verbose" option, updated and installed grub and rebooted a few times. Errors were still there (7/42=16%), although I got more at the begining of my test than at the end. Not exactly the result I expected!

I have now reflashed the hard drive with the original image I had and I'm going to estimate the number of bad boots from this un-hacked system. Then try Sepero's hack again. Maybe the 1/50 error rate I reported earlier was not accurate ??

Revision history for this message
Damien (takahara) wrote :

Hendrik,

For admin tasks, try to add a user through the GUI users-admin. I don't know about apache or cups, but I'm not surprised they fail too.

Before editing your bootloader, first try without that hack (e.g. only Sepero's 3 hacks). Then if it doesn't work for you this is the way to edit your bootloader:

1) sudo gedit /etc/default/grub

In there, change the line

GRUB_CMDLINE_LINUX=""

to

GRUB_CMDLINE_LINUX=" init='/sbin/init --verbose' "

save this and then,

2) sudo update-grub

3) sudo grub-install <your drive device, e.g. /dev/sda>

then reboot a few times to see how it works.

Revision history for this message
Damien (takahara) wrote :

The results are in for today: a standard linux distro has this problem 7% os the time (7 bad boots, 93 good boots)

This is not so far from the 2/100 obtained before. I'm not so sure the hacking does any effect after all...

I'll try the hacks tomorrow, but I don't have much time to get this bug fixed so I may try an install of Debian instead ;-(

Revision history for this message
Damien (takahara) wrote :

I have no better news at this point so I contacted the consolekit guys to see what they think.

Revision history for this message
Damien (takahara) wrote :

According to an answer on the consolekit list, "GDM shouldn't be activating ConsoleKit until after the X server is up".
This may thus be a GDM bug (not waiting for X "ready" state before launching consolekit)

I have thus added GDM as affected package.

Revision history for this message
Damien (takahara) wrote :

Ubuntu being based on Debian, I've tried "testing" (aka squeeze). The good news is that it works without a glitch, the bad one is that I am switching all our production systems to Debian.

FYI, here's a comparison of relevant packages versions on Squeeze and Lucid:

Debian Squeeze:
--------------------------
GDM: 2.30.2-4
Consolekit: 0.4.1-4
policykit-1: 0.96.2
Kernel: 2.6.32-5

Ubuntu Lucid:
--------------------------
GDM: 2.30.2.is.2.30.0-0ubuntu2 (definitely the worst version string I've ever seen)
Consolekit: 0.4.1-3ubuntu1
policykit-1: 0.96.2
Kernel: 2.6.32-23-generic

Revision history for this message
Tobias Johansson (tj-mensa) wrote :

I reinstalled ubuntu on a computer. Before I had this issue maybee 1/5 startupps. But now (even with the workarounds) i would say its more 7/10 times.. I just started to use linux 2 month ago but Im starting to wonder if switching was a good idea.

Revision history for this message
Damien (takahara) wrote :

Tobias,

If you have a spare partition, could you try Debian "testing"? Since you experience this bug almost systematically, it would be interesting to see if Debian solves your problem like it did for me.
http://www.debian.org/devel/debian-installer/

Also, which video drivers do you use? If you have an nvidia card, try this:
lsmod | grep nvidia

It may be that the nvidia proprietary drivers make X slower to boot, resulting in The Bug.
Another thing: is your root partition on an SSD?

Revision history for this message
Damien (takahara) wrote :

(Chris Coulson, are you there? Is there anyone knowledgeable listening to this thread anymore?)

Some more investigations today: if I build consolekit from the latest source and install that, then the error is SYSTEMATIC.
Reinstalling the ubuntu package (--reinstall) does not change a thing, the error persists. I also noted that every 2-3 boot the X server will not even start.

Is the ubuntu source tree for consolekit available somewhere? After looking at the sources a fix seems trivial but I can't build a .deb that would install cleanly.

Thanks...

Revision history for this message
OC (olivier-c) wrote :

I confirm this on an HP Compaq nc6000 laptop, Ubuntu 10.04 i386, fresh install, both as installed and after updates.
My rate of "bad boots" is roughly 50% (based on roughly 20 restarts).

I installed Maverick alpha3 as a dual-boot. After 20 restarts I cannot reproduce the problem under Maverick.

I attach a syslog of a "bad boot" and one of a "normal boot".

Revision history for this message
OC (olivier-c) wrote :

Tried Mike Bianchi's proposed solution in comment 81 of Bug #543506.
As a result, 2 bad boots in a series of 20. This is five times less than before.

Also, marking the present bug as duplicate of Bug #554172, please correct me if mistaken.

Revision history for this message
Damien (takahara) wrote :

OC,

These bugs look different: they seem to have runlevel problems. Some comments point to this bug though, like #15 of Bug #543506.

There certainly is quite a lot of confusion with all these boot bugs.
This guy nailed exactly how messy all this is:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/554172/comments/140

I'm personally amazed that a bug for which there is an apparent solution (like this one) is not fixed after FOUR months.
Heck, there isn't even anyone assigned to it, even if its priority is high!

"On Launchpad, no one can hear you scream..."

Revision history for this message
skorqa (skorqa) wrote :

Hi all,
I'm also affected by this bug, or at least encounter the exact same symptoms as Damien described in #9
The machine is a laptop Acer Aspire 1692 and the kernel is 2.6.32.24-generic-pae
The distrib is an up to date Lucid 10.04, ( fresh install + updates)
I did not perform enough boots to extract a significant statistical data about failing / successful boots, but i m attaching a few logs
of good and bad boots. for some of them i set 'debug init='/sbin/init --verbose' ' kernel options.

Like Damien i'm not sure how much this bug is related to #554172 but an upstart fix has been committed in lucid-proposed, so i suggest some of us install that fix to see if it has any effects on this bug. I did it and i still did not notice the bug after a few boots (not enough to be significant). The logs of these one are attached as well.

If the upstart fix reveals not to work in this case, do you think that a temporary work around could be downgrading gdm / consolekit and kernel to debian squeeze version, or do you think it would break the distribution stability ?

Cheers

Fred

Revision history for this message
skorqa (skorqa) wrote :
Revision history for this message
skorqa (skorqa) wrote :
Revision history for this message
skorqa (skorqa) wrote :
Revision history for this message
skorqa (skorqa) wrote :
Revision history for this message
skorqa (skorqa) wrote :

well, unfortunately the bug reappears after ~20 reboots, even with the new upstart patch...

regards

f

Revision history for this message
Damien (takahara) wrote :

Thanks Skorka, now we are sure it's not a duplicate :)

The good thing about this bug is that I'm now testing many other distros (Debian, CentOS, Fedora...) for our multitouch system and I'm seeing lots of good things there... "♪♪ Always look on the bright side of life... ♪♪"

Revision history for this message
Chris Coulson (chrisccoulson) wrote :

Sorry, this seems to have dropped off the radar a bit. I will take a look at this when I get some time, but I'm busy with other things at the moment. Somebody else feel free to take it if they get an opportunity to work on it before I do

Changed in consolekit (Ubuntu Lucid):
assignee: nobody → Chris Coulson (chrisccoulson)
milestone: none → ubuntu-10.04.2
Revision history for this message
Damien (takahara) wrote :

Hi Chris,

Nice to see you're back! :-)

I have tried a few things in the sources of consolekit but unfortunately installing my new deb can make things worse "sometimes" (i.e. X refuses to start every other time or so, even if I make no changes in the sources). Maybe some config files are not properly cleaned up during the upgrade? Well, I don't know what's going on there but I'll continue to probe.

In the meantime if you have a specific fix to suggest (code?) or even a .deb to test I'll be happy to stress-test it a few hundred times to see if it solves something.

Cheers,

Damien

Revision history for this message
Colin Watson (cjwatson) wrote :

Can people affected by this bug have a go at the version of consolekit in this PPA?

  https://launchpad.net/~cjwatson/+archive/ppa

Beware: I really haven't tested this *at all* beyond building it, and can't reproduce the error condition myself anyway. Make sure you have a way to roll back to the old version of consolekit, as it's entirely possible that it will e.g. segfault as soon as this error occurs. I ran it past Chris Coulson as a sanity check but haven't sent it to anyone else yet.

Patch attached.

Revision history for this message
Colin Watson (cjwatson) wrote :

(I only have this built for Maverick right now.)

Revision history for this message
Colin Watson (cjwatson) wrote :

One other thing, just following up a hunch: on a broken system, could you find the process ID of console-kit-daemon ('ps acx | grep console-kit'), then attach the output of 'lsof -p THAT_PROCESS_ID' run as root?

Revision history for this message
Colin Watson (cjwatson) wrote :

Regarding gdm: I think why this is happening despite the fact that ConsoleKit upstream says that gdm shouldn't be starting CK until X has started is probably that we're carrying the patch in https://bugzilla.gnome.org/show_bug.cgi?id=593996. This queries CK for users before X starts.

However, it's wrong to assume that gdm is the only thing which starts ConsoleKit; other PAM sessions are quite liable to do so too, and it's entirely possible (if unlikely) for somebody to start a PAM session remotely while the X server happens to be starting. Thus, I argue that it makes sense to fix this in ConsoleKit independently of any changes in gdm.

tags: added: patch
Revision history for this message
Lennie (lbesselink) wrote :

Hi Colin and all the others,

I recently upgraded a laptop and I also sometimes had the 'can not shutdown/reboot' probleem.

If what Colin mentions is true, then would it not help to just disable the getting the recently-logged-in-user-list from consolekit as a test/workaround ?

Which can be achieved with the following command.

sudo -u gdm -- gconftool -s /apps/gdm/simple-greeter/disable_user_list true --type bool

which changes the file: /var/lib/gdm/.gconf/apps/gdm/simple-greeter/%gconf.xml

I enabled the debug in /etc/gdm/gdm.schemas before hand by setting the 'debug/Enable' to true and judging by the /var/log/gdm/:0-greeter.log file it seems it doesn't talk to consolekit anymore at startup.

Normally it would say something like this:
GdmUserManager: Found current seat: /org/freedesktop/ConsoleKit/Seat1

But not with the recent-user-list-disabled.

Maybe someone like Damien can test this ? Because I can't really reproduce the problem very well.

Possible you could also post (part of) a logfile of when it fails to work, maybe there is an other hint in there as well.

I hope this helps, because this is a pretty annoying problem.

Revision history for this message
Lennie (lbesselink) wrote :

Forget what I mentioned, it doesn't work. I was doing reboot-cycles to try and produce the error and I got the same problem again:
WARNING: Could not determine active console
WARNING: Error waiting for native console 10 activation: Invalid argument
...

Revision history for this message
Lennie (lbesselink) wrote :

I wonder what would happen if we enabled the last part again in /etc/X11/Xsession.d again, instead of the what it says now:

# gdm already creates a CK session for us, so do not run the expensive D-Bus
# calls if we have $GDMSESSION
if [ -z "$GDMSESSION" ] && [ -x "$CK_LAUNCH_SESSION" ] && \
 ( [ -z "$XDG_SESSION_COOKIE" ] || is_on_console ) ; then
    STARTUP="$CK_LAUNCH_SESSION $STARTUP"
fi

like so:

    STARTUP="$CK_LAUNCH_SESSION $STARTUP"

Revision history for this message
Lennie (lbesselink) wrote :

That did not help, maybe if I enable --debug in /usr/share/dbus-1/system-services/org.freedesktop.ConsoleKit.service it might give us more information.

Revision history for this message
Colin Watson (cjwatson) wrote :

On Sat, Aug 21, 2010 at 07:40:37PM -0000, Lennie wrote:
> If what Colin mentions is true, then would it not help to just disable
> the getting the recently-logged-in-user-list from consolekit as a
> test/workaround ?

Could you try the package from my PPA?

The UserManager stuff in gdm is one likely cause I noticed, but I didn't
put in the extensive effort to check whether it's the only one.

Revision history for this message
Lennie (lbesselink) wrote :

At the moment I'm doing reboots so I could get a some debug from consolekit, but so far it hasn't happend.

As I don't know consolekit or gdm very well but I can program, I hope it would give me some idea of where to start to look.

I think maybe the debug I enabled made startup of consolekit slow enough that it doesn't happen for me anymore. Because I can't reproduce. :-)

Can I suggest adding some debug to consolekit in the if "if (errno == EINVAL && data->retries++ < 10) {", so we can see if we trigger it ?

I've done that right now.

OK, I hope I can still trigger it this evening before going to bed, otherwise it will be tomorrow.

Revision history for this message
Colin Watson (cjwatson) wrote :

On Sat, Aug 21, 2010 at 10:26:26PM -0000, Lennie wrote:
> I think maybe the debug I enabled made startup of consolekit slow enough
> that it doesn't happen for me anymore. Because I can't reproduce. :-)
>
> Can I suggest adding some debug to consolekit in the if "if (errno ==
> EINVAL && data->retries++ < 10) {", so we can see if we trigger it ?

You should still get "Error waiting for native console 1 activation:
Invalid argument" etc. with my patch. Furthermore, if you arrange for
console-kit-daemon to be run with --debug, then you should get
"VT_WAITACTIVE for vt 1" and "VT_WAITACTIVE for vt 1 returned
<something>" around any calls to that ioctl, whether successful or not,
which should be enough to see what's going on.

Still, I've added a few more bits of debug to the version of my patch
that I'll send upstream if this works out.

Revision history for this message
Lennie (lbesselink) wrote :

Yeah, just noticed that, because your change did not help. :-(

With the debug on, I've not been able to reproduce the problem yet, so I don't yet have a debug with the error-situation.

Revision history for this message
Lennie (lbesselink) wrote :

I changed the original package (not ppa) to have g_warning instead of g_debug for VT_WAIT..

let's see what happends.

Revision history for this message
Lennie (lbesselink) wrote :
Download full text (4.7 KiB)

Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: Could not determine active console
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: VT_WAITACTIVE for vt 2
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: VT_WAITACTIVE for vt 2 returned -1
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: Error waiting for native console 2 activation: Invalid argument
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: VT_WAITACTIVE for vt 6
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: VT_WAITACTIVE for vt 6 returned -1
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: Error waiting for native console 6 activation: Invalid argument
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: VT_WAITACTIVE for vt 3
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: VT_WAITACTIVE for vt 15
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: VT_WAITACTIVE for vt 8
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: VT_WAITACTIVE for vt 19
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: VT_WAITACTIVE for vt 12
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: VT_WAITACTIVE for vt 12 returned -1
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: Error waiting for native console 12 activation: Invalid argument
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: VT_WAITACTIVE for vt 14
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: VT_WAITACTIVE for vt 14 returned -1
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: Error waiting for native console 14 activation: Invalid argument
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: VT_WAITACTIVE for vt 16
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: VT_WAITACTIVE for vt 16 returned -1
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: Error waiting for native console 16 activation: Invalid argument
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: VT_WAITACTIVE for vt 18
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: VT_WAITACTIVE for vt 18 returned -1
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: Error waiting for native console 18 activation: Invalid argument
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: VT_WAITACTIVE for vt 20
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: VT_WAITACTIVE for vt 20 returned -1
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: Error waiting for native console 20 activation: Invalid argument
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: VT_WAITACTIVE for vt 3 returned -1
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: Error waiting for native console 3 activation: Invalid argument
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: VT_WAITACTIVE for vt 7
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: VT_WAITACTIVE for vt 7 returned -1
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: Error waiting for native console 7 activation: Invalid argument
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: VT_WAITACTIVE for vt 4
Aug 22 13:31:14 laptop console-kit-daemon[1108]: WARNING: VT_WAITACTIVE for vt 4 returned -1
Aug 22 13...

Read more...

Revision history for this message
Lennie (lbesselink) wrote :

Yes, it's this one that failed:

res = ioctl (console_fd, VT_GETSTATE, &stat);

Interresting, it would have done:

kb_ok = (ioctl (fd, VT_GETSTATE, &vts) == 0);

just moments before that in ck_fd_is_a_console

I wonder...

Revision history for this message
Lennie (lbesselink) wrote :

So I checked, when it fails:

Aug 22 14:32:46 laptop console-kit-daemon[1119]: WARNING: input of ck_get_active_console_num: 3
Aug 22 14:32:46 laptop console-kit-daemon[1119]: WARNING: Could not determine active console

That doesn't make a lot of sense to me, it seems to have called ck_get_a_console_fd from ck_vt_monitor_init and it does:
kb_ok = (ioctl (fd, VT_GETSTATE, &vts) == 0);

fd is 3 and just a moment later it does:
res = ioctl (console_fd, VT_GETSTATE, &stat);

in ck_get_active_console_num and fails.

Revision history for this message
Lennie (lbesselink) wrote :

Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: input of ck_get_active_console_num: 3
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: got an error: ioctl VT_GETSTATE
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: Could not determine active console

When I have and it fails:

        g_warning ("input of ck_get_active_console_num: %d", console_fd);

        res = ioctl (console_fd, VT_GETSTATE, &stat);
        if (res == ERROR) {
                g_warning ("got an error: ioctl VT_GETSTATE");
                perror ("ioctl VT_GETSTATE");
                goto out;
        }

        {
                int i;

                g_warning ("Current VT: tty%d", stat.v_active);

Revision history for this message
Colin Watson (cjwatson) wrote :

The theory in the description of this bug is that something is starting
X at the time, and that ConsoleKit is racing with it. As such, you will
see different results from the same ioctl from moment to moment.

It's not obviously that GDM is calling ConsoleKit too early - ConsoleKit
is started by D-Bus activation when GDM asks for it, no matter what
stage it asks for it, and I assert that ConsoleKit ought to be able to
back off for a while and cope even if X is starting at the time. I'll
construct a simpler test program and see what I got wrong in my patch.

The reason you're seeing odd sequences of messages is that there are
multiple threads running in parallel.

Revision history for this message
Colin Watson (cjwatson) wrote :

On Sun, Aug 22, 2010 at 11:57:38AM -0000, Lennie wrote:
> Eventually they all return -1 and say error waiting for native console.

Could you please attach the full unedited log? I'd like to go through
the whole thing myself. Thanks.

Revision history for this message
Lennie (lbesselink) wrote :
Download full text (18.4 KiB)

Here is the whole part of the log when it fails:
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: input of ck_get_active_console_num: 3
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: got an error: ioctl VT_GETSTATE
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: Could not determine active console
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: VT_WAITACTIVE for vt 4
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: VT_WAITACTIVE for vt 4 returned -1
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: Error waiting for native console 4 activation: Invalid argument
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: VT_WAITACTIVE for vt 7
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: VT_WAITACTIVE for vt 7 returned -1
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: Error waiting for native console 7 activation: Invalid argument
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: VT_WAITACTIVE for vt 9
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: VT_WAITACTIVE for vt 9 returned -1
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: Error waiting for native console 9 activation: Invalid argument
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: VT_WAITACTIVE for vt 11
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: VT_WAITACTIVE for vt 11 returned -1
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: Error waiting for native console 11 activation: Invalid argument
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: VT_WAITACTIVE for vt 13
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: VT_WAITACTIVE for vt 13 returned -1
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: Error waiting for native console 13 activation: Invalid argument
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: VT_WAITACTIVE for vt 15
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: VT_WAITACTIVE for vt 15 returned -1
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: Error waiting for native console 15 activation: Invalid argument
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: VT_WAITACTIVE for vt 17
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: VT_WAITACTIVE for vt 17 returned -1
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: Error waiting for native console 17 activation: Invalid argument
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: VT_WAITACTIVE for vt 19
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: VT_WAITACTIVE for vt 19 returned -1
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: Error waiting for native console 19 activation: Invalid argument
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: VT_WAITACTIVE for vt 3
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: VT_WAITACTIVE for vt 3 returned -1
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: Error waiting for native console 3 activation: Invalid argument
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: VT_WAITACTIVE for vt 21
Aug 22 15:19:19 laptop console-kit-daemon[1110]: WARNING: VT_WAITACTIVE for vt 21 returned -1
Aug 22 15:19:19 l...

Revision history for this message
Lennie (lbesselink) wrote :

A small test-program to query dbus would be enough. That way I could run it without gdm (and thus X) and see if it can cope with that.

And seems to work just fine if I run it on the console without gdm, etc.

Revision history for this message
Lennie (lbesselink) wrote :

Is that first part really threaded though ? Isn't it just failing because X starts up at the same time or something like that ?

Revision history for this message
Lennie (lbesselink) wrote :

What I mean is, it starts up tried to find the active vt, when that succeeds it continues to the next step, it tried to get it again but fails. Then starts the threads, but not for the active console:

                /* don't wait on the active vc */
                if (i == current_num) {
                        continue;
                }

Revision history for this message
Lennie (lbesselink) wrote :
Download full text (5.9 KiB)

I hacked in some code in ck_vt_monitor_init to try again and I think I got it.

The console_fd was just a bad one so the threads couldn't use it to VT_WAITACTIVE.

Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: console_fd: 3
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: input of ck_get_active_console_num: 3
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: got an error: ioctl VT_GETSTATE
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: Could not determine active console, trying again
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: console_fd: 12
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: input of ck_get_active_console_num: 12
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: Current VT: tty1
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: skipping: 1
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITACTIVE for vt 13
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITACTIVE for vt 2
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITACTIVE for vt 6
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITACTIVE for vt 8
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITACTIVE for vt 10
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITACTIVE for vt 5
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITACTIVE for vt 3
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITACTIVE for vt 14
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITACTIVE for vt 55
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITACTIVE for vt 59
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITACTIVE for vt 11
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITACTIVE for vt 16
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITACTIVE for vt 18
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITACTIVE for vt 20
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITACTIVE for vt 62
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITACTIVE for vt 60
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITACTIVE for vt 41
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITACTIVE for vt 46
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITACTIVE for vt 49
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITACTIVE for vt 56
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITACTIVE for vt 53
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITACTIVE for vt 54
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITACTIVE for vt 21
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITACTIVE for vt 12
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITACTIVE for vt 9
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITACTIVE for vt 45
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITACTIVE for vt 17
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITACTIVE for vt 39
Aug 23 02:32:07 laptop console-kit-daemon[1131]: WARNING: VT_WAITA...

Read more...

Revision history for this message
Damien (takahara) wrote :

Colin,

Regarding lsof, here's the output I have when the bug occurs.

(I am running Lucid so I can't test your PPA)

Revision history for this message
Lennie (lbesselink) wrote :

OK, I created a ppa with my quick hack so people can test it.

deb http://ppa.launchpad.net/lbesselink/consolekit/ubuntu lucid main

But it will take a couple of hours before it has been compiled by launchpad.

Revision history for this message
Lennie (lbesselink) wrote :

Well, the ppa is doing compiling. I hope someone, anyone.

Maybe Damien has time to test this ?

It solves this specific problem of the VT_WAITACTIVE.

__

Looking at some of the logging, the most likely candidate for which is changing the console_fd from under consolekit is this:
kernel: [ 14.439499] Console: switching to colour frame buffer device 160x50

And that is not X itself, it's the loading of the radeon driver. So it's the kernel-mode-setting.

Because with gdm-debug on it much later says: "Starting X server process:"

Revision history for this message
Lennie (lbesselink) wrote :

Just a small lesson, don't change text at the last moment :-)

Instead of:

"Well, the ppa is doing compiling. I hope someone, anyone."

I meant to say:

"Well, the ppa is done compiling. I hope someone, anyone can test it."

Revision history for this message
Damien (takahara) wrote :

Hi Lennie,

Just tried your PPA but it still fails, after just 6 boots.
Log is attached.

Revision history for this message
Damien (takahara) wrote :

BTW, Lennie, does your PPA includes Colin's patch?

Revision history for this message
Lennie (lbesselink) wrote :

Damein,

That's annoying. But atleast now we know, we can ask the kernel for something, which it will give us, but what it gives us is useless.

Interresting.
____

No, it does not include that patch.

I was trying to find out what is going on and fixed my problem with a workaround along the way.

I've contacted upstream to see if we can figure out the right way to fix it, so they can include a patch.

But if the kernel gives us something we can't use then it will probably limit the possibilities.

Revision history for this message
Damien (takahara) wrote :

Lennie,

Maybe we could try a couple of times (20x?) to get a proper console_fd, with a small pause in the loop (0.5s ?)
That way we would know if there is some delay between the fd becoming useless and a new fd being reported.

Revision history for this message
Lennie (lbesselink) wrote :

I'll see if I can create some code that does that.

But I hope to get an answer from upstream how they would like it to be solved.

That way I don't have to create any code which we'll gonna throw away later. :-)

But more importantly they can incorporate it in the upstream version, so others don't encounter the same problems and have to go looking for solutions as well.

Revision history for this message
Lennie (lbesselink) wrote :

I did what you asked and added a loop, 0.2 seconds 40 times.

I uploaded a new ppa, it will need some time to compile. It says 26 minutes to wait till it will compile amd64 and 8 hours before i386 will build.

Revision history for this message
Damien (takahara) wrote :

Thanks, I'm testing your PPA. I have about 60 consecutive successful boots so things look good (I will continue to 200 to be sure).

However, there is nothing suspicious appearing in the logs (ie. no more "got an error: ioctl VT_GETSTATE"). Does it mean that bad FDs are not returned any more?

Also, you said that the FD "3" was coming from another error (cfr the consolekit mailing list). Well, I don't see this #3 anymore: only FD "11" is used for all boots. Could this spurious FD be the real problem? The six good boots I had before the last bad boot (see message #69) were also using FD 11. Looks like going to 11 is the solution :-)

Revision history for this message
Damien (takahara) wrote :

120 reboots and not a miss!

Attached is the log of all reboots (minus the VT_WAITACTIVE message from the 64 threads)

Revision history for this message
Damien (takahara) wrote :

Also, this is the last consolekit log in full

Revision history for this message
Lennie (lbesselink) wrote :

So that could mean that when we keep trying to grab the console_fd it seems to work.

But I'm missing some debug-information, that is a bit strange.

It was a bit late last night so that might be why it is missing. :-)

I hope that is the reason, because otherwise I can't quiet explain it.

Revision history for this message
Damien (takahara) wrote :

Yeah, can't explain that either. If we ask for an FD more than once there should be something mentioned in the logs.

It could also be that I've just been lucky with the reboots. I've had about 100 successive boots before (without any patch) so it's worth trying more tomorrow...

Revision history for this message
Lennie (lbesselink) wrote :

Damien, I do have a question.

Just to make sure, you did use the same hardware/machine to test ?

I think I read you had several of these machines ? Did you maybe do a restore from an image or something like that ? Did you use the same method to do so ?

There was no extra USB-device plugged in or something like that ? Or an external display if this is a laptop.

Because timing seems to be very important to cause the error this bug is about.

Revision history for this message
Damien (takahara) wrote :

Lennie,

Exactly the same machine. I just did apt-get update;apt-get install consolekit. That's all. Not a single hardware change.

Mmmm... But I remember that it only upgraded the consolekit package, not the associated libraries. Could that explain something?

Revision history for this message
Lennie (lbesselink) wrote :

no, I don't think that matters.

I'll have a look at what else I did, maybe I did something wrong and the debug isn't displayed.

Revision history for this message
Damien (takahara) wrote :

I have updated the libs too, just in case. But the logs are expectedly the same.

( Small thing: you used usleep(200) which is not a pause of 0.2 sec but 0.0002 sec. )

Could it be that there's a problem in the packaging? Do you see the new debug warnings on your machine if you install your newest PPA?

Revision history for this message
Damien (takahara) wrote :

Using an hex editor, the new error strings (such as "did not fail ? after %d attempts") show up in the binary of console-kit-daemon.

"which console-kit-daemon" returns the proper location (/usr/sbin), so there should not be any conflict (like multiple binaries, who knows!)

Probably not a packaging issue then.

Revision history for this message
Damien (takahara) wrote :

I have performed 222 reboots with the latest PPA from Lennie, and not a single time did the bug happen. \( ^^)/
(However, this bug is far from being closed since we don't know exactly WHY it's working.)

Revision history for this message
Lennie (lbesselink) wrote :

The most logical explanation is, you had the same 'problem' I had when I turned on --debug for the console-kit-daemon.

It would start logging messages to rsyslogd (which saves it in syslog, daemon, etc.) early and the timing would be off and you would not be able to reproduce it.

Revision history for this message
Damien (takahara) wrote :

But I haven't turned on --debug when starting the daemon. Or is it a compilation option?
If so, did you turn it on in your latest PPA (lbesselink4)?

Also, I don't see how this would explain the lack of messages in the daemon log. From the code I saw even in case of an immediate success, the message "did not fail ? after 1 attempts" should be spotted in the logs.

Clearly I'm missing something ;-)

Revision history for this message
Lennie (lbesselink) wrote :

The debug that I added which says:

Aug 26 17:44:51 tacto-ppp-01 console-kit-daemon[927]: WARNING: ck_get_a_console_fd, choose: /dev/tty0

Is earlier in the startup of console-kit-daemon then the previous versions.

The previous version had the first debug at:
Aug 26 17:44:51 tacto-ppp-01 console-kit-daemon[927]: WARNING: console_fd: 11

What I had with the --debug option on was that it also started sending debug to syslog a lot earlier like my change now.

Any debug that early seems to prevent it failing at all the first time.

The 'did not ? after 1 attempts' is in the loop when it would try again which is after the first try.

Does that make it clear for you ?

Revision history for this message
Damien (takahara) wrote :

Ok, it makes sense now, thanks. I was only looking at the patch "08-" and thought that the "07-" was already used in your previous PPA, so I missed those debug statements.

So if we want to see if the retry loop is effective then we should remove patch 07?

Revision history for this message
Lennie (lbesselink) wrote :

Yes seems like it, I hope to have time for it this evening otherwise it's probably something for tomorrow.

If you know how to build it, you could try it out yourself. :-)

Revision history for this message
moojix (moojix) wrote :

I had the same symptoms as Damien desribed in #9
(no automount of external drives, shutdown or restart was not possible as user).
I used a similar workaround as Sepero pointed out in #10 (ubuntuforum link).

My workaround were the following 2 steps:
edit /etc/init/rc-sysinit.conf
old: "start on filesystem and net-device-up IFACE=lo"
new: "start on filesystem and started rsyslog and net-device-up IFACE=lo"

apt-get install ifupdown --reinstall --purge

Environment:
- Lucid 2.6.32-24-generic #41-Ubuntu SMP Thu Aug 19 01:38:40 UTC 2010 x86_64 GNU/Linux
- NVIDIA UNIX x86_64 Kernel Module 195.36.24

Revision history for this message
moojix (moojix) wrote :

sorry for my last comment #91. This workaround did not work for me.
Same symptoms (no automount external drives, no shutdown etc.) after reboot.

@Lennie I tested now your quick hack ppa from #66.
I have not tested as intensive as Damien, but it works well after 10 reboots on my box.

Revision history for this message
cmat (cmat555) wrote :

The work around in Lennie's PPA seems to be working here. Haven't experienced the problem since using it.

Revision history for this message
Lennie (lbesselink) wrote :

Hi,

moojix and cmat could you post the debug from your logs ?

You possible don't want to post all of it, just the stuff that seems to be different between boots.

I'm especially interrested to know if it says "trying again". You can see it with:
grep "trying again" /var/log/syslog

In that case please include the whole debug of one of those times you did get it at boot (replace the date and time with the right one :-) ):

grep console-kit-daemon /var/log/syslog | grep 'Aug 30 23:02'

Revision history for this message
Damien (takahara) wrote :

Lennie,

I tired to build the debs before, but I haven't been very successful. I now have a very good reason to try again though :-) Stay tuned...

Revision history for this message
Damien (takahara) wrote :

moojix, cmat,

To isolate the interesting parts of the log, you can also try this (rather awful) command:

cat /var/log/daemon.log | egrep "console-kit-daemon\[[0-9]{3-4}\]\:.WARNING:.[^V]"

(it finds the console kit messages in the logs, but removes the 64 VT_WAITACTIVE messages from the threads)

Revision history for this message
Damien (takahara) wrote :

I think I have some good news...

I first tried Lennie's PPA, but without patches #5, #6 and #7 (IOW, only patch #8). I'm actually not exactly sure what patches were applied, because the "skipping :%d" messages were still there. But the message of patch #7 was not there any more, which is the important thing (see #88).

Without that message, the bug reappeared after about 20 reboots and consolekit went through the retry-loop. However, the delay in the retry loop being very small (200usec), the loop did not have any effect: all 30 retries failed.

After much messing with the patches, I restarted with a clean source tgz and applied the patch 08 ONLY. I'm sure this time, 'cause I applied the patch manually on the sources (!). I also changed the usleep time to 200000 (0.2sec). After a few reboots the bug did not reappear, so I checked the log... only to see that the loop was working! In fact, the bug was triggered more often now that no extra debug messages were printed. But every time the loop fixed the problem. Banzai!

Interestingly, only 2 retries were necessary (which means a 0.2 second delay). Hence the characteristic time on which the bug occurs is [ 6ms < t < 200ms ]. Here's the log, which is now very compact:

Aug 31 12:52:48 tacto-ppp-01 console-kit-daemon[954]: WARNING: Could not determine active console
Aug 31 12:52:48 tacto-ppp-01 console-kit-daemon[954]: WARNING: console_fd: 3
Aug 31 12:52:48 tacto-ppp-01 console-kit-daemon[954]: WARNING: tried but failed
Aug 31 12:52:49 tacto-ppp-01 console-kit-daemon[954]: WARNING: console_fd: 12
Aug 31 12:52:49 tacto-ppp-01 console-kit-daemon[954]: WARNING: did not fail ? after 2 attempts
Aug 31 13:01:49 tacto-ppp-01 console-kit-daemon[864]: WARNING: Could not determine active console
Aug 31 13:01:49 tacto-ppp-01 console-kit-daemon[864]: WARNING: console_fd: 3
Aug 31 13:01:49 tacto-ppp-01 console-kit-daemon[864]: WARNING: tried but failed
Aug 31 13:01:50 tacto-ppp-01 console-kit-daemon[864]: WARNING: console_fd: 12
Aug 31 13:01:50 tacto-ppp-01 console-kit-daemon[864]: WARNING: did not fail ? after 2 attempts

If I'm not mistaken no warnings are printed at all when consolekit works well, and only a few lines are printed when the retry loop saves the day.

Lennie, could you make a nice&clean PPA with your patch #08 only (and change the usleep delay to 0.2sec)? I'm afraid my packaging skills are not exactly good enough for everyone to play with my debs.

Revision history for this message
Lennie (lbesselink) wrote :

That sounds very promissing.

But I've been pondering, maybe it needs more work to be sure.

Revision history for this message
Damien (takahara) wrote :

Yes, knowing why the console FD is wrong would be very nice... Then upstream could fix the bug in a more appropriate fashion.
But in the meantime it will at least make Ubuntu usable for some of us.

Also, the original sources had this "FIXME" warning where you added the retry loop, showing that something had to be done there. So even if the root of the problem is fixed upstream, keeping the retry loop in the code would still be a plus, IMHO.

Just out of curiosity, I will also try with a shorter sleep time to determine the "race time" more accurately.

Changed in consolekit (Ubuntu Lucid):
status: Triaged → In Progress
Revision history for this message
moojix (moojix) wrote :

Lennie, Damien,

"trying again" is not written on my box

zgrep "trying again" /var/log/* | wc -l
0

grep "console-kit" /var/log/syslog.1 | grep "Aug 31" | grep -v "VT_WAITACTIVE"
Aug 31 07:51:10 st-002 console-kit-daemon[1369]: WARNING: ck_get_a_console_fd, choose: /dev/tty0
Aug 31 07:51:10 st-002 console-kit-daemon[1369]: WARNING: console_fd: 11
Aug 31 07:51:10 st-002 console-kit-daemon[1369]: WARNING: input of ck_get_active_console_num: 11
Aug 31 07:51:10 st-002 console-kit-daemon[1369]: WARNING: Current VT: tty7
Aug 31 07:51:10 st-002 console-kit-daemon[1369]: WARNING: skipping: 7

Revision history for this message
Damien (takahara) wrote :

moojix,

look for "try but failed", this is the actual string.

Revision history for this message
Lennie (lbesselink) wrote :

Damien I think the text was actually "tried but failed", but I doubt moojix has that text as it will not be reached if moojix didn't see: "trying again"

Anyhow, I made some more changes and a new ppa.

The ppa is building at this moment.

Revision history for this message
Lennie (lbesselink) wrote :

I downloaded the ppa on my laptop and it didn't break it.

But I wasn't able to reproduce the problem which we are trying to fix. So I wasn't able to test the fix/workaround.

Revision history for this message
Damien (takahara) wrote :

Thanks Lennie, I'll try it tomorrow.

I am now doing quite a few reboot cycles to see how the retry delay varies. Early results show large differences: between 20ms and almost two seconds.

Revision history for this message
Lennie (lbesselink) wrote :

Damien,

Let me know what you'll find, because then we'll get a good idea of how bad it really is.

The new ppa has a loop which has 30 times 0.2 seconds (I changed the usleep also this time).

So that's 6 seconds, that ought to be enough ?

If you have any other code/suggestions let me know.

Revision history for this message
Damien (takahara) wrote :

Lennie,

I finished rebooting "a few times" and here's a histogram of what I've got.

It basically falls into 3 categories: less than 100ms (majority of cases), about 600~1000ms and close to 2000ms. I could not get anything >2 seconds, which is a good sign.

Given the wide range I would suggest starting the usleep value at 20ms and multiplying it by 2 every time it fails, with the max delay of 1 sec (what do you mean I'm going to 11? ;) ). That way we keep a good reaction time without polling hundreds of times. So we would check at the following accumulated times: 0 (first retry), 20, 60, 140, 300, 620, 1260, 2260, 3260,..., 10260. If it fails after those 10 seconds then we should have a phatt warning in the logs. A debug message telling the total time of the waiting loop would be nice too (if the loop is triggered, of course)

Revision history for this message
moojix (moojix) wrote :

just FYI:
the latest ppa (0.4.1-3ubuntu1-lbesselink7) works for me.

grep "but failed" /var/log/syslog | wc -l
0

grep "console-kit" /var/log/syslog | grep -v "VT_WAITACTIVE"
Sep 1 11:49:26 st-002 console-kit-daemon[1368]: WARNING: did not fail
Sep 1 11:49:26 st-002 console-kit-daemon[1368]: WARNING: skipping: 7

Revision history for this message
Damien (takahara) wrote :

moojix,

How often did the bug appeared with the stock ubuntu packages? If it was not systematic you'll need to reboot a few (or a lot) of times to trigger the bug and see how it's handled by Lennie's PPA.

Revision history for this message
moojix (moojix) wrote :

Damien,

i had the symptoms (no automount of external drives and shutdown/restart not working as user) described above in this bug at
99% of my reboots (On my new box since 4 weeks nearly with every daily boot - it's a workstation, not a server). As I had other priorities on this new box (i870 with ssd) I had just ignored this UI shutdown-problem ( sudo shutdown -h 0 works well :) ).
As i am not familiar with consolekit and boot process, I have no clue for which string or which startup sequence I need to look with the stock ubuntu consolekit package.
I have not systematic tested with the stock Ubuntu package, what's the difference in the error-case and the good-case was.

If Lennie is interested in, he can build a ppa with debug logging, but without the re-try fix, so I can try to reproduce the bug on my box with a nearly original Lucid-consolekit package, but with more logging.

Additionally what could be helpful: I will reboot several times with the original stock Ubuntu consolekit package to be sure, that I can reproduce the bug on this box (to validate my 99% failure rate without any fix...).

Revision history for this message
moojix (moojix) wrote :

okay, I've used the original Lucid consolekit package (0.4.1-3ubuntu1) for some reboot tests.
I can reproduce the symptoms: 1. reboot: okay - 2. .. 5. reboot failed with the symptoms.

1. reboot syslog (ok)
egrep "console" /var/log/syslog_20100901_2 | grep -v activation
Sep 1 18:16:04 st-002 rsyslogd-2039: Could no open output file '/dev/xconsole' [try http://www.rsyslog.com/e/2039 ]
Sep 1 18:16:04 st-002 kernel: [ 0.000000] console [tty0] enabled

2. reboot syslog (failed)
egrep "console" /var/log/syslog_20100901_3 | grep -v activation
Sep 1 18:20:03 st-002 rsyslogd-2039: Could no open output file '/dev/xconsole' [try http://www.rsyslog.com/e/2039 ]
Sep 1 18:20:03 st-002 kernel: [ 0.000000] console [tty0] enabled
Sep 1 18:20:03 st-002 init: Failed to open system console: Input/output error
Sep 1 18:20:04 st-002 console-kit-daemon[1050]: WARNING: Could not determine active console

Good. Now I'll try to reproduce this timing problem with Lennies PPA.

Revision history for this message
moojix (moojix) wrote :

I've repeated the tests with Lennies ppa (0.4.1-3ubuntu1-lbesselink7)

In 9/10 reboots this is my syslog:
grep console /var/log/syslog | grep -v VT_WAITACTIVE
Sep 1 19:00:26 st-002 rsyslogd-2039: Could no open output file '/dev/xconsole' [try http://www.rsyslog.com/e/2039 ]
Sep 1 19:00:26 st-002 kernel: [ 0.000000] console [tty0] enabled
Sep 1 19:00:26 st-002 init: Failed to open system console: Input/output error
Sep 1 19:00:30 st-002 console-kit-daemon[1364]: WARNING: did not fail
Sep 1 19:00:30 st-002 console-kit-daemon[1364]: WARNING: skipping: 7

in 1/10 reboot this is my syslog:
grep console /var/log/syslog_20100901_9 | grep -v VT_WAITACTIVE
Sep 1 18:58:08 st-002 rsyslogd-2039: Could no open output file '/dev/xconsole' [try http://www.rsyslog.com/e/2039 ]
Sep 1 18:58:08 st-002 kernel: [ 0.000000] console [tty0] enabled
Sep 1 18:58:09 st-002 console-kit-daemon[1147]: WARNING: did not fail
Sep 1 18:58:09 st-002 console-kit-daemon[1147]: WARNING: skipping: 1
Sep 1 18:58:09 st-002 console-kit-daemon[1147]: WARNING: skipping: 7

In all 10/10 reboots I have no symptoms as I've got with the original stock Lucid package (missing external drives etc.).
I do not understand how "init: Failed to open system console: Input/output error" is related.

Revision history for this message
Lennie (lbesselink) wrote :

moojix do you have one that does not say the following ?: did not fail

Revision history for this message
moojix (moojix) wrote :

Lennie,
no, unfortunately zero. all say with your ppa "did not fail".

Revision history for this message
Lennie (lbesselink) wrote :

Then I don't think I actually solved it, just slowed it down enough so it didn't fail.

I'm already looking at a diff between 0.4.1-3ubuntu1 and 0.4.1-3ubuntu1-lbesselink7

This morning the strace I tried also looked good. I don't think it's the debug/warning logging to syslog.

Maybe it is the calls to the debug-functions ? I did already change them from warning to debug, so it would not try to log anything through syslog.

An other day, an other ppa.

Revision history for this message
Lennie (lbesselink) wrote :

After a whole evening running tests and no failures (and thus no way to test the fix), the new ppa gave this result after less then 10 reboots:

Sep 2 00:21:06 laptop console-kit-daemon[1111]: WARNING: tried but failed
Sep 2 00:21:06 laptop console-kit-daemon[1111]: WARNING: eventually choose to use 12
Sep 2 00:21:06 laptop console-kit-daemon[1111]: WARNING: first fd was 3
Sep 2 00:21:06 laptop console-kit-daemon[1111]: WARNING: retried 1 times

So the workaround/test works for me.

Revision history for this message
moojix (moojix) wrote :

Lennie, I made 10 reboot tests. Looks good to me:

8x:
Sep 2 09:16:04 st-002 kernel: [ 0.000000] console [tty0] enabled
Sep 2 09:16:04 st-002 init: Failed to open system console: Input/output error
Sep 2 09:16:05 st-002 console-kit-daemon[1085]: WARNING: tried but failed
Sep 2 09:16:05 st-002 console-kit-daemon[1085]: WARNING: eventually choose to use 12
Sep 2 09:16:05 st-002 console-kit-daemon[1085]: WARNING: first fd was 3
Sep 2 09:16:05 st-002 console-kit-daemon[1085]: WARNING: retried 1 times
Sep 2 09:16:05 st-002 console-kit-daemon[1085]: WARNING: skipping: 7

2x
Sep 2 09:17:40 st-002 kernel: [ 0.000000] console [tty0] enabled
Sep 2 09:17:41 st-002 console-kit-daemon[1161]: WARNING: did not fail
Sep 2 09:17:41 st-002 console-kit-daemon[1161]: WARNING: skipping: 1
Sep 2 09:17:41 st-002 console-kit-daemon[1161]: WARNING: skipping: 7

0.4.1-3ubuntu1-lbesselink9 works for me, too.

Revision history for this message
Lennie (lbesselink) wrote :

Good very good, so now I can get back to talking to upstream about this. How they want to put in a real fix.

Revision history for this message
Damien (takahara) wrote :

Thanks Lennie! (^-^)v

I just understood why the console #3 was popping up all the time. That brings two newbie questions:

- why is nothing done if all no console could be obtained? IMHO the first warning should be printed from ck_get_a_console_fd. In fact, shouldn't the retry loop be in ck_get_a_console_fd?

- why is the console fd search stopped at 3? We often have values above that (11, 12...) so would it work if we just extended the search to 12 (or whatever the limit is)?

Revision history for this message
Lennie (lbesselink) wrote :

Damien, all good questions, but I got a better response from upstream:

Atleast /dev/tty0 should work, find out why it isn't.

Revision history for this message
Damien (takahara) wrote :

Well, I just answered my second question by trying it. It doesn't work.

But I still think that ck_get_a_console_fd should return ERROR and not a bogus fd.
This is fine tuning though...

Revision history for this message
Lennie (lbesselink) wrote :

Concerning your last question, I think a default kernel might have something like 65536 possible values for that and you don't want to check them all. Probably it can even be raised with a kernel-compile.

Revision history for this message
Lennie (lbesselink) wrote :

Your 3rd question:

No1 likes that part of the code, I've got the feeling that part will go away.

Revision history for this message
Colin Watson (cjwatson) wrote :

With reference to comment 110 in particular, I wonder if this is a consequence of bug 554172.

Revision history for this message
Colin Watson (cjwatson) wrote :

How about version 0.4.1-4ubuntu1~ppa2 in https://launchpad.net/~cjwatson/+archive/ppa? It should build soon. This retries the open() of console fds on EIO.

Revision history for this message
Colin Watson (cjwatson) wrote :

Could anyone please test the packages in that PPA? If it works and I get positive feedback, I can probably still just about squeeze it into 10.10.

Revision history for this message
moojix (moojix) wrote :

Colin, sorry. At the moment I can test for Lucid only. Anyone else?

Revision history for this message
Lennie (lbesselink) wrote :

Ohh, hi Colin, sorry I missed your first 2 messages.

I have a few remarks:

1. won't the timing be different on Maverick ? You still need to be able to reproduce the problem first.

2. Would it not be a good idea to build a ppa for Lucid so people can test that ?

3. I've been really busy with other things. I'm sad to say. I have some code I wanted to test to find out what the cause of the problem is and get a fix upstream.

4. Maybe I should schedule some time for that tonight or tomorrow night. As it's only 12 days left.

Revision history for this message
Colin Watson (cjwatson) wrote :

I've uploaded a matching Lucid version.

Revision history for this message
Colin Watson (cjwatson) wrote :

Lennie, I suspect that if you read through bug 554172 it will be enlightening. It explains why opening /dev/console (/dev/tty0 will be the same) can sometimes fail with EIO.

Revision history for this message
Lennie (lbesselink) wrote :

It was quiet a long bug report, but it made sense to me.

I wasn't able to test, the laptop I use for testing was in use this evening.

Revision history for this message
moojix (moojix) wrote :

I rebooted my box (Lucid) 10 times.
Good, I cannot reproduce the symptoms described in #9 or #109.
zero errors. all okay. no fail or error in syslog.
The Lucid consolekit package 0.4.1-3ubuntu2~ppa1 from cjwatson (#124 and #128) works for me.

Revision history for this message
Colin Watson (cjwatson) wrote :

OK, I'll go ahead then, and we can revisit if further problems arise. Thanks, and particular thanks to Lennie for tireless debugging.

Revision history for this message
Colin Watson (cjwatson) wrote :

I'll backport this to Lucid after it's spent some time being tested in Maverick.

Changed in consolekit (Ubuntu Maverick):
assignee: nobody → Colin Watson (cjwatson)
milestone: none → ubuntu-10.10
Changed in consolekit (Ubuntu Lucid):
assignee: Chris Coulson (chrisccoulson) → Colin Watson (cjwatson)
Changed in consolekit (Ubuntu Maverick):
status: Triaged → In Progress
Revision history for this message
Lennie (lbesselink) wrote :

Thanks Colin.

I had some time and access to the laptop again.

So I created a small patch on top of your patch to prove it what happends.

I did 10 boots and 4 times the code was triggered, 6 times it did not trigger,
but it seems those were the times it did not need to be triggered.

So I think that proves this works.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package consolekit - 0.4.1-4ubuntu1

---------------
consolekit (0.4.1-4ubuntu1) maverick; urgency=low

  * 10-retry_console_open_eio.patch: Retry console opens if they return EIO,
    since this may happen while a tty is closing (see LP #554172, and thanks
    to Lennie and others for extensive debugging work; LP: #544139).
 -- Colin Watson <email address hidden> Thu, 30 Sep 2010 14:49:11 +0100

Changed in consolekit (Ubuntu Maverick):
status: In Progress → Fix Released
Revision history for this message
Colin Watson (cjwatson) wrote :

Thanks a lot, Lennie!

Revision history for this message
Colin Watson (cjwatson) wrote :
description: updated
Revision history for this message
Martin Pitt (pitti) wrote : Please test proposed package

Accepted consolekit into lucid-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in consolekit (Ubuntu Lucid):
status: In Progress → Fix Committed
tags: added: verification-needed
Revision history for this message
Martin Pitt (pitti) wrote :

It seems nobody forwarded this upstream? *sigh*, I'll clean up after this and do that, so that we can get rid of that delta.

Martin Pitt (pitti)
affects: gdm → consolekit
Changed in consolekit:
importance: Undecided → Unknown
status: New → Unknown
Changed in consolekit:
status: Unknown → Confirmed
Revision history for this message
Martin Pitt (pitti) wrote :

Any testers for the lucid update? At least to verify that it still works?

Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

Verification for Lucid.

I've verified that the package upgrades correctly from a default Lucid installation and that after the installation the system reboots, that X and the network are working. If there are specific verifications to do, let me known.

Marking as verification-done.

tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package consolekit - 0.4.1-3ubuntu2

---------------
consolekit (0.4.1-3ubuntu2) lucid-proposed; urgency=low

  * 10-retry_console_open_eio.patch: Retry console opens if they return EIO,
    since this may happen while a tty is closing (see LP #554172, and thanks
    to Lennie and others for extensive debugging work; LP: #544139).
 -- Colin Watson <email address hidden> Wed, 10 Nov 2010 12:18:29 +0000

Changed in consolekit (Ubuntu Lucid):
status: Fix Committed → Fix Released
Changed in consolekit:
importance: Unknown → Medium
Revision history for this message
Martin Pitt (pitti) wrote :

Upstream requests updating this patch:

----------------
Hmm, so I have discussed this with some folks, and it seems we do need
something like this as the kernel will not help us here.

However, I don't think the patch is acceptable: there are other reasons why EIO
might be returned. Since we cannot distuingish the error causes we need at
least a safe way out. My suggestion would be to retry opening 20 times and then
give up. In systemd I have settled to use 20 times with 50ms in between,
totalling to 1s in total.

Could you please update the patch? I'll then merge it.
-------------------

Revision history for this message
Carlos Felipe Forigua Rodríguez (carlos-forig) wrote :

Didn't work. Installed thunar and i can mount filesystems as a normal user using thunar

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related questions

Remote bug watches

Bug watches keep track of this bug in other bug trackers.