Comment 34 for bug 470776

Revision history for this message
Michael Palmer (mp4) wrote :

Hi Steve,

Thanks for your work on this.

Just to make sure you understand (since you mentioned the problem doesn't happen for you). this bug is not just about warning messages... It actually drops you into the rescue shell, and the boot process stops there, waiting for console input.

This happens about 10% of the time per node for me (I gather it's when some race condition occurs). This makes unattended boot of a cluster not usable. E.g., I have 20 nodes in a cluster so almost always one or more nodes don't boot.

I am mounting /home and /usr/local over NFS.
evoa1:/home /home nfs rw 0 0
evoa1:/usr/local /usr/local nfs rw 0 0

I tried a the workaround suggested above - putting noauto in /etc/fstab and then mounting the directories in /etc/rc.local... however then users can ssh into the nodes before their home directories are mounted & that causes other problems. (We are running some job queuing software & queued jobs may try to start up quickly as soon as a node is up.)

A workaround would be helpful... e.g., just knowing the right place to put a "sleep 30" to reduce the frequency of the race condition.

thanks,

Mike