Comment 39 for bug 1834875

Revision history for this message
Dan Watkins (oddbloke) wrote :

[N.B. I wrote the below before I saw Ryan's comment, so there is some repetition.]

OK, I've spent some time catching up on this properly so I can summarise: per comment #24, the issue is that when udev processes the events emitted by the kernel, it (sometimes) doesn't determine the correct partition information. The kernel _does_ emit all the events we would expect, and udev _does_ handle all the events we would expect (which is to say that `udevadm settle` doesn't change behaviour here, it merely ensures that the broken behaviour has completed before we proceed). The hypothesised race condition is somewhere between the kernel and udev: I believe the kernel event is emitted before the partition table has necessarily been fully updated so when udev processes the event and reads the partition table, sometimes it finds the partition and sometimes it doesn't. To be clear, the kernel event generation and the buggy udev event handling all happens as a result of the resize command, _not_ as a result of anything else cloud-init runs subsequently.

So as far as I can tell, this bug would occur regardless of what runs the resize command, and no matter what commands are executed after the resize command. (It might be possible to work around this bug by issuing commands that force a re-read of the partition table on a disk, for example, but this bug _would_ still have occurred before then.)

cloud-init could potentially work around a (kernel|systemd) that isn't handling partitions correctly, but we really shouldn't have to. Until we're satisfied that they cannot actually be fixed, we shouldn't do that. (I am _not_ convinced that this cannot be fixed in (the kernel|systemd), because using a different kernel and using a different udevadm have both caused the issue to stop reproducing.)

So, let me be a little more categorical. The information we have at the moment indicates an issue in the interactions between the kernel and udev on partition resize. cloud-init's involvement is merely as the initiator of that resize. Until we have more information that indicates the issue to be in cloud-init, this isn't a valid cloud-init issue. Once we have more information from the kernel and/or systemd folks, if it indicates that cloud-init _is_ at fault, please move this back to New.