status symlink non-atomicity traceback with status --wait

Bug #1962150 reported by Adam Collard
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-init
Fix Released
Medium
Adam Collard

Bug Description

MAAS run system-tests on a regular basis using LXD containers to set things up.

A recent run failed waiting for the LXD container to finish booting, with a traceback from cloud-init.

After launching the container, the script runs `timeout 2000 cloud-init status --wait --long` to ensure that the commands we pass through as user-data are complete.

Here's a redacted snippet from the logs

 2022-02-23 17:21:00 INFO : Waiting for boot to finish...
 2022-02-23 17:21:00 INFO timeout 2000 cloud-init status --wait --long
 2022-02-23 17:21:04 INFO ..........
 2022-02-23 17:21:04 INFO Traceback (most recent call last):
 2022-02-23 17:21:04 INFO File "/usr/bin/cloud-init", line 11, in <module>
 2022-02-23 17:21:04 INFO load_entry_point('cloud-init==21.4', 'console_scripts', 'cloud-init')()
 2022-02-23 17:21:04 INFO File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 927, in main
 2022-02-23 17:21:04 INFO retval = util.log_time(
 2022-02-23 17:21:04 INFO File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 2472, in log_time
 2022-02-23 17:21:04 INFO ret = func(*args, **kwargs)
 2022-02-23 17:21:04 INFO File "/usr/lib/python3/dist-packages/cloudinit/cmd/status.py", line 60, in handle_status_args
 2022-02-23 17:21:04 INFO status, status_detail, time = _get_status_details(init.paths)
 2022-02-23 17:21:04 INFO File "/usr/lib/python3/dist-packages/cloudinit/cmd/status.py", line 123, in _get_status_details
 2022-02-23 17:21:04 INFO status_v1 = load_json(load_file(status_file)).get('v1', {})
 2022-02-23 17:21:04 INFO File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 1361, in load_file
 2022-02-23 17:21:04 INFO with open(fname, 'rb') as ifh:
 2022-02-23 17:21:04 INFO FileNotFoundError: [Errno 2] No such file or directory: '/run/cloud-init/status.json'

You can see from the ... that there are a few successful attempts to wait, but then it fails.

Revision history for this message
Adam Collard (adam-collard) wrote (last edit ):

Reading the code - https://github.com/canonical/cloud-init/blob/main/cloudinit/cmd/status.py#L141-L144 - , the status command tries to guard against the file not existing, but clearly this run slipped through - the file existed when we os.path.exists() it, but not when we open() it :|

Looking deeper, we can see that cmd.main.status_wrapper atomically writes JSON (yay!) to the 'data_d' (/var/lib/cloud/data) then symlinks that status file in the 'link_d' (/run/cloud-init). All seems reasonable, but let's look at how it does that symlinking:

https://github.com/canonical/cloud-init/blob/main/cloudinit/cmd/main.py#L755-L758

Note the `force=True`, and refer to the implementation of sym_link:

https://github.com/canonical/cloud-init/blob/2837b835f101d81704f018a4f872b1d660eb6f3e/cloudinit/util.py#L1887-L1891

.. which deletes the symlink, then re-creates it.

This is non-atomic and entirely possible for readers (such as `--wait`) to see the entry in the FS, before it gets deleted, the deletion to occur and then open it.

Revision history for this message
Adam Collard (adam-collard) wrote :
Revision history for this message
James Falcon (falcojr) wrote :

Good find! Thanks for providing a PR too.

Changed in cloud-init:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Chad Smith (chad.smith) wrote :

Reproducible with the following on a system with cloud-init:

echo > reproducer.py <<EOF
from cloudinit.util import sym_link, write_file

source_file = "/tmp/test"
link_file = "/run/cloud-init/status.json"
write_file(source_file, "{}")

while True:
    sym_link(source_file, link_file, force=True)
EOF

# terminal 1
python3 ./reproducer.py

# terminal 2
cloud-init status --wait
.......
...
    with open(fname, 'rb') as ifh:
FileNotFoundError: [Errno 2] No such file or directory: '/run/cloud-init/status.json'

Revision history for this message
Chad Smith (chad.smith) wrote :
Changed in cloud-init:
status: Triaged → Fix Committed
assignee: nobody → Adam Collard (adam-collard)
Brett Holman (holmanb)
Changed in cloud-init:
status: Fix Committed → Fix Released
Revision history for this message
James Falcon (falcojr) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.