cloud-init not restarting ssh service after writing sshd_config

Bug #1998526 reported by Anh Vo (MSFT)
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-init
Fix Released
High
James Falcon

Bug Description

cloud-init 22.2 introduced a race condition with ssh.service since it added a systemctl call to check if service is stopped/not-started. If ssh.service starts at around the same time of the check, cloud-init believes that the service is stopped and does not need to be restarts, proceeds to write the sshd_config. The writing of sshd_config might happen after the ssh.service has started, especially if there's significant delay in the systemctl status call (the delay would have given ssh.service plenty of time to startup)

I believed this is the commit that introduced the issue:
https://github.com/canonical/cloud-init/commit/5054ffe188

I've attached the cloud-init.log and the auth.log showing the time ssh service started.

From cloud-init.log - the call to check ssh service status happened at 22:44:43,630, when cloud-init wrote the file sshd_config at 22:44:51, ssh service already started. There's a strange 8s delay from systemctl that might have to do with systemd or the condition of the VM. Regardless, the race is definitely there.

2022-11-22 22:44:43,630 - subp.py[DEBUG]: Running command ['systemctl', 'status', 'ssh'] with allowed return codes [0] (shell=False, capture=True)
2022-11-22 22:44:51,116 - cc_set_passwords.py[DEBUG]: Writing config 'ssh_pwauth: True'. SSH service 'ssh' will not be restarted because it is stopped.
2022-11-22 22:44:51,116 - util.py[DEBUG]: Reading from /etc/ssh/sshd_config (quiet=False)
2022-11-22 22:44:51,116 - util.py[DEBUG]: Read 3546 bytes from /etc/ssh/sshd_config
2022-11-22 22:44:51,116 - ssh_util.py[DEBUG]: line 55: option PasswordAuthentication updated no -> yes
2022-11-22 22:44:51,117 - util.py[DEBUG]: Writing to /etc/ssh/sshd_config - wb: [600] 3547 bytes

Here's what the customer gave us from their VM:

ssh.service - OpenBSD Secure Shell server
   Loaded: loaded (/lib/systemd/system/ssh.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2022-11-22 22:44:43 UTC; 6 days ago

sshd_config file changed after only few seconds.

-rw------- 1 root root 3547 2022-11-22 22:44:51.113697898 +0000 /etc/ssh/sshd_config

Revision history for this message
Anh Vo (MSFT) (vtqanh) wrote :
Revision history for this message
Chad Smith (chad.smith) wrote (last edit ):

Confirmed race in logs comparison between auth.log and the systemctl status output info vs cloud-init.log thanks Anh.

It is strange though that we see an 8 second delay getting the response from the `systemctl status ssh` check and the immediate log line that follows that response. T

We can do one of two things:
 1. shift the point in time for our systemctl status ssh to inspect and restart ssh service only after we are certain we have successfully made ssh config changes.
 2. check something akin to `systemctl is-enabled ssh`, but we would also need to handle socket-activated ssh status on Ubuntu systems in 22.10 and later as systemctl is-enabled ssh will return 'disabled' in those states.

Minimally, I think we reduce exposure to this particular race by moving the logic for the status ssh check after the write/update of sshd_config.

But, ideally, we probably want cloud-init to grow awareness of whether or not ssh service is enabled, but not yet started.
This solution will likely take a bit of time and touch implementation details for both systemd and non-systemd environments via cloudinit.distros.manage_service()

Changed in cloud-init:
status: New → Triaged
Revision history for this message
Chad Smith (chad.smith) wrote :

It'd be interesting to see journalctl -b 0 on a system affected by this if available. This can be obtained from `sudo cloud-init collect-logs` on an affected system.

That said, I think we can still improve this solution a bit to mitigate this problem.

Changed in cloud-init:
importance: Undecided → High
Revision history for this message
Anh Vo (MSFT) (vtqanh) wrote :

Unfortunately we no longer have access to the VM to retrieve the journal log.

Revision history for this message
Chad Smith (chad.smith) wrote :

Upstream PR which plans to address this bug
https://github.com/canonical/cloud-init/pull/1909

ongoing discussion about the approach we should take, but this is expected to be fixed in next release.

Changed in cloud-init:
assignee: nobody → James Falcon (falcojr)
status: Triaged → In Progress
Revision history for this message
Alberto Contreras (aciba) wrote : Fixed in cloud-init version 23.1.

This bug is believed to be fixed in cloud-init in version 23.1. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in cloud-init:
status: In Progress → Fix Released
Revision history for this message
James Falcon (falcojr) wrote :
Revision history for this message
shixuantong (sxt1001) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.