Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TimeoutStartUSec too low, or MicroVMs taking a little while be to start #317

Open
pwaller opened this issue Dec 31, 2024 · 3 comments
Open

Comments

@pwaller
Copy link

pwaller commented Dec 31, 2024

I have a machine with (less than 10) microvms on it. When the machine is rebooted daily, the log looks like:

Dec 31 09:33:22 foo systemd[1]: Starting MicroVM 'machine-d'...
Dec 31 09:33:22 foo systemd[1]: Starting MicroVM 'machine-b'...
Dec 31 09:33:22 foo systemd[1]: Starting MicroVM 'machine-e'...
Dec 31 09:33:22 foo systemd[1]: Starting MicroVM 'machine-a'...
Dec 31 09:33:22 foo systemd[1]: Starting MicroVM 'machine-f'...
Dec 31 09:33:22 foo systemd[1]: Starting MicroVM machine-c'...
Dec 31 09:34:05 foo systemd[1]: Started MicroVM 'machine-a'.
Dec 31 09:34:52 foo systemd[1]: Started MicroVM machine-c'.
Dec 31 09:34:53 foo systemd[1]: [email protected]: Failed with result 'timeout'.
Dec 31 09:34:53 foo systemd[1]: Failed to start MicroVM 'machine-b'.
Dec 31 09:34:53 foo systemd[1]: [email protected]: Failed with result 'timeout'.
Dec 31 09:34:53 foo systemd[1]: Failed to start MicroVM 'machine-d'.
Dec 31 09:34:53 foo systemd[1]: [email protected]: Failed with result 'timeout'.
Dec 31 09:34:53 foo systemd[1]: Failed to start MicroVM 'machine-f'.
Dec 31 09:34:53 foo systemd[1]: [email protected]: Failed with result 'timeout'.
Dec 31 09:34:53 foo systemd[1]: Failed to start MicroVM 'machine-e'.
Dec 31 09:35:00 foo systemd[1]: Starting MicroVM 'machine-b'...
Dec 31 09:35:00 foo systemd[1]: Starting MicroVM 'machine-f'...
Dec 31 09:35:00 foo systemd[1]: Starting MicroVM 'machine-d'...
Dec 31 09:35:00 foo systemd[1]: Starting MicroVM 'machine-e'...
Dec 31 09:35:44 foo systemd[1]: Started MicroVM 'machine-e'.
Dec 31 09:35:44 foo systemd[1]: Started MicroVM 'machine-f'.
Dec 31 09:35:50 foo systemd[1]: Started MicroVM 'machine-b'.
Dec 31 09:34:53 foo systemd[1]: [email protected]: Failed with result 'timeout'.
Dec 31 09:36:31 foo systemd[1]: Failed to start MicroVM 'machine-d'.
Dec 31 09:36:39 foo systemd[1]: Starting MicroVM 'machine-d'...
Dec 31 09:37:48 foo systemd[1]: Started MicroVM 'machine-d'.

My first read of what appears to be happening is that the machines take a fair amount of CPU time before they are considered 'up' to systemd, and if they don't become up within the default time limit of 1m30s, they are terminated and restarted. Eventually this process settles down.

What doesn't quite add up for me is that I think there should be enough CPU available for this to all happen.

Some questions:

  • When is a microvm considered started?
  • How might I determine why they are not coming up quickly?
  • Has anyone else encountered this, are there common pitfalls?
  • Should the default start timeout be raised?
  • If it turns out the machines are contending for resources (such as I/O) is there a straightforward way to stagger their boot to reduce the contention overhead?
@pwaller pwaller changed the title TimeoutStartUSec too low, or MicroVMs taking a little while be considered to start TimeoutStartUSec too low, or MicroVMs taking a little while be to start Dec 31, 2024
@pwaller
Copy link
Author

pwaller commented Dec 31, 2024

It seems to me the machines are probably quite close to having come up, their consoles say things like 'Reached target Multi-User System'.

Another thing which is suspicious is that the amount of CPU time consumed is very low, only 8 seconds during the 1m30 TimeoutStartUSec window. One the one hand this makes me think the machine should have had plenty of time to come up (there are 8 CPUs and not so many microvms). On the other hand, perhaps it is hitting an I/O bottleneck.

Dec 31 09:34:49 foo microvm@machine-a[3320]: [  OK  ] Reached target Multi-User System.
Dec 31 09:34:52 foo systemd[1]: [email protected]: start operation timed out. Terminating.
Dec 31 09:34:52 foo microvm@machine-a[3320]: [
Dec 31 09:34:53 foo systemd[1]: [email protected]: Consumed 8.094s CPU time, 472M memory peak, 410.2M read from disk.

I do also see a suspicious message about flushing journal persistant storage, and some time spent waiting in systemd:

Dec 31 09:34:09 foo microvm@machine-a[3320]: [*     ] A start job is running for Flush Jo…ersistent Storage (21s / 1min 46s)

The machines are configured so hat their journals are written onto the host's storage and can be observed from there. I'm wondering if there is something wrong with this:

microvm.shares = [
  {
    tag = "journal";
    proto = "virtiofs";
    source = "/var/lib/microvms/${config.networking.hostName}/journal";
    mountPoint = "/var/log/journal";
    socket = "journal.sock";
  }
]

@JohannesEbke
Copy link
Contributor

One more observation on this particular machine: I looked at the systemd message timings inside one of the microvms, and noticed the "Switch Root" taking ~21s:
Dec 31 09:33:27 foo microvm@roundcube[3333]: Starting Switch Root... Dec 31 09:33:43 foo microvm@roundcube[3333]: [9B blob data] Dec 31 09:33:43 foo microvm@roundcube[3333]: Welcome to NixOS 25.05 (Warbler)! Dec 31 09:33:48 foo microvm@roundcube[3333]: [ OK ] Stopped initrd-switch-root.service.
This is before the mentioned "flushing journal persistant storage" delay, that runs from 08:34:05 to 08:34:33 in this instance.

An additional piece of meta-information: This server is not of the most recent build (Xeon E3-1245 V2 @ 3.40GHz) and (currently) uses btrfs on cryptsetup on two disks of spinning rust for the journals and data.

@astro
Copy link
Owner

astro commented Dec 31, 2024

The timeout could be turned into an option in the host module.

I am just lucky to not have run into this yet. 55 MicroVMs on our 2013 hardware start within 90s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants