-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A user servers's systemd service in a failed state fails to start up properly again #87
Comments
I also tried deleting the user and creating a new user, but faced the same issues. At this stage I'm going to trash the VM and start again, but I'd like to know a better solution :) |
Don't know if it's relevant, but the traefik logs show a lot of the following errors:
|
Also facing similar issues. Hub starts but spawning notebooks leads to the timeout issue |
I installed tljh on a local instance of vultr.com, but several errors:
|
Still being affected by this issue. Started up a new instance, new install, added users, began logging in using user accounts to check server startup. Worked fine for 10, then would timeout. Would work fine again for 5, then timeout again. No traefik errors showing this time, nothing in the users server logs, just the timeout errors in the jupyterhub logs. |
@lachlancampbell, there's a service we added not so long ago that's culling the notebook servers that have been idle for more than 10 min (this is the default value). Checkout the docs on how to change the idle timeout and other defaults to fit your needs. |
Okay, thanks. I'll try turning the culling service off, although would it stop servers from starting up, as that's what is happening? |
@lachlancampbell, no the idle culler shouldn't stop servers from starting up, it just stops them after 10 min of inactivity. The timeout error you're receiving on this fresh install is the same with the one posted in the initial comment? Do you see anything in the logs related to the idle culler? |
I believe it's the same error, but I've included a section of the full log below. As you can see, the idle culler does appear but not sure that it's related.
|
|
@lachlancampbell try checking the logs of the server that failed to show up:
You might also check for load on the system with What's interesting from the logs is that we see systemd attempting to start the server with:
But no output from that process. I think that's the root, that systemd is actually taking a long time to launch the requested process, or even failing entirely to do so. It's especially strange that this same action succeeds in one second, five minutes later:
I don't have a good answer yet for what could cause this to take a long time, or perhaps fail altogether. It might be worth checking Are there other active users when this is happening? |
Logs for jupyter-cou000 don't show anything more:
When this occurred I was the only user, so shouldn't have been that. I'll try and replicate it and check the system status as you suggest. It's possible it could be an intermittent problem with our cloud, although since others reported the same issue I was hoping it wasn't that. |
I've hit the exact same issue as everyone else here. I can't exactly come up with reproduction steps, but it's something like:
I have a workaround I've been using, but would really like to find and/or help you patch a permanent solution. If someone can point me at the code that is breaking, I have a team of engineers who would happily submit a PR. Workaround:
Environment: Ubuntu 18.04.1 LTS (GNU/Linux 4.15.0-1027-aws x86_64) |
I also just found that if I run: |
Still haven't found a resolution for this. Ran two more training workshops this week, faced the same startup inconsistencies, and still haven't been able to diagnose it. Unfortunately resetting failed services didn't work for me, as the user notebook servers are not listed as failed. The hubs worked well during the sessions, but the problems starting up servers is still irritating. Would be grateful if someone could suggest another way of attempting to diagnose what's going on. |
I had the same problems, however when I looked into the logs (/var/log/system...) it seemed like python was failing on import jupyterhub. I had to sudo to jupyterhub-admin-user and then run conda init as advised in the anaconda faq |
I've got the same problem. fixed it by deleting your created account, for example user jupyterhub-admin. Delete it with "sudo userdel jupyterhub-admin". |
This indeed solved the issue for me. Many thanks @rgaushell I caused this issue by opening too many tabs in Jupyterlab which crashed the service. |
My problem is that after a few days running the TLJH the server reaches 100% cpu usage. This causes the server to stop responding. |
Issue takeawaysIt seems that starting a user server in TLJH means to start a systemd service via jupyterhub-systemdspawner. A systemd service may sometimes fail, and if a user server's systemd service in a failed state, it can block startup of a user server. An example reproduction strategy could be the following: https://github.com/jupyterhub/the-littlest-jupyterhub/issues/351#issuecomment-525379200, but I think we can reproduce it faster by crashing the user server in some other way. I would guess this to be a systemdspawner issue rather than a TLJH issue, but I'm not sure and could not find a bug reported about that in that repo at this point. I'm moving this issue from TLJH to there. Action points
|
This comment has been minimized.
This comment has been minimized.
It seems like use of |
I'm having problems where if I leave a hub running for an extended period of time (weeks or longer) it becomes slow to respond and sometimes unable to start a user's server, for example:
I've tried reloading the hub, and restarting the VM (Ubuntu 18.04 on Openstack) with no success.
The text was updated successfully, but these errors were encountered: