A user servers's systemd service in a failed state fails to start up properly again #87

lachlancampbell · 2019-05-27T03:11:39Z

I'm having problems where if I leave a hub running for an extended period of time (weeks or longer) it becomes slow to respond and sometimes unable to start a user's server, for example:

May 27 03:07:25 littlehub python3[1910]: [I 2019-05-27 03:07:25.408 JupyterHub proxy:301] Checking routes
May 27 03:07:33 littlehub python3[1910]: [W 2019-05-27 03:07:33.210 JupyterHub base:714] User lukeedwards is slow to start (timeout=10)
May 27 03:07:33 littlehub python3[1910]: [I 2019-05-27 03:07:33.212 JupyterHub log:158] 202 POST /hub/api/users/lukeedwards/server ([email protected]) 10064.93ms
May 27 03:08:02 littlehub python3[1910]: [I 2019-05-27 03:08:02.138 JupyterHub log:158] 200 GET /hub/admin ([email protected]) 49.57ms
May 27 03:08:23 littlehub python3[1910]: [W 2019-05-27 03:08:23.294 JupyterHub user:471] lukeedwards's server failed to start in 60 seconds, giving up
May 27 03:08:23 littlehub python3[1910]: [E 2019-05-27 03:08:23.380 JupyterHub gen:593] Exception in Future <Task finished coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /opt/tljh/hub/lib/python3.6/site-packages/jupyterhub/handlers/base.py:619> exception=TimeoutError('Timeout',)> after timeout
May 27 03:08:23 littlehub python3[1910]:     Traceback (most recent call last):
May 27 03:08:23 littlehub python3[1910]:       File "/opt/tljh/hub/lib/python3.6/site-packages/tornado/gen.py", line 589, in error_callback
May 27 03:08:23 littlehub python3[1910]:         future.result()
May 27 03:08:23 littlehub python3[1910]:       File "/opt/tljh/hub/lib/python3.6/site-packages/jupyterhub/handlers/base.py", line 626, in finish_user_spawn
May 27 03:08:23 littlehub python3[1910]:         await spawn_future
May 27 03:08:23 littlehub python3[1910]:       File "/opt/tljh/hub/lib/python3.6/site-packages/jupyterhub/user.py", line 489, in spawn
May 27 03:08:23 littlehub python3[1910]:         raise e
May 27 03:08:23 littlehub python3[1910]:       File "/opt/tljh/hub/lib/python3.6/site-packages/jupyterhub/user.py", line 409, in spawn
May 27 03:08:23 littlehub python3[1910]:         url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
May 27 03:08:23 littlehub python3[1910]:     tornado.util.TimeoutError: Timeout

I've tried reloading the hub, and restarting the VM (Ubuntu 18.04 on Openstack) with no success.

The text was updated successfully, but these errors were encountered:

lachlancampbell · 2019-05-27T03:20:49Z

I also tried deleting the user and creating a new user, but faced the same issues. At this stage I'm going to trash the VM and start again, but I'd like to know a better solution :)

lachlancampbell · 2019-05-27T05:53:19Z

Don't know if it's relevant, but the traefik logs show a lot of the following errors:

    "__CURSOR" : "s=440e4b14ca704e02b779fd869a74565b;i=ea0d0;b=09c8ee45c9134834a058a7078bc785bc;m=4ec0605bad;t=589d59c2ddaee;x=ef952bcfedaec3d7",
    "__REALTIME_TIMESTAMP" : "1558925424843502",
    "__MONOTONIC_TIMESTAMP" : "338234989485",
    "_BOOT_ID" : "09c8ee45c9134834a058a7078bc785bc",
    "PRIORITY" : "6",
    "SYSLOG_FACILITY" : "3",
    "_MACHINE_ID" : "a95e814a37844c22b16c0168297d3050",
    "_HOSTNAME" : "littlehub",
    "_TRANSPORT" : "stdout",
    "_UID" : "0",
    "_GID" : "0",
    "_SELINUX_CONTEXT" : "unconfined\n",
    "_SYSTEMD_SLICE" : "system.slice",
    "SYSLOG_IDENTIFIER" : "traefik",
    "_COMM" : "traefik",
    "_EXE" : "/opt/tljh/hub/bin/traefik",
    "_CMDLINE" : "/opt/tljh/hub/bin/traefik -c /opt/tljh/state/traefik.toml",
    "_CAP_EFFECTIVE" : "3ff7fcffff",
    "_SYSTEMD_CGROUP" : "/system.slice/traefik.service",
    "_SYSTEMD_UNIT" : "traefik.service",
    "_STREAM_ID" : "69f2bb4f58c14dadb0e8bc0a73300da4",
    "_PID" : "1121",
    "_SYSTEMD_INVOCATION_ID" : "0499aef90ffd4afab1e65e84d93e5d84",
    "MESSAGE" : "time=\"2019-05-27T02:50:24Z\" level=error msg=\"Error while closing Hijacked conn: close tcp 192.168.1.13:80->52.63.84.227:48524: use of closed network connection\""

dmpe · 2019-06-08T17:05:33Z

Also facing similar issues. Hub starts but spawning notebooks leads to the timeout issue

arinbasu · 2019-06-15T21:15:41Z

I installed tljh on a local instance of vultr.com, but several errors:

I had to install gcc on the instance
I had to do apt install python3-devel
Then it installed something ending with 'done' message.
When I visited the URL, I could see a login window.
But on login, it failed to spawn and showed a timeout error.
No notebooks were spawned.

lachlancampbell · 2019-07-17T08:19:49Z

Still being affected by this issue. Started up a new instance, new install, added users, began logging in using user accounts to check server startup. Worked fine for 10, then would timeout. Would work fine again for 5, then timeout again. No traefik errors showing this time, nothing in the users server logs, just the timeout errors in the jupyterhub logs.

GeorgianaElena · 2019-07-19T09:21:52Z

@lachlancampbell, there's a service we added not so long ago that's culling the notebook servers that have been idle for more than 10 min (this is the default value).
If you're using the latest version on master, this might be what's stopping the notebooks.

Checkout the docs on how to change the idle timeout and other defaults to fit your needs.

lachlancampbell · 2019-07-22T01:09:41Z

Okay, thanks. I'll try turning the culling service off, although would it stop servers from starting up, as that's what is happening?

GeorgianaElena · 2019-07-22T09:16:51Z

@lachlancampbell, no the idle culler shouldn't stop servers from starting up, it just stops them after 10 min of inactivity. The timeout error you're receiving on this fresh install is the same with the one posted in the initial comment? Do you see anything in the logs related to the idle culler?

lachlancampbell · 2019-07-23T01:46:23Z

I believe it's the same error, but I've included a section of the full log below. As you can see, the idle culler does appear but not sure that it's related.

-- Logs begin at Wed 2019-07-17 05:49:40 UTC, end at Tue 2019-07-23 01:56:11 UTC. --
Jul 17 06:56:42 littlehub python3[12245]: [I 2019-07-17 06:56:42.935 JupyterHub base:663] User logged in: cou000
Jul 17 06:56:42 littlehub python3[12245]: [I 2019-07-17 06:56:42.941 JupyterHub log:174] 302 POST /hub/login?next= -> /hub/spawn ([email protected]) 458.34ms
Jul 17 06:56:43 littlehub useradd[12478]: new group: name=jupyter-cou000, GID=1005
Jul 17 06:56:43 littlehub useradd[12478]: new user: name=jupyter-cou000, UID=1003, GID=1005, home=/home/jupyter-cou000, shell=/bin/sh
Jul 17 06:56:43 littlehub python3[12245]: Adding user jupyter-cou000 to group jupyterhub-users
Jul 17 06:56:43 littlehub gpasswd[12485]: user jupyter-cou000 added by root to group jupyterhub-users
Jul 17 06:56:43 littlehub systemd[1]: Started /bin/bash -c cd /home/jupyter-cou000 && exec jupyterhub-singleuser --port=39851.
Jul 17 06:56:43 littlehub python3[12245]: Running as unit: jupyter-cou000.service
Jul 17 06:56:43 littlehub python3[12245]: [I 2019-07-17 06:56:43.987 JupyterHub log:174] 302 GET /hub/spawn -> /hub/spawn-pending/cou000 ([email protected]) 1033.83ms
Jul 17 06:56:44 littlehub python3[12245]: [I 2019-07-17 06:56:44.050 JupyterHub pages:303] cou000 is pending spawn
Jul 17 06:56:44 littlehub python3[12245]: [I 2019-07-17 06:56:44.053 JupyterHub log:174] 200 GET /hub/spawn-pending/cou000 ([email protected]) 26.72ms
Jul 17 06:56:53 littlehub python3[12245]: [W 2019-07-17 06:56:53.004 JupyterHub base:932] User cou000 is slow to become responsive (timeout=10)
Jul 17 06:57:13 littlehub python3[12245]: [W 2019-07-17 06:57:13.993 JupyterHub user:678] cou000's server never showed up at http://127.0.0.1:39851/user/cou000/ after 30 seconds. Giving up
Jul 17 06:57:14 littlehub systemd[1]: Stopping /bin/bash -c cd /home/jupyter-cou000 && exec jupyterhub-singleuser --port=39851...
Jul 17 06:57:14 littlehub systemd[1]: Stopped /bin/bash -c cd /home/jupyter-cou000 && exec jupyterhub-singleuser --port=39851.
Jul 17 06:57:14 littlehub python3[12245]: [E 2019-07-17 06:57:14.103 JupyterHub gen:599] Exception in Future <Task finished coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /opt/tljh/hub/lib/python3.6/site-packages/jupyterhub/handlers/base.py:800> exception=TimeoutError("Server at http://127.0.0.1:39851/user/cou000/ didn't respond in 30 seconds",)> after timeout
Jul 17 06:57:14 littlehub python3[12245]:     Traceback (most recent call last):
Jul 17 06:57:14 littlehub python3[12245]:       File "/opt/tljh/hub/lib/python3.6/site-packages/tornado/gen.py", line 593, in error_callback
Jul 17 06:57:14 littlehub python3[12245]:         future.result()
Jul 17 06:57:14 littlehub python3[12245]:       File "/opt/tljh/hub/lib/python3.6/site-packages/jupyterhub/handlers/base.py", line 807, in finish_user_spawn
Jul 17 06:57:14 littlehub python3[12245]:         await spawn_future
Jul 17 06:57:14 littlehub python3[12245]:       File "/opt/tljh/hub/lib/python3.6/site-packages/jupyterhub/user.py", line 654, in spawn
Jul 17 06:57:14 littlehub python3[12245]:         await self._wait_up(spawner)
Jul 17 06:57:14 littlehub python3[12245]:       File "/opt/tljh/hub/lib/python3.6/site-packages/jupyterhub/user.py", line 701, in _wait_up
Jul 17 06:57:14 littlehub python3[12245]:         raise e
Jul 17 06:57:14 littlehub python3[12245]:       File "/opt/tljh/hub/lib/python3.6/site-packages/jupyterhub/user.py", line 669, in _wait_up
Jul 17 06:57:14 littlehub python3[12245]:         http=True, timeout=spawner.http_timeout, ssl_context=ssl_context
Jul 17 06:57:14 littlehub python3[12245]:       File "/opt/tljh/hub/lib/python3.6/site-packages/jupyterhub/utils.py", line 234, in wait_for_http_server
Jul 17 06:57:14 littlehub python3[12245]:         timeout=timeout,
Jul 17 06:57:14 littlehub python3[12245]:       File "/opt/tljh/hub/lib/python3.6/site-packages/jupyterhub/utils.py", line 177, in exponential_backoff
Jul 17 06:57:14 littlehub python3[12245]:         raise TimeoutError(fail_message)
Jul 17 06:57:14 littlehub python3[12245]:     TimeoutError: Server at http://127.0.0.1:39851/user/cou000/ didn't respond in 30 seconds
Jul 17 06:57:14 littlehub python3[12245]:     
Jul 17 06:57:14 littlehub python3[12245]: [I 2019-07-17 06:57:14.114 JupyterHub log:174] 200 GET /hub/api/users/cou000/server/progress ([email protected]) 29913.91ms
Jul 17 06:57:28 littlehub python3[12245]: [I 2019-07-17 06:57:28.119 JupyterHub log:174] 200 GET /hub/api/users ([email protected]) 40.25ms
Jul 17 06:57:58 littlehub systemd-timesyncd[698]: Network configuration changed, trying to establish connection.
Jul 17 06:57:58 littlehub systemd-timesyncd[698]: Synchronized to time server 91.189.91.157:123 (ntp.ubuntu.com).
Jul 17 06:58:13 littlehub sudo[12525]:   ubuntu : TTY=pts/0 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/bin/journalctl -u jupyterhub
Jul 17 06:58:13 littlehub sudo[12525]: pam_unix(sudo:session): session opened for user root by ubuntu(uid=0)
Jul 17 06:58:28 littlehub python3[12245]: [I 2019-07-17 06:58:28.118 JupyterHub log:174] 200 GET /hub/api/users ([email protected]) 39.38ms
Jul 17 06:59:28 littlehub python3[12245]: [I 2019-07-17 06:59:28.114 JupyterHub log:174] 200 GET /hub/api/users ([email protected]) 36.45ms
Jul 17 07:00:26 littlehub sudo[12525]: pam_unix(sudo:session): session closed for user root
Jul 17 07:00:28 littlehub python3[12245]: [I 2019-07-17 07:00:28.119 JupyterHub log:174] 200 GET /hub/api/users ([email protected]) 40.36ms
Jul 17 07:00:42 littlehub sudo[12537]:   ubuntu : TTY=pts/0 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/bin/journalctl -u traefik
Jul 17 07:00:42 littlehub sudo[12537]: pam_unix(sudo:session): session opened for user root by ubuntu(uid=0)
Jul 17 07:01:09 littlehub sudo[12537]: pam_unix(sudo:session): session closed for user root
Jul 17 07:01:16 littlehub python3[12245]: [I 2019-07-17 07:01:16.651 JupyterHub login:43] User logged out: cou000
Jul 17 07:01:16 littlehub python3[12245]: [I 2019-07-17 07:01:16.695 JupyterHub log:174] 302 GET /hub/logout -> /hub/login ([email protected]) 79.37ms
Jul 17 07:01:28 littlehub python3[12245]: [I 2019-07-17 07:01:28.114 JupyterHub log:174] 200 GET /hub/api/users ([email protected]) 36.36ms
Jul 17 07:01:28 littlehub python3[12245]: [I 2019-07-17 07:01:28.397 JupyterHub proxy:319] Checking routes
Jul 17 07:01:29 littlehub python3[12245]: [I 2019-07-17 07:01:29.029 JupyterHub base:663] User logged in: cou000
Jul 17 07:01:29 littlehub python3[12245]: [I 2019-07-17 07:01:29.030 JupyterHub log:174] 302 POST /hub/login?next= -> /hub/spawn ([email protected]) 475.30ms
Jul 17 07:01:29 littlehub systemd[1]: Started /bin/bash -c cd /home/jupyter-cou000 && exec jupyterhub-singleuser --port=52651.
Jul 17 07:01:29 littlehub python3[12245]: Running as unit: jupyter-cou000.service
Jul 17 07:01:29 littlehub traefik[12244]: time="2019-07-17T07:01:29Z" level=warning msg="A new release has been found: 1.7.12. Please consider updating."
Jul 17 07:01:29 littlehub python3[12245]: [I 2019-07-17 07:01:29.734 JupyterHub log:174] 200 POST /hub/api/users/lachlan/activity ([email protected]) 40.94ms
Jul 17 07:01:30 littlehub python3[12245]: [I 2019-07-17 07:01:30.100 JupyterHub log:174] 302 GET /hub/spawn -> /hub/spawn-pending/cou000 ([email protected]) 1031.31ms
Jul 17 07:01:30 littlehub python3[12245]: [I 2019-07-17 07:01:30.133 JupyterHub pages:303] cou000 is pending spawn
Jul 17 07:01:30 littlehub bash[12547]: [I 2019-07-17 07:01:30.751 SingleUserNotebookApp extension:168] JupyterLab extension loaded from /opt/tljh/user/lib/python3.6/site-packages/jupyterlab
Jul 17 07:01:30 littlehub bash[12547]: [I 2019-07-17 07:01:30.751 SingleUserNotebookApp extension:169] JupyterLab application directory is /opt/tljh/user/share/jupyter/lab
Jul 17 07:01:30 littlehub bash[12547]: [I 2019-07-17 07:01:30.800 SingleUserNotebookApp extension:22] nteract extension loaded from /opt/tljh/user/lib/python3.6/site-packages/nteract_on_jupyter
Jul 17 07:01:30 littlehub bash[12547]: [I 2019-07-17 07:01:30.804 SingleUserNotebookApp singleuser:561] Starting jupyterhub-singleuser server version 1.0.0
Jul 17 07:01:30 littlehub python3[12245]: [I 2019-07-17 07:01:30.814 JupyterHub log:174] 200 GET /hub/api (@127.0.0.1) 1.80ms
Jul 17 07:01:30 littlehub bash[12547]: [I 2019-07-17 07:01:30.818 SingleUserNotebookApp notebookapp:1774] Serving notebooks from local directory: /home/jupyter-cou000
Jul 17 07:01:30 littlehub bash[12547]: [I 2019-07-17 07:01:30.818 SingleUserNotebookApp notebookapp:1774] The Jupyter Notebook is running at:
Jul 17 07:01:30 littlehub bash[12547]: [I 2019-07-17 07:01:30.818 SingleUserNotebookApp notebookapp:1774] http://127.0.0.1:52651/user/cou000/
Jul 17 07:01:30 littlehub bash[12547]: [I 2019-07-17 07:01:30.818 SingleUserNotebookApp notebookapp:1775] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Jul 17 07:01:30 littlehub bash[12547]: [I 2019-07-17 07:01:30.829 SingleUserNotebookApp singleuser:542] Updating Hub with activity every 300 seconds
Jul 17 07:01:30 littlehub python3[12245]: [I 2019-07-17 07:01:30.884 JupyterHub log:174] 200 POST /hub/api/users/cou000/activity ([email protected]) 50.41ms
Jul 17 07:01:31 littlehub bash[12547]: [I 2019-07-17 07:01:31.779 SingleUserNotebookApp log:174] 302 GET /user/cou000/ -> /user/cou000/tree? (@127.0.0.1) 1.52ms
Jul 17 07:01:31 littlehub python3[12245]: [I 2019-07-17 07:01:31.780 JupyterHub base:810] User cou000 took 2.683 seconds to start
Jul 17 07:01:31 littlehub python3[12245]: [I 2019-07-17 07:01:31.781 JupyterHub proxy:261] Adding user cou000 to proxy /user/cou000/ => http://127.0.0.1:52651
Jul 17 07:01:31 littlehub python3[12245]: [I 2019-07-17 07:01:31.783 JupyterHub proxy:114] Waiting for /user/cou000/ to register with traefik
Jul 17 07:01:31 littlehub traefik[12244]: time="2019-07-17T07:01:31Z" level=info msg="Server configuration reloaded on :80"
Jul 17 07:01:31 littlehub traefik[12244]: time="2019-07-17T07:01:31Z" level=info msg="Server configuration reloaded on :443"
Jul 17 07:01:31 littlehub traefik[12244]: time="2019-07-17T07:01:31Z" level=info msg="Server configuration reloaded on 127.0.0.1:8099"
Jul 17 07:01:31 littlehub python3[12245]: [I 2019-07-17 07:01:31.995 JupyterHub users:606] Server cou000 is ready
Jul 17 07:01:31 littlehub python3[12245]: [I 2019-07-17 07:01:31.998 JupyterHub log:174] 200 GET /hub/api/users/cou000/server/progress ([email protected]) 1683.14ms
Jul 17 07:01:32 littlehub python3[12245]: [I 2019-07-17 07:01:32.038 JupyterHub log:174] 302 GET /hub/spawn-pending/cou000 -> /user/cou000/ ([email protected]) 27.45ms
Jul 17 07:01:32 littlehub bash[12547]: [I 2019-07-17 07:01:32.050 SingleUserNotebookApp log:174] 302 GET /user/cou000/ -> /user/cou000/tree? (@130.116.150.122) 2.14ms
Jul 17 07:01:32 littlehub bash[12547]: [I 2019-07-17 07:01:32.068 SingleUserNotebookApp log:174] 302 GET /user/cou000/tree -> /hub/api/oauth2/authorize?client_id=jupyterhub-user-cou000&redirect_uri=%2Fuser%2Fcou000%2Foauth_callback&response_type=code&state=[secret] (@130.116.150.122) 6.04ms
Jul 17 07:01:32 littlehub python3[12245]: [I 2019-07-17 07:01:32.151 JupyterHub log:174] 302 GET /hub/api/oauth2/authorize?client_id=jupyterhub-user-cou000&redirect_uri=%2Fuser%2Fcou000%2Foauth_callback&response_type=code&state=[secret] -> /user/cou000/oauth_callback?code=[secret]&state=[secret] ([email protected]) 69.34ms
Jul 17 07:01:32 littlehub python3[12245]: [I 2019-07-17 07:01:32.256 JupyterHub log:174] 200 POST /hub/api/oauth2/token ([email protected]) 85.37ms
Jul 17 07:01:32 littlehub python3[12245]: [I 2019-07-17 07:01:32.338 JupyterHub log:174] 200 GET /hub/api/authorizations/token/[secret] ([email protected]) 71.52ms
Jul 17 07:01:32 littlehub bash[12547]: [I 2019-07-17 07:01:32.340 SingleUserNotebookApp auth:981] Logged-in user {'kind': 'user', 'name': 'cou000', 'admin': False, 'groups': [], 'server': '/user/cou000/', 'pending': None, 'created': '2019-07-17T06:56:14.397855Z', 'last_activity': '2019-07-17T07:01:32.318000Z', 'servers': None}
Jul 17 07:01:32 littlehub bash[12547]: [I 2019-07-17 07:01:32.342 SingleUserNotebookApp log:174] 302 GET /user/cou000/oauth_callback?code=[secret]&state=[secret] -> /user/cou000/tree (@130.116.150.122) 182.35ms
Jul 17 07:01:32 littlehub bash[12547]: [I 2019-07-17 07:01:32.541 SingleUserNotebookApp log:174] 200 GET /user/cou000/tree ([email protected]) 130.01ms
Jul 17 07:01:32 littlehub bash[12547]: [I 2019-07-17 07:01:32.963 SingleUserNotebookApp log:174] 200 GET /user/cou000/api/config/common?_=1563346892845 ([email protected]) 7.40ms
Jul 17 07:01:32 littlehub bash[12547]: [I 2019-07-17 07:01:32.970 SingleUserNotebookApp log:174] 200 GET /user/cou000/api/config/tree?_=1563346892844 ([email protected]) 5.83ms
Jul 17 07:01:32 littlehub bash[12547]: [I 2019-07-17 07:01:32.979 SingleUserNotebookApp log:174] 200 GET /user/cou000/api/sessions?_=1563346892847 ([email protected]) 4.65ms
Jul 17 07:01:32 littlehub bash[12547]: [I 2019-07-17 07:01:32.983 SingleUserNotebookApp log:174] 200 GET /user/cou000/api/terminals?_=1563346892846 ([email protected]) 1.96ms
Jul 17 07:01:32 littlehub bash[12547]: [I 2019-07-17 07:01:32.991 SingleUserNotebookApp log:174] 200 GET /user/cou000/api/kernelspecs ([email protected]) 7.23ms
Jul 17 07:01:32 littlehub bash[12547]: [I 2019-07-17 07:01:32.995 SingleUserNotebookApp log:174] 200 GET /user/cou000/api/terminals?_=1563346892848 ([email protected]) 2.22ms
Jul 17 07:01:33 littlehub bash[12547]: [I 2019-07-17 07:01:33.029 SingleUserNotebookApp log:174] 200 GET /user/cou000/api/contents?type=directory&_=1563346892849 ([email protected]) 7.16ms
Jul 17 07:01:41 littlehub bash[12547]: [I 2019-07-17 07:01:41.676 SingleUserNotebookApp log:174] 302 GET /user/cou000/logout -> /hub/logout ([email protected]) 4.19ms
Jul 17 07:01:41 littlehub python3[12245]: [I 2019-07-17 07:01:41.709 JupyterHub login:43] User logged out: cou000
Jul 17 07:01:41 littlehub python3[12245]: [I 2019-07-17 07:01:41.759 JupyterHub log:174] 302 GET /hub/logout -> /hub/login ([email protected]) 74.40ms
Jul 17 07:01:56 littlehub sudo[12572]:   ubuntu : TTY=pts/0 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/bin/journalctl -f -u jupyterhub
Jul 17 07:01:56 littlehub sudo[12572]: pam_unix(sudo:session): session opened for user root by ubuntu(uid=0)
Jul 17 07:02:26 littlehub systemd-timesyncd[698]: Network configuration changed, trying to establish connection.
Jul 17 07:02:26 littlehub systemd-timesyncd[698]: Synchronized to time server 91.189.91.157:123 (ntp.ubuntu.com).

rikedyp · 2019-07-23T09:06:11Z

Jul 23 10:00:30 DyalogDemos python3[25943]: [I 2019-07-23 10:00:30.786 JupyterHub log:174] 200 GET /hub/home ([email protected]) 27.30ms
Jul 23 10:00:32 DyalogDemos python3[25943]: [I 2019-07-23 10:00:32.816 JupyterHub log:174] 200 GET /hub/token ([email protected]) 34.88ms
Jul 23 10:00:33 DyalogDemos python3[25943]: [I 2019-07-23 10:00:33.588 JupyterHub log:174] 200 GET /hub/admin ([email protected]) 10.67ms
Jul 23 10:00:36 DyalogDemos python3[25943]: [I 2019-07-23 10:00:36.224 JupyterHub log:174] 200 GET /hub/home ([email protected]) 13.75ms
Jul 23 10:00:37 DyalogDemos python3[25943]: [I 2019-07-23 10:00:37.335 JupyterHub systemdspawner:217] user:dyalog Unit jupyter-dyalog in a failed state. Resetting state.
Jul 23 10:00:37 DyalogDemos python3[25943]: Running as unit: jupyter-dyalog.service
Jul 23 10:00:38 DyalogDemos python3[25943]: [I 2019-07-23 10:00:38.281 JupyterHub log:174] 302 GET /hub/spawn/dyalog -> /hub/spawn-pending/dyalog ([email protected]) 1010.83ms
Jul 23 10:00:38 DyalogDemos python3[25943]: [I 2019-07-23 10:00:38.314 JupyterHub pages:303] dyalog is pending spawn
Jul 23 10:00:38 DyalogDemos python3[25943]: [I 2019-07-23 10:00:38.322 JupyterHub log:174] 200 GET /hub/spawn-pending/dyalog ([email protected]) 18.92ms
Jul 23 10:00:47 DyalogDemos python3[25943]: [W 2019-07-23 10:00:47.280 JupyterHub base:900] User dyalog is slow to start (timeout=10)
Jul 23 10:01:19 DyalogDemos python3[25943]: [W 190723 10:01:19 cull_idle_servers:129] Not culling server dyalog with pending spawn
Jul 23 10:01:19 DyalogDemos python3[25943]: [I 2019-07-23 10:01:19.547 JupyterHub log:174] 200 GET /hub/api/users ([email protected]) 12.79ms
Jul 23 10:01:37 DyalogDemos python3[25943]: [W 2019-07-23 10:01:37.318 JupyterHub user:618] dyalog's server failed to start in 60 seconds, giving up
Jul 23 10:01:37 DyalogDemos python3[25943]: [E 2019-07-23 10:01:37.346 JupyterHub gen:599] Exception in Future <Task finished coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn()
Jul 23 10:01:37 DyalogDemos python3[25943]:     Traceback (most recent call last):
Jul 23 10:01:37 DyalogDemos python3[25943]:       File "/opt/tljh/hub/lib/python3.7/site-packages/tornado/gen.py", line 593, in error_callback
Jul 23 10:01:37 DyalogDemos python3[25943]:         future.result()
Jul 23 10:01:37 DyalogDemos python3[25943]:       File "/opt/tljh/hub/lib/python3.7/site-packages/jupyterhub/handlers/base.py", line 807, in finish_user_spawn
Jul 23 10:01:37 DyalogDemos python3[25943]:         await spawn_future
Jul 23 10:01:37 DyalogDemos python3[25943]:       File "/opt/tljh/hub/lib/python3.7/site-packages/jupyterhub/user.py", line 642, in spawn
Jul 23 10:01:37 DyalogDemos python3[25943]:         raise e
Jul 23 10:01:37 DyalogDemos python3[25943]:       File "/opt/tljh/hub/lib/python3.7/site-packages/jupyterhub/user.py", line 546, in spawn
Jul 23 10:01:37 DyalogDemos python3[25943]:         url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
Jul 23 10:01:37 DyalogDemos python3[25943]:     tornado.util.TimeoutError: Timeout
Jul 23 10:01:37 DyalogDemos python3[25943]:     
Jul 23 10:01:37 DyalogDemos python3[25943]: [I 2019-07-23 10:01:37.348 JupyterHub log:174] 200 GET /hub/api/users/dyalog/server/progress ([email protected]) 58827.07ms
Jul 23 10:02:19 DyalogDemos python3[25943]: [I 2019-07-23 10:02:19.544 JupyterHub log:174] 200 GET /hub/api/users ([email protected]) 10.43ms
Jul 23 10:02:52 DyalogDemos python3[25943]: [I 2019-07-23 10:02:52.167 JupyterHub log:174] 302 GET / -> /hub/ (@127.0.0.1) 0.65ms
Jul 23 10:02:52 DyalogDemos python3[25943]: [I 2019-07-23 10:02:52.188 JupyterHub log:174] 302 GET /hub/ -> /hub/login (@127.0.0.1) 0.58ms
Jul 23 10:02:52 DyalogDemos python3[25943]: [I 2019-07-23 10:02:52.236 JupyterHub log:174] 200 GET /hub/login (@127.0.0.1) 17.13ms
Jul 23 10:02:56 DyalogDemos python3[25943]: [I 2019-07-23 10:02:56.403 JupyterHub base:663] User logged in: dyalog
Jul 23 10:02:56 DyalogDemos python3[25943]: [I 2019-07-23 10:02:56.409 JupyterHub log:174] 302 POST /hub/login?next= -> /hub/spawn ([email protected]) 308.13ms
Jul 23 10:02:56 DyalogDemos python3[25943]: [I 2019-07-23 10:02:56.477 JupyterHub systemdspawner:217] user:dyalog Unit jupyter-dyalog in a failed state. Resetting state.
Jul 23 10:02:56 DyalogDemos python3[25943]: Running as unit: jupyter-dyalog.service
Jul 23 10:02:57 DyalogDemos python3[25943]: [I 2019-07-23 10:02:57.435 JupyterHub log:174] 302 GET /hub/spawn -> /hub/spawn-pending/dyalog ([email protected]) 1022.26ms
Jul 23 10:02:57 DyalogDemos python3[25943]: [I 2019-07-23 10:02:57.458 JupyterHub pages:303] dyalog is pending spawn
Jul 23 10:02:57 DyalogDemos python3[25943]: [I 2019-07-23 10:02:57.461 JupyterHub log:174] 200 GET /hub/spawn-pending/dyalog ([email protected]) 14.52ms
Jul 23 10:03:06 DyalogDemos python3[25943]: [W 2019-07-23 10:03:06.433 JupyterHub base:900] User dyalog is slow to start (timeout=10)

minrk · 2019-07-25T09:39:12Z

@lachlancampbell try checking the logs of the server that failed to show up:

systemctl -u jupyter-cou000

You might also check for load on the system with htop (might require apt -y install htop).

What's interesting from the logs is that we see systemd attempting to start the server with:

Jul 17 06:56:43 littlehub systemd[1]: Started /bin/bash -c cd /home/jupyter-cou000 && exec jupyterhub-singleuser --port=39851.

But no output from that process. I think that's the root, that systemd is actually taking a long time to launch the requested process, or even failing entirely to do so. It's especially strange that this same action succeeds in one second, five minutes later:

Jul 17 07:01:29 littlehub systemd[1]: Started /bin/bash -c cd /home/jupyter-cou000 && exec jupyterhub-singleuser --port=52651.
Jul 17 07:01:29 littlehub python3[12245]: Running as unit: jupyter-cou000.service
...
# first output from the started server (note SingleUserNotebookApp):
Jul 17 07:01:30 littlehub bash[12547]: [I 2019-07-17 07:01:30.751 SingleUserNotebookApp extension:168] JupyterLab extension loaded from /opt/tljh/user/lib/python3.6/site-packages/jupyterlab

I don't have a good answer yet for what could cause this to take a long time, or perhaps fail altogether. It might be worth checking systemctl status jupyter-cou000 while a launch that's going to fail is in the middle of trying to start. Or just systemctl status. That might shed some light on what's going on, such as exhausting some resources.

Are there other active users when this is happening?

lachlancampbell · 2019-07-25T12:25:52Z

Logs for jupyter-cou000 don't show anything more:

Jul 17 06:56:43 littlehub systemd[1]: Started /bin/bash -c cd /home/jupyter-cou000 && exec jupyterhub-singleuser --port=39851.
Jul 17 06:57:14 littlehub systemd[1]: Stopping /bin/bash -c cd /home/jupyter-cou000 && exec jupyterhub-singleuser --port=39851...
Jul 17 06:57:14 littlehub systemd[1]: Stopped /bin/bash -c cd /home/jupyter-cou000 && exec jupyterhub-singleuser --port=39851.

When this occurred I was the only user, so shouldn't have been that. I'll try and replicate it and check the system status as you suggest.

It's possible it could be an intermittent problem with our cloud, although since others reported the same issue I was hoping it wasn't that.

rgaushell · 2019-08-27T16:26:44Z

I've hit the exact same issue as everyone else here. I can't exactly come up with reproduction steps, but it's something like:

Create a user
Have them open a lot of notebooks (5 or more?)
Leave it for a day or more
Come back and your server is dead
Try to start it
Doesn't start, errors in log about timeouts/tornado, startup taking too long

I have a workaround I've been using, but would really like to find and/or help you patch a permanent solution. If someone can point me at the code that is breaking, I have a team of engineers who would happily submit a PR.

Workaround:

sudo bash
systemctl stop jupyter-username.service
systemctl start jupyter-username.service

Environment: Ubuntu 18.04.1 LTS (GNU/Linux 4.15.0-1027-aws x86_64)
hub: Python 3.6.8 (default, Jan 14 2019, 11:02:34)
user: Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34)

rgaushell · 2019-08-28T17:42:39Z

I also just found that if I run:
systemctl reset-failed
it clears out failed services and then TLJH can start a server that wouldn't previously start.

lachlancampbell · 2019-09-13T04:17:12Z

Still haven't found a resolution for this. Ran two more training workshops this week, faced the same startup inconsistencies, and still haven't been able to diagnose it. Unfortunately resetting failed services didn't work for me, as the user notebook servers are not listed as failed. The hubs worked well during the sessions, but the problems starting up servers is still irritating. Would be grateful if someone could suggest another way of attempting to diagnose what's going on.

dwr-psandhu · 2019-09-21T06:06:16Z

I had the same problems, however when I looked into the logs (/var/log/system...) it seemed like python was failing on import jupyterhub.

I had to sudo to jupyterhub-admin-user and then run conda init as advised in the anaconda faq

mrwen00 · 2019-11-14T12:45:39Z

I've got the same problem. fixed it by deleting your created account, for example user jupyterhub-admin. Delete it with "sudo userdel jupyterhub-admin".
After removing the account and create again, it works perfectly.

quinten-goens · 2019-12-05T13:43:32Z

I also just found that if I run:
systemctl reset-failed
it clears out failed services and then TLJH can start a server that wouldn't previously start.

This indeed solved the issue for me. Many thanks @rgaushell

I caused this issue by opening too many tabs in Jupyterlab which crashed the service.

junqfisica · 2020-01-07T15:20:45Z

My problem is that after a few days running the TLJH the server reaches 100% cpu usage. This causes the server to stop responding.

consideRatio · 2021-10-24T21:43:23Z

Issue takeaways

It seems that starting a user server in TLJH means to start a systemd service via jupyterhub-systemdspawner. A systemd service may sometimes fail, and if a user server's systemd service in a failed state, it can block startup of a user server.

An example reproduction strategy could be the following: https://github.com/jupyterhub/the-littlest-jupyterhub/issues/351#issuecomment-525379200, but I think we can reproduce it faster by crashing the user server in some other way.

I would guess this to be a systemdspawner issue rather than a TLJH issue, but I'm not sure and could not find a bug reported about that in that repo at this point. I'm moving this issue from TLJH to there.

Action points

Investigate the code base of systemdspawner
- Does it consider a systemd service with a failed state when it starts a user server?
Try reproduce issue

consideRatio · 2023-05-26T07:35:27Z

It seems like use of systemctl reset-failed has worked well, so maybe we should check for a failed state and use systemctl reset-failed <user-servers-unit-name> if needed before starting a user server?

lachlancampbell closed this as completed Jul 23, 2019

lachlancampbell reopened this Jul 23, 2019

consideRatio changed the title ~~server is slow to start then gives timeout error~~ A user servers's systemd service in a failed state fails to start up properly again Oct 24, 2021

consideRatio transferred this issue from jupyterhub/the-littlest-jupyterhub Oct 24, 2021

This comment has been minimized.

Sign in to view

consideRatio added bug help wanted labels May 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A user servers's systemd service in a failed state fails to start up properly again #87

A user servers's systemd service in a failed state fails to start up properly again #87

lachlancampbell commented May 27, 2019

lachlancampbell commented May 27, 2019

lachlancampbell commented May 27, 2019

dmpe commented Jun 8, 2019 •

edited

Loading

arinbasu commented Jun 15, 2019

lachlancampbell commented Jul 17, 2019

GeorgianaElena commented Jul 19, 2019

lachlancampbell commented Jul 22, 2019

GeorgianaElena commented Jul 22, 2019

lachlancampbell commented Jul 23, 2019 •

edited by minrk

Loading

rikedyp commented Jul 23, 2019 •

edited by minrk

Loading

minrk commented Jul 25, 2019

lachlancampbell commented Jul 25, 2019 •

edited by consideRatio

Loading

rgaushell commented Aug 27, 2019 •

edited

Loading

rgaushell commented Aug 28, 2019

lachlancampbell commented Sep 13, 2019

dwr-psandhu commented Sep 21, 2019

mrwen00 commented Nov 14, 2019

quinten-goens commented Dec 5, 2019

junqfisica commented Jan 7, 2020

consideRatio commented Oct 24, 2021

This comment has been minimized.

consideRatio commented May 26, 2023

A user servers's systemd service in a failed state fails to start up properly again #87

A user servers's systemd service in a failed state fails to start up properly again #87

Comments

lachlancampbell commented May 27, 2019

lachlancampbell commented May 27, 2019

lachlancampbell commented May 27, 2019

dmpe commented Jun 8, 2019 • edited Loading

arinbasu commented Jun 15, 2019

lachlancampbell commented Jul 17, 2019

GeorgianaElena commented Jul 19, 2019

lachlancampbell commented Jul 22, 2019

GeorgianaElena commented Jul 22, 2019

lachlancampbell commented Jul 23, 2019 • edited by minrk Loading

rikedyp commented Jul 23, 2019 • edited by minrk Loading

minrk commented Jul 25, 2019

lachlancampbell commented Jul 25, 2019 • edited by consideRatio Loading

rgaushell commented Aug 27, 2019 • edited Loading

rgaushell commented Aug 28, 2019

lachlancampbell commented Sep 13, 2019

dwr-psandhu commented Sep 21, 2019

mrwen00 commented Nov 14, 2019

quinten-goens commented Dec 5, 2019

junqfisica commented Jan 7, 2020

consideRatio commented Oct 24, 2021

Issue takeaways

Action points

This comment has been minimized.

consideRatio commented May 26, 2023

dmpe commented Jun 8, 2019 •

edited

Loading

lachlancampbell commented Jul 23, 2019 •

edited by minrk

Loading

rikedyp commented Jul 23, 2019 •

edited by minrk

Loading

lachlancampbell commented Jul 25, 2019 •

edited by consideRatio

Loading

rgaushell commented Aug 27, 2019 •

edited

Loading