Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler kills whole cluster with 10 instances #1

Open
philipgiuliani opened this issue Sep 4, 2020 · 2 comments
Open

Scheduler kills whole cluster with 10 instances #1

philipgiuliani opened this issue Sep 4, 2020 · 2 comments

Comments

@philipgiuliani
Copy link

Hi,
we have a Cluster running 10 instances that gets killed when using quantum-swarm. With just 2 instances it was working fine.

08:12:56.435 [info] [swarm on A] [tracker:ensure_swarm_started_on_remote_node] nodeup B
08:12:56.435 [info] [swarm on A] [tracker:handle_topology_change] topology change complete
08:13:18.898 [info] GenStage consumer MyProject.Scheduler.ExecutorSupervisor is stopping after receiving cancel from producer #PID<61039.8804.0> with reason: :shutdown
08:13:18.898 [error] Supervisor received unexpected message: {:DOWN, #Reference<0.981856171.4094427137.33179>, :process, #PID<61039.8804.0>, :shutdown}
08:13:18.899 [error] Supervisor received unexpected message: {:DOWN, #Reference<0.981856171.4094427137.33175>, :process, #PID<61039.8802.0>, :shutdown}
08:13:18.901 [error] Supervisor received unexpected message: {:DOWN, #Reference<0.981856171.4094427137.33171>, :process, #PID<61039.8801.0>, :shutdown}
08:13:18.901 [error] Supervisor received unexpected message: {:DOWN, #Reference<0.981856171.4094427137.33167>, :process, #PID<61039.8799.0>, :shutdown}
08:13:18.902 [error] Supervisor received unexpected message: {:DOWN, #Reference<0.981856171.4094427137.33163>, :process, #PID<61039.8797.0>, :shutdown}
08:13:18.903 [warn] [swarm on [email protected]] [tracker:handle_replica_event] received track event for MyProject.Scheduler.NodeSelectorBroadcaster, mismatched pids, local clock conflicts with remote clock, event unhandled
08:13:18.906 [warn] [swarm on [email protected]] [tracker:handle_replica_event] received track event for MyProject.Scheduler.JobBroadcaster, mismatched pids, local clock conflicts with remote clock, event unhandled
08:13:18.907 [warn] [swarm on [email protected]] [tracker:handle_replica_event] received track event for MyProject.Scheduler.ExecutionBroadcaster, mismatched pids, local clock conflicts with remote clock, event unhandled
08:13:18.911 [error] Supervisor received unexpected message: {:DOWN, #Reference<0.981856171.4094164994.181688>, :process, #PID<61061.8615.0>, :noproc}
08:13:18.911 [error] Supervisor received unexpected message: {:DOWN, #Reference<0.981856171.4094164994.181719>, :process, #PID<61043.8615.0>, :noproc}
08:13:18.911 [error] Supervisor received unexpected message: {:DOWN, #Reference<0.981856171.4094164994.181711>, :process, #PID<61061.8621.0>, :shutdown}
08:13:18.911 [info] GenStage consumer MyProject.Scheduler.ExecutorSupervisor is stopping after receiving cancel from producer #PID<61043.8615.0> with reason: :noproc
08:13:18.912 [error] GenServer MyProject.Scheduler.ExecutorSupervisor terminating
** (stop) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
Last message: {:DOWN, #Reference<0.981856171.4094164994.181723>, :process, #PID<61043.8615.0>, :noproc}
08:13:18.912 [error] Supervisor received unexpected message: {:DOWN, #Reference<0.981856171.4094164994.181702>, :process, #PID<61061.8619.0>, :shutdown}
08:13:18.912 [error] Supervisor received unexpected message: {:DOWN, #Reference<0.981856171.4094164994.181698>, :process, #PID<61061.8617.0>, :shutdown}
08:13:18.913 [warn] [swarm on [email protected]] [tracker:handle_replica_event] received track event for MyProject.Scheduler.TaskRegistry, mismatched pids, local clock conflicts with remote clock, event unhandled
08:13:18.914 [warn] [swarm on [email protected]] [tracker:handle_replica_event] received track event for MyProject.Scheduler.NodeSelectorBroadcaster, mismatched pids, local clock conflicts with remote clock, event unhandled
08:13:18.919 [error] GenServer #PID<0.8135.0> terminating
** (stop) 'stopping because dependent process <0.8127.0> died: shutdown'
Last message: {:EXIT, #PID<0.8127.0>, :shutdown}
08:13:18.919 [error] GenServer #PID<0.8138.0> terminating
** (stop) 'stopping because dependent process <0.8128.0> died: shutdown'
Last message: {:EXIT, #PID<0.8128.0>, :shutdown}
08:13:18.919 [error] GenServer #PID<0.8131.0> terminating
** (stop) 'stopping because dependent process <0.8126.0> died: shutdown'
Last message: {:EXIT, #PID<0.8126.0>, :shutdown}
08:13:18.927 [info] Application my_project exited: shutdown
"Kernel pid terminated (application_controller) ({application_terminated,my_project,shutdown})
"
"{"Kernel pid terminated",application_controller,"{application_terminated,my_project,shutdown}"}
"

Crash dump is being written to: erl_crash.dump...done

I am not sure what other information I could supply you that will help.

@philipgiuliani
Copy link
Author

Hey @maennchen ,

maybe its just our architecture but this seems like a critical problem for me. If you use this library in a Cluster it will crash the whole production system 😀

@maennchen
Copy link
Member

@philipgiuliani Hm, I only tested it with two machines so far and it seemed to work fine.

I‘ll try to replicate the issue.

If you‘re able to determine the problem a PR would also be very welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants