Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demo 1 GRPC related error #12

Open
YinghanUVA opened this issue Nov 8, 2023 · 0 comments
Open

Demo 1 GRPC related error #12

YinghanUVA opened this issue Nov 8, 2023 · 0 comments

Comments

@YinghanUVA
Copy link

Hi team,

I'm able to successfully build with this command bazel run //markdown/demo:demo_local_runner -- --training_type=batch but I came across gRPC related error like the following. Does this occur to you before? Any idea on the solution?

INFO:tensorflow:loss = 1.1790854, step = 1952
I1108 21:10:20.894386 140460748154688 basic_session_run_hooks.py:262] loss = 1.1790854, step = 1952
INFO:tensorflow:loss = 1.2298307, step = 2152 (18.186 sec)
I1108 21:10:39.080899 140460748154688 basic_session_run_hooks.py:260] loss = 1.2298307, step = 2152 (18.186 sec)
I1108 21:10:47.662103 140675923150656 cpu_training.py:374] MetricsHeartBeat thread stopped
I1108 21:10:47.664155 140675923150656 cpu_training.py:1712] Try to shutdown ps 0
I1108 21:10:47.677361 140269666805568 cpu_training.py:1776] Ps 0 shutdown successfully!
I1108 21:10:47.677551 140675923150656 cpu_training.py:1718] Shutdown ps 0 successfully!
I1108 21:10:47.677928 140675923150656 cpu_training.py:1712] Try to shutdown ps 1
I1108 21:10:47.678347 140269666805568 cpu_training.py:2158] Finished ps 0.
I1108 21:10:47.678776 140269666805568 runner_utils.py:396] exit monolith_discovery!
I1108 21:10:47.684976 140603018831680 cpu_training.py:1776] Ps 1 shutdown successfully!
I1108 21:10:47.685158 140675923150656 cpu_training.py:1718] Shutdown ps 1 successfully!
I1108 21:10:47.685652 140603018831680 cpu_training.py:2158] Finished ps 1.
I1108 21:10:47.686046 140603018831680 runner_utils.py:396] exit monolith_discovery!
I1108 21:10:47.693424 140675923150656 cpu_training.py:2155] Worker End 1699477847.693356, Cost: 30.059291124343872(s)
I1108 21:10:47.693858 140675923150656 cpu_training.py:2158] Finished worker 0.
I1108 21:10:47.694137 140675923150656 runner_utils.py:396] exit monolith_discovery!
2023-11-08 21:10:48.412364: W external/org_tensorflow/tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:514] RecvTensor cancelled for 128048405063079430
2023-11-08 21:10:48.412458: W external/org_tensorflow/tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:514] RecvTensor cancelled for 128048405063079430
2023-11-08 21:10:48.412479: W external/org_tensorflow/tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:514] RecvTensor cancelled for 128048405063079430
2023-11-08 21:10:48.412496: W external/org_tensorflow/tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:514] RecvTensor cancelled for 128048405063079430
2023-11-08 21:10:48.412575: I external/org_tensorflow/tensorflow/core/distributed_runtime/worker.cc:207] Cancellation requested for RunGraph.
2023-11-08 21:10:48.412993: W external/org_tensorflow/tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:514] RecvTensor cancelled for 128048405063079430
INFO:tensorflow:An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: From /job:ps/replica:0/task:1:
Socket closed
Additional GRPC error information from remote target /job:ps/replica:0/task:1:
:{"created":"@1699477848.412335053","description":"Error received from peer ipv4:10.128.0.74:34391","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}
I1108 21:10:48.415307 140460748154688 monitored_session.py:1285] An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: From /job:ps/replica:0/task:1:
Socket closed
Additional GRPC error information from remote target /job:ps/replica:0/task:1:
:{"created":"@1699477848.412335053","description":"Error received from peer ipv4:10.128.0.74:34391","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}

Related specs:

share_cluster_devices_in_session: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 200, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({'chief': ['10.128.0.74:33213'], 'ps': ['10.128.0.74:33337', '10.128.0.74:34391'], 'worker': ['10.128.0.74:57669']}), '_task_type': 'worker', '_task_id': 0, '_evaluation_master': '', '_master': 'grpc://10.128.0.74:57669', '_num_ps_replicas': 2, '_num_worker_replicas': 2, '_global_id_in_cluster': 1, '_is_chief': False}
I1108 21:10:19.861951 140460748154688 estimator.py:191] Using config: {'_model_dir': '/tmp/movie_lens_tutorial', '_tf_random_seed': None, '_save_summary_steps': 200, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': device_filters: "/job:ps"
device_filters: "/job:chief"
device_filters: "/job:worker/task:0"
gpu_options {
  allow_growth: true
}
allow_soft_placement: true
graph_options {
  rewrite_options {
    disable_meta_optimizer: true
  }
}
operation_timeout_in_ms: -1
cluster_def {
  job {
    name: "chief"
    tasks {
      key: 0
      value: "10.128.0.74:33213"
    }
  }
  job {
    name: "ps"
    tasks {
      key: 0
      value: "10.128.0.74:33337"
    }
    tasks {
      key: 1
      value: "10.128.0.74:34391"
    }
  }
  job {
    name: "worker"
    tasks {
      key: 0
      value: "10.128.0.74:57669"
    }
  }
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant