Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[202405][dualtor] Orchagent is going down during switchover #3298

Open
vkjammala-arista opened this issue Sep 24, 2024 · 3 comments
Open

[202405][dualtor] Orchagent is going down during switchover #3298

vkjammala-arista opened this issue Sep 24, 2024 · 3 comments
Assignees

Comments

@vkjammala-arista
Copy link

vkjammala-arista commented Sep 24, 2024

Description

When performing a switchover (say active to standby or viceversa), we are observing orchagent process going down and thus leaving mux status in inconsistent state.

Based on the observations from debug logs, we thought using bulker for programming the routes/neighbors (introduced by PR: #3148) is the problem and confirmed the same by running the tests after reverting the PR changes.

Steps to reproduce the issue:

  1. Run any sonic-mgmt test (Ex: tests/dualtor_io/test_link_failure.py) performing switchover (say using toggle_all_simulator_ports_to_rand_selected_tor or similar fixture which performs switchover during test setup).

Describe the results you received:

  1. Tests will fail with Failed to toggle all ports to <tor_device> from mux simulator as mux status will be left in inconsistent state.
    def _toggle_all_simulator_ports_to_target_dut(target_dut_hostname, duthosts, mux_server_url, tbinfo):
        """Helper function to toggle all ports to active on the target DUT."""
        ...
        if not is_toggle_done and \
                not utilities.wait_until(120, 10, 0, _check_toggle_done, duthosts, target_dut_hostname, probe=True):
&gt;           pytest_assert(False, "Failed to toggle all ports to {} from mux simulator".format(target_dut_hostname))
E           Failed: Failed to toggle all ports to ld301 from mux simulator```
  1. Orchagent process in swss docker container will be down (can we verified with ps aux inside swss container)

Describe the results you expected:

Switchover should have completed without any failures.

Additional information you deem important:

Some of the debug logs captured during the switchover,

2024 Sep 18 17:47:12.847339 gd377 NOTICE swss#orchagent: :- nbrHandler: Processing neighbors for mux Ethernet200, enable 0, state 2
2024 Sep 18 17:47:12.847339 gd377 INFO swss#orchagent: :- updateRoutes: Updating routes pointing to multiple mux nexthops
...
2024 Sep 18 17:47:12.851834 gd377 INFO swss#orchagent: :- addRoutes: Adding route entry 192.168.0.44, nh 400000000167a to bulker
2024 Sep 18 17:47:12.851834 gd377 INFO swss#orchagent: :- create_entry: EntityBulker.create_entry 1, 2, 1
2024 Sep 18 17:47:12.851834 gd377 INFO swss#orchagent: :- addRoutes: Adding route entry fc02:1000::2c, nh 400000000167a to bulker
2024 Sep 18 17:47:12.851834 gd377 INFO swss#orchagent: :- create_entry: EntityBulker.create_entry 2, 2, 1
2024 Sep 18 17:47:12.851834 gd377 DEBUG swss#orchagent: :> redis_bulk_create_route_entry: enter
2024 Sep 18 17:47:12.851834 gd377 DEBUG swss#orchagent: :> bulkCreate: enter
...
...
2024 Sep 18 17:47:12.881418 gd377 DEBUG swss#orchagent: :> waitForBulkResponse: enter
...
...
2024 Sep 18 17:47:12.886416 gd377 DEBUG swss#orchagent: :- processReply: got message: ["switch_shutdown_request","{\"switch_id\":\"oid:0x21000000000000\"}"]
...
...
2024 Sep 18 17:48:12.935572 gd377 DEBUG swss#orchagent: :> on_switch_shutdown_request: enter
2024 Sep 18 17:48:12.935597 gd377 ERR swss#orchagent: :- on_switch_shutdown_request: Syncd stopped
2024 Sep 18 17:48:12.946670 gd377 INFO swss#supervisord 2024-09-18 17:48:12,945 WARN exited: orchagent (exit status 1; not expected)

Based on the debug logs captured during multiple test runs we suspected usage of bulker entity is causing orchagent to go down for some reason. And tried running the tests by reverting PR #3148 :[muxorch] Using bulker to program routes/neighbors during switchover and tests are passing.

@yxieca
Copy link
Contributor

yxieca commented Sep 26, 2024

@prsunny @Ndancejic can you assess this issue?

@yxieca
Copy link
Contributor

yxieca commented Sep 26, 2024

@bingwang-ms FYI

@prsunny
Copy link
Collaborator

prsunny commented Jan 7, 2025

@Ndancejic , looks like the fix is merged. Can you close this if its not anymore an issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants