Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][T2?]: REBOOT_TYPE_POWEROFF reboot causes test failures on T2 as NTP slew doesn't recover for a while #16289

Closed
Javier-Tan opened this issue Jan 2, 2025 · 1 comment · Fixed by #16348
Assignees

Comments

@Javier-Tan
Copy link
Contributor

Javier-Tan commented Jan 2, 2025

Issue Description

When the chassis is rebooted through REBOOT_TYPE_POWEROFF (through PDUs), the internal clock gets stuck at the time it reboots (e.g., if it takes 5 minutes to boot and is restarted at 00:00, the clock would show 00:00 (or slightly before) when it comes back up at 00:05. This could be caused by RTC (?)).

The time will slowly correct itself by ntp slew as designed, but due to the length of time it takes for a chassis to reboot, it takes very long (and much longer than the wait_until time defined for this case in the reboot function)

Snippet responsible for handling this in reboot.py reboot function:

    # some device does not have onchip clock and requires obtaining system time a little later from ntp
    # or SUP to obtain the correct time so if the uptime is less than original device time, it means it
    # is most likely due to this issue which we can wait a little more until the correct time is set in place.
    if float(dut_uptime.strftime("%s")) < float(dut_datetime.strftime("%s")):
        logger.info('DUT {} timestamp went backwards'.format(hostname))
        wait_until(120, 5, 0, positive_uptime, duthost, dut_datetime)

    dut_uptime = duthost.get_up_time()

    assert float(dut_uptime.strftime("%s")) > float(dut_datetime.strftime("%s")), "Device {} did not reboot". \
        format(hostname)

Where dut_uptime is the DUTs latest reported startup time, and dut_datetime is the DUT reported time collected before the reboot

Any test using this reboot type should pass UNLESS this is a safeguard that DUTs shouldn't behave this way, in this case the image needs to be fixed.

On the test side:
We could either increase the wait_until time (will be very long), sync the datetime using sudo ntpdate <NTP_SERVER_IP> manually in the test after reboot, or skip this check for T2 devices

Results you see

Sample test_power_off_reboot.py failure:
Before test fails: (made sure to sync ntp before this)

Actual time: 2025-01-02 03:51:10.744581

timedatectl:
               Local time: Thu 2025-01-02 03:51:11 UTC
           Universal time: Thu 2025-01-02 03:51:11 UTC
                 RTC time: Thu 2025-01-02 03:51:10
                Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
              NTP service: n/a
          RTC in local TZ: no

show ntp:
MGMT_VRF_CONFIG is not present.
synchronised to unspecified at stratum 4 
   time correct to within 150 ms
   polling server every 64 s
     remote           refid      st t when poll reach   delay   offset   jitter
===============================================================================
*10.150.22.222   10.221.236.34    3 u    3   64  377   0.2086  -0.3139   0.0731

After test fails:

Actual time: 2025-01-02 03:57:48.520614

timedatectl:
               Local time: Thu 2025-01-02 03:56:44 UTC
           Universal time: Thu 2025-01-02 03:56:44 UTC
                 RTC time: Thu 2025-01-02 03:56:44
                Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
              NTP service: n/a
          RTC in local TZ: no

show ntp:
MGMT_VRF_CONFIG is not present.
synchronised to unspecified at stratum 4 
   time correct to within 65062 ms
   polling server every 64 s
     remote           refid      st t when poll reach   delay   offset   jitter
===============================================================================
*10.150.22.222   10.221.236.34    3 u   47   64   17   0.2465 64907.24  88.1042

Results you expected to see

As stated before:

Any test using this reboot type should pass UNLESS this is a safeguard that DUTs shouldn't behave this way, in this case the image needs to be fixed.

On the test side:
We could either increase the wait_until time (will be very long), sync the datetime using sudo ntpdate <NTP_SERVER_IP> manually in the test after reboot, or skip this check for T2 devices

Is it platform specific

generic

Relevant log output

N/A

Output of show version

N/A

Attach files (if any)

N/A

@arlakshm
Copy link
Contributor

arlakshm commented Jan 4, 2025

@Javier-Tan, can you skip the check for T2 for now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants