Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Logs API stops functioning when a pod fails #21287

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Fluder-Paradyne
Copy link
Contributor

@Fluder-Paradyne Fluder-Paradyne commented Dec 21, 2024

Closes: #21286

When a pod is in a Failed, Completed, or Error state, the Kubernetes API sometimes returns a log message like:

unable to retrieve container logs for containerd://a0002996fc1eb2a973c4708fa4f4f5f1b9e643114322c71fe54dbbfef27148d5.

The current log parsing logic assumes logs always begin with a timestamp, which causes issues when encountering this specific log format.

  • Added an exception to handle cases where the log message begins with the string unable to retrieve container logs for.
  • For these cases, the log stream is cleanly stopped, and a log entry is sent with the message and the current timestamp.

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
  • The title of the PR states what changed and the related issues number (used for the release note).
  • The title of the PR conforms to the Toolchain Guide
  • I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
  • I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
  • Does this PR require documentation updates?
  • I've updated documentation as required by this PR.
  • I have signed off all my commits as required by DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My build is green (troubleshooting builds).
  • My new feature complies with the feature status guidelines.
  • I have added a brief description of why this PR is necessary and/or what this PR solves.
  • Optional. My organization is added to USERS.md.
  • Optional. For bug fixes, I've indicated what older releases this fix should be cherry-picked into (this may or may not happen depending on risk/complexity).

Copy link

bunnyshell bot commented Dec 21, 2024

🔴 Preview Environment stopped on Bunnyshell

See: Environment Details | Pipeline Logs

Available commands (reply to this comment):

  • 🔵 /bns:start to start the environment
  • 🚀 /bns:deploy to redeploy the environment
  • /bns:delete to remove the environment

Copy link

codecov bot commented Dec 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 55.21%. Comparing base (8126508) to head (a136cf4).
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #21287      +/-   ##
==========================================
+ Coverage   55.19%   55.21%   +0.01%     
==========================================
  Files         337      337              
  Lines       57058    57061       +3     
==========================================
+ Hits        31496    31506      +10     
+ Misses      22863    22860       -3     
+ Partials     2699     2695       -4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Fluder-Paradyne Fluder-Paradyne force-pushed the exclude-broken-pods branch 2 times, most recently from e0d48bb to d0e06c7 Compare December 23, 2024 04:02
@Fluder-Paradyne Fluder-Paradyne marked this pull request as ready for review December 24, 2024 04:07
@Fluder-Paradyne Fluder-Paradyne requested a review from a team as a code owner December 24, 2024 04:07
@Fluder-Paradyne Fluder-Paradyne force-pushed the exclude-broken-pods branch 4 times, most recently from 0dbcf76 to 9035ea5 Compare December 27, 2024 11:25
server/application/logs.go Outdated Show resolved Hide resolved
@@ -39,6 +39,11 @@ func parseLogsStream(podName string, stream io.ReadCloser, ch chan logEntry) {
timeStampStr := parts[0]
logTime, err := time.Parse(time.RFC3339Nano, timeStampStr)
if err != nil {
if strings.HasPrefix(line, "unable to retrieve container logs for ") ||
strings.HasPrefix(line, "Unable to retrieve container logs for ") {
ch <- logEntry{line: line, podName: podName, timeStamp: time.Now()}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should err still be added here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, any error in the log channel would stop all other streams immediately.

logStream := mergeLogStreams(streams, time.Millisecond*100)
sentCount := int64(0)
done := make(chan error)
go func() {
for entry := range logStream {
if entry.err != nil {
done <- entry.err
return
} else {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the purpose of ch <- logEntry{err: err}? Shouldn't we log all error/message not starting with a timestamp in a way that they do not crash and exit properly?

There might be other error similar to unable to retrieve container logs for that we might receive.

for entry := range res {
entries = append(entries, entry)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably need some wait before checking results to avoid flakiness. Or, use some channel for synchronization. This would also help to exit from a goroutine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have added done channel for sync,
Have increased the time window to 5 seconds, would 5 seconds be enough ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the value of the timestamp really matter? Or what matters is that we received the log correctly?

"timestamp <= now" seems more reliable. I don't think it is works mocking now, but ideally this is what should be done.

@Fluder-Paradyne Fluder-Paradyne force-pushed the exclude-broken-pods branch 2 times, most recently from c1f1255 to 086514c Compare December 30, 2024 10:22
Signed-off-by: Fluder-Paradyne <[email protected]>
…essages

- Add sync and increase timeout for test

Signed-off-by: Fluder-Paradyne <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Ready for final review
Development

Successfully merging this pull request may close these issues.

Logs API Stops Functioning When a Pod Fails
4 participants