Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: measure state lock #14874

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

sergiocazzolato
Copy link
Collaborator

@sergiocazzolato sergiocazzolato commented Dec 18, 2024

This change allows to collect the time the snapd state file is locked. To collect the results it is required to include the test snapd-state-lock and make sure just 1 worker is used.

SPREAD_SNAPD_STATE_LOCK_THRESHOLD_MS=15 SPREAD_USE_SNAPD_SNAP_URL= SPREAD_USE_PREBUILT_SNAPD_SNAP=false ./run-spread -artifacts .artifacts google:ubuntu-22.04-64:tests/main/snapd-state-lock google:ubuntu-22.04-64:tests/smoke/

this is an example of the smoke suite with threshold=15ms: https://paste.ubuntu.com/p/vkcMdcGTnM/
this is an example of the smoke suite with threshold=100ms: https://paste.ubuntu.com/p/XTn8jZ62pJ/

@sergiocazzolato sergiocazzolato marked this pull request as draft December 18, 2024 02:34
@sergiocazzolato sergiocazzolato force-pushed the tests-measure-state-lock branch from 0d1c64f to f46e537 Compare December 18, 2024 17:40
@sergiocazzolato sergiocazzolato added the Needs Samuele review Needs a review from Samuele before it can land label Dec 19, 2024
@sergiocazzolato sergiocazzolato marked this pull request as ready for review December 19, 2024 16:51
Copy link
Collaborator

@pedronis pedronis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question

@@ -139,6 +142,7 @@ func (s *State) Modified() bool {

// Lock acquires the state lock.
func (s *State) Lock() {
s.lockStart = osutil.GetLockStart()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be done with the lock taken? otherwise just the tentative themselves to lock will race/dirty lockStart

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I'll update this

@pedronis pedronis requested a review from bboozzoo December 20, 2024 14:03
@sergiocazzolato
Copy link
Collaborator Author

sergiocazzolato commented Jan 6, 2025

These are the times after the fix proposed by @pedronis

lock-times-smoke.txt

I see it locks about 820ms when the snapd daemon is started (this happens in all the tests)

@pedronis pedronis self-requested a review January 6, 2025 15:54
@@ -159,6 +163,7 @@ func (s *State) writing() {
func (s *State) unlock() {
atomic.AddInt32(&s.muC, -1)
s.mu.Unlock()
osutil.MaybeSaveLockTime(s.lockStart)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only needs to happen with the lock taken I think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be ok to do with the state lock released if we flock the log file. With the state lock held, there would likely be no contention for flock on the log file. OTOH with the lock released, we'd rely on flock to serialize writes, but this would not keep the state lock unnecessarily held.

FWIW, unlocking may reschedule, so assuming we use flock for serializing writes, the timing may end up being incorrect given that the time is captured inside MaybeSaveLockTime(). The end time will likely need to be captured like so:

lockEnd := time.Now()
s.mu.Unlock()
osutil.MaybeSaveLockTime(s.lockStart, lockEnd)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, I missed that s.lockStart is used after releasing the lock, it should ofc be read into a variable before the lock is released

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

udpated

@sergiocazzolato
Copy link
Collaborator Author

sergiocazzolato commented Jan 6, 2025

@pedronis This is the result of the main suite 145 tests (using a machine with 2 cores and threshold of 800ms),
https://pastebin.canonical.com/p/Km4DrSskYD/

I see some worst times doing:
github.com/snapcore/snapd/daemon.(*Daemon).Start
github.com/snapcore/snapd/overlord/assertstate.doValidateSnap

should we specially test those cases?

defer lockfile.Close()

pc := make([]uintptr, 10)
n := runtime.Callers(0, pc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skip should likely be set to 2, to skip traceCallers and MaybeSaveLockTime

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Comment on lines 50 to 52
for i := 0; i < n; i++ {
f := runtime.FuncForPC(pc[i])
file, line := f.FileLine(pc[i])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per docs of https://pkg.go.dev/runtime?utm_source=godoc#Callers, this should use runtime.CallersFrames()

// MaybeSaveLockTime allows to save lock times when this overpass the threshold
// defined by through the SNAPD_STATE_LOCK_THRESHOLD_MS environment settings.
func MaybeSaveLockTime(lockStart int64) {
lockEnd := time.Now().UnixNano() / int64(time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can prob be done after both getenvs

@@ -159,6 +163,7 @@ func (s *State) writing() {
func (s *State) unlock() {
atomic.AddInt32(&s.muC, -1)
s.mu.Unlock()
osutil.MaybeSaveLockTime(s.lockStart)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be ok to do with the state lock released if we flock the log file. With the state lock held, there would likely be no contention for flock on the log file. OTOH with the lock released, we'd rely on flock to serialize writes, but this would not keep the state lock unnecessarily held.

FWIW, unlocking may reschedule, so assuming we use flock for serializing writes, the timing may end up being incorrect given that the time is captured inside MaybeSaveLockTime(). The end time will likely need to be captured like so:

lockEnd := time.Now()
s.mu.Unlock()
osutil.MaybeSaveLockTime(s.lockStart, lockEnd)

fmt.Fprintf(os.Stderr, "could not retrieve log file, SNAPD_STATE_LOCK_FILE env var required")
}

lockfile, err := os.OpenFile(lockfilePath, os.O_APPEND|os.O_WRONLY|os.O_CREATE, 0600)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could flock the file after opening it, so that we serialize the writes but the state lock could be unlocked earlier

f := runtime.FuncForPC(pc[i])
file, line := f.FileLine(pc[i])
formattedLine = fmt.Sprintf("%s:%d %s\n", file, line, f.Name())
if _, err = lockfile.WriteString(formattedLine); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could maybe use runtime.Profile to achieve the same thing but do it in more standard way.

)

func traceCallers(description string) {
lockfilePath := os.Getenv("SNAPD_STATE_LOCK_FILE")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd avoid using lock* in the name, to not confuse future self or other readers that this is not a lock file, but a file with log of state lock times (and their context). So maybe logFilePath.

if threshold <= 0 {
return
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"SNAPD_STATE_LOCK_FILE" could be tested here

@@ -111,6 +112,8 @@ type State struct {
// task/changes observing
taskHandlers map[int]func(t *Task, old, new Status) (remove bool)
changeHandlers map[int]func(chg *Change, old, new Status)

lockStart int64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
lockStart int64
lockWait int64
lockStart int64

Comment on lines 145 to 147
s.mu.Lock()
atomic.AddInt32(&s.muC, 1)
s.lockStart = osutil.GetLockStart()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s.mu.Lock()
atomic.AddInt32(&s.muC, 1)
s.lockStart = osutil.GetLockStart()
lockWait := osutil.LockTimestamp()
s.mu.Lock()
atomic.AddInt32(&s.muC, 1)
s.lockStart = osutil.LockTimestamp()
s.lockWait = lockWait


// MaybeSaveLockTime allows to save lock times when this overpass the threshold
// defined by through the SNAPD_STATE_LOCK_THRESHOLD_MS environment settings.
func MaybeSaveLockTime(lockStart int64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func MaybeSaveLockTime(lockStart int64) {
func MaybeSaveLockTime(lockWait, lockStart, lockEnd int64) {

Comment on lines 165 to 166
s.mu.Unlock()
osutil.MaybeSaveLockTime(s.lockStart)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s.mu.Unlock()
osutil.MaybeSaveLockTime(s.lockStart)
lockWait, lockStart := s.lockWait, s.lockStart
s.lockWait, s.lockStart = 0, 0
s.mu.Unlock()
osutil.MaybeSaveLockTime(lockWait, lockStart)

@@ -159,6 +163,7 @@ func (s *State) writing() {
func (s *State) unlock() {
atomic.AddInt32(&s.muC, -1)
s.mu.Unlock()
osutil.MaybeSaveLockTime(s.lockStart)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, I missed that s.lockStart is used after releasing the lock, it should ofc be read into a variable before the lock is released

Comment on lines 78 to 82
elapsedMilliseconds := lockEnd - lockStart
if elapsedMilliseconds > threshold {
formattedLine := fmt.Sprintf("Elapsed Time: %d milliseconds", elapsedMilliseconds)
traceCallers(formattedLine)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
elapsedMilliseconds := lockEnd - lockStart
if elapsedMilliseconds > threshold {
formattedLine := fmt.Sprintf("Elapsed Time: %d milliseconds", elapsedMilliseconds)
traceCallers(formattedLine)
}
heldMs := lockEnd - lockStart
waitedMs := lockStart - lockWait
if heldMs > threshold || waitedMs > threshold {
formattedLine := fmt.Sprintf("waited %d ms held", waitedMs, heldMs)
traceCallers(formattedLine)
}

Copy link

codecov bot commented Jan 8, 2025

Codecov Report

Attention: Patch coverage is 54.54545% with 5 lines in your changes missing coverage. Please review.

Project coverage is 78.22%. Comparing base (24a0034) to head (af6f416).
Report is 88 commits behind head on master.

Files with missing lines Patch % Lines
osutil/statelock_fake.go 0.00% 3 Missing ⚠️
osutil/flock.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #14874      +/-   ##
==========================================
+ Coverage   78.20%   78.22%   +0.01%     
==========================================
  Files        1151     1154       +3     
  Lines      151396   152804    +1408     
==========================================
+ Hits       118402   119528    +1126     
- Misses      25662    25895     +233     
- Partials     7332     7381      +49     
Flag Coverage Δ
unittests 78.22% <54.54%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Samuele review Needs a review from Samuele before it can land
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants