Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic: valid partial linearization returned non-ok result from model step #25

Open
jayaprabhakar opened this issue Dec 21, 2024 · 7 comments
Labels

Comments

@jayaprabhakar
Copy link

When testing a history using porcupine, I got this error once. It wasn't reproducible, but I still wanted to report it, just in case.

The history matched as linearizable and but failed to generate the visualization.

panic: valid partial linearization returned non-ok result from model step [recovered]
        panic: valid partial linearization returned non-ok result from model step
 
goroutine 284 [running]:
testing.tRunner.func1.2({0x100a3dfc0, 0x100af0aa0})
        /Users/jp/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1632 +0x2c4
testing.tRunner.func1()
        /Users/jp/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1635 +0x47c
panic({0x100a3dfc0?, 0x100af0aa0?})
        /Users/jp/go/pkg/mod/golang.org/[email protected]/src/runtime/panic.go:785 +0x124
github.com/anishathalye/porcupine.computeVisualizationData({0x100aedc28, 0x100aedc30, 0x100aed3a8, 0xc0371da018, 0x100aed3b0, 0x100aed3b8, 0x100aed3c0}, {{0xc037865368, 0x1, 0x1}, ...})
        /Users/jp/go/pkg/mod/github.com/anishathalye/[email protected]/visualization.go:70 +0xa48
github.com/anishathalye/porcupine.Visualize({0x0, 0x0, 0x100aed3a8, 0xc0371da018, 0x100aed3b0, 0x100aed3b8, 0x100aed3c0}, {{0xc037865368, 0x1, 0x1}, ...}, ...)
        /Users/jp/go/pkg/mod/github.com/anishathalye/[email protected]/visualization.go:103 +0xd8
github.com/anishathalye/porcupine.VisualizePath({0x0, 0x0, 0x100aed3a8, 0xc0371da018, 0x100aed3b0, 0x100aed3b8, 0x100aed3c0}, {{0xc037865368, 0x1, 0x1}, ...}, ...)
        /Users/jp/go/pkg/mod/github.com/anishathalye/[email protected]/visualization.go:127 +0x144
github.com/slatedb/slatedb-go/fizztest.TestLinearizability(0xc0536e01a0)
        /Users/jp/src/slatedb-go/fizztest/mbt_test.go:644 +0x3cc4
testing.tRunner(0xc0536e01a0, 0x100aed300)
        /Users/jp/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1690 +0x188
created by testing.(*T).Run in goroutine 1
        /Users/jp/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1743 +0x5e4
FAIL    github.com/slatedb/slatedb-go/fizztest  133.153s
FAIL

I somehow suspect this could be caused due to a trick that I use to test systems that are not actually linearizable in traditional definition but follows a specific model. For example, if a method TransferMultiple transfers 3 dollars in 3 steps with $1 in each step.

func TransferMultiple:
    atomic:
        a += 1
        b -= 1
    atomic:
        a += 1
        b -= 1
    atomic:
        a += 1
        b -= 1
   

Here this operation will not appear as a single atomic action, but 3 individual atomic actions. At the same time, I need to ensure each inc a and dec b happens atomically.
Therefore when generating the history, when TransferMultiple operation is called, I create 2 additional virtual operations with exactly same start and end times.

... more operations
Operation:  19: {Client: 1 Input: {op: req-16} Call: 382167 Output: {} Return: 489292}
Operation:  20: {Client: 1 Input: {op: req-16 } Call: 382167 Output: {} Return: 489292}
Operation:  21: {Client: 1 Input: {op: TransferMultiple} Call: 382167 Output: {} Return: 489292}
... more operations

The important thing here is, the same Client: 1 did 3 things at the same time. From my understanding, this is generally not a typical expectation as a single thread should have a single operation at any point in time.

(Here, Ideally, I would've liked the TransferMultiple call to be listed first, but the porcupine HTML generator puts the last listed operation title over the other two so I need to reorder it)

The model that uses the state graph from the formal specification would put mark a specific order (21, 19, 20) as linearizable.

I am not sure if this caused the panic, but I am wondering if this could have caused the issue. Since I cannot reproduce this issue, I couldn't narrow down the cause.

@anishathalye
Copy link
Owner

From my understanding, this is generally not a typical expectation as a single thread should have a single operation at any point in time.

Yes, but Porcupine's linearizability checker does not actually rely on this / doesn't even look at the Client IDs. The example history you gave, from the linearizability checker's perspective, is equivalent to:

... more operations
Operation:  19: {Client: 1 Input: {op: req-16} Call: 382167 Output: {} Return: 489292}
Operation:  20: {Client: 2 Input: {op: req-16 } Call: 382167 Output: {} Return: 489292}
Operation:  21: {Client: 3 Input: {op: TransferMultiple} Call: 382167 Output: {} Return: 489292}
... more operations

To make this more readable in the visualizer, you could consider synthesizing multiple client IDs when you have such a situation, like 1a, 1b, 1c. Though Porcupine Client IDs are integers, so you could do something like use the units place to distinguish multiple concurrent operations from the same "Client", like 10, 11, 12.

I am not sure if this caused the panic, but I am wondering if this could have caused the issue. Since I cannot reproduce this issue, I couldn't narrow down the cause.

I don't think the above should cause this issue. I think this means that either (1) your model's behavior doesn't satisfy Porcupine's requirement on models (e.g., your model is nondeterministic, or your model's Equal() function has a bug, or your model's Partition() function has a bug), or (2) there is a bug in the Porcupine library itself.

If you can add persistence so you save the history in case you encounter this again, and you can share the history + model, that would be really helpful for investigating further.

@jayaprabhakar
Copy link
Author

Thanks @anishathalye. This narrows down what I should look for.

My model does not override Partition() function and the Equal() is a trivial. My Step function is non trivial though :( and it also looks at 700MB of state graph.

I will look through the code again for any unintended non-determinism. My prime suspect is map iteration. (My model also relies on json but I don't use json equal as Go's native json implementation randomizes order and some other aspects)

Within the porcupine's code, there is one map iteration checker.go:330 in the checkParallel function.

partialLinearizations := make([][][]int, len(history))
for i := 0; i < len(history); i++ {
	var partials [][]int
	// turn longest into a set of unique linearizations
	set := make(map[*[]int]struct{})
	for _, v := range longest[i] {
		if v != nil {
			set[v] = struct{}{}
		}
	}
	for k := range set {                // <-------- will this iteration order have any impact?
		arr := make([]int, len(*k))
		copy(arr, *k)
		partials = append(partials, arr)
	}
	partialLinearizations[i] = partials
}

I'll update you if I could reproduce this issue again.

@anishathalye
Copy link
Owner

The map iteration order you pointed to in Porcupine shouldn't cause any issues.

@omidrostamabadi
Copy link

omidrostamabadi commented Jan 10, 2025

I got the exact same error. After hours of debugging, I found the cause of this error in my case is using duplicate IDs. For this history I got the error (The numbers in front of Event are IDs):

Event 0: Kind: CallEvent Operation: PUT Key: 0 Value: 280297596632815 Output Value: 0           
Event 0: Kind: ReturnEvent Operation: PUT Key: 0 Value: 280297596632815 Output Value: 0         
Event 1: Kind: CallEvent Operation: PUT Key: 4294967296 Value: 280297596632815 Output Value: 0  
Event 1: Kind: ReturnEvent Operation: PUT Key: 4294967296 Value: 280297596632815 Output Value: 0
Event 2: Kind: CallEvent Operation: PUT Key: 8589934592 Value: 280297596632815 Output Value: 0
Event 2: Kind: ReturnEvent Operation: PUT Key: 8589934592 Value: 280297596632815 Output Value: 0
Event 1: Kind: CallEvent Operation: GET Key: 8589934592 Value: 1153620254 Output Value: 0
Event 1: Kind: ReturnEvent Operation: GET Key: 8589934592 Value: 0 Output Value: 280297596632815
Event 2: Kind: CallEvent Operation: GET Key: 8589934592 Value: 1730390528 Output Value: 0
Event 2: Kind: ReturnEvent Operation: GET Key: 8589934592 Value: 0 Output Value: 280297596632815
Event 3: Kind: CallEvent Operation: GET Key: 0 Value: 204931885 Output Value: 0
Event 3: Kind: ReturnEvent Operation: GET Key: 0 Value: 0 Output Value: 280297596632815
Event 4: Kind: CallEvent Operation: GET Key: 0 Value: 1298809402 Output Value: 0
Event 4: Kind: ReturnEvent Operation: GET Key: 0 Value: 0 Output Value: 280297596632815
Event 5: Kind: CallEvent Operation: GET Key: 0 Value: 1146639389 Output Value: 0
Event 5: Kind: ReturnEvent Operation: GET Key: 0 Value: 0 Output Value: 280297596632815

I had three PUT operations followed by GETs. There's only one outstanding operation at a time, so return event immediately follows call event. After using distinct IDs for GETs the problem went away.

The interesting point is that during CheckEvents or CheckEventsVerbose I didn't get any errors. It only happened when I tried to call 'Visualize'. More specifically, during CheckEventsVerbose, I had a bunch of PUTs followed by GETs. On the first GET operation, CheckEventsVerbose returned false to indicate the history is not linearizable. Then, on calling Visualize, one PUT was executed in STEP, and followed by a GET which lead to the error I guess. The following dump is for a different history from what I showed above. The prints are inside the STEP function:

Value for Putkey 0 is 280297596632815
Value for Putkey 1 is 280297596632815
Value for Putkey 4294967296 is 280297596632815
Value for Putkey 4294967297 is 280297596632815
Value for Putkey 8589934592 is 280297596632815
Value for Putkey 8589934593 is 280297596632815
Value for Putkey 12884901888 is 280297596632815
Value for Putkey 12884901889 is 280297596632815
Value for Putkey 17179869184 is 280297596632815
Value for Putkey 17179869185 is 280297596632815
Value for Putkey 21474836480 is 280297596632815
Value for Putkey 21474836481 is 280297596632815
Value for Putkey 25769803776 is 280297596632815
Value for Putkey 25769803777 is 280297596632815
Value for Putkey 30064771072 is 280297596632815
Value for Putkey 30064771073 is 280297596632815
Value for Putkey 34359738368 is 280297596632815
Value for Putkey 34359738369 is 280297596632815
Value for Putkey 38654705664 is 280297596632815
Value for Putkey 38654705665 is 280297596632815
Value for Getkey 30064771073 is 280297596632815
History is NOT linearizable! <<<<-------- CheckEventsVerbose done
Value for Putkey 0 is 280297596632815 <<<<-------- Visualize start
Error: Get for 30064771073 didn't return anything
panic: valid partial linearization returned non-ok result from model step

goroutine 1 [running]:
github.com/anishathalye/porcupine.computeVisualizationData({0x4e18f0, 0x4e18d8, 0x4e18f8, 0x4e1900, 0x4e1908, 0x4e1910, 0x4e18d0}, {{0xc0000b4030, 0x1, 0x1}, ...})
        /users/omidr/porcupine/visualization.go:137 +0x958
github.com/anishathalye/porcupine.Visualize({0x4e18f0, 0x0, 0x4e18f8, 0x4e1900, 0x4e1908, 0x4e1910, 0x0}, {{0xc0000b4030, 0x1, 0x1}, ...}, ...)
        /users/omidr/porcupine/visualization.go:179 +0xce
main.main()
        /users/omidr/porcupine/tests/test_blob.go:39 +0x59e
exit status 2

@anishathalye
Copy link
Owner

Yes, IDs need to be unique to a single operation or a single call/return pair for an event, this is a precondition of all the Check* functions. I guess the documentation wasn't clear enough about that? In the case that this precondition is violated, the functions provide no guarantee about their behavior.

Some things we could do:

  • Improve the documentation to clarify this
  • Add a validation step inside all of the Check* functions to try to figure out if the history is well-formed (e.g., checking for duplicate IDs, checking that there is a call + return for every operation, etc.). This has the downside that it adds run-time overhead for every caller, but the overhead is probably small compared to the linearizability check itself.

@omidrostamabadi
Copy link

I think it’s clear that IDs need to be unique, so adding to the doc probably won’t add much. It’s just if the trace generator has a bug, like what happened to me, the error message could be misleading. I guess your model is the suspect when you see an error like this. Anyway, I’d rather pay that small runtime validation overhead.

@jayaprabhakar
Copy link
Author

CheckEvent* functions have IDs, but in my case, I use CheckOperation* functions, as they don't have IDs, that may not be the issue. that said, I couldn't reliably reproduce this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants