panic: valid partial linearization returned non-ok result from model step #25

jayaprabhakar · 2024-12-21T07:37:49Z

When testing a history using porcupine, I got this error once. It wasn't reproducible, but I still wanted to report it, just in case.

The history matched as linearizable and but failed to generate the visualization.

panic: valid partial linearization returned non-ok result from model step [recovered]
        panic: valid partial linearization returned non-ok result from model step
 
goroutine 284 [running]:
testing.tRunner.func1.2({0x100a3dfc0, 0x100af0aa0})
        /Users/jp/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1632 +0x2c4
testing.tRunner.func1()
        /Users/jp/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1635 +0x47c
panic({0x100a3dfc0?, 0x100af0aa0?})
        /Users/jp/go/pkg/mod/golang.org/[email protected]/src/runtime/panic.go:785 +0x124
github.com/anishathalye/porcupine.computeVisualizationData({0x100aedc28, 0x100aedc30, 0x100aed3a8, 0xc0371da018, 0x100aed3b0, 0x100aed3b8, 0x100aed3c0}, {{0xc037865368, 0x1, 0x1}, ...})
        /Users/jp/go/pkg/mod/github.com/anishathalye/[email protected]/visualization.go:70 +0xa48
github.com/anishathalye/porcupine.Visualize({0x0, 0x0, 0x100aed3a8, 0xc0371da018, 0x100aed3b0, 0x100aed3b8, 0x100aed3c0}, {{0xc037865368, 0x1, 0x1}, ...}, ...)
        /Users/jp/go/pkg/mod/github.com/anishathalye/[email protected]/visualization.go:103 +0xd8
github.com/anishathalye/porcupine.VisualizePath({0x0, 0x0, 0x100aed3a8, 0xc0371da018, 0x100aed3b0, 0x100aed3b8, 0x100aed3c0}, {{0xc037865368, 0x1, 0x1}, ...}, ...)
        /Users/jp/go/pkg/mod/github.com/anishathalye/[email protected]/visualization.go:127 +0x144
github.com/slatedb/slatedb-go/fizztest.TestLinearizability(0xc0536e01a0)
        /Users/jp/src/slatedb-go/fizztest/mbt_test.go:644 +0x3cc4
testing.tRunner(0xc0536e01a0, 0x100aed300)
        /Users/jp/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1690 +0x188
created by testing.(*T).Run in goroutine 1
        /Users/jp/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1743 +0x5e4
FAIL    github.com/slatedb/slatedb-go/fizztest  133.153s
FAIL

I somehow suspect this could be caused due to a trick that I use to test systems that are not actually linearizable in traditional definition but follows a specific model. For example, if a method TransferMultiple transfers 3 dollars in 3 steps with $1 in each step.

func TransferMultiple:
    atomic:
        a += 1
        b -= 1
    atomic:
        a += 1
        b -= 1
    atomic:
        a += 1
        b -= 1

Here this operation will not appear as a single atomic action, but 3 individual atomic actions. At the same time, I need to ensure each inc a and dec b happens atomically.
Therefore when generating the history, when TransferMultiple operation is called, I create 2 additional virtual operations with exactly same start and end times.

... more operations
Operation:  19: {Client: 1 Input: {op: req-16} Call: 382167 Output: {} Return: 489292}
Operation:  20: {Client: 1 Input: {op: req-16 } Call: 382167 Output: {} Return: 489292}
Operation:  21: {Client: 1 Input: {op: TransferMultiple} Call: 382167 Output: {} Return: 489292}
... more operations

The important thing here is, the same Client: 1 did 3 things at the same time. From my understanding, this is generally not a typical expectation as a single thread should have a single operation at any point in time.

(Here, Ideally, I would've liked the TransferMultiple call to be listed first, but the porcupine HTML generator puts the last listed operation title over the other two so I need to reorder it)

The model that uses the state graph from the formal specification would put mark a specific order (21, 19, 20) as linearizable.

I am not sure if this caused the panic, but I am wondering if this could have caused the issue. Since I cannot reproduce this issue, I couldn't narrow down the cause.

The text was updated successfully, but these errors were encountered:

anishathalye · 2024-12-21T16:59:35Z

From my understanding, this is generally not a typical expectation as a single thread should have a single operation at any point in time.

Yes, but Porcupine's linearizability checker does not actually rely on this / doesn't even look at the Client IDs. The example history you gave, from the linearizability checker's perspective, is equivalent to:

... more operations
Operation:  19: {Client: 1 Input: {op: req-16} Call: 382167 Output: {} Return: 489292}
Operation:  20: {Client: 2 Input: {op: req-16 } Call: 382167 Output: {} Return: 489292}
Operation:  21: {Client: 3 Input: {op: TransferMultiple} Call: 382167 Output: {} Return: 489292}
... more operations

To make this more readable in the visualizer, you could consider synthesizing multiple client IDs when you have such a situation, like 1a, 1b, 1c. Though Porcupine Client IDs are integers, so you could do something like use the units place to distinguish multiple concurrent operations from the same "Client", like 10, 11, 12.

I am not sure if this caused the panic, but I am wondering if this could have caused the issue. Since I cannot reproduce this issue, I couldn't narrow down the cause.

I don't think the above should cause this issue. I think this means that either (1) your model's behavior doesn't satisfy Porcupine's requirement on models (e.g., your model is nondeterministic, or your model's Equal() function has a bug, or your model's Partition() function has a bug), or (2) there is a bug in the Porcupine library itself.

If you can add persistence so you save the history in case you encounter this again, and you can share the history + model, that would be really helpful for investigating further.

jayaprabhakar · 2024-12-22T04:49:40Z

Thanks @anishathalye. This narrows down what I should look for.

My model does not override Partition() function and the Equal() is a trivial. My Step function is non trivial though :( and it also looks at 700MB of state graph.

I will look through the code again for any unintended non-determinism. My prime suspect is map iteration. (My model also relies on json but I don't use json equal as Go's native json implementation randomizes order and some other aspects)

Within the porcupine's code, there is one map iteration checker.go:330 in the checkParallel function.

partialLinearizations := make([][][]int, len(history))
for i := 0; i < len(history); i++ {
	var partials [][]int
	// turn longest into a set of unique linearizations
	set := make(map[*[]int]struct{})
	for _, v := range longest[i] {
		if v != nil {
			set[v] = struct{}{}
		}
	}
	for k := range set {                // <-------- will this iteration order have any impact?
		arr := make([]int, len(*k))
		copy(arr, *k)
		partials = append(partials, arr)
	}
	partialLinearizations[i] = partials
}

I'll update you if I could reproduce this issue again.

anishathalye · 2024-12-22T18:36:36Z

The map iteration order you pointed to in Porcupine shouldn't cause any issues.

omidrostamabadi · 2025-01-10T05:17:41Z

I got the exact same error. After hours of debugging, I found the cause of this error in my case is using duplicate IDs. For this history I got the error (The numbers in front of Event are IDs):

Event 0: Kind: CallEvent Operation: PUT Key: 0 Value: 280297596632815 Output Value: 0           
Event 0: Kind: ReturnEvent Operation: PUT Key: 0 Value: 280297596632815 Output Value: 0         
Event 1: Kind: CallEvent Operation: PUT Key: 4294967296 Value: 280297596632815 Output Value: 0  
Event 1: Kind: ReturnEvent Operation: PUT Key: 4294967296 Value: 280297596632815 Output Value: 0
Event 2: Kind: CallEvent Operation: PUT Key: 8589934592 Value: 280297596632815 Output Value: 0
Event 2: Kind: ReturnEvent Operation: PUT Key: 8589934592 Value: 280297596632815 Output Value: 0
Event 1: Kind: CallEvent Operation: GET Key: 8589934592 Value: 1153620254 Output Value: 0
Event 1: Kind: ReturnEvent Operation: GET Key: 8589934592 Value: 0 Output Value: 280297596632815
Event 2: Kind: CallEvent Operation: GET Key: 8589934592 Value: 1730390528 Output Value: 0
Event 2: Kind: ReturnEvent Operation: GET Key: 8589934592 Value: 0 Output Value: 280297596632815
Event 3: Kind: CallEvent Operation: GET Key: 0 Value: 204931885 Output Value: 0
Event 3: Kind: ReturnEvent Operation: GET Key: 0 Value: 0 Output Value: 280297596632815
Event 4: Kind: CallEvent Operation: GET Key: 0 Value: 1298809402 Output Value: 0
Event 4: Kind: ReturnEvent Operation: GET Key: 0 Value: 0 Output Value: 280297596632815
Event 5: Kind: CallEvent Operation: GET Key: 0 Value: 1146639389 Output Value: 0
Event 5: Kind: ReturnEvent Operation: GET Key: 0 Value: 0 Output Value: 280297596632815

I had three PUT operations followed by GETs. There's only one outstanding operation at a time, so return event immediately follows call event. After using distinct IDs for GETs the problem went away.

The interesting point is that during CheckEvents or CheckEventsVerbose I didn't get any errors. It only happened when I tried to call 'Visualize'. More specifically, during CheckEventsVerbose, I had a bunch of PUTs followed by GETs. On the first GET operation, CheckEventsVerbose returned false to indicate the history is not linearizable. Then, on calling Visualize, one PUT was executed in STEP, and followed by a GET which lead to the error I guess. The following dump is for a different history from what I showed above. The prints are inside the STEP function:

Value for Putkey 0 is 280297596632815
Value for Putkey 1 is 280297596632815
Value for Putkey 4294967296 is 280297596632815
Value for Putkey 4294967297 is 280297596632815
Value for Putkey 8589934592 is 280297596632815
Value for Putkey 8589934593 is 280297596632815
Value for Putkey 12884901888 is 280297596632815
Value for Putkey 12884901889 is 280297596632815
Value for Putkey 17179869184 is 280297596632815
Value for Putkey 17179869185 is 280297596632815
Value for Putkey 21474836480 is 280297596632815
Value for Putkey 21474836481 is 280297596632815
Value for Putkey 25769803776 is 280297596632815
Value for Putkey 25769803777 is 280297596632815
Value for Putkey 30064771072 is 280297596632815
Value for Putkey 30064771073 is 280297596632815
Value for Putkey 34359738368 is 280297596632815
Value for Putkey 34359738369 is 280297596632815
Value for Putkey 38654705664 is 280297596632815
Value for Putkey 38654705665 is 280297596632815
Value for Getkey 30064771073 is 280297596632815
History is NOT linearizable! <<<<-------- CheckEventsVerbose done
Value for Putkey 0 is 280297596632815 <<<<-------- Visualize start
Error: Get for 30064771073 didn't return anything
panic: valid partial linearization returned non-ok result from model step

goroutine 1 [running]:
github.com/anishathalye/porcupine.computeVisualizationData({0x4e18f0, 0x4e18d8, 0x4e18f8, 0x4e1900, 0x4e1908, 0x4e1910, 0x4e18d0}, {{0xc0000b4030, 0x1, 0x1}, ...})
        /users/omidr/porcupine/visualization.go:137 +0x958
github.com/anishathalye/porcupine.Visualize({0x4e18f0, 0x0, 0x4e18f8, 0x4e1900, 0x4e1908, 0x4e1910, 0x0}, {{0xc0000b4030, 0x1, 0x1}, ...}, ...)
        /users/omidr/porcupine/visualization.go:179 +0xce
main.main()
        /users/omidr/porcupine/tests/test_blob.go:39 +0x59e
exit status 2

anishathalye · 2025-01-10T17:59:05Z

Yes, IDs need to be unique to a single operation or a single call/return pair for an event, this is a precondition of all the Check* functions. I guess the documentation wasn't clear enough about that? In the case that this precondition is violated, the functions provide no guarantee about their behavior.

Some things we could do:

Improve the documentation to clarify this
Add a validation step inside all of the Check* functions to try to figure out if the history is well-formed (e.g., checking for duplicate IDs, checking that there is a call + return for every operation, etc.). This has the downside that it adds run-time overhead for every caller, but the overhead is probably small compared to the linearizability check itself.

omidrostamabadi · 2025-01-10T19:14:09Z

I think it’s clear that IDs need to be unique, so adding to the doc probably won’t add much. It’s just if the trace generator has a bug, like what happened to me, the error message could be misleading. I guess your model is the suspect when you see an error like this. Anyway, I’d rather pay that small runtime validation overhead.

jayaprabhakar · 2025-01-12T05:47:57Z

CheckEvent* functions have IDs, but in my case, I use CheckOperation* functions, as they don't have IDs, that may not be the issue. that said, I couldn't reliably reproduce this issue.

anishathalye added the bug label Dec 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

panic: valid partial linearization returned non-ok result from model step #25

panic: valid partial linearization returned non-ok result from model step #25

jayaprabhakar commented Dec 21, 2024

anishathalye commented Dec 21, 2024

jayaprabhakar commented Dec 22, 2024

anishathalye commented Dec 22, 2024

omidrostamabadi commented Jan 10, 2025 •

edited

Loading

anishathalye commented Jan 10, 2025

omidrostamabadi commented Jan 10, 2025

jayaprabhakar commented Jan 12, 2025

panic: valid partial linearization returned non-ok result from model step #25

panic: valid partial linearization returned non-ok result from model step #25

Comments

jayaprabhakar commented Dec 21, 2024

anishathalye commented Dec 21, 2024

jayaprabhakar commented Dec 22, 2024

anishathalye commented Dec 22, 2024

omidrostamabadi commented Jan 10, 2025 • edited Loading

anishathalye commented Jan 10, 2025

omidrostamabadi commented Jan 10, 2025

jayaprabhakar commented Jan 12, 2025

omidrostamabadi commented Jan 10, 2025 •

edited

Loading