-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for serialization of large caches #46
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #46 +/- ##
==========================================
+ Coverage 78.45% 81.97% +3.52%
==========================================
Files 2 3 +1
Lines 297 355 +58
==========================================
+ Hits 233 291 +58
Misses 64 64 ☔ View full report in Codecov by Sentry. |
Co-authored-by: Eric Hanson <[email protected]>
Thanks for the review @ericphanson, committed the changes you suggested! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a few more comments; overall this looks pretty good to me, though I've never written a serialize method before so not sure if there's things to look out for here
ext/SerializationExt.jl
Outdated
# Create a mapping from memory address to id | ||
node_map = Dict{Ptr, Int}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than using Ptr's, I think it is simpler (and maybe more robust) to use an IdDict
:
# Create a mapping from memory address to id | |
node_map = Dict{Ptr, Int}() | |
# Create a mapping from object to id. Here we use `IdDict` to use object identity as the hash. | |
node_map = IdDict{LinkedNode{K}, Int}() |
then this can be indexed just as node_map[node] = id
rather than using pointer_from_objref
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks
Co-authored-by: Eric Hanson <[email protected]>
Co-authored-by: Eric Hanson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, but it would be great if someone more familiar with serialization could review as well
ext/SerializationExt.jl
Outdated
# Link the nodes | ||
for (idx, node) in enumerate(nodes) | ||
node.next = nodes[idx % n_nodes + 1] | ||
node.prev = nodes[idx == 1 ? n_nodes : idx - 1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can write these two right hand sides using mod1(idx+1, n_nodes)
and mod1(idx-1, n_nodes)
, which I would find more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't know about the mod1
function, thanks! That's much cleaner.
test/serializationtests.jl
Outdated
c_node = cache.keyset.first | ||
d_node = deserialized_cache.keyset.first | ||
for i in 1:cache.keyset.length | ||
c_node.val == d_node.val || @test false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not simply @test c_node.val == d_node.val
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just personal taste, for a large dict of 100k entries, I don't want to add 100k tests that compare each element. I just want to have one test that fails if any element is different, which I believe these do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that is true. On the other hand, the test could also have just been with a handful of elements in the LRU cache I believe. I haven't timed it and believe this is all very fast because it is just Int
s, but the 100000 did jump to the eye as some large number to have as a simple test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for small caches, the regular serialization method works fine and doesn't stackoverflow, so we need a big one to really test the new method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ exactly right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, yes and no, since the new method is now always called, irrespective of the size of the cache. And it is written in such a way that it should not suffer from the same flaws, so I would think that if it passes the test for a smaller cache, that provides sufficient guarantees. But I am fine with the current tests.
Is it clear what was causing the default serialisation strategy to fail? My guess of why it was entering an infinite loop would apply irrespective of the size of the cache, so that is not consistent with the original method working for small caches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe as a middle ground, the following gives a single (two) test(s), but still tests all values are equal:
@test length(cache.keyset) == length(deserialized_cache.keyset)
@test all(((c_val, d_val),) -> c_val == d_val, zip(cache.keyset, deserialized_cache.keyset))
test/serializationtests.jl
Outdated
d_value, d_node, d_s = deserialized_cache.dict[key] | ||
c_value == d_value || @test false | ||
c_node.val == d_node.val || @test false | ||
c_s == d_s || @test false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question for these tests.
This looks great, thanks. I've also added some minor comments and questions. |
All great feedback so far, I've learned an unexpected amount about Julia from this simple PR. Thanks! |
Do you want another review? I can ask a person in our group that recently implemented some serialisation things for some other cache structures that we have in our code, but he is only available from Friday onwards. I am also fine with having this merged as is. |
I'm fine with either, doesn't hurt to get another set of eyes on it and I've already pointed all my code to my branch anyways |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks very nice! I left some minor comments which only really change readability, and thus might be a bit subjective as well. Feel free to discard those if you would disagree.
As a more general comment, I think it is possible to circumvent the exchange of the dictionary with linked nodes for a new dictionary, which is keeping an integer for the order. AFAIK, serialization of a dictionary happens by just serializing (key, value)
pairs sequentially. In that sense, they already have an order, which you could use as the id.
Conveniently, iterating over an LRU
actually gives you the key => value
pairs in the exact order you want, because of how iteration of the keyset is happening. Thus, you could consider something along the lines of the following, and avoid the intermediate additional dictionary:
Serialization link
write(s.io, Int32(length(lru)))
for (k, (val, node, sz)) in lru
serialize(s, k)
serialize(s, val)
serialize(s, sz)
end
For the deserialization, it should be possible to then just deserialize, and similar to how you implemented it wrap the keys back in a LinkedNode
, linking them along the way, because you can now be sure that you are traversing them in order. (special casing the first and last node). (Note that this snippet does not include special cases for dicts that are empty or have a single element)
link for deserialization
dict = Dict{K, Tuple{V, LRUCache.LinkedNode{K}, Int}}()
n = read(s.io, Int32)
sizehint!(dict, n)
# first entry
k = deserialize(s)
first = node = LRUCache.LinkedNode{K}(k)
val = deserialize(s)
sz = deserialize(s)
dict[k] = (val, node, sz)
# middle entries
for i in 2:n-1
prev = node
k = deserialize(s)
node = LRUCache.LinkedNode{K}(k)
prev.next = node
node.prev = prev
val = deserialize(s)
sz = deserialize(s)
dict[k] = (val, node, sz)
end
# last node
prev = node
k = deserialize(s)
node = LRUCache.LinkedNode{K}(k)
prev.next = node
node.prev = prev
val = deserialize(s)
sz = deserialize(s)
dict[k] = (val, node, sz)
node.next = first
first.prev = node
I think I am still missing some things to make the keyset etc, and I did not test anything yet. I think this should result in a smaller file size, and maybe a bit better performance. Of course, depending on the use-case, this might all be overkill, and your implementation is definitely also viable!
test/serializationtests.jl
Outdated
c_node = cache.keyset.first | ||
d_node = deserialized_cache.keyset.first | ||
for i in 1:cache.keyset.length | ||
c_node.val == d_node.val || @test false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe as a middle ground, the following gives a single (two) test(s), but still tests all values are equal:
@test length(cache.keyset) == length(deserialized_cache.keyset)
@test all(((c_val, d_val),) -> c_val == d_val, zip(cache.keyset, deserialized_cache.keyset))
Co-authored-by: Lukas <[email protected]>
Co-authored-by: Lukas <[email protected]>
Woops, seems like I was too fast in suggesting the while loop to for loop change, the iterator indeed only returns the values, not the nodes themselves. My apologies! |
Thanks for the in-depth feedback @lkdvos! I'll look into implementing your suggested refactor. Also I suppose I need to undo that last commit now as well haha |
@lkdvos Finished working on your recommended changes, serialization code is much cleaner now. Thanks for your input! |
Looks good to me; maybe @lkdvos can take a final look and then we can merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be worth testing roundtripping an entirely empty lru
, as well as one with one entry. Since those look like edge cases in the deserialization implementation. Maybe:
lru = LRU{Int,Int}(; maxsize=10)
for n in 0:5
if n > 0
lru[n] = n
end
io = IOBuffer()
serialize(io, lru)
seekstart(io)
lru_rt = deserialize(io)
@test lru isa LRU{Int,Int}
@test issetequal(collect(lru), collect(lru_rt))
end
One other edge case I can think of is mutable values with shared object identity. Like
lru = LRU(; maxsize=5)
a = b = [1]
lru[1] = a
lru[2] = b
@test lru[1] === lru[2]
# now roundtrip it and check that the above `===` still holds
Good point, honestly not sure how to handle the multiple references to mutable values case, though. |
The regular serializer handles it, so I was hoping it would just work |
Great feedback @ericphanson, worked as expected. I was able to clean everything up, code is much nicer now. |
@Jutho @ericphanson are there any other changes you recommend before merging? |
Sorry, forgot about this. I think it’s good to go! |
Fixes #45, where serialization of large LRU might result in a stack overflow. Does not automatically load Serialization, but uses extensions to overwrite Serialization.{serialize,deserialize} if Serialization is loaded, available in julia 1.9+.