Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for serialization of large caches #46

Merged
merged 18 commits into from
Jan 25, 2024

Conversation

jarbus
Copy link
Contributor

@jarbus jarbus commented Jan 12, 2024

Fixes #45, where serialization of large LRU might result in a stack overflow. Does not automatically load Serialization, but uses extensions to overwrite Serialization.{serialize,deserialize} if Serialization is loaded, available in julia 1.9+.

@codecov-commenter
Copy link

codecov-commenter commented Jan 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (3125353) 78.45% compared to head (77d67a0) 81.97%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master      #46      +/-   ##
==========================================
+ Coverage   78.45%   81.97%   +3.52%     
==========================================
  Files           2        3       +1     
  Lines         297      355      +58     
==========================================
+ Hits          233      291      +58     
  Misses         64       64              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Project.toml Outdated Show resolved Hide resolved
@jarbus
Copy link
Contributor Author

jarbus commented Jan 14, 2024

Thanks for the review @ericphanson, committed the changes you suggested!

Copy link
Member

@ericphanson ericphanson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few more comments; overall this looks pretty good to me, though I've never written a serialize method before so not sure if there's things to look out for here

ext/SerializationExt.jl Outdated Show resolved Hide resolved
Comment on lines 9 to 10
# Create a mapping from memory address to id
node_map = Dict{Ptr, Int}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using Ptr's, I think it is simpler (and maybe more robust) to use an IdDict:

Suggested change
# Create a mapping from memory address to id
node_map = Dict{Ptr, Int}()
# Create a mapping from object to id. Here we use `IdDict` to use object identity as the hash.
node_map = IdDict{LinkedNode{K}, Int}()

then this can be indexed just as node_map[node] = id rather than using pointer_from_objref.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks

test/serializationtests.jl Outdated Show resolved Hide resolved
Copy link
Member

@ericphanson ericphanson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, but it would be great if someone more familiar with serialization could review as well

# Link the nodes
for (idx, node) in enumerate(nodes)
node.next = nodes[idx % n_nodes + 1]
node.prev = nodes[idx == 1 ? n_nodes : idx - 1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can write these two right hand sides using mod1(idx+1, n_nodes) and mod1(idx-1, n_nodes), which I would find more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know about the mod1 function, thanks! That's much cleaner.

c_node = cache.keyset.first
d_node = deserialized_cache.keyset.first
for i in 1:cache.keyset.length
c_node.val == d_node.val || @test false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not simply @test c_node.val == d_node.val ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just personal taste, for a large dict of 100k entries, I don't want to add 100k tests that compare each element. I just want to have one test that fails if any element is different, which I believe these do

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that is true. On the other hand, the test could also have just been with a handful of elements in the LRU cache I believe. I haven't timed it and believe this is all very fast because it is just Ints, but the 100000 did jump to the eye as some large number to have as a simple test case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for small caches, the regular serialization method works fine and doesn't stackoverflow, so we need a big one to really test the new method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ exactly right

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, yes and no, since the new method is now always called, irrespective of the size of the cache. And it is written in such a way that it should not suffer from the same flaws, so I would think that if it passes the test for a smaller cache, that provides sufficient guarantees. But I am fine with the current tests.

Is it clear what was causing the default serialisation strategy to fail? My guess of why it was entering an infinite loop would apply irrespective of the size of the cache, so that is not consistent with the original method working for small caches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe as a middle ground, the following gives a single (two) test(s), but still tests all values are equal:

@test length(cache.keyset) == length(deserialized_cache.keyset)
@test all(((c_val, d_val),) -> c_val == d_val, zip(cache.keyset, deserialized_cache.keyset))

d_value, d_node, d_s = deserialized_cache.dict[key]
c_value == d_value || @test false
c_node.val == d_node.val || @test false
c_s == d_s || @test false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question for these tests.

@Jutho
Copy link
Collaborator

Jutho commented Jan 15, 2024

This looks great, thanks. I've also added some minor comments and questions.

@jarbus
Copy link
Contributor Author

jarbus commented Jan 15, 2024

All great feedback so far, I've learned an unexpected amount about Julia from this simple PR. Thanks!

@Jutho
Copy link
Collaborator

Jutho commented Jan 16, 2024

Do you want another review? I can ask a person in our group that recently implemented some serialisation things for some other cache structures that we have in our code, but he is only available from Friday onwards. I am also fine with having this merged as is.

@jarbus
Copy link
Contributor Author

jarbus commented Jan 17, 2024

I'm fine with either, doesn't hurt to get another set of eyes on it and I've already pointed all my code to my branch anyways

Copy link
Contributor

@lkdvos lkdvos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks very nice! I left some minor comments which only really change readability, and thus might be a bit subjective as well. Feel free to discard those if you would disagree.

As a more general comment, I think it is possible to circumvent the exchange of the dictionary with linked nodes for a new dictionary, which is keeping an integer for the order. AFAIK, serialization of a dictionary happens by just serializing (key, value) pairs sequentially. In that sense, they already have an order, which you could use as the id.

Conveniently, iterating over an LRU actually gives you the key => value pairs in the exact order you want, because of how iteration of the keyset is happening. Thus, you could consider something along the lines of the following, and avoid the intermediate additional dictionary:
Serialization link

write(s.io, Int32(length(lru)))
for (k, (val, node, sz)) in lru
    serialize(s, k)
    serialize(s, val)
    serialize(s, sz)
end

For the deserialization, it should be possible to then just deserialize, and similar to how you implemented it wrap the keys back in a LinkedNode, linking them along the way, because you can now be sure that you are traversing them in order. (special casing the first and last node). (Note that this snippet does not include special cases for dicts that are empty or have a single element)
link for deserialization

dict = Dict{K, Tuple{V, LRUCache.LinkedNode{K}, Int}}()
n = read(s.io, Int32)
sizehint!(dict, n)
# first entry
k = deserialize(s)
first = node = LRUCache.LinkedNode{K}(k)
val = deserialize(s)
sz = deserialize(s)
dict[k] = (val, node, sz)
# middle entries
for i in 2:n-1
    prev = node
    k = deserialize(s)
    node = LRUCache.LinkedNode{K}(k)
    prev.next = node
    node.prev = prev
    val = deserialize(s)
    sz = deserialize(s)
    dict[k] = (val, node, sz)
end
# last node
prev = node
k = deserialize(s)
node = LRUCache.LinkedNode{K}(k)
prev.next = node
node.prev = prev

val = deserialize(s)
sz = deserialize(s)
dict[k] = (val, node, sz)

node.next = first
first.prev = node

I think I am still missing some things to make the keyset etc, and I did not test anything yet. I think this should result in a smaller file size, and maybe a bit better performance. Of course, depending on the use-case, this might all be overkill, and your implementation is definitely also viable!

ext/SerializationExt.jl Outdated Show resolved Hide resolved
ext/SerializationExt.jl Outdated Show resolved Hide resolved
ext/SerializationExt.jl Outdated Show resolved Hide resolved
c_node = cache.keyset.first
d_node = deserialized_cache.keyset.first
for i in 1:cache.keyset.length
c_node.val == d_node.val || @test false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe as a middle ground, the following gives a single (two) test(s), but still tests all values are equal:

@test length(cache.keyset) == length(deserialized_cache.keyset)
@test all(((c_val, d_val),) -> c_val == d_val, zip(cache.keyset, deserialized_cache.keyset))

@lkdvos
Copy link
Contributor

lkdvos commented Jan 17, 2024

Woops, seems like I was too fast in suggesting the while loop to for loop change, the iterator indeed only returns the values, not the nodes themselves. My apologies!

@jarbus
Copy link
Contributor Author

jarbus commented Jan 17, 2024

Thanks for the in-depth feedback @lkdvos! I'll look into implementing your suggested refactor. Also I suppose I need to undo that last commit now as well haha

@jarbus
Copy link
Contributor Author

jarbus commented Jan 18, 2024

@lkdvos Finished working on your recommended changes, serialization code is much cleaner now. Thanks for your input!

@Jutho
Copy link
Collaborator

Jutho commented Jan 18, 2024

Looks good to me; maybe @lkdvos can take a final look and then we can merge.

Copy link
Member

@ericphanson ericphanson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be worth testing roundtripping an entirely empty lru, as well as one with one entry. Since those look like edge cases in the deserialization implementation. Maybe:

lru = LRU{Int,Int}(; maxsize=10)
for n in 0:5
        if n > 0
             lru[n] = n
        end
        io = IOBuffer()
        serialize(io, lru)
        seekstart(io)
        lru_rt = deserialize(io)
        @test lru isa LRU{Int,Int}
        @test issetequal(collect(lru), collect(lru_rt))
end

One other edge case I can think of is mutable values with shared object identity. Like

lru = LRU(; maxsize=5)
a = b = [1]
lru[1] = a
lru[2] = b
@test lru[1] === lru[2]
# now roundtrip it and check that the above `===` still holds

@jarbus
Copy link
Contributor Author

jarbus commented Jan 19, 2024

Good point, honestly not sure how to handle the multiple references to mutable values case, though.

@ericphanson
Copy link
Member

The regular serializer handles it, so I was hoping it would just work

@jarbus
Copy link
Contributor Author

jarbus commented Jan 19, 2024

Great feedback @ericphanson, worked as expected. I was able to clean everything up, code is much nicer now.

ext/SerializationExt.jl Outdated Show resolved Hide resolved
@jarbus
Copy link
Contributor Author

jarbus commented Jan 25, 2024

@Jutho @ericphanson are there any other changes you recommend before merging?

@ericphanson
Copy link
Member

Sorry, forgot about this. I think it’s good to go!

@ericphanson ericphanson merged commit bc3fd23 into JuliaCollections:master Jan 25, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Serializing Large LRUs hangs/StackOverflows
5 participants