Add support for serialization of large caches #46

jarbus · 2024-01-12T19:23:49Z

Fixes #45, where serialization of large LRU might result in a stack overflow. Does not automatically load Serialization, but uses extensions to overwrite Serialization.{serialize,deserialize} if Serialization is loaded, available in julia 1.9+.

codecov-commenter · 2024-01-12T19:26:06Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (3125353) 78.45% compared to head (77d67a0) 81.97%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master      #46      +/-   ##
==========================================
+ Coverage   78.45%   81.97%   +3.52%     
==========================================
  Files           2        3       +1     
  Lines         297      355      +58     
==========================================
+ Hits          233      291      +58     
  Misses         64       64

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Project.toml

test/serializationtests.jl

Co-authored-by: Eric Hanson <[email protected]>

jarbus · 2024-01-14T13:35:33Z

Thanks for the review @ericphanson, committed the changes you suggested!

ericphanson

left a few more comments; overall this looks pretty good to me, though I've never written a serialize method before so not sure if there's things to look out for here

ext/SerializationExt.jl

ericphanson · 2024-01-14T13:46:44Z

ext/SerializationExt.jl

+    # Create a mapping from memory address to id
+    node_map = Dict{Ptr, Int}()


Rather than using Ptr's, I think it is simpler (and maybe more robust) to use an IdDict:

Suggested change

# Create a mapping from memory address to id

node_map = Dict{Ptr, Int}()

# Create a mapping from object to id. Here we use `IdDict` to use object identity as the hash.

node_map = IdDict{LinkedNode{K}, Int}()

then this can be indexed just as node_map[node] = id rather than using pointer_from_objref.

This is great, thanks

test/serializationtests.jl

Co-authored-by: Eric Hanson <[email protected]>

ericphanson

lgtm, but it would be great if someone more familiar with serialization could review as well

Jutho · 2024-01-15T09:53:22Z

ext/SerializationExt.jl

+    # Link the nodes
+    for (idx, node) in enumerate(nodes)
+        node.next = nodes[idx % n_nodes + 1]
+        node.prev = nodes[idx == 1 ? n_nodes : idx - 1]


I think you can write these two right hand sides using mod1(idx+1, n_nodes) and mod1(idx-1, n_nodes), which I would find more clear.

Didn't know about the mod1 function, thanks! That's much cleaner.

Jutho · 2024-01-15T09:55:08Z

test/serializationtests.jl

+    c_node = cache.keyset.first
+    d_node = deserialized_cache.keyset.first
+    for i in 1:cache.keyset.length
+        c_node.val == d_node.val || @test false


Why not simply @test c_node.val == d_node.val ?

This is just personal taste, for a large dict of 100k entries, I don't want to add 100k tests that compare each element. I just want to have one test that fails if any element is different, which I believe these do

Sure, that is true. On the other hand, the test could also have just been with a handful of elements in the LRU cache I believe. I haven't timed it and believe this is all very fast because it is just Ints, but the 100000 did jump to the eye as some large number to have as a simple test case.

for small caches, the regular serialization method works fine and doesn't stackoverflow, so we need a big one to really test the new method

^ exactly right

Well, yes and no, since the new method is now always called, irrespective of the size of the cache. And it is written in such a way that it should not suffer from the same flaws, so I would think that if it passes the test for a smaller cache, that provides sufficient guarantees. But I am fine with the current tests.

Is it clear what was causing the default serialisation strategy to fail? My guess of why it was entering an infinite loop would apply irrespective of the size of the cache, so that is not consistent with the original method working for small caches.

Maybe as a middle ground, the following gives a single (two) test(s), but still tests all values are equal:

@test length(cache.keyset) == length(deserialized_cache.keyset) @test all(((c_val, d_val),) -> c_val == d_val, zip(cache.keyset, deserialized_cache.keyset))

Jutho · 2024-01-15T09:55:34Z

test/serializationtests.jl

+        d_value, d_node, d_s = deserialized_cache.dict[key]
+        c_value == d_value || @test false
+        c_node.val == d_node.val || @test false
+        c_s == d_s || @test false


Same question for these tests.

Jutho · 2024-01-15T09:56:17Z

This looks great, thanks. I've also added some minor comments and questions.

jarbus · 2024-01-15T13:29:22Z

All great feedback so far, I've learned an unexpected amount about Julia from this simple PR. Thanks!

Jutho · 2024-01-16T22:19:00Z

Do you want another review? I can ask a person in our group that recently implemented some serialisation things for some other cache structures that we have in our code, but he is only available from Friday onwards. I am also fine with having this merged as is.

jarbus · 2024-01-17T00:10:16Z

I'm fine with either, doesn't hurt to get another set of eyes on it and I've already pointed all my code to my branch anyways

lkdvos

I think this looks very nice! I left some minor comments which only really change readability, and thus might be a bit subjective as well. Feel free to discard those if you would disagree.

As a more general comment, I think it is possible to circumvent the exchange of the dictionary with linked nodes for a new dictionary, which is keeping an integer for the order. AFAIK, serialization of a dictionary happens by just serializing (key, value) pairs sequentially. In that sense, they already have an order, which you could use as the id.

Conveniently, iterating over an LRU actually gives you the key => value pairs in the exact order you want, because of how iteration of the keyset is happening. Thus, you could consider something along the lines of the following, and avoid the intermediate additional dictionary:
Serialization link

write(s.io, Int32(length(lru)))
for (k, (val, node, sz)) in lru
    serialize(s, k)
    serialize(s, val)
    serialize(s, sz)
end

For the deserialization, it should be possible to then just deserialize, and similar to how you implemented it wrap the keys back in a LinkedNode, linking them along the way, because you can now be sure that you are traversing them in order. (special casing the first and last node). (Note that this snippet does not include special cases for dicts that are empty or have a single element)
link for deserialization

dict = Dict{K, Tuple{V, LRUCache.LinkedNode{K}, Int}}()
n = read(s.io, Int32)
sizehint!(dict, n)
# first entry
k = deserialize(s)
first = node = LRUCache.LinkedNode{K}(k)
val = deserialize(s)
sz = deserialize(s)
dict[k] = (val, node, sz)
# middle entries
for i in 2:n-1
    prev = node
    k = deserialize(s)
    node = LRUCache.LinkedNode{K}(k)
    prev.next = node
    node.prev = prev
    val = deserialize(s)
    sz = deserialize(s)
    dict[k] = (val, node, sz)
end
# last node
prev = node
k = deserialize(s)
node = LRUCache.LinkedNode{K}(k)
prev.next = node
node.prev = prev

val = deserialize(s)
sz = deserialize(s)
dict[k] = (val, node, sz)

node.next = first
first.prev = node

I think I am still missing some things to make the keyset etc, and I did not test anything yet. I think this should result in a smaller file size, and maybe a bit better performance. Of course, depending on the use-case, this might all be overkill, and your implementation is definitely also viable!

ext/SerializationExt.jl

lkdvos · 2024-01-17T09:58:02Z

test/serializationtests.jl

+    c_node = cache.keyset.first
+    d_node = deserialized_cache.keyset.first
+    for i in 1:cache.keyset.length
+        c_node.val == d_node.val || @test false


Maybe as a middle ground, the following gives a single (two) test(s), but still tests all values are equal:

@test length(cache.keyset) == length(deserialized_cache.keyset) @test all(((c_val, d_val),) -> c_val == d_val, zip(cache.keyset, deserialized_cache.keyset))

Co-authored-by: Lukas <[email protected]>

lkdvos · 2024-01-17T15:16:14Z

Woops, seems like I was too fast in suggesting the while loop to for loop change, the iterator indeed only returns the values, not the nodes themselves. My apologies!

jarbus · 2024-01-17T15:23:48Z

Thanks for the in-depth feedback @lkdvos! I'll look into implementing your suggested refactor. Also I suppose I need to undo that last commit now as well haha

jarbus · 2024-01-18T21:56:07Z

@lkdvos Finished working on your recommended changes, serialization code is much cleaner now. Thanks for your input!

Jutho · 2024-01-18T22:37:18Z

Looks good to me; maybe @lkdvos can take a final look and then we can merge.

ericphanson

I think it would be worth testing roundtripping an entirely empty lru, as well as one with one entry. Since those look like edge cases in the deserialization implementation. Maybe:

lru = LRU{Int,Int}(; maxsize=10)
for n in 0:5
        if n > 0
             lru[n] = n
        end
        io = IOBuffer()
        serialize(io, lru)
        seekstart(io)
        lru_rt = deserialize(io)
        @test lru isa LRU{Int,Int}
        @test issetequal(collect(lru), collect(lru_rt))
end

One other edge case I can think of is mutable values with shared object identity. Like

lru = LRU(; maxsize=5)
a = b = [1]
lru[1] = a
lru[2] = b
@test lru[1] === lru[2]
# now roundtrip it and check that the above `===` still holds

jarbus · 2024-01-19T00:07:38Z

Good point, honestly not sure how to handle the multiple references to mutable values case, though.

ericphanson · 2024-01-19T00:08:51Z

The regular serializer handles it, so I was hoping it would just work

jarbus · 2024-01-19T02:19:19Z

Great feedback @ericphanson, worked as expected. I was able to clean everything up, code is much nicer now.

ext/SerializationExt.jl

test/serializationtests.jl

jarbus · 2024-01-25T01:47:34Z

@Jutho @ericphanson are there any other changes you recommend before merging?

ericphanson · 2024-01-25T12:40:09Z

Sorry, forgot about this. I think it’s good to go!

jarbus added 2 commits January 12, 2024 14:04

add serialization support via extension

a748428

change version, update compat

1aa387c

jarbus mentioned this pull request Jan 12, 2024

Serializing Large LRUs hangs/StackOverflows #45

Closed

ericphanson reviewed Jan 14, 2024

View reviewed changes

Project.toml Outdated Show resolved Hide resolved

ericphanson reviewed Jan 14, 2024

View reviewed changes

test/serializationtests.jl Outdated Show resolved Hide resolved

jarbus and others added 2 commits January 14, 2024 08:29

chg test to check if key-value sets are same for caches

504c977

Co-authored-by: Eric Hanson <[email protected]>

chg: julia ext for v >= 1.9, dep for v < 1.9

5e97119

ericphanson reviewed Jan 14, 2024

View reviewed changes

jarbus and others added 4 commits January 14, 2024 13:48

chg: Dict of pointers to IdDict

1bd7fed

chg: File write to IOBuffer

d10da95

Co-authored-by: Eric Hanson <[email protected]>

rm: unnecessary exports

5158806

Co-authored-by: Eric Hanson <[email protected]>

fix

d83ba66

ericphanson approved these changes Jan 14, 2024

View reviewed changes

Jutho reviewed Jan 15, 2024

View reviewed changes

chg: clean up next/prev calc using mod1

9786bb0

lkdvos reviewed Jan 17, 2024

View reviewed changes

jarbus and others added 2 commits January 17, 2024 09:53

chg: distinguise size from serializer

036826f

Co-authored-by: Lukas <[email protected]>

chg: add sizehint! for IdDict

9864440

Co-authored-by: Lukas <[email protected]>

jarbus force-pushed the serial branch from cca8b92 to 9864440 Compare January 17, 2024 15:25

chg: clean up keyset test

efebf93

chg: remove intermediate steps for serialize

1c7c467

ericphanson reviewed Jan 18, 2024

View reviewed changes

jarbus added 2 commits January 18, 2024 21:15

clean up and handle/test edge cases

b7399e3

test: mutable reference

70f4468

lkdvos reviewed Jan 19, 2024

View reviewed changes

ext/SerializationExt.jl Outdated Show resolved Hide resolved

ericphanson reviewed Jan 19, 2024

View reviewed changes

test/serializationtests.jl Outdated Show resolved Hide resolved

jarbus added 3 commits January 19, 2024 10:01

chg: serialize size

43c1c3d

fix: update cache on every iteration but the first

a97b02f

fix: compat with Julia 1.x

77d67a0

lkdvos approved these changes Jan 19, 2024

View reviewed changes

ericphanson merged commit bc3fd23 into JuliaCollections:master Jan 25, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for serialization of large caches #46

Add support for serialization of large caches #46

jarbus commented Jan 12, 2024

codecov-commenter commented Jan 12, 2024 •

edited

Loading

jarbus commented Jan 14, 2024

ericphanson left a comment

ericphanson Jan 14, 2024

jarbus Jan 14, 2024

ericphanson left a comment

Jutho Jan 15, 2024

jarbus Jan 15, 2024

Jutho Jan 15, 2024

jarbus Jan 15, 2024

Jutho Jan 15, 2024

ericphanson Jan 15, 2024

jarbus Jan 16, 2024

Jutho Jan 16, 2024

lkdvos Jan 17, 2024

Jutho Jan 15, 2024

Jutho commented Jan 15, 2024

jarbus commented Jan 15, 2024

Jutho commented Jan 16, 2024

jarbus commented Jan 17, 2024

lkdvos left a comment

lkdvos Jan 17, 2024

lkdvos commented Jan 17, 2024

jarbus commented Jan 17, 2024 •

edited

Loading

jarbus commented Jan 18, 2024

Jutho commented Jan 18, 2024

ericphanson left a comment •

edited

Loading

jarbus commented Jan 19, 2024

ericphanson commented Jan 19, 2024

jarbus commented Jan 19, 2024

jarbus commented Jan 25, 2024

ericphanson commented Jan 25, 2024

		# Create a mapping from memory address to id
		node_map = Dict{Ptr, Int}()

Add support for serialization of large caches #46

Add support for serialization of large caches #46

Conversation

jarbus commented Jan 12, 2024

codecov-commenter commented Jan 12, 2024 • edited Loading

Codecov Report

jarbus commented Jan 14, 2024

ericphanson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericphanson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jutho commented Jan 15, 2024

jarbus commented Jan 15, 2024

Jutho commented Jan 16, 2024

jarbus commented Jan 17, 2024

lkdvos left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lkdvos commented Jan 17, 2024

jarbus commented Jan 17, 2024 • edited Loading

jarbus commented Jan 18, 2024

Jutho commented Jan 18, 2024

ericphanson left a comment • edited Loading

Choose a reason for hiding this comment

jarbus commented Jan 19, 2024

ericphanson commented Jan 19, 2024

jarbus commented Jan 19, 2024

jarbus commented Jan 25, 2024

ericphanson commented Jan 25, 2024

codecov-commenter commented Jan 12, 2024 •

edited

Loading

jarbus commented Jan 17, 2024 •

edited

Loading

ericphanson left a comment •

edited

Loading