Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: cache npm metadata #5491

Merged
merged 34 commits into from
Jun 24, 2023
Merged

feat: cache npm metadata #5491

merged 34 commits into from
Jun 24, 2023

Conversation

paul-soporan
Copy link
Member

@paul-soporan paul-soporan commented Jun 8, 2023

What's the problem this PR addresses?

Resolving package metadata is slower than it has to be because, most times, Yarn has already fetched it in the past, and some things can be cached and reused.

This should improve performance in various cases (ranging from creating new projects and cache-only-but-no-lockfile installs to yarn up when no new versions are available), since the server can avoid resending the response body if nothing has changed.

How did you fix it?

This PR makes Yarn cache npm package metadata inside <globalFolder>/metadata/npm/<cacheKey>/<registry>/<package>.json when getPackageMetadata is used.

If an exact version is requested, Yarn will return the metadata from disk directly and avoid hitting the network altogether.

Otherwise, Yarn will set the If-None-Match & If-Modified-Since headers using the etag & last-modified values that were cached during previous requests. This tells the server to skip sending the response body and just respond with 304, making Yarn reuse the cached metadata.

TODO:

  • Trim the cached metadata of unnecessary fields to decrease cache size
  • Update benchmark scripts to make sure that they take the metadata cache into account
  • Run more benchmarks
  • Make yarn cache clean clean the npm metadata cache (different PR)

Checklist

  • I have set the packages that need to be released for my changes to be effective.
  • I will check that all automated PR checks pass before the PR gets reviewed.

@paul-soporan
Copy link
Member Author

Implemented cached metadata trimming and the metadata cache size on the berry repo (install after rm yarn.lock) went from 857M to 115M.

@arcanis
Copy link
Member

arcanis commented Jun 9, 2023

Overall looks a good idea; couple of questions:

  • How are benchmarks impacted, in practice?
  • The cache is currently one file per package, correct? Would it make sense to have one mega cache file?
  • Rather than JSON, would it perhaps be faster to use v8.serialize (we already do this for the install state?)
    • In that case the write would have to be mutexed
    • And probably deferred until after the resolution step has completed, to make a single write
    • And perhaps even run in the background (no await), to avoid blocking the subsequent fetch step

@paul-soporan paul-soporan force-pushed the paul/feat/cache-npm-metadata branch from e476317 to f3fc013 Compare June 10, 2023 15:33
@paul-soporan paul-soporan force-pushed the paul/feat/cache-npm-metadata branch from b55f4f9 to 9e42027 Compare June 10, 2023 16:29
@paul-soporan
Copy link
Member Author

paul-soporan commented Jun 11, 2023

How are benchmarks impacted, in practice?

These are the results I get locally. On GH actions I'm seeing inconclusive results for some reason. More benchmarks welcome.

I'm seeing similar results on GH actions now.

Next

  • Before:
Testing install-full-cold
Benchmark 1: yarn install
  Time (mean ± σ):     11.026 s ±  0.248 s    [User: 12.908 s, System: 0.603 s]
  Range (min  max):   10.694 s  11.522 s    10 runs
 
Testing install-cache-only
Benchmark 1: yarn install
  Time (mean ± σ):      3.715 s ±  0.181 s    [User: 2.100 s, System: 0.315 s]
  Range (min  max):    3.442 s   3.989 s    10 runs
 
Testing install-cache-and-lock
Benchmark 1: yarn install
  Time (mean ± σ):     740.0 ms ±  10.7 ms    [User: 844.3 ms, System: 147.3 ms]
  Range (min  max):   718.8 ms  751.0 ms    10 runs
 
Testing install-ready
Benchmark 1: yarn add dummy-pkg@link:./dummy-pkg
  Time (mean ± σ):     357.7 ms ±   2.6 ms    [User: 372.3 ms, System: 40.6 ms]
  Range (min  max):   352.8 ms  362.3 ms    10 runs
  • After:
Testing install-full-cold
Benchmark 1: yarn install
  Time (mean ± σ):     10.684 s ±  0.218 s    [User: 12.763 s, System: 0.526 s]
  Range (min  max):   10.424 s  11.057 s    10 runs
 
Testing install-cache-only
Benchmark 1: yarn install
  Time (mean ± σ):      1.075 s ±  0.039 s    [User: 1.027 s, System: 0.161 s]
  Range (min  max):    1.032 s   1.157 s    10 runs
 
Testing install-cache-and-lock
Benchmark 1: yarn install
  Time (mean ± σ):     735.9 ms ±   8.9 ms    [User: 832.1 ms, System: 143.9 ms]
  Range (min  max):   715.2 ms  745.6 ms    10 runs
 
Testing install-ready
Benchmark 1: yarn add dummy-pkg@link:./dummy-pkg
  Time (mean ± σ):     359.0 ms ±   3.0 ms    [User: 370.9 ms, System: 43.2 ms]
  Range (min  max):   355.2 ms  364.3 ms    10 runs

Gatsby

  • Before:
Testing install-full-cold
Benchmark 1: yarn install
  Time (mean ± σ):     34.461 s ±  2.341 s    [User: 52.835 s, System: 3.466 s]
  Range (min  max):   33.194 s  41.047 s    10 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs.
 
Testing install-cache-only
Benchmark 1: yarn install
  Time (mean ± σ):     20.037 s ±  0.293 s    [User: 23.248 s, System: 2.338 s]
  Range (min  max):   19.634 s  20.382 s    10 runs
 
Testing install-cache-and-lock
Benchmark 1: yarn install
  Time (mean ± σ):     11.540 s ±  0.052 s    [User: 14.078 s, System: 1.433 s]
  Range (min  max):   11.452 s  11.623 s    10 runs
 
Testing install-ready
Benchmark 1: yarn add dummy-pkg@link:./dummy-pkg
  Time (mean ± σ):     926.8 ms ±   7.8 ms    [User: 1259.0 ms, System: 103.8 ms]
  Range (min  max):   918.8 ms  942.2 ms    10 runs
  • After:
Testing install-full-cold
Benchmark 1: yarn install
  Time (mean ± σ):     33.968 s ±  2.952 s    [User: 55.060 s, System: 3.342 s]
  Range (min  max):   32.473 s  42.293 s    10 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs.
 
Testing install-cache-only
Benchmark 1: yarn install
  Time (mean ± σ):     13.822 s ±  0.058 s    [User: 16.784 s, System: 1.609 s]
  Range (min  max):   13.739 s  13.915 s    10 runs
 
Testing install-cache-and-lock
Benchmark 1: yarn install
  Time (mean ± σ):     11.534 s ±  0.043 s    [User: 14.133 s, System: 1.429 s]
  Range (min  max):   11.471 s  11.590 s    10 runs
 
Testing install-ready
Benchmark 1: yarn add dummy-pkg@link:./dummy-pkg
  Time (mean ± σ):     930.7 ms ±   6.6 ms    [User: 1264.1 ms, System: 100.4 ms]
  Range (min  max):   919.1 ms  942.9 ms    10 runs

@paul-soporan
Copy link
Member Author

Rather than JSON, would it perhaps be faster to use v8.serialize

These are the benchmark results for serializing and deserializing a mega cache file:

JSON.stringify: 340.514ms
JSON size: 113 MB
JSON.parse: 548.721ms

v8.serialize: 301.942ms
v8 size: 108 MB
v8.deserialize: 805.485ms

v8.serialize is a bit faster than JSON.stringify and produces a result that's a bit smaller, but JSON.parse is way faster than v8.deserialize.

(we already do this for the install state?)

That's because we have to do it for the install state since it includes Maps and Sets which are natively supported by v8 but not by JSON.

Here I'll keep using JSON because overall it's faster.

@paul-soporan
Copy link
Member Author

The cache is currently one file per package, correct? Would it make sense to have one mega cache file?

It might be a bit faster, but it would make the code a lot more complicated because it requires mutexing and doesn't play nicely with the existing resolver architecture which is better suited for concurrent operations.

For now, I think the current implementation is fine, we can always tweak it in the future if we need to.

@paul-soporan paul-soporan requested a review from arcanis June 17, 2023 19:15
@paul-soporan
Copy link
Member Author

paul-soporan commented Jun 17, 2023

PR is now finished and ready for review.

Small note: Because we return the cached metadata directly for exact versions, if an exact version has been deprecated since it was cached, the deprecation warning will not be shown until the cache is reset (either cleared or a non-exact version is resolved for the same package which should cause the metadata to be fetched again from the network).
Not a problem anymore, as we've stopped reporting deprecation warnings during installs.


Final benchmarks from GitHub Actions (Gatsby):

TL;DR: install-cache-only is now 1.87 times faster.

Before
Testing install-full-cold
Benchmark #1: yarn install
  Time (mean ± σ):     53.801 s ±  0.328 s    [User: 76.642 s, System: 5.323 s]
  Range (min  max):   53.312 s  54.258 s    10 runs
 
Testing install-cache-only
Benchmark #1: yarn install
  Time (mean ± σ):     23.105 s ±  0.313 s    [User: 25.800 s, System: 3.807 s]
  Range (min  max):   22.708 s  23.590 s    10 runs
 
Testing install-cache-and-lock
Benchmark #1: yarn install
  Time (mean ± σ):      7.691 s ±  0.071 s    [User: 9.150 s, System: 1.762 s]
  Range (min  max):    7.601 s   7.839 s    10 runs
 
Testing install-ready
Benchmark #1: yarn add dummy-pkg@link:./dummy-pkg
  Time (mean ± σ):      1.893 s ±  0.019 s    [User: 2.268 s, System: 0.211 s]
  Range (min  max):    1.863 s   1.924 s    10 runs
After
Testing install-full-cold
Benchmark #1: yarn install
  Time (mean ± σ):     53.710 s ±  0.372 s    [User: 79.227 s, System: 4.789 s]
  Range (min  max):   53.285 s  54.395 s    10 runs
 
Testing install-cache-only
Benchmark #1: yarn install
  Time (mean ± σ):     12.371 s ±  0.277 s    [User: 13.988 s, System: 2.125 s]
  Range (min  max):   11.930 s  12.803 s    10 runs
 
Testing install-cache-and-lock
Benchmark #1: yarn install
  Time (mean ± σ):      7.677 s ±  0.053 s    [User: 9.183 s, System: 1.722 s]
  Range (min  max):    7.574 s   7.736 s    10 runs
 
Testing install-ready
Benchmark #1: yarn add dummy-pkg@link:./dummy-pkg
  Time (mean ± σ):      1.898 s ±  0.028 s    [User: 2.270 s, System: 0.205 s]
  Range (min  max):    1.859 s   1.950 s    10 runs

@paul-soporan paul-soporan requested a review from merceyz June 18, 2023 15:49
@belgattitude
Copy link

belgattitude commented Jun 20, 2023

Thanks @paul-soporan, some questions

  • about <globalFolder> as seen in the description. Would it work with enableGlobalCache: false ? I'm asking because when dealing with cache on ci, one cache location is generally simplier.

  • if so, would it be purged over time (like the .cache/*.zip) or the preferred way would be to invalidate on lock changes (or with a date based cache key)

  • Will it be only yarn 4 ?

PS: That interest me because I could update some gists:

@paul-soporan
Copy link
Member Author

about as seen in the description. Would it work with enableGlobalCache: false ? I'm asking because when dealing with cache on ci, one cache location is generally simplier.

It is completely independent of the archive cache or its settings. It will always be inside globalFolder, together with telemetry.json, index, and the mirror.

if so, would it be purged over time (like the .cache/*.zip) or the preferred way would be to invalidate on lock changes (or with a date based cache key)

It is a global cache, that applies to all projects on the machine, and will grow indefinitely, just like the mirror / global cache (which is not purged over time at the moment either).

The more metadata that's stored in this cache (and the more recent it is), the faster installs will be in new projects when dependencies aren't already part of the lockfile.

invalidate on lock changes

It is a registry metadata cache. The lockfile is based on it, not the other way around.

It is also completely useless when the lockfile exists and is up-to-date with the manifests.

or with a date based cache key

Only if you want to trim it to reduce disk usage. The data in the cache will only be used when it's safe to do so, so we shouldn't run into the risk of stale data.

We store the etag & last-modified response headers and forward them on subsequent requests using the If-None-Match / If-Modified-Since request headers. The server responds with 304 and an empty body if the metadata hasn't been updated since, which means we can use the cached metadata. Otherwise, we use the new metadata the server responds with and update the cache.

i.e. We already have a builtin invalidation mechanism.

Note: We also use the cached data directly for exact version (e.g. when [email protected] is requested), since the metadata for a specific version is immutable (besides the deprecated field, but that one won't be cached anymore after #5509 is merged since we'll remove the deprecation warnings during installs).

Will it be only yarn 4 ?

Most likely yes, even though it's not a breaking change. We could technically backport it to 3.x, but 4.x is getting closer and closer and we probably won't have another minor 3.x version.

@belgattitude
Copy link

Thanks a lot for this comprehensive answer.

}

function getMetadataFolder(configuration: Configuration) {
return ppath.join(configuration.get(`globalFolder`), `npm-metadata`);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking perhaps <global>/cache/metadata might be a better location, what do you think? Actions that already cache the <global>/cache folder would benefit from that out of the box, rather than having to add a new path (or cache the whole <global>, which may not be as common).

I'd also suggest moving the npm- prefix from the folder to each individual file name, by consistency with the cache itself (which contains files from the npm, git, http fetchers, etc).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping on this comment?

Copy link
Member Author

@paul-soporan paul-soporan Jun 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry for the delay.)

I'm thinking perhaps <global>/cache/metadata might be a better location, what do you think?

I don't really like the idea, the mirror and the npm metadata cache are 2 separate things, and merging them in a single folder would lead to confusion. We've also got index which would have to be moved to cache too with this line of thinking.

This would just complicate things like yarn cache clean, where --mirror would have to stop removing the entire folder and just remove the archives.

I'd prefer all of them to be in separate folders. One thing we could do would be to move the mirror to <global>/cache/mirror in addition to what you proposed, then I'd be fine with it and it would still benefit from existing caching (but it would look a bit weird to have the global cache in <global>/cache/mirror, but I guess that's just the consequence of the mirror and the global cache being the same thing).

I'd also suggest moving the npm- prefix from the folder to each individual file name, by consistency with the cache itself (which contains files from the npm, git, http fetchers, etc).

I'd rather not. It would just give the illusion of consistency.

The cache can do it because it's the sole source that controls that folder.

The npm metadata cache is supposed to be specific to the plugin-npm resolvers - they are the only ones that populate and control it.

Moving the prefix to the filenames and having the folder called just metadata gives the illusion that the folder is a generalized metadata cache. And it could be indeed, if other resolvers ever need to cache metadata in it, but it would have different shapes (and possibly different filename formats) based on the resolvers that cache the data and that might lead to confusion.

Edit: The files also wouldn't be in a single folder like the cache, since e.g. for npm we have npm-metadata/<hash>/<registry>/lodash.json.

It would also have to be controlled by the core, which would be tasked with automatically generating the paths to ensure that no 2 resolvers accidentally use the same path (and also to make it possible for us to change the metadata cache path in the future).

In addition, we're not even certain that the current kind of cache is better than a monolithic one, that's something I'm open to experimenting with in the future.

That's why I think that opening the folder to anything but the npm resolvers is not something I want to do yet, if ever.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Thought more about it and I'd be open to making it <global>/metadata/npm/<hash>/<registry>/lodash.json.

This way, we still have a common metadata folder but we make it clear that each resolver has to manage it manually.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable. I like the idea of moving the cache into cache/mirror (that said, I think we could also just rename --mirror into -g for the same effect; I think I'd like this even better 🤔).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I still think I'd prefer to keep the mirror, metadata, and index separate.

I think we could also just rename --mirror into -g for the same effect

We could have a -g too but I prefer to retain the granularity. yarn cache clean will become interactive in a future PR anyways (just like we discussed some time ago), and it will clean everything by default.

packages/plugin-npm/sources/npmHttpUtils.ts Show resolved Hide resolved
packages/plugin-npm/sources/npmHttpUtils.ts Show resolved Hide resolved
packages/yarnpkg-core/sources/httpUtils.ts Show resolved Hide resolved
const runRequest = () => prettyNetworkError(request(target, null, {configuration, wrapNetworkRequest, ...rest}), {configuration, customErrorMessage})
.then(response => response.body);

// We cannot cache responses when wrapNetworkRequest is used, as it can differ between calls
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that the user could mutate the request wrapping? Perhaps it'd be worth having a secondary wrapNetworkRequestPure hook? Maybe for later.

(Many usages ofwrapNetworkRequest will want to log the queries, or perhaps cache them even further, but not modify their results in an inconsistent way)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not about the hook, it's about the function parameter.

The wrapNetworkRequest hook can indeed modify the response, but for a given URL it's kinda guaranteed that the same hooks will run even across different calls.

However, the function argument (that I'm using for caching npm metadata) can be passed different values in different calls, meaning that in one call I could pass a wrapNetworkRequest that returns the body foo, and in a different call one that returns the body bar, therefore breaking caching.

Thinking about it more, the entire cache in httpUtils is a bit unsafe because it only depends on the target, but the result could be different depending on the configuration (settings and hooks) / headers / jsonResponse.

I assume that it was implemented with the assumption that there won't be different option bags passed to it across different calls for a given URL in the same process, but that's in no way enforced and could lead to bugs.

In any case, I don't intend to fix that in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the function argument (that I'm using for caching npm metadata) can be passed different values in different calls, meaning that in one call I could pass a wrapNetworkRequest that returns the body foo, and in a different call one that returns the body bar, therefore breaking caching.

Yes, that's what I mean by having a pure version of the hook - one that would semantically require implementers to return the same output given the same input.

But yes, that's a follow-up, maybe not even necessary until requested.

@arcanis arcanis enabled auto-merge June 24, 2023 12:50
@arcanis arcanis disabled auto-merge June 24, 2023 13:01
@arcanis arcanis merged commit 4897712 into master Jun 24, 2023
@arcanis arcanis deleted the paul/feat/cache-npm-metadata branch June 24, 2023 13:01
@merceyz
Copy link
Member

merceyz commented Jun 24, 2023

TL;DR: install-cache-only is now 1.87 times faster.
#5491 (comment)

The results are even better in my case so great job!

=== install-full-cold ===
Benchmark 1: node ./before.cjs install
  Time (mean ± σ):     32.690 s ±  1.055 s    [User: 54.431 s, System: 12.793 s]
  Range (min … max):   31.618 s … 35.251 s    10 runs

Benchmark 2: node ./after.cjs install
  Time (mean ± σ):     30.612 s ±  1.168 s    [User: 55.848 s, System: 19.314 s]
  Range (min … max):   29.162 s … 32.916 s    10 runs

Summary
  'node ./after.cjs install' ran
    1.07 ± 0.05 times faster than 'node ./before.cjs install'
=== install-cache-only ===
Benchmark 1: node ./before.cjs install
  Time (mean ± σ):     15.117 s ±  0.136 s    [User: 18.566 s, System: 5.275 s]
  Range (min … max):   14.940 s … 15.327 s    10 runs

Benchmark 2: node ./after.cjs install
  Time (mean ± σ):      5.786 s ±  1.179 s    [User: 7.259 s, System: 4.400 s]
  Range (min … max):    4.938 s …  8.626 s    10 runs

Summary
  'node ./after.cjs install' ran
    2.61 ± 0.53 times faster than 'node ./before.cjs install'
=== install-cache-and-lock ===
Benchmark 1: node ./before.cjs install
  Time (mean ± σ):      2.999 s ±  0.012 s    [User: 3.589 s, System: 2.468 s]
  Range (min … max):    2.983 s …  3.021 s    10 runs

Benchmark 2: node ./after.cjs install
  Time (mean ± σ):      3.039 s ±  0.086 s    [User: 3.588 s, System: 2.524 s]
  Range (min … max):    2.976 s …  3.208 s    10 runs

Summary
  'node ./before.cjs install' ran
    1.01 ± 0.03 times faster than 'node ./after.cjs install'
=== install-ready ===
Benchmark 1: node ./before.cjs add dummy-pkg@link:./dummy-pkg
  Time (mean ± σ):      1.120 s ±  0.041 s    [User: 1.602 s, System: 0.661 s]
  Range (min … max):    1.088 s …  1.211 s    10 runs

Benchmark 2: node ./after.cjs add dummy-pkg@link:./dummy-pkg
  Time (mean ± σ):      1.089 s ±  0.008 s    [User: 1.558 s, System: 0.651 s]
  Range (min … max):    1.078 s …  1.104 s    10 runs

Summary
  'node ./after.cjs add dummy-pkg@link:./dummy-pkg' ran
    1.03 ± 0.04 times faster than 'node ./before.cjs add dummy-pkg@link:./dummy-pkg'

arcanis pushed a commit that referenced this pull request Jun 24, 2023
**What's the problem this PR addresses?**

-
07b4ceb
changed the metadata cache structure but didn't update the benchmark
script to match.
- We're not removing the `$globalFolder/index` folder so
`install-full-cold` is skipping some work.

Follow-up to #5491

**How did you fix it?**

Clear the entire global folder.

**Checklist**
- [x] I have read the [Contributing
Guide](https://yarnpkg.com/advanced/contributing).
- [x] I have set the packages that need to be released for my changes to
be effective.
- [x] I will check that all automated PR checks pass before the PR gets
reviewed.
arcanis pushed a commit that referenced this pull request Jun 24, 2023
**What's the problem this PR addresses?**

Now that #5491 has landed we can
cache the metadata for our e2e tests to potentially speed them up.

**How did you fix it?**

Cache metadata.

**Checklist**
- [x] I have read the [Contributing
Guide](https://yarnpkg.com/advanced/contributing).
- [x] I have set the packages that need to be released for my changes to
be effective.
- [x] I will check that all automated PR checks pass before the PR gets
reviewed.
arcanis pushed a commit to yarnpkg/example-repo-zipn that referenced this pull request Jul 3, 2023
**What's the problem this PR addresses?**

-
yarnpkg/berry@07b4ceb
changed the metadata cache structure but didn't update the benchmark
script to match.
- We're not removing the `$globalFolder/index` folder so
`install-full-cold` is skipping some work.

Follow-up to yarnpkg/berry#5491

**How did you fix it?**

Clear the entire global folder.

**Checklist**
- [x] I have read the [Contributing
Guide](https://yarnpkg.com/advanced/contributing).
- [x] I have set the packages that need to be released for my changes to
be effective.
- [x] I will check that all automated PR checks pass before the PR gets
reviewed.
arcanis pushed a commit to yarnpkg/example-repo-zipn that referenced this pull request Jul 3, 2023
**What's the problem this PR addresses?**

Now that yarnpkg/berry#5491 has landed we can
cache the metadata for our e2e tests to potentially speed them up.

**How did you fix it?**

Cache metadata.

**Checklist**
- [x] I have read the [Contributing
Guide](https://yarnpkg.com/advanced/contributing).
- [x] I have set the packages that need to be released for my changes to
be effective.
- [x] I will check that all automated PR checks pass before the PR gets
reviewed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants