-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process packages in parallel #175
Comments
Could you share a rough number for how many lines of code / second are you seeing indexing at and what your target number is based on workspace size and peak commit velocity? IIRC, one of the problems here was that we didn't have easy access to the dependency graph between packages in a workspace, which meant the determining the order of indexing was non-trivial, and it would be wasteful to do the naive thing of indexing in parallel without understanding the dependency graph as that would likely lead to a bunch of duplicated work. Even if we did things in dependency order, it might not be fast enough if your build graph is not wide enough. |
I'm currently trying to index a large monorepo with over 600+ packages, so I'm a sure topologically sorted graph would be wide enough to saturate CPUs.
It roughly takes me 20 mins to index on a repository with ~2M SLOC of typescript on an M1 Macbook Pro.
Master for us gets updated roughly once every 15 minutes (but this could be higher going forward). Ideally we'd want to be able to index + compress + upload + process well within this time. But given that sourcegrpah does not invalidate the entire index when a new commit is pushed (IIRC from docs), it might not be too bad if the time exceeds that by a bit. Still, the faster, the better. |
I think even with some duplicated work, having an option to use more cores might be several times faster on a multi-core machine, and that could be just enough for some repositories to exceed their master update rate. Perhaps this can be something that can be made more efficient over time rather than not providing an option at all ? I'm happy to test out any early changes and give feedback on it from our repositories if it helps. Very interested in making this scale for us at the moment (and should hopefully help other enterprise users of sourcegraph too) |
Right, code navigation will fall back to the nearest index if one can't be found for the latest commit.
That's a fair point. We could try exposing an option for this... In the meantime, if you want a quick workaround, one thing you could try is avoid the workspace indexing option and instead individually invoke the indexer for each inner Thanks for sharing the numbers. Based on those, it seems like the indexer is running much slower than we'd expect... The number you gave ends up being ~1.7k SLOC/s, whereas IIRC early benchmarks were around 5k - 10k SLOC/s, so we may have had a serious regression there too. |
@pastelsky Thank you for reporting! Can you elaborate with more details about your build? For example, are you able to typecheck your existing TypeScript code in parallel with a separate tool? The For the record, I'm not opposed to adding basic parallell processing support to scip-typescript. I'm mostly curious to better understand your problem and see if there's a reasonable alternative solution that you can use. |
Digging a little deeper into my hunch, I profiled a run of This happens because How costly can be seen from this profile of
This also explains why I can run Given that workspaces are popular, and monorepos are common — the benchmark of "5k - 10k SLOC/s" seems purely theoretical. See profiling results — |
@pastelsky thank you for the detailed information. This is very helpful, especially the flamegraph!
Can you please confirm that this is for a clean build? Our benchmarks indicate that tsc typechecks around 1-4k loc/s (the exact number varies between projects). For the sourcegraph/sourcegraph repo, we clean typecheck 160k lines in ~2 minutes (~1.3k loc/s) but we admittedly have heavy usage of rxjs that most likely slows down typechecking compared to normal codebases. At ~2m loc and 3 minutes, your codebase is getting typechecked at ~11k loc/s, which sounds surprisingly fast. |
Me and @varungandhi-src investigated our usage of
|
Yes, this is for a clean build. Here's the tsconfig in use, which could make a difference — tsconfig.json
|
I haven't tried this out personally, but I remember another tool we use — API extractor that also runs TS analysis per package recommending this —
Like I said, we have about ~600 packages in our monorepo, and these do internally link to each other — and in some cases the dependency chains can be some levels deep. I don't think what we have is very atypical though. |
Towards #175 Previously, scip-typescript didn't cache anything at all between TypeScript projects. This commit implements an optimization so that we now cache the results of loading source files and parsing options. Benchmarks against the sourcegraph/sourcegraph repo indicate this optimization speeds up the index time by ~30% from ~100s to ~70s. The resulting index.scip has identical checksum before and after applying this optimization. This new optimization is enabled by default, but can be disabled with the option `--no-global-cache`.
Previously, scip-typescript didn't cache anything at all between TypeScript projects. This commit implements an optimization so that we now cache the results of loading source files and parsing options. Benchmarks against the sourcegraph/sourcegraph repo indicate this optimization consistently speeds up the `index` command in all three multi-project repositories that I tested it with. - sourcegraph/sourcegraph: ~30% from ~100s to ~70s - nextautjs/next-auth: ~40% from 6.5s to 3.9 - xtermjs/xterm.js: ~45% from 7.3s to 4.1s For every repo, I additionally validated that the resulting index.scip has identical checksum before and after applying this optimization. Given these promising results, this new optimization is enabled by default, but can be disabled with the option `--no-global-cache`. *Test plan* Manually tested by running `scip-typescript index tsconfig.all.json` in the sourcegraph/sourcegraph repository. To benchmark the difference for this PR: - Checkout the code - Run `yarn tsc -b` - Go to the directory of your project - Run `node PATH_TO_SCIP_TYPESCRIPT/dist/src/main.js` - Copy the "optimized" index.scip with `cp index.scip index-withcache.scip` - Run `node PATH_TO_SCIP_TYPESCRIPT/dist/src/main.js --no-global-caches` - Validate the checksum is identical from the optimized output `shasum -a 256 *.scip`
Towards #175 Previously, scip-typescript didn't cache anything at all between TypeScript projects. This commit implements an optimization so that we now cache the results of loading source files and parsing options. Benchmarks against the sourcegraph/sourcegraph repo indicate this optimization consistently speeds up the `index` command in all three multi-project repositories that I tested it with. - sourcegraph/sourcegraph: ~30% from ~100s to ~70s - nextautjs/next-auth: ~40% from 6.5s to 3.9 - xtermjs/xterm.js: ~45% from 7.3s to 4.1s For every repo, I additionally validated that the resulting index.scip has identical checksum before and after applying this optimization. Given these promising results, this new optimization is enabled by default, but can be disabled with the option `--no-global-cache`. *Test plan* Manually tested by running `scip-typescript index tsconfig.all.json` in the sourcegraph/sourcegraph repository. To benchmark the difference for this PR: - Checkout the code - Run `yarn tsc -b` - Go to the directory of your project - Run `node PATH_TO_SCIP_TYPESCRIPT/dist/src/main.js` - Copy the "optimized" index.scip with `cp index.scip index-withcache.scip` - Run `node PATH_TO_SCIP_TYPESCRIPT/dist/src/main.js --no-global-caches` - Validate the checksum is identical from the optimized output `shasum -a 256 *.scip`
@pastelsky thank you for the reference to api-extractor. It was helpful to browse their usage of the TypeScript APIs. This pointer gave me keywords to search for that led me to this file here https://github.com/fimbullinter/wotan/blob/e25bf84561562935584a47220af5c996d6b746e7/packages/wotan/src/project-host.ts#L226 I've opened a PR #182 that implements an optimization where we cache |
* Optimization: cache results between projects Towards #175 Previously, scip-typescript didn't cache anything at all between TypeScript projects. This commit implements an optimization so that we now cache the results of loading source files and parsing options. Benchmarks against the sourcegraph/sourcegraph repo indicate this optimization consistently speeds up the `index` command in all three multi-project repositories that I tested it with. - sourcegraph/sourcegraph: ~30% from ~100s to ~70s - nextautjs/next-auth: ~40% from 6.5s to 3.9 - xtermjs/xterm.js: ~45% from 7.3s to 4.1s For every repo, I additionally validated that the resulting index.scip has identical checksum before and after applying this optimization. Given these promising results, this new optimization is enabled by default, but can be disabled with the option `--no-global-cache`. *Test plan* Manually tested by running `scip-typescript index tsconfig.all.json` in the sourcegraph/sourcegraph repository. To benchmark the difference for this PR: - Checkout the code - Run `yarn tsc -b` - Go to the directory of your project - Run `node PATH_TO_SCIP_TYPESCRIPT/dist/src/main.js` - Copy the "optimized" index.scip with `cp index.scip index-withcache.scip` - Run `node PATH_TO_SCIP_TYPESCRIPT/dist/src/main.js --no-global-caches` - Validate the checksum is identical from the optimized output `shasum -a 256 *.scip` * Fix failing tests * Address review comments
@pastelsky I just released https://github.com/sourcegraph/scip-typescript/releases/tag/v0.2.11 with the optimization enabled. Please give it a try and let us know if you observe performance improvements. |
This is amazing @olafurpg. This has effectively brought down indexing times in our internal monorepo from ~20 minutes to ~11 minutes. |
@pastelsky thank you for confirming! That's a great improvement. The v0.2.11 release should emit identical indexes as v0.2.10 but only faster because it includes the performance optimization. If you're running v0.3.0 then it's normal that the index contents are different since v0.3.0 includes a new feature addition. To be 100% sure, you can optionally disable the optimization with I'd love to drill into the remaining performance gap between scip-typescript and |
FWIW, this is how I'm generating these, so you can test improvements by yourself too — clinic flame -- node node_modules/.bin/scip-typescript index --yarn-workspaces |
@pastelsky Thank you for the I took a quick stab at identifying what other parts of the I'm on the fence about adding parallel processing to scip-typescript itself because 1) I'm still optimistic we can get the same performance as |
This wasn't documented, so I thought running it serially was the only way. This is assuming, ofcourse, that sharded indexes are independent enough and can be uploaded via |
I have not seen this Our backend does not support merging independent
This seems like a better approach for now. The I admit this isn't ideal, and it would be preferrable to offer simpler parallelism like |
I guess then this is a pre-requisite for any sort of sharding/filtering/parallelization (outside of
We've several nested packages under a common workspace root. I also don't want We'd be able to run a script to get all packages, and slice them into buckets if |
Can we do this in src-cli? Right now, src-cli accepts two forms of uploads, SCIP (via conversion to LSIF) and LSIF directly. We could add a third "scip directory" where there is one metadata file, and the other index files are generated by workers. We could add a function in the SCIP Go bindings to correctly concatenate a "scip directory" into a single index (the main thing that requires some care is putting the metadata first), and use that in src-cli. After that, src-cli would treat the index like a normal SCIP index. WDYT? I can see this being potentially useful in the future as well for indexers which add support for distributed indexing via Bazel, because the |
@varungandhi-src seems like a good idea — this could also enable sourcegraph to support incremental indexing in the future — which would be a neat feature to have, especially for very large repositories. |
scip-typescript
currently can take several minutes to index yarn workspaces. It does not seem to saturate CPU enough or use all cores. Ideally there should be an option to do indexing for multiple packages concurrently to reduce this time.This would be useful considering slow indexing reduces the benefit of type indexing in the first place on a repository with high frequency commits as a new commits can invalidate the ongoing index.
The text was updated successfully, but these errors were encountered: