Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate performance differences with memchr::memmem implementation #40

Open
marmeladema opened this issue May 3, 2021 · 7 comments

Comments

@marmeladema
Copy link
Collaborator

#26 lead to a very interesting discussion where it was brought up that in some cases sliceslice is substantially slower than memchr::memmem.

We should investigate why, decide if this is something that actually matters for us and if so, fix it

See #26 (comment) for detailed performance results

@BurntSushi
Copy link

Since I spent a lot of time on this lately, I'd like to give a bit more of a holistic description of differences as I see them. And I'll talk a little bit about the actual benchmarks from the memchr crate.

The benchmarks are split into four variations (only two of which are relevant to sliceslice):

  • oneshot benchmarks the time it takes to build the searcher and execute the search. sliceslice has a ceiling on this given its current API, since it requires a Box<[u8]> for its needle. oneshot benchmarks are only executed on searches with zero or one matches. In the latter case, the match always occurs at the end of the haystack. memchr::memmem does somewhat well on these benchmarks for teeny haystacks because it diverts to Rabin-Karp immediately for small haystacks instead of spending time building searchers. This is a really important use case to optimize for, since it corresponds to things like bstr::ByteSlice::contains_str and of course, str::contains in std.
  • oneshotiter is like oneshot, except it iterates over all matches. This only runs for benchmarks with more than one match. This benchmark is not relevant to sliceslice since sliceslice does not expose the match position. Only whether a match occurred.
  • prebuilt is like oneshot, but doesn't include searcher construction in its measurement.
  • prebuiltiter is like oneshotiter, but doesn't include searcher construction in its measurement. This one is also irrelevant for sliceslice.

Getting into actual code/design differences:

  • The biggest difference is I think requirements. I really want to provide an additive time complexity bound. This means the implementation can't just be "Generic SIMD." It can only use "Generic SIMD" for a constant number of needles. There's also the case of portability. For at least these reasons, sliceslice will likely always have an edge when it comes to latency.
  • memchr::memmem has two variants of the "Generic SIMD" algorithm. For small needles (<=32 bytes), the "Generic SIMD" algorithm is used nearly as-is, including the confirmation step with memcmp. (x86_64 only.) For larger needles, a prefilter variant of "Generic SIMD" is used that doesn't do its own confirmation step. Instead, it drops out into Two-Way. This lets us guarantee our additive time bound.
  • memchr::memmem uses its own intended-to-be lower latency memcmp routine, where as sliceslice specializes memcmp on every needle length up to some small bound, so that it should compile down to a simple load and compare. The trade off is that more code is produced, but this almost certainly makes sliceslice faster in many cases, especially where the initial candidate generation produces many false positives. I went with my route to see if I could get to a place that was "good enough" without lots of extra codegen.
  • Perhaps the thing that likely makes the most difference in common cases is that memchr::memmem selects the two bytes to look for based on a background frequency distribution of bytes. The prebuilt/huge-ru/never-john-watson benchmark might be a good one to look at. Eyeballing things, I'd probably attribute the difference in my measurements to this heuristic. Taking a quick look at a profile, memchr::memmem spends almost no time in its confirmation code, but memcmp is quite visible on a profile when running with sliceslice.
  • Picking rare bytes has pathological cases of its own. The prebuilt/pathological-defeat-simple-vector/rare-alphabet benchmark shows this. Both memchr::memmem and sliceslice slow down considerably here, but sliceslice does a bit better. Although, there are other pathological benchmarks where the outcomes are reversed:
  • Since picking rare bytes is a heuristic that can itself fail with the assumption isn't true, when "Generic SIMD" is used as a prefilter, a heuristic will dynamically try to detect if the prefilter is effective or not, and if not, disable it. The prebuilt/pathological-defeat-simple-vector-repeated/rare-alphabet benchmark demonstrates this. If you profile this when compared to sliceslice, you'll see that memchr::memmem isn't in a SIMD routine at all, but rather, Two-Way. Meanwhile, sliceslice is thrashing in memcmp.

I think the thing I realized while doing this exercise is that it's not really possible to do better in every case. I'm sure y'all know this of course, but it's worth calling out explicitly. You kind of have to pick your trade offs and which cases you want to hit. For example, there are some cases where the old Regex substring search built on top of a simple memchr loop were twice as fast as memchr::memmem, simply because it would sometimes pick a rare enough byte where memchr was in and of itself good enough as a highly discriminatory prefilter. "Generic SIMD" can't match the same speed as memchr in those cases. Nevertheless, the simplistic memchr loop has a lot of weaknesses and can slow down to pretty bad levels as soon as its rare byte assumption fails to hold. The nice thing about "Generic SIMD" is that it's a bit more robust even when the bytes it picks aren't exactly rare, since it is at least typically the case that the combination of those bytes in specific positions is enough to be discriminatory. e.g., rg 'strengths' has gotten twice as fast now that it uses memchr::memmem. And I'll say, finding the pathological cases for the "Generic SIMD" algorithm was not easy. I would love to find simpler cases, because as of now, the inputs are quite pathological. (Which is, IMO, a very nice point in favor of the "Generic SIMD" algorithm.)

OK, I think I've written enough for now. Happy to keep chatting and learn from each other!

@marmeladema
Copy link
Collaborator Author

This discussion made me realize that we actually don't need to force users of sliceslice to have to heap-allocate the needle into a Box<[u8]>. This is how we use it internally, but since we have a Needle trait that is implemented for Box<[u8]> then we will be fine 👍
This has been implemented in #42
I haven't looked into how oneshot benchmarks are implemented, but maybe that will improve some of the results

@BurntSushi
Copy link

I haven't looked into how oneshot benchmarks are implemented, but maybe that will improve some of the results

Very likely will! If you ping me when a new release is out, then I'll update the benchmarks. :-)

@marmeladema
Copy link
Collaborator Author

@BurntSushi version 0.3.0 has just been published!

BurntSushi added a commit to BurntSushi/memchr that referenced this issue May 6, 2021
This was kicked off due to the sliceslice 0.3.0 release:
cloudflare/sliceslice-rs#40
@BurntSushi
Copy link

Yup, the benchmarks with the biggest difference are oneshot. From the bench directory on memchr master:

$ critcmp runs/2021-05-01_regex-and-bstr/raw.json runs/2021-05-06_sliceslice-0.3.0/raw.json -f '/sliceslice/' -t50
group                                                                      2021-05-01                             2021-05-06
-----                                                                      ----------                             ----------
memmem/sliceslice/oneshot/pathological-repeated-rare-small/never-tricky    1.61     51.7±0.27ns    18.0 GB/sec    1.00     32.2±0.49ns    28.9 GB/sec
memmem/sliceslice/oneshot/teeny-en/never-all-common-bytes                  2.35     28.7±0.03ns   929.5 MB/sec    1.00     12.2±0.01ns     2.1 GB/sec
memmem/sliceslice/oneshot/teeny-en/never-john-watson                       1.72     29.0±0.03ns   921.7 MB/sec    1.00     16.9±0.03ns  1581.8 MB/sec
memmem/sliceslice/oneshot/teeny-en/never-some-rare-bytes                   2.31     27.7±0.02ns   965.0 MB/sec    1.00     12.0±0.01ns     2.2 GB/sec
memmem/sliceslice/oneshot/teeny-en/never-two-space                         2.58     27.8±0.04ns   961.9 MB/sec    1.00     10.8±0.01ns     2.4 GB/sec
memmem/sliceslice/oneshot/teeny-en/rare-sherlock                           2.49     28.1±0.04ns   951.8 MB/sec    1.00     11.3±0.01ns     2.3 GB/sec
memmem/sliceslice/oneshot/teeny-ru/never-john-watson                       2.59     36.9±0.07ns  1085.1 MB/sec    1.00     14.2±0.01ns     2.7 GB/sec
memmem/sliceslice/oneshot/teeny-ru/rare-sherlock                           2.11     29.7±0.05ns  1348.2 MB/sec    1.00     14.1±0.02ns     2.8 GB/sec
memmem/sliceslice/oneshot/teeny-ru/rare-sherlock-holmes                    2.31     39.4±0.06ns  1016.3 MB/sec    1.00     17.0±0.02ns     2.3 GB/sec
memmem/sliceslice/oneshot/teeny-zh/never-john-watson                       1.93     33.9±0.04ns   871.2 MB/sec    1.00     17.6±0.02ns  1678.9 MB/sec
memmem/sliceslice/oneshot/teeny-zh/rare-sherlock                           2.48     27.9±0.08ns  1058.5 MB/sec    1.00     11.3±0.01ns     2.6 GB/sec
memmem/sliceslice/oneshot/teeny-zh/rare-sherlock-holmes                    2.02     46.2±0.09ns   639.4 MB/sec    1.00     22.9±0.03ns  1290.8 MB/sec

@marmeladema
Copy link
Collaborator Author

Thank you 👍 Nice performance improvements and imo improved API for our users.
On a similar note, I've run my benchmarks to compare rust 1.51.0 and 1.52.0 and on the short_haystack, I have noticed a 22% regression in terms of instruction counts.
@BurntSushi have you encountered a similar regression on some of your benchmarks?

@marmeladema
Copy link
Collaborator Author

@BurntSushi version 0.3.1 has just been published! Do you think you could rerun the numbers again between your memmem and sliceslice? The recent changes were mostly aiming at fixing rust 1.52.x regression but they might have improved a bit the short haystacks benchmarks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants