-
-
Notifications
You must be signed in to change notification settings - Fork 829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor filters in spawn_senders #829
base: master
Are you sure you want to change the base?
Conversation
@sharkdp Should I keep working on this PR? |
How do we continue with this? I still think this could be a very valuable change to the code base. I would like to be very careful though and properly check with benchmarks that this doesn't introduce any performance regressions. Sorry for leaving this hanging for so long, requiring (you) to fix a few merge conflicts. |
It's okay, you did mention you were pretty busy in the Reddit post asking for contributors. I fixed the merge conflicts, refactored the code a tiny bit and then ran Benchmark report file
|
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./fd-master --hidden --no-ignore '' '/home/user' |
343.1 ± 12.1 | 327.2 | 363.3 | 1.00 |
./fd-feature --hidden --no-ignore '' '/home/user' |
367.1 ± 6.9 | 358.4 | 378.6 | 1.07 ± 0.04 |
Simple pattern
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./fd-master '.*[0-9]\.jpg$' '/home/user' |
22.6 ± 2.2 | 18.6 | 28.4 | 1.00 |
./fd-feature '.*[0-9]\.jpg$' '/home/user' |
22.7 ± 2.4 | 17.9 | 31.2 | 1.00 ± 0.15 |
Simple pattern (-HI)
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./fd-master -HI '.*[0-9]\.jpg$' '/home/user' |
264.9 ± 9.4 | 254.4 | 285.5 | 1.00 |
./fd-feature -HI '.*[0-9]\.jpg$' '/home/user' |
287.3 ± 9.0 | 274.4 | 303.0 | 1.08 ± 0.05 |
File extension
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./fd-master -HI --extension jpg '' '/home/user' |
282.5 ± 8.8 | 273.7 | 295.5 | 1.00 |
./fd-feature -HI --extension jpg '' '/home/user' |
475.7 ± 16.3 | 458.7 | 501.9 | 1.68 ± 0.08 |
File type
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./fd-master -HI --type l '' '/home/user' |
261.3 ± 7.0 | 252.6 | 275.5 | 1.00 |
./fd-feature -HI --type l '' '/home/user' |
291.1 ± 9.0 | 280.2 | 310.6 | 1.11 ± 0.05 |
Cold cache
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
./fd-master -HI '.*[0-9]\.jpg$' '/home/user' |
79.630 ± 1.315 | 78.491 | 81.069 | 1.00 |
./fd-feature -HI '.*[0-9]\.jpg$' '/home/user' |
81.093 ± 1.952 | 79.208 | 83.105 | 1.02 ± 0.03 |
My guess is that the massive performance regression in the File Extension test probably comes from this line, where I clone an Option<RegexSet>
:
Box::new(Extensions::new(config.extensions.clone()))
I assume the Extensions
filter could be easily modified to use Option<&RegexSet>
instead. Let me know if the benchmarks look okay to you, assuming this outlier is fixed.
Thank you very much for bringing this up to date! Unfortunately, the benchmark results do not look okay, even excluding the |
It's a lot better now! Here are the new benchmark results: Benchmark report file
|
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./fd-master --hidden --no-ignore '' '/home/asha' |
303.4 ± 10.6 | 286.0 | 315.4 | 1.05 ± 0.05 |
./fd-feature --hidden --no-ignore '' '/home/asha' |
288.3 ± 8.7 | 272.4 | 299.9 | 1.00 |
Simple pattern
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./fd-master '.*[0-9]\.jpg$' '/home/asha' |
19.0 ± 1.3 | 16.5 | 23.1 | 1.00 ± 0.11 |
./fd-feature '.*[0-9]\.jpg$' '/home/asha' |
19.0 ± 1.6 | 14.2 | 23.6 | 1.00 |
Simple pattern (-HI)
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./fd-master -HI '.*[0-9]\.jpg$' '/home/asha' |
210.2 ± 2.2 | 206.1 | 216.1 | 1.00 |
./fd-feature -HI '.*[0-9]\.jpg$' '/home/asha' |
211.5 ± 4.8 | 206.4 | 223.2 | 1.01 ± 0.03 |
File extension
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./fd-master -HI --extension jpg '' '/home/asha' |
222.5 ± 1.7 | 220.5 | 225.7 | 1.00 |
./fd-feature -HI --extension jpg '' '/home/asha' |
222.8 ± 5.0 | 216.6 | 233.7 | 1.00 ± 0.02 |
File type
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./fd-master -HI --type l '' '/home/asha' |
208.0 ± 6.1 | 200.7 | 218.9 | 1.00 |
./fd-feature -HI --type l '' '/home/asha' |
208.7 ± 3.7 | 203.8 | 215.9 | 1.00 ± 0.03 |
Cold cache
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
./fd-master -HI '.*[0-9]\.jpg$' '/home/asha' |
86.769 ± 1.437 | 85.202 | 88.022 | 1.02 ± 0.02 |
./fd-feature -HI '.*[0-9]\.jpg$' '/home/asha' |
85.200 ± 1.499 | 83.473 | 86.163 | 1.00 |
Cool! Thank you for the update. Is the "Switch file filter dynamic dispatch to enum" commit essential to get these improved runtimes? I liked the previous implementation better. And I can't imagine that the dynamic dispatch is that expensive? |
I like the previous implementation more too honestly, though it does seem to have a noticeable impact; you can be the judge of whether it's acceptable or not. Here's a benchmark of the commit right before replacing the dynamic dispatch with the enum: Benchmark report file
|
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./fd-master --hidden --no-ignore '' '/home/asha' |
290.3 ± 8.5 | 282.2 | 306.3 | 1.00 |
./fd-feature --hidden --no-ignore '' '/home/asha' |
301.6 ± 8.8 | 284.0 | 317.6 | 1.04 ± 0.04 |
Simple pattern
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./fd-master '.*[0-9]\.jpg$' '/home/asha' |
18.7 ± 1.6 | 14.7 | 22.8 | 1.00 |
./fd-feature '.*[0-9]\.jpg$' '/home/asha' |
19.1 ± 1.4 | 16.9 | 24.9 | 1.03 ± 0.12 |
Simple pattern (-HI)
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./fd-master -HI '.*[0-9]\.jpg$' '/home/asha' |
208.1 ± 3.6 | 204.2 | 218.5 | 1.00 |
./fd-feature -HI '.*[0-9]\.jpg$' '/home/asha' |
208.6 ± 4.4 | 203.1 | 218.2 | 1.00 ± 0.03 |
File extension
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./fd-master -HI --extension jpg '' '/home/asha' |
222.3 ± 4.5 | 216.4 | 235.2 | 1.00 |
./fd-feature -HI --extension jpg '' '/home/asha' |
223.8 ± 6.7 | 217.9 | 238.5 | 1.01 ± 0.04 |
File type
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./fd-master -HI --type l '' '/home/asha' |
207.5 ± 4.6 | 202.7 | 218.7 | 1.00 |
./fd-feature -HI --type l '' '/home/asha' |
209.3 ± 5.9 | 202.3 | 222.8 | 1.01 ± 0.04 |
Cold cache
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
./fd-master -HI '.*[0-9]\.jpg$' '/home/asha' |
85.896 ± 1.179 | 84.573 | 86.836 | 1.00 ± 0.02 |
./fd-feature -HI '.*[0-9]\.jpg$' '/home/asha' |
85.639 ± 0.604 | 84.953 | 86.091 | 1.00 |
Regarding the enum option, I'm guessing you mostly have a problem with the enum-name-struct-name repetition, right? I'm talking about this:
let mut filters: Vec<FilterKind> = vec![
FilterKind::SkipRoot(SkipRoot),
FilterKind::MinDepth(MinDepth::new(config.min_depth)),
FilterKind::RegexMatch(RegexMatch::new(pattern, config.search_full_path)),
FilterKind::Extensions(Extensions::new(config.extensions.as_ref())),
];
If that's the case, one potential improvement could be to have the structs return the relevant FilterKind
variant themselves, something like this:
let mut filters = vec![
SkipRoot::make_filter(),
MinDepth::make_filter(config.min_depth),
RegexMatch::make_filter(pattern, config.search_full_path),
Extensions::make_filter(config.extensions.as_ref()),
];
Hm... not this duplication per se, just the general amount of additional code. |
Very interesting. I wouldn't have expected this. This might have to do with inlining actually. If we use static dispatch, the compiler might be able to inline to filter implementations? |
Can we try something like this? struct AllFilters {
skip_root: SkipRoot,
min_depth: MinDepth,
...
}
impl Filter for AllFilters {
fn should_skip(&self, entry: &DirEntry) -> bool {
self.skip_root.should_skip(entry)
|| self.min_depth.should_skip(entry)
|| ...
}
} For the |
Doesn't that defeat the whole purpose? I mean, it would still move code out to multiple functions. But we couldn't simply call: let should_skip = filters.iter().any(|filter| filter.should_skip(&entry)); and would have to list every filter manually. The let should_skip = min_depth_should_skip(…) || file_type_should_skip(…) || …; from the main code path. The |
Should the code stay as it is then? The metadata filters are next up for similar refactoring as per the first comment of the issue. |
How about something like: pub trait Filter: Send + Sync + Sized {
/// Whether the entry should be skipped or not.
fn should_skip(&self, entry: &DirEntry) -> bool;
fn chain<F: Filter>(self, other: F) -> ChainedFilter<Self, F> {
ChainedFilter(self, other)
}
}
pub struct ChainedFilter<F1: Filter, F2: Filter>(F1, F2);
impl<F1: Filter, F2: Filter> Filter for ChainedFilter<F1, F2> {
fn should_skip(&self, entry: &DirEntry) -> bool {
self.0.should_filter(entry) || self.1.should_filter(entry)
}
// then in walk:
let combined_filter = SkipRoot
.chain(MinDepth::new(config.min_depth))
.chain(RegexMatch::new(pattern, config.search_full_path))
//... This keeps the filters static but removes the need for any dispatch |
@yyogo |
How would you handle optional filters with that approach? let filters = {
let filters = SkipRoot
.chain(MinDepth::new(config.min_depth))
.chain(RegexMatch::new(pattern, config.search_full_path))
.chain(Extensions::new(config.extensions.as_ref()));
if let Some(file_types) = config.file_types.clone() {
filters.chain(file_types)
} else {
filters
}
}; This is a no-go since you're returning different types inside let filters = SkipRoot
.chain(MinDepth::new(config.min_depth))
.chain(RegexMatch::new(pattern, config.search_full_path))
.chain(Extensions::new(config.extensions.as_ref()))
.chain_opt(config.file_types.clone()); And try to implement fn chain_opt<F: Filter>(self, other: Option<F>) -> ChainedFilter<Self, F> {
match other {
Some(other) => self.chain(other),
None => ???,
}
} What should the |
@Asha20 maybe this? impl<F> Filter for Option<F> where F: Filter {
fn should_skip(&self, entry: &DirEntry) -> bool {
self.as_ref().map_or(false, |f| f.should_skip(entry))
}
} alternatively |
Relates to #382
I refactored all of the filters that don't require metadata, introducing each new filter in a separate commit. I'm planning to tackle the metadata filters later, but this might be a good stopping point for feedback before proceeding.