-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueMap - separate HashMap for sorted attribs #2288
base: main
Are you sure you want to change the base?
ValueMap - separate HashMap for sorted attribs #2288
Conversation
b4ee62f
to
c9621ae
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2288 +/- ##
=====================================
Coverage 79.5% 79.5%
=====================================
Files 123 123
Lines 21258 21276 +18
=====================================
+ Hits 16905 16920 +15
- Misses 4353 4356 +3 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
Benchmark results: cargo bench --bench metric
It looks like there's relation between number number of attributes, E.g. for 1 attribute, performance is increased, for 10 is decreased, but I cannot explain why this happens. cargo run --release --package stress --bin metrics
Same theme as before, counter performance increase, histograms performance decrease. |
let trackers = match self.sorted_attribs.lock() { | ||
Ok(mut trackers) => { | ||
let new = HashMap::with_capacity(trackers.len()); | ||
replace(trackers.deref_mut(), new) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have briefly touched this situation in my previous PR. about unnecessary allocation, but we didn't discussed downsides of it.
Here's the table of possible solutions.
memory policy | approach | benefits | drawbacks |
---|---|---|---|
concervative | take(trackers) | smallest memory footprint never grows more than needed | lots of allocation every time new collection phase starts |
balanced | replace(trackers, HashMap with_capacity (previous_len)) | also small memory footprint | on stable loads might save a lot of unnecessary allocations, on spiky loads might either waist memory or do more allocations |
liberal | replace(trackers, previous_tracker with same_memory) | no allocations at all | memory only grow, which might be a problem if you have huge spike, allocated memory will stay there forever (though it's only for contents of "empty" buckets, maybe not that bad?). Also it's a bit more complexity in code, as you need to have extra variable (and lock) just for that, and finally this might be unexpected for end user, as end user might expect that delta temporality doesn't do any sort of caching |
At the moment I chose "balanced approach", I found it most natural :) but happy to hear your opinion on this too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fraillt Please see #2343 where I opted for the liberal from the above table. It should ensure HashMap starts with enough capacity, and never ever needs to re-adjust in its lifetime. Please take a look and share your feedback.
"memory only grow" - it will not, due to cardinality capping.
"finally this might be unexpected for end user, as end user might expect that delta temporality doesn't do any sort of caching" - these are internal changes only. end users should still see expected behavior with this change.
prepare_data(dest, self.count.load(Ordering::SeqCst)); | ||
if self.has_no_attribute_value.swap(false, Ordering::AcqRel) { | ||
// reset sorted trackers so new attributes set will be written into new hashmap | ||
let trackers = match self.sorted_attribs.lock() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is making the contention between collect and update worse. Earlier, the only point of contention during collect was the write lock under which we replace the existing RwLock<Hashmap>
with a default one. With the changes in this PR, we now have two points of contention during collect:
- Locking the
Mutex
withself.sorted_attribs.lock()
which would compete with updates trying to add newer attribute sets - Acquiring the write lock with
self.all_attribs.write()
which would compete with all the updates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I know, but I don't see other way if we want to implement sharding, so my goal with this PR was mostly try to preserve existing performance (as much as I could) while at the same time, make it easy to add sharding in the future.
However, I expect this extra lock to be very cheap in practice, for two reasons:
- Mutex is generally fast
- at the end of collection cycle, there should be quite low contention on inserting new attribute sets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I don't see other way if we want to implement sharding,
Why is that? Implementing sharding for the Hashmap
should be only about replacing the existing HashMap
type with the type that implements sharding (something like Dashmap
). How is the lack of having a sorted and unsorted set of attributes trackers preventing us from trying a sharded Hashmap
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything could work with one hashmap, we could probably tolerate slower cumulative collection phase, but the real problem is with delta temporality collection.
Here's the problem:
We have hard requirement that attribute-set keys are order agnostic.
In practice this means, that we can have multiple attribute-sets in hashmap, which points to same tracker (hence, we have Arc<A>
.).
With sharding this means, that same tracker can be in multiple shards, so the question is how do you reset hashmap?
- we cannot simply "replace" hashmap with fresh instance.
- we cannot iterate through each shard, deduplicate (using hashset to identify already processed attribute-sets), and reset/clear each shard while iterating. This doesn't work because, even if you remove attribute-set in one shard, on next tracker update same tracker instance will be put back, as long as sorted attribute-set instance lives in another shard.
The only option to actually clear sharded hashmap is to lock ALL shards, then iterate/clone and unlock.
So my approach is to have two hashmaps instead, because all things considered it has more benefits than drawbacks.
- update existing tracker - no effect, because attribute set is found on first try in both cases (in case user never provides sorted attribute-sets, it might be even better, as first hashmap will be half as small, so less chance for collision).
- inserting new tracker - probably not expected, but it is performance increase actually (with cargo bench --bench metrics_counter), probably due to less checks and fast lock with Mutex.
- collection phase (both delta and cumulative) - huge benefits, short lock time, no need to have hashset in order to dedup, and solves Fix metrics dedup/sort bug #2093 (which would further slow down collection phase).
So basically this approach has bunch of benefits and 0 drawbacks, so why not try it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that sharding has proven to be improving throughout significantly, but those tests were done at the time we took a Mutex lock in hotpath. Now we have replaced that with Read lock on hotpath, so there is no contention anymore. i.e if 10 threads wants to update(), they can all do so without being blocked by each other. With this existing design, how does sharding help improve performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it has to do with CPU cache contention.
There's still single lock instance (on happy path) that all cores/threads will fight for. It doesn't matter if it's Mutex/RwLock or any other primitive, the fact remains that when we "lock it" some state is updated under the hood.
Once one CPU updates a memory (there's no difference if it's atomic or not), it marks this L1 cache line as "invalid" on all other CPU cores, so when next core comes in, it needs to fetch it (might be from main memory or previous CPU cache).
So sharding solves exactly this problem:
- it will have multiple lock instances (per shard), so that multiple threads don't fight that much anymore, and don't need to invalidate other CPU cache line every single time.
As I mentioned, I have done some tests locally, using naive implementation (I didn't used any padding between shards, to avoid false-sharing (when invalidating one shard, might also invalidate another if they happen to live on same L1 cache line)), and results are already ~3x improvement (it varies greatly between different CPUs, I saw 2x on one, and 5x on another in some cases). Proper implementation could be even more performant :)
Some more benchmark results: cargo bench --bench metrics_counter
|
6d97487
to
f38e09a
Compare
let sorted_attrs = AttributeSet::from(attributes).into_vec(); | ||
if let Some(tracker) = trackers.get(sorted_attrs.as_slice()) { | ||
tracker.update(value); | ||
let Ok(mut sorted_trackers) = self.sorted_attribs.lock() else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This introduces the possibility of a update() thread fighting for the lock with another update() thread (or even collect()). This is something we need to avoid as much as possible.
Previously, we have only read() locks, and the possibility of the update() thread getting blocked is very very small. Now that probability is quite high.
Metrics update() should not cause a thread to be blocked - as the thread could lose its CPU slice, and need to wait for a chance to get CPU back. We never wrote this down as a key requirement, but this is an important expectation from a metrics solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably missed, this only happens on new attribute-set, updating existing attribute set didn't changed.
In fact this PR holds locks for much less time compared to what we have in main.
It does, however, two locks in new attribute set case (sorted.lock() and all_attribs.write()), but it also has certain benefits as well. E.g. on main, when we insert new attribute-set we block readers until we finish these steps:
- get(as-is) -> (not found)
- get(sorted) -> (not found)
- create tracker
- update tracker
- insert (as-is)
- insert (sorted)
on this branch for same scenario readers are blocked until we:
- insert (as-is)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my comment about update() blocking other update() more frequently in this PR was wrong, apologies.!
I was thinking of a scenario which is not yet implemented in main, but I was working on it
- take read lock
- check incoming order and update if found and return
- check sorted order and update if found and return
- Check if above cardinality limit. If yes, update the overflow and return. (no write locks taken so far)
The above change would ensure that in the situation of malicious inputs being constantly thrown, we only take read lock, but this PR need to take write lock. But I can see this similar change can be added in this PR too, so they can behave same.
I am not convinced about the need of storing sorted in a yet another hashmap. Is this primarily motivated by the issue described in #2093? If de-duplication is making things complex, we can do something like:
update(value, &attrs)
{
if dedup_metrics_attributes_enabled
{
attrs = deduplicate the attributes; // its costly, but under feature flag.
}
// this point onwards, operate as if attrs has no duplicates. (either by feature flag, or by users ensuring the attrs are free of duplicates.
...rest of logic..
}
Would this approach make things easier, and avoid the need of storing sorted attributes separately?
Separately, I love the separate RwLock for collect purposes, and the idea of mem::swapping that with main hashmap. Can we get that to its own PR we can merge quickly, to make continuous progress.
I also want to revisit the idea of draining actively used trackers for each collect - we could keep active ones in the hashmap, but use a marker field to indicate if active or not. This will make Metrics sdk operate fully in existing memory, as long as users have same timeseries, which is the common case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's several reasons for having separate hashmap:
- reduced contention on reads (as pointed out, because writes can happen mostly on separate lock)
- solves Fix metrics dedup/sort bug #2093 elegantly/efficiently
- allow implementing sharding efficiently. (although there is alternative impl [Perf][Metrics] Use flurry's concurrent hashmap for 5x throughput #2305 using concurrent map, but it has aggresive locks for writes, and I have a feeling that it might be very slow for individual
.add(value, attribs)
call on metrics. Actually thisflurry
concurrent map impl, sparked me an idea to implement Measure metrics latency #2323).
I'll come with test results soon, to verify if my concerns are valid or not... :) but basically, I still hold an opinion that this is best approach to move forward.
f38e09a
to
89a1c0c
Compare
Here's some results for this PR, using #2323 There's the outcome (compared to main):
|
Changes
Internal structure for ValueMap has changed, in order to achieve several things:
Merge requirement checklist
CHANGELOG.md
files updated for non-trivial, user-facing changes