Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: have flightGroups take some time to allow reuse #286

Open
wants to merge 3 commits into
base: v8
Choose a base branch
from

Conversation

moshegood
Copy link
Contributor

@moshegood moshegood commented Feb 5, 2024

Max memory usage on two clusters with/without this change.
The lower line is the one with the change.
Note also, the high memory usage servers were actively shedding load and rejecting connections to stay alive.

Screenshot 2024-02-05 at 1 04 15 PM

Other Approaches

The main other way to do this would be to cache the data from the MakeServerSidePutEvent(...) call in getReplayEvent()
That would require cache-invalidation, which would require the data stores to have some version id to enable the ld-relay to know that the underlying information has changed.

Requirements

  • I have added test coverage for new or changed functionality
  • I have followed the repository's pull request submission guidelines
  • I have validated my changes against all supported platform versions

Related issues

Provide links to any issues in this repository or elsewhere relating to this pull request.

Describe the solution you've provided

Provide a clear and concise description of what you expect to happen.

Describe alternatives you've considered

Provide a clear and concise description of any alternative solutions or features you've considered.

Additional context

Add any other context about the pull request here.

@moshegood moshegood requested a review from a team February 5, 2024 17:11
@@ -109,6 +110,7 @@ func (r *serverSideEnvStreamRepository) Replay(channel, id string) chan eventsou
// getReplayEvent will return a ServerSidePutEvent with all the data needed for a Replay.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is only called when we have a new or reconnecting client.
So some small delay here should be acceptable.

@moshegood moshegood force-pushed the moshe/lower.memory.usage branch from e6a11e1 to d22388d Compare February 5, 2024 19:22
@cwaldren-ld
Copy link
Contributor

Hi @moshegood, I like the idea here - trading increased latency for decreased memory usage.

In your use-case, what would you actually use as the configured LD_STREAMING_DELAY_SECONDS?

One downside I'm seeing is that the latency would be introduced even if no one else is being served. A dynamic approach might calculate the incoming RPS and use that to increase the latency, but that might be too complicated.

@moshegood
Copy link
Contributor Author

Hi @moshegood, I like the idea here - trading increased latency for decreased memory usage.

In your use-case, what would you actually use as the configured LD_STREAMING_DELAY_SECONDS?

One downside I'm seeing is that the latency would be introduced even if no one else is being served. A dynamic approach might calculate the incoming RPS and use that to increase the latency, but that might be too complicated.

I've been testing with 2 seconds.
Our big spikes in connectivity come during deployments and after network reconnects between the relays and LaunchDarkly.

@cwaldren-ld cwaldren-ld changed the title have flightGroups take some time to allow reuse feat: have flightGroups take some time to allow reuse Feb 8, 2024
@moshegood moshegood force-pushed the moshe/lower.memory.usage branch 3 times, most recently from 2976a5e to 52ff6a5 Compare February 11, 2024 16:20
@moshegood
Copy link
Contributor Author

Would it be better if we only delay when there was another request completed very recently?
For example, pause the flight group only if there was another request completed less than a second ago?

@cwaldren-ld
Copy link
Contributor

I think so, and I'm wondering if we can use something like token bucket to solve this.

For example:

  1. Setup token bucket with rate 1 req/sec (can be configurable)
  2. Request comes in, check the token bucket.
  3. If token available, then serve the request immediately.
  4. Otherwise, block and initiate flightgroup.
  5. [meanwhile, other requests may come in and join the flightgroup.]
  6. Serve the requests all at once. This fills the token bucket back up, go to (1).

@moshegood
Copy link
Contributor Author

Changed things around so that a single new client connection is responded to immediately.
We only delay the bulk fetch of all data between batches of new clients if we very recently fetched that data, and if the option to do so is set.

@moshegood moshegood force-pushed the moshe/lower.memory.usage branch from 2ea42ef to f517e61 Compare February 16, 2024 20:06
@moshegood
Copy link
Contributor Author

Any updates required on this PR?
Do you need to run additional tests?

@cwaldren-ld
Copy link
Contributor

Hi @moshegood , pardon the delay on my end.

We are still discussing how to approach this internally. Mainly it seems that we might have outgrown the usage of the flightgroup and need a more sophisticated system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants