Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ActiveRecord::ConnectionTimeoutError "all pooled connections were in use" after enabling concurrency #469

Open
MattFenelon opened this issue Mar 19, 2024 · 16 comments · Fixed by #470 · May be fixed by #472
Open

ActiveRecord::ConnectionTimeoutError "all pooled connections were in use" after enabling concurrency #469

MattFenelon opened this issue Mar 19, 2024 · 16 comments · Fixed by #470 · May be fixed by #472
Labels

Comments

@MattFenelon
Copy link
Contributor

I recently enabled concurrency on a system and it resulted in a lot of ActiveRecord::ConnectionTimeoutError errors.

I believe it's because ActiveRecord has a 1 thread = 1 connection policy and the database connection pool for this system is limited to the number of puma threads. As soon as the system receives enough requests to use up the connection pool, the concurrently side-loaded resource threads are unable to get a connection from the pool, and vice versa.

But I'm surprised no one has come across this issue before so perhaps I've misunderstood the problem.

One solution might be to increase the system's database connection pool size. This would allow Graphiti's threads to have enough connections to not exhaust the thread pool. The difficulty would be in sizing the db pool. Given each request can sideload x resources, the total number of database connections required at any given time would be x * number of concurrent requests.

Ideally though I think it would make sense to allow the thread pool size to be configured. Concurrent::Promise, used to side-loaded resources, has by default a thread pool equal to java's int max.

What do you think?

vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_adapters/abstract/connection_pool/queue.rb:128:in `block in wait_poll': could not obtain a connection from the pool within 5.000 seconds (waited 5.005 seconds); all pooled connections were in use (ActiveRecord::ConnectionTimeoutError)
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_adapters/abstract/connection_pool/queue.rb:117:in `loop'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_adapters/abstract/connection_pool/queue.rb:117:in `wait_poll'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_adapters/abstract/connection_pool/queue.rb:78:in `internal_poll'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_adapters/abstract/connection_pool/queue.rb:204:in `internal_poll'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_adapters/abstract/connection_pool/queue.rb:73:in `block in poll'
    from vendor/ruby-3.2.2/lib/ruby/3.2.0/monitor.rb:202:in `synchronize'
    from vendor/ruby-3.2.2/lib/ruby/3.2.0/monitor.rb:202:in `mon_synchronize'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_adapters/abstract/connection_pool/queue.rb:82:in `synchronize'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_adapters/abstract/connection_pool/queue.rb:73:in `poll'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_adapters/abstract/connection_pool.rb:663:in `acquire_connection'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_adapters/abstract/connection_pool.rb:353:in `checkout'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_adapters/abstract/connection_pool.rb:182:in `connection'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_adapters/abstract/connection_handler.rb:246:in `retrieve_connection'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_handling.rb:287:in `retrieve_connection'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_handling.rb:254:in `connection'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/querying.rb:62:in `_query_by_sql'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/relation.rb:998:in `block in exec_main_query'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/relation.rb:1018:in `skip_query_cache_if_necessary'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/relation.rb:984:in `exec_main_query'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/relation.rb:962:in `block in exec_queries'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/relation.rb:1018:in `skip_query_cache_if_necessary'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/relation.rb:956:in `exec_queries'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/relation.rb:742:in `load'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/relation.rb:264:in `records'
    from vendor/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/relation.rb:259:in `to_ary'
    from lib/graphiti_adapters/an_adapter.rb:19:in `resolve'
    from vendor/bundle/ruby/3.2.0/gems/graphiti-1.3.9/lib/graphiti/resource.rb:117:in `resolve'
    from vendor/bundle/ruby/3.2.0/gems/graphiti-1.3.9/lib/graphiti/scope.rb:22:in `block in resolve'
    from vendor/bundle/ruby/3.2.0/gems/graphiti-1.3.9/lib/graphiti/scope.rb:81:in `block in broadcast_data'
    from vendor/bundle/ruby/3.2.0/gems/graphiti-1.3.9/lib/graphiti.rb:65:in `block in broadcast'
    from vendor/bundle/ruby/3.2.0/gems/activesupport-7.1.3/lib/active_support/notifications.rb:206:in `block in instrument'
    from vendor/bundle/ruby/3.2.0/gems/activesupport-7.1.3/lib/active_support/notifications/instrumenter.rb:58:in `instrument'
    from vendor/bundle/ruby/3.2.0/gems/activesupport-7.1.3/lib/active_support/notifications.rb:206:in `instrument'
    from vendor/bundle/ruby/3.2.0/gems/graphiti-1.3.9/lib/graphiti.rb:64:in `broadcast'
    from vendor/bundle/ruby/3.2.0/gems/graphiti-1.3.9/lib/graphiti/scope.rb:80:in `broadcast_data'
    from vendor/bundle/ruby/3.2.0/gems/graphiti-1.3.9/lib/graphiti/scope.rb:20:in `resolve'
    from vendor/bundle/ruby/3.2.0/gems/graphiti-1.3.9/lib/graphiti/resource_proxy.rb:68:in `data'
    from vendor/bundle/ruby/3.2.0/gems/graphiti-1.3.9/lib/graphiti/sideload.rb:231:in `load'
    from vendor/bundle/ruby/3.2.0/gems/graphiti-1.3.9/lib/graphiti/sideload.rb:299:in `resolve'
    from vendor/bundle/ruby/3.2.0/gems/graphiti-1.3.9/lib/graphiti/scope.rb:47:in `block (2 levels) in resolve_sideloads'
    from vendor/bundle/ruby/3.2.0/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/executor/safe_task_executor.rb:24:in `block in execute'
    from vendor/bundle/ruby/3.2.0/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/synchronization/mutex_lockable_object.rb:48:in `block in synchronize'
    from vendor/bundle/ruby/3.2.0/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/synchronization/mutex_lockable_object.rb:48:in `synchronize'
    from vendor/bundle/ruby/3.2.0/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/synchronization/mutex_lockable_object.rb:48:in `synchronize'
    from vendor/bundle/ruby/3.2.0/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/executor/safe_task_executor.rb:22:in `execute'
    from vendor/bundle/ruby/3.2.0/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/promise.rb:564:in `block in realize'
    from vendor/bundle/ruby/3.2.0/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:359:in `run_task'
    from vendor/bundle/ruby/3.2.0/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:350:in `block (3 levels) in create_worker'
    from vendor/bundle/ruby/3.2.0/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:341:in `loop'
    from vendor/bundle/ruby/3.2.0/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:341:in `block (2 levels) in create_worker'
    from vendor/bundle/ruby/3.2.0/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:340:in `catch'
    from vendor/bundle/ruby/3.2.0/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:340:in `block in create_worker'
Copy link

🎉 This issue has been resolved in version 1.6.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

@jkeen
Copy link
Collaborator

jkeen commented Mar 26, 2024

@MattFenelon moving our convo from #470 into here.

In short I'm still having a ton of trouble with web server lockups after this change and it's hard to figure out what exactly is going on… but it seems like without this change everything is a-ok. My DB pool values and the sideload thread max values seem to be reasonable, and yet sometimes things will just hang causing the webserver to not respond to requests.

I think we need to roll this thing back and investigate further on an unreleased branch. The part that worries me most is that there's no error message or any sort of clue that this is the issue, meaning there could be people out there with apps breaking without a clear cut cause. Also for me in order to get things running properly again it has required a server restart, which is a real bummer.

I'm going to put out a new patch release without these changes, and then maybe we can work together to try and resolve what's going on by working off a branch?

jkeen added a commit that referenced this issue Mar 26, 2024
…n what's causing thread pool hangs. refs #469

revert: "thread pool scope and mutex need to be global across all instances of Scope for it to be a global thread pool (#471)"
revert: "add thread pool and concurrency_max_threads configuration option (#470)"

This reverts commit 51fb51c.
This reverts commit 697d761.
github-actions bot pushed a commit that referenced this issue Mar 26, 2024
## [1.6.3](v1.6.2...v1.6.3) (2024-03-26)

### Bug Fixes

* Remove thread pool executor logic until we get a better handle on what's causing thread pool hangs. refs [#469](#469) ([7941b6f](7941b6f)), closes [#471](#471) [#470](#470)
@MattFenelon
Copy link
Contributor Author

Sorry to hear that @jkeen. I see you've reverted, that makes sense until we can clear this up.

In terms of diagnosing:

  1. What're your puma thread and workers set to?
  2. Did you use the default concurrency_max_threads setting?
  3. What's pool set to in database.yml?
  4. Does your database have a connection limit? If so, what is it?
  5. How many dynos/containers/vms do you have running?

MattFenelon added a commit to MattFenelon/graphiti that referenced this issue Mar 27, 2024
…handle on what's causing thread pool hangs. refs graphiti-api#469"

This reverts commit 7941b6f.
MattFenelon added a commit to MattFenelon/graphiti that referenced this issue Mar 27, 2024
This option allows to limit the maximum number of resources that can be
sideloaded concurrently. With a properly configured connection pool,
this ensures that the activerecord's connection pool is not exhausted by
the sideloading process.

The thread pool configuration is based on ActiveRecord's
global_thread_pool_async_query_executor.

This was previously attempted but there were reports of deadlocks. This
code differs from the original by using Concurrency::Delay assigned to a
constant instead of a regular Mutex. The Delay+constant is how
concurrent-ruby sets up their global thread pools so it's more likely to
be correct.

Closes graphiti-api#469

See: https://github.com/rails/rails/blob/94faed4dd4c915829afc2698e055a0265d9b5773/activerecord/lib/active_record.rb#L279-L287
See: graphiti-api#470
@MattFenelon MattFenelon linked a pull request Mar 27, 2024 that will close this issue
7 tasks
@MattFenelon
Copy link
Contributor Author

The fact that it's locking up completely makes me think it's a deadlock related to the use of the Mutex. I've gone with a different approach in #472 but I haven't been able to test those changes yet. Do you want to give it a try? I can test them in about a week or so.

jkeen added a commit to jkeen/graphiti that referenced this issue Mar 27, 2024
…n what's causing thread pool hangs. refs graphiti-api#469

revert: "thread pool scope and mutex need to be global across all instances of Scope for it to be a global thread pool (graphiti-api#471)"
revert: "add thread pool and concurrency_max_threads configuration option (graphiti-api#470)"

This reverts commit 51fb51c.
This reverts commit 697d761.
jkeen pushed a commit to jkeen/graphiti that referenced this issue Mar 27, 2024
## [1.6.3](graphiti-api/graphiti@v1.6.2...v1.6.3) (2024-03-26)

### Bug Fixes

* Remove thread pool executor logic until we get a better handle on what's causing thread pool hangs. refs [graphiti-api#469](graphiti-api#469) ([7941b6f](graphiti-api@7941b6f)), closes [graphiti-api#471](graphiti-api#471) [graphiti-api#470](graphiti-api#470)
@jkeen jkeen added bug and removed released labels Mar 27, 2024
@jkeen
Copy link
Collaborator

jkeen commented Mar 27, 2024

For the staging site that kept running into issues yesterday:

DB_POOL = 30
RAILS_MAX_THREADS = 10
GRAPHITI_CONCURRENCY_MAX_THREADS = 3
WEB_CONCURRENCY = 2

It's running on a VPS and when inspecting the process list it didn't seem like any processes had CPU spiked, or anything else that might usually jump out at me as a problem. I wasn't sure what I should be looking for in spotting any hung threads from the command line, even. Any pointers there?

I can give #472 a try this week sometime. Appreciate the follow up

@MattFenelon
Copy link
Contributor Author

I think I've found the issue. It only seems to happen when side-loading children of child resources. Does that tally up with the side loading you're doing, e.g. include=resource1.resource2(.resource3)+? I don't think this happens when the includes are only 1 level deep, e.g. include=resource1,resource2,(resource3)+.

For the sake of an example, let's imagine a thread pool of 1 and an unbounded queue:

  1. The parent resource creates Promise 1 to load child-resource :authors
  2. Waits for Promise 1 to finish
  3. Thread 1 is leased to run Promise 1. There are no more threads available in the pool.
  4. Thread 1 - Promise 1 - the authors resource creates Promise 2 to load child-resource :books
  5. Thread 1 - Promise 1 - waits for Promise 2 to finish
  6. Promise 2 can't run because the thread pool is exhausted. Thread 1 can't be released because Promise 1 is waiting on Promise 2. This is a deadlock.

Though it's technically a deadlock the graphiti code uses sleep to wait for the promises so it manifests itself as a hang. In my PR I replaced the sleep with Concurrent::Promise.zip(*promises).value!, which connects the threads together, allowing ruby to detect the deadlock and raise a fatal exception No live threads left. Deadlock?.

I think it can be fixed by making the child resource sideload logic non-blocking. That would remove the need for the child resources to wait, which would then free up the threads to be used for other promises in the queue. But it probably needs some kind of queue to allow the parent resource to know when all child resources are loaded.

  1. The parent resource creates Promise 1 to load child-resource :authors
  2. Waits for all promises in the queue to finish
  3. Thread 1 is leased to run Promise 1. There are no more threads available in the pool.
  4. Thread 1 - Promise 1 - the authors resource creates Promise 2 to load child-resource :books
  5. Thread 1 - Promise 1 finishes. Thread 1 is added back to the thread pool.
  6. Thread 1 is leased to run Promise 2.
  7. Thread 1 - Promise 2 - the books resource is loaded. There are no other resources to load.
  8. Thread 1 - Promise 2 finishes.
  9. The parent resource stops waiting as all promises in the queue have finished
  10. The parent resource returns. All resources have been loaded at this point.

A few ifs there but I'm going to try this approach and see how I get on.

I've added a successfully failing test case that shows the deadlock: e4de93e

@jkeen
Copy link
Collaborator

jkeen commented Apr 12, 2024

And just for the sake of clarity—how is that scenario you outlined working now in the released version of graphiti?

I guess what I'm trying to understand is if the strategy you're describing works, does it yield in performance gains, or does it yield a more safe approach to what is currently happening by default?

@MattFenelon
Copy link
Contributor Author

Sure I can do that. There's no (real) limit to the threads that the released version of graphiti can use. Instead the problem is that the unbounded threads eventually can exhaust the available database connection pool as ActiveRecord has a 1 thread = 1 connection policy.

For this failure example, we have a puma thread pool of 1 and a database connection pool of 1:

  1. Thread 1 - The parent resource is accessed.
  2. Thread 1 - The parent resource loads data from ActiveRecord. ActiveRecord leases the thread a connection from the connection pool. The database connection pool is now empty
  3. Thread 1 - The parent resource creates Thread 2 to load the :authors resource
  4. Thread 1 - waits on thread 2.
  5. Thread 2 - Attempts to load author data from ActiveRecord. AR attempts to lease a database connection to the thread but can't because thread 1 has the only available connection.
  6. Thread 2 - raises ActiveRecord::ConnectionTimeoutError "all pooled connections were in use".

In the real world this is less likely to happen because there have to be enough web threads running at the same time to exhaust the database pool.

For example, consider we have 5 puma threads and a database connection pool of 5:

  1. Thread 1 - The parent resource is accessed.
  2. Thread 1 - The parent resource loads data from ActiveRecord. ActiveRecord leases the thread a connection from the connection pool. There are 4 connections left in the pool
  3. Thread 1 - The parent resource creates Thread 2 to load the :authors resource
  4. Thread 1 - waits on thread 2.
  5. Thread 2 - Attempts to load author data from ActiveRecord. ActiveRecord leases the thread a connection from the connection pool. There are 3 connections left in the pool
  6. Thread 3 - a new request comes in. ActiveRecord leases the thread a connection from the connection pool. There are 2 connections left in the pool
  7. Thread 2 - finishes. The connection is made available to the pool again. There are 3 connections left in the pool
  8. Thread 1 - finishes waiting. The connection is made available to the pool again. There are 4 connections left in the pool

And so on. If someone has limited traffic, limited sideloads, or a large database connection pool, they may never see this issue.

A naive way to solve it is to just increase the database connection pool but the AR error will raise its head eventually with enough traffic and sideloads.

@jkeen
Copy link
Collaborator

jkeen commented Jun 7, 2024

@MattFenelon I was adjusting values for db pool, max threads, and web_concurrency today when I ran square into this exact problem! I never ran into it before because of how my database.yml is setup where RAILS_MAX_THREADS wasn't being used for the pool value, where instead I had a separate DB_POOL var defined (and set at 40). Others might have a similar setup, thereby never running into it

image

@MattFenelon
Copy link
Contributor Author

MattFenelon commented Sep 27, 2024

@jkeen Good to know!

I agree it's very much based on each app's configuration and resource lay out. With low sideloads and/or a large database connection pool, it's unlikely to ever be a problem.

That said, the default for new Rails' apps is to use RAILS_MAX_THREADS for the database connection pool. An app with that configuration will run into the problem sooner or later.

https://github.com/rails/rails/blame/13d5f87208921cc5139cd53b280c42f23c334826/railties/lib/rails/generators/rails/app/templates/config/databases/postgresql.yml.tt

@geeosh
Copy link

geeosh commented Dec 19, 2024

I've run into this issue of deadlocks a lot when running tests locally or in using my app on staging (although I don't get ActiveRecord::ConnectionTimeoutError locally, only in my staging environment). It would be helpful to raise some kind of exception to know this is the issue when there is a deadlock.

@MattFenelon any idea about a potential solution in sight? Any way I can help?

@geeosh
Copy link

geeosh commented Dec 20, 2024

This is interesting, I added the ActionDispatch::DebugLocks middleware, and found that the line that was creating the deadlock for me was in lib/graphiti/util/class.rb:9 in infer_resource_class where it calls safe_constantize when the sideloaded resources were loading. If I turn on eager loading then the deadlock goes away. Sounds like this might be a separate issue...

@MattFenelon
Copy link
Contributor Author

@geeosh #472 contains the fix for the activerecord::ConnectionTimeoutError issue.

The deadlocks were introduced in an earlier fix for this issue. That fix was reverted so you shouldn't be seeing deadlocks caused by that fix. If you’re still seeinn deadlocks, and you’re not on the version of Graphiti that contained the faulty fix, that’s a separate issue.

@jkeen
Copy link
Collaborator

jkeen commented Dec 26, 2024

@MattFenelon Last I tested this on my staging app I ran into the same lockup issue I was running into before—where everything seized and I had to restart the server, which would happen after a query I had with 9 sideloads. I started typing out a comment about it a while ago and I got distracted half way through and here we are—sorry for the delay in communicating this + merging.

I'll add that DebugLocks middleware and see if that exposes anything, let me know if there's any other configuration or debug I should enable to root out the issue

@geeosh
Copy link

geeosh commented Dec 27, 2024

@MattFenelon great thanks for clarifying. I am running version 1.7.6 and still seeing deadlocks, so it must be something else. I'll have to dig into it again.

@MattFenelon
Copy link
Contributor Author

MattFenelon commented Jan 7, 2025

@MattFenelon Last I tested this on my staging app I ran into the same lockup issue I was running into before—where everything seized and I had to restart the server, which would happen after a query I had with 9 sideloads.

Ok, thanks for the bug report @jkeen. I'll do some testing on my side and see if I can replicate. Could you provide more info of the issue you're seeing?

Are you able to provide the value of the include parameter you used to sideload those resources? If you can't share, can you say if they all single level (e.g. sideload1, sideload2), or multi-level (e.g. sideload.child1, sideload.child2), or a combination?

Are you using the default concurrency level of 4 or another setting? Also, what's your database pool set to?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants