Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Child process sometimes hang, peg cpu @ 100% #234

Closed
Adam7288 opened this issue Sep 1, 2022 · 84 comments
Closed

Child process sometimes hang, peg cpu @ 100% #234

Adam7288 opened this issue Sep 1, 2022 · 84 comments

Comments

@Adam7288
Copy link

Adam7288 commented Sep 1, 2022

Here is pstack trace for one of the hung processes pegging cpu @ 100% - looks like some kind of deadlock

#0 __lll_unlock_wake () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:371
#1 0x00007f2d659f0f9e in _L_unlock_738 () from /lib64/libpthread.so.0
#2 0x00007f2d659f0f10 in __pthread_mutex_unlock_usercnt (decr=1, mutex=0x7f2c7c5248f8) at pthread_mutex_unlock.c:55
#3 __GI___pthread_mutex_unlock (mutex=0x7f2c7c5248f8) at pthread_mutex_unlock.c:330
#4 0x00007f2d57cb140d in h2_beam_receive () from /etc/httpd/modules/mod_http2.so
#5 0x00007f2d57cc8ee3 in buffer_output_receive () from /etc/httpd/modules/mod_http2.so
#6 0x00007f2d57ccb1ec in stream_data_cb () from /etc/httpd/modules/mod_http2.so
#7 0x00007f2d66cba171 in nghttp2_session_pack_data () from /lib64/libnghttp2.so.14
#8 0x00007f2d66cbaedd in nghttp2_session_mem_send_internal () from /lib64/libnghttp2.so.14
#9 0x00007f2d66cbbae9 in nghttp2_session_send () from /lib64/libnghttp2.so.14
#10 0x00007f2d57cc7544 in h2_session_send () from /etc/httpd/modules/mod_http2.so
#11 0x00007f2d57cc777a in h2_session_process () from /etc/httpd/modules/mod_http2.so
#12 0x00007f2d57cb2149 in h2_c1_run () from /etc/httpd/modules/mod_http2.so
#13 0x00007f2d57cb2569 in h2_c1_hook_process_connection () from /etc/httpd/modules/mod_http2.so
#14 0x00005571f66c33c0 in ap_run_process_connection (c=c@entry=0x7f2d4006d5e0) at connection.c:42
#15 0x00007f2d5a8ab40a in process_socket (thd=thd@entry=0x5571f72c4510, p=, sock=, cs=0x7f2d4006d530, my_child_num=my_child_num@entry=11, my_thread_num=my_thread_num@entry=12) at event.c:1086
#16 0x00007f2d5a8ae6ae in worker_thread (thd=0x5571f72c4510, dummy=) at event.c:2179
#17 0x00007f2d659edea5 in start_thread (arg=0x7f2d3a7f4700) at pthread_create.c:307
#18 0x00007f2d65512b0d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

@icing
Copy link
Owner

icing commented Sep 1, 2022

Hmm, how could it deadlock when unlocking a mutex?

Which version of httpd and mod_http2 are you running?

@Adam7288
Copy link
Author

Adam7288 commented Sep 1, 2022

Good point, just trying to sound competent, failing.

Server version: Apache/2.4.54 (codeit)

Name : mod_http2
Arch : x86_64
Version : 2.0.3
Release : 2.codeit.el7
Size : 398 k
Repo : installed
From repo : CodeIT
Summary : module implementing HTTP/2 for Apache 2
URL : https://icing.github.io/mod_h2/
License : ASL 2.0
Description : The mod_h2 Apache httpd module implements the HTTP2 protocol
: (h2+h2c) on top of libnghttp2 for httpd 2.4 servers.

@alexskynet
Copy link

alexskynet commented Sep 16, 2022

Same problem here on several different servers:

OS CentOS 7
httpd-2.4.54-1.codeit.el7.x86_64
mod_http2-2.0.4-1.codeit.el7.x86_64
The problem appears when using mpm_worker or mpm_even
It doesn't happen using mpm_prefork
So we had to "downgrade" from mpm_event to prefork to keep http2 turned on, even if we lost a lot of mpm performance
Version 2.0.4 is not fixing the behaviour
httpd processes keep CPU at 100% and at memory while progressively every CPU Processor get filled and RAM also
strace ing the stuck httpd processes says they are all locked in read

Hope this may help to track down the problem

@icing
Copy link
Owner

icing commented Sep 16, 2022

@alexskynet you do not have to switch mpm to disable http/2. http/2 is only used if you configure

Protocols h2 http/1.1

in your server. If you comment that out or just configure Protocols http/1.1, no h2 will be used.

That allows you to determine if the problem is related to http2 or if there is another problem in your server.

@alexskynet
Copy link

alexskynet commented Sep 16, 2022

Thank you very much @icing
I realized that few minutes after my post so I'm running with event now
I also realized that the reason it worked with prefork is that mod_http2 doesn't work with this worker
We runned several days with no problems, while when we started apache with mpm_event, and http2, after few time (variable from ten minutes to a max one hour) we started having more and more httpd processes stuck in read till the server hanged completely.
Have you idea how we may workaround those hangs?
Thank you again and best regards

@alexskynet
Copy link

I want to add a little piece to the puzzle that may help
We run several servers and noticed that some of those work fin with http2 with no hangs
Digging a little bit we noticed that the servers that hang the apache process have the root on NFS filesystem and the ones that don't hang have local filesystem.
Running the NFS rooted servers without the mod_http2 loaded leads to fully functional wuth no hangs of sort.
It wold sound interesting know id @Adam7288 is NFS rooted or not

@alexskynet
Copy link

alexskynet commented Sep 21, 2022

Another info
Previous message isn't completely perfect ...
We discovered a server with nfs based root that works ok with mod_http2
We see no differences in configuration of mount point:

172.16.0.100:/apachesites on /var/www/virtual type nfs4 (rw,noatime,nodiratime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.16.0.32,local_lock=none,addr=172.16.0.100)

The difference is that the hanging servers all use Joomla both versions 3 and 4 with fastcgi but have no errors in logs of the sites and in php logs.

The working server uses the same configuration but runs a custom php application.

From the logs users are using http2 without any kind of hang.

We are a little bit confused at this point

@icing
Copy link
Owner

icing commented Sep 21, 2022

Yeah, hard to see the pattern from here.

@icing
Copy link
Owner

icing commented Sep 21, 2022

I think you'd need to vary one thing at a time in your setup in order to find out which combination is the critical one. For example, place your /var/www/virtual/ on an ext4 and see if the problem persists.

@alexskynet
Copy link

alexskynet commented Sep 21, 2022

Yes @icing this is a nightmare: too many parameters moving ...
Three servers sharing the same kind of configuration show the same behaviour
They are all CentOS 7 updated with the latest updates
They all use FastCGI for PHP (various versions from REMI repo inclufing PHP 7.4, 8.0 and 8.1 latest available), mod_pagespeed
One of them uses ISPConfig
They all have NFS rooted system and they all hang with httpd processes stuck in read
A fortth server runs nextcloud whith same configuration except local FS: this doesn't hang
Sixth and seventh server share the same confs with NFS root but only run custom PHP applicatiion and don't hang
So .. a real puzzle

@alexskynet
Copy link

Tried to downgrade the mod_http2 from 2.0.4 to 2.0.3 on one server
Let's wait

@alexskynet
Copy link

Downgrading to 2.0.3 seems to fix the problem
We've run form more than two hours with no hangs and normal load
Now we are testing the downgrade on a second server

@Adam7288
Copy link
Author

I was experiencing this issue on 2.0.3 and continue to experience it on 2.0.4. NFS is something we use too so that is an interesting avenue. @icing is there any way we can get better data to troubleshoot this?

@alexskynet
Copy link

alexskynet commented Sep 21, 2022

Thank you for your answer @Adam7288 this sounds interesting
Wich version of NFS protocol are you using?
3, 4.0 or 4.1?
We are using protocol 4.0 with noatime and nodiratrime options on the servers where we're experimenting the anomaly
I suspect something with NFS on the release of locks maybe???
In the meantime we've been running for some hours on 2.0.3 with no hangs while with 2.0.4 is usually a matter of minutes ...

@icing
Copy link
Owner

icing commented Sep 21, 2022

I looked at the changes between 2.0.3 and 2.0.4 and would have been surprised by differences. BUT I can make you a 2.0.5 with the changes I did in Apache trunk. There are difference in connection handling. Hard to tell if they improve your situation, but I'd like to know how those changes fare for you.

The alternative in your case would be to try the mod_http2 that was released in Apache 2.5.54.

@icing
Copy link
Owner

icing commented Sep 21, 2022

Regarding NFS usage: I have no personal experience with running a server that way. As I understand your setups, you use NFS to share the htdocs between nodes.

Question is: are files on the NFS share being modified and may that have impact? Have you EnableMmap on on this for example? Do you have H2CopyFiles on or off (this is the default).

@Adam7288
Copy link
Author

Only time nfs is invoked on our end is within a php script to access them in some way - for instance a document download. We do not serve up static or dynamic content directly from the nfs share.

@alexskynet
Copy link

Thank you @icing
I appreciate your help very much
I understand prfectly thet sometimes software devel leads to unexpected results ... it happenetd to me also so many times :-)
If I can help in testing I'll be more than happy to do it
Yes you understood perfectely
We share the root among some nodes
The only files that are actually modified during access to the webserver are cache files, that may be regenerated if expired, and session files.
Yes EnableMmap is on

@icing
Copy link
Owner

icing commented Sep 21, 2022

Ok, I'll make a v2.0.5 for testing. In the meantime, you might want to configure H2CopyFiles on to see if that makes any difference for you.

@icing
Copy link
Owner

icing commented Sep 21, 2022

@alexskynet
Copy link

thank you @icing
I'm compiling right now
Stay tuned

@alexskynet
Copy link

Sorry @icing
Module installed and restarted httpd
Got 2 stuck processes in read in a couple of minutes

@icing
Copy link
Owner

icing commented Sep 21, 2022

"stuck in read" as from the status handler or some backtrace?

@alexskynet
Copy link

alexskynet commented Sep 21, 2022

strace -p
shows read
htop show cpu 100% for the process

@alexskynet
Copy link

Status shows

6781 0/0/0 W 0.00 22 0 0 0.0 0.00 0.00 87.27.4.2 h2

rolled back to 2.0.3: no hangs till now

@icing
Copy link
Owner

icing commented Sep 21, 2022

Hmm, I assume your site is too busy to raise the http2 log level much?

@alexskynet
Copy link

I may reinstall the 2.0.5 and try to increase the log for few minutes if it may help
Which directive do I need to insert?

@icing
Copy link
Owner

icing commented Sep 21, 2022

If you can take it LogLevel http2:trace1, but may start with LogLevel http2:debug.

@alexskynet
Copy link

Ok give me some minutes

@alexskynet
Copy link

Stuck in seconds with 2.0.5
This is trace log but it looks ok to me

[Wed Sep 21 16:09:53.574077 2022] [http2:debug] [pid 23712:tid 139812816349376] h2_workers.c(318): h2_workers: cleanup 25 workers idling
[Wed Sep 21 16:09:53.574635 2022] [http2:debug] [pid 23712:tid 139812816349376] h2_workers.c(346): h2_workers: cleanup all workers terminated
[Wed Sep 21 16:09:53.575301 2022] [http2:debug] [pid 23712:tid 139812816349376] h2_workers.c(349): h2_workers: cleanup zombie workers joined
[Wed Sep 21 16:12:20.930961 2022] [mpm_worker:notice] [pid 23636:tid 139812816349376] AH00297: SIGUSR1 received. Doing graceful restart
[Wed Sep 21 16:12:21.009126 2022] [http2:debug] [pid 25117:tid 139812816349376] h2_workers.c(370): h2_workers: cleanup 25 workers (25 idle)
[Wed Sep 21 16:12:21.009464 2022] [http2:debug] [pid 25117:tid 139812816349376] h2_workers.c(406): h2_workers: cleanup all workers terminated
[Wed Sep 21 16:12:21.010195 2022] [http2:debug] [pid 25117:tid 139812816349376] h2_workers.c(411): h2_workers: cleanup zombie workers joined
[Wed Sep 21 16:12:22.385166 2022] [lbmethod_heartbeat:notice] [pid 23636:tid 139812816349376] AH02282: No slotmem from mod_heartmonitor
[Wed Sep 21 16:12:22.388610 2022] [http2:info] [pid 23636:tid 139812816349376] AH03090: mod_http2 (v2.0.5-git, feats=CHPRIO+SHA256+INVHD+DWINS, nghttp2 1.49.0), initializing...
[Wed Sep 21 16:12:22.388690 2022] [http2:trace1] [pid 23636:tid 139812816349376] h2_protocol.c(426): h2_h2, child_init
[Wed Sep 21 16:12:22.388773 2022] [http2:trace1] [pid 23636:tid 139812816349376] h2_switch.c(46): h2_switch init
[Wed Sep 21 16:12:22.526948 2022] [mpm_worker:notice] [pid 23636:tid 139812816349376] AH00292: Apache/2.4.54 (codeit) OpenSSL/1.1.1q+quic mod_fcgid/2.3.9 configured -- resuming normal operations
[Wed Sep 21 16:12:22.587929 2022] [core:notice] [pid 23636:tid 139812816349376] AH00094: Command line: '/usr/sbin/httpd -D FOREGROUND'
[Wed Sep 21 16:12:22.592326 2022] [http2:info] [pid 26351:tid 139812816349376] h2_workers: created with min=25 max=37 idle_ms=600000
[Wed Sep 21 16:12:23.448280 2022] [http2:trace1] [pid 26351:tid 139811847399168] h2_c1.c(218): [client 10.0.10.252:39865] h2_h2, process_conn
[Wed Sep 21 16:12:23.448344 2022] [http2:trace1] [pid 26351:tid 139811847399168] h2_c1.c(226): [client 10.0.10.252:39865] h2_h2, process_conn, new connection using protocol 'http/1.1', direct=1, tls acceptable=1
[Wed Sep 21 16:12:23.448371 2022] [http2:trace1] [pid 26351:tid 139811847399168] h2_c1.c(282): [client 10.0.10.252:39865] h2_h2, declined
[Wed Sep 21 16:12:23.594873 2022] [http2:info] [pid 26438:tid 139812816349376] h2_workers: created with min=25 max=37 idle_ms=600000
[Wed Sep 21 16:12:25.171200 2022] [http2:debug] [pid 25034:tid 139812816349376] h2_workers.c(370): h2_workers: cleanup 25 workers (25 idle)
[Wed Sep 21 16:12:25.171708 2022] [http2:debug] [pid 25034:tid 139812816349376] h2_workers.c(406): h2_workers: cleanup all workers terminated
[Wed Sep 21 16:12:25.172210 2022] [http2:debug] [pid 25034:tid 139812816349376] h2_workers.c(411): h2_workers: cleanup zombie workers joined
[Wed Sep 21 16:12:33.463932 2022] [http2:trace1] [pid 26438:tid 139812048725760] h2_c1.c(218): [client 10.0.10.252:12296] h2_h2, process_conn
[Wed Sep 21 16:12:33.463996 2022] [http2:trace1] [pid 26438:tid 139812048725760] h2_c1.c(226): [client 10.0.10.252:12296] h2_h2, process_conn, new connection using protocol 'http/1.1', direct=1, tls acceptable=1
[Wed Sep 21 16:12:33.464018 2022] [http2:trace1] [pid 26438:tid 139812048725760] h2_c1.c(282): [client 10.0.10.252:12296] h2_h2, declined
[Wed Sep 21 16:12:36.796691 2022] [http2:trace1] [pid 26351:tid 139811855791872] h2_c1.c(218): [client 104.155.8.47:55726] h2_h2, process_conn
[Wed Sep 21 16:12:36.796740 2022] [http2:trace1] [pid 26351:tid 139811855791872] h2_c1.c(226): [client 104.155.8.47:55726] h2_h2, process_conn, new connection using protocol 'http/1.1', direct=1, tls acceptable=1
[Wed Sep 21 16:12:36.796759 2022] [http2:trace1] [pid 26351:tid 139811855791872] h2_c1.c(282): [client 104.155.8.47:55726] h2_h2, declined
[Wed Sep 21 16:12:43.466684 2022] [http2:trace1] [pid 26438:tid 139812057118464] h2_c1.c(218): [client 10.0.10.252:16043] h2_h2, process_conn
[Wed Sep 21 16:12:43.466738 2022] [http2:trace1] [pid 26438:tid 139812057118464] h2_c1.c(226): [client 10.0.10.252:16043] h2_h2, process_conn, new connection using protocol 'http/1.1', direct=1, tls acceptable=1
[Wed Sep 21 16:12:43.467413 2022] [http2:trace1] [pid 26438:tid 139812057118464] h2_c1.c(282): [client 10.0.10.252:16043] h2_h2, declined
[Wed Sep 21 16:12:53.482230 2022] [http2:trace1] [pid 26351:tid 139811788650240] h2_c1.c(218): [client 10.0.10.252:39973] h2_h2, process_conn
[Wed Sep 21 16:12:53.482277 2022] [http2:trace1] [pid 26351:tid 139811788650240] h2_c1.c(226): [client 10.0.10.252:39973] h2_h2, process_conn, new connection using protocol 'http/1.1', direct=1, tls acceptable=1
[Wed Sep 21 16:12:53.482290 2022] [http2:trace1] [pid 26351:tid 139811788650240] h2_c1.c(282): [client 10.0.10.252:39973] h2_h2, declined
[Wed Sep 21 16:13:00.838746 2022] [http2:trace1] [pid 26438:tid 139812006762240] h2_c1.c(218): [client 80.94.92.231:56704] h2_h2, process_conn
[Wed Sep 21 16:13:00.838798 2022] [http2:trace1] [pid 26438:tid 139812006762240] h2_c1.c(226): [client 80.94.92.231:56704] h2_h2, process_conn, new connection using protocol 'http/1.1', direct=1, tls acceptable=1
[Wed Sep 21 16:13:00.838821 2022] [http2:trace1] [pid 26438:tid 139812006762240] h2_c1.c(282): [client 80.94.92.231:56704] h2_h2, declined
[Wed Sep 21 16:13:03.492551 2022] [http2:trace1] [pid 26351:tid 139811855791872] h2_c1.c(218): [client 10.0.10.252:31642] h2_h2, process_conn
[Wed Sep 21 16:13:03.492615 2022] [http2:trace1] [pid 26351:tid 139811855791872] h2_c1.c(226): [client 10.0.10.252:31642] h2_h2, process_conn, new connection using protocol 'http/1.1', direct=1, tls acceptable=1
[Wed Sep 21 16:13:03.492640 2022] [http2:trace1] [pid 26351:tid 139811855791872] h2_c1.c(282): [client 10.0.10.252:31642] h2_h2, declined
[Wed Sep 21 16:13:13.506889 2022] [http2:trace1] [pid 26351:tid 139811704723200] h2_c1.c(218): [client 10.0.10.252:24285] h2_h2, process_conn
[Wed Sep 21 16:13:13.506939 2022] [http2:trace1] [pid 26351:tid 139811704723200] h2_c1.c(226): [client 10.0.10.252:24285] h2_h2, process_conn, new connection using protocol 'http/1.1', direct=1, tls acceptable=1
[Wed Sep 21 16:13:13.506957 2022] [http2:trace1] [pid 26351:tid 139811704723200] h2_c1.c(282): [client 10.0.10.252:24285] h2_h2, declined
[Wed Sep 21 16:13:23.517390 2022] [http2:trace1] [pid 26351:tid 139811662759680] h2_c1.c(218): [client 10.0.10.252:31236] h2_h2, process_conn
[Wed Sep 21 16:13:23.517450 2022] [http2:trace1] [pid 26351:tid 139811662759680] h2_c1.c(226): [client 10.0.10.252:31236] h2_h2, process_conn, new connection using protocol 'http/1.1', direct=1, tls acceptable=1
[Wed Sep 21 16:13:23.517469 2022] [http2:trace1] [pid 26351:tid 139811662759680] h2_c1.c(282): [client 10.0.10.252:31236] h2_h2, declined
[Wed Sep 21 16:13:33.515813 2022] [http2:trace1] [pid 26351:tid 139811847399168] h2_c1.c(218): [client 10.0.10.252:18757] h2_h2, process_conn

@icing
Copy link
Owner

icing commented Sep 23, 2022

Thanks @alexskynet for putting v2.0.6 into your grinder. Sorry, that it did nothing to improve.😢

Analyzing the backtraces now.

@icing
Copy link
Owner

icing commented Sep 23, 2022

There was one change from v2.0.3 to newer ones that involved thread creation for the h2 workers. A quick check to see if that is causing problems would be to configure your server with a fixed number of h2 workers, like in:

H2MinWorkers 25
H2MaxWorkers 25

so all workers are created at startup and no dynamic creation/desctruction of thread does happen.

Could you give this a shot?

@alexskynet
Copy link

unfortunately no change
Still 100% CPU with your settings and 2.0.6

@alexskynet
Copy link

Thread 17 (Thread 0x7ff9e1ff3700 (LWP 8225)):
#0 0x00007ffa1522c75d in read () from /lib64/libpthread.so.0
#1 0x00007ffa1545fd4b in apr_socket_recv () from /lib64/libapr-1.so.0
#2 0x00007ffa15ae3ae4 in socket_bucket_read () from /lib64/libaprutil-1.so.0
#3 0x000055ce02ad2e71 in ap_core_input_filter ()
#4 0x00007ffa0d3cd04e in logio_in_filter () from /etc/httpd/modules/mod_logio.so
#5 0x00007ffa07210a9a in bio_filter_in_read () from /etc/httpd/modules/mod_ssl.so
#6 0x00007ffa06fc95ea in bread_conv () from /lib64/libcrypto.so.81.1.1
#7 0x00007ffa06fc82c6 in BIO_read () from /lib64/libcrypto.so.81.1.1
#8 0x00007ffa1655ee21 in ssl3_read_n () from /lib64/libssl.so.81.1.1
#9 0x00007ffa16563622 in ssl3_get_record () from /lib64/libssl.so.81.1.1
#10 0x00007ffa16560e24 in ssl3_read_bytes () from /lib64/libssl.so.81.1.1
#11 0x00007ffa16568b60 in ssl3_read () from /lib64/libssl.so.81.1.1
#12 0x00007ffa16575665 in SSL_read () from /lib64/libssl.so.81.1.1
#13 0x00007ffa07211f73 in ssl_io_input_read () from /etc/httpd/modules/mod_ssl.so
#14 0x00007ffa07214e2c in ssl_io_filter_input () from /etc/httpd/modules/mod_ssl.so
#15 0x00007ffa066c594a in read_and_feed (session=session@entry=0x7ff9d4004c60) at h2_c1_io.c:486
#16 0x00007ffa066c6ecf in h2_c1_read (session=session@entry=0x7ff9d4004c60) at h2_c1_io.c:513
#17 0x00007ffa066da08b in h2_session_process (session=0x7ff9d4004c60, async=) at h2_session.c:1865
#18 0x00007ffa066c4cc9 in h2_c1_run (c=c@entry=0x7ff9e8007878) at h2_c1.c:130
#19 0x00007ffa066c50dd in h2_c1_hook_process_connection (c=0x7ff9e8007878) at h2_c1.c:278

@alexskynet
Copy link

Thread 25 (Thread 0x7ff9e5ffb700 (LWP 8217)):
#0 0x00007ffa1522c75d in read () from /lib64/libpthread.so.0
#1 0x00007ffa1545fd4b in apr_socket_recv () from /lib64/libapr-1.so.0
#2 0x00007ffa15ae3ae4 in socket_bucket_read () from /lib64/libaprutil-1.so.0
#3 0x000055ce02ad2e71 in ap_core_input_filter ()
#4 0x00007ffa0d3cd04e in logio_in_filter () from /etc/httpd/modules/mod_logio.so
#5 0x00007ffa07210a9a in bio_filter_in_read () from /etc/httpd/modules/mod_ssl.so
#6 0x00007ffa06fc95ea in bread_conv () from /lib64/libcrypto.so.81.1.1
#7 0x00007ffa06fc82c6 in BIO_read () from /lib64/libcrypto.so.81.1.1
#8 0x00007ffa1655ee21 in ssl3_read_n () from /lib64/libssl.so.81.1.1
#9 0x00007ffa16563622 in ssl3_get_record () from /lib64/libssl.so.81.1.1
#10 0x00007ffa16560e24 in ssl3_read_bytes () from /lib64/libssl.so.81.1.1
#11 0x00007ffa16568b60 in ssl3_read () from /lib64/libssl.so.81.1.1
#12 0x00007ffa16575665 in SSL_read () from /lib64/libssl.so.81.1.1
#13 0x00007ffa07211f73 in ssl_io_input_read () from /etc/httpd/modules/mod_ssl.so
#14 0x00007ffa07214e2c in ssl_io_filter_input () from /etc/httpd/modules/mod_ssl.so
#15 0x00007ffa066c594a in read_and_feed (session=session@entry=0x7ff9d002cbd0) at h2_c1_io.c:486
#16 0x00007ffa066c6ecf in h2_c1_read (session=session@entry=0x7ff9d002cbd0) at h2_c1_io.c:513
#17 0x00007ffa066da08b in h2_session_process (session=0x7ff9d002cbd0, async=) at h2_session.c:1865
#18 0x00007ffa066c4cc9 in h2_c1_run (c=c@entry=0x7ff9e8005798) at h2_c1.c:130
#19 0x00007ffa066c50dd in h2_c1_hook_process_connection (c=0x7ff9e8005798) at h2_c1.c:278
#20 0x000055ce02ab13c0 in ap_run_process_connection ()
#21 0x00007ffa096c9d9b in worker_thread () from /etc/httpd/modules/mod_mpm_worker.so
#22 0x00007ffa15225ea5 in start_thread () from /lib64/libpthread.so.0
#23 0x00007ffa14d4ab0d in clone () from /lib64/libc.so.6

@icing
Copy link
Owner

icing commented Sep 23, 2022

v2.0.7 released with a fix for mpm_worker setups that could (does?) result in busy loops.

Background: the v2.0.x line had improvements to return to mpm_event connection monitoring asap for efficiency. Unfortunately, the mpm_worker situation was not properly accounted for.

@nono303 I mentioned worker, but this should apply to Windows setups as well.

@alexskynet
Copy link

Time to compile and run and I'll let you know

@alexskynet
Copy link

running

@alexskynet
Copy link

Very first impression is good: no immediate lock so I cross my fingers

@alexskynet
Copy link

A giant step @icing !
It is running with no apparent problems
The only thing I noticed is that httpd CPU usage raised from 6-7% to something around 10% but this is really not a problem

@icing
Copy link
Owner

icing commented Sep 23, 2022

Thanks for your patience. Happy to hear that. Let's see what the day brings.

As to performance, I made some conservative changes to bring stability. When this version proves to be stable, I can dare to tighten the screws again somewhat.

@alexskynet
Copy link

You're doing a great job @icing

I'm happy to have been useful to help a little bit: that's called opensource!

@alexskynet
Copy link

running till now ok
Starting a second test server
Thank you @icing great job!

@alexskynet
Copy link

alexskynet commented Sep 23, 2022

I just want to confirm that 2.0.7 works like a charm

CodeIt has already released the mod_http2-2.0.7 rpm with the fix so all the world should be happy now :-)

It has been running for several hours on two server with absolutely no troubles

Problem solved: well done @icing ! 🥇

@icing
Copy link
Owner

icing commented Sep 23, 2022

Thanks again @alexskynet for the help!

Closing this as fixed in v2.0.7.

@icing icing closed this as completed Sep 23, 2022
@nono303
Copy link

nono303 commented Sep 23, 2022

… I came after the battle (quite busy now) but many Thx @icing for your work and responsiveness
I compiled and running over 2.0.7 on Windows and all work like a charm 😉
No more cpu usage against this stack

libapr-1.dll!impl_pollset_poll+0xba
mod_http2.so!mplx_pollset_poll+0x154
mod_http2.so!h2_mplx_c1_poll+0x52
mod_http2.so!h2_session_process+0xca5
mod_http2.so!h2_c1_run+0xaf
mod_http2.so!h2_c1_hook_process_connection+0x4e2
libhttpd.dll!ap_run_process_connection+0x35
libhttpd.dll!worker_main+0x3a8
kernel32.dll!BaseThreadInitThunk+0xd
ntdll.dll!RtlUserThreadStart+0x1d

just maybe a little bit more cpu cycle on mod_watchdog, but not sure is related to h2

mod_watchdog.so!wd_worker+0x24f
libapr-1.dll!dummy_worker+0x43
ucrtbase.DLL!o__realloc_base+0x60
kernel32.dll!BaseThreadInitThunk+0xd

@alexskynet
Copy link

Sorry to reopen this bt today we had hangs on two different servers both running 2.0.7
Reverting to 2.0.3 seems to fix it
We runned 2.0.7 ok for nearly 4 days before the hang
The interesting part of pstack seems this at a first glance:
Thread 7 (Thread 0x7fdcfb7fe700 (LWP 20589)):
#0 0x00007fdd26670cd4 in h2_fifo_remove () from /etc/httpd/modules/mod_http2.so
#1 0x00007fdcdc03e478 in ?? ()
#2 0x00007fdd35a7167a in apr_bucket_shared_split () from /lib64/libaprutil-1.so.0
#3 0x00007fdcfc0057d8 in ?? ()
#4 0x0000000000003727 in ?? ()
#5 0x00007fdd26882f40 in http2_module () from /etc/httpd/modules/mod_http2.so
#6 0x0000000000003ff7 in ?? ()
#7 0x00007fdd2666b2cc in h2_stream_recv_DATA () from /etc/httpd/modules/mod_http2.so
#8 0x5613754aae48f400 in ?? ()
#9 0x00007fdcdc018c80 in ?? ()
#10 0x00007fdce80028d0 in ?? ()
#11 0x00007fdce800b0e0 in ?? ()
#12 0x00007fdce8003b60 in ?? ()
#13 0x00007fdce800b080 in ?? ()
#14 0x00007fdce800e0c8 in ?? ()
#15 0x00007fdce800e0c0 in ?? ()
#16 0x00007fdd364981a1 in nghttp2_session_pack_data () from /lib64/libnghttp2.so.14
#17 0x00007fdd36498f0d in nghttp2_session_mem_send_internal () from /lib64/libnghttp2.so.14
#18 0x00007fdd36499b19 in nghttp2_session_send () from /lib64/libnghttp2.so.14
#19 0x00007fdd266672ec in on_stream_state_enter () from /etc/httpd/modules/mod_http2.so
#20 0x00007fdd26667574 in on_stream_state_enter () from /etc/httpd/modules/mod_http2.so
#21 0x00007fdcdc018c08 in ?? ()
#22 0x00007fdcfb7fdb18 in ?? ()
#23 0x0000000000000000 in ?? ()
Thread 6 (Thread 0x7fdcf97fa700 (LWP 20593)):
#0 0x00007fdd26668946 in h2_session_process () from /etc/httpd/modules/mod_http2.so
#1 0x00007fdc8c072970 in ?? ()
#2 0x0000000000000142 in ?? ()
#3 0x00007fdcf97f9a50 in ?? ()
#4 0x00007fdcd0015b20 in ?? ()
#5 0xffffffffffffffff in ?? ()
#6 0x0000000000000000 in ?? ()

@icing
Copy link
Owner

icing commented Sep 27, 2022

@alexskynet this strack trace does not look right. I assume there is another version of mod_http2 loaded than the v2.0.7 one. v2.0.7 never invokes h2_fifo_remove(). Assuming the APR lib symbols are correct, the apr_bucket_split is also never called in v2.0.7 (but was in versions v2.0.4 and earlier).

Could you double check?

@icing
Copy link
Owner

icing commented Sep 29, 2022

v2.0.8 released. The fixes are unrelated to this, but I added a assertion before the point where the 100% cpu loop seems to happen in reports by @alexskynet. It would be interesting to know if this triggers and if so, what is logged (at level critical) in such a case.

@icing icing reopened this Sep 30, 2022
@icing
Copy link
Owner

icing commented Sep 30, 2022

The epic battle continues with v2.0.9:

  • Fixed a bug where errors during reponse body handling did not lead to
    a proper RST_STREAM. Instead processing went into an infinite loop.
    Extended test cases to catch this condition.

@alexskynet
Copy link

Hi @icing and thank you for your hard work

Testing 2.0.9 with worker

Till now what I see in mod_status looks OK: it has been running for 20 minutes now.

I cross my fingers and wait ...

I'll let you know

@icing
Copy link
Owner

icing commented Oct 1, 2022

You tricked me before! I remain sceptical...😉

@alexskynet
Copy link

12 hours still running fine ...

It looks promising

@alexskynet
Copy link

Hi @icing did some testing for three days now and it works fine
Five servers are running 2.0.9 and one is running master with no anomalies
The battle is over and you won
Many compliments for the very good work

@icing
Copy link
Owner

icing commented Oct 4, 2022

Hi @alexskynet! This is excellent news. We won. Thank you very much for helping on this.

In the meantime, I have added more edge test cases and made some more improvements on reliability. Will release that in some days as a v2.0.10.

@alexskynet
Copy link

alexskynet commented Oct 11, 2022 via email

@icing
Copy link
Owner

icing commented Oct 11, 2022

With v2.0.10 just being released and Alessandro's extensive testing, I think we have solve the issues.

Many, many thanks to everyone.

@icing icing closed this as completed Oct 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants