Stuck HTTPD server: Improving health checks

Development of a health check with proven effectiveness is still pending. The primary challenge is to distinguish between:

a fully loaded/slightly overloaded, but well-coping HTTPD instance. This means the configured maximum number of workers is handling client connections and it will no longer accept new connections immediately. In this situation, the server should continue processing rather than being restarted and dropping all these connections.
- Unfortunately, at the moment there is no way to get the state the worker processes other than opening a new connection and accessing the server-status endpoint... docker-basics#13
a stuck HTTPD instance. This means there is no progress on handling connections and processing requests. Such situations may arise because of bugs and other shortcomings in the HTTPD program. In this situation, the HTTPD process should be restarted and to become available again.

This week, we observed a stuck HTTPD process on one of our servers. We do not yet have behaviour in place to automatically restart the HTTPD process once it becomes stuck. The incident described is (as far as we know) representative of similar ones.

HTTPD on host lsdf-42-135 stopped responding to requests while being healthy by other accounts (httpd service running, IP reachable, ...) The last successful request was a server status check by Telegraf.

::1 - - [26/Mar/2023:03:41:00 +0200] "GET /server-status?auto HTTP/1.1" 200 137 668 444 "-" "Go-http-client/1.1"

The next status check (presumably started at 03:41:10) failed with a timeout:

Mär 26 03:41:15 lsdf-42-135-dis telegraf[2666]: 2023-03-26T01:41:15Z E! [inputs.apache] Error in plugin: error on request to https://localhost/server-status?auto : Get "https://localhost/server-status?auto": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Any new connection times out eventually (enforced by the client).

An example where a client on lsdf-42-190 opened several connections (in parallel) to the RR IP on lsdf-42-135. netstat on the client lsdf-42-190 shows the connection as established, with some data waiting in the send queue (not acked by the remote host):

Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0    288 lsdf-42-190-dis.l:58582 os-webdav-02-rr.l:https ESTABLISHED
tcp        0    288 lsdf-42-190-dis.l:58580 os-webdav-02-rr.l:https ESTABLISHED
tcp        0    288 lsdf-42-190-dis.l:58594 os-webdav-02-rr.l:https ESTABLISHED

On the server lsdf-42-135, netstat shows these connections (and other connections) only as SYN_RECV.

Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
tcp        0      0 os-webdav-02-rr.ls:http icinga-kit1.scc.k:44428 SYN_RECV   
tcp        0      0 os-webdav-02-rr.l:https lsdf-42-190-dis.l:58582 SYN_RECV   
tcp        0      0 os-webdav-02-rr.l:https lsdf-42-190-dis.l:58594 SYN_RECV   
tcp        0      0 os-webdav-02-rr.l:https lsdf-42-190-dis.l:58612 SYN_RECV   
tcp        0      0 os-webdav-02-rr.l:https lsdf-42-190-dis.l:58580 SYN_RECV   
tcp        0      0 os-webdav-02-rr.l:https lsdf-42-190-dis.l:58614 SYN_RECV   
tcp        0      0 os-webdav-02-rr.l:https lsdf-42-190-dis.l:58608 SYN_RECV   
tcp        0      0 localhost:https         localhost:40492         SYN_RECV

So the connections are never handed over to HTTPD.

HTTPD is very close to its configured maximum number of workers (241 out of 256).

$ pstree -s 14982
systemd---httpd-+-228*[httpd---httpd]
                `-13*[httpd]

lsof (and netstat) show that some of worker processes have an established connection to the LDAP server (previously, similar incidents occurred which seemed to be related to LDAP).

COMMAND    PID     USER   FD   TYPE    DEVICE SIZE/OFF NODE NAME
httpd    14982     root    4u  IPv6   3742228      0t0  TCP *:443 (LISTEN)
httpd    14982     root    6u  IPv6   3742232      0t0  TCP *:80 (LISTEN)
httpd    21303   aa0000    4u  IPv4 536971728      0t0  TCP 141.52.212.167:51234->141.3.135.113:30389 (ESTABLISHED)
httpd    21304   aa0000    4u  IPv4 536949177      0t0  TCP 141.52.212.167:51238->141.3.135.113:30389 (ESTABLISHED)
httpd    21305   aa0000    4u  IPv4 536949174      0t0  TCP 141.52.212.167:51236->141.3.135.113:30389 (ESTABLISHED)
httpd    21306   aa0000    4u  IPv4 536947298      0t0  TCP 141.52.212.167:51240->141.3.135.113:30389 (ESTABLISHED)
httpd    21317   aa0000    4u  IPv4 536963994      0t0  TCP 141.52.212.167:51242->141.3.135.113:30389 (ESTABLISHED)
httpd    21318   aa0000    4u  IPv4 536944089      0t0  TCP 141.52.212.167:51250->141.3.135.113:30389 (ESTABLISHED)
httpd    21319   aa0000    4u  IPv4 536950047      0t0  TCP 141.52.212.167:51244->141.3.135.113:30389 (ESTABLISHED)
httpd    21320   aa0000    4u  IPv4 536947314      0t0  TCP 141.52.212.167:51248->141.3.135.113:30389 (ESTABLISHED)
httpd    21321   aa0000    4u  IPv4 536951194      0t0  TCP 141.52.212.167:51252->141.3.135.113:30389 (ESTABLISHED)
httpd    21322   aa0000    4u  IPv4 536976678      0t0  TCP 141.52.212.167:51258->141.3.135.113:30389 (ESTABLISHED)
httpd    21323   aa0000    4u  IPv4 536928879      0t0  TCP 141.52.212.167:51254->141.3.135.113:30389 (ESTABLISHED)
httpd    21340   aa0000    4u  IPv4 536949183      0t0  TCP 141.52.212.167:51256->141.3.135.113:30389 (ESTABLISHED)
httpd    21342   aa0000    4u  IPv4 536971734      0t0  TCP 141.52.212.167:51262->141.3.135.113:30389 (ESTABLISHED)
httpd    21343   aa0000    4u  IPv4 536944097      0t0  TCP 141.52.212.167:51260->141.3.135.113:30389 (ESTABLISHED)
httpd    21344   aa0000    4u  IPv4 536863375      0t0  TCP 141.52.212.167:51264->141.3.135.113:30389 (ESTABLISHED)
httpd    21356   apache    4u  IPv4 536976685      0t0  TCP 141.52.212.167:51266->141.3.135.113:30389 (ESTABLISHED)

Client behaviour

I used the opportunity to test some clients regarding their behaviour with the somewhat broken state. The cluster of three servers is accessible via a DNS RR record, returning all three IPs in "random" order. In this case, the IP associated with the stuck HTTPD was returned as the first IP:

chromium connects to only one IP (first returned by DNS) and will eventually reach a timeout. Reloading manually does not help. Chromium maintains its own DNS cache and record order and TTL may differ from other programs
rclone (with 96 concurrent transfers) connects to all IPs. Apparently it detects that one IP is not available (after timeouts, presumably) and silently fails over to other IPs. No message is printed.
curl connects to only one IP (first returned by DNS) and will eventually reach a timeout and return an error.
wget resolves all three IPs but only connects to the first one returned by DNS. Eventually a timeout is reached and an error returned.
davix connects to only one IP (first returned by DNS) and will eventually reach a timeout and return an error.

/cc @Sven.Siebler /cc @mozhdeh.farhadi