Stuck HTTPD server: Improving health checks
Development of a health check with proven effectiveness is still pending. The primary challenge is to distinguish between:
- a fully loaded/slightly overloaded, but well-coping HTTPD instance.
This means the configured maximum number of workers is handling client connections and it will no longer accept new connections immediately.
In this situation, the server should continue processing rather than being restarted and dropping all these connections.
- Unfortunately, at the moment there is no way to get the state the worker processes other than opening a new connection and accessing the
server-status
endpoint... docker-basics#13
- Unfortunately, at the moment there is no way to get the state the worker processes other than opening a new connection and accessing the
- a stuck HTTPD instance. This means there is no progress on handling connections and processing requests. Such situations may arise because of bugs and other shortcomings in the HTTPD program. In this situation, the HTTPD process should be restarted and to become available again.
This week, we observed a stuck HTTPD process on one of our servers. We do not yet have behaviour in place to automatically restart the HTTPD process once it becomes stuck. The incident described is (as far as we know) representative of similar ones.
HTTPD on host lsdf-42-135 stopped responding to requests while being healthy by other accounts (httpd
service running, IP reachable, ...)
The last successful request was a server status check by Telegraf.
::1 - - [26/Mar/2023:03:41:00 +0200] "GET /server-status?auto HTTP/1.1" 200 137 668 444 "-" "Go-http-client/1.1"
The next status check (presumably started at 03:41:10) failed with a timeout:
Mär 26 03:41:15 lsdf-42-135-dis telegraf[2666]: 2023-03-26T01:41:15Z E! [inputs.apache] Error in plugin: error on request to https://localhost/server-status?auto : Get "https://localhost/server-status?auto": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Any new connection times out eventually (enforced by the client).
An example where a client on lsdf-42-190 opened several connections (in parallel) to the RR IP on lsdf-42-135.
netstat
on the client lsdf-42-190 shows the connection as established, with some data waiting in the send queue (not acked by the remote host):
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 288 lsdf-42-190-dis.l:58582 os-webdav-02-rr.l:https ESTABLISHED
tcp 0 288 lsdf-42-190-dis.l:58580 os-webdav-02-rr.l:https ESTABLISHED
tcp 0 288 lsdf-42-190-dis.l:58594 os-webdav-02-rr.l:https ESTABLISHED
On the server lsdf-42-135, netstat
shows these connections (and other connections) only as SYN_RECV
.
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 os-webdav-02-rr.ls:http icinga-kit1.scc.k:44428 SYN_RECV
tcp 0 0 os-webdav-02-rr.l:https lsdf-42-190-dis.l:58582 SYN_RECV
tcp 0 0 os-webdav-02-rr.l:https lsdf-42-190-dis.l:58594 SYN_RECV
tcp 0 0 os-webdav-02-rr.l:https lsdf-42-190-dis.l:58612 SYN_RECV
tcp 0 0 os-webdav-02-rr.l:https lsdf-42-190-dis.l:58580 SYN_RECV
tcp 0 0 os-webdav-02-rr.l:https lsdf-42-190-dis.l:58614 SYN_RECV
tcp 0 0 os-webdav-02-rr.l:https lsdf-42-190-dis.l:58608 SYN_RECV
tcp 0 0 localhost:https localhost:40492 SYN_RECV
So the connections are never handed over to HTTPD.
HTTPD is very close to its configured maximum number of workers (241 out of 256).
$ pstree -s 14982
systemd---httpd-+-228*[httpd---httpd]
`-13*[httpd]
lsof
(and netstat
) show that some of worker processes have an established connection to the LDAP server (previously, similar incidents occurred which seemed to be related to LDAP).
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
httpd 14982 root 4u IPv6 3742228 0t0 TCP *:443 (LISTEN)
httpd 14982 root 6u IPv6 3742232 0t0 TCP *:80 (LISTEN)
httpd 21303 aa0000 4u IPv4 536971728 0t0 TCP 141.52.212.167:51234->141.3.135.113:30389 (ESTABLISHED)
httpd 21304 aa0000 4u IPv4 536949177 0t0 TCP 141.52.212.167:51238->141.3.135.113:30389 (ESTABLISHED)
httpd 21305 aa0000 4u IPv4 536949174 0t0 TCP 141.52.212.167:51236->141.3.135.113:30389 (ESTABLISHED)
httpd 21306 aa0000 4u IPv4 536947298 0t0 TCP 141.52.212.167:51240->141.3.135.113:30389 (ESTABLISHED)
httpd 21317 aa0000 4u IPv4 536963994 0t0 TCP 141.52.212.167:51242->141.3.135.113:30389 (ESTABLISHED)
httpd 21318 aa0000 4u IPv4 536944089 0t0 TCP 141.52.212.167:51250->141.3.135.113:30389 (ESTABLISHED)
httpd 21319 aa0000 4u IPv4 536950047 0t0 TCP 141.52.212.167:51244->141.3.135.113:30389 (ESTABLISHED)
httpd 21320 aa0000 4u IPv4 536947314 0t0 TCP 141.52.212.167:51248->141.3.135.113:30389 (ESTABLISHED)
httpd 21321 aa0000 4u IPv4 536951194 0t0 TCP 141.52.212.167:51252->141.3.135.113:30389 (ESTABLISHED)
httpd 21322 aa0000 4u IPv4 536976678 0t0 TCP 141.52.212.167:51258->141.3.135.113:30389 (ESTABLISHED)
httpd 21323 aa0000 4u IPv4 536928879 0t0 TCP 141.52.212.167:51254->141.3.135.113:30389 (ESTABLISHED)
httpd 21340 aa0000 4u IPv4 536949183 0t0 TCP 141.52.212.167:51256->141.3.135.113:30389 (ESTABLISHED)
httpd 21342 aa0000 4u IPv4 536971734 0t0 TCP 141.52.212.167:51262->141.3.135.113:30389 (ESTABLISHED)
httpd 21343 aa0000 4u IPv4 536944097 0t0 TCP 141.52.212.167:51260->141.3.135.113:30389 (ESTABLISHED)
httpd 21344 aa0000 4u IPv4 536863375 0t0 TCP 141.52.212.167:51264->141.3.135.113:30389 (ESTABLISHED)
httpd 21356 apache 4u IPv4 536976685 0t0 TCP 141.52.212.167:51266->141.3.135.113:30389 (ESTABLISHED)
Client behaviour
I used the opportunity to test some clients regarding their behaviour with the somewhat broken state. The cluster of three servers is accessible via a DNS RR record, returning all three IPs in "random" order. In this case, the IP associated with the stuck HTTPD was returned as the first IP:
-
chromium
connects to only one IP (first returned by DNS) and will eventually reach a timeout. Reloading manually does not help. Chromium maintains its own DNS cache and record order and TTL may differ from other programs -
rclone
(with 96 concurrent transfers) connects to all IPs. Apparently it detects that one IP is not available (after timeouts, presumably) and silently fails over to other IPs. No message is printed. -
curl
connects to only one IP (first returned by DNS) and will eventually reach a timeout and return an error. -
wget
resolves all three IPs but only connects to the first one returned by DNS. Eventually a timeout is reached and an error returned. -
davix
connects to only one IP (first returned by DNS) and will eventually reach a timeout and return an error.
/cc @Sven.Siebler /cc @mozhdeh.farhadi